Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Prompt-based for Low-Resource Tibetan Text Classification

Published: 24 August 2023 Publication History

Abstract

Text classification is a critical and foundational task in Tibetan natural language processing, it plays a crucial role in various applications, such as sentiment analysis and information extraction. However, the limited availability of annotated data poses a significant challenge to Tibetan natural language processing. This paper proposes a prompt learning-based method for low-resource Tibetan text classification to overcome this challenge. This method utilizes pre-trained language models to learn text representation and generation capabilities on a large-scale unsupervised Tibetan corpus, enabling few-shot Tibetan text classification. Experimental results demonstrate that the proposed method significantly improves the performance of Tibetan text classification in low-resource scenarios. This work provides a new research idea and method for low-resource language processing, such as Tibetan natural language processing. Hopefully, it will inspire subsequent work on low-resource language processing.

References

[1]
Li Ailin. 2014. Research on Tibetan text classification algorithm for Web public opinion analysis. Master’s thesis. Northwest University for Nationalities.
[2]
Yuan Bin. 2016. Research and Implementation of Tibetan Weibo Emotion Classification. Master’s Thesis. Northwest University for Nationalities.
[3]
Bo An and Congjun Long. 2022. Tibetan text classification based on pre-trained language model. Journal of Chinese Information Processing (2022).
[4]
Gianni Brauwers and Flavius Frasincar. 2022. A survey on aspect-based sentiment classification. Comput. Surveys 55, 4 (2022), 1–37.
[5]
Jingjing Cai, Jianping Li, Wei Li, and Ji Wang. 2018. Deeplearning model used in text classification. In 2018 15th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). IEEE, 123–126.
[6]
Hui Cao and Huiqiang Jia. 2013. Tibetan text classification based on the feature of position weight. In 2013 International Conference on Asian Language Processing. IEEE, 220–223.
[7]
Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering 23, 1 (2017), 155–162.
[8]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504–3514.
[9]
Yu-Zhi Cun and Xiao-Quan Wang. 2010. Plant recolonization in the Himalaya from the southeastern Qinghai-Tibetan Plateau: Geographical isolation contributed to high population differentiation. Molecular Phylogenetics and Evolution 56, 3 (2010), 972–982.
[10]
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893 (2018).
[11]
Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332 (2021).
[12]
Xu Guixian, Zhao Xiaobing Xiang Chuncheng, Weng Yu, and Yang Guosheng. 2011. Automatic text classification method for Tibetan web pages based on columns. Journal of Chinese Information Processing 25, 4 (2011), 20–24.
[13]
Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in transformer. Advances in Neural Information Processing Systems 34 (2021), 15908–15919.
[14]
Nathan W. Hill and Jiang Di. 2016. Introduction (to special issue on Tibetan natural language processing). Himalayan Linguistics 15, 1 (2016), 1–11.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[16]
Jia Hongyun. 2019. Research and Implementation of Tibetan Text classification Based on AdaBoost Model. Master’s thesis. Tibet University.
[17]
Weina Zhao, Jian Wu, Yeping He, Huidan Liu and Minghua Liu. 2012. SegT: A practical Tibetan word segmentation system. Journal of Chinese Information Processing 26, 1 (2012), 97.
[18]
Suzana Ilić, Edison Marrese-Taylor, Jorge A. Balazs, and Yutaka Matsuo. 2018. Deep contextualized word representations for detecting sarcasm and irony. arXiv preprint arXiv:1809.09795 (2018).
[19]
Huiqiang Jia and Yonghong Li. 2010. Design and implementation of Tibetan text classifier. Guide to becoming Rich through Science and Technology4X (2010), 30–31.
[20]
Li Jia, Tao Jiang, Jia Hao Meng, and TingTing Zhang. 2020. Tibetan text classification method based on BiLSTM model. In 2020 International Conference on Artificial Intelligence and Electromechanical Automation (AIEA). IEEE, 27–31.
[21]
Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 562–570.
[22]
Zhensong Li, Jie Zhu, Zhixiang Luo, and Saihu Liu. 2019. Research on Tibetan text classification method based on neural network. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 379–383.
[23]
Huidan Liu, Minghua Nuo, Jian Wu, and Yeping He. 2012. Building large scale text corpus for Tibetan natural language processing by extracting text from web. In 24th International Conference on Computational Linguistics. 11.
[24]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021).
[25]
Xiaolei Lu and Bin Ni. 2019. BERT-CNN: A hierarchical patent classifier based on a pre-trained language model. arXiv preprint arXiv:1911.06241 (2019).
[26]
Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified structure generation for universal information extraction. arXiv preprint arXiv:2203.12277 (2022).
[27]
Wei Ma, Hongzhi Yu, and Jing Ma. 2019. Study of Tibetan text classification based on fastText. In 3rd International Conference on Computer Engineering, Information Science & Application Technology (ICCIA 2019). Atlantis Press, 374–380.
[28]
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning–based text classification: A comprehensive review. ACM Computing Surveys (CSUR) 54, 3 (2021), 1–40.
[29]
Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. arXiv preprint arXiv:2003.00744 (2020).
[30]
Zhaoyang Niu, Guoqiang Zhong, and Hui Yu. 2021. A review on the attention mechanism of deep learning. Neurocomputing 452 (2021), 48–62.
[31]
Taiwo Oyedare and Jung-Min Jerry Park. 2019. Estimating the required training dataset size for transmitter classification using deep learning. In 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). IEEE, 1–10.
[32]
Xiongfei Qin, Shengliang Peng, Xi Yang, and Yu-Dong Yao. 2019. Deep learning based channel code recognition using TextCNN. In 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). IEEE, 1–5.
[33]
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.
[34]
Nuo Qun, Xing Li, Xipeng Qiu, and Xuanjing Huang. 2017. End-to-end neural text classification for Tibetan. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 472–480.
[35]
Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. 2019. Masked language model scoring. arXiv preprint arXiv:1910.14659 (2019).
[36]
Benwang Sun, Fang Tian, and Li Liang. 2018. Tibetan micro-blog sentiment analysis based on mixed deep learning. In 2018 International Conference on Audio, Language and Image Processing (ICALIP). IEEE, 109–112.
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
[38]
Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2020. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–34.
[39]
Yang Hongwu, Wang Lili, and Song Zhimeng. 2020. Tibetan text classification method based on multiple classifiers. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition) (2020).
[40]
Raymond E. Wright. 1995. Logistic regression. (1995).
[41]
Xu Guixian, Zhang Zixin, Yu Shaona, Dong Shuangyu, and Tian Yuan. 2022. Tibetan news text classification based on graph convolution network. Data Analysis and Knowledge Discovery (2022), 1.
[42]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020).
[43]
Wei Yan, Hui Cao, and Zeyu Cui. 2021. Tibetan text classification based on RNN. In Journal of Physics: Conference Series, Vol. 1848. IOP Publishing, 012139.
[44]
Ziqing Yang, Zihang Xu, Yiming Cui, Baoxin Wang, Min Lin, Dayong Wu, and Zhigang Chen. 2022. CINO: A Chinese minority pre-trained language model. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 3937–3949. https://aclanthology.org/2022.coling-1.346.
[45]
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7370–7377.
[46]
Wang Yong. 2013. Research on Tibetan Text Classification Based on Naive Bayes. Master’s thesis. Northwest University for Nationalities.

Cited By

View all
  • (2024)A SYSTEMATIC REVIEW OF EXISTING TOOLS TO AUTOMATED PROCESSING SYSTEMS FOR KAZAKH LANGUAGEBULLETIN Series of Physics & Mathematical Sciences10.51889/2959-5894.2024.87.3.00987:3Online publication date: Sep-2024
  • (2024)Construction of a Tibetan Handwriting Khyug-yig DatasetData Intelligence10.3724/2096-7004.di.2024.00486:3(870-887)Online publication date: 28-Oct-2024
  • (2024)RPEPL: Tibetan Sentiment Analysis Based on Relative Position Encoding and Prompt LearningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/369857523:12(1-18)Online publication date: 23-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 8
August 2023
373 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3615980
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2023
Online AM: 31 May 2023
Accepted: 29 May 2023
Revised: 29 March 2023
Received: 11 January 2023
Published in TALLIP Volume 22, Issue 8

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Tibetan text classification
  2. prompt learning
  3. deep learning
  4. pre-trained language model

Qualifiers

  • Research-article

Funding Sources

  • National Social Science Foundation of China
  • National Natural Science Foundation of China
  • Innovation Project major research of Chinese Academy of Social Sciences

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)149
  • Downloads (Last 6 weeks)19
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A SYSTEMATIC REVIEW OF EXISTING TOOLS TO AUTOMATED PROCESSING SYSTEMS FOR KAZAKH LANGUAGEBULLETIN Series of Physics & Mathematical Sciences10.51889/2959-5894.2024.87.3.00987:3Online publication date: Sep-2024
  • (2024)Construction of a Tibetan Handwriting Khyug-yig DatasetData Intelligence10.3724/2096-7004.di.2024.00486:3(870-887)Online publication date: 28-Oct-2024
  • (2024)RPEPL: Tibetan Sentiment Analysis Based on Relative Position Encoding and Prompt LearningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/369857523:12(1-18)Online publication date: 23-Nov-2024
  • (2024)T-LLaMA: a Tibetan large language model based on LLaMA2Complex & Intelligent Systems10.1007/s40747-024-01641-711:1Online publication date: 19-Dec-2024
  • (2023)Benchmarking Multilabel Topic Classification in the Kyrgyz LanguageAnalysis of Images, Social Networks and Texts10.1007/978-3-031-54534-4_2(21-35)Online publication date: 28-Sep-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media