research-article

Prompt-based for Low-Resource Tibetan Text Classification

Author:

Bo AnAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 8

Article No.: 207, Pages 1 - 13

https://doi.org/10.1145/3603168

Published: 24 August 2023 Publication History

Abstract

Text classification is a critical and foundational task in Tibetan natural language processing, it plays a crucial role in various applications, such as sentiment analysis and information extraction. However, the limited availability of annotated data poses a significant challenge to Tibetan natural language processing. This paper proposes a prompt learning-based method for low-resource Tibetan text classification to overcome this challenge. This method utilizes pre-trained language models to learn text representation and generation capabilities on a large-scale unsupervised Tibetan corpus, enabling few-shot Tibetan text classification. Experimental results demonstrate that the proposed method significantly improves the performance of Tibetan text classification in low-resource scenarios. This work provides a new research idea and method for low-resource language processing, such as Tibetan natural language processing. Hopefully, it will inspire subsequent work on low-resource language processing.

References

[1]

Li Ailin. 2014. Research on Tibetan text classification algorithm for Web public opinion analysis. Master’s thesis. Northwest University for Nationalities.

[2]

Yuan Bin. 2016. Research and Implementation of Tibetan Weibo Emotion Classification. Master’s Thesis. Northwest University for Nationalities.

[3]

Bo An and Congjun Long. 2022. Tibetan text classification based on pre-trained language model. Journal of Chinese Information Processing (2022).

[4]

Gianni Brauwers and Flavius Frasincar. 2022. A survey on aspect-based sentiment classification. Comput. Surveys 55, 4 (2022), 1–37.

Digital Library

[5]

Jingjing Cai, Jianping Li, Wei Li, and Ji Wang. 2018. Deeplearning model used in text classification. In 2018 15th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). IEEE, 123–126.

[6]

Hui Cao and Huiqiang Jia. 2013. Tibetan text classification based on the feature of position weight. In 2013 International Conference on Asian Language Processing. IEEE, 220–223.

Digital Library

[7]

Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering 23, 1 (2017), 155–162.

[8]

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504–3514.

Digital Library

[9]

Yu-Zhi Cun and Xiao-Quan Wang. 2010. Plant recolonization in the Himalaya from the southeastern Qinghai-Tibetan Plateau: Geographical isolation contributed to high population differentiation. Molecular Phylogenetics and Evolution 56, 3 (2010), 972–982.

[10]

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893 (2018).

[11]

Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332 (2021).

[12]

Xu Guixian, Zhao Xiaobing Xiang Chuncheng, Weng Yu, and Yang Guosheng. 2011. Automatic text classification method for Tibetan web pages based on columns. Journal of Chinese Information Processing 25, 4 (2011), 20–24.

[13]

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in transformer. Advances in Neural Information Processing Systems 34 (2021), 15908–15919.

[14]

Nathan W. Hill and Jiang Di. 2016. Introduction (to special issue on Tibetan natural language processing). Himalayan Linguistics 15, 1 (2016), 1–11.

[15]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.

Digital Library

[16]

Jia Hongyun. 2019. Research and Implementation of Tibetan Text classification Based on AdaBoost Model. Master’s thesis. Tibet University.

[17]

Weina Zhao, Jian Wu, Yeping He, Huidan Liu and Minghua Liu. 2012. SegT: A practical Tibetan word segmentation system. Journal of Chinese Information Processing 26, 1 (2012), 97.

[18]

Suzana Ilić, Edison Marrese-Taylor, Jorge A. Balazs, and Yutaka Matsuo. 2018. Deep contextualized word representations for detecting sarcasm and irony. arXiv preprint arXiv:1809.09795 (2018).

[19]

Huiqiang Jia and Yonghong Li. 2010. Design and implementation of Tibetan text classifier. Guide to becoming Rich through Science and Technology4X (2010), 30–31.

[20]

Li Jia, Tao Jiang, Jia Hao Meng, and TingTing Zhang. 2020. Tibetan text classification method based on BiLSTM model. In 2020 International Conference on Artificial Intelligence and Electromechanical Automation (AIEA). IEEE, 27–31.

[21]

Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 562–570.

[22]

Zhensong Li, Jie Zhu, Zhixiang Luo, and Saihu Liu. 2019. Research on Tibetan text classification method based on neural network. In 2019 International Conference on Asian Language Processing (IALP). IEEE, 379–383.

[23]

Huidan Liu, Minghua Nuo, Jian Wu, and Yeping He. 2012. Building large scale text corpus for Tibetan natural language processing by extracting text from web. In 24th International Conference on Computational Linguistics. 11.

[24]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021).

[25]

Xiaolei Lu and Bin Ni. 2019. BERT-CNN: A hierarchical patent classifier based on a pre-trained language model. arXiv preprint arXiv:1911.06241 (2019).

[26]

Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified structure generation for universal information extraction. arXiv preprint arXiv:2203.12277 (2022).

[27]

Wei Ma, Hongzhi Yu, and Jing Ma. 2019. Study of Tibetan text classification based on fastText. In 3rd International Conference on Computer Engineering, Information Science & Application Technology (ICCIA 2019). Atlantis Press, 374–380.

[28]

Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning–based text classification: A comprehensive review. ACM Computing Surveys (CSUR) 54, 3 (2021), 1–40.

Digital Library

[29]

Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. arXiv preprint arXiv:2003.00744 (2020).

[30]

Zhaoyang Niu, Guoqiang Zhong, and Hui Yu. 2021. A review on the attention mechanism of deep learning. Neurocomputing 452 (2021), 48–62.

[31]

Taiwo Oyedare and Jung-Min Jerry Park. 2019. Estimating the required training dataset size for transmitter classification using deep learning. In 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). IEEE, 1–10.

[32]

Xiongfei Qin, Shengliang Peng, Xi Yang, and Yu-Dong Yao. 2019. Deep learning based channel code recognition using TextCNN. In 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). IEEE, 1–5.

[33]

Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.

[34]

Nuo Qun, Xing Li, Xipeng Qiu, and Xuanjing Huang. 2017. End-to-end neural text classification for Tibetan. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 472–480.

[35]

Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. 2019. Masked language model scoring. arXiv preprint arXiv:1910.14659 (2019).

[36]

Benwang Sun, Fang Tian, and Li Liang. 2018. Tibetan micro-blog sentiment analysis based on mixed deep learning. In 2018 International Conference on Audio, Language and Image Processing (ICALIP). IEEE, 109–112.

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).

[38]

Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2020. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–34.

Digital Library

[39]

Yang Hongwu, Wang Lili, and Song Zhimeng. 2020. Tibetan text classification method based on multiple classifiers. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition) (2020).

[40]

Raymond E. Wright. 1995. Logistic regression. (1995).

[41]

Xu Guixian, Zhang Zixin, Yu Shaona, Dong Shuangyu, and Tian Yuan. 2022. Tibetan news text classification based on graph convolution network. Data Analysis and Knowledge Discovery (2022), 1.

[42]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020).

[43]

Wei Yan, Hui Cao, and Zeyu Cui. 2021. Tibetan text classification based on RNN. In Journal of Physics: Conference Series, Vol. 1848. IOP Publishing, 012139.

[44]

Ziqing Yang, Zihang Xu, Yiming Cui, Baoxin Wang, Min Lin, Dayong Wu, and Zhigang Chen. 2022. CINO: A Chinese minority pre-trained language model. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 3937–3949. https://aclanthology.org/2022.coling-1.346.

[45]

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7370–7377.

Digital Library

[46]

Wang Yong. 2013. Research on Tibetan Text Classification Based on Naive Bayes. Master’s thesis. Northwest University for Nationalities.

Cited By

Aitim ASatybaldiyeva R(2024)A SYSTEMATIC REVIEW OF EXISTING TOOLS TO AUTOMATED PROCESSING SYSTEMS FOR KAZAKH LANGUAGEBULLETIN Series of Physics & Mathematical Sciences10.51889/2959-5894.2024.87.3.00987:3Online publication date: Sep-2024
https://doi.org/10.51889/2959-5894.2024.87.3.009
Tashi DSheng TChen BDuojie RDongrub RYu YWang XTashi N(2024)Construction of a Tibetan Handwriting Khyug-yig DatasetData Intelligence10.3724/2096-7004.di.2024.00486:3(870-887)Online publication date: 28-Oct-2024
https://doi.org/10.3724/2096-7004.di.2024.0048
Kong CLv XZhang LZhao HCai ZChen Y(2024)RPEPL: Tibetan Sentiment Analysis Based on Relative Position Encoding and Prompt LearningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/369857523:12(1-18)Online publication date: 23-Nov-2024
https://dl.acm.org/doi/10.1145/3698575
Show More Cited By

Index Terms

Prompt-based for Low-Resource Tibetan Text Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Tibetan Text Classification Based on the Feature of Position Weight
IALP '13: Proceedings of the 2013 International Conference on Asian Language Processing

Based on the study of Tibetan characters and grammar, this paper has done research on Tibetan in the text categorization weight algorithm based on the vector space model. Comprehensively considering the position information of Tibetan which presented in ...
Tibetan number identification based on classification of number components in Tibetan word segmentation
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Tibetan word segmentation is essential for Tibetan information processing. People mainly use the basic machine matching method which is based on dictionary to segment Tibetan words at present, because there is no segmented Tibetan corpus which can be ...
Low-Resource Text Classification via Cross-Lingual Language Model Fine-Tuning
Chinese Computational Linguistics
Abstract
Text classification tends to be difficult when data are inadequate considering the amount of manually labeled text corpora. For low-resource agglutinative languages including Uyghur, Kazakh, and Kyrgyz (UKK languages), in which words are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 8

August 2023

373 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3615980

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2023

Online AM: 31 May 2023

Accepted: 29 May 2023

Revised: 29 March 2023

Received: 11 January 2023

Published in TALLIP Volume 22, Issue 8

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Social Science Foundation of China
National Natural Science Foundation of China
Innovation Project major research of Chinese Academy of Social Sciences

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
365
Total Downloads

Downloads (Last 12 months)149
Downloads (Last 6 weeks)19

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aitim ASatybaldiyeva R(2024)A SYSTEMATIC REVIEW OF EXISTING TOOLS TO AUTOMATED PROCESSING SYSTEMS FOR KAZAKH LANGUAGEBULLETIN Series of Physics & Mathematical Sciences10.51889/2959-5894.2024.87.3.00987:3Online publication date: Sep-2024
https://doi.org/10.51889/2959-5894.2024.87.3.009
Tashi DSheng TChen BDuojie RDongrub RYu YWang XTashi N(2024)Construction of a Tibetan Handwriting Khyug-yig DatasetData Intelligence10.3724/2096-7004.di.2024.00486:3(870-887)Online publication date: 28-Oct-2024
https://doi.org/10.3724/2096-7004.di.2024.0048
Kong CLv XZhang LZhao HCai ZChen Y(2024)RPEPL: Tibetan Sentiment Analysis Based on Relative Position Encoding and Prompt LearningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/369857523:12(1-18)Online publication date: 23-Nov-2024
https://dl.acm.org/doi/10.1145/3698575
Lv HPu CDuo LLi YZhou QShen J(2024)T-LLaMA: a Tibetan large language model based on LLaMA2Complex & Intelligent Systems10.1007/s40747-024-01641-711:1Online publication date: 19-Dec-2024
https://doi.org/10.1007/s40747-024-01641-7
Alekseev ANikolenko SKabaeva G(2023)Benchmarking Multilabel Topic Classification in the Kyrgyz LanguageAnalysis of Images, Social Networks and Texts10.1007/978-3-031-54534-4_2(21-35)Online publication date: 28-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-54534-4_2

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents