research-article

Character-based Joint Word Segmentation and Part-of-Speech Tagging for Tibetan Based on Deep Learning

Authors:

La DuoAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 21, Issue 5

Article No.: 95, Pages 1 - 15

https://doi.org/10.1145/3511600

Published: 23 November 2022 Publication History

Abstract

Tibetan word segmentation and POS tagging are the primary tasks of Tibetan natural language processing. Most of existing methods of Tibetan word segmentation and POS tagging are based on rules and statistics, which need manual construction of features. In addition, the joint mode has shown stronger capabilities for word segmentation and POS tagging and have received great interests. In this paper, we propose Bi-LSTM+IDCNN+CRF structures, a simple yet effective end-to-end neural network model, for joint Tibetan word segmentation and POS tagging. We conduct step-by-step and joint experiments on the Tibetan datasets. The results demonstrate that the performance of the Bi-LSTM+IDCNN+CRF model is the best regardless of the step-by-step or joint mode. We obtain state-of-the-art performance in the joint tagging mode. The F1 score of the word segmentation task reached 92.31%, and the F1 score of the POS tagging task reached 81.26%.

References

[1]

China National Information Technology Standardization on Network. The parts-of-speech tagging set for Tibetan information processing: GB/T 36337-2018[S].2018.

[2]

China National Information Technology Standardization on Network. Specification on Tibetan segmentation for information processing: GB/T 36452-2018[S].2018.

[3]

Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1 (2009), 1–127.

Digital Library

[4]

Zhijie Cai. 2010. Design and implementation of Banzhida Tibetan word segmentation system. Journal of Minorities Teachers College of Qinghai Teachers University 21, 2 (2010), 75–77.

[5]

Xinchi Chen, Xipeng Qiu, and Xuanjing Huang. 2017. A Feature-Enriched neural model for joint Chinese word segmentation and Part-of-Speech tagging. arXiv:1611.05384.

[6]

Yuzhong Chen, Baoli Li, Shiwen YU, and Cuoji Lan. 2003. An Automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Applied Linguistics 1 (2003), 75–82.

[7]

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12 (2011), 2493–2537.

Digital Library

[8]

Cai Deng and Zhao Hai. 2016. Neural word segmentation learning for Chinese. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. New York, USA: ACL (2016).

[9]

Yangmo Droma. 2013. Study on method of solving ambiguity in Tibetan part of speech tagging. Computer Engineering and Applications 49, 24 (2013), 135−137+148.

[10]

Alex Graves. 2012. Long short-term memory. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 37–45.

[11]

Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5–6 (2005), 602–610.

Digital Library

[12]

Quecairang Hua, Qun Liu, and Haixing Zhao. 2014. Discriminative Tibetan Part-of-Speech tagging with perceptron model. Journal of Chinese Information Processing 28, 2 (2014), 56–60.

[13]

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint, arXiv: 1508.01991.

[14]

Zhaxiduojie and An Jiancairang. 2012. Research and implementation of the Tibetan POS Tagging based on HMM. Computer CD Software and Applications 12 (2012), 100–101.

[15]

Tao Jiang, Hongzhi Yu, and Yangkyi Jam. 2011. Tibetan word segmentation system based on conditional random fields. Software Engineering and Service Science (ICSESS’11), IEEE 2nd International Conference on. IEEE, 446–448.

[16]

Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan Lü. 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. Proceedings of ACL-08: HLT, 897–904.

[17]

Caijun Kang. 2014. Research on Tibetan word segmentation and POS tagging[D]. Shanghai Normal University.

[18]

Diederik Kingma, and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Computer Science.

[19]

Canasai Kruengkrai, Kiyotaka Uchimoto, Jun'ichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven wordcharacter hybrid model for joint Chinese word segmentation and POS tagging. ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore.

[20]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.

[21]

Bohan Li, Huidan Liu, Congjun Long, and Jian Wu. 2018. Tibetan word segmentation based on deep learning. Computer Engineering and Design 39, 1 (2018), 194–198.

[22]

Yachao Li, Jing Jiang, Yangji Jia, and Hongzhi Yu. 2015. TIP-LAS: An open source toolkit for Tibetan word segmentation and POS tagging. Journal of Chinese Information Processing 29, 6 (2015), 203–207.

[23]

Huidan Liu. 2012. Research on Tibetan Word Segmentation and Text Resource Mining. Institute of Software, Chinese Academy of Sciences.

[24]

Congjun Long, Huidan Liu, Minghua Nuo, and Jian Wu. 2015. Tibetan POS tagging based on syllable tagging. Journal of Chinese Information Processing 29, 5 (2015), 211–215.

[25]

Congjun Long, Huidan Liu, and Jian Wu. 2017. Research on tagging of Tibetan syllables. Journal of Chinese Information Processing 31, 4.

[26]

Bingfen Luo and Jiang Di. 1999. Basic rule of Tibetan computer automatic word segmentation. Chinese Minority Language Modernization.

[27]

Karten Luobsang, Yuanyuan Yang, and Xiaobing Zhao. 2015. Tibetan automatic word segmentation based on conditional random fields and knowledge fusion. Journal of Chinese Information Processing 29, 6 (2015), 213–219.

[28]

Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. USA: ACL. 1064–1074.

[29]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv:1310.4546

[30]

Hwee Tou Ng and Jin Kiat Low. 2004. Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. EMNLP, (2004), 277–284.

[31]

Sithar Norbu, Pema Choejey, Tenzin Dendup, Sarmad Hussain, and Ahmed Mauz. 2010. Dzongkha word segmentation. Proceedings of the 8th Workshop on Asian Language Resources, 95–102.

[32]

Vincent Pascal, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11 (2010), 3371–3408.

[33]

Kunyu Qi. 2006. Research of Tibetan word segmentation for information processing. Journal of Northwest University for Nationalities (Philosophy and Social Science) 4 (2006), 92–97.

[34]

Tao Qian, Yue Zhang, Meishan Zhang, Yafeng Ren, and Donghong Ji. 2015. A transition-based model for joint segmentation, POS-tagging and normalization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1837–1846.

[35]

Mike Schuster and Kuldip K. Paliwal. 2002. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.

Digital Library

[36]

Min Shi, Bin Li, and Xiaohe Chen. 2010. CRF based research on a unified approach to word segmentation and POS tagging for Pre-Qin Chinese. Journal of Chinese Information Processing 24, 2 (2010), 39–45.

[37]

Xiaodong Shi and Yajun Lu. 2011. A Tibetan segmentation system—Yangjin. Journal of Chinese Information Processing 25, 4 (2011), 54–56.

[38]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 56 (2014), 1929–1958.

Digital Library

[39]

Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2007. Fast and accurate entity Koltun recognition with iterated dilated convolutions. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2670–2680.

[40]

Junfeng Su, Kunyu Qi, and Tai Ben. 2009. Research on automatic Part-of-Speech tagging of Tibetan corpus based on HMM. Journal of Northwest University for Nationalities (Natural Science) 30, 1 (2009), 42–45.

[41]

Yuan Sun, Xiaodong Yan, Xiaobing Zhao, and Guosheng Yang. 2010. Notice of Retraction: A resolution of overlapping ambiguity in Tibetan word segmentation. IEEE International Conference on Computer Science & Information Technology.

[42]

Chunqi Wang and Bo Xu. 2007. Convolutional Neural Network with word embeddings for Chinese word segmentation. Proceedings of the 8th International Joint Conference on Natural Language, 163–172.

[43]

Jingkang Wang, Jianing Zhou, and Gongshen Liu. 2018. Multiple character embeddings for Chinese word segmentation.

[44]

Lili Wang, Hongwu Yang, Xiaotian Xing, and Yajing Yan. 2019. Tibetan word segmentation method based on CNN-BiLSTM-CRF model. International Conference on Asian Language Processing (IALP'19). 319--324. DOI:https://doi.org/10.1109/IALP48816.2019.9037661

[45]

Nianwen Xue and Libin Shen. 2003. Chinese word segmentation as LMR tagging. Proceeding of the Second SIGHAN Workshop on Chinese Language Processing. (Association for Computational Linguistics, Morristown, NJ, USA), 7, 17 (2003), 176–179.

Digital Library

[46]

Wenming Yang and Weijie Chu. 2019. Named entity recognition of online medical question answering text. Computer Systems and Applications 28, 2 (2019), 8--14 (in Chinese). http://www.c-s-a.org.cn/1003-3254/6760.html.

[47]

Yushi Yao and Zheng Huang. 2016. Bi-directional LSTM recurrent neural network for Chinese word segmentation. International Conference on Neural Information Processing Springer, Cham (2016).

[48]

Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122.

[49]

Hongzhi Yu, Yachao Li, Kun Wang, and Lengben Tashi. 2013. Fusion of syllable features for Tibetan part of speech based on maximum entropy model. Journal of Chinese Information Processing 27, 5 (2013), 160–165.

[50]

Meishan Zhang, Yue Zhang, and Guohong Fu. 2016. Transition-based neural word segmentation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany 1 (2016), 421–431.

[51]

Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA.

[52]

Duola and Jia Zhaxi. 2015. Tibetan Syllable Frequency Dictionary[M]. China Social Sciences Press (2015).

[53]

Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep Learning for Chinese Word Segmentation and POS Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 647--657.

Cited By

Liang YLv HLi YDuo LLiu CZhou Q(2024)Tibetan-BERT-wwm: A Tibetan Pretrained Model With Whole Word Masking for Text ClassificationIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.337463311:5(6268-6277)Online publication date: Oct-2024
https://doi.org/10.1109/TCSS.2024.3374633
Chen XChen ZXiao LZhou M(2022)A Novel Sentiment Analysis Model of Museum User Experience Evaluation Data Based on Unbalanced Data Analysis TechnologyComputational Intelligence and Neuroscience10.1155/2022/20966342022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/2096634
Lv HLv HYang LShen JDuo LLi YZhou QYong B(undefined)Improved Tibetan Word Vectors Models Based on Position Information FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3681787
https://dl.acm.org/doi/10.1145/3681787

Index Terms

Character-based Joint Word Segmentation and Part-of-Speech Tagging for Tibetan Based on Deep Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Lexical semantics

Recommendations

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a ...
Exploring Character-Level Deep Learning Models for POS Tagging in Assamese Language
Abstract
The proposed research investigates a novel approach of character-level Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (Bi-LSTM) for part-of-speech (POS) tagging in the Assamese language. The proposed work contributes to ...
Transformation-based part-of-speech tagging for Serbian language
CIMMACS'09: Proceedings of the 8th WSEAS International Conference on Computational intelligence, man-machine systems and cybernetics

Machine learning techniques based on transformation rules have proven to be a viable alternative to stochastic tagging, achieving similar accuracy while having many advantages such as simplicity and better portability to other languages. However, data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 5

September 2022

486 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3533669

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2022

Online AM: 31 August 2022

Accepted: 12 January 2022

Revised: 10 January 2022

Received: 29 December 2020

Published in TALLIP Volume 21, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Key R&D Program of China
Ministry of Education - China Mobile Research Foundation
Fundamental Research Funds for the Central Universities
National Natural Science Foundation of China
Major National Project of High Resolution Earth Observation System
State Grid Corporation of China Science and Technology Project
Program for New Century Excellent Talents in University
Strategic Priority Research Program of the Chinese Academy of Sciences
Google Research Awards and Google Faculty Award, Science and Technology Plan of Qinghai Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
170
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)3

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liang YLv HLi YDuo LLiu CZhou Q(2024)Tibetan-BERT-wwm: A Tibetan Pretrained Model With Whole Word Masking for Text ClassificationIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.337463311:5(6268-6277)Online publication date: Oct-2024
https://doi.org/10.1109/TCSS.2024.3374633
Chen XChen ZXiao LZhou M(2022)A Novel Sentiment Analysis Model of Museum User Experience Evaluation Data Based on Unbalanced Data Analysis TechnologyComputational Intelligence and Neuroscience10.1155/2022/20966342022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/2096634
Lv HLv HYang LShen JDuo LLi YZhou QYong B(undefined)Improved Tibetan Word Vectors Models Based on Position Information FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3681787
https://dl.acm.org/doi/10.1145/3681787

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents