Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Unsupervised Word Segmentation with Bi-directional Neural Language Model

Published: 25 November 2022 Publication History

Abstract

We propose an unsupervised word segmentation model, in which for each unlabelled sentence sample, the learning objective is to maximize the generation probability of the sentence given its all possible segmentations. Such a generation probability can be factorized into the likelihood of each possible segment given the context in a recursive way. To capture both the long- and short-term dependencies, we propose to use a bi-directional neural language model to better extract the features of the segment’s context. Two decoding algorithms were also developed to combine the context features from both directions to generate the final segmentation at the inference time, which helps to reconcile word-boundary ambiguities. Experimental results show that our context-sensitive unsupervised segmentation model achieved state-of-the-art at different evaluation settings on various datasets for Chinese, and the comparable result for Thai.

References

[1]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, Feb. (2003), 1137–1155.
[2]
Miaohong Chen, Baobao Chang, and Wenzhe Pei. 2014. A joint model for unsupervised Chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 854–863.
[3]
Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
[4]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[5]
Thomas Emerson. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing.
[6]
Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. 2004. Accessor variety criteria for Chinese word extraction. Computat. Ling. 30, 1 (2004), 75–93.
[7]
Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 673–680.
[8]
Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112, 1 (2009), 21–54.
[9]
Jingjing Gong, Xinchi Chen, Tao Gui, and Xipeng Qiu. 2019. Switch-LSTMs for multi-criteria chinese word segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 6457–6464.
[10]
Han He, Lei Wu, Hua Yan, Zhimin Gao, Yi Feng, and George Townsend. 2019. Effective neural solution for multi-criteria word segmentation. In Smart Intelligent Computing and Applications. Springer, 133–142.
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.
[12]
Changning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. J. Chinese Inf. Process. 21, 3 (2007), 8–20.
[13]
Guangjin Jin and Xiao Chen. 2008. The Fourth International Chinese Language Processing Bakeoff: Chinese word segmentation, named entity recognition and Chinese PoS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing.
[14]
Zhihui Jin and Kumiko Tanaka-Ishii. 2006. Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 428–435.
[15]
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).
[16]
Chunyu Kitt and Yorick Wilks. 1999. Unsupervised learning of word boundary with description length gain. EACL 1999: CoNLL-99 Computational Natural Language Learning (1999). Retrieved from: https://aclanthology.org/W99-0701.pdf.
[17]
Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2015. Segmental recurrent neural networks. arXiv preprint arXiv:1511.06018 (2015).
[18]
Krit Kosawat. 2009. InterBEST 2009: Thai word segmentation workshop. In Proceedings of 8th International Symposium on Natural Language Processing (SNLP’09).
[19]
Steven N. MacEachern and Peter Müller. 1998. Estimating mixture of Dirichlet process models. J. Computat. Graphic. Statist. 7, 2 (1998), 223–238.
[20]
Pierre Magistry and Benoît Sagot. 2012. Unsupervized word segmentation: The case for Mandarin Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 383–387.
[21]
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.
[22]
Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 100–108.
[23]
Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probabil. Apr 1 (1997), 855–900.
[24]
Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257–286.
[25]
Richard Sproat and Chilin Shih. 1990. A statistical method for finding word boundaries in Chinese text. Comput. Process. Chinese Orient. Lang. 4, 4 (1990), 336–351.
[26]
Zhiqing Sun and Zhi-Hong Deng. 2018. Unsupervised neural word segmentation for Chinese via segmental language modeling. arXiv preprint arXiv:1810.03167 (2018).
[27]
Yee Whye Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 985–992.
[28]
Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1385–1392.
[29]
Zhiyang Teng, Hao Xiong, and Qun Liu. 2014. Unsupervised joint monolingual character alignment and word segmentation. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 1–12.
[30]
Kei Uchiumi, Hiroshi Tsukahara, and Daichi Mochihashi. 2015. Inducing word and part-of-speech with Pitman-Yor hidden semi-Markov models. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1774–1782.
[31]
Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. 2017. Sequence modeling via segmentations. arXiv preprint arXiv:1702.07463 (2017).
[32]
Hanshi Wang, Jian Zhu, Shiping Tang, and Xiaozhong Fan. 2011. A new unsupervised approach to word segmentation. Computat. Ling. 37, 3 (2011), 421–454.
[33]
Suen Ching Yee. 1986. Computational Studies of the Most Frequent Chinese Words and Sounds, Vol. 3. World Scientific.
[34]
Hai Zhao and Chunyu Kit. 2008. An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.
[35]
Lujun Zhao, Qi Zhang, Peng Wang, and Xiaoyu Liu. 2018. Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In Proceedings of the International Joint Conferences on Artificial Intelligence. 4602–4608.

Cited By

View all
  • (2024)Constraint information extraction for 3D geological modelling using a span-based joint entity and relation extraction modelEarth Science Informatics10.1007/s12145-024-01245-217:2(985-998)Online publication date: 16-Feb-2024
  • (2023)Semi-Supervised Lexicon-Aware Embedding for News Article Time EstimationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3592604Online publication date: 13-Apr-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 1
January 2023
340 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3572718
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 November 2022
Online AM: 29 April 2022
Accepted: 28 March 2022
Revised: 15 March 2022
Received: 17 June 2021
Published in TALLIP Volume 22, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Unsupervised word segmentation
  2. bi-directional neural language model
  3. recurrent neural networks
  4. context-sensitive segmentation
  5. decoding algorithms

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)113
  • Downloads (Last 6 weeks)16
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Constraint information extraction for 3D geological modelling using a span-based joint entity and relation extraction modelEarth Science Informatics10.1007/s12145-024-01245-217:2(985-998)Online publication date: 16-Feb-2024
  • (2023)Semi-Supervised Lexicon-Aware Embedding for News Article Time EstimationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3592604Online publication date: 13-Apr-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media