research-article

Unsupervised Word Segmentation with Bi-directional Neural Language Model

Authors:

Xiaoqing ZhengAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 1

Article No.: 17, Pages 1 - 16

https://doi.org/10.1145/3529387

Published: 25 November 2022 Publication History

Abstract

We propose an unsupervised word segmentation model, in which for each unlabelled sentence sample, the learning objective is to maximize the generation probability of the sentence given its all possible segmentations. Such a generation probability can be factorized into the likelihood of each possible segment given the context in a recursive way. To capture both the long- and short-term dependencies, we propose to use a bi-directional neural language model to better extract the features of the segment’s context. Two decoding algorithms were also developed to combine the context features from both directions to generate the final segmentation at the inference time, which helps to reconcile word-boundary ambiguities. Experimental results show that our context-sensitive unsupervised segmentation model achieved state-of-the-art at different evaluation settings on various datasets for Chinese, and the comparable result for Thai.

References

[1]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, Feb. (2003), 1137–1155.

Digital Library

[2]

Miaohong Chen, Baobao Chang, and Wenzhe Pei. 2014. A joint model for unsupervised Chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 854–863.

[3]

Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

[4]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[5]

Thomas Emerson. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing.

[6]

Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. 2004. Accessor variety criteria for Chinese word extraction. Computat. Ling. 30, 1 (2004), 75–93.

Digital Library

[7]

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 673–680.

Digital Library

[8]

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112, 1 (2009), 21–54.

[9]

Jingjing Gong, Xinchi Chen, Tao Gui, and Xipeng Qiu. 2019. Switch-LSTMs for multi-criteria chinese word segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 6457–6464.

Digital Library

[10]

Han He, Lei Wu, Hua Yan, Zhimin Gao, Yi Feng, and George Townsend. 2019. Effective neural solution for multi-criteria word segmentation. In Smart Intelligent Computing and Applications. Springer, 133–142.

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.

Digital Library

[12]

Changning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. J. Chinese Inf. Process. 21, 3 (2007), 8–20.

[13]

Guangjin Jin and Xiao Chen. 2008. The Fourth International Chinese Language Processing Bakeoff: Chinese word segmentation, named entity recognition and Chinese PoS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing.

[14]

Zhihui Jin and Kumiko Tanaka-Ishii. 2006. Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 428–435.

Digital Library

[15]

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).

[16]

Chunyu Kitt and Yorick Wilks. 1999. Unsupervised learning of word boundary with description length gain. EACL 1999: CoNLL-99 Computational Natural Language Learning (1999). Retrieved from: https://aclanthology.org/W99-0701.pdf.

[17]

Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2015. Segmental recurrent neural networks. arXiv preprint arXiv:1511.06018 (2015).

[18]

Krit Kosawat. 2009. InterBEST 2009: Thai word segmentation workshop. In Proceedings of 8th International Symposium on Natural Language Processing (SNLP’09).

[19]

Steven N. MacEachern and Peter Müller. 1998. Estimating mixture of Dirichlet process models. J. Computat. Graphic. Statist. 7, 2 (1998), 223–238.

[20]

Pierre Magistry and Benoît Sagot. 2012. Unsupervized word segmentation: The case for Mandarin Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 383–387.

[21]

Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.

[22]

Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 100–108.

Digital Library

[23]

Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probabil. Apr 1 (1997), 855–900.

[24]

Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257–286.

[25]

Richard Sproat and Chilin Shih. 1990. A statistical method for finding word boundaries in Chinese text. Comput. Process. Chinese Orient. Lang. 4, 4 (1990), 336–351.

[26]

Zhiqing Sun and Zhi-Hong Deng. 2018. Unsupervised neural word segmentation for Chinese via segmental language modeling. arXiv preprint arXiv:1810.03167 (2018).

[27]

Yee Whye Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 985–992.

Digital Library

[28]

Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1385–1392.

[29]

Zhiyang Teng, Hao Xiong, and Qun Liu. 2014. Unsupervised joint monolingual character alignment and word segmentation. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 1–12.

[30]

Kei Uchiumi, Hiroshi Tsukahara, and Daichi Mochihashi. 2015. Inducing word and part-of-speech with Pitman-Yor hidden semi-Markov models. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1774–1782.

[31]

Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. 2017. Sequence modeling via segmentations. arXiv preprint arXiv:1702.07463 (2017).

[32]

Hanshi Wang, Jian Zhu, Shiping Tang, and Xiaozhong Fan. 2011. A new unsupervised approach to word segmentation. Computat. Ling. 37, 3 (2011), 421–454.

Digital Library

[33]

Suen Ching Yee. 1986. Computational Studies of the Most Frequent Chinese Words and Sounds, Vol. 3. World Scientific.

[34]

Hai Zhao and Chunyu Kit. 2008. An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.

[35]

Lujun Zhao, Qi Zhang, Peng Wang, and Xiaoyu Liu. 2018. Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In Proceedings of the International Joint Conferences on Artificial Intelligence. 4602–4608.

Cited By

Zhuang CLiu CZhu HMa YShi GLiu ZLiu B(2024)Constraint information extraction for 3D geological modelling using a span-based joint entity and relation extraction modelEarth Science Informatics10.1007/s12145-024-01245-217:2(985-998)Online publication date: 16-Feb-2024
https://doi.org/10.1007/s12145-024-01245-2
Ahmed ULin JSrivastava GYun U(2023)Semi-Supervised Lexicon-Aware Embedding for News Article Time EstimationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3592604Online publication date: 13-Apr-2023
https://doi.org/10.1145/3592604

Index Terms

Unsupervised Word Segmentation with Bi-directional Neural Language Model
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Source separation
    2. Machine learning approaches
      1. Neural networks
      2. Rule learning

Recommendations

Enhancing recurrent neural network-based language models by word tokenization

Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks ...
Language model based arabic word segmentation
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1

We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus ...
Natural Language Grammatical Inference with Recurrent Neural Networks

This paper examines the inductive inference of a complex grammar with neural networks specifically, the task considered is that of training a network to classify natural language sentences as grammatical or ungrammatical, thereby exhibiting the same ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 1

January 2023

340 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3572718

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 November 2022

Online AM: 29 April 2022

Accepted: 28 March 2022

Revised: 15 March 2022

Received: 17 June 2021

Published in TALLIP Volume 22, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
377
Total Downloads

Downloads (Last 12 months)113
Downloads (Last 6 weeks)16

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhuang CLiu CZhu HMa YShi GLiu ZLiu B(2024)Constraint information extraction for 3D geological modelling using a span-based joint entity and relation extraction modelEarth Science Informatics10.1007/s12145-024-01245-217:2(985-998)Online publication date: 16-Feb-2024
https://doi.org/10.1007/s12145-024-01245-2
Ahmed ULin JSrivastava GYun U(2023)Semi-Supervised Lexicon-Aware Embedding for News Article Time EstimationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3592604Online publication date: 13-Apr-2023
https://doi.org/10.1145/3592604

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents