Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.3115/1119250.1119255dlproceedingsArticle/Chapter ViewAbstractPublication PagessighanConference Proceedingsconference-collections
Article
Free access

A bottom-up merging algorithm for Chinese unknown word extraction

Published: 11 July 2003 Publication History

Abstract

Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.

References

[1]
Chen, H. H., & J. C. Lee, 1994, "The Identification of Organization Names in Chinese Texts", Communication of COLIPS, Vol. 4 No. 2, 131--142.
[2]
Sun, M. S., C. N. Huang, H. Y. Gao, & Jie Fang, 1994, "Identifying Chinese Names in Unrestricted Texts", Communication of COLIPS, Vol. 4 No. 2, 113--122
[3]
Lin, M. Y., T. H. Chiang, & K. Y. Su, 1993," A Preliminary Study on Unknown Word Problem in Chinese Word Segmentation," Proceedings of ROCLING VI, pp. 119--137
[4]
Richard Sproat and Chilin Shih, "A Statistical Method for Finding Word Boundaries in Chinese Text," Computer Processing of Chinese and Oriental Languages, 4, 336--351, 1990
[5]
Sun, Maosong, Dayang Shen, and Benjamin K. Tsou. 1998. Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. In Proceedings of COLING-ACL'98, pages 1265--1271
[6]
Ge, Xianping, Wanda Pratt, and Padhraic Smyth. 1999. Discovering Chinese Words from Unsegmented Text. In SIGIR'99, pages 271--272
[7]
Palmer, David. 1997. A Trainable Rule-based Algorithm for Word Segmentation. In Proceedings of the Association for Computational Linguistics
[8]
Chiang, T. H., M. Y. Lin, & K. Y. Su, 1992," Statistical Models for Word Segmentation and Unknown Word Resolution," Proceedings of ROCLING V, pp. 121--146
[9]
Chang, Jing-Shin and Keh-Yih Su, 1997a. "An Un-supervised Iterative Method for Chinese New Lexicon Extraction", to appear in International Journal of Computational Linguistics & Chinese Language Processing, 1997
[10]
C.H. Tung and H. J. Lee, "Identification of unknown words from corpus," International Journal of Computer Processing of Chinese and Oriental Languages, Vol. 8, Supplement, pp. 131--146, 1995
[11]
Chen, K. J. & Wei-Yun Ma, 2002. Unknown Word Extraction for Chinese Documents. In Proceedings of COLING 2002, pages 169--175
[12]
Chen, K. J. & Ming-Hong Bai, 1998, "Unknown Word Detection for Chinese by a Corpus-based Learning Method," international Journal of Computational linguistics and Chinese Language Processing, Vol. 3, # 1, pp. 27--44
[13]
Church, Kenneth W., 2000, "Empirical Estimates of Adaptation: The Chance of Two Noriegas is Closer to p/2 than p*p", Proceedings of Coling 2000, pp. 180--186.
[14]
Allen James 1995 Natural Language understandding. Second Edition, page 44
[15]
Chen, K. J. & S. H. Liu, 1992," Word Identification for Mandarin Chinese Sentences," Proceedings of 14th Coling, pp. 101--107
[16]
Huang, C. R. Et al., 1995," The Introduction of Sinica Corpus," Proceedings of ROCLING VIII, pp. 81--89.
[17]
Huang, C. R., K. J. Chen, & Li-Li Chang, 1997, "Segmentation Standard for Chinese Natural Language Processing," International Journal of Computational Linguistics and Chinese Language Processing, Accepted.
[18]
Chomsky, N. 1956 Three models for the description of language. IRE Transactions on Information Theory, 2, 113--124
[19]
Church, K. and Hanks, P., "Word Association Norms, Mutual Information and Lexicography," Computational Linguistics, Vol. 16, March. 1990, pp. 22--29
[20]
Smadja, Frank, "Retrieving Collocations from Text: Xtract," Computational Linguistics, Vol. 19, No. 1, 1993, pp. 143--177
[21]
Smadja, Frank, McKeown, K. R. and Hatzivasiloglou, V. "Translating Collocations for Bilingual Lexicons," Computational Linguistics, Vol. 22, No.1, 1996
[22]
Church, K, W. Gale, P. Hanks, and D. Hindle. 1991 "Using Statistics in Lexical Analysis," in Zernik (ed.) Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115--164, Lawrence Erlbaum Associates Publishers

Cited By

View all
  • (2019)Is It Possible to Use Chatbot for the Chinese Word Segmentation?Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval10.1145/3342827.3342836(20-24)Online publication date: 28-Jun-2019
  • (2017)Predicting political affiliation of posts on FacebookProceedings of the 11th International Conference on Ubiquitous Information Management and Communication10.1145/3022227.3022283(1-8)Online publication date: 5-Jan-2017
  • (2012)Phrase-based approach for adaptive tokenizationProceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology10.5555/2390930.2390933(17-25)Online publication date: 7-Jun-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
SIGHAN '03: Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
July 2003
193 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 11 July 2003

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)9
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Is It Possible to Use Chatbot for the Chinese Word Segmentation?Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval10.1145/3342827.3342836(20-24)Online publication date: 28-Jun-2019
  • (2017)Predicting political affiliation of posts on FacebookProceedings of the 11th International Conference on Ubiquitous Information Management and Communication10.1145/3022227.3022283(1-8)Online publication date: 5-Jan-2017
  • (2012)Phrase-based approach for adaptive tokenizationProceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology10.5555/2390930.2390933(17-25)Online publication date: 7-Jun-2012
  • (2010)Realization of a news dissemination agent based on weighted association rules and text mining techniquesExpert Systems with Applications: An International Journal10.1016/j.eswa.2010.02.07837:9(6409-6413)Online publication date: 1-Sep-2010
  • (2009)Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2008.02.01336:2(3641-3651)Online publication date: 1-Mar-2009
  • (2008)Supporting the development of collaborative problem-based learning environments with an intelligent diagnosis toolExpert Systems with Applications: An International Journal10.1016/j.eswa.2007.07.02835:3(622-631)Online publication date: 1-Oct-2008
  • (2007)Implementation and performance evaluation of parameter improvement mechanisms for intelligent e-learning systemsComputers & Education10.1016/j.compedu.2005.11.00849:3(597-614)Online publication date: 1-Nov-2007
  • (2003)Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation BakeoffProceedings of the second SIGHAN workshop on Chinese language processing - Volume 1710.3115/1119250.1119276(168-171)Online publication date: 11-Jul-2003

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media