Abstract
Traditional statistical approaches for identifying multi-word terms have to handle a large amount of noisy data and are extremely time consuming. This paper introduces a multi-word term extraction system for extracting multi-word terms from a set of documents based on the co-related text-segments existing in these documents. The system uses a short predefined stoplist as an initial input to segment a set of documents into text-segments, calculates the segment-weights of all text-segments, and then applies the short text-segments to segment the longer text-segments based on the weight values recursively until all text-segments cannot be further divided. The resultant text-segments can thus be identified as terms based on a specified threshold. The initial experimental result on a set of traditional Chinese documents shows that this system can achieve a minimum of 76.39% of recall rate and a minimum of 91.05% of precision rate on retrieving multiple occurrences terms, which include 18.30% of new identified terms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chang, J.S., Chen, S.D., Ker, S.J., Chen, Y., Liu, J.: A multiple-Corpus Approach to Recognition of Proper Names in Chinese Texts. Computer Processing of Chinese and Oriental Languages 8(1), 75–85 (1994)
Lai, Y.-S., Wu, C.-H.: Unknown Word and Phrase Extraction Using a Phrase-Like-Unit-Based Likelihood Ratio. International Journal of Computer Processing of Oriental Languages 13(1), 83–95 (2000)
Chinese Stoplist (Traditional). http://www.lc.leidenuniv.nl/awcourse/oracle/text.920/a96518/astopsup.htm#45728
Tsai, C.-H.: A Review of Chinese Word Lists Accessible on the Internet, http://technology.chtsai.org/wordlist/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, J., Yeh, CH., Chau, R. (2006). A Multi-word Term Extraction System. In: Yang, Q., Webb, G. (eds) PRICAI 2006: Trends in Artificial Intelligence. PRICAI 2006. Lecture Notes in Computer Science(), vol 4099. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-36668-3_153
Download citation
DOI: https://doi.org/10.1007/978-3-540-36668-3_153
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36667-6
Online ISBN: 978-3-540-36668-3
eBook Packages: Computer ScienceComputer Science (R0)