Abstract
Traditional new word detection focused on finding the positional distribution of new words on Chinese text, but rarely on other languages. It was also difficult to obtain semantic information or translations of these new words. This paper proposed NEWBA, an enhanced new word identification algorithm by using bilingual corpus alignment. It indicated that NEWBA performs better than the traditional unsupervised method. In addition, it can obtain bilingual word pairs, which was able to provide us with translations beyond detection. NEWBA can expand the scope of traditional new word detection and therefore obtain more valuable information from bilingual aligned corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Huang, J.H., Powers, D.: Chinese word segmentation based on contextual entropy. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 152–158 (2003)
Zhang, H.P., Shang, J.Y.: Social media-oriented open domain new word detection. J. Chin. Inf. Process. 3, 115–121 (2017)
Chen, K.J., Ma, W.Y.: Unknown word extraction for Chinese documents. In: COLING 2002: The 19th International Conference on Computational Linguistics (2002)
Montariol, S., Allauzen, A.: Measure and evaluation of semantic divergence across two languages. In: ACL 2021 (Volume 1: Long Papers), pp. 1247–1258 (2021)
Chang, B.: Chinese-English parallel corpus construction and its application. In: Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation, pp. 283–290 (2004)
Chengke, Y., Junlan, Z.: New word identification algorithm in natural language processing. In: 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), pp. 199–203. IEEE (2020)
Chen, F., Liu, Y.Q.: Open domain new word detection using condition random field method. J. Softw. 24(5), 1051–1060 (2013)
Wang, X.: An improved neologism synthesis algorithm based on multi-word mutual information and adjacency entropy. Mod. Comput. 4, 7–11 (2018)
Ye, Y., Wu, Q.: Unknown Chinese word extraction based on variety of overlapping strings. Inf. Process. Manag. 49(2), 497–512 (2013)
Qian, Y., Du, Y.: Detecting new Chinese words from massive domain texts with word embedding. J. Inf. Sci. 45(2), 196–211 (2019)
Le, Z., Jidong, L.: Discovering Chinese new words based on multi-sense word embedding. Data Anal. Knowl. Discov. 6(1), 113–121 (2022)
Zhang, J., Huang, K.: Unsupervised new word extraction from Chinese social media data. J. Chin. Inf. Process. (2018)
Huang, X.J., Peng, F.C.: Applying machine learning to text segmentation for information retrieval. Inf. Retrieval 6(3), 333–362 (2003)
Sproat, R., Emerson, T.: The first international Chinese word segmentation bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143 (2003)
Sun, Z., Deng, Z.H.: Unsupervised neural word segmentation for Chinese via segmental language modeling. arXiv preprint arXiv:1810.03167 (2018)
Liang, Y., Yin, P., Yiu, S.M.: New word detection and tagging on Chinese Twitter stream. In: Madria, S., Hara, T. (eds.) DaWaK 2015. LNCS, vol. 9263, pp. 310–321. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22729-0_24
Dou, Z.Y., Neubig, G.: Word alignment by fine-tuning embeddings on parallel corpora. arXiv preprint arXiv:2101.08231 (2021)
Barrault, L., et al.: Findings of the 2019 conference on machine translation. In: Proceedings of WMT (2019)
Deng, K., Bol, P.K.: On the unsupervised analysis of domain-specific Chinese texts. Proc. Natl. Acad. Sci. 113(22), 6154–6159 (2016)
Acknowledgments
This work is partly supported by the Beijing Natural Science Foundation (No. 4212026 and No. 4202069) and the Fundamental Strengthening Program Technology Field Fund (No. 2021-JCJQ-JJ-0059).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Z., Zhang, H., Shang, J., Wushour, S. (2022). An Enhanced New Word Identification Approach Using Bilingual Alignment. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-17120-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)