Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
note

Loanword Identification in Low-Resource Languages with Minimal Supervision

Published: 20 February 2020 Publication History

Abstract

Bilingual resources play a very important role in many natural language processing tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time consuming to build such resources. Lexical borrowing happens in almost every language. This inspires us to detect these loanwords effectively, and to use the “loanword (in receipt language)”-“donor word (in donor language)” to extend the bilingual resource for NLP tasks in low-resource languages. In this article, we propose a novel method to identify loanwords in Uyghur. The most important advantage of this method is that the model only relies on large amount of monolingual corpora and only a small scale of annotated data. Our loanword identification model includes two parts: loanword candidate generation and loanword prediction. In the first part, we use two large-scale monolingual corpora and a small bilingual dictionary to train a cross-lingual embedding model. Since semantic unrelated words often cannot be treated as loanword pairs, a loanword candidate list will be generated according to this model and a word list in Uyghur. In the second part, we predict from the preceding candidates based on a log-linear model that integrates several features such as pronunciation similarity, part-of-speech tags, and hybrid language modeling. To evaluate the effectiveness of our proposed method, we conduct two types of experiments: loanword identification and OOV translation. Experimental results showed that (1) our proposed method achieved significant F1 improvements compared to other models in all four loanword identification tasks in Uyghur, and (2) after extending the existing translation models with loanword identification results, OOV rates in several language pairs reduced significantly and the translation performance improved.

References

[1]
Wafia Adouane, Simon Dobnik, Jean-Philippe Bernardy, and Nasredine Semmar. 2018. A comparison of character neural language model and bootstrapping for language identification in multilingual noisy texts. In Proceedings of the 2nd Workshop on Subword/Character LEvel Models (SCLeM’18). 22--31.
[2]
Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. arXiv:1602.01925.
[3]
Antonio Barone and Valerio Miceli. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996.
[4]
Andre Cianflone and Leila Kosseim. 2016. N-gram and neural language models for discriminating similar languages. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial3’16). 243--250. https://www.aclweb.org/anthology/W16-4831.
[5]
Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1285--1295.
[6]
Philip Durkin (Ed.). 2014. Borrowed Words: A History of Loanwords in English. Oxford University Press. https://global.oup.com/academic/product/borrowed-words-9780199574995?cc=;cn8lang=;en8#.
[7]
Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an unsupervised transliteration model into statistical machine translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers. 148--153.
[8]
Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 462--471.
[9]
Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi Reichart, and Anna Korhonen. 2018. On the relation between linguistic typology and (limitations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 316--327. http://aclweb.org/anthology/D18-1029.
[10]
Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1: Long Papers. 1234--1244.
[11]
Martin Haspelmath and Uri Tadmor (Eds.). 2009. The World Loanword Database (WOLD). Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany. https://wold.clld.org/.
[12]
Bates L. Hoffer. 2005. Language borrowing and the indices of adaptability and receptivity. Intercultural Communication Studies 14, 2 (2005), 53.
[13]
Kejun Huang, Matt Gardner, Evangelos Papalexakis, Christos Faloutsos, Nikos Sidiropoulos, Tom Mitchell, Partha P. Talukdar, and Xiao Fu. 2015. Translation invariant word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1084--1088.
[14]
Aaron Jaech, George Mulcaire, Mari Ostendorf, and Noah A. Smith. 2016. A neural model for language identification in code-switched tweets. In Proceedings of the 2nd Workshop on Computational Approaches to Code Switching. 60--64.
[15]
Tommi Sakari Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Lindén. 2019. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research 65 (2019), 675--782.
[16]
Yoonjung Kang, Andrea Hoa Pham, and Benjamin Storme. 2016. French loanwords in Vietnamese: The role of input language phonotactics and contrast in loanword adaptation. In Proceedings of the Annual Meetings on Phonology, Vol. 2.
[17]
Alina Karakanta, Jon Dehdari, and Josef van Genabith. 2018. Neural machine translation for low-resource languages without parallel corpora. Machine Translation 32, 1 (June 2018), 167--189.
[18]
Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for M-gram language modeling. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95), Vol. 1. 181e4.
[19]
Nicholas D. Kontovas. 2008. An Analysis of Recent Loans into the Standard Uyghur Lexicon. Retrieved February 4, 2020 from https://www.aclweb.org/anthology/C18-1256.pdf.
[20]
John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). 282--289.
[21]
Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language Information Processing 14, 3 (2015), 11.
[22]
Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers.1054--1063.
[23]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 151--159.
[24]
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1: Long Papers.11--19.
[25]
Richard T. McCoy and Robert Frank. 2018. Phonologically informed edit distance algorithms for word alignment with low-resource languages. In Proceedings of the Society for Computation in Linguistics (SCiL’18).102--112.
[26]
Chenggang Mi, Yating Yang, Lei Wang, Xiao Li, and Kamali Dalielihan. 2014. Detection of loan words in Uyghur texts. In Natural Language Processing and Chinese Computing. Springer, 103--112.
[27]
Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, and Tonghai Jiang. 2018. A neural network based model for loanword identification in Uyghur. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). http://aclweb.org/anthology/L18-1565.
[28]
Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, and Tonghai Jiang. 2018. Toward better loanword identification in Uyghur using cross-lingual word embeddings. In Proceedings of the 27th International Conference on Computational Linguistics. 3027--3037. http://aclweb.org/anthology/C18-1256.
[29]
Chenggang Mi, Yating Yang, Xi Zhou, Lei Wang, Xiao Li, and Tonghai Jiang. 2016. Recurrent neural network based loanwords identification in Uyghur. In Proceedings of the 30th Pacific Asia Conference on Language, Information, and Computation: Oral Papers. 209--217.
[30]
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arxiv:1309.4168.
[31]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.
[32]
Aditya Mogadala and Achim Rettinger. 2016. Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 692--702.
[33]
Elham Mohammadi, Hadi Veisi, and Hessam Amini. 2017. Native language identification using a mixture of character and word n-grams. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 210--216.
[34]
Graham Neubig and Chris Dyer. 2016. Generalizing and hybridizing count-based and neural language models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1163--1172.
[35]
Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Volume 1. 160--167.
[36]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318.
[37]
Sharon Peperkamp. 2004. A psycholinguistic theory of loanword adaptations. In Proceedings of the Annual Meeting of the Berkeley Linguistics Society, Vol. 30. 341--352.
[38]
Sree Harsha Ramesh and Krishna Prasad Sankaranarayanan. 2018. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 112--119.
[39]
Henry G. Schwarz. 1992. An Uyghur-English Dictionary. Western Washington University.
[40]
Gökhan Akın Şeker and Gülşen Eryiğit. 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1. Semantic Web 8, 5 (2017), 625--642.
[41]
Jacqueline Serigos. 2017. Using distributional semantics in loanword research: A concept-based approach to quantifying semantic specificity of Anglicisms in Spanish. International Journal of Bilingualism 21, 5 (2017), 521--540.
[42]
Shigeko Shinohara. 2015. Loanword-specific grammar in Japanese adaptations of Korean words and phrases. Journal of East Asian Linguistics 24, 2 (2015), 149--191.
[43]
Andreas Stolcke. 2002. SRILM—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing.
[44]
Yulia Tsvetkov, Waleed Ammar, and Chris Dyer. 2015. Constraint-based models of lexical borrowing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 598--608.
[45]
Yulia Tsvetkov and Chris Dyer. 2015. Lexicon stratification for translating out-of-vocabulary words. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 2: Short Papers. 125--131.
[46]
Ivan Vulić and Marie-Francine Moens. 2016. Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55 (2016), 953--994.
[47]
Matthew D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. arXiv:1212.5701.
[48]
Jiajun Zhang and Chengqing Zong. 2016. Bridging neural machine translation and bilingual dictionaries. arXiv:1610.07272.

Cited By

View all
  • (2023)Loanword identification based on web resources: A case study on wikipediaComputer Speech & Language10.1016/j.csl.2023.10151781(101517)Online publication date: Jun-2023
  • (2023)The appeal of green advertisements on consumers' consumption intention based on low-resource machine translationThe Journal of Supercomputing10.1007/s11227-022-04846-079:5(5086-5108)Online publication date: 1-Mar-2023
  • (2021)Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature FusionComputational Intelligence and Neuroscience10.1155/2021/99750782021Online publication date: 1-Jan-2021

Index Terms

  1. Loanword Identification in Low-Resource Languages with Minimal Supervision

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 3
        May 2020
        228 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3378675
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 February 2020
        Accepted: 01 November 2019
        Revised: 01 September 2019
        Received: 01 January 2019
        Published in TALLIP Volume 19, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Uyghur loanword
        2. cross-lingual embedding
        3. loanword identification
        4. low-resource neural machine translation
        5. out-of-vocabulary
        6. pronunciation similarity

        Qualifiers

        • Note
        • Research
        • Refereed

        Funding Sources

        • National Natural Science Foundation of China
        • National Key Research and Development Program of China

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)32
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 10 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Loanword identification based on web resources: A case study on wikipediaComputer Speech & Language10.1016/j.csl.2023.10151781(101517)Online publication date: Jun-2023
        • (2023)The appeal of green advertisements on consumers' consumption intention based on low-resource machine translationThe Journal of Supercomputing10.1007/s11227-022-04846-079:5(5086-5108)Online publication date: 1-Mar-2023
        • (2021)Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature FusionComputational Intelligence and Neuroscience10.1155/2021/99750782021Online publication date: 1-Jan-2021

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media