note

Loanword Identification in Low-Resource Languages with Minimal Supervision

Authors:

Chenggang Mi,

Lei Xie,

Yanning ZhangAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 19, Issue 3

Article No.: 43, Pages 1 - 22

https://doi.org/10.1145/3374212

Published: 20 February 2020 Publication History

Get Access

Abstract

Bilingual resources play a very important role in many natural language processing tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time consuming to build such resources. Lexical borrowing happens in almost every language. This inspires us to detect these loanwords effectively, and to use the “loanword (in receipt language)”-“donor word (in donor language)” to extend the bilingual resource for NLP tasks in low-resource languages. In this article, we propose a novel method to identify loanwords in Uyghur. The most important advantage of this method is that the model only relies on large amount of monolingual corpora and only a small scale of annotated data. Our loanword identification model includes two parts: loanword candidate generation and loanword prediction. In the first part, we use two large-scale monolingual corpora and a small bilingual dictionary to train a cross-lingual embedding model. Since semantic unrelated words often cannot be treated as loanword pairs, a loanword candidate list will be generated according to this model and a word list in Uyghur. In the second part, we predict from the preceding candidates based on a log-linear model that integrates several features such as pronunciation similarity, part-of-speech tags, and hybrid language modeling. To evaluate the effectiveness of our proposed method, we conduct two types of experiments: loanword identification and OOV translation. Experimental results showed that (1) our proposed method achieved significant F1 improvements compared to other models in all four loanword identification tasks in Uyghur, and (2) after extending the existing translation models with loanword identification results, OOV rates in several language pairs reduced significantly and the translation performance improved.

References

[1]

Wafia Adouane, Simon Dobnik, Jean-Philippe Bernardy, and Nasredine Semmar. 2018. A comparison of character neural language model and bootstrapping for language identification in multilingual noisy texts. In Proceedings of the 2nd Workshop on Subword/Character LEvel Models (SCLeM’18). 22--31.

Abstract

References

Cited By

Index Terms

Recommendations

Multilingual Offensive Language Identification for Low-resource Languages

Improving the Robustness of Loanword Identification in Social Media Texts

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations