Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMs

Mehari Yohannes, Hailemariam; Lynden, Steven; Amagasa, Toshiyuki; Matono, Akiyoshi

doi:10.1007/978-3-031-70239-6_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14762))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

244 Accesses

Abstract

Named Entity Recognition (NER) plays a crucial role in natural language processing (NLP) tasks by identifying and classifying named entities. However, developing high-performing NER models for low-resource languages remains challenging due to the limited availability of labeled data. This paper proposes a semi-supervised data augmentation approach that combines two state-of-the-art pre-trained language models (PLMs). Our method first fine-tunes two PLMs using a small set of labeled data, then uses them to generate weakly supervised data from unlabeled data through collaborative learning. These predictions are then evaluated using confidence scores and an agreement measurement by both models to generate a high-quality dataset. We perform experiments using seven low-resource, but widely spoken African languages, demonstrating that augmented datasets generated by our approach achieve better results in six out of the seven languages. Furthermore, we conduct cross-lingual zero-shot experiments between language pairs and multi-lingual experiments to validate the robustness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://huggingface.co/.

References

Adelani, D.I., et al.: Masakhaner: named entity recognition for African languages. Trans. Assoc. Comput. Linguist. 9, 1116–1131 (2021)
Article Google Scholar
Chen, H., Yuan, S., Zhang, X.: Rose-NER: robust semi-supervised named entity recognition on insufficient labeled data. In: Proceedings of the 10th International Joint Conference on Knowledge Graphs, pp. 38–44 (2021)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)
Dai, X., Adel, H.: An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, B., et al.: Daga: data augmentation with a generation approach for low-resource tagging tasks. arXiv preprint arXiv:2011.01549 (2020)
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017)
Hailemariam, M.Y., Lynden, S., Matono, A., Amagasa, T.: Self-attention-based data augmentation method for text classification. In: Proceedings of the 2023 15th International Conference on Machine Learning and Computing, pp. 239–244 (2023)
Google Scholar
Helwe, C., Dib, G., Shamas, M., Elbassuoni, S.: A semi-supervised BERT approach for Arabic named entity recognition. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp. 49–57 (2020)
Google Scholar
Khalifa, M., Abdul-Mageed, M., Shaalan, K.: Self-training pre-trained language models for zero-and few-shot multi-dialectal Arabic sequence labeling. arXiv preprint arXiv:2101.04758 (2021)
Liu, L., et al.: A semi-supervised approach for extracting TCM clinical terms based on feature words. BMC Med. Inform. Decis. Mak. 20(3), 1–7 (2020)
MathSciNet Google Scholar
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)
Article Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mathew, J., Fakhraei, S., Ambite, J.L.: Biomedical named entity recognition via reference-set augmented bootstrapping. arXiv preprint arXiv:1906.00282 (2019)
Mitchell, A., Strassel, S., Huang, S., Zakhary, R.: ACE 2004 multilingual training corpus. Linguist. Data Consortium, Philadelphia 1, 1–1 (2005)
Google Scholar
Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502 (2019)
Sam, R.C., Le, H.T., Nguyen, T.T., Nguyen, T.H.: Combining proper name-coreference with conditional random fields for semi-supervised named entity recognition in Vietnamese text. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6634, pp. 512–524. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20841-6_42
Chapter Google Scholar
Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96 (2016)
Google Scholar
Tedeschi, S., Navigli, R.: Multinerd: a multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 801–812 (2022)
Google Scholar
Weischedel, R., et al.: Ontonotes release 5.0 ldc2013t19. Linguist. Data Consortium, Philadelphia, PA 23, 170 (2013)
Google Scholar
Yao, L., Sun, C., Wang, X., Wang, X.: Combining self learning and active learning for Chinese named entity recognition. J. Softw. 5(5), 530–537 (2010)
Google Scholar
Yohannes, H.M., Amagasa, T.: A method of named entity recognition for Tigrinya. ACM SIGAPP Appl. Comput. Rev. 22(3), 56–68 (2022)
Article Google Scholar
Yohannes, H.M., Amagasa, T.: Named-entity recognition for a low-resource language using pre-trained language model. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pp. 837–844 (2022)
Google Scholar
Yohannes, H.M., Amagasa, T.: A scheme for news article classification in a low-resource language. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds.) iiWAS 2022. LNCS, vol. 13635, pp. 519–530. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-21047-1_47
Chapter Google Scholar
Zhang, H., Hennig, L., Alt, C., Hu, C., Meng, Y., Wang, C.: Bootstrapping named entity recognition in e-commerce with positive unlabeled learning. arXiv preprint arXiv:2005.11075 (2020)
Zhang, M., Geng, G., Chen, J.: Semi-supervised bidirectional long short-term memory and conditional random fields model for named-entity recognition using embeddings from language models representations. Entropy 22(2), 252 (2020)
Article MathSciNet Google Scholar
Zhou, R., et al.: MELM: data augmentation with masked entity language modeling for low-resource NER. arXiv preprint arXiv:2108.13655 (2021)

Download references

Acknowledgement

This paper is based on results obtained from JST CREST Grant Number JPMJCR22M2 and JSPS KAKENHI Grant Number JP23K24949.

Author information

Authors and Affiliations

Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, Japan
Hailemariam Mehari Yohannes
National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
Steven Lynden & Akiyoshi Matono
Center for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki, Japan
Toshiyuki Amagasa

Authors

Hailemariam Mehari Yohannes
View author publications
You can also search for this author in PubMed Google Scholar
Steven Lynden
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Amagasa
View author publications
You can also search for this author in PubMed Google Scholar
Akiyoshi Matono
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailemariam Mehari Yohannes .

Editor information

Editors and Affiliations

University of Turin, Turin, Italy
Amon Rapp
University of Turin, Turin, Italy
Luigi Di Caro
University of Derby, Derby, UK
Farid Meziane
Oakland University, Rochester, MI, USA
Vijayan Sugumaran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mehari Yohannes, H., Lynden, S., Amagasa, T., Matono, A. (2024). Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMs. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-70239-6_12
Published: 20 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70238-9
Online ISBN: 978-3-031-70239-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMs