Abstract
Named Entity Recognition (NER) plays a crucial role in natural language processing (NLP) tasks by identifying and classifying named entities. However, developing high-performing NER models for low-resource languages remains challenging due to the limited availability of labeled data. This paper proposes a semi-supervised data augmentation approach that combines two state-of-the-art pre-trained language models (PLMs). Our method first fine-tunes two PLMs using a small set of labeled data, then uses them to generate weakly supervised data from unlabeled data through collaborative learning. These predictions are then evaluated using confidence scores and an agreement measurement by both models to generate a high-quality dataset. We perform experiments using seven low-resource, but widely spoken African languages, demonstrating that augmented datasets generated by our approach achieve better results in six out of the seven languages. Furthermore, we conduct cross-lingual zero-shot experiments between language pairs and multi-lingual experiments to validate the robustness of our method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Adelani, D.I., et al.: Masakhaner: named entity recognition for African languages. Trans. Assoc. Comput. Linguist. 9, 1116–1131 (2021)
Chen, H., Yuan, S., Zhang, X.: Rose-NER: robust semi-supervised named entity recognition on insufficient labeled data. In: Proceedings of the 10th International Joint Conference on Knowledge Graphs, pp. 38–44 (2021)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)
Dai, X., Adel, H.: An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, B., et al.: Daga: data augmentation with a generation approach for low-resource tagging tasks. arXiv preprint arXiv:2011.01549 (2020)
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017)
Hailemariam, M.Y., Lynden, S., Matono, A., Amagasa, T.: Self-attention-based data augmentation method for text classification. In: Proceedings of the 2023 15th International Conference on Machine Learning and Computing, pp. 239–244 (2023)
Helwe, C., Dib, G., Shamas, M., Elbassuoni, S.: A semi-supervised BERT approach for Arabic named entity recognition. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp. 49–57 (2020)
Khalifa, M., Abdul-Mageed, M., Shaalan, K.: Self-training pre-trained language models for zero-and few-shot multi-dialectal Arabic sequence labeling. arXiv preprint arXiv:2101.04758 (2021)
Liu, L., et al.: A semi-supervised approach for extracting TCM clinical terms based on feature words. BMC Med. Inform. Decis. Mak. 20(3), 1–7 (2020)
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mathew, J., Fakhraei, S., Ambite, J.L.: Biomedical named entity recognition via reference-set augmented bootstrapping. arXiv preprint arXiv:1906.00282 (2019)
Mitchell, A., Strassel, S., Huang, S., Zakhary, R.: ACE 2004 multilingual training corpus. Linguist. Data Consortium, Philadelphia 1, 1–1 (2005)
Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502 (2019)
Sam, R.C., Le, H.T., Nguyen, T.T., Nguyen, T.H.: Combining proper name-coreference with conditional random fields for semi-supervised named entity recognition in Vietnamese text. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6634, pp. 512–524. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20841-6_42
Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96 (2016)
Tedeschi, S., Navigli, R.: Multinerd: a multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 801–812 (2022)
Weischedel, R., et al.: Ontonotes release 5.0 ldc2013t19. Linguist. Data Consortium, Philadelphia, PA 23, 170 (2013)
Yao, L., Sun, C., Wang, X., Wang, X.: Combining self learning and active learning for Chinese named entity recognition. J. Softw. 5(5), 530–537 (2010)
Yohannes, H.M., Amagasa, T.: A method of named entity recognition for Tigrinya. ACM SIGAPP Appl. Comput. Rev. 22(3), 56–68 (2022)
Yohannes, H.M., Amagasa, T.: Named-entity recognition for a low-resource language using pre-trained language model. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pp. 837–844 (2022)
Yohannes, H.M., Amagasa, T.: A scheme for news article classification in a low-resource language. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds.) iiWAS 2022. LNCS, vol. 13635, pp. 519–530. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-21047-1_47
Zhang, H., Hennig, L., Alt, C., Hu, C., Meng, Y., Wang, C.: Bootstrapping named entity recognition in e-commerce with positive unlabeled learning. arXiv preprint arXiv:2005.11075 (2020)
Zhang, M., Geng, G., Chen, J.: Semi-supervised bidirectional long short-term memory and conditional random fields model for named-entity recognition using embeddings from language models representations. Entropy 22(2), 252 (2020)
Zhou, R., et al.: MELM: data augmentation with masked entity language modeling for low-resource NER. arXiv preprint arXiv:2108.13655 (2021)
Acknowledgement
This paper is based on results obtained from JST CREST Grant Number JPMJCR22M2 and JSPS KAKENHI Grant Number JP23K24949.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mehari Yohannes, H., Lynden, S., Amagasa, T., Matono, A. (2024). Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMs. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-70239-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70238-9
Online ISBN: 978-3-031-70239-6
eBook Packages: Computer ScienceComputer Science (R0)