Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMs

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2024)

Abstract

Named Entity Recognition (NER) plays a crucial role in natural language processing (NLP) tasks by identifying and classifying named entities. However, developing high-performing NER models for low-resource languages remains challenging due to the limited availability of labeled data. This paper proposes a semi-supervised data augmentation approach that combines two state-of-the-art pre-trained language models (PLMs). Our method first fine-tunes two PLMs using a small set of labeled data, then uses them to generate weakly supervised data from unlabeled data through collaborative learning. These predictions are then evaluated using confidence scores and an agreement measurement by both models to generate a high-quality dataset. We perform experiments using seven low-resource, but widely spoken African languages, demonstrating that augmented datasets generated by our approach achieve better results in six out of the seven languages. Furthermore, we conduct cross-lingual zero-shot experiments between language pairs and multi-lingual experiments to validate the robustness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/.

References

  1. Adelani, D.I., et al.: Masakhaner: named entity recognition for African languages. Trans. Assoc. Comput. Linguist. 9, 1116–1131 (2021)

    Article  Google Scholar 

  2. Chen, H., Yuan, S., Zhang, X.: Rose-NER: robust semi-supervised named entity recognition on insufficient labeled data. In: Proceedings of the 10th International Joint Conference on Knowledge Graphs, pp. 38–44 (2021)

    Google Scholar 

  3. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)

  4. Dai, X., Adel, H.: An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683 (2020)

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  6. Ding, B., et al.: Daga: data augmentation with a generation approach for low-resource tagging tasks. arXiv preprint arXiv:2011.01549 (2020)

  7. Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017)

  8. Hailemariam, M.Y., Lynden, S., Matono, A., Amagasa, T.: Self-attention-based data augmentation method for text classification. In: Proceedings of the 2023 15th International Conference on Machine Learning and Computing, pp. 239–244 (2023)

    Google Scholar 

  9. Helwe, C., Dib, G., Shamas, M., Elbassuoni, S.: A semi-supervised BERT approach for Arabic named entity recognition. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp. 49–57 (2020)

    Google Scholar 

  10. Khalifa, M., Abdul-Mageed, M., Shaalan, K.: Self-training pre-trained language models for zero-and few-shot multi-dialectal Arabic sequence labeling. arXiv preprint arXiv:2101.04758 (2021)

  11. Liu, L., et al.: A semi-supervised approach for extracting TCM clinical terms based on feature words. BMC Med. Inform. Decis. Mak. 20(3), 1–7 (2020)

    MathSciNet  Google Scholar 

  12. Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)

    Article  Google Scholar 

  13. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  14. Mathew, J., Fakhraei, S., Ambite, J.L.: Biomedical named entity recognition via reference-set augmented bootstrapping. arXiv preprint arXiv:1906.00282 (2019)

  15. Mitchell, A., Strassel, S., Huang, S., Zakhary, R.: ACE 2004 multilingual training corpus. Linguist. Data Consortium, Philadelphia 1, 1–1 (2005)

    Google Scholar 

  16. Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502 (2019)

  17. Sam, R.C., Le, H.T., Nguyen, T.T., Nguyen, T.H.: Combining proper name-coreference with conditional random fields for semi-supervised named entity recognition in Vietnamese text. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6634, pp. 512–524. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20841-6_42

    Chapter  Google Scholar 

  18. Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)

    Google Scholar 

  19. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96 (2016)

    Google Scholar 

  20. Tedeschi, S., Navigli, R.: Multinerd: a multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 801–812 (2022)

    Google Scholar 

  21. Weischedel, R., et al.: Ontonotes release 5.0 ldc2013t19. Linguist. Data Consortium, Philadelphia, PA 23, 170 (2013)

    Google Scholar 

  22. Yao, L., Sun, C., Wang, X., Wang, X.: Combining self learning and active learning for Chinese named entity recognition. J. Softw. 5(5), 530–537 (2010)

    Google Scholar 

  23. Yohannes, H.M., Amagasa, T.: A method of named entity recognition for Tigrinya. ACM SIGAPP Appl. Comput. Rev. 22(3), 56–68 (2022)

    Article  Google Scholar 

  24. Yohannes, H.M., Amagasa, T.: Named-entity recognition for a low-resource language using pre-trained language model. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pp. 837–844 (2022)

    Google Scholar 

  25. Yohannes, H.M., Amagasa, T.: A scheme for news article classification in a low-resource language. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds.) iiWAS 2022. LNCS, vol. 13635, pp. 519–530. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-21047-1_47

    Chapter  Google Scholar 

  26. Zhang, H., Hennig, L., Alt, C., Hu, C., Meng, Y., Wang, C.: Bootstrapping named entity recognition in e-commerce with positive unlabeled learning. arXiv preprint arXiv:2005.11075 (2020)

  27. Zhang, M., Geng, G., Chen, J.: Semi-supervised bidirectional long short-term memory and conditional random fields model for named-entity recognition using embeddings from language models representations. Entropy 22(2), 252 (2020)

    Article  MathSciNet  Google Scholar 

  28. Zhou, R., et al.: MELM: data augmentation with masked entity language modeling for low-resource NER. arXiv preprint arXiv:2108.13655 (2021)

Download references

Acknowledgement

This paper is based on results obtained from JST CREST Grant Number JPMJCR22M2 and JSPS KAKENHI Grant Number JP23K24949.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailemariam Mehari Yohannes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mehari Yohannes, H., Lynden, S., Amagasa, T., Matono, A. (2024). Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMs. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70239-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70238-9

  • Online ISBN: 978-3-031-70239-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics