research-article

MizBERT: A Mizo BERT Model

Authors:

Robert Lalramhluna,

Dr.Partha PakrayAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 7

Article No.: 99, Pages 1 - 14

https://doi.org/10.1145/3666003

Published: 26 June 2024 Publication History

Abstract

This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce MizBERT, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, MizBERT has been tailored to accommodate the nuances of the Mizo language. Evaluation of MizBERT’s capabilities is conducted using two primary metrics: masked language modeling and perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that MizBERT outperforms both the Multilingual BERT model and the Support Vector Machine algorithm, achieving an accuracy of 98.92%. This underscores MizBERT’s proficiency in understanding and processing the intricacies inherent in the Mizo language.

References

[1]

Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104 (2020).

[2]

Andrew Bawitlung, Sandeep Kumar Dash, Robert Lalramhluna, and Alexander Gelbukh. 2024. An approach to Mizo language news classification using machine learning. In Data Science and Network Engineering, Suyel Namasudra, Munesh Chandra Trivedi, Ruben Gonzalez Crespo, and Pascal Lorenz (Eds.). Springer Nature Singapore, Singapore, 165–180.

[3]

Jereemi Bentham, Partha Pakray, Goutam Majumder, Sunday Lalbiaknia, and Alexander Gelbukh. 2016. Identification of rules for recognition of named entity classes in Mizo language. In Proceedings of the 2016 15th Mexican International Conference on Artificial Intelligence (MICAI’16). IEEE, 8–13.

[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. arxiv:2005.14165 [cs.CL] (2020).

[5]

Branden Chan, Stefan Schweter, and Timo Möller. 2020. German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics. 6788–6796.

[6]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.

[7]

Wietse De Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nissim. 2019. BERTje: A Dutch BERT model. arXiv preprint arXiv:1912.09582 (2019).

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP’05). https://aclanthology.org/I05-5002

[10]

Vanlalmuansangi Khenglawt, Sahinur Rahman Laskar, Santanu Pal, Partha Pakray, and Ajoy Kumar Khan. 2022. Language resource building and English-to-Mizo neural machine translation encountering tonal words. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. 48–54. https://aclanthology.org/2022.wildre-1.9

[11]

Candy Lalrempuii, Badal Soni, and Partha Pakray. 2021. An improved English-to-Mizo neural machine translation. ACM Transactions on Asian Low-Resource Language Information Processing 20, 4 (May 2021), Article 61, 21 pages.

Digital Library

[12]

Mercy Lalthangmawii, Ranjita Das, and Robert Lalramhluna. 2023. Mizo news classification using machine learning techniques. In Evolution in Computational Intelligence: Proceedings of the 10th International Conference on Frontiers in Intelligent Computing: Theory and Applications (FICTA’22). Springer, 577–585.

[13]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[14]

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de La Clergerie, Djamé Seddah, and Benoît Sagot. 2019. CamemBERT: A tasty French language model. arXiv preprint arXiv:1911.03894 (2019).

[15]

Morrel V. L. Nunsanga, Parthat Pakray, C. Lallawmsanga, and L. Lolit Kumar Singh. 2021. Part-of-speech tagging for Mizo language using conditional random field. In Thematic Issue on Artificial Intelligence for Industry 4.0. Vol. 25. Computación y Sistemas.

[16]

Amarnath Pathak, Partha Pakray, and Jereemi Bentham. 2019. English–Mizo machine translation using neural and statistical approaches. Neural Computing and Applications 31, 11 (2019), 7615–7631.

Digital Library

[17]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Preprint.

[18]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.

[19]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html

[20]

Leonard Richardson. 2007. Beautiful Soup Documentation. Leonard Richardson.

[21]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1631–1642. https://aclanthology.org/D13-1170

[22]

Yuan Sun, Sisi Liu, Junjie Deng, and Xiaobing Zhao. 2022. TiBERT: Tibetan pre-trained language model. arxiv:2205.07303[cs.CL] (2022).

[23]

Zaitinkhuma Thihlum, Vanlalmuansangi Khenglawt, and Somen Debnath. 2020. Machine translation of English language to Mizo language. In Proceedings of the 2020 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM’20). 92–97.

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arxiv:1706.03762[cs.CL] (2017).

[25]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. CoRR abs/1704.05426 (2017). http://arxiv.org/abs/1704.05426

[26]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6

[27]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arxiv:1609.08144[cs.CL] (2016).

[28]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. XLNet: Generalized autoregressive pretraining for language understanding. arxiv:1906.08237[cs.CL] (2020).

Index Terms

MizBERT: A Mizo BERT Model
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Construction of Mizo: English Parallel Corpus for Machine Translation
Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. ...
Word Segmentation of Hiragana Sentences Using Hiragana BERT
PRICAI 2023: Trends in Artificial Intelligence
Abstract
Unlike Western languages, word segmentation is necessary for Japanese sentences because they do not have word boundaries. The performances of existing morphological analyzers for Japanese sentences are very high. However, it is difficult to ...
Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages
Analysis of Images, Social Networks and Texts
Abstract
Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 7

July 2024

254 pages

EISSN:2375-4702

DOI:10.1145/3613605

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2024

Online AM: 25 May 2024

Accepted: 21 May 2024

Revised: 13 May 2024

Received: 09 October 2023

Published in TALLIP Volume 23, Issue 7

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
158
Total Downloads

Downloads (Last 12 months)158
Downloads (Last 6 weeks)17

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents