Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SSRN 4871732

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Development and Evaluation of Myanmar GPT: A Language

Model for Myanmar Natural Language Processing


( Pre print Article)

Author - Min Si Thu


Co Author - Dr Myo Myint OO, Pyae Sone Phyo
Article Information
Title: Development and Evaluation of Myanmar GPT

Author and Creator - Min Si Thu

Co Author - Dr. Myo Myint Oo , Pyae Sone Phyo

MyanmarGPT Announcement Date- [17 th, December 2023 ]

GitHub Repository - Myanmar GPT GitHub

Hugging Face Model - Myanmar GPT on Hugging Face

Note: This preprint is not edited enough and requires further refinement.
Development and Evaluation of Myanmar GPT: A Language
Model for Myanmar Natural Language Processing

Abstract
This paper presents the development and evaluation of Myanmar GPT, the first large language
model (LLM) tailored for the Myanmar language. Myanmar GPT was created by Min Si Thu.
Utilizing a diverse corpus and advanced training techniques, Myanmar GPT aims to bridge the
gap in natural language processing for underrepresented languages. The model's performance is
evaluated using standard metrics, demonstrating significant improvements over existing
solutions. This research contributes a valuable tool for the Myanmar NLP community and sets
the stage for future advancements in the field.

Keywords : MyanmarGPT ၊ Natural Language Processing ၊ Transformer Architecture ၊ Myanmar


Language ၊
sentiment analysis, language education
1. Introduction tools, and other language-driven
technologies remain underdeveloped for
The Myanmar language, spoken by over 50 Myanmar, highlighting an urgent need for
million people, is a cornerstone of the focused research and development in this
nation's cultural and social identity. Its area.
significance is deeply embedded in the
Existing NLP models often fall short in
everyday lives of Myanmar's citizens,
effectively capturing the unique nuances and
reflecting the country's rich history and
complexities inherent in the Myanmar
diverse cultural practices. However, despite
language. This shortfall can be attributed to
its profound importance, the Myanmar
several factors, including differences in
language has not received the level of
syntax, grammar, and script that distinguish
attention it deserves in the burgeoning field
Myanmar from other languages. Current
of natural language processing (NLP). When
language models, typically trained on large
compared to the resources and tools
multilingual datasets, tend to underperform
available for more widely spoken languages
when applied to Myanmar text. These
such as English, Chinese, and Spanish, the
models, while proficient in languages with
disparity is striking. This lack of
abundant resources, struggle with
Myanmar-specific NLP tools and resources
Myanmar's distinct linguistic characteristics,
poses a significant barrier to the
leading to suboptimal outcomes in various
development of various applications that
NLP tasks. For instance, the intricate
could immensely benefit the local
structure of Myanmar syntax and the
population. Automated translation,
specific rules governing its grammar present Secondly, the development of the Myanmar
challenges that are not adequately addressed GPT model marks a major advancement in
by generalized models. As a result, there is a language-specific NLP, providing a tool that
growing recognition within the NLP can generate and process Myanmar text with
community of the need for a dedicated greater accuracy and fluency than existing
language model that can more effectively models. Thirdly, a comprehensive evaluation
process and generate Myanmar text. of the model's performance across various
NLP tasks will not only demonstrate its
The primary objective of this research is to capabilities but also provide insights into the
bridge this gap by developing a generative specific challenges and opportunities
pre-trained transformer (GPT) model associated with Myanmar language
specifically designed for the Myanmar processing. These contributions collectively
language. This ambitious endeavor involves underscore the potential of this research to
several critical steps, starting with the transform the landscape of NLP for the
collection and preprocessing of a Myanmar language, paving the way for
comprehensive Myanmar language corpus. more effective and inclusive technological
Given the scarcity of existing datasets, this solutions that can benefit millions of
task is both challenging and crucial, Myanmar speakers.
requiring meticulous curation of textual data
to ensure a robust foundation for model
training. Once the corpus is established, the
2. Related Work
next phase involves training the model using
The advent of transformer models,
state-of-the-art techniques that have proven
particularly the GPT series, has significantly
effective in other contexts. By tailoring
transformed the field of natural language
these techniques to the specific requirements
processing (NLP) by enabling the
of the Myanmar language, the aim is to
development of highly proficient language
create a model that can outperform existing
models. Among these, GPT-3, developed by
ones on various NLP tasks. Finally, the
OpenAI, stands out as a notable example
model's performance will be rigorously
due to its exceptional capabilities in text
evaluated against established benchmarks to
generation and comprehension. This model
assess its effectiveness and identify areas for
showcases the potential of transformer
further improvement.
architecture to handle a wide array of
This research makes several key language tasks with remarkable proficiency.
contributions that extend beyond the However, these advanced models
development of the Myanmar GPT model predominantly focus on languages with
itself. Firstly, the creation of a large-scale abundant data resources, which results in
Myanmar language corpus represents a underrepresentation and limited
significant resource for the NLP community, functionality for languages like Myanmar.
offering a valuable dataset for future The disparity in resource allocation and
research and application development. model development underscores a
significant challenge in the NLP field, where effective tool for Myanmar language
languages with fewer resources are often left processing.
behind.
The development of Myanmar GPT
Efforts to address Myanmar language represents a significant advancement in the
processing have historically included field of NLP for underrepresented
rule-based systems, statistical models, and, languages. By filling a critical gap in
more recently, neural network-based existing solutions, this model has the
approaches. These efforts have laid essential potential to greatly enhance the capabilities
groundwork, leading to the development of of applications involving the Myanmar
Myanmar-English machine translation language, such as automated translation,
systems and sentiment analysis tools. While sentiment analysis, and language education
these initiatives have provided valuable tools. The superior performance of
insights and initial tools for Myanmar Myanmar GPT across these tasks
language processing, they often depend on demonstrates the importance of tailored
limited datasets and lack the sophistication language models and highlights the need for
and comprehensive capabilities of modern continued investment in NLP resources for
transformer-based models. The reliance on less commonly spoken languages. Through
small, sometimes insufficient datasets this focused effort, Myanmar GPT not only
restricts their performance and scalability, advances the technological landscape for
making it difficult to achieve the same level Myanmar but also sets a precedent for
of accuracy and functionality seen in models similar initiatives targeting other
developed for more widely spoken underrepresented languages worldwide.
languages.

Myanmar GPT aims to bridge this gap by


3 Background Theory
leveraging the transformer architecture to
create a model specifically designed for the Burmese, as a low-resource language with
Myanmar language. This approach limited digital text corpora and language
processing tools, presents unique
distinguishes itself from previous methods
challenges for NLP, which can be effectively
by focusing on the unique linguistic
addressed using the transformer
characteristics of Myanmar and training the architecture that leverages self-attention
model on a large and diverse corpus. The mechanisms to model complex linguistic
extensive dataset ensures that the model can relationships and dependencies.
capture a wide range of linguistic nuances
and contexts, providing superior 3.1 Burmese as a Low Resource
performance across various NLP tasks. By Language
utilizing state-of-the-art techniques and a
comprehensive training process, Myanmar Burmese, the official language of Myanmar,
GPT seeks to overcome the limitations faced is spoken by over 50 million people. Despite
by earlier models and deliver a highly its wide usage, Burmese is considered a
low-resource language in the field of natural capture long-range dependencies more
language processing (NLP). This effectively.
classification arises from the limited
availability of digital text corpora, linguistic Transformers are composed of an encoder
datasets, and language processing tools and a decoder. The encoder processes the
tailored to Burmese. Unlike widely spoken input text and generates a set of continuous
languages such as English, Spanish, or representations. The decoder then uses these
Chinese, which benefit from extensive representations to produce the output text.
linguistic resources and research, Burmese Both the encoder and decoder are built using
suffers from a lack of annotated datasets and layers of self-attention and feed-forward
pretrained models. neural networks. Self-attention mechanisms
compute the relevance of each word in the
Low-resource languages face several input sequence to every other word,
challenges in NLP development. First, the enabling the model to understand context
scarcity of large, high-quality text corpora and relationships within the text
hinders the training of effective language comprehensively.
models. Second, the unique linguistic
features of Burmese, including its script, Key components of the transformer
syntax, and grammar, are not architecture include:
well-represented in existing multilingual
1. Self-Attention Mechanism:
models. Burmese uses a syllabic alphabet,
Calculates the attention scores for
which poses additional challenges for
each word in the sequence, allowing
tokenization and language modeling. These
the model to focus on relevant parts
factors contribute to the underperformance
of the text.
of generic NLP models when applied to
2. Positional Encoding: Adds
Burmese text, underscoring the need for
information about the position of
language-specific solutions.
each word in the sequence, as
transformers do not inherently
3.2 Transformer Architecture
consider word order.
The transformer architecture, introduced by 3. Feed-Forward Neural Networks:
Vaswani et al. in 2017, has revolutionized Apply non-linear transformations to
the field of NLP by enabling the the attention outputs, adding
development of highly effective language complexity and depth to the model.
models. Unlike previous models that relied 4. Residual Connections and Layer
on recurrent neural networks (RNNs) or Normalization: Improve training
convolutional neural networks (CNNs), stability and model performance by
transformers utilize a mechanism known as facilitating gradient flow and
self-attention. This allows the model to normalization across layers.
weigh the importance of different words in a
sentence, regardless of their position, and
The success of transformers in NLP tasks and conversational text. By encompassing
led to the development of models such as such a broad spectrum of sources, the
BERT (Bidirectional Encoder dataset ensured extensive coverage of
Representations from Transformers) and different topics, styles, and contexts in
GPT (Generative Pre-trained Transformer). which the language is used. This extensive
GPT, particularly, leverages a collection aimed to provide the model with a
transformer-based architecture to generate deep and nuanced understanding of the
coherent and contextually relevant text. language, crucial for generating relevant and
Pretrained on large text corpora, GPT coherent responses.
models can be fine-tuned for specific tasks,
including language translation, 4.1 Data Preprocessing
summarization, and text generation.
Ensuring the quality and consistency of the
In the context of low-resource languages dataset involved a thorough preprocessing
like Burmese, the transformer architecture's phase. Text normalization was a critical first
ability to model complex linguistic step, addressing variations in spelling,
structures and relationships is particularly punctuation, and formatting that are
advantageous. By training a common in real-world text. This process
transformer-based model on a carefully involves standardizing the text to a uniform
curated Burmese corpus, it is possible to format, which is essential for reducing noise
develop a language model that significantly and improving the model's performance. In
improves upon existing NLP tools for the context of the Myanmar language,
Burmese. The resulting Myanmar GPT can normalization also included handling
effectively capture the nuances of the script-specific issues such as diacritic marks
Burmese language, enabling more accurate and script variations.
and contextually appropriate text generation
and comprehension. Tokenization, a fundamental step in natural
language processing (NLP), was customized
4. Methodology for for the Myanmar script. Unlike languages
that use spaces to separate words, the
Developing Myanmar GPT Myanmar script often requires
syllable-based segmentation and recognition
The foundation of any effective language of compound words. Developing a tokenizer
model lies in the quality and diversity of its that accurately segments the text into
dataset. For the development of Myanmar meaningful units was essential. This
GPT, a comprehensive and diverse corpus customized tokenization process helped in
was meticulously assembled from multiple accurately capturing the linguistic structure
sources to capture the rich and varied use of of the Myanmar language, thereby
the Myanmar language. This diverse dataset improving the quality of the input data fed
included a wide array of texts such as news into the model.
articles, social media posts, literary works,
4.2 Model Architecture model converge faster and avoid issues such
as overfitting.
Myanmar GPT is based on the GPT-2
architecture, which is renowned for its The training process was conducted on
capability in generating human-like text. high-performance GPUs to handle the
The GPT-2 model employs a multi-layer substantial computational demands of the
transformer network, an advanced transformer architecture. High-performance
architecture that has revolutionized NLP GPUs are essential for processing large
tasks. Central to the transformer model is the amounts of data quickly and efficiently,
self-attention mechanism, which allows the enabling the training of large-scale models
model to weigh the importance of different like GPT-2. During training, various
words in a sentence when generating text. hyperparameters such as batch size, learning
This mechanism is particularly effective in rate, and the number of training epochs were
capturing long-range dependencies within carefully tuned. Tuning these
the text, enabling the model to generate hyperparameters was crucial for optimizing
coherent and contextually relevant the model's performance, ensuring that it
sentences. learned effectively from the dataset and
produced high-quality text.
The architecture of Myanmar GPT includes
an embedding layer that converts input 4.4 Performance Evaluation
tokens into dense vectors, which are then
processed by multiple layers of Evaluating the performance of Myanmar
self-attention and feed-forward neural GPT involved a combination of quantitative
networks. The final layer is an output layer and qualitative metrics. Quantitative metrics
that generates a probability distribution over included perplexity and BLEU score.
the vocabulary for predicting the next token. Perplexity measures the model's uncertainty
This architecture enables the model to when predicting the next word in a
understand and generate text that is fluent sequence, with lower perplexity indicating
and contextually appropriate. better performance. The BLEU score,
commonly used in machine translation,
4.3 Training Process measures the accuracy of the generated text
by comparing it to a reference text. These
Training a model of this complexity requires metrics provided a quantitative assessment
significant computational resources. of the model's ability to generate accurate
Myanmar GPT was trained using the Adam and coherent text.
optimizer, a popular optimization algorithm
known for its efficiency and effectiveness in In addition to quantitative metrics, human
training deep learning models. The learning evaluation played a vital role in assessing
rate scheduler was employed to adjust the the model's performance. Native speakers of
learning rate during training, helping the the Myanmar language were involved in
rating the fluency and coherence of the
generated text. This qualitative assessment adheres to correct grammar and syntax but
was essential for understanding the practical also captures the stylistic nuances and
effectiveness of the model in real-world contextual accuracy expected in natural
applications. Human evaluators provided language use. This qualitative validation
insights into the nuances of the language underscores Myanmar GPT's ability to meet
that quantitative metrics might miss, high standards of linguistic proficiency,
ensuring that the model's outputs were not surpassing the performance of earlier
only accurate but also natural and models that struggled with the complexities
contextually appropriate. of the Myanmar language.

The results collectively demonstrate the


5. Results effectiveness of Myanmar GPT across a
spectrum of NLP tasks, confirming its
The performance evaluation of Myanmar
capability to generate coherent and
GPT involved rigorous benchmarking
contextually relevant text. By successfully
against established models using several key
learning and applying the intricacies of the
evaluation metrics. Notably, the model
Myanmar language, the model has achieved
achieved a perplexity score significantly
a significant milestone in advancing
lower than baseline models, indicating its
Myanmar language processing technology.
superior predictive capabilities and ability to
These findings not only highlight the
better understand and predict Myanmar
model's technical achievements but also
language text. Additionally, the BLEU score
establish a new benchmark for future
showed substantial improvements,
research and development efforts in
underscoring Myanmar GPT's proficiency in
Myanmar language NLP. Moving forward,
generating more accurate and contextually
the success of Myanmar GPT sets a
appropriate sentences compared to existing
promising precedent for further innovations
approaches. These quantitative metrics
aimed at enhancing language-specific
provide robust evidence of the model's
models and expanding their application in
advancements over previous methods,
diverse linguistic contexts globally.
highlighting its effectiveness in enhancing
the quality of language processing tasks
specific to Myanmar. 6. Discussion
Qualitative assessments further validated The development of Myanmar GPT
Myanmar GPT's capabilities through represents a significant milestone with
evaluations conducted by native speakers. wide-ranging implications for the Myanmar
These assessments focused on evaluating the language and its speakers. By introducing
fluency, coherence, and relevance of text advanced NLP tools tailored specifically for
generated by the model. Feedback from Myanmar, Myanmar GPT plays a crucial
these evaluations consistently indicated that role in preserving and promoting the
Myanmar GPT produces text that not only language in the digital era. These tools not
only enhance language processing
capabilities but also facilitate better access
to information and services for Myanmar
speakers, thereby fostering digital 7. Conclusion
inclusivity and empowerment within the
community. Moreover, Myanmar GPT This paper presents the development and
serves as a foundational platform for future evaluation of Myanmar GPT, a
research and development in Myanmar NLP, transformer-based language model
encouraging ongoing innovation and specifically designed for the Myanmar
collaboration among researchers and language. By leveraging a diverse corpus
practitioners interested in advancing and advanced training techniques, Myanmar
language technology for Myanmar. GPT demonstrates significant improvements
in handling various NLP tasks compared to
Despite its achievements, Myanmar GPT
existing models. The successful
faces inherent challenges and limitations.
implementation of Myanmar GPT marks a
The model's performance heavily depends
critical step forward in Myanmar language
on the quality and diversity of its training
processing, offering valuable tools for the
data, and biases present within the dataset
local NLP community and setting the stage
can influence the accuracy and fairness of its
for future advancements. This research
outputs. Moreover, the training and
underscores the importance of developing
deployment of large transformer models like
language-specific models to ensure inclusive
Myanmar GPT require substantial
and equitable access to AI technologies.
computational resources, posing barriers to
accessibility in certain research
environments. Addressing these challenges References
in future research efforts could involve
expanding the dataset to include more varied 1 Vaswani, A., Shazeer, N., Parmar, N.,
sources of text to improve the model's Uszkoreit, J., Jones, L., Gomez, A. N., ... &
robustness and developing more efficient Polosukhin, I. (2017). Attention is all you
training methodologies to mitigate need. Advances in neural information
computational demands. Additionally, processing systems, 30.
focusing on practical applications such as
chatbots and translation systems based on 2 Ma, X., Yang, X., Xiong, W., Chen, B.,
Myanmar GPT could further demonstrate its Yu, L., Zhang, H., ... & Zhou, C. (2024).
utility and expand its impact across different Megalodon: Efficient llm pretraining and
domains of language technology. inference with unlimited context length.
arXiv preprint arXiv:2404.08801.

3 Ma, X., Yang, X., Xiong, W., Chen, B.,


Yu, L., Zhang, H., ... & Zhou, C. (2024).
Megalodon: Efficient llm pretraining and
inference with unlimited context length. System with Semantic roles (Doctoral
arXiv preprint arXiv:2404.08801. dissertation, MERAL Portal).

4 ​Soky, K., Mimura, M., Kawahara, T., Li, 9 Htet, A. K. (2024). Building a Dataset and
S., Ding, C., Chu, C., & Sam, S. (2021, Exploring Low-Resource Approaches to
November). Khmer speech translation Natural Language Inference with Myanmar
corpus of the extraordinary chambers in the (Doctoral dissertation, Macquarie
courts of cambodia (eccc). In 2021 24th University).
Conference of the Oriental COCOSDA
International Committee for the 10 Htet, A. K. (2024). Building a Dataset
Co-ordination and Standardisation of and Exploring Low-Resource Approaches to
Speech Databases and Assessment Natural Language Inference with Myanmar
Techniques (O-COCOSDA) (pp. 122-127). (Doctoral dissertation, Macquarie
IEEE. University).

5 San, M. E., Usanavasin, S., Thu, Y. K., &


Okumura, M. (2024). A Study for
Enhancing Low-resource
Thai-Myanmar-English Neural Machine
Translation. ACM Transactions on Asian
and Low-Resource Language Information
Processing.

6 Black, S., Biderman, S., Hallahan, E.,


Anthony, Q., Gao, L., Golding, L., ... &
Weinbach, S. (2022). Gpt-neox-20b: An
open-source autoregressive language model.
arXiv preprint arXiv:2204.06745.

7 Jiang, S., Huang, X., Cai, X., & Lin, N.


(2021). Pre-trained models and evaluation
data for the Myanmar language. In Neural
Information Processing: 28th International
Conference, ICONIP 2021, Sanur, Bali,
Indonesia, December 8–12, 2021,
Proceedings, Part VI 28 (pp. 449-458).
Springer International Publishing.

8 Naing, M. T., & Thida, A. (2014).


Automatic Myanmar Text Summarization

You might also like