Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Transliteration of Arabizi into Arabic Script for Tunisian Dialect

Published: 28 November 2019 Publication History

Abstract

The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, SMS, comments, etc.). Most of these messages and comments are written in Latin script, also called Arabizi. Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary, such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content like emoticons. Since the Tunisian dialect suffers from the unavailability of basic tools and linguistic resources compared to Modern Standard Arabic, we resort to the use of these written sources as a starting point to build large corpora automatically. In the context of natural language processing and to benefit from these networks’ data, transliterating from Arabizi to Arabic script is a necessary step because most recently available tools for processing the Tunisian dialect expect Arabic script input. Indeed, the transliteration task can help construct and enrich parallel corpora and dictionaries for the Tunisian dialect and can be useful for developing various natural language processing applications such as sentiment analysis, opinion mining, topic detection, and machine translation. In this article, we focus on converting the Tunisian dialect text that is written in Latin script to Arabic script following the Conventional Orthography for Dialectal Arabic. Then, we propose two models to transliterate Arabizi into Arabic script for the Tunisian dialect, namely a rule-based model and a discriminative model as a sequence classification task based on conditional random fields). In the first model, we use a set of transliteration rules to convert the Tunisian dialect Arabizi texts to Arabic script. In the second model, transliteration is performed both at word and character levels. In the end, our models got a character error rate of 10.47%.

References

[1]
M. Al-Badrashiny, R. Eskander, N. Habash, and O. Rambow. 2014. Automatic transliteration of Romanized dialectal Arabic. In Proceedings of the 18th Conference on Computational Natural Language Learning. 30--38.
[2]
Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. 1--13.
[3]
W. Ammar, Ch. Dyer, and A. Smith. 2012. Transliteration by sequence labeling with lattice encodings and reranking. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 8--14.
[4]
T. Baccouche. 2003. L’arabe, d’une koin dialectale une langue de culture, Memoires de la société linguistique de Paris, TomeXI (les langues de Communication). 87--93.
[5]
D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
[6]
A. Bies, Z. Song, M. Maamouri, S. Grimes, H. Lee, J. Wright, S. Sreassel, N. Habash, R. Eskander, and O. Rambow. 2014. Transliteration of Arabizi into Arabic orthography: Developing a parallel annotated Arabizi-Arabic script SMS/chat corpus. In Proceedings of Neural Information Processing Systems (NIPS’14). 93--103.
[7]
T. Buckwalter. 2004. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages. 31--34.
[8]
F. Casacuberta and E. Vidal. 2007. GIZA++: Training of Statistical Translation Models. Retrieved October 29, 2019 from http://fjoch.com/GIZA++.html.
[9]
A. Chalabi and H. Gergers. 2012. Romanized Arabic transliteration. In Proceedings of the 2nd Workshop on Advances in Text Input Methods (WTIM 2). 89--96.
[10]
K. Darwish. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 217--224.
[11]
T. Deselaers, S. Hasan, O. Bender, and H. Ney. 2009. A deep learning approach to machine transliteration. In Proceedings of the 4th Workshop on Statistical Machine Translation. 233--241.
[12]
A. El-Kahky, K. Darwish, A. Aldein, M. El-Wahab, A. Hefny, and W. Ammar. 2011. Improved transliteration mining using graph reinforcement. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1384--1393.
[13]
A. El-Kahky, K. Darwish, A. Aldein, M. El-Wahab, and W. Ammar. 2012. Transliteration mining using large training and test sets. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 243--252.
[14]
R. Eskander, M. Al-Badrashiny, N. Habash, and O. Rambow. 2014. Foreign words and the automatic processing of Arabic social media text written in Roman script. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 1--12.
[15]
G. Surya, S. Harsha, P. Pingali, and V. Varma. 2008. Statistical transliteration for cross language information retrieval using HMM alignment and CRF. In Proceedings of the 2nd Workshop on Cross Lingual Information Access. 42--47.
[16]
I. Guellil, F. Azouaou, M. Abbas, and F. Sadat. 2017. Arabizi transliteration of Algerian Arabic dialect into Modern Standard Arabic. In Proceedings of the 1st Workshop on Social Media and User Generated Content Machine Translation.
[17]
M. L. Gibson. 1998. Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. University of Reading.
[18]
M. Graja, M. Jaoua, and L. Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian Dialect. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 12, 2311--2321
[19]
M. Graja, M. Jaoua, and L. Belguith. 2013. Discriminative framework for spoken Tunisian dialect understanding. In Proceedings of the 1st International Conference on Statistical Language and Speech Processing (SLSP’13). 29--31.
[20]
N. Habash, A. Shahrour, and M. Al-Khalil. 2016. Exploiting Arabic diacritization for high quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).
[21]
N. Habash, M. Diab, and O. Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).
[22]
N. Habash, A. Soudi, and T. Buckwalter. 2007. On Arabic transliteration. In Arabic Computational Morphology: Knowledge-Based and Empirical Methods, A. Soudi, A. van den Bosch, and G. Neumann (Eds.). Springer, 3--14.
[23]
M. Hadj Ameur, F. Meziane, and A. Guessouma. 2017. Arabic machine transliteration using an attention-based encoder-decoder model. In Proceedings of the 3rd International Conference on Arabic Computational Linguistics. 5--6.
[24]
O. Hamed and T. Zesch. 2017. A survey and comparative study of Arabic diacritization tools. Journal of Language Technology and Computational Linguistics 32, 1 (2017), 1--21.
[25]
C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown Classics. Georgetown University Press, Washington, DC.
[26]
I. Illina, D. Fohr, and D. Jouvet. 2011. Grapheme-to-phoneme conversion using conditional random fields. In Proceedings of the 12th Annual Conference of the International Speech Communication Association(INTERSPEECH’11).
[27]
B. Kang and K. Cho. 2000. Automatic transliteration and back-transliteration by decision tree learning. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC’00). 227--233.
[28]
N. Karmani, H. Soussou, and A. Alimi. 2019. Tunisian Arabic chat alphabet transliteration using probabilistic finite state transducers. International Arab Journal of Information Technology 16, 2.
[29]
J. Lafferty, A. McCallum, and F. Peireira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.
[30]
J. Maleki and L. Ahrenberg. 2008. Converting Romanized Persian to the Arabic writing systems. In Proceedings of the 6th International Language Resources and Evaluation (LREC’08). 2904--2908.
[31]
A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, and L. Belguith. 2014. Phonetic tool for the Tunisian Arabic. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages.
[32]
A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, L. Belguith, and N. Habash. 2014. A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 19th Edition of the Language Resources and Evaluation Conference.
[33]
A. Masmoudi, N. Habash, M. Khmekhem, Y. Estéve, and L. Belguith. 2015. Arabic transliteration of Romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’15).
[34]
A. Masmoudi, M. Khmekhem, F. Bougares, Y. Estéve, and L. Belguith. 2016. Conditional random fields for the Tunisian dialect grapheme-to-phoneme conversion. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH’16). 8--12.
[35]
A. Masmoudi, F. Bougares, M. Khmekhem, Y. Estéve, and L. Belguith. 2018. Automatic speech recognition system for Tunisian dialect. Language Resources and Evaluation 52 (2018), 249--267.
[36]
A. Masmoudi, S. Medhaffar, R. Sellami, and L. Belguith. 2018. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3, Article 28.
[37]
S. Mejri, M. Said, and I. Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synergy Tunisie 1, 53--74.
[38]
H. Rathod, Manikrao L. Dhore, and R. M. Dhore. 2013. Hindi and Marathi to English machine transliteration using SVM. International Journal on Natural Language Computing 2, 4, 55--71.
[39]
M. Rosca and T. Breuel. 2016. Sequence-to-sequence neural network models for transliteration. arXiv:1610.09565.
[40]
H. Saadane and N. Semmar. 2012. Utilisation de la translittération Arabe pour l’amélioration de l’lignement de mots a partir de corpus paralléles Francais-Arabe. In Proceedings of the Joint JEP-TALN-RECITAL Conference. 127--140.
[41]
H. Saadane and N. Habash. 2015. A conventional orthography for Algerian Arabic. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.
[42]
I. Sfar. 2005. Morphologie des noms de professions: Incorporation et paraphrase, La terminologie, entre traduction et bilinguisme, 15--16.
[43]
K. Shaalan, M. Abo Bakr, and I. Ziedan. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages.
[44]
E. Souissi and F. Debili. 2001. Transliteration of Arab proper names. In Proceedings of the 9th International Conference on Human-Computer Interaction (HCI’01).
[45]
A. Stolcke. 2002. SRILM: An extensible language modeling toolkit. In Proceedings of the International Conference on Speech and Language Processing.
[46]
D. Wang and S. King. 2011. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Processing Letters 18, 2 (2011), 122--125.
[47]
J. Younes, H. Achour, and E. Souissi. 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Proceedings of the 15th International Conference on Current Trends in Web Engineering (ICWE’15 Workshops). 3--14.
[48]
J. Younes, E. Souissi, and H. Achour. 2016. A hidden Markov model for the automatic transliteration of Romanized Tunisian dialect. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16).
[49]
J. Younes, E. Souissi, H. Achour, and A. Ferchechi. 2018. A sequence-to-sequence based approach for the double transliteration of Tunisian dialect. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing’18).
[50]
I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. Belguith, and N. Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).
[51]
I. Zribi, M. Khmekhem, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University, Computer and Information Sciences 29 (2017), 147--155.

Cited By

View all
  • (2024)Improving neural machine translation by integrating transliteration for low-resource English–Assamese languageNatural Language Processing10.1017/nlp.2024.20(1-22)Online publication date: 27-May-2024
  • (2024)Detecting Speech Disorders Using A Machine-Learning Guided Method in Spontaneous Tunisian Dialect SpeechSN Computer Science10.1007/s42979-024-02775-85:5Online publication date: 17-Apr-2024
  • (2022)Sentiment Analysis of Arabic DocumentsResearch Anthology on Implementing Sentiment Analysis Across Multiple Disciplines10.4018/978-1-6684-6303-1.ch064(1237-1261)Online publication date: 10-Jun-2022
  • Show More Cited By

Index Terms

  1. Transliteration of Arabizi into Arabic Script for Tunisian Dialect

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 2
      March 2020
      301 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3358605
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 November 2019
      Accepted: 01 September 2019
      Revised: 01 July 2019
      Received: 01 February 2019
      Published in TALLIP Volume 19, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Arabizi corpus
      2. CRF model
      3. Natural language processing
      4. Tunisian dialect
      5. diacritization
      6. rule-based approach
      7. transliteration

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)26
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 04 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Improving neural machine translation by integrating transliteration for low-resource English–Assamese languageNatural Language Processing10.1017/nlp.2024.20(1-22)Online publication date: 27-May-2024
      • (2024)Detecting Speech Disorders Using A Machine-Learning Guided Method in Spontaneous Tunisian Dialect SpeechSN Computer Science10.1007/s42979-024-02775-85:5Online publication date: 17-Apr-2024
      • (2022)Sentiment Analysis of Arabic DocumentsResearch Anthology on Implementing Sentiment Analysis Across Multiple Disciplines10.4018/978-1-6684-6303-1.ch064(1237-1261)Online publication date: 10-Jun-2022
      • (2022)Rule-Based Arabic Sentiment Analysis using Binary Equilibrium Optimization AlgorithmArabian Journal for Science and Engineering10.1007/s13369-022-07198-248:2(2359-2374)Online publication date: 26-Sep-2022
      • (2022)Automatic diacritization of Tunisian dialect text using SMT modelInternational Journal of Speech Technology10.1007/s10772-021-09864-625:1(89-104)Online publication date: 1-Mar-2022
      • (2022)Social Media Sentiment Classification for Tunisian Dialect: A Deep Learning ApproachIntelligent Systems and Pattern Recognition10.1007/978-3-031-08277-1_31(377-393)Online publication date: 17-Jun-2022
      • (2021)Sentiment Analysis of Arabic DocumentsNatural Language Processing for Global and Local Business10.4018/978-1-7998-4240-8.ch013(307-331)Online publication date: 2021
      • (2021)Learning Word Representations for Tunisian Sentiment AnalysisPattern Recognition and Artificial Intelligence10.1007/978-3-030-71804-6_24(329-340)Online publication date: 18-Mar-2021

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media