research-article

Transliteration of Arabizi into Arabic Script for Tunisian Dialect

Authors:

Abir Masmoudi,

Mariem Ellouze Khmekhem,

Mourad Khrouf,

Lamia Hadrich BelguithAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 19, Issue 2

Article No.: 32, Pages 1 - 21

https://doi.org/10.1145/3364319

Published: 28 November 2019 Publication History

Get Access

Abstract

The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, SMS, comments, etc.). Most of these messages and comments are written in Latin script, also called Arabizi. Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary, such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content like emoticons. Since the Tunisian dialect suffers from the unavailability of basic tools and linguistic resources compared to Modern Standard Arabic, we resort to the use of these written sources as a starting point to build large corpora automatically. In the context of natural language processing and to benefit from these networks’ data, transliterating from Arabizi to Arabic script is a necessary step because most recently available tools for processing the Tunisian dialect expect Arabic script input. Indeed, the transliteration task can help construct and enrich parallel corpora and dictionaries for the Tunisian dialect and can be useful for developing various natural language processing applications such as sentiment analysis, opinion mining, topic detection, and machine translation. In this article, we focus on converting the Tunisian dialect text that is written in Latin script to Arabic script following the Conventional Orthography for Dialectal Arabic. Then, we propose two models to transliterate Arabizi into Arabic script for the Tunisian dialect, namely a rule-based model and a discriminative model as a sequence classification task based on conditional random fields). In the first model, we use a set of transliteration rules to convert the Tunisian dialect Arabizi texts to Arabic script. In the second model, transliteration is performed both at word and character levels. In the end, our models got a character error rate of 10.47%.

References

[1]

M. Al-Badrashiny, R. Eskander, N. Habash, and O. Rambow. 2014. Automatic transliteration of Romanized dialectal Arabic. In Proceedings of the 18th Conference on Computational Natural Language Learning. 30--38.

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Diacritics Restoration for Tunisian Dialect

Romanized Tunisian dialect transliteration using sequence labelling techniques

A Sequence-to-Sequence based Approach For the double Transliteration of Tunisian Dialect

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations