research-article

Automatic Diacritics Restoration for Tunisian Dialect

Authors:

Salima Mdhaffar,

Lamia Hadrich BelguithAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 18, Issue 3

Article No.: 28, Pages 1 - 18

https://doi.org/10.1145/3297278

Published: 12 July 2019 Publication History

Abstract

Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. First, a word without diacratics could have many possible meanings depending on their diacritization. Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Third, without diacritics a word could have many possible parts of speech (POS) instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.44% for CRF and WER 34.6% for SMT).

References

[1]

G. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour, and M. Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recogn. 18, 2 (June 2015), 183--197.

Digital Library

[2]

A. Ahmed and M. Elaraby. 2000. A large-scale computational processor of the Arabic morphology and applications. PhD thesis, Faculty of Engineering, Cairo University Giza, Egypt.

[3]

A. Al-Taani and S. Abu Al-Rub. 2009. A rule-based approach for tagging non-vocalized Arabic words. Int. Arab J. Info. Technol. 6, 3 (2009), 320--328.

[4]

Y. A. Alotaibi, A. H. Meftah, and S. A. Selouani. 2013. Diacritization, automatic segmentation and labeling for levantine Arabic speech. In Proceedings of the Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13).

[5]

M. Al-Badrashiny, A. Hawwari, and M. Diab. 2017. A layered language model-based hybrid approach to automatic full diacritization of Arabic. In Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).

[6]

M. Ameur, Y. Moulahoum, and A. Guessoum. 2015. Restoration of Arabic diacritics using a multilevel statistical model. In Proceedings of the International Federation for Information Processing (IFIP’15).

[7]

A. Z. Ayman, M. Elmahdy, H. Husni, and J. Al Jaam. 2016. Automatic diacritics restoration for Arabic text. International J. Comput. Info. Sci. 12, 2 (Dec. 2016), 159--165.

[8]

A. Azmi and R. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 21, 477--495.

[9]

T. Baccouche. 2003. L’arabe, d’une koin dialectale une langue de culture, Memoires de la societe linguistique de Paris, TomeXI (les langues de Communication), 87--93.

[10]

T. Baccouche. 1994. L’emprunt en arabe moderne, Beit Elhikma et IBLV.

[11]

Y. Belinkov and J. Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

[12]

H. Bouamor, W. Zaghouani, M. Diab, O. Obeid, O. Kemal, M. Ghoneim, and A. Hawwari. 2015. A pilot study on Arabic multi-genre corpus diacritization annotation. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.

[13]

R. Boujelbane, M. Mallek, M. Ellouze, and L. Belguith. 2014. Fine-grained (POS) tagging of spoken Tunisian dialect corpora. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, (NLDB’14).

[14]

P. Brown, S. Pietra, V. Pietra, and R. Mercer. 1993. The mathematic of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2 (1993), 263--311.

Digital Library

[15]

M. Diab, M. Ghoneim, and N. Habash. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of MTSummit.

[16]

M. Elshafei, H. Al-muhtaseb, and M. Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of Saudi 18th National Computer Conference (NCC’06).

[17]

A. Fashwan and S. Alansary. 2017. SHAKKIL: An automatic diacritization system for modern standard Arabic texts. Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).

[18]

Y. Gal. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (SEMITIC’02).

Digital Library

[19]

M. L. Gibson. 1998. Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. PhD thesis. Department of linguistic science, University of Reading.

[20]

M. Graja, M. Jaoua, and L. Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. In IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 12 (2015), 2311--2321.

Digital Library

[21]

N. Habash, A. Shahrour, and M. Al-Khalil. 2016. Exploiting Arabic diacritization for high-quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).

[22]

N. Habash, M. Diab, and O. Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).

[23]

N. Habash and O. Rambow. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.

Digital Library

[24]

O. Hamed and T. Zesch. 2017. A survey and comparative study of Arabic diacritization tools. J. Lang. Technol. Computat. Linguist. 32, 1.

[25]

S. Harrat, M. Abbas, K. Meftouh, K. Smaili, E. N. S. Bouzareah, and C. Loria. 2013. Diacritics restoration for Arabic dialect texts. In Proceedings of the 14th Annual Conference of the International Speech Communication.

[26]

E. Hermena, D. Drieghe, S. Hellmuth, and P. Simon. 2015. Processing of Arabic diacritical marks: Phonological syntactic disambiguation of homographic verbs and visual crowding effects. J. Exper. Psychol. Hum. Percept. Perform. 41, 494--507.

[27]

Y. Hifny. 2012. Higher order n-gram language models for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC’12).

[28]

C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown University Press, Washington.

[29]

M. Jarrar, N. Habash, D. Akra, N. Zalmout, and W. Bank. 2014. Building a corpus for palestinian Arabic: A preliminary study. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 18--27.

[30]

K. Kirchhoff and D. Vergyri. 2005. Cross- dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Commun. 46, 37--51.

[31]

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, and N. Bertoldi. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics (ACL’07).

Digital Library

[32]

A. Kubra and G. Eryigit. 2014. Vowel and diacritic restoration for social media texts. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM’14).

[33]

J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML’01). 282--289.

Digital Library

[34]

M. Maamouri, A. Bies, and S. Kulick. 2006. Diacritization: A challenge to Arabic treebank annotation and parsing. Proceedings of the British Computer Society Arabic NLP/MT Conference.

[35]

M. Maamouri, A. Bies, and S. Kulick. 2008. Enhancing the Arabic treebank: A collaborative effort toward new annotation guidelines. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).

[36]

A. Masmoudi, F. Bougares, M. Khmekhem, Y. Estéve, and L. Belguith. 2018. Automatic speech recognition system for Tunisian dialect. J. Lang. Resour. Eval. 52, 249--267.

Digital Library

[37]

A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, and L. Belguith. 2014. Phonetic tool for the Tunisian Arabic. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages.

[38]

A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, L. Belguith, and N. Habash. 2014. A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 19th Language Resources and Evaluation Conference.

[39]

A. Masmoudi, N. Habash, M. Khmekhem, Y. Estéve, and L. Belguith. 2015. Arabic transliteration of romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’15).

[40]

S. Mejri, M. Said, and I. Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synerg. Tunisie 1, 53--74.

[41]

F. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29 1, 19--52.

Digital Library

[42]

M. Rashwan, A. Al Sallab, H. Raafat, and A. Rafea. 2015. Deep learning framework with confused sub set resolution architecture for automatic Arabic diacritization. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 3 (2015).

Digital Library

[43]

H. Saadane and N. Habash. 2015. A conventional orthography for algerian Arabic. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.

[44]

I. Sfar. 2005. Morphologie des noms de professions : Incorporation et paraphrase, La terminologie, entre traduction et bilinguisme (2005), 15--16.

[45]

K. Shaalan, M. Abo Bakr, and I. Ziedan. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (EACL’09).

Digital Library

[46]

K. Shaalan, H. Abo Bakr, and I. Ziedan. 2008. A statistical method for adding case ending diacritics for Arabic text. In Proceedings of the Language Engineering Conference.

[47]

T. Schlippe. 2008. Statistical methods for automatic diacritization of Arabic texts. Carnegie Mellon University, Pittsburgh, PA.

[48]

A. Stolcke. 2002. SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02).

[49]

F. Talmoudi. 1980. A morphosyntactic study of Romance verbs in the Arabic dialects of Tunis, Sousa, and Sfax. Göteborg, Acta Universitatis Gothoburgensis.

[50]

M. Tilmatine. 1999. Substrat Et Convergences: Le Berbére Et L’arabe Nord-Africain. In Estudios de Dialectologia Norteafricana Y Andalusi, M. Haak, R. Jong, K. De Versteegh (Eds.).

[51]

D. Wang and S. King. 2011. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18, 2 (2011), 122--125.

[52]

W. Zaghouani, H. Bouamor, A. Hawwari, M. Diab, O. Obeid, M. Ghoneim, S. Alqahtani, and K. Oflazer. 2016. Guidelines and framework for a large-scale Arabic diacritized corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).

[53]

W. Zaghouani, N. Habash, O. Obeid, B. Mohit, H. Bouamor, and K. Oflazer. 2016. Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).

[54]

W. Zaghouani, N. Habash, H. Bouamor, A. Rozovskaya, B. Mohit, A. Heider, and K. Oflazer. 2015. Correction annotation for non-native Arabic texts: Guidelines and corpus. In Proceedings of the Association for Computational Linguistics Fourth Linguistic Annotation Workshop.

[55]

I. Zitouni, J. Sorensen, and R. Sarikaya. 2006. Maximum entropy-based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics.

Digital Library

[56]

I. Zitouni and R. Sarikaya. 2009. Arabic diacritic restoration approach based on maximum entropy models. J. Comput. Speech Lang. 23 (2009), 257--276.

Digital Library

[57]

I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. Belguith, and N. Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).

[58]

I. Zribi, M. Khmekhem, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Info. Sci. 29, 147--155.

Digital Library

[59]

I. Zribi, M. Ellouze, L. H. Belguith, and P. Blache. 2015. Spoken Tunisian Arabic Corpus “STAC”: Transcription and Annotation. Res. Comput. Sci. 90 (2015), 123--135.

[60]

I. Zribi, M. Graja, M. E. Khemakhem, M. Jaoua, and L. Belguith. 2013. Orthographic transcription for spoken Tunisian Arabic. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13), A. Gelbukh (Ed.).

Digital Library

Cited By

Khellas KSeghir R(2023)Alabib-65: A Realistic Dataset for Algerian Sign Language RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359690922:6(1-23)Online publication date: 10-May-2023
https://dl.acm.org/doi/10.1145/3596909
Khalid HMurtaza GAbbas Q(2023)Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359586122:6(1-13)Online publication date: 16-Jun-2023
https://dl.acm.org/doi/10.1145/3595861
Zhang FZhang MLiu SSun YDuan N(2023)Enhancing RDF Verbalization with Descriptive and Relational KnowledgeACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359529322:6(1-18)Online publication date: 16-Jun-2023
https://dl.acm.org/doi/10.1145/3595293
Show More Cited By

Index Terms

Automatic Diacritics Restoration for Tunisian Dialect
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Machine translation

Recommendations

Transliteration of Arabizi into Arabic Script for Tunisian Dialect

The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, ...
Automatic diacritization of Tunisian dialect text using SMT model
Abstract
Unlike other tongues, Arabic language is characterized by its written form which is essentially consonant and may not have short vowels. One of the major functions of short vowels is to determine and facilitate the meaning of words or sentences. ...
Morphological disambiguation of Tunisian dialect

In this paper, we propose a method to disambiguate the output of a morphological analyzer of the Tunisian dialect. We test three machine-learning techniques that classify the morphological analysis of each word token into two classes: true and false. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18, Issue 3

September 2019

386 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3305347

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2019

Accepted: 01 December 2018

Revised: 01 October 2018

Received: 01 May 2018

Published in TALLIP Volume 18, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
161
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khellas KSeghir R(2023)Alabib-65: A Realistic Dataset for Algerian Sign Language RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359690922:6(1-23)Online publication date: 10-May-2023
https://dl.acm.org/doi/10.1145/3596909
Khalid HMurtaza GAbbas Q(2023)Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359586122:6(1-13)Online publication date: 16-Jun-2023
https://dl.acm.org/doi/10.1145/3595861
Zhang FZhang MLiu SSun YDuan N(2023)Enhancing RDF Verbalization with Descriptive and Relational KnowledgeACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359529322:6(1-18)Online publication date: 16-Jun-2023
https://dl.acm.org/doi/10.1145/3595293
Khanmohammadi RMirshafiee MRezaee Jouryabi YMirroshandel S(2023)Prose2Poem: The Blessing of Transformers in Translating Prose to Persian PoetryACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359279122:6(1-18)Online publication date: 14-Apr-2023
https://dl.acm.org/doi/10.1145/3592791
Abbache MAbbache AXu JMeziane FWen X(2023)The Impact of Arabic Diacritization on Word EmbeddingsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260322:6(1-30)Online publication date: 16-Jun-2023
https://dl.acm.org/doi/10.1145/3592603
Park CKim J(2023)Robust Multi-task Learning-based Korean POS Tagging to Overcome Word Spacing ErrorsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359120622:6(1-13)Online publication date: 5-Apr-2023
https://dl.acm.org/doi/10.1145/3591206
Parte SRatmele ADhanare R(2023)An Efficient and Accurate Detection of Fake News Using Capsule Transient Auto EncoderACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358918422:6(1-22)Online publication date: 16-Jun-2023
https://dl.acm.org/doi/10.1145/3589184
Zhao FYan CJin HHe L(2023)BayesKGR: Bayesian Few-Shot Learning for Knowledge Graph ReasoningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358918322:6(1-21)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3589183
Munir KZhao HLi Z(2023)Semi-Supervised Semantic Role Labeling with Bidirectional Language ModelsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358716022:6(1-20)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1145/3587160
Das RSingh T(2023)Image–Text Multimodal Sentiment Analysis Framework of Assamese News Articles Using Late FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358486122:6(1-30)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3584861
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents