Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Automatic Diacritics Restoration for Tunisian Dialect

Published: 12 July 2019 Publication History

Abstract

Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. First, a word without diacratics could have many possible meanings depending on their diacritization. Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Third, without diacritics a word could have many possible parts of speech (POS) instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.44% for CRF and WER 34.6% for SMT).

References

[1]
G. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour, and M. Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recogn. 18, 2 (June 2015), 183--197.
[2]
A. Ahmed and M. Elaraby. 2000. A large-scale computational processor of the Arabic morphology and applications. PhD thesis, Faculty of Engineering, Cairo University Giza, Egypt.
[3]
A. Al-Taani and S. Abu Al-Rub. 2009. A rule-based approach for tagging non-vocalized Arabic words. Int. Arab J. Info. Technol. 6, 3 (2009), 320--328.
[4]
Y. A. Alotaibi, A. H. Meftah, and S. A. Selouani. 2013. Diacritization, automatic segmentation and labeling for levantine Arabic speech. In Proceedings of the Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13).
[5]
M. Al-Badrashiny, A. Hawwari, and M. Diab. 2017. A layered language model-based hybrid approach to automatic full diacritization of Arabic. In Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).
[6]
M. Ameur, Y. Moulahoum, and A. Guessoum. 2015. Restoration of Arabic diacritics using a multilevel statistical model. In Proceedings of the International Federation for Information Processing (IFIP’15).
[7]
A. Z. Ayman, M. Elmahdy, H. Husni, and J. Al Jaam. 2016. Automatic diacritics restoration for Arabic text. International J. Comput. Info. Sci. 12, 2 (Dec. 2016), 159--165.
[8]
A. Azmi and R. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 21, 477--495.
[9]
T. Baccouche. 2003. L’arabe, d’une koin dialectale une langue de culture, Memoires de la societe linguistique de Paris, TomeXI (les langues de Communication), 87--93.
[10]
T. Baccouche. 1994. L’emprunt en arabe moderne, Beit Elhikma et IBLV.
[11]
Y. Belinkov and J. Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
[12]
H. Bouamor, W. Zaghouani, M. Diab, O. Obeid, O. Kemal, M. Ghoneim, and A. Hawwari. 2015. A pilot study on Arabic multi-genre corpus diacritization annotation. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.
[13]
R. Boujelbane, M. Mallek, M. Ellouze, and L. Belguith. 2014. Fine-grained (POS) tagging of spoken Tunisian dialect corpora. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, (NLDB’14).
[14]
P. Brown, S. Pietra, V. Pietra, and R. Mercer. 1993. The mathematic of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2 (1993), 263--311.
[15]
M. Diab, M. Ghoneim, and N. Habash. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of MTSummit.
[16]
M. Elshafei, H. Al-muhtaseb, and M. Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of Saudi 18th National Computer Conference (NCC’06).
[17]
A. Fashwan and S. Alansary. 2017. SHAKKIL: An automatic diacritization system for modern standard Arabic texts. Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).
[18]
Y. Gal. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (SEMITIC’02).
[19]
M. L. Gibson. 1998. Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. PhD thesis. Department of linguistic science, University of Reading.
[20]
M. Graja, M. Jaoua, and L. Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. In IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 12 (2015), 2311--2321.
[21]
N. Habash, A. Shahrour, and M. Al-Khalil. 2016. Exploiting Arabic diacritization for high-quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).
[22]
N. Habash, M. Diab, and O. Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).
[23]
N. Habash and O. Rambow. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.
[24]
O. Hamed and T. Zesch. 2017. A survey and comparative study of Arabic diacritization tools. J. Lang. Technol. Computat. Linguist. 32, 1.
[25]
S. Harrat, M. Abbas, K. Meftouh, K. Smaili, E. N. S. Bouzareah, and C. Loria. 2013. Diacritics restoration for Arabic dialect texts. In Proceedings of the 14th Annual Conference of the International Speech Communication.
[26]
E. Hermena, D. Drieghe, S. Hellmuth, and P. Simon. 2015. Processing of Arabic diacritical marks: Phonological syntactic disambiguation of homographic verbs and visual crowding effects. J. Exper. Psychol. Hum. Percept. Perform. 41, 494--507.
[27]
Y. Hifny. 2012. Higher order n-gram language models for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC’12).
[28]
C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown University Press, Washington.
[29]
M. Jarrar, N. Habash, D. Akra, N. Zalmout, and W. Bank. 2014. Building a corpus for palestinian Arabic: A preliminary study. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 18--27.
[30]
K. Kirchhoff and D. Vergyri. 2005. Cross- dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Commun. 46, 37--51.
[31]
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, and N. Bertoldi. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics (ACL’07).
[32]
A. Kubra and G. Eryigit. 2014. Vowel and diacritic restoration for social media texts. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM’14).
[33]
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML’01). 282--289.
[34]
M. Maamouri, A. Bies, and S. Kulick. 2006. Diacritization: A challenge to Arabic treebank annotation and parsing. Proceedings of the British Computer Society Arabic NLP/MT Conference.
[35]
M. Maamouri, A. Bies, and S. Kulick. 2008. Enhancing the Arabic treebank: A collaborative effort toward new annotation guidelines. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).
[36]
A. Masmoudi, F. Bougares, M. Khmekhem, Y. Estéve, and L. Belguith. 2018. Automatic speech recognition system for Tunisian dialect. J. Lang. Resour. Eval. 52, 249--267.
[37]
A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, and L. Belguith. 2014. Phonetic tool for the Tunisian Arabic. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages.
[38]
A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, L. Belguith, and N. Habash. 2014. A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 19th Language Resources and Evaluation Conference.
[39]
A. Masmoudi, N. Habash, M. Khmekhem, Y. Estéve, and L. Belguith. 2015. Arabic transliteration of romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’15).
[40]
S. Mejri, M. Said, and I. Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synerg. Tunisie 1, 53--74.
[41]
F. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29 1, 19--52.
[42]
M. Rashwan, A. Al Sallab, H. Raafat, and A. Rafea. 2015. Deep learning framework with confused sub set resolution architecture for automatic Arabic diacritization. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 3 (2015).
[43]
H. Saadane and N. Habash. 2015. A conventional orthography for algerian Arabic. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.
[44]
I. Sfar. 2005. Morphologie des noms de professions : Incorporation et paraphrase, La terminologie, entre traduction et bilinguisme (2005), 15--16.
[45]
K. Shaalan, M. Abo Bakr, and I. Ziedan. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (EACL’09).
[46]
K. Shaalan, H. Abo Bakr, and I. Ziedan. 2008. A statistical method for adding case ending diacritics for Arabic text. In Proceedings of the Language Engineering Conference.
[47]
T. Schlippe. 2008. Statistical methods for automatic diacritization of Arabic texts. Carnegie Mellon University, Pittsburgh, PA.
[48]
A. Stolcke. 2002. SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02).
[49]
F. Talmoudi. 1980. A morphosyntactic study of Romance verbs in the Arabic dialects of Tunis, Sousa, and Sfax. Göteborg, Acta Universitatis Gothoburgensis.
[50]
M. Tilmatine. 1999. Substrat Et Convergences: Le Berbére Et L’arabe Nord-Africain. In Estudios de Dialectologia Norteafricana Y Andalusi, M. Haak, R. Jong, K. De Versteegh (Eds.).
[51]
D. Wang and S. King. 2011. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18, 2 (2011), 122--125.
[52]
W. Zaghouani, H. Bouamor, A. Hawwari, M. Diab, O. Obeid, M. Ghoneim, S. Alqahtani, and K. Oflazer. 2016. Guidelines and framework for a large-scale Arabic diacritized corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).
[53]
W. Zaghouani, N. Habash, O. Obeid, B. Mohit, H. Bouamor, and K. Oflazer. 2016. Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).
[54]
W. Zaghouani, N. Habash, H. Bouamor, A. Rozovskaya, B. Mohit, A. Heider, and K. Oflazer. 2015. Correction annotation for non-native Arabic texts: Guidelines and corpus. In Proceedings of the Association for Computational Linguistics Fourth Linguistic Annotation Workshop.
[55]
I. Zitouni, J. Sorensen, and R. Sarikaya. 2006. Maximum entropy-based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics.
[56]
I. Zitouni and R. Sarikaya. 2009. Arabic diacritic restoration approach based on maximum entropy models. J. Comput. Speech Lang. 23 (2009), 257--276.
[57]
I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. Belguith, and N. Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).
[58]
I. Zribi, M. Khmekhem, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Info. Sci. 29, 147--155.
[59]
I. Zribi, M. Ellouze, L. H. Belguith, and P. Blache. 2015. Spoken Tunisian Arabic Corpus “STAC”: Transcription and Annotation. Res. Comput. Sci. 90 (2015), 123--135.
[60]
I. Zribi, M. Graja, M. E. Khemakhem, M. Jaoua, and L. Belguith. 2013. Orthographic transcription for spoken Tunisian Arabic. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13), A. Gelbukh (Ed.).

Cited By

View all
  • (2023)Alabib-65: A Realistic Dataset for Algerian Sign Language RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359690922:6(1-23)Online publication date: 10-May-2023
  • (2023)Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359586122:6(1-13)Online publication date: 16-Jun-2023
  • (2023)Enhancing RDF Verbalization with Descriptive and Relational KnowledgeACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359529322:6(1-18)Online publication date: 16-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 3
September 2019
386 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3305347
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2019
Accepted: 01 December 2018
Revised: 01 October 2018
Received: 01 May 2018
Published in TALLIP Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CRF model
  2. Natural language processing
  3. POS tagging
  4. SMT model
  5. Tunisian dialect
  6. diacritization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Alabib-65: A Realistic Dataset for Algerian Sign Language RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359690922:6(1-23)Online publication date: 10-May-2023
  • (2023)Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359586122:6(1-13)Online publication date: 16-Jun-2023
  • (2023)Enhancing RDF Verbalization with Descriptive and Relational KnowledgeACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359529322:6(1-18)Online publication date: 16-Jun-2023
  • (2023)Prose2Poem: The Blessing of Transformers in Translating Prose to Persian PoetryACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359279122:6(1-18)Online publication date: 14-Apr-2023
  • (2023)The Impact of Arabic Diacritization on Word EmbeddingsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260322:6(1-30)Online publication date: 16-Jun-2023
  • (2023)Robust Multi-task Learning-based Korean POS Tagging to Overcome Word Spacing ErrorsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359120622:6(1-13)Online publication date: 5-Apr-2023
  • (2023)An Efficient and Accurate Detection of Fake News Using Capsule Transient Auto EncoderACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358918422:6(1-22)Online publication date: 16-Jun-2023
  • (2023)BayesKGR: Bayesian Few-Shot Learning for Knowledge Graph ReasoningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358918322:6(1-21)Online publication date: 27-Mar-2023
  • (2023)Semi-Supervised Semantic Role Labeling with Bidirectional Language ModelsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358716022:6(1-20)Online publication date: 1-Apr-2023
  • (2023)Image–Text Multimodal Sentiment Analysis Framework of Assamese News Articles Using Late FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358486122:6(1-30)Online publication date: 17-Feb-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media