research-article

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

Authors: Asma Mekki, Inès Zribi, Mariem Ellouze, Lamia Hadrich BelguithAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 7

Article No.: 194, Pages 1 - 19

https://doi.org/10.1145/3599234

Published: 20 July 2023 Publication History

Abstract

Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, and so on. In this article, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses TA for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.

References

[1]

Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 11–16.

[2]

Muhammad Abdul-Mageed, Mona Diab, and Sandra Kübler. 2013. ASMA: A system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria, 1–8.

[3]

Igor N. Aizenberg, Naum N. Aizenberg, and Joos Vandewalle. 2000. Multiple-valued threshold logic and multi-valued neurons. In Proceedings of the Multi-Valued and Universal Binary Neurons. Springer, 25–80.

[4]

Ahmad Al-taani and Salah Abu Al-rub. 2009. A rule-based approach for tagging non-vocalized Arabic words. The International Arab Journal of Information Technology 6, 3 (2009), 320–328.

[5]

Abdulrahman Almuhareb, Waleed Alsanie, and Abdulmohsen Al-thubaity. 2019. Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access 7 (2019), 12879–12887. https://ieeexplore.ieee.org/document/8620203.

[6]

Shihadeh Alqrainy, Hasan Muaidi AlSerhan, and Aladdin Ayesh. 2008. Pattern-based algorithm for Part-of-Speech tagging Arabic text. In Proceedings of the 2008 International Conference on Computer Engineering Systems. 119–124. DOI:

[7]

Yassine Benajiba and Imed Zitouni. 2010. Arabic word segmentation for better unit of analysis. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. Valletta, Malta.

[8]

Abderrahim Boudlal, Abdelhak Lakhouja, Azzedine Mazroui, Abdelouafi Meziane, Mohamed Ould Abdallahi Ould Bebah, and Mohamed Shoul. 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the ACIT2010. Riyadh, Saudi Arabia.

[9]

Rahma Boujelbane, Mariem Ellouze, Frédéric Béchet, and Lamia Belguith. 2014. De l’arabe standard vers l’arabe dialectal: Projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. TAL. 2. Traitement Automatique du Langage Parlé 55 (2014), 73–96. https://hal.science/halshs-01193325/.

[10]

Rahma Boujelbane, Mariem Mallek, Mariem Ellouze, and Lamia Hadrich Belguith. 2014. Fine-grained POS tagging of Spoken Tunisian Dialect Corpora. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, 59–62.

[11]

Kareem Darwish, Ahmed Abdelali, and Hamdy Mubarak. 2014. Using stem-templates to improve Arabic POS and gender / number tagging. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 2926–2931.

[12]

Kareem Darwish, Walid Magdy, and Ahmed Mourad. 2012. Language processing for Arabic microblog retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management.

Digital Library

[13]

Mona Diab. 2009. Second generation AMIRA tools for Arabic processing: Fast and robust second generation Amira tools for Arabic processing: Fast and robust Tokenization, POS tagging, and base phrase chunking. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.

[14]

Mohamed Eldesouki, Younes Samih, Ahmed Abdelali, Mohammed Attia, Hamdy Mubarak, Kareem Darwish, and Kallmeyer Laura. 2017. Arabic multi-dialect segmentation: bi-LSTM-CRF vs. SVM. CoRR abs/1708.05891 (2017). http://arxiv.org/abs/1708.05891

[15]

David Graff. 2003. Arabic gigaword corpus. Philadelphia, PA: Linguistic Data Consortium (2003).

[16]

Steve R. Gunn. 1998. Support vector machines for classification and regression, technical report. Southampton, England: Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton 14, 1 (1998), 5–16.

[17]

Nizar Habash and Owen Rambow. 2006. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL. Sydney, Australia, 681–688.

Digital Library

[18]

Meryeme Hadni, Said Alaoui Ouatik, Abdelmonaime Lachkar, and Mohammed Meknassi. 2013. Hybrid part-of-speech Tagger for non-vocalized Arabic text. International Journal on Natural Language Computing (IJNLC) 2, 6 (2013), 1–15.

[19]

Ahmed Hamdi, Rahma Boujelbane, Nizar Habash, and Alexis Nasr. 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian - standard Arabic machine translation. In Proceedings of the MT Summit 2013. France.

[20]

Thorsten Joachims. 2016. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD).

[21]

Anna Kashina. 2020. Case study of language preferences in social media of Tunisia. Advances in Social Science, Education and Humanities Research 489 (2020), 111–115. https://www.atlantis-press.com/proceedings/icdatmi-20/125948610.

[22]

Itamar Kastner and Frans Adriaans. 2018. Linguistic constraints on statistical word segmentation: The role of consonants in Arabic and English. Cognitive Science 42, S2: Special Issue: Word Learning and Language Acquisition (2018), 1–25. DOI:

[23]

Shereen Khoja. 2001. APT : Arabic part-of-speech tagger. In Proceedings of the Student Work. NAACL. 20–25.

[24]

John Lafferty and Andrew McCallum. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data conditional random fields: Probabilistic models for segmenting and. In Proceedings of the 18th International Conference on Machine Learning, ICML, Vol. 1. 282–289.

[25]

Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a large scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. Cairo, Egypt.

[26]

Mohamed Maamouri, Ann Bies, and Seth Kulick. 2012. Expanding Arabic Treebank to speech: Results from broadcast news. In Proceedings of the LREC. Citeseer, 1856–1861.

[27]

Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila Tabassi, and Michael Ciul. 2012. Egyptian Arabic treebank DF Part 2 V2.0. In Proceedings of the LDC Catalog Number LDC2012E98.

[28]

Salah Mejri, Mosbah Said, and Inès Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synergies Tunisie 1 (2009), 53–74. https://gerflint.fr/Base/Tunisie1/salah1.pdf.

[29]

Asma Mekki, Inès Zribi, Mariem Ellouze, and Lamia Hadrich Belguith. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. Language Resources and Evaluation 56, 1 (2022), 44–51.

Digital Library

[30]

Asma Mekki, Inès Zribi, Mariem Ellouze, and Lamia Hadrich Belguith. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. In Foundations of Intelligent Systems. Michelangelo Ceci, Sergio Flesca, Elio Masciari, Giuseppe Manco, and Zbigniew W. Raś (Eds.), Springer International Publishing, Cham, 44–51.

Digital Library

[31]

Asma Mekki, Inès Zribi, Mariem Ellouze, and Lamia Hadrich Belguith. 2020. Treebank creation and parser generation for Tunisian social media text. In Proceedings of the 17th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2020. IEEE.

[32]

Asma Mekki, Inès Zribi, Mariem Ellouze Khmekhem, and Lamia Hadrich Belguith. 2018. Critical description of TA linguistic resources. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing 2018) & Procedia Computer Science, November 17-19 2018. Dubai, United Arab Emirates.

Digital Library

[33]

Asma Mekki, Inès Zribi, Mariem Ellouze Khemakhem, and Lamia Hadrich Belguith. 2017. Syntactic analysis of the Tunisian Arabic. In Proceedings of the International Workshop on Language Processing and Knowledge Management.

[34]

Asma Mekki, Inès Zribi, Mariem Ellouze Khemakhem, and Lamia Hadrich Belguith. 2021. Sentence boundary detection of various forms of Tunisian Arabic. Language Resources and Evaluation (2021).

[35]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013). https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.

[36]

Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.

[37]

Emad Mohamed, Behrang Mohit, and Kemal Oflazer. 2012. Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation (LREC 2012). 873–877.

[38]

Will Monroe, Spence Green, and Christopher D. Manning. 2014. Word segmentation of informal Arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. DOI:

[39]

Arfath Pasha, Mohamed Al-badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. MADAMIRA : A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1094–1101.

[40]

Lotfi Sayahi. 2014. Diglossia and Language Contact: Language Variation and Change in North Africa. Cambridge University Press.

[41]

Max Silberztein. 2005. NooJ: A linguistic annotation system for corpus processing. In Proceedings of HLT/EMNLP 2005 Interactive Demonstrations. 10–11.

Digital Library

[42]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.

Digital Library

[43]

Roua Torjmen and Kais Haddar. 2018. Morphological aanalyzer for the Tunisian dialect. In International Conference on Text, Speech, and Dialogue (TSD 2018). Springer, Cham, 180–187. DOI:

[44]

Vladimir N. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA.

[45]

Jihene Younes, Hadhemi Achour, and Emna Souissi. 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Current Trends in Web Engineering - 15th International Conference, ICWE 2015 Rotterdam, The Netherlands. 3–14.

[46]

Chiraz Ben Othman Zribi, Aroua Torjmen, and Mohamed Ben Ahmed. 2007. A multi-agent system for POS-tagging vocalized Arabic texts. The International Arab Journal of Information Technology 4, November 2007 (2007), 322–329.

[47]

Inès Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem Ellouze, Lamia Hadrich Belguith, and Nizar Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 26-31, 2014.European Language Resources Association (ELRA), 2355–2361.

[48]

Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith, and Philippe Blache. 2015. Spoken Tunisian Arabic corpus STAC: Transcription and annotation. Research in Computing Science 90 (2015).

[49]

Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith, and Philippe Blache. 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University - Computer and Information Sciences 29, 2 (2017), 147–155. Arabic Natural Language Processing: Models, Systems and Applications.

Digital Library

[50]

Inès Zribi, Inès Kammoun, Mariem Ellouze, Lamia Hadrich Belguith, and Philippe Blache. 2016. Sentence boundary detection for transcribed Tunisian Arabic. In 12th Edition of the Konvens Conference. Bochum, Germany.

[51]

Inès Zribi, Mariem Ellouze Khemakhem, and Lamia Hadrich Belguith. 2013. Morphological analysis of tunisian dialect. In Sixth International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, October 14–18, 2013. 992–996.

Cited By

Qarah FAlsanoosy T(2024)A Comprehensive Analysis of Various Tokenizers for Arabic Large Language ModelsApplied Sciences10.3390/app1413569614:13(5696)Online publication date: 29-Jun-2024
https://doi.org/10.3390/app14135696
Mekki AZribi IEllouze MBelguith L(2024)TTKComputer Speech and Language10.1016/j.csl.2023.10161786:COnline publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.csl.2023.101617

Index Terms

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models
1. General and reference
  1. Cross-computing tools and techniques
    1. Experimentation
  2. Document types
    1. Computing standards, RFCs and guidelines
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User models

Recommendations

TTK: A toolkit for Tunisian linguistic analysis
Abstract
Over the last two decades, many efforts have been made to provide resources to support the Arabic Natural Language Processing (NLP). Some of these resources target specific NLP tasks such as word tokenization, parsing, or sentiment analysis, ...
Highlights
- NLP tools give us a better understanding of how the language may work in specific situations.
- TTK contains text processing tools from orthographic normalization to comprehension.
- All the proposed and used tools showed encouraging ...
Sentence boundary detection of various forms of Tunisian Arabic
Abstract
Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the ...
POS tagger for Urdu using Stochastic approaches
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

Part-of-Speech tagging is a problem of Natural language processing. It is a process of labeling an accurate part of speech for each word of a given corpus sentence. There are various approaches like rule based, stochastic and hybrid that are mainly used ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 7

July 2023

422 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3610376

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2023

Online AM: 24 May 2023

Accepted: 19 May 2023

Revised: 30 August 2022

Received: 15 May 2021

Published in TALLIP Volume 22, Issue 7

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
158
Total Downloads

Downloads (Last 12 months)85
Downloads (Last 6 weeks)6

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qarah FAlsanoosy T(2024)A Comprehensive Analysis of Various Tokenizers for Arabic Large Language ModelsApplied Sciences10.3390/app1413569614:13(5696)Online publication date: 29-Jun-2024
https://doi.org/10.3390/app14135696
Mekki AZribi IEllouze MBelguith L(2024)TTKComputer Speech and Language10.1016/j.csl.2023.10161786:COnline publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.csl.2023.101617

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents