Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

Published: 20 July 2023 Publication History

Abstract

Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, and so on. In this article, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses TA for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.

References

[1]
Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 11–16.
[2]
Muhammad Abdul-Mageed, Mona Diab, and Sandra Kübler. 2013. ASMA: A system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria, 1–8.
[3]
Igor N. Aizenberg, Naum N. Aizenberg, and Joos Vandewalle. 2000. Multiple-valued threshold logic and multi-valued neurons. In Proceedings of the Multi-Valued and Universal Binary Neurons. Springer, 25–80.
[4]
Ahmad Al-taani and Salah Abu Al-rub. 2009. A rule-based approach for tagging non-vocalized Arabic words. The International Arab Journal of Information Technology 6, 3 (2009), 320–328.
[5]
Abdulrahman Almuhareb, Waleed Alsanie, and Abdulmohsen Al-thubaity. 2019. Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access 7 (2019), 12879–12887. https://ieeexplore.ieee.org/document/8620203.
[6]
Shihadeh Alqrainy, Hasan Muaidi AlSerhan, and Aladdin Ayesh. 2008. Pattern-based algorithm for Part-of-Speech tagging Arabic text. In Proceedings of the 2008 International Conference on Computer Engineering Systems. 119–124. DOI:
[7]
Yassine Benajiba and Imed Zitouni. 2010. Arabic word segmentation for better unit of analysis. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. Valletta, Malta.
[8]
Abderrahim Boudlal, Abdelhak Lakhouja, Azzedine Mazroui, Abdelouafi Meziane, Mohamed Ould Abdallahi Ould Bebah, and Mohamed Shoul. 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the ACIT2010. Riyadh, Saudi Arabia.
[9]
Rahma Boujelbane, Mariem Ellouze, Frédéric Béchet, and Lamia Belguith. 2014. De l’arabe standard vers l’arabe dialectal: Projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens. TAL. 2. Traitement Automatique du Langage Parlé 55 (2014), 73–96. https://hal.science/halshs-01193325/.
[10]
Rahma Boujelbane, Mariem Mallek, Mariem Ellouze, and Lamia Hadrich Belguith. 2014. Fine-grained POS tagging of Spoken Tunisian Dialect Corpora. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, 59–62.
[11]
Kareem Darwish, Ahmed Abdelali, and Hamdy Mubarak. 2014. Using stem-templates to improve Arabic POS and gender / number tagging. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 2926–2931.
[12]
Kareem Darwish, Walid Magdy, and Ahmed Mourad. 2012. Language processing for Arabic microblog retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management.
[13]
Mona Diab. 2009. Second generation AMIRA tools for Arabic processing: Fast and robust second generation Amira tools for Arabic processing: Fast and robust Tokenization, POS tagging, and base phrase chunking. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.
[14]
Mohamed Eldesouki, Younes Samih, Ahmed Abdelali, Mohammed Attia, Hamdy Mubarak, Kareem Darwish, and Kallmeyer Laura. 2017. Arabic multi-dialect segmentation: bi-LSTM-CRF vs. SVM. CoRR abs/1708.05891 (2017). http://arxiv.org/abs/1708.05891
[15]
David Graff. 2003. Arabic gigaword corpus. Philadelphia, PA: Linguistic Data Consortium (2003).
[16]
Steve R. Gunn. 1998. Support vector machines for classification and regression, technical report. Southampton, England: Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton 14, 1 (1998), 5–16.
[17]
Nizar Habash and Owen Rambow. 2006. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL. Sydney, Australia, 681–688.
[18]
Meryeme Hadni, Said Alaoui Ouatik, Abdelmonaime Lachkar, and Mohammed Meknassi. 2013. Hybrid part-of-speech Tagger for non-vocalized Arabic text. International Journal on Natural Language Computing (IJNLC) 2, 6 (2013), 1–15.
[19]
Ahmed Hamdi, Rahma Boujelbane, Nizar Habash, and Alexis Nasr. 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian - standard Arabic machine translation. In Proceedings of the MT Summit 2013. France.
[20]
Thorsten Joachims. 2016. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD).
[21]
Anna Kashina. 2020. Case study of language preferences in social media of Tunisia. Advances in Social Science, Education and Humanities Research 489 (2020), 111–115. https://www.atlantis-press.com/proceedings/icdatmi-20/125948610.
[22]
Itamar Kastner and Frans Adriaans. 2018. Linguistic constraints on statistical word segmentation: The role of consonants in Arabic and English. Cognitive Science 42, S2: Special Issue: Word Learning and Language Acquisition (2018), 1–25. DOI:
[23]
Shereen Khoja. 2001. APT : Arabic part-of-speech tagger. In Proceedings of the Student Work. NAACL. 20–25.
[24]
John Lafferty and Andrew McCallum. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data conditional random fields: Probabilistic models for segmenting and. In Proceedings of the 18th International Conference on Machine Learning, ICML, Vol. 1. 282–289.
[25]
Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a large scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. Cairo, Egypt.
[26]
Mohamed Maamouri, Ann Bies, and Seth Kulick. 2012. Expanding Arabic Treebank to speech: Results from broadcast news. In Proceedings of the LREC. Citeseer, 1856–1861.
[27]
Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila Tabassi, and Michael Ciul. 2012. Egyptian Arabic treebank DF Part 2 V2.0. In Proceedings of the LDC Catalog Number LDC2012E98.
[28]
Salah Mejri, Mosbah Said, and Inès Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synergies Tunisie 1 (2009), 53–74. https://gerflint.fr/Base/Tunisie1/salah1.pdf.
[29]
Asma Mekki, Inès Zribi, Mariem Ellouze, and Lamia Hadrich Belguith. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. Language Resources and Evaluation 56, 1 (2022), 44–51.
[30]
Asma Mekki, Inès Zribi, Mariem Ellouze, and Lamia Hadrich Belguith. 2022. Sarcasm detection in Tunisian social media comments: Case of COVID-19. In Foundations of Intelligent Systems. Michelangelo Ceci, Sergio Flesca, Elio Masciari, Giuseppe Manco, and Zbigniew W. Raś (Eds.), Springer International Publishing, Cham, 44–51.
[31]
Asma Mekki, Inès Zribi, Mariem Ellouze, and Lamia Hadrich Belguith. 2020. Treebank creation and parser generation for Tunisian social media text. In Proceedings of the 17th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2020. IEEE.
[32]
Asma Mekki, Inès Zribi, Mariem Ellouze Khmekhem, and Lamia Hadrich Belguith. 2018. Critical description of TA linguistic resources. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing 2018) & Procedia Computer Science, November 17-19 2018. Dubai, United Arab Emirates.
[33]
Asma Mekki, Inès Zribi, Mariem Ellouze Khemakhem, and Lamia Hadrich Belguith. 2017. Syntactic analysis of the Tunisian Arabic. In Proceedings of the International Workshop on Language Processing and Knowledge Management.
[34]
Asma Mekki, Inès Zribi, Mariem Ellouze Khemakhem, and Lamia Hadrich Belguith. 2021. Sentence boundary detection of various forms of Tunisian Arabic. Language Resources and Evaluation (2021).
[35]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013). https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
[36]
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.
[37]
Emad Mohamed, Behrang Mohit, and Kemal Oflazer. 2012. Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation (LREC 2012). 873–877.
[38]
Will Monroe, Spence Green, and Christopher D. Manning. 2014. Word segmentation of informal Arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. DOI:
[39]
Arfath Pasha, Mohamed Al-badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. MADAMIRA : A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1094–1101.
[40]
Lotfi Sayahi. 2014. Diglossia and Language Contact: Language Variation and Change in North Africa. Cambridge University Press.
[41]
Max Silberztein. 2005. NooJ: A linguistic annotation system for corpus processing. In Proceedings of HLT/EMNLP 2005 Interactive Demonstrations. 10–11.
[42]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
[43]
Roua Torjmen and Kais Haddar. 2018. Morphological aanalyzer for the Tunisian dialect. In International Conference on Text, Speech, and Dialogue (TSD 2018). Springer, Cham, 180–187. DOI:
[44]
Vladimir N. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA.
[45]
Jihene Younes, Hadhemi Achour, and Emna Souissi. 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Current Trends in Web Engineering - 15th International Conference, ICWE 2015 Rotterdam, The Netherlands. 3–14.
[46]
Chiraz Ben Othman Zribi, Aroua Torjmen, and Mohamed Ben Ahmed. 2007. A multi-agent system for POS-tagging vocalized Arabic texts. The International Arab Journal of Information Technology 4, November 2007 (2007), 322–329.
[47]
Inès Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem Ellouze, Lamia Hadrich Belguith, and Nizar Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 26-31, 2014.European Language Resources Association (ELRA), 2355–2361.
[48]
Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith, and Philippe Blache. 2015. Spoken Tunisian Arabic corpus STAC: Transcription and annotation. Research in Computing Science 90 (2015).
[49]
Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith, and Philippe Blache. 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University - Computer and Information Sciences 29, 2 (2017), 147–155. Arabic Natural Language Processing: Models, Systems and Applications.
[50]
Inès Zribi, Inès Kammoun, Mariem Ellouze, Lamia Hadrich Belguith, and Philippe Blache. 2016. Sentence boundary detection for transcribed Tunisian Arabic. In 12th Edition of the Konvens Conference. Bochum, Germany.
[51]
Inès Zribi, Mariem Ellouze Khemakhem, and Lamia Hadrich Belguith. 2013. Morphological analysis of tunisian dialect. In Sixth International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, October 14–18, 2013. 992–996.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 7
July 2023
422 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3610376
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2023
Online AM: 24 May 2023
Accepted: 19 May 2023
Revised: 30 August 2022
Received: 15 May 2021
Published in TALLIP Volume 22, Issue 7

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Word tokenization
  2. Tunisian Arabic
  3. Arabic dialect
  4. deep learning
  5. SVM
  6. CRF

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)85
  • Downloads (Last 6 weeks)6
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media