Authorship Attribution (AA) of short texts like SMS, chat, social media posts has become a releva... more Authorship Attribution (AA) of short texts like SMS, chat, social media posts has become a relevant study issue, adding new dimensions to this field. However, AA of Arabic Tweets is not well-investigated and left behind compared to longer texts such as ancient books, poems, news articles, or even similar short text like the fatwa (i.e., a legal decree in the religion of Islam). This paper presents the advantage of using a bagging ensemble model over a single learner model to increase the accuracy of AA of Arabic tweets. In doing so, we evaluated the performance of a bagging ensemble model using three state-of-the-art classification approaches as base classifiers, namely Naïve Bayesian (NB), Support Vector Machines (SVM), and Decision Trees (DT). According to the experiments conducted, the proposed bagging classifier that used the SVM algorithm as a base model achieved the highest accuracy rate (i.e., 95,03%) among the other classifiers. This accuracy is among the highest ever published in similar studies.
Today's technology provides excellent opportunities for students, primarily those with learni... more Today's technology provides excellent opportunities for students, primarily those with learning disabilities, to be engaged in digital learning environments. Learning disabilities are neurologically-based processing deficits in acquiring and learning essential reading, spelling, and writing skills. Besides, few studies were conducted about assistive technology's effectiveness for handwriting and spelling for Arabic children with specific learning disabilities (i.e., Dyslexia and Dysgraphia). This study investigates the impact of using computers and tablets on the performance of text copying and dictation. The study was conducted in a Moroccan public primary school with two experimental groups. From 60 students, 12 students from third grade and 12 others from the second grade identified as students with specific learning disabilities, primarily dyslexics and dysgraphics. The results affirmed that fewer spelling errors are scored in both copying and dictation tests when using computers and tablets. Therefore, the authors recommend that primary schools allow learning disabled students to overcome their difficulties by assisting handwriting tasks with keyboards-based ones, especially in final examinations.
Several probabilistic methods used for Part of speech (POS) tagging are based on Hidden Markov Mo... more Several probabilistic methods used for Part of speech (POS) tagging are based on Hidden Markov Models (HMM), these methods have difficulties especially in estimating transition probabilities accurately from limited amounts of training data. Consequently, a new method appeared to avoid problems that HMM face. However, the transition probabilities are estimated using a decision tree. Based on this method a language independent POS tagger (called TreeTagger) has been implemented. The main purpose of this work is to create the language model to adapt TreeTagger for Arabic POS tagging and lemmatization. Furthermore, different configurations have been done, namely, collecting lexical resources, as well as the annotated training corpora. In addition, we used the proposed universal tagset that consists of common POS categories of 22 different languages including Arabic. We highlight the use of this tagger via various experiments on vowelled and unvowelled text from both Modern Standard Arabic and Classical Arabic. In fact, the obtained accuracies rates are 99.4%, 92.6% and 81.9% for respectively the Quranic vowelled corpus "Al-Mus'haf", the unvowelled "Al-Mus'haf1" corpus and for the NEMLAR corpus.
The focus of data scientists is essentially divided into three areas: collecting data, analyzing ... more The focus of data scientists is essentially divided into three areas: collecting data, analyzing data, and inferring information from data. Each one of these tasks requires special personnel, takes time, and costs money. Yet, the next and the fastidious step is how to turn data into products. Therefore, this field grabs the attention of many research groups in academia as well as industry. In the last decades, data-driven approaches came into existence and gained more popularity because they require much less human effort. Natural Language Processing (NLP) is strongly among the fields influenced by data. The growth of data is behind the performance improvement of most NLP applications such as machine translation and automatic speech recognition. Consequently, many NLP applications are frequently moving from rule-based systems and knowledge-based methods to data-driven approaches. However, collected data that are based on undefined design criteria or on technically unsuitable forms will be useless. Also, they will be neglected if the size is not enough to perform the required analysis and to infer the accurate information. The chief purpose of this overview is to shed some lights on the vital role of data in various fields and give a better understanding of data in light of NLP. Expressly, it describes what happen to data during its life-cycle: building, processing, analyzing, and exploring phases.
Studies in computational intelligence, Nov 18, 2017
Arabic is an old Semitic language, the standardization of its lexicon and grammar are deeply root... more Arabic is an old Semitic language, the standardization of its lexicon and grammar are deeply rooted and well established a long time ago in history. Arabic is a morphologically rich language characterized by the phenomenon of derivation and inflection. It is an international language with over 500 million native speakers around 29 countries. In the last 15 years, Arabic has achieved the highest growth of the ten top online languages. Consequently, the volume of stored electronic information increases rapidly. Despite this proud heritage, lexical richness, and online user growth, Arabic is relatively an under-resourced language compared to other languages with less or similar population size (e.g., French and German). The boundaries of this chapter cover the major progress that has been made in Arabic linguistic resources, primarily corpora compilation and the challenges that researchers face in the development of such process. It is hoped that this overall view of the Arabic corpus linguistics would guide current and future research directions.
Stemming is the main step used for handling the morphologically rich languages such as Arabic. It... more Stemming is the main step used for handling the morphologically rich languages such as Arabic. It is usually used in several fields such as Natural Language Processing, Information Retrieval (IR), and Text Mining. The goal of stemming is reducing inflected or derived words to their base (root or stem), from a generally written word form. Considering that Arabic is mainly dependent on roots and patterns to generate words, a new efficient heavy/light stemmer is developed based on the interaction between roots and patterns; yet, rich linguistic resources are involved. This stemmer provides three different outputs: individual root, a stem, and a combination of stem/root. In this paper, we highlight the performance of the developed stemmer via various experiments on both Modern Standard Arabic and Classical Arabic. In fact, the achieved accuracies are 96.93% and 96.56% for respectively the Quranic corpus "Al-Mus'haf" and NEMLAR corpus. In the context of usability testing, the effectiveness of the stemmer on IR and Part of Speech (PoS) tagging are studied. The obtained results indicate an improvement in PoS tagging by 10.98% and by 14.12% in search efficiency.
The Arabic language is expanding in the world. According to UNESCO, the Arabic language is spoken... more The Arabic language is expanding in the world. According to UNESCO, the Arabic language is spoken by more than 422 million native speakers around 29 countries and among 1.6 billion Muslims worldwide use it to perform their daily prayers. The presence of the Arabic language on the internet grew around 6.091% in the last fifteen years (2000–2015), it is the highest growth of the ten top online languages. Therefore, the number of Arabic documents increases rapidly. This calls for the necessity to improve Arabic Information Retrieval (IR) techniques. Many researchers agree on the benefits of both stemming and lemmatization in IR, primarily with highly inflective languages, short documents and limited space for storing data. The chief purpose of the current study is assessing the impact of stemming and lemmatization on Arabic IR. In this paper, we illustrate several concepts of Arabic morphology, including stemming and lemmatization algorithms. Then, we highlight the use of these latter and their benefits for different Arabic IR systems. Finally, an experiment is conducted to calculate the occurrence of all Quranic surface word, stem, and lemma forms by searching their similarities in both Classical and Modern Standard Arabic resources. In doing so, recent and efficient analyzers AlKhalil Morpho Sys and MADAMIRA are used.
International Journal of Speech Technology, Feb 16, 2016
There is not a widely amount of available annotated Arabic corpora. This leads us to contribute t... more There is not a widely amount of available annotated Arabic corpora. This leads us to contribute to the enrichment of Arabic corpora resources. In this regard, we have decided to start working with ...
Authorship Attribution (AA) of short texts like SMS, chat, social media posts has become a releva... more Authorship Attribution (AA) of short texts like SMS, chat, social media posts has become a relevant study issue, adding new dimensions to this field. However, AA of Arabic Tweets is not well-investigated and left behind compared to longer texts such as ancient books, poems, news articles, or even similar short text like the fatwa (i.e., a legal decree in the religion of Islam). This paper presents the advantage of using a bagging ensemble model over a single learner model to increase the accuracy of AA of Arabic tweets. In doing so, we evaluated the performance of a bagging ensemble model using three state-of-the-art classification approaches as base classifiers, namely Naïve Bayesian (NB), Support Vector Machines (SVM), and Decision Trees (DT). According to the experiments conducted, the proposed bagging classifier that used the SVM algorithm as a base model achieved the highest accuracy rate (i.e., 95,03%) among the other classifiers. This accuracy is among the highest ever published in similar studies.
Today's technology provides excellent opportunities for students, primarily those with learni... more Today's technology provides excellent opportunities for students, primarily those with learning disabilities, to be engaged in digital learning environments. Learning disabilities are neurologically-based processing deficits in acquiring and learning essential reading, spelling, and writing skills. Besides, few studies were conducted about assistive technology's effectiveness for handwriting and spelling for Arabic children with specific learning disabilities (i.e., Dyslexia and Dysgraphia). This study investigates the impact of using computers and tablets on the performance of text copying and dictation. The study was conducted in a Moroccan public primary school with two experimental groups. From 60 students, 12 students from third grade and 12 others from the second grade identified as students with specific learning disabilities, primarily dyslexics and dysgraphics. The results affirmed that fewer spelling errors are scored in both copying and dictation tests when using computers and tablets. Therefore, the authors recommend that primary schools allow learning disabled students to overcome their difficulties by assisting handwriting tasks with keyboards-based ones, especially in final examinations.
Several probabilistic methods used for Part of speech (POS) tagging are based on Hidden Markov Mo... more Several probabilistic methods used for Part of speech (POS) tagging are based on Hidden Markov Models (HMM), these methods have difficulties especially in estimating transition probabilities accurately from limited amounts of training data. Consequently, a new method appeared to avoid problems that HMM face. However, the transition probabilities are estimated using a decision tree. Based on this method a language independent POS tagger (called TreeTagger) has been implemented. The main purpose of this work is to create the language model to adapt TreeTagger for Arabic POS tagging and lemmatization. Furthermore, different configurations have been done, namely, collecting lexical resources, as well as the annotated training corpora. In addition, we used the proposed universal tagset that consists of common POS categories of 22 different languages including Arabic. We highlight the use of this tagger via various experiments on vowelled and unvowelled text from both Modern Standard Arabic and Classical Arabic. In fact, the obtained accuracies rates are 99.4%, 92.6% and 81.9% for respectively the Quranic vowelled corpus "Al-Mus'haf", the unvowelled "Al-Mus'haf1" corpus and for the NEMLAR corpus.
The focus of data scientists is essentially divided into three areas: collecting data, analyzing ... more The focus of data scientists is essentially divided into three areas: collecting data, analyzing data, and inferring information from data. Each one of these tasks requires special personnel, takes time, and costs money. Yet, the next and the fastidious step is how to turn data into products. Therefore, this field grabs the attention of many research groups in academia as well as industry. In the last decades, data-driven approaches came into existence and gained more popularity because they require much less human effort. Natural Language Processing (NLP) is strongly among the fields influenced by data. The growth of data is behind the performance improvement of most NLP applications such as machine translation and automatic speech recognition. Consequently, many NLP applications are frequently moving from rule-based systems and knowledge-based methods to data-driven approaches. However, collected data that are based on undefined design criteria or on technically unsuitable forms will be useless. Also, they will be neglected if the size is not enough to perform the required analysis and to infer the accurate information. The chief purpose of this overview is to shed some lights on the vital role of data in various fields and give a better understanding of data in light of NLP. Expressly, it describes what happen to data during its life-cycle: building, processing, analyzing, and exploring phases.
Studies in computational intelligence, Nov 18, 2017
Arabic is an old Semitic language, the standardization of its lexicon and grammar are deeply root... more Arabic is an old Semitic language, the standardization of its lexicon and grammar are deeply rooted and well established a long time ago in history. Arabic is a morphologically rich language characterized by the phenomenon of derivation and inflection. It is an international language with over 500 million native speakers around 29 countries. In the last 15 years, Arabic has achieved the highest growth of the ten top online languages. Consequently, the volume of stored electronic information increases rapidly. Despite this proud heritage, lexical richness, and online user growth, Arabic is relatively an under-resourced language compared to other languages with less or similar population size (e.g., French and German). The boundaries of this chapter cover the major progress that has been made in Arabic linguistic resources, primarily corpora compilation and the challenges that researchers face in the development of such process. It is hoped that this overall view of the Arabic corpus linguistics would guide current and future research directions.
Stemming is the main step used for handling the morphologically rich languages such as Arabic. It... more Stemming is the main step used for handling the morphologically rich languages such as Arabic. It is usually used in several fields such as Natural Language Processing, Information Retrieval (IR), and Text Mining. The goal of stemming is reducing inflected or derived words to their base (root or stem), from a generally written word form. Considering that Arabic is mainly dependent on roots and patterns to generate words, a new efficient heavy/light stemmer is developed based on the interaction between roots and patterns; yet, rich linguistic resources are involved. This stemmer provides three different outputs: individual root, a stem, and a combination of stem/root. In this paper, we highlight the performance of the developed stemmer via various experiments on both Modern Standard Arabic and Classical Arabic. In fact, the achieved accuracies are 96.93% and 96.56% for respectively the Quranic corpus "Al-Mus'haf" and NEMLAR corpus. In the context of usability testing, the effectiveness of the stemmer on IR and Part of Speech (PoS) tagging are studied. The obtained results indicate an improvement in PoS tagging by 10.98% and by 14.12% in search efficiency.
The Arabic language is expanding in the world. According to UNESCO, the Arabic language is spoken... more The Arabic language is expanding in the world. According to UNESCO, the Arabic language is spoken by more than 422 million native speakers around 29 countries and among 1.6 billion Muslims worldwide use it to perform their daily prayers. The presence of the Arabic language on the internet grew around 6.091% in the last fifteen years (2000–2015), it is the highest growth of the ten top online languages. Therefore, the number of Arabic documents increases rapidly. This calls for the necessity to improve Arabic Information Retrieval (IR) techniques. Many researchers agree on the benefits of both stemming and lemmatization in IR, primarily with highly inflective languages, short documents and limited space for storing data. The chief purpose of the current study is assessing the impact of stemming and lemmatization on Arabic IR. In this paper, we illustrate several concepts of Arabic morphology, including stemming and lemmatization algorithms. Then, we highlight the use of these latter and their benefits for different Arabic IR systems. Finally, an experiment is conducted to calculate the occurrence of all Quranic surface word, stem, and lemma forms by searching their similarities in both Classical and Modern Standard Arabic resources. In doing so, recent and efficient analyzers AlKhalil Morpho Sys and MADAMIRA are used.
International Journal of Speech Technology, Feb 16, 2016
There is not a widely amount of available annotated Arabic corpora. This leads us to contribute t... more There is not a widely amount of available annotated Arabic corpora. This leads us to contribute to the enrichment of Arabic corpora resources. In this regard, we have decided to start working with ...
Uploads
Papers by imad zeroual