Abstract Recognize automatically the spontaneous Human speech and transcribe it into text is beco... more Abstract Recognize automatically the spontaneous Human speech and transcribe it into text is becoming an important task. However, freely available models are rare especially for under-resourced languages and dialects since they require large amounts of data in order to achieve high performances. This paper describes an approach to build an end-to-end Tunisian dialect speech system based on deep learning. For this propose, a Tunisian dialect paired text-speech dataset called "TunSpeech" was created. Existing Modern Standard Arabic (MSA) speech data was also combined with dialectal Tunisian data and decreased the Out-Of-Vocabulary rate and improve perplexity. On the other hand, synthetic dialectal data from a text to speech increased the Word Error Rate.
Advances in speech and language technologies enable tools such as voice-search, text-tospeech, sp... more Advances in speech and language technologies enable tools such as voice-search, text-tospeech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets throu...
On various Social Media platforms, people, tend to use the informal way to communicate, or write ... more On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, negative and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier. The dataset is publicly available.
Purpose A number of approaches and algorithms have been proposed over the years as a basis for au... more Purpose A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing. Design/methodology/approach In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets. Findings The results indicate that this approach achieved precision enhancements at low recall, and ...
Understanding the causes of spikes in the emotion flow of influential social media users is a key... more Understanding the causes of spikes in the emotion flow of influential social media users is a key component when analyzing the diffusion and adoption of opinions and trends. Hence, in this work we focus on detecting the likely reasons or causes of spikes within influential Twitter users’ emotion flow. To achieve this, once an emotion spike is identified we use linguistic and statistical analyses on the tweets surrounding the spike in order to reveal the spike’s likely explanations or causes in the form of keyphrases. Experimental evaluation on emotion flow visualization, emotion spikes identification and likely cause extraction for several influential Twitter users shows that our method is effective for pinpointing interesting insights behind the causes of the emotion fluctuation. Implications of our work are highlighted by relating emotion flow spikes to real-world events and by the transversal application of our technique to other types of timestamped text.
Communications in Computer and Information Science, 2018
Sentiment analysis is the Natural Language Processing (NLP) task that aims to classify text to di... more Sentiment analysis is the Natural Language Processing (NLP) task that aims to classify text to different classes such as positive, negative or neutral. In this paper, we focus on sentiment analysis for Arabic language. Most of the previous works use machine learning techniques combined with hand engineering features to do Arabic sentiment analysis (ASA). More recently, Deep Neural Networks (DNNs) were widely used for this task especially for English languages. In this work, we developed a system called CNN-ASAWR where we investigate the use of Convolutional Neural Networks (CNNs) for ASA on 2 datasets: ASTD and SemEval 2017 datasets. We explore the importance of various unsupervised word representations learned from unannotated corpora. Experimental results showed that we were able to outperform the previous state-of-the-art systems on the datasets without using any kind of hand engineering features.
Tunizi is the first 100% Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi is the rep... more Tunizi is the first 100% Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi is the representation of the tunisian dialect written in Latin characters and numbers rather than Arabic letters.We gathered comments from social media platforms that express sentiment about popular topics. For this purpose, we extracted 100k comments using public streaming APIs. Tunizi was preprocessed by removing links, emoji symbols, and punctuations. The collected comments were manually annotated using an overall polarity: positive (1), negative (-1) and netral (0). neutral class. We divided the dataset into separate training, validation and test sets, with a ratio of 7:1:2 with a balanced split where the number of comments from positive class and negative class are almost the same.
RÉSUMÉ Ce travail présente une étude de l’impact du prétraitement linguistique (suppression de mo... more RÉSUMÉ Ce travail présente une étude de l’impact du prétraitement linguistique (suppression de mots vides, racinisation et détection d’emoji, de négation et d’entités nommées) sur la classification des sentiments en dialecte Tunisien. Nous évaluons cet impact sur trois corpus de tailles et contenus différents. Deux techniques de classification sont utilisées : Naïve bayes et Support Vector Machines. Nous comparons nos résultats aux résultats de référence obtenus sur ces même corpus. Nos résultats soulignent l’impact positif de la phase de prétraitement sur la performance de la classification.
For easier communication, posting, or commenting on each others posts, people use their dialects.... more For easier communication, posting, or commenting on each others posts, people use their dialects. In Africa, various languages and dialects exist. However, they are still underrepresented and not fully exploited for analytical studies and research purposes. In order to perform approaches like Machine Learning and Deep Learning, datasets are required. One of the African languages is Bambara, used by citizens in different countries. However, no previous work on datasets for this language was performed for Sentiment Analysis. In this paper, we present the first common-crawl-based Bambara dialectal dataset dedicated for Sentiment Analysis, available freely for Natural Language Processing research purposes.
The process of automatically translating sentences by examining a number of human-produced transl... more The process of automatically translating sentences by examining a number of human-produced translation samples is called Statistical Machine Translation. To help with word alignment of statistically translated sentences between languages having significantly different word orders, word reordering is required. In this paper, we outline the characteristics of the Turkish language and its challenges for SMT. We propose to reorder translated Turkish sentences based on linguistic knowledge. For this issue, we propose using morphological analysis and syntactic rules.
In recent digital library systems or World Wide Web environment, parallel corpora are used by man... more In recent digital library systems or World Wide Web environment, parallel corpora are used by many applications (Natural Language Processing, machine translation, terminology extraction, etc.). This paper presents a new cross-language information retrieval model based on the language modeling. The model avoids query and/or document translation or the use of external resources. It proposes a structured indexing schema of multilingual documents by combining a keywords model and a keyphrases model. Applied on parallel collections, a query, in one language, can retrieve documents in the same language as well as documents on other languages. Promising results are reported on the MuchMore parallel collection (German language and English language). RÉSUMÉ. Dans les systèmes récents de bibliothèques numériques ou dans le contexte du Web, les corpus parallèles sont utilisés par de nombreuses applications (traitement du langage naturel, la traduction automatique, extraction de terminologie, e...
Abstract Recognize automatically the spontaneous Human speech and transcribe it into text is beco... more Abstract Recognize automatically the spontaneous Human speech and transcribe it into text is becoming an important task. However, freely available models are rare especially for under-resourced languages and dialects since they require large amounts of data in order to achieve high performances. This paper describes an approach to build an end-to-end Tunisian dialect speech system based on deep learning. For this propose, a Tunisian dialect paired text-speech dataset called "TunSpeech" was created. Existing Modern Standard Arabic (MSA) speech data was also combined with dialectal Tunisian data and decreased the Out-Of-Vocabulary rate and improve perplexity. On the other hand, synthetic dialectal data from a text to speech increased the Word Error Rate.
Advances in speech and language technologies enable tools such as voice-search, text-tospeech, sp... more Advances in speech and language technologies enable tools such as voice-search, text-tospeech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets throu...
On various Social Media platforms, people, tend to use the informal way to communicate, or write ... more On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, negative and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier. The dataset is publicly available.
Purpose A number of approaches and algorithms have been proposed over the years as a basis for au... more Purpose A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing. Design/methodology/approach In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets. Findings The results indicate that this approach achieved precision enhancements at low recall, and ...
Understanding the causes of spikes in the emotion flow of influential social media users is a key... more Understanding the causes of spikes in the emotion flow of influential social media users is a key component when analyzing the diffusion and adoption of opinions and trends. Hence, in this work we focus on detecting the likely reasons or causes of spikes within influential Twitter users’ emotion flow. To achieve this, once an emotion spike is identified we use linguistic and statistical analyses on the tweets surrounding the spike in order to reveal the spike’s likely explanations or causes in the form of keyphrases. Experimental evaluation on emotion flow visualization, emotion spikes identification and likely cause extraction for several influential Twitter users shows that our method is effective for pinpointing interesting insights behind the causes of the emotion fluctuation. Implications of our work are highlighted by relating emotion flow spikes to real-world events and by the transversal application of our technique to other types of timestamped text.
Communications in Computer and Information Science, 2018
Sentiment analysis is the Natural Language Processing (NLP) task that aims to classify text to di... more Sentiment analysis is the Natural Language Processing (NLP) task that aims to classify text to different classes such as positive, negative or neutral. In this paper, we focus on sentiment analysis for Arabic language. Most of the previous works use machine learning techniques combined with hand engineering features to do Arabic sentiment analysis (ASA). More recently, Deep Neural Networks (DNNs) were widely used for this task especially for English languages. In this work, we developed a system called CNN-ASAWR where we investigate the use of Convolutional Neural Networks (CNNs) for ASA on 2 datasets: ASTD and SemEval 2017 datasets. We explore the importance of various unsupervised word representations learned from unannotated corpora. Experimental results showed that we were able to outperform the previous state-of-the-art systems on the datasets without using any kind of hand engineering features.
Tunizi is the first 100% Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi is the rep... more Tunizi is the first 100% Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi is the representation of the tunisian dialect written in Latin characters and numbers rather than Arabic letters.We gathered comments from social media platforms that express sentiment about popular topics. For this purpose, we extracted 100k comments using public streaming APIs. Tunizi was preprocessed by removing links, emoji symbols, and punctuations. The collected comments were manually annotated using an overall polarity: positive (1), negative (-1) and netral (0). neutral class. We divided the dataset into separate training, validation and test sets, with a ratio of 7:1:2 with a balanced split where the number of comments from positive class and negative class are almost the same.
RÉSUMÉ Ce travail présente une étude de l’impact du prétraitement linguistique (suppression de mo... more RÉSUMÉ Ce travail présente une étude de l’impact du prétraitement linguistique (suppression de mots vides, racinisation et détection d’emoji, de négation et d’entités nommées) sur la classification des sentiments en dialecte Tunisien. Nous évaluons cet impact sur trois corpus de tailles et contenus différents. Deux techniques de classification sont utilisées : Naïve bayes et Support Vector Machines. Nous comparons nos résultats aux résultats de référence obtenus sur ces même corpus. Nos résultats soulignent l’impact positif de la phase de prétraitement sur la performance de la classification.
For easier communication, posting, or commenting on each others posts, people use their dialects.... more For easier communication, posting, or commenting on each others posts, people use their dialects. In Africa, various languages and dialects exist. However, they are still underrepresented and not fully exploited for analytical studies and research purposes. In order to perform approaches like Machine Learning and Deep Learning, datasets are required. One of the African languages is Bambara, used by citizens in different countries. However, no previous work on datasets for this language was performed for Sentiment Analysis. In this paper, we present the first common-crawl-based Bambara dialectal dataset dedicated for Sentiment Analysis, available freely for Natural Language Processing research purposes.
The process of automatically translating sentences by examining a number of human-produced transl... more The process of automatically translating sentences by examining a number of human-produced translation samples is called Statistical Machine Translation. To help with word alignment of statistically translated sentences between languages having significantly different word orders, word reordering is required. In this paper, we outline the characteristics of the Turkish language and its challenges for SMT. We propose to reorder translated Turkish sentences based on linguistic knowledge. For this issue, we propose using morphological analysis and syntactic rules.
In recent digital library systems or World Wide Web environment, parallel corpora are used by man... more In recent digital library systems or World Wide Web environment, parallel corpora are used by many applications (Natural Language Processing, machine translation, terminology extraction, etc.). This paper presents a new cross-language information retrieval model based on the language modeling. The model avoids query and/or document translation or the use of external resources. It proposes a structured indexing schema of multilingual documents by combining a keywords model and a keyphrases model. Applied on parallel collections, a query, in one language, can retrieve documents in the same language as well as documents on other languages. Promising results are reported on the MuchMore parallel collection (German language and English language). RÉSUMÉ. Dans les systèmes récents de bibliothèques numériques ou dans le contexte du Web, les corpus parallèles sont utilisés par de nombreuses applications (traitement du langage naturel, la traduction automatique, extraction de terminologie, e...
Uploads
Papers by Hatem Haddad