Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Hatem Haddad

    Hatem Haddad

    Abstract Recognize automatically the spontaneous Human speech and transcribe it into text is becoming an important task. However, freely available models are rare especially for under-resourced languages and dialects since they require... more
    Abstract Recognize automatically the spontaneous Human speech and transcribe it into text is becoming an important task. However, freely available models are rare especially for under-resourced languages and dialects since they require large amounts of data in order to achieve high performances. This paper describes an approach to build an end-to-end Tunisian dialect speech system based on deep learning. For this propose, a Tunisian dialect paired text-speech dataset called "TunSpeech" was created. Existing Modern Standard Arabic (MSA) speech data was also combined with dialectal Tunisian data and decreased the Out-Of-Vocabulary rate and improve perplexity. On the other hand, synthetic dialectal data from a text to speech increased the Word Error Rate.
    Advances in speech and language technologies enable tools such as voice-search, text-tospeech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese.... more
    Advances in speech and language technologies enable tools such as voice-search, text-tospeech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets throu...
    On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write... more
    On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, negative and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier. The dataset is publicly available.
    Purpose A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact... more
    Purpose A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing. Design/methodology/approach In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets. Findings The results indicate that this approach achieved precision enhancements at low recall, and ...
    Understanding the causes of spikes in the emotion flow of influential social media users is a key component when analyzing the diffusion and adoption of opinions and trends. Hence, in this work we focus on detecting the likely reasons or... more
    Understanding the causes of spikes in the emotion flow of influential social media users is a key component when analyzing the diffusion and adoption of opinions and trends. Hence, in this work we focus on detecting the likely reasons or causes of spikes within influential Twitter users’ emotion flow. To achieve this, once an emotion spike is identified we use linguistic and statistical analyses on the tweets surrounding the spike in order to reveal the spike’s likely explanations or causes in the form of keyphrases. Experimental evaluation on emotion flow visualization, emotion spikes identification and likely cause extraction for several influential Twitter users shows that our method is effective for pinpointing interesting insights behind the causes of the emotion fluctuation. Implications of our work are highlighted by relating emotion flow spikes to real-world events and by the transversal application of our technique to other types of timestamped text.
    Sentiment analysis is the Natural Language Processing (NLP) task that aims to classify text to different classes such as positive, negative or neutral. In this paper, we focus on sentiment analysis for Arabic language. Most of the... more
    Sentiment analysis is the Natural Language Processing (NLP) task that aims to classify text to different classes such as positive, negative or neutral. In this paper, we focus on sentiment analysis for Arabic language. Most of the previous works use machine learning techniques combined with hand engineering features to do Arabic sentiment analysis (ASA). More recently, Deep Neural Networks (DNNs) were widely used for this task especially for English languages. In this work, we developed a system called CNN-ASAWR where we investigate the use of Convolutional Neural Networks (CNNs) for ASA on 2 datasets: ASTD and SemEval 2017 datasets. We explore the importance of various unsupervised word representations learned from unannotated corpora. Experimental results showed that we were able to outperform the previous state-of-the-art systems on the datasets without using any kind of hand engineering features.
    International audienceno abstrac
    Tunizi is the first 100% Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi is the representation of the tunisian dialect written in Latin characters and numbers rather than Arabic letters.We gathered comments from social media... more
    Tunizi is the first 100% Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi is the representation of the tunisian dialect written in Latin characters and numbers rather than Arabic letters.We gathered comments from social media platforms that express sentiment about popular topics. For this purpose, we extracted 100k comments using public streaming APIs. Tunizi was preprocessed by removing links, emoji symbols, and punctuations. The collected comments were manually annotated using an overall polarity: positive (1), negative (-1) and netral (0). neutral class. We divided the dataset into separate training, validation and test sets, with a ratio of 7:1:2 with a balanced split where the number of comments from positive class and negative class are almost the same.
    RÉSUMÉ Ce travail présente une étude de l’impact du prétraitement linguistique (suppression de mots vides, racinisation et détection d’emoji, de négation et d’entités nommées) sur la classification des sentiments en dialecte Tunisien.... more
    RÉSUMÉ Ce travail présente une étude de l’impact du prétraitement linguistique (suppression de mots vides, racinisation et détection d’emoji, de négation et d’entités nommées) sur la classification des sentiments en dialecte Tunisien. Nous évaluons cet impact sur trois corpus de tailles et contenus différents. Deux techniques de classification sont utilisées : Naïve bayes et Support Vector Machines. Nous comparons nos résultats aux résultats de référence obtenus sur ces même corpus. Nos résultats soulignent l’impact positif de la phase de prétraitement sur la performance de la classification.
    For easier communication, posting, or commenting on each others posts, people use their dialects. In Africa, various languages and dialects exist. However, they are still underrepresented and not fully exploited for analytical studies and... more
    For easier communication, posting, or commenting on each others posts, people use their dialects. In Africa, various languages and dialects exist. However, they are still underrepresented and not fully exploited for analytical studies and research purposes. In order to perform approaches like Machine Learning and Deep Learning, datasets are required. One of the African languages is Bambara, used by citizens in different countries. However, no previous work on datasets for this language was performed for Sentiment Analysis. In this paper, we present the first common-crawl-based Bambara dialectal dataset dedicated for Sentiment Analysis, available freely for Natural Language Processing research purposes.
    The process of automatically translating sentences by examining a number of human-produced translation samples is called Statistical Machine Translation. To help with word alignment of statistically translated sentences between languages... more
    The process of automatically translating sentences by examining a number of human-produced translation samples is called Statistical Machine Translation. To help with word alignment of statistically translated sentences between languages having significantly different word orders, word reordering is required. In this paper, we outline the characteristics of the Turkish language and its challenges for SMT. We propose to reorder translated Turkish sentences based on linguistic knowledge. For this issue, we propose using morphological analysis and syntactic rules.
    In recent digital library systems or World Wide Web environment, parallel corpora are used by many applications (Natural Language Processing, machine translation, terminology extraction, etc.). This paper presents a new cross-language... more
    In recent digital library systems or World Wide Web environment, parallel corpora are used by many applications (Natural Language Processing, machine translation, terminology extraction, etc.). This paper presents a new cross-language information retrieval model based on the language modeling. The model avoids query and/or document translation or the use of external resources. It proposes a structured indexing schema of multilingual documents by combining a keywords model and a keyphrases model. Applied on parallel collections, a query, in one language, can retrieve documents in the same language as well as documents on other languages. Promising results are reported on the MuchMore parallel collection (German language and English language). RÉSUMÉ. Dans les systèmes récents de bibliothèques numériques ou dans le contexte du Web, les corpus parallèles sont utilisés par de nombreuses applications (traitement du langage naturel, la traduction automatique, extraction de terminologie, e...
    Tunisian dialect, is different from Modern Standard Arabic (MSA). In (Fourati et al., 2020), TUNIZI was referred to the Romanized alphabet used to transcribe informal Arabic for communication by the Tunisian Social Media community.... more
    Tunisian dialect, is different from Modern Standard Arabic (MSA). In (Fourati et al., 2020), TUNIZI was referred to the Romanized alphabet used to transcribe informal Arabic for communication by the Tunisian Social Media community. (Younes et al., 2015) mentioned that 81% of the Tunisian comments on Facebook used Romanized alphabet. In (Abidi, 2019), a study was conducted on 1,2M social media Tunisian comments (16M words and 1M unique word) showed that 53% of the comments used Romanized alphabet while 34% used Arabic alphabet and 13% used code-switching script. The study mentioned also that 87% of the comments based Romanized alphabet are TUNIZI while the rest are French and English.
    The previous Named Entity Recognition (NER) models for Modern Standard Arabic (MSA) rely heavily on the use of features and gazetteers, which is time consuming. In this paper, we introduce a novel neural network architecture based on... more
    The previous Named Entity Recognition (NER) models for Modern Standard Arabic (MSA) rely heavily on the use of features and gazetteers, which is time consuming. In this paper, we introduce a novel neural network architecture based on bidirectional Gated Recurrent Unit (GRU) combined with Conditional Random Fields (CRF). Our neural network uses minimal features: pretrained word representations learned from unannotated corpora and also character-level embeddings of words. This novel architecture allowed us to eliminate the need for most of handcrafted engineering features. We evaluate our system on a publicly available dataset where we were able to achieve comparable results to previous best-performing systems.
    For CLEF 2018, we focus on cultural microblog search. The aim of this work is to find relevant microcritics in a monolingual and cross lingual context about films. This task is challenging due to the short length of the query and of the... more
    For CLEF 2018, we focus on cultural microblog search. The aim of this work is to find relevant microcritics in a monolingual and cross lingual context about films. This task is challenging due to the short length of the query and of the documents. For the monolingual context we propose to expand the query using a probalistic weighting scheme. For the french-english cross language task, we used a state of the art approach based on query transation.
    We propose and evaluate a Cross-language Information Retrieval model (CLIR) based on the extraction and the translation of Formal Concepts avoiding queries and/or documents translation. The contribution of this work is the unified formal... more
    We propose and evaluate a Cross-language Information Retrieval model (CLIR) based on the extraction and the translation of Formal Concepts avoiding queries and/or documents translation. The contribution of this work is the unified formal framework that integrates Formal Concept Analysis (FCA) and information retrieval for effective CLIR. The model is indexing bilingual documents using bilingual Formal Concepts extracted by a FCA. Moreover, the use of noun phrases, in addition to keywords, as indexes is studied. We use two comparable collections: an Italian-French collection and an English-French collection. To evaluate our model, we use three Information Retrieval models: TF. IDF, BM25 and Language Model. Finally, we study the query expansion results. Our main finding suggests that Formal Concept Analysis is effective to align Formal Concepts from different languages. Results indicate that our model performances are comparable to a words translation approach and better than a words ...
    In this paper, we present our contribution in DEFT 2017 international workshop. We have tackled task 1 entitled “Polarity analysis of non figurative tweets ”. We propose three sentiment classification models implemented using... more
    In this paper, we present our contribution in DEFT 2017 international workshop. We have tackled task 1 entitled “Polarity analysis of non figurative tweets ”. We propose three sentiment classification models implemented using lexicon-based, supervised, and document embedding-based methods. For the first model, a novel strategy is introduced where Named Entities (NEs) have been involved in the Sentiment Analysis task. The first two models adopted bag-of-N-grams features while for the third model, features have been extracted automatically from the data itself in the form of document vectors. The official evaluation of the three models indicated that the best performance was achieved by the supervised learning-based model. Nevertheless, the results obtained by the document embeddingbased model are considered promising and can be further improved if pretrianed French word vectors are used to initialize the model’s features. MOTS-CLÉS : Analyse des sentiments, apprentissage supervisé, m...
    Searching for an available, reliable, official, and understandable information is not a trivial task due to scattered information across the internet, and the availability lack of governmental communication channels communicating with... more
    Searching for an available, reliable, official, and understandable information is not a trivial task due to scattered information across the internet, and the availability lack of governmental communication channels communicating with African dialects and languages. In this paper, we introduce an Artificial Intelligence Powered chatbot for crisis communication that would be omnichannel, multilingual and multi dialectal. We present our work on modified StarSpace embedding tailored for African dialects for the question-answering task along with the architecture of the proposed chatbot system and a description of the different layers. English, French, Arabic, Tunisian, Igbo,Yorùbá, and Hausa are used as languages and dialects. Quantitative and qualitative evaluation results are obtained for our real deployed Covid-19 chatbot. Results show that users are satisfied and the conversation with the chatbot is meeting customer needs.
    This paper presents the URPAH team’s participation in DEFT 2012.Our approach uses noun phrases in the automatic identification of keywords indexing the content of scientific papers published in a review of Human and Social Sciences, with... more
    This paper presents the URPAH team’s participation in DEFT 2012.Our approach uses noun phrases in the automatic identification of keywords indexing the content of scientific papers published in a review of Human and Social Sciences, with assistance from the terminology of
    This paper investigates the impact of preprocessing on sentiment classification of Turkish movies and products reviews. Input datasets were subjected to several combinations of preprocessing techniques. Later, the manipulated reviews were... more
    This paper investigates the impact of preprocessing on sentiment classification of Turkish movies and products reviews. Input datasets were subjected to several combinations of preprocessing techniques. Later, the manipulated reviews were fed to a lexicon-based and a supervised machine learning sentiment classifiers. The achieved results emphasize the positive impact of preprocessing phase on the accuracy of both sentiment classifiers as the baseline was outperformed with a considerable margin especially when stemming, emoji recognition and negation detection tasks were applied.
    Nous presentons un modele supportant une indexation a base de syntagmes. Cette modelisation inclut une description formelle des termes d'indexation, un processus de derivation, une fonction de correspondance, une semantique du langage... more
    Nous presentons un modele supportant une indexation a base de syntagmes. Cette modelisation inclut une description formelle des termes d'indexation, un processus de derivation, une fonction de correspondance, une semantique du langage d'indexation et une fonction de ponderation de la orrespondance entre termes d'indexation. Elle met en evidence les elements qui doivent permettre de guider la conception de Systemes de Recherche d'Informations a base de mots composes. Nous proposons egalement un choix de techniques pour mettre en oeuvre ce modele, particulierement dans l'extraction automatique des syntagmes et dans leur ponderation pour le calcul de la mesure pertinence d'un document par rapport a une requete.
    Human Language Technologyhas played a big role in implementing Latin based information retrieval systems. Two of the most sited techniques are stemming and truncation. Numerous studies have showed that the inflectional structure of words... more
    Human Language Technologyhas played a big role in implementing Latin based information retrieval systems. Two of the most sited techniques are stemming and truncation. Numerous studies have showed that the inflectional structure of words has a big impact on the retrieval accuracy of Latin-based languages information retrieval systems (IRS). Stemming or truncation is done for two principal reasons: the reduction in index storage required and the increase in performance due to the use of word variants. Several stemming algorithms were proposed for stemming text such as Porter for English. While these studies were concerned with Latin-based languages, only few studies give attention to the Arabic language. This paper we present a study of the Arabic language characteristics that can be useful to integrate in an information retrieval system and the kind of stemming techniques that can be used for the Arabic language. We used the .ae domain as a case study. We present some characteristic...
    In this paper, a new sparse adaptive filtering algorithm is proposed. The proposed algorithm introduces a log-sum penalty term into the cost function of a mixed norm leaky least-mean-square (NLLMS) algorithm. The cost function of the... more
    In this paper, a new sparse adaptive filtering algorithm is proposed. The proposed algorithm introduces a log-sum penalty term into the cost function of a mixed norm leaky least-mean-square (NLLMS) algorithm. The cost function of the NLLMS algorithm is expressed in terms of sum of exponentials with a leakage factor. As a result of the log-sum penalty, the performance of the proposed algorithm is high in sparse system identification settings, especially, when the unknown system is highly sparse. The performance of the proposed algorithm is compared to those of the reweighted-zero-attracting LMS (RZA-LMS) and the p-norm variable step-size LMS (PNVSSLMS) algorithms in sparse system identification settings. The proposed algorithm shows superior performance compared to the aforementioned algorithms.
    Résumé.Nous proposons dans cet article un Système de Recherche d’In formation (SRI) qui se base sur des techniques d’indexation de t ext s en langue naturelle. Nous présentons une méthode d’indexation de doc uments qui repose sur une... more
    Résumé.Nous proposons dans cet article un Système de Recherche d’In formation (SRI) qui se base sur des techniques d’indexation de t ext s en langue naturelle. Nous présentons une méthode d’indexation de doc uments qui repose sur une approche hybride pour la sélection de descripteurs t xtuels. Cette approche emploie des traitements du langage naturel pour l’ex traction des syntagmes nominaux et sur un filtrage statistique basé sur l’inf ormation mutuelle pour sélectionner les syntagmes nominaux les plus informat ifs pour le processus d’indexation. Nous effectuons des expérimentations en utilisant le corpus Le Monde 94 de la collection CLEF 2001 et sur le SRI Lemur pour é valuer l’approche proposée.
    We describe our submitted system to the 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic (Abu Farha et al., 2021). We tackled both subtasks, namely Sarcasm Detection (Subtask 1) and Sentiment Analysis (Subtask 2). We used... more
    We describe our submitted system to the 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic (Abu Farha et al., 2021). We tackled both subtasks, namely Sarcasm Detection (Subtask 1) and Sentiment Analysis (Subtask 2). We used state-of-the-art pretrained contextualized text representation models and fine-tuned them according to the downstream task in hand. As a first approach, we used Google’s multilingual BERT and then other Arabic variants: AraBERT, ARBERT and MARBERT. The results found show that MARBERT outperforms all of the previously mentioned models overall, either on Subtask 1 or Subtask 2.
    Arabic sentiment analysis models have recently employed compositional paragraph or sentence embedding features to represent the informal Arabic dialectal content. These embeddings are mostly composed via ordered, syntax-aware composition... more
    Arabic sentiment analysis models have recently employed compositional paragraph or sentence embedding features to represent the informal Arabic dialectal content. These embeddings are mostly composed via ordered, syntax-aware composition functions and learned within deep neural network architectures. With the differences in the syntactic structure and words’ order among the Arabic dialects, a sentiment analysis system developed for one dialect might not be efficient for the others. Here we present syntax-ignorant, sentiment-specific n-gram embeddings for sentiment analysis of several Arabic dialects. The novelty of the proposed model is illustrated through its features and architecture. In the proposed model, the sentiment is expressed by embeddings, composed via the unordered additive composition function and learned within a shallow neural architecture. To evaluate the generated embeddings, they were compared with the state-of-the art word/paragraph embeddings. This involved inves...
    Social media reflects the attitudes of the public towards specific events. Events are often related to persons, locations or organizations, the so-called Named Entities (NEs). This can define NEs as sentiment-bearing components. In this... more
    Social media reflects the attitudes of the public towards specific events. Events are often related to persons, locations or organizations, the so-called Named Entities (NEs). This can define NEs as sentiment-bearing components. In this paper, we dive beyond NEs recognition to the exploitation of sentiment-annotated NEs in Arabic sentiment analysis. Therefore, we develop an algorithm to detect the sentiment of NEs based on the majority of attitudes towards them. This enabled tagging NEs with proper tags and, thus, including them in a sentiment analysis framework of two models: supervised and lexicon-based. Both models were applied on datasets of multi-dialectal content. The results revealed that NEs have no considerable impact on the supervised model, while employing NEs in the lexicon-based model improved the classification performance and outperformed most of the baseline systems.
    The process of automatically translating sentences by examining a number of human-produced translation samples is called Statistical Machine Translation. To help with word alignment of statistically translated sentences between languages... more
    The process of automatically translating sentences by examining a number of human-produced translation samples is called Statistical Machine Translation. To help with word alignment of statistically translated sentences between languages having significantly different word orders, word reordering is required. In this paper, we outline the characteristics of the Turkish language and its challenges for SMT. We propose to reorder translated Turkish sentences based on linguistic knowledge. For this issue, we propose using morphological analysis and syntactic rules.

    And 36 more