Nowadays, social networks play a fundamental role in promoting and diffusing television and radio... more Nowadays, social networks play a fundamental role in promoting and diffusing television and radio programs to different categories of audiences. So, political parties, influential groups and political activists have rapidly seized these new communication media to spread their ideas andgive their sentiments concerning critical issues. In this context, Twitter, Facebook and YouTube have become very popular tools for sharing videos and communicating with users who interact with each other to discuss some problems, propose solutions and give viewpoints. This interaction on the social media sites yields to a huge amount of unstructured and noisy texts; hence the need for automated analysis techniques to classify sentiments conveyed in the users’ comments. In this work, we focus on opinions written in a less resourced Arabic language: Tunisian dialect (TD). In this work, we present a process for building a sentiment analyses model for comments written on Tunisian television broadcasts pub...
The difficulty of processing dialects is clearly observed in the high cost of building representa... more The difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.
Ahmed Hamdi1 Rahma Boujelbane1,2 Nizar Habash3 Alexis Nasr1 (1) Laboratoire d’Informatique Fondam... more Ahmed Hamdi1 Rahma Boujelbane1,2 Nizar Habash3 Alexis Nasr1 (1) Laboratoire d’Informatique Fondamentale de MarseilleCNRS UMR 7279 Université Aix-Marseille (2) Multimedia, InfoRmation Systems and Advanced Computing Laboratory, Sfax 3021, TUNISIE. (3) Center for Computational Learning Systems Columbia University New York, NY 10115, USA {ahmed.hamdi,rahma.boujelbane,alexis.nasr}@lif.univ-mrs.fr habash@ccls.columbia.edu
2018 JCCO Joint International Conference on ICT in Education and Training, International Conference on Computing in Arabic, and International Conference on Geocomputing (JCCO: TICET-ICCA-GECO), 2018
Opinion analysis in the Web is becoming more and more an attractive task, due the increasing need... more Opinion analysis in the Web is becoming more and more an attractive task, due the increasing need of individuals and societies to track the attitude of people against several subjects of daily life (services, products, decisions, etc.). Several tools and corpora have been developed to analyze opinions on well-resourced languages such as standard Arabic, French, English and Spanish. Recently, with the emergency of dialects on social media, opinion dialect began to be studied. We present in this paper a method for opinion lexicon and corpus building. We aim through this work to propose a statistical model to analyze and follow up the opinions of Internet users upon broadcasts, TV and radio programs.
Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively... more Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costly to obtain adequate corpora to use for training NLP tools. Furthermore, there are almost no parallel corpora involving TD and MSA. In this paper, we describe the creation of Tunisian dialect text corpus as well as a method for building a bilingual dictionary, in order to create language model for speech recognition system for the Tunisian Broadcast News. So, we use explicit knowledge about the relation between TD and MSA.
Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an unde... more Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.
Arabic Dialects (AD) have recently begun to receive more attention from the speech science and te... more Arabic Dialects (AD) have recently begun to receive more attention from the speech science and technology communities. The use of dialects in language technologies will contribute to improve the development process and the usability of applications such speech recognition, speech comprehension, or speech synthesis. However, AD faces the problem of lack of resources compared to the Modern Standard Arabic (MSA). This paper deals with the problem of tagging an AD: The Tunisian Dialect (TD). We present, in this work, a method for building a fine grained POS (Part Of Speech tagger) for the TD. This method consists on adapting a MSA POS tagger by generating a training TD corpus from a MSA corpus using a bilingual lexicon MSA-TD. The evaluation of the TD tagger on a corpus of text transcriptions achieved an accuracy of 78.5%.
Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively... more Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costly to obtain adequate corpora to use for training NLP tools. Furthermore, there are almost no parallel corpora involving TD and MSA. In this paper, we describe the creation of Tunisian dialect text corpus as well as a method for building a bilingual dictionary, in order to create language model for speech recognition system for the Tunisian Broadcast News. So, we use explicit knowledge about the relation between TD and MSA.
Nowadays, social networks play a fundamental role in promoting and diffusing television and radio... more Nowadays, social networks play a fundamental role in promoting and diffusing television and radio programs to different categories of audiences. So, political parties, influential groups and political activists have rapidly seized these new communication media to spread their ideas andgive their sentiments concerning critical issues. In this context, Twitter, Facebook and YouTube have become very popular tools for sharing videos and communicating with users who interact with each other to discuss some problems, propose solutions and give viewpoints. This interaction on the social media sites yields to a huge amount of unstructured and noisy texts; hence the need for automated analysis techniques to classify sentiments conveyed in the users’ comments. In this work, we focus on opinions written in a less resourced Arabic language: Tunisian dialect (TD). In this work, we present a process for building a sentiment analyses model for comments written on Tunisian television broadcasts pub...
The difficulty of processing dialects is clearly observed in the high cost of building representa... more The difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.
Ahmed Hamdi1 Rahma Boujelbane1,2 Nizar Habash3 Alexis Nasr1 (1) Laboratoire d’Informatique Fondam... more Ahmed Hamdi1 Rahma Boujelbane1,2 Nizar Habash3 Alexis Nasr1 (1) Laboratoire d’Informatique Fondamentale de MarseilleCNRS UMR 7279 Université Aix-Marseille (2) Multimedia, InfoRmation Systems and Advanced Computing Laboratory, Sfax 3021, TUNISIE. (3) Center for Computational Learning Systems Columbia University New York, NY 10115, USA {ahmed.hamdi,rahma.boujelbane,alexis.nasr}@lif.univ-mrs.fr habash@ccls.columbia.edu
2018 JCCO Joint International Conference on ICT in Education and Training, International Conference on Computing in Arabic, and International Conference on Geocomputing (JCCO: TICET-ICCA-GECO), 2018
Opinion analysis in the Web is becoming more and more an attractive task, due the increasing need... more Opinion analysis in the Web is becoming more and more an attractive task, due the increasing need of individuals and societies to track the attitude of people against several subjects of daily life (services, products, decisions, etc.). Several tools and corpora have been developed to analyze opinions on well-resourced languages such as standard Arabic, French, English and Spanish. Recently, with the emergency of dialects on social media, opinion dialect began to be studied. We present in this paper a method for opinion lexicon and corpus building. We aim through this work to propose a statistical model to analyze and follow up the opinions of Internet users upon broadcasts, TV and radio programs.
Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively... more Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costly to obtain adequate corpora to use for training NLP tools. Furthermore, there are almost no parallel corpora involving TD and MSA. In this paper, we describe the creation of Tunisian dialect text corpus as well as a method for building a bilingual dictionary, in order to create language model for speech recognition system for the Tunisian Broadcast News. So, we use explicit knowledge about the relation between TD and MSA.
Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an unde... more Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.
Arabic Dialects (AD) have recently begun to receive more attention from the speech science and te... more Arabic Dialects (AD) have recently begun to receive more attention from the speech science and technology communities. The use of dialects in language technologies will contribute to improve the development process and the usability of applications such speech recognition, speech comprehension, or speech synthesis. However, AD faces the problem of lack of resources compared to the Modern Standard Arabic (MSA). This paper deals with the problem of tagging an AD: The Tunisian Dialect (TD). We present, in this work, a method for building a fine grained POS (Part Of Speech tagger) for the TD. This method consists on adapting a MSA POS tagger by generating a training TD corpus from a MSA corpus using a bilingual lexicon MSA-TD. The evaluation of the TD tagger on a corpus of text transcriptions achieved an accuracy of 78.5%.
Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively... more Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costly to obtain adequate corpora to use for training NLP tools. Furthermore, there are almost no parallel corpora involving TD and MSA. In this paper, we describe the creation of Tunisian dialect text corpus as well as a method for building a bilingual dictionary, in order to create language model for speech recognition system for the Tunisian Broadcast News. So, we use explicit knowledge about the relation between TD and MSA.
Uploads
Papers by Rahma Boujelbane