Cross-Dialectal Arabic Processing
Salima Harrat, Karima Meftouh, Mourad Abbas, Salma Jamoussi, Motaz
Saad, Kamel Smaïli
To cite this version:
Salima Harrat, Karima Meftouh, Mourad Abbas, Salma Jamoussi, Motaz Saad, et al.. Cross-Dialectal
Arabic Processing. International Conference on Intelligent Text Processing and Computational Linguistics, Apr 2015, cairo, Egypt. 10.1007/978-3-319-18111-0_47. hal-01261598
HAL Id: hal-01261598
https://hal.archives-ouvertes.fr/hal-01261598
Submitted on 25 Jan 2016
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Cross-dialectal Arabic Processing
Salima Harrat1 , Karima Meftouh2 , Mourad Abbas3 , Salma Jamoussi4 , Motaz
Saad5 , and Kamel Smaili5
1
ENSB ⋆ , Ecole Supérieure d’Informatique (ESI), Algiers, Algeria,
2
Badji Mokhtar University, Annaba, Algeria
3
CRSTDLA⋆⋆ , Algiers, Algeria
4
⋆⋆⋆
MIRACL
, Pole Technologique de Sfax, Tunisia
5
Campus Scientifique LORIA , Nancy, France
Abstract. We present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we
compare to the Modern Standard Arabic (MSA). Three dialects from
Maghreb are concerned by this study: two from Algeria and one from
Tunisia and two dialects from Middle-east (Syria and Palestine). The
resources which have been built from scratch have lead to a collection of
a multi-dialect parallel resource. Furthermore, this collection has been
aligned by hand with a MSA corpus. We conducted several analytical
studies in order to understand the relationship between these vernacular languages. For this, we studied the closeness between all the pairs of
dialects and MSA in terms of Hellinger distance. We also performed an
experiment of dialect identification. This experiment showed that neighbouring dialects as expected tend to be confused, making difficult their
identification. Because the Arabic dialects are different from one region to
another which make the communication between people difficult, we conducted cross-lingual machine translation between all the pairs of dialects
and also with MSA. Several interesting conclusions have been carried out
from this experiment.
1
Introduction
In Arab countries, the majority of people speaks dialects. Modern Standard Arabic (MSA) is the official language used only in formal speeches, media and education. What may be surprising is that even educated people, in their daily life
prefer speaking the dialect which is their mother-tongue. Consequently, studying
the dialects becomes a priority which could take benefit from natural language
processing tools.
During the last decade, researchers have been interested to Arabic dialects
processing, like building lexicon, morphological analysis, POS tagging, etc, [1–4].
⋆
⋆⋆
⋆⋆⋆
Ecole Normale Supérieure Bouzareah.
Centre de Recherche Scientifique et Technique pour le Développement de la langue
Arabe.
Multimedia, InfoRmation systems and Advanced Computing Laboratory.
Recent works have been dedicated to other tasks, such as Machine Translation [5,
6] and dialect identification [7, 8]. In [9], a work of building a small multilingual
dialectal corpus is presented further including the MSA.
In this paper, we will focus on a set of Arabic dialects and more particulary
on three from Maghreb (two from Algeria and one from Tunisia). On the other
side we will conduct a study and experiment on Palestinian and Syrian dialects.
To do that, we build a parallel corpus, study the relationship between dialects
and MSA, distinguish one dialect from another and present few experiments
of machine translation between MSA and the different dialects. This paper is
structured as follows: in section 2 we give an overview of the considered dialects.
We discuss in section 3 how resources are built with some related works, then we
detailed how we created our parallel corpus. Section 4 is dedicated to an analytical comparison of all dialects and MSA. Section 5 presents dialect identification
experiments whereas the last one gives results of machine translation between
all dialects and MSA. Finally, we conclude in section 7 by summarizing all the
work.
2
Overview of the Used Dialects
Arabs use in their daily conversations dialects which could be considered such as
variants of MSA. Tunisian and Algerian dialects share many features with the
other Maghrebi dialects because of their similar history. It is worth mentioning
that they contain many words borrowed from other languages, mainly Berber,
French, Turkish, Italian and Spanish. Syrian and Palestinian dialects share an
important number of features since they are included in the Levantine Arabic
dialect continuum. In the following, we give a short overview about each dialect
we study in this article.
2.1
Algerian Dialect
Algerian dialect is an informal spoken language, not used in official speech. Its
vocabulary is roughly similar through all Algeria. However, in the east of the
country, the dialect is closer to Tunisian whereas in the west it is closer to
Moroccan. Most of the words of Arabic dialect come from MSA [10], but there
is significant variation in the vocalization in most cases, and omission of some
letters in other cases. Contrary to MSA, few letters are not used in Algerian as
and , where most of the time they are respectively pronounced as
and .
✠
➔
✠
Moreover Algerian dialect uses some non-Arabic letters like
2.2
✠
❳
➄
✑
➡
and
❍
✒
❳
.
Tunisian Dialect
Like other Maghrebi dialects, the vocabulary of the Tunisian dialect is mostly
Arabic, with significant Berber substrates. However, its morphology, syntax,
phonology and vocabulary differ from standard Arabic. The Tunisian dialect is
very agglutinative: people tend to use very few words for conversation where
one word may express a whole sentence. It differs from MSA especially in its
negation form where the markers are always agglutinated to other words as
affixes or suffixes. Moreover, in Tunisian dialect, several Arabic words are used
with substantial changes in their stem formation.
2.3
Syrian and Palestinian Dialects
Syrian and Palestinian dialects are part of Levantine Spoken Arabic which covers also dialects spoken in Lebanon and Jordan. Levantine Arabic shares most
phonological, structural, and lexical features with other varieties of Arabic. At
the same time, there are differences among Levantine dialects based on geography and urban/rural division. Arabic Syrian dialect is influenced by the Syriac
language, a Semitic language of the Middle East, belonging to the Aramaean
language group. It contains a large proportion of Arabic words and also words
borrowed from Turkish and French. Palestinian dialect has slight phonetic differences from north Levantine dialects. It can be classified into two main categories:
urban and countryside. It can be classified also according to geographical area
(north and south). Palestinian dialect built in this work is mainly the dialect of
people who live in Gaza strip.
3
Building a Parallel Corpus
It is well known that parallel corpora are the foundation stone of several natural language processing tasks, particularly cross-language applications such as
machine translation, bilingual lexicon extraction and multilingual information
retrieval. Building this kind of resources is a challenging task especially when it
deals with under-resourced languages [11]. Arabic is one of these languages for
which parallel corpora are scarce. The problem is much deeper with the Arabic
dialects which are used by a huge number of people but, unfortunately they are
often not written. To overcome the need of text corpora covering these languages,
researchers can choose one of two main possible solutions: building the corpus
from scratch or crawl the web to build a parallel set of sentences.
The solution adopted for our work is the first one: from scratch, since the
overall goal of this work is Speech-to-Speech translation we need real everyday
conversations. In the following, we focus on Annaba’s dialect (ANB), the language spoken in the east of Algeria, on Algiers’s dialect (ALG), the language
used in the capital of Algeria, on Sfax’s dialect (TUN) spoken in the south of
Tunisia, Syrian (SYR) and Palestinian (PAL) dialects.
ANB corpus was created by recording different conversations from every day
life whereas, for ALG, we used the recordings corresponding to movies and shows
which are often expressed in the dialect of Algiers. Then we transcribed both
of them by hand. To increase the size of the two corpora, we translated each of
them into the other. Afterwards, these two corpora have been translated into
MSA.
In order to introduce both Tunisian, Syrian and Palestinian dialects, we used
MSA as a pivot language. We translated the MSA corpus to TUN, SYR and
PAL. The Tunisian corpus was produced by 20 native speakers. Each one was
responsible of translating almost 320 sentences from MSA to TUN. Speakers
have very slight differences in their spoken languages. All of them are from the
south of Tunisia where people tend to use Arabic words rather than French
words as it is the case in the north of the country. In fact, the dialect used in the
south is closer to the Standard Arabic than that used in the north of Tunisia.
Syrian and Palestinian corpora were created in the same way as Tunisian one
except that each of them has been obtained by two translators. Finally, we get
a parallel corpus including ANB, ALG, TUN, SYR, PAL and MSA.
It should be noted that each dialect word is written by adopting the Arabic
notation, that means if a dialectal word does exist in MSA, it is written in a
standard form without any change, otherwise the word is written as it is uttered.
We give in Table 1 an example of a same sentence from the built corpus. We can
remark even if we do not read Arabic that some words are the same from one
dialect to another, while others are completely different.
Table 1. An example of a sentence from the parallel corpus
Dialect/Language
ALG
Sentence
✏ ✏
✏
✠
✚✬
é❑ ◗å❸♠
✡
✠
✏
ú➥
é➤❏➥ ❳
✡
✏ ✏
ANB
✠
✏
ú➥
é➤❏➥ ❳
✡
✠
✏
úæ❏❏❦
✡✳
✡
➼❏❐
✡
✠
✏
✠
é❑ ❳◗➤❑
✡
✏
✏ ✏
⑨ ❆❑
✳
ñ❑P ❳
➻ ❨❏➠
ú♠
✠
✏
ú➥
✠
✏
✏
ñ➟❏➆✢ ❅
✡
■♠
✳
✚✬
✏
ñ❏❏➤➠
✳
ú❮
✏ ✏
✏
é❏❦ð ◗❑
✡
PAL
✠
✏
é➤❏➥ ❳
✡
✠
ú➥
⑨ ❆❑
✳
✠
➻ ❨❑
✳
➻ ❨❏➠
✏
é❏➟❏➆✢
✡
✡
➻ ❨❑
✳
➻ ❨❏➠
✏
❆❑ ❅
ú❮ ❅
✏ ✏
✠
✏
ú➥
✡
Meaning
4
✏
é❑ P ❨î❊
✡
ð
✏
✠
ñ❏✜➆➥
✡
✏
ú❰❐ ❅
ú❦ ❅
✠
à❅
✏
✠
á❑ ❨❑ ◗❑
✡
✡
✠
➻ ❨❏➠
✏
ú● ❅
✏
■➥ñ❐ ❅
à ❆❶➠
✎
✠
ñ❑ ❨❣ ❅
✏
ú❰❐ ❅
✳
➱➾
✏
■➥ñ❐ ❅
➱➾
✡
✍
✏
✡
✏
✑
✏✏
ú➽❐
✡
✠
✠ ✏
é❏➥◗➟❏❷ ❅
ø ❨❐ ❅
✏
All the time that I put to visit you, you want to spoil it in one minute
Analytical Comparison
Multi-dialect Corpus Statistics
The obtained parallel corpus is made up of 6400 parallel sentences. The MSA
part contains 40906 words including 9131 different words. The five dialects, ALG,
✏
■➥ñ❐ ❅
✡
In the following, we will compare dialects between them and with MSA. The
idea is to understand what is close to what? What is different? etc. We hope
that this will help us in a future work to take advantage of MSA in order to
develop linguistic tools for processing dialects.
4.1
■➥ñ❐ ❅
✎
✡
✠
✡
✌
é➤❏➥ ❳
✡
➱➽❐ ❅
✏
ñ❑ ❨➟➥
✠
ø ❆❣
✳
✡
✠
✡
MSA
■➥ñ❐ ❅
✏
✏
✑
✚✬
➼❏♠
✡✳
✍
✏ ✏
✏
❅❳❅
✡
✏
é➤❏➥ ❨❑
✡
✳
➱➾
✡
✡
SYR
✏
■➥ñ❐ ❅
✌
è ❆❑
✳
✠
❅
✏
➝ ❆➥
✑
✏
✚✬
✡✳
é➤❏➥ ❳
✡
ú❮
✡
✠
✠
é❏❑ ❆❣
✳✡
✑
✏
✑
✚✬
✡✳
✡
TUN
ú♠
✡
➱➾
ANB, TUN, SYR and PAL include an average of 37500 words with a vocabulary
which does not exceed 10250 words (see Table 2). The average number of words
in a dialect sentence is of 6 while it is of 7 for MSA. The shortest sentence in
the corpus is composed of 4 words and the longest one contains 25 words.
Table 2. Parallel corpus description
Corpus #Distinct words
ALG
8966
ANB
9060
TUN
10215
SYR
9825
PAL
9196
MSA
9131
4.2
#Words
38707
38428
36648
37259
39286
40906
Common Lexical Units between MSA and Dialects
MSA language is the same throughout the Arab world, while the dialects vary
according to the geographical location. In this Section, we are interested in measuring how much the dialectal vocabulary is close to MSA by using the aforementioned parallel corpus. The experiments we achieved, show that the dialects
employ many MSA words, even if the utterance of these words depends strongly
on each dialect. Particularly, PAL is closest to MSA than other dialects are
(Table 3).
Table 3. Percentage of common words between Arabic dialects and MSA
Dialect ALG ANB TUN SYR PAL
%
21.18 21.07 37.60 37.36 51.68
These results are not surprising. Indeed, Middle-East Arabic dialects tend
to be closer to MSA than those of the Maghreb. Also, it would be noticed that
Arabic dialects spoken in south of Maghreb countries include more Arabic words
than those spoken in the north. This explains the different rates in terms of
common words between the two Algerian dialects on one side and the Tunisian
dialect on another side. Indeed TUN is spoken in the south of Tunisia while ALG
and ANB are dialects of northern Algeria. In Table 4, we give few examples of
the most frequent words between the studied Arabic dialects and MSA.
In the same way, we computed the percentage of common words between all
pairs of dialects. The Table 5 represents the percentage of common words between different dialects. These values show that ALG and ANB share the largest
Table 4. The most frequent words of each dialect relatively to our corpus
Dialect
ALG
ANB
TUN
Most Frequent words
❨❣ ❅ð
✙➉
❧
❤ ❅P
one
right
he went
full
❨❣ ❅ð
✙➉
❧
❤ ❅P
➻ ❨❏➠
one
right
✠
à ❆➾
it was
SYR
✏
✠
he went you have
✏
■➥ð
❨❣ ❅ð
time
one
all
❨❣ ❅ð
ø ❨❏➠
➱➽❐ ❅
✏
Ð ñ❏❐ ❅
✡
è◗Ó
✠
✡
today one time
PAL
➱Ó ❆➾
one
i have
Ð ñ❏❐ ❅
✡
❨❣ ❅ð
■❏↔
✡
✳
❤ ❅P
today
one
good
he went
number of words, followed by PAL and SYR. These results were excepted because ALG dialects and ANB are close since they are used in two cities separated
by 372 miles, as PAL and SYR which are used in the same geographic location
separated by only 175 miles. Also, TUN shares more words with PAL than the
two other Maghrebi dialects do. Only 23% in average of words are common to
Syrian and Maghrebi dialects. This result reinforces the fact that we made at
the beginning of the article about the difficulty of conversing between Arabic
people, from Maghreb and middle-east.
Table 5. Cross dialect percentage of common words.
Ref.
ALG
ANB
TUN
SYR
PAL
Percentage of common words
ALG ANB TUN SYR PAL
73.62 35.43 24.16 25.43
72.86
34.25 23.59 25.00
31.10 30.38
29.79 33.49
21.01 20.73 29.52
- 44.00
24.79 24.63 37.20 49.33 -
We estimated also the percentage of common words at sentence level between
each pair of languages. For each pair of the k th aligned sentences SLk i and SLk j
from the bitext (Li , Lj ). The common words is calculated as in formula 1 , it
corresponds to the percentage of common words in the two sentences to the
total number of words in both sentences. Then we estimate the average common
number of words over all the sentences.
Ovp(SLk i , SLk j ) =
|SLk i ∩ SLk j |
|SLk i ∪ SLk j |
(1)
Table 6 presents the overlap between the Arabic dialects and MSA at sentence
level. The achieved results confirm those of the two last experiments. PAL is the
closest dialect to MSA followed by TUN then SYR, while ALG and ANB are the
farthest. This experiment also highlights the closeness between Algerian dialects
(ALG and ANB) and Levantine dialects (PAL and SYR). It shows also That
TUN is closer to PAL and to SYR than ALG and ANB.
Table 6. Overlapping of vocabularies between Dialects and MSA
MSA
PAL
SYR
TUN
ANB
4.3
ALG
0.12
0.13
0.09
0.16
0.32
ANB
0.10
0.11
0.09
0.13
TUN SYR PAL
0.16 0.14 0.21
0.17 0.21
0.13
Measuring the Cross-language Divergence
In this section, we are interested by measuring the divergence between dialects
and MSA throw unigram language models. For this purpose we choose to use
the Hellinger Distance (HD) [12][13], a measure of distributional divergence. It
quantifies the similarity between two probability distributions. It has been used
to detect failures in classification performance [14] and in machine learning it
is used to estimate the class distribution [15]. In [16], this distance was used to
measure information loss in data protection. Hellinger distance is symmetric and
non-negative, and obeys to triangle rule.
In order to measure the divergence between two languages with HD, let consider a bi-text(Li , Lj ) with the vocabularies Vi and Vj respectively. We constitute
V , a vocabulary including 10K words including the common words between Vi
and Vj and from the remaining words of the two vocabularies we include the
most frequent ones to complete V. For each side of the bi-text, a unigram probability distribution P (w|Li ) is computed over V . To avoid zero probabilities due
to the words not belonging to the considered language, we decided to smooth
the probabilities. The comparison of the two distributions is then calculated as
follows:
s
q
1 X p
( P (w|Li ) − P (w|Lj ))2
(2)
HD(Li , Lj ) =
2
w∈V
Table 7 draws HD values computed between all dialects and MSA. These
values show that PAL is the closest dialect to MSA followed by TUN then SYR,
whereas ALG and ANB are the most divergent. The closest dialects according
to HD are ALG and ANB and also PAL and SYR. The closest dialect to MSA
is PAL and the farthest are ALG and ANB. Another interesting and expected
result is the one related to the distance between TUN and the other dialects,
TUN is closer to ALG and ANB than to PAL and SYR.
Table 7. Hellinger distance values for the different pairs of languages
MSA
PAL
SYR
TUN
ANB
5
ALG
0.72
0.85
0.84
0.79
0.73
ANB
0.72
0.86
0.86
0.80
TUN SYR PAL
0.60 0.62 0.55
0.81 0.76
0.81
Dialect Identification
In this section, we deal with the issue of using several languages in the same
sentence. This is very common in Arabic world and especially in Maghreb. This
phenomenon is commonly named code switching. Arabic people often switch
between several languages. For instance, in Algeria, people could switch from
dialect, to MSA to French. In the following, French will not be taken into account.
To identify the different languages in order to apply the appropriate tools, we
consider the identification of language such as a classification issue which will be
treated in the following by a Naive Bayes classifier (NBC). NBC is probabilistic
learning algorithm, it is used for many issues in NLP [17–19]. A naive Bayes
classifier assumes that all features representing a given problem are conditionally
independent given the value of classification variables.
For our purpose, NBC is based on 3-grams features. Given n classes corresponding to n languages, the purpose is to assign the most suitable class Ci in
accordance to a set of features F = {f1 , ..., fn } which maximizes the conditional
probability:
p(Ci | f1 , ..., fn ) = p(Ci )
n
Y
p(fj |Ci )
(3)
j=1
where p(Ci ) is the probability of the class Ci and p(fj |Ci ) is the conditional
probability of the feature fj observed with the class Ci .
For the experiment, we created a corpus by merging MSA, ALG, ANB, TUN,
SYR and PAL for which each sentence is annotated by its corresponding language
class. By selecting randomly 80% of the data, we create the training corpus and
the remaining has been dedicated for test. Classification results in Table 8 show
that the recall for MSA is the highest one (75%); this could be explained by
the fact that MSA writing obeys to strict rules contrary to dialects for which no
Table 8. Dialect identification results using the parallel corpus
Language Precision Recall F
ALG
0.48
0.50 0.49
ANB
0.49
0.49 0.49
TUN
0.68
0.52 0.59
SYR
0.62
0.55 0.58
PAL
0.53
0.57 0.55
MSA
0.64
0.75 0.69
formal writing rules exist: a dialect word could be written in different forms which
are all acceptable. Consequently, this phenomenon generates a larger distribution
probability for dialects than MSA ones.
Table 9 draws the confusion matrix of the classifier. For dialect side, it is
clearly shown that the highest confusion rates are those between ALG and ANB
and between PAL and SYR, this confusion is justified by the closeness between
these pairs of dialects; ALG and ANB for example share an important vocabulary
in spite of their difference. For MSA side, it is shown that the highest confusion
Table 9. Confusion matrix rates for dialect identification using the parallel corpus.
Estimated language
True language classes ALG ANB TUN SYR
ALG
50
35
4
6
ANB
38
49
5
2
TUN
12
8
52
6
SYR
3
3
4
55
PAL
4
3
4
16
MSA
2
2
4
5
classes
PAL MSA
2
4
3
3
9
14
24
11
57
17
12
75
rates related to MSA are those with PAL, whereas for ALG and ANB dialects,
confusion rates related to MSA do not exceed 4% for both dialects.
6
Machine Translation
Arabic language translation has been widely studied. The rich morphology of
Arabic is seen as a rocky barrier in building efficient translation systems. Indeed,
Arabic is characterized by complex a morphology and a rich vocabulary. It is
a derivational, flexional and agglutinative language. We recall that, in order to
compare it to English, an Arabic word (or more rigorously a lexical entry) can
sometimes correspond to a whole English sentence [20].
Moreover, Arabic words are often ambiguous because a single word could
have multiple morphological analyses. This is due to the richness of the Arabic
affixation and the omission of short vowels. In addition, articles, prepositions,
pronouns, etc. can be affixed to adjectives, nouns, verbs and particles to which
they are related. All these phenomena increase the ambiguity and make the
traditional issues of NLP more challenging such as machine translation from
and to Arabic.
As shown in the previous experiments, dialects even if they are inspired
strongly from Arabic, the significant differences may prevent communication
between people of Arabic world. That is why, it is very important to propose
machine translation between different dialects and MSA. In the following, we
present several experiments in order to develop machine translation between
Arabic dialects and MSA. For each pair of languages we used a parallel corpus
of 6400 sentences (5900 have been dedicated to training and the remaining for
tests).
Table 10. BLEU score of Machine Translation on different pairs of languages using
two smoothing techniques
Source
ALG
ANB
TUN
SYR
PAL
MSA
ALG
KN WB
67.31 65.55
9.89 9.48
7.57 7.50
11.28 10.67
13.55 13.05
ANB
KN WB
61.06 60.81
9.34 9.01
7.50 7.64
9.53 9.15
12.54 11.72
Target
TUN
SYR
KN WB
KN WB
9.67 9.36 7.29 7.95
9.08 8.64 7.52 7.95
13.05 12.93
13.67 13.23
17.93 16.64 23.29 23.07
20.03 20.44 21.38 20.32
PAL
KN WB
10.61 10.14
10.12 9.84
22.55 22.21
26.60 25.74
42.46 41.37
MSA
KN WB
15.1 14.64
14.44 13.95
25.99 25.21
24.14 22.96
40.48 39.76
-
All the MT systems we used are phrase-based [21] with default settings: bidirectional phrase and lexical translation probabilities, distortion model, a word
and a phrase penalty and a trigram language model. We used GIZA++ [22] for
alignment and SRILM toolkit [23] to compute trigram language models. Since
the parallel corpus is small, we experimented the Kneser-Ney and Witten-Bell
smoothing techniques hoping to identify the one which best fits. The results
conducted on the test set are presented in terms of BLEU in Table 10. This
experiment leads to very interesting conclusions. First of all, for small parallel
corpus, it seems that the smoothing technique has an impact on translation results. A difference of almost 2 points in terms of BLEU scores has been observed
for translating from ANB to ALG. But, we can not generalize by affirming that
one smoothing technique is definitely better than another. High score of translation has been achieved between ANB and ALG in both sides. This result is
natural since these two dialects are used in the same country and share up to
60% of words. Almost the same observation is made for the pair SYR and PAL
since these two dialects belong to the same language family (Levantine).
Another interesting and expected result is BLEU score between MSA and
dialects. In fact, the highest one is related to PAL in both sides since this dialect
is the closest one to MSA as shown in other experiments of sections 4.2 and 4.3.
Most surprising results are those relative to SYR and TUN. It seems that it is
easier to translate TUN to MSA than SYR to MSA. Also, translating from MSA
to TUN gives better results than from MSA to the Algerian dialects. In the symmetric side of translation we get the same scale of results. This definitely shows
the closeness of TUN dialect to MSA in comparison to the Algerian dialects.
7
Conclusion
In this paper we present an analytical study of Arabic dialects from Middle-east
and Maghreb. Maghreb’s dialects use several words from French, Turkish, ...
and adapted them phonetically so they become full words of these dialects. In
the opposite, Middle-east dialects are more close to MSA because they share an
important vocabulary with it.
To make this research and study possible, we started from scratch because
for these vernacular languages, there is no available written resources. We build
a parallel corpus including 5 dialects (two from Algeria, one from Tunisia, and
the two others from Middle-east: Palestine and Syria) and MSA. We perform
different experimentations in order to study the relationship between MSA and
dialects on one hand and cross-dialects on the other hand. For this, we calculated the overlapping between each pair of vocabularies. We then calculated the
distance between the distributions of each pair of languages in order to measure
which language is closer to which one. The carried out results are consistent
with the fact that Middle-East Arabic dialects are closer to MSA than those of
the Maghreb. This has been confirmed by the other experiments of identification
handled by machine learning techniques. We showed that it is easier to identify
MSA than dialects because it is a natural language with the whole standard
linguistic constraints. Concerning the experience on identification, the results
could be separated into two classes. The first one concerns ALG and ANB and
the other one the three other dialects. In fact for this last class, the F-measure
results are close and the difference between them are not statistically significant.
This means that it is easier to identify PAL, SYR and TUN than Algerian dialects.
We conducted also several experiments of machine translation between all the
pairs of languages. We took advantage from this experiment to try to understand
whether the smoothing techniques could have an impact on BLEU score when
we are faced to small corpora. We remarked that in some cases, the used method
could improve BLEU significantly. High score of translation has been achieved
between ANB and ALG in both sides. This result is natural since these two
dialects are used in the same country and share up to 70% of words. In the near
future, we will extend this work to other dialects and will propose a speech to
speech system which is the main objective of this work.
References
1. Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A., McLemore, C.:
Egyptian Colloquial Arabic Lexicon. In: LDC catalog number LDC99L22. (2002)
2. Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Hopkins, J.,
Liu, D., Noamany, M., Schone, P., Schwartz, R., Vergyri, D.: Novel Approaches to
Arabic Speech Recognition: Report from the 2002 Johns-Hopkins Summer Workshop. In: Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing.
(2003) 344–347
3. Habash, N., Rambow, O.: Magead: A Morphological Analyzer and Generator
for the Arabic Dialects. In: Proceedings of the 21st International Conference on
Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. (2006) 681–688
4. Chiang, D., Diab, M., Habash, N., Rambow, O., Shareef, S.: Parsing Arabic Dialects. In: Proceedings of the European Chapter of ACL (EACL). (2006)
5. Zbib, R., Malchiodi, E., Jacob, D., Stallard, D., Matsoukas, S., Schwartz, R.,
Makhoul, J., Zaidan, O., Callison-Burch, C.: Machine Translation of Arabic Dialects. In: Proceedings of the 2012 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies.
NAACL HLT 12 (2012) 49–59
6. Salloum, W., Habash, N.: Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic. In: Proceedings of the 2013 Conference
of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies. NAACL HLT 13 (2013) 348–358
7. Zaidan, O., Callison-Burch, C.: Arabic Dialect Identification. Computational Linguistics, Volume 40 (2014) 171–202
8. Elfardy, H., Diab, M.: Sentence Level Dialect Identification in Arabic. In: ACL
(2). (2013) 456–461
9. Bouamor, H., Habash, N., Oflazer, K.: A Multidialectal Parallel Corpus of Arabic.
In: Proceedings of the Language Resources and Evaluation Conference, LREC2014. (2014) 1240–1245
10. Meftouh, K., Bouchemal, N., Smaili, K.: A Study of a Non-resourced Language: an
Algerian Dialect. In: Third International Workshop on Spoken Languages Technologies for Under-resourced Languages. (2012) 125–132
11. Skadiņa, I., Aker, A., Giouli, V., Tufis, D., Gaizauskas, R., Mieriņa, M., Mastropavlos, N.: A Collection of Comparable Corpora for Under-resourced Languages. In:
Proceedings of the 2010 Conference on Human Language Technologies – The Baltic
Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010.
(2010) 161–168
12. Kailath, T.: The Divergence and Bhattacharyya Distance Measures in Signal Selection. In: Communication Technology, IEEE Transactions. Volume 15. (1967)
52–60
13. Rao, C.R.: A Review of Canonical Coordinates and an Alternative to Correspondence Analysis Using Hellinger Distance. Quaderns dEstadistica i Investig Ope,
Questiio, Volume 19 (1995) 23–63
14. Cieslak, D.A., Chawla, N.V.: A Framework for Monitoring Classifiers Performance:
When and Why Failure Occurs? Knowledge and Information Systems (2009) 83–
109
15. González-Castro, V., Alaiz-Rodrguez, R., Alegre, E.: Class Distribution Estimation
Based on the Hellinger Distance. Information Sciences (2013) 146–164
16. Torra, V., Carlson, M.: On the Hellinger Distance for Measuring Information
Loss in Microdata. In: Joint UNECE/Eurostat work session on statistical data
confidentiality. (2013)
17. Pop, I.: An Approach of the Naive Bayes Classifier for the Dcument Classification.
General Mathematics, Volume 14, No 4 (2006) 135–138
18. Pedersen, T.: A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation. In: Proceedings of 1st Annual Meeting of the
North American Chapter of the Association for Computational Linguistics. (2000)
63–69
19. Ahmed, F., Nurnberger, A.: Arabic/English Word Translation Disambiguation
Using Parallel Corpora and Matching Schemes. In: 12th EAMT conference. (2008)
6–11
20. Badr, I., Zbib, R., Glass, J.: Segmentation for English-to-Arabic Statistical Machine Translation. In: Proceedings of the ACL 2008 Conference Short Papers.
(2008) 153–156
21. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,
Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A.,
Herbst, E.: Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the Annual Meeting of the Association for Computational Linguistics,
demonstation session (2007) 177–180
22. Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment
Models. Computational Linguistics, Volume 29, No 1 (2003) 19–51
23. Stolcke, A.: Srilm – an Extensible Language Modeling Toolkit. In: ICSLP, Denver,
USA (2002) 901–904