Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Diphthong

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163

Contents lists available at ScienceDirect

Journal of King Saud University –


Computer and Information Sciences
journal homepage: www.sciencedirect.com

Morphological, syntactic and diacritics rules for automatic diacritization


of Arabic sentences
Amine Chennoufi ⇑, Azzeddine Mazroui
Department of Mathematics and Computer Science, Faculty of Sciences, University Mohamed First, B-P 717, 60000 Oujda, Morocco

a r t i c l e i n f o a b s t r a c t

Article history: The diacritical marks of Arabic language are characters other than letters and are in the majority of cases
Received 13 January 2016 absent from Arab writings. This paper presents a hybrid system for automatic diacritization of Arabic sen-
Revised 25 May 2016 tences combining linguistic rules and statistical treatments. The used approach is based on four stages.
Accepted 23 June 2016
The first phase consists of a morphological analysis using the second version of the morphological ana-
Available online 9 July 2016
lyzer Alkhalil Morpho Sys. Morphosyntactic outputs from this step are used in the second phase to elim-
inate invalid word transitions according to the syntactic rules. Then, the system used in the third stage is
Keywords:
a discrete hidden Markov model and Viterbi algorithm to determine the most probable diacritized sen-
Arabic language
Automatic diacritization
tence. The unseen transitions in the training corpus are processed using smoothing techniques. Finally,
Arabic diacritical marks the last step deals with words not analyzed by Alkhalil analyzer, for which we use statistical treatments
Morphological analysis based on the letters. The word error rate of our system is around 2.58% if we ignore the diacritic of the last
Smoothing techniques letter of the word and around 6.28% when this diacritic is taken into account.
Hidden Markov model Ó 2016 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction Arabic diacritical marks are classified into three groups (Zitouni
et al., 2006):
The diacritical mark is a sign accompanying a letter to modify
the corresponding sound or to distinguish the word from another 1) The first group consisting of three single short diacritics: “ َ ”
homonym word. Diacritical marks are widely used in Semitic lan- fatha, “ ُ ” damma and “ ِ ” kasra. Thus, by adding any of
guages including Arabic, Hebrew and other languages like Urdu. these signs with the letter “ ‫ ” ﻡ‬/m1/, we obtain the
The purpose of these signs is to clarify the morphological structure, following respective sounds: “ ‫ ” َﻡ‬/ma/, “ ‫” ُﻡ‬/mu/
the grammatical function, the semantic meaning of words and and “ ‫” ِﻡ‬/mi/.
other linguistic and voice features (Debili and Achour, 1998). Dia- 2) The second group represents the doubled case ending dia-
critical marks in the Arabic texts are often absent (Farghaly and critics (called tanween): “ ً ” tanween fatha, “ ٌ ” tanween
Shaalan, 2009), unlike Latin languages like French, where the pres- damma and “ٍ ” tanween kasra. These diacritical marks are
ence of vowels in the texts is mandatory (the vowels in Latin lan- reserved only for the last letter of nominal words (nouns,
guages play in most cases the same function as diacritical marks in adjectives and adverbs). This phenomenon, called nunation”,
Arabic language). Indeed, according to Habash (2010), diacritical has the phonetic effect of adding an N” sound after the
marks are absent in 98% of Arabic texts, and an undiacritized word corresponding short vowel at the word ending. Thus, the let-
can have several potential diacritizations in over 77% of cases ter “ ‫ ” ﻡ‬/m/ with these three signs gives the following
(Boudchiche and Mazroui, 2015). sounds: “ ‫ ” ًﻣﺎ‬/mF/ (man), “ ‫ ” ٌﻡ‬/mN/ (mon) et “ ‫ ” ٍﻡ‬/mK/ (min).
3) The third group is called syllabification marks and composed
⇑ Corresponding author. of “ ّ ” shadda (geminate: consonant is doubled in duration)
E-mail addresses: chennoufi.amin@gmail.com (A. Chennoufi), azze.mazroui@ and “ ْ ” sukun. This last group indicates the absence of a
gmail.com (A. Mazroui). short vowel, and reflects a glottal stop while shadda reflects
Peer review under responsibility of King Saud University. the doubling of a consonant and is always followed by a
single diacritic or by a tanween. With the letter “ ‫ ” ﻡ‬/m/
and the diacritical mark fatha, we get “ ‫ ” َّﻡ‬/ma/.

Production and hosting by Elsevier 1


Buckwalter transliteration.

http://dx.doi.org/10.1016/j.jksuci.2016.06.004
1319-1578/Ó 2016 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
A. Chennoufi, A. Mazroui / Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163 157

The diacritization operation of Arabic words occurs at two 2.1. Statistics-based models
levels: morphological and syntactic levels (Diab et al., 2007).
The morphological (lexical diacritics) consists of the internal dia- Gal (2002) was one of the first to use an approach based on hid-
critization of the word (the stem of the word without the last den Markov models (HMM) for the vocalization of Semitic texts. He
letter) and clarifies the meaning of the word. The syntactic level has tested his method on the Quran as Arabic texts and the Old
(casual diacritics) is interested in diacritization of the last letter Testament for the Hebrew language. The developed application
of the stem and it is used to identify the syntactic role of words does not extend to all Arabic diacritical marks. Emam and Fischer
in the sentence. Lexical diacritics do not change with the posi- (2005) extended the statistical processing of diacritization based
tion of the word in the sentence while the casual diacritic on examples for Statistical Machine Translation (SMT). Alghamdi
depends on the position of the word in the sentence. Thus, the et al. (2010) introduced a method based on the quad-gram at the
Arabic-speaking reader should understand the Arabic text before letters. Recently, the researcher (Hifny, 2013) presented a statisti-
reading it properly (Elshafei et al., 2006). This is a difficult for cal method based on n-gram and compared some smoothing tech-
readers who do not have extensive knowledge of the Arabic lan- niques to treat the case of unseen transitions. More recently,
guage. Indeed, Hermena et al. (2015) studied the reaction of the Abandah et al. (2015) used a training phase based on recurrent
readers facing the diacritized and undiacritized Arabic texts in neural networks (RNN) for automatically adding diacritical marks
eye-tracking experience. The results show that readers have ben- to Arabic text without relying on any prior morphological or con-
efited from the lifting of the ambiguity of words when diacritical textual analysis. The diacritization is solved as a sequence of tran-
marks are present. scription problem. Their approach uses a deep bidirectional long
The absence of diacritical marks is a source of complexity for short-term memory network that builds high-level linguistic
automatic processing systems of the Arabic language that cannot abstractions of text and exploits long-range context in both input
easily determine the meaning of the sentence (Said et al., 2013). directions.
Therefore, the need for an automatic diacritization tool of Arabic
is more than necessary to remove ambiguity and improve the per- 2.2. Morphological hybrid approaches
formances of automatic processing of Arabic applications such as
machine translation (Vergyri and Kirchhoff, 2004) and speech These approaches use both morphological analysis and statisti-
recognition (Messaoudi et al., 2004). The introduction of diacritical cal processing. The works of Vergyri and Kirchhoff (2004) are
marks in Arabic dialect speech corpus Levantine2 (BBN/AUB Baby- among the first to use these approaches. Thus, diacritical marks
lon DARPA) has helped to increase its reliability and efficiency in the Arab conversations are restored by combining morphologi-
(Alotaibi et al., 2013). cal and contextual information with a statistical model labeling
In addition, the lack of diacritical marks in Arabic sentences (acoustic signal). However, they did not model the Shadda dia-
represents the main cause of the confusion encountered during critic. Similarly, Nelken and Shieber (2005) presented a system that
its analysis (Boudchiche and Mazroui, 2015) and (Debili and uses an automatic finite state probability, and incorporated a tri-
Achour, 1998). The study of Bouamor et al. (2015) showed that gram model based on words, a quad-gram language model based
the automatic text diacritization increases quality manual tagging on letters and an extremely simple morphological model to iden-
of the corpus. tify the prefix and the suffix of word. Zitouni et al. (2006) combined
The objective of this paper is to present an automatic Arabic a statistical model based on maximum entropy with the classifica-
diacritization system combining linguistic rules and statistical tion of words. The input parameters of this model are the simple
treatments. This article is structured as follows: the second letter of the word and the morphological segments and the syntac-
paragraph presents the previous works on this area. The third tic state. Habash and Rambow (2007) use the outputs of the mor-
paragraph is devoted to the presentation of the different steps phological analyzer BAMA (Buckwalter, 2004) and individual
of our system. Indeed, we describe the morphological analysis taggers to choose among these outputs the most selected by these
adopted in the first part of the system. Then, we explain the taggers. Diab et al. (2007) were inspired by the machine translation
syntactic control used in the second part and some diacritical system (SMT), and they introduced six different diacritization
rules. We conclude this section by presenting the statistical schemes developed from observations of the naturally relevant
model adopted in the third and fourth steps of the system. diacritical marks. For these schemes, the morphological analyzer
The fourth paragraph deals with the experimentation and evalu- used was MADA (Habash et al., 2013). Recently, Bebah et al.
ation system. We end this paper by a conclusion and some (2014) exploited the morphological analyzer Alkhalil Morpho Sys
perspectives. (Bebah et al., 2011) in a process based on hidden Markov models.

2.3. Morphosyntactical hybrid approaches


2. Related work
These methods use both morphological and syntactic rules, and
Automatic diacritization approaches can be classified into four statistical processing. The architecture of the automatic diacritiza-
categories. The first one includes approaches based only on sta- tion system proposed by Shaalan et al. (2009) combines three
tistical processing. The second category includes hybrid approaches: automatic segmentation, part-of-speech (POS) tagging
approaches using a morphological analysis followed by a statisti- and the chunk parsing. This method is based on the lexicon of
cal processing. The third category consists of hybrid approaches extraction, the bi-gram model and the support vector machines
using morphological analysis, syntactic rules and statistical pro- (SVM). The syntactic information is used to treat for each word
cessing. The last one contains the automatic diacritization sys- the diacritical mark of its last letter in a separate final process.
tems developed by commercial companies. Approaches based The solution, proposed by Rashwan et al. (2011) uses in the first
solely on the rules are rarely used because of their complexities step morphological and syntactic information from ArabMorp3
due to the high level of ambiguity and the large number of mor- and ArabTagger4 tools, and then an n-gram model and the A⁄ algo-
phosyntactic rules (Debili and Achour, 1998).
3
http://www.rdi-eg.com/technologies/Morpho.aspx.
2 4
https://catalog.ldc.upenn.edu/LDC2005S08. http://www.rdi-eg.com/technologies/POS.aspx.
158 A. Chennoufi, A. Mazroui / Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163

rithm to select the most likely solution. Said et al. (2013) developed a case of a noun or a verb, the system also provided the root, the syn-
system based on auto-correction, morphological analysis, part-of- tactic form and the patterns of the stem and the lemma. We opted
speech tagging and a diacritization process of unseen words in the for the use of this analyzer because their performances are much
training corpus. Pasha et al. (2014) presented MADAMIRA (v1.0) better than those of the first version of BAMA (Buckwalter, 2002)
which is a disambiguation morphological analysis system of Arabic or the first version of Alkhalil analyzer (Chennoufi and Mazroui,
words in context. This system combines some aspects of both sys- 2016). In particular, the analyzed rate of words is very high since
tems MADA (Habash et al., 2013) and AMIRA (Diab et al., 2007). it reached 98.49%.
MADAMIRA provides several morphosyntactical outputs including It should be noted that when the Alkhalil system analyzes a
word diacritization. This system uses in disambiguation step the word partially or totally vowelized, it only keeps the outputs
SVM model or the N-gram model. More recently, Shahrour et al. whose diacritization is compatible with that of the input word.
(2015) presented an automatic Arabic diacritization approach that
provides the type of the word and the POS tag in the context using 3.2. Syntactic control
additional morphological and syntactic information to re-label the
nominal output of the morphological analyzer MADAMIRA. Most research on automatic diacritization has shown that the
rate of syntactic errors (error on the last letter of the word) is at
2.4. Applications developed by commercial companies least as important as the rate of morphological errors (error related
to the word without its last letter). These papers have recom-
As for most applications of natural language processing, com- mended the use of syntactic rules for improving the performance
mercial companies have developed independent automatic diacriti- of the automatic diacritization (Chennoufi and Mazroui, 2016;
zation systems or as part of other applications such as a speech Schlippe et al., 2008; Shaalan et al., 2009).
synthesizer or a word processor. Among the most interesting pro- We have exploited morphosyntactic information obtained from
jects, we cite the diacritizer ArabDiac5 developed by RDI society6, the morphological analysis to keep only the transitions of words
the mobile application Harakat developed by the company multillect7 that respect the linguistic rules of Arabic language. We have there-
and those developed by IBM8 society, INFO ARAB–ISIS9, AppTek10, fore sought to use the majority of outputs provided by Alkhalil ana-
Sakhr11 and Aljazeera12 companies. Recently, Microsoft Research lyzer. Thus, information such as POS tags (noun, verb or particle),
subsidiary of Microsoft Corporation launched an automatic diacritiza- syntactic form (genitive name, jussive form of verbs. . .) and encli-
tion application of the Arabic language called Arabic Authoring tics of words will be very useful in this stage. For example, a prepo-
Services13 (version 1.0) in the version 2013 of Microsoft Word. sition without suffix is always followed by a genitive noun. It
means that only the transitions between prepositions and genitive
3. Description of our automatic diacritization system nouns are kept. We have implemented 36 syntactic rules and we
present in Table 1 some examples of them.
Given the morphological and syntactic richness of Arabic At the end of this step, if no transition between two successive
language, the proposed solution for automatic diacritization will words of a sentence is enabled by the 36 rules, we do not reject any
reflect this richness and will be performed in four stages (see transition for these two words.
Fig. 1). The first stage (module M2) includes morphological analy-
sis out of context and it provides for each word all its possible dia- 3.3. Diacritic rules
critization forms. In the second step (M3 module), the system uses
the syntactic rules to eliminate invalid transitions. The third phase After preliminary testing of our system, we noticed a significant
is devoted to statistical processing to choose among the solutions portion of diacritization errors come from the non-application of
of the second phase those most likely. This is done through the the rule relating to the succession of two sukun diacritics
use of an HMM modeling (M4 module), smoothing techniques (“‫)” ﻗﺎﻋﺪﺓ ﺍﻟﺘﻘﺎﺀ ﺍﻟﺴﺎﻛﻨﻴﻦ‬. In this case, the second sukun is always the
(module M5) and the Viterbi algorithm (module M6). The last step Alif letter “‫ ”ﺍ‬/A/. To address this problem and improve the
(M7 and M8 modules) treats the not analyzed words in the mor- performance of our system, we have adopted in this case the fol-
phological stage. It consists of a statistical treatment similar to that lowing diacritic rules:
of the third step with a model based on letters rather than words.
1) If the stem of the predecessor word is the preposition
3.1. Morphological analysis particle “‫ ” ِﻣ ْﻦ‬/mino/, then the sukun of its last letter
will be replaced by the diacritical fatha (“‫ ” ِﻣ ْﻦ ﺍﻟْ ِﻜ َﺘﺎ ِﺏ‬/mino
After pre-treatment of the undiacritized text (tokenization and AlokitaAbi/ (from the book) becomes “‫ ” ِﻣ َﻦ ﺍ ْﻟ ِﻜ َﺘﺎ ِﺏ‬/mina
normalization of words), and segmentation into sentences and AlokitaAbi/).
then into words, the latter are treated with the second version of 2) If the predecessor word ends with the letter “‫ ” ﻡ‬/m/
Alkhalil Morpho Sys analyzer (Boudchiche et al., 2014). Thus, we “‫( ”ﻣﻴﻢ ﺍﻟﺠﻤﻊ‬/m/ plural), so the sukun of the word’s last letter
get all possible diacritization forms of each word taken out of con- “‫ ”ﻣﻴﻢ ﺍﻟﺠﻤﻊ‬/m/ will be replaced by the diacritical damma
text accompanied by their morphosyntactic information. Indeed, (“‫ ” َﻗ َﺮ ْﺃ ُﺗ ْﻢ ﺍ ْﻟ ِﻜ َﺘﺎ َﺏ‬/qaraOotumo AlokitaAba/ (you’ve read the
for each diacritization form, the system provides the stem, the book) becomes “‫ ” َﻗ َﺮﺃْ ُﺗ ُﻢ ﺍ ْﻟ ِﻜ َﺘﺎ َﺏ‬/qaraOotumu AlokitaAba/).
clitics attached to the stem, the POS tags and the lemma. In the 3) If the above cases do not attend (the most common case),
then the sukun at the last letter of the word will be replaced
5
http://www.rdi-eg.com/technologies/Diac.aspx. by the diacritical kasra (“‫ ” ُﺧ ْﺬ ﺍﻟْ ِﻜ َﺘﺎ َﺏ‬/xu⁄o AlokitaAba/ (takes
6
http://www.rdi-eg.com/. the book) becomes “‫ ” ُﺧ ِﺬ ﺍ ْﻟ ِﻜ َﺘﺎ َﺏ‬/xu⁄i AlokitaAba/).
7
https://multillect.com/apidoc/harakat.
8
www.ibm.com. 3.4. Statistical analysis at word level
9
http://www.isisintl.com/.
10
http://www.apptek.com/.
11
After morphological analysis step that gives for each word all
http://www.sakhr.com/index.php/en/.
12
http://learning.aljazeera.net/TextEditor.
its possible diacritizations, and following the validation step of
13
https://store.office.com/arabic-authoring-services-WA104030856.aspx?assetid= transitions between pairs of diacritized words and the applica-
WA104030856. tion of diacritic rules, we present the third stage of diacritization
A. Chennoufi, A. Mazroui / Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163 159

M8 Choosing the optimal solution


with the Viterbi algorithm on
Smoothing the characters
M5 technique
Transitions Smoothed Transitions
between diacritized probability
characters Transitions between
diacritized words

M6 Choosing the optimal


solution with the Viterbi
Hidden Markov algorithm on the words
Hidden Markov M4
Model on character M7
Model on word level
level

Parameters Output:
Syntactic control of of Diacritized
transitions – diacritics M3 Transition smoothing text
check of words matrix A and
emission
Transition matrix B on
matrix A words
and Unknown Analyzed
emission words words
matrix B
NO
on YES HMM Smoothing
T1 T2
characters Trainer estimator

Morphological analysis by M2
AlKhalil Morpho Sys 2 Diacritized
corpus

Words processed
M1
Preprocessing of words Lexicon

Input:
Undiacritized text

Figure 1. Overview of the automatic diacritization architecture.

process. It consists of a statistical treatment based on the hidden phase of diacritization system relates only to these cases. These
Markov models and the Viterbi algorithm (Neuhoff, 1975), which words are not diacritized by the third stage of the system. Thus,
provides the most likely diacritized sentence (Fig. 2). The repre- for each unanalyzed word, another hidden Markov model is used
sentation of observed states of HMM are the Arabic words with- and for which the Arabic letters are the observed state and the dia-
out diacritics (eg “‫ ” ﻓﻬﻤﺘﻢ‬/fhmtm/) and the hidden states are critized letters are the hidden states. The Viterbi algorithm is also
diacritized word forms (eg “‫ ” َﻓ ِﻬ ْﻤ ُﺘ ْﻢ‬/fahimotumo/) (Elshafei et al., used to choose the most probable solution.
2006; Bebah et al., 2014). This model states provided the best
scores of automatic diacritization compared to other hidden 4. Experimental phase
states like lists of diacritical marks (Bebah et al., 2014). To
smooth the unseen valid transitions in the training corpus, we 4.1. Methodology
used the Absolute Discounting Smoothing Technique (Ney and
Essen, 1991), which has achieved the highest scores in previous To achieve statistical phase, transition and emission probabili-
works (Hifny, 2013;Chennoufi and Mazroui, 2014). ties aij and bi(t) will be estimated during the training step (for
details see Bebah et al., 2014). The used estimation method is
3.5. Statistical analysis at letter level based on the calculation of maximum likelihood (Manning and
Schütze, 1999). Indeed, if we note:
During the test phase, another constraint was encountered
related to words not analyzed by Alkhalil Morpho Sys and for C = {Ph1, . . .,PhM} a representative corpus of Arabic texts formed
which the label unknown” was associated. Thereby, the fourth by M phrases Phk,
160 A. Chennoufi, A. Mazroui / Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163

Table 1
Examples of syntactic rules used in the automatic diacritization system.

N Rules Examples
°
1 The preposition “‫ ”ﺣﺮﻑ ﺟﺮ‬is always followed by a genitive noun “‫”ﺍﺳﻢ ﻣﺠﺮﻭﺭ‬ The transition “‫ ” ِﻣ َﻦ ﺍﻟْ َﻤ ْﺪ َﺭ َﺳ ِﺔ‬/mina Alomadorasati/ (from the school) is valid
The transition “‫ ” ِﻣ َﻦ ﺍﻟْ َﻤ ْﺪ َﺭ َﺳ َﺔ‬/mina Alomadorasata/ is not valid
2 The particle “‫ ” ﻟَ ٰﻤﺎ‬/lamaA/ is always followed by a verb in the past tense “‫ ” ﻓﻌﻞ ﻣﺎﺽ‬or The transition “‫ ” ﻟَ ٰﻤﺎ َﺫ َﻫ َﺐ‬/lamaA ⁄ahaba/ (when he left) is valid
an apocopative verb in the present tense The transition “‫ﺐ‬ ٌ ‫ ” ﻟَ ٰﻤﺎ َﺫ َﻫ‬/lamaA ⁄ahaba/ is not valid
The transition“‫ ” ﻟَ ٰﻤﺎ َﻳ ْﺬ َﻫ ْﺐ‬/lamaA ya⁄ohabo/ (when he leaves) is valid
The transition “‫ ” ﻟَ ٰﻤﺎ َﻳ ْﺬ َﻫ َﺐ‬/lamaA ya⁄ohaba/ is not valid
3 The relative pronoun “‫ ” ﺍﺳﻢ ﻣﻮﺻﻮﻝ‬is always followed by a nominative verb in the The transition “‫ ” ﺍﻟّ ِﺬﻱ َﻳ ْﻜ ُﺘ ُﺐ‬/Ala⁄iy yaktubu/ (who writes) is valid
present tense “‫ ” ﻓﻌﻞ ﻣﻀﺎﺭﻉ ﻣﺮﻓﻮﻉ‬or a verb in the past tense or a nominative The transition “‫ ” ﺍﻟّ ِﺬﻱ َﻳ ْﻜﺘُ َﺐ‬/Ala⁄iy yaktuba/ is not valid
noun “‫ ” ﺍﺳﻢ ﻣﺮﻓﻮﻉ‬or a particle “‫”ﺣﺮﻑ‬ The transition “‫ ” ﺍﻟّ ِﺘﻲ ﺃُ ّﻣ َﻬﺎ‬/Alatiy ÂumuhaA/ (who his mother) is valid
The transition “‫ ” ﻟّ ِﺘﻲ ﺃُ ٰﻣ َﻬﺎ‬/ Alatiy ÂumahaA / is not valid
4 An adverbe “‫ ”ﺿﺮﻑ‬not attached to a pronoun is always followed by a genitive noun The transition “‫ ” َﻓ ْﻮ َﻕ ُﺳ ٰﻠ ٍﻢ‬/fawoqa sulamı̃/ (above the stair) is valid
“‫ ” ﺍﺳﻢ ﻣﺠﺮﻭﺭ‬or a demonstrative pronoun “‫ ” ﺍﺳﻢ ﺇﺷﺎﺭﺓ‬or a relative pronoun “‫” ﺍﺳﻢ ﻣﻮﺻﻮﻝ‬ The transition “‫ ” َﻓ ْﻮ َﻕ َﺳ ٰﻠ َﻢ‬/fawoqa salama/ is not valid
or a particle “‫”ﺣﺮﻑ‬ The transition“‫ﺲ‬ِ ‫ ” َﻣ َﺴﺎ َﺀ ﺍﻟ َﺨ ِﻤﻴ‬/masaA’a Alxamiysi/ (on Thursday evening) is valid
The transition“‫ﺲ‬ َ ‫ ” َﻣ َﺴﺎ َﺀ ﺍﻟ َﺨ ِﻤﻴ‬/ masaA’a Alxamiysa/ is not valid

nki = the occurrence number of the hidden state wi (diacritized We observed that some texts contain partially diacritized
word) in the sentence Phk, words. These texts have been eliminated and are not part of the
nkij = the occurrence number of the transition from the hidden 72 million words used in the training and testing phases. Similarly,
state wi (diacritized word) to the hidden state wj (diacritized diacritical marks are not always arranged in the same way in all
successor word) in the sentence Phk, texts. Indeed, some diacritic writing rules differ sometimes from
one Arab country to another and from one area to another. Thus,
mkit = the occurrence number of undiacritized word ut with the
to evaluate our system we have standardized the diacritic scrip-
hidden state wi in the sentence Phk,
tures of training and test corpora with the output of Alkhalil ana-
N k1þ ðwi Þ = the number of all words repeated once and more lyzer. Finally, some spelling mistakes often appear in some texts
after the diacritized word wi in the sentence Phk, of the corpus. We have carried out the correction of these errors.
PM k
nj
PMLE ðwj Þ ¼ k¼1 N
: The maximum likelihood of the word wj in
4.2.1. Standardization of diacritic rules
the corpus C of size N.
By analyzing the writing rules of diacritical marks in the differ-
ent texts, we found the following differences:
Then, the probabilities aij and bi(t) can be estimated by the fol-
lowing formulas:
1) Diacritic marks on long vowels (Alif “‫” ﺍ‬/A/, Waw
nP o “‫ ”ﻭ‬/W/, Yae “‫ ”ﻱ‬/Y/) have three forms of writing. The first
M
k¼1 nij  D; 0
k
max D XM form does not put diacritical marks on long vowels
aij ¼ PM k þ PM PMLE ðwj Þ k¼1 Nk1þ ðwi Þ (“‫ ” ﺍ ْﻟﻤﺎﻟﻴ ِﺰﻳﻮ َﻥ‬/AlomAlyziywna/ (Malaisiens)), the second
k
k¼1 ni k¼1 ni way brings them after long vowels (“‫ ” ﺍﻟْﻤ َﺎﻟ ِﻴ ِﺰﻳ ُﻮ َﻥ‬/
PM k
m AlomAalyiziywuna/) and the 3rd writing puts the diacritical
and bi ðtÞ ¼ Pk¼1 it
ð1Þ
M k
k¼1 ni
mark before the long vowel (“‫ ”ﺍ ْﻟ َﻤﺎﻟِﻴ ِﺰ ُﻳﻮ َﻥ‬/AlomaAliyziyuwna/).
We adopted this last rule because it is similar to that used by
with the constant D = 0.5. Alkhalil analyzer.
2) The Tanween fatha sign with the letter Alif “‫”ﺍ‬/A/ has two
forms of writing: one before the letter (“‫ ” َﺳ َﻼ ًﻣﺎ‬/salaAmFA/
4.2. Training and test corpora (peace)) and the other after the letter (“‫ ” َﺳ َﻼﻣ ًﺎ‬/salaAmAF/).
The second form has been adopted.
Our statistical model was trained on 90% of a large corpus of 3) Shadda sign also presents two forms of writing: one before
more than 72 million diacritized words. This training corpus was the diacritical mark and the other after the diacritical mark.
drawn at random. The remaining 10% (7,176,188 words) will be The rule that we have adopted is always to write the Shadda
used to test and evaluate our model. These corpora consist of sign before the diacritical mark.
Tashkeela corpus14 (63 million of diacritized words), Nemlar corpus
(0.5 million of diacritized words) (Attia et al., 2005) and a part of RDI We applied these three rules to all words of the corpora.
corpus15 not redundant with Tashkeela corpus (8.5 million of dia-
critized words). They are composed of texts taken from diacritized 4.2.2. Correcting spelling errors
old classic books and few modern documents. The topics covered We also correct some errors that were recurrent in the corpora.
several thematic areas including theology, grammar, history, econ-
omy and geography. 1) In some cases, there are words with Alif maksoura “‫ ”ﻯ‬/Y/
The HMM based on the letters and specific to unanalyzed words instead of the letter Yae “‫ ”ﻱ‬/y/. Thus, whenever the letter Alif
in the morphological step was trained on the same corpus as that maksoura is accompanied by a diacritical mark, we proceed
used for the HMM related to words. to replace it by the letter Yae (e.g. the word “‫ ” َﻋ ِﻠ ّﻰ‬/EaliYu/
will be replaced with the proper name “‫ ” َﻋ ِﻠ ّﻲ‬/ Ealiyu/).
2) Some words contain a succession of diacritical marks (“‫)” َﻋﻠ َﻢ‬.
14
In this case, we only keep the first diacritical mark and reject
http://sourceforge.net/projects/tashkeela/.
15
http://www.rdi-eg.com/RDI/TrainingData/.
the others.
A. Chennoufi, A. Mazroui / Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163 161

Table 2
Evaluation results of the two automatic diacritization systems on the test corpus.

Approach of automatic diacritization WER1 WER2 DER1 DER2


system (%) (%) (%) (%)
Morphological analysis + Statistics 8.29 4.10 2.93 1.54
(Chennoufi and Mazroui, 2016)
Morphological analysis + Diacritic rules 6.50 2.58 2.05 0.90
+ Statistics
Morphological analysis + Diacritic rules 6.28 2.58 1.99 0.90
+ Syntactic rules + Statistics

Best results are shown in boldface.

automatically counted in the calculation of WER1, we can assert


that the integration of diacritic rules have benefited mainly to
improve WER2. Analyzing the results of the third system which
integrated syntactic rules, we find that the integration of these
rules has allowed only to correct some errors made by the second
system at the vowel of the last character of the word. Indeed, just
Figure 2. Example of using the Viterbi algorithm on an Arabic sentence to find the
optimal solution.
WER1 decreased from 6.50% to 6.28% while WER2 remained
unchanged. The other error rates related to letter (DER1 and
DER2) also presented significant decreases. Thus, the integration
3) Sometimes the letter Alif with hamza below”‫ ﺇ‬/I/ is not
of syntactic and diacritic rules allowed a significant improvement
accompanied by the diacritic kasra that represents the only
in the system performances.
possibility of diacritization. In this case, we add this diacrit-
ical mark.
4) The diacritic rules mentioned in paragraph 3.3 are not 4.3.2. Comparison with the results of the literature
always respected in the corpora. We therefore apply these To position our system with respect to other Arabic automatic
rules to all words of the corpora. diacritization applications, we compare the performance of our
system with those of two other systems. The first one is MADA-
4.3. Results MIRA system (Version 1 – 25/08/2014) and the second is Arabic
Authoring Services (‫ ) ُﻛ ْﻦ‬integrated with Microsoft Word (version
Before presenting the results, it is important to explain the eval- 2013). Indeed, we ran these three systems (MADAMIRA, Arabic
uation methodology both at the word and at the letter level. The Authoring Services and our system) on a random sample of
error rate at the word level is noted WER (WER: Word Error Rate) 187,723 words from test corpus. The outputs of these three sys-
and the error rate at the letter level is noted DER (DER: Diacritic tems have undergone the same standardization treatments of
Error Rate). For each of these two types of errors, we introduce paragraphs 3.3, 4.2.1 and 4.2.2 above. The results of these evalua-
the rate that takes into account the diacritical mark of the last let- tions are presented in Table 3.
ter and the one that ignores this diacritical mark. Consequently, The different error rates of MADAMIRA and Arabic Authoring
WER1 represents the rate of the words incorrectly diacritized by Services (‫ ) ُﻛ ْﻦ‬are relatively high. Indeed, the error rate WER1 of
the system taking into account the diacritic of the last letter. MADAMIRA systems based on the SVM model and the language
WER2 is defined as WER1 except that it ignores the diacritical model are respectively equal to 36.07% and 27.29%. Similarly, the
mark of the last letter. Similarly, DER1 is the rate of letters incor- Arabic Authoring Services System (‫ ) ُﻛ ْﻦ‬indicates 20.56% for WER1.
rectly diacritized including the last letter, while DER2 is defined However, our system shows a much lower rate of order 6.22%. Sim-
as DER1 but does not consider the last letter of the word. For this ilar remarks can be raised for the other error rates WER2, DER1 and
metric, the numbers and the punctuations are not considered in DER2.
the evaluation process. The high error rate of the systems MADAMIRA and Arabic
Authoring Services (‫ ) ُﻛ ْﻦ‬can be explained in part by the nature of
4.3.1. Contribution of syntactic and diacritic rules the test corpus. Indeed, this corpus is essentially made up of clas-
To assess the impact of the integration of diacritical and syntac- sical Arabic texts, while both systems are more suited to contem-
tic rules, we evaluate three automatic diacritization systems. The porary texts (MSA: Modern Standard Arabic).
first system is the one developed in a previous work (Chennoufi On the other hand, to ensure objective comparison between our
and Mazroui, 2016), and which is based on morphological analysis system and some previous work like Abandah et al. (2015), which
and statistical treatments without syntactic and diacritic rules. The is the most recent work and announcing the best results, we use
second system is obtained by integrating the diacritic rules in the the same evaluation metric of diacritization introduced by
first system, and the third is one that incorporates both diacritical Zitouni et al. (2006) and adopted by Habash and Rambow (2007),
and syntactic rules. After completing the training steps on the Rashwan et al. (2011), Abandah et al. (2015) and other authors.
same training corpus for these three systems, we tested them on For this metric, the numbers and the punctuations are also consid-
the test corpus consisting of 7.17 million words. The results of
Table 3
the different error rates for these three systems are shown in
Comparison between three Arabic automatic diacritization systems.
Table 2.
We note that the integration of diacritic rules has significantly Automatic diacritization system WER1 WER2 DER1 DER2
improved the accuracy of the system. Indeed, WER1 decreased MADAMIRA (SVM) 36.07 20.21 12.66 7.12
from 8.29% for the system does not incorporate the diacritic rules MADAMIRA (language model) 27.29 16.14 9.21 5.56
Arabic Authoring services (‫) ُﻛ ْﻦ‬ 20.56 11.18 7.19 4.16
to 6.50% for one that incorporates these rules. Similarly, WER2
Our system 6.22 2.53 1.98 0.90
decreased from 4.10% for the first system to only 2.58% for the sec-
ond. Given that every word counted in calculating WER2 will be Best results are shown in boldface.
162 A. Chennoufi, A. Mazroui / Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163

Table 4 Bebah, M., Chennoufi, A., Mazroui, A., Lakhouaja, A., 2014. Hybrid approaches for
Performance comparison between Abandah diacritization system and our system. automatic vowelization of Arabic texts. Int. J. Nat. Lang. Comput. 3, 53–71.
http://dx.doi.org/10.5121/ijnlc.2014.3404.
Automatic diacritization system WER1 WER2 DER1 DER2 Bouamor, H., Zaghouani, W., Diab, M., Obeid, O., Oflazer, K., Ghoneim, M., Hawwari,
A., 2015. A pilot study on Arabic multi-genre corpus diacritization. In:
Abandah et al. (2015) 5.82 3.54 2.09 1.28
Proceedings of the Second Workshop on Arabic Natural Language Processing.
Our system 4.45 1.86 1.52 0.71
Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 80–88.
Best results are shown in boldface. <http://dx.doi.org/10.18653/v1/W15-3209>.
Boudchiche, M., Mazroui, A., 2015. Evaluation of the ambiguity caused by the
absence of diacritical marks in Arabic texts: statistical study. In: 2015 5th
ered in the evaluation process. We tested our system on the same International Conference on Information & Communication Technology and
Accessibility (ICTA). IEEE, pp. 1–6. http://dx.doi.org/10.1109/ICTA.2015.
corpus used as a test corpus by Abandah et al. (2015). This corpus 7426904.
consists on ten books of Tashkeela corpus and the Quran. Table 4 Boudchiche, M., Mazroui, A., Bebah, M., Lakhouaja, A., 2014. L’Analyseur
below shows the scores of Abandah et al. (2015) and our system. Morphosyntaxique AlKhalil Morpho Sys 2. In: 1ère Journée Doctorale
Nationale Sur L’ingénierie de La Langue Arabe, (JDILA’14). Rabat, Morocco.
Table 4 shows that the error rates WER1 and DER1 of Abandah Buckwalter, T., 2002. Arabic Morphological Analyzer Version 1.0. Linguist. Data
system are respectively equal to 5.82% and 2.09%. Our system has a Consort. n° LDC2002L49.
lower error rate WER1 and DER1 respectively equal to 4.45% and Buckwalter, T., 2004. Arabic morphological analyzer version 2.0. Linguist. Data
Consortium. Univ. Pennsylvania.
1.52%. Chennoufi, A., Mazroui, A., 2014. Méthodes de lissage d’une approche morpho-
It should be noted that this assessment methodology is biased statistique pour la voyellation automatique des textes arabes. In: 21ème
and does not reflect the real performances of the system since Traitement Automatique Des Langues Naturelles. pp. 443–448.
Chennoufi, A., Mazroui, A., 2016. Impact of morphological analysis and a large
the punctuations and numbers are never diacritized in the Arabic training corpus on the performances of Arabic diacritization. Int. J. Speech
texts and their error rates are always equal to zero. Technol. 19, 269–280. http://dx.doi.org/10.1007/s10772-015-9313-5.
Debili, F., Achour, H., 1998. Voyellation automatique de l’arabe. In: Proc. Work.
Comput. Approaches to Semit. Lang. pp. 42–49.
4.4. Discussion
Diab, M., Hacioglu, K., Jurafsky, D., 2007. Automatic Processing of Modern Standard
Arabic Text. In: Arabic Computational Morphology. Springer, Netherlands,
The good performances of our system are consequences of: Dordrecht, pp. 159–179. http://dx.doi.org/10.1007/978-1-4020-6046-5_9.
Elshafei, M., Al-Muhtaseb, H., Alghamdi, M., 2006. Machine generation of arabic
diacritical marks. In: The 2006 World Congress in Computer Science Computer
1) The robustness of the second version of AlKhalil analyzer Engineering, and Applied Computing. Las Vegas, USA, pp. 128–133.
used by our system in the morphological stage; Emam, O., Fischer, V., 2005. Hierarchical approach for the statistical vowelization of
2) The use of syntactic and diacritic rules; arabic text. US 8,069,045 B2.
Farghaly, A., Shaalan, K., 2009. Arabic natural language processing. ACM
3) The strong representation of the corpus used in the training Trans. Asian Lang. Inf. Process. 8, 1–22. http://dx.doi.org/10.1145/1644879.
phase given its large size. 1644881.
Gal, Y., 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In:
Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic
The evaluation of this automatic diacritization system of Arabic Languages. Association for Computational Linguistics, Morristown, NJ, USA, pp.
sentences combining morphological analysis, syntactic and dia- 1–7. http://dx.doi.org/10.3115/1118637.1118641.
critic rules and statistical processing produces better performance Habash, N., Rambow, O., 2007. Arabic diacritization through full morphological
tagging. Hum. Lang. Technol. In: 2007 Conf. North Am. Chapter Assoc. Comput.
than other systems. The integration of syntactic rules has con- Linguist. Companion, vol. Short Pap.
tributed to the improvement of the error rate WER1, and they par- Habash, N., Roth, R., Rambow, O., Eskander, R., Tomeh, N., 2013. Morphological
ticularly allowed correcting some mistakes at the last character. In Analysis and Disambiguation for Dialectal Arabic. In: Proceedings of the 2013
Conference of the North American Chapter of the Association for Computational
the same, the integration of diacritic rules has reduced the error
Linguistics: Human Language Technologies (NAACL-HLT). Atlanta, GA, pp. 426–
rate WER2. 432.
Habash, N.Y., 2010. Introduction to Arabic natural language processing. Synth. Lect.
Hum. Lang. Technol. 3, 1–187. http://dx.doi.org/10.2200/
5. Conclusion S00277ED1V01Y201008HLT010.
Hermena, E.W., Drieghe, D., Hellmuth, S., Liversedge, S.P., 2015. Processing of Arabic
This paper presents a model of automatic Arabic diacritization diacritical marks: phonological–syntactic disambiguation of homographic verbs
and visual crowding effects. J. Exp. Psychol. Hum. Percept. Perform. 41, 494–
based on hybrid approach that combines the linguistic rules and 507. http://dx.doi.org/10.1037/xhp0000032.
statistical processing. The use of morphological, syntactic and dia- Hifny, Y., 2013. Restoration of Arabic diacritics using dynamic programming. In: 8th
critic rules combined with the hidden Markov models provides the International Conference on Computer Engineering & Systems (ICCES). IEEE, pp.
3–8. http://dx.doi.org/10.1109/ICCES.2013.6707161.
best performances. Indeed, the evaluation results are very encour-
Manning, C.D., Schütze, H., 1999. Foundations of Statistical Natural Language
aging and much better in comparison with other available systems. Processing. MIT Press.
Spelling errors in the training and testing corpora and their enrich- Messaoudi, A., Lamel, L., Gauvain, J.-L., 2004. The LIMSI RT-04 BN Arabic System. In:
Darpa RT04. Palisades NY.
ment by other texts will improve these scores. In addition, the inte-
Nelken, R., Shieber, S.M., 2005. Arabic diacritization using weighted finite-state
gration of other syntactic rules will contribute to decrease the error transducers. In: Proc. ACL Work. Comput. Approaches to Semit. Lang.
rates. Neuhoff, D., 1975. The Viterbi algorithm as an aid in text recognition (Corresp.).
IEEE Trans. Inf. Theory 21, 222–226. http://dx.doi.org/10.1109/TIT.1975.
1055355.
Ney, H., Essen, U., 1991. On smoothing techniques for bigram-based natural
References language modelling. [Proceedings] ICASSP 91: 1991 International Conference on
Acoustics, Speech, and Signal Processing, vol. 2. IEEE, pp. 825–828. http://dx.doi.
org/10.1109/ICASSP.1991.150464.
Abandah, G.A., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., Al-Taee, M., 2015.
Pasha, A., Al-Badrashiny, M., Diab, M., Kholy, A. El, Eskander, R., Habash, N.,
Automatic diacritization of Arabic text using recurrent neural networks. Int. J.
Pooleery, M., Rambow, O., Roth, R.M., 2014. MADAMIRA: a fast, comprehensive
Doc. Anal. Recogn. 18, 183–197. http://dx.doi.org/10.1007/s10032-015-0242-2.
tool for morphological analysis and disambiguation of Arabic, in: Proceedings of
Alghamdi, M., Muzaffar, Z., Alhakami, H., 2010. Automatic restoration of Arabic
LREC. Reykjavik, Iceland.
diacritics: a simple, purely statistical approach. Arab. J. Sci. Eng. 35, 125–135.
Rashwan, M.A.A., Al-Badrashiny, M.A.S.A.A., Attia, M., Abdou, S.M., Rafea, A., 2011. A
Alotaibi, Y.A., Meftah, A.H., Selouani, S.-A., 2013. Diacritization, automatic
stochastic arabic diacritizer based on a hybrid of factorized and unfactorized
segmentation and labeling for Levantine Arabic speech. In: 2013 IEEE Digital
textual features. IEEE Trans. Audio. Speech. Lang. Process. 19, 166–175. http://
Signal Processing and Signal Processing Education Meeting (DSP/SPE). IEEE, pp.
dx.doi.org/10.1109/TASL.2010.2045240.
7–11. http://dx.doi.org/10.1109/DSP-SPE.2013.6642556.
Said, A., El-Sharqwi, M., Chalabi, A., Kamal, E., 2013. A hybrid approach for Arabic
Attia, M., Choukri, K., Yaseen, M., 2005. Specifications of the Arabic written corpus
diacritization. In: Natural Language Processing and Information Systems.
produced within the Nemlar project.
Springer, Berlin Heidelberg, pp. 53–64. http://dx.doi.org/10.1007/978-3-642-
Bebah, M., Meziane, A., Mazroui, A., Lakhouaja, A., 2011. Alkhalil Morpho Sys. In: 7th
38824-8_5.
International Computing Conference in Arabic. Riyadh, Saudi Arabia.
A. Chennoufi, A. Mazroui / Journal of King Saud University – Computer and Information Sciences 29 (2017) 156–163 163

Schlippe, T., Nguyen, T., Vogel, S., 2008. Diacritization as a machine translation Vergyri, D., Kirchhoff, K., 2004. Automatic diacritization of Arabic for acoustic
problem and as a sequence labeling problem. In: 8th AMTA Conference. modeling in speech recognition. In: Proc. Work. Comput. Approaches to Arab.
Shaalan, K., Bakr, H.M.A., Ziedan, I., 2009. A hybrid approach for building arabic Script-based Lang.
diacritizer. In: Proceedings of the EACL 2009 Workshop on Computational Zitouni, I., Sorensen, J.S., Sarikaya, R., 2006. Maximum entropy based restoration of
Approaches to Semitic Languages. pp. 27–35. Arabic diacritics. In: Proceedings of the 21st International Conference on
Shahrour, A., Khalifa, S., Habash, N., 2015. Improving Arabic Diacritization through Computational Linguistics and the 44th Annual Meeting of the ACL – ACL ’06.
Syntactic Analysis. In: The 2015 Conference on Empirical Methods in Natural Association for Computational Linguistics, Morristown, NJ, USA, pp. 577–584.
Language Processing, EMNLP 2015. Association for Computational Linguistics, http://dx.doi.org/10.3115/1220175.1220248.
pp. 1309–1315.

You might also like