Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 A Statistical Approach for Qur'an Vowel Restoration A.A. EL-Harby, M.A. EL-Shehawey, R.S. El-Barogy Mansoura University, Faculty of Science, Math. Dept., New Damietta, Egypt. elharby@yahoo.co.uk automatic insertion of diacritics have become an essential component in many important applications, such as Information Retrieval, Machine Translation, Corpora Acquisition, construction of Machine Readable Dictionaries, and others [4-6]. Many systems for processing Arabic language used either Arabic script or a one-to-one transliteration thereof as their input script [89]. Other systems were used the Arabic script, for instance [10-11]. The HMMs have been traditionally used to capture the contextual dependencies between words [13]. These Models are very important tools for vowel restoration in many languages, for instance, in French [14], and Spanish [15]. Gal [16] used the unigram and bigram HMM models for vowel restoration of transliterated Hebrew and Arabic. This system needs the frequencies for each word, which are calculated during the processing. He said that there are three Hebrew letters which can not be distinguished in the written words; these letters may define vowels or consonants. His model does not have the ability to solve this problem; it considers all appearances of these words as consonants. In the same way, the above problem will occur when the proposed models are applied on transliterated Qur’an. He considered three vowels in the used transliterated Qur’an [17]. Although, there are eleven vowels (diacritics) in the Arabic language (Arabic Qur’an). Moreover, the obtained results are not precise. The diacritics in Arabic are clear and can be classified easily. Therefore, we will apply the two models on the Qur’an written in Arabic, and the frequencies for each word are previously prepared to be used directly in our system. This paper presents an automatic system that has the ability to restore diacritics of holy Qur'an written in Arabic. This paper is organized as follows: characteristics of Arabic language are presented in Section 2. The required database is prepared to be used directly with our system; see Section 3. The unigram model for Qur'an diacritics restoration is introduced in Section 4.1. In Section 4.2 the HMM model for Qur'an diacritics restoration is illustrated and Hidden Markov Chain is calculated. The Viterbi algorithm is applied in Section 5. Abstract This paper presents an automatic system that has the ability to restore diacritics (vowels) for non-diacritic Qur’an words, using a unigram base-line model and a bigram Hidden Markov Model (HMM). The proposed system was very robust and reliable without using morphological analysis methods for diacritics restoration. It was found that the HMMs are useful tools for the task of diacritics restoration in Arabic language. The used technique is simple to apply and does not require any language specific knowledge to be embedded in the model. Qur’an was used as corpora; our system was implemented and also tested on many parts of Qur’an as training set. For instance, the proposed system was implemented on 1366 words starting from the beginning of the Qur'an, and the best performance was 94.3% word accuracy for a unigram model and 95.2% word accuracy for a bigram HMM model. Keywords: Diacritics in Arabic language, Vowel Restoration, Statistical Model, HMM. 1. Introduction Arabic is currently the sixth most widely spoken language in the world with estimated 186 million native speakers. As the language of the Qur'an (the holy book of Islam), it is also widely used throughout the Muslim world. It belongs to the Semitic group of languages, which also include Hebrew and Amharic (the main language of Ethiopia) [1]. Arabic language is considered a member of a highly sophisticated category of natural language, which has a very rich morphology, where one root can generate several words having different meanings [2]. Arabic is a diacritized language, whereas Arabs often write non-diacritized text, which is not enough to figure out the correct pronunciation of words in the process of Text-To-Speech transformation. In order to solve this problem, a tool must be added to the system used in the process of transformation for supplying full diacritization [3]. Diacritics restoration is the problem of inserting diacritics into a text where they are missing. With the continuously increasing amount of texts available on the Web, tools for 9 ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 The results of the applied models are presented in Section 6. Finally, the conclusion is discussed in the last Section. This version contains all words in their full diacritic marks. The downloaded files of the Qur'an are organized in our database in a systematic way. The whole Qur'an exists in 114 Soura and about 87803 words. Our approach needs the frequency for each word. Therefore, the numbers of occurrences are calculated for all words in the corpus, and the repeated words are removed. It is found that the corpus is reduced to 17862 words (80% reduction). These words are different in their diacritic pattern. This big reduction improved the performance of the proposed system; all data are ready for use. After applying our system on the prepared corpus, it was found that many words are written incorrectly, the diacritics of some letters of many words are missed. For instance,"‫ "ﺍﻟﺮﺣﻤﻦ‬must be written at a specified position in the corpus as "ِ‫" ﺍﻟ ﱠﺮﺣْﻤَـﻦ‬. But it is written wrong, as the following patterns "ِ‫ "ﺍﻟ ﱠﺮﺣْﻤـﻦ‬and "ِ‫"ﺍﻟ ّﺮﺣْﻤَـﻦ‬. Like wise, the word "َ‫ﻼﺓ‬ َ‫ﺼ‬ ‫ "ﺍﻟ ﱠ‬is written incorrectly: "‫ "ﺍﻟﺼﱠﻼﺓ‬and "َ‫ﻼﺓ‬ َ‫ﺼ‬ َ ‫"ﺍﻟ‬. The whole Qur'an was downloaded from other sites for instance http://www.holyquran.net/quran, and it was also found that many words are written incorrectly. It is very difficult to correct all these words. We preferred to apply our system on correct corpus. Therefore, we corrected the first 82 sentences (1366 words); starting from the beginning of the Qur'an up to verse number 75 in Sourtu Al-Baqarah. In this corpus, all sentences are distinguished by a special character (#). Appendix C shows examples of the words and their frequencies. 2. Characteristics of Arabic language In Arabic, a vowelization is the process of placing special marks above or under the letters of the word. There are 11 marks of vowelization in Arabic, they are called diacritic marks. These marks are Kasra "ِ", Fatha َ"َ", Damma " ُ ", Sokoon " ْ " , Tanween Fatha " ً ", Tanween Damma " ٌ ", Tanween Kasra " ٍ ", Shadda " ّ ", Kasra under Shadda " ‫" ﱢ‬, Fatha above Shadda " ‫" ﱠ‬, and Damma Above Shadda " ‫" ﱡ‬. All diacritic marks are placed above the Arabic letters; the first and the seventh are placed under the Arabic letters. The last three marks consist of the first three marks accompanied with the diacritic mark number eight Shadda. In our work, the last three diacritic marks are distinguished by using the first three marks respectively plus the mark number eight. Therefore, the first eight marks are enough to represent all eleven marks. In addition, the letters written without diacritic marks are called "Null" mark. Each diacritic mark corresponds to a different phonetic value; these marks are listed in Appendix A. The pronunciation of an Arabic word may differ slightly according to the sentence because of the diacritic of the last letter in each word. The diacritic of the last letter in Arabic word depends on the function of that word in the sentence. For instance, the last letter of a noun can have the diacritic Kasra or Sokoon. The diacritic of the last letter in other words (e.g. verbs, prepositions, conjunctions) is always Sokoon. Thus, determining the diacritic of the last letter requires some grammatical processing. This processing is not considered in the proposed system. In our approach, the set of all last diacritic marks (LDMs) contains the following marks Kasra (KA), Damma (DA), Fatha (FA), Sokoon (SO), Tanween Kasra (KK), Tanween Damma (DD), Tanween Fatha (FF), and Null (Nl) (no diacritic), which are listed in Appendix B. In Arabic, the word might have many different meanings depending on the vowelization. For instance, the Arabic word "‫ "ﺫھﺐ‬may either mean “go” or “gold”. The absence of diacritic marks in Arabic text may lead to ambiguity in understanding the required meaning or mispronunciation of words; that is why one can use partial vowelization in written texts to handle such a problem. Moreover, vowelization can not be avoided in many applications, such as speech synthesis by machines, educational books for children, etc [1]. In Arabic, there are almost five possible morphological analyses per word on average [18]. For instance, the word "‫"ﻛﺘﺎﺏ‬, may be morphologically analyzed as the noun "ُ‫" " ِﻛﺘَﺎﺏ‬book", plural of the noun," ٌ‫" " ُﻛ ُﺘﺐ‬books" or the verb "َ‫ " َﻛ َﺘﺐ‬write ","‫ﺐ‬ َ ‫ " ُﻛ ِﺘ‬wrote. 4. Proposed System This system was designed for the Qur'an diacritics restoration depending on a unigram model and a bigram HMM model n. These models were used in vowel restoration with the transliteration version of Qur'an using 3 diacritic marks [16]. These marks are represented by letters "i", "u" and "a" which are corresponding to "Kasra", "Damma", and "Fatha" respectively. Although Arabic has 11 diacritic marks, it is difficult to tell whether the above three letters define vowels or consonants. As an example of the transliterated version, seven verses of the first SOORA of the Qur'an are listed in Fig. 1. They are downloaded from www.sacred-texts.com. From this figure, the words, alAAalameena, alssirata, and almustaqeema end with the letter 'a' which represents a vowel. On the other hand, the word "Ihdina" ends with the letter "a", which represents a consonant. Therefore, the Qur'an written in Arabic is used as a corpus considering all eleven diacritic marks. Al-Fatihah: The Opening 1. Bismi Allahi alrrahmani alrraheemi 2. Alhamdu lillahi rabbi alAAalameena 3. Alrrahmani alrraheemi 4. Maliki yawmi alddeeni 5. Iyyaka naAAbudu wa-iyyaka nastaAAeenu 6. Ihdina alssirata almustaqeema 7. Sirata allatheena anAAamta AAalayhim ghayri almaghdoobi AAalayhim wala alddalleena 3. Database preparation In this paper, the proposed approach is carried out on a Qur'an text written in Arabic considering all eleven diacritic marks. Therefore, a corpus of Qur'an texts is needed. In this corpus, each word must be fully voweled, i.e. supplied with its full diacritical marking as diacritisized. Moreover, the whole Qur'an is downloaded from the following site (http://saaid.net/Quran/h.htm). Figure 1: Example of the Arabic transliteration of the first SOORA of the Qur'an 4.1 Unigram Model Our system has established frequencies table including a separate entry for each vowel-annotated word in the 10 ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 training data; see Appendix C. These frequencies are previously prepared for each word to be used directly for our system in a systematic way. The proposed system picks the diacriticzed word with the highest count that has the same non–voweled structure from the frequencies table. As an example, if we need to restore the diacritics of the word "‫ "ﺑﻘﺮﺓ‬from the prepared corpus. The system searches for all words that have the same non-voweled structure from the established table. For the word "‫"ﺑﻘﺮﺓ‬, the system found two words. The first is "ً‫" َﺑ َﻘ َﺮﺓ‬, and its frequency is "1" and the second word is "ٌ‫" َﺑ َﻘ َﺮﺓ‬, and its frequency is "3". The system selected the second word as a diacriticzed word. The achieved performance was 94.3% word accuracy with this model. It suggests the word with higher frequency. But this is not always correct, especially for the last letter which has a diacritic mark depending on the function of the word in the sentence. The proposed model in the next Section does not depend on the highest frequency for each word. which is the second factor of Eq. 2 as denoted in the following equation 4.2 Bigram Model This section describes our proposed method of Arabic diacritics restoration which consists of two processes. The first is done by selecting the most likely sequence of LDM for the non-diacriticzed words in all sentences; see Fig. 2(a). The second is only done by picking the diacriticzed words that have the same non-diacriticzed structure and end with the sequence of LDM which was determined above. This process is described in Fig. 2(b). In order to accomplish the first step, we constructed a bigram HMM; similar to [19], where hidden states are the diacritic marks putting on the last letters of diacriticzed words, and observations are non-diacriticzed words. Our model consists of a set of hidden states, {T1, T2, …, TK} where each one corresponds to an observed diacritic mark on the last letter of a diacriticzed word from the training corpus. From each diacriticzed word ending with the hidden state Ti, there is a single emission, which simply consists of the word in its non-diacriticzed form. The processes of diacritics restoration are described as follows. Let W1, W2, …, WN be a sequence of nondiacriticzed words (observations). We want to find the sequence of LDM (hidden states) T1,T2,…,TN that maximize the probability P(T1,T2,…,TN| W1,W2,…,WN). By using the Bayes' rule, the above probability can be written as follows: Substituting from Eq. 4 and Eq. 5 into Eq. 2, we obtain P(T 1, N | W 1, N ) > P(T 1, N ) ° P( W 1, N | T 1, N ) P( W 1, N ) P (T1, N ) > P (TN | T1, N .1 ) ° P (TN .1 | T1, N .2 ) (3) °.........° P(T1 | T0 ) using the bigram assumption the following approximation can be used P (T1, N ) > P (T N | T N .1 ) ° P (T N .1 | T N . 2 ) ° ......... ° P (T1 | T0 ) > P(W1, N , T1, N ) > – N P ( Ti | T i . 1 ) (4) i >1 i >1 P (Wi | T1, N ) > – P(T 1, N | W 1, N ) > P(W >– 1, N N P(Wi | Ti ) (5) i >1 | T1, N ) ° P(T 1, N ) N P ( W i | Ti ) ° P ( Ti | Ti . 1 ) (6) i >1 More formally, we define the process of Arabic diacritics restoration using a bigram HMM model as finding arg max p(T1, N | W1, N ) > – T1, N N [ P(Wi | Ti ) ° P(Ti | Ti .1 )] (7) i >1 The advantage of the above equation is that the probabilities involved can be readily estimated from the corpus of diacritic text. The probability for each word, P(Wi | Ti ) , can be simply estimated by counting each word by LDM. Thus, it is calculated by the following equation: P(Wi | Ti ) > Count(times Wi has the LDM Ti ) Count(times of occurs Ti ) (8) for instance, P (W i > ‫ﻋﻠﯿﮭﻢ‬ | Ti > S ) " has the LDM S ) Count (times " ‫ﻋﻠﯿﮭﻢ‬ 6 > > Count (times of occurs S as LDM ) 235 The second term in the right hand side of Eq. 7 defines the bigram probability which can be estimated simply by counting the occurrence of each pair of categories (LDM) compared to the individual category occurrence, using as follows. (1) P(Ti > Ly | Ti.1 > Lx ) > Count( Lx at positioni .1 and Ly at i) Count( Lx at positioni .1) (9) where x, y Á |1,2,..,8~ the above equation is used to calculate these values. In order to describe this calculation for a pair of LDM, for instance, the probability of F mark next to K mark would be estimated by the Eq. 9. Then, we obtain the following result: | T 1, N ) ° P(T 1, N ) (2) These probabilities of transitions through the states of the model are approximated by bigram counts. If we make the assumption that the probability of a given LDM Ti depends only on the previous LDM Ti-1, we can compute the probability of a sequence of LDM T1,N = T1,T2,…,TN 1, N N we will denote the initial state of HMM by the following character "#", i.e. T0 = "#". The second factor in Eq. 2, P(W1,N|T1,N) can be approximated by assuming that words are independent of each other, and that the second is every word that depends only on its LDM. Since we are interested in finding the T1, T2, …, TN that gives the maximum value of the probability in Eq. 1. The common denominator in all these cases will not affect the result. Thus, the problem reduces the sequence T1, T2,…, TN that maximizes the following equation: P(T 1, N | W 1, N ) > P(W – 11 ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 . P (T i > F | T i .1 > K ) > > Count ( K at position i .1 and F Count ( K at position i .1) at 5. Implementation This section presents the implementation of the bigram model using the Eqs. 7-9; the Viterbi algorithm is used to find the most likely path transitions throughout the hidden states [20]. When two or more paths end at the same state, the path with the smallest (or largest) path is selected as the most likely path [21]. This algorithm was described by Allen [22] in details. The probability of the best sequence, leading to each possible LDM (category) at each word, is stored in an array called, SCOR(K,N). Its dimension is K×N, where K is the number of LDMs (T1,T2,…,TK), which equals 8 in our corpus and N is the number of words (W1,W2,…,WN) in the sentence. For instance, the element SCOR(i,j), contains the probability for the best sequence up to word (Wj) that has a LDM (Ti). In order to find the actual best sequence for each category at each word, it is only enough to record the preceding category for each category at each word. The second array, PTR(K,N), is used to determine the best sequence of diacritics by preserving the LDMs of the previous words. The structures of the above two arrays are shown in Table 3 with an applied example. This algorithm is written in Fig. 3. . i) 42 > 0.407767 103 Table 1 gives all bigram frequencies, which are calculated from the proposed corpus. Each bigram is calculated using the numerator of Eq. 9. The Hidden Markov Chain is calculated from Table 1 by dividing each value by the corresponding sum (Sum-R). The Hidden Markov Chain values are listed in Table 2. Start Read / Input Diacritized Text Split Text into Words 1- Initialization Step For i = 1 to 8 SCOR(i,1) = P(Wi | Li) × P(Li|#). PTR(i,1) = 0 2- Iteration Step For j = 1 to N For i = 1 to 8 SCOR(i,j) = MAXm=1,t[SCOR(m,j-1) × P(Li | Lm) ] × P(Wi | Li) PTR(i, j) = index of m that gave the max above 3- Sequence Identification Step T(N) = i that maximize SCOR(i,N) For i = N-1 to 1 do Calculating the Parameters of the Bigram HMM END (a) Preparing Bigram HMM parameters Start T(i) = PTR(T(i+1), i+1) Figure 3: Viterbi Algorithm. In order to illustrate the implementation of the Viterbi algorithm, it is carried out on the sentence " ‫ﺻﺮﺍط ﺍﻟﺬﯾﻦ ﺃﻧﻌﻤﺖ‬ ‫ "ﻋﻠﯿﻬﻢ‬using the Eq. 8 (substituting from the Appendix D) and the Hidden Markov Chain that was shown in Table 2 (calculated by Eq. 9). This sentence is called a test sentence. Likewise, this algorithm contains three steps; they are described as follows: in the first step, the category of the first word can be computed as follows: Read / Input Non-Diacritized Text Split Text into Words Apply the Bigram HMM SCOR(i,1) = P(‫| ﺻﺮﺍط‬Ti) × P(Ti |#), PTR(i,1) = 0 for all i = 1,2,…,8. For instance, at i = 1, T1 = KA, then SCOR(1,1) = P(‫|ﺻﺮﺍط‬KA) × P(KA |#) = 0, also at i = 3, T3 = FA, then SCOR(3,1) = P(‫|ﺻﺮﺍط‬FA) × P(FA |#) =1.34 × 10-03. Diacritized Text Evaluation The results of this step are written in the first column of the array, SCOR. The maximum value in this column is 1.34 × 10-03, it corresponds to the category FA. The calculations in the second step are illustrated as follows: the algorithm extends the sequence by one word at a time and keeps track of the best sequence found to each END (b) Apply the HMM Figure 2: Application of the Bigram model. 12 ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 category: these tracks are recorded in the array, PTR. For instance, the probability of the state P(‫ | ﺍﻟﺬﯾﻦ‬FA) i.e. P(W2| T3) is computed as follows: SCOR(3,2) is only the none zero value. SCOR(3,2) = MAXm=1,8[SCOR(m,1) × P(T3 | Tm) ] × P(W2|T3), = MAXm=1,8[SCOR(m,1) × P(T3 | Tm) ] × P(‫| ﺍﻟﺬﯾﻦ‬T3), = MAX[SCOR(1,1) × P(FA |KA),0, SCOR(3,1)×P(FA |FA), ….,0] × P(‫ | ﺍﻟﺬﯾﻦ‬FA), = MAX [0×0.184,0, 1.34×10-03×.307,…., 0] ×0.0385, = 1.34×10-03 × 0.306792 × 0.0385 = 1.50×10-05, and PTR(3,2) = 3. 7. Conclusion The proposed system was used for diacritics restoration of Arabic words from the Qur'an considering all eleven diacritic marks. The Arabic characteristics were described in Section 2. In order to prepare our corpus, the whole Qur'an was downloaded from some sites; it was found that many words are written incorrectly, the diacritics of some letters of many words are missed. The detection and correction of these words in the whole Qur'an are very difficult and need a lot of time. Therefore, we only corrected the first 1366 words; starting from the beginning of the Qur'an. Our approach needs the frequency for each word; therefore, our corpus was organized systematically to be used directly without any calculations of the frequencies for all words. This reduced the processing time; see Section 3. It was found that the HMMs are useful tools for the task of vowel restoration in Arabic words from the Qur'an. The system accuracy was 94.3% for the unigram model and 95.2% for the bigram HMM. The comparison between the proposed system and Gal system [16] is shown in Table 4. He used the same proposed models for vowel restoration of transliterated Qur'an. His results did not represent all Arabic vowels. Our accuracy with applying the bigram HMM. is better than his accuracy The results of the second word are shown in the second column of the array, SCOR; see Table 3. The calculations of the other words in the test sentence are computed and listed in Table 3. In the third step, the highest probability sequence ends in state (‫ |ﻋﻠﯿﻬﻢ‬SO). It is easy to track back from this category using PTR(4,4), C(4) = 4 which means LDM of the fourth word is SO, and C(3) = PTR(4,4) = 3 which means LDM of the third word is FA. In the same way, it is found that, the most likely sequence is FA FA FA SO. Then, the Viterbi algorithm is ended, and the first process of the bigram is completed. The application of the Viterbi algorithm on the test sentence is shown in Fig. 4 (placed at page 8), ovals represent hidden states that correspond to LDM, while heavier arrows indicate the best sequence leading up to each state. Then, the second process of the bigram starts to pick the diacriticzed words that have the same non- diacriticzed structure from our corpus. The obtained result of the test sentence is "ْ‫ﻋَﻠﯿْ ِﻬﻢ‬ َ ‫ﺖ‬ َ ْ‫ﻦ َأﻧْ َﻌﻤ‬ َ ‫ط ﺍﱠﻟﺬِﯾ‬ َ ‫ﺻﺮَﺍ‬ ِ ". Type Words Considered Vowels Performance Proposed System Arabic Gal System [16] transliterated 11 3 95.2% 86% Table 4: Comparison Automatic diacritization also plays a vital role in other applications, such as Text-to-Speech, that process Arabic texts, as it describes the various meanings and pronunciations of words. Many automatic systems which classify words and identify their types require the input text in a diacriticzed Arabic format. The method is useful for any Arabic texts. Our system is applied in an automatic way and easy to use. 6. Results The proposed system was evaluated by measuring the percentage of words in the test set whose diacritic pattern was restored correctly, i.e. the diacritics pattern suggested by the system exactly matched the original using word accuracy. Our system was applied using the two proposed models on different percentages of the prepared corpus; they are 25%, 50%, 75%, and 100%. The obtained results from applying the unigram model were, 94%, 94.3%, 93.4%, and 92.5%word accuracies respectively. By applying the bigram HMM, the results were 95.2%, 94.8%, 93.7%, and 93% word accuracies respectively. Fig. 5 shows the performance of the two models. 8. Future Work It was found that many words were written incorrectly on the downloaded corpus; the diacritics of some letters of many words were missed. Therefore, the size of our corpus was chosen relatively small; see Section 3. We are going to correct diacritics written wrong in the whole Qur'an, in order to get the correct diacritics restoration of all words in the Qur'an. 9. References: [1] R. Al-Shalabi, G. Kanaan, and S. Alqrainy, "An Automatic System for Extracting Nouns from A vowelized Arabic Text", In International Arab Conference on Information Technology (ACIT2003), Alexandria, Egypt, 2003. [2] R. Al-Shalabi, G. Kanaan, and H.M. AlSerhn, "New Approach for Extracting Arabic Roots", In International Arab Conference on Information Technology (ACIT2003), Figure 5: The two Models results. 13 ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 Alexandria, Egypt, 2003. [3] 10. Author Biography Katrin Kirchhoff, Dimitra Vergyri, Jeff Bilmes, Kevin Duh and Andreas Stolcke, "Morphology-based language modeling for conversational Arabic speech recognition", Computer Speech & Language, 20(4): 589608, 2006. 28, 2008 A.A. El-Harby is an assistant professor in the Department of Maths. and Computer Science, Damietta, Faculty of Science, Mansoura University. He received his B.Sc. degree in Computer Science from the Faculty of science, Suez Canal University, his M.Sc. degree in computer science from Damietta, faculty of science at Mansoura University, and his Ph.D. in computer science from Keele University, UK. His research interests include image processing, artificial intelligent, neural networks, remote sensing, fuzzy logic, and NLP. Address: Department of Maths., Faculty of Science, New Damietta 34517. email: elharby@yahoo.co.uk [4] Young-Sook Hwang, Andrew Finch and Yutaka Sasaki, "Improving statistical machine translation using shallow linguistic knowledge", Computer Speech & Language, 21(2): 350-372, 2007. [5] Loghman Barari, and Behrang QasemiZadeh, "CloniZER Spell Checker, Adaptive, Language Independent Spell Checker", AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt. [6] Mohamed Mohandes, " Automatic Translation of Arabic Text to Arabic Sign Language", AIML Journal, 6(4): 15-19, 2006 [7] Yasser M. Abbass, Waleed Fakher, Mohsen Rashwan, " Arabic / English Identification in a hybrid complex documents images", GVIP 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt. M.A. EL-Shehawey is an associative professor in the Department of Mathematics and Computer Science, Damietta, Faculty of Science, Mansoura University. He received his B.Sc. degree in Mathematics from Mansoura, faculty of science at Mansoura University, his M.Sc. degree in Mathematics (Partial Differential Equations), from Mansoura, faculty of science at Mansoura University, and his Ph.D. in Mathematics (Probability& Stochastic Processes) Johannes Kepler University, LINZ. His research interests include Probability& Stochastic Processes, Statistics, and Partial Differential Equations. Address: Department of Math., Faculty of Science, New Damietta 34517. email: el_shehawy@ mans.edu.eg [8] C. Pisarn and T. Theeramunkong, "An HMM-based method for Thai spelling speech recognition", Computers & Mathematics with Applications, 54(1): 76-95, 2007. [9] Samy Bengio, "Multimodal speech processing using asynchronous Hidden Markov Models", Information Fusion, 5(2): 81-89, 2004. [10] Rajaram Sivasubramanian, and Abhaikumar Varadhan, "An Efficient Implementation of IS-95A CDMA Transceivers through FPGA", DSP Journal, 6(1): 23-30, 2006. [11] J. Allen, "Natural Language understanding", Second edition The Benjamin/Cummings Publishing Company, Inc., 1995. [12] Y. Gal, "Hebrew vowel restoration using a bigram HMM model", http://www.eecs.harvard.edu/~gal/Hebrew%2 0vowel%20restoration-final.doc. R.S. El-Barogy is an assistant lecturer in the Department of Mathematics and Computer Science, Damietta, Faculty of Science, Mansoura University. He received his B.Sc. degree in Mathematics and computer science and his M.Sc. degree in computer science from Damietta, faculty of science at Mansoura University. His research interests include Natural Language Processing, and Speech Recognition. Address: Department of Math., Faculty of Science, New Damietta 34517. email: elbarogy2000@yahoo.com [13] Zohar Eviatar and Raphiq Ibrahim, "Morphological Structure and Hemispheric Functioning: The Contribution of the Right Hemisphere to Reading in Different Languages", Neuropsychology, 21(4): 470484, 2007. 14 ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 State i State i-1 # KA DA FA SO KK DD FF Nl Sum-C # KA DA FA SO KK DD FF Nl Sum-R 0 4 3 54 6 0 1 0 14 82 0 19 1 42 14 3 2 7 15 103 0 7 19 32 11 2 2 3 32 108 0 33 35 131 105 3 10 9 101 427 0 12 14 102 33 5 6 9 54 235 0 1 1 6 1 1 3 2 4 19 0 0 2 16 7 0 6 0 7 38 0 0 3 20 8 0 0 1 6 38 0 33 34 90 50 6 12 7 84 316 0 109 112 493 235 20 42 38 317 1366 Table 1: Bigram frequencies from the proposed corpus. State i State i-1 # KA 0 0.048780 0 0.184466 0 0.064815 0 0.077283 0 0.051064 0 0.052632 0 0.0 0 0.0 0 0.104430 # KA DA FA SO KK DD FF Nl DA FA 0.036585 0.009709 0.175926 0.081967 0.059574 0.052632 0.052632 0.078947 0.107595 0.658537 0.407767 0.296296 0.306792 0.434043 0.315789 0.421053 0.526316 0.284810 SO KK DD 0.073171 0.0 0.135922 0.029126 0.101852 0.018519 0.245902 0.007026 0.140426 0.021277 0.052632 0.052632 0.184211 0.0 0.210526 0.0 0.158228 0.018987 FF 0.012195 0.0 0.019417 0.067961 0.018519 0.027778 0.023419 0.021077 0.025532 0.038298 0.157895 0.105263 0.157895 0.0 0.0 0.026316 0.037975 0.022152 Nl 0.170732 0.145631 0.296296 0.236534 0.229787 0.210526 0.184211 0.157895 0.265823 Table 2: Bigram probabilities from the proposed corpus (Hidden Markov Chain). (a) NDW State KA DA FA SO KK DD FF Nl 0 0 0 0 1.34×10-03 0 0 0 0 0 (b) 0 0 3.44×10-11 1.50×10-05 2.19×10-08 9.31×10-09 0 0 0 0 0 0 0 0 0 0 5.85×10-11 0 0 0 0 0 DW NDW State KA DA FA SO KK DD FF Nl Best Path ‫ﺻﺮاط‬ ‫اﻟﺬﯾﻦ‬ ‫أﻧﻌﻤﺖ‬ ‫ﻋﻠﯿﮭﻢ‬ 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 3 0 0 0 0 0 0 2 0 3 0 0 0 0 FA FA FA SO Table 3: The arrays in our system, (a) The SCOR array, (b) The PTR array. DW and NDW mean diacritic and non-diacritic words respectively. Figure 4: The obtained results from the bigram HMM 15 ICGST-AIML Journal, ISSN: 1687-4846, Volume 8, Issue III, December 2008 Appendix A: The Arabic diacritics (‫ ـــ‬is used to only show diacritics) . Diacritic Name Kasra Fatha Damma Sokoon Tanween Fatha Tanween Damma Mark Diacritic Name ‫ــِـ‬ ‫ــَـ‬ ‫ــُـ‬ ‫ــْـ‬ ‫ــًـ‬ ‫ــٌـ‬ Mark Tanween Kasra Shadda Kasra under Shadda Fatha above Shadda Damma Above Shadda Null Appendix C: The Frequency of words taken from our database ‫ــٍـ‬ ‫ــّـ‬ ‫ــﱢـ‬ ‫ــﱠـ‬ ‫ــﱡـ‬ ‫ـــ‬ Appendix B: The diacritics of the last letter (LDM) Word f Word f Word f State Number Name LDM Abbreviation ‫ِﺑﺴْ ِﻢ‬ ْ‫ﺁﺗُﻮﺍ‬ ‫ﺁ َﺗﯿْﻨَﺎ‬ ‫ﺁ َﺩ َﻡ‬ ‫ﺁ َﺩ ُﻡ‬ ‫ﺁﺫَﺍ ِﻧﮭِﻢ‬ ‫ﻝ‬ ِ‫ﺁ‬ ‫ﻝ‬ َ‫ﺁ‬ 2 1 1 1 3 1 1 1 ‫ﺷ ﱡﺪ‬ َ ‫َﺃ‬ ْ‫َﺃﺻْﺎ ِﺑ َﻌ ُﮭﻢ‬ ‫ﺏ‬ ُ ‫َﺃﺻْﺤَﺎ‬ ْ‫َﺃﺿَﺎءت‬ ‫َﺃﻇَْﻠ َﻢ‬ ْ‫ﻋ ﱠﺪت‬ ِ ُ‫ﺃ‬ ‫ﺃَﻋَْﻠ ُﻢ‬ ‫ﺃَﻋُﻮ ُﺫ‬ 1 1 1 1 1 1 3 1 ‫ﺃَﻧﺪَﺍﺩًﺍ‬ ‫ﻝ‬ َ ‫ﺃُﻧ ِﺰ‬ ‫ﻝ‬ َ ‫ﺃَﻧ َﺰ‬ ‫ﺃَﻧ َﺰﻟْﻨَﺎ‬ ‫ﺖ‬ َ ْ‫َﺃﻧْ َﻌﻤ‬ ‫ﺖ‬ ُ ْ‫َﺃﻧْ َﻌﻤ‬ ْ‫ﺴ ُﻜﻢ‬ َ ‫ﺃَﻧ ُﻔ‬ ‫ﺴﮭُﻢ‬ َ ‫ﺃَﻧ ُﻔ‬ 1 2 1 1 1 2 3 1 1 2 3 4 5 6 7 8 Kasra Damma Fatha Sokoon Tanween Kasra Tanween Damma Tanween Fatha Null KA DA FA SO KK DD FF Nl ‫ﻦ‬ َ ‫ﺁ َﻣ‬ 3 ‫َﺃﻏْ َﺮﻗْﻨَﺎ‬ 1 ْ‫ﺴ ُﮭﻢ‬ َ ‫ﺃَﻧ ُﻔ‬ 1 Appendix D: A summary of a set of words counts from the corpus. ‫ﱠﻪ‬ ‫ﺍﻟﻠ‬ ‫َﺎﺏ‬ ‫ﻜﺘ‬ ِ‫ﻟ‬ ْ‫ﺍ‬ ‫ِﯾﻦ‬ َ ‫ﻟﺬ‬ ‫ﺍﱠ‬ ‫ِﯿﻢ‬ ِ ‫ﺮﺣ‬ ‫ﺍﻟﱠ‬ ‫ْﺖ‬ ‫َﻤ‬ ‫ﻧﻌ‬ ْ‫أ‬ َ ‫َﺓ‬ ‫َﺮ‬ ‫ﺑﻘ‬ َ ‫َﺎﺕ‬ ٍ ‫ُﻤ‬ ‫ُﻠ‬ ‫ﻇ‬ ‫ِﻢ‬ ‫ْﻬ‬ ‫َﯿ‬ ‫ﻋﻠ‬ َ ‫َﺍط‬ ‫ﺻﺮ‬ ِ others TOTAL KA DA FA SO 9 0 0 1 0 0 0 0 0 99 109 12 1 0 2 2 0 0 1 0 94 112 5 2 18 0 1 0 0 0 1 466 493 0 0 0 0 0 0 0 6 0 229 235 16 KK DD 0 0 0 0 0 0 1 0 0 19 20 0 0 0 0 0 2 1 0 0 39 42 FF 0 0 0 0 0 1 0 0 0 37 38 Nl 0 0 0 0 0 0 0 0 0 316 317 TOTAL 26 3 18 3 3 3 2 7 1 1299 1366