A Novel Root Based Arabic Stemmer: Journal of King Saud University - Computer and Information Sciences

Journal of King Saud University Computer and Information Sciences (2015) 27, 94103
King Saud University

Journal of King Saud University
Computer and Information Sciences
www.ksu.edu.sa
www.sciencedirect.com
A novel root based Arabic stemmer

Mohammed N. Al-Kabi a, Saif A. Kazakzeh b, Belal M. Abu Ata b,
Saif A. Al-Rababah c, Izzat M. Alsmadi d,*
a
Faculty of Sciences and IT, Zarqa University, P.O. Box 2000, 13110 Zarqa, Jordan
b
CIS Department, IT & CS Faculty, Yarmouk University, 21163 Irbid, Jordan
c
Information Systems Department, IT Faculty, Al-albayt University, Jordan
d
Computer Science Department, Boise State University, Boise, ID 83725, USA
Received 26 December 2012; revised 7 December 2013; accepted 3 April 2014

Available online 21 March 2015
KEYWORDS Abstract Stemming algorithms are used in information retrieval systems, indexers, text mining,
Natural Language text classiers etc., to extract stems or roots of different words, so that words derived from the same
Processing (NLP); stem or root are grouped together. Many stemming algorithms were built in different natural
Computational intelligence; languages. Khoja stemmer is one of the known and widely used Arabic stemmers. In this paper,
Stemming; we introduced a new light and heavy Arabic stemmer. This new stemmer is presented in this study
Information retrieval and compared with two well-known Arabic stemmers. Results showed that accuracy of our
stemmer is slightly better than the accuracy yielded by each one of those two well-known Arabic
stemmers used for evaluation and comparison. Evaluation tests on our novel stemmer yield
75.03% accuracy, while the other two Arabic stemmers yield slightly lower accuracy.
2015 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is
an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction omits some or all vowels. Not all Semitic languages use a
cursive style (Abjad, 2012; Semitic languages, 2012) like the
Semitic languages are mainly used in the Middle East, and Arabic language (Arabic language, 2015). Semitic languages
North Africa. The Arabic language is currently the most used use non-concatenative (i.e. discontinuous) morphology to
Semitic language, since it is the native language for more than form words which represent a modied version of roots
290 million people Worldwide (Arabic language, 2015). These (Non-concatenative morphology, 2012; Semitic languages,
Semitic languages use the writing style from right to left. Most 2012). Most of Semitic roots consist of three consonants
Semitic scripts use Abjad style. Abjad is a type of alphabet that (Triliteral) (Semitic languages, 2012). Afxes are used by
Semitic languages. However, most of the words are formulated
by vowels between the root consonants (Semitic languages,
* Corresponding author. 2012). Therefore extracting the Semitic roots of different
Peer review under responsibility of King Saud University. Semitic words is usually not a trivial process.
The ofcial Arabic language also called Modern Standard
Arabic (MSA) or Literary Arabic is widely used in schools,
universities, academic establishments, newspapers, radio,
Production and hosting by Elsevier TV stations, government agencies. . .etc. Arabic language is
http://dx.doi.org/10.1016/j.jksuci.2014.04.001
1319-1578 2015 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
A novel root based Arabic stemmer 95
based on 28 letters, where the shapes of some of these letters 2. Related Work
are changed according to their location in the word. In addi-
tion, these letters can be joined together or written sepa- Several research papers and projects are proposed developing
rately based on their location in the word. Several vowel Arabic stemmers (e.g. Al-Shalabi and Evens, 1998; Khoja
diacritics are used especially in the holy Quran and in clas- and Garside, 1999; Abu-Salem et al., 1999). There are many
sical poetry. studies that present examples of Arabic Stemming algorithms
Not all Arabic words used in MSA are native Arabic words and their effectiveness. Most of these studies claim an accuracy
which are derived from Arabic three consonants (i.e. which exceeds 85%. It is impossible to verify these claims due
Triliteral) origin. These include for example, the following to the lack of source codes and the datasets which were used in
Arabic words which lack authentic Arabic roots, since they the testing process.
are not derived from native Arabic roots while they are pho- Chen and Gey (2002) study is not purely dedicated to the
netically modied Arabic versions from their origins in other construction of Arabic stemming, since it aims to study
languages: (e.g. Television, ), (Programmer, English-Arabic cross-language retrieval (CLIR). Therefore
), (Telephone, ), (Computer, ), the paper constructed two Arabic stemmers beside an
(Dictionary, ), (Chemistry, ), (Physics, ), Arabic stop word list. They used a simple program which
(Geography, ), (Lemon, ), (Orange, ). is restricted to removing major Arabic prexes: The (de-
In natural languages it is normal to nd a number of words nite article (Alif-laam), ), and four plural sufxes: (Alif-
derived from the same root or stem. Stemming is the process of taa, ), (Alif-nuun, ), (Waaw-nuun, )and (Taa,
extracting the root of each word, in order to treat a group of ). Then they built two stemmers, the rst one is called
words that are derived from the same root as synonyms, since MT-based Arabic stemmer, which uses online Ajeeb machine
they suppose to refer to the same concept. However, in reality translation system to translate Arabic words to English.
not all words which are derived from the same root may refer These words are then partitioned into groups or clusters,
to the same concept. Stemming process is widely used in infor- where each group of Arabic words has a common English
mation retrieval, text mining, text classication. . .etc. stem. Next, the MT-based Arabic stemmer selects the short-
The following four Arabic words (Written, ), est word in the cluster and considers it as an Arabic stem
(Writings, ), (Writer, ), (Book, )are for all the Arabic words in the cluster. The second Arabic
derived from the same Arabic three consonants triliteral with stemmer is called light stemmer, where its main task is to
origin verb (Wrote, ). They also refer to the same remove the top frequently used Arabic prexes and sufxes.
concept. Therefore stemming these four Arabic words is useful In their study Larkey et al. (2002) constructed and tested a
for some relevant tasks. On the other hand, the stemming of number of Arabic light stemmers. Their tests showed that
the following two Arabic words (accountant, )and the effectiveness of information retrieval systems (IRSs)
(computer, )which are derived from the same Arabic which use the best light stemmers yield much better effec-
triliteral verb (counted, )shows that stemming is not tiveness than those that use morphological stemmers
benecial, since these two Arabic words are not synonyms, attempting to nd the Arabic root. They also concluded that
and refer to two different concepts. Further, the following four using the best light stemmer within an IRS is better than
Arabic Words: (Books, ), (Ofce, ), (Library, avoiding stemming or using co-occurrence analysis to pro-
), (Writing, )represent four different concepts duce stem classes or using very light stemmers. Many think
that are derived from the same Arabic triliteral verb (Wrote, that light stemming is much easier and more accurate than
) . These examples show that Arabic stemming is not heavy (root-based) stemming, since light stemming is
always straightforward where even if an automatic extraction restricted to strip off predetermined Arabic afxes (prexes
tool is very accurate, when evaluating the semantics, some of and sufxes) from Arabic words. In reality, in many situa-
the stemming activities are not relevant. tions the Arabic afx could be part of the root (e.g.,
There are two types of stemming, the rst type is light stem- (Governor, )). Therefore the light stemmer should
ming which is used to remove afxes (i.e. prexes and sufxes), decide whether to remove the afx if it is really an afx,
while the second type is called heavy stemming (i.e. root stem- or to keep the afx if it is part of the Arabic root.
ming) which is used to extract the root of the words and Nwesri et al. (2005) exhibited in their study three novel tech-
include implicitly light stemming. niques to remove Arabic prexes (i.e. Arabic prepositions
In this study, a novel Arabic stemming algorithm is pro- and conjunctions) from Arabic words inputted to their light
posed, implemented, and tested. The algorithm applies both stemmers. Those are Arabic light stemmers which could not
the light and heavy (root) stemming techniques on Arabic be benchmarked with our new root-based stemmer.
words to extract the triliteral roots of words. Our Arabic stem- Most of the Arabic words are derived from triliteral Arabic
ming algorithm is not dictionary based. The conducted tests on roots. However, there are very few quadri-literal Arabic roots
this stemming algorithm reveal an accuracy of 75.03%. The relative to the number of triliteral Arabic Roots. Kanaan et al.
results are compared with two Arabic stemmers described in (2004) presented a novel stemming algorithm dedicated to
previous research papers. Arabic words derived from quadri-literal Arabic roots only
The rest of this article is organized as follows: Section 2 pre- and used a limited set consisting of 145 Arabic words.
sents the related work, Section 3 presents the methodology Stemmer of Kanaan et al., 2004 yields 95% accuracy. Our study
adopted in this study, Section 4 presents experiments con- is completely different Kanaan et al., 2004 study in data size
ducted to demonstrate the validity of the proposed algorithm. which is much larger, and their study is restricted to Arabic
Section 5 presents an analysis and a comparison between our words derived from quadri-literal Arabic roots, while this one
stemmer and two known Arabic stemmers. Finally Section 5 designed for Arabic words is derived from triliteral Arabic roots.
presents conclusion and future work.
96 M.N. Al-Kabi et al.
Taghva et al. (2005) study presents the construction of a dictionary, so it needs less storage space and needs less
heavy (root-based) stemmer which does not rely on any dic- computational time relative to its counterparts. They also
tionary of Arabic roots. Authors claimed that the effectiveness claim that their stemmer is effective and better than others,
of their Arabic stemmer is equivalent to the known stemmer of since for example it uses Arabic stop-words which are
Khoja and Garside (1999). In addition, they found that the neglected by other stemmers to improve the extracted stems.
ability of the root-based stemmer to nd the right Arabic root In addition, ETS was capable to identify Arabic nouns and
is not an essential issue in monolingual Arabic information verbs.
retrieval. Most of the studies related to Arabic stemmers are Ghwanmeh et al. (2009) present another Arabic root-based
either based on a dictionary of Arabic roots or use a set of algorithm based on the Arabic morphological patterns. The
rules to identify the verb patterns of the Arabic words in order capability of the proposed stemmer by Ghwanmeh et al.
to nd the Arabic roots. These stemmers are accurate but con- (2009) is restricted to native Arabic words that consist of four
sume a lot of the computation resources of the computer sys- or more Arabic alphabets. However, this case was treated very
tem as Al-Serhan and Ayesh (2006) study claimed. Therefore well by Khoja Arabic stemmer (Khoja and Garside, 1999) as
those researchers present an Arabic stemmer based on neural well in our new stemmer. The stemmer proposed in
networks, which is characterized by its efciency and effective- (Ghwanmeh et al., 2009) checks the length of the inputted
ness. They also claimed that their novel stemmer capabilities Arabic word to determine whether to proceed with necessary
are restricted to nding the root of Arabic words derived from steps to extract the Arabic root, or to leave the word as is, if
triliteral Arabic roots. The stemmer is limited to Arabic words the input length is less than 4. When the evaluated Arabic
which consist of no more than ve Arabic alphabets. word is of length that exceeds 3, the algorithm starts with nor-
A novel Arabic morphological analysis method is presented malization of some its Arabic letters and then starts stripping
by Al-Sughaiyer and Al-Kharashi (2006). The main idea of off prexes and sufxes. Afterward their proposed algorithm
their novel method is based on verb pattern similarity of words starts matching the extracted word with 81 Arabic triliteral
derived from various Arabic roots. Their method is character- verbs patterns (Forms, ). Those Authors tested their
ized by its simple computation, and its accuracy. algorithm using a dataset of Arabic words extracted from a
Another Arabic root-based stemmer is proposed by Al- corpus of 242 abstracts from the proceedings of the Saudi
Shalabi et al., 2007. Their stemmer is characterized by its Arabian national computer conferences. Tests of their
capability to nd Arabic triliteral, quadri-literal and penta-lit- Arabic stemmer yield an accuracy of 95%. (Ghwanmeh
eral roots. The effectiveness of their stemmer is 95%, but their et al., 2009) stemmer is benchmarked in this study, the dataset
study does not refer to the dataset they used, and the stemmer they used is not adopted in this study since it is limited and
is not offered online to the public. Therefore it is excluded restricted to computer-based topics, while ours include the
from the benchmarking of this study. used Arabic words which covers different aspects of our life.
Momani and Faraj (2007) presented another novel Arabic Hmeidi et al. (2010) study exhibits a novel bigram-based
root-based stemmer to extract triliteral Arabic roots with a Arabic stemming algorithm. Authors used two similarity mea-
73% accuracy using a dataset of more than 1500 Arabic sures (Manhattan and Dice). They tested their algorithm on
words. They presented their Arabic stemmer with preliminary the Holy Quran and a corpus of 242 abstracts. They claimed
examples. that their algorithm was capable to extract triliteral, quadrilit-
Similar to Porter stemmer popularity for English, Khoja eral, pentaliteral, hexaliteral, and heptaliteral Arabic roots.
stemmer (Khoja and Garside, 1999) got popular for Arabic Tests of their stemmer revealed that using bigram with Dice
stemming through many relevant citations. One of these measure yields better Arabic roots than using bigram with
attempts to improve over Khoja stemmer was presented by Manhattan distance measure.
Kchaou and Kanoun (2008). They adopt two dictionaries of Abu Ata and Al-Omari, 2014 in 2014 paper proposed an
Arabic roots, one for normal Arabic roots, and the other dic- Arabic stemmer dedicated to different Arabic dialects. They
tionary is for radical Arabic roots, while Khoja stemmer was describe in their study a novel rule-based algorithm to extract
based only on one dictionary for Arabic roots. Authors tested stems from textual Arabic Gulf dialect.
their stemmer using 200,000 Arabic words, where they claimed Boubas et al. (2011) study exhibits a novel Arabic stemming
98% of accuracy. algorithm which uses genetic algorithms and verbs pattern
Most of the research studies related to reducing Arabic matching. This algorithm is based mainly on machine learning
words to their stems or roots in information retrieval and system and Arabic morphological rules or patterns. They pro-
Natural Language Processing (NLP) concentrate on the con- duced an Arabic morphological analyzer capable to generate
struction of Arabic stemmers whether they are light or heavy the Arabic root for any stream of Arabic words.
(i.e. root-based) (Mustafa, 2002; AI-Sawadi and Khayat, All our attempts to get all those stemmers listed in this sec-
1996). In addition, lemmatization algorithms are used to tion are failed to get more Arabic root-based stemmers to
obtain the roots, where lemmatizers are more robust than their benchmarked with our new Arabic stemmer.
counterparts since they depend on morphological analysis and
vocabulary usage. Al-Shammari and Lin (2008a,b) presented
3. Methodology
the rst Arabic lemmatizer. Authors claimed that their tests
showed that their lemmatization algorithm is better than
Khojas Arabic root-based stemmer when these two different In this study, a novel Arabic stemmer is presented to extract
algorithms are used to cluster Arabic text documents. the trilateral Arabic roots from Arabic words derived from tri-
Al-Shammari and Lin (2008a,b) present a new Arabic stem- lateral roots. The proposed stemmer is based on light and
mer called Educated Text Stemmer (ETS). ETS is character- heavy (root-based) stemming methods. C#.NET language is
ized by its efciency, since it does not rely on any root used to implement our new proposed Arabic stemmer.
Afterward this algorithm is tested and its outputs are com- 3.1. Stemming algorithms
pared with the outputs of two other Arabic root-based stem-
mers (Khoja, (Khoja and Garside, 1999), and the stemmer This section introduces the algorithm of our proposed Arabic
proposed by Ghwanmeh et al. (2009)). stemmer. Each inputted Arabic word to this stemmer has to
The comparison was restricted to two selected Arabic stem- proceed in three-phases. These three phases are described
mers since most of the proposed Arabic stemmers presented below. The rst-phase is dedicated to removing Arabic afxes;
previously in the literature are not offered to the public to be second-phase is dedicated to identifying the verb pattern of
tested, except those of Khoja and Garside (1999), and each evaluated Arabic word, while the third-phase is dedicated
Ghwanmeh et al. (2009). One of the authors in our paper to rening the proposed Arabic root. Fig. 1 exhibits the pseudo
was the developer of the stemmer in the second one (i.e. code of our proposed Arabic stemming algorithm to extract
Ghwanmeh et al. (2009)). We checked Arabic stemming triliteral Arabic verbs.
resources presented in (http://sites.google.com/site/nlp4ara- The following subsections exhibit a detailed discussion of
bic/), where Al-Stem Stemmer and Alexs version of Arabic some of the essential steps shown in Fig. 1.
Stemmer are Perl stemmers that were run and found that they
cannot be used in this study, since they are light Arabic 3.1.1. Removing Arabic affixes (prefixes and suffixes)
stemmers.
Arabic words that are used as inputs to this stemmer rst have
Light-based stemming algorithm is concerned with the
to be normalized. For example, the following three different
removal of the afxes from the inputted words. Our new
shapes of the rst Arabic alphabet (Alif, ) will be
Arabic stemmer removes several predetermined Arabic afxes.
normalized to (Alif, ).
Several examples of these Arabic afxes are shown in Table 1.
In this phase it is essential for the proposed stemming algo-
El-Affendi (2002) indicates in his study that the total num-
rithm to remove the real prexes and sufxes. Consider the fol-
ber of Arabic roots is approximately 9464 roots. Triliteral
lowing Arabic word (Adults, ), where the blind removal
Arabic roots constitute around 70% of the total number of
of the Arabic prex (Baa-alif-laam, )will lead to a failure
Arabic roots, while 30% of the total number of Arabic roots
to nd the right Arabic root (Reach, ), since two letters are
is classied under quadri-literal Arabic roots. Sawalha and
removed from the root (Reach, ).
Atwell (2009) used in his study 2730 verb patterns and 985
Our proposed stemmer rst attempts to identify Arabic
noun patterns.
afxes with different lengths as shown in Table 1, in order to
On the other hand the root-based stemming is based on
remove these afxes. So in the rst phase of our stemmer
comparing the Arabic word under consideration with Arabic
appropriate afxes are tested and eliminated from inputted
triliteral verbs patterns (patterns, ), which are selected
Arabic words. Our stemmer removes the afxes after con-
depending on the number of the letters in the word. By com-
sidering the length of the word and the length of the afx to
paring the word to that specic verb (pattern, )we can
control afx elimination process to yield better roots. For
derive the root of the word. Several examples of those
instance, consider the following three Arabic words: (The
Arabic triliteral verbs patterns (patterns, )are shown
Reformers, ), (The Products, ), and (The
in Table 2.
Libraries, ). First our stemmer removes the denite
article (Alif-laam), )from those three words, and
Table 1 A sample list of Arabic afxes removed by our removes the sufx (Waaw-nuun, )from (Reformers,
stemmer. ), and removes the sufx (Alif-taa, )from
(Products, ), and (Libraries, ). Prex and
1 Letter 2 Letters 3 Letters
sufx removal converts the three Arabic words to the follow-
Prexes (Alif, ), (Alif-laam), ), (Waaw-alif-laam, ing Arabic words: (Reformer, ), (Product, ), and
(Waaw, (Siin-nuun, ), ), (Kaaf-alif- (Ofce, ). So the rst phase of this stemmer yields afx
), & (Faa-alif, ), laam, ),(Baa- free Arabic words. Moreover, we should notice that the seman-
(Yaa, ) (Kaaf-taa, ), & alif-laam, ), &
tic of the Arabic words (Reformer, ), (Product, ),
(Yaa-Alif, ) (Waaw-siin-taa,
and (Ofce, )is correct, which means that the removal
)
Suxes (Yaa, ), (Haa-nuun, ), (Yaa-Alif-Taa, of prexes and sufxes was correct.
(Taa, ), (Kaaf-nuun, ), ), (Kaaf-Miim-
& (Laam, (Haa-miim, ), Alif, ), & (Haa- 3.1.2. Arabic verb pattern identification
) (Alif-taa, & ) Miim-Alif, ) In the second phase the stemmer attempts to extract the cor-
(Taa-haa, ) rect Arabic root. The correctness of each extracted Arabic root
by this stemmer is based mainly on identifying the right root
pattern for the inputted Arabic word.
Table 2 Arabic triliteral verbs patterns. In this phase, we compare the output of the rst phase to a
Arabic word Arabic verb Arabic triliteral set of verbs (patterns, )in order to extract the right root.
pattern verb The main task in this phase is to identify the verb (pattern,
)of the output of the rst phase, by matching the
(School, ) (Mafala, ) (Studied, )
output to a number of verbs (patterns, )which have
(Forgiveness, ) (Estefaal, (Forgave, )
)
similar word lengths. Afterward a matching between the
(They are Looking, (Yafalon, (Looked, ) corresponding Arabic letters in the extracted word and pattern
) ) is conducted, where the following three Arabic letters (Faa,
), (Ayn, ), and (Laam, )within the Arabic pattern
Input: A text file that contains the Arabic words

Output: Arabic Triliteral Verb/Verbs
1. Remove Arabic prefix(es) from each word
2. Normalize 3 shapes of (Alif, " )" to (Bare Alif, ")"
3. Remove suffix(es) from each word
4. Determine word length after removing affixes (prefixes and suffixes)
5. Identify Arabic patterns having same lengths to word length in step 4.
6. Compare each pattern identified in step 5 with extracted word from step 3
7. Select the closest pattern:
a. Choose the pattern from the set of Arabic patterns having same lengths to word length which has the highest
number of common Arabic letters with the Arabic word extracted from step 3.
b. Determine the pattern which has the largest matching corresponding letters with the generated word from step 3
which is considered as the right pattern, where the corresponding Arabic letters within the extracted word from
step 3 will not be compared with three Arabic letters (Faa', ")", (Ayn, ")", (Laam, " )"within the pattern
under consideration.
8. Eliminate all matched letters in step 7. The Arabic letters of the Arabic word extracted from step 3 which corresponds to
the Arabic letters (Faa', ")", (Ayn, ")", and (Laam, " )"in the selected pattern (found in step 7.a) are selected to
constitute the extracted Arabic root.
9. Refine the extracted Arabic root by converting some of the Arabic letters.
Figure 1 Pseudo code of our proposed Arabic stemming algorithm.
are excluded from this similarity matching process. The pat- the Arabic root can be extracted simply by eliminating the
tern which achieved the highest matching will be considered matched Arabic letter/letters between the pattern and the
by our algorithm. extracted word from the rst phase. As an example, our stem-
The right Arabic root will successfully be extracted if the mer will identify the source (Mafal, )as a verb pattern
stemmer succeeds at this phase to identify the right verb for the extracted Arabic words: (Reformer, ), (Product,
(pattern, ). Fig. 2 presents the matching between the ), and (Ofce, )from the previous phase. Thus in
outputs of the previous phase (Reformer, ), (Product, this case the extracted Arabic triliteral roots are: (Reformed,
), and (Ofce, )and the verb pattern (Mafal, ), (Produced, ), and (Wrote, ).
). The main task of this Arabic stemmer is to extract In this phase, the system identies all the verb patterns that
the three consonants (Triliteral) Arabic verb from which the have the same length as the resulted Arabic word from apply-
original word is derived. All patterns used are derived from ing the rst three steps of the above algorithm. Then, our stem-
the Arabic trilateral verb (Faala, ). mer starts matching the corresponding letters of the resulted
Identifying the right Arabic pattern for an Arabic word word and each candidate verb pattern. The pattern which
leads to extracting the right Arabic root by simply extracting has the largest matching corresponding letters is one used by
the corresponding three Arabic letters within the preprocessed the stemmer to extract the Arabic root. As illustrated in
word to the following three Arabic letters (Faa, ), (Ayn, Fig. 3, after nding the right pattern (Mafal, )the
), (Laam, )in the identied pattern. In other words system will eliminate similar letters except main letters. By
Figure 2 Root extraction process.

Figure 3 Arabic verb pattern matching and root extraction process.
doing so, the letter will be removed, and the letters that
correspond to the main letter will be returned in the same
order as the source. The output of this process is then the root Letter position 7 6 5 4 3 2 1
(produced, )for the evaluated word. Pattern (Enfale, )
To show how our Arabic stemmer extracts the Arabic root Arabic word
of inputted Arabic words consider the 9-letter Arabic word
(The Forgiveness, ). The rst problem that the
stemmer has to remove the prex, normalize some Arabic let-
Similarly our stemmer starts another comparison between the
ters, remove sufx, and then determine the length of the
candidate pattern (Estftale, )and the resulted Arabic
Arabic word after stripping off prexes and sufxes. Identify
word (Forgiveness, )yields four matches at positions 1,
word length will lead to identify verb patterns with equal
2, 3, and 6 as shown below:
lengths. Select the appropriate pattern from the set of 7-letters
patterns like (Eftale, ), (Estftale, ), (Enfale,
), etc. for the resulted word (Forgiveness, )
after applying the rst three steps. To solve this problem and Letter position 7 6 5 4 3 2 1
Pattern (Estftale, )
select the appropriate pattern of this word the stemmer starts
Arabic word
comparing corresponding Arabic letters in each of the candi-
date patterns and the input Arabic word (Forgiveness,
).
Therefore the stemmer in such cases will select the Arabic pat-
tern (Estftale, ). Next the stemmer starts to extract
non-matched Arabic alphabets from the resulted Arabic word
Letter position 7 6 5 4 3 2 1
Pattern (Eftale, )
(Forgiveness, ), and that means (Ghayn, ), (Faa,
Arabic word ), and (Raa, )constitute the Arabic root (Forgive,
).
One of the cons of our new Arabic stemmers proposed in this
study is its incapability to extract the correct Arabic roots from
The above comparison between the candidate pattern
Arabic words whose lengths are less than 4 characters, and could
(Eftale, )and the resulted Arabic word (Forgiveness,
not treat vowels properly in those short words. So the present
)yields two matches at position 1 and 3. Similarly
version of our algorithm is incapable to extract the correct root
our stemmer starts another comparison between the candidate
from the following two Arabic words: (You see, ), (she saw,
pattern (Enfale, )and the resulted Arabic word
), and output them as is. This problem should be
(Forgiveness, )yields 1 match at position 1 as shown
considered in the enhancement of this stemmer in the future.
below:
3.1.3. Root decision evaluation Therefore there is a need to construct a standard Arabic data-
Up to this phase, the system has the root of the given word. set for Arabic stemmers to be used to benchmark different
However, in Arabic language there are some letters that must Arabic stemmers.
be drawn differently when they come at the end or the middle
of the word. These letters should be adjusted to have the word 4.3. Analysis
correctly displayed. For instance, the letter (Waaw with
Hamza above, )in the middle of the word should be In this section, an analysis of the effectiveness of our novel
transformed to (Alif, ), while the same letter at the end of stemmer is conducted using 5176 Arabic words of different
the word should be transformed to (Alif, ). Applying this lengths. Major attributes for stemmers quality are the predic-
process will correct most of the generated roots. tion accuracy of words stems. Section 4.2 presents the overall
output results of comparing the accuracy of the three Arabic
4. Experimental analysis stemmers (i.e. Khoja and Garside (1999), Ghwanmeh et al.
(2009) and our proposed stemmers) under consideration. In
As mentioned earlier, we have implemented our proposed this section the tests on the three Arabic stemmers will be con-
stemming algorithm using C# .NET programming language. ducted according to the length of the input Arabic word.
The system accepts a text le that includes the Arabic words This section describes the comparison results in more
and produces the roots of those words. Examples of the details. Experiment is divided into different categories based
systems output results are shown in Table 3. The following on word length. The summary results are shown in bar charts
subsections show the test collection used to test our Arabic in Fig. 5.
stemmer, beside the results of these tests. Our test collection has 677 words of four letters. The Khoja
and Garside (1999) algorithm yields 69.2% accuracy, followed
4.1. The test collection by our and Ghwanmeh et al. (2009) stemmers with 69.1% and
55.2% accuracies respectively. Fig. 5 presents the accuracy for
the three stemmers to extract Arabic roots from four letters
The research projects in this eld lack a gold standard set to be Arabic words. Also our test collection has 1071 Arabic words
used to carry benchmark tests of different Arabic stemmers. of ve Arabic letters. Our algorithm yields 71.4% accuracy, fol-
Therefore a dataset consisting of 6081 Arabic words derived lowed by Khoja and Garside (1999) and Ghwanmeh et al. (2009)
from native Arabic triliteral verbs is constructed to evaluate stemmers with 65.1% and 52.2% of accuracies respectively.
our proposed Arabic stemming algorithm relative to the other Fig. 5 shows the ve letters results for the three stemmers.
two stemmers. Those include singular, dual, and plural Arabic The test collection has 845 words of six letters. Our algo-
words (nouns and verbs) which are derived from triliteral rithm yields 71.8% accuracy, followed by Khoja and Garside
Arabic roots. (1999) and Ghwanmeh et al. (2009) stemmers with 71.4%
and 63.3% of accuracies respectively. Fig. 5 shows the six let-
4.2. Results ters results for the three stemmers. The test collection has 733
words of seven letters. The Khoja and Garside (1999) algo-
The results of the tests on our novel algorithm yield an accu- rithm yields 84.3% accuracy, followed by our and
racy of 75.03% of the whole collection. We have compared Ghwanmeh et al. stemmers with 81.9% and 77.8% of accura-
the results of the tests on our proposed stemming algorithm cies respectively. Fig. 5 shows the seven letters results for the
with the results of tests on Ghwanmeh et al. (2009) Arabic three stemmers. The test collection has 1850 words of eight let-
stemmers using the same test collection. Those two stemmers ters. Our algorithm yields 77% accuracy, followed by Khoja
(Khoja and Garside (1999) and Ghwanmeh et al. (2009)) yield and Garside (1999) and Ghwanmeh et al. (2009) stemmers with
accuracies of 74.03% and 67.40% respectively. Results of tests 76.5% and 75.5% of accuracies respectively. Fig. 5 shows the
of our stemmer have slightly exceeded Khoja stemmer. Fig. 4 eight letters results for the three stemmers.
visualizes these results. Fig. 5 shows that our stemmer effectiveness to extract
The accuracy of this Arabic stemmer may seem at a rst Arabic triliteral roots from 5, 6 and 8 Arabic letters- words
glance lower than the accuracies of other Arabic stemmers is better by overall results in comparison with the other two
reported in previous studies. One of these is: Ghwanmeh Arabic stemmers. Fig. 5 shows that Khoja and Garside
et al. (2009) which claims 95% accuracy, but within our study (1999) stemmer effectiveness to extract Arabic triliteral roots
it yields an accuracy of 67.40%. This is due to differences in from words of 4 and 7 letters is the best in terms of prediction
size and type of the datasets used to test these stemmers. accuracy. This clearly reveals that we still need to work on
stemmer optimization to work with all word sizes.
Table 3 Proposed stemmer output results. 4.4. Stemmer output analysis

Inputted Arabic word Number of Extracted Arabic
letters triliteral verb Using stemmers leads to two types of errors (over-stemming
(The noise, ) 5 (Noised, ) and under-stemming). Over-stemming errors occur when
(Welding, ) 7 (Welded, ) words that refer to distinct concepts are stemmed to the same
(The employments, 7 (Employed, ) root. Consider the following two Arabic words (feet, )
) and (Introduction, )which refer to two different
(Will Send you, 9 (Sent, )
concepts, but most probably stemmed to one Arabic triliteral
)
verb (Presented, ). Under-stemming errors occur when
Figure 4 Stemmer results comparison.
Figure 5 Variable length words analysis.
words of the same concept are stemmed to different roots. Table 4 Examples of under-stemming (False Negative) errors.
Consider the following two Arabic words (mobile, )and
(mobile, )which refer to the same concept, but they Inputted Our stemmer output Khoja Ghwanmeh
Arabic word (under-stemming stemmer stemmer
stemmed to two different Arabic triliteral verbs: (transferred,
errors) output output
)and (Toured, ). Table 4 shows examples of under-
stemming (aka False Negative) errors of our stemmer. (For the
Table 5 shows examples of over-stemming (False Positive) reports,
)
errors of our stemmer. Note that, under-stemming and over-
(For the
stemming errors produced by our stemmer may or may not
consumer,
be produced by the other two stemmers. We can view these )
types of errors as a general Arabic language phenomena and (For the
not dependant on correct or wrong stemming. computers,
Considering Table 4, we can easily notice that all the words )
have the sufx (for, )which is the cause of the wrong (And the
stemming cases, in which the algorithm removes the rst letter control,
of this sufx but not the second one since it is considered as an ")"
original letter in most cases. On the other side, in Table 5, the (In the region,
")"
problem of over stemming (false positive) cases is caused by
(The
the same reason of removing an original letter in the compar-
harshness,
ison phase that considers a wrong shape to apply the heavy ")"
stemming on.
Table 5 Examples of over-stemming (aka False Positive) References

errors.
Abjad, (2012).Wikipedia, the free encyclopaedia. <http://en.wiki-
Inputted Our stemmer output Khoja Ghwanmeh pedia.org/wiki/Abjad>.
Arabic word (over-stemming stemmer stemmer Abu Ata, B., Al-Omari, A. (2014). A Rule-Based Stemmer for Arabic
errors) output output Gulf Dialect. Journal of King Saud University - Computer and
(His Information Sciences (JKSU). (Submitted).
arguments, Abu-Salem, H., Al-Omari, M., Evens, M., 1999. Stemming method-
) ologies over individual query words for an Arabic information
(The legacy, retrieval system. J. Am. Soc. Inf. Sci. (JASIS) 50 (6), 524529.
) AI-Sawadi, A.D., Khayat, M.G., 1996. An end-case analyzer of arabic
(your sentences. J. King Saud Univ. Comput. Inf. Sci. (JKSU) 8, 2152.
listening, Al-Serhan, H., Ayesh, A., 2006. A triliteral word roots extraction using
) neural network for Arabic, In: The 2006 International Conference
(Vacations, on Computer Engineering and Systems, pp. 436440.
")" Al-Shalabi, R., Kanaan, G., Ghwanmeh, S., Nour, F. M., 2007.
(Coral, Stemmer algorithm for Arabic words based on excessive letter
")" locations. In: 4th International Conference on Innovations in
(Recipes, Information Technology (IIT 07), pp. 456460.
")" Al-Sughaiyer, Imad A., Al-Kharashi, Ibrahim A., 2006. Rule parser
for Arabic stemmer. Lect. Notes Comput. Sci. 2448 (2006), 1118.
http://dx.doi.org/10.1007/3-540-46154-X_2.
Arabic language, (2015).Wikipedia, the free encyclopaedia. <http://
en.wikipedia.org/wiki/Arabic_language>.
The two competitive Arabic stemmers Khoja and Garside Boubas, A., Lulu, L., Belkhouche, B., Harous, S., 2011.
(1999) and Ghwanmeh et al. (2009) yield lower accuracy than GENESTEM: A novel approach for an Arabic stemmer using
our Arabic stemmer when the lengths of input Arabic words genetic algorithms. In: International Conference on Innovations
are of: 4, 5, and 8 Arabic alphabets. The accuracies of our in Information Technology (IIT 2011), pp. 7782.
Arabic stemmer and Khoja and Garside (1999) Arabic stem- Chen, A., Gey, F., 2002. Building an Arabic stemmer for information
mer are equivalent when the word length is 6. Ghwanmeh retrieval. In: Proceedings of the Eleventh Text REtrieval
Conference (TREC 2002). National Institute of Standards and
et al. Arabic stemmer shows low accuracy for 6-letters
Technology, pp. 631639.
Arabic words. The Khoja and Garside (1999) Arabic stemmer Eiman Al-Shammari, Jessica Lin, 2008. A novel Arabic lemmatization
yields better results for 7-letter input Arabic words relative to algorithm. In: Proceedings of the second workshop on Analytics
the other two Arabic stemmers. for noisy unstructured text data (and 08). pp. 113118.
Eiman Al-Shammari, Jessica Lin, 2008. Towards an error-free Arabic
5. Conclusion and future work stemming. In: Proceedings of the 2nd ACM workshop on
Improving non English web searching (iNEWS 08). pp. 916.
El-Affendi M. A., 2002. An LVQ connectionist solution to the non-
In this work, we proposed, developed and evaluated a new determinacy Problem in Arabic morphological analysis: a learning
Arabic stemmer. Three main processing phases were applied hybrid algorithm. Natural Language Engineering, 8(1), pp. 323.
to generate Arabic roots from words. Phase 1 is responsible Cambridge University Press.
for removing prexes and sufxes, Phase 2 is responsible for Ghwanmeh, S., Kanaan, G., Al-Shalabi, R., Rababah, 2009.
comparing output to standard word sources or shapes, and Enhanced Algorithm for Extracting the Root of Arabic Words.
phase 3 is responsible for correcting the extracted root. In: the Sixth International Conference on Computer Graphics,
Preliminary experimental results indicated an acceptable accu- Imaging and Visualization, (CGIV 09). pp. 388391.
racy for roots prediction. We compared our stemmer with two Hmeidi, Ismail I., Al-Shalabi, Riyad F., Al-Taani, Ahmad T., Najadat,
Hassan., Al-Hazaimeh, Shaker A., 2010. A novel approach to the
Arabic stemmers, where the same dataset is used. Results
extraction of roots from Arabic words using bigrams. J. Am. Soc.
showed that our algorithm is better in terms of accuracy in
Inf. Sci. Technol. (JASIS) 61 (3), 583591.
most cases (of different word lengths) in comparison with Kanaan, G., Al-Shalabi, R., Jaam, J.M., Al-Kabi, M.N., Hasnah, A.,
the other two Arabic stemmers. 2004. A new stemming algorithm to extract quadri-literal Arabic
We plan to enhance the effectiveness of this stemmer in the roots. In: Proceedings of International Conference Information and
future, by trying to accomplish the following: Check the con- Communication Technologies: From Theory to Applications,
formance between the removed prexes and the removed suf- pp456 - 460.
xes. Actually, there are some cases in which, removing a Kchaou, Z., Kanoun, S., 2008. Arabic stemming with two dictionaries.
certain sufx led us to ignore some other prexes and not to In: International Conference on Innovations in Information
remove them, and vice versa. Solving this problem may lead Technology (IIT 2008), pp. 688691.
Khoja, S., Garside, R., 1999. Stemming Arabic Text. Computing
to enhancing the effectiveness of our stemmer. Also our stem-
Department, Lancaster University, Lancaster, UK.
mer capability is restricted to the extraction of Arabic triliteral
Larkey, L., Ballesteros, L., Connell, M.E., 2002. Improving stemming
roots, and it fails for example to extract quadriliteral roots (i.e. for Arabic information retrieval: light stemming and co-occurrence
roots with four consonants), so enhanced version should be analysis. In: SIGIR02, Tampere, Finland, pp. 275282
prepared. Also next version of this should be capable to extract Momani, M., Faraj, J., 2007. A novel algorithm to extract tri-literal
Arabic roots from 2-letters and 3-letters Arabic words, and Arabic roots. In: International Conference on Computer Systems
should be capable to deal with vowels on these short words. and Applications (AICCSA 07), pp. 309315.
Mustafa, S.H., 2002. A relational approach to the design of an Arabic Sawalha, M., Atwell, ES. (2009). Linguistically Informed and Corpus
lexical database. J. King Saud Univ. Comput. Inf. Sci. (JKSU) Informed Morphological Analysis of Arabic. In: Proceedings of the
14, 123. 5th Corpus Linguistics Conference (CL2009). University of
Non-concatenative morphology. (2012).Wikipedia, the free Liverpool, UK. Lancaster University, University Centre for
encyclopaedia. <http://en.wikipedia.org/wiki/ Computer Corpus Research on Language, University of
Nonconcatenative_morphology>. Liverpool. http://ucrel.lancs.ac.uk/publications/cl2009/,.pp. 1258
Nwesri, A.F.A., Tahaghoghi, S.M.M., Scholer, F., 2005. Stemming 1265.
Arabic Conjunctions and Prepositions. Lect. Notes Comput. Sc. Semitic languages, (2012). Wikipedia, the free encyclopaedia.
3772, 206217. http://dx.doi.org/10.1007/11575832_23. <http://en.wikipedia.org/wiki/Semitic_languages>.
Riyad Al-Shalabi, Martha Evens, 1998. A Computational Taghva, K., Elkhoury, R., Coombs, J., 2005. Arabic stemming without
Morphology System for Arabic. Computational Approaches to a root dictionary. In: International Conference on Information
Semitic Languages Workshop, COLING 98, Montreal, Canada. Technology: Coding and Computing (ITCC 2005), pp. 152157.
5865.

A Novel Root Based Arabic Stemmer: Journal of King Saud University - Computer and Information Sciences

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

A Novel Root Based Arabic Stemmer: Journal of King Saud University - Computer and Information Sciences

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Novel Root Based Arabic Stemmer: Journal of King Saud University - Computer and Information Sciences

Uploaded by

Copyright:

Available Formats

Journal of King Saud University Computer and Information Sciences (2015) 27, 94103

King Saud University

A novel root based Arabic stemmer

Received 26 December 2012; revised 7 December 2013; accepted 3 April 2014

Input: A text file that contains the Arabic words

Figure 1 Pseudo code of our proposed Arabic stemming algorithm.

Figure 2 Root extraction process.

Figure 3 Arabic verb pattern matching and root extraction process.

Table 3 Proposed stemmer output results. 4.4. Stemmer output analysis

Figure 4 Stemmer results comparison.

Figure 5 Variable length words analysis.

Table 5 Examples of over-stemming (aka False Positive) References

You might also like