Abstract
The majority of Arabic text available on the web is written without short vowels (diacritics). Diacritics are commonly used in religious scripts such as the holy Quran (the book of Islam), Al-Hadith (the teachings of Prophet Mohammad (PBUH)), children’s literature, and in some words where ambiguity of articulation might arise. Internet Arabic users might lose credible sources of Arabic text to be retrieved if they could not match the correct diacritical marks attached to the words in the collection. However, typing the diacritical marks is very annoying and time consuming. The other way around, is to ignore these marks and fall into the problem of ambiguity. Previous work suggested pre-processing of Arabic text to remove these diacritical marks before indexing. Consequently, there are noticeable discrepancies when searching the web for Arabic text using international search engines such as Google and yahoo. In this article, we propose a framework to enhance the retrieval effectiveness of search engines to search for diacritic and diacritic-less Arabic text through query expansion techniques. We used a rule-based stemmer and a semantic relational database compiled in an experimental thesaurus to do the expansion. We tested our approach on the scripts of the Quran. We found that query expansion for searching Arabic text is promising and it is likely that the efficiency can be further improved by advanced natural language processing tools.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Arabic belongs to the family of Semitic languages. It differs from Latin languages morphologically, syntactically and semantically. The writing system of Arabic has 25 consonants and three long vowels that are written from right to left and change shapes according to their position in the word. In addition, Arabic has short vowels (diacritics) written above and under a consonant to give it its desired sound and hence give a word a desired meaning. The common diacritics used in Arabic language are listed in Table 1.
In Arabic text, diacritics are not generally indicated in writing. Diacritic-less text (i.e., text without diacritic vowels) are commonly used by Arabic community for the everyday written and printed material such as books, magazines, newspapers and letters. However, diacritics are heavily used in religious texts that demand strict obedience to pronunciation rules such in the Quran (the book of Muslims; followers of Islam) and some scripts of Al-Hadith (teachings of Prophet Mohammad (PBUH)).
In addition, it is very common to use diacritics (fully, partially even randomly) with classical poetry, children’s literature and in ordinary text when it is ambiguous to read. For instance, a word in Arabic consisting of three consonants like (ك ت ب ktb) “to write” can have many interpretations with the presence of diacritics (Kirchhoff and Vergyri 2005).
For Arabic language speakers, the only way to disambiguate the diacritic-less words is to locate them within the context. Analysis of 23,000 Arabic scripts showed that there is an average of 11.6 possible ways to assign diacritics for every diacritic-less word (Debili et al. 2002). Examples of the different meanings associated with the word “ktb” in the presence of short vowel diacritics are listed in Table 2.
Unfortunately, the bulk of the diacritic-less Arabic scripts available on the Internet prevent at least two groups of people from accessing their contents. The first group includes visually impaired people, while the second group includes people with learning disabilities. Both groups rely on text synthesis and voice recognition applications. Unfortunately, the success of the Arabic voice applications is highly dependent on the presence of diacritisized text which enables the systems to pronounce the words correctly. Many research projects have been carried out to restore diacritics automatically to help such applications (El-Sadany and Hashish 1988; Gal 2002; Zitouni et al. 2006).
Another group who might be affected by the absence of diacritics includes people trying to read and understand the teachings of the Quran and Al-Hadith. The meanings of the Quran verses and Al-Hadith are heavily dependent on the presence of the diacritic vowels.
Most researches in the field of Arabic Information Retrieval (AIR) did not pay much attention to the problem of searching and retrieving diacritisized text. Most published works even suggested removing the diacritics at the preprocessing step to unify the content of the inverted list (Buckwalter 2007). This could be true in the early days, where Arabic text on the web was very limited. Additionally, in the absence of tools and applications that could handle this problem seriously, an important new lexical resource for modern standard Arabic (MSA) known as Arabic WordNet (AWN) may result in rejection (Black and Elkateb 2004; Black et al. 2006). In addition, ignoring diacritical text may delay progression in some research areas such as: semantic web (SW) (Abdelali et al. 2003; El-Helw and Aly 2004; Zaidi and Laskri 2005) and ontology-based information retrieval (Elkateb and Black 2004).
Current international search engines (such as Google,Footnote 1 Yahoo,Footnote 2 MSN,Footnote 3 etc.) and Arabic search engines (such as Ayna,Footnote 4 etc.) are not yet mature enough to handle the complexity of Arabic language. Accordingly, there is no motivation to upload more Arabic scripts on the web if the content cannot be retrieved. The need to extend the search capabilities of these search engines or to develop a (full) Arabic search engine becomes mandatory.
In this paper, we propose a framework to enhance the retrieval effectiveness of search engines to search for diacritic and diacritic-less Arabic text through query expansion techniques using a rule-based stemmer and an experimental thesaurus. We carried our experiments on the scripts of the Quran, as a free open resource of classical Arabic.
2 State of the art
2.1 Arabic information retrieval
A study of the world market, commissioned by the Miniwatts Marketing Group,Footnote 5 shows that the number of Arab Internet users in the Middle East and Africa could jump to 32 million in 2008 from 2.5 million in the year 2000. In addition, the growth of Arab Internet users in the Middle East region (for the same period) is expected to reach about 924% compared to the growth of the world Internet users (see footnote 5 for details). The conducted research pointed out that 65% of the Internet Arabic speaking users could not read English pages, which account for 70% of the material available on the Internet.
Unfortunately, efforts to build new search engines to serve the increasing number of Arabic-speaking users are still very humble. This is mainly due, in the first place, to the complexity of Arabic language. Another obstacle facing developers of AIR systems is the lack of adequate resources (i.e., corpora, lexicons, morphological analyzers, part-of-speech taggers, etc.) that could help in scaling, testing and evaluating the performance of their implemented systems in the real world.
Generally speaking, AIR engines can be designed based on one of two categories (Abdelali et al. 2004):
-
1.
Full-form based IR, which has been adopted by most of the commercial engines such as Google, Yahoo and Ayna.
-
2.
Morphology-based IR. Systems developed in this context are experimental systems to improve the performance of AIR. Different approaches to improve the performance of AIR have also been addressed. These approaches include: using light stemmers (Larkey et al. 2002; Semmar et al. 2006), part of speech taggers (Khoja 2001; Diab 2007), using thesauri (Xu et al. 2002) and using ontology (Abdelali et al. 2003).
In this work, we used the rule based morphological analyzer developed by Khoja (1999) to extract Arabic stems. We also compiled an experimental thesaurus based on semantic relations extracted from the Quran.
2.2 Cross language information retrieval
As of today, the most dominant language of the web is English. The majority of the Internet users around the world (80% of the world’s population) cannot read English pages and accordingly they are prevented from making benefit of the credible source of information available for free on the web. The approach of Cross-Language Information Retrieval (CLIR) allows a user to formulate a query in his own language and retrieve documents in one or several other languages.
CLIR appeared in the literature since the last decade (Grefenstette 1996; Oard 2000). However, the development of CLIR systems is very limited because of the high development cost and complexity. Most of the CLIR systems are based on thesauri, which are very expensive to implement, stemming, word boundary identification, and defining lists of stop-words in the target language (Larkey and Connell 2005). Several multilingual IR systems have already been studied for many spoken languages other than English. Examples of these languages include Arabic, French, Chinese, Japanese and Spanish. The following is a quick survey on the different approaches, which have been tried to tackle the CLIR problem:
-
1.
Machine translation and dictionary-based technique. In this technique, queries are translated using a dictionary into a language in which a document may be found (Hull and Grefenstette 1996; Oard 1998; Pirkola et al. 2001; Hedlund et al. 2004). An example of this approach is a multilingual search engine, called TITAN (Hayashi et al. 1997), where an electronic bilingual (English-Japanese) dictionary helped Japanese users to search the web using their own language.
-
2.
Controlled vocabulary technique. This traditional technique is based on indexing all documents of the collection using fixed terms (descriptors) which are also used for queries. In multilingual IR, these descriptors are translated and mapped to each other in thesauri (French et al. 2001; Kampas 2004).
-
3.
Transliteration/Transcription technique. Transliteration is the process of converting the characters of an alphabetical or syllabic script to the characters of a conversion alphabet. For example, documents written in non-Roman scripts such as Arabic, Chinese, Hebrew, Japanese, etc. are transliterated or transcribed into Roman characters. This approach has been mainly applied to identify proper names in different languages (Gey et al. 2002; Virga and Khudanpur 2003).
-
4.
Corpus-based technique. This technique is based on analyzing large collection of text (corpora) and automatically extracting the information needed on which the translation will be based (Talvensaari et al. 2007).
-
5.
Latent semantic indexing (LSI) technique. LSI is based on using automatic statistical algorithms to improve the retrieval of relevant documents. LSI allows a user to retrieve documents by concepts and meaning even when they do not share any words with the query (Landauer and Littman1990; Dumais et al. 1996).
2.3 Arabic cross language information retrieval
Arabic CLIR has been attempted by many research groups. The following is a short survey of some of the researches related to Arabic CLIR. Aljlayl et al. (2002), built an Arabic–English IR system based on a machine translation approach. AbdulJaleel and Larkey (2003), proposed a statistical transliteration approach for Arabic–English IR. Grefenstette et al. (2005), described the changes required to modify their cross language IR system, which has been designed for European languages to integrate Arabic language. Abdelali et al. (2006), described how precision can be improved in query expansion using LSI. Finally, Semmar and Fluhr (2007), presented a new approach to align Arabic–French sentences retrieved from a parallel corpus based on a cross-language IR system. This approach is basically based on building a database of sentences of the target text and considering each sentence of the source text as a query to that database.
3 Experimenting with current international search engines
Despite attempts to increase the existing repositories of Arabic content available on the Internet, the content is still close to 0.2% of the total worldwide Internet content. According to latest statistics by the Internet SocietyFootnote 6 regarding the distribution of languages on the Internet, there are only 100 million Arabic web pages covering topics in business, science, politics, religion, and email messages. It is then obvious why most developers design their tools and applications to deal with the English scripts, where few attempts have been tried to process Arabic scripts.
Google and Yahoo are the most dominant international search engines used to search and retrieve documents in multi-languages. Attempts to tackle the problems of international web retrieval systems to handle non-English natural languages have also appeared in the literature. As an example, Lazarinis (2007) created a methodology for identifying some of the deficiencies of searching the web using Greek language. Another research articles explored the inefficiencies of international search engines in searching for Russian, French, Hungarian and Hebrew queries (Bar-Ilan and Gutman 2005), Chinese queries (Moukdad and Cui 2005), Arabic queries (Moukdad 2004; Al-Maskari et al. 2007) and Polish queries (Sroka 2000).
In this experiment we show the discrepancy of the two engines when it comes to search for diacritic and diacritic-less Arabic documents. First, we tried to search Google and Yahoo for verses from the Quran. For each query listed in Table 3, we carried a full-form search using words with diacritics (exactly as they are typed in the Quran) and a second run using the diacritic-less form for the same queries.
Every time we ran an experiment, we used the Advanced Search Preferences Language Tools (available in each search engine), to change the language setting. The settings that we applied include: searching the web, Arabic pages, and English pages. The results of this experiment are listed in Table 3.
It is very obvious here that for the examined search preferences, Google retrieved different results for diacritic and diacritic-less queries. In most cases the returned documents were a mixture of diacritic and diacritic-less results. A lot of good documents were easily missed because of the absence of diacritics. However, the case with Yahoo was different; it simply ignores the diacritics in most cases and returned “almost” the same results for diacritic and diacritic-less queries.
Next, we tried Google and Yahoo to search for Arabic words where the diacritics reflect some semantics. A list of the words that we have tested is given in Table 4. Again we did a full-form match and we fixed the search preferences to searching the web. Both engines failed to distinguish the different meanings of the word “ktb”, for example, in the presence of diacritics.
We found that Google returned mixed results, while Yahoo ignored the diacritics. After all, it is obvious that both search engines need to handle this problem more efficiently. However, it is preferable at the end to allow users to enter Arabic words without diacritics while at the same time allowing the retrieval of those words with vowel diacritics for the purposes of disambiguation. In the remaining sections we present our framework and discuss the results of our experiments to solve this problem.
4 The proposed experimental search engine
In previous work, we designed and implemented an experimental search engine to experiment with AIR (Hammo et al. 2002). Later, this engine has been modified to support passage retrieval for an open-domain Arabic question answering system, called QARAB (Hammo et al. 2004).
In this work, we extended the capabilities of our experimental system to tackle the problem of retrieving diacritic and diacritic-less Arabic documents. Like most of the experimantal search engines, our system is based on the famous vector space model (VSM) (Salton and McGill 1983; Salton 1989). It takes a query in Arabic language and attempts to provide a ranked list of documents, based on a similarity measure (cosine measure), that are close enough to the user’s query. The main components of the extended model include a: Tokenizer, Stemmer and Thesaurus modules. The data flow of our system is depicted in Fig. 1.
5 Experimental setup
5.1 The study domain
To test the performance of our extended model for retrieving diacritisized Arabic text, we have chosen the Quran. The scripts of the Quran are diacritisized to preserve the pronunciation and the meaning of its words, which can be totally changed in the absence of diacritics. Our approach is simply based on searching the text of the Quran without being worried about typing the diacritics. We improved the search process through automatic query expansion using a rule-based stemmer and a thesaurus. Users need only to type in the words they want to search for while their queries are automatically augmented with all morphological variation of the query’s words to expand the search. Next, we investigated the effectiveness of applying a thesaurus of semantic classes to expand the search. The obtained results are promising and open directions for enhancing the capabilities of search engines and other applications, such as, question answering, information extraction from the Quran and other Arabic resources available on the Internet.
5.2 Data acquisition
The Quran chapters (suras) are split into verses (ayat). The Tokenizer tokenizes the verses into words (tokens), while a rule-based stemmer peals the common affixes (prefixes, infixes and suffixes) from the word to simplify it to a root form. In our extended model we provide four types of indexes:
-
1.
The diacritic index: contains the original words from of the Quran.
-
2.
The diacritic-less index: contains the words of the Quran after removing their diacritics.
-
3.
The stem index: contains the roots obtained automatically from the rule-based stemmer.
-
4.
The thesaurus index: contains semantic word classes to expand the query during the search.
The Quran consists of (114) chapters (suras), where each surah is generally known by a name. Also the Quran contains a total of (6,236) verses (ayat), a total of (77,845) words and a total of (1,767) distinct roots.
5.3 The data model of the extended search engine
IR system for English language was implemented and tested using a relational database management system (RDBMS) (Lundquist et al. 1997). It has been argued that designing an IR system based on RDBMS retains the benefits of providing fast and sophisticated retrieval, being portable across different platforms as well as being able to use the security and integrity features built in the RDBMS itself (Lundquist et al. 1997). In this work, we adapted the idea of integrating relational database model with an IR system to store and manipulate the Quran scripts. The data model of our extended engine is depicted in Fig. 2. The IR system has been coded in Java, while the database has been designed using SQL server. The system has been tested on a Pentium IV machine running Windows XP.
5.4 Processing the scripts of the Quran
5.4.1 Preprocessing the scripts
The scripts of the Quran undergo a preprocessing phase before extracting the words and building the indexes. The system automatically triggers a tokenizer module to split the chapters into verses at the verse Unicode boundary marker (), which marks the end of each verse in the Quran. Other marks such as (۞ ۩) are used to organize the Quran into parts and sections have also to be removed. Words in the Quran are bounded by white spaces, which make it very easy to identify words and verify their correctness. Finally, a set of letters and marks such as: (ۜۙۘۗۖ) which are used for reciting the Quran also must be removed before the tokenization process can take place. At the end of the tokenization process the words are ready to be stemmed and indexed.
5.4.2 Building the inverted lists (indexes)
5.4.2.1 The diacritic index
This index contains the diacritisized full-form words of the Quran (and their frequencies) as they were obtained from the tokenizer without any further processing (i.e., keeping their diacritics intact). This index is not used directly for searching, because typing the diacritics is not easy as well as missing a diacritic vowel leads to unsuccessful match for most of the time. Instead, the words of this index that share the same roots of the query’s bag-of-words are automatically added to the original query to expand the search during query expansion (QE).
5.4.2.2 The diacritic-less index
This index has the distinct words of the Quran after they have been processed by removing their diacritical marks and unifying the alef-hamza character to alef (i.e., converting آأإ to ا). The size of the diacritic-less index is about (22.6%) less than the diacritic index. This index is the primary index to be used by our search engine to answer users’ queries. At least one morphological form of each word of the Quran is located in this index. The other forms are obtained from the diacritic index during the QE process.
5.4.2.3 The stem index
Words obtained from the tokenizer are automatically stemmed before they can be added to the stem index. We used a rule-based stemmer, written in Java, after getting the permission of Khoja (1999). A total of 1,767 distinct roots (including proper names appearing in the Quran) were identified. Each stem in the index is associated with a 1-to-m relationship to entities in the diacritic index, the diacritic-less index and the thesaurus index, respectively. The size of the root index is about (88.2%) less than the diacritic-less index. An example of this relationship is explained in Fig. 3.
5.4.2.4 The thesaurus index
A thesaurus is a structure that manages the complexities of terminology in language and provides semantic relationships (such as synonymy) between terms. Building a thesaurus can be done manually or automatically by collecting key words of documents and classify and organize them into a thesaurus. In our work, we benefited from the ongoing work of Kubaisi (2006) to construct a thesaurus for the Quran. In his work, he grouped words that carry similar meanings in semantic classes to help people understanding the Quran in a better way. We have compiled these semantic classes into an experimental thesaurus. So far, the index contains (100) semantic groups, where each group is made of 3–6 synonyms. The average number of words in the thesaurus index is around (500) words. Experiments show that using thesauri increase the recall and sometimes this could be at the expense of precision (Xu et al. 2002).
5.4.3 Using the stemmer
Stemming is the task of correlating several words onto one base form. It has a relatively low processing cost and uses morphological heuristics to remove affixes from words before indexing. It reduces the index size, and usually it improves the results slightly (Strzalkowski and Vauthey 1992). This makes stemming very attractive for many natural language processing (NLP) applications such as: IR, information extraction (IE), question answering (QA), machine translation (MT), text summarizations (TS), etc.
Arabic stemming is more complicated than English stemming. Major words of the Arabic language are constructed from the three consonant roots by following fixed patterns. Patterns include prefixes, infixes and suffixes to indicate number, gender and tense. Arabic stemming is the process of removing all affixes from a word to extract its root. A stemmer for Arabic, for example, should identify the string, kateb كاتب (writer), ketab كتاب (book), maktabah مكتبه (library), maktab مكتب (office), as one base form ktb كتب (he wrote).
In our research, we used a rule-based stemmer, developed by Khoja (1999), to experiment with Arabic passage retrieval and QA (Hammo et al. 2004). In this experiment, the stemmer performed reasonably with accuracy closed to (95%). We observed that most of the failing cases were due to stemming proper names such as the names of Prophets, angels, ancient cities, places and people, numerals, as well as words with doubled characters (represented using the diacritic shada. To verify the correctness of the stems, we compared the generated stems with a list of manually extracted and verified roots of Al-Quran (Khadir 2002). We corrected the mistaken ones and added the missing ones. Finally, the diacritic index, the diacritic-less index and the thesaurus index were linked to the stem index using their stem-id fields. An example of the association between indexes is given in Fig. 3.
In the above diagram, the root (جوع jawaa (make someone hungry)) is associated with the words ((hunger الجوع), (getting hungry تجوع), (make someone hungry جوع), and (and hunger والجوع)). Also it is associated with the same dicritisized form of the words available in the diactritic index. Finally, the root has an association with the semantic class (hunger الجوع), which contains the synonyms (خصاصة، مسغبة، مخمصة). In the next section, we explain how this association enhanced the effectiveness and the performance of our search engine.
6 Experiments and results
6.1 Data set and test collection
In this paper, we used the scripts of the Quran and a collection of 40 diacritic-less queries obtained from Arabic native speakers. Each person has been asked to provide 4 queries that he could remember from the Quran. The list (without modifications) is given in Table 5.
6.2 Experimental results
6.2.1 Searching the diacritic-less index using full-form words
In this experiment, we tested our system using the full-form words (i.e., as they appear in the queries without modifications). Table 6 shows the results of searching the diacritic-less index for the queries listed in Table 5. The results of running this experiment are given in Fig. 4.
The system failed in the cases where an exact match is not satisfied. A sample of what were really found in the diacritic-less index that could not satisfy some of the queries is given in Table 7. It is obvious that failures, in most cases, were due to missing either the diacritics or the affixes (i.e., prefixes, infixes, suffixes) that are attached to the original words.
6.2.1.1 Discussion and findings
In most cases, the system failed to find results that satisfy the full-form of the query’s bag-of-words. For instance, (Q# 4, 5, 13–14, 17, 18, 20, 22, 24–26, 29–31, 35–37 and 39) failed to find any match in the diacritic-less index. The following two examples explain the results of the search engine during this experiment:
Example 1
Query # 13: مستور (covered).
The system could not find an exact match for this query. However, the diacritic-less index includes the words: سترا (shield) and the word مستورا (covered) as shown in Table 7. Although the verses containing these two words are relevant to the query, but the system failed to return them as it could not match these words with the word(s) of the query.
Example 2
Query # 15: جوع (hunger).
The system finds two exact matches for this query. Two verses containing the word (جوع hunger) were retrieved. In addition, the diacritic-less index also contains three morphological forms of this word: (تجوع to be hungry), (الجوع starvation), (والجوع and starvation). Unfortunately, the system failed to retrieve these relevant verses because they do not match the query’s bag-of-words. In the following experiments, we show how our system can easily solve this problem and retrieve these verses through query expansion using a stemmer and a thesaurus, respectively.
6.2.2 The effectiveness of QE
QE can be defined as the process of reformulating the query’s bag-of-words to overcome the problem of mismatching potential documents and improving the performance of a search engine by retrieving the documents that are more relevant (of better quality), or at least equally relevant (Qiu and Frei 1993; Vectomova and Wang 2006). Without query expansion, the documents which have the potential to be relevant to the user’s query may not be retrieved at all. Many QE techniques have been investigated in the IR literature. They include:
-
QE through synonymy. This is performed through finding words in a thesaurus that are synonymous to the words in the query.
-
QE through stemming. This is performed by augmenting the query’s bag-of-words with their morphological variations that share the same stems.
-
QE through word sensing. This is performed through sensing the words to resolve ambiguity from a specialized database such as the WordNet.
-
QE through fixing spelling errors. This is performed through fixing spelling errors and automatically searching for the corrected form of the words.
-
QE through paraphrasing. This is performed by rewriting the terms of the original query.
Some QE techniques such as synonymy and stemming have been criticized for increasing the total recall on the expense of lowering the precision. Other techniques like word sense disambiguation (WSD) tend to increase the precision. However, despite the increase in the recall, augmenting the user’s query with synonyms and morphological variations and ranking the occurrences of the query’s words, cause documents with more approximate terms to migrate near the top of the ranked list, hence, leading to a higher performance. In the next sections we will discuss our experimental results through QE using a stemmer and a thesaurus.
6.2.3 QE through stemming
In many cases, using the full-form query bag-of-words for searching may not give good results and in some cases no results at all (some examples where given in Table 7). This is because of the variation in morphological structure between the words in the corpus and the query’s bag-of-words, which most of the time end up with no-match. Therefore, in our modified search engine QE is done automatically to find all verses, which have words that are correlated to the roots of the query’s bag-of-words. The process starts with running the stemmer against the query’s bag-of-words. For each root we obtain from stemming the query, we search the root index for all its associations within the diacritic and diacritic-less indexes. All words satisfying this condition are added to the original query’s bag-of-words. The new expanded query is ready to be submitted again to search for all occurrences of documents (in our case, verses) that have these words.
Running the experiment after expanding the search through stemming was very efficient and satisfactory. The results of QE through stemming are listed in Table 8. However, QE through stemming outperforms the results obtained from the full-form technique as clearly indicated by Fig. 5. Generally speaking, stemming improves the recall as well as the precisions. The work by Larkey et al. (2002) recommends that working with light stemmers could perform better than the root-extraction stemmer. The choice between root extraction and light stemming is contingent to the source of collection. However, the obtained results in this experiment made this technique and the light stemming technique as well very practical especially for QA systems (Hammo et al. 2004).
6.2.3.1 Discussion and findings: Example 3
The result of running query # 15: جوع (hunger) using a stemmer.
Query # 15 can be rewritten through QE process using a stemmer as explained in Fig. 6. The aim of the expansion is to recover the shortage of the missed verses from the previous experiment (as discussed in Example 2). In this experiment the query goes under stemming to identify all its roots. For each valid root in the query, the system automatically, searches the root index and adds to the original query all words (from diacritic index and diacritic-less index) that have an association with this root. The new query is then used to search the indexes for all potential occurrences of the new bag-of-words. Expanding query # 15 adds 7 more words to the query and returns 5 verses instead of 3 as in the previous example. All the 5 retrieved verses are relevant to the query. The new expanded query is shown in Fig. 6.
6.2.4 QE through thesaurus
In this experiment, we benefited from the ongoing work of Kubaisi (2006) to construct a thesaurus for the Quran. We have compiled sets of semantic classes into an experimental thesaurus. So far, the thesaurus index contains (100) semantic groups, where each group is made of 3–6 synonyms. The average number of words in the index is around (500) words. The association between the root index and the thesaurus index make QE using the thesaurus very straightforward. Words from the thesaurus that are related to the query’s bag-of-words are added automatically to the original query to expand the search. A comparison between QE using a stemmer and QE using a stemmer and a thesaurus is depicted in Fig. 7.
The results obtained from using the thesaurus outperformed the results obtained from the stemmer alone. In Table 9 we give the semantic groups related to the data set used for testing and the results obtained from this experiment.
6.2.4.1 Discussion and findings: Example 4
The result of running query # 15: جوع (hunger) using a stemmer and a thesaurus.
Again, query # 15 can be rewritten through QE process using a thesaurus as in Fig. 8. The aim of the expansion is to extend the search using words that are related in meaning to the query’s bag-of-words. By doing the expansion we hope to retrieve more documents that are relevant to the original query, even if the query does not have these words.
In this experiment the query goes under stemming to identify all its roots. For each valid root, the system automatically, searches the root index and adds to the original query all words (from diacritic index, diacritic-less index and the thesaurus index) that have association with this root. The new query is then used to search the indexes for potential occurrences of the new bag-of-words. The expansion of query # 15 adds 10 more words to the query and returns 8 verses, which are all relevant to the query. The new expanded query is given in Fig. 8.
6.3 Comparing the results of the experiments
A comparison of applying the different techniques on the data set after the three runs is shown in Fig. 9. As indicated by this chart, applying QE techniques improve the results dramatically and hence using a stemmer and a thesaurus outperformed the original search using the full-form of words. The improvement in recall and precision for the QE process is given in Fig. 10. Again it indicates that using a stemmer and a thesaurus improve AIR search engines.
7 Conclusion and future work
In this article, we explained the problem of searching diacritisized text using current international search engines such as Google and Yahoo. We provide a framework solution for the searching problem through indexing. We investigated the use of QE on searching the Quran scripts in the absence of diacritics, where queries are automatically augmented with related terms extracted from a diacritic and diacritic-less indexes by applying a stemmer and a thesaurus of semantic classes. We conducted a set of experiments to test our system on a data set of 40 queries and the scripts of the holy Quran. We found that QE for searching Arabic text is promising and it is likely that the efficiency can be further improved.
Applications such as IE, QA, TS, and MT are few examples of NLP applications that rely extensively on extracting concepts from web documents. This process requires the analysis of the document content, either morphologically, syntactically or semantically and therefore, new search engines equipped with tools to integrate and derive new meaning from Arabic repositories need to be investigated. Our system could be improved by adding a more sophisticated morphological analyzer, POST, and Arabic ontology.
Abbreviations
- AIR:
-
Arabic information retrieval
- AWN:
-
Arabic wordNet
- PBUH:
-
Peace be upon him
- CLIR:
-
Cross language information retrieval
- IE:
-
Information extraction
- IR:
-
Information retrieval
- LSI:
-
Latest semantic indexing
- MSA:
-
Modern standard Arabic
- MT:
-
Machine translation
- NLP:
-
Natural language processing
- POST:
-
Part of speech tagging
- QA:
-
Question answering
- QARAB:
-
Question answering system for Arabic
- RDBMS:
-
Relational data base management system
- QE:
-
Query expansion
- SQL:
-
Structured query language
- TS:
-
Text summarization
- VR:
-
Verses retrieved
- VRQ:
-
Verses relevant to query
- VRS:
-
Verses retrieved using Stemmer
- VRT:
-
Verses retrieved using Thesaurus
- VRW:
-
Verses retrieved using words
- VSM:
-
Vector space model
- WSD:
-
Word sense disambiguation
References
Abdelali, A., Cowie, J., Farwell, D., Ogden, W., & Helmreich S. (2003). Cross-language information retrieval using ontology. In Proceedings of TALN ’2003, Batz-sur-Mer, France.
Abdelali, A., Cowie, J., & Soliman, H. (2004). Arabic information retrieval perspectives. In Proceedings of JEP-TALN 2004 Arabic Language Processing.
Abdelali, A., Cowie, J., & Soliman, H. (2006). Improving query expansion precision using latent semantic analysis: Application on Arabic retrieval. Journies d’Etudes sur le Traitement Automatique de la Langue Arabe (JETALA), Rabat, Morocco.
AbdulJaleel, N., & Larkey, L. (2003). Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 139–146.
Aljlayl, M., Frieder, O., & Grossman, D. (2002). On Arabic-English cross-language information retrieval: A machine translation approach. In Proceedings of the Third International Conference on Information Technology, pp. 2–7.
Al-Maskari, A., Sanderson, M., & Clough, P. (2007). Arabic users’ satisfaction with the online information as obtained from Google. In Proceedings of Sixth International Conference on Conceptions of Library and Information Science (CoLIS).
Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28.
Black, W., & Elkateb, S. (2004). A prototype English-Arabic dictionary based on WordNet. In Proceedings of 2nd Global WordNet Conference, (GWC 2004), pp. 67–74.
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., et al. (2006). Introducing the Arabic wordnet project. In Proceedings of the Third International WordNet Conference, (GWC 2006), pp. 295–299.
Buckwalter, T. (2007). Issues in Arabic morphological analysis. In A. Soudi, A. Van den Bsch, & G. Neumann (Eds.), Arabic computational morphology (pp. 23–41). Netherlands: Springer. ISBN 978-1-4020-6045-8.
Debili, F., Achour, H., & Souissi, E. (2002). Del’etiquetage grammatical a’ la voyellation automatique de l’arabe. Correspondances (Vol. 71, pp. 10–28). Tunis: Institut de Recherche sur le Maghreb Contemporain.
Diab, M. (2007). Improved Arabic base phrase chunking with a new enriched POS tag set. In Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 89–96.
Dumais, S., Landauer, T., & Littman, M. (1996). Automatic cross-linguistic information retrieval using latent semantic indexing. In SIGIR’96-Workshop on Cross-Linguistic Information Retrieval, pp. 16–23.
Elkateb, S., & Black, W. (2004). A bilingual dictionary with enriched lexical information. In Proceedings of NEMLAR Cairo, Egypt 2004 Arabic Language Tools and Resources, pp. 79–84.
El-Helw, A., & Aly, H. (2004). An intelligent database application for the semantic web. In Proceedings of CSITeA-04 Conference, Cairo, Egypt.
El-Sadany, T., & Hashish, M. (1988). Semi-automatic vowelization of Arabic verbs. In 10th NC Conference, Jeddah, Saudi Arabia.
French, J., Powell, A., Gey, F., & Perelman, N. (2001). Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 199–206.
Gal, Y. (2002). An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 27–33.
Gey, F., Kando, N., & Peters, C. (2002). Cross language information retrieval: A research roadmap. ACM SIGIR Forum, 36(2), 72–80.
Grefenstette, G. (1996). Cross-linguistic information retrieval workshop. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in IR, p. 344.
Grefenstette, G., Semmar, N., & Elkateb-Gara, F. (2005). Modifying a natural language processing system for European languages to treat Arabic in information processing and information retrieval applications. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 31–38.
Hammo, B., Abu-Salem, H., Lytinen, S., & Evens, M. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 55–65.
Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system for the Arabic language. Computers and the Humanities, 38(4), 379–415.
Hayashi, Y., Kikui, G., & Susaki, S. (1997). TITAN: A cross-linguistic search engine for the WWW. In Working Notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, pp. 58–65.
Hedlund, T., Airio, E., Keskustalo, H., Lehtokangas, R., Pirkola, A., & Järvelin, K. (2004). Dictionary based cross-language information retrieval: Learning experiences from CLEF 2000–2002. Information Retrieval, 7(1), 99–119.
Hull, D., & Grefenstette, G. (1996). Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–57.
Kampas, J. (2004). Improving retrieval effectiveness by reranking documents based on controlled vocabulary. Lecture Notes in Computer Science, 2997, 283–295.
Khadir, M. (2002). Quran lexicon. Retrieved April 10, 2008 from http://www.al-mishkat.com/words/book.htm.
Khoja, S. (1999). Stemming Arabic text. Retrieved June 20, 2007 from http://zeus.cs.pacificu.edu/shereen/research.htm.
Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the Student Workshop at NAACL 2001, pp. 20–25.
Kirchhoff, K., & Vergyri, D. (2005). Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Communication, 46(1), 37–51.
Kubaisi, A. (2006). Quran words. Retrieved April 10, 2008 from http://www.islamiyyat.com/kalema.htm.
Landauer, T., & Littman, M. (1990). Fully automatic cross-language document retrieval using latent semantic indexing. In Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pp. 31–38.
Larkey, L., Ballesteros, L., & Connell, M. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research & Development in IR, pp. 275–282.
Larkey, L., & Connell, M. (2005). Structured queries, language modeling, and relevance modeling in cross-language information retrieval. Information Processing and Management: An International Journal, 41(3), 457–473.
Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding? Journal of Information Science, 33(5), 622–636.
Lundquist, C., Frieder, O., Holmes, D., & Grossman, D. (1997). A parallel relational database management system approach to relevance feedback in information retrieval. Journal of the American Society of Information Science (JASIS), 50(5), 413–426.
Moukdad, H. (2004). How do search engines handle Chinese queries? Lost in cyberspace: How do search engines handle arabic queries? In Proceedings of the 32nd Annual Conference of the Canadian Association for Information Science. Retrieved October 1, 2008 from www.cais-acsi.ca/proceedings/2004/moukdad_2004.pdf.
Moukdad, H., & Cui, H. (2005). How do search engines handle Chinese queries? Webology, 2(3). Retrieved October 1, 2008 from www.Webology.ir/2005/v2n3/a17.html.
Oard, D. (1998). A comparative study of query and document translation for cross-language information retrieval. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, pp. 472–483.
Oard, D. (2000). Evaluating interactive cross-language information retrieval: Document selection. In Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum, CLEF 2000, pp. 57–71.
Pirkola, A., Hedlund, T., Keskustalo, H., & Järvelin, K. (2001). Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4(3–4), 209–230.
Qiu, Y., & Frei, H. (1993). Concept based query expansion. In Proceedings of the 16th ACM SIGIR International Conference on Research and Development in IR, pp. 160–169.
Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York: McGraw-Hill Book Company.
Salton, G. (1989). Automatic text processing—the transformation analysis and retrieval of information by computer. MA: Addison Wesley.
Semmar, N., & Fluhr, C. (2007). Arabic to French sentence alignment: Exploration of a cross-language information retrieval approach. In Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 73–80.
Semmar, N., Laib, M., & Fluhr, Ch. (2006). Using stemming in morphological analyzer to improve Arabic information retrieval. In Proceedings of TALN 2006, pp. 317–327.
Sroka, M. (2000). Web search engines for Polish information retrieval: Questions of search capabilities and retrieval performance. International Information & Library Research, 32(2), 87–98.
Strzalkowski, T., & Vauthey, B. (1992). Information retrieval using robust natural language processing. In Proceedings of ACL-92, pp. 104–111.
Talvensaari, T., Juhola, M., Laurikkala, J., & Järvelin, K. (2007). Corpus-based cross-language information retrieval in retrieval of highly relevant documents: Research articles. Journal of the American Society for Information Science and Technology, 58(3), 322–334.
Vectomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of Information Science, 32(4), 324–333.
Virga, P., & Khudanpur, S. (2003). Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Vol. 15, pp. 57–64.
Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 269–274.
Zaidi, S., & Laskri, M. (2005). A cross-language information retrieval based on an Arabic ontology in the legal domain. In Proceedings of the International Conference on Signal-Image Technology and Internet-Based Systems (SITIS’05), pp. 86–91.
Zitouni, I., Sorensen, J., & Sarikaya R. (2006). Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics, pp. 577–584.
Acknowledgment
We would like to thank Shereen Khoja for providing her stemmer, Prof. Nadim Obeid for his valuable suggestions to improve this work and Mahmoud El-Hajj for helping with construction the thesaurus and the database implementation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hammo, B.H. Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents. Inf Retrieval 12, 300–323 (2009). https://doi.org/10.1007/s10791-008-9081-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-008-9081-9