Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Personalized Information Retrieval Syste

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

EUSFLAT - LFA 2005

Personalized Information Retrieval system in the Framework of Fuzzy


Logic
M. Oussalah and A. Eltigani
University of Birmingham,
Electronics, Electrical and Computing,
Edgbaston, B15 2TT, UK
M.Oussalah@bham.ac.uk

Abstract information retrieval system (IRS) makes use of a set of


index terms, which correspond to basic words used in all
Due to increase in web-based applications, the possible documents of the database, and finds those
need for enhanced information retrieval system documents whose index terms map those found in the
that accommodate user’s needs become crucial. user’s query. The central concept in any information
Most of commercial information retrieval system retrieval system is the notion of relevance [10, 14, 15],
are based on standard Boolean model and at less which measures the extent to which a given document is
scale vector models. Although the deficiencies of found relevant to the query formulated by the user.
these models are now part of text-book knowledge, Several mathematical models, including Boolean, vector
the development of new models still have to and probabilistic, among others have been developed to
overcome the feasibility and testing challenge. determine the weights associated to each documents, and
This paper advocates a fuzzy based approach for therefore, identify the most relevant ones. Also, in recent
information retrieval where a new model is put years attempts to construct fuzzy logic based information
forward. Also, its feasibility and performance are retrieval as extension of Boolean models have been
demonstrated through a testing with a large-scale proposed [1, 3, 5, 7, 8, 9].
university database and whose results are Since the early work of Kraft and his team in early
compared to a standard commercial Boolean eighties [5], several extension of fuzzy retrieval models
model. have been put forward in fuzzy community. This includes
various eliciting procedures to determine fuzzy set of
index terms over set of documents [5], application of
Keywords: information retrieval, fuzzy logic various fuzzy set-operators to extend the logical
operations [2], using fuzzy quantifiers to handle linguistic
1. Introduction quantifier like “very”, “popular”, etc, instead of treating
With continuing and exponential growths of internet- them as separate index terms [6], use of clustering
based applications affecting the daily life of all varieties techniques to identify documents which are close to query
of population, ranging from kids to old persons and from according to the underlying metric [12,13], among others.
highly skilled to unskilled persons, the need to flexible Unfortunately, the application of fuzzy set theory for IR is
information retrieval systems to accommodate the need of not yet mature from both theoretical and practical
various type of population becomes very relevant. This considerations due to lack of extensive testing and
becomes the heart of e-business companies, which evaluation methodologies. This explains why they are not
attempt to respond to various interests and gain yet a standard in IR applications, which requires further
popularity. Needless to say, that as a result of booming in feasibility and comparative studies to build their own
internet technology and storage capacity, the web contains reputations.
huge amount of information, in turn, users have to face a This paper investigates an information retrieval system in
variety and a large number of Web pages and often waste the framework of fuzzy logic where an attempt to deal
a lot of time on searching. To alleviate the difficulty, with semantic document representation is accomplished.
many tools have been developed and used on the Web. In Evaluation of the result of the information retrieval
this respect, several studies have focused on information system in terms of precision and recall as well as a
retrieval systems to increase its flexibility and comparison with standard Boolean information retrieval
attractiveness power. Especially incorporating issues to model is investigated.
personalize the customer profile and information
management have become hot topics in literature during
these last two decades [10, 14]. In its basic concept, any

582
EUSFLAT - LFA 2005

2. From Boolean to fuzzy model of where the weights take values in the unit interval and are
information retrieval system determined as the product of index frequency within
document times inverse document frequency, while the
The three key components of any information retrieval similarity between the query and the document is given as
system are [15]: the cosine of the weight vector of the document and that
1. Query representation—representing the user’s of the query.
information request At the meantime, fuzzy set theory has been investigated
2. Document representation—representing the text as a way to relax the binary assumption of the weights
collection where each index term defines a fuzzy set whose universe
3. Matching Function or ranking function1—ranking the of discourse is the set of all documents in the database.
documents according to their relevance. The larger the membership grade is, the more important
the index term is for characterizing the content of the
To automatically evaluate a user query, an IRS needs a document.
formal representation of the documents' content, which More specifically, let fi,j designates the frequency of index
captures and synthesizes the meaning of a document term ti in the document dj, using the popular term-
written in a natural language. This is usually referred to as frequency and inverse document frequency in calculating
the indexing mechanism, which constitutes the basis of the weights factors in vector representation [ ], the latter

µ (ti ) associated to each index term ti in the universe of


document representation. It also details the importance of can be normalized to determine the fuzzy set
each index term in the document representation by
allocating to each index a weight value. Obviously the document list as follows
weighting should be able to convey the semantic of the

µ (ti ) ( d j ) =
document. Therefore a sophisticated query language is not f i, j idf (t i )
sufficient to produce effective results if the representation . . (1)
max f k , j max idf (t k )
of documents introduces a severe loss of information. In k k
almost all commercial information retrieval systems,
indexing mechanisms are based on counting the where idf (t i ) is the inverse document frequency of term
frequency of the index terms within the whole document.
idf (t i ) = log( N / n i )
ti, given as
Issue of correlation among index terms is often neglected.
The Matching function allows us to identify those
documents that match the user’s request. It also enables N is the total number of documents in the database and ni
ranking the documents according to their degrees of is the total number of documents containing index term ti.
relevance.
The normalization stage in (1) allows us to have a
In Boolean model, which is commonly in commercial membership grade within unit interval. (1) is very similar
information retrieval system, each document is given as to the weight factors used in vector model of information
vector over all index terms, with an associated binary retrieval system [10]. Expression (1) defines for each
weight indicating whether a specific index term is present document dj its degree of membership to a fuzzy set
in the document or absent. While a query is represented as pertaining to index terms ti.

ti, then f i , j =0, so µ (ti ) (d j ) = 0 . Otherwise, µ (ti ) (d j ) is


a set of index terms connected by logical operators like Obviously if the document dj does not contain index term
AND, OR, NOT. A Query and a given document are
matched if the associated index terms contained in the increasing with the frequency of term ti in document dj,
document satisfy the query. and decrease with the frequency of the terms in all
implication D ⇒ Q holds, where D and Q stand for some
In other words, a document d answers a query q, if the documents.

logical formulae of document d and query q, respectively, Now given a query q which is given as some logical
in terms of index term representation. It is therefore well combination of index terms ti, say, q = L(t), then it is
known in this model that the more index terms are
µ q (defined over the set of documents) using the fuzzy-
straightforward to construct the underlying fuzzy set
involved in AND operator (in the query), the fewer
documents are retrieved, and, vice versa, leaving out one
in [5,6], only those documents dj for which µ q (d j ) ≥ ξ ,
or two index terms may result in a (too) large set of set extension of logical operators used in L. Therefore, as

are considered relevant, where ξ is some threshold value.


retrieved documents. It is also well known that i) a good
query is difficult to formulate as the impact of difficult
combination of operators is difficult o grasp, ii) the Again the proper choice of appropriate fuzzy operator that
relative importance of index terms cannot be specified, extends the logical operator is still open. For instance, the
iii) the retrieved documents cannot be ranked. The need to choice of t-norm or t-conorm operators, which extend
overcome the above limitations ha given rise to several logical AND and OR operators respectively, is not
alternative approaches especially the vector representation discussed in this paper. We can refer to some related

583
EUSFLAT - LFA 2005

µ ( d1 ) ( t 2 ) = µ ( d1 ) ( t 3 ) = . =1;
works who have, at least briefly, investigated the issue. 1 log( 3 / 1 )
Rather an interesting extension of Boolean retrieval
model consists in generalizing the implication (D Æ Q)
1 log( 3 / 1 )

µ ( d1 ) ( t 4 ) = µ ( d1 ) ( t 5 ) = . = 0.37 ;
previously pointed out. In this respect, an alternative to 1 log( 3 / 2 )
(1) consists in finding fuzzy sets pertaining to each 1 log( 3 / 1 )
document over set of index terms, such that otherwise µ d1 ( t i ) = 0 . Therefore,

µ ( d j ) (t i ) = µ (ti ) (d j ) (2) µ ( d1 ) =1/t2 +1/t3+0.37/t1 +0.37/t4 +0.37/t5

Similarly,
µ ( q ) (t i ) = L ( µ (ti ) (q k )
and
µ ( d 2 ) =1/t6 +1/t7+1/t8 +1/t9 +0.37/t5
(3)

µ ( d 3 ) =1/t10 +1/t11+1/t12 +0.37/t9 +0.37/t5


k

µ ( q1 ) =1/t2 +0.37/t1+0.37/t5
where qk stands for the kth component of the query q used
in the logical expression L such that q=L(q1, q2, …) and
L stands for the fuzzy-set extension of L. µ ( q2 ) =1/t9

Therefore, the relevance of the document dj to query can


µ q =1/t2 +1/t9+0.37/t1 +0.37/t5
Using max combination for set-union operation, we have
now be expressed as the valuation of the following fuzzy
implication:

Sim( d j , q ) = ∑ I ( µ ( d j ) ( t k ), µ ( q ) ( t k ))
Using (5), we have
(4) Sim( d 1 , q ) =3, Sim( d 2 , q ) =2 and Sim( d 3 , q ) =2.
tk
Consequently, document d1 is the most relevant to the
where I designates a fuzzy implication operator I. For query while d2 and d3 are equally relevant.
instance using, Lukasiewicz implicator, we have

Sim( d j , q ) = ∑ min( 1,1 − µ ( d j ) ( t k ) + µ ( q ) ( t k ) (5) 3. Extension to accommodate document


tk
The rational behind (2) is to use expression (1) as well to structure.
determine the fuzzy set pertaining to each document over
the universe of index terms. (4) follows a fuzzy extension Development in section 2 tackles different index terms in
of aforementioned D Æ Q implication while taking the each document equally. However, it is well agreed that
arithmetic sum over the universe of discourse allows us to some of those indices are more pertinent than others. For
account for the number of terms, which appear both in instance, index terms located in title of the document are
query q and document dj. This also ensures a single value trivially more relevant in conveying the semantic than
for evaluating the similarity between document dj and those located in the core of the document. Similarly, those
query q. located in section’s or subsection’s title are more
important than those located in paragraph (s) related to
Example: one of these sections. Similarly terms located in the
Consider a set of three documents: keyword list of the document should be deemed more
d1: ”students have access to computers”, relevant. Also, linguistic quantifiers like very, large,
d2: “people do not like computers and laptops” small, popular, great, etc, make the underlying index
d3: “computer courses are familiar to students” terms, which follow those quantifiers in the document,
And query q: “students have computers OR laptops” more relevant to others. It is therefore more appealing to
account for these facts in the construction of the
The set of index terms is therefore: membership functions pertaining to each document.
K=(students, have, access, to, computers, people, do, For this purpose, we consider the following fuzzy rules:
like, laptops, courses, are, familiar), i) if the index term occurs in the title of the
with t1=students, t2=have, …,t12=familiar document, then the latter has a highest priority.
ii) if the index term occurs in the keyword list of the
Applying (1) and (2), we have document, then that document has second

µ ( d1 ) ( t1 ) = = . = 0.37
highest priority
f 1,1 idf ( t1 ) 1 log( 3 / 2 )
. iii) if the index term occurs in section or subsection
max f k ,1 max idf ( t k ) 1 log( 3 / 1 ) title, then the document has third highest priority
k k
iv) If the index term found in the body of
.
document’s section is bold, then its associated
occurrence frequency is virtually expanded.

584
EUSFLAT - LFA 2005

v) if a quantifier was found then the index term - pi =1 if ti occurs in the body of document (sub)
associated to the quantifier is allocated either section only and is neither bold nor preceded by a
extra or less term frequency depending on the quantifier
nature of the quantifier by artificially multiplying
f i , j in (1) by a fixed quantity. Notice that in the course of the preceding, it is
straightforward that as soon as an index term was found in
Rules i)-iii) define a prioritized way in handling index the title of the document, the underlying similarity will be
terms found in the document. It implicitly introduces a ranked first unless there is another document whose title
second pointer in the term characterization, corresponding contains that index term, in the latter case, the two
to the term whereabouts within the document: title, similarity will be ranked according to the result of the
keywords list, (sub) section title or in the main text of any overall valuation of the implicator.
section (subsection) of the document. Rules iv) virtually However, if none of the index terms is found in the title,
increases the weighting of the index term within the or (sub) section, keyword list, bold, or preceded by a
document by multiplying the index frequency fi,j by a quantifier, then one recovers the valuation (2-5). An
fixed entity, say, f’i,j=fi,j. p, with p>1. While in v), the algorithm for constructing the similarity values is
index frequency is either expanded or diminished summarized below.
depending on the nature of the quantifier.
Algorithm
representation of µ ( d j ) ( t i ) such that
For this purpose, an extra component is added to
- Step 1: Span documents and build index term
representation of all document

 µ ( ti ) ( d j )
- Step 2: Rewrite query in terms of index term

µ( d j ) ( ti ) =  ,
representation

 p i  µ ( d j ) , for j=1 to N (number of documents)


(6) - Step 3: Apply (6) and (1) to calculate fuzzy sets

with p>0 and µ ( ti ) ( d j ) determined via (1) - Step 4: determine µ ( q ) using (3).
- Step 5: determine similarity Sim( d j , q ) , for j=1
The counterparts for (5) is: to N according to (7) and rank the similarities
according to (8).

∑ I ( µ ( d j ) (t k ), µ ( q ) (t k ))
Sim(d j , q) =  k 
max( p ) 
t
(7) 4. Application to database search
 i i
 The algorithm developed in Section 3 has been applied to
a university database containing about 150000 documents
The ranking of (7) is accomplished according to the of technical papers.
following

 a  c 
The implemented system is outlined in the conceptual

b  ≤ d  if either (b ≤ d )or (b=d and a ≤ c),


   
(8) diagram shown in Figure 1 below.

which is often used in multidimensional utility calculus.

Therefore the comparison rule (8) defines a complete


ordering in the 2-dimensional space.
We suggest the following for pi’s values:
- pi =N if ti occurs in title of document,
- pi =N/2 if ti occurs in the document keywords list,
- pi =N/4 if ti occurs in the title of the document
(sub) section,
- pi =N/8 if ti occurs as a bold word
- 1 <pi < N/8 if ti occurs after a quantifier indicating
an increase in the term relevancy
- 0 <pi < 1 if ti occurs after a quantifier indicating a
decrease in the term relevancy Figure 1. Conceptual diagram

585
EUSFLAT - LFA 2005

The general architecture includes subroutines for


recall =
documents counting, quantifier list, reading files, file # relevant documents retrieved
search, index terms’ counting while eliminating common (9)
# relevant documents in database
words like “a”, “the”, “in”, etc, calculate membership
functions, calculate similarities, display documents, etc.
Precision: Part of the retrieved documents, which is
Also an inverted file was built with the aid of GNU actually relevant, i.e.,
mifluz [4]

Pr ecision =
We used a stop list of 571 words and Porter’s stemmer as
# relevant documents retrieved
in [11], which allows us to extend the search to all (10)
queries, which can be grammatically extracted from the # documents retrieved
user’s query.
It is trivial that high recall is obtained at the cost of lower
An example is shown in Figure 2 below. The histogram precision. Likewise, a high precision can be attained at
shown in the right hand side of Figure 1 shows the value the cost of recall. The problem is to find a good balance
of the similarity between the query and underlying between recall and precision.
document according to expression (5) and (7). The Figure 3 shows the precision evaluation when taking 10
document d_11110 has got the highest grade because the recall levels from (0 to 100%); that is, given a ranked
index term match was found in the title of the document result of the search, one checks whether the first ranked
and occurs very frequently as well throughout the document is truly relevant, if so, it is associated 100%
document. precision level, then check the second ranked document,
and so forth.
The numerical values pointed out in Figure were obtained
using an average of 100 searches of roughly similar and
close queries. It is easy to check that fuzzy information
retrieval system always outperforms the standard logical
retrieval system, which in average, only 80% of relevant
documents were retrieved. However, as the
implementation is concerned, the increases in
performance were obtained at the cost of the increase of
computation time, which is almost three times that of the
logical retrieval model. It should be noticed that the large
increase of such computational time is not entirely related
to the fuzzy algorithm itself, but mainly to annexed
implementations like Stemma and interface. However, the
computation time is usually not a primary issue as the
relevance is deemed much more important from the user’s
perspective viewpoint.

Figure 2. Example of result display


.

In order to evaluate the performance of the fuzzy retrieval


algorithm, a comparison of results was accomplished with
respect to the logical information retrieval system already
implemented in the commercial software of our university
database search. For this purpose we used the standards
recall and precision evaluations used in most information
retrieval systems. Especially, we computed the precision
and recall at various cut-off points, where the precision is
determined at various recall levels [3, 10]. More
specifically,
Figure 3 Precision and recall curves using fuzzy and logical
Recall: Part of relevant documents, which is actually retrieval system
retrieved, i.e.,

586
EUSFLAT - LFA 2005

Information by Computer,Addison-Wesley, Reading,


5. Conclusion MA, 1989.
Due to exponential increase of Web based applications, [11] M.F. Porter. An algorithm for suffix stripping. In
the need to new models for information retrieval systems K.Sparck Jones and P.Willet, editors, Readings in
becomes an issue especially to tackle its challenging size, Information Retrieval, pages 313–316. Morgan
diversity and popularity. This paper advocates the use of Kaufmann Publishers, 1997.
fuzzy based approach for information retrieval where [12] T.A. Runkler, J. Bezdek, Web mining with relational
initial Kraft’s model has been extended to accommodate clustering, Internat. J. Approx. Reason. 32 (2003)
document structure using multidimensional similarity 217–236.
measure. The matching between document and user’s [13] T.A. Runkler, J.C. Bezdek, Alternating cluster
query models is rather accomplished through a fuzzy estimation: a new tool for clustering and function
implicator connective, which extends the Boolean model. approximation, IEEE Trans. Fuzzy Systems 7 (1999)
The proposal has been implemented in a large-scale 377–393.
database involving university library database. The [14] G. Salton, M.J. McGill, Introduction to Modern
performances of the proposal were evaluated in terms of Information Retrieval, McGraw-Hill, New York,
precision and recall entities and compared to those 1983.
obtained using the already implemented Boolean retrieval [12] Van RIJSBERGEN, C. J. Information Retrieval.
model. The comparison shows the feasibility and the 1979, Butterworth, London.
superiority of the fuzzy retrieval model where recall and
precision evaluations are concerned, but this is obtained at
the cost of increased computational time.

References
[1] M. Anvari, G. Rose, Fuzzy relational databases, in: J.
Bezdek (Ed.), The Analysis of Fuzzy Information,
vol. II, CRC Press, Boca Raton, FL, 1987, pp. 203-
212.
[2] G. Bordogna, G. Pasi, A soft aggregation of selection
criteria in a fuzzy information retrieval environment,
191 Internat. Fuzzy Systems and Intelligent Control
Conf., Louisville, KY, 16-17 March, 1993.
[3] M. Buckland, F. Gey, The relationship between recall
and precision, J. Am. Soc. Inf. Sci. 45 (January)
(1994) 12–19.
[3] B. Buckles, F. Petry, A fuzzy model for relational
databases, Internat. J. Fuzzy Sets and Systems 7
(1982) 213-226.
[4] GNU mifluz. http://www.gnu.org/software/mifluz.

[5] D. Kraft, D. Buell, Fuzzy sets and generalized


Boolean retrieval systems, Internat. J. Man-Machine
Studies 19 (1983) 45-56.
[6] D. Kraft, G. Bordogna, G. Pasi, An extended fuzzy
linguistic approach to generalize Boolean retrieval, J.
Information Sci. 2 (3) (1994) 119 134.
[7] D.H. Kraft, F. E. Petry, Fuzzy information systems:
managing uncertainty in databases and information
retrieval systems Fuzzy Sets and Systems 90 (1997)
183-191
[8] J. Medina, O. Pons, M. Vila, Gefred: A generalized
model to implement fuzzy relational databases,
Inform. Sci. 47 (1994) 234 254.
[9] S. Miyamoto, Fuzzy Sets in Information Retrieval and
Cluster Analysis, Kluwer Academic Publishers,
Dordrecht, 1990.
[10] G. Salton, Automatic Text Processing: The
Transformation, Analysis and Retrieval of

587

You might also like