Personalized Information Retrieval Syste
Personalized Information Retrieval Syste
Personalized Information Retrieval Syste
582
EUSFLAT - LFA 2005
2. From Boolean to fuzzy model of where the weights take values in the unit interval and are
information retrieval system determined as the product of index frequency within
document times inverse document frequency, while the
The three key components of any information retrieval similarity between the query and the document is given as
system are [15]: the cosine of the weight vector of the document and that
1. Query representation—representing the user’s of the query.
information request At the meantime, fuzzy set theory has been investigated
2. Document representation—representing the text as a way to relax the binary assumption of the weights
collection where each index term defines a fuzzy set whose universe
3. Matching Function or ranking function1—ranking the of discourse is the set of all documents in the database.
documents according to their relevance. The larger the membership grade is, the more important
the index term is for characterizing the content of the
To automatically evaluate a user query, an IRS needs a document.
formal representation of the documents' content, which More specifically, let fi,j designates the frequency of index
captures and synthesizes the meaning of a document term ti in the document dj, using the popular term-
written in a natural language. This is usually referred to as frequency and inverse document frequency in calculating
the indexing mechanism, which constitutes the basis of the weights factors in vector representation [ ], the latter
µ (ti ) ( d j ) =
document. Therefore a sophisticated query language is not f i, j idf (t i )
sufficient to produce effective results if the representation . . (1)
max f k , j max idf (t k )
of documents introduces a severe loss of information. In k k
almost all commercial information retrieval systems,
indexing mechanisms are based on counting the where idf (t i ) is the inverse document frequency of term
frequency of the index terms within the whole document.
idf (t i ) = log( N / n i )
ti, given as
Issue of correlation among index terms is often neglected.
The Matching function allows us to identify those
documents that match the user’s request. It also enables N is the total number of documents in the database and ni
ranking the documents according to their degrees of is the total number of documents containing index term ti.
relevance.
The normalization stage in (1) allows us to have a
In Boolean model, which is commonly in commercial membership grade within unit interval. (1) is very similar
information retrieval system, each document is given as to the weight factors used in vector model of information
vector over all index terms, with an associated binary retrieval system [10]. Expression (1) defines for each
weight indicating whether a specific index term is present document dj its degree of membership to a fuzzy set
in the document or absent. While a query is represented as pertaining to index terms ti.
logical formulae of document d and query q, respectively, Now given a query q which is given as some logical
in terms of index term representation. It is therefore well combination of index terms ti, say, q = L(t), then it is
known in this model that the more index terms are
µ q (defined over the set of documents) using the fuzzy-
straightforward to construct the underlying fuzzy set
involved in AND operator (in the query), the fewer
documents are retrieved, and, vice versa, leaving out one
in [5,6], only those documents dj for which µ q (d j ) ≥ ξ ,
or two index terms may result in a (too) large set of set extension of logical operators used in L. Therefore, as
583
EUSFLAT - LFA 2005
µ ( d1 ) ( t 2 ) = µ ( d1 ) ( t 3 ) = . =1;
works who have, at least briefly, investigated the issue. 1 log( 3 / 1 )
Rather an interesting extension of Boolean retrieval
model consists in generalizing the implication (D Æ Q)
1 log( 3 / 1 )
µ ( d1 ) ( t 4 ) = µ ( d1 ) ( t 5 ) = . = 0.37 ;
previously pointed out. In this respect, an alternative to 1 log( 3 / 2 )
(1) consists in finding fuzzy sets pertaining to each 1 log( 3 / 1 )
document over set of index terms, such that otherwise µ d1 ( t i ) = 0 . Therefore,
Similarly,
µ ( q ) (t i ) = L ( µ (ti ) (q k )
and
µ ( d 2 ) =1/t6 +1/t7+1/t8 +1/t9 +0.37/t5
(3)
µ ( q1 ) =1/t2 +0.37/t1+0.37/t5
where qk stands for the kth component of the query q used
in the logical expression L such that q=L(q1, q2, …) and
L stands for the fuzzy-set extension of L. µ ( q2 ) =1/t9
Sim( d j , q ) = ∑ I ( µ ( d j ) ( t k ), µ ( q ) ( t k ))
Using (5), we have
(4) Sim( d 1 , q ) =3, Sim( d 2 , q ) =2 and Sim( d 3 , q ) =2.
tk
Consequently, document d1 is the most relevant to the
where I designates a fuzzy implication operator I. For query while d2 and d3 are equally relevant.
instance using, Lukasiewicz implicator, we have
µ ( d1 ) ( t1 ) = = . = 0.37
highest priority
f 1,1 idf ( t1 ) 1 log( 3 / 2 )
. iii) if the index term occurs in section or subsection
max f k ,1 max idf ( t k ) 1 log( 3 / 1 ) title, then the document has third highest priority
k k
iv) If the index term found in the body of
.
document’s section is bold, then its associated
occurrence frequency is virtually expanded.
584
EUSFLAT - LFA 2005
v) if a quantifier was found then the index term - pi =1 if ti occurs in the body of document (sub)
associated to the quantifier is allocated either section only and is neither bold nor preceded by a
extra or less term frequency depending on the quantifier
nature of the quantifier by artificially multiplying
f i , j in (1) by a fixed quantity. Notice that in the course of the preceding, it is
straightforward that as soon as an index term was found in
Rules i)-iii) define a prioritized way in handling index the title of the document, the underlying similarity will be
terms found in the document. It implicitly introduces a ranked first unless there is another document whose title
second pointer in the term characterization, corresponding contains that index term, in the latter case, the two
to the term whereabouts within the document: title, similarity will be ranked according to the result of the
keywords list, (sub) section title or in the main text of any overall valuation of the implicator.
section (subsection) of the document. Rules iv) virtually However, if none of the index terms is found in the title,
increases the weighting of the index term within the or (sub) section, keyword list, bold, or preceded by a
document by multiplying the index frequency fi,j by a quantifier, then one recovers the valuation (2-5). An
fixed entity, say, f’i,j=fi,j. p, with p>1. While in v), the algorithm for constructing the similarity values is
index frequency is either expanded or diminished summarized below.
depending on the nature of the quantifier.
Algorithm
representation of µ ( d j ) ( t i ) such that
For this purpose, an extra component is added to
- Step 1: Span documents and build index term
representation of all document
µ ( ti ) ( d j )
- Step 2: Rewrite query in terms of index term
µ( d j ) ( ti ) = ,
representation
with p>0 and µ ( ti ) ( d j ) determined via (1) - Step 4: determine µ ( q ) using (3).
- Step 5: determine similarity Sim( d j , q ) , for j=1
The counterparts for (5) is: to N according to (7) and rank the similarities
according to (8).
∑ I ( µ ( d j ) (t k ), µ ( q ) (t k ))
Sim(d j , q) = k
max( p )
t
(7) 4. Application to database search
i i
The algorithm developed in Section 3 has been applied to
a university database containing about 150000 documents
The ranking of (7) is accomplished according to the of technical papers.
following
a c
The implemented system is outlined in the conceptual
585
EUSFLAT - LFA 2005
Pr ecision =
We used a stop list of 571 words and Porter’s stemmer as
# relevant documents retrieved
in [11], which allows us to extend the search to all (10)
queries, which can be grammatically extracted from the # documents retrieved
user’s query.
It is trivial that high recall is obtained at the cost of lower
An example is shown in Figure 2 below. The histogram precision. Likewise, a high precision can be attained at
shown in the right hand side of Figure 1 shows the value the cost of recall. The problem is to find a good balance
of the similarity between the query and underlying between recall and precision.
document according to expression (5) and (7). The Figure 3 shows the precision evaluation when taking 10
document d_11110 has got the highest grade because the recall levels from (0 to 100%); that is, given a ranked
index term match was found in the title of the document result of the search, one checks whether the first ranked
and occurs very frequently as well throughout the document is truly relevant, if so, it is associated 100%
document. precision level, then check the second ranked document,
and so forth.
The numerical values pointed out in Figure were obtained
using an average of 100 searches of roughly similar and
close queries. It is easy to check that fuzzy information
retrieval system always outperforms the standard logical
retrieval system, which in average, only 80% of relevant
documents were retrieved. However, as the
implementation is concerned, the increases in
performance were obtained at the cost of the increase of
computation time, which is almost three times that of the
logical retrieval model. It should be noticed that the large
increase of such computational time is not entirely related
to the fuzzy algorithm itself, but mainly to annexed
implementations like Stemma and interface. However, the
computation time is usually not a primary issue as the
relevance is deemed much more important from the user’s
perspective viewpoint.
586
EUSFLAT - LFA 2005
References
[1] M. Anvari, G. Rose, Fuzzy relational databases, in: J.
Bezdek (Ed.), The Analysis of Fuzzy Information,
vol. II, CRC Press, Boca Raton, FL, 1987, pp. 203-
212.
[2] G. Bordogna, G. Pasi, A soft aggregation of selection
criteria in a fuzzy information retrieval environment,
191 Internat. Fuzzy Systems and Intelligent Control
Conf., Louisville, KY, 16-17 March, 1993.
[3] M. Buckland, F. Gey, The relationship between recall
and precision, J. Am. Soc. Inf. Sci. 45 (January)
(1994) 12–19.
[3] B. Buckles, F. Petry, A fuzzy model for relational
databases, Internat. J. Fuzzy Sets and Systems 7
(1982) 213-226.
[4] GNU mifluz. http://www.gnu.org/software/mifluz.
587