Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Introduction of IR Models

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 62

Chapter 4

IR Models
Introduction of IR Models
At the end of this chapter every student must able to:

 Define what model is


 Describe why model is needed in information retrieval
 Differentiate different types of information retrieval models
 Boolean Model

Vector space model


 probabilistic model
 know how to calculate and find the similarity of some
documents to the given query
 Identify term frequency, document frequency, inverted
document frequency, term weight and similarity
measurements
What is model?
Model- is the ideal abstraction of something
which is working in the real world.
There are 2 good reasons for having models of IR
1. Models guide research and provide the means
of academic discussion
2. Models can serve as a blueprint to implement ac
actual retrieval system
IR Models
• In IR, mathematical models are used to understand and reason
about some behavior or phenomena in the real world

• A model of an information retrieval predicts and explains what


a user will find relevant given the user query
Retrieval model
• Thus, retrieval models are models that can describe
the computational processes (here, retrieval)
– e.g., how documents are ranked
– e.g., how similarities are measured
• Are models that can attempt to describe the
human process
– e.g., the information need, interaction
– Few do so meaningfully
• Are models that specify the details of
– Document representation
– Query representation
– Retrieval function (matching function)
– Ranking
Retrieval Models
• A number of IR models are proposed over the years to retrieve
information
• The following are the major models developed to retrieve
information
– Boolean model
• Exact match model
– Statistical models
• Vector space and probabilistic models are the major
statistical models
• Are “best match” models
– Linguistic and knowledge based models
• Are “best match models”
What is the difference b/n best match and exact match?
Types of models
• The three classic information retrieval models are:

– Boolean retrieval models

– Vector space models

– Probabilistic models
1. Boolean model

A document either matches a query, or does


not.
 The Boolean retrieval model is a model for information
retrieval in which we can pose(create) any query which
is in the form of a Boolean expression of terms, that is,
in which terms are combined with the operators
AND, OR, and NOT.
…..cont
 The first model of an information retrieval
 The most criticized model
 Developed by George Boole
• Boole defined 3 basic operators
AND
OR
NOT
Example
……cont

• Boolean relevance prediction ( R )


– A document is predicted as relevant to a query iff it satisfies the query
expression
– Each query term specifies a set of documents containing them
• AND (^) : the intersection of two sets
• OR (V) : the union of two sets
• NOT (¬) :set inverse, or really set difference
– A query, thus, searches a set of documents to determine their content
– The search engine retrieves those documents satisfying the logical
constraints of the query
….cont

• Most queries search for more than one term

– Information need: to find all documents containing “information”


and “retrieval”
Answer Only documents containing both “information” and “retrieval” satisfy
this query

– Information need: to find all documents containing “information”


or “retrieval”
Answer Will be satisfied by a document that contains either of the two words or
both.

– Information need: to find all documents containing “information”


or “retrieval”, but not both
Boolean model

• Consider a set of five docs and assume that they contain the terms
shown in the table
Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm

Find documents retrieved by the following expressions in


a. Information AND retrieval
{d1,d3} ∩{d1,d2,d4}={d1}
b. Information OR retrieval
{d1,d3} U{d1,d2,d4}={d1, d2,d3,d4}
Advantages and disadvantages of Boolean
model
• Advantages of Boolean model
A very simple model based on sets(Easy for
expert)
It is computationally efficient
Expressiveness and clarity
Is still a dominant model with the commercial
database systems
Disadvantages of Boolean model
• Disadvantages
Needs to train users

Very limited to show user information need in detail

No weighting of index or query terms

Based on exact matchingthere may be relevant


document that is partially matched
Vector Space Model
Suggested by peter Luhn and Salton
A classic model of document retrieval based on
representing documents and queries as a vector
Partial matching is possible
Retrieval is based on the similarity between the query
vector and document vectors
The output documents are ranked according to this
similarity
….cont

• The similarity is based on the occurrences of the


keywords in the query and document

• The angle between query and document is measured


by using:

cosine similarity measurement since both


document and user’s query are represented as vectors
in VSM
…cont

• VSM assumes that if document vector V1 is closer to


the query than another document vector V2: then;
 The document represented by V1 is more
relevant than the one represented by V2
In VSM, to decide the similarity of the document to the
given query, term weighting(tf*idf) is used. To
calculate tf*idf, first we have to calculate the following
1.Term frequency(tf)
• Term frequency: is the number of times a given
term appears in that documents

Tf = number of frequency of a term


maximum frequency

(for a single document)


2. Inverse document frequency(IDF)

• IDF used to measure whether the terms are


common or rare across all documents

• Idf = log N/df where

N total number of documents

Df document frequency (number of documents


containing the given term)
When we change the log2 to log10

Then :
3. Term weighting (tf*idf)
Term weighting is the assignment of numerical values to
terms that represent their importance in a document in
order to improve retrieval effectiveness

• Term weighting = tf*idf


Tfi,j* log N/df
4. Document length
calculate the length of the document
Document length normalization adjusts the term frequency or the
relevance score in order to normalize the effect of document
length on the document ranking.

Document length= the square root summation of term weight


square for each document
5. similarity
• At the end we need to calculate similarity of the documents to
the query.

• The widely used measure of similarity in vector space model is


called the cosine similarity.

• The cosine similarity between two vectors d,j (the document


vector) and q (query vector) is given by:
…..cont
…cont
• The denominator of the above formula can be
replaced by the length of the document times
length of the query. This means
Examples

• Example1: If the following three documents are given with


one query, then, which document must be ranked first?

D1: New york times

D2: New York post

D3: Los Angeles times

Query: new new times


Solution
• Step1: calculate inversed document frequency (IDF) for each term.
Step2: calculate term frequency (tf)
Step 3: calculate term weight (tw): TW=tf*idf
Step 4: calculate document length or length of the
document and length of the query
Step 5: calculate the similarity of each document
to the query

Therefore; since the value of D1>D2>D3, the document must be


ranked as:
D1, D2, D3.
Example 2
 Which document must be ranked first for the
following?

Doc1: Breakthrough drug for schizophrenia


Doc2: New schizophrenia drug
Doc3:New approach for treatment of
schizophrenia
Doc4: New hopes for schizophrenia patients
Query: Treatment of schizophrenia patients
Example 3
From the following documents which one
must be ranked first?

D1:The health observances for march


D2:The health oriented calendar

D3: The awareness news for march awareness

Q: March health awareness


Advantages and Disadvantages of VSM
Latent semantic indexing
• Latent Semantic Indexing (LSI) is an extension of the
vector space retrieval method (Deerwester et al., 1990).
• LSI can retrieve relevant documents even when they do
not share any words with the query.

• if only a synonym of the keyword is present in a


document, the document will be still found relevant.
LSI/LSA
LSI

• Latent Semantic Indexing (LSI) [Deerwester et al] tries to


overcome the problems of lexical matching by using
statistically derived conceptual indices instead of individual
words for retrieval.

• Latent Semantic Indexing is a technique that projects queries


and documents into a space with “latent” semantic dimensions.
• In the latent semantic space, a query and a document can
have high cosine similarity even if they do not share any
terms
Probabilistic Model
Why introduce probabilities and probability theory in IR?
 As a process, retrieval is inherently uncertain
 Understanding of user’s information needs is uncertain
Are we sure the user mapped his need into a good query?
 Even if the query represents well the need, did we represent it well?
 Estimating document relevance for the query
 Uncertainty from selection of document representation
 Uncertainty from matching query and documents
Probability theory is a common framework for modeling
uncertainty
An IR system is uncertain primarily about
1. Understanding of the query

2. Whether a document satisfies the query


Probability theory
 Provides principled foundation for reasoning under uncertainty

 Probabilistic information retrieval models estimate how likely it is that a


document is relevant for a query
Probabilistic IR models
 Classic probabilistic models (BIM, BM11, BM25)

 Bayesian networks for text retrieval

Probabilistic IR models are among the oldest, but also among the best-
• performing and most widely used IR models
Probability ranking principle

• Ranked retrieval setup: given a collection of documents, the


user issues a query, and an ordered list of documents is
returned.

• Assume binary notion of relevance: Rd,q is a random


dichotomous variable, such that

– Rd,q = 1 if document d is relevant w.r.t query q

– Rd,q = 0 otherwise

• Probabilistic ranking orders documents decreasingly by their


estimated probability of relevance w.r.t. query: P(R = 1|d, q)
Bayesian based probabilistic model
• Let x be a document in the collection
• Let R represent relevance document with respect to given
query and let NR be non relevance
• Need to find p(R|x)- probability that a document x is relevant
P(R|x)=probability that a document ‘x’ is relevant
P(x|R)=probability that if a relevant document is retrieved, it is ‘x’
P(R)=probability of relevant document in the collection
P(x)=probability that ‘x’ is in the collection
Example 1
• Assume that the following is given for you
P(R)=0.6
P(x)=0.5
P(x|R)=0.7
Then what is P(R|x)?
• P(R|x)=p(R)*P(x|R) = 0.6*0.7 = 0.84
P(x) 0.5
Example 2
Assume that document ‘y’ is in the collection
Probability that if a non-relevant document retrieved, it is ‘y’ is
p(y/NR) =0.2

Probability of non-relevant documents in the collection is p (NR) =0.6


Probability of ‘y’ in the collection is p(y) =0.4
a) What is the probability that y is non relevant document?
b) Is the document is relevant or non relevant?
Solution

a/ P (NR/Y) =p(y/NR) p (NR)/p(y)


P (NR/Y)=0.2*0.6/0.4= 0.3

b/ P(R/Y) +p (NR/Y) =1
P(R/Y) = 1-p (NR/Y)
P(R/Y) = 1-0.3

P(R/Y) =0.7, hence the document is relevant


Binary independence model
• As the name implies, this model assumes that the index terms
exist independently in the documents and we can then assign
binary (0,1) values to these index terms.

• For a further illustration of this model, consider a document


Dk in a collection, is represented by a binary vector t = (t 1 ,t 2
,t 3 ,…,t u ) where u represents total number of terms in the
collection, t i =1 indicates the presence of the ith index term and
t i =0 indicates its absence.
Binary independence model
• A decision rule can be formulated by which any document
can be assigned to either the relevant or non-relevant set of
documents for a particular query.

• The obvious rule is to assign a document to the relevant set if the


probability of the document being relevant given the
document representation is greater than the probability of
document being non relevant, that is, if:

P(relevant|t) > P(non-relevant|t)


Using Bayes’s theorem,
BIM
• Binary documents and queries are represented as vector of
binary

• Independence terms are independent of each other

• Some of the assumptions of the BIM can be removed. For


example, we can remove the assumption that terms are
independent
• A case that violates this assumption s term pairs like: hong and
kong, new york, los angeles, Addis Ababa, Arba Minch, Abba
Gada, Haadha Siinqee, w/c are strongly dependent on each other

No of relevant No of non Total


documents relevant
documents

No of documents r N-r n
containing term tk

No of documents R-r N-R-N+r N-R


not containing
term tk Total
Total number number of
of relevant doc in
Total R doc retrieved N-R N collection
Definition
N Total number of documents in the collection

n Total number of documents containing the term tk

R total number of relevant documents retrieved


r Total number of relevant documents retrieved containing the term
tk

Based on this,

odds(probability) of term tk appearing in relevant document is given by


r/R-r

odds(probability) of term tk appearing in irrelevant document is given


by n-r/N-n-R+r
• Assume that if odds of term tk appearing in relevant is
equal to that of non relevant, then

r/R-r=n-r/N-n-R+r
Now we can calculate the relevance function as:

Relevance function(W) = r(N-n-R+r)/(R-r)(n-r), but when


we have no relevant document that contain the term tk, this become
zero, so, we have to add 0.5 to reduce zero result
Example
N=20Total no of documents in the collection
n=10 Total no of documents containing the term tk
R =15Total no of relevant documents retrieved

r=5Total no of relevant documents retrieve containing term tk

Then calculate the probability of relevance function of the term tk


Solution
• Since our result becomes zero, when we don’t have relevant
document, let us add 0.5 to the given formula

Relevance function(W) = r(N-n-R+r)/(R-r)(n-r),

RF(W) = r+0.5(N-n-R+r+0.5)
(R-r+0.5)(n-r+0.5),

RF(W) = 5+0.5(20-10-15+5+0.5)
(15-5+0.5)(10-5+0.5),
=0.095 (the probability that there is relevant document)
BM25
• Stands for Best Match

• Used for modern full text search collections (give attention to


term frequency and document length), Called as weighting
scheme, okapi weighting
• It uses the following contingency table

Relevant Non relevant Total

Di=1 ri ni-ri ni

Di=0 R-ri N-ni-R+ri N-ni

Total R N-R N
Where :

r=number of documents relevant to Q containing term t


R=number of documents relevant to Q
n=number of documents containing t
N =number of documents in the given collection
tf=frequency(number of occurrence of t in Di)
qtf=frequency of t in q
Avdl=average document length in the given collection
dl= length of Di
K1,k2,K= free parameters, where K=k1[(1-b)+b*dl/avdl)]
K1=1.2, k2=0 to 100, b=0.75
Formula of BM 25
• BM25(Q,D)=

log(ri+0.5)/R-ri+0.5 *(k1+1)tf * (k2 +1)qtf


(ni-ri+0.5)/N-ni-R+ri+0.5 K+tf k2+qtf
Example
• Assume that : ‘information system’ query which has the qtf=1

• N=1,000,000
• No relevance information(r and R=0)
• ‘information’ occurs in 500,000 documents(n1=500,000)
• ‘system’ occurs in 10,000 documents(n2=500,000)

• ‘information’ occurs 100 times in doc(f1=100)


• ‘system’ occurs 50 times in doc(f2=50)
• Document length is 80% of average which =0.8

• K1=1.2,k2=200, b=0.75, K=1.2[(1-0.75)+0.75*0.8]=1.02


Formula of BM 25
• BM25(Q,D)=log(ri+0.5)/R-ri+0.5 * (k1+1)tf * (k2 +1)qtf
(ni-ri+0.5)/N-ni-R+ri+0.5 K+tf k2+qtf

=log[(0+0.5)/(0-0+0.5) *(1.2+1)100 *(200+1)1


(500,000-0+0.5)/(1000000-500000+0+0.5) 1.02+100 200+1
+ (0+0.5/0-0+0.5) *(1.2+1)50 *(200+1)1
(10000-0+0.5) /(1000000-10000+0+0.5) 1.2+50 200+1

BM25(Q,D)=0.353535

Therefore, the similarity of the query (information system) to the


document is 0.353535
Exercise
• Compare and contrast the following Information Retrieval
models

Boolean model vs vector Space Model vs probabilistic Model

(pros and cons )


END OF

CHAPTER
4

You might also like