Introduction of IR Models

Chapter 4
IR Models
Introduction of IR Models
At the end of this chapter every student must able to:
 Define what model is

 Describe why model is needed in information retrieval
 Differentiate different types of information retrieval models
 Boolean Model
Vector space model

 probabilistic model
 know how to calculate and find the similarity of some
documents to the given query
 Identify term frequency, document frequency, inverted
document frequency, term weight and similarity
measurements
What is model?
Model- is the ideal abstraction of something
which is working in the real world.
There are 2 good reasons for having models of IR
1. Models guide research and provide the means
of academic discussion
2. Models can serve as a blueprint to implement ac
actual retrieval system
IR Models
• In IR, mathematical models are used to understand and reason
about some behavior or phenomena in the real world
• A model of an information retrieval predicts and explains what

a user will find relevant given the user query
Retrieval model
• Thus, retrieval models are models that can describe
the computational processes (here, retrieval)
– e.g., how documents are ranked
– e.g., how similarities are measured
• Are models that can attempt to describe the
human process
– e.g., the information need, interaction
– Few do so meaningfully
• Are models that specify the details of
– Document representation
– Query representation
– Retrieval function (matching function)
– Ranking
Retrieval Models
• A number of IR models are proposed over the years to retrieve
information
• The following are the major models developed to retrieve
information
– Boolean model
• Exact match model
– Statistical models
• Vector space and probabilistic models are the major
statistical models
• Are “best match” models
– Linguistic and knowledge based models
• Are “best match models”
What is the difference b/n best match and exact match?
Types of models
• The three classic information retrieval models are:
– Boolean retrieval models
– Vector space models
– Probabilistic models
1. Boolean model
A document either matches a query, or does

not.
 The Boolean retrieval model is a model for information
retrieval in which we can pose(create) any query which
is in the form of a Boolean expression of terms, that is,
in which terms are combined with the operators
AND, OR, and NOT.
…..cont
 The first model of an information retrieval
 The most criticized model
 Developed by George Boole
• Boole defined 3 basic operators
AND
OR
NOT
Example
……cont
• Boolean relevance prediction ( R )

– A document is predicted as relevant to a query iff it satisfies the query
expression
– Each query term specifies a set of documents containing them
• AND (^) : the intersection of two sets
• OR (V) : the union of two sets
• NOT (¬) :set inverse, or really set difference
– A query, thus, searches a set of documents to determine their content
– The search engine retrieves those documents satisfying the logical
constraints of the query
….cont
• Most queries search for more than one term
– Information need: to find all documents containing “information”

and “retrieval”
Answer Only documents containing both “information” and “retrieval” satisfy
this query

or “retrieval”
Answer Will be satisfied by a document that contains either of the two words or
both.

or “retrieval”, but not both
Boolean model
• Consider a set of five docs and assume that they contain the terms
shown in the table
Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm
Find documents retrieved by the following expressions in

a. Information AND retrieval
{d1,d3} ∩{d1,d2,d4}={d1}
b. Information OR retrieval
{d1,d3} U{d1,d2,d4}={d1, d2,d3,d4}
Advantages and disadvantages of Boolean
model
• Advantages of Boolean model
A very simple model based on sets(Easy for
expert)
It is computationally efficient
Expressiveness and clarity
Is still a dominant model with the commercial
database systems
Disadvantages of Boolean model
• Disadvantages
Needs to train users
Very limited to show user information need in detail
No weighting of index or query terms
Based on exact matchingthere may be relevant

document that is partially matched
Vector Space Model
Suggested by peter Luhn and Salton
A classic model of document retrieval based on
representing documents and queries as a vector
Partial matching is possible
Retrieval is based on the similarity between the query
vector and document vectors
The output documents are ranked according to this
similarity
….cont
• The similarity is based on the occurrences of the

keywords in the query and document
• The angle between query and document is measured

by using:
cosine similarity measurement since both

document and user’s query are represented as vectors
in VSM
…cont
• VSM assumes that if document vector V1 is closer to

the query than another document vector V2: then;
 The document represented by V1 is more
relevant than the one represented by V2
In VSM, to decide the similarity of the document to the
given query, term weighting(tf*idf) is used. To
calculate tf*idf, first we have to calculate the following
1.Term frequency(tf)
• Term frequency: is the number of times a given
term appears in that documents
Tf = number of frequency of a term

maximum frequency
(for a single document)

2. Inverse document frequency(IDF)
• IDF used to measure whether the terms are

common or rare across all documents
• Idf = log N/df where
N total number of documents
Df document frequency (number of documents

containing the given term)
When we change the log2 to log10
Then :
3. Term weighting (tf*idf)
Term weighting is the assignment of numerical values to
terms that represent their importance in a document in
order to improve retrieval effectiveness
• Term weighting = tf*idf

Tfi,j* log N/df
4. Document length
calculate the length of the document
Document length normalization adjusts the term frequency or the
relevance score in order to normalize the effect of document
length on the document ranking.
Document length= the square root summation of term weight

square for each document
5. similarity
• At the end we need to calculate similarity of the documents to
the query.
• The widely used measure of similarity in vector space model is

called the cosine similarity.
• The cosine similarity between two vectors d,j (the document

vector) and q (query vector) is given by:
…..cont
…cont
• The denominator of the above formula can be
replaced by the length of the document times
length of the query. This means
Examples
• Example1: If the following three documents are given with

one query, then, which document must be ranked first?
D1: New york times
D2: New York post
D3: Los Angeles times
Query: new new times

Solution
• Step1: calculate inversed document frequency (IDF) for each term.
Step2: calculate term frequency (tf)
Step 3: calculate term weight (tw): TW=tf*idf
Step 4: calculate document length or length of the
document and length of the query
Step 5: calculate the similarity of each document
to the query
Therefore; since the value of D1>D2>D3, the document must be

ranked as:
D1, D2, D3.
Example 2
 Which document must be ranked first for the
following?
Doc1: Breakthrough drug for schizophrenia

Doc2: New schizophrenia drug
Doc3:New approach for treatment of
schizophrenia
Doc4: New hopes for schizophrenia patients
Query: Treatment of schizophrenia patients
Example 3
From the following documents which one
must be ranked first?
D1:The health observances for march

D2:The health oriented calendar
D3: The awareness news for march awareness
Q: March health awareness

Advantages and Disadvantages of VSM
Latent semantic indexing
• Latent Semantic Indexing (LSI) is an extension of the
vector space retrieval method (Deerwester et al., 1990).
• LSI can retrieve relevant documents even when they do
not share any words with the query.
• if only a synonym of the keyword is present in a

document, the document will be still found relevant.
LSI/LSA
LSI
• Latent Semantic Indexing (LSI) [Deerwester et al] tries to

overcome the problems of lexical matching by using
statistically derived conceptual indices instead of individual
words for retrieval.
• Latent Semantic Indexing is a technique that projects queries

and documents into a space with “latent” semantic dimensions.
• In the latent semantic space, a query and a document can
have high cosine similarity even if they do not share any
terms
Probabilistic Model
Why introduce probabilities and probability theory in IR?
 As a process, retrieval is inherently uncertain
 Understanding of user’s information needs is uncertain
Are we sure the user mapped his need into a good query?
 Even if the query represents well the need, did we represent it well?
 Estimating document relevance for the query
 Uncertainty from selection of document representation
 Uncertainty from matching query and documents
Probability theory is a common framework for modeling
uncertainty
An IR system is uncertain primarily about
1. Understanding of the query
2. Whether a document satisfies the query

Probability theory
 Provides principled foundation for reasoning under uncertainty
 Probabilistic information retrieval models estimate how likely it is that a

document is relevant for a query
Probabilistic IR models
 Classic probabilistic models (BIM, BM11, BM25)
 Bayesian networks for text retrieval
Probabilistic IR models are among the oldest, but also among the best-
• performing and most widely used IR models
Probability ranking principle
• Ranked retrieval setup: given a collection of documents, the

user issues a query, and an ordered list of documents is
returned.
• Assume binary notion of relevance: Rd,q is a random

dichotomous variable, such that
– Rd,q = 1 if document d is relevant w.r.t query q
– Rd,q = 0 otherwise
• Probabilistic ranking orders documents decreasingly by their

estimated probability of relevance w.r.t. query: P(R = 1|d, q)
Bayesian based probabilistic model
• Let x be a document in the collection
• Let R represent relevance document with respect to given
query and let NR be non relevance
• Need to find p(R|x)- probability that a document x is relevant
P(R|x)=probability that a document ‘x’ is relevant
P(x|R)=probability that if a relevant document is retrieved, it is ‘x’
P(R)=probability of relevant document in the collection
P(x)=probability that ‘x’ is in the collection
Example 1
• Assume that the following is given for you
P(R)=0.6
P(x)=0.5
P(x|R)=0.7
Then what is P(R|x)?
• P(R|x)=p(R)*P(x|R) = 0.6*0.7 = 0.84
P(x) 0.5
Example 2
Assume that document ‘y’ is in the collection
Probability that if a non-relevant document retrieved, it is ‘y’ is
p(y/NR) =0.2
Probability of non-relevant documents in the collection is p (NR) =0.6

Probability of ‘y’ in the collection is p(y) =0.4
a) What is the probability that y is non relevant document?
b) Is the document is relevant or non relevant?
Solution
a/ P (NR/Y) =p(y/NR) p (NR)/p(y)

P (NR/Y)=0.2*0.6/0.4= 0.3
b/ P(R/Y) +p (NR/Y) =1
P(R/Y) = 1-p (NR/Y)
P(R/Y) = 1-0.3
P(R/Y) =0.7, hence the document is relevant

Binary independence model
• As the name implies, this model assumes that the index terms
exist independently in the documents and we can then assign
binary (0,1) values to these index terms.
• For a further illustration of this model, consider a document

Dk in a collection, is represented by a binary vector t = (t 1 ,t 2
,t 3 ,…,t u ) where u represents total number of terms in the
collection, t i =1 indicates the presence of the ith index term and
t i =0 indicates its absence.
Binary independence model
• A decision rule can be formulated by which any document
can be assigned to either the relevant or non-relevant set of
documents for a particular query.
• The obvious rule is to assign a document to the relevant set if the

probability of the document being relevant given the
document representation is greater than the probability of
document being non relevant, that is, if:
P(relevant|t) > P(non-relevant|t)

Using Bayes’s theorem,
BIM
• Binary documents and queries are represented as vector of
binary
• Independence terms are independent of each other
• Some of the assumptions of the BIM can be removed. For

example, we can remove the assumption that terms are
independent
• A case that violates this assumption s term pairs like: hong and
kong, new york, los angeles, Addis Ababa, Arba Minch, Abba
Gada, Haadha Siinqee, w/c are strongly dependent on each other
No of relevant No of non Total

documents relevant
documents
No of documents r N-r n
containing term tk
No of documents R-r N-R-N+r N-R

not containing
term tk Total
Total number number of
of relevant doc in
Total R doc retrieved N-R N collection
Definition
N Total number of documents in the collection
n Total number of documents containing the term tk
R total number of relevant documents retrieved

r Total number of relevant documents retrieved containing the term
tk
Based on this,
odds(probability) of term tk appearing in relevant document is given by

r/R-r
odds(probability) of term tk appearing in irrelevant document is given

by n-r/N-n-R+r
• Assume that if odds of term tk appearing in relevant is
equal to that of non relevant, then
r/R-r=n-r/N-n-R+r
Now we can calculate the relevance function as:
Relevance function(W) = r(N-n-R+r)/(R-r)(n-r), but when

we have no relevant document that contain the term tk, this become
zero, so, we have to add 0.5 to reduce zero result
Example
N=20Total no of documents in the collection
n=10 Total no of documents containing the term tk
R =15Total no of relevant documents retrieved
r=5Total no of relevant documents retrieve containing term tk
Then calculate the probability of relevance function of the term tk

Solution
• Since our result becomes zero, when we don’t have relevant
document, let us add 0.5 to the given formula
Relevance function(W) = r(N-n-R+r)/(R-r)(n-r),
RF(W) = r+0.5(N-n-R+r+0.5)
(R-r+0.5)(n-r+0.5),
RF(W) = 5+0.5(20-10-15+5+0.5)
(15-5+0.5)(10-5+0.5),
=0.095 (the probability that there is relevant document)
BM25
• Stands for Best Match
• Used for modern full text search collections (give attention to

term frequency and document length), Called as weighting
scheme, okapi weighting
• It uses the following contingency table
Relevant Non relevant Total
Di=1 ri ni-ri ni
Di=0 R-ri N-ni-R+ri N-ni
Total R N-R N
Where :
r=number of documents relevant to Q containing term t

R=number of documents relevant to Q
n=number of documents containing t
N =number of documents in the given collection
tf=frequency(number of occurrence of t in Di)
qtf=frequency of t in q
Avdl=average document length in the given collection
dl= length of Di
K1,k2,K= free parameters, where K=k1[(1-b)+b*dl/avdl)]
K1=1.2, k2=0 to 100, b=0.75
Formula of BM 25
• BM25(Q,D)=
log(ri+0.5)/R-ri+0.5 *(k1+1)tf * (k2 +1)qtf

(ni-ri+0.5)/N-ni-R+ri+0.5 K+tf k2+qtf
Example
• Assume that : ‘information system’ query which has the qtf=1
• N=1,000,000
• No relevance information(r and R=0)
• ‘information’ occurs in 500,000 documents(n1=500,000)
• ‘system’ occurs in 10,000 documents(n2=500,000)
• ‘information’ occurs 100 times in doc(f1=100)

• ‘system’ occurs 50 times in doc(f2=50)
• Document length is 80% of average which =0.8
• K1=1.2,k2=200, b=0.75, K=1.2[(1-0.75)+0.75*0.8]=1.02

Formula of BM 25
• BM25(Q,D)=log(ri+0.5)/R-ri+0.5 * (k1+1)tf * (k2 +1)qtf
(ni-ri+0.5)/N-ni-R+ri+0.5 K+tf k2+qtf
=log[(0+0.5)/(0-0+0.5) *(1.2+1)100 *(200+1)1

(500,000-0+0.5)/(1000000-500000+0+0.5) 1.02+100 200+1
+ (0+0.5/0-0+0.5) *(1.2+1)50 *(200+1)1
(10000-0+0.5) /(1000000-10000+0+0.5) 1.2+50 200+1
BM25(Q,D)=0.353535
Therefore, the similarity of the query (information system) to the

document is 0.353535
Exercise
• Compare and contrast the following Information Retrieval
models
Boolean model vs vector Space Model vs probabilistic Model
(pros and cons )

END OF
CHAPTER
4

Introduction of IR Models

Uploaded by

Copyright:

Available Formats

Introduction of IR Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction of IR Models

Uploaded by

Copyright:

Available Formats

Chapter 4

 Define what model is

Vector space model

• A model of an information retrieval predicts and explains what

– Boolean retrieval models

– Vector space models

A document either matches a query, or does

• Boolean relevance prediction ( R )

• Most queries search for more than one term

– Information need: to find all documents containing “information”

– Information need: to find all documents containing “information”

– Information need: to find all documents containing “information”

Find documents retrieved by the following expressions in

Very limited to show user information need in detail

No weighting of index or query terms

Based on exact matchingthere may be relevant

• The similarity is based on the occurrences of the

• The angle between query and document is measured

cosine similarity measurement since both

• VSM assumes that if document vector V1 is closer to

Tf = number of frequency of a term

(for a single document)

• IDF used to measure whether the terms are

• Idf = log N/df where

N total number of documents

Df document frequency (number of documents

• Term weighting = tf*idf

Document length= the square root summation of term weight

• The widely used measure of similarity in vector space model is

• The cosine similarity between two vectors d,j (the document

• Example1: If the following three documents are given with

D1: New york times

D2: New York post

D3: Los Angeles times

Query: new new times

Therefore; since the value of D1>D2>D3, the document must be

Doc1: Breakthrough drug for schizophrenia

D1:The health observances for march

D3: The awareness news for march awareness

Q: March health awareness

• if only a synonym of the keyword is present in a

• Latent Semantic Indexing (LSI) [Deerwester et al] tries to

• Latent Semantic Indexing is a technique that projects queries

2. Whether a document satisfies the query

 Probabilistic information retrieval models estimate how likely it is that a

 Bayesian networks for text retrieval

• Ranked retrieval setup: given a collection of documents, the

• Assume binary notion of relevance: Rd,q is a random

– Rd,q = 1 if document d is relevant w.r.t query q

• Probabilistic ranking orders documents decreasingly by their

Probability of non-relevant documents in the collection is p (NR) =0.6

a/ P (NR/Y) =p(y/NR) p (NR)/p(y)

P(R/Y) =0.7, hence the document is relevant

• For a further illustration of this model, consider a document

• The obvious rule is to assign a document to the relevant set if the

P(relevant|t) > P(non-relevant|t)

• Independence terms are independent of each other

• Some of the assumptions of the BIM can be removed. For

No of relevant No of non Total

log(ri+0.5)/R-ri+0.5 (k1+1)tf (k2 +1)qtf

=log[(0+0.5)/(0-0+0.5) (1.2+1)100 (200+1)1