Lecture 2-Boolean Retrieval
Lecture 2-Boolean Retrieval
Retrieval
Introduction to
Information Retrieval
Introduction to Information
Retrieval
Information Retrieval
Information Retrieval (IR) is finding
material (usually documents) of an
unstructured nature (usually text) that
satisfies an information need from within
large collections (usually stored on
computers).
Introduction to Information
Retrieval
Boolean Retrieval
Which plays of Shakespeare contain the words Brutus AND
Caesar but NOT Calpurnia?
One could grep all of Shakespeares plays for Brutus and
Caesar, then strip out those containing Calpurnia?
Why is that not the answer?
Slow (for large corpora)
Ranked retrieval (best documents to return)
The way to avoid linearly scanning the text for each query is
to index the document in advance.
3
Introduction to Information
Retrieval
Term-document incidence
matrix
1 if play contains
word, 0 otherwise
Introduction to Information
Retrieval
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus,
Caesar and Calpurnia, complement the last,
and then do a bitwise AND.
110100 AND 110111 AND 101111 = 100100.
Introduction to Information
Retrieval
Answers to query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Bigger collections
Consider N = 1 million documents, each with
about 1000 words.
Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
Say there are M = 500K distinct terms among
these.
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Inverted index
For each term t, we must store a list of all
documents that contain t.
Identify each by a docID, a document serial
number
Brutus
Caesa
r
Calpurnia
1
2
2
31
11 31 45 173 174
16 57 132
54 101
11
Introduction to Information
Retrieval
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal
and best
In memory, can use linked lists or variable
Posting
length arrays
Some tradeoffs in size/ease of insertion
Brutus
Caes
ar
Calpurnia
Dictionary
2
31
11 31 45 173 174
16 57 132
54 101
Postings Sorted by docID
12
Introduction to Information
Retrieval
Token stream.
Countrymen
Linguistic
modules
Modified tokens.
Inverted index.
friend
roman
countryman
Indexer friend
roman
countryman
13
16
Introduction to Information
Retrieval
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Pointers
17
Introduction to Information
Retrieval
18
Introduction to Information
Retrieval
16
32
8
64
1
3
21
128
Brutus
34 Caesar
19
Introduction to Information
Retrieval
The merge
Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2
16
32
8
64
13
21
128
Brutus
34 Caesar
Introduction to Information
Retrieval
21
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Query optimization
What is the best order for query
processing?
Consider a query that is an AND of n
terms.
For each of the n terms, get its postings,
Brutus
4
8 16 32 64 128
then AND them2 together.
Caesar
Calpurnia
16 21 34
13 16
Introduction to Information
Retrieval
Brutus
Caesar
Calpurnia
4
2
16 32 64 128
16 21 34
13 16
Introduction to Information
Retrieval
27
Introduction to Information
Retrieval
Exercise
Recommend a query
processing order for
Term
eyes
kaleidoscope
marmalade
skies
tangerine
trees
Freq
213312
87009
107913
271658
46653
316812
28
Introduction to Information
Retrieval
30
Introduction to Information
Retrieval
Introduction to Information
Retrieval
32