lecture1-intro
lecture1-intro
Information Retrieval
Introducing Information Retrieval
and Web Search
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
2
Unstructured (text) vs. structured (database)
data in the mid-nineties
3
Unstructured (text) vs. structured (database)
data today
4
Sec. 1.1
5
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?
Info need
Info about removing mice
without killing them
Misformulation?
Query Searc
how trap mice alive
h
Search
engine
Query Results
Collection
refinement
Sec. 1.1
7
Introduction to
Information Retrieval
Term-document incidence matrices
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) ➔
bitwise AND.
– 110100 AND
– 110111 AND Antony
Antony and Cleopatra
1
Julius Caesar
1
The Tempest
0
Hamlet
0
Othello
0
Macbeth
1
– 101111 =
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
– 100100
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
11
Sec. 1.1
Answers to query
12
Sec. 1.1
Bigger collections
• Consider N = 1 million documents, each with
about 1000 words.
• Avg 6 bytes/word including
spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among
these.
13
Sec. 1.1
14
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec. 1.2
Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
What happens if the word Caesar
is added to document 14? 16
Sec. 1.2
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sorted by docID (more later on why).
17
Sec. 1.2
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2
Doc 1 Doc 2
Why frequency?
Will discuss later.
Sec. 1.2
Lists of
docIDs
Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 23
Introduction to
Information Retrieval
Query processing with an inverted index
Sec. 1.3
25
Sec. 1.3
26
Sec. 1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
28
Introduction to
Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Sec. 1.3
Boolean queries:
More general merges
• Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
33
Sec. 1.3
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
• Can we always merge in “linear” time?
– Linear in what?
• Can we do better?
34
Sec. 1.3
Query optimization
• What is the best order for query
processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
37
Exercise
• Recommend a query
processing order for
38
Query processing exercises
• Exercise: If the query is friends AND romans AND
(NOT countrymen), how could we use the freq of
countrymen?
• Exercise: Extend the merge to an arbitrary
Boolean query. Can we always guarantee
execution in time linear in the total postings size?
• Hint: Begin with the case of a Boolean formula
query: in this, each query term appears only once
in the query.
39
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this
document collection.
b. Draw the inverted index representation for this
collection
For the document collection above, what are the returned
results for
these queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)
Exercise
• Try the search feature at
http://www.rhymezone.com/shakespeare/
• Write down five search features you think it
could do better
41
Introduction to
Information Retrieval
Phrase queries and positional indexes
Sec. 2.4
Phrase queries
• We want to be able to answer queries such as
“stanford university” – as a phrase
• Thus the sentence “I went to university at
Stanford” is not a match.
– The concept of phrase queries has proven easily
understood by users; one of the few “advanced
search” ideas that works
– Many more queries are implicit phrase queries
• For this, it no longer suffices to store only
<term : docs> entries
Sec. 2.4.1
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
Proximity queries
• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
– Again, here, /k means “within k words of”.
• Clearly, positional indexes can be used for
such queries; biword indexes cannot.
• Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it
work for any value of k?
– This is a little tricky to do correctly and efficiently
– See Figure 2.12 of IIR
Sec. 2.4.2
Rules of thumb
• A positional index is 2–4 as large as a non-
positional index
Combination schemes
• These two approaches can be profitably
combined
– For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
– A typical web query mixture was executed in ¼ of the
time of using just a positional index
– It required 26% more space than having a positional
index alone
An algorithm for proximity intersection of postings lists p1 and p2.
The algorithm finds places where the two terms appear within k words of each other
and returns a list of triples giving docID and the term position in p1 and p2.
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
59
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones such
as the Title and Bullets
• … to say nothing of linguistic structure
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
• Or even
– Title is about Object Oriented Programming AND
Author something like stro*rup
– where * is the wild-card operator
60