Introduction To: Information Retrieval
Introduction To: Information Retrieval
Retrieval
Introduction to
Information Retrieval
CS276: Information Retrieval and Web
Search
Pandu Nayak and Prabhakar
Raghavan
Lecture 2: The term vocabulary and
Introduction to Information
Retrieval
Ch. 1
Introduction to Information
Retrieval
Postings
Faster merges: skip lists
Positional postings and phrase queries
3
Introduction to Information
Retrieval
Token stream.
Countrymen
Linguistic
modules
Modified tokens.
Inverted index.
friend
roman
countryman
Indexer friend
roman
countryman
13
16
Introduction to Information
Retrieval
Sec. 2.1
Parsing a document
What format is it in?
pdf/word/excel/html?
Introduction to Information
Retrieval
Sec. 2.1
Complications: Format/language
Documents being indexed can include docs
from many different languages
A single index may have to contain terms of
several languages.
A file?
An email? (Perhaps one of many in an mbox.)
An email with 5 attachments?
A group of files (PPT or LaTeX as HTML pages)
6
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Sec. 2.2.1
Tokenization
Input: Friends, Romans, Countrymen
Output: Tokens
Friends
Romans
Countrymen
Introduction to Information
Retrieval
Sec. 2.2.1
Tokenization
Issues in tokenization:
Finlands capital
Finland? Finlands? Finlands?
Hewlett-Packard Hewlett and
Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
co-education
lowercase, lower-case, lower case ?
It can be effective to get the user to put in possible
hyphens
Introduction to Information
Retrieval
Sec. 2.2.1
Numbers
Introduction to Information
Retrieval
Sec. 2.2.1
Sec. 2.2.1
Introduction to Information
Retrieval
Dates/amounts
in multiple formats
500
$500K( 6,000
Katakana Hiragana
Kanji
Romaji
12
Sec. 2.2.1
Introduction to Information
Retrieval
start
Algeria achieved its independence in 1962 after
132 years of French occupation.
With Unicode, the surface presentation is complex, but
the stored form is straightforward
13
Introduction to Information
Retrieval
Sec. 2.2.2
Stop words
With a stop list, you exclude from the
dictionary entirely the commonest words.
Intuition:
They have little semantic content: the, a, and, to, be
There are a lot of them: ~30% of postings for top 30 words
14
Sec. 2.2.3
Introduction to Information
Retrieval
Normalization to terms
We need to normalize words in indexed text
as well as query words into the same form
We want to match U.S.A. and USA
USA
antidiscriminatory
15
Sec. 2.2.3
Introduction to Information
Retrieval
Normalization to terms
An alternative to equivalence classing is to
do asymmetric expansion
An example of where this may be useful
Enter: window
Enter: windows
window
Enter: Windows
Introduction to Information
Retrieval
Sec. 2.2.3
Introduction to Information
Retrieval
Sec. 2.2.3
Is this
Morgen will ich in MIT German mit?
Introduction to Information
Retrieval
Sec. 2.2.3
Case folding
Reduce all letters to lower case
exception: upper case in midsentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Google example:
Query C.A.T.
#1 result was for cat (well, Lolcats)
not Caterpillar Inc.
19
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Sec. 2.2.4
Lemmatization
Reduce inflectional/variant forms to base
form
E.g.,
am, are, is be
car, cars, car's, cars' car
Introduction to Information
Retrieval
Sec. 2.2.4
Stemming
Reduce terms to their roots before
indexing
Stemming suggest crude affix chopping
language dependent
e.g., automate(s), automatic, automation
all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
22
Introduction to Information
Retrieval
Sec. 2.2.4
Porters algorithm
Commonest algorithm for stemming
English
Results suggest its at least as good as other
stemming options
Introduction to Information
Retrieval
Sec. 2.2.4
sses ss
ies i
ational ate
tional tion
Introduction to Information
Retrieval
Sec. 2.2.4
Other stemmers
Other stemmers exist, e.g., Lovins stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Introduction to Information
Retrieval
Sec. 2.2.4
Language-specificity
Many of the above features embody
transformations that are
Language-specific and
Often, application-specific
26
Sec. 2.2
Introduction to Information
Retrieval
MIT.english
mit.german
guaranteed.english
entries.english
sometimes.english
These may be
grouped by
language (or
not).
More on this in
ranking/query
processing.
tokenization.english
27
Introduction to Information
Retrieval
Sec. 2.3
Introduction to Information
Retrieval
41
48
11
64
17
128
21
Brutus
31 Caesar
29
Sec. 2.3
Introduction to Information
Retrieval
41
41
64
128
31
11
48
11
17
21
31
Why?
To skip postings that will not figure in the
search results.
How?
Where do we place skip pointers?
30
Sec. 2.3
Introduction to Information
Retrieval
41
41
64
128
31
11
48
11
17
21
31
Introduction to Information
Retrieval
Sec. 2.3
32
Introduction to Information
Retrieval
Sec. 2.3
Placing skips
Simple heuristic: for postings of length L, use
L evenly-spaced skip pointers.
This ignores the distribution of query terms.
Easy if the index is relatively static; harder if L
keeps changing because of updates.
This definitely used to help; with modern
hardware it may not (Bahle et al. 2002) unless
youre memory-based
The I/O cost of loading a bigger postings list can
outweigh the gains from quicker in memory
merging!
33
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Sec. 2.4
Phrase queries
Want to be able to answer queries such as
stanford university as a phrase
Thus the sentence I went to university at
Stanford is not a match.
The concept of phrase queries has proven
easily understood by users; one of the few
advanced search ideas that works
Many more queries are implicit phrase queries
Introduction to Information
Retrieval
Sec. 2.4.1
Sec. 2.4.1
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Sec. 2.4.1
Extended biwords
Parse the indexed text and perform part-ofspeech-tagging (POST).
Bucket the terms into (say) Nouns (N) and
articles/prepositions (X).
Call any string of terms of the form NX*N an
extended biword.
Each such extended biword is now made a term
in the dictionary.
Example: catcher in the rye
N
X X N
Query processing: parse it into Ns and Xs
Segment query into enhanced biwords
Look up in index: catcher rye
38
Introduction to Information
Retrieval
Sec. 2.4.1
39
Introduction to Information
Retrieval
Sec. 2.4.2
40
Sec. 2.4.2
Introduction to Information
Retrieval
41
Introduction to Information
Retrieval
Sec. 2.4.2
Introduction to Information
Retrieval
Sec. 2.4.2
Proximity queries
LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
Again, here, /k means within k words of.
Introduction to Information
Retrieval
Sec. 2.4.2
Sec. 2.4.2
Introduction to Information
Retrieval
Postings
Positional postings
1000
100,000
100
45
Introduction to Information
Retrieval
Sec. 2.4.2
Rules of thumb
A positional index is 24 as large as a nonpositional index
Positional index size 3550% of volume of
original text
Caveat: all of this holds for English-like
languages
46
Introduction to Information
Retrieval
Sec. 2.4.3
Combination schemes
These two approaches can be profitably
combined
For particular phrases (Michael Jackson,
Britney Spears) it is inefficient to keep
on merging positional postings lists
Even more so for phrases like The Who
Introduction to Information
Retrieval
http://www.seg.rmit.edu.au/research/research.php?author=4