Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
Overview
❶ Recap
❷ Dictionaries
❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
1
Introduction to Information Retrieval
Outline
❶ Recap
❷ Dictionaries
❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
2
Introduction to Information Retrieval
Type/token distinction
3
Introduction to Information Retrieval
Problems in tokenization
4
Introduction to Information Retrieval
5
Introduction to Information Retrieval
Skip pointers
6
Introduction to Information Retrieval
Positional indexes
Postings lists in a nonpositional index: each posting is just a
docID
Postings lists in a positional index: each posting is a docID and
a list of positions
Example query: “to1 be2 or3 not4 to5 be6”
TO, 993427:
‹ 1: ‹7, 18, 33, 72, 86, 231›;
2: ‹1, 17, 74, 222, 255›;
4: ‹8, 16, 190, 429, 433›;
5: ‹363, 367›;
7: ‹13, 23, 191›; . . . ›
BE, 178239:
‹ 1: ‹17, 25›;
4: ‹17, 191, 291, 430, 434›;
5: ‹14, 19, 101›; . . . › Document 4 is a match! 7
Introduction to Information Retrieval
Positional indexes
8
Introduction to Information Retrieval
Take-away
9
Introduction to Information Retrieval
Outline
❶ Recap
❷ Dictionaries
❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
10
Introduction to Information Retrieval
Inverted index
11
Introduction to Information Retrieval
Inverted index
12
Introduction to Information Retrieval
Dictionaries
13
Introduction to Information Retrieval
14
Introduction to Information Retrieval
15
Introduction to Information Retrieval
16
Introduction to Information Retrieval
Hashes
Each vocabulary term is hashed into an integer.
Try to avoid collisions
At query time, do the following: hash query term, resolve
collisions, locate entry in fixed-width array
Pros: Lookup in a hash is faster than lookup in a tree.
Lookup time is constant.
Cons
no way to find minor variants (resume vs. résumé)
no prefix search (all terms starting with automat)
need to rehash everything periodically if vocabulary keeps
growing
17
Introduction to Information Retrieval
Trees
Trees solve the prefix problem (find all terms starting with
automat).
Simplest tree: binary tree
Search is slightly slower than in hashes: O(logM), where M is
the size of the vocabulary.
O(logM) only holds for balanced trees.
Rebalancing binary trees is expensive.
B-trees mitigate the rebalancing problem.
B-tree definition: every internal node has a number of
children in the interval [a, b] where a, b are appropriate
positive integers, e.g., [2, 4].
18
Introduction to Information Retrieval
Binary tree
19
Introduction to Information Retrieval
B-tree
20
Introduction to Information Retrieval
Outline
❶ Recap
❷ Dictionaries
❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
21
Introduction to Information Retrieval
Wildcard queries
mon*: find all docs containing any term beginning with mon
Easy with B-tree dictionary: retrieve all terms t in the range:
mon ≤ t < moo
*mon: find all docs containing any term ending with mon
Maintain an additional tree for terms backwards
Then retrieve all terms t in the range: nom ≤ t < non
Result: A set of terms that are matches for wildcard query
Then retrieve documents that contain any of these terms
22
Introduction to Information Retrieval
Example: m*nchen
We could look up m* and *nchen in the B-tree and intersect
the two term sets.
Expensive
Alternative: permuterm index
Basic idea: Rotate every wildcard query, so that the * occurs
at the end.
Store each of these rotations in the dictionary, say, in a B-tree
23
Introduction to Information Retrieval
Permuterm index
For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell
to the B-tree where $ is a special symbol
24
Introduction to Information Retrieval
25
Introduction to Information Retrieval
Permuterm index
For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and
o$hell
Queries
For X, look up X$
For X*, look up X*$
For *X, look up X$*
For *X*, look up X*
For X*Y, look up Y$X*
Example: For hel*o, look up o$hel*
Permuterm index would better be called a permuterm tree.
But permuterm index is the more common name.
26
Introduction to Information Retrieval
27
Introduction to Information Retrieval
k-gram indexes
28
Introduction to Information Retrieval
29
Introduction to Information Retrieval
30
Introduction to Information Retrieval
31
Introduction to Information Retrieval
Exercise
Google has very limited support for wildcard queries.
For example, this query doesn’t work very well on Google:
[gen* universit*]
Intention: you are looking for the University of Geneva, but
don’t know which accents to use for the French words for
university and Geneva.
According to Google search basics, 2010-04-29: “Note that
the * operator works only on whole words, not parts of
words.”
But this is not entirely true. Try [pythag*] and [m*nchen]
Exercise: Why doesn’t Google fully support wildcard queries?
32
Introduction to Information Retrieval
Outline
❶ Recap
❷ Dictionaries
❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
34
Introduction to Information Retrieval
Spelling correction
Two principal uses
Correcting documents being indexed
Correcting user queries
Two different methods for spelling correction
Isolated word spelling correction
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words, e.g., an
asteroid that fell form the sky
Context-sensitive spelling correction
Look at surrounding words
Can correct form/from error above
35
Introduction to Information Retrieval
Correcting documents
36
Introduction to Information Retrieval
Correcting queries
First: isolated word spelling correction
Premise 1: There is a list of “correct words” from which the
correct spellings come.
Premise 2: We have a way of computing the distance
between a misspelled word and a correct word.
Simple spelling correction algorithm: return the “correct”
word that has the smallest distance to the misspelled word.
Example: informaton → information
For the list of correct words, we can use the vocabulary of all
words that occur in our collection.
Why is this problematic?
37
Introduction to Information Retrieval
38
Introduction to Information Retrieval
39
Introduction to Information Retrieval
Edit distance
The edit distance between string s1 and string s2 is the
minimum number of basic operations that convert s1 to s2.
Levenshtein distance: The admissible basic operations are
insert, delete, and replace
Levenshtein distance dog-do: 1
Levenshtein distance cat-cart: 1
Levenshtein distance cat-cut: 1
Levenshtein distance cat-act: 2
Damerau-Levenshtein distance cat-act: 1
Damerau-Levenshtein includes transposition as a fourth
possible operation.
40
Introduction to Information Retrieval
41
Introduction to Information Retrieval
42
Introduction to Information Retrieval
43
Introduction to Information Retrieval
44
Introduction to Information Retrieval
45
Introduction to Information Retrieval
46
Introduction to Information Retrieval
47
Introduction to Information Retrieval
48
Introduction to Information Retrieval
49
Introduction to Information Retrieval
50
Introduction to Information Retrieval
51
Introduction to Information Retrieval
52
Introduction to Information Retrieval
53
Introduction to Information Retrieval
54
Introduction to Information Retrieval
55
Introduction to Information Retrieval
56
Introduction to Information Retrieval
57
Introduction to Information Retrieval
58
Introduction to Information Retrieval
59
Introduction to Information Retrieval
61
Introduction to Information Retrieval
62
Introduction to Information Retrieval
Exercise
63
Introduction to Information Retrieval
64
Introduction to Information Retrieval
65
Introduction to Information Retrieval
66
Introduction to Information Retrieval
67
Introduction to Information Retrieval
68
Introduction to Information Retrieval
69
Introduction to Information Retrieval
70
Introduction to Information Retrieval
71
Introduction to Information Retrieval
72
Introduction to Information Retrieval
73
Introduction to Information Retrieval
74
Introduction to Information Retrieval
75
Introduction to Information Retrieval
76
Introduction to Information Retrieval
77
Introduction to Information Retrieval
78
Introduction to Information Retrieval
79
Introduction to Information Retrieval
80
Introduction to Information Retrieval
81
Introduction to Information Retrieval
83
Introduction to Information Retrieval
84
Introduction to Information Retrieval
85
Introduction to Information Retrieval
86
Introduction to Information Retrieval
87
Introduction to Information Retrieval
88
Introduction to Information Retrieval
89
Introduction to Information Retrieval
90
Introduction to Information Retrieval
91
Introduction to Information Retrieval
92
Introduction to Information Retrieval
93
Introduction to Information Retrieval
94
Introduction to Information Retrieval
95
Introduction to Information Retrieval
96
Introduction to Information Retrieval
97
Introduction to Information Retrieval
How do
I read out the editing operations that transform OSLO into SNOW?
98
Introduction to Information Retrieval
99
Introduction to Information Retrieval
100
Introduction to Information Retrieval
101
Introduction to Information Retrieval
102
Introduction to Information Retrieval
103
Introduction to Information Retrieval
104
Introduction to Information Retrieval
105
Introduction to Information Retrieval
106
Introduction to Information Retrieval
107
Introduction to Information Retrieval
108
Introduction to Information Retrieval
Outline
❶ Recap
❷ Dictionaries
❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
109
Introduction to Information Retrieval
Spelling correction
110
Introduction to Information Retrieval
111
Introduction to Information Retrieval
112
Introduction to Information Retrieval
114
Introduction to Information Retrieval
116
Introduction to Information Retrieval
Outline
❶ Recap
❷ Dictionaries
❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
117
Introduction to Information Retrieval
Soundex
118
Introduction to Information Retrieval
Soundex algorithm
❶ Retain the first letter of the term.
❷ Change all occurrences of the following letters to ’0’ (zero): A, E, I,
O, U, H, W, Y
❸ Change letters to digits as follows:
B, F, P, V to 1
C, G, J, K, Q, S, X, Z to 2
D,T to 3
L to 4
M, N to 5
R to 6
❹ Repeatedly remove one out of each pair of consecutive identical
digits
❺ Remove all zeros from the resulting string; pad the resulting string
with trailing zeros and return the first four positions, which will
consist of a letter followed by three digits 119
Introduction to Information Retrieval
Retain H
ERMAN → 0RM0N
0RM0N → 06505
06505 → 06505
06505 → 655
Return H655
Note: HERMANN will generate the same code
120
Introduction to Information Retrieval
121
Introduction to Information Retrieval
Exercise
122
Introduction to Information Retrieval
Take-away
123
Introduction to Information Retrieval
Resources
Chapter 3 of IIR
Resources at http://ifnlp.org/ir
Soundex demo
Levenshtein distance demo
Peter Norvig’s spelling corrector
124