06 Indexing 2
06 Indexing 2
06 Indexing 2
Indexing (2)
Instructor:
Walid Magdy
07-Oct-2020
1
Walid Magdy, TTDS 2020/2021
Lecture Objectives
• Learn more about indexing:
• Structured documents
• Extent index
• Index compression
• Data structure
• Wild-char search and applications
* You are not asked to implement any of the content in this lecture, but you
might think of using some for your course project ☺
2
Walid Magdy, TTDS 2020/2021
1
10/6/2020
Structured Documents
• Document are not always flat:
• Meta-data: title, author, time-stamp
• Structure: headline, section, body
• Tags: link, hashtag, mention
3
Walid Magdy, TTDS 2020/2021
Extent Index
• Special “term” for each element/field/tag
• Index all terms in a structured document as plain text
• Terms in a given field/tag get special additional entry
• Posting: spans of window related to a given field
• Allows multiple overlapping spans of different types
2
10/6/2020
Using Extent
• Doc: 1 → 1 2 3
Headline: “Information retrieval lecture”
Text: “this is lecture 6 of the TTDS course on IR”
4 5 6 7 8
5
Walid Magdy, TTDS 2020/2021
Index Compression
• Inverted indices are big
• Large disk space → large I/O operations
• Index compression
• Reduce space → less I/O
• Allow more chunks of index to be cached in memory
• Large size goes to:
• terms? document numbers?
• Ideas:
• Compress document numbers, how?
6
Walid Magdy, TTDS 2020/2021
3
10/6/2020
Delta Encoding
• Large collections → large sequence of doc IDs
• e.g. Doc IDs: 1, 2, 3, … 66,032, ……, 5,323,424,235
• Large ID number → more bytes to store
• 1 byte: 0→255
• 2 bytes: 0 → 65,535
• 4 bytes: 0 → 4.3 B
3 bytes
• Idea: delta in ID instead of full ID
• Very useful, especially for frequent terms
term ? 5 1 3 7 321 15 2
1 byte 2 bytes
7
Walid Magdy, TTDS 2020/2021
v-byte Encoding
• Have different byte storage for each delta in index
• Use fewer bits to encode
• High bit in a byte → 1/0 = terminate/continue
• Remaining 7 bits → binary number
• Examples:
• “6” → 10000110
• “127” → 11111111
• “128” → 0000000110000000 → 00000010000000
• Real example sequence:
100001010000000011000001010000111
0000101 → 000000010000010 → 0000111
5 → 130 → 7
8
Walid Magdy, TTDS 2020/2021
4
10/6/2020
Index Compression
• There are more sophisticated compression
algorithms:
• Elias gamma code
• The more compression
• Less storage
• More processing
• In general
• Less I/O + more processing > more I/O + no processing
“>” = faster
• With new data structures, problem is less severe
9
Walid Magdy, TTDS 2020/2021
10
Walid Magdy, TTDS 2020/2021
10
5
10/6/2020
Hashes
• Each vocabulary term is hashed to an integer
• Pros
• Lookup is faster than for a tree: O(1)
• Cons
• No easy way to find minor variants:
• judgment/judgement
• No prefix search
• If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything
11
Walid Magdy, TTDS 2020/2021
11
12
Walid Magdy, TTDS 2020/2021
12
6
10/6/2020
Trees: B-tree
n-z
a-hu
hy-m
13
Trees
• Pros?
• Solves the prefix problem (terms starting with “ab”)
• Cons?
• Slower: O(log M) [and this requires balanced tree]
• Rebalancing binary trees is expensive
• But B-trees mitigate the rebalancing problem
14
Walid Magdy, TTDS 2020/2021
14
7
10/6/2020
Wild-Card Queries: *
• mon*: find all docs containing any word beginning
“mon”.
• Easy with binary tree (or B-tree) lexicon
• *mon: find words ending in “mon”: harder
• Maintain an additional B-tree for terms backwards.
• How can we enumerate all terms meeting the wild-
card query pro*cent ?
• Query processing: se*ate AND fil*er ?
• Expensive
15
Walid Magdy, TTDS 2020/2021
15
Permuterm Indexes
• Transform wild-card queries so that the * occurs at
the end
• For term hello, index under:
• hello$, ello$h, llo$he, lo$hel, o$hell, $hello
where $ is a special symbol.
• Rotate query wild-card to the end
• Queries:
• X lookup on X$ hello → hello$
• X* lookup on $X* hel* → hel*$ → $hel*
• *X lookup on X$* *llo → *llo$ → llo$*
• X*Y lookup on Y$X* he*lo → he*lo$ → lo$he*
• Index Size?
16
Walid Magdy, TTDS 2020/2021
16
8
10/6/2020
17
Walid Magdy, TTDS 2020/2021
17
Wild card
query
Find possible Filter unmatching Search collection Documents
terms terms for all terms
18
Walid Magdy, TTDS 2020/2021
18
9
10/6/2020
19
Walid Magdy, TTDS 2020/2021
19
20
10
10/6/2020
Document: Elepbant → $e el le ep pb ba an nt t$
Query: Elephant → $e el le ep ph ha an nt t$
21
Walid Magdy, TTDS 2020/2021
21
Summary
• Index can by multilayer
• Extent index (multi-terms in one position in document)
• Index does not have to be formed of words
• Character n-grams representation of words
• Two indexes are sometimes used
• Index of character n-grams to find matching words
• Index of terms to search for matched words
22
Walid Magdy, TTDS 2020/2021
22
11
10/6/2020
Resources
• Text book 1: Intro to IR, Chapter 3.1 – 3.4
• Text book 2: IR in Practice, Chapter 5
23
Walid Magdy, TTDS 2020/2021
23
12