Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

06 Indexing 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

10/6/2020

Text Technologies for Data Science


INFR11145

Indexing (2)
Instructor:
Walid Magdy

07-Oct-2020
1
Walid Magdy, TTDS 2020/2021

Lecture Objectives
• Learn more about indexing:
• Structured documents
• Extent index
• Index compression
• Data structure
• Wild-char search and applications

* You are not asked to implement any of the content in this lecture, but you
might think of using some for your course project ☺

2
Walid Magdy, TTDS 2020/2021

1
10/6/2020

Structured Documents
• Document are not always flat:
• Meta-data: title, author, time-stamp
• Structure: headline, section, body
• Tags: link, hashtag, mention

• How to deal with it?


• Neglect!
• Create separate index for each field
• Use “extent index”

3
Walid Magdy, TTDS 2020/2021

Extent Index
• Special “term” for each element/field/tag
• Index all terms in a structured document as plain text
• Terms in a given field/tag get special additional entry
• Posting: spans of window related to a given field
• Allows multiple overlapping spans of different types

he 1,1 1,5 2,1 3,3 4,3 5,1


drink 1,8 2,4 2,6 2,8 3,6 4,5 5,6
ink 3,8 4,2 5,8 D1: He likes to wink, he likes to drink
pink 4,8 5,7 D2: He likes to drink, and drink, and drink
Link 3,1:2 4,1:4 5,7:8 D3: The thing he likes to drink is ink
D4: The ink he likes to drink is pink
D5: He likes to wink, and drink pink ink
4
Walid Magdy, TTDS 2020/2021

2
10/6/2020

Using Extent
• Doc: 1 → 1 2 3
Headline: “Information retrieval lecture”
Text: “this is lecture 6 of the TTDS course on IR”
4 5 6 7 8

• Query → Headline: lecture

lecture 1,3 1,4 2,9 3,7 3,11

Headline 1,1:3 2,1:5 3,1:4

5
Walid Magdy, TTDS 2020/2021

Index Compression
• Inverted indices are big
• Large disk space → large I/O operations
• Index compression
• Reduce space → less I/O
• Allow more chunks of index to be cached in memory
• Large size goes to:
• terms? document numbers?
• Ideas:
• Compress document numbers, how?

6
Walid Magdy, TTDS 2020/2021

3
10/6/2020

Delta Encoding
• Large collections → large sequence of doc IDs
• e.g. Doc IDs: 1, 2, 3, … 66,032, ……, 5,323,424,235
• Large ID number → more bytes to store
• 1 byte: 0→255
• 2 bytes: 0 → 65,535
• 4 bytes: 0 → 4.3 B
3 bytes
• Idea: delta in ID instead of full ID
• Very useful, especially for frequent terms

term 100002 100007 100008 100011 100019

term ? 5 1 3 7 321 15 2

1 byte 2 bytes
7
Walid Magdy, TTDS 2020/2021

v-byte Encoding
• Have different byte storage for each delta in index
• Use fewer bits to encode
• High bit in a byte → 1/0 = terminate/continue
• Remaining 7 bits → binary number
• Examples:
• “6” → 10000110
• “127” → 11111111
• “128” → 0000000110000000 → 00000010000000
• Real example sequence:
100001010000000011000001010000111
0000101 → 000000010000010 → 0000111
5 → 130 → 7
8
Walid Magdy, TTDS 2020/2021

4
10/6/2020

Index Compression
• There are more sophisticated compression
algorithms:
• Elias gamma code
• The more compression
• Less storage
• More processing
• In general
• Less I/O + more processing > more I/O + no processing
“>” = faster
• With new data structures, problem is less severe

9
Walid Magdy, TTDS 2020/2021

Dictionary Data Structures


• The dictionary data structure stores the term
vocabulary, document frequency, pointers to each
postings list …
• For small collections, load full dictionary in memory.
In real-life, cannot load all index to memory!
• Then what to load?
• How to reach quickly?
• What data structure to use for inverted index?

10
Walid Magdy, TTDS 2020/2021

10

5
10/6/2020

Hashes
• Each vocabulary term is hashed to an integer
• Pros
• Lookup is faster than for a tree: O(1)
• Cons
• No easy way to find minor variants:
• judgment/judgement
• No prefix search
• If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything

11
Walid Magdy, TTDS 2020/2021

11

Trees: Binary Search Tree


Root
a-m n-z

a-hu hy-m n-sh si-z

12
Walid Magdy, TTDS 2020/2021

12

6
10/6/2020

Trees: B-tree

n-z
a-hu
hy-m

Every internal node has a number of children in the


interval [a,b] where a, b are appropriate natural
numbers, e.g., [2,4].
13
Walid Magdy, TTDS 2020/2021

13

Trees
• Pros?
• Solves the prefix problem (terms starting with “ab”)
• Cons?
• Slower: O(log M) [and this requires balanced tree]
• Rebalancing binary trees is expensive
• But B-trees mitigate the rebalancing problem

14
Walid Magdy, TTDS 2020/2021

14

7
10/6/2020

Wild-Card Queries: *
• mon*: find all docs containing any word beginning
“mon”.
• Easy with binary tree (or B-tree) lexicon
• *mon: find words ending in “mon”: harder
• Maintain an additional B-tree for terms backwards.
• How can we enumerate all terms meeting the wild-
card query pro*cent ?
• Query processing: se*ate AND fil*er ?
• Expensive

15
Walid Magdy, TTDS 2020/2021

15

Permuterm Indexes
• Transform wild-card queries so that the * occurs at
the end
• For term hello, index under:
• hello$, ello$h, llo$he, lo$hel, o$hell, $hello
where $ is a special symbol.
• Rotate query wild-card to the end
• Queries:
• X lookup on X$ hello → hello$
• X* lookup on $X* hel* → hel*$ → $hel*
• *X lookup on X$* *llo → *llo$ → llo$*
• X*Y lookup on Y$X* he*lo → he*lo$ → lo$he*

• Index Size?
16
Walid Magdy, TTDS 2020/2021

16

8
10/6/2020

Character n-gram Indexes


• Enumerate all n-grams (sequence of n chars)
occurring in any term
• e.g., from text “April is the cruelest month” we get the 2-
grams (bigrams) →
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$,
$m,mo,on,nt,h$
• $ is a special word boundary symbol
• Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
• Character n-grams → terms
• terms → documents

17
Walid Magdy, TTDS 2020/2021

17

Character n-gram Indexes


• The n-gram index finds terms based on a query
consisting of n-grams (here n=2).
$m mace madden
mo among amortize
on almond among

Wild card
query
Find possible Filter unmatching Search collection Documents
terms terms for all terms

Index of char Collection index


bigrams of terms

18
Walid Magdy, TTDS 2020/2021

18

9
10/6/2020

Character n-gram Indexes: Query time


• Step 1: Query mon* → $m AND mo AND on
• It would still match moon.
• Step 2: Must post-filter these terms against query.
• Phrase match, or post-step1 match
• Step 3: Surviving enumerated terms are then looked
up in the term-document inverted index.
→ Montreal OR monster OR monkey
• Wild-cards can result in expensive query execution
(very large disjunctions…)

19
Walid Magdy, TTDS 2020/2021

19

Character n-gram Indexes: Applications


• Spelling Correction
• Create n-gram representation for words
• Build index for words:
• Dictionary of words → documents (each word is a document)
• Character n-grams → terms
• When getting a search term that is misspelled (OOV or
not frequent), find possible corrections
• Possible corrections = most matching results
Query: elepgant → $e el le ep pg ga an nt t$
Results:
elegant → $e el le eg ga an nt t$
elephant → $e el le ep ph ha an nt t$
20
Walid Magdy, TTDS 2020/2021

20

10
10/6/2020

Character n-gram Indexes: Applications


• Char n-grams can be used as direct index terms for
some applications:
• Arabic IR, when no stemmer/segmenter is available
• Documents with spelling mistakes: OCR documents
• Word char representation can by with multiple n’s
• “elephant” → 2/3-gram →
“$e el le ep ph ha an nt t$ $el $ele lep eph pha han ant nt$”

The children behaved well ‫األبناء تصرفوا جيدا‬ $‫ا ال أل أب بن نا اء ء‬$


Her children are cute ‫أبناءها لطاف‬ $‫أ أب بن نا اء ءه ها ا‬$

Document: Elepbant → $e el le ep pb ba an nt t$
Query: Elephant → $e el le ep ph ha an nt t$

21
Walid Magdy, TTDS 2020/2021

21

Summary
• Index can by multilayer
• Extent index (multi-terms in one position in document)
• Index does not have to be formed of words
• Character n-grams representation of words
• Two indexes are sometimes used
• Index of character n-grams to find matching words
• Index of terms to search for matched words

22
Walid Magdy, TTDS 2020/2021

22

11
10/6/2020

Resources
• Text book 1: Intro to IR, Chapter 3.1 – 3.4
• Text book 2: IR in Practice, Chapter 5

23
Walid Magdy, TTDS 2020/2021

23

12

You might also like