Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

IR Indexing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15
At a glance
Powered by AI
The key takeaways are about different techniques for indexing large document collections like Reuters, including BSBI and SPIMI.

Sorting 100 million records on disk is too slow because of disk seek time.

BSBI solves the sorting problem by parsing documents into term-document pairs, accumulating these pairs into blocks that fit into memory, sorting each block individually, and then merging the sorted blocks.

BSBI

Reuters collection example (approximate #’s)


• 800,000 documents from the Reuters news feed
• 200 terms per document
• 400,000 unique terms
• number of postings 100,000,000
BSBI

Reuters collection example (approximate #’s)


• Sorting 100,000,000 records on disk is too slow because of
disk seek time.
• Parse and build posting entries one at a time
• Sort posting entries by term
• Then by document in each term
• Doing this with random disk seeks is too slow
• e.g. If every comparison takes 2 disk seeks and N items
need to be sorted with N log2(N) comparisons?
• 306ish days?
BSBI

Reuters collection example (approximate #’s)


• 100,000,000 records
• Nlog2(N) is = 2,657,542,475.91 comparisons
• 2 disk seeks per comparison = 13,287,712.38 seconds x 2
• = 26,575,424.76 seconds
• = 442,923.75 minutes
• = 7,382.06 hours
• = 307.59 days
• = 84% of a year
• = 1% of your life
BSBI - Block sort-based indexing

Different way to sort index


• 12-byte records (term, doc, meta-data)
• Need to sort T= 100,000,000 such 12-byte records by term
• Define a block to have 1,600,000 such records
• can easily fit a couple blocks in memory
• we will be working with 64 such blocks
• Accumulate postings for each block (real blocks are bigger)
• Sort each block
• Write to disk
• Then merge
BSBI - Block sort-based indexing

Different way to sort index


Block Block Merged Postings

(1998,www.cnn.com)
(1998,news.google.com)
(1998,www.cnn.com) (1998,news.google.com)
(Her,news.bbc.co.uk)
(Every,www.cnn.com) (Every,www.cnn.com)
(I,www.cnn.com)
(Her,news.google.com) (Her,news.bbc.co.uk)
(Jensen's,www.cnn.com)
(I'm,news.bbc.co.uk) (Her,news.google.com)
(I,www.cnn.com)
(I'm,news.bbc.co.uk)
(Jensen's,www.cnn.com)

Disk
BSBI - Block sort-based indexing

BlockSortBasedIndexConstruction

BlockSortBasedIndexConstruction()
1 n←0
2 while (all documents not processed)
3 do block ← ParseNextBlock()
4 BSBI-Invert(block)
5 WriteBlockToDisk(block, fn )
6 MergeBlocks(f1 , f2 ..., fn , fmerged )
BSBI - Block sort-based indexing

Block merge indexing


• Parse documents into (TermID, DocID) pairs until “block” is
full
• Invert the block
• Sort the (TermID,DocID) pairs
• Compile into TermID posting lists
• Write the block to disk
• Then merge all blocks into one large postings file
• Need 2 copies of the data on disk (input then output)
BSBI - Block sort-based indexing

Analysis of BSBI
• The dominant term is O(TlogT)
• T is the number of TermID,DocID pairs
• But in practice ParseNextBlock takes the most time
• Then MergingBlocks
• Again, disk seeks times versus memory access times
BSBI - Block sort-based indexing

Analysis of BSBI
• 12-byte records (term, doc, meta-data)
• Need to sort T= 100,000,000 such 12-byte records by term
• Define a block to have 1,600,000 such records
• can easily fit a couple blocks in memory
• we will be working with 64 such blocks
• 64 blocks * 1,600,000 records * 12 bytes = 1,228,800,000 bytes
• Nlog2N comparisons is 5,584,577,250.93
• 2 touches per comparison at memory speeds (10e-6 sec) =
• 55,845.77 seconds = 930.76 min = 15.5 hours
Index Construction

Overview

• Introduction
• Hardware
• BSBI - Block sort-based indexing
• SPIMI - Single Pass in-memory indexing
• Distributed indexing
• Dynamic indexing
• Miscellaneous topics
Single-Pass In-Memory Indexing

SPIMI
• BSBI is good but,
• it needs a data structure for mapping terms to termIDs
• this won’t fit in memory for big corpora
• Straightforward solution
• dynamically create dictionaries
• store the dictionaries with the blocks
Single-Pass In-Memory Indexing

SPIMI
• BSBI is good but,
• it needs a data structure for mapping terms to termIDs
• this won’t fit in memory for big corpora
• Straightforward solution
• dynamically create dictionaries
• store the dictionaries with the blocks
Single-Pass In-Memory Indexing

SPIMI-Invert(tokenStream)
1 outputF ile ← NewFile()
2 dictionary ← NewHash()
3 while (f ree memory available)
4 do token ← next(tokenStream)
5 if term(token) ∈ / dictionary
6 then postingsList ← AddToDictionary(dictionary, term(token))
7 else postingsList ← GetPostingsList(dictionary, term(token))
8 if f ull(postingsList)
9 then postingsList ← DoublePostingsList(dictionary, term(token))
10 AddToPostingsList(postingsList, docID(token))
11 sortedT erms ← SortTerms(dictionary)
12 WriteBlockToDisk(sortedT erms, dictionary, outputF ile)
13 return outputF ile
Single-Pass In-Memory Indexing

• So what is different here?


• SPIMI adds postings directly to a posting list.
• BSBI first collected (TermID,DocID pairs)
• then sorted them
• then aggregated the postings
• Each posting list is dynamic so there is no posting list
sorting
• Saves memory because a term is only stored once
• Complexity is more like O(T)
• Compression enables bigger effective blocks
Single-Pass In-Memory Indexing

Large Scale Indexing


• Key decision in block merge indexing is block size
• In practice, spidering often interlaced with indexing
• Spidering bottlenecked by WAN speed and other factors

You might also like