Lecture2 Intro Boolean 6per
Lecture2 Intro Boolean 6per
Lecture2 Intro Boolean 6per
Information Retrieval
§ Information Retrieval (IR) is finding material (usually
Introduction to documents) of an unstructured nature (usually text)
that satisfies an information need from within large
Information Retrieval collections (usually stored on computers).
Introducing Information Retrieval § These days we frequently think first of web search, but
there are many other cases:
and Web Search
§ E-mail search
§ Searching your laptop
§ Corporate knowledge bases
§ Legal information retrieval
250 250
200 200
150 150
Unstructured Unstructured
100 Structured 100 Structured
50 50
0 0
Data volume Market Cap Data volume Market Cap
3 4
Info need
Info about removing mice
§ Goal: Retrieve documents with information that is without killing them
relevant to the user’s information need and helps the Misformulation?
Search
engine
Query Results
Collection
5 refinement
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval Sec. 1.1
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval Sec. 1.1
12 13
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval Sec. 1.1
14 15
Inverted index
§ For each term t, we must store a list of all documents
Introduction to that contain t.
§ Identify each doc by a docID, a document serial number
Information Retrieval § Can we used fixed-size arrays for this?
Introduction to Information Retrieval Sec. 1.2 Introduction to Information Retrieval Sec. 1.2
Postings roman 1 2
Dictionary Inverted index
Sorted by docID (more later on why).
19 countryman 13 16
Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.2
Introduction to Information Retrieval Sec. 1.2 Introduction to Information Retrieval Sec. 1.2
Why frequency?
Will discuss later.
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
29 30
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
36
Introduction to Information Retrieval Sec. 1.4 Introduction to Information Retrieval Sec. 1.4
Introduction to Information Retrieval Sec. 1.3 Introduction to Information Retrieval Sec. 1.3
Boolean queries:
More general merges Merging
§ Exercise: Adapt the merge for the queries: What about an arbitrary Boolean formula?
Brutus AND NOT Caesar (Brutus OR Caesar) AND NOT
Brutus OR NOT Caesar (Antony OR Cleopatra)
§ Can we always merge in “linear” time?
§ Linear in what?
§ Can we still run through the merge in time O(x+y)?
What can we achieve? § Can we do better?
39 40
Introduction to Information Retrieval Sec. 1.3 Introduction to Information Retrieval Sec. 1.3
Query: Brutus AND Calpurnia AND Caesar Execute the query as (Calpurnia AND Brutus) AND Caesar.
41
42
Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.3
43 44
45 46
Phrase queries
§ We want to be able to answer queries such as
Introduction to “stanford university” – as a phrase
§ Thus the sentence “I went to university at Stanford”
Information Retrieval is not a match.
§ The concept of phrase queries has proven easily
Phrase queries and positional indexes understood by users; one of the few “advanced search”
ideas that works
§ Many more queries are implicit phrase queries
§ For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1 Introduction to Information Retrieval Sec. 2.4.1
Introduction to Information Retrieval Sec. 2.4.1 Introduction to Information Retrieval Sec. 2.4.2
Introduction to Information Retrieval Sec. 2.4.2 Introduction to Information Retrieval Sec. 2.4.2
Introduction to Information Retrieval Sec. 2.4.2 Introduction to Information Retrieval Sec. 2.4.2
Combination schemes
§ These two approaches can be profitably combined
§ For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
§ Even more so for phrases like “The Who”
§ Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
§ A typical web query mixture was executed in ¼ of the time
of using just a positional index
§ It required 26% more space than having a positional index
alone