Lecture2 Intro Boolean 6per

Introduction to Information Retrieval Introduction to Information Retrieval
Information Retrieval
§ Information Retrieval (IR) is finding material (usually
Introduction to documents) of an unstructured nature (usually text)
that satisfies an information need from within large
Information Retrieval collections (usually stored on computers).
Introducing Information Retrieval § These days we frequently think first of web search, but
there are many other cases:
and Web Search
§ E-mail search
§ Searching your laptop
§ Corporate knowledge bases
§ Legal information retrieval
Unstructured (text) vs. structured Unstructured (text) vs. structured

(database) data in the mid-nineties (database) data today
250 250
200 200
150 150
Unstructured Unstructured
100 Structured 100 Structured
50 50
0 0
Data volume Market Cap Data volume Market Cap
3 4
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval
Basic assumptions of Information Retrieval The classic search model

Get rid of mice in a
§ Collection: A set of documents User task
politically correct way
§ Assume it is a static collection for the moment
Misconception?
Info need
Info about removing mice
§ Goal: Retrieve documents with information that is without killing them
relevant to the user’s information need and helps the Misformulation?
user complete a task Query

how trap mice alive Search
Search
engine
Query Results
Collection
5 refinement
How good are the retrieved docs?

§ Precision : Fraction of retrieved docs that are
relevant to the user’s information need Introduction to
§ Recall : Fraction of relevant docs in collection that
are retrieved Information Retrieval
§ More precise definitions and measurements to follow later Term-document incidence matrices
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval Sec. 1.1
Unstructured data in 1620 Term-document incidence matrices

§ Which plays of Shakespeare contain the words Brutus
AND Caesar but NOT Calpurnia?
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
§ One could grep all of Shakespeare’s plays for Brutus Antony 1 1 0 0 0 1
and Caesar, then strip out lines containing Calpurnia? Brutus

Caesar
1
1
1
1
0
0
1
1
0
1
0
1
Calpurnia 0 1 0 0 0 0
§ Why is that not the answer? Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
§ Slow (for large corpora) worser 1 0 1 1 1 0
§ NOT Calpurnia is non-trivial

§ Other operations (e.g., find the word Romans near
countrymen) not feasible
§ Ranked retrieval (best documents to return) Brutus AND Caesar BUT NOT 1 if play contains
§ Later lectures Calpurnia word, 0 otherwise
10
Incidence vectors Answers to query

§ So we have a 0/1 vector for each term.
§ Antony and Cleopatra, Act III, Scene ii
§ To answer query: take the vectors for Brutus, Caesar Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
and Calpurnia (complemented) è bitwise AND. When Antony found Julius Caesar dead,
§ 110100 AND He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
§ 110111 AND
§ 101111 =
§ 100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
§ Hamlet, Act III, Scene ii
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0 Lord Polonius: I did enact Julius Caesar I was killed i’ the
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Capitol; Brutus killed me.
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
12 13
Bigger collections Can’t build the matrix

§ Consider N = 1 million documents, each with about § 500K x 1M matrix has half-a-trillion 0’s and 1’s.
1000 words.
§ Avg 6 bytes/word including spaces/punctuation § But it has no more than one billion 1’s. Why?
§ 6GB of data in the documents. § matrix is extremely sparse.
§ Say there are M = 500K distinct terms among these.
§ What’s a better representation?
§ We only record the 1 positions.
14 15
Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.2
Inverted index
§ For each term t, we must store a list of all documents
Introduction to that contain t.
§ Identify each doc by a docID, a document serial number
Information Retrieval § Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174

The Inverted Index
Caesar 1 2 4 5 6 16 57 132
The key data structure underlying modern IR
Calpurnia 2 31 54 101
What happens if the word Caesar

is added to document 14?
18
Inverted index Inverted index construction

§ We need variable-size postings lists Documents to Friends, Romans, countrymen.
be indexed
§ On disk, a continuous run of postings is normal and best
§ In memory, can use linked lists or variable length arrays
§ Some tradeoffs in size/ease of insertion
Tokenizer
Posting
Token stream Friends Romans Countrymen
Brutus 1 2 4 11 31 45 173 174 Linguistic
modules
Caesar 1 2 4 5 6 16 57 132 friend roman countryman
Modified tokens
Calpurnia 2 31 54 101
Indexer friend 2 4
Postings roman 1 2
Dictionary Inverted index
Sorted by docID (more later on why).
19 countryman 13 16
Initial stages of text processing Indexer steps: Token sequence

§ Tokenization § Sequence of (Modified token, Document ID) pairs.
§ Cut character sequence into word tokens
§ Deal with “John’s”, a state-of-the-art solution
§ Normalization
§ Map text and query term to same form
§ You want U.S.A. and USA to match Doc 1 Doc 2
§ Stemming
§ We may wish different forms of a root to match I did enact Julius So let it be with
§ authorize, authorization Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
§ Stop words
Brutus killed me. Caesar was ambitious
§ We may omit very common words (or not)
§ the, a, to, of
Indexer steps: Sort Indexer steps: Dictionary & Postings

§ Sort by terms § Multiple term
entries in a single
§ At least conceptually document are
§ And then docID merged.
§ Split into Dictionary
and Postings
Core indexing step § Doc. frequency
information is
added.
Why frequency?
Will discuss later.
Where do we pay in storage?

Lists of
docIDs Introduction to
Terms Information Retrieval
and
counts IR system
implementation Query processing with an inverted index
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 26
The index we just built Query processing: AND

§ How do we process a query? Our focus § Consider processing the query:
§ Later – what kinds of queries can we process? Brutus AND Caesar
§ Locate Brutus in the Dictionary;
§ Retrieve its postings.
§ Locate Caesar in the Dictionary;
§ Retrieve its postings.
§ “Merge” the two postings (intersect the document sets):
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
29 30
Intersecting two postings lists

The merge (a “merge” algorithm)
§ Walk through the two postings simultaneously, in
time linear in the total number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
31 33
Boolean queries: Exact match

§ The Boolean retrieval model is being able to ask a
Introduction to query that is a Boolean expression:
Information Retrieval § Boolean Queries are queries using AND, OR and NOT to
join query terms
§ Views each document as a set of words
§ Is precise: document matches condition or not.
The Boolean Retrieval Model
§ Perhaps the simplest model to build an IR system on
& Extended Boolean Models
§ Primary commercial retrieval tool for 3 decades.
§ Many search systems you still use are Boolean:
§ Email, library catalog, macOS Spotlight
36
Example: WestLaw http://www.westlaw.com/ Example: WestLaw http://www.westlaw.com/
§ Largest commercial (paying subscribers) legal § Another example query:

search service (started 1975; ranking added § Requirements for disabled people to be able to access a
1992; new federated search added 2010) workplace
§ Tens of terabytes of data; ~700,000 users § disabl! /p access! /s work-site work-place (employment /3
place
§ Majority of users still use boolean queries
§ Note that SPACE is disjunction, not conjunction!
§ Example query:
§ Long, precise queries; proximity operators;
§ What is the statute of limitations in cases involving
the federal tort claims act?
incrementally developed; not like web search
§ LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT § Many professional searchers still like Boolean search
/3 CLAIM § You know exactly what you are getting
§ /3 = within 3 words, /S = in same sentence 37 § But that doesn’t mean it actually works better….
Boolean queries:
More general merges Merging
§ Exercise: Adapt the merge for the queries: What about an arbitrary Boolean formula?
Brutus AND NOT Caesar (Brutus OR Caesar) AND NOT
Brutus OR NOT Caesar (Antony OR Cleopatra)
§ Can we always merge in “linear” time?
§ Linear in what?
§ Can we still run through the merge in time O(x+y)?
What can we achieve? § Can we do better?
39 40
Query optimization Query optimization example

§ Process in order of increasing freq:
§ What is the best order for query processing? § start with smallest set, then keep cutting further.
§ Consider a query that is an AND of n terms.
§ For each of the n terms, get its postings, then This is why we kept
AND them together. document freq. in dictionary
Brutus 2 4 8 16 32 64 128 Brutus 2 4 8 16 32 64 128

Caesar 1 2 3 5 8 16 21 34 Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16 Calpurnia 13 16
Query: Brutus AND Calpurnia AND Caesar Execute the query as (Calpurnia AND Brutus) AND Caesar.
41
42
Exercise More general optimization

§ Recommend a query § e.g., (madding OR crowd) AND (ignoble OR
processing order for strife)
(tangerine OR trees) AND
Term Freq
§ Get doc. freq.’s for all terms.
(marmalade OR skies) AND
(kaleidoscope OR eyes)
eyes 213312 § Estimate the size of each OR by the sum of its
kaleidoscope 87009
marmalade 107913 doc. freq.’s (conservative).
§ Which two terms should we
process first?
skies
tangerine
271658
46653
§ Process in increasing order of OR sizes.
trees 316812
43 44
Query processing exercises Exercise

§ Exercise: If the query is friends AND romans AND § Try the search feature at
(NOT countrymen), how could we use the freq of http://www.rhymezone.com/shakespeare/
countrymen? § Write down five search features you think it could do
§ Exercise: Extend the merge to an arbitrary Boolean better
query. Can we always guarantee execution in time
linear in the total postings size?
§ Hint: Begin with the case of a Boolean formula
query: in this, each query term appears only once in
the query.
45 46
Phrase queries
§ We want to be able to answer queries such as
Introduction to “stanford university” – as a phrase
§ Thus the sentence “I went to university at Stanford”
Information Retrieval is not a match.
§ The concept of phrase queries has proven easily
Phrase queries and positional indexes understood by users; one of the few “advanced search”
ideas that works
§ Many more queries are implicit phrase queries
§ For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1 Introduction to Information Retrieval Sec. 2.4.1
A first attempt: Biword indexes Longer phrase queries

§ Index every consecutive pair of terms in the text as a § Longer phrases can be processed by breaking them
phrase down
§ For example the text “Friends, Romans, § stanford university palo alto can be broken into the
Countrymen” would generate the biwords Boolean query on biwords:
§ friends romans stanford university AND university palo AND palo alto
§ romans countrymen
§ Each of these biwords is now a dictionary term Without the docs, we cannot verify that the docs
§ Two-word phrase query-processing is now matching the above Boolean query do contain the
immediate. phrase.
Can have false positives!
Issues for biword indexes Solution 2: Positional indexes

§ False positives, as noted before § In the postings, store, for each term the position(s) in
§ Index blowup due to bigger dictionary which tokens of it appear:
§ Infeasible for more than biwords, big even for them
<term, number of docs containing term;
§ Biword indexes are not the standard solution (for all doc1: position1, position2 … ;
biwords) but can be part of a compound strategy doc2: position1, position2 … ;
etc.>
Positional index example Processing a phrase query

§ Extract inverted index entries for each distinct term:
<be: 993427; to, be, or, not.
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5 § Merge their doc:position lists to enumerate all
2: 3, 149; could contain “to be positions with “to be or not to be”.
4: 17, 191, 291, 430, 434; or not to be”? § to:
5: 363, 367, …>
§ 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
§ For phrase queries, we use a merge algorithm § be:
recursively at the document level § 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
§ But we now need to deal with more than just § Same general method for proximity searches
equality
Proximity queries Positional index size

§ LIMIT! /3 STATUTE /3 FEDERAL /2 TORT § A positional index expands postings storage
§ Again, here, /k means “within k words of”. substantially
§ Clearly, positional indexes can be used for such § Even though indices can be compressed
queries; biword indexes cannot. § Nevertheless, a positional index is now standardly
§ Exercise: Adapt the linear merge of postings to used because of the power and usefulness of phrase
handle proximity queries. Can you make it work for and proximity queries … whether used explicitly or
any value of k? implicitly in a ranking retrieval system.
§ This is a little tricky to do correctly and efficiently
§ See Figure 2.12 of IIR
Positional index size Rules of thumb

§ Need an entry for each occurrence, not just once per § A positional index is 2–4 as large as a non-positional
document index
§ Index size depends on average document size Why?
§ Average web page has <1000 terms § Positional index size 35–50% of volume of original
§ SEC filings, books, even some epic poems … easily 100,000 text
terms
§ Consider a term with frequency 0.1% § Caveat: all of this holds for “English-like” languages
Document size Postings Positional postings
1000 1 1
100,000 1 100
Introduction to Information Retrieval Sec. 2.4.3
Combination schemes
§ These two approaches can be profitably combined
§ For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
§ Even more so for phrases like “The Who”
§ Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
§ A typical web query mixture was executed in ¼ of the time
of using just a positional index
§ It required 26% more space than having a positional index
alone

Lecture2 Intro Boolean 6per

Uploaded by

Copyright:

Available Formats

Lecture2 Intro Boolean 6per

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture2 Intro Boolean 6per

Uploaded by

Copyright:

Available Formats

Introduction to Information Retrieval Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval

Unstructured (text) vs. structured Unstructured (text) vs. structured

Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval

Basic assumptions of Information Retrieval The classic search model

user complete a task Query

How good are the retrieved docs?

Unstructured data in 1620 Term-document incidence matrices

and Caesar, then strip out lines containing Calpurnia? Brutus

§ NOT Calpurnia is non-trivial

Incidence vectors Answers to query

Bigger collections Can’t build the matrix

Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.2

Brutus 1 2 4 11 31 45 173 174

What happens if the word Caesar

Inverted index Inverted index construction

Initial stages of text processing Indexer steps: Token sequence

Indexer steps: Sort Indexer steps: Dictionary & Postings

Introduction to Information Retrieval Sec. 1.2 Introduction to Information Retrieval

Where do we pay in storage?

The index we just built Query processing: AND

Introduction to Information Retrieval Sec. 1.3 Introduction to Information Retrieval

Intersecting two postings lists

If the list lengths are x and y, the merge takes O(x+y)

Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.3

Boolean queries: Exact match

Example: WestLaw http://www.westlaw.com/ Example: WestLaw http://www.westlaw.com/

§ Largest commercial (paying subscribers) legal § Another example query:

Query optimization Query optimization example

Brutus 2 4 8 16 32 64 128 Brutus 2 4 8 16 32 64 128

Exercise More general optimization

Introduction to Information Retrieval Introduction to Information Retrieval

Query processing exercises Exercise

Introduction to Information Retrieval Introduction to Information Retrieval Sec. 2.4

A first attempt: Biword indexes Longer phrase queries

Issues for biword indexes Solution 2: Positional indexes

Positional index example Processing a phrase query

Proximity queries Positional index size

Positional index size Rules of thumb

Introduction to Information Retrieval Sec. 2.4.3

You might also like