Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture2 Intro Boolean 6per

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Introduction to Information Retrieval Introduction to Information Retrieval

Information Retrieval
§ Information Retrieval (IR) is finding material (usually
Introduction to documents) of an unstructured nature (usually text)
that satisfies an information need from within large
Information Retrieval collections (usually stored on computers).

Introducing Information Retrieval § These days we frequently think first of web search, but
there are many other cases:
and Web Search
§ E-mail search
§ Searching your laptop
§ Corporate knowledge bases
§ Legal information retrieval

Introduction to Information Retrieval Introduction to Information Retrieval

Unstructured (text) vs. structured Unstructured (text) vs. structured


(database) data in the mid-nineties (database) data today

250 250

200 200

150 150
Unstructured Unstructured
100 Structured 100 Structured

50 50

0 0
Data volume Market Cap Data volume Market Cap

3 4

Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval

Basic assumptions of Information Retrieval The classic search model


Get rid of mice in a
§ Collection: A set of documents User task
politically correct way
§ Assume it is a static collection for the moment
Misconception?

Info need
Info about removing mice
§ Goal: Retrieve documents with information that is without killing them
relevant to the user’s information need and helps the Misformulation?

user complete a task Query


how trap mice alive Search

Search
engine

Query Results
Collection
5 refinement
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval

How good are the retrieved docs?


§ Precision : Fraction of retrieved docs that are
relevant to the user’s information need Introduction to
§ Recall : Fraction of relevant docs in collection that
are retrieved Information Retrieval
§ More precise definitions and measurements to follow later Term-document incidence matrices

Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval Sec. 1.1

Unstructured data in 1620 Term-document incidence matrices


§ Which plays of Shakespeare contain the words Brutus
AND Caesar but NOT Calpurnia?
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
§ One could grep all of Shakespeare’s plays for Brutus Antony 1 1 0 0 0 1

and Caesar, then strip out lines containing Calpurnia? Brutus


Caesar
1
1
1
1
0
0
1
1
0
1
0
1
Calpurnia 0 1 0 0 0 0
§ Why is that not the answer? Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
§ Slow (for large corpora) worser 1 0 1 1 1 0

§ NOT Calpurnia is non-trivial


§ Other operations (e.g., find the word Romans near
countrymen) not feasible
§ Ranked retrieval (best documents to return) Brutus AND Caesar BUT NOT 1 if play contains
§ Later lectures Calpurnia word, 0 otherwise
10

Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval Sec. 1.1

Incidence vectors Answers to query


§ So we have a 0/1 vector for each term.
§ Antony and Cleopatra, Act III, Scene ii
§ To answer query: take the vectors for Brutus, Caesar Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
and Calpurnia (complemented) è bitwise AND. When Antony found Julius Caesar dead,
§ 110100 AND He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
§ 110111 AND
§ 101111 =
§ 100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
§ Hamlet, Act III, Scene ii
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0 Lord Polonius: I did enact Julius Caesar I was killed i’ the
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Capitol; Brutus killed me.
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

12 13
Introduction to Information Retrieval Sec. 1.1 Introduction to Information Retrieval Sec. 1.1

Bigger collections Can’t build the matrix


§ Consider N = 1 million documents, each with about § 500K x 1M matrix has half-a-trillion 0’s and 1’s.
1000 words.
§ Avg 6 bytes/word including spaces/punctuation § But it has no more than one billion 1’s. Why?
§ 6GB of data in the documents. § matrix is extremely sparse.
§ Say there are M = 500K distinct terms among these.
§ What’s a better representation?
§ We only record the 1 positions.

14 15

Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.2

Inverted index
§ For each term t, we must store a list of all documents
Introduction to that contain t.
§ Identify each doc by a docID, a document serial number
Information Retrieval § Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174


The Inverted Index
Caesar 1 2 4 5 6 16 57 132
The key data structure underlying modern IR
Calpurnia 2 31 54 101

What happens if the word Caesar


is added to document 14?
18

Introduction to Information Retrieval Sec. 1.2 Introduction to Information Retrieval Sec. 1.2

Inverted index Inverted index construction


§ We need variable-size postings lists Documents to Friends, Romans, countrymen.
be indexed
§ On disk, a continuous run of postings is normal and best
§ In memory, can use linked lists or variable length arrays
§ Some tradeoffs in size/ease of insertion
Tokenizer
Posting
Token stream Friends Romans Countrymen
Brutus 1 2 4 11 31 45 173 174 Linguistic
modules
Caesar 1 2 4 5 6 16 57 132 friend roman countryman
Modified tokens
Calpurnia 2 31 54 101
Indexer friend 2 4

Postings roman 1 2
Dictionary Inverted index
Sorted by docID (more later on why).
19 countryman 13 16
Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.2

Initial stages of text processing Indexer steps: Token sequence


§ Tokenization § Sequence of (Modified token, Document ID) pairs.
§ Cut character sequence into word tokens
§ Deal with “John’s”, a state-of-the-art solution
§ Normalization
§ Map text and query term to same form
§ You want U.S.A. and USA to match Doc 1 Doc 2
§ Stemming
§ We may wish different forms of a root to match I did enact Julius So let it be with
§ authorize, authorization Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
§ Stop words
Brutus killed me. Caesar was ambitious
§ We may omit very common words (or not)
§ the, a, to, of

Introduction to Information Retrieval Sec. 1.2 Introduction to Information Retrieval Sec. 1.2

Indexer steps: Sort Indexer steps: Dictionary & Postings


§ Sort by terms § Multiple term
entries in a single
§ At least conceptually document are
§ And then docID merged.
§ Split into Dictionary
and Postings
Core indexing step § Doc. frequency
information is
added.

Why frequency?
Will discuss later.

Introduction to Information Retrieval Sec. 1.2 Introduction to Information Retrieval

Where do we pay in storage?


Lists of
docIDs Introduction to
Terms Information Retrieval
and
counts IR system
implementation Query processing with an inverted index
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 26
Introduction to Information Retrieval Sec. 1.3 Introduction to Information Retrieval Sec. 1.3

The index we just built Query processing: AND


§ How do we process a query? Our focus § Consider processing the query:
§ Later – what kinds of queries can we process? Brutus AND Caesar
§ Locate Brutus in the Dictionary;
§ Retrieve its postings.
§ Locate Caesar in the Dictionary;
§ Retrieve its postings.
§ “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

29 30

Introduction to Information Retrieval Sec. 1.3 Introduction to Information Retrieval

Intersecting two postings lists


The merge (a “merge” algorithm)
§ Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
31 33

Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.3

Boolean queries: Exact match


§ The Boolean retrieval model is being able to ask a
Introduction to query that is a Boolean expression:
Information Retrieval § Boolean Queries are queries using AND, OR and NOT to
join query terms
§ Views each document as a set of words
§ Is precise: document matches condition or not.
The Boolean Retrieval Model
§ Perhaps the simplest model to build an IR system on
& Extended Boolean Models
§ Primary commercial retrieval tool for 3 decades.
§ Many search systems you still use are Boolean:
§ Email, library catalog, macOS Spotlight

36
Introduction to Information Retrieval Sec. 1.4 Introduction to Information Retrieval Sec. 1.4

Example: WestLaw http://www.westlaw.com/ Example: WestLaw http://www.westlaw.com/

§ Largest commercial (paying subscribers) legal § Another example query:


search service (started 1975; ranking added § Requirements for disabled people to be able to access a
1992; new federated search added 2010) workplace
§ Tens of terabytes of data; ~700,000 users § disabl! /p access! /s work-site work-place (employment /3
place
§ Majority of users still use boolean queries
§ Note that SPACE is disjunction, not conjunction!
§ Example query:
§ Long, precise queries; proximity operators;
§ What is the statute of limitations in cases involving
the federal tort claims act?
incrementally developed; not like web search
§ LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT § Many professional searchers still like Boolean search
/3 CLAIM § You know exactly what you are getting
§ /3 = within 3 words, /S = in same sentence 37 § But that doesn’t mean it actually works better….

Introduction to Information Retrieval Sec. 1.3 Introduction to Information Retrieval Sec. 1.3

Boolean queries:
More general merges Merging
§ Exercise: Adapt the merge for the queries: What about an arbitrary Boolean formula?
Brutus AND NOT Caesar (Brutus OR Caesar) AND NOT
Brutus OR NOT Caesar (Antony OR Cleopatra)
§ Can we always merge in “linear” time?
§ Linear in what?
§ Can we still run through the merge in time O(x+y)?
What can we achieve? § Can we do better?

39 40

Introduction to Information Retrieval Sec. 1.3 Introduction to Information Retrieval Sec. 1.3

Query optimization Query optimization example


§ Process in order of increasing freq:
§ What is the best order for query processing? § start with smallest set, then keep cutting further.
§ Consider a query that is an AND of n terms.
§ For each of the n terms, get its postings, then This is why we kept
AND them together. document freq. in dictionary

Brutus 2 4 8 16 32 64 128 Brutus 2 4 8 16 32 64 128


Caesar 1 2 3 5 8 16 21 34 Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16 Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar Execute the query as (Calpurnia AND Brutus) AND Caesar.
41
42
Introduction to Information Retrieval Introduction to Information Retrieval Sec. 1.3

Exercise More general optimization


§ Recommend a query § e.g., (madding OR crowd) AND (ignoble OR
processing order for strife)
(tangerine OR trees) AND
Term Freq
§ Get doc. freq.’s for all terms.
(marmalade OR skies) AND
(kaleidoscope OR eyes)
eyes 213312 § Estimate the size of each OR by the sum of its
kaleidoscope 87009
marmalade 107913 doc. freq.’s (conservative).
§ Which two terms should we
process first?
skies
tangerine
271658
46653
§ Process in increasing order of OR sizes.
trees 316812

43 44

Introduction to Information Retrieval Introduction to Information Retrieval

Query processing exercises Exercise


§ Exercise: If the query is friends AND romans AND § Try the search feature at
(NOT countrymen), how could we use the freq of http://www.rhymezone.com/shakespeare/
countrymen? § Write down five search features you think it could do
§ Exercise: Extend the merge to an arbitrary Boolean better
query. Can we always guarantee execution in time
linear in the total postings size?
§ Hint: Begin with the case of a Boolean formula
query: in this, each query term appears only once in
the query.

45 46

Introduction to Information Retrieval Introduction to Information Retrieval Sec. 2.4

Phrase queries
§ We want to be able to answer queries such as
Introduction to “stanford university” – as a phrase
§ Thus the sentence “I went to university at Stanford”
Information Retrieval is not a match.
§ The concept of phrase queries has proven easily
Phrase queries and positional indexes understood by users; one of the few “advanced search”
ideas that works
§ Many more queries are implicit phrase queries
§ For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1 Introduction to Information Retrieval Sec. 2.4.1

A first attempt: Biword indexes Longer phrase queries


§ Index every consecutive pair of terms in the text as a § Longer phrases can be processed by breaking them
phrase down
§ For example the text “Friends, Romans, § stanford university palo alto can be broken into the
Countrymen” would generate the biwords Boolean query on biwords:
§ friends romans stanford university AND university palo AND palo alto
§ romans countrymen
§ Each of these biwords is now a dictionary term Without the docs, we cannot verify that the docs
§ Two-word phrase query-processing is now matching the above Boolean query do contain the
immediate. phrase.
Can have false positives!

Introduction to Information Retrieval Sec. 2.4.1 Introduction to Information Retrieval Sec. 2.4.2

Issues for biword indexes Solution 2: Positional indexes


§ False positives, as noted before § In the postings, store, for each term the position(s) in
§ Index blowup due to bigger dictionary which tokens of it appear:
§ Infeasible for more than biwords, big even for them
<term, number of docs containing term;
§ Biword indexes are not the standard solution (for all doc1: position1, position2 … ;
biwords) but can be part of a compound strategy doc2: position1, position2 … ;
etc.>

Introduction to Information Retrieval Sec. 2.4.2 Introduction to Information Retrieval Sec. 2.4.2

Positional index example Processing a phrase query


§ Extract inverted index entries for each distinct term:
<be: 993427; to, be, or, not.
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5 § Merge their doc:position lists to enumerate all
2: 3, 149; could contain “to be positions with “to be or not to be”.
4: 17, 191, 291, 430, 434; or not to be”? § to:
5: 363, 367, …>
§ 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
§ For phrase queries, we use a merge algorithm § be:
recursively at the document level § 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
§ But we now need to deal with more than just § Same general method for proximity searches
equality
Introduction to Information Retrieval Sec. 2.4.2 Introduction to Information Retrieval Sec. 2.4.2

Proximity queries Positional index size


§ LIMIT! /3 STATUTE /3 FEDERAL /2 TORT § A positional index expands postings storage
§ Again, here, /k means “within k words of”. substantially
§ Clearly, positional indexes can be used for such § Even though indices can be compressed
queries; biword indexes cannot. § Nevertheless, a positional index is now standardly
§ Exercise: Adapt the linear merge of postings to used because of the power and usefulness of phrase
handle proximity queries. Can you make it work for and proximity queries … whether used explicitly or
any value of k? implicitly in a ranking retrieval system.
§ This is a little tricky to do correctly and efficiently
§ See Figure 2.12 of IIR

Introduction to Information Retrieval Sec. 2.4.2 Introduction to Information Retrieval Sec. 2.4.2

Positional index size Rules of thumb


§ Need an entry for each occurrence, not just once per § A positional index is 2–4 as large as a non-positional
document index
§ Index size depends on average document size Why?
§ Average web page has <1000 terms § Positional index size 35–50% of volume of original
§ SEC filings, books, even some epic poems … easily 100,000 text
terms
§ Consider a term with frequency 0.1% § Caveat: all of this holds for “English-like” languages
Document size Postings Positional postings
1000 1 1
100,000 1 100

Introduction to Information Retrieval Sec. 2.4.3

Combination schemes
§ These two approaches can be profitably combined
§ For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
§ Even more so for phrases like “The Who”
§ Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
§ A typical web query mixture was executed in ¼ of the time
of using just a positional index
§ It required 26% more space than having a positional index
alone

You might also like