Lucene
Lucene
Lucene
Agenda
! Walk through for Project 2 ! Lucene ! Introduction and package overview ! Hands on: Simple indexing and querying ! Cook book: Some recipes ! Applications to Project 2 ! Recap of concepts
PROJECT TWO
IR models, query processing and evaluation
terms, etc. ! Can you tweak parameters to perform better? ! Can you simulate Boolean queries?
! Did some models behave better to QP than others? ! Any interesting patterns: Technique wise? Query wise?
THE THEORY
What concepts would you need?
Relevance models
! VSM or tf-idf ! Boolean ! BM25 ! LM ! DFR
Evaluation metrics
! Precision, Recall & F-measure single valued ! tp, fp, tn, fn ! Based on rank ! Average precision ! Mean Average Precision: MAP, GM MAP ! Precision @ k ! R-precision ! Reciprocal rank ! Bpref Preference based ! Incremental value? ! DCG ! nDCG
LUCENE
A 100% Java Text IR engine
Introduction
! Originally written in 1999 by Doug Cutting ! Part of the Apache Software Foundation family ! Full text indexing and searching ! Flexible, extendable ! Spawned several related projects: ! Nutch: Web crawling + HTML parsing ! Solr, ElasticSearch: Enterprise search servers ! Used by: ! AOL, Apple, CiteSeerX, Eclipse, IBM, JIRA, LinkedIn, Twitter, etc.
Package overview
Package analysis Usage Converts text from a Reader into TokenStream. Analyzer combines Tokenizer and TokenFilters to create TokenStream Primarily composed of IndexWriter and IndexReader Data structures to represent queries, an IndexSearcher to search over docs and QueryParsers to convert strings to queries Abstractions for storing persistent data Just a bunch of handy classes
document Simple Document class that is simply a collection of Fields index search store util
Getting started
! Pre-shipped files: IndexFiles and SearchFiles ! IndexFiles: Used to create index ! java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzerscommon.jar org.apache.lucene.demo.IndexFiles -index <directory> -docs <directory> ! SearchFiles: Use to query index ! java -cp lucene-core.jar:lucene-demo.jar:lucenequeryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
IndexFiles
! Setup Directory, Analyzer, IndexWriter ! Directory dir = FSDirectory.open(new File(indexPath)); // where are you reading files from! ! Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //standard out of box: different analyzers can do different thing! ! IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);! ! IndexWriter writer = new IndexWriter(dir, iwc); ! ! Construct Document, add Fields and add to writer ! Document doc = new Document();! ! Field pathField = new StringField("path", file.getPath(), Field.Store.YES); //properties for each field ! ! doc.add(pathField);! ! doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));! ! writer.addDocument(doc); //add document to writer!
SearchFiles
! Setup IndexReader and QueryParser ! IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //Read the physical index! ! IndexSearcher searcher = new IndexSearcher(reader);! ! Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);! ! QueryParser parser = new QueryParser(Version.LUCENE_40, field, analyzer); // setup the parser: needs a searcher and analyzer! ! Parse input text into Query ! Query query = parser.parse(line); ! ! Run the search and get results ! TopDocs results = searcher.search(query, 100);! ! ScoreDoc[] hits = results.scoreDocs;!
Some recipes
! Indexing: ! Use different analyzers ! Play around with document and fields ! Searching ! Different query parsers ! Powerful query syntax:
! Boolean queries, phrase queries, proximity queries, field boosting etc.
! Similarity classes: ! Default (TF-IDF) plus other variants: BM25, LM, DFR ! Defines how relevance ranking is done. ! New in Lucene 4: Per field similarity
Applications to project 2
! The wrapper code makes the corresponding calls to
! Query parser: ! Simplest is boost fields: Rank one higher than others ! Machine learned weights? Can treat as a classic regression problem ! Can write more complicated queries:
! POS tagging - some tags more important than others ! SVO analysis: Whats the subject? Interchangeable verbs etc.