Lucene Ranking
Lucene Ranking
Lucene Ranking
Introduction
Lucene is a powerful search framework capable of indexing a several gigabytes of document data and then quickly performing complex searches on that data. Lucene can also process data beyond raw text. Typically this consists of data about the documents that are being indexed, for example, title information or document authors. Lucene provides a scoring algorithm that includes this additional data to find best matches to document queries. The default scoring algorithm is fairly complex and considers such factors as the frequency of a particular query term with individual documents and the frequency of the term in the total population of documents.
Ranking Data
This article is concerned with indexing and searching structured data only. For our example we will be searching car data, in particular the following data: Color Red Red Blue Green Red Black Transmission Stick Automatic Stick Automatic Automatic Stick Ford Chevy Ford Honda Toyota Chevy Make Mileage 10000 20000 5000 10000 25000 30000 Year 1996 1997 1996 2002 1999 1995
A traditional solution to this problem would be to utilize a relational database to store the data and SQL to perform the query. This works well for finding the data but if we want to rank the data according to how well the results match the search we must implement a custom ranking algorithm. Another solution is to utilize Lucene to index and query against this data. This is a straight-forward process utilizing Lucene. Row in the table is represented as a Lucene Document object. Each column of the data is represented as a Lucene Field object (in this particular case a Keyword Field). See listing 1 for an example program. When we execute this search we get the following results:
0.9318453 0.5757549 0.49302885 0.49302885 0.43772483 0.0299615 : : : : : : Red, Stick, Ford, 000010000, 1996 Blue, Stick, Ford, 000005000, 1996 Red, Automatic, Chevy, 000020000, 1997 Red, Automatic, Toyota, 000025000, 1999 Green, Automatic, Honda, 000010000, 2002 Black, Stick, Chevy, 000030000, 1995
As we should expect, the row that most closely matched our query appears with the highest score. The
row that matches our query the least also appears last in the results. The results in the middle of the group all match an equal number of criteria ( 3 out of 5 ) but produce different scores. The reason that these scores are different is that the Lucene scoring algorithm considers the rarity of a matched term within the global space of all terms for a given field. In other words, if you match a term that is not very common in the data then this match is given a higher score. What we are looking for is to produce scores that are a direct reflection of the number of matched fields. So our target for the above results would be 0.8, 0.6, 0.6, 0.6, 0.6, 0.6, and 0.2 respectively. A value of 1.0 would indicate 5 out of 5 matches and a score of 0.2 would indicate 1 out of 5 matches. Lucene provides a hook into the scoring mechanism, org.apache.lucene.search.Similarity. There are 6 factors that are used to compute a score for a document; phrase frequency (tf), term document frequency (idf), boost (getBoost), term field normalization (lengthNorm), term coordination (coord), and query normalization (queryNorm). Of these factors there are 3 that of are interest in modifying the Similarity function to meet our needs; term document frequency, term field normalization, and term coordination. According to the Lucene Javadoc for Similarity the term document frequency; "Computes a score factor based on a term's document frequency (the number of documents which contain the term)." This means that the score of a matched term is directly proportional to that term's rarity. The default formula for this function is idf =log numdocs docFreq1
where numdocs is the total number of documents that have been indexed and docFreq is the total number of documents that contain this term. From this formula we can see that the lower the value of docFreq, the higher the value of idf. The desired behavior is that the rarity of a term should have no effect on the score of the document. The easiest way to accomplish this is to set this value to a constant. Since scores are normalized (as we'll see later) a value of 1.0 will suffice. The next term of interest is the term field normalization. From the Lucene Javadoc; "Computes the normalization value for a field given the total number of terms contained in a field. These values, together with field boosts, are stored in an index and multiplied into scores for hits on each field by the search code." This means that the effect on the document score is inversely proportional to the number of terms contained in a field of interest. For our demo application this factor is not important since each field contains a single field. However, for our purposes we do not want the number of terms in a field to influence the score (see listing 4 for a demonstration of this factor). The default implementation of this factor is lengthNorm= 1 numTerms
To eliminate this factor in the consideration of our score we once again use a constant. We'll use the value lengthNorm=1.0. The final of our 3 terms is the term coordination. From the Lucene Javadoc; "Computes a score factor based on the fraction of all query terms that a document contains." This means that the more instances of a given query term that a document contains, the higher the score. The default implementation for this factor is
coord = overlap maxOverlap
where overlap is the number of query terms matched in the document and maxOverlap is the total number of terms present in the query. Once again, we remove this as a factor from scoring by setting the value to a constant 1.0. All of these changes result in the following Similarity implementation.
class IsolationSimilarity extends DefaultSimilarity { public IsolationSimilarity(){ } public float idf(int docFreq, int numDocs) { return(float)1.0; } public float coord(int overlap, int maxOverlap) { return 1.0f; } public float lengthNorm(String fieldName, int numTerms) { return 1.0f; }
Listing 2 shows the changes to the code with the new Similarity in place. Running the program results in the following:
1.0 0.75 0.75 0.75 0.75 0.25 : : : : : : Red, Stick, Ford, 000010000, 1996 Red, Automatic, Chevy, 000020000, 1997 Blue, Stick, Ford, 000005000, 1996 Green, Automatic, Honda, 000010000, 2002 Red, Automatic, Toyota, 000025000, 1999 Black, Stick, Chevy, 000030000, 1995
These results are close to our desired values but we still have a problem. The top match in our results is giving us a score of 1.0 even though we don't have a 5 out of 5 match. The reason that this occurs is that Lucene normalizes the scores of the hits. Since we have changed the Similarity function the result is that the highest scoring document always receives a score of 1.0. We can change this by introducing
a document that is not part of the normal data set (therefore should be removed from the final hit list) and that will always be part of the hit list and will always score 1.0. We term this document the ringer document. The values entered into the ringer document should be values that will not occur in any of the other documents. For our example dataset we use the ringer document containing values of "-1" for all fields. This document is then indexed as any other document. The next step in this process is to construct the query such that the ringer document is always returned. We do this by creating an overall query that becomes the combination of our normal query and a ringer query contructed to specifically target our ringer document. Listing 3 contains the contruction of the ringer document and query.
1.0 0.8 0.6 0.6 0.6 0.6 0.2 : : : : : : : -1, -1, -1, -1, -1 Red, Stick, Ford, 000010000, 1996 Red, Automatic, Chevy, 000020000, 1997 Blue, Stick, Ford, 000005000, 1996 Green, Automatic, Honda, 000010000, 2002 Red, Automatic, Toyota, 000025000, 1999 Black, Stick, Chevy, 000030000, 1995
The ringer document has been included in these results for demonstration purposes only. Normally it would be excluded from the hit results when the results are presented. It should also be kept in mind that the number of results will need to be decremented by one to account for the ringer document.
Listing 1
import import import import import org.apache.lucene.store.*; org.apache.lucene.index.*; org.apache.lucene.document.*; org.apache.lucene.search.*; org.apache.lucene.analysis.standard.*;
public class ExampleDriver { public static final String[][] dataArr = new String[][]{ {"Red", "Stick", "Ford", "000010000", "1996"}, {"Red", "Automatic", "Chevy", "000020000", "1997"}, {"Blue", "Stick", "Ford", "000005000", "1996"}, {"Green", "Automatic", "Honda", "000010000", "2002"}, {"Red", "Automatic", "Toyota", "000025000", "1999"}, {"Black", "Stick", "Chevy", "000030000", "1995"} }; public static void main( String[] args ){ try { RAMDirectory rd = new RAMDirectory(); IndexWriter iw = new IndexWriter( rd, new StandardAnalyzer(), true ); for ( int i = 0 ; i < dataArr.length; i++ ) { Document d = new Document(); d.add( Field.Keyword( "Color", dataArr[i][0] ) ); d.add( Field.Keyword( "Transmission", dataArr[i][1] ) ); d.add( Field.Keyword( "Manufacturer", dataArr[i][2] ) ); d.add( Field.Keyword( "Mileage", dataArr[i][3] ) ); d.add( Field.Keyword( "Year", dataArr[i][4] ) ); } iw.addDocument( d );
iw.optimize(); iw.close(); IndexSearcher is = new IndexSearcher( rd ); BooleanQuery q = new BooleanQuery(); BooleanQuery carQuery = new BooleanQuery(); carQuery.add( new TermQuery( new Term( "Color", "Red" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Transmission", "Automatic" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Manufacturer", "Ford" ) ), false, false ); carQuery.add( new RangeQuery( new Term( "Mileage", "000000000" ), new Term( "Mileage", "000010001" ), true ), false, false ); carQuery.add( new RangeQuery( new Term( "Year", "1995" ), new Term( "Year", "9999" ), true ), false, false ); Hits hits = is.search( carQuery ); for ( int i = 0 ; i < hits.length(); i++ ) { Document d = hits.doc( i ); System.out.print( hits.score( i ) ); System.out.print( " : " ); System.out.print( d.get( "Color" ) ); System.out.print( ", " ); System.out.print( d.get( "Transmission" ) ); System.out.print( ", " ); System.out.print( d.get( "Manufacturer" ) ); System.out.print( ", " );
System.out.print( d.get( "Mileage" ) ); System.out.print( ", " ); System.out.print( d.get( "Year" ) ); System.out.println();
Listing 2
import import import import import org.apache.lucene.store.*; org.apache.lucene.index.*; org.apache.lucene.document.*; org.apache.lucene.search.*; org.apache.lucene.analysis.standard.*;
public class ExampleDriver { public static final String[][] dataArr = new String[][]{ {"Red", "Stick", "Ford", "000010000", "1996"}, {"Red", "Automatic", "Chevy", "000020000", "1997"}, {"Blue", "Stick", "Ford", "000005000", "1996"}, {"Green", "Automatic", "Honda", "000010000", "2002"}, {"Red", "Automatic", "Toyota", "000025000", "1999"}, {"Black", "Stick", "Chevy", "000030000", "1995"} }; public static void main( String[] args ){ try { RAMDirectory rd = new RAMDirectory(); IndexWriter iw = new IndexWriter( rd, new StandardAnalyzer(), true ); iw.setSimilarity( new IsolationSimilarity() ); for ( int i = 0 ; i < dataArr.length; i++ ) { Document d = new Document(); d.add( Field.Keyword( "Color", dataArr[i][0] ) ); d.add( Field.Keyword( "Transmission", dataArr[i][1] ) ); d.add( Field.Keyword( "Manufacturer", dataArr[i][2] ) ); d.add( Field.Keyword( "Mileage", dataArr[i][3] ) ); d.add( Field.Keyword( "Year", dataArr[i][4] ) ); } iw.addDocument( d );
iw.optimize(); iw.close(); IndexSearcher is = new IndexSearcher( rd ); is.setSimilarity( new IsolationSimilarity() ); BooleanQuery q = new BooleanQuery(); BooleanQuery carQuery = new BooleanQuery(); carQuery.add( new TermQuery( new Term( "Color", "Red" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Transmission", "Automatic" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Manufacturer", "Ford" ) ), false, false ); carQuery.add( new RangeQuery( new Term( "Mileage", "000000000" ), new Term( "Mileage", "000010001" ), true ), false, false ); carQuery.add( new RangeQuery( new Term( "Year", "1995" ), new Term( "Year", "9999" ), true ), false, false ); Hits hits = is.search( carQuery ); for ( int i = 0 ; i < hits.length(); i++ ) { Document d = hits.doc( i ); System.out.print( hits.score( i ) ); System.out.print( " : " ); System.out.print( d.get( "Color" ) ); System.out.print( ", " );
System.out.print( d.get( "Transmission" ) ); System.out.print( ", " ); System.out.print( d.get( "Manufacturer" ) ); System.out.print( ", " ); System.out.print( d.get( "Mileage" ) ); System.out.print( ", " ); System.out.print( d.get( "Year" ) ); System.out.println();
class IsolationSimilarity extends DefaultSimilarity { public IsolationSimilarity(){ } public float idf(int docFreq, int numDocs) { return(float)1.0; } public float coord(int overlap, int maxOverlap) { return 1.0f; } public float lengthNorm(String fieldName, int numTerms) { return 1.0f; }
Listing 3
import import import import import org.apache.lucene.store.*; org.apache.lucene.index.*; org.apache.lucene.document.*; org.apache.lucene.search.*; org.apache.lucene.analysis.standard.*;
public class ExampleDriver { public static final String[][] dataArr = new String[][]{ {"Red", "Stick", "Ford", "000010000", "1996"}, {"Red", "Automatic", "Chevy", "000020000", "1997"}, {"Blue", "Stick", "Ford", "000005000", "1996"}, {"Green", "Automatic", "Honda", "000010000", "2002"}, {"Red", "Automatic", "Toyota", "000025000", "1999"}, {"Black", "Stick", "Chevy", "000030000", "1995"} , {"-1", "-1", "-1", "-1", "-1"} }; public static void main( String[] args ){ try { RAMDirectory rd = new RAMDirectory(); IndexWriter iw = new IndexWriter( rd, new StandardAnalyzer(), true ); iw.setSimilarity( new IsolationSimilarity() ); for ( int i = 0 ; i < dataArr.length; i++ ) { Document d = new Document(); d.add( Field.Keyword( "Color", dataArr[i][0] ) ); d.add( Field.Keyword( "Transmission", dataArr[i][1] ) ); d.add( Field.Keyword( "Manufacturer", dataArr[i][2] ) ); d.add( Field.Keyword( "Mileage", dataArr[i][3] ) ); d.add( Field.Keyword( "Year", dataArr[i][4] ) ); } iw.addDocument( d );
iw.optimize(); iw.close(); IndexSearcher is = new IndexSearcher( rd ); is.setSimilarity( new IsolationSimilarity() ); BooleanQuery q = new BooleanQuery(); BooleanQuery carQuery = new BooleanQuery(); carQuery.add( new TermQuery( new Term( "Color", "Red" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Transmission", "Automatic" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Manufacturer", "Ford" ) ), false, false ); carQuery.add( new RangeQuery( new Term( "Mileage", "000000000" ), new Term( "Mileage", "000010001" ), true ), false, false ); carQuery.add( new RangeQuery( new Term( "Year", "1995" ), new Term( "Year", "9999" ), true ), false, false ); BooleanQuery ringerQuery = new BooleanQuery(); ringerQuery.add( new TermQuery( new Term( "Color", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Transmission", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Manufacturer", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Mileage", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Year", "-1" ) ), false, false );
q.add( carQuery, false, false ); q.add( ringerQuery, false, false ); Hits hits = is.search( q ); for ( int i = 0 ; i < hits.length(); i++ ) { Document d = hits.doc( i ); System.out.print( hits.score( i ) ); System.out.print( " : " ); System.out.print( d.get( "Color" ) ); System.out.print( ", " ); System.out.print( d.get( "Transmission" ) ); System.out.print( ", " ); System.out.print( d.get( "Manufacturer" ) ); System.out.print( ", " ); System.out.print( d.get( "Mileage" ) ); System.out.print( ", " ); System.out.print( d.get( "Year" ) ); System.out.println(); } } catch ( Throwable t ) { t.printStackTrace(); }
} }
class IsolationSimilarity extends DefaultSimilarity { public IsolationSimilarity(){ } public float idf(int docFreq, int numDocs) { return(float)1.0; } public float coord(int overlap, int maxOverlap) { return 1.0f; } public float lengthNorm(String fieldName, int numTerms) { return 1.0f; }
Listing 4
import import import import import org.apache.lucene.store.*; org.apache.lucene.index.*; org.apache.lucene.document.*; org.apache.lucene.search.*; org.apache.lucene.analysis.standard.*;
public class ExampleDriver { // Note that the ringer data row contains all the possible values for the multi // value field options. For fields that are single valued we can use a value that // will not be in the normal population of values. public static final String[][] dataArr = new String[][]{ {"Red", "Stick", "Ford", "000010000", "1996", "A/C;Leather;Sunroof"}, {"Red", "Automatic", "Chevy", "000020000", "1997", "A/C"}, {"Blue", "Stick", "Ford", "000005000", "1996", "A/C;Leather"}, {"Green", "Automatic", "Honda", "000010000", "2002", "A/C;Sunroof"}, {"Red", "Automatic", "Toyota", "000025000", "1999", "Leather"}, {"Black", "Stick", "Chevy", "000030000", "1995", ""} , {"-1", "-1", "-1", "-1", "-1", "-1" } }; public static void main( String[] args ){ try { RAMDirectory rd = new RAMDirectory(); IndexWriter iw = new IndexWriter( rd, new StandardAnalyzer(), true ); iw.setSimilarity( new IsolationSimilarity() ); for ( int i = 0 ; i < dataArr.length; i++ ) { Document d = new Document(); d.add( Field.Keyword( "Color", dataArr[i][0] ) ); d.add( Field.Keyword( "Transmission", dataArr[i][1] ) ); d.add( Field.Keyword( "Manufacturer", dataArr[i][2] ) ); d.add( Field.Keyword( "Mileage", dataArr[i][3] ) ); d.add( Field.Keyword( "Year", dataArr[i][4] ) ); String[] options = dataArr[i][5].split( ";" ); for( int j = 0 ; j < options.length; j++ ){ d.add( Field.Keyword( "Options", options[j] ) ); } } iw.addDocument( d );
iw.optimize(); iw.close(); IndexSearcher is = new IndexSearcher( rd ); is.setSimilarity( new IsolationSimilarity() ); BooleanQuery q = new BooleanQuery(); BooleanQuery carQuery = new BooleanQuery(); carQuery.add( new TermQuery( new Term( "Color", "Red" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Transmission", "Automatic" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Manufacturer", "Ford" ) ), false, false ); carQuery.add( new RangeQuery( new Term( "Mileage", "000000000" ), new Term( "Mileage", "000010001" ), true ), false, false ); carQuery.add( new RangeQuery( new Term( "Year", "1995" ),
false, false );
carQuery.add( new TermQuery( new Term( "Options", "A/C" ) ), false, false ); carQuery.add( new TermQuery( new Term( "Options", "Leather" ) ), false, false ); BooleanQuery ringerQuery = new BooleanQuery(); ringerQuery.add( new TermQuery( new Term( "Color", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Transmission", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Manufacturer", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Mileage", "-1" ) ), false, false ); ringerQuery.add( new TermQuery( new Term( "Year", "-1" ) ), false, false ); //The ringer portion for multi-valued fields set a boost equal to the number of values //searched on. TermQuery tq = new TermQuery( new Term( "Options", "-1" ) ); tq.setBoost( 2.0f ); //We searched for 2 options. ringerQuery.add( tq, false, false ); q.add( carQuery, false, false ); q.add( ringerQuery, false, false ); Hits hits = is.search( q ); for ( int i = 0 ; i < hits.length(); i++ ) { Document d = hits.doc( i ); System.out.print( hits.score( i ) ); System.out.print( " : " ); System.out.print( d.get( "Color" ) ); System.out.print( ", " ); System.out.print( d.get( "Transmission" ) ); System.out.print( ", " ); System.out.print( d.get( "Manufacturer" ) ); System.out.print( ", " ); System.out.print( d.get( "Mileage" ) ); System.out.print( ", " ); System.out.print( d.get( "Year" ) ); System.out.print( ", " ); String[] options = d.getValues( "Options" ); for( int j = 0; j< options.length; j++ ){ System.out.print( options[j] ); System.out.print( ";" ); } System.out.println(); } } catch ( Throwable t ) { t.printStackTrace(); }
class IsolationSimilarity extends DefaultSimilarity { public IsolationSimilarity(){ } public float idf(int docFreq, int numDocs) { return(float)1.0; } public float coord(int overlap, int maxOverlap) { return 1.0f; } public float lengthNorm(String fieldName, int numTerms) { return 1.0f; }
Final Notes
The example listings contain queries that search over all the fields. If a field is not being queried against then it should not be included in either the working query (carQuery in the code listings) or in the ringer query.