Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Analyzers

An analyzer examines the text of fields and generates a token stream.

Analyzers are specified as a child of the <fieldType> element in Solr’s schema.

In normal usage, only fields of type solr.TextField or solr.SortableTextField will specify an analyzer. The simplest way to configure an analyzer is with a single <analyzer> element whose class attribute is a fully qualified Java class name. The named class must derive from org.apache.lucene.analysis.Analyzer. For example:

<fieldType name="nametext" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
</fieldType>

In this case a single class, WhitespaceAnalyzer, is responsible for analyzing the content of the named text field and emitting the corresponding tokens. For simple cases, such as plain English prose, a single analyzer class like this may be sufficient. But it’s often necessary to do more complex analysis of the field content.

Even the most complex analysis requirements can usually be decomposed into a series of discrete, relatively simple processing steps. As you will soon discover, the Solr distribution comes with a large selection of tokenizers and filters that covers most scenarios you are likely to encounter. Setting up an analyzer chain is very straightforward; you specify a simple <analyzer> element (no class attribute) with child elements that name factory classes for the tokenizer and filters to use, in the order you want them to run.

For example:

  • With name

  • With class name (legacy)

<fieldType name="nametext" class="solr.TextField">
  <analyzer>
    <tokenizer name="standard"/>
    <filter name="lowercase"/>
    <filter name="stop"/>
    <filter name="englishPorter"/>
  </analyzer>
</fieldType>

Tokenizer and filter factory classes are referred by their symbolic names (SPI names). Here, name="standard" refers org.apache.lucene.analysis.standard.StandardTokenizerFactory.

<fieldType name="nametext" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"/>
  </analyzer>
</fieldType>

Note that classes in the org.apache.lucene.analysis package may be referred to here with the shorthand solr. prefix.

In this case, no Analyzer class was specified on the <analyzer> element. Rather, a sequence of more specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is passed to the first item in the list (solr.StandardTokenizerFactory), and the tokens that emerge from the last one (solr.EnglishPorterFilterFactory) are the terms that are used for indexing or querying any fields that use the "nametext" fieldType.

Field Values versus Indexed Terms

The output of an Analyzer affects the terms indexed in a given field (and the terms used when parsing queries against those fields) but it has no impact on the stored value for the fields. For example: an analyzer might split "Brown Cow" into two indexed terms "brown" and "cow", but the stored value will still be a single String: "Brown Cow"

Analysis Phases

Analysis takes place in two contexts. At index time, when a field is being created, the token stream that results from analysis is added to an index and defines the set of terms (including positions, sizes, and so on) for the field. At query time, the values being searched for are analyzed and the terms that result are matched against those that are stored in the field’s index.

In many cases, the same analysis should be applied to both phases. This is desirable when you want to query for exact string matches, possibly with case-insensitivity, for example. In other cases, you may want to apply slightly different analysis steps during indexing than those used at query time.

If you provide a simple <analyzer> definition for a field type, as in the examples above, then it will be used for both indexing and queries. If you want distinct analyzers for each phase, you may include two <analyzer> definitions distinguished with a type attribute. For example:

  • With name

  • With class name (legacy)

<fieldType name="nametext" class="solr.TextField">
  <analyzer type="index">
    <tokenizer name="standard"/>
    <filter name="lowercase"/>
    <filter name="keepWord" words="keepwords.txt"/>
    <filter name="synonymFilter" synonyms="syns.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer name="standard"/>
    <filter name="lowercase"/>
  </analyzer>
</fieldType>
<fieldType name="nametext" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
    <filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not listed in keepwords.txt are discarded and those that remain are mapped to alternate values as defined by the synonym rules in the file syns.txt. This essentially builds an index from a restricted set of possible values and then normalizes them to values that may not even occur in the original text.

At query time, the only normalization that happens is to convert the query terms to lowercase. The filtering and mapping steps that occur at index time are not applied to the query terms. Queries must then, in this example, be very precise, using only the normalized terms that were stored at index time.

Analysis for Multi-Term Expansion

In some types of queries (i.e., Prefix, Wildcard, Regex, etc.) the input provided by the user is not natural language intended for Analysis. Things like Synonyms or Stop word filtering do not work in a logical way in these types of Queries.

When Solr needs to perform analysis for a query that results in multi-term expansion, then the normalize method is called for each factory in the filter chain. Factories that provide filters that do not make sense in this context will return their inputs unchanged. Normalization applies to both CharFilters and TokenFilters.

For most use cases, this provides the best possible behavior, but if you wish for absolute control over the analysis performed on these types of queries, you may explicitly define a multiterm analyzer to use, such as in the following example:

<fieldType name="nametext" class="solr.TextField">
  <analyzer type="index">
    <tokenizer name="standard"/>
    <filter name="lowercase"/>
    <filter name="keepWord" words="keepwords.txt"/>
    <filter name="synonym" synonyms="syns.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer name="standard"/>
    <filter name="lowercase"/>
  </analyzer>
  <!-- No analysis at all when doing queries that involved Multi-Term expansion -->
  <analyzer type="multiterm">
    <tokenizer name="keyword" />
  </analyzer>
</fieldType>