|
| 1 | +Tsearch2 - full text search extension for PostgreSQL |
| 2 | + |
| 3 | + [10][Online version] of this document is available |
| 4 | + |
| 5 | + This module is sponsored by Delta-Soft Ltd., Moscow, Russia. |
| 6 | + |
| 7 | + Notice: This version is fully incompatible with old tsearch (V1), |
| 8 | + which is considered as deprecated in upcoming 7.4 release and |
| 9 | + obsoleted in 7.5. |
| 10 | + |
| 11 | + The Tsearch2 contrib module contains an implementation of a new data |
| 12 | + type tsvector - a searchable data type with indexed access. In a |
| 13 | + nutshell, tsvector is a set of unique words along with their |
| 14 | + positional information in the document, organized in a special |
| 15 | + structure optimized for fast access and lookup. Actually, each word |
| 16 | + entry, besides its position in the document, could have a weight |
| 17 | + attribute, describing importance of this word (at a specific) position |
| 18 | + in document. A set of bit-signatures of a fixed length, representing |
| 19 | + tsvectors, are stored in a search tree (developed using PostgreSQL |
| 20 | + GiST), which provides online update of full text index and fast query |
| 21 | + lookup. The module provides indexed access methods, queries, |
| 22 | + operations and supporting routines for the tsvector data type and easy |
| 23 | + conversion of text data to tsvector. Table driven configuration allows |
| 24 | + creation of custom configuration optimized for specific searches using |
| 25 | + standard SQL commands. |
| 26 | + |
| 27 | + Configuration allows you to: |
| 28 | + * specify the type of lexemes to be indexed and the way they are |
| 29 | + processed. |
| 30 | + * specify dictionaries to be used along with stop words recognition. |
| 31 | + * specify the parser used to process a document. |
| 32 | + |
| 33 | + See [11]Documentation Roadmap for links to documentation. |
| 34 | + |
| 35 | +Authors |
| 36 | + |
| 37 | + * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia |
| 38 | + * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia |
| 39 | + |
| 40 | +Contributors |
| 41 | + |
| 42 | + * Robert John Shepherd and Andrew J. Kopciuch submitted |
| 43 | + "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch |
| 44 | + v2) |
| 45 | + * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 |
| 46 | + Reference" and proposed new naming convention for tsearch V2 |
| 47 | + |
| 48 | +New features |
| 49 | + |
| 50 | + * Relevance ranking of search results |
| 51 | + * Table driven configuration |
| 52 | + * Morphology support (ispell dictionaries, snowball stemmers) |
| 53 | + * Headline support (text fragments with highlighted search terms) |
| 54 | + * Ability to plug-in custom dictionaries and parsers |
| 55 | + * Synonym dictionary |
| 56 | + * Generator of templates for dictionaries (built-in snowball stemmer |
| 57 | + support) |
| 58 | + * Statistics of indexed words is available |
| 59 | + |
| 60 | +Limitations |
| 61 | + |
| 62 | + * Lexeme should be not longer than 2048 bytes |
| 63 | + * The number of lexemes is limited by 2^32. Note, that actual |
| 64 | + capacity of tsvector is depends on whether positional information |
| 65 | + is stored or not. |
| 66 | + * tsvector - the size is limited by approximately 2^20 bytes. |
| 67 | + * tsquery - the number of entries (lexemes and operations) < 32768 |
| 68 | + * Positional information |
| 69 | + + maximal position of lexeme < 2^14 (16384) |
| 70 | + + lexeme could have maximum 256 positions |
| 71 | + |
| 72 | +References |
| 73 | + |
| 74 | + * GiST development site - |
| 75 | + [12]http://www.sai.msu.su/~megera/postgres/gist |
| 76 | + * OpenFTS home page - [13]http://openfts.sourceforge.net/ |
| 77 | + * Mailing list - |
| 78 | + [14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen |
| 79 | + eral |
| 80 | + |
| 81 | + [15]Documentation Roadmap |
| 82 | + |
| 83 | +Documentation Roadmap |
| 84 | + |
| 85 | + * Several docs are available from docs/ subdirectory |
| 86 | + + "Tsearch V2 Introduction" by Andrew Kopciuch |
| 87 | + + "Tsearch2 Guide" by Brandon Rhodes |
| 88 | + + "Tsearch2 Reference" by Brandon Rhodes |
| 89 | + * Readme.gendict in gendict/ subdirectory |
| 90 | + + [16][Gendict tutorial] |
| 91 | + |
| 92 | + Online version of documentation is always available from Tsearch V2 |
| 93 | + home page - |
| 94 | + [17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
| 95 | + |
| 96 | +Support |
| 97 | + |
| 98 | + Authors urgently recommend people to use [18][openfts-general] or |
| 99 | + [19][pgsql-general] mailing lists for questions and discussions. |
| 100 | + |
| 101 | +Caution |
| 102 | + |
| 103 | + In spite of apparent easy full text searching with our tsearch module |
| 104 | + (authors hope it's so), any serious search engine require profound |
| 105 | + study of various aspects, such as stop words, dictionaries, special |
| 106 | + parsers. Tsearch module was designed to facilitate both those cases. |
| 107 | + |
| 108 | +Development History |
| 109 | + |
| 110 | + Pre-tsearch era |
| 111 | + Development of OpenFTS began in 2000 after realizing that we |
| 112 | + needed a search engine optimized for online updates and able to |
| 113 | + access metadata from the database. This is essential for online |
| 114 | + news agencies, web portals, digital libraries, etc. Most search |
| 115 | + engines available utilize an inverted index which is very fast |
| 116 | + for searching but very slow for online updates. Incremental |
| 117 | + updates of an inverted index is a complex engineering task |
| 118 | + while we needed something light, free and with the ability to |
| 119 | + access metadata from the database. The last requirement is very |
| 120 | + important because in a real life application a search engine |
| 121 | + should always consult metadata ( topic, permissions, date |
| 122 | + range, version, etc.). We extensively use PostgreSQL as a |
| 123 | + database backend and have no intention to move from it, so the |
| 124 | + problem was to find a data structure and a fast way to access |
| 125 | + it. PostgreSQL has rather unique data type for storing sets |
| 126 | + (think about words) - arrays, but lacks index access to them. A |
| 127 | + document is parsed into lexemes, which are identified in |
| 128 | + various ways (e.g. stemming, morphology, dictionary), and as a |
| 129 | + result is reduced to an array of integer numbers. During our |
| 130 | + research we found a paper of Joseph Hellerstein which |
| 131 | + introduced an interesting data structure suitable for sets - |
| 132 | + RD-tree (Russian Doll tree). It looked very attractive, but |
| 133 | + implementing it in PostgreSQL seemed difficult because of our |
| 134 | + ignorance of database internals. Further research lead us to |
| 135 | + the idea to use GiST for implementing RD-tree, but at that time |
| 136 | + the GiST code had for a long while remained untouched and |
| 137 | + contained several bugs. After work on improving GiST for |
| 138 | + version 7.0.3 of PostgreSQL was done, we were able to implement |
| 139 | + RD-Tree and use it for index access to arrays of integers. This |
| 140 | + implementation was ideally suited for small arrays and |
| 141 | + eliminated complex joins, but was practically useless for |
| 142 | + indexing large arrays. The next improvement came from an idea |
| 143 | + to represent a document by a single bit-signature, a so-called |
| 144 | + superimposed signature (see "Index Structures for Databases |
| 145 | + Containing Data Items with Set-valued Attributes", 1997, Sven |
| 146 | + Helmer for details). We developeded the contrib/intarray module |
| 147 | + and used it for full text indexing. |
| 148 | + |
| 149 | + tsearch v1 |
| 150 | + It was inconvenient to use integer id's instead of words, so we |
| 151 | + introduced a new data type called 'txtidx' - a searchable data |
| 152 | + type (textual) with indexed access. This was a first step of |
| 153 | + our work on an implementation of a built-in PostgreSQL full |
| 154 | + text search engine. Even though tsearch v1 had many features of |
| 155 | + a search engine it lacked configuration support and relevance |
| 156 | + ranking. People were encouraged to use OpenFTS, which provided |
| 157 | + relevance ranking based on coordinate information and flexible |
| 158 | + configuration. OpenFTS v.0.34 is the last version based on |
| 159 | + tsearch v1. |
| 160 | + |
| 161 | + tsearch V2 |
| 162 | + People recognized tsearch as a powerful tool for full text |
| 163 | + searching and insisted on adding ranking support, better |
| 164 | + configurability, etc. We already thought about moving most of |
| 165 | + the features of OpenFTS to tsearch, and in the early 2003 we |
| 166 | + decided to work on a new version of tsearch - tsearch v2. We've |
| 167 | + abandoned auxiliary index tables which were used by OpenFTS to |
| 168 | + store coordinate information and modified the txtidx type to |
| 169 | + store them internally. Also, we've added table-driven |
| 170 | + configuration, support of ispell dictionaries, snowball |
| 171 | + stemmers and the ability to specify which types of lexemes to |
| 172 | + index. Also, it's now possible to generate headlines of |
| 173 | + documents with highlighted search terms. These changes make |
| 174 | + tsearch more user friendly and turn it into a really powerful |
| 175 | + full text search engine. After announcing the alpha version, we |
| 176 | + received a proposal from Brandon Rhodes to rename tsearch |
| 177 | + functions to be more consistent. So, we have renamed txtidx |
| 178 | + type to tsvector and other things as well. |
| 179 | + |
| 180 | + To allow users of tsearch v1 smooth upgrade, we named the module as |
| 181 | + tsearch2. |
| 182 | + |
| 183 | + Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave |
| 184 | + people could download it from OpenFTS CVS (see link from [20][OpenFTS |
| 185 | + page] |
| 186 | + |
| 187 | +References |
| 188 | + |
| 189 | + 10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html |
| 190 | + 11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap |
| 191 | + 12. http://www.sai.msu.su/~megera/postgres/gist |
| 192 | + 13. http://openfts.sourceforge.net/ |
| 193 | + 14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
| 194 | + 15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap |
| 195 | + 16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict |
| 196 | + 17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
| 197 | + 18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
| 198 | + 19. http://archives.postgresql.org/pgsql-general/ |
| 199 | + 20. http://openfts.sourceforge.net/ |
0 commit comments