|
1 | 1 | Tsearch2 - full text search extension for PostgreSQL
|
2 | 2 |
|
3 |
| - [10][Online version] of this document is available |
4 |
| - |
5 |
| - This module is sponsored by Delta-Soft Ltd., Moscow, Russia. |
6 |
| - |
7 |
| - Notice: This version is fully incompatible with old tsearch (V1), |
8 |
| - which was deprecated in 7.4 and obsoleted in 8.0. |
9 |
| - |
10 |
| - The Tsearch2 contrib module contains an implementation of a new data |
11 |
| - type tsvector - a searchable data type with indexed access. In a |
12 |
| - nutshell, tsvector is a set of unique words along with their |
13 |
| - positional information in the document, organized in a special |
14 |
| - structure optimized for fast access and lookup. Actually, each word |
15 |
| - entry, besides its position in the document, could have a weight |
16 |
| - attribute, describing importance of this word (at a specific) position |
17 |
| - in document. A set of bit-signatures of a fixed length, representing |
18 |
| - tsvectors, are stored in a search tree (developed using PostgreSQL |
19 |
| - GiST), which provides online update of full text index and fast query |
20 |
| - lookup. The module provides indexed access methods, queries, |
21 |
| - operations and supporting routines for the tsvector data type and easy |
22 |
| - conversion of text data to tsvector. Table driven configuration allows |
23 |
| - creation of custom configuration optimized for specific searches using |
| 3 | + [1]Online version of this document is available |
| 4 | + |
| 5 | + Tsearch2 - is the full text engine, fully integrated into PostgreSQL |
| 6 | + RDBMS. |
| 7 | + |
| 8 | +Main features |
| 9 | + |
| 10 | + * Full online update |
| 11 | + * Supports multiple table driven configurations |
| 12 | + * flexible and rich linguistic support (dictionaries, stop words), |
| 13 | + thesaurus |
| 14 | + * full multibyte (UTF-8) support |
| 15 | + * Sophisticated ranking functions with support of proximity and |
| 16 | + structure information (rank, rank_cd) |
| 17 | + * Index support (GiST and Gin) with concurrency and recovery support |
| 18 | + * Rich query language with query rewriting support |
| 19 | + * Headline support (text fragments with highlighted search terms) |
| 20 | + * Ability to plug-in custom dictionaries and parsers |
| 21 | + * Template generator for tsearch2 dictionaries with [2]snowball |
| 22 | + stemmer support |
| 23 | + * It is mature (5 years of development) |
| 24 | + |
| 25 | + Tsearch2, in a nutshell, provides FTS operator (contains) for the new |
| 26 | + data types, representing document (tsvector) and query (tsquery). |
| 27 | + Table driven configuration allows creation of custom searches using |
24 | 28 | standard SQL commands.
|
25 |
| - |
26 |
| - Configuration allows you to: |
27 |
| - * specify the type of lexemes to be indexed and the way they are |
28 |
| - processed. |
29 |
| - * specify dictionaries to be used along with stop words recognition. |
30 |
| - * specify the parser used to process a document. |
31 |
| - |
32 |
| - See [11]Documentation Roadmap for links to documentation. |
| 29 | + |
| 30 | + tsvector is a searchable data type, representing document. It is a set |
| 31 | + of unique words along with their positional information in the |
| 32 | + document, organized in a special structure optimized for fast access |
| 33 | + and lookup. Each entry could be labelled to reflect its importance in |
| 34 | + document. |
| 35 | + |
| 36 | + tsquery is a data type for textual queries with support of boolean |
| 37 | + operators. It consists of lexemes (optionally labelled) with boolean |
| 38 | + operators between. |
| 39 | + |
| 40 | + Table driven configuration allows to specify: |
| 41 | + * parser, which used to break document onto lexemes |
| 42 | + * what lexemes to index and the way they are processed |
| 43 | + * dictionaries to be used along with stop words recognition. |
33 | 44 |
|
34 | 45 | OpenFTS vs Tsearch2
|
35 | 46 |
|
36 |
| - OpenFTS is a middleware between application and database, so it uses |
37 |
| - tsearch2 as a storage, while database engine is used as a query executor |
38 |
| - (searching). Everything else (parsing of documents, query processing, |
39 |
| - linguistics) carry outs on client side. That's why OpenFTS has its own |
40 |
| - configuration table (fts_conf) and works with its own set of dictionaries. |
41 |
| - OpenFTS is more flexible, because it could be used in multi-server |
42 |
| - architecture with separated machines for repository of documents |
43 |
| - (documents could be stored in file system), database and query engine. |
| 47 | + [3]OpenFTS is a middleware between application and database. OpenFTS |
| 48 | + uses tsearch2 as a storage and database engine as a query executor |
| 49 | + (searching). Everything else, i.e. parsing of documents, query |
| 50 | + processing, linguistics, carry outs on client side. That's why OpenFTS |
| 51 | + has its own configuration table (fts_conf) and works with its own set |
| 52 | + of dictionaries. OpenFTS is more flexible, because it could be used in |
| 53 | + multi-server architecture with separate machines for repository of |
| 54 | + documents (documents could be stored in filesystem), database and |
| 55 | + query engine. |
| 56 | + |
| 57 | + See [4]Documentation Roadmap for links to documentation. |
44 | 58 |
|
45 | 59 | Authors
|
46 | 60 |
|
47 | 61 | * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
|
48 |
| - * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia |
49 |
| - |
| 62 | + * Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia |
| 63 | + |
50 | 64 | Contributors
|
51 | 65 |
|
52 |
| - * Robert John Shepherd and Andrew J. Kopciuch submitted |
53 |
| - "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch |
| 66 | + * Robert John Shepherd and Andrew J. Kopciuch submitted |
| 67 | + "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch |
54 | 68 | v2)
|
55 |
| - * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 |
| 69 | + * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 |
56 | 70 | Reference" and proposed new naming convention for tsearch V2
|
57 |
| - |
58 |
| -Features Added with Tsearch2 |
59 | 71 |
|
60 |
| - * Relevance ranking of search results |
61 |
| - * Table driven configuration |
62 |
| - * Morphology support (ispell dictionaries, snowball stemmers) |
63 |
| - * Headline support (text fragments with highlighted search terms) |
64 |
| - * Ability to plug-in custom dictionaries and parsers |
65 |
| - * Synonym dictionary |
66 |
| - * Generator of templates for dictionaries (built-in snowball stemmer |
67 |
| - support) |
68 |
| - * Statistics of indexed words is available |
69 |
| - |
| 72 | +Sponsors |
| 73 | + |
| 74 | + * ABC Startsiden - compound words support |
| 75 | + * University of Mannheim for UTF-8 support (in 8.2) |
| 76 | + * jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized |
| 77 | + Inverted index (in 8.2) |
| 78 | + * Georgia Public Library Service and LibLime, Inc. for Thesaurus |
| 79 | + dictionary |
| 80 | + * PostGIS community - GiST Concurrency and Recovery |
| 81 | + |
| 82 | + The authors are grateful to the Russian Foundation for Basic Research |
| 83 | + and Delta-Soft Ltd., Moscow, Russia for support. |
| 84 | + |
70 | 85 | Limitations
|
71 | 86 |
|
72 |
| - * Lexeme should be not longer than 2048 bytes |
73 |
| - * The number of lexemes is limited by 2^32. Note, that actual |
74 |
| - capacity of tsvector is depends on whether positional information |
75 |
| - is stored or not. |
76 |
| - * tsvector - the size is limited by approximately 2^20 bytes. |
77 |
| - * tsquery - the number of entries (lexemes and operations) < 32768 |
78 |
| - * Positional information |
79 |
| - + maximal position of lexeme < 2^14 (16384) |
80 |
| - + lexeme could have maximum 256 positions |
81 |
| - |
| 87 | + * Length of lexeme < 2K |
| 88 | + * Length of tsvector (lexemes + positions) < 1Mb |
| 89 | + * The number of lexemes < 4^32 |
| 90 | + * 0< Positional information < 16383 |
| 91 | + * No more than 256 positions per lexeme |
| 92 | + * The number of nodes ( lexemes + operations) in tsquery < 32768 |
| 93 | + |
82 | 94 | References
|
83 | 95 |
|
84 | 96 | * GiST development site -
|
85 |
| - [12]http://www.sai.msu.su/~megera/postgres/gist |
86 |
| - * OpenFTS home page - [13]http://openfts.sourceforge.net/ |
| 97 | + [6]http://www.sai.msu.su/~megera/postgres/gist |
| 98 | + * GiN development - [7]http://www.sigaev.ru/gin/ |
| 99 | + * OpenFTS home page - [8]http://openfts.sourceforge.net/ |
87 | 100 | * Mailing list -
|
88 |
| - [14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen |
89 |
| - eral |
90 |
| - |
91 |
| - [15]Documentation Roadmap |
92 |
| - |
| 101 | + [9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene |
| 102 | + ral |
| 103 | + |
93 | 104 | Documentation Roadmap
|
94 | 105 |
|
95 | 106 | * Several docs are available from docs/ subdirectory
|
96 | 107 | + "Tsearch V2 Introduction" by Andrew Kopciuch
|
97 | 108 | + "Tsearch2 Guide" by Brandon Rhodes
|
98 | 109 | + "Tsearch2 Reference" by Brandon Rhodes
|
99 | 110 | * Readme.gendict in gendict/ subdirectory
|
100 |
| - + [16][Gendict tutorial] |
101 |
| - |
102 |
| - Online version of documentation is always available from Tsearch V2 |
103 |
| - home page - |
104 |
| - [17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
105 |
| - |
| 111 | + + Also, check [10]Gendict tutorial |
| 112 | + * Check [11]tsearch2 Wiki pages for various documentation |
| 113 | + |
106 | 114 | Support
|
107 | 115 |
|
108 |
| - Authors urgently recommend people to use [18][openfts-general] or |
109 |
| - [19][pgsql-general] mailing lists for questions and discussions. |
110 |
| - |
111 |
| -Caution |
| 116 | + Authors urgently recommend people to use [12]openfts-general or |
| 117 | + [13]pgsql-general mailing lists for questions and discussions. |
112 | 118 |
|
113 |
| - In spite of apparent easy full text searching with our tsearch module |
114 |
| - (authors hope it's so), any serious search engine require profound |
115 |
| - study of various aspects, such as stop words, dictionaries, special |
116 |
| - parsers. Tsearch module was designed to facilitate both those cases. |
117 |
| - |
118 | 119 | Development History
|
119 | 120 |
|
| 121 | + Latest news |
| 122 | + |
| 123 | + To the PostgreSQL 8.2 release we added: |
| 124 | + * multibyte (UTF-8) support |
| 125 | + * Thesaurus dictionary |
| 126 | + * Query rewriting |
| 127 | + * rank_cd relevation function now support different weights of |
| 128 | + lexemes |
| 129 | + * GiN support adds scalability of tsearch2 |
| 130 | + |
120 | 131 | Pre-tsearch era
|
121 |
| - Development of OpenFTS began in 2000 after realizing that we |
122 |
| - needed a search engine optimized for online updates and able to |
123 |
| - access metadata from the database. This is essential for online |
| 132 | + Development of OpenFTS began in 2000 after realizing that we |
| 133 | + need a search engine optimized for online updates with access |
| 134 | + to metadata from the database. This is essential for online |
124 | 135 | news agencies, web portals, digital libraries, etc. Most search
|
125 |
| - engines available utilize an inverted index which is very fast |
126 |
| - for searching but very slow for online updates. Incremental |
127 |
| - updates of an inverted index is a complex engineering task |
128 |
| - while we needed something light, free and with the ability to |
129 |
| - access metadata from the database. The last requirement is very |
130 |
| - important because in a real life application a search engine |
131 |
| - should always consult metadata ( topic, permissions, date |
132 |
| - range, version, etc.). We extensively use PostgreSQL as a |
133 |
| - database backend and have no intention to move from it, so the |
134 |
| - problem was to find a data structure and a fast way to access |
135 |
| - it. PostgreSQL has rather unique data type for storing sets |
136 |
| - (think about words) - arrays, but lacks index access to them. A |
137 |
| - document is parsed into lexemes, which are identified in |
138 |
| - various ways (e.g. stemming, morphology, dictionary), and as a |
139 |
| - result is reduced to an array of integer numbers. During our |
140 |
| - research we found a paper of Joseph Hellerstein which |
141 |
| - introduced an interesting data structure suitable for sets - |
142 |
| - RD-tree (Russian Doll tree). It looked very attractive, but |
143 |
| - implementing it in PostgreSQL seemed difficult because of our |
144 |
| - ignorance of database internals. Further research lead us to |
145 |
| - the idea to use GiST for implementing RD-tree, but at that time |
146 |
| - the GiST code had for a long while remained untouched and |
147 |
| - contained several bugs. After work on improving GiST for |
148 |
| - version 7.0.3 of PostgreSQL was done, we were able to implement |
149 |
| - RD-Tree and use it for index access to arrays of integers. This |
150 |
| - implementation was ideally suited for small arrays and |
151 |
| - eliminated complex joins, but was practically useless for |
152 |
| - indexing large arrays. The next improvement came from an idea |
153 |
| - to represent a document by a single bit-signature, a so-called |
154 |
| - superimposed signature (see "Index Structures for Databases |
155 |
| - Containing Data Items with Set-valued Attributes", 1997, Sven |
156 |
| - Helmer for details). We developeded the contrib/intarray module |
157 |
| - and used it for full text indexing. |
158 |
| - |
| 136 | + engines available utilize an inverted index which is very fast |
| 137 | + for searching but very slow for online updates. Incremental |
| 138 | + updates of an inverted index is a complex engineering task |
| 139 | + while we needed something light, free and with the ability to |
| 140 | + access metadata from the database. The last requirement was |
| 141 | + very important because in a real life application search engine |
| 142 | + should always consult metadata ( topic, permissions, date |
| 143 | + range, version, etc.). We extensively use PostgreSQL as a |
| 144 | + database backend and have no intention to move from it, so the |
| 145 | + problem was to find a data structure and a fast way to access |
| 146 | + it. PostgreSQL has rather unique data type for storing sets |
| 147 | + (think about words) - arrays, but lacks index access to them. |
| 148 | + During our research we found a paper of Joseph Hellerstein, who |
| 149 | + introduced an interesting data structure suitable for sets - |
| 150 | + RD-tree (Russian Doll tree). Further research lead us to the |
| 151 | + idea to use GiST for implementing RD-tree, but at that time the |
| 152 | + GiST code was intouched for a long time and contained several |
| 153 | + bugs. After work on improving GiST for version 7.0.3 of |
| 154 | + PostgreSQL was done, we were able to implement RD-Tree and use |
| 155 | + it for index access to arrays of integers. This implementation |
| 156 | + was ideally suited for small arrays and eliminated complex |
| 157 | + joins, but was practically useless for indexing large arrays. |
| 158 | + The next improvement came from an idea to represent a document |
| 159 | + by a single bit-signature, a so-called superimposed signature |
| 160 | + (see "Index Structures for Databases Containing Data Items with |
| 161 | + Set-valued Attributes", 1997, Sven Helmer for details). We |
| 162 | + developeded the contrib/intarray module and used it for full |
| 163 | + text indexing. |
| 164 | + |
159 | 165 | tsearch v1
|
160 | 166 | It was inconvenient to use integer id's instead of words, so we
|
161 |
| - introduced a new data type called 'txtidx' - a searchable data |
162 |
| - type (textual) with indexed access. This was a first step of |
163 |
| - our work on an implementation of a built-in PostgreSQL full |
| 167 | + introduced a new data type called 'txtidx' - a searchable data |
| 168 | + type (textual) with indexed access. This was a first step of |
| 169 | + our work on an implementation of a built-in PostgreSQL full |
164 | 170 | text search engine. Even though tsearch v1 had many features of
|
165 |
| - a search engine it lacked configuration support and relevance |
166 |
| - ranking. People were encouraged to use OpenFTS, which provided |
167 |
| - relevance ranking based on coordinate information and flexible |
168 |
| - configuration. OpenFTS v.0.34 is the last version based on |
| 171 | + a search engine it lacked configuration support and relevance |
| 172 | + ranking. People were encouraged to use OpenFTS, which provided |
| 173 | + relevance ranking based on positional information and flexible |
| 174 | + configuration. OpenFTS v.0.34 is the last version based on |
169 | 175 | tsearch v1.
|
170 |
| - |
| 176 | + |
171 | 177 | tsearch V2
|
172 |
| - People recognized tsearch as a powerful tool for full text |
173 |
| - searching and insisted on adding ranking support, better |
174 |
| - configurability, etc. We already thought about moving most of |
175 |
| - the features of OpenFTS to tsearch, and in the early 2003 we |
176 |
| - decided to work on a new version of tsearch - tsearch v2. We've |
177 |
| - abandoned auxiliary index tables which were used by OpenFTS to |
178 |
| - store coordinate information and modified the txtidx type to |
179 |
| - store them internally. Also, we've added table-driven |
180 |
| - configuration, support of ispell dictionaries, snowball |
181 |
| - stemmers and the ability to specify which types of lexemes to |
182 |
| - index. Also, it's now possible to generate headlines of |
183 |
| - documents with highlighted search terms. These changes make |
184 |
| - tsearch more user friendly and turn it into a really powerful |
185 |
| - full text search engine. After announcing the alpha version, we |
186 |
| - received a proposal from Brandon Rhodes to rename tsearch |
187 |
| - functions to be more consistent. So, we have renamed txtidx |
188 |
| - type to tsvector and other things as well. |
189 |
| - |
190 |
| - To allow users of tsearch v1 smooth upgrade, we named the module as |
191 |
| - tsearch2. |
192 |
| - |
193 |
| - Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave |
194 |
| - people could download it from OpenFTS CVS (see link from [20][OpenFTS |
195 |
| - page] |
| 178 | + People recognized tsearch as a powerful tool for full text |
| 179 | + searching and insisted on adding ranking support, better |
| 180 | + configurability, etc. We already thought about moving most of |
| 181 | + the features of OpenFTS to tsearch, and in the early 2003 we |
| 182 | + decided to work on a new version of tsearch. We abandoned |
| 183 | + auxiliary index tables which were used by OpenFTS to store |
| 184 | + positional information and modified the txtidx type to store |
| 185 | + them internally. We added table-driven configuration, support |
| 186 | + of ispell dictionaries, snowball stemmers and the ability to |
| 187 | + specify which types of lexemes to index. Now, it's possible to |
| 188 | + generate headlines of documents with highlighted search terms. |
| 189 | + These changes make tsearch more user friendly and turn it into |
| 190 | + a really powerful full text search engine. Brandon Rhodes |
| 191 | + proposed to rename tsearch functions for consistency and we |
| 192 | + renamed txtidx type to tsvector and other things as well. To |
| 193 | + allow users of tsearch v1 smooth upgrade, we named the module |
| 194 | + as tsearch2. Since version 0.35 OpenFTS uses tsearch2. |
196 | 195 |
|
197 | 196 | References
|
198 | 197 |
|
199 |
| - 10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html |
200 |
| - 11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap |
201 |
| - 12. http://www.sai.msu.su/~megera/postgres/gist |
202 |
| - 13. http://openfts.sourceforge.net/ |
203 |
| - 14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
204 |
| - 15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap |
205 |
| - 16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict |
206 |
| - 17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
207 |
| - 18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
208 |
| - 19. http://archives.postgresql.org/pgsql-general/ |
209 |
| - 20. http://openfts.sourceforge.net/ |
| 198 | + 1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html |
| 199 | + 2. http://snowball.tartarus.org/ |
| 200 | + 3. http://openfts.sourceforge.net/ |
| 201 | + 4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm |
| 202 | + 5. http:www.jfg-networks.com/ |
| 203 | + 6. http://www.sai.msu.su/~megera/postgres/gist |
| 204 | + 7. http://www.sigaev.ru/gin/ |
| 205 | + 8. http://openfts.sourceforge.net/ |
| 206 | + 9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
| 207 | + 10. http://www.sai.msu.su/~megera/wiki/Gendict |
| 208 | + 11. http://www.sai.msu.su/~megera/wiki/Tsearch2 |
| 209 | + 12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
| 210 | + 13. http://archives.postgresql.org/pgsql-general/ |
0 commit comments