Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 092ed29

Browse files
committed
New README, forgotten when docs was updated
1 parent 0c96e42 commit 092ed29

File tree

1 file changed

+167
-166
lines changed

1 file changed

+167
-166
lines changed

contrib/tsearch2/README.tsearch2

Lines changed: 167 additions & 166 deletions
Original file line numberDiff line numberDiff line change
@@ -1,209 +1,210 @@
11
Tsearch2 - full text search extension for PostgreSQL
22

3-
[10][Online version] of this document is available
4-
5-
This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
6-
7-
Notice: This version is fully incompatible with old tsearch (V1),
8-
which was deprecated in 7.4 and obsoleted in 8.0.
9-
10-
The Tsearch2 contrib module contains an implementation of a new data
11-
type tsvector - a searchable data type with indexed access. In a
12-
nutshell, tsvector is a set of unique words along with their
13-
positional information in the document, organized in a special
14-
structure optimized for fast access and lookup. Actually, each word
15-
entry, besides its position in the document, could have a weight
16-
attribute, describing importance of this word (at a specific) position
17-
in document. A set of bit-signatures of a fixed length, representing
18-
tsvectors, are stored in a search tree (developed using PostgreSQL
19-
GiST), which provides online update of full text index and fast query
20-
lookup. The module provides indexed access methods, queries,
21-
operations and supporting routines for the tsvector data type and easy
22-
conversion of text data to tsvector. Table driven configuration allows
23-
creation of custom configuration optimized for specific searches using
3+
[1]Online version of this document is available
4+
5+
Tsearch2 - is the full text engine, fully integrated into PostgreSQL
6+
RDBMS.
7+
8+
Main features
9+
10+
* Full online update
11+
* Supports multiple table driven configurations
12+
* flexible and rich linguistic support (dictionaries, stop words),
13+
thesaurus
14+
* full multibyte (UTF-8) support
15+
* Sophisticated ranking functions with support of proximity and
16+
structure information (rank, rank_cd)
17+
* Index support (GiST and Gin) with concurrency and recovery support
18+
* Rich query language with query rewriting support
19+
* Headline support (text fragments with highlighted search terms)
20+
* Ability to plug-in custom dictionaries and parsers
21+
* Template generator for tsearch2 dictionaries with [2]snowball
22+
stemmer support
23+
* It is mature (5 years of development)
24+
25+
Tsearch2, in a nutshell, provides FTS operator (contains) for the new
26+
data types, representing document (tsvector) and query (tsquery).
27+
Table driven configuration allows creation of custom searches using
2428
standard SQL commands.
25-
26-
Configuration allows you to:
27-
* specify the type of lexemes to be indexed and the way they are
28-
processed.
29-
* specify dictionaries to be used along with stop words recognition.
30-
* specify the parser used to process a document.
31-
32-
See [11]Documentation Roadmap for links to documentation.
29+
30+
tsvector is a searchable data type, representing document. It is a set
31+
of unique words along with their positional information in the
32+
document, organized in a special structure optimized for fast access
33+
and lookup. Each entry could be labelled to reflect its importance in
34+
document.
35+
36+
tsquery is a data type for textual queries with support of boolean
37+
operators. It consists of lexemes (optionally labelled) with boolean
38+
operators between.
39+
40+
Table driven configuration allows to specify:
41+
* parser, which used to break document onto lexemes
42+
* what lexemes to index and the way they are processed
43+
* dictionaries to be used along with stop words recognition.
3344

3445
OpenFTS vs Tsearch2
3546

36-
OpenFTS is a middleware between application and database, so it uses
37-
tsearch2 as a storage, while database engine is used as a query executor
38-
(searching). Everything else (parsing of documents, query processing,
39-
linguistics) carry outs on client side. That's why OpenFTS has its own
40-
configuration table (fts_conf) and works with its own set of dictionaries.
41-
OpenFTS is more flexible, because it could be used in multi-server
42-
architecture with separated machines for repository of documents
43-
(documents could be stored in file system), database and query engine.
47+
[3]OpenFTS is a middleware between application and database. OpenFTS
48+
uses tsearch2 as a storage and database engine as a query executor
49+
(searching). Everything else, i.e. parsing of documents, query
50+
processing, linguistics, carry outs on client side. That's why OpenFTS
51+
has its own configuration table (fts_conf) and works with its own set
52+
of dictionaries. OpenFTS is more flexible, because it could be used in
53+
multi-server architecture with separate machines for repository of
54+
documents (documents could be stored in filesystem), database and
55+
query engine.
56+
57+
See [4]Documentation Roadmap for links to documentation.
4458

4559
Authors
4660

4761
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
48-
* Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
49-
62+
* Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia
63+
5064
Contributors
5165

52-
* Robert John Shepherd and Andrew J. Kopciuch submitted
53-
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
66+
* Robert John Shepherd and Andrew J. Kopciuch submitted
67+
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
5468
v2)
55-
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
69+
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
5670
Reference" and proposed new naming convention for tsearch V2
57-
58-
Features Added with Tsearch2
5971

60-
* Relevance ranking of search results
61-
* Table driven configuration
62-
* Morphology support (ispell dictionaries, snowball stemmers)
63-
* Headline support (text fragments with highlighted search terms)
64-
* Ability to plug-in custom dictionaries and parsers
65-
* Synonym dictionary
66-
* Generator of templates for dictionaries (built-in snowball stemmer
67-
support)
68-
* Statistics of indexed words is available
69-
72+
Sponsors
73+
74+
* ABC Startsiden - compound words support
75+
* University of Mannheim for UTF-8 support (in 8.2)
76+
* jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
77+
Inverted index (in 8.2)
78+
* Georgia Public Library Service and LibLime, Inc. for Thesaurus
79+
dictionary
80+
* PostGIS community - GiST Concurrency and Recovery
81+
82+
The authors are grateful to the Russian Foundation for Basic Research
83+
and Delta-Soft Ltd., Moscow, Russia for support.
84+
7085
Limitations
7186

72-
* Lexeme should be not longer than 2048 bytes
73-
* The number of lexemes is limited by 2^32. Note, that actual
74-
capacity of tsvector is depends on whether positional information
75-
is stored or not.
76-
* tsvector - the size is limited by approximately 2^20 bytes.
77-
* tsquery - the number of entries (lexemes and operations) < 32768
78-
* Positional information
79-
+ maximal position of lexeme < 2^14 (16384)
80-
+ lexeme could have maximum 256 positions
81-
87+
* Length of lexeme < 2K
88+
* Length of tsvector (lexemes + positions) < 1Mb
89+
* The number of lexemes < 4^32
90+
* 0< Positional information < 16383
91+
* No more than 256 positions per lexeme
92+
* The number of nodes ( lexemes + operations) in tsquery < 32768
93+
8294
References
8395

8496
* GiST development site -
85-
[12]http://www.sai.msu.su/~megera/postgres/gist
86-
* OpenFTS home page - [13]http://openfts.sourceforge.net/
97+
[6]http://www.sai.msu.su/~megera/postgres/gist
98+
* GiN development - [7]http://www.sigaev.ru/gin/
99+
* OpenFTS home page - [8]http://openfts.sourceforge.net/
87100
* Mailing list -
88-
[14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen
89-
eral
90-
91-
[15]Documentation Roadmap
92-
101+
[9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
102+
ral
103+
93104
Documentation Roadmap
94105

95106
* Several docs are available from docs/ subdirectory
96107
+ "Tsearch V2 Introduction" by Andrew Kopciuch
97108
+ "Tsearch2 Guide" by Brandon Rhodes
98109
+ "Tsearch2 Reference" by Brandon Rhodes
99110
* Readme.gendict in gendict/ subdirectory
100-
+ [16][Gendict tutorial]
101-
102-
Online version of documentation is always available from Tsearch V2
103-
home page -
104-
[17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
105-
111+
+ Also, check [10]Gendict tutorial
112+
* Check [11]tsearch2 Wiki pages for various documentation
113+
106114
Support
107115

108-
Authors urgently recommend people to use [18][openfts-general] or
109-
[19][pgsql-general] mailing lists for questions and discussions.
110-
111-
Caution
116+
Authors urgently recommend people to use [12]openfts-general or
117+
[13]pgsql-general mailing lists for questions and discussions.
112118

113-
In spite of apparent easy full text searching with our tsearch module
114-
(authors hope it's so), any serious search engine require profound
115-
study of various aspects, such as stop words, dictionaries, special
116-
parsers. Tsearch module was designed to facilitate both those cases.
117-
118119
Development History
119120

121+
Latest news
122+
123+
To the PostgreSQL 8.2 release we added:
124+
* multibyte (UTF-8) support
125+
* Thesaurus dictionary
126+
* Query rewriting
127+
* rank_cd relevation function now support different weights of
128+
lexemes
129+
* GiN support adds scalability of tsearch2
130+
120131
Pre-tsearch era
121-
Development of OpenFTS began in 2000 after realizing that we
122-
needed a search engine optimized for online updates and able to
123-
access metadata from the database. This is essential for online
132+
Development of OpenFTS began in 2000 after realizing that we
133+
need a search engine optimized for online updates with access
134+
to metadata from the database. This is essential for online
124135
news agencies, web portals, digital libraries, etc. Most search
125-
engines available utilize an inverted index which is very fast
126-
for searching but very slow for online updates. Incremental
127-
updates of an inverted index is a complex engineering task
128-
while we needed something light, free and with the ability to
129-
access metadata from the database. The last requirement is very
130-
important because in a real life application a search engine
131-
should always consult metadata ( topic, permissions, date
132-
range, version, etc.). We extensively use PostgreSQL as a
133-
database backend and have no intention to move from it, so the
134-
problem was to find a data structure and a fast way to access
135-
it. PostgreSQL has rather unique data type for storing sets
136-
(think about words) - arrays, but lacks index access to them. A
137-
document is parsed into lexemes, which are identified in
138-
various ways (e.g. stemming, morphology, dictionary), and as a
139-
result is reduced to an array of integer numbers. During our
140-
research we found a paper of Joseph Hellerstein which
141-
introduced an interesting data structure suitable for sets -
142-
RD-tree (Russian Doll tree). It looked very attractive, but
143-
implementing it in PostgreSQL seemed difficult because of our
144-
ignorance of database internals. Further research lead us to
145-
the idea to use GiST for implementing RD-tree, but at that time
146-
the GiST code had for a long while remained untouched and
147-
contained several bugs. After work on improving GiST for
148-
version 7.0.3 of PostgreSQL was done, we were able to implement
149-
RD-Tree and use it for index access to arrays of integers. This
150-
implementation was ideally suited for small arrays and
151-
eliminated complex joins, but was practically useless for
152-
indexing large arrays. The next improvement came from an idea
153-
to represent a document by a single bit-signature, a so-called
154-
superimposed signature (see "Index Structures for Databases
155-
Containing Data Items with Set-valued Attributes", 1997, Sven
156-
Helmer for details). We developeded the contrib/intarray module
157-
and used it for full text indexing.
158-
136+
engines available utilize an inverted index which is very fast
137+
for searching but very slow for online updates. Incremental
138+
updates of an inverted index is a complex engineering task
139+
while we needed something light, free and with the ability to
140+
access metadata from the database. The last requirement was
141+
very important because in a real life application search engine
142+
should always consult metadata ( topic, permissions, date
143+
range, version, etc.). We extensively use PostgreSQL as a
144+
database backend and have no intention to move from it, so the
145+
problem was to find a data structure and a fast way to access
146+
it. PostgreSQL has rather unique data type for storing sets
147+
(think about words) - arrays, but lacks index access to them.
148+
During our research we found a paper of Joseph Hellerstein, who
149+
introduced an interesting data structure suitable for sets -
150+
RD-tree (Russian Doll tree). Further research lead us to the
151+
idea to use GiST for implementing RD-tree, but at that time the
152+
GiST code was intouched for a long time and contained several
153+
bugs. After work on improving GiST for version 7.0.3 of
154+
PostgreSQL was done, we were able to implement RD-Tree and use
155+
it for index access to arrays of integers. This implementation
156+
was ideally suited for small arrays and eliminated complex
157+
joins, but was practically useless for indexing large arrays.
158+
The next improvement came from an idea to represent a document
159+
by a single bit-signature, a so-called superimposed signature
160+
(see "Index Structures for Databases Containing Data Items with
161+
Set-valued Attributes", 1997, Sven Helmer for details). We
162+
developeded the contrib/intarray module and used it for full
163+
text indexing.
164+
159165
tsearch v1
160166
It was inconvenient to use integer id's instead of words, so we
161-
introduced a new data type called 'txtidx' - a searchable data
162-
type (textual) with indexed access. This was a first step of
163-
our work on an implementation of a built-in PostgreSQL full
167+
introduced a new data type called 'txtidx' - a searchable data
168+
type (textual) with indexed access. This was a first step of
169+
our work on an implementation of a built-in PostgreSQL full
164170
text search engine. Even though tsearch v1 had many features of
165-
a search engine it lacked configuration support and relevance
166-
ranking. People were encouraged to use OpenFTS, which provided
167-
relevance ranking based on coordinate information and flexible
168-
configuration. OpenFTS v.0.34 is the last version based on
171+
a search engine it lacked configuration support and relevance
172+
ranking. People were encouraged to use OpenFTS, which provided
173+
relevance ranking based on positional information and flexible
174+
configuration. OpenFTS v.0.34 is the last version based on
169175
tsearch v1.
170-
176+
171177
tsearch V2
172-
People recognized tsearch as a powerful tool for full text
173-
searching and insisted on adding ranking support, better
174-
configurability, etc. We already thought about moving most of
175-
the features of OpenFTS to tsearch, and in the early 2003 we
176-
decided to work on a new version of tsearch - tsearch v2. We've
177-
abandoned auxiliary index tables which were used by OpenFTS to
178-
store coordinate information and modified the txtidx type to
179-
store them internally. Also, we've added table-driven
180-
configuration, support of ispell dictionaries, snowball
181-
stemmers and the ability to specify which types of lexemes to
182-
index. Also, it's now possible to generate headlines of
183-
documents with highlighted search terms. These changes make
184-
tsearch more user friendly and turn it into a really powerful
185-
full text search engine. After announcing the alpha version, we
186-
received a proposal from Brandon Rhodes to rename tsearch
187-
functions to be more consistent. So, we have renamed txtidx
188-
type to tsvector and other things as well.
189-
190-
To allow users of tsearch v1 smooth upgrade, we named the module as
191-
tsearch2.
192-
193-
Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave
194-
people could download it from OpenFTS CVS (see link from [20][OpenFTS
195-
page]
178+
People recognized tsearch as a powerful tool for full text
179+
searching and insisted on adding ranking support, better
180+
configurability, etc. We already thought about moving most of
181+
the features of OpenFTS to tsearch, and in the early 2003 we
182+
decided to work on a new version of tsearch. We abandoned
183+
auxiliary index tables which were used by OpenFTS to store
184+
positional information and modified the txtidx type to store
185+
them internally. We added table-driven configuration, support
186+
of ispell dictionaries, snowball stemmers and the ability to
187+
specify which types of lexemes to index. Now, it's possible to
188+
generate headlines of documents with highlighted search terms.
189+
These changes make tsearch more user friendly and turn it into
190+
a really powerful full text search engine. Brandon Rhodes
191+
proposed to rename tsearch functions for consistency and we
192+
renamed txtidx type to tsvector and other things as well. To
193+
allow users of tsearch v1 smooth upgrade, we named the module
194+
as tsearch2. Since version 0.35 OpenFTS uses tsearch2.
196195

197196
References
198197

199-
10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
200-
11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap
201-
12. http://www.sai.msu.su/~megera/postgres/gist
202-
13. http://openfts.sourceforge.net/
203-
14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
204-
15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap
205-
16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict
206-
17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
207-
18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
208-
19. http://archives.postgresql.org/pgsql-general/
209-
20. http://openfts.sourceforge.net/
198+
1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
199+
2. http://snowball.tartarus.org/
200+
3. http://openfts.sourceforge.net/
201+
4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
202+
5. http:www.jfg-networks.com/
203+
6. http://www.sai.msu.su/~megera/postgres/gist
204+
7. http://www.sigaev.ru/gin/
205+
8. http://openfts.sourceforge.net/
206+
9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
207+
10. http://www.sai.msu.su/~megera/wiki/Gendict
208+
11. http://www.sai.msu.su/~megera/wiki/Tsearch2
209+
12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
210+
13. http://archives.postgresql.org/pgsql-general/

0 commit comments

Comments
 (0)