Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit e9ea125

Browse files
committed
This patch adds the following to the FTI module:
* The ability to index more than one column in a table with a single trigger. * All uses of sprintf changed to snprintf to prevent users from crashing Postgres. * Error messages made more consistent * Some changes made to bring it into line with coding requirements for triggers specified in the docs. (ie. check you're a trigger before casting your context) * The perl script that generate indices has been updated to support indexing multiple columns in a table. * Fairly well tested in our development environment indexing a food database's brand and description fields. The size of the fti index is around 300,000 rows. * All docs and examples upgraded. This includes specifying more efficient index usage that was specified before, better examples that don't produce duplicates, etc. Christopher Kings-Lynne & Brett
1 parent 16365ac commit e9ea125

File tree

3 files changed

+190
-142
lines changed

3 files changed

+190
-142
lines changed

contrib/fulltextindex/README.fti

+17-13
Original file line numberDiff line numberDiff line change
@@ -3,29 +3,31 @@ An attempt at some sort of Full Text Indexing for PostgreSQL.
33
The included software is an attempt to add some sort of Full Text Indexing
44
support to PostgreSQL. I mean by this that we can ask questions like:
55

6-
Give me all rows that have 'still' and 'nash' in the 'artist' field.
6+
Give me all rows that have 'still' and 'nash' in the 'artist' or 'title'
7+
fields.
78

89
Ofcourse we can write this as:
910

10-
select * from cds where artist ~* 'stills' and artist ~* 'nash';
11+
select * from cds where (artist ~* 'stills' or title ~* 'stills') and
12+
(artist ~* 'nash' or title ~* 'nash');
1113

1214
But this does not use any indices, and therefore, if your database
1315
gets very large, it will not have very high performance (the above query
1416
requires at least one sequential scan, it probably takes 2 due to the
1517
self-join).
1618

1719
The approach used by this add-on is to define a trigger on the table and
18-
column you want to do this queries on. On every insert in the table, it
19-
takes the value in the specified column, breaks the text in this column
20+
columns you want to do this queries on. On every insert in the table, it
21+
takes the value in the specified columns, breaks the text in these columns
2022
up into pieces, and stores all sub-strings into another table, together
2123
with a reference to the row in the original table that contained this
2224
sub-string (it uses the oid of that row).
2325

2426
By now creating an index over the 'fti-table', we can search for
2527
substrings that occur in the original table. By making a join between
2628
the fti-table and the orig-table, we can get the actual rows we want
27-
(this can also be done by using subselects, and maybe there're other
28-
ways too).
29+
(this can also be done by using subselects - but subselects are currently
30+
inefficient in Postgres, and maybe there're other ways too).
2931

3032
The trigger code also allows an array called StopWords, that prevents
3133
certain words from being indexed.
@@ -62,20 +64,22 @@ The create the function that contains the trigger::
6264
And finally define the trigger on the 'cds' table:
6365

6466
create trigger cds-fti-trigger after update or insert or delete on cds
65-
for each row execute procedure fti(cds-fti, artist);
67+
for each row execute procedure fti(cds-fti, artist, title);
6668

6769
Here, the trigger will be defined on table 'cds', it will create
68-
sub-strings from the field 'artist', and it will place those sub-strings
69-
in the table 'cds-fti'.
70+
sub-strings from the fields 'artist' and 'title', and it will place
71+
those sub-strings in the table 'cds-fti'.
7072

7173
Now populate the table 'cds'. This will also populate the table 'cds-fti'.
72-
It's fastest to populate the table *before* you create the indices.
74+
It's fastest to populate the table *before* you create the indices. Use the
75+
supplied 'fti.pl' to assist you with this.
7376

7477
Before you start using the system, you should at least have the following
7578
indices:
7679

77-
create index cds-fti-idx on cds-fti (string, id);
78-
create index cds-oid-idx on cds (oid);
80+
create index cds-fti-idx on cds-fti (string); -- String matching
81+
create index cds-fti-idx on cds-fti (id); -- For deleting a cds row
82+
create index cds-oid-idx on cds (oid); -- For joining cds to cds-fti
7983

8084
To get the most performance out of this, you should have 'cds-fti'
8185
clustered on disk, ie. all rows with the same sub-strings should be
@@ -109,7 +113,7 @@ clustered : same as above, only clustered : 4.501.321 rows
109113
A sequential scan of the artist_fti table (and thus also the clustered table)
110114
takes around 6:16 minutes....
111115

112-
Unfortunately I cannot probide anybody else with this test-date, since I
116+
Unfortunately I cannot provide anybody else with this test-data, since I
113117
am not allowed to redistribute the data (it's a database being sold by
114118
a couple of wholesale companies). Anyways, it's megabytes, so you probably
115119
wouldn't want it in this distribution anyways.

0 commit comments

Comments
 (0)