Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit fd58231

Browse files
committed
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core in 2007 :-(. While the upstream project isn't a beehive of activity, they do make additions and bug fixes from time to time. Update our copies of these files. Also update our documentation about how to keep things in sync, since they're not making distribution tarballs these days. Fortunately, their source code turns out to be a breeze to build. Notable changes: * The non-UTF8 version of the hungarian stemmer now works in LATIN2 not LATIN1. * New stemmers have appeared for arabic, indonesian, irish, lithuanian, nepali, and tamil. These all work in UTF8, and the indonesian and irish ones also work in LATIN1. (There are some new stemmers that I did not incorporate, mainly because their names don't match the underlying languages, suggesting that they're not to be considered mainstream.) Worth noting: the upstream Nepali dictionary was contributed by Arthur Zakirov. initdb forced because the contents of snowball_create.sql have changed. Still TODO: see about updating the stopword lists. Arthur Zakirov, minor mods and doc work by me Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
1 parent b076eb7 commit fd58231

File tree

88 files changed

+13093
-7303
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+13093
-7303
lines changed

doc/src/sgml/textsearch.sgml

+8-2
Original file line numberDiff line numberDiff line change
@@ -3792,24 +3792,30 @@ Parser: "pg_catalog.default"
37923792
List text search dictionaries (add <literal>+</literal> for more detail).
37933793
<screen>
37943794
=&gt; \dFd
3795-
List of text search dictionaries
3796-
Schema | Name | Description
3795+
List of text search dictionaries
3796+
Schema | Name | Description
37973797
------------+-----------------+-----------------------------------------------------------
3798+
pg_catalog | arabic_stem | snowball stemmer for arabic language
37983799
pg_catalog | danish_stem | snowball stemmer for danish language
37993800
pg_catalog | dutch_stem | snowball stemmer for dutch language
38003801
pg_catalog | english_stem | snowball stemmer for english language
38013802
pg_catalog | finnish_stem | snowball stemmer for finnish language
38023803
pg_catalog | french_stem | snowball stemmer for french language
38033804
pg_catalog | german_stem | snowball stemmer for german language
38043805
pg_catalog | hungarian_stem | snowball stemmer for hungarian language
3806+
pg_catalog | indonesian_stem | snowball stemmer for indonesian language
3807+
pg_catalog | irish_stem | snowball stemmer for irish language
38053808
pg_catalog | italian_stem | snowball stemmer for italian language
3809+
pg_catalog | lithuanian_stem | snowball stemmer for lithuanian language
3810+
pg_catalog | nepali_stem | snowball stemmer for nepali language
38063811
pg_catalog | norwegian_stem | snowball stemmer for norwegian language
38073812
pg_catalog | portuguese_stem | snowball stemmer for portuguese language
38083813
pg_catalog | romanian_stem | snowball stemmer for romanian language
38093814
pg_catalog | russian_stem | snowball stemmer for russian language
38103815
pg_catalog | simple | simple dictionary: just lower case and check for stopword
38113816
pg_catalog | spanish_stem | snowball stemmer for spanish language
38123817
pg_catalog | swedish_stem | snowball stemmer for swedish language
3818+
pg_catalog | tamil_stem | snowball stemmer for tamil language
38133819
pg_catalog | turkish_stem | snowball stemmer for turkish language
38143820
</screen>
38153821
</para>

src/backend/snowball/Makefile

+15-1
Original file line numberDiff line numberDiff line change
@@ -23,51 +23,65 @@ OBJS= $(WIN32RES) dict_snowball.o api.o utilities.o \
2323
stem_ISO_8859_1_finnish.o \
2424
stem_ISO_8859_1_french.o \
2525
stem_ISO_8859_1_german.o \
26-
stem_ISO_8859_1_hungarian.o \
26+
stem_ISO_8859_1_indonesian.o \
27+
stem_ISO_8859_1_irish.o \
2728
stem_ISO_8859_1_italian.o \
2829
stem_ISO_8859_1_norwegian.o \
2930
stem_ISO_8859_1_porter.o \
3031
stem_ISO_8859_1_portuguese.o \
3132
stem_ISO_8859_1_spanish.o \
3233
stem_ISO_8859_1_swedish.o \
34+
stem_ISO_8859_2_hungarian.o \
3335
stem_ISO_8859_2_romanian.o \
3436
stem_KOI8_R_russian.o \
37+
stem_UTF_8_arabic.o \
3538
stem_UTF_8_danish.o \
3639
stem_UTF_8_dutch.o \
3740
stem_UTF_8_english.o \
3841
stem_UTF_8_finnish.o \
3942
stem_UTF_8_french.o \
4043
stem_UTF_8_german.o \
4144
stem_UTF_8_hungarian.o \
45+
stem_UTF_8_indonesian.o \
46+
stem_UTF_8_irish.o \
4247
stem_UTF_8_italian.o \
48+
stem_UTF_8_lithuanian.o \
49+
stem_UTF_8_nepali.o \
4350
stem_UTF_8_norwegian.o \
4451
stem_UTF_8_porter.o \
4552
stem_UTF_8_portuguese.o \
4653
stem_UTF_8_romanian.o \
4754
stem_UTF_8_russian.o \
4855
stem_UTF_8_spanish.o \
4956
stem_UTF_8_swedish.o \
57+
stem_UTF_8_tamil.o \
5058
stem_UTF_8_turkish.o
5159

5260
# first column is language name and also name of dictionary for not-all-ASCII
5361
# words, second is name of dictionary for all-ASCII words
5462
# Note order dependency: use of some other language as ASCII dictionary
5563
# must come after creation of that language
5664
LANGUAGES= \
65+
arabic arabic \
5766
danish danish \
5867
dutch dutch \
5968
english english \
6069
finnish finnish \
6170
french french \
6271
german german \
6372
hungarian hungarian \
73+
indonesian indonesian \
74+
irish irish \
6475
italian italian \
76+
lithuanian lithuanian \
77+
nepali nepali \
6578
norwegian norwegian \
6679
portuguese portuguese \
6780
romanian romanian \
6881
russian english \
6982
spanish spanish \
7083
swedish swedish \
84+
tamil tamil \
7185
turkish turkish
7286

7387

src/backend/snowball/README

+34-19
Original file line numberDiff line numberDiff line change
@@ -4,46 +4,61 @@ Snowball-Based Stemming
44
=======================
55

66
This module uses the word stemming code developed by the Snowball project,
7-
http://snowball.tartarus.org/
7+
http://snowballstem.org (formerly http://snowball.tartarus.org)
88
which is released by them under a BSD-style license.
99

10-
The files under src/backend/snowball/libstemmer/ and
11-
src/include/snowball/libstemmer/ are taken directly from their libstemmer_c
12-
distribution, with only some minor adjustments of file inclusions. Note
10+
The Snowball project is not currently making formal releases; it's best
11+
to pull from their git repository
12+
13+
git clone https://github.com/snowballstem/snowball.git
14+
15+
and then building the derived files is as simple as
16+
17+
cd snowball
18+
make
19+
20+
At least on Linux, no platform-specific adjustment is needed.
21+
22+
Postgres' files under src/backend/snowball/libstemmer/ and
23+
src/include/snowball/libstemmer/ are taken directly from the Snowball
24+
files, with only some minor adjustments of file inclusions. Note
1325
that most of these files are in fact derived files, not master source.
14-
The master sources are in the Snowball language, and are available along
15-
with the Snowball-to-C compiler from the Snowball project. We choose to
16-
include the derived files in the PostgreSQL distribution because most
17-
installations will not have the Snowball compiler available.
26+
The master sources are in the Snowball language, and are built using
27+
the Snowball-to-C compiler that is also part of the Snowball project.
28+
We choose to include the derived files in the PostgreSQL distribution
29+
because most installations will not have the Snowball compiler available.
30+
31+
We are currently synced with the Snowball git commit
32+
1964ce688cbeca505263c8f77e16ed923296ce7a
33+
of 2018-06-29.
1834

19-
To update the PostgreSQL sources from a new Snowball libstemmer_c
20-
distribution:
35+
To update the PostgreSQL sources from a new Snowball version:
2136

22-
1. Copy the *.c files in libstemmer_c/src_c/ to src/backend/snowball/libstemmer
37+
0. If you didn't do it already, "make -C snowball".
38+
39+
1. Copy the *.c files in snowball/src_c/ to src/backend/snowball/libstemmer
2340
with replacement of "../runtime/header.h" by "header.h", for example
2441

25-
for f in libstemmer_c/src_c/*.c
42+
for f in .../snowball/src_c/*.c
2643
do
2744
sed 's|\.\./runtime/header\.h|header.h|' $f >libstemmer/`basename $f`
2845
done
2946

30-
(Alternatively, if you rebuild the stemmer files from the master Snowball
31-
sources, just omit "-r ../runtime" from the Snowball compiler switches.)
32-
33-
2. Copy the *.c files in libstemmer_c/runtime/ to
47+
2. Copy the *.c files in snowball/runtime/ to
3448
src/backend/snowball/libstemmer, and edit them to remove direct inclusions
3549
of system headers such as <stdio.h> --- they should only include "header.h".
3650
(This removal avoids portability problems on some platforms where <stdio.h>
3751
is sensitive to largefile compilation options.)
3852

39-
3. Copy the *.h files in libstemmer_c/src_c/ and libstemmer_c/runtime/
53+
3. Copy the *.h files in snowball/src_c/ and snowball/runtime/
4054
to src/include/snowball/libstemmer. At this writing the header files
4155
do not require any changes.
4256

4357
4. Check whether any stemmer modules have been added or removed. If so, edit
4458
the OBJS list in Makefile, the list of #include's in dict_snowball.c, and the
45-
stemmer_modules[] table in dict_snowball.c.
59+
stemmer_modules[] table in dict_snowball.c. You might also need to change
60+
the LANGUAGES list in Makefile.
4661

4762
5. The various stopword files in stopwords/ must be downloaded
48-
individually from pages on the snowball.tartarus.org website.
63+
individually from pages on the snowballstem.org website.
4964
Be careful that these files must be stored in UTF-8 encoding.

src/backend/snowball/dict_snowball.c

+53-33
Original file line numberDiff line numberDiff line change
@@ -32,30 +32,38 @@
3232
#include "snowball/libstemmer/stem_ISO_8859_1_finnish.h"
3333
#include "snowball/libstemmer/stem_ISO_8859_1_french.h"
3434
#include "snowball/libstemmer/stem_ISO_8859_1_german.h"
35-
#include "snowball/libstemmer/stem_ISO_8859_1_hungarian.h"
35+
#include "snowball/libstemmer/stem_ISO_8859_1_indonesian.h"
36+
#include "snowball/libstemmer/stem_ISO_8859_1_irish.h"
3637
#include "snowball/libstemmer/stem_ISO_8859_1_italian.h"
3738
#include "snowball/libstemmer/stem_ISO_8859_1_norwegian.h"
3839
#include "snowball/libstemmer/stem_ISO_8859_1_porter.h"
3940
#include "snowball/libstemmer/stem_ISO_8859_1_portuguese.h"
4041
#include "snowball/libstemmer/stem_ISO_8859_1_spanish.h"
4142
#include "snowball/libstemmer/stem_ISO_8859_1_swedish.h"
43+
#include "snowball/libstemmer/stem_ISO_8859_2_hungarian.h"
4244
#include "snowball/libstemmer/stem_ISO_8859_2_romanian.h"
4345
#include "snowball/libstemmer/stem_KOI8_R_russian.h"
46+
#include "snowball/libstemmer/stem_UTF_8_arabic.h"
4447
#include "snowball/libstemmer/stem_UTF_8_danish.h"
4548
#include "snowball/libstemmer/stem_UTF_8_dutch.h"
4649
#include "snowball/libstemmer/stem_UTF_8_english.h"
4750
#include "snowball/libstemmer/stem_UTF_8_finnish.h"
4851
#include "snowball/libstemmer/stem_UTF_8_french.h"
4952
#include "snowball/libstemmer/stem_UTF_8_german.h"
5053
#include "snowball/libstemmer/stem_UTF_8_hungarian.h"
54+
#include "snowball/libstemmer/stem_UTF_8_indonesian.h"
55+
#include "snowball/libstemmer/stem_UTF_8_irish.h"
5156
#include "snowball/libstemmer/stem_UTF_8_italian.h"
57+
#include "snowball/libstemmer/stem_UTF_8_lithuanian.h"
58+
#include "snowball/libstemmer/stem_UTF_8_nepali.h"
5259
#include "snowball/libstemmer/stem_UTF_8_norwegian.h"
5360
#include "snowball/libstemmer/stem_UTF_8_porter.h"
5461
#include "snowball/libstemmer/stem_UTF_8_portuguese.h"
5562
#include "snowball/libstemmer/stem_UTF_8_romanian.h"
5663
#include "snowball/libstemmer/stem_UTF_8_russian.h"
5764
#include "snowball/libstemmer/stem_UTF_8_spanish.h"
5865
#include "snowball/libstemmer/stem_UTF_8_swedish.h"
66+
#include "snowball/libstemmer/stem_UTF_8_tamil.h"
5967
#include "snowball/libstemmer/stem_UTF_8_turkish.h"
6068

6169
PG_MODULE_MAGIC;
@@ -74,48 +82,60 @@ typedef struct stemmer_module
7482
int (*stem) (struct SN_env *);
7583
} stemmer_module;
7684

85+
/* Args: stemmer name, PG code for encoding, Snowball's name for encoding */
86+
#define STEMMER_MODULE(name,enc,senc) \
87+
{#name, enc, name##_##senc##_create_env, name##_##senc##_close_env, name##_##senc##_stem}
88+
7789
static const stemmer_module stemmer_modules[] =
7890
{
7991
/*
8092
* Stemmers list from Snowball distribution
8193
*/
82-
{"danish", PG_LATIN1, danish_ISO_8859_1_create_env, danish_ISO_8859_1_close_env, danish_ISO_8859_1_stem},
83-
{"dutch", PG_LATIN1, dutch_ISO_8859_1_create_env, dutch_ISO_8859_1_close_env, dutch_ISO_8859_1_stem},
84-
{"english", PG_LATIN1, english_ISO_8859_1_create_env, english_ISO_8859_1_close_env, english_ISO_8859_1_stem},
85-
{"finnish", PG_LATIN1, finnish_ISO_8859_1_create_env, finnish_ISO_8859_1_close_env, finnish_ISO_8859_1_stem},
86-
{"french", PG_LATIN1, french_ISO_8859_1_create_env, french_ISO_8859_1_close_env, french_ISO_8859_1_stem},
87-
{"german", PG_LATIN1, german_ISO_8859_1_create_env, german_ISO_8859_1_close_env, german_ISO_8859_1_stem},
88-
{"hungarian", PG_LATIN1, hungarian_ISO_8859_1_create_env, hungarian_ISO_8859_1_close_env, hungarian_ISO_8859_1_stem},
89-
{"italian", PG_LATIN1, italian_ISO_8859_1_create_env, italian_ISO_8859_1_close_env, italian_ISO_8859_1_stem},
90-
{"norwegian", PG_LATIN1, norwegian_ISO_8859_1_create_env, norwegian_ISO_8859_1_close_env, norwegian_ISO_8859_1_stem},
91-
{"porter", PG_LATIN1, porter_ISO_8859_1_create_env, porter_ISO_8859_1_close_env, porter_ISO_8859_1_stem},
92-
{"portuguese", PG_LATIN1, portuguese_ISO_8859_1_create_env, portuguese_ISO_8859_1_close_env, portuguese_ISO_8859_1_stem},
93-
{"spanish", PG_LATIN1, spanish_ISO_8859_1_create_env, spanish_ISO_8859_1_close_env, spanish_ISO_8859_1_stem},
94-
{"swedish", PG_LATIN1, swedish_ISO_8859_1_create_env, swedish_ISO_8859_1_close_env, swedish_ISO_8859_1_stem},
95-
{"romanian", PG_LATIN2, romanian_ISO_8859_2_create_env, romanian_ISO_8859_2_close_env, romanian_ISO_8859_2_stem},
96-
{"russian", PG_KOI8R, russian_KOI8_R_create_env, russian_KOI8_R_close_env, russian_KOI8_R_stem},
97-
{"danish", PG_UTF8, danish_UTF_8_create_env, danish_UTF_8_close_env, danish_UTF_8_stem},
98-
{"dutch", PG_UTF8, dutch_UTF_8_create_env, dutch_UTF_8_close_env, dutch_UTF_8_stem},
99-
{"english", PG_UTF8, english_UTF_8_create_env, english_UTF_8_close_env, english_UTF_8_stem},
100-
{"finnish", PG_UTF8, finnish_UTF_8_create_env, finnish_UTF_8_close_env, finnish_UTF_8_stem},
101-
{"french", PG_UTF8, french_UTF_8_create_env, french_UTF_8_close_env, french_UTF_8_stem},
102-
{"german", PG_UTF8, german_UTF_8_create_env, german_UTF_8_close_env, german_UTF_8_stem},
103-
{"hungarian", PG_UTF8, hungarian_UTF_8_create_env, hungarian_UTF_8_close_env, hungarian_UTF_8_stem},
104-
{"italian", PG_UTF8, italian_UTF_8_create_env, italian_UTF_8_close_env, italian_UTF_8_stem},
105-
{"norwegian", PG_UTF8, norwegian_UTF_8_create_env, norwegian_UTF_8_close_env, norwegian_UTF_8_stem},
106-
{"porter", PG_UTF8, porter_UTF_8_create_env, porter_UTF_8_close_env, porter_UTF_8_stem},
107-
{"portuguese", PG_UTF8, portuguese_UTF_8_create_env, portuguese_UTF_8_close_env, portuguese_UTF_8_stem},
108-
{"romanian", PG_UTF8, romanian_UTF_8_create_env, romanian_UTF_8_close_env, romanian_UTF_8_stem},
109-
{"russian", PG_UTF8, russian_UTF_8_create_env, russian_UTF_8_close_env, russian_UTF_8_stem},
110-
{"spanish", PG_UTF8, spanish_UTF_8_create_env, spanish_UTF_8_close_env, spanish_UTF_8_stem},
111-
{"swedish", PG_UTF8, swedish_UTF_8_create_env, swedish_UTF_8_close_env, swedish_UTF_8_stem},
112-
{"turkish", PG_UTF8, turkish_UTF_8_create_env, turkish_UTF_8_close_env, turkish_UTF_8_stem},
94+
STEMMER_MODULE(danish, PG_LATIN1, ISO_8859_1),
95+
STEMMER_MODULE(dutch, PG_LATIN1, ISO_8859_1),
96+
STEMMER_MODULE(english, PG_LATIN1, ISO_8859_1),
97+
STEMMER_MODULE(finnish, PG_LATIN1, ISO_8859_1),
98+
STEMMER_MODULE(french, PG_LATIN1, ISO_8859_1),
99+
STEMMER_MODULE(german, PG_LATIN1, ISO_8859_1),
100+
STEMMER_MODULE(indonesian, PG_LATIN1, ISO_8859_1),
101+
STEMMER_MODULE(irish, PG_LATIN1, ISO_8859_1),
102+
STEMMER_MODULE(italian, PG_LATIN1, ISO_8859_1),
103+
STEMMER_MODULE(norwegian, PG_LATIN1, ISO_8859_1),
104+
STEMMER_MODULE(porter, PG_LATIN1, ISO_8859_1),
105+
STEMMER_MODULE(portuguese, PG_LATIN1, ISO_8859_1),
106+
STEMMER_MODULE(spanish, PG_LATIN1, ISO_8859_1),
107+
STEMMER_MODULE(swedish, PG_LATIN1, ISO_8859_1),
108+
STEMMER_MODULE(hungarian, PG_LATIN2, ISO_8859_2),
109+
STEMMER_MODULE(romanian, PG_LATIN2, ISO_8859_2),
110+
STEMMER_MODULE(russian, PG_KOI8R, KOI8_R),
111+
STEMMER_MODULE(arabic, PG_UTF8, UTF_8),
112+
STEMMER_MODULE(danish, PG_UTF8, UTF_8),
113+
STEMMER_MODULE(dutch, PG_UTF8, UTF_8),
114+
STEMMER_MODULE(english, PG_UTF8, UTF_8),
115+
STEMMER_MODULE(finnish, PG_UTF8, UTF_8),
116+
STEMMER_MODULE(french, PG_UTF8, UTF_8),
117+
STEMMER_MODULE(german, PG_UTF8, UTF_8),
118+
STEMMER_MODULE(hungarian, PG_UTF8, UTF_8),
119+
STEMMER_MODULE(indonesian, PG_UTF8, UTF_8),
120+
STEMMER_MODULE(irish, PG_UTF8, UTF_8),
121+
STEMMER_MODULE(italian, PG_UTF8, UTF_8),
122+
STEMMER_MODULE(lithuanian, PG_UTF8, UTF_8),
123+
STEMMER_MODULE(nepali, PG_UTF8, UTF_8),
124+
STEMMER_MODULE(norwegian, PG_UTF8, UTF_8),
125+
STEMMER_MODULE(porter, PG_UTF8, UTF_8),
126+
STEMMER_MODULE(portuguese, PG_UTF8, UTF_8),
127+
STEMMER_MODULE(romanian, PG_UTF8, UTF_8),
128+
STEMMER_MODULE(russian, PG_UTF8, UTF_8),
129+
STEMMER_MODULE(spanish, PG_UTF8, UTF_8),
130+
STEMMER_MODULE(swedish, PG_UTF8, UTF_8),
131+
STEMMER_MODULE(tamil, PG_UTF8, UTF_8),
132+
STEMMER_MODULE(turkish, PG_UTF8, UTF_8),
113133

114134
/*
115135
* Stemmer with PG_SQL_ASCII encoding should be valid for any server
116136
* encoding
117137
*/
118-
{"english", PG_SQL_ASCII, english_ISO_8859_1_create_env, english_ISO_8859_1_close_env, english_ISO_8859_1_stem},
138+
STEMMER_MODULE(english, PG_SQL_ASCII, ISO_8859_1),
119139

120140
{NULL, 0, NULL, NULL, NULL} /* list end marker */
121141
};

0 commit comments

Comments
 (0)