Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit d78a7d9

Browse files
committed
Improve support of Hunspell in ispell dictionary.
Now it's possible to load recent version of Hunspell for several languages. To handle these dictionaries Hunspell patch adds support for: * FLAG long - sets the double extended ASCII character flag type * FLAG num - sets the decimal number flag type (from 1 to 65535) * AF parameter - alias for flag's set Also it moves test dictionaries into separate directory. Author: Artur Zakirov with editorization by me
1 parent 9445db9 commit d78a7d9

15 files changed

+1103
-89
lines changed

doc/src/sgml/textsearch.sgml

Lines changed: 140 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2615,18 +2615,41 @@ SELECT plainto_tsquery('supernova star');
26152615
</para>
26162616

26172617
<para>
2618-
To create an <application>Ispell</> dictionary, use the built-in
2619-
<literal>ispell</literal> template and specify several parameters:
2618+
To create an <application>Ispell</> dictionary perform these steps:
26202619
</para>
2621-
2620+
<itemizedlist spacing="compact" mark="bullet">
2621+
<listitem>
2622+
<para>
2623+
download dictionary configuration files. <productname>OpenOffice</>
2624+
extension files have the <filename>.oxt</> extension. It is necessary
2625+
to extract <filename>.aff</> and <filename>.dic</> files, change
2626+
extensions to <filename>.affix</> and <filename>.dict</>. For some
2627+
dictionary files it is also needed to convert characters to the UTF-8
2628+
encoding with commands (for example, for norwegian language dictionary):
26222629
<programlisting>
2623-
CREATE TEXT SEARCH DICTIONARY english_ispell (
2630+
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
2631+
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
2632+
</programlisting>
2633+
</para>
2634+
</listitem>
2635+
<listitem>
2636+
<para>
2637+
copy files to the <filename>$SHAREDIR/tsearch_data</> directory
2638+
</para>
2639+
</listitem>
2640+
<listitem>
2641+
<para>
2642+
load files into PostgreSQL with the following command:
2643+
<programlisting>
2644+
CREATE TEXT SEARCH DICTIONARY english_hunspell (
26242645
TEMPLATE = ispell,
2625-
DictFile = english,
2626-
AffFile = english,
2627-
StopWords = english
2628-
);
2646+
DictFile = en_us,
2647+
AffFile = en_us,
2648+
Stopwords = english);
26292649
</programlisting>
2650+
</para>
2651+
</listitem>
2652+
</itemizedlist>
26302653

26312654
<para>
26322655
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
@@ -2642,6 +2665,56 @@ CREATE TEXT SEARCH DICTIONARY english_ispell (
26422665
example, a Snowball dictionary, which recognizes everything.
26432666
</para>
26442667

2668+
<para>
2669+
The <filename>.affix</> file of <application>Ispell</> has the following
2670+
structure:
2671+
<programlisting>
2672+
prefixes
2673+
flag *A:
2674+
. > RE # As in enter > reenter
2675+
suffixes
2676+
flag T:
2677+
E > ST # As in late > latest
2678+
[^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
2679+
[AEIOU]Y > EST # As in gray > grayest
2680+
[^EY] > EST # As in small > smallest
2681+
</programlisting>
2682+
</para>
2683+
<para>
2684+
And the <filename>.dict</> file has the following structure:
2685+
<programlisting>
2686+
lapse/ADGRS
2687+
lard/DGRS
2688+
large/PRTY
2689+
lark/MRS
2690+
</programlisting>
2691+
</para>
2692+
2693+
<para>
2694+
Format of the <filename>.dict</> file is:
2695+
<programlisting>
2696+
basic_form/affix_class_name
2697+
</programlisting>
2698+
</para>
2699+
2700+
<para>
2701+
In the <filename>.affix</> file every affix flag is described in the
2702+
following format:
2703+
<programlisting>
2704+
condition > [-stripping_letters,] adding_affix
2705+
</programlisting>
2706+
</para>
2707+
2708+
<para>
2709+
Here, condition has a format similar to the format of regular expressions.
2710+
It can use groupings <literal>[...]</> and <literal>[^...]</>.
2711+
For example, <literal>[AEIOU]Y</> means that the last letter of the word
2712+
is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
2713+
<literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
2714+
<literal>[^EY]</> means that the last letter is neither <literal>"e"</>
2715+
nor <literal>"y"</>.
2716+
</para>
2717+
26452718
<para>
26462719
Ispell dictionaries support splitting compound words;
26472720
a useful feature.
@@ -2663,6 +2736,65 @@ SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
26632736
</programlisting>
26642737
</para>
26652738

2739+
<para>
2740+
<application>MySpell</> format is a subset of <application>Hunspell</>.
2741+
The <filename>.affix</> file of <application>Hunspell</> has the following
2742+
structure:
2743+
<programlisting>
2744+
PFX A Y 1
2745+
PFX A 0 re .
2746+
SFX T N 4
2747+
SFX T 0 st e
2748+
SFX T y iest [^aeiou]y
2749+
SFX T 0 est [aeiou]y
2750+
SFX T 0 est [^ey]
2751+
</programlisting>
2752+
</para>
2753+
2754+
<para>
2755+
The first line of an affix class is the header. Fields of an affix rules are
2756+
listed after the header:
2757+
</para>
2758+
<itemizedlist spacing="compact" mark="bullet">
2759+
<listitem>
2760+
<para>
2761+
parameter name (PFX or SFX)
2762+
</para>
2763+
</listitem>
2764+
<listitem>
2765+
<para>
2766+
flag (name of the affix class)
2767+
</para>
2768+
</listitem>
2769+
<listitem>
2770+
<para>
2771+
stripping characters from beginning (at prefix) or end (at suffix) of the
2772+
word
2773+
</para>
2774+
</listitem>
2775+
<listitem>
2776+
<para>
2777+
adding affix
2778+
</para>
2779+
</listitem>
2780+
<listitem>
2781+
<para>
2782+
condition that has a format similar to the format of regular expressions.
2783+
</para>
2784+
</listitem>
2785+
</itemizedlist>
2786+
2787+
<para>
2788+
The <filename>.dict</> file looks like the <filename>.dict</> file of
2789+
<application>Ispell</>:
2790+
<programlisting>
2791+
larder/M
2792+
lardy/RT
2793+
large/RSPMYT
2794+
largehearted
2795+
</programlisting>
2796+
</para>
2797+
26662798
<note>
26672799
<para>
26682800
<application>MySpell</> does not support compound words.

src/backend/tsearch/Makefile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,11 @@ include $(top_builddir)/src/Makefile.global
1313

1414
DICTDIR=tsearch_data
1515

16-
DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \
17-
ispell_sample.affix ispell_sample.dict
16+
DICTFILES=dicts/synonym_sample.syn dicts/thesaurus_sample.ths \
17+
dicts/hunspell_sample.affix \
18+
dicts/ispell_sample.affix dicts/ispell_sample.dict \
19+
dicts/hunspell_sample_long.affix dicts/hunspell_sample_long.dict \
20+
dicts/hunspell_sample_num.affix dicts/hunspell_sample_num.dict
1821

1922
OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
2023
dict_simple.o dict_synonym.o dict_thesaurus.o \
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
FLAG long
2+
3+
AF 7
4+
AF cZ #1
5+
AF cL #2
6+
AF sGsJpUsS #3
7+
AF sSpB #4
8+
AF cZsS #5
9+
AF sScZs\ #6
10+
AF sA #7
11+
12+
COMPOUNDFLAG cZ
13+
ONLYINCOMPOUND cL
14+
15+
PFX pB Y 1
16+
PFX pB 0 re .
17+
18+
PFX pU N 1
19+
PFX pU 0 un .
20+
21+
SFX sJ Y 1
22+
SFX sJ 0 INGS [^E]
23+
24+
SFX sG Y 1
25+
SFX sG 0 ING [^E]
26+
27+
SFX sS Y 1
28+
SFX sS 0 S [^SXZHY]
29+
30+
SFX sA Y 1
31+
SFX sA Y IES [^AEIOU]Y
32+
33+
SFX s\ N 1
34+
SFX s\ 0 Y/2 [^Y]
35+
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
book/3
2+
booking/4
3+
footballklubber
4+
foot/5
5+
football/1
6+
ball/6
7+
klubber/1
8+
sky/7
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
FLAG num
2+
3+
COMPOUNDFLAG 101
4+
ONLYINCOMPOUND 102
5+
6+
PFX 201 Y 1
7+
PFX 201 0 re .
8+
9+
PFX 202 N 1
10+
PFX 202 0 un .
11+
12+
SFX 301 Y 1
13+
SFX 301 0 INGS [^E]
14+
15+
SFX 302 Y 1
16+
SFX 302 0 ING [^E]
17+
18+
SFX 303 Y 1
19+
SFX 303 0 S [^SXZHY]
20+
21+
SFX 304 Y 1
22+
SFX 304 Y IES [^AEIOU]Y
23+
24+
SFX 305 N 1
25+
SFX 305 0 Y/102 [^Y]
26+
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
book/302,301,202,303
2+
booking/303,201
3+
footballklubber
4+
foot/101,303
5+
football/101
6+
ball/303,101,305
7+
klubber/101
8+
sky/304

0 commit comments

Comments
 (0)