Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 5e1963f

Browse files
committed
Collations with nondeterministic comparison
This adds a flag "deterministic" to collations. If that is false, such a collation disables various optimizations that assume that strings are equal only if they are byte-wise equal. That then allows use cases such as case-insensitive or accent-insensitive comparisons or handling of strings with different Unicode normal forms. This functionality is only supported with the ICU provider. At least glibc doesn't appear to have any locales that work in a nondeterministic way, so it's not worth supporting this for the libc provider. The term "deterministic comparison" in this context is from Unicode Technical Standard postgres#10 (https://unicode.org/reports/tr10/#Deterministic_Comparison). This patch makes changes in three areas: - CREATE COLLATION DDL changes and system catalog changes to support this new flag. - Many executor nodes and auxiliary code are extended to track collations. Previously, this code would just throw away collation information, because the eventually-called user-defined functions didn't use it since they only cared about equality, which didn't need collation information. - String data type functions that do equality comparisons and hashing are changed to take the (non-)deterministic flag into account. For comparison, this just means skipping various shortcuts and tie breakers that use byte-wise comparison. For hashing, we first need to convert the input string to a canonical "sort key" using the ICU analogue of strxfrm(). Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
1 parent 2ab6d28 commit 5e1963f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+2087
-239
lines changed

contrib/bloom/bloom.h

+1
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,7 @@ typedef struct BloomMetaPageData
137137
typedef struct BloomState
138138
{
139139
FmgrInfo hashFn[INDEX_MAX_KEYS];
140+
Oid collations[INDEX_MAX_KEYS];
140141
BloomOptions opts; /* copy of options on index's metapage */
141142
int32 nColumns;
142143

contrib/bloom/blutils.c

+2-1
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,7 @@ initBloomState(BloomState *state, Relation index)
163163
fmgr_info_copy(&(state->hashFn[i]),
164164
index_getprocinfo(index, i + 1, BLOOM_HASH_PROC),
165165
CurrentMemoryContext);
166+
state->collations[i] = index->rd_indcollation[i];
166167
}
167168

168169
/* Initialize amcache if needed with options from metapage */
@@ -267,7 +268,7 @@ signValue(BloomState *state, BloomSignatureWord *sign, Datum value, int attno)
267268
* different columns will be mapped into different bits because of step
268269
* above
269270
*/
270-
hashVal = DatumGetInt32(FunctionCall1(&state->hashFn[attno], value));
271+
hashVal = DatumGetInt32(FunctionCall1Coll(&state->hashFn[attno], state->collations[attno], value));
271272
mySrand(hashVal ^ myRand());
272273

273274
for (j = 0; j < state->opts.bitSize[attno]; j++)

doc/src/sgml/catalogs.sgml

+7
Original file line numberDiff line numberDiff line change
@@ -2077,6 +2077,13 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
20772077
default, <literal>c</literal> = libc, <literal>i</literal> = icu</entry>
20782078
</row>
20792079

2080+
<row>
2081+
<entry><structfield>collisdeterministic</structfield></entry>
2082+
<entry><type>bool</type></entry>
2083+
<entry></entry>
2084+
<entry>Is the collation deterministic?</entry>
2085+
</row>
2086+
20802087
<row>
20812088
<entry><structfield>collencoding</structfield></entry>
20822089
<entry><type>int4</type></entry>

doc/src/sgml/charset.sgml

+56-5
Original file line numberDiff line numberDiff line change
@@ -847,11 +847,13 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
847847

848848
<para>
849849
Note that while this system allows creating collations that <quote>ignore
850-
case</quote> or <quote>ignore accents</quote> or similar (using
851-
the <literal>ks</literal> key), PostgreSQL does not at the moment allow
852-
such collations to act in a truly case- or accent-insensitive manner. Any
853-
strings that compare equal according to the collation but are not
854-
byte-wise equal will be sorted according to their byte values.
850+
case</quote> or <quote>ignore accents</quote> or similar (using the
851+
<literal>ks</literal> key), in order for such collations to act in a
852+
truly case- or accent-insensitive manner, they also need to be declared as not
853+
<firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
854+
see <xref linkend="collation-nondeterministic"/>.
855+
Otherwise, any strings that compare equal according to the collation but
856+
are not byte-wise equal will be sorted according to their byte values.
855857
</para>
856858

857859
<note>
@@ -883,6 +885,55 @@ CREATE COLLATION french FROM "fr-x-icu";
883885
</para>
884886
</sect4>
885887
</sect3>
888+
889+
<sect3 id="collation-nondeterministic">
890+
<title>Nondeterminstic Collations</title>
891+
892+
<para>
893+
A collation is either <firstterm>deterministic</firstterm> or
894+
<firstterm>nondeterministic</firstterm>. A deterministic collation uses
895+
deterministic comparisons, which means that it considers strings to be
896+
equal only if they consist of the same byte sequence. Nondeterministic
897+
comparison may determine strings to be equal even if they consist of
898+
different bytes. Typical situations include case-insensitive comparison,
899+
accent-insensitive comparison, as well as comparion of strings in
900+
different Unicode normal forms. It is up to the collation provider to
901+
actually implement such insensitive comparisons; the deterministic flag
902+
only determines whether ties are to be broken using bytewise comparison.
903+
See also <ulink url="https://unicode.org/reports/tr10">Unicode Technical
904+
Standard 10</ulink> for more information on the terminology.
905+
</para>
906+
907+
<para>
908+
To create a nondeterministic collation, specify the property
909+
<literal>deterministic = false</literal> to <command>CREATE
910+
COLLATION</command>, for example:
911+
<programlisting>
912+
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
913+
</programlisting>
914+
This example would use the standard Unicode collation in a
915+
nondeterministic way. In particular, this would allow strings in
916+
different normal forms to be compared correctly. More interesting
917+
examples make use of the ICU customization facilities explained above.
918+
For example:
919+
<programlisting>
920+
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
921+
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
922+
</programlisting>
923+
</para>
924+
925+
<para>
926+
All standard and predefined collations are deterministic, all
927+
user-defined collations are deterministic by default. While
928+
nondeterministic collations give a more <quote>correct</quote> behavior,
929+
especially when considering the full power of Unicode and its many
930+
special cases, they also have some drawbacks. Foremost, their use leads
931+
to a performance penalty. Also, certain operations are not possible with
932+
nondeterministic collations, such as pattern matching operations.
933+
Therefore, they should be used only in cases where they are specifically
934+
wanted.
935+
</para>
936+
</sect3>
886937
</sect2>
887938
</sect1>
888939

doc/src/sgml/citext.sgml

+21
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,16 @@
1414
exactly like <type>text</type>.
1515
</para>
1616

17+
<tip>
18+
<para>
19+
Consider using <firstterm>nondeterministic collations</firstterm> (see
20+
<xref linkend="collation-nondeterministic"/>) instead of this module. They
21+
can be used for case-insensitive comparisons, accent-insensitive
22+
comparisons, and other combinations, and they handle more Unicode special
23+
cases correctly.
24+
</para>
25+
</tip>
26+
1727
<sect2>
1828
<title>Rationale</title>
1929

@@ -246,6 +256,17 @@ SELECT * FROM users WHERE nick = 'Larry';
246256
will be invoked instead.
247257
</para>
248258
</listitem>
259+
260+
<listitem>
261+
<para>
262+
The approach of lower-casing strings for comparison does not handle some
263+
Unicode special cases correctly, for example when one upper-case letter
264+
has two lower-case letter equivalents. Unicode distinguishes between
265+
<firstterm>case mapping</firstterm> and <firstterm>case
266+
folding</firstterm> for this reason. Use nondeterministic collations
267+
instead of <type>citext</type> to handle that correctly.
268+
</para>
269+
</listitem>
249270
</itemizedlist>
250271
</sect2>
251272

doc/src/sgml/func.sgml

+6
Original file line numberDiff line numberDiff line change
@@ -4065,6 +4065,12 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
40654065
</para>
40664066
</caution>
40674067

4068+
<para>
4069+
The pattern matching operators of all three kinds do not support
4070+
nondeterministic collations. If required, apply a different collation to
4071+
the expression to work around this limitation.
4072+
</para>
4073+
40684074
<sect2 id="functions-like">
40694075
<title><function>LIKE</function></title>
40704076

doc/src/sgml/ref/create_collation.sgml

+22
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> (
2323
[ LC_COLLATE = <replaceable>lc_collate</replaceable>, ]
2424
[ LC_CTYPE = <replaceable>lc_ctype</replaceable>, ]
2525
[ PROVIDER = <replaceable>provider</replaceable>, ]
26+
[ DETERMINISTIC = <replaceable>boolean</replaceable>, ]
2627
[ VERSION = <replaceable>version</replaceable> ]
2728
)
2829
CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replaceable>existing_collation</replaceable>
@@ -124,6 +125,27 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
124125
</listitem>
125126
</varlistentry>
126127

128+
<varlistentry>
129+
<term><literal>DETERMINISTIC</literal></term>
130+
131+
<listitem>
132+
<para>
133+
Specifies whether the collation should use deterministic comparisons.
134+
The default is true. A deterministic comparison considers strings that
135+
are not byte-wise equal to be unequal even if they are considered
136+
logically equal by the comparison. PostgreSQL breaks ties using a
137+
byte-wise comparison. Comparison that is not deterministic can make the
138+
collation be, say, case- or accent-insensitive. For that, you need to
139+
choose an appropriate <literal>LC_COLLATE</literal> setting
140+
<emphasis>and</emphasis> set the collation to not deterministic here.
141+
</para>
142+
143+
<para>
144+
Nondeterministic collations are only supported with the ICU provider.
145+
</para>
146+
</listitem>
147+
</varlistentry>
148+
127149
<varlistentry>
128150
<term><replaceable>version</replaceable></term>
129151

src/backend/access/hash/hashfunc.c

+89-11
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,10 @@
2727
#include "postgres.h"
2828

2929
#include "access/hash.h"
30+
#include "catalog/pg_collation.h"
3031
#include "utils/builtins.h"
3132
#include "utils/hashutils.h"
33+
#include "utils/pg_locale.h"
3234

3335
/*
3436
* Datatype-specific hash functions.
@@ -243,15 +245,51 @@ Datum
243245
hashtext(PG_FUNCTION_ARGS)
244246
{
245247
text *key = PG_GETARG_TEXT_PP(0);
248+
Oid collid = PG_GET_COLLATION();
249+
pg_locale_t mylocale = 0;
246250
Datum result;
247251

248-
/*
249-
* Note: this is currently identical in behavior to hashvarlena, but keep
250-
* it as a separate function in case we someday want to do something
251-
* different in non-C locales. (See also hashbpchar, if so.)
252-
*/
253-
result = hash_any((unsigned char *) VARDATA_ANY(key),
254-
VARSIZE_ANY_EXHDR(key));
252+
if (!collid)
253+
ereport(ERROR,
254+
(errcode(ERRCODE_INDETERMINATE_COLLATION),
255+
errmsg("could not determine which collation to use for string hashing"),
256+
errhint("Use the COLLATE clause to set the collation explicitly.")));
257+
258+
if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
259+
mylocale = pg_newlocale_from_collation(collid);
260+
261+
if (!mylocale || mylocale->deterministic)
262+
{
263+
result = hash_any((unsigned char *) VARDATA_ANY(key),
264+
VARSIZE_ANY_EXHDR(key));
265+
}
266+
else
267+
{
268+
#ifdef USE_ICU
269+
if (mylocale->provider == COLLPROVIDER_ICU)
270+
{
271+
int32_t ulen = -1;
272+
UChar *uchar = NULL;
273+
Size bsize;
274+
uint8_t *buf;
275+
276+
ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
277+
278+
bsize = ucol_getSortKey(mylocale->info.icu.ucol,
279+
uchar, ulen, NULL, 0);
280+
buf = palloc(bsize);
281+
ucol_getSortKey(mylocale->info.icu.ucol,
282+
uchar, ulen, buf, bsize);
283+
284+
result = hash_any(buf, bsize);
285+
286+
pfree(buf);
287+
}
288+
else
289+
#endif
290+
/* shouldn't happen */
291+
elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
292+
}
255293

256294
/* Avoid leaking memory for toasted inputs */
257295
PG_FREE_IF_COPY(key, 0);
@@ -263,12 +301,52 @@ Datum
263301
hashtextextended(PG_FUNCTION_ARGS)
264302
{
265303
text *key = PG_GETARG_TEXT_PP(0);
304+
Oid collid = PG_GET_COLLATION();
305+
pg_locale_t mylocale = 0;
266306
Datum result;
267307

268-
/* Same approach as hashtext */
269-
result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
270-
VARSIZE_ANY_EXHDR(key),
271-
PG_GETARG_INT64(1));
308+
if (!collid)
309+
ereport(ERROR,
310+
(errcode(ERRCODE_INDETERMINATE_COLLATION),
311+
errmsg("could not determine which collation to use for string hashing"),
312+
errhint("Use the COLLATE clause to set the collation explicitly.")));
313+
314+
if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
315+
mylocale = pg_newlocale_from_collation(collid);
316+
317+
if (!mylocale || mylocale->deterministic)
318+
{
319+
result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
320+
VARSIZE_ANY_EXHDR(key),
321+
PG_GETARG_INT64(1));
322+
}
323+
else
324+
{
325+
#ifdef USE_ICU
326+
if (mylocale->provider == COLLPROVIDER_ICU)
327+
{
328+
int32_t ulen = -1;
329+
UChar *uchar = NULL;
330+
Size bsize;
331+
uint8_t *buf;
332+
333+
ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
334+
335+
bsize = ucol_getSortKey(mylocale->info.icu.ucol,
336+
uchar, ulen, NULL, 0);
337+
buf = palloc(bsize);
338+
ucol_getSortKey(mylocale->info.icu.ucol,
339+
uchar, ulen, buf, bsize);
340+
341+
result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
342+
343+
pfree(buf);
344+
}
345+
else
346+
#endif
347+
/* shouldn't happen */
348+
elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
349+
}
272350

273351
PG_FREE_IF_COPY(key, 0);
274352

src/backend/access/spgist/spgtextproc.c

+2-1
Original file line numberDiff line numberDiff line change
@@ -630,7 +630,8 @@ spg_text_leaf_consistent(PG_FUNCTION_ARGS)
630630
* query (prefix) string, so we don't need to check it again.
631631
*/
632632
res = (level >= queryLen) ||
633-
DatumGetBool(DirectFunctionCall2(text_starts_with,
633+
DatumGetBool(DirectFunctionCall2Coll(text_starts_with,
634+
PG_GET_COLLATION(),
634635
out->leafValue,
635636
PointerGetDatum(query)));
636637

src/backend/catalog/pg_collation.c

+2
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ Oid
4646
CollationCreate(const char *collname, Oid collnamespace,
4747
Oid collowner,
4848
char collprovider,
49+
bool collisdeterministic,
4950
int32 collencoding,
5051
const char *collcollate, const char *collctype,
5152
const char *collversion,
@@ -160,6 +161,7 @@ CollationCreate(const char *collname, Oid collnamespace,
160161
values[Anum_pg_collation_collnamespace - 1] = ObjectIdGetDatum(collnamespace);
161162
values[Anum_pg_collation_collowner - 1] = ObjectIdGetDatum(collowner);
162163
values[Anum_pg_collation_collprovider - 1] = CharGetDatum(collprovider);
164+
values[Anum_pg_collation_collisdeterministic - 1] = BoolGetDatum(collisdeterministic);
163165
values[Anum_pg_collation_collencoding - 1] = Int32GetDatum(collencoding);
164166
namestrcpy(&name_collate, collcollate);
165167
values[Anum_pg_collation_collcollate - 1] = NameGetDatum(&name_collate);

0 commit comments

Comments
 (0)