Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 85b7efa

Browse files
committed
Support LIKE with nondeterministic collations
This allows for example using LIKE with case-insensitive collations. There was previously no internal implementation of this, so it was met with a not-supported error. This adds the internal implementation and removes the error. The implementation follows the specification of the SQL standard for this. Unlike with deterministic collations, the LIKE matching cannot go character by character but has to go substring by substring. For example, if we are matching against LIKE 'foo%bar', we can't start by looking for an 'f', then an 'o', but instead with have to find something that matches 'foo'. This is because the collation could consider substrings of different lengths to be equal. This is all internal to MatchText() in like_match.c. The changes in GenericMatchText() in like.c just pass through the locale information to MatchText(), which was previously not needed. This matches exactly Generic_Text_IC_like() below. ILIKE is not affected. (It's unclear whether ILIKE makes sense under nondeterministic collations.) This also updates match_pattern_prefix() in like_support.c to support optimizing the case of an exact pattern with nondeterministic collations. This was already alluded to in the previous code. (includes documentation examples from Daniel Vérité and test cases from Paul A Jungwirth) Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
1 parent 8fcd802 commit 85b7efa

File tree

7 files changed

+458
-44
lines changed

7 files changed

+458
-44
lines changed

doc/src/sgml/charset.sgml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1197,7 +1197,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
11971197
to a performance penalty. Note, in particular, that B-tree cannot use
11981198
deduplication with indexes that use a nondeterministic collation. Also,
11991199
certain operations are not possible with nondeterministic collations,
1200-
such as pattern matching operations. Therefore, they should be used
1200+
such as some pattern matching operations. Therefore, they should be used
12011201
only in cases where they are specifically wanted.
12021202
</para>
12031203

doc/src/sgml/func.sgml

Lines changed: 47 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5414,9 +5414,10 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
54145414
</caution>
54155415

54165416
<para>
5417-
The pattern matching operators of all three kinds do not support
5418-
nondeterministic collations. If required, apply a different collation to
5419-
the expression to work around this limitation.
5417+
<function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
5418+
expressions do not support nondeterministic collations. If required, use
5419+
<function>LIKE</function> or apply a different collation to the expression
5420+
to work around this limitation.
54205421
</para>
54215422

54225423
<sect2 id="functions-like">
@@ -5462,6 +5463,46 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
54625463
</programlisting>
54635464
</para>
54645465

5466+
<para>
5467+
<function>LIKE</function> pattern matching supports nondeterministic
5468+
collations (see <xref linkend="collation-nondeterministic"/>), such as
5469+
case-insensitive collations or collations that, say, ignore punctuation.
5470+
So with a case-insensitive collation, one could have:
5471+
<programlisting>
5472+
'AbC' LIKE 'abc' COLLATE case_insensitive <lineannotation>true</lineannotation>
5473+
'AbC' LIKE 'a%' COLLATE case_insensitive <lineannotation>true</lineannotation>
5474+
</programlisting>
5475+
With collations that ignore certain characters or in general that consider
5476+
strings of different lengths equal, the semantics can become a bit more
5477+
complicated. Consider these examples:
5478+
<programlisting>
5479+
'.foo.' LIKE 'foo' COLLATE ign_punct <lineannotation>true</lineannotation>
5480+
'.foo.' LIKE 'f_o' COLLATE ign_punct <lineannotation>true</lineannotation>
5481+
'.foo.' LIKE '_oo' COLLATE ign_punct <lineannotation>false</lineannotation>
5482+
</programlisting>
5483+
The way the matching works is that the pattern is partitioned into
5484+
sequences of wildcards and non-wildcard strings (wildcards being
5485+
<literal>_</literal> and <literal>%</literal>). For example, the pattern
5486+
<literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
5487+
pattern <literal>_oo</literal> is partitioned into <literal>_,
5488+
oo</literal>. The input string matches the pattern if it can be
5489+
partitioned in such a way that the wildcards match one character or any
5490+
number of characters respectively and the non-wildcard partitions are
5491+
equal under the applicable collation. So for example, <literal>'.foo.'
5492+
LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
5493+
<literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
5494+
<literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
5495+
matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
5496+
ign_punct</literal>. But <literal>'.foo.' LIKE '_oo' COLLATE
5497+
ign_punct</literal> is false because <literal>.foo.</literal> cannot be
5498+
partitioned in a way that the first character is any character and the
5499+
rest of the string compares equal to <literal>oo</literal>. (Note that
5500+
the single-character wildcard always matches exactly one character,
5501+
independent of the collation. So in this example, the
5502+
<literal>_</literal> would match <literal>.</literal>, but then the rest
5503+
of the input string won't match the rest of the pattern.)
5504+
</para>
5505+
54655506
<para>
54665507
<function>LIKE</function> pattern matching always covers the entire
54675508
string. Therefore, if it's desired to match a sequence anywhere within
@@ -5503,8 +5544,9 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
55035544

55045545
<para>
55055546
The key word <token>ILIKE</token> can be used instead of
5506-
<token>LIKE</token> to make the match case-insensitive according
5507-
to the active locale. This is not in the <acronym>SQL</acronym> standard but is a
5547+
<token>LIKE</token> to make the match case-insensitive according to the
5548+
active locale. (But this does not support nondeterministic collations.)
5549+
This is not in the <acronym>SQL</acronym> standard but is a
55085550
<productname>PostgreSQL</productname> extension.
55095551
</para>
55105552

src/backend/utils/adt/like.c

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -147,22 +147,28 @@ SB_lower_char(unsigned char c, pg_locale_t locale)
147147
static inline int
148148
GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
149149
{
150-
if (collation)
151-
{
152-
pg_locale_t locale = pg_newlocale_from_collation(collation);
150+
pg_locale_t locale;
153151

154-
if (!locale->deterministic)
155-
ereport(ERROR,
156-
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
157-
errmsg("nondeterministic collations are not supported for LIKE")));
152+
if (!OidIsValid(collation))
153+
{
154+
/*
155+
* This typically means that the parser could not resolve a conflict
156+
* of implicit collations, so report it that way.
157+
*/
158+
ereport(ERROR,
159+
(errcode(ERRCODE_INDETERMINATE_COLLATION),
160+
errmsg("could not determine which collation to use for LIKE"),
161+
errhint("Use the COLLATE clause to set the collation explicitly.")));
158162
}
159163

164+
locale = pg_newlocale_from_collation(collation);
165+
160166
if (pg_database_encoding_max_length() == 1)
161-
return SB_MatchText(s, slen, p, plen, 0);
167+
return SB_MatchText(s, slen, p, plen, locale);
162168
else if (GetDatabaseEncoding() == PG_UTF8)
163-
return UTF8_MatchText(s, slen, p, plen, 0);
169+
return UTF8_MatchText(s, slen, p, plen, locale);
164170
else
165-
return MB_MatchText(s, slen, p, plen, 0);
171+
return MB_MatchText(s, slen, p, plen, locale);
166172
}
167173

168174
static inline int

src/backend/utils/adt/like_match.c

Lines changed: 147 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,9 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
157157
* the first pattern byte to each text byte to avoid recursing
158158
* more than we have to. This fact also guarantees that we don't
159159
* have to consider a match to the zero-length substring at the
160-
* end of the text.
160+
* end of the text. With a nondeterministic collation, we can't
161+
* rely on the first bytes being equal, so we have to recurse in
162+
* any case.
161163
*/
162164
if (*p == '\\')
163165
{
@@ -172,7 +174,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
172174

173175
while (tlen > 0)
174176
{
175-
if (GETCHAR(*t, locale) == firstpat)
177+
if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
176178
{
177179
int matched = MatchText(t, tlen, p, plen, locale);
178180

@@ -196,6 +198,149 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
196198
NextByte(p, plen);
197199
continue;
198200
}
201+
else if (locale && !locale->deterministic)
202+
{
203+
/*
204+
* For nondeterministic locales, we find the next substring of the
205+
* pattern that does not contain wildcards and try to find a
206+
* matching substring in the text. Crucially, we cannot do this
207+
* character by character, as in the normal case, but must do it
208+
* substring by substring, partitioned by the wildcard characters.
209+
* (This is per SQL standard.)
210+
*/
211+
const char *p1;
212+
size_t p1len;
213+
const char *t1;
214+
size_t t1len;
215+
bool found_escape;
216+
const char *subpat;
217+
size_t subpatlen;
218+
char *buf = NULL;
219+
220+
/*
221+
* Determine next substring of pattern without wildcards. p is
222+
* the start of the subpattern, p1 is one past the last byte. Also
223+
* track if we found an escape character.
224+
*/
225+
p1 = p;
226+
p1len = plen;
227+
found_escape = false;
228+
while (p1len > 0)
229+
{
230+
if (*p1 == '\\')
231+
{
232+
found_escape = true;
233+
NextByte(p1, p1len);
234+
if (p1len == 0)
235+
ereport(ERROR,
236+
(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
237+
errmsg("LIKE pattern must not end with escape character")));
238+
}
239+
else if (*p1 == '_' || *p1 == '%')
240+
break;
241+
NextByte(p1, p1len);
242+
}
243+
244+
/*
245+
* If we found an escape character, then make an unescaped copy of
246+
* the subpattern.
247+
*/
248+
if (found_escape)
249+
{
250+
char *b;
251+
252+
b = buf = palloc(p1 - p);
253+
for (const char *c = p; c < p1; c++)
254+
{
255+
if (*c == '\\')
256+
;
257+
else
258+
*(b++) = *c;
259+
}
260+
261+
subpat = buf;
262+
subpatlen = b - buf;
263+
}
264+
else
265+
{
266+
subpat = p;
267+
subpatlen = p1 - p;
268+
}
269+
270+
/*
271+
* Shortcut: If this is the end of the pattern, then the rest of
272+
* the text has to match the rest of the pattern.
273+
*/
274+
if (p1len == 0)
275+
{
276+
int cmp;
277+
278+
cmp = pg_strncoll(subpat, subpatlen, t, tlen, locale);
279+
280+
if (buf)
281+
pfree(buf);
282+
if (cmp == 0)
283+
return LIKE_TRUE;
284+
else
285+
return LIKE_FALSE;
286+
}
287+
288+
/*
289+
* Now build a substring of the text and try to match it against
290+
* the subpattern. t is the start of the text, t1 is one past the
291+
* last byte. We start with a zero-length string.
292+
*/
293+
t1 = t;
294+
t1len = tlen;
295+
for (;;)
296+
{
297+
int cmp;
298+
299+
CHECK_FOR_INTERRUPTS();
300+
301+
cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
302+
303+
/*
304+
* If we found a match, we have to test if the rest of pattern
305+
* can match against the rest of the string. Otherwise we
306+
* have to continue here try matching with a longer substring.
307+
* (This is similar to the recursion for the '%' wildcard
308+
* above.)
309+
*
310+
* Note that we can't just wind forward p and t and continue
311+
* with the main loop. This would fail for example with
312+
*
313+
* U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
314+
*
315+
* You'd find that t=\0061 matches p=\00E4, but then the rest
316+
* won't match; but t=\0061\0308 also matches p=\00E4, and
317+
* then the rest will match.
318+
*/
319+
if (cmp == 0)
320+
{
321+
int matched = MatchText(t1, t1len, p1, p1len, locale);
322+
323+
if (matched == LIKE_TRUE)
324+
{
325+
if (buf)
326+
pfree(buf);
327+
return matched;
328+
}
329+
}
330+
331+
/*
332+
* Didn't match. If we used up the whole text, then the match
333+
* fails. Otherwise, try again with a longer substring.
334+
*/
335+
if (t1len == 0)
336+
return LIKE_FALSE;
337+
else
338+
NextChar(t1, t1len);
339+
}
340+
if (buf)
341+
pfree(buf);
342+
continue;
343+
}
199344
else if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
200345
{
201346
/* non-wildcard pattern char fails to match text char */

src/backend/utils/adt/like_support.c

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
272272
return NIL;
273273
patt = (Const *) rightop;
274274

275-
/*
276-
* Not supported if the expression collation is nondeterministic. The
277-
* optimized equality or prefix tests use bytewise comparisons, which is
278-
* not consistent with nondeterministic collations. The actual
279-
* pattern-matching implementation functions will later error out that
280-
* pattern-matching is not supported with nondeterministic collations. (We
281-
* could also error out here, but by doing it later we get more precise
282-
* error messages.) (It should be possible to support at least
283-
* Pattern_Prefix_Exact, but no point as long as the actual
284-
* pattern-matching implementations don't support it.)
285-
*
286-
* expr_coll is not set for a non-collation-aware data type such as bytea.
287-
*/
288-
if (expr_coll && !get_collation_isdeterministic(expr_coll))
289-
return NIL;
290-
291275
/*
292276
* Try to extract a fixed prefix from the pattern.
293277
*/
@@ -404,13 +388,26 @@ match_pattern_prefix(Node *leftop,
404388
{
405389
if (!op_in_opfamily(eqopr, opfamily))
406390
return NIL;
391+
if (indexcollation != expr_coll)
392+
return NIL;
407393
expr = make_opclause(eqopr, BOOLOID, false,
408394
(Expr *) leftop, (Expr *) prefix,
409395
InvalidOid, indexcollation);
410396
result = list_make1(expr);
411397
return result;
412398
}
413399

400+
/*
401+
* Anything other than Pattern_Prefix_Exact is not supported if the
402+
* expression collation is nondeterministic. The optimized equality or
403+
* prefix tests use bytewise comparisons, which is not consistent with
404+
* nondeterministic collations.
405+
*
406+
* expr_coll is not set for a non-collation-aware data type such as bytea.
407+
*/
408+
if (expr_coll && !get_collation_isdeterministic(expr_coll))
409+
return NIL;
410+
414411
/*
415412
* Otherwise, we have a nonempty required prefix of the values. Some
416413
* opclasses support prefix checks directly, otherwise we'll try to

0 commit comments

Comments
 (0)