Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit ca70bda

Browse files
committed
Fix issues around strictness of SIMILAR TO.
As a result of some long-ago quick hacks, the SIMILAR TO operator and the corresponding flavor of substring() interpreted "ESCAPE NULL" as selecting the default escape character '\'. This is both surprising and not per spec: the standard is clear that these functions should return NULL for NULL input. Additionally, because of inconsistency of the strictness markings of 3-argument substring() and similar_escape(), the planner could not inline the SQL definition of substring(), resulting in a substantial performance penalty compared to the underlying POSIX substring() function. The simplest fix for this would be to change the strictness marking of similar_escape(), but if we do that we risk breaking existing views that depend on that function. Hence, leave similar_escape() as-is as a compatibility function, and instead invent a new function similar_to_escape() that comes in two strict variants. There are a couple of other behaviors in this area that are also not per spec, but they are documented and seem generally at least as sane as the spec's definition, so leave them alone. But improve the documentation to describe them fully. Patch by me; thanks to Álvaro Herrera and Andrew Gierth for review and discussion. Discussion: https://postgr.es/m/14047.1557708214@sss.pgh.pa.us
1 parent c5bc705 commit ca70bda

File tree

7 files changed

+193
-41
lines changed

7 files changed

+193
-41
lines changed

doc/src/sgml/func.sgml

+39-11
Original file line numberDiff line numberDiff line change
@@ -4121,6 +4121,14 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
41214121
special meaning of underscore and percent signs in the pattern.
41224122
</para>
41234123

4124+
<para>
4125+
According to the SQL standard, omitting <literal>ESCAPE</literal>
4126+
means there is no escape character (rather than defaulting to a
4127+
backslash), and a zero-length <literal>ESCAPE</literal> value is
4128+
disallowed. <productname>PostgreSQL</productname>'s behavior in
4129+
this regard is therefore slightly nonstandard.
4130+
</para>
4131+
41244132
<para>
41254133
The key word <token>ILIKE</token> can be used instead of
41264134
<token>LIKE</token> to make the match case-insensitive according
@@ -4139,9 +4147,9 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
41394147
</para>
41404148

41414149
<para>
4142-
There is also the prefix operator <literal>^@</literal> and corresponding
4143-
<function>starts_with</function> function which covers cases when only
4144-
searching by beginning of the string is needed.
4150+
Also see the prefix operator <literal>^@</literal> and corresponding
4151+
<function>starts_with</function> function, which are useful in cases
4152+
where simply matching the beginning of a string is needed.
41454153
</para>
41464154
</sect2>
41474155

@@ -4172,7 +4180,7 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
41724180
It is similar to <function>LIKE</function>, except that it
41734181
interprets the pattern using the SQL standard's definition of a
41744182
regular expression. SQL regular expressions are a curious cross
4175-
between <function>LIKE</function> notation and common regular
4183+
between <function>LIKE</function> notation and common (POSIX) regular
41764184
expression notation.
41774185
</para>
41784186

@@ -4256,18 +4264,38 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
42564264
</para>
42574265

42584266
<para>
4259-
As with <function>LIKE</function>, a backslash disables the special meaning
4260-
of any of these metacharacters; or a different escape character can
4261-
be specified with <literal>ESCAPE</literal>.
4267+
As with <function>LIKE</function>, a backslash disables the special
4268+
meaning of any of these metacharacters. A different escape character
4269+
can be specified with <literal>ESCAPE</literal>, or the escape
4270+
capability can be disabled by writing <literal>ESCAPE ''</literal>.
4271+
</para>
4272+
4273+
<para>
4274+
According to the SQL standard, omitting <literal>ESCAPE</literal>
4275+
means there is no escape character (rather than defaulting to a
4276+
backslash), and a zero-length <literal>ESCAPE</literal> value is
4277+
disallowed. <productname>PostgreSQL</productname>'s behavior in
4278+
this regard is therefore slightly nonstandard.
4279+
</para>
4280+
4281+
<para>
4282+
Another nonstandard extension is that following the escape character
4283+
with a letter or digit provides access to the escape sequences
4284+
defined for POSIX regular expressions; see
4285+
<xref linkend="posix-character-entry-escapes-table"/>,
4286+
<xref linkend="posix-class-shorthand-escapes-table"/>, and
4287+
<xref linkend="posix-constraint-escapes-table"/> below.
42624288
</para>
42634289

42644290
<para>
42654291
Some examples:
42664292
<programlisting>
4267-
'abc' SIMILAR TO 'abc' <lineannotation>true</lineannotation>
4268-
'abc' SIMILAR TO 'a' <lineannotation>false</lineannotation>
4269-
'abc' SIMILAR TO '%(b|d)%' <lineannotation>true</lineannotation>
4270-
'abc' SIMILAR TO '(b|c)%' <lineannotation>false</lineannotation>
4293+
'abc' SIMILAR TO 'abc' <lineannotation>true</lineannotation>
4294+
'abc' SIMILAR TO 'a' <lineannotation>false</lineannotation>
4295+
'abc' SIMILAR TO '%(b|d)%' <lineannotation>true</lineannotation>
4296+
'abc' SIMILAR TO '(b|c)%' <lineannotation>false</lineannotation>
4297+
'-abc-' SIMILAR TO '%\mabc\M%' <lineannotation>true</lineannotation>
4298+
'xabcy' SIMILAR TO '%\mabc\M%' <lineannotation>false</lineannotation>
42714299
</programlisting>
42724300
</para>
42734301

src/backend/parser/gram.y

+8-8
Original file line numberDiff line numberDiff line change
@@ -13073,31 +13073,31 @@ a_expr: c_expr { $$ = $1; }
1307313073

1307413074
| a_expr SIMILAR TO a_expr %prec SIMILAR
1307513075
{
13076-
FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
13077-
list_make2($4, makeNullAConst(-1)),
13076+
FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
13077+
list_make1($4),
1307813078
@2);
1307913079
$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~",
1308013080
$1, (Node *) n, @2);
1308113081
}
1308213082
| a_expr SIMILAR TO a_expr ESCAPE a_expr %prec SIMILAR
1308313083
{
13084-
FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
13084+
FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
1308513085
list_make2($4, $6),
1308613086
@2);
1308713087
$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~",
1308813088
$1, (Node *) n, @2);
1308913089
}
1309013090
| a_expr NOT_LA SIMILAR TO a_expr %prec NOT_LA
1309113091
{
13092-
FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
13093-
list_make2($5, makeNullAConst(-1)),
13092+
FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
13093+
list_make1($5),
1309413094
@2);
1309513095
$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~",
1309613096
$1, (Node *) n, @2);
1309713097
}
1309813098
| a_expr NOT_LA SIMILAR TO a_expr ESCAPE a_expr %prec NOT_LA
1309913099
{
13100-
FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
13100+
FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
1310113101
list_make2($5, $7),
1310213102
@2);
1310313103
$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~",
@@ -14323,9 +14323,9 @@ subquery_Op:
1432314323
| NOT_LA ILIKE
1432414324
{ $$ = list_make1(makeString("!~~*")); }
1432514325
/* cannot put SIMILAR TO here, because SIMILAR TO is a hack.
14326-
* the regular expression is preprocessed by a function (similar_escape),
14326+
* the regular expression is preprocessed by a function (similar_to_escape),
1432714327
* and the ~ operator for posix regular expressions is used.
14328-
* x SIMILAR TO y -> x ~ similar_escape(y)
14328+
* x SIMILAR TO y -> x ~ similar_to_escape(y)
1432914329
* this transformation is made on the fly by the parser upwards.
1433014330
* however the SubLink structure which handles any/some/all stuff
1433114331
* is not ready for such a thing.

src/backend/utils/adt/regexp.c

+71-14
Original file line numberDiff line numberDiff line change
@@ -654,15 +654,18 @@ textregexreplace(PG_FUNCTION_ARGS)
654654
}
655655

656656
/*
657-
* similar_escape()
658-
* Convert a SQL:2008 regexp pattern to POSIX style, so it can be used by
659-
* our regexp engine.
657+
* similar_to_escape(), similar_escape()
658+
*
659+
* Convert a SQL "SIMILAR TO" regexp pattern to POSIX style, so it can be
660+
* used by our regexp engine.
661+
*
662+
* similar_escape_internal() is the common workhorse for three SQL-exposed
663+
* functions. esc_text can be passed as NULL to select the default escape
664+
* (which is '\'), or as an empty string to select no escape character.
660665
*/
661-
Datum
662-
similar_escape(PG_FUNCTION_ARGS)
666+
static text *
667+
similar_escape_internal(text *pat_text, text *esc_text)
663668
{
664-
text *pat_text;
665-
text *esc_text;
666669
text *result;
667670
char *p,
668671
*e,
@@ -673,26 +676,21 @@ similar_escape(PG_FUNCTION_ARGS)
673676
bool incharclass = false;
674677
int nquotes = 0;
675678

676-
/* This function is not strict, so must test explicitly */
677-
if (PG_ARGISNULL(0))
678-
PG_RETURN_NULL();
679-
pat_text = PG_GETARG_TEXT_PP(0);
680679
p = VARDATA_ANY(pat_text);
681680
plen = VARSIZE_ANY_EXHDR(pat_text);
682-
if (PG_ARGISNULL(1))
681+
if (esc_text == NULL)
683682
{
684683
/* No ESCAPE clause provided; default to backslash as escape */
685684
e = "\\";
686685
elen = 1;
687686
}
688687
else
689688
{
690-
esc_text = PG_GETARG_TEXT_PP(1);
691689
e = VARDATA_ANY(esc_text);
692690
elen = VARSIZE_ANY_EXHDR(esc_text);
693691
if (elen == 0)
694692
e = NULL; /* no escape character */
695-
else
693+
else if (elen > 1)
696694
{
697695
int escape_mblen = pg_mbstrlen_with_len(e, elen);
698696

@@ -898,6 +896,65 @@ similar_escape(PG_FUNCTION_ARGS)
898896

899897
SET_VARSIZE(result, r - ((char *) result));
900898

899+
return result;
900+
}
901+
902+
/*
903+
* similar_to_escape(pattern, escape)
904+
*/
905+
Datum
906+
similar_to_escape_2(PG_FUNCTION_ARGS)
907+
{
908+
text *pat_text = PG_GETARG_TEXT_PP(0);
909+
text *esc_text = PG_GETARG_TEXT_PP(1);
910+
text *result;
911+
912+
result = similar_escape_internal(pat_text, esc_text);
913+
914+
PG_RETURN_TEXT_P(result);
915+
}
916+
917+
/*
918+
* similar_to_escape(pattern)
919+
* Inserts a default escape character.
920+
*/
921+
Datum
922+
similar_to_escape_1(PG_FUNCTION_ARGS)
923+
{
924+
text *pat_text = PG_GETARG_TEXT_PP(0);
925+
text *result;
926+
927+
result = similar_escape_internal(pat_text, NULL);
928+
929+
PG_RETURN_TEXT_P(result);
930+
}
931+
932+
/*
933+
* similar_escape(pattern, escape)
934+
*
935+
* Legacy function for compatibility with views stored using the
936+
* pre-v13 expansion of SIMILAR TO. Unlike the above functions, this
937+
* is non-strict, which leads to not-per-spec handling of "ESCAPE NULL".
938+
*/
939+
Datum
940+
similar_escape(PG_FUNCTION_ARGS)
941+
{
942+
text *pat_text;
943+
text *esc_text;
944+
text *result;
945+
946+
/* This function is not strict, so must test explicitly */
947+
if (PG_ARGISNULL(0))
948+
PG_RETURN_NULL();
949+
pat_text = PG_GETARG_TEXT_PP(0);
950+
951+
if (PG_ARGISNULL(1))
952+
esc_text = NULL; /* use default escape character */
953+
else
954+
esc_text = PG_GETARG_TEXT_PP(1);
955+
956+
result = similar_escape_internal(pat_text, esc_text);
957+
901958
PG_RETURN_TEXT_P(result);
902959
}
903960

src/include/catalog/catversion.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,6 @@
5353
*/
5454

5555
/* yyyymmddN */
56-
#define CATALOG_VERSION_NO 201908012
56+
#define CATALOG_VERSION_NO 201909071
5757

5858
#endif

src/include/catalog/pg_proc.dat

+10-5
Original file line numberDiff line numberDiff line change
@@ -3346,9 +3346,15 @@
33463346
proname => 'repeat', prorettype => 'text', proargtypes => 'text int4',
33473347
prosrc => 'repeat' },
33483348

3349-
{ oid => '1623', descr => 'convert SQL99 regexp pattern to POSIX style',
3349+
{ oid => '1623', descr => 'convert SQL regexp pattern to POSIX style',
33503350
proname => 'similar_escape', proisstrict => 'f', prorettype => 'text',
33513351
proargtypes => 'text text', prosrc => 'similar_escape' },
3352+
{ oid => '1986', descr => 'convert SQL regexp pattern to POSIX style',
3353+
proname => 'similar_to_escape', prorettype => 'text',
3354+
proargtypes => 'text text', prosrc => 'similar_to_escape_2' },
3355+
{ oid => '1987', descr => 'convert SQL regexp pattern to POSIX style',
3356+
proname => 'similar_to_escape', prorettype => 'text', proargtypes => 'text',
3357+
prosrc => 'similar_to_escape_1' },
33523358

33533359
{ oid => '1624',
33543360
proname => 'mul_d_interval', prorettype => 'interval',
@@ -5771,10 +5777,10 @@
57715777
{ oid => '2073', descr => 'extract text matching regular expression',
57725778
proname => 'substring', prorettype => 'text', proargtypes => 'text text',
57735779
prosrc => 'textregexsubstr' },
5774-
{ oid => '2074', descr => 'extract text matching SQL99 regular expression',
5780+
{ oid => '2074', descr => 'extract text matching SQL regular expression',
57755781
proname => 'substring', prolang => 'sql', prorettype => 'text',
57765782
proargtypes => 'text text text',
5777-
prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3))' },
5783+
prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_to_escape($2, $3))' },
57785784

57795785
{ oid => '2075', descr => 'convert int8 to bitstring',
57805786
proname => 'bit', prorettype => 'bit', proargtypes => 'int8 int4',
@@ -10554,8 +10560,7 @@
1055410560
proparallel => 'r', prorettype => 'void', proargtypes => '',
1055510561
prosrc => 'pg_replication_origin_xact_reset' },
1055610562

10557-
{ oid => '6012',
10558-
descr => 'advance replication origin to specific location',
10563+
{ oid => '6012', descr => 'advance replication origin to specific location',
1055910564
proname => 'pg_replication_origin_advance', provolatile => 'v',
1056010565
proparallel => 'u', prorettype => 'void', proargtypes => 'text pg_lsn',
1056110566
prosrc => 'pg_replication_origin_advance' },

src/test/regress/expected/strings.out

+50-1
Original file line numberDiff line numberDiff line change
@@ -410,7 +410,56 @@ SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
410410
cde
411411
(1 row)
412412

413-
-- PostgreSQL extension to allow using back reference in replace string;
413+
-- Check behavior of SIMILAR TO, which uses largely the same regexp variant
414+
SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
415+
true
416+
------
417+
t
418+
(1 row)
419+
420+
SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false;
421+
false
422+
-------
423+
f
424+
(1 row)
425+
426+
SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false;
427+
false
428+
-------
429+
f
430+
(1 row)
431+
432+
SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true;
433+
true
434+
------
435+
t
436+
(1 row)
437+
438+
-- Postgres uses '\' as the default escape character, which is not per spec
439+
SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false;
440+
false
441+
-------
442+
f
443+
(1 row)
444+
445+
-- and an empty string to mean "no escape", which is also not per spec
446+
SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true;
447+
true
448+
------
449+
t
450+
(1 row)
451+
452+
-- these behaviors are per spec, though:
453+
SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null;
454+
null
455+
------
456+
457+
(1 row)
458+
459+
SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error;
460+
ERROR: invalid escape string
461+
HINT: Escape string must be empty or one character.
462+
-- Test back reference in regexp_replace
414463
SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
415464
regexp_replace
416465
----------------

src/test/regress/sql/strings.sql

+14-1
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,20 @@ SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
144144
-- With a parenthesized subexpression, return only what matches the subexpr
145145
SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
146146

147-
-- PostgreSQL extension to allow using back reference in replace string;
147+
-- Check behavior of SIMILAR TO, which uses largely the same regexp variant
148+
SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
149+
SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false;
150+
SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false;
151+
SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true;
152+
-- Postgres uses '\' as the default escape character, which is not per spec
153+
SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false;
154+
-- and an empty string to mean "no escape", which is also not per spec
155+
SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true;
156+
-- these behaviors are per spec, though:
157+
SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null;
158+
SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error;
159+
160+
-- Test back reference in regexp_replace
148161
SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
149162
SELECT regexp_replace('AAA BBB CCC ', E'\\s+', ' ', 'g');
150163
SELECT regexp_replace('AAA', '^|$', 'Z', 'g');

0 commit comments

Comments
 (0)