Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 7c85032

Browse files
committed
Fix SQL-style substring() to have spec-compliant greediness behavior.
SQL's regular-expression substring() function is defined to have a pattern argument that's separated into three subpatterns by escape- double-quote markers; the function result is the part of the input matching the second subpattern. The standard makes it clear that if there is ambiguity about how to match the input to the subpatterns, the first and third subpatterns should be taken to match the smallest possible amount of text (i.e., they're "non greedy", in the terms of our regex code). We were not doing it that way: the first subpattern would eat the largest possible amount of text, causing the function result to be shorter than what the spec requires. Fix that by attaching explicit greediness quantifiers to the subpatterns. (This depends on the regex fix in commit 8a29ed0; before that, this didn't reliably change the regex engine's behavior.) Also, by adding parentheses around each subpattern, we ensure that "|" (OR) in the subpatterns behave sanely. Previously, "|" in the first or third subpatterns didn't work. This patch also makes the function throw error if you write more than two escape-double-quote markers, and do something sane if you write just one, and document that behavior. Previously, an odd number of markers led to a confusing complaint about unbalanced parentheses, while extra pairs of markers were just ignored. (Note that the spec requires exactly two markers, but we've historically allowed there to be none, and this patch preserves the old behavior for that case.) In passing, adjust some substring() test cases that didn't really prove what they said they were testing for: they used patterns that didn't match the data string, so that the output would be NULL whether or not the function was really strict. Although this is certainly a bug fix, changing the behavior in back branches seems undesirable: applications could perhaps be depending on the old behavior, since it's not obviously wrong unless you read the spec very closely. Hence, no back-patch. Discussion: https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net
1 parent fb489e4 commit 7c85032

File tree

4 files changed

+174
-26
lines changed

4 files changed

+174
-26
lines changed

doc/src/sgml/func.sgml

+34-8
Original file line numberDiff line numberDiff line change
@@ -4296,19 +4296,45 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
42964296
</para>
42974297

42984298
<para>
4299-
The <function>substring</function> function with three parameters,
4300-
<function>substring(<replaceable>string</replaceable> from
4301-
<replaceable>pattern</replaceable> for
4302-
<replaceable>escape-character</replaceable>)</function>, provides
4303-
extraction of a substring that matches an SQL
4304-
regular expression pattern. As with <literal>SIMILAR TO</literal>, the
4299+
The <function>substring</function> function with three parameters
4300+
provides extraction of a substring that matches an SQL
4301+
regular expression pattern. The function can be written according
4302+
to SQL99 syntax:
4303+
<synopsis>
4304+
substring(<replaceable>string</replaceable> from <replaceable>pattern</replaceable> for <replaceable>escape-character</replaceable>)
4305+
</synopsis>
4306+
or as a plain three-argument function:
4307+
<synopsis>
4308+
substring(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>, <replaceable>escape-character</replaceable>)
4309+
</synopsis>
4310+
As with <literal>SIMILAR TO</literal>, the
43054311
specified pattern must match the entire data string, or else the
43064312
function fails and returns null. To indicate the part of the
4307-
pattern that should be returned on success, the pattern must contain
4313+
pattern for which the matching data sub-string is of interest,
4314+
the pattern should contain
43084315
two occurrences of the escape character followed by a double quote
43094316
(<literal>"</literal>). <!-- " font-lock sanity -->
43104317
The text matching the portion of the pattern
4311-
between these markers is returned.
4318+
between these separators is returned when the match is successful.
4319+
</para>
4320+
4321+
<para>
4322+
The escape-double-quote separators actually
4323+
divide <function>substring</function>'s pattern into three independent
4324+
regular expressions; for example, a vertical bar (<literal>|</literal>)
4325+
in any of the three sections affects only that section. Also, the first
4326+
and third of these regular expressions are defined to match the smallest
4327+
possible amount of text, not the largest, when there is any ambiguity
4328+
about how much of the data string matches which pattern. (In POSIX
4329+
parlance, the first and third regular expressions are forced to be
4330+
non-greedy.)
4331+
</para>
4332+
4333+
<para>
4334+
As an extension to the SQL standard, <productname>PostgreSQL</productname>
4335+
allows there to be just one escape-double-quote separator, in which case
4336+
the third regular expression is taken as empty; or no separators, in which
4337+
case the first and third regular expressions are taken as empty.
43124338
</para>
43134339

43144340
<para>

src/backend/utils/adt/regexp.c

+65-8
Original file line numberDiff line numberDiff line change
@@ -708,20 +708,42 @@ similar_escape(PG_FUNCTION_ARGS)
708708
* We surround the transformed input string with
709709
* ^(?: ... )$
710710
* which requires some explanation. We need "^" and "$" to force
711-
* the pattern to match the entire input string as per SQL99 spec.
711+
* the pattern to match the entire input string as per the SQL spec.
712712
* The "(?:" and ")" are a non-capturing set of parens; we have to have
713713
* parens in case the string contains "|", else the "^" and "$" will
714714
* be bound into the first and last alternatives which is not what we
715715
* want, and the parens must be non capturing because we don't want them
716716
* to count when selecting output for SUBSTRING.
717+
*
718+
* When the pattern is divided into three parts by escape-double-quotes,
719+
* what we emit is
720+
* ^(?:part1){1,1}?(part2){1,1}(?:part3)$
721+
* which requires even more explanation. The "{1,1}?" on part1 makes it
722+
* non-greedy so that it will match the smallest possible amount of text
723+
* not the largest, as required by SQL. The plain parens around part2
724+
* are capturing parens so that that part is what controls the result of
725+
* SUBSTRING. The "{1,1}" forces part2 to be greedy, so that it matches
726+
* the largest possible amount of text; hence part3 must match the
727+
* smallest amount of text, as required by SQL. We don't need an explicit
728+
* greediness marker on part3. Note that this also confines the effects
729+
* of any "|" characters to the respective part, which is what we want.
730+
*
731+
* The SQL spec says that SUBSTRING's pattern must contain exactly two
732+
* escape-double-quotes, but we only complain if there's more than two.
733+
* With none, we act as though part1 and part3 are empty; with one, we
734+
* act as though part3 is empty. Both behaviors fall out of omitting
735+
* the relevant part separators in the above expansion. If the result
736+
* of this function is used in a plain regexp match (SIMILAR TO), the
737+
* escape-double-quotes have no effect on the match behavior.
717738
*----------
718739
*/
719740

720741
/*
721-
* We need room for the prefix/postfix plus as many as 3 output bytes per
722-
* input byte; since the input is at most 1GB this can't overflow
742+
* We need room for the prefix/postfix and part separators, plus as many
743+
* as 3 output bytes per input byte; since the input is at most 1GB this
744+
* can't overflow size_t.
723745
*/
724-
result = (text *) palloc(VARHDRSZ + 6 + 3 * plen);
746+
result = (text *) palloc(VARHDRSZ + 23 + 3 * (size_t) plen);
725747
r = VARDATA(result);
726748

727749
*r++ = '^';
@@ -760,7 +782,7 @@ similar_escape(PG_FUNCTION_ARGS)
760782
}
761783
else if (e && elen == mblen && memcmp(e, p, mblen) == 0)
762784
{
763-
/* SQL99 escape character; do not send to output */
785+
/* SQL escape character; do not send to output */
764786
afterescape = true;
765787
}
766788
else
@@ -784,18 +806,53 @@ similar_escape(PG_FUNCTION_ARGS)
784806
/* fast path */
785807
if (afterescape)
786808
{
787-
if (pchar == '"' && !incharclass) /* for SUBSTRING patterns */
788-
*r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
809+
if (pchar == '"' && !incharclass) /* escape-double-quote? */
810+
{
811+
/* emit appropriate part separator, per notes above */
812+
if (nquotes == 0)
813+
{
814+
*r++ = ')';
815+
*r++ = '{';
816+
*r++ = '1';
817+
*r++ = ',';
818+
*r++ = '1';
819+
*r++ = '}';
820+
*r++ = '?';
821+
*r++ = '(';
822+
}
823+
else if (nquotes == 1)
824+
{
825+
*r++ = ')';
826+
*r++ = '{';
827+
*r++ = '1';
828+
*r++ = ',';
829+
*r++ = '1';
830+
*r++ = '}';
831+
*r++ = '(';
832+
*r++ = '?';
833+
*r++ = ':';
834+
}
835+
else
836+
ereport(ERROR,
837+
(errcode(ERRCODE_INVALID_USE_OF_ESCAPE_CHARACTER),
838+
errmsg("SQL regular expression may not contain more than two escape-double-quote separators")));
839+
nquotes++;
840+
}
789841
else
790842
{
843+
/*
844+
* We allow any character at all to be escaped; notably, this
845+
* allows access to POSIX character-class escapes such as
846+
* "\d". The SQL spec is considerably more restrictive.
847+
*/
791848
*r++ = '\\';
792849
*r++ = pchar;
793850
}
794851
afterescape = false;
795852
}
796853
else if (e && pchar == *e)
797854
{
798-
/* SQL99 escape character; do not send to output */
855+
/* SQL escape character; do not send to output */
799856
afterescape = true;
800857
}
801858
else if (incharclass)

src/test/regress/expected/strings.out

+54-5
Original file line numberDiff line numberDiff line change
@@ -313,7 +313,7 @@ SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
313313
t
314314
(1 row)
315315

316-
-- T581 regular expression substring (with SQL99's bizarre regexp syntax)
316+
-- T581 regular expression substring (with SQL's bizarre regexp syntax)
317317
SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
318318
bcd
319319
-----
@@ -328,13 +328,13 @@ SELECT SUBSTRING('abcdefg' FROM '#"(b_d)#"%' FOR '#') IS NULL AS "True";
328328
(1 row)
329329

330330
-- Null inputs should return NULL
331-
SELECT SUBSTRING('abcdefg' FROM '(b|c)' FOR NULL) IS NULL AS "True";
331+
SELECT SUBSTRING('abcdefg' FROM '%' FOR NULL) IS NULL AS "True";
332332
True
333333
------
334334
t
335335
(1 row)
336336

337-
SELECT SUBSTRING(NULL FROM '(b|c)' FOR '#') IS NULL AS "True";
337+
SELECT SUBSTRING(NULL FROM '%' FOR '#') IS NULL AS "True";
338338
True
339339
------
340340
t
@@ -346,8 +346,57 @@ SELECT SUBSTRING('abcdefg' FROM NULL FOR '#') IS NULL AS "True";
346346
t
347347
(1 row)
348348

349-
-- PostgreSQL extension to allow omitting the escape character;
350-
-- here the regexp is taken as Posix syntax
349+
-- The first and last parts should act non-greedy
350+
SELECT SUBSTRING('abcdefg' FROM 'a#"%#"g' FOR '#') AS "bcdef";
351+
bcdef
352+
-------
353+
bcdef
354+
(1 row)
355+
356+
SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*' FOR '#') AS "abcdefg";
357+
abcdefg
358+
---------
359+
abcdefg
360+
(1 row)
361+
362+
-- Vertical bar in any part affects only that part
363+
SELECT SUBSTRING('abcdefg' FROM 'a|b#"%#"g' FOR '#') AS "bcdef";
364+
bcdef
365+
-------
366+
bcdef
367+
(1 row)
368+
369+
SELECT SUBSTRING('abcdefg' FROM 'a#"%#"x|g' FOR '#') AS "bcdef";
370+
bcdef
371+
-------
372+
bcdef
373+
(1 row)
374+
375+
SELECT SUBSTRING('abcdefg' FROM 'a#"%|ab#"g' FOR '#') AS "bcdef";
376+
bcdef
377+
-------
378+
bcdef
379+
(1 row)
380+
381+
-- Can't have more than two part separators
382+
SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*#"x' FOR '#') AS "error";
383+
ERROR: SQL regular expression may not contain more than two escape-double-quote separators
384+
CONTEXT: SQL function "substring" statement 1
385+
-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
386+
SELECT SUBSTRING('abcdefg' FROM 'a#"%g' FOR '#') AS "bcdefg";
387+
bcdefg
388+
--------
389+
bcdefg
390+
(1 row)
391+
392+
SELECT SUBSTRING('abcdefg' FROM 'a%g' FOR '#') AS "abcdefg";
393+
abcdefg
394+
---------
395+
abcdefg
396+
(1 row)
397+
398+
-- substring() with just two arguments is not allowed by SQL spec;
399+
-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
351400
SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
352401
cde
353402
-----

src/test/regress/sql/strings.sql

+21-5
Original file line numberDiff line numberDiff line change
@@ -110,19 +110,35 @@ SELECT SUBSTRING('1234567890' FROM 3) = '34567890' AS "34567890";
110110
111111
SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
112112
113-
-- T581 regular expression substring (with SQL99's bizarre regexp syntax)
113+
-- T581 regular expression substring (with SQL's bizarre regexp syntax)
114114
SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
115115

116116
-- No match should return NULL
117117
SELECT SUBSTRING('abcdefg' FROM '#"(b_d)#"%' FOR '#') IS NULL AS "True";
118118

119119
-- Null inputs should return NULL
120-
SELECT SUBSTRING('abcdefg' FROM '(b|c)' FOR NULL) IS NULL AS "True";
121-
SELECT SUBSTRING(NULL FROM '(b|c)' FOR '#') IS NULL AS "True";
120+
SELECT SUBSTRING('abcdefg' FROM '%' FOR NULL) IS NULL AS "True";
121+
SELECT SUBSTRING(NULL FROM '%' FOR '#') IS NULL AS "True";
122122
SELECT SUBSTRING('abcdefg' FROM NULL FOR '#') IS NULL AS "True";
123123

124-
-- PostgreSQL extension to allow omitting the escape character;
125-
-- here the regexp is taken as Posix syntax
124+
-- The first and last parts should act non-greedy
125+
SELECT SUBSTRING('abcdefg' FROM 'a#"%#"g' FOR '#') AS "bcdef";
126+
SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*' FOR '#') AS "abcdefg";
127+
128+
-- Vertical bar in any part affects only that part
129+
SELECT SUBSTRING('abcdefg' FROM 'a|b#"%#"g' FOR '#') AS "bcdef";
130+
SELECT SUBSTRING('abcdefg' FROM 'a#"%#"x|g' FOR '#') AS "bcdef";
131+
SELECT SUBSTRING('abcdefg' FROM 'a#"%|ab#"g' FOR '#') AS "bcdef";
132+
133+
-- Can't have more than two part separators
134+
SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*#"x' FOR '#') AS "error";
135+
136+
-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
137+
SELECT SUBSTRING('abcdefg' FROM 'a#"%g' FOR '#') AS "bcdefg";
138+
SELECT SUBSTRING('abcdefg' FROM 'a%g' FOR '#') AS "abcdefg";
139+
140+
-- substring() with just two arguments is not allowed by SQL spec;
141+
-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
126142
SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
127143

128144
-- With a parenthesized subexpression, return only what matches the subexpr

0 commit comments

Comments
 (0)