Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 7dc13a0

Browse files
committed
Change regex \D and \W shorthands to always match newlines.
Newline is certainly not a digit, nor a word character, so it is sensible that it should match these complemented character classes. Previously, \D and \W acted that way by default, but in newline-sensitive mode ('n' or 'p' flag) they did not match newlines. This behavior was previously forced because explicit complemented character classes don't match newlines in newline-sensitive mode; but as of the previous commit that implementation constraint no longer exists. It seems useful to change this because the primary real-world use for newline-sensitive mode seems to be to match the default behavior of other regex engines such as Perl and Javascript ... and their default behavior is that these match newlines. The old behavior can be kept by writing an explicit complemented character class, i.e. [^[:digit:]] or [^[:word:]]. (This means that \D and \W are not exactly equivalent to those strings, but they weren't anyway.) Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
1 parent 2a0af7f commit 7dc13a0

File tree

4 files changed

+36
-17
lines changed

4 files changed

+36
-17
lines changed

doc/src/sgml/func.sgml

+20-8
Original file line numberDiff line numberDiff line change
@@ -6323,32 +6323,38 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
63236323
<tbody>
63246324
<row>
63256325
<entry> <literal>\d</literal> </entry>
6326-
<entry> <literal>[[:digit:]]</literal> </entry>
6326+
<entry> matches any digit, like
6327+
<literal>[[:digit:]]</literal> </entry>
63276328
</row>
63286329

63296330
<row>
63306331
<entry> <literal>\s</literal> </entry>
6331-
<entry> <literal>[[:space:]]</literal> </entry>
6332+
<entry> matches any whitespace character, like
6333+
<literal>[[:space:]]</literal> </entry>
63326334
</row>
63336335

63346336
<row>
63356337
<entry> <literal>\w</literal> </entry>
6336-
<entry> <literal>[[:word:]]</literal> </entry>
6338+
<entry> matches any word character, like
6339+
<literal>[[:word:]]</literal> </entry>
63376340
</row>
63386341

63396342
<row>
63406343
<entry> <literal>\D</literal> </entry>
6341-
<entry> <literal>[^[:digit:]]</literal> </entry>
6344+
<entry> matches any non-digit, like
6345+
<literal>[^[:digit:]]</literal> </entry>
63426346
</row>
63436347

63446348
<row>
63456349
<entry> <literal>\S</literal> </entry>
6346-
<entry> <literal>[^[:space:]]</literal> </entry>
6350+
<entry> matches any non-whitespace character, like
6351+
<literal>[^[:space:]]</literal> </entry>
63476352
</row>
63486353

63496354
<row>
63506355
<entry> <literal>\W</literal> </entry>
6351-
<entry> <literal>[^[:word:]]</literal> </entry>
6356+
<entry> matches any non-word character, like
6357+
<literal>[^[:word:]]</literal> </entry>
63526358
</row>
63536359
</tbody>
63546360
</tgroup>
@@ -6813,14 +6819,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
68136819
If newline-sensitive matching is specified, <literal>.</literal>
68146820
and bracket expressions using <literal>^</literal>
68156821
will never match the newline character
6816-
(so that matches will never cross newlines unless the RE
6817-
explicitly arranges it)
6822+
(so that matches will not cross lines unless the RE
6823+
explicitly includes a newline)
68186824
and <literal>^</literal> and <literal>$</literal>
68196825
will match the empty string after and before a newline
68206826
respectively, in addition to matching at beginning and end of string
68216827
respectively.
68226828
But the ARE escapes <literal>\A</literal> and <literal>\Z</literal>
68236829
continue to match beginning or end of string <emphasis>only</emphasis>.
6830+
Also, the character class shorthands <literal>\D</literal>
6831+
and <literal>\W</literal> will match a newline regardless of this mode.
6832+
(Before <productname>PostgreSQL</productname> 14, they did not match
6833+
newlines when in newline-sensitive mode.
6834+
Write <literal>[^[:digit:]]</literal>
6835+
or <literal>[^[:word:]]</literal> to get the old behavior.)
68246836
</para>
68256837

68266838
<para>

src/backend/regex/re_syntax.n

+6-1
Original file line numberDiff line numberDiff line change
@@ -804,7 +804,7 @@ and bracket expressions using
804804
\fB^\fR
805805
will never match the newline character
806806
(so that matches will never cross newlines unless the RE
807-
explicitly arranges it)
807+
explicitly includes a newline)
808808
and
809809
\fB^\fR
810810
and
@@ -817,6 +817,11 @@ ARE
817817
and
818818
\fB\eZ\fR
819819
continue to match beginning or end of string \fIonly\fR.
820+
Also, the character class shorthands
821+
\fB\eD\fR
822+
and
823+
\fB\eW\fR
824+
will match a newline regardless of this mode.
820825
.PP
821826
If partial newline-sensitive matching is specified,
822827
this affects \fB.\fR

src/backend/regex/regcomp.c

+2-4
Original file line numberDiff line numberDiff line change
@@ -1407,10 +1407,6 @@ charclasscomplement(struct vars *v,
14071407

14081408
/* build arcs for char class; this may cause color splitting */
14091409
subcolorcvec(v, cv, cstate, cstate);
1410-
1411-
/* in NLSTOP mode, ensure newline is not part of the result set */
1412-
if (v->cflags & REG_NLSTOP)
1413-
newarc(v->nfa, PLAIN, v->nlcolor, cstate, cstate);
14141410
NOERR();
14151411

14161412
/* clean up any subcolors in the arc set */
@@ -1612,6 +1608,8 @@ cbracket(struct vars *v,
16121608

16131609
NOERR();
16141610
bracket(v, left, right);
1611+
1612+
/* in NLSTOP mode, ensure newline is not part of the result set */
16151613
if (v->cflags & REG_NLSTOP)
16161614
newarc(v->nfa, PLAIN, v->nlcolor, left, right);
16171615
NOERR();

src/test/modules/test_regex/expected/test_regex.out

+8-4
Original file line numberDiff line numberDiff line change
@@ -2144,7 +2144,8 @@ select * from test_regex('\D+', E'abc\ndef345', 'nLP');
21442144
test_regex
21452145
-------------------------------
21462146
{0,REG_UNONPOSIX,REG_ULOCALE}
2147-
{abc}
2147+
{"abc +
2148+
def"}
21482149
(2 rows)
21492150

21502151
select * from test_regex('[\D]+', E'abc\ndef345', 'LPE');
@@ -2159,7 +2160,8 @@ select * from test_regex('[\D]+', E'abc\ndef345', 'nLPE');
21592160
test_regex
21602161
----------------------------------------
21612162
{0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE}
2162-
{abc}
2163+
{"abc +
2164+
def"}
21632165
(2 rows)
21642166

21652167
select * from test_regex('\w+', E'abc_012\ndef', 'LP');
@@ -2202,7 +2204,8 @@ select * from test_regex('\W+', E'***\n@@@___', 'nLP');
22022204
test_regex
22032205
-------------------------------
22042206
{0,REG_UNONPOSIX,REG_ULOCALE}
2205-
{***}
2207+
{"*** +
2208+
@@@"}
22062209
(2 rows)
22072210

22082211
select * from test_regex('[\W]+', E'***\n@@@___', 'LPE');
@@ -2217,7 +2220,8 @@ select * from test_regex('[\W]+', E'***\n@@@___', 'nLPE');
22172220
test_regex
22182221
----------------------------------------
22192222
{0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE}
2220-
{***}
2223+
{"*** +
2224+
@@@"}
22212225
(2 rows)
22222226

22232227
-- doing 13 "escapes"

0 commit comments

Comments
 (0)