Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 12c9a04

Browse files
committed
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
1 parent c5057b2 commit 12c9a04

File tree

14 files changed

+690
-73
lines changed

14 files changed

+690
-73
lines changed

doc/src/sgml/func.sgml

+17-3
Original file line numberDiff line numberDiff line change
@@ -4477,13 +4477,27 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
44774477
where no substring matching <replaceable>re</> begins
44784478
(AREs only) </entry>
44794479
</row>
4480+
4481+
<row>
4482+
<entry> <literal>(?&lt;=</><replaceable>re</><literal>)</> </entry>
4483+
<entry> <firstterm>positive lookbehind</> matches at any point
4484+
where a substring matching <replaceable>re</> ends
4485+
(AREs only) </entry>
4486+
</row>
4487+
4488+
<row>
4489+
<entry> <literal>(?&lt;!</><replaceable>re</><literal>)</> </entry>
4490+
<entry> <firstterm>negative lookbehind</> matches at any point
4491+
where no substring matching <replaceable>re</> ends
4492+
(AREs only) </entry>
4493+
</row>
44804494
</tbody>
44814495
</tgroup>
44824496
</table>
44834497

44844498
<para>
4485-
Lookahead constraints cannot contain <firstterm>back references</>
4486-
(see <xref linkend="posix-escape-sequences">),
4499+
Lookahead and lookbehind constraints cannot contain <firstterm>back
4500+
references</> (see <xref linkend="posix-escape-sequences">),
44874501
and all parentheses within them are considered non-capturing.
44884502
</para>
44894503
</sect3>
@@ -5355,7 +5369,7 @@ SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
53555369
the lack of special treatment for a trailing newline,
53565370
the addition of complemented bracket expressions to the things
53575371
affected by newline-sensitive matching,
5358-
the restrictions on parentheses and back references in lookahead
5372+
the restrictions on parentheses and back references in lookahead/lookbehind
53595373
constraints, and the longest/shortest-match (rather than first-match)
53605374
matching semantics.
53615375
</para>

src/backend/regex/README

+4-4
Original file line numberDiff line numberDiff line change
@@ -332,10 +332,10 @@ The possible arc types are:
332332
as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
333333
constraints respectively.
334334

335-
LACON constraints, which represent "(?=re)" and "(?!re)" constraints,
336-
i.e. the input starting at this point must match (or not match) a
337-
given sub-RE, but the matching input is not consumed. These are
338-
dumped as ":subtree_number:->to_state".
335+
LACON constraints, which represent "(?=re)", "(?!re)", "(?<=re)", and
336+
"(?<!re)" constraints, i.e. the input starting/ending at this point must
337+
match (or not match) a given sub-RE, but the matching input is not
338+
consumed. These are dumped as ":subtree_number:->to_state".
339339

340340
If you see anything else (especially any question marks) in the display of
341341
an arc, it's dumpnfa() trying to tell you that there's something fishy

src/backend/regex/re_syntax.n

+12-3
Original file line numberDiff line numberDiff line change
@@ -196,10 +196,18 @@ where a substring matching \fIre\fR begins
196196
\fB(?!\fIre\fB)\fR
197197
\fInegative lookahead\fR (AREs only), matches at any point
198198
where no substring matching \fIre\fR begins
199+
.TP
200+
\fB(?<=\fIre\fB)\fR
201+
\fIpositive lookbehind\fR (AREs only), matches at any point
202+
where a substring matching \fIre\fR ends
203+
.TP
204+
\fB(?<!\fIre\fB)\fR
205+
\fInegative lookbehind\fR (AREs only), matches at any point
206+
where no substring matching \fIre\fR ends
199207
.RE
200208
.PP
201-
The lookahead constraints may not contain back references (see later),
202-
and all parentheses within them are considered non-capturing.
209+
Lookahead and lookbehind constraints may not contain back references
210+
(see later), and all parentheses within them are considered non-capturing.
203211
.PP
204212
An RE may not end with `\fB\e\fR'.
205213

@@ -856,7 +864,8 @@ Incompatibilities of note include `\fB\eb\fR', `\fB\eB\fR',
856864
the lack of special treatment for a trailing newline,
857865
the addition of complemented bracket expressions to the things
858866
affected by newline-sensitive matching,
859-
the restrictions on parentheses and back references in lookahead constraints,
867+
the restrictions on parentheses and back references in lookahead/lookbehind
868+
constraints,
860869
and the longest/shortest-match (rather than first-match) matching semantics.
861870
.PP
862871
The matching rules for REs containing both normal and non-greedy quantifiers

src/backend/regex/regc_lex.c

+25-4
Original file line numberDiff line numberDiff line change
@@ -582,6 +582,8 @@ next(struct vars * v)
582582
{
583583
NOTE(REG_UNONPOSIX);
584584
v->now++;
585+
if (ATEOS())
586+
FAILW(REG_BADRPT);
585587
switch (*v->now++)
586588
{
587589
case CHR(':'): /* non-capturing paren */
@@ -596,12 +598,31 @@ next(struct vars * v)
596598
return next(v);
597599
break;
598600
case CHR('='): /* positive lookahead */
599-
NOTE(REG_ULOOKAHEAD);
600-
RETV(LACON, 1);
601+
NOTE(REG_ULOOKAROUND);
602+
RETV(LACON, LATYPE_AHEAD_POS);
601603
break;
602604
case CHR('!'): /* negative lookahead */
603-
NOTE(REG_ULOOKAHEAD);
604-
RETV(LACON, 0);
605+
NOTE(REG_ULOOKAROUND);
606+
RETV(LACON, LATYPE_AHEAD_NEG);
607+
break;
608+
case CHR('<'):
609+
if (ATEOS())
610+
FAILW(REG_BADRPT);
611+
switch (*v->now++)
612+
{
613+
case CHR('='): /* positive lookbehind */
614+
NOTE(REG_ULOOKAROUND);
615+
RETV(LACON, LATYPE_BEHIND_POS);
616+
break;
617+
case CHR('!'): /* negative lookbehind */
618+
NOTE(REG_ULOOKAROUND);
619+
RETV(LACON, LATYPE_BEHIND_NEG);
620+
break;
621+
default:
622+
FAILW(REG_BADRPT);
623+
break;
624+
}
625+
assert(NOTREACHED);
605626
break;
606627
default:
607628
FAILW(REG_BADRPT);

src/backend/regex/regc_nfa.c

+43
Original file line numberDiff line numberDiff line change
@@ -1348,6 +1348,49 @@ cleartraverse(struct nfa * nfa,
13481348
cleartraverse(nfa, a->to);
13491349
}
13501350

1351+
/*
1352+
* single_color_transition - does getting from s1 to s2 cross one PLAIN arc?
1353+
*
1354+
* If traversing from s1 to s2 requires a single PLAIN match (possibly of any
1355+
* of a set of colors), return a state whose outarc list contains only PLAIN
1356+
* arcs of those color(s). Otherwise return NULL.
1357+
*
1358+
* This is used before optimizing the NFA, so there may be EMPTY arcs, which
1359+
* we should ignore; the possibility of an EMPTY is why the result state could
1360+
* be different from s1.
1361+
*
1362+
* It's worth troubling to handle multiple parallel PLAIN arcs here because a
1363+
* bracket construct such as [abc] might yield either one or several parallel
1364+
* PLAIN arcs depending on earlier atoms in the expression. We'd rather that
1365+
* that implementation detail not create user-visible performance differences.
1366+
*/
1367+
static struct state *
1368+
single_color_transition(struct state * s1, struct state * s2)
1369+
{
1370+
struct arc *a;
1371+
1372+
/* Ignore leading EMPTY arc, if any */
1373+
if (s1->nouts == 1 && s1->outs->type == EMPTY)
1374+
s1 = s1->outs->to;
1375+
/* Likewise for any trailing EMPTY arc */
1376+
if (s2->nins == 1 && s2->ins->type == EMPTY)
1377+
s2 = s2->ins->from;
1378+
/* Perhaps we could have a single-state loop in between, if so reject */
1379+
if (s1 == s2)
1380+
return NULL;
1381+
/* s1 must have at least one outarc... */
1382+
if (s1->outs == NULL)
1383+
return NULL;
1384+
/* ... and they must all be PLAIN arcs to s2 */
1385+
for (a = s1->outs; a != NULL; a = a->outchain)
1386+
{
1387+
if (a->type != PLAIN || a->to != s2)
1388+
return NULL;
1389+
}
1390+
/* OK, return s1 as the possessor of the relevant outarcs */
1391+
return s1;
1392+
}
1393+
13511394
/*
13521395
* specialcolors - fill in special colors for an NFA
13531396
*/

0 commit comments

Comments
 (0)