Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit d3158f3

Browse files
committed
Fix regex back-references that are directly quantified with *.
The syntax "\n*", that is a backref with a * quantifier directly applied to it, has never worked correctly in Spencer's library. This has been an open bug in the Tcl bug tracker since 2005: https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894 The core of the problem is in parseqatom(), which first changes "\n*" to "\n+|" and then applies repeat() to the NFA representing the backref atom. repeat() thinks that any arc leading into its "rp" argument is part of the sub-NFA to be repeated. Unfortunately, since parseqatom() already created the arc that was intended to represent the empty bypass around "\n+", this arc gets moved too, so that it now leads into the state loop created by repeat(). Thus, what was supposed to be an "empty" bypass gets turned into something that represents zero or more repetitions of the NFA representing the backref atom. In the original example, in place of ^([bc])\1*$ we now have something that acts like ^([bc])(\1+|[bc]*)$ At runtime, the branch involving the actual backref fails, as it's supposed to, but then the other branch succeeds anyway. We could no doubt fix this by some rearrangement of the operations in parseqatom(), but that code is plenty ugly already, and what's more the whole business of converting "x*" to "x+|" probably needs to go away to fix another problem I'll mention in a moment. Instead, this patch suppresses the *-conversion when the target is a simple backref atom, leaving the case of m == 0 to be handled at runtime. This makes the patch in regcomp.c a one-liner, at the cost of having to tweak cbrdissect() a little. In the event I went a bit further than that and rewrote cbrdissect() to check all the string-length-related conditions before it starts comparing characters. It seems a bit stupid to possibly iterate through many copies of an n-character backreference, only to fail at the end because the target string's length isn't a multiple of n --- we could have found that out before starting. The existing coding could only be a win if integer division is hugely expensive compared to character comparison, but I don't know of any modern machine where that might be true. This does not fix all the problems with quantified back-references. In particular, the code is still broken for back-references that appear within a larger expression that is quantified (so that direct insertion of the quantification limits into the BACKREF node doesn't apply). I think fixing that will take some major surgery on the NFA code, specifically introducing an explicit iteration node type instead of trying to transform iteration into concatenation of modified regexps. Back-patch to all supported branches. In HEAD, also add a regression test case for this. (It may seem a bit silly to create a regression test file for just one test case; but I'm expecting that we will soon import a whole bunch of regex regression tests from Tcl, so might as well create the infrastructure now.)
1 parent 86328cb commit d3158f3

File tree

2 files changed

+52
-29
lines changed

2 files changed

+52
-29
lines changed

src/backend/regex/regcomp.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1087,8 +1087,12 @@ parseqatom(struct vars * v,
10871087
NOERR();
10881088
}
10891089

1090-
/* it's quantifier time; first, turn x{0,...} into x{1,...}|empty */
1091-
if (m == 0)
1090+
/*
1091+
* It's quantifier time. If the atom is just a BACKREF, we'll let it deal
1092+
* with quantifiers internally. Otherwise, the first step is to turn
1093+
* x{0,...} into x{1,...}|empty
1094+
*/
1095+
if (m == 0 && atomtype != BACKREF)
10921096
{
10931097
EMPTYARC(s2, atom->end); /* the bypass */
10941098
assert(PREF(qprefer) != 0);

src/backend/regex/regexec.c

Lines changed: 46 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -719,7 +719,7 @@ cdissect(struct vars * v,
719719
case '|': /* alternation */
720720
assert(t->left != NULL);
721721
return caltdissect(v, t, begin, end);
722-
case 'b': /* back ref -- shouldn't be calling us! */
722+
case 'b': /* back reference */
723723
assert(t->left == NULL && t->right == NULL);
724724
return cbrdissect(v, t, begin, end);
725725
case '.': /* concatenation */
@@ -976,12 +976,12 @@ cbrdissect(struct vars * v,
976976
chr *begin, /* beginning of relevant substring */
977977
chr *end) /* end of same */
978978
{
979-
int i;
980979
int n = t->subno;
981-
size_t len;
982-
chr *paren;
980+
size_t numreps;
981+
size_t tlen;
982+
size_t brlen;
983+
chr *brstring;
983984
chr *p;
984-
chr *stop;
985985
int min = t->min;
986986
int max = t->max;
987987

@@ -992,46 +992,65 @@ cbrdissect(struct vars * v,
992992

993993
MDEBUG(("cbackref n%d %d{%d-%d}\n", t->retry, n, min, max));
994994

995+
/* get the backreferenced string */
995996
if (v->pmatch[n].rm_so == -1)
996997
return REG_NOMATCH;
997-
paren = v->start + v->pmatch[n].rm_so;
998-
len = v->pmatch[n].rm_eo - v->pmatch[n].rm_so;
998+
brstring = v->start + v->pmatch[n].rm_so;
999+
brlen = v->pmatch[n].rm_eo - v->pmatch[n].rm_so;
9991000

10001001
/* no room to maneuver -- retries are pointless */
10011002
if (v->mem[t->retry])
10021003
return REG_NOMATCH;
10031004
v->mem[t->retry] = 1;
10041005

1005-
/* special-case zero-length string */
1006-
if (len == 0)
1006+
/* special cases for zero-length strings */
1007+
if (brlen == 0)
1008+
{
1009+
/*
1010+
* matches only if target is zero length, but any number of
1011+
* repetitions can be considered to be present
1012+
*/
1013+
if (begin == end && min <= max)
1014+
{
1015+
MDEBUG(("cbackref matched trivially\n"));
1016+
return REG_OKAY;
1017+
}
1018+
return REG_NOMATCH;
1019+
}
1020+
if (begin == end)
10071021
{
1008-
if (begin == end)
1022+
/* matches only if zero repetitions are okay */
1023+
if (min == 0)
1024+
{
1025+
MDEBUG(("cbackref matched trivially\n"));
10091026
return REG_OKAY;
1027+
}
10101028
return REG_NOMATCH;
10111029
}
10121030

1013-
/* and too-short string */
1014-
assert(end >= begin);
1015-
if ((size_t) (end - begin) < len)
1031+
/*
1032+
* check target length to see if it could possibly be an allowed number of
1033+
* repetitions of brstring
1034+
*/
1035+
assert(end > begin);
1036+
tlen = end - begin;
1037+
if (tlen % brlen != 0)
1038+
return REG_NOMATCH;
1039+
numreps = tlen / brlen;
1040+
if (numreps < min || (numreps > max && max != INFINITY))
10161041
return REG_NOMATCH;
1017-
stop = end - len;
10181042

1019-
/* count occurrences */
1020-
i = 0;
1021-
for (p = begin; p <= stop && (i < max || max == INFINITY); p += len)
1043+
/* okay, compare the actual string contents */
1044+
p = begin;
1045+
while (numreps-- > 0)
10221046
{
1023-
if ((*v->g->compare) (paren, p, len) != 0)
1024-
break;
1025-
i++;
1047+
if ((*v->g->compare) (brstring, p, brlen) != 0)
1048+
return REG_NOMATCH;
1049+
p += brlen;
10261050
}
1027-
MDEBUG(("cbackref found %d\n", i));
10281051

1029-
/* and sort it out */
1030-
if (p != end) /* didn't consume all of it */
1031-
return REG_NOMATCH;
1032-
if (min <= i && (i <= max || max == INFINITY))
1033-
return REG_OKAY;
1034-
return REG_NOMATCH; /* out of range */
1052+
MDEBUG(("cbackref matched\n"));
1053+
return REG_OKAY;
10351054
}
10361055

10371056
/*

0 commit comments

Comments
 (0)