Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTom Lane2021-02-20 23:11:56 +0000
committerTom Lane2021-02-20 23:11:56 +0000
commit08c0d6ad65f7c161add82ae906efb90dbd7f653d (patch)
treecc6376d6fd084c6584b18d818c0816e3b7e190c9 /src/backend/regex/README
parent17661188336c8cbb1783808912096932c57893a3 (diff)
Invent "rainbow" arcs within the regex engine.
Some regular expression constructs, most notably the "." match-anything metacharacter, produce a sheaf of parallel NFA arcs covering all possible colors (that is, character equivalence classes). We can make a noticeable improvement in the space and time needed to process large regexes by replacing such cases with a single arc bearing the special color code "RAINBOW". This requires only minor additional complication in places such as pull() and push(). Callers of pg_reg_getoutarcs() must now be prepared for the possibility of seeing a RAINBOW arc. For the one known user, contrib/pg_trgm, that's a net benefit since it cuts the number of arcs to be dealt with, and the handling isn't any different than for other colors that contain too many characters to be dealt with individually. This is part of a patch series that in total reduces the regex engine's runtime by about a factor of four on a large corpus of real-world regexes. Patch by me, reviewed by Joel Jacobson Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
Diffstat (limited to 'src/backend/regex/README')
-rw-r--r--src/backend/regex/README36
1 files changed, 28 insertions, 8 deletions
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e37..a83ab5074df 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
states 4 -> 5 on color 2 ("x" only)
which can be seen to be a correct representation of the regex.
+There is one more complexity, which is how to handle ".", that is a
+match-anything atom. We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot. That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic. Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors. Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline. In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
Given this summary, we can see we need the following operations for
colors:
@@ -349,6 +361,8 @@ The possible arc types are:
PLAIN arcs, which specify matching of any character of a given "color"
(see above). These are dumped as "[color_number]->to_state".
+ In addition there can be "rainbow" PLAIN arcs, which are dumped as
+ "[*]->to_state".
EMPTY arcs, which specify a no-op transition to another state. These
are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
AHEAD constraints, which represent a "next character must be of this
color" constraint. AHEAD differs from a PLAIN arc in that the input
character is not consumed when crossing the arc. These are dumped as
- ">color_number>->to_state".
+ ">color_number>->to_state", or possibly ">*>->to_state".
BEHIND constraints, which represent a "previous character must be of
this color" constraint, which likewise consumes no input. These are
- dumped as "<color_number<->to_state".
+ dumped as "<color_number<->to_state", or possibly "<*<->to_state".
'^' arcs, which specify a beginning-of-input constraint. These are
dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
@@ -396,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
the end of the input.
3. If the NFA is (or can be) in the goal state at this point, it matches.
+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any. The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
So one can mentally execute an untransformed NFA by taking ^ and $ as
ordinary constraints that match at start and end of input; but plain
arcs out of the start state should be taken as matches for the character
before the target substring, and similarly, plain arcs leading to the
post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any. NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state. So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.