Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 27af914

Browse files
committed
Create the beginnings of internals documentation for the regex code.
Create src/backend/regex/README to hold an implementation overview of the regex package, and fill it in with some preliminary notes about the code's DFA/NFA processing and colormap management. Much more to do there of course. Also, improve some code comments around the colormap and cvec code. No functional changes except to add one missing assert.
1 parent 2f582f7 commit 27af914

File tree

4 files changed

+343
-16
lines changed

4 files changed

+343
-16
lines changed

src/backend/regex/README

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
Implementation notes about Henry Spencer's regex library
2+
========================================================
3+
4+
If Henry ever had any internals documentation, he didn't publish it.
5+
So this file is an attempt to reverse-engineer some docs.
6+
7+
General source-file layout
8+
--------------------------
9+
10+
There are four separately-compilable source files, each exposing exactly
11+
one exported function:
12+
regcomp.c: pg_regcomp
13+
regexec.c: pg_regexec
14+
regerror.c: pg_regerror
15+
regfree.c: pg_regfree
16+
(The pg_ prefixes were added by the Postgres project to distinguish this
17+
library version from any similar one that might be present on a particular
18+
system. They'd need to be removed or replaced in any standalone version
19+
of the library.)
20+
21+
There are additional source files regc_*.c that are #include'd in regcomp,
22+
and similarly additional source files rege_*.c that are #include'd in
23+
regexec. This was done to avoid exposing internal symbols globally;
24+
all functions not meant to be part of the library API are static.
25+
26+
(Actually the above is a lie in one respect: there is one more global
27+
symbol, pg_set_regex_collation in regcomp. It is not meant to be part of
28+
the API, but it has to be global because both regcomp and regexec call it.
29+
It'd be better to get rid of that, as well as the static variables it
30+
sets, in favor of keeping the needed locale state in the regex structs.
31+
We have not done this yet for lack of a design for how to add
32+
application-specific state to the structs.)
33+
34+
What's where in src/backend/regex/:
35+
36+
regcomp.c Top-level regex compilation code
37+
regc_color.c Color map management
38+
regc_cvec.c Character vector (cvec) management
39+
regc_lex.c Lexer
40+
regc_nfa.c NFA handling
41+
regc_locale.c Application-specific locale code from Tcl project
42+
regc_pg_locale.c Postgres-added application-specific locale code
43+
regexec.c Top-level regex execution code
44+
rege_dfa.c DFA creation and execution
45+
regerror.c pg_regerror: generate text for a regex error code
46+
regfree.c pg_regfree: API to free a no-longer-needed regex_t
47+
48+
The locale-specific code is concerned primarily with case-folding and with
49+
expanding locale-specific character classes, such as [[:alnum:]]. It
50+
really needs refactoring if this is ever to become a standalone library.
51+
52+
The header files for the library are in src/include/regex/:
53+
54+
regcustom.h Customizes library for particular application
55+
regerrs.h Error message list
56+
regex.h Exported API
57+
regguts.h Internals declarations
58+
59+
60+
DFAs, NFAs, and all that
61+
------------------------
62+
63+
This library is a hybrid DFA/NFA regex implementation. (If you've never
64+
heard either of those terms, get thee to a first-year comp sci textbook.)
65+
It might not be clear at first glance what that really means and how it
66+
relates to what you'll see in the code. Here's what really happens:
67+
68+
* Initial parsing of a regex generates an NFA representation, with number
69+
of states approximately proportional to the length of the regexp.
70+
71+
* The NFA is then optimized into a "compact NFA" representation, which is
72+
basically the same data but without fields that are not going to be needed
73+
at runtime. We do a little bit of cleanup too, such as removing
74+
unreachable states that might be created as a result of the rather naive
75+
transformation done by initial parsing. The cNFA representation is what
76+
is passed from regcomp to regexec.
77+
78+
* Unlike traditional NFA-based regex engines, we do not execute directly
79+
from the NFA representation, as that would require backtracking and so be
80+
very slow in some cases. Rather, we execute a DFA, which ideally can
81+
process an input string in linear time (O(M) for M characters of input)
82+
without backtracking. Each state of the DFA corresponds to a set of
83+
states of the NFA, that is all the states that the NFA might have been in
84+
upon reaching the current point in the input string. Therefore, an NFA
85+
with N states might require as many as 2^N states in the corresponding
86+
DFA, which could easily require unreasonable amounts of memory. We deal
87+
with this by materializing states of the DFA lazily (only when needed) and
88+
keeping them in a limited-size cache. The possible need to build the same
89+
state of the DFA repeatedly makes this approach not truly O(M) time, but
90+
in the worst case as much as O(M*N). That's still far better than the
91+
worst case for a backtracking NFA engine.
92+
93+
If that were the end of it, we'd just say this is a DFA engine, with the
94+
use of NFAs being merely an implementation detail. However, a DFA engine
95+
cannot handle some important regex features such as capturing parens and
96+
back-references. If the parser finds that a regex uses these features
97+
(collectively called "messy cases" in the code), then we have to use
98+
NFA-style backtracking search after all.
99+
100+
When using the NFA mode, the representation constructed by the parser
101+
consists of a tree of sub-expressions ("subre"s). Leaf tree nodes are
102+
either plain regular expressions (which are executed as DFAs in the manner
103+
described above) or back-references (which try to match the input to some
104+
previous substring). Non-leaf nodes are capture nodes (which save the
105+
location of the substring currently matching their child node) or
106+
concatenation or alternation nodes. At execution time, the executor
107+
recursively scans the tree. At concatenation or alternation nodes,
108+
it considers each possible alternative way of matching the input string,
109+
ie each place where the string could be split for a concatenation, or each
110+
child node for an alternation. It tries the next alternative if the match
111+
fails according to the child nodes. This is exactly the sort of
112+
backtracking search done by a traditional NFA regex engine. If there are
113+
many tree levels it can get very slow.
114+
115+
But all is not lost: we can still be smarter than the average pure NFA
116+
engine. To do this, each subre node has an associated DFA, which
117+
represents what the node could possibly match insofar as a mathematically
118+
pure regex can describe that, which basically means "no backrefs".
119+
Before we perform any search of possible alternative sub-matches, we run
120+
the DFA to see if it thinks the proposed substring could possibly match.
121+
If not, we can reject the match immediately without iterating through many
122+
possibilities.
123+
124+
As an example, consider the regex "(a[bc]+)\1". The compiled
125+
representation will have a top-level concatenation subre node. Its left
126+
child is a capture node, and the child of that is a plain DFA node for
127+
"a[bc]+". The concatenation's right child is a backref node for \1.
128+
The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
129+
where the backref has been replaced by a copy of the DFA for its referent
130+
expression. When executed, the concatenation node will have to search for
131+
a possible division of the input string that allows its two child nodes to
132+
each match their part of the string (and although this specific case can
133+
only succeed when the division is at the middle, the code does not know
134+
that, nor would it be true in general). However, we can first run the DFA
135+
and quickly reject any input that doesn't contain two a's and some number
136+
of b's and c's. If the DFA doesn't match, there is no need to recurse to
137+
the two child nodes for each possible string division point. In many
138+
cases, this prefiltering makes the search run much faster than a pure NFA
139+
engine could do. It is this behavior that justifies using the phrase
140+
"hybrid DFA/NFA engine" to describe Spencer's library.
141+
142+
143+
Colors and colormapping
144+
-----------------------
145+
146+
In many common regex patterns, there are large numbers of characters that
147+
can be treated alike by the execution engine. A simple example is the
148+
pattern "[[:alpha:]][[:alnum:]]*" for an identifier. Basically the engine
149+
only needs to care whether an input symbol is a letter, a digit, or other.
150+
We could build the NFA or DFA with a separate arc for each possible letter
151+
and digit, but that's very wasteful of space and not so cheap to execute
152+
either, especially when dealing with Unicode which can have thousands of
153+
letters. Instead, the parser builds a "color map" that maps each possible
154+
input symbol to a "color", or equivalence class. The NFA or DFA
155+
representation then has arcs labeled with colors, not specific input
156+
symbols. At execution, the first thing the executor does with each input
157+
symbol is to look up its color in the color map, and then everything else
158+
works from the color only.
159+
160+
To build the colormap, we start by assigning every possible input symbol
161+
the color WHITE, which means "other" (that is, at the end of parsing, the
162+
symbols that are still WHITE are those not explicitly referenced anywhere
163+
in the regex). When we see a simple literal character or a bracket
164+
expression in the regex, we want to assign that character, or all the
165+
characters represented by the bracket expression, a unique new color that
166+
can be used to label the NFA arc corresponding to the state transition for
167+
matching this character or bracket expression. The basic idea is:
168+
first, change the color assigned to a character to some new value;
169+
second, run through all the existing arcs in the partially-built NFA,
170+
and for each one referencing the character's old color, add a parallel
171+
arc referencing its new color (this keeps the reassignment from changing
172+
the semantics of what we already built); and third, add a new arc with
173+
the character's new color to the current pair of NFA states, denoting
174+
that seeing this character allows the state transition to be made.
175+
176+
This is complicated a bit by not wanting to create more colors
177+
(equivalence classes) than absolutely necessary. In particular, if a
178+
bracket expression mentions two characters that had the same color before,
179+
they should still share the same color after we process the bracket, since
180+
there is still not a need to distinguish them. But we do need to
181+
distinguish them from other characters that previously had the same color
182+
yet are not listed in the bracket expression. To mechanize this, the code
183+
has a concept of "parent colors" and "subcolors", where a color's subcolor
184+
is the new color that we are giving to any characters of that color while
185+
parsing the current atom. (The word "parent" is a bit unfortunate here,
186+
because it suggests a long-lived relationship, but a subcolor link really
187+
only lasts for the duration of parsing a single atom.) In other words,
188+
a subcolor link means that we are in process of splitting the parent color
189+
into two colors (equivalence classes), depending on whether or not each
190+
member character should be included by the current regex atom.
191+
192+
As an example, suppose we have the regex "a\d\wx". Initially all possible
193+
character codes are labeled WHITE (color 0). To parse the atom "a", we
194+
create a new color (1), update "a"'s color map entry to 1, and create an
195+
arc labeled 1 between the first two states of the NFA. Now we see \d,
196+
which is really a bracket expression containing the digits "0"-"9".
197+
First we process "0", which is currently WHITE, so we create a new color
198+
(2), update "0"'s color map entry to 2, and create an arc labeled 2
199+
between the second and third states of the NFA. We also mark color WHITE
200+
as having the subcolor 2, which means that future relabelings of WHITE
201+
characters should also select 2 as the new color. Thus, when we process
202+
"1", we won't create a new color but re-use 2. We update "1"'s color map
203+
entry to 2, and then find that we don't need a new arc because there is
204+
already one labeled 2 between the second and third states of the NFA.
205+
Similarly for the other 8 digits, so there will be only one arc labeled 2
206+
between NFA states 2 and 3 for all members of this bracket expression.
207+
At completion of processing of the bracket expression, we call okcolors()
208+
which breaks all the existing parent/subcolor links; there is no longer a
209+
marker saying that WHITE characters should be relabeled 2. (Note:
210+
actually, we did the same creation and clearing of a subcolor link for the
211+
primitive atom "a", but it didn't do anything very interesting.) Now we
212+
come to the "\w" bracket expression, which for simplicity assume expands
213+
to just "[a-z0-9]". We process "a", but observe that it is already the
214+
sole member of its color 1. This means there is no need to subdivide that
215+
equivalence class more finely, so we do not create any new color. We just
216+
make an arc labeled 1 between the third and fourth NFA states. Next we
217+
process "b", which is WHITE and far from the only WHITE character, so we
218+
create a new color (3), link that as WHITE's subcolor, relabel "b" as
219+
color 3, and make an arc labeled 3. As we process "c" through "z", each
220+
is relabeled from WHITE to 3, but no new arc is needed. Now we come to
221+
"0", which is not the only member of its color 2, so we suppose that a new
222+
color is needed and create color 4. We link 4 as subcolor of 2, relabel
223+
"0" as color 4 in the map, and add an arc for color 4. Next "1" through
224+
"9" are similarly relabeled as color 4, with no additional arcs needed.
225+
Having finished the bracket expression, we call okcolors(), which breaks
226+
the subcolor links. okcolors() further observes that we have removed
227+
every member of color 2 (the previous color of the digit characters).
228+
Therefore, it runs through the partial NFA built so far and relabels arcs
229+
labeled 2 to color 4; in particular the arc from NFA state 2 to state 3 is
230+
relabeled color 4. Then it frees up color 2, since we have no more use
231+
for that color. We now have an NFA in which transitions for digits are
232+
consistently labeled with color 4. Last, we come to the atom "x".
233+
"x" is currently labeled with color 3, and it's not the only member of
234+
that color, so we realize that we now need to distinguish "x" from other
235+
letters when we did not before. We create a new color, which might have
236+
been 5 but instead we recycle the unused color 2. "x" is relabeled 2 in
237+
the color map and 2 is linked as the subcolor of 3, and we add an arc for
238+
2 between states 4 and 5 of the NFA. Now we call okcolors(), which breaks
239+
the subcolor link between colors 3 and 2 and notices that both colors are
240+
nonempty. Therefore, it also runs through the existing NFA arcs and adds
241+
an additional arc labeled 2 wherever there is an arc labeled 3; this
242+
action ensures that characters of color 2 (i.e., "x") will still be
243+
considered as allowing any transitions they did before. We are now done
244+
parsing the regex, and we have these final color assignments:
245+
color 1: "a"
246+
color 2: "x"
247+
color 3: other letters
248+
color 4: digits
249+
and the NFA has these arcs:
250+
states 1 -> 2 on color 1 (hence, "a" only)
251+
states 2 -> 3 on color 4 (digits)
252+
states 3 -> 4 on colors 1, 3, 4, and 2 (covering all \w characters)
253+
states 4 -> 5 on color 2 ("x" only)
254+
which can be seen to be a correct representation of the regex.
255+
256+
Given this summary, we can see we need the following operations for
257+
colors:
258+
259+
* A fast way to look up the current color assignment for any character
260+
code. (This is needed during both parsing and execution, while the
261+
remaining operations are needed only during parsing.)
262+
* A way to alter the color assignment for any given character code.
263+
* We must track the number of characters currently assigned to each
264+
color, so that we can detect empty and singleton colors.
265+
* We must track all existing NFA arcs of a given color, so that we
266+
can relabel them at need, or add parallel arcs of a new color when
267+
an existing color has to be subdivided.
268+
269+
The last two of these are handled with the "struct colordesc" array and
270+
the "colorchain" links in NFA arc structs. The color map proper (that
271+
is, the per-character lookup array) is handled as a multi-level tree,
272+
with each tree level indexed by one byte of a character's value. The
273+
code arranges to not have more than one copy of bottom-level tree pages
274+
that are all-the-same-color.
275+
276+
Unfortunately, this design does not seem terribly efficient for common
277+
cases such as a tree in which all Unicode letters are colored the same,
278+
because there aren't that many places where we get a whole page all the
279+
same color, except at the end of the map. (It also strikes me that given
280+
PG's current restrictions on the range of Unicode values, we could use a
281+
3-level rather than 4-level tree; but there's not provision for that in
282+
regguts.h at the moment.)
283+
284+
A bigger problem is that it just doesn't seem very reasonable to have to
285+
consider each Unicode letter separately at regex parse time for a regex
286+
such as "\w"; more than likely, a huge percentage of those codes will
287+
never be seen at runtime. We need to fix things so that locale-based
288+
character classes are somehow processed "symbolically" without making a
289+
full expansion of their contents at parse time. This would mean that we'd
290+
have to be ready to call iswalpha() at runtime, but if that only happens
291+
for high-code-value characters, it shouldn't be a big performance hit.

src/backend/regex/regc_cvec.c

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ static void
7777
addchr(struct cvec * cv, /* character vector */
7878
chr c) /* character to add */
7979
{
80+
assert(cv->nchrs < cv->chrspace);
8081
cv->chrs[cv->nchrs++] = (chr) c;
8182
}
8283

@@ -95,17 +96,27 @@ addrange(struct cvec * cv, /* character vector */
9596
}
9697

9798
/*
98-
* getcvec - get a cvec, remembering it as v->cv
99+
* getcvec - get a transient cvec, initialized to empty
100+
*
101+
* The returned cvec is valid only until the next call of getcvec, which
102+
* typically will recycle the space. Callers should *not* free the cvec
103+
* explicitly; it will be cleaned up when the struct vars is destroyed.
104+
*
105+
* This is typically used while interpreting bracket expressions. In that
106+
* usage the cvec is only needed momentarily until we build arcs from it,
107+
* so transientness is a convenient behavior.
99108
*/
100109
static struct cvec *
101110
getcvec(struct vars * v, /* context */
102111
int nchrs, /* to hold this many chrs... */
103112
int nranges) /* ... and this many ranges */
104113
{
114+
/* recycle existing transient cvec if large enough */
105115
if (v->cv != NULL && nchrs <= v->cv->chrspace &&
106116
nranges <= v->cv->rangespace)
107117
return clearcvec(v->cv);
108118

119+
/* nope, make a new one */
109120
if (v->cv != NULL)
110121
freecvec(v->cv);
111122
v->cv = newcvec(nchrs, nranges);

src/backend/regex/regcomp.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -356,6 +356,7 @@ pg_regcomp(regex_t *re,
356356
ZAPCNFA(g->search);
357357
v->nfa = newnfa(v, v->cm, (struct nfa *) NULL);
358358
CNOERR();
359+
/* set up a reasonably-sized transient cvec for getcvec usage */
359360
v->cv = newcvec(100, 20);
360361
if (v->cv == NULL)
361362
return freev(v, REG_ESPACE);

0 commit comments

Comments
 (0)