|
| 1 | +Implementation notes about Henry Spencer's regex library |
| 2 | +======================================================== |
| 3 | + |
| 4 | +If Henry ever had any internals documentation, he didn't publish it. |
| 5 | +So this file is an attempt to reverse-engineer some docs. |
| 6 | + |
| 7 | +General source-file layout |
| 8 | +-------------------------- |
| 9 | + |
| 10 | +There are four separately-compilable source files, each exposing exactly |
| 11 | +one exported function: |
| 12 | + regcomp.c: pg_regcomp |
| 13 | + regexec.c: pg_regexec |
| 14 | + regerror.c: pg_regerror |
| 15 | + regfree.c: pg_regfree |
| 16 | +(The pg_ prefixes were added by the Postgres project to distinguish this |
| 17 | +library version from any similar one that might be present on a particular |
| 18 | +system. They'd need to be removed or replaced in any standalone version |
| 19 | +of the library.) |
| 20 | + |
| 21 | +There are additional source files regc_*.c that are #include'd in regcomp, |
| 22 | +and similarly additional source files rege_*.c that are #include'd in |
| 23 | +regexec. This was done to avoid exposing internal symbols globally; |
| 24 | +all functions not meant to be part of the library API are static. |
| 25 | + |
| 26 | +(Actually the above is a lie in one respect: there is one more global |
| 27 | +symbol, pg_set_regex_collation in regcomp. It is not meant to be part of |
| 28 | +the API, but it has to be global because both regcomp and regexec call it. |
| 29 | +It'd be better to get rid of that, as well as the static variables it |
| 30 | +sets, in favor of keeping the needed locale state in the regex structs. |
| 31 | +We have not done this yet for lack of a design for how to add |
| 32 | +application-specific state to the structs.) |
| 33 | + |
| 34 | +What's where in src/backend/regex/: |
| 35 | + |
| 36 | +regcomp.c Top-level regex compilation code |
| 37 | +regc_color.c Color map management |
| 38 | +regc_cvec.c Character vector (cvec) management |
| 39 | +regc_lex.c Lexer |
| 40 | +regc_nfa.c NFA handling |
| 41 | +regc_locale.c Application-specific locale code from Tcl project |
| 42 | +regc_pg_locale.c Postgres-added application-specific locale code |
| 43 | +regexec.c Top-level regex execution code |
| 44 | +rege_dfa.c DFA creation and execution |
| 45 | +regerror.c pg_regerror: generate text for a regex error code |
| 46 | +regfree.c pg_regfree: API to free a no-longer-needed regex_t |
| 47 | + |
| 48 | +The locale-specific code is concerned primarily with case-folding and with |
| 49 | +expanding locale-specific character classes, such as [[:alnum:]]. It |
| 50 | +really needs refactoring if this is ever to become a standalone library. |
| 51 | + |
| 52 | +The header files for the library are in src/include/regex/: |
| 53 | + |
| 54 | +regcustom.h Customizes library for particular application |
| 55 | +regerrs.h Error message list |
| 56 | +regex.h Exported API |
| 57 | +regguts.h Internals declarations |
| 58 | + |
| 59 | + |
| 60 | +DFAs, NFAs, and all that |
| 61 | +------------------------ |
| 62 | + |
| 63 | +This library is a hybrid DFA/NFA regex implementation. (If you've never |
| 64 | +heard either of those terms, get thee to a first-year comp sci textbook.) |
| 65 | +It might not be clear at first glance what that really means and how it |
| 66 | +relates to what you'll see in the code. Here's what really happens: |
| 67 | + |
| 68 | +* Initial parsing of a regex generates an NFA representation, with number |
| 69 | +of states approximately proportional to the length of the regexp. |
| 70 | + |
| 71 | +* The NFA is then optimized into a "compact NFA" representation, which is |
| 72 | +basically the same data but without fields that are not going to be needed |
| 73 | +at runtime. We do a little bit of cleanup too, such as removing |
| 74 | +unreachable states that might be created as a result of the rather naive |
| 75 | +transformation done by initial parsing. The cNFA representation is what |
| 76 | +is passed from regcomp to regexec. |
| 77 | + |
| 78 | +* Unlike traditional NFA-based regex engines, we do not execute directly |
| 79 | +from the NFA representation, as that would require backtracking and so be |
| 80 | +very slow in some cases. Rather, we execute a DFA, which ideally can |
| 81 | +process an input string in linear time (O(M) for M characters of input) |
| 82 | +without backtracking. Each state of the DFA corresponds to a set of |
| 83 | +states of the NFA, that is all the states that the NFA might have been in |
| 84 | +upon reaching the current point in the input string. Therefore, an NFA |
| 85 | +with N states might require as many as 2^N states in the corresponding |
| 86 | +DFA, which could easily require unreasonable amounts of memory. We deal |
| 87 | +with this by materializing states of the DFA lazily (only when needed) and |
| 88 | +keeping them in a limited-size cache. The possible need to build the same |
| 89 | +state of the DFA repeatedly makes this approach not truly O(M) time, but |
| 90 | +in the worst case as much as O(M*N). That's still far better than the |
| 91 | +worst case for a backtracking NFA engine. |
| 92 | + |
| 93 | +If that were the end of it, we'd just say this is a DFA engine, with the |
| 94 | +use of NFAs being merely an implementation detail. However, a DFA engine |
| 95 | +cannot handle some important regex features such as capturing parens and |
| 96 | +back-references. If the parser finds that a regex uses these features |
| 97 | +(collectively called "messy cases" in the code), then we have to use |
| 98 | +NFA-style backtracking search after all. |
| 99 | + |
| 100 | +When using the NFA mode, the representation constructed by the parser |
| 101 | +consists of a tree of sub-expressions ("subre"s). Leaf tree nodes are |
| 102 | +either plain regular expressions (which are executed as DFAs in the manner |
| 103 | +described above) or back-references (which try to match the input to some |
| 104 | +previous substring). Non-leaf nodes are capture nodes (which save the |
| 105 | +location of the substring currently matching their child node) or |
| 106 | +concatenation or alternation nodes. At execution time, the executor |
| 107 | +recursively scans the tree. At concatenation or alternation nodes, |
| 108 | +it considers each possible alternative way of matching the input string, |
| 109 | +ie each place where the string could be split for a concatenation, or each |
| 110 | +child node for an alternation. It tries the next alternative if the match |
| 111 | +fails according to the child nodes. This is exactly the sort of |
| 112 | +backtracking search done by a traditional NFA regex engine. If there are |
| 113 | +many tree levels it can get very slow. |
| 114 | + |
| 115 | +But all is not lost: we can still be smarter than the average pure NFA |
| 116 | +engine. To do this, each subre node has an associated DFA, which |
| 117 | +represents what the node could possibly match insofar as a mathematically |
| 118 | +pure regex can describe that, which basically means "no backrefs". |
| 119 | +Before we perform any search of possible alternative sub-matches, we run |
| 120 | +the DFA to see if it thinks the proposed substring could possibly match. |
| 121 | +If not, we can reject the match immediately without iterating through many |
| 122 | +possibilities. |
| 123 | + |
| 124 | +As an example, consider the regex "(a[bc]+)\1". The compiled |
| 125 | +representation will have a top-level concatenation subre node. Its left |
| 126 | +child is a capture node, and the child of that is a plain DFA node for |
| 127 | +"a[bc]+". The concatenation's right child is a backref node for \1. |
| 128 | +The DFA associated with the concatenation node will be "a[bc]+a[bc]+", |
| 129 | +where the backref has been replaced by a copy of the DFA for its referent |
| 130 | +expression. When executed, the concatenation node will have to search for |
| 131 | +a possible division of the input string that allows its two child nodes to |
| 132 | +each match their part of the string (and although this specific case can |
| 133 | +only succeed when the division is at the middle, the code does not know |
| 134 | +that, nor would it be true in general). However, we can first run the DFA |
| 135 | +and quickly reject any input that doesn't contain two a's and some number |
| 136 | +of b's and c's. If the DFA doesn't match, there is no need to recurse to |
| 137 | +the two child nodes for each possible string division point. In many |
| 138 | +cases, this prefiltering makes the search run much faster than a pure NFA |
| 139 | +engine could do. It is this behavior that justifies using the phrase |
| 140 | +"hybrid DFA/NFA engine" to describe Spencer's library. |
| 141 | + |
| 142 | + |
| 143 | +Colors and colormapping |
| 144 | +----------------------- |
| 145 | + |
| 146 | +In many common regex patterns, there are large numbers of characters that |
| 147 | +can be treated alike by the execution engine. A simple example is the |
| 148 | +pattern "[[:alpha:]][[:alnum:]]*" for an identifier. Basically the engine |
| 149 | +only needs to care whether an input symbol is a letter, a digit, or other. |
| 150 | +We could build the NFA or DFA with a separate arc for each possible letter |
| 151 | +and digit, but that's very wasteful of space and not so cheap to execute |
| 152 | +either, especially when dealing with Unicode which can have thousands of |
| 153 | +letters. Instead, the parser builds a "color map" that maps each possible |
| 154 | +input symbol to a "color", or equivalence class. The NFA or DFA |
| 155 | +representation then has arcs labeled with colors, not specific input |
| 156 | +symbols. At execution, the first thing the executor does with each input |
| 157 | +symbol is to look up its color in the color map, and then everything else |
| 158 | +works from the color only. |
| 159 | + |
| 160 | +To build the colormap, we start by assigning every possible input symbol |
| 161 | +the color WHITE, which means "other" (that is, at the end of parsing, the |
| 162 | +symbols that are still WHITE are those not explicitly referenced anywhere |
| 163 | +in the regex). When we see a simple literal character or a bracket |
| 164 | +expression in the regex, we want to assign that character, or all the |
| 165 | +characters represented by the bracket expression, a unique new color that |
| 166 | +can be used to label the NFA arc corresponding to the state transition for |
| 167 | +matching this character or bracket expression. The basic idea is: |
| 168 | +first, change the color assigned to a character to some new value; |
| 169 | +second, run through all the existing arcs in the partially-built NFA, |
| 170 | +and for each one referencing the character's old color, add a parallel |
| 171 | +arc referencing its new color (this keeps the reassignment from changing |
| 172 | +the semantics of what we already built); and third, add a new arc with |
| 173 | +the character's new color to the current pair of NFA states, denoting |
| 174 | +that seeing this character allows the state transition to be made. |
| 175 | + |
| 176 | +This is complicated a bit by not wanting to create more colors |
| 177 | +(equivalence classes) than absolutely necessary. In particular, if a |
| 178 | +bracket expression mentions two characters that had the same color before, |
| 179 | +they should still share the same color after we process the bracket, since |
| 180 | +there is still not a need to distinguish them. But we do need to |
| 181 | +distinguish them from other characters that previously had the same color |
| 182 | +yet are not listed in the bracket expression. To mechanize this, the code |
| 183 | +has a concept of "parent colors" and "subcolors", where a color's subcolor |
| 184 | +is the new color that we are giving to any characters of that color while |
| 185 | +parsing the current atom. (The word "parent" is a bit unfortunate here, |
| 186 | +because it suggests a long-lived relationship, but a subcolor link really |
| 187 | +only lasts for the duration of parsing a single atom.) In other words, |
| 188 | +a subcolor link means that we are in process of splitting the parent color |
| 189 | +into two colors (equivalence classes), depending on whether or not each |
| 190 | +member character should be included by the current regex atom. |
| 191 | + |
| 192 | +As an example, suppose we have the regex "a\d\wx". Initially all possible |
| 193 | +character codes are labeled WHITE (color 0). To parse the atom "a", we |
| 194 | +create a new color (1), update "a"'s color map entry to 1, and create an |
| 195 | +arc labeled 1 between the first two states of the NFA. Now we see \d, |
| 196 | +which is really a bracket expression containing the digits "0"-"9". |
| 197 | +First we process "0", which is currently WHITE, so we create a new color |
| 198 | +(2), update "0"'s color map entry to 2, and create an arc labeled 2 |
| 199 | +between the second and third states of the NFA. We also mark color WHITE |
| 200 | +as having the subcolor 2, which means that future relabelings of WHITE |
| 201 | +characters should also select 2 as the new color. Thus, when we process |
| 202 | +"1", we won't create a new color but re-use 2. We update "1"'s color map |
| 203 | +entry to 2, and then find that we don't need a new arc because there is |
| 204 | +already one labeled 2 between the second and third states of the NFA. |
| 205 | +Similarly for the other 8 digits, so there will be only one arc labeled 2 |
| 206 | +between NFA states 2 and 3 for all members of this bracket expression. |
| 207 | +At completion of processing of the bracket expression, we call okcolors() |
| 208 | +which breaks all the existing parent/subcolor links; there is no longer a |
| 209 | +marker saying that WHITE characters should be relabeled 2. (Note: |
| 210 | +actually, we did the same creation and clearing of a subcolor link for the |
| 211 | +primitive atom "a", but it didn't do anything very interesting.) Now we |
| 212 | +come to the "\w" bracket expression, which for simplicity assume expands |
| 213 | +to just "[a-z0-9]". We process "a", but observe that it is already the |
| 214 | +sole member of its color 1. This means there is no need to subdivide that |
| 215 | +equivalence class more finely, so we do not create any new color. We just |
| 216 | +make an arc labeled 1 between the third and fourth NFA states. Next we |
| 217 | +process "b", which is WHITE and far from the only WHITE character, so we |
| 218 | +create a new color (3), link that as WHITE's subcolor, relabel "b" as |
| 219 | +color 3, and make an arc labeled 3. As we process "c" through "z", each |
| 220 | +is relabeled from WHITE to 3, but no new arc is needed. Now we come to |
| 221 | +"0", which is not the only member of its color 2, so we suppose that a new |
| 222 | +color is needed and create color 4. We link 4 as subcolor of 2, relabel |
| 223 | +"0" as color 4 in the map, and add an arc for color 4. Next "1" through |
| 224 | +"9" are similarly relabeled as color 4, with no additional arcs needed. |
| 225 | +Having finished the bracket expression, we call okcolors(), which breaks |
| 226 | +the subcolor links. okcolors() further observes that we have removed |
| 227 | +every member of color 2 (the previous color of the digit characters). |
| 228 | +Therefore, it runs through the partial NFA built so far and relabels arcs |
| 229 | +labeled 2 to color 4; in particular the arc from NFA state 2 to state 3 is |
| 230 | +relabeled color 4. Then it frees up color 2, since we have no more use |
| 231 | +for that color. We now have an NFA in which transitions for digits are |
| 232 | +consistently labeled with color 4. Last, we come to the atom "x". |
| 233 | +"x" is currently labeled with color 3, and it's not the only member of |
| 234 | +that color, so we realize that we now need to distinguish "x" from other |
| 235 | +letters when we did not before. We create a new color, which might have |
| 236 | +been 5 but instead we recycle the unused color 2. "x" is relabeled 2 in |
| 237 | +the color map and 2 is linked as the subcolor of 3, and we add an arc for |
| 238 | +2 between states 4 and 5 of the NFA. Now we call okcolors(), which breaks |
| 239 | +the subcolor link between colors 3 and 2 and notices that both colors are |
| 240 | +nonempty. Therefore, it also runs through the existing NFA arcs and adds |
| 241 | +an additional arc labeled 2 wherever there is an arc labeled 3; this |
| 242 | +action ensures that characters of color 2 (i.e., "x") will still be |
| 243 | +considered as allowing any transitions they did before. We are now done |
| 244 | +parsing the regex, and we have these final color assignments: |
| 245 | + color 1: "a" |
| 246 | + color 2: "x" |
| 247 | + color 3: other letters |
| 248 | + color 4: digits |
| 249 | +and the NFA has these arcs: |
| 250 | + states 1 -> 2 on color 1 (hence, "a" only) |
| 251 | + states 2 -> 3 on color 4 (digits) |
| 252 | + states 3 -> 4 on colors 1, 3, 4, and 2 (covering all \w characters) |
| 253 | + states 4 -> 5 on color 2 ("x" only) |
| 254 | +which can be seen to be a correct representation of the regex. |
| 255 | + |
| 256 | +Given this summary, we can see we need the following operations for |
| 257 | +colors: |
| 258 | + |
| 259 | +* A fast way to look up the current color assignment for any character |
| 260 | + code. (This is needed during both parsing and execution, while the |
| 261 | + remaining operations are needed only during parsing.) |
| 262 | +* A way to alter the color assignment for any given character code. |
| 263 | +* We must track the number of characters currently assigned to each |
| 264 | + color, so that we can detect empty and singleton colors. |
| 265 | +* We must track all existing NFA arcs of a given color, so that we |
| 266 | + can relabel them at need, or add parallel arcs of a new color when |
| 267 | + an existing color has to be subdivided. |
| 268 | + |
| 269 | +The last two of these are handled with the "struct colordesc" array and |
| 270 | +the "colorchain" links in NFA arc structs. The color map proper (that |
| 271 | +is, the per-character lookup array) is handled as a multi-level tree, |
| 272 | +with each tree level indexed by one byte of a character's value. The |
| 273 | +code arranges to not have more than one copy of bottom-level tree pages |
| 274 | +that are all-the-same-color. |
| 275 | + |
| 276 | +Unfortunately, this design does not seem terribly efficient for common |
| 277 | +cases such as a tree in which all Unicode letters are colored the same, |
| 278 | +because there aren't that many places where we get a whole page all the |
| 279 | +same color, except at the end of the map. (It also strikes me that given |
| 280 | +PG's current restrictions on the range of Unicode values, we could use a |
| 281 | +3-level rather than 4-level tree; but there's not provision for that in |
| 282 | +regguts.h at the moment.) |
| 283 | + |
| 284 | +A bigger problem is that it just doesn't seem very reasonable to have to |
| 285 | +consider each Unicode letter separately at regex parse time for a regex |
| 286 | +such as "\w"; more than likely, a huge percentage of those codes will |
| 287 | +never be seen at runtime. We need to fix things so that locale-based |
| 288 | +character classes are somehow processed "symbolically" without making a |
| 289 | +full expansion of their contents at parse time. This would mean that we'd |
| 290 | +have to be ready to call iswalpha() at runtime, but if that only happens |
| 291 | +for high-code-value characters, it shouldn't be a big performance hit. |
0 commit comments