Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 0471cd5

Browse files
committed
Clarify description of greedy and non-greedy POSIX regular expressions,
per discussion in Nov 2004 with Ken Tanzer.
1 parent a9566cc commit 0471cd5

File tree

1 file changed

+94
-30
lines changed

1 file changed

+94
-30
lines changed

doc/src/sgml/func.sgml

Lines changed: 94 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
$PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.233 2005/01/08 05:19:18 tgl Exp $
2+
$PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.234 2005/01/09 20:08:50 tgl Exp $
33
PostgreSQL documentation
44
-->
55

@@ -3772,45 +3772,109 @@ substring('foobar' from 'o(.)b') <lineannotation>o</lineannotation>
37723772
In the event that an RE could match more than one substring of a given
37733773
string, the RE matches the one starting earliest in the string.
37743774
If the RE could match more than one substring starting at that point,
3775-
its choice is determined by its <firstterm>preference</>:
3776-
either the longest substring, or the shortest.
3775+
either the longest possible match or the shortest possible match will
3776+
be taken, depending on whether the RE is <firstterm>greedy</> or
3777+
<firstterm>non-greedy</>.
37773778
</para>
37783779

37793780
<para>
3780-
Most atoms, and all constraints, have no preference.
3781-
A parenthesized RE has the same preference (possibly none) as the RE.
3782-
A quantified atom with quantifier
3783-
<literal>{</><replaceable>m</><literal>}</>
3784-
or
3785-
<literal>{</><replaceable>m</><literal>}?</>
3786-
has the same preference (possibly none) as the atom itself.
3787-
A quantified atom with other normal quantifiers (including
3788-
<literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}</>
3789-
with <replaceable>m</> equal to <replaceable>n</>)
3790-
prefers longest match.
3791-
A quantified atom with other non-greedy quantifiers (including
3792-
<literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}?</>
3793-
with <replaceable>m</> equal to <replaceable>n</>)
3794-
prefers shortest match.
3795-
A branch has the same preference as the first quantified atom in it
3796-
which has a preference.
3797-
An RE consisting of two or more branches connected by the
3798-
<literal>|</> operator prefers longest match.
3781+
Whether an RE is greedy or not is determined by the following rules:
3782+
<itemizedlist>
3783+
<listitem>
3784+
<para>
3785+
Most atoms, and all constraints, have no greediness attribute (because
3786+
they cannot match variable amounts of text anyway).
3787+
</para>
3788+
</listitem>
3789+
<listitem>
3790+
<para>
3791+
Adding parentheses around an RE does not change its greediness.
3792+
</para>
3793+
</listitem>
3794+
<listitem>
3795+
<para>
3796+
A quantified atom with a fixed-repetition quantifier
3797+
(<literal>{</><replaceable>m</><literal>}</>
3798+
or
3799+
<literal>{</><replaceable>m</><literal>}?</>)
3800+
has the same greediness (possibly none) as the atom itself.
3801+
</para>
3802+
</listitem>
3803+
<listitem>
3804+
<para>
3805+
A quantified atom with other normal quantifiers (including
3806+
<literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}</>
3807+
with <replaceable>m</> equal to <replaceable>n</>)
3808+
is greedy (prefers longest match).
3809+
</para>
3810+
</listitem>
3811+
<listitem>
3812+
<para>
3813+
A quantified atom with a non-greedy quantifier (including
3814+
<literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}?</>
3815+
with <replaceable>m</> equal to <replaceable>n</>)
3816+
is non-greedy (prefers shortest match).
3817+
</para>
3818+
</listitem>
3819+
<listitem>
3820+
<para>
3821+
A branch &mdash; that is, an RE that has no top-level
3822+
<literal>|</> operator &mdash; has the same greediness as the first
3823+
quantified atom in it that has a greediness attribute.
3824+
</para>
3825+
</listitem>
3826+
<listitem>
3827+
<para>
3828+
An RE consisting of two or more branches connected by the
3829+
<literal>|</> operator is always greedy.
3830+
</para>
3831+
</listitem>
3832+
</itemizedlist>
3833+
</para>
3834+
3835+
<para>
3836+
The above rules associate greediness attributes not only with individual
3837+
quantified atoms, but with branches and entire REs that contain quantified
3838+
atoms. What that means is that the matching is done in such a way that
3839+
the branch, or whole RE, matches the longest or shortest possible
3840+
substring <emphasis>as a whole</>. Once the length of the entire match
3841+
is determined, the part of it that matches any particular subexpression
3842+
is determined on the basis of the greediness attribute of that
3843+
subexpression, with subexpressions starting earlier in the RE taking
3844+
priority over ones starting later.
3845+
</para>
3846+
3847+
<para>
3848+
An example of what this means:
3849+
<screen>
3850+
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
3851+
<lineannotation>Result: </lineannotation><computeroutput>123</computeroutput>
3852+
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
3853+
<lineannotation>Result: </lineannotation><computeroutput>1</computeroutput>
3854+
</screen>
3855+
In the first case, the RE as a whole is greedy because <literal>Y*</>
3856+
is greedy. It can match beginning at the <literal>Y</>, and it matches
3857+
the longest possible string starting there, i.e., <literal>Y123</>.
3858+
The output is the parenthesized part of that, or <literal>123</>.
3859+
In the second case, the RE as a whole is non-greedy because <literal>Y*?</>
3860+
is non-greedy. It can match beginning at the <literal>Y</>, and it matches
3861+
the shortest possible string starting there, i.e., <literal>Y1</>.
3862+
The subexpression <literal>[0-9]{1,3}</> is greedy but it cannot change
3863+
the decision as to the overall match length; so it is forced to match
3864+
just <literal>1</>.
37993865
</para>
38003866

38013867
<para>
3802-
Subject to the constraints imposed by the rules for matching the whole RE,
3803-
subexpressions also match the longest or shortest possible substrings,
3804-
based on their preferences,
3805-
with subexpressions starting earlier in the RE taking priority over
3806-
ones starting later.
3807-
Note that outer subexpressions thus take priority over
3808-
their component subexpressions.
3868+
In short, when an RE contains both greedy and non-greedy subexpressions,
3869+
the total match length is either as long as possible or as short as
3870+
possible, according to the attribute assigned to the whole RE. The
3871+
attributes assigned to the subexpressions only affect how much of that
3872+
match they are allowed to <quote>eat</> relative to each other.
38093873
</para>
38103874

38113875
<para>
38123876
The quantifiers <literal>{1,1}</> and <literal>{1,1}?</>
3813-
can be used to force longest and shortest preference, respectively,
3877+
can be used to force greediness or non-greediness, respectively,
38143878
on a subexpression or a whole RE.
38153879
</para>
38163880

0 commit comments

Comments
 (0)