|
1 | 1 | <!--
|
2 |
| -$PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.233 2005/01/08 05:19:18 tgl Exp $ |
| 2 | +$PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.234 2005/01/09 20:08:50 tgl Exp $ |
3 | 3 | PostgreSQL documentation
|
4 | 4 | -->
|
5 | 5 |
|
@@ -3772,45 +3772,109 @@ substring('foobar' from 'o(.)b') <lineannotation>o</lineannotation>
|
3772 | 3772 | In the event that an RE could match more than one substring of a given
|
3773 | 3773 | string, the RE matches the one starting earliest in the string.
|
3774 | 3774 | If the RE could match more than one substring starting at that point,
|
3775 |
| - its choice is determined by its <firstterm>preference</>: |
3776 |
| - either the longest substring, or the shortest. |
| 3775 | + either the longest possible match or the shortest possible match will |
| 3776 | + be taken, depending on whether the RE is <firstterm>greedy</> or |
| 3777 | + <firstterm>non-greedy</>. |
3777 | 3778 | </para>
|
3778 | 3779 |
|
3779 | 3780 | <para>
|
3780 |
| - Most atoms, and all constraints, have no preference. |
3781 |
| - A parenthesized RE has the same preference (possibly none) as the RE. |
3782 |
| - A quantified atom with quantifier |
3783 |
| - <literal>{</><replaceable>m</><literal>}</> |
3784 |
| - or |
3785 |
| - <literal>{</><replaceable>m</><literal>}?</> |
3786 |
| - has the same preference (possibly none) as the atom itself. |
3787 |
| - A quantified atom with other normal quantifiers (including |
3788 |
| - <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}</> |
3789 |
| - with <replaceable>m</> equal to <replaceable>n</>) |
3790 |
| - prefers longest match. |
3791 |
| - A quantified atom with other non-greedy quantifiers (including |
3792 |
| - <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}?</> |
3793 |
| - with <replaceable>m</> equal to <replaceable>n</>) |
3794 |
| - prefers shortest match. |
3795 |
| - A branch has the same preference as the first quantified atom in it |
3796 |
| - which has a preference. |
3797 |
| - An RE consisting of two or more branches connected by the |
3798 |
| - <literal>|</> operator prefers longest match. |
| 3781 | + Whether an RE is greedy or not is determined by the following rules: |
| 3782 | + <itemizedlist> |
| 3783 | + <listitem> |
| 3784 | + <para> |
| 3785 | + Most atoms, and all constraints, have no greediness attribute (because |
| 3786 | + they cannot match variable amounts of text anyway). |
| 3787 | + </para> |
| 3788 | + </listitem> |
| 3789 | + <listitem> |
| 3790 | + <para> |
| 3791 | + Adding parentheses around an RE does not change its greediness. |
| 3792 | + </para> |
| 3793 | + </listitem> |
| 3794 | + <listitem> |
| 3795 | + <para> |
| 3796 | + A quantified atom with a fixed-repetition quantifier |
| 3797 | + (<literal>{</><replaceable>m</><literal>}</> |
| 3798 | + or |
| 3799 | + <literal>{</><replaceable>m</><literal>}?</>) |
| 3800 | + has the same greediness (possibly none) as the atom itself. |
| 3801 | + </para> |
| 3802 | + </listitem> |
| 3803 | + <listitem> |
| 3804 | + <para> |
| 3805 | + A quantified atom with other normal quantifiers (including |
| 3806 | + <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}</> |
| 3807 | + with <replaceable>m</> equal to <replaceable>n</>) |
| 3808 | + is greedy (prefers longest match). |
| 3809 | + </para> |
| 3810 | + </listitem> |
| 3811 | + <listitem> |
| 3812 | + <para> |
| 3813 | + A quantified atom with a non-greedy quantifier (including |
| 3814 | + <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}?</> |
| 3815 | + with <replaceable>m</> equal to <replaceable>n</>) |
| 3816 | + is non-greedy (prefers shortest match). |
| 3817 | + </para> |
| 3818 | + </listitem> |
| 3819 | + <listitem> |
| 3820 | + <para> |
| 3821 | + A branch — that is, an RE that has no top-level |
| 3822 | + <literal>|</> operator — has the same greediness as the first |
| 3823 | + quantified atom in it that has a greediness attribute. |
| 3824 | + </para> |
| 3825 | + </listitem> |
| 3826 | + <listitem> |
| 3827 | + <para> |
| 3828 | + An RE consisting of two or more branches connected by the |
| 3829 | + <literal>|</> operator is always greedy. |
| 3830 | + </para> |
| 3831 | + </listitem> |
| 3832 | + </itemizedlist> |
| 3833 | + </para> |
| 3834 | + |
| 3835 | + <para> |
| 3836 | + The above rules associate greediness attributes not only with individual |
| 3837 | + quantified atoms, but with branches and entire REs that contain quantified |
| 3838 | + atoms. What that means is that the matching is done in such a way that |
| 3839 | + the branch, or whole RE, matches the longest or shortest possible |
| 3840 | + substring <emphasis>as a whole</>. Once the length of the entire match |
| 3841 | + is determined, the part of it that matches any particular subexpression |
| 3842 | + is determined on the basis of the greediness attribute of that |
| 3843 | + subexpression, with subexpressions starting earlier in the RE taking |
| 3844 | + priority over ones starting later. |
| 3845 | + </para> |
| 3846 | + |
| 3847 | + <para> |
| 3848 | + An example of what this means: |
| 3849 | +<screen> |
| 3850 | +SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})'); |
| 3851 | +<lineannotation>Result: </lineannotation><computeroutput>123</computeroutput> |
| 3852 | +SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})'); |
| 3853 | +<lineannotation>Result: </lineannotation><computeroutput>1</computeroutput> |
| 3854 | +</screen> |
| 3855 | + In the first case, the RE as a whole is greedy because <literal>Y*</> |
| 3856 | + is greedy. It can match beginning at the <literal>Y</>, and it matches |
| 3857 | + the longest possible string starting there, i.e., <literal>Y123</>. |
| 3858 | + The output is the parenthesized part of that, or <literal>123</>. |
| 3859 | + In the second case, the RE as a whole is non-greedy because <literal>Y*?</> |
| 3860 | + is non-greedy. It can match beginning at the <literal>Y</>, and it matches |
| 3861 | + the shortest possible string starting there, i.e., <literal>Y1</>. |
| 3862 | + The subexpression <literal>[0-9]{1,3}</> is greedy but it cannot change |
| 3863 | + the decision as to the overall match length; so it is forced to match |
| 3864 | + just <literal>1</>. |
3799 | 3865 | </para>
|
3800 | 3866 |
|
3801 | 3867 | <para>
|
3802 |
| - Subject to the constraints imposed by the rules for matching the whole RE, |
3803 |
| - subexpressions also match the longest or shortest possible substrings, |
3804 |
| - based on their preferences, |
3805 |
| - with subexpressions starting earlier in the RE taking priority over |
3806 |
| - ones starting later. |
3807 |
| - Note that outer subexpressions thus take priority over |
3808 |
| - their component subexpressions. |
| 3868 | + In short, when an RE contains both greedy and non-greedy subexpressions, |
| 3869 | + the total match length is either as long as possible or as short as |
| 3870 | + possible, according to the attribute assigned to the whole RE. The |
| 3871 | + attributes assigned to the subexpressions only affect how much of that |
| 3872 | + match they are allowed to <quote>eat</> relative to each other. |
3809 | 3873 | </para>
|
3810 | 3874 |
|
3811 | 3875 | <para>
|
3812 | 3876 | The quantifiers <literal>{1,1}</> and <literal>{1,1}?</>
|
3813 |
| - can be used to force longest and shortest preference, respectively, |
| 3877 | + can be used to force greediness or non-greediness, respectively, |
3814 | 3878 | on a subexpression or a whole RE.
|
3815 | 3879 | </para>
|
3816 | 3880 |
|
|
0 commit comments