Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

101 PDFsam Matlab Prog

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Regular Expressions

Regular Expressions
In this section...
“What Is a Regular Expression?” on page 2-51
“Steps for Building Expressions” on page 2-52
“Operators and Characters” on page 2-55

This topic describes what regular expressions are and how to use them to search text. Regular
expressions are flexible and powerful, though they use complex syntax. An alternative to regular
expressions is a pattern (since R2020b), which is simpler to define and results in code that is easier
to read. For more information, see “Build Pattern Expressions” on page 6-40.

What Is a Regular Expression?


A regular expression is a sequence of characters that defines a certain pattern. You normally use a
regular expression to search text for a group of words that matches the pattern, for example, while
parsing program input or while processing a block of text.

The character vector 'Joh?n\w*' is an example of a regular expression. It defines a pattern that
starts with the letters Jo, is optionally followed by the letter h (indicated by 'h?'), is then followed
by the letter n, and ends with any number of word characters, that is, characters that are alphabetic,
numeric, or underscore (indicated by '\w*'). This pattern matches any of the following:

Jon, John, Jonathan, Johnny

Regular expressions provide a unique way to search a volume of text for a particular subset of
characters within that text. Instead of looking for an exact character match as you would do with a
function like strfind, regular expressions give you the ability to look for a particular pattern of
characters.

For example, several ways of expressing a metric rate of speed are:

km/h
km/hr
km/hour
kilometers/hour
kilometers per hour

You could locate any of the above terms in your text by issuing five separate search commands:

strfind(text, 'km/h');
strfind(text, 'km/hour');
% etc.

To be more efficient, however, you can build a single phrase that applies to all of these search terms:

2-51
2 Program Components

Translate this phrase into a regular expression (to be explained later in this section) and you have:

pattern = 'k(ilo)?m(eters)?(/|\sper\s)h(r|our)?';

Now locate one or more of the terms using just a single command:

text = ['The high-speed train traveled at 250 ', ...


'kilometers per hour alongside the automobile ', ...
'travelling at 120 km/h.'];
regexp(text, pattern, 'match')

ans =

1×2 cell array

{'kilometers per hour'} {'km/h'}

There are four MATLAB functions that support searching and replacing characters using regular
expressions. The first three are similar in the input values they accept and the output values they
return. For details, click the links to the function reference pages.

Function Description
regexp Match regular expression.
regexpi Match regular expression, ignoring case.
regexprep Replace part of text using regular expression.
regexptranslate Translate text into regular expression.

When calling any of the first three functions, pass the text to be parsed and the regular expression in
the first two input arguments. When calling regexprep, pass an additional input that is an
expression that specifies a pattern for the replacement.

Steps for Building Expressions


There are three steps involved in using regular expressions to search text for a particular term:

1 Identify unique patterns in the string on page 2-53

This entails breaking up the text you want to search for into groups of like character types. These
character types could be a series of lowercase letters, a dollar sign followed by three numbers
and then a decimal point, etc.
2 Express each pattern as a regular expression on page 2-53

2-52
Regular Expressions

Use the metacharacters and operators described in this documentation to express each segment
of your search pattern as a regular expression. Then combine these expression segments into the
single expression to use in the search.
3 Call the appropriate search function on page 2-54

Pass the text you want to parse to one of the search functions, such as regexp or regexpi, or to
the text replacement function, regexprep.

The example shown in this section searches a record containing contact information belonging to a
group of five friends. This information includes each person's name, telephone number, place of
residence, and email address. The goal is to extract specific information from the text..
contacts = { ...
'Harry 287-625-7315 Columbus, OH hparker@hmail.com'; ...
'Janice 529-882-1759 Fresno, CA jan_stephens@horizon.net'; ...
'Mike 793-136-0975 Richmond, VA sue_and_mike@hmail.net'; ...
'Nadine 648-427-9947 Tampa, FL nadine_berry@horizon.net'; ...
'Jason 697-336-7728 Montrose, CO jason_blake@mymail.com'};

The first part of the example builds a regular expression that represents the format of a standard
email address. Using that expression, the example then searches the information for the email
address of one of the group of friends. Contact information for Janice is in row 2 of the contacts cell
array:
contacts{2}

ans =

'Janice 529-882-1759 Fresno, CA jan_stephens@horizon.net'

Step 1 — Identify Unique Patterns in the Text

A typical email address is made up of standard components: the user's account name, followed by an
@ sign, the name of the user's internet service provider (ISP), a dot (period), and the domain to which
the ISP belongs. The table below lists these components in the left column, and generalizes the
format of each component in the right column.

Unique patterns of an email address General description of each pattern


Start with the account name One or more lowercase letters and underscores
jan_stephens . . .
Add '@' @ sign
jan_stephens@ . . .
Add the ISP One or more lowercase letters, no underscores
jan_stephens@horizon . . .
Add a dot (period) Dot (period) character
jan_stephens@horizon. . . .
Finish with the domain com or net
jan_stephens@horizon.net

Step 2 — Express Each Pattern as a Regular Expression

In this step, you translate the general formats derived in Step 1 into segments of a regular
expression. You then add these segments together to form the entire expression.

2-53
2 Program Components

The table below shows the generalized format descriptions of each character pattern in the left-most
column. (This was carried forward from the right column of the table in Step 1.) The second column
shows the operators or metacharacters that represent the character pattern.

Description of each segment Pattern


One or more lowercase letters and underscores [a-z_]+
@ sign @
One or more lowercase letters, no underscores [a-z]+
Dot (period) character \.
com or net (com|net)

Assembling these patterns into one character vector gives you the complete expression:

email = '[a-z_]+@[a-z]+\.(com|net)';

Step 3 — Call the Appropriate Search Function

In this step, you use the regular expression derived in Step 2 to match an email address for one of the
friends in the group. Use the regexp function to perform the search.

Here is the list of contact information shown earlier in this section. Each person's record occupies a
row of the contacts cell array:

contacts = { ...
'Harry 287-625-7315 Columbus, OH hparker@hmail.com'; ...
'Janice 529-882-1759 Fresno, CA jan_stephens@horizon.net'; ...
'Mike 793-136-0975 Richmond, VA sue_and_mike@hmail.net'; ...
'Nadine 648-427-9947 Tampa, FL nadine_berry@horizon.net'; ...
'Jason 697-336-7728 Montrose, CO jason_blake@mymail.com'};

This is the regular expression that represents an email address, as derived in Step 2:

email = '[a-z_]+@[a-z]+\.(com|net)';

Call the regexp function, passing row 2 of the contacts cell array and the email regular
expression. This returns the email address for Janice.

regexp(contacts{2}, email, 'match')

ans =

1×1 cell array

{'jan_stephens@horizon.net'}

MATLAB parses a character vector from left to right, “consuming” the vector as it goes. If matching
characters are found, regexp records the location and resumes parsing the character vector, starting
just after the end of the most recent match.

Make the same call, but this time for the fifth person in the list:

regexp(contacts{5}, email, 'match')

ans =

2-54
Regular Expressions

1×1 cell array

{'jason_blake@mymail.com'}

You can also search for the email address of everyone in the list by using the entire cell array for the
input argument:

regexp(contacts, email, 'match');

Operators and Characters


Regular expressions can contain characters, metacharacters, operators, tokens, and flags that specify
patterns to match, as described in these sections:

• “Metacharacters” on page 2-55


• “Character Representation” on page 2-56
• “Quantifiers” on page 2-56
• “Grouping Operators” on page 2-57
• “Anchors” on page 2-58
• “Lookaround Assertions” on page 2-58
• “Logical and Conditional Operators” on page 2-59
• “Token Operators” on page 2-60
• “Dynamic Expressions” on page 2-60
• “Comments” on page 2-61
• “Search Flags” on page 2-61

Metacharacters

Metacharacters represent letters, letter ranges, digits, and space characters. Use them to construct a
generalized pattern of characters.

Metacharacter Description Example


. Any single character, including white '..ain' matches sequences of five
space consecutive characters that end with 'ain'.
[c1c2c3] Any character contained within the '[rp.]ain' matches 'rain' or 'pain' or
square brackets. The following '.ain'.
characters are treated literally: $ | . *
+ ? and - when not used to indicate a
range.
[^c1c2c3] Any character not contained within the '[^*rp]ain' matches all four-letter
square brackets. The following sequences that end in 'ain', except 'rain'
characters are treated literally: $ | . * and 'pain' and '*ain'. For example, it
+ ? and - when not used to indicate a matches 'gain', 'lain', or 'vain'.
range.
[c1-c2] Any character in the range of c1 through '[A-G]' matches a single character in the
c2 range of A through G.

2-55
2 Program Components

Metacharacter Description Example


\w Any alphabetic, numeric, or underscore '\w*' identifies a word comprised of any
character. For English character sets, \w grouping of alphabetic, numeric, or underscore
is equivalent to [a-zA-Z_0-9] characters.
\W Any character that is not alphabetic, '\W*' identifies a term that is not a word
numeric, or underscore. For English comprised of any grouping of alphabetic,
character sets, \W is equivalent to [^a- numeric, or underscore characters.
zA-Z_0-9]
\s Any white-space character; equivalent to '\w*n\s' matches words that end with the
[ \f\n\r\t\v] letter n, followed by a white-space character.
\S Any non-white-space character; '\d\S' matches a numeric digit followed by
equivalent to [^ \f\n\r\t\v] any non-white-space character.
\d Any numeric digit; equivalent to [0-9] '\d*' matches any number of consecutive
digits.
\D Any nondigit character; equivalent to '\w*\D\>' matches words that do not end
[^0-9] with a numeric digit.
\oN or \o{N} Character of octal value N '\o{40}' matches the space character,
defined by octal 40.
\xN or \x{N} Character of hexadecimal value N '\x2C' matches the comma character, defined
by hex 2C.

Character Representation

Operator Description
\a Alarm (beep)
\b Backspace
\f Form feed
\n New line
\r Carriage return
\t Horizontal tab
\v Vertical tab
\char Any character with special meaning in regular expressions that you want to match literally
(for example, use \\ to match a single backslash)

Quantifiers

Quantifiers specify the number of times a pattern must occur in the matching text.

Quantifier Number of Times Expression Occurs Example


expr* 0 or more times consecutively. '\w*' matches a word of any length.
expr? 0 times or 1 time. '\w*(\.m)?' matches words that optionally
end with the extension .m.

2-56
Regular Expressions

Quantifier Number of Times Expression Occurs Example


expr+ 1 or more times consecutively. '<img src="\w+\.gif">' matches an
<img> HTML tag when the file name contains
one or more characters.
expr{m,n} At least m times, but no more than n times '\S{4,8}' matches between four and eight
consecutively. non-white-space characters.

{0,1} is equivalent to ?.
expr{m,} At least m times consecutively. '<a href="\w{1,}\.html">' matches an
<a> HTML tag when the file name contains one
{0,} and {1,} are equivalent to * and +, or more characters.
respectively.
expr{n} Exactly n times consecutively. '\d{4}' matches four consecutive digits.

Equivalent to {n,n}.

Quantifiers can appear in three modes, described in the following table. q represents any of the
quantifiers in the previous table.

Mode Description Example


exprq Greedy expression: match as many characters Given the text '<tr><td><p>text</p></
as possible. td>', the expression '</?t.*>' matches all
characters between <tr and /td>:

'<tr><td><p>text</p></td>'
exprq? Lazy expression: match as few characters as Given the text'<tr><td><p>text</p></
necessary. td>', the expression '</?t.*?>' ends each
match at the first occurrence of the closing
angle bracket (>):

'<tr>' '<td>' '</td>'


exprq+ Possessive expression: match as much as Given the text'<tr><td><p>text</p></
possible, but do not rescan any portions of the td>', the expression '</?t.*+>' does not
text. return any matches, because the closing
angle bracket is captured using .*, and is not
rescanned.

Grouping Operators

Grouping operators allow you to capture tokens, apply one operator to multiple elements, or disable
backtracking in a specific group.

Grouping Description Example


Operator
(expr) Group elements of the expression and capture 'Joh?n\s(\w*)' captures a token that
tokens. contains the last name of any person with the
first name John or Jon.

2-57
2 Program Components

Grouping Description Example


Operator
(?:expr) Group, but do not capture tokens. '(?:[aeiou][^aeiou]){2}' matches two
consecutive patterns of a vowel followed by a
nonvowel, such as 'anon'.

Without grouping, '[aeiou][^aeiou]


{2}'matches a vowel followed by two
nonvowels.
(?>expr) Group atomically. Do not backtrack within the 'A(?>.*)Z' does not match 'AtoZ',
group to complete the match, and do not although 'A(?:.*)Z' does. Using the atomic
capture tokens. group, Z is captured using .* and is not
rescanned.
(expr1|expr2) Match expression expr1 or expression '(let|tel)\w+' matches words that start
expr2. with let or tel.

If there is a match with expr1, then expr2 is


ignored.

You can include ?: or ?> after the opening


parenthesis to suppress tokens or group
atomically.

Anchors

Anchors in the expression match the beginning or end of a character vector or word.

Anchor Matches the... Example


^expr Beginning of the input text. '^M\w*' matches a word starting with M at
the beginning of the text.
expr$ End of the input text. '\w*m$' matches words ending with m at the
end of the text.
\<expr Beginning of a word. '\<n\w*' matches any words starting with
n.
expr\> End of a word. '\w*e\>' matches any words ending with e.

Lookaround Assertions

Lookaround assertions look for patterns that immediately precede or follow the intended match, but
are not part of the match.

The pointer remains at the current location, and characters that correspond to the test expression
are not captured or discarded. Therefore, lookahead assertions can match overlapping character
groups.

2-58
Regular Expressions

Lookaround Description Example


Assertion
expr(?=test) Look ahead for characters that match test. '\w*(?=ing)' matches terms that are
followed by ing, such as 'Fly' and 'fall'
in the input text 'Flying, not falling.'
expr(?!test) Look ahead for characters that do not 'i(?!ng)' matches instances of the letter i
match test. that are not followed by ng.
(?<=test)expr Look behind for characters that match '(?<=re)\w*' matches terms that follow
test. 're', such as 'new', 'use', and 'cycle'
in the input text 'renew, reuse,
recycle'
(?<!test)expr Look behind for characters that do not '(?<!\d)(\d)(?!\d)' matches single-
match test. digit numbers (digits that do not precede or
follow other digits).

If you specify a lookahead assertion before an expression, the operation is equivalent to a logical AND.

Operation Description Example


(?=test)expr Match both test and expr. '(?=[a-z])[^aeiou]' matches
consonants.
(?!test)expr Match expr and do not match test. '(?![aeiou])[a-z]' matches consonants.

For more information, see “Lookahead Assertions in Regular Expressions” on page 2-63.

Logical and Conditional Operators

Logical and conditional operators allow you to test the state of a given condition, and then use the
outcome to determine which pattern, if any, to match next. These operators support logical OR and if
or if/else conditions. (For AND conditions, see “Lookaround Assertions” on page 2-58.)

Conditions can be tokens on page 2-60, lookaround assertions on page 2-58, or dynamic expressions
on page 2-60 of the form (?@cmd). Dynamic expressions must return a logical or numeric value.

Conditional Operator Description Example


expr1|expr2 Match expression expr1 or expression '(let|tel)\w+' matches words that
expr2. start with let or tel.

If there is a match with expr1, then


expr2 is ignored.
(?(cond)expr) If condition cond is true, then match '(?(?@ispc)[A-Z]:\\)' matches a
expr. drive name, such as C:\, when run on a
Windows system.
(?(cond)expr1|expr2) If condition cond is true, then match 'Mr(s?)\..*?(?(1)her|his) \w*'
expr1. Otherwise, match expr2. matches text that includes her when
the text begins with Mrs, or that
includes his when the text begins with
Mr.

2-59
2 Program Components

Token Operators

Tokens are portions of the matched text that you define by enclosing part of the regular expression in
parentheses. You can refer to a token by its sequence in the text (an ordinal token), or assign names
to tokens for easier code maintenance and readable output.

Ordinal Token Operator Description Example


(expr) Capture in a token the characters that 'Joh?n\s(\w*)' captures a token that
match the enclosed expression. contains the last name of any person
with the first name John or Jon.
\N Match the Nth token. '<(\w+).*>.*</\1>' captures tokens
for HTML tags, such as 'title' from
the text '<title>Some text</
title>'.
(?(N)expr1|expr2) If the Nth token is found, then match 'Mr(s?)\..*?(?(1)her|his) \w*'
expr1. Otherwise, match expr2. matches text that includes her when
the text begins with Mrs, or that
includes his when the text begins with
Mr.

Named Token Operator Description Example


(?<name>expr) Capture in a named token the '(?<month>\d+)-(?<day>\d+)-(?
characters that match the enclosed <yr>\d+)' creates named tokens for
expression. the month, day, and year in an input
date of the form mm-dd-yy.
\k<name> Match the token referred to by name. '<(?<tag>\w+).*>.*</\k<tag>>'
captures tokens for HTML tags, such as
'title' from the text '<title>Some
text</title>'.
(?(name)expr1|expr2) If the named token is found, then 'Mr(?<sex>s?)\..*?(?(sex)her|
match expr1. Otherwise, match his) \w*' matches text that includes
expr2. her when the text begins with Mrs, or
that includes his when the text begins
with Mr.

Note If an expression has nested parentheses, MATLAB captures tokens that correspond to the
outermost set of parentheses. For example, given the search pattern '(and(y|rew))', MATLAB
creates a token for 'andrew' but not for 'y' or 'rew'.

For more information, see “Tokens in Regular Expressions” on page 2-66.

Dynamic Expressions

Dynamic expressions allow you to execute a MATLAB command or a regular expression to determine
the text to match.

The parentheses that enclose dynamic expressions do not create a capturing group.

2-60
Regular Expressions

Operator Description Example


(??expr) Parse expr and include the resulting term '^(\d+)((??\\w{$1}))' determines
in the match expression. how many characters to match by reading
a digit at the beginning of the match. The
When parsed, expr must correspond to a dynamic expression is enclosed in a
complete, valid regular expression. second set of parentheses so that the
Dynamic expressions that use the backslash resulting match is captured in a token. For
escape character (\) require two instance, matching '5XXXXX' captures
backslashes: one for the initial parsing of tokens for '5' and 'XXXXX'.
expr, and one for the complete match.
(??@cmd) Execute the MATLAB command '(.{2,}).?(??@fliplr($1))' finds
represented by cmd, and include the output palindromes that are at least four
returned by the command in the match characters long, such as 'abba'.
expression.
(?@cmd) Execute the MATLAB command '\w*?(\w)(?@disp($1))\1\w*'
represented by cmd, but discard any output matches words that include double letters
the command returns. (Helpful for (such as pp), and displays intermediate
diagnosing regular expressions.) results.

Within dynamic expressions, use the following operators to define replacement terms.

Replacement Operator Description


$& or $0 Portion of the input text that is currently a match
$` Portion of the input text that precedes the current match
$' Portion of the input text that follows the current match (use $'' to represent $')
$N Nth token
$<name> Named token
${cmd} Output returned when MATLAB executes the command, cmd

For more information, see “Dynamic Regular Expressions” on page 2-72.

Comments

The comment operator enables you to insert comments into your code to make it more maintainable.
The text of the comment is ignored by MATLAB when matching against the input text.

Characters Description Example


(?#comment) Insert a comment in the regular expression. '(?# Initial digit)\<\d\w+'
The comment text is ignored when includes a comment, and matches words
matching the input. that begin with a number.

Search Flags

Search flags modify the behavior for matching expressions.

Flag Description
(?-i) Match letter case (default for regexp and regexprep).
(?i) Do not match letter case (default for regexpi).

2-61
2 Program Components

Flag Description
(?s) Match dot (.) in the pattern with any character (default).
(?-s) Match dot in the pattern with any character that is not a newline character.
(?-m) Match the ^ and $ metacharacters at the beginning and end of text (default).
(?m) Match the ^ and $ metacharacters at the beginning and end of a line.
(?-x) Include space characters and comments when matching (default).
(?x) Ignore space characters and comments when matching. Use '\ ' and '\#' to
match space and # characters.

The expression that the flag modifies can appear either after the parentheses, such as

(?i)\w*

or inside the parentheses and separated from the flag with a colon (:), such as

(?i:\w*)

The latter syntax allows you to change the behavior for part of a larger expression.

See Also
pattern | regexp | regexpi | regexprep | regexptranslate

More About
• “Lookahead Assertions in Regular Expressions” on page 2-63
• “Tokens in Regular Expressions” on page 2-66
• “Dynamic Regular Expressions” on page 2-72
• “Search and Replace Text” on page 6-37

2-62
Lookahead Assertions in Regular Expressions

Lookahead Assertions in Regular Expressions


In this section...
“Lookahead Assertions” on page 2-63
“Overlapping Matches” on page 2-63
“Logical AND Conditions” on page 2-64

Lookahead Assertions
There are two types of lookaround assertions for regular expressions: lookahead and lookbehind. In
both cases, the assertion is a condition that must be satisfied to return a match to the expression.

A lookahead assertion has the form (?=test) and can appear anywhere in a regular expression.
MATLAB looks ahead of the current location in the text for the test condition. If MATLAB matches the
test condition, it continues processing the rest of the expression to find a match.

For example, look ahead in a character vector specifying a path to find the name of the folder that
contains a program file (in this case, fileread.m).

chr = which('fileread')

chr =

'matlabroot\toolbox\matlab\iofun\fileread.m'

regexp(chr,'\w+(?=\\\w+\.[mp])','match')

ans =

1×1 cell array

{'iofun'}

The match expression, \w+, searches for one or more alphanumeric or underscore characters. Each
time regexp finds a term that matches this condition, it looks ahead for a backslash (specified with
two backslashes, \\), followed by a file name (\w+) with an .m or .p extension (\.[mp]). The
regexp function returns the match that satisfies the lookahead condition, which is the folder name
iofun.

Overlapping Matches
Lookahead assertions do not consume any characters in the text. As a result, you can use them to find
overlapping character sequences.

For example, use lookahead to find every sequence of six nonwhitespace characters in a character
vector by matching initial characters that precede five additional characters:

chr = 'Locate several 6-char. phrases';


startIndex = regexpi(chr,'\S(?=\S{5})')

startIndex =

1 8 9 16 17 24 25

2-63
2 Program Components

The starting indices correspond to these phrases:


Locate severa everal 6-char -char. phrase hrases

Without the lookahead operator, MATLAB parses a character vector from left to right, consuming the
vector as it goes. If matching characters are found, regexp records the location and resumes parsing
the character vector from the location of the most recent match. There is no overlapping of
characters in this process.
chr = 'Locate several 6-char. phrases';
startIndex = regexpi(chr,'\S{6}')

startIndex =

1 8 16 24

The starting indices correspond to these phrases:


Locate severa 6-char phrase

Logical AND Conditions


Another way to use a lookahead operation is to perform a logical AND between two conditions. This
example initially attempts to locate all lowercase consonants in a character array consisting of the
first 50 characters of the help for the normest function:
helptext = help('normest');
chr = helptext(1:50)

chr =

' NORMEST Estimate the matrix 2-norm.


NORMEST(S'

Merely searching for non-vowels ([^aeiou]) does not return the expected answer, as the output
includes capital letters, space characters, and punctuation:
c = regexp(chr,'[^aeiou]','match')

c =

1×43 cell array

Columns 1 through 14

{' '} {'N'} {'O'} {'R'} {'M'} {'E'} {'S'} {'T'} {' '} {'E'} {'s

Columns 15 through 28

{' '} {'t'} {'h'} {' '} {'m'} {'t'} {'r'} {'x'} {' '} {'2'} {'-

Columns 29 through 42

{'.'} {'↵'} {' '} {' '} {' '} {' '} {'N'} {'O'} {'R'} {'M'} {'E

Column 43

{'S'}

2-64
Lookahead Assertions in Regular Expressions

Try this again, using a lookahead operator to create the following AND condition:

(lowercase letter) AND (not a vowel)

This time, the result is correct:

c = regexp(chr,'(?=[a-z])[^aeiou]','match')

c =

1×13 cell array

{'s'} {'t'} {'m'} {'t'} {'t'} {'h'} {'m'} {'t'} {'r'} {'x'} {'n

Note that when using a lookahead operator to perform an AND, you need to place the match
expression expr after the test expression test:

(?=test)expr or (?!test)expr

See Also
regexp | regexpi | regexprep

More About
• “Regular Expressions” on page 2-51

2-65
2 Program Components

Tokens in Regular Expressions


In this section...
“Introduction” on page 2-66
“Multiple Tokens” on page 2-68
“Unmatched Tokens” on page 2-69
“Tokens in Replacement Text” on page 2-69
“Named Capture” on page 2-70

Introduction
Parentheses used in a regular expression not only group elements of that expression together, but
also designate any matches found for that group as tokens. You can use tokens to match other parts
of the same text. One advantage of using tokens is that they remember what they matched, so you
can recall and reuse matched text in the process of searching or replacing.

Each token in the expression is assigned a number, starting from 1, going from left to right. To make
a reference to a token later in the expression, refer to it using a backslash followed by the token
number. For example, when referencing a token generated by the third set of parentheses in the
expression, use \3.

As a simple example, if you wanted to search for identical sequential letters in a character array, you
could capture the first letter as a token and then search for a matching character immediately
afterwards. In the expression shown below, the (\S) phrase creates a token whenever regexp
matches any nonwhitespace character in the character array. The second part of the expression,
'\1', looks for a second instance of the same character immediately following the first.
poe = ['While I nodded, nearly napping, ' ...
'suddenly there came a tapping,'];

[mat,tok,ext] = regexp(poe, '(\S)\1', 'match', ...


'tokens', 'tokenExtents');
mat

mat =

1×4 cell array

{'dd'} {'pp'} {'dd'} {'pp'}

The cell array tok contains cell arrays that each contain a token.
tok{:}

ans =

1×1 cell array

{'d'}

ans =

2-66
Tokens in Regular Expressions

1×1 cell array

{'p'}

ans =

1×1 cell array

{'d'}

ans =

1×1 cell array

{'p'}

The cell array ext contains numeric arrays that each contain starting and ending indices for a token.

ext{:}

ans =

11 11

ans =

26 26

ans =

35 35

ans =

57 57

For another example, capture pairs of matching HTML tags (e.g., <a> and </a>) and the text
between them. The expression used for this example is

expr = '<(\w+).*?>.*?</\1>';

The first part of the expression, '<(\w+)', matches an opening angle bracket (<) followed by one or
more alphabetic, numeric, or underscore characters. The enclosing parentheses capture token
characters following the opening angle bracket.

The second part of the expression, '.*?>.*?', matches the remainder of this HTML tag (characters
up to the >), and any characters that may precede the next opening angle bracket.

The last part, '</\1>', matches all characters in the ending HTML tag. This tag is composed of the
sequence </tag>, where tag is whatever characters were captured as a token.

hstr = '<!comment><a name="752507"></a><b>Default</b><br>';


expr = '<(\w+).*?>.*?</\1>';

2-67
2 Program Components

[mat,tok] = regexp(hstr, expr, 'match', 'tokens');


mat{:}

ans =

'<a name="752507"></a>'

ans =

'<b>Default</b>'

tok{:}

ans =

1×1 cell array

{'a'}

ans =

1×1 cell array

{'b'}

Multiple Tokens
Here is an example of how tokens are assigned values. Suppose that you are going to search the
following text:

andy ted bob jim andrew andy ted mark

You choose to search the above text with the following search pattern:

and(y|rew)|(t)e(d)

This pattern has three parenthetical expressions that generate tokens. When you finally perform the
search, the following tokens are generated for each match.

Match Token 1 Token 2


andy y
ted t d
andrew rew
andy y
ted t d

Only the highest level parentheses are used. For example, if the search pattern and(y|rew) finds the
text andrew, token 1 is assigned the value rew. However, if the search pattern (and(y|rew)) is
used, token 1 is assigned the value andrew.

2-68
Tokens in Regular Expressions

Unmatched Tokens
For those tokens specified in the regular expression that have no match in the text being evaluated,
regexp and regexpi return an empty character vector ('') as the token output, and an extent that
marks the position in the string where the token was expected.

The example shown here executes regexp on a character vector specifying the path returned from
the MATLAB tempdir function. The regular expression expr includes six token specifiers, one for
each piece of the path. The third specifier [a-z]+ has no match in the character vector because this
part of the path, Profiles, begins with an uppercase letter:
chr = tempdir

chr =

'C:\WINNT\Profiles\bpascal\LOCALS~1\Temp\'

expr = ['([A-Z]:)\\(WINNT)\\([a-z]+)?.*\\' ...


'([a-z]+)\\([A-Z]+~\d)\\(Temp)\\'];

[tok, ext] = regexp(chr, expr, 'tokens', 'tokenExtents');

When a token is not found in the text, regexp returns an empty character vector ('') as the token
and a numeric array with the token extent. The first number of the extent is the string index that
marks where the token was expected, and the second number of the extent is equal to one less than
the first.

In the case of this example, the empty token is the third specified in the expression, so the third token
returned is empty:
tok{:}

ans =

1×6 cell array

{'C:'} {'WINNT'} {0×0 char} {'bpascal'} {'LOCALS~1'} {'Temp'}

The third token extent returned in the variable ext has the starting index set to 10, which is where
the nonmatching term, Profiles, begins in the path. The ending extent index is set to one less than
the starting index, or 9:
ext{:}

ans =

1 2
4 8
10 9
19 25
27 34
36 39

Tokens in Replacement Text


When using tokens in replacement text, reference them using $1, $2, etc. instead of \1, \2, etc. This
example captures two tokens and reverses their order. The first, $1, is 'Norma Jean' and the

2-69
2 Program Components

second, $2, is 'Baker'. Note that regexprep returns the modified text, not a vector of starting
indices.
regexprep('Norma Jean Baker', '(\w+\s\w+)\s(\w+)', '$2, $1')

ans =

'Baker, Norma Jean'

Named Capture
If you use a lot of tokens in your expressions, it may be helpful to assign them names rather than
having to keep track of which token number is assigned to which token.

When referencing a named token within the expression, use the syntax \k<name> instead of the
numeric \1, \2, etc.:
poe = ['While I nodded, nearly napping, ' ...
'suddenly there came a tapping,'];

regexp(poe, '(?<anychar>.)\k<anychar>', 'match')

ans =

1×4 cell array

{'dd'} {'pp'} {'dd'} {'pp'}

Named tokens can also be useful in labeling the output from the MATLAB regular expression
functions. This is especially true when you are processing many pieces of text.

For example, parse different parts of street addresses from several character vectors. A short name is
assigned to each token in the expression:
chr1 = '134 Main Street, Boulder, CO, 14923';
chr2 = '26 Walnut Road, Topeka, KA, 25384';
chr3 = '847 Industrial Drive, Elizabeth, NJ, 73548';

p1 = '(?<adrs>\d+\s\S+\s(Road|Street|Avenue|Drive))';
p2 = '(?<city>[A-Z][a-z]+)';
p3 = '(?<state>[A-Z]{2})';
p4 = '(?<zip>\d{5})';

expr = [p1 ', ' p2 ', ' p3 ', ' p4];

As the following results demonstrate, you can make your output easier to work with by using named
tokens:
loc1 = regexp(chr1, expr, 'names')

loc1 =

struct with fields:

adrs: '134 Main Street'


city: 'Boulder'
state: 'CO'
zip: '14923'

2-70

You might also like