101 PDFsam Matlab Prog
101 PDFsam Matlab Prog
101 PDFsam Matlab Prog
Regular Expressions
In this section...
“What Is a Regular Expression?” on page 2-51
“Steps for Building Expressions” on page 2-52
“Operators and Characters” on page 2-55
This topic describes what regular expressions are and how to use them to search text. Regular
expressions are flexible and powerful, though they use complex syntax. An alternative to regular
expressions is a pattern (since R2020b), which is simpler to define and results in code that is easier
to read. For more information, see “Build Pattern Expressions” on page 6-40.
The character vector 'Joh?n\w*' is an example of a regular expression. It defines a pattern that
starts with the letters Jo, is optionally followed by the letter h (indicated by 'h?'), is then followed
by the letter n, and ends with any number of word characters, that is, characters that are alphabetic,
numeric, or underscore (indicated by '\w*'). This pattern matches any of the following:
Regular expressions provide a unique way to search a volume of text for a particular subset of
characters within that text. Instead of looking for an exact character match as you would do with a
function like strfind, regular expressions give you the ability to look for a particular pattern of
characters.
km/h
km/hr
km/hour
kilometers/hour
kilometers per hour
You could locate any of the above terms in your text by issuing five separate search commands:
strfind(text, 'km/h');
strfind(text, 'km/hour');
% etc.
To be more efficient, however, you can build a single phrase that applies to all of these search terms:
2-51
2 Program Components
Translate this phrase into a regular expression (to be explained later in this section) and you have:
pattern = 'k(ilo)?m(eters)?(/|\sper\s)h(r|our)?';
Now locate one or more of the terms using just a single command:
ans =
There are four MATLAB functions that support searching and replacing characters using regular
expressions. The first three are similar in the input values they accept and the output values they
return. For details, click the links to the function reference pages.
Function Description
regexp Match regular expression.
regexpi Match regular expression, ignoring case.
regexprep Replace part of text using regular expression.
regexptranslate Translate text into regular expression.
When calling any of the first three functions, pass the text to be parsed and the regular expression in
the first two input arguments. When calling regexprep, pass an additional input that is an
expression that specifies a pattern for the replacement.
This entails breaking up the text you want to search for into groups of like character types. These
character types could be a series of lowercase letters, a dollar sign followed by three numbers
and then a decimal point, etc.
2 Express each pattern as a regular expression on page 2-53
2-52
Regular Expressions
Use the metacharacters and operators described in this documentation to express each segment
of your search pattern as a regular expression. Then combine these expression segments into the
single expression to use in the search.
3 Call the appropriate search function on page 2-54
Pass the text you want to parse to one of the search functions, such as regexp or regexpi, or to
the text replacement function, regexprep.
The example shown in this section searches a record containing contact information belonging to a
group of five friends. This information includes each person's name, telephone number, place of
residence, and email address. The goal is to extract specific information from the text..
contacts = { ...
'Harry 287-625-7315 Columbus, OH hparker@hmail.com'; ...
'Janice 529-882-1759 Fresno, CA jan_stephens@horizon.net'; ...
'Mike 793-136-0975 Richmond, VA sue_and_mike@hmail.net'; ...
'Nadine 648-427-9947 Tampa, FL nadine_berry@horizon.net'; ...
'Jason 697-336-7728 Montrose, CO jason_blake@mymail.com'};
The first part of the example builds a regular expression that represents the format of a standard
email address. Using that expression, the example then searches the information for the email
address of one of the group of friends. Contact information for Janice is in row 2 of the contacts cell
array:
contacts{2}
ans =
A typical email address is made up of standard components: the user's account name, followed by an
@ sign, the name of the user's internet service provider (ISP), a dot (period), and the domain to which
the ISP belongs. The table below lists these components in the left column, and generalizes the
format of each component in the right column.
In this step, you translate the general formats derived in Step 1 into segments of a regular
expression. You then add these segments together to form the entire expression.
2-53
2 Program Components
The table below shows the generalized format descriptions of each character pattern in the left-most
column. (This was carried forward from the right column of the table in Step 1.) The second column
shows the operators or metacharacters that represent the character pattern.
Assembling these patterns into one character vector gives you the complete expression:
email = '[a-z_]+@[a-z]+\.(com|net)';
In this step, you use the regular expression derived in Step 2 to match an email address for one of the
friends in the group. Use the regexp function to perform the search.
Here is the list of contact information shown earlier in this section. Each person's record occupies a
row of the contacts cell array:
contacts = { ...
'Harry 287-625-7315 Columbus, OH hparker@hmail.com'; ...
'Janice 529-882-1759 Fresno, CA jan_stephens@horizon.net'; ...
'Mike 793-136-0975 Richmond, VA sue_and_mike@hmail.net'; ...
'Nadine 648-427-9947 Tampa, FL nadine_berry@horizon.net'; ...
'Jason 697-336-7728 Montrose, CO jason_blake@mymail.com'};
This is the regular expression that represents an email address, as derived in Step 2:
email = '[a-z_]+@[a-z]+\.(com|net)';
Call the regexp function, passing row 2 of the contacts cell array and the email regular
expression. This returns the email address for Janice.
ans =
{'jan_stephens@horizon.net'}
MATLAB parses a character vector from left to right, “consuming” the vector as it goes. If matching
characters are found, regexp records the location and resumes parsing the character vector, starting
just after the end of the most recent match.
Make the same call, but this time for the fifth person in the list:
ans =
2-54
Regular Expressions
{'jason_blake@mymail.com'}
You can also search for the email address of everyone in the list by using the entire cell array for the
input argument:
Metacharacters
Metacharacters represent letters, letter ranges, digits, and space characters. Use them to construct a
generalized pattern of characters.
2-55
2 Program Components
Character Representation
Operator Description
\a Alarm (beep)
\b Backspace
\f Form feed
\n New line
\r Carriage return
\t Horizontal tab
\v Vertical tab
\char Any character with special meaning in regular expressions that you want to match literally
(for example, use \\ to match a single backslash)
Quantifiers
Quantifiers specify the number of times a pattern must occur in the matching text.
2-56
Regular Expressions
{0,1} is equivalent to ?.
expr{m,} At least m times consecutively. '<a href="\w{1,}\.html">' matches an
<a> HTML tag when the file name contains one
{0,} and {1,} are equivalent to * and +, or more characters.
respectively.
expr{n} Exactly n times consecutively. '\d{4}' matches four consecutive digits.
Equivalent to {n,n}.
Quantifiers can appear in three modes, described in the following table. q represents any of the
quantifiers in the previous table.
'<tr><td><p>text</p></td>'
exprq? Lazy expression: match as few characters as Given the text'<tr><td><p>text</p></
necessary. td>', the expression '</?t.*?>' ends each
match at the first occurrence of the closing
angle bracket (>):
Grouping Operators
Grouping operators allow you to capture tokens, apply one operator to multiple elements, or disable
backtracking in a specific group.
2-57
2 Program Components
Anchors
Anchors in the expression match the beginning or end of a character vector or word.
Lookaround Assertions
Lookaround assertions look for patterns that immediately precede or follow the intended match, but
are not part of the match.
The pointer remains at the current location, and characters that correspond to the test expression
are not captured or discarded. Therefore, lookahead assertions can match overlapping character
groups.
2-58
Regular Expressions
If you specify a lookahead assertion before an expression, the operation is equivalent to a logical AND.
For more information, see “Lookahead Assertions in Regular Expressions” on page 2-63.
Logical and conditional operators allow you to test the state of a given condition, and then use the
outcome to determine which pattern, if any, to match next. These operators support logical OR and if
or if/else conditions. (For AND conditions, see “Lookaround Assertions” on page 2-58.)
Conditions can be tokens on page 2-60, lookaround assertions on page 2-58, or dynamic expressions
on page 2-60 of the form (?@cmd). Dynamic expressions must return a logical or numeric value.
2-59
2 Program Components
Token Operators
Tokens are portions of the matched text that you define by enclosing part of the regular expression in
parentheses. You can refer to a token by its sequence in the text (an ordinal token), or assign names
to tokens for easier code maintenance and readable output.
Note If an expression has nested parentheses, MATLAB captures tokens that correspond to the
outermost set of parentheses. For example, given the search pattern '(and(y|rew))', MATLAB
creates a token for 'andrew' but not for 'y' or 'rew'.
Dynamic Expressions
Dynamic expressions allow you to execute a MATLAB command or a regular expression to determine
the text to match.
The parentheses that enclose dynamic expressions do not create a capturing group.
2-60
Regular Expressions
Within dynamic expressions, use the following operators to define replacement terms.
Comments
The comment operator enables you to insert comments into your code to make it more maintainable.
The text of the comment is ignored by MATLAB when matching against the input text.
Search Flags
Flag Description
(?-i) Match letter case (default for regexp and regexprep).
(?i) Do not match letter case (default for regexpi).
2-61
2 Program Components
Flag Description
(?s) Match dot (.) in the pattern with any character (default).
(?-s) Match dot in the pattern with any character that is not a newline character.
(?-m) Match the ^ and $ metacharacters at the beginning and end of text (default).
(?m) Match the ^ and $ metacharacters at the beginning and end of a line.
(?-x) Include space characters and comments when matching (default).
(?x) Ignore space characters and comments when matching. Use '\ ' and '\#' to
match space and # characters.
The expression that the flag modifies can appear either after the parentheses, such as
(?i)\w*
or inside the parentheses and separated from the flag with a colon (:), such as
(?i:\w*)
The latter syntax allows you to change the behavior for part of a larger expression.
See Also
pattern | regexp | regexpi | regexprep | regexptranslate
More About
• “Lookahead Assertions in Regular Expressions” on page 2-63
• “Tokens in Regular Expressions” on page 2-66
• “Dynamic Regular Expressions” on page 2-72
• “Search and Replace Text” on page 6-37
2-62
Lookahead Assertions in Regular Expressions
Lookahead Assertions
There are two types of lookaround assertions for regular expressions: lookahead and lookbehind. In
both cases, the assertion is a condition that must be satisfied to return a match to the expression.
A lookahead assertion has the form (?=test) and can appear anywhere in a regular expression.
MATLAB looks ahead of the current location in the text for the test condition. If MATLAB matches the
test condition, it continues processing the rest of the expression to find a match.
For example, look ahead in a character vector specifying a path to find the name of the folder that
contains a program file (in this case, fileread.m).
chr = which('fileread')
chr =
'matlabroot\toolbox\matlab\iofun\fileread.m'
regexp(chr,'\w+(?=\\\w+\.[mp])','match')
ans =
{'iofun'}
The match expression, \w+, searches for one or more alphanumeric or underscore characters. Each
time regexp finds a term that matches this condition, it looks ahead for a backslash (specified with
two backslashes, \\), followed by a file name (\w+) with an .m or .p extension (\.[mp]). The
regexp function returns the match that satisfies the lookahead condition, which is the folder name
iofun.
Overlapping Matches
Lookahead assertions do not consume any characters in the text. As a result, you can use them to find
overlapping character sequences.
For example, use lookahead to find every sequence of six nonwhitespace characters in a character
vector by matching initial characters that precede five additional characters:
startIndex =
1 8 9 16 17 24 25
2-63
2 Program Components
Without the lookahead operator, MATLAB parses a character vector from left to right, consuming the
vector as it goes. If matching characters are found, regexp records the location and resumes parsing
the character vector from the location of the most recent match. There is no overlapping of
characters in this process.
chr = 'Locate several 6-char. phrases';
startIndex = regexpi(chr,'\S{6}')
startIndex =
1 8 16 24
chr =
Merely searching for non-vowels ([^aeiou]) does not return the expected answer, as the output
includes capital letters, space characters, and punctuation:
c = regexp(chr,'[^aeiou]','match')
c =
Columns 1 through 14
{' '} {'N'} {'O'} {'R'} {'M'} {'E'} {'S'} {'T'} {' '} {'E'} {'s
Columns 15 through 28
{' '} {'t'} {'h'} {' '} {'m'} {'t'} {'r'} {'x'} {' '} {'2'} {'-
Columns 29 through 42
{'.'} {'↵'} {' '} {' '} {' '} {' '} {'N'} {'O'} {'R'} {'M'} {'E
Column 43
{'S'}
2-64
Lookahead Assertions in Regular Expressions
Try this again, using a lookahead operator to create the following AND condition:
c = regexp(chr,'(?=[a-z])[^aeiou]','match')
c =
{'s'} {'t'} {'m'} {'t'} {'t'} {'h'} {'m'} {'t'} {'r'} {'x'} {'n
Note that when using a lookahead operator to perform an AND, you need to place the match
expression expr after the test expression test:
(?=test)expr or (?!test)expr
See Also
regexp | regexpi | regexprep
More About
• “Regular Expressions” on page 2-51
2-65
2 Program Components
Introduction
Parentheses used in a regular expression not only group elements of that expression together, but
also designate any matches found for that group as tokens. You can use tokens to match other parts
of the same text. One advantage of using tokens is that they remember what they matched, so you
can recall and reuse matched text in the process of searching or replacing.
Each token in the expression is assigned a number, starting from 1, going from left to right. To make
a reference to a token later in the expression, refer to it using a backslash followed by the token
number. For example, when referencing a token generated by the third set of parentheses in the
expression, use \3.
As a simple example, if you wanted to search for identical sequential letters in a character array, you
could capture the first letter as a token and then search for a matching character immediately
afterwards. In the expression shown below, the (\S) phrase creates a token whenever regexp
matches any nonwhitespace character in the character array. The second part of the expression,
'\1', looks for a second instance of the same character immediately following the first.
poe = ['While I nodded, nearly napping, ' ...
'suddenly there came a tapping,'];
mat =
The cell array tok contains cell arrays that each contain a token.
tok{:}
ans =
{'d'}
ans =
2-66
Tokens in Regular Expressions
{'p'}
ans =
{'d'}
ans =
{'p'}
The cell array ext contains numeric arrays that each contain starting and ending indices for a token.
ext{:}
ans =
11 11
ans =
26 26
ans =
35 35
ans =
57 57
For another example, capture pairs of matching HTML tags (e.g., <a> and </a>) and the text
between them. The expression used for this example is
expr = '<(\w+).*?>.*?</\1>';
The first part of the expression, '<(\w+)', matches an opening angle bracket (<) followed by one or
more alphabetic, numeric, or underscore characters. The enclosing parentheses capture token
characters following the opening angle bracket.
The second part of the expression, '.*?>.*?', matches the remainder of this HTML tag (characters
up to the >), and any characters that may precede the next opening angle bracket.
The last part, '</\1>', matches all characters in the ending HTML tag. This tag is composed of the
sequence </tag>, where tag is whatever characters were captured as a token.
2-67
2 Program Components
ans =
'<a name="752507"></a>'
ans =
'<b>Default</b>'
tok{:}
ans =
{'a'}
ans =
{'b'}
Multiple Tokens
Here is an example of how tokens are assigned values. Suppose that you are going to search the
following text:
You choose to search the above text with the following search pattern:
and(y|rew)|(t)e(d)
This pattern has three parenthetical expressions that generate tokens. When you finally perform the
search, the following tokens are generated for each match.
Only the highest level parentheses are used. For example, if the search pattern and(y|rew) finds the
text andrew, token 1 is assigned the value rew. However, if the search pattern (and(y|rew)) is
used, token 1 is assigned the value andrew.
2-68
Tokens in Regular Expressions
Unmatched Tokens
For those tokens specified in the regular expression that have no match in the text being evaluated,
regexp and regexpi return an empty character vector ('') as the token output, and an extent that
marks the position in the string where the token was expected.
The example shown here executes regexp on a character vector specifying the path returned from
the MATLAB tempdir function. The regular expression expr includes six token specifiers, one for
each piece of the path. The third specifier [a-z]+ has no match in the character vector because this
part of the path, Profiles, begins with an uppercase letter:
chr = tempdir
chr =
'C:\WINNT\Profiles\bpascal\LOCALS~1\Temp\'
When a token is not found in the text, regexp returns an empty character vector ('') as the token
and a numeric array with the token extent. The first number of the extent is the string index that
marks where the token was expected, and the second number of the extent is equal to one less than
the first.
In the case of this example, the empty token is the third specified in the expression, so the third token
returned is empty:
tok{:}
ans =
The third token extent returned in the variable ext has the starting index set to 10, which is where
the nonmatching term, Profiles, begins in the path. The ending extent index is set to one less than
the starting index, or 9:
ext{:}
ans =
1 2
4 8
10 9
19 25
27 34
36 39
2-69
2 Program Components
second, $2, is 'Baker'. Note that regexprep returns the modified text, not a vector of starting
indices.
regexprep('Norma Jean Baker', '(\w+\s\w+)\s(\w+)', '$2, $1')
ans =
Named Capture
If you use a lot of tokens in your expressions, it may be helpful to assign them names rather than
having to keep track of which token number is assigned to which token.
When referencing a named token within the expression, use the syntax \k<name> instead of the
numeric \1, \2, etc.:
poe = ['While I nodded, nearly napping, ' ...
'suddenly there came a tapping,'];
ans =
Named tokens can also be useful in labeling the output from the MATLAB regular expression
functions. This is especially true when you are processing many pieces of text.
For example, parse different parts of street addresses from several character vectors. A short name is
assigned to each token in the expression:
chr1 = '134 Main Street, Boulder, CO, 14923';
chr2 = '26 Walnut Road, Topeka, KA, 25384';
chr3 = '847 Industrial Drive, Elizabeth, NJ, 73548';
p1 = '(?<adrs>\d+\s\S+\s(Road|Street|Avenue|Drive))';
p2 = '(?<city>[A-Z][a-z]+)';
p3 = '(?<state>[A-Z]{2})';
p4 = '(?<zip>\d{5})';
As the following results demonstrate, you can make your output easier to work with by using named
tokens:
loc1 = regexp(chr1, expr, 'names')
loc1 =
2-70