Regular Expression
Regular Expression
EXPRESSIONS
Pocket Primer
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book and its companion files (the “Work”), you agree
that this license grants permission to use the contents contained herein, including
the companion files, but does not give you the right of ownership to any of the
textual content in the book / files or ownership to any of the information or products
contained in it. This license does not permit uploading of the Work onto the
Internet or on a network (of any kind) without the written consent of the Publisher.
Duplication or dissemination of any text, code, simulations, images, etc. contained
herein is limited to and subject to licensing terms for the respective products, and
permission must be obtained from the Publisher or the owner of the content, etc.,
in order to reproduce or network any portion of the textual material (in any media)
that is contained in the Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement
of the book and/or companion files, and only at the discretion of the Publisher. The
use of “implied warranty” and certain “exclusions” vary from state to state, and might
not apply to the purchaser of this product.
The companion files are available for downloading by writing to the publisher at
info@merclearning.com.
REGULAR
EXPRESSIONS
Pocket Primer
Oswald Campesato
This publication, portions of it, or any accompanying software may not be reproduced in any
way, stored in a retrieval system of any type, or transmitted by any means, media, electronic
display or mechanical display, including, but not limited to, photocopy, recording, Internet
postings, or scanning, without prior permission in writing from the publisher.
The publisher recognizes and respects all marks used by companies, manufacturers, and
developers as a means to distinguish their products. All brand names and product names
mentioned in this book are trademarks or service marks of their respective companies. Any
omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to
infringe on the property of others.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations,
etc. For additional information, please contact the Customer Service Dept. at
800-232-0223(toll free).
All of our titles are available in digital format at authorcloudware.com and other digital
vendors. The sole obligation of Mercury Learning and Information to the purchaser is to
replace the book, based on defective materials or faulty workmanship, but not based on the
operation or functionality of the product. Companion files are available by writing to the
publisher at info @merclearning.com.
I’d like to dedicate this book to my parents –
may this bring joy and happiness into their lives.
Contents
Preface.................................................................................................. xi
Printing Lines.........................................................................................105
Character Classes and sed....................................................................106
Removing Control Characters...............................................................107
Counting Words in a Dataset..........................................................................108
Back References and Forward References in sed.......................................108
Working with Forward References.......................................................109
Displaying Only “Pure” Words in a Dataset..................................................110
The awk Command........................................................................................111
Built-In Variables That Control awk.....................................................112
How Does the awk Command Work?..................................................112
Aligning Text with the printf Command...................................................113
Matching with Metacharacters and Character Sets.......................................114
Printing Lines Using Conditional Logic........................................................115
Selecting and Switching Any Two Columns...................................................116
Reversing All Rows with awk.........................................................................116
Reversing the Lines in a File..........................................................................117
Switching Two Adjacent Columns (1)............................................................118
Switching Two Adjacent Columns (2)............................................................119
Switching Consecutive Columns....................................................................119
A More Complex Example.............................................................................121
Chapter Summary...........................................................................................122
Index.......................................................................................169
Preface
T
he goal of this book is to introduce readers to regular expressions in
several technologies. While the material is primarily for people who
have little or no experience with regular expressions, there is also some
content that may be suitable for intermediate users, or for people who wish to
understand how to translate what they know about regular expressions from
prior experience into any of the languages discussed in this book. Hence,
this is more suitable as an introductory “how-to” book than a reference book.
Keep in mind that this book will not make you an expert in creating regular
expressions.
If you are interested in applying regular expressions to tasks that involve
some type of data cleaning, Data Cleaning Pocket Primer might be a good fit
for you.
This book is intended for data scientists, data analysts, and other people
who want to understand regular expressions to perform various tasks. As such,
no prior knowledge of regular expressions is required (but can obviously be
helpful).
You will acquire an understanding of how to create an assortment of regular
expressions, such as filtering data for strings containing uppercase or lowercase
letters; matching integers, decimals, hexadecimal, and scientific numbers; and
context-dependent pattern matching expressions.
Some chapters contain use cases, such as replacing non-alphabetic char-
acters with a white space (Chapter 1), how to switch columns in a text file
(Chapter 5), and how to reverse the order of the fields of a record in a text file
xii • Regular Expressions Pocket Primer
The code samples in this book were created and tested using bash on a
Macbook Pro with OS X 10.12.6 (macOS Sierra). Regarding their content: the
regular expressions are derived primarily from the author, and in some cases
there are code samples that incorporate short sections of code from discus-
sions in online forums. The key point to remember is that the overwhelming
majority of the code samples follow the “Four Cs”: they must be Clear, Con-
cise, Complete, and Correct to the extent that it’s possible to do so, given the
size of this book.
You need some familiarity with working from the command line in a Unix-
like environment. However, there are subjective prerequisites, such as a strong
desire to learn regular expressions, along with the motivation and discipline to
read and understand the code samples. In any case, if you’re not sure whether
or not you can absorb the material in this book, glance through the code sam-
ples to get a feel for the level of complexity.
Although there isn’t a specific list, this book does not cover the REs that are
very complex and contain “corner cases” that are useful for expert-level devel-
opers. The purpose of the material in the chapters is to illustrate how to use cre-
ate a variety of regular expressions for handling common data-related tasks with
datasets, after which you can do further reading to deepen your knowledge.
If you are a Mac user, there are three ways to do so. The first method is to use
Finder to navigate to Applications > Utilities and then double click
on the Utilities application. Next, if you already have a command shell avail-
able, you can launch a new command shell by typing the following command:
open /Applications/Utilities/Terminal.app
command+n in that command shell, and your Mac will launch another com-
mand shell.
If you are a PC user, you can install Cygwin (open source https://cygwin.
com/) that simulates bash commands, or use another toolkit such as MKS (a
commercial product). Please read the online documentation that describes the
download and installation process.
If you use RStudio, you launch a command shell inside of RStudio by navi-
gating to Tools > Command Line, and then you can launch bash com-
mands. Note that custom aliases are not automatically set if they are defined in
a file other than the main start-up file (such as .bash_login).
Although Perl has fantastic support for regular expressions (and peerless
for many years), Perl has become a sort of “niche” language. Since Perl ap-
peals to a much smaller audience, it makes more sense to include Perl regular
expressions in an Appendix instead of a chapter.
However, it’s worth spending a few minutes to skim through the first por-
tion of the Perl Appendix: the examples of regular expressions are modeled
after the material in Chapter 1 and the syntax is very similar.
In addition, if you are a front-end Web developer (or perhaps a full-stack
developer), you will benefit from the Appendix because the Perl examples are
more similar to JavaScript than other scripting languages. Furthermore, if you
work with R, you can leverage your knowledge of Perl regular expressions be-
cause the Perl syntax is supported in R.
The answer to this question varies widely, mainly because the answer de-
pends heavily on your objectives. The best answer is to try a new tool or tech-
nique from the book out on a problem or task you care about, professionally or
personally. Precisely what that might be depends on who you are, as the needs
of a data scientist, manager, student or developer are all different. In addition,
keep what you learned in mind as you tackle new data cleaning or manipula-
tion challenges. Sometimes knowing a technique is possible makes finding a
solution easier, even if you have to re-read the section to remember exactly
how the syntax works.
If you have reached the limits of what you have learned here and want to
get further technical depth about regular expressions, there are various online
resources and literature describing how to create complex and arcane regular
expressions.
Chapter
1
Introduction to Regular
Expressions
T
his chapter introduces you to basic Regular Expressions, often abbrevi-
ated as REs, that will prepare you for the material in subsequent chap-
ters. The REs in this chapter are illustrated via the Unix grep utility
that is available on any Unix-related platform, including Linux and MacBook
(OS X). If you are a complete neophyte, you’ll learn a decent variety of REs by
the time you have finished reading this chapter.
In fact, this chapter does not require you to understand any of the deeper
theory that underlies REs: simply launch the grep (or egrep) utility from
the command line to see the result of matching REs to various strings. In most
cases, the text strings are placed in text files so that the REs can be tested
against multiple strings simultaneously.
In essence, this chapter acts as “ground zero” for REs, starting from the
simplest search strings (i.e., hard-coded strings), to search strings that contain
REs involving uppercase letters, lowercase letters, numbers, special charac-
ters, and various combinations of such strings.
If you have some experience working with REs, skim through the code
samples in this chapter (you might find something new to you). If you are
impatient, see if you can explain the purpose of the following RE: [^ ]*?@
[^ ]*?\.[^ ]*. If you know the answer, then you can probably go directly
to Chapter 2.
The first section in this chapter (which comprises most of the chapter) con-
tains code snippets that illustrate how to perform very simple pattern matching
with lines of text in a text file. This section also introduces the metacharacters
^, $, ., \, and ?, along with code snippets that illustrate how to use these
metacharacters in REs (and also their nuances). The purpose of this section is
to provide a myriad of concrete examples of REs, after which the more abstract
descriptions of metacharacters will be more meaningful to you.
2 • Regular Expressions Pocket Primer
grey
gr[a-z]y
^the
^[the]
[^the]
^[^z]
^t.*gray
^the.*gray.$
As you can see, the word grey appears in the first and second lines, the
word gray appears in the first and third lines, and all three lines contain either
grey or gray.
Introduction to Regular Expressions • 3
Here are the tasks that we want to perform in this section (and also the next
section):
The solutions to the three preceding tasks are very easy. The following
command performs the first task:
The third task can be solved using the metacharacter “|” (logical “or” in
egrep syntax) and the egrep utility, as shown here:
The third task can also be solved with a character class, which is the topic
of the next section.
The following command performs the third task listed in the previous section:
The term gr[ae]y is an RE, and it’s a compact way of representing the
two strings gray and the string grey. The order of the letters in the square
brackets is irrelevant, which means that the third task can also be solved with
this command:
The matching lines contain either grey and gray, and if the text file in-
cluded a line with the string grzy, then such a line would appear in the previ-
ous output.
We can also specify a single letter inside the square brackets. For example,
the term [a] is an RE that matches the letter a. Now launch this command
from the command line:
Once again, the order of the letters in the square brackets is irrelevant,
which means that the following commands have the same output:
dog
den
dupe
On the other hand, the RE ^[the] matches any lines that start with one
of the letters t, h, or e, as shown here:
By contrast, the following expression matches any lines that do not start
with the letter t (and in this case, there are no matching lines):
Based on what you have learned thus far, you know the meaning of the fol-
lowing REs:
Later in this book you will learn how to match more complex expressions, such
as zip codes for different countries, email addresses, phone numbers, and ISBNs.
Introduction to Regular Expressions • 7
Notice that the first line in the file lines1.txt is excluded, because
although gray is in the line, the line ends with a period instead of gray.
The metacharacter “.” matches any single character (except a linefeed). At the
other extreme is the metacharacter “*” that matches zero or more occurrences
of any character. In addition, this * is often called a wildcard, and it behaves the
same way in most regular expression syntax as it does in common Find/Replace
tools. The metacharacter “*” is useful when you want to match the intervening
letters between a start character (or word) and an end character (or word).
For example, if you want to match the lines that start with the letter t, fol-
lowed by any letters, and then followed by an occurrence of the word gray,
use this expression:
Notice how the metacharacters “.*” enable you to “match” the intervening
characters between the initial t and the occurrence of the word gray some-
where else in a line. In this example, gray appears at the end of both match-
ing lines, but a line containing the word gray somewhere “in the middle”
would also have matched the RE.
If you want to match the lines that start with the word the, followed by an
occurrence of the word gray, use this RE:
You can match the final “.” character by using the “escape” metacharacter
“ \”. This tells the expression that it should treat something that is normally a
8 • Regular Expressions Pocket Primer
The following RE also matches the final “.” character, because a period is a
legitimate match, but it would also match a line that ends in “grayx” or “gray!”
However, the following RE does not match the final “.” character, but it
will match a line that ends in “gray,” because “.” as a metacharacter matches
the final “y”:
Finally, the following expression only matches the first line, because you
need one and only one additional character after “grey” to match:
grep "^the.*gray.$" lines1.txt
x
y
z w
a
Introduction to Regular Expressions • 9
x
y
z w
The output consisting of two lines (the second line is blank) is here:
Match lines that contain only whitespaces with this expression, which lit-
erally means “match lines that begin with whitespace, and end in one or
more instances of whitespace.” The “+” metacharacter means “match one
or more instances of the prior element”:
The output is a blank line, which you will see on the screen. Note that
matching an empty line is different from matching a line containing only
whitespaces.
10 • Regular Expressions Pocket Primer
Escaping a Metacharacter
Recall that if you want to match the lines that start with the letter t and also
end with the word gray, use this expression:
If you want to match the lines that contain a “.”, use this expression:
If you want to match the lines that match .doc, use this expression:
The following expression matches the lines that end with .doc:
cat
catty
catfish
small catfish
If you want to match the lines that contain dog, use this expression:
dog
doggy
If you want to match the lines that start with the word dog, use this expression:
dog
doggy
If you want to match the lines that end with the word dog, use this expres-
sion:
dog
If you want to match the lines that start and also end with the word dog,
use this expression:
dog
If you want to match the lines that start with a blank space, use this expres-
sion:
catfish
12 • Regular Expressions Pocket Primer
If you want to match the lines that start with a period, use this expression:
.gray
If you want to match the lines with any occurrence of a period, use this
expression:
grey.
.gray
By contrast, the following expression matches all lines because the “.” me-
tacharacter has not been escaped (so you are now telling it to match lines that
begin with any character at all. Only an empty line would fail to match):
grey.
.gray
dog
doggy
cat
catty
catfish
small catfish
The following expression matches lines that start with a space, followed by
any characters, and then followed by the string cat:
catfish
small catfish
The following expression matches lines that contain the letter r or the
letter e:
grey.
.gray
Introduction to Regular Expressions • 13
The following expression matches lines that contain the letter g, followed
by either the letter r or the letter e:
grey.
Note that the third RE in the preceding list matches other words (e.g., are,
bre, cre, and so forth) that are not contained in lines3.txt, and it’s just
happenstance that the RE matches the string .grey.
This RE matches the word .gray:
grep "^.[g][re]" lines3.txt
• “?” means “match exactly zero or one instance of the previous element”
• “+” means “match one or more instances of the previous element”
• “|”is used as a “logical or” in an extended regular expression
The pipe “|” metacharacter (which has a different context from the pipe
symbol in the command line: REs have their own syntax, which does not match
that of the operating system a lot of the time) provides a choice of options. For
example, the expression a|b means a or b, and the expression a|b|c means
a or b or c.
The “$” metacharacter refers to the end of a line of text, and in REs inside
the vi editor, the “$” metacharacter refers to the last line in a file.
The “^” metacharacter refers to the beginning of a string or a line of text.
For example:
In the case of REs, the “^” metacharacter can also mean “does not match”:
the context determines which interpretation to use for the “10” metacharacter.
This chapter contains multiple sections with examples of these metacharac-
ters, as well as “^”, “$”, “*”, and “ \” metacharacters.
This section contains examples of REs that match mixed-case strings (typi-
cally user names or text that has proper sentences). Listing 1.5 displays the
contents of lines5.txt, which is used in code snippets in this section.
The following RE matches lines that contain mixed-case strings, but not
lines that fail to have mixed-case strings. Recall that [A-Z] is the character
class that matches any capital letter, and [a-z] is the character class that
matches any lowercase letter:
The following RE matches mixed-case strings that end with a period “.”:
The following RE matches strings that start with an uppercase letter fol-
lowed by a space and another lowercase string, and end in a period “.”:
Another RE that uses the “|” metacharacter to match strings that contain
either John or john is here:
grey.
.gray
dog
doggy
cat
catty
catfish
small catfish
The examples in this section require egrep because grep does not sup-
NOTE
port + on all operating systems.
catfish
The following expression matches lines that start with one or more whites-
paces then any number of characters, and are then followed by the string cat:
catfish
small catfish
grey.
.gray
dog
doggy
cat
catty
Introduction to Regular Expressions • 17
grey.
dog
doggy
cat
catty
.gray
catfish
small catfish
The following expression matches lines that do not start with a word, fol-
lowed by the string cat:
catfish
The following expression matches lines that do not start with a word, and
contain the string cat somewhere in the line:
catfish
small catfish
You can use \b as a boundary marker to match email addresses that occur
somewhere in a text string:
The following expression matches lines that start with two digits (followed
by anything):
grep "^[0-9][0-9]." lines6.txt
05/12/18
05/12/2018
05912918
05.12.18
05.12.2018
0591292018
The following expression matches lines that start with two digits (followed
by a forward slash):
05/12/18
05/12/2018
Introduction to Regular Expressions • 19
The following expression matches lines that start with two digits (followed
by a forward slash or period):
05/12/18
05/12/2018
05.12.18
05.12.2018
The following expression matches lines that end with a forward slash or
period, followed by two digits:
05/12/18
05/12/2018
05.12.18
05.12.2018
05/12/18
05.12.18
The following expression matches lines that contain four consecutive digits:
05/12/2018
05912918
05.12.2018
0591292018
The following expression matches lines that end with four consecutive dig-
its that are preceded by a forward slash or period:
05/12/2018
05.12.2018
20 • Regular Expressions Pocket Primer
Remove the “.” in the preceding RE to obtain the following expression that
matches the pattern mm/dd/yyyy:
05/12/2018
Keep in mind that it’s common for people to start a date with a one-char-
acter month (for example, 5 instead of 05, which requires matching one digit
instead of two digits). A single RE for this scenario probably requires the “or”
operator | to handle both possibilities for the first nine months of the year.
05/12/2018
05912918
05.12.2018
0591292018
There is also a simpler way to match multiple consecutive digits via the
\d character class. The following expression uses egrep to match lines that
contain three consecutive digits:
05/12/2018
05912918
05.12.2018
0591292018
Here are some REs that use egrep in order to match some common patterns:
The following expression uses egrep to match lines that contain any pair
of digits followed by a non-digit character:
egrep "\d{2}\D" lines6.txt
05/12/18
05/12/2018
05.12.18
05.12.2018
Introduction to Regular Expressions • 21
The following expression uses egrep to match lines that contain three
pairs of digits that are separated by a non-digit character:
05/12/18
05/12/2018
05.12.18
05.12.2018
The following expression uses egrep to match lines that contain three
pairs of digits that are separated by a non-digit character, and also exclude four-
digit sequences:
05/12/18
05.12.18
05/12/2018
123-45-6789
Now that you’ve seen some working examples of REs, let’s summarize our
understanding of metacharacters.
This section is marked optional because it’s more complex than the other
examples in this chapter, and also because the solution involves Perl. If you
have some knowledge of Perl, then you will probably be comfortable with this
example. If you are new to Perl, you might benefit from reading at least a por-
tion of the Perl Appendix before delving into the code sample in this section.
This example illustrates how to use a basic RE to solve a common data-
related task. Until now we’ve focused on the “Find” applications, but REs are
just as useful in Find/Replace scenarios. For our first example of this type, we
will show how it works in Perl. In later chapters we’ll show how to do this exact
use case in other languages and environments. If you are interested in more
Perl examples, the appendix contains many, including both Perl versions of this
chapter’s examples as well as some of the more advanced concepts from later
chapters.
Listing 1.7 displays the contents of alphanums.txt, which consists of
two comma-separated fields in each row. Question: how would you split each
row into three fields consisting of numbers and alphabetic characters?
"CCC_9012_2",3YZ
"DDD_3456_1",4WX
One Perl-based solution for replacing everything except letters and digits
with blank spaces is shown here:
1234 4 1
5678 3 2
9012 2 3
3456 1 4
Notice that every line in the preceding output starts with five blank spaces,
because the original lines start with five non-digits. Read the Appendix that
contains an assortment of Perl-based REs, many of which are counterparts to
the code snippets in this chapter.
The key point of this section is that you can apply your knowledge of the
REs to solve a variety of tasks in multiple programming languages, but there
may be (usually small) language-specific syntax differences in both the com-
mand and in how the output is presented.
Useful Links
Although this chapter (and the next one as well) uses the grep and egrep
commands for testing REs, there are also websites that enable you to test
whether or not a text string matches an RE. For example, the following website
provides an interface for testing REs:
https://regex101.com/
Navigate to the preceding website, enter an RE in the “Regular Expres-
sion” field, and then specify a text string in the “Test String” field. The right
panel displays whether or not a full or partial match succeeded, along with a
description of the details of the RE.
A search for “regular expressions in <language>” will always turn up useful
syntax links, beyond what is covered in this text.
24 • Regular Expressions Pocket Primer
Chapter Summary
T
his chapter extends the material in Chapter 1, with examples of inter-
esting and more sophisticated REs that match ISBNs, email addresses,
and so forth. The REs in this chapter also use the grep (or egrep)
command, just as we did in Chapter 1.
The first (short yet relevant) section of this chapter contains tips for “think-
ing in REs”, specifically designed to help you solve new tasks involving REs.
Although this section will not make you an expert in REs, you will learn useful
guidelines for creating REs to solve a variety of tasks.
The second section in this chapter contains REs that match dates, phone
numbers, and zip codes. You will also see REs that match various types of num-
bers, such as integers, decimals, hexadecimals, octals, and binary numbers. In
addition, you will learn how to create REs for scientific numbers.
The third section contains REs that match IP addresses and simple com-
ment strings (in source code), as well as REs for matching proper names and
ISBNs. The final section discusses capture groups and back references, which
are useful for more complex pattern matches.
This section provides a simple methodology for creating the REs in this
chapter as well as REs for your own projects. After you have crafted an RE
for a task, then you can focus on simplifying that RE. However, there is often
a trade-off: compact REs that are also complex and sophisticated tend to be
more difficult for other people to understand (and hence more difficult to
debug and to enhance), whereas lengthier REs that are based on a combina-
tion of simpler REs can be simpler to manage in applications. When in doubt,
include a well-structured comment block that concisely explains the purpose
of the RE.
26 • Regular Expressions Pocket Primer
Someone looking at this later should be able to parse out “begins with ei-
ther a +, a –, or neither, and the rest of the characters on the line must be one
or more digits” by working from left to right.
Although the preceding example is simple, you can use the same type of
analysis to solve more complex problems, along with the following points:
For example, consider the RE for ISBNs, which we will develop later in
this chapter. It consists of the concatenation of four REs. The solution pre-
sented in this chapter is lengthy, yet it requires about five minutes to create
Common Regex Tasks • 27
This section contains REs that match simple phone numbers, followed by
some more complex REs that handle phone extensions. You will also learn
about some of the features of a Google library (written in multiple program-
ming languages) that supports international phone numbers. This section is
surprisingly long (phone numbers are simple, n’est-ce pas?) and contains some
nice REs that will help you hone your ability to differentiate between phone
formats that have very slight differences.
As you might already know, different countries have special cases for their
phone numbers. In the USA, (408) 974–3218 is a valid U.S. number,
whereas (999) 974–3218 is invalid. Meanwhile, the numbers 0404 999
999 and (02) 9999 9999 are valid numbers in Australia, but the sequence
(09) 9999 9999 is invalid. In the United States, any number beginning
with a 555 prefix at the local level (e.g., [405] 555-3212) is fake, used only for
movies or similar public art, to avoid a random person being bothered by fans
dialing the number.
Listing 2.1 displays the contents of phonenumbers.txt, which contains
various patterns for phone numbers.
The following RE matches U.S. phone numbers of the form ddd ddd dddd:
The following RE matches U.S. phone numbers of the form ddd ddd-dddd:
650 123-4567
The following RE matches U.S. phone numbers of the form (ddd) ddd-
dddd:
(650) 123-4567
The following RE matches U.S. phone numbers of the form 1-ddd ddd-
dddd:
1-(650) 123-4567
The following RE checks for numbers that have an optional dash “-” be-
tween the three groups of digits:
9405306123
The following RE checks for numbers that have an optional dash “-” or a
blank between the three groups of digits:
9405306123
650 123-4567
650 123 4567
Common Regex Tasks • 29
Compare the preceding pair of similar REs to make sure that you under-
stand how (and why) they produce a different set of matching phone numbers.
The following RE matches numbers with seven digits and also numbers with
ten digits, with extensions allowed (and delimiters are spaces, dashes, or periods):
^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|
[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8]
[02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]
{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)
\s*(\d+))?$
There are phone-related libraries for other languages that rely on the
Google i18n phone number dataset:
PHP: https://github.com/giggsey/libphonenumber-for-php
Python: https://github.com/daviddrysdale/python-phonenumbers
30 • Regular Expressions Pocket Primer
Ruby: https://github.com/sstephenson/global_phone
C#: https://github.com/erezak/libphonenumber-csharp
Objective-C: https://github.com/iziz/libPhoneNumber-iOS
The following website provides a PHP script that validates phone numbers
based on a list of acceptable formats: http://www.bitrepository.com/how-to-
validate-a-telephone-number.html.
The following RE matches strings that contain five digits (which is a com-
mon U.S. zip code pattern):
94053
94053-06123
9405306123
However, the third string in the preceding output is an invalid U.S. zip
code. Let’s see how to match either of the first two zip codes and exclude the
third (invalid) zip code.
The following expression matches U.S. zip codes consisting of five digits:
The following expression matches U.S. zip codes consisting of five digits
followed by a hyphen, and then followed by another five digits:
94053-06123
Recall from earlier examples that the “or” operator lets you combine both
expressions to properly sort out both valid U.S. zip code options:
94053
94053-06123
You can also define REs that match zip codes that end in a fixed pattern.
For example, the following RE matches U.S. zip codes that end with 43 or 58:
The preceding RE matches the zip code 94043 as well as the 94058 zip
code. On the other hand, the following RE matches zip codes that start with
three digits and end in either 53 or 23:
94053
94053-06123
9405306123
Valid Canadian postal codes are significantly different from U.S. zip codes:
they have the form A1A 1A1, where A is a capital letter and 1 is a digit (with a
space between the two triplets). The following RE matches Canadian zip codes:
egrep "^[A-Z][0-9][A-Z] [0-9][A-Z][0-9]" lines1.txt
V6K 8Z3
Matching email addresses is a complex task. This section provides REs that
match common email addresses that have the following pattern:
1. an initial string having at least four characters and at most twelve char-
acters (which can be any combination of lowercase letters, uppercase
letters, or digits), then
32 • Regular Expressions Pocket Primer
Here is the RE that has the structure described in the preceding list (which
also requires egrep instead of grep) that matches an email address:
jsmith@acme.com
There are a few points to keep in mind regarding the preceding RE. First,
it only matches email addresses with the suffix “.com”. Second, longer (yet still
valid) email addresses are excluded, such as the one shown here:
myverylongemailaddress@acme.com
Consequently, you need to make decisions about the allowable set of email
addresses that you want to match with your RE.
The following RE that has the structure described in the preceding list also
allows a dot “.” as in the initial portion of the email address:
egrep "^[A-Za-z0-9]{4,12}\.[A-Za-z0-9]{4,12}\@[A-Za-z0-9]{4,8}\.
com$" lines1.txt
john.smith@acme.com
The section shown in bold in the preceding RE shows you how to match
the dot “.” character, followed by an alphanumeric string that has at least four
characters and at most twelve characters.
There are other combinations of characters that form valid email addresses
that have not been discussed in this section. For example, consider the follow-
ing email addresses:
dave.edward.smith@gmail.com
dave-777-smith@artist.net
Dave-777-Smith@artist.net
The REs that match the preceding email addresses are an exercise for you.
Most applications manage the complexity by only focusing on the following
patterns:
• It must have a period after the @, and at least one character between the
@ and the period.
• It must have at least one character after the period.
#acbedf
#abd
#acbedf
#AcBeDf
#ABD
#fad
#ABCDEF
#a1b2d3
#abd
#a1b2d3
#A1B2D4
#A3b5D7
#acbedf
#AcBeDf
#ABD
#fad
#f00
#F00
#FF0000
#ABCDEF
#123456
This section contains examples of REs that match integers, floating point
numbers, hexadecimal numbers, octal numbers, and binary numbers. The
subsequent section discusses REs for scientific numbers, which are a “gener-
alization” of decimal numbers: they are more complex, and so they merit their
own section.
Listing 2.4 displays the contents of numbers.txt, which is used in code
snippets in this section.
#hexadecimal numbers
12345
FA4389
0xFA4389
0X4A3E5C
#octal numbers
1234
03434
#binary numbers
010101
110101
0b010101
1234
-123
1234
1234.432 why?
0.458 why?
010101
110101
0b010101 why?
1234
03434
36 • Regular Expressions Pocket Primer
1234
-123
1234.432
-123.528
0.458
010101
110101
1234
03434
1234.432
-123.528
0.458
1234
12345
FA4389
1234
03434
010101
110101
0b010101
The last string in the preceding output matches the initial pattern (because
of the lowercase “b”). Remove the a-f section of the preceding RE if you
want to exclude strings that contain lowercase letters.
Notice that numbers that are integers, octal numbers, and binary numbers
also appear in the preceding list (because they are valid hexadecimal numbers).
Common Regex Tasks • 37
1234
12345
FA4389
0xFA4389
0X4A3E5C
1234
03434
010101
110101
0b010101
Once again, notice that numbers that are integers, octal numbers, and bi-
nary numbers also appear in the preceding list (because they are valid hexa-
decimal numbers).
You can also match “couplets” of hexadecimal numbers that are separated
by a blank space. For example, the following RE matches the string A3 B6
3F 62: (the Bash echo command “echoes” the string in quotes):
1234
110101
1234
Notice that there are two occurrences of the number 1234: the first one
appears as an integer (and it’s a valid octal number) and the second one ap-
pears in the section with octal numbers. Moreover, the number 110101 from
the binary section is also a valid octal number.
The following RE matches octal numbers with a 0 prefix:
1234
1234
03434
38 • Regular Expressions Pocket Primer
Once again, there are two occurrences of the number 1234: the first one
appears as an integer (and it’s a valid octal number) and the second one ap-
pears in the section with octal numbers.
010101
110101
010101
110101
0b010101
This section contains examples of REs that match scientific numbers. List-
ing 2.5 displays the contents of numbers2.txt that will be used in code
snippets in this section.
egrep '^[+-]?\d*(([,.]\d{3})+)?([,.]\d+)?([eE][+-]?\d+)?$'
numbers.txt
Common Regex Tasks • 39
1234
-123
1234.432
-123.528
0.458
12345
1234
03434
010101
110101
0.123
+13
423.2e32
-7.20e+19
-.4E-8
-27.6603
+0005
1234.432
-123.528
0.458
12345
1234
03434
010101
110101
-123
1234.432
-123.528
0.458
12345
FA4389
0xFA4389
0X4A3E5C
1234
03434
010101
110101
0b010101
*** Option #5:
--------------
0.123
z = 0xFFFF00;
+13
423.2e32
-7.20e+19
-.4E-8
-27.6603
+0005
125.e12
*** Option #6:
--------------
1234
-123
1234.432
-123.528
0.458
12345
FA4389
0xFA4389
0X4A3E5C
1234
03434
010101
110101
0b010101
*** Option #7:
--------------
0.123
z = 0xFFFF00;
+13
423.2e32
-7.20e+19
-.4E-8
-27.6603
+0005
125.e12
*** Option #8:
--------------
0.123
z = 0xFFFF00;
+13
42 • Regular Expressions Pocket Primer
423.2e32
-7.20e+19
-.4E-8
-27.6603
+0005
125.e12
As you can see, the REs in Listing 2.7 have varying degrees of success in
terms of matching scientific numbers. In general, they err by matching “false
positives” (numbers that are not valid scientific numbers) instead of excluding
“false negatives” (numbers that are valid scientific numbers).
Most real-world applications and programming languages use various nu-
meric variable “types” to deal with calculations, and when forced to translate
a number which is stored as text in scientific notation, they require you to
either use a predefined format or instruct the command which format is
being used. This means most of the complexity of the matching problem is
limited to some kind of reasonable subset that can be matched with a simpler
regular expression. Unless you are trying to do something like scan docu-
ments and pull out numbers from free-form text without knowing ahead of
time what format the numbers used, you should not normally encounter this
level of complexity.
As with any problem of this type, it is often easier to run several “match”
expressions and then filter out duplicates and false positives with other
program logic than to try to match every possibility with a single regular
expression.
This section contains examples of REs that match IP addresses and sim-
ple comment strings (in source code). Listing 2.8 displays the contents of
lines2.txt that will be used in code snippets in this section.
// this is a comment
Common Regex Tasks • 43
The following RE matches valid IP addresses that contain three digits in all
four components:
egrep "^\d{3}\.\d{3}\.\d{3}\.\d{3}" lines2.txt
192.168.123.065
The following snippet matches the lines that contain the string http or
https:
ftp://www.acme.com
http://www.bdnf.com
https://www.ceog.com
a line with https://www.ceog.com embedded in it
The following snippet matches the lines that contain the string http or
https:
http://www.bdnf.com
https://www.ceog.com
a line with https://www.ceog.com embedded in it
The following snippet matches the lines that contain the string ftp, http,
or https:
ftp://www.acme.com
http://www.bdnf.com
https://www.ceog.com
a line with https://www.ceog.com embedded in it
The following snippet matches the lines that contain the string http em-
bedded in the line of text:
Let’s consider the following REs that match male names having only a last
name (we’ll handle the proper names with a first name and multiple middle
names later in this section):
^Mr\.?\s[A-Z][a-z]+$
^Mr\.?\s[A-Z]$
Now let’s consider the following REs that match female names.
^Ms\.?\s[A-Z][a-z]+$
^Ms\.?\s[A-Z]$
^Mrs\.?\s[A-Z][a-z]+$
^Mrs\.?\s[A-Z]$
Finally, let’s consider the following REs that match male names as well as
female names, and let’s see how they differ:
^M([rs]|(rs))\.?\s[A-Z]([a-z]+)?$
^Mr\.?\s[A-Z]\w*$
^M(r|s|rs)\.?\s[A-Z]\w*$
Now we can match proper names that contain a first name with the follow-
ing RE:
^M([rs]|(rs))\.?\s[A-Z]([a-z]+\s+[A-Za-z]*)$
^M([rs]|(rs))\.?\s[A-Z]([a-z]+(\s*[A-Za-z])+)$
There are additional cases to consider for proper names. For example, you
might need to match suffixes such as Jr., Sr., or Esq. In addition, you might
need to consider prefixes such as Sir, Count, Lord, Dr, Dr., Prof, and Professor
(among others). Thus, REs that match proper names can involve many nu-
ances, and it’s a good idea to determine (to the extent that it’s possible to do so)
which prefixes and suffices that you need to match before you embark on the
task of creating the appropriate REs.
The next section contains REs for matching ISBNs (that are more complex
than the REs in this section) and also illustrates the same divide-and-conquer
technique.
The REs in this section involve multiple simpler REs that are concatenated
using the pipe “|” symbol, which indicates an “OR” operation (as described in
the first section of this chapter).
As a simplified example, suppose we want to construct an RE that matches
the following strings:
a123
ab123
abc123
abcd123
The solution is easy to construct when you describe the strings in an Eng-
lish sentence: the strings start with either an a OR an ab OR an abc OR an
abcd, AND all of them have the number 123 as the rightmost portion. Using
the “|” symbol we can construct the RE like this:
^(a|ab|abc|abcd)123$
Now let’s consider valid ISBNs, which can start with the optional string
ISBN, and also contain either ten-digit sequences or thirteen-digit sequences.
Listing 2.11 displays the contents of ISBN.txt, which contains examples of
valid ISBN numbers.
Notice that the first line in Listing 2.11 contains the string ISBN followed
by a blank space, and the next two lines contain the string ISBN followed by a
hyphen, and then two more digits, and then either a colon “:” or a blank space.
Those two lines end with a hyphenated thirteen-digit number and a hyphen-
ated ten-digit number, respectively.
Common Regex Tasks • 47
The fourth line in Listing 2.11 contains a thirteen-digit number with white
spaces; the fifth line contains a “pure” thirteen-digit number; and the sixth line
contains a hyphenated ten-digit number.
Now let’s see how to match the numeric portion of the ISBNs in Listing
2.11. The following RE matches the digits in the first and the second line:
\d{3}-\d-\d{3}-\d{5}-\d
The following RE matches the digits in the third line as well as the sixth
line:
\d-\d{3}-\d{5}-\d
\d{13}
Now let’s create REs for the text prefix (when present) and combine them
with the earlier list of REs to match all of the lines in Listing 2.11. The result
involves four REs, as shown in the following:
ISBN 978-0-596-52068-7
ISBN-13: 978-0-596-52068-7
ISBN-10 0-596-52068-9
978-0-596-52068-7
978 0 596 52068 7
9780596520687
0-596-52068-9
Now we can combine the preceding four REs to create a single RE that
matches every valid ISBN in the text file ISBN.txt:
^[A-Z]-\d{3}
50 • Regular Expressions Pocket Primer
^([A-Z]+)-\d{3}
Note that the capture group consists of one or more capital letters that ap-
pear at the beginning of a line because of the ^ metacharacter. You can refer-
ence this capture group as \1. You can define nine capture groups, designated
as \1 through \9. Here is another example:
^([A-Z]+)-(\d{3})-\d{4}
In the preceding code snippet, the first capture group, \1, refers to ^
([A-Z]+) and consists of one or more capital letters that appear at the begin-
ning of a line. The second capture group, \2, refers to -(\d{3}) and consists
of three consecutive digits that appear after the first capture group (and also a
hyphen). More information about capture groups is available here:
http://www.rexegg.com/regex-capture.html
http://www.rexegg.com/regex-lookarounds.html
Now consider the following RE that uses a back reference in order to de-
tect duplicate (consecutive) words (with uppercase letters):
\b([A-Z]+)\s+\1\b
([A-Z]+)
Finally, use the term \1 later in the RE in order to back reference that
matched pattern.
Common Regex Tasks • 51
If you are unfamiliar with back references, they might require some prac-
tice to become comfortable with them and to see when they can be advanta-
geous. For example, the following pair of REs match the same patterns:
"\b([A-Z]+)\s+\1\b"
"\b([A-Z]+)\s+([A-Z]+)\b"
Even though both of the preceding REs produce the same result, it’s argua-
bly easier to read the first RE containing a back reference than the second RE.
Just to be sure, now let’s test the preceding RE with the egrep utility to
see if it finds duplicate (uppercase) words in the text file duplicates.txt:
egrep "\b([A-Z]+)\s+\1\b" duplicates.txt
Now let’s test the following RE that searches for duplicate words containing
lowercase letters as well as uppercase letters:
egrep "\b([A-Za-z]+)\s+\1\b" duplicates.txt
The following list contains guidelines for improving the performance of REs:
Chapter Summary
In this chapter you learned some tips for “thinking in REs”, which will be
helpful when you are faced with new tasks involving REs. Next you saw a va-
riety of real-world RE applications that seem simple but turn out to be more
complex when applied to the real world, such as with phone numbers and
scientific notation.
Finally, you were exposed to important concepts about testing REs and
some basic rules on RE performance.
Chapter
3
REs in Python
T
his chapter introduces you to REs in Python, with a mixture of code
blocks and complete code samples that cover many of the topics that
are discussed in Chapter 1. Since the details about metacharacters and
character classes in Python are virtually identical to the information that you
learned in Chapter 1, you can probably read this chapter quickly (even if you
are only interested in a cursory view of Python and REs). If you are interested
in learning more about Python, perhaps after you become comfortable with
Python syntax, a Python Pocket Primer is available here: https://www.amazon.
com/dp/B00KGF0PJA.
The first part of this chapter shows you how to define REs with digits and
letters (uppercase as well as lowercase), and also how to use character classes
in REs. You will also learn about character sets and character classes.
The second portion discusses the Python re module, which contains sev-
eral useful methods, such as the re.match() method for matching groups
of characters, the re.search() method to perform searches in character
strings, and the findAll() method. You will also learn how to use character
classes (and how to group them) in REs.
The final portion of this chapter contains an assortment of code samples,
such as modifying text strings, splitting text strings with the re.split()
method, and substituting text strings with the re.sub() method.
There are several points to keep in mind before you read this chapter and
after you have installed Python on your machine. First, you need basic profi-
ciency in Python to be comfortable with the code samples. If necessary, read
some rudimentary online Python tutorials in preparation for this chapter. Sec-
ond, the code samples were written for Python 2.7, and some fairly minor
changes to the code samples are necessary in order to convert them to Python
3.x. A good starting point is the Python 3 documentation for REs: https://docs.
python.org/3/howto/regex.html.
56 • Regular Expressions Pocket Primer
As you know from Chapter 2, you can define REs to match characters,
digits, telephone numbers, zip codes, or email addresses. The re module
(added in Python 1.5) provides Perl-style RE patterns (Perl REs are dis-
cussed in an Appendix). Note that earlier versions of Python provided the
regex module that was removed in Python 2.5. The re module provides
an assortment of methods (discussed later in this chapter) for searching text
strings or replacing text strings, which is similar to the basic search and/or
replace functionality that is available in word processors (but usually without
RE support). The re module also provides methods for splitting text strings
based on REs.
Before delving into the methods in the re module, let’s quickly review the
various metacharacters and character classes for Python.
Metacharacters in Python
Python supports a set of metacharacters, most of which are the same as the
metacharacters in other scripting languages such as Perl, as well as program-
ming languages such as JavaScript and Java. The complete list of metacharac-
ters in Python is here:
. ^ $ * + ? { } [ ] \ | ( )
? (matches 0 or 1): the expression ca?t matches ct or cat but not caat
* (matches 0 or more): the expression ca*t matches ct or cat or caat
+ (matches 1 or more): the expression a+ matches cat or caat but not ct
^ (beginning of line): the expression ^[a] matches the string abc (but
not bc)
$ (end of line): [c]$ matches the string abc (but not cab)
. (a single dot): matches any single character (except a newline)
“escaping” their symbolic meaning with the backslash (“ \”) character. Thus, the
sequences \?, \*, \+, \^, \$, and \. represent the literal characters
instead of their symbolic meaning. You can also “escape” the backslash charac-
ter with the sequence “\\”. If you have two consecutive backslash characters,
you need an additional backslash for each of them, which means that “\\\\”
is the “escaped” sequence for “\\”.
The second way is to list the metacharacters inside a pair of square brackets.
For example, [+?] treats the two characters “+” and “?” as literal characters
instead of metacharacters. The second approach is obviously more compact
and less prone to error (it’s easy to forget a backslash in a long sequence of
metacharacters). As you might surmise, the methods in the re module support
metacharacters.
The “^” character that is to the left (and outside) of a sequence in square
brackets (such as ^[A-Z]) “anchors” the RE to the beginning of a line,
NOTE
whereas the “^” character that is the first character inside a pair of square
brackets negates the RE (such as [^A-Z]) inside the square brackets.
“^[a-z]” means any string that starts with any lowercase letter
“[^a-z]” means any string that does not contain any lowercase letters
“^[^a-z]” means any string that starts with anything except a lowercase
letter
“^[a-z]$” means a single lowercase letter
“^[^a-z]$” means a single character (including digits) that is not a low-
ercase letter
print 'text1:',text1
print 'text2:',text2
The RE in Listing 3.1 might seem daunting if you are new to REs, but let’s
demystify its contents by examining the entire expression and then the mean-
ing of each character. First of all, the term [/\.*?=+] matches a forward
slash (“/ ”), a dot (“.”), a question mark (“?”), an equals sign (“=”), or a plus sign
(“+”). Notice that the dot “.” is preceded by a backslash character “ \”. Doing
so “escapes” the meaning of the “.” metacharacter (which matches any single
non-whitespace character) and treats it as a literal character.
Thus the term [/\.*?=+]+ means “one or more occurrences of any
of the metacharacters—treated as literal characters—inside the square
brackets”.
Consequently, the expression re.sub("[/\.*?=+]+","",text1)
means “search the text string text1 and replace any of the metacharacters
(treated as literal characters) found with an empty string ("")”.
The output from Listing 3.1 is here:
Later in this chapter you will learn about other functions in the re module
that enable you to modify and split text strings.
^[0-9].
However, the following expression matches a text string that does not start
with a digit because of the “^” metacharacter that is at the beginning of the
expression in square brackets as well as the “^” metacharacter that is to the
left (and outside) the expression in square brackets (which you learned in a
previous note):
^[^0-9]
Thus, the “^” character inside a pair of matching square brackets (“[]”)
negates the expression immediately to its right that is also located inside the
square brackets.
The backslash (“ \”) allows you to “escape” the meaning of a metacharacter.
Consequently, a dot “.” matches a single character (except for whitespace char-
acters), whereas the sequence “ \.” matches the dot “.” character.
Other examples involving the backslash metacharacter are here:
Character classes are convenient expressions that are shorter and simpler
than their “bare” counterparts that you saw in the previous section. Some con-
venient character sequences that express patterns of digits and letters are as
follows:
Based on the preceding definitions, \d+ matches one or more digits and \
w+ matches one or more characters, both of which are more compact expres-
sions than using character sets. In addition, we can reformulate the expressions
in the previous section:
The curly braces (“{}”) are called quantifiers, and they specify the number
(or range) of characters in the expressions that precede them.
The re module provides the following methods for matching and search-
ing one or more occurrences of an RE in a text string:
NOTE The match() function only matches patterns at the start of a string.
The two methods match() and search() are discussed in this chapter, and
you can read online documentation regarding the Python findall() and
finditer() methods. The next section shows you how to use the match()
function in the Python re module.
REs in Python • 61
The pattern parameter is the RE that you want to match in the string
parameter. The flags parameter allows you to specify multiple flags using the
bitwise OR operator that is represented by the pipe “|” symbol.
group(num=0): This method returns the entire match (or specific sub-
group num).
groups(): This method returns all matching subgroups in a tuple (empty
if there weren’t any).
The following code block illustrates how to use the group() function in REs:
>>> import re
>>> p = re.compile('(a(b)c)de')
>>> m = p.match('abcde')
>>> m.group(0)
'abcde'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
Notice that the higher numbers inside the group() method match more
deeply nested expressions that are specified in the initial RE.
Listing 3.2 displays the contents of MatchGroup1.py, which illustrates
how to use the group() function to match an alphanumeric text string and
an alphabetic string.
line1 = 'abcd123'
line2 = 'abcdefg'
mixed = re.compile(r"^[a-z0-9]{5,7}$")
62 • Regular Expressions Pocket Primer
line3 = mixed.match(line1)
line4 = mixed.match(line2)
print 'line1:',line1
print 'line2:',line2
print 'line3:',line3.group(0)
print 'line4:',line4
print 'line5:',line4.group(0)
line6 = 'a1b2c3d4e5f6g7'
mixed2 = re.compile(r"^([a-z]+[0-9]+){5,7}$")
line7 = mixed2.match(line6)
print 'line6:',line6
print 'line7:',line7.group(0)
print 'line8:',line7.group(1)
line9 = 'abc123fgh4567'
mixed3 = re.compile(r"^([a-z]*[0-9]*){5,7}$")
line10 = mixed3.match(line9)
print 'line9:',line9
print 'line10:',line10.group(0)
line1: abcd123
line2: abcdefg
line3: <_sre.SRE_Match object at 0x100485440>
line4: <_sre.SRE_Match object at 0x1004854a8>
line5: abcdefg
line6: a1b2c3d4e5f6g7
line7: a1b2c3d4e5f6g7
line8: g7
line9: abc123fgh4567
line10: abc123fgh4567
Notice that line3 and line7 involve two similar but different REs. The
variable mixed specifies a sequence of lowercase letters followed by dig-
its, where the length of the text string is also between 5 and 7. The string
'abcd123' satisfies all of these conditions.
On the other hand, mixed2 specifies a pattern consisting of one or more
pairs, where each pair contains one or more lowercase letters followed by
one or more digits, where the length of the matching pairs is also between
5 and 7. In this case, the string 'abcd123' as well as the string 'a1b2c-
3d4e5f6g7' both satisfy these criteria.
The third RE mixed3 specifies a pair such that each pair consists of zero
or more occurrences of lowercase letters and zero or more occurrences of a
digit, and also that the number of such pairs is between 5 and 7. As you can
see from the output, the RE in mixed3 matches lowercase letters and digits
in any order.
In the preceding example, the RE specified a range for the length of the
string, which involves a lower limit of 5 and an upper limit of 7. However, you
REs in Python • 63
can also specify a lower limit without an upper limit (or an upper limit without
a lower limit).
The following RE mixed4 specifies lowercase letters, and requires a
match of five, six, or seven such characters:
mixed4 = re.compile(r"^[a-z]{5,7}$")
line11 = mixed4.match(line1)
print 'line11:',line11
Since line1 only contains four lowercase letters, there is no match, and
in this case the output is None, as shown here:
line11: None
alphas = re.compile(r"^[abcde]{5,}")
line1 = alphas.match("abcde").group(0)
line2 = alphas.match("edcba").group(0)
line3 = alphas.match("acbedf").group(0)
line4 = alphas.match("abcdefghi").group(0)
line5 = alphas.match("abcdefghi abcdef")
print 'line1:',line1
print 'line2:',line2
print 'line3:',line3
print 'line4:',line4
print 'line5:',line5
Listing 3.3 initializes the variable alphas as an RE that matches any string
that starts with one of the letters a through e and consists of at least five char-
acters. The next portion of Listing 3.3 initializes the four variables line1,
line2, line3, and line4 by means of the alphas RE that is applied to
various text strings. These four variables are set to the first matching group by
means of the expression group(0).
The output from Listing 3.3 is here:
line1: abcde
line2: edcba
line3: acbed
line4: abcde
line5: <_sre.SRE_Match object at 0x1004854a8>
Unlike the first four output lines, the output from line5 fails the match
simply because .group(0) was not specified in the definition of line5.
Listing 3.4 displays the contents of MatchGroup3.py, which illustrates
how to use an RE with the group() function to match words in a text string.
64 • Regular Expressions Pocket Primer
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "matchObj does not match line:", line
Capture Groups
You have already seen examples of capture groups, such as matchObj.
group(1) and matchObj.group(2), in the preceding section. The
groups contain the matched values, and the integer in the parentheses speci-
fies different capture groups.
Specifically, match.group(0) returns the fully matched string, whereas
match.group(1), match.group(2), and so forth will return the capture
groups, from left to right, in the input string. In addition, match.group() is
the same as match.group(0).
Capture groups are powerful and can become quite complex, in part be-
cause a matching group can be a substring of an enclosing matching group,
similar to the way that “back references” work with the sed utility (discussed in
Chapter 5). If you want to learn more about capture groups, perform an Inter-
net search where you can find some examples of highly complex capture groups.
The match() method supports various optional modifiers that affect the
type of matching that will be performed. As you saw in the previous example,
you can also specify multiple modifiers separated by the OR (“|”) symbol. Ad-
ditional modifiers that are available for RE are shown here:
re.M makes $ match the end of a line and makes ^ match the start of any line
re.S makes a period (“.”) match any character (including a newline)
re.U interprets letters according to the Unicode character set
Experiment with these modifiers by writing Python code that uses them in
conjunction with different text strings.
As you saw earlier in this chapter, the re.match() method only matches
from the beginning of a string, whereas the re.search() method can suc-
cessfully match a substring anywhere in a text string.
The re.search() method takes two arguments, an RE pattern and a
string, and then searches for the specified pattern in the given string. The
search() method returns a match object (if the search was successful) or
None. As a simple example, the following searches for the pattern tasty fol-
lowed by a five-letter word:
import re
if match:
## 'found tasty pizza'
print 'found', match.group()
else:
print 'Nothing tasty here'
The following code block further illustrates the difference between the
match() method and the search() methods:
>>> import re
>>> print re.search('this', 'this is the one').span()
(0, 4)
>>>
>>> print re.search('the', 'this is the one').span()
(8, 11)
>>> print re.match('this', 'this is the one').span()
(0, 4)
>>> print re.match('the', 'this is the one').span()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'span'
66 • Regular Expressions Pocket Primer
str1 = "123456"
matches1 = re.findall("(\d+)", str1)
print 'matches1:',matches1
str1 = "123456"
matches1 = re.findall("(\d\d\d)", str1)
print 'matches1:',matches1
str1 = "123456"
matches1 = re.findall("(\d\d)", str1)
print 'matches1:',matches1
print
str2 = "1a2b3c456"
matches2 = re.findall("(\d)", str2)
print 'matches2:',matches2
print
str2 = "1a2b3c456"
matches2 = re.findall("\d", str2)
print 'matches2:',matches2
print
str3 = "1a2b3c456"
matches3 = re.findall("(\w)", str3)
print 'matches3:',matches3
Listing 3.5 contains simple REs (which you have seen already) for match-
ing digits in the variables str1 and str2. The final code block of Listing 3.5
matches every character in the string str3, effectively “splitting” str3 into
a list where each element consists of one character.
The output from Listing 3.5 is here (notice the blank lines after the first
three output lines):
matches1: ['123456']
matches1: ['123', '456']
matches1: ['12', '34', '56']
matches3: ['1', 'a', '2', 'b', '3', 'c', '4', '5', '6']
REs in Python • 67
Listing 3.6 initializes the string variable str and the RE caps that matches
any word that starts with a capital letter, because the first portion of caps is
the pattern [A-Z] that matches any capital letter between A and Z inclusive.
The output of Listing 3.6 is here:
str: This Sentence contains Capitalized words
caps: ['This', 'Sentence', 'Capitalized']
p1 = re.compile('[a-z]+')
m1 = p1.match("hello")
In the preceding code block, the p1 object represents the compiled RE for
one or more lowercase letters, and the “matching object” m1 object supports
the following methods:
if searchObj:
print "searchObj.group() : ", searchObj.group()
print "searchObj.group(1) : ", searchObj.group(1)
print "searchObj.group(2) : ", searchObj.group(2)
else:
print "searchObj does not match line:", line
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1): ", matchObj.group(1)
print "matchObj.group(2): ", matchObj.group(2)
else:
print "matchObj does not match line:", line
Listing 3.7 contains the variable line that represents a text string, and the
variable searchObj is an RE involving the search() method and pair
of pipe-delimited modifiers (discussed in more detail in the next section). If
searchObj is not null, the if/else conditional code in Listing 3.7 displays
the contents of the three groups resulting from the successful match with the
contents of the variable line.
The same logic applies to matchObj, which is based on the re.match()
function instead of the re.search() function (recall the distinction that was
explained earlier in the chapter).
The output from Listing 3.7 is here:
In addition to the character classes that you have seen earlier in this chap-
ter, you can specify subexpressions of character classes. Listing 3.8 displays
the contents of Grouping1.py, which illustrates how to use the search()
method.
p1 = re.compile('(ab)*')
print 'match1:',p1.match('ababababab').group()
print 'span1: ',p1.match('ababababab').span()
p2 = re.compile('(a)b')
m2 = p2.match('ab')
REs in Python • 69
print 'match2:',m2.group(0)
print 'match3:',m2.group(1)
Since the explanation is quite lengthy, let’s look at the output and then
delve into the explanation. The output from Listing 3.8 is here:
match1: ababababab
span1: (0, 10)
match2: ab
match3: a
Listing 3.8 starts by defining the RE p1 that matches zero or more occur-
rences of the string ab. The first print statement displays the result of using
the match() function of p1 (followed by the group() function) against a
string, and the result is a string. This illustrates the use of “method chaining”,
which eliminates the need for an intermediate object (as shown in the sec-
ond code block). The second print statement displays the result of using the
match() function of p1, followed by applying the span() function, against
a string. In this case the result is a numeric range (see output below).
The second part of Listing 3.8 defines the RE p2 that matches an optional
letter a followed by the letter b. The variable m2 invokes the match method
on p2 using the string ab. The third print statement displays the result of in-
voking group(0) on m2, and the fourth print statement displays the result
of involving group(1) on m2. Both results are substrings of the input string
ab. Recall that group(0) returns the highest level match that occurred, and
group(1) returns a more “specific” match that occurred, such as one that
involves the parentheses in the definition of p2. The higher the value of the
integer in the expression group(n), the more specific the match.
This section contains some examples that illustrate how to use character
classes to match various strings and also how to use delimiters in order to split a
text string. For example, one common date string involves a date format of the
form MM/DD/YY. Another common scenario involves records with a delim-
iter that separates multiple fields. Usually such records contain one delimiter,
but as you will see, Python makes it very easy to split records using multiple
delimiters.
date1 = '02/28/2013'
date2 = 'February 28, 2013'
if re.match(r'\d+/\d+/\d+', date2):
print('date2 matches this pattern')
else:
print('date2 does not match this pattern')
Now that you understand how to define REs for digits and letters, let’s
look at some more sophisticated REs. For example, the following expression
matches a string that is any combination of digits, uppercase letters, or lower-
case letters (i.e., no special characters):
^[a-zA-Z0-9]$
^[\w\W\d]$
REs in Python • 71
import re
Listing 3.11 displays the contents of the Python script RegEx2.py, which
illustrates how to define simple REs in order to split the contents of a text
string.
line2 = "abc1,abc2:abc3;abc4"
result2 = re.split(r'[,:;]', line2)
print 'result2:',result2
Listing 3.11 contains three blocks of code, each of which uses the split()
method in the re module in order to tokenize three different strings. The first
RE specifies a whitespace, the second RE specifies three punctuation charac-
ters, and the third RE specifies the combination of the first two REs.
72 • Regular Expressions Pocket Primer
Listing 3.12 contains two text strings that can be split using the same RE
'\d+\. '. Note that if you use the expression '\d\. ' only the first text
string will split correctly.
The result of launching Listing 3.12 is here:
Earlier in this chapter you saw a preview of using the sub() method to
remove all the metacharacters in a text string. The following code block illus-
trates how to use the re.sub() method to substitute alphabetic characters
in a text string.
>>> import re
>>> p = re.compile( '(one|two|three)')
>>> p.sub( 'some', 'one book two books three books')
'some book some books some books'
>>>
>>> p.sub( 'some', 'one book two books three books', count=1)
'some book two books three books'
The following code block uses the re.sub() method in order to insert a
line feed after each alphabetic character in a text string:
line2:
a
b
c
d
e
Now consider the following example that illustrates how to use the Python
subn() function with a text string:
line = 'abcde'
linere = re.compile(r'', re.IGNORECASE)
line3 = linere.subn('', line)
print 'line3:',line3
The output from launching the preceding Python code block is here:
line3: ('abcde', 6)
Listing 3.13 displays the contents of the Python script RegEx3.py, which
illustrates how to find substrings using the startswith() function and
endswith() function.
line2 = "abc1,Abc2:def3;Def4"
result2 = re.split(r'[,:;]', line2)
for w in result2:
if(w.startswith('Abc')):
print 'Word starts with Abc:',w
elif(w.endswith('4')):
print 'Word ends with 4:',w
else:
print 'Word:',w
Listing 3.13 starts by initializing the string line2 (with punctuation char-
acters as word delimiters) and the RE result2 that uses the split()
function with a comma, colon, and semicolon as “split delimiters” in order to
tokenize the string variable line2.
The output after launching Listing 3.13 is here:
Word: abc1
Word starts with Abc: Abc2
Word: def3
Word ends with 4: Def4
line1 = "abcdef"
line2 = "123,abc1,abc2,abc3"
line3 = "abc1,abc2,123,456f"
if re.match("^[A-Za-z]*$", line1):
print 'line1 contains only letters:',line1
if re.match("^[\w]*$", line1):
print 'line1 contains only letters:',line1
if re.match("^[0-9][0-9][0-9]", line2):
print 'line2 starts with 3 digits:',line2
if re.match("^\d\d\d", line2):
print 'line2 starts with 3 digits:',line2
print
Listing 3.14 starts by initializing three string variables line1, line2, and
line3. The first RE contains an expression that matches any line containing
uppercase or lowercase letters (or both):
if re.match("^[A-Za-z]*$", line1):
line1[:-1].isalpha()
The preceding snippet starts from the rightmost position of the string and
checks if each character is alphabetic.
The next snippet checks if line1 can be tokenized into words (a word
contains only alphabetic characters):
if re.match("^[\w]*$", line1):
REs in Python • 75
The next portion of Listing 3.14 checks if a string contains three consecu-
tive digits:
if re.match("^[0-9][0-9][0-9]", line2):
print 'line2 starts with 3 digits:',line2
if re.match("^\d\d\d", line2):
The first snippet uses the pattern [0-9] to match a digit, whereas the
second snippet uses the expression \d to match a digit.
The output from Listing 3.14 is here:
Compilation Flags
Compilation flags modify the manner in which REs work. Flags are avail-
able in the RE module as a long name (such as IGNORECASE) and a short,
one-letter form (such as I). The short form is the same as the flags in pattern
modifiers in Perl. You can specify multiple flags by using the “|” symbol. For
example, re.I | re.M sets both the I and M flags.
You can check the online Python documentation regarding all the available
compilation flags in Python.
Compound REs
if re.match("^[Tt]his", line1):
print 'line1 starts with This or this:'
print line1
else:
print 'no match'
if re.match("^This|That", line2):
print 'line2 starts with This or That:'
print line2
76 • Regular Expressions Pocket Primer
else:
print 'no match'
Listing 3.15 starts with two string variables line1 and line2, followed by
an if/else conditional code block that checks if line1 starts with the RE [Tt]
his, which matches the string This as well as the string this.
The second conditional code block checks if line2 starts with the string
This or the string That. Notice the “^” metacharacter, which in this context
anchors the RE to the beginning of the string. The output from Listing 3.15
is here:
charCount = 0
digitCount = 0
otherCount = 0
for ch in line1:
if(re.match(r'\d', ch)):
digitCount = digitCount + 1
elif(re.match(r'\w', ch)):
charCount = charCount + 1
else:
otherCount = otherCount + 1
print 'charcount:',charCount
print 'digitcount:',digitCount
print 'othercount:',otherCount
charcount: 16
digitcount: 5
othercount: 6
You can also “group” subexpressions and even refer to them symbolically.
For example, the following expression matches zero or one occurrences of
three consecutive letters or digits:
^([a-zA-Z0-9]{3,3})?
^\d{3,3}[-]\d{3,3}[-]\d{4,4}
^\d{5,5}([-]\d{5,5})?
str = 'john.doe@google.com'
match = re.search(r'\w+@\w+', str)
if match:
print match.group() ## 'doe@google'
Exercise: use the preceding code block as a starting point in order to define
an RE for email addresses.
As you saw in Chapter 2, most email checks are fairly simple in produc-
tion code: at least one character, followed by an @ symbol, at least one more
character, followed by a period, and at least one character after the period.
Such checks are obviously minimalistic, and they cannot prove that the email
address is real.
Listing 3.17 displays the contents of the Python script RegEx4.py, which
illustrates how to define REs that match various text strings.
if expr1.search( searchString ):
print '"Test" was found.'
if expr2.match( searchString ):
print '"Test" was found at the beginning of the line.'
if expr3.match( searchString ):
print '"Test" was found at the end of the line.'
if result:
print 'There are %d words(s) ending in "es":' % \
( len( result ) ),
Listing 3.17 starts with the variable searchString that specifies a text
string, followed by the REs expr1, expr2, expr3. The RE expr1 matches
the string Test that occurs anywhere in searchString, whereas expr2
matches Test if it occurs at the beginning of searchString, and expr3
matches Test if it occurs at the end of searchString. The RE expr
matches words that end in the letters es, and the RE expr5 matches the let-
ter t followed by a vowel.
The output from Listing 3.17 is here:
"Test" was found.
"Test" was found at the beginning of the line.
There are 1 words(s) ending in "es": matches
The letter t, followed by a vowel, occurs 3 times: Te ti te
Chapter Summary
This chapter showed you how to create various types of REs. First you
learned how to define primitive REs using sequences of digits, lowercase let-
ters, and uppercase letters. Next you learned how to use character classes,
which are more convenient and simpler expressions that can perform the same
functionality. You also learned how to use the Python re library in order to
compile REs and then use them to see if they match substrings of text strings.
Chapter
4
Working with REs in R
T
his chapter introduces you to REs in R, which are used from a statisti-
cal viewpoint to solve tasks for data scientists. Keep in mind that basic
familiarity with standard data types in R is required for this chapter,
such as creating string vectors, vectors of sentences, and data frames. This
chapter shows you how to use REs in some R-specific commands, thereby
enhancing your knowledge of R. When you have finished this chapter, you will
have enough knowledge to convert the code samples in the first two chapters
to their R counterparts.
The first section of this chapter contains a summary of rules for me-
tacharacters in R, an overview of search functions in R, as well an explana-
tion of grep-related commands in R. The second section of this chapter
contains basic examples of REs in R, which are similar to approximately
25% of Chapter 1. The final section of this chapter contains a collection of
one-line REs in R that use some of the R commands that are discussed in
the second section.
One recommendation: download and install RStudio for your platform and
use RStudio to test the REs in this chapter. RStudio is an extremely power-
ful code development environment, and a must-learn tool if you plan to work
extensively in R.
Here are the rules that specify how to match metacharacters as regular
characters when they are included in a character class:
Search Functions in R
Perl RE Support in R
REs in R are usually restricted, and the inline help functionality does
not provide extensive information about many topics. What an individual
82 • Regular Expressions Pocket Primer
c ommand supports depends on who wrote it and what they chose to imple-
ment, which means behavior is more variable than something like Java, which
was developed commercially by a single development team. On the plus side,
R functions can correctly interpret Perl RE syntax when perl=TRUE is sup-
ported by the command and specified by the user. For your convenience, the
Appendix contains examples of one-line Perl REs. In addition, the RE syntax
in Python bears some resemblance to REs in R. All of the commands dis-
cussed in this chapter support perl=TRUE and share common RE behavior
if perl=FALSE.
The previous section gave you a high-level description of the modus op-
erandi of the Unix grep command versus the grep command in R. This
section provides a deeper explanation and some examples that illustrate how to
work with grep in R. In particular, this section briefly discusses the commands
grepl, regexpr, and gregexpr, which also have grep-like functionality.
Here’s a simple example of the grep command in R (an explanation is
provided later):
>x<-c("abc","bcd","cde","def")
>grep("bc",x)
[1] 1 2
The grep command in R requires an RE and the input vector as the first
and second arguments, respectively. If you specify value=FALSE or omit
the value parameter, then grep returns a new vector with the indexes of the
elements in the input vector that partially or fully matched the RE. On the other
hand, if you specify value=TRUE, then grep returns a vector with copies of
the actual elements in the input vector that partially or fully matched.
For example, the following grep command in R matches the RE a+ (one
or more occurrences of the letter a) with the elements of a string vector:
grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=FALSE)
> grepl("bc",x)
[1] TRUE TRUE FALSE FALSE
[1] 1 -1 3 1
attr(,"match.length")
[1] 1 -1 1 2
attr(,"useBytes")
[1] TRUE
[[1]]
[1] 1
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE
[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
Working with REs in R • 85
[[3]]
[1] 3 5
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
[[4]]
[1] 1
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
The first argument for the regmatches command is the same input that
is supplied to the regexpr command or the gregexpr command. The
second argument is the vector that is returned by the regexpr command
or the list returned by the gregexpr command. If you pass the vector from
the regexpr command, then regmatches returns a character vector with
all the strings that were matched. Note that this vector may be shorter than the
input vector if there is no match in some of the elements.
If you pass the list from the gregexpr command, then regmatches re-
turns a vector with the same number of elements as the input list. Each output
list is a character vector with all the matches of the corresponding element in
86 • Regular Expressions Pocket Primer
the input vector, or NULL if an element had no matches. The examples in the
beginning of this section illustrate some of the preceding points.
Notice that the second and fourth element didn’t match and were un-
changed. The first element replaced the “a” with the single vector value “one.”
In addition, the third element concatenated the vector into a single string
(changing it to “two three”) and then replaced the “a” with that string.
x<-c("abc","bcdbc","cde","def")
sub(".*(bc).*","\\1",grep("bc",x,value=TRUE))
Working with REs in R • 87
The grep removes elements that don’t match (the third and fourth strings),
so the output vector has fewer elements than the input vector. If we remove
everything except the first occurrence of bc in the first two strings, we get the
following output:
Now let’s look at the gsub() command that substitutes all occurrences of
a pattern with a given string. By way of comparison, the sub() command is
similar to find/replace, whereas gsub() is similar to find/replace all. Here
is a simple example of the gsub() command:
The substr() command returns the start and stop positions of a sub-
string in a given string:
> x<-"abcdefghijk"
> substr(x,5,8)
[1] "efgh"
strsplit("11/03/2013","/")
[[1]]
[1] "11" "03" "2013"
Now that you have seen examples of how to use some useful string-related
commands in R, let’s look at how to use REs in R.
vect1 = c("the","dog","is","grey","and","the","cat","is","gray")
As you can see, the word grey appears in the first and second lines, the
word gray appears in the first and third lines, and all three lines contain either
grey or gray.
Here are the tasks that we want to perform:
"grey"
Notice that the string groy is not displayed in the preceding output,
nor is there a -1 (which you might have expected) as the index for the non-
occurrence of the string groy.
The following pair of commands uses the pattern [ae] to combine grey
and gray, and then displays the occurrence of grey and gray in vect1,
along with the index values of their positions in vect1:
grep(pattern = "gr[ae]y ", vect1, value = TRUE)
grep(pattern = "gr[ae]y ", vect1, value = FALSE)
The only matches are grey and gray, but if vect1 included the “word”
grzy, then this word would appear in the previous output.
We can also specify a single letter inside the square brackets. For example,
the term [a] is an RE that matches the letter a. Launch this command:
If we specify a vowel that does not appear in any word in vect1, then we
see a message that indirectly hints at the absence of that vowel. An example is
here:
character(0)
integer(0)
Once again, the order of the letters in the square brackets is irrelevant,
which means that the following commands have the same output:
In this section let’s define an array of strings called mytext1 whose con-
tents are shown here:
mytext1 <- c("the dog is grey and the cat is gray.", "this dog is
grey", "that cat is gray")
Now let’s apply the REs that we saw early in this chapter to the variable
mytext1. For example, check for strings in mytext1 with either of these
two REs:
[1] "the dog is grey and the cat is gray." "this dog is grey"
[3] "that cat is gray"
[1] "the dog is grey and the cat is gray." "this dog is grey"
[3] "that cat is gray"
[1] "the dog is grey and the cat is gray." "this dog is grey"
[3] "that cat is gray"
[1] "the dog is grey and the cat is gray." "this dog is grey"
[3] "that cat is gray"
The examples in this section use the grep() function, but you can also
use the sub() and gsub() functions, described earlier in this chapter, in
conjunction with REs.
Working with REs in R • 91
This section contains examples of using the grep function (and related
functions) in R in order to find matching strings in string vectors. If necessary,
read the appropriate sections in Chapter 1 to refresh your memory regarding
the REs in this section.
Initialize string vector strings:
## [1] "accb"
grep("ac{2,}b", strings, value = TRUE)
## [1] "abcd"
grep("ab$", strings, value = TRUE)
## [1] "cdab"
grep("\\bab", strings, value = TRUE)
## [1] "^ab"
grep("abc|abd", strings, value = TRUE)
Working with REs in R • 93
## [1] "A.bc"
Case Sensitivity in R
Pattern matching is case sensitive in R. However, you can perform case
insensitive pattern matching by specifying ignore.case=TRUE (in base
R functions) or by “wrapping them” with ignore.case() for stringr
functions.
Yet another way to specify case-insensitive pattern matching is to use the
tolower() and toupper() functions to convert strings to lower- or upper-
case and then perform pattern-matching operations. Consider the following
example:
## character(0)
grep(pattern, strings, value = TRUE, ignore.case = TRUE)
mytext <- c("This is the first line", "This is second", "This line
has 997 also")
94 • Regular Expressions Pocket Primer
## [1] 3
## [1] 1 2 3
## [1] 1 2
require(stringi)
str <- c(
"this is a string that is slightly longer",
"nospacesinthisstring",
"several whitespaces",
" startswithspaces",
"endswithgspaces ",
" contains both leading and trailing ",
"just one space each")
stri_count(str,regex="\\S+")
#[1] 8 1 2 1 1 5 4
Listing 4.1 initializes str as a vector of strings, and then invokes the sap-
ply() method twice. The first invocation invokes the gregexpr() method
in order to find the position of the first occurrence of the RE in each of the
four substrings of str, which yields the values 5, 2, 1, and 1, respectively.
The second invocation of the sapply() method invokes the str-
split() method that splits the four substrings of str into substrings based
on the specified RE, which produces the values 5, 2, 1, and 1, respectively.
The third portion of Listing 4.1 initializes str as a new list of strings, and
then invokes the stri_count() function in order to count the number of
occurrences of non-whitespace character strings in each of the seven sub-
strings of str, which is 8, 1, 2, 1, 1, 5, and 4, respectively.
R supports several more advanced string functions that are somewhat re-
lated to REs, such as splitting a string, getting a subset of a string, pasting
strings together, and so forth. These R functions are very useful for data clean-
ing, and here is a short introduction with above example.
The strsplit() function (which returns a list) splits its second argument
into words, where the second argument split is an RE used for splitting strings.
The unlist() function converts a list into a character vector, and the
function str_split_fixed() returns a data frame.
The paste() or paste0() functions put things together. The paste0()
function is equivalent to paste() with sep = "". We can use the col-
lapse = "-" argument to concatenate a character vector into a string.
96 • Regular Expressions Pocket Primer
str_function(string, pattern)
For example, the detect() function checks whether or not a pattern ap-
pears in a string. The extract() and extract_all() functions extract
the first occurrence and all occurrences, respectively, of a pattern in a string.
The match() and match_all() functions extract the first matched group
and all matched groups, respectively, from a string. Other functions in this
package include: locate() and locate_all(), replace() and re-
place_all(), split() and split_fixed(). As with the previous R
commands, an internet search for “R stringr package documentation” will pro-
vide more details regarding the functions in this package.
Chapter Summary
T
his section assumes a bit of basic familiarity with the Unix/Linux com-
mand line and how the commands accept input and generate output.
While the examples use the bash “shell” environment for syntax, most
of them will also work with other common shells such as bourne and korn.
This chapter shows you how to use REs in order to transform data using the
Unix sed utility (an acronym for “stream editor”), followed by a short section
that contains examples of REs with the Unix awk utility.
The first part of this chapter contains basic examples of the sed command,
such as replacing and deleting strings, numbers, and letters. The second part
of this chapter discusses various switches that are available for the sed com-
mand, along with an example of replacing multiple delimiters with a single
delimiter in a dataset.
The third part of this chapter provides a very brief introduction of the
awk command. You will learn about some built-in variables for awk, and also
how to manipulate string variables using awk. Note that some of these string-
related examples can also be handled using other bash commands.
The final section contains code samples that involve metacharacters (intro-
duced in Chapter 1) and character sets in awk commands. You will also see
how to use conditional logic in awk commands in order to determine whether
or not to print specific lines of text.
The sed command is the most common command line tool used in Unix/
Linux environments to do find/replace-type functions using REs for pattern
matching, although it has many other uses. As such it’s worth a bit of explana-
tion before diving into examples.
98 • Regular Expressions Pocket Primer
The name sed is an acronym for “stream editor”, and the utility derives many
of its commands from the ed line-editor (ed was the first UNIX text editor). The
sed command is a “non-interactive” stream-oriented editor that can be used to
automate editing via shell scripts. This ability to modify an entire stream of data
(which can be the contents of multiple files, in a manner similar to how grep
behaves) as if you were inside an editor is not common in modern programming
languages. This behavior allows some capabilities not easily duplicated else-
where, while behaving exactly like any other command (grep, cat, ls, find,
and so forth) in how it can accept data, output data, and pattern match with REs.
Some of the more common uses for sed include: print matching lines,
delete matching lines, and find/replace matching strings or REs.
The sed command requires you to specify a string in order to match the
lines in a file. For example, suppose that the file numbers.txt contains the
following lines:
1
2
123
3
five
4
The following sed command prints all the lines that contain the string 3:
cat numbers.txt |sed –n "/3/p"
Keep in mind that it’s always more efficient to just read in the file using the
sed command than to pipe it in with a different command. You can “feed” it
Working with REs in bash • 99
data from another command if that other command adds value (such as adding
line numbers, removing blank lines, or other similar helpful activities).
The –n option suppresses all output, and the p option prints the match-
ing line. If you omit the –n option, then every line is printed, and the p op-
tion causes the matching line to be printed again. Hence, issue the following
command:
The output (the data to the right of the colon) is as follows. Note that the
labels to the left of the colon show the source of the data, to illustrate the “one
row at a time” behavior of sed.
It is also possible to match two patterns and print everything between the
lines that match:
The output of the preceding command (all lines between 123 and five,
inclusive) is here:
123
3
five
The examples in this section illustrate how to use sed to substitute new
text for an existing text pattern.
x="abc"
echo $x |sed "s/abc/def/"
def
In the prior command you have instructed sed to substitute ("s) the first text
pattern (/abc) with the second pattern (/def) and no further instructions (/").
Deleting a text pattern is simply a matter of leaving the second pattern
empty:
defabc
As you see, this only removes the first occurrence of the pattern. You can
remove all the occurrences of the pattern by adding the “global” terminal in-
struction (/g"):
def
Note that we are operating directly on the main stream with this command,
as we are not using the -n tag. You can also suppress the main stream with -n
and print the substitution, achieving the same output if you use the terminal p
(print) instruction:
For substitutions either syntax will do, but that is not always true of other
commands.
You can also remove digits instead of letters by using the numeric metacha-
racters as your regular expression match pattern (from Chapter 1):
The following sed command deletes a range of lines, starting from the line
that matches 123 and continuing through the file until reaching the line that
matches the string five (and also deleting all the intermediate lines). The
syntax should be familiar from the earlier matching example:
sed "/123/,/five/d" columns4.txt
Hullu
Recall that an integer consists of one or more digits, so it matches the regu-
lar expression [0-9]+, which matches one or more digits. However, you need
to specify the regular expression [0-9]* in order to remove every number from
the variable x:
echo $x | sed "s/[0-9]//g"
The following command removes all lowercase letters from the variable x:
echo $x | sed "s/[a-z]*//g"
The following command removes all lowercase and uppercase letters from
the variable x:
echo $x | sed "s/[a-z][A-Z]*//g"
The previous section showed you how to delete a range of rows of a text
file, based on a start line and end line, using either a numeric range or a pair
of strings. As deleting is just substituting an empty result for what you match,
it should now be clear that a replace activity involves populating that part of
the command with something that achieves your desired outcome. This sec-
tion contains various examples that illustrate how to get the exact substitution
you desire.
The following examples illustrate how to convert lowercase abc to upper-
case ABC in sed:
The output of the preceding command is here (which only works on one
case of abc):
ABC
echo "abcdefabc" |sed "s/abc/ABC/g"
The output of the preceding command is here (/g” means works on every
case of abc):
ABCdefABC
ABCde
Obviously you can use the following sed expression that combines the
three substitutions into one substitution:
Nevertheless, the –e switch is useful when you need to perform more com-
plex substitutions that cannot be combined into a single substitution.
The “/” character is not the only delimiter that sed supports, which is use-
ful when strings contain the “/” character. For example, you can reverse the
order of /aa/bb/cc/ with this command:
echo "/aa/bb/cc" |sed -n "s#/aa/bb/cc#/cc/bb/aa/#p"
/cc/bb/aa/
Working with REs in bash • 103
The following examples illustrate how to use the “w” terminal command in-
struction to write the sed output to both standard output and also to a named
file upper1 if the match succeeds:
If you examine the contents of the text file upper1 you will see that it
contains the same string ABCdefabc that is displayed on the screen. This
two-stream behavior that we noticed earlier with the print (“p”) terminal
command is unusual but sometimes useful. It is more common to simply
send the standard output to a file using the “>” syntax, as shown in the fol-
lowing example (both syntaxes work for a replace operation), but in that case
nothing is written to the terminal screen. The previous syntax allows both at
the same time:
Listing 5.1 displays the contents of update2.sh that replace the occur-
rence of the string hello with the string goodbye in the files with the suffix
txt in the current directory.
Listing 5.1 contains a for loop that iterates over the list of text files with
the txt suffix. For each such file, initialize the variable newfile that is created
by appending the string _new to the first file (represented by the variable f).
Next, replace the occurrences of hello with the string goodbye in each file f,
and redirect the output to $newfile. Finally, rename $newfile to $f using
the mv command.
If you want to perform the update in matching files in all subdirectories,
replace the “for” statement with the following:
Listing 5.2 displays the contents of the dataset delim1.txt, which con-
tains multiple delimiters “|”, “:”, and “^”. Listing 5.3 displays the contents of
delimiter1.sh, which illustrates how to replace multiple delimiters with a
single comma delimiter “,” in delimiter1.txt.
104 • Regular Expressions Pocket Primer
As you can see, the second line in Listing 5.3 is simple yet very powerful:
you can extend the sed command with as many delimiters as you require in
order to create a dataset with a single delimiter between values. The output
from Listing 5.3 is shown here:
1000,Jane,Edwards,Sales
2000,Tom,Smith,Development
3000,Dave,Del Ray,Marketing
Do keep in mind that this kind of transformation can be a bit unsafe un-
less you have checked that your new delimiter is not already in use. For that a
grep command is useful (you want the result to be zero, as -c counts the how
many times the pattern matches in the input file):
The three command line switches -n, -e, and -i are useful when you
specify them with the sed command.
As a review, specify -n when you want to suppress the printing of the basic
stream output:
sed -n 's/foo/bar/'
Specify -n and end with /p' when you want to match the result only:
sed -n 's/foo/bar/p'
A more advanced example that hints at the flexibility of sed involves the in-
sertion of a character after a fixed number of positions. For example, consider
the following code snippet:
ABCnDEFnGHInJKLnMNOnPQRnSTUnVWXnYZ
While the previous example does not seem especially useful, consider a
large text stream with no line breaks (everything on one line). You could use
something like this to insert newline characters, or something else to break the
data into easier-to-process chunks. It is possible to work through exactly what
sed is doing by looking at each element of the command and comparing to the
output, even if you don’t know the syntax. (Tip: sometimes you will encounter
very complex instructions for sed without any documentation in the code: try
not to be that person when coding.)
The output is changing after every three characters and we know dot (.)
matches any single character, so .{3} must be telling it to do that (with es-
cape slashes \ because brackets are a special character for sed, and it won’t
interpret it properly if we just leave it as .{3}). The “n” is clear enough in
the replacement column, so the “&\” must be somehow telling it to insert a
character instead of replacing it. The terminal g command of course means
to repeat. To clarify and confirm those guesses, take what you could infer and
perform an Internet search.
The sed utility is very useful for manipulating the contents of text files.
For example, you can print ranges of lines or subsets of lines that match a
regular expression. You can also perform search-and-replace on the lines
in a text file. This section contains examples that illustrate how to perform
such functionality.
Printing Lines
Listing 5.4 displays the contents of test4.txt (doubled-spaced lines)
that is used for several examples in this section.
def
abc
abc
The following code snippet prints the first three lines in test4.txt (we
used this syntax before when deleting rows, and it is equally useful for printing):
The output of the preceding code snippet is here (the second line is blank):
abc
def
def
abc
The following code snippet takes advantage of the basic output stream
and the second match stream to duplicate every line (including blank lines) in
test4.txt:
abc
abc
def
def
abc
abc
abc
abc
The following code snippet prints the first three lines and then capitalizes
the string abc, duplicating ABC in the final output because we did not use -n
and did end with /p" in the second sed command. Remember that /p" only
prints the text that matched the sed command, where the basic output prints
the whole file, which is why def does not get duplicated:
ABC
ABC
def
As our first example involving sed and character classes, the following
code snippet illustrates how to match lines that contain lowercase letters:
The following code snippet illustrates how to match lines that contain low-
ercase letters:
The following code snippet illustrates how to match lines that contain the
numbers 4, 5, or 6:
The following code snippet illustrates how to match lines that start with any
two characters followed by EE:
The following command removes the carriage return and the tab characters
from the text file ControlChars.txt:
You cannot see the tab character in the second sed command in the pre-
ceding code snippet; however, if you redirect the output to the file nocon-
trol1.txt, you can see that there are no embedded control characters in
this new file by typing the following command:
cat –t nocontrol1.txt
In the chapter describing grep you learned about back references, and
similar functionality is available with the sed command. The main difference
Working with REs in bash • 109
is that the back references can also be used in the replacement section of the
command.
The following sed command matches two consecutive occurrences of the
letter “a” and prints four of them:
aaaa
The following sed command replaces all duplicate pairs of letters with the
letters aa:
The output of the previous sed command is here (note the trailing “/ ”
character):
aa/aa/aa/
The preceding sed command uses the @ character as a delimiter. The char-
acter class [0-9] matches one single digit. Since there are four digits in the
input string 1234, the character class [0-9] is repeated four times, and the
value of each digit is stored in \1, \2, \3, and \4. The output from the pre-
ceding sed command is here:
1,234
A more general sed expression that can insert a comma in five-digit num-
bers is here:
12,345
group appears later in the string instead of earlier in the string. Keep in mind
that Perl supports RE forward references, whereas other languages (such as
JavaScript) do not support RE forward references.
Use the symbol “=” to denote a forward reference in an RE. The following
syntax shows you how to specify whether or not a forward reference contains
a string, as shown here:
In the previous chapter we solved this task using the egrep command, and
this section shows you how to solve this task using the sed command.
For simplicity, let’s work with a text string, and that way we can see the
intermediate results as we work toward the solution. The approach will be
similar to the code block shown earlier which counted unique words. Let’s
initialize the variable x as shown here:
The first step is to split x into one word per line by replacing space with
newlines:
The second step is to invoke sed with the regular expression ^[a-zA-
Z]+, which matches any string consisting of one or more uppercase and/or
lowercase letters (and nothing else). Note that the -E switch is needed to parse
this kind of regular expression in sed, as it uses some of the newer/modern
regular expression syntax not available when sed was new.
ghi
abc
Ghi
Working with REs in bash • 111
If you also want to sort the output and print only the unique words, pipe the
result to the sort and uniq commands:
Ghi
abc
ghi
If you want to extract only the integers in the variable x, use this command:
123
If you want to extract alphanumeric words from the variable x, use this
command:
123
123z
Ghi
abc
ghi
Now you can replace echo $x with a dataset in order to retrieve only
alphabetic strings from that dataset.
This concludes the portion of the chapter pertaining to the sed command.
The next portion of the chapter discusses the awk command, along with many
simple code snippets that perform a variety of tasks.
The awk (Aho, Weinberger, and Kernighan) command has a C-like syntax,
and you can use this utility to perform very complex operations on numbers
and text strings.
Awk has nearly the flexibility of an entire programming language contained
in a command that Unix/Linux sees behaving as if it was any other command.
As such it is the go-to command when grep and sed aren’t enough to get
the job done.
112 • Regular Expressions Pocket Primer
As a side comment, there is also the gawk command that is GNU awk, as
well as the nawk command is “new” awk (neither command is discussed in
this book). One advantage of nawk is that it allows you to set externally the
value of an internal variable.
Other built-in variables include FILENAME (the name of the file that awk
is currently reading), FNR (the current record number in the current file), NF
(the number of fields in the current input record), and NR (the number of input
records awk has processed since the beginning of the program’s execution).
Consult the online documentation for additional information regarding
these (and other) arguments for the awk command.
whatever follows it, in this case a space. Switches will often provide a shortcut
to an action that normally needs a command inside a ‘BEGIN{} block):
x="a b c d e"
echo $x |awk -F" " '{print $1}'
a
echo $x |awk -F" " '{print NF}'
5
echo $x |awk -F" " '{print $0}'
a b c d e
echo $x |awk -F" " '{print $3, $1}'
c a
Yet another way is shown here (but as we’ve discussed earlier, it can be inef-
ficient, so only do it if the cat command is adding value in some way):
This simple example of four ways to do the same task should illustrate
why commenting awk calls of any complexity is almost always a good idea.
The next person to look at your code may not know/remember the syntax you
are using.
Listing 5.8 contains a printf() statement that displays the first four
fields of each row in the file columns2.txt, where each field is 10 charac-
ters wide.
The output from launching the code in Listing 5.8 is here:
one * two* * *
three * four* * *
one * two* three* four*
five * six* * *
one * two* three* *
four * five* * *
Keep in mind that printf is reasonably powerful and as such has its own
syntax, which is beyond the scope of this chapter. A search online can find the
manual pages and also discussions of “how to do X with printf().”
If we can match a simple pattern, by now you probably expect that you can
also match a regular expression, just as we did in grep and sed. Listing 5.9
displays the contents of Patterns1.sh, which uses metacharacters to match
the beginning and the end of a line of text in the file columns2.txt.
one
five
four
The following code snippet prints the first and third columns of the lines of
text in products.txt whose cost equals 300:
awk ' $2 == 300 { print $1, $3 }' products.txt
The following code snippet prints the first and third columns of the lines of
text in products.txt that start with the string Tablet:
awk '/^Tablet/ { print $1, $3 }' products.txt
The example in this section shows you how to switch any pairs of columns
(and display them) in the rows of a text file. Listing 5.13 displays the contents
of switchcolumns.sh, which performs this task. Notice that the code does
not require any REs.
As you can see, the if statement in Listing 5.13 processes the rows that con-
tain at least six columns and prints the sixth column and the third column. The
output from Listing 5.13 is here:
four,one
two,three
three,three
The example in this section shows you how to reverse the order of all
the columns in each row in a text file. Listing 5.14 displays the contents of
Working with REs in bash • 117
Listing 5.15 consists of a one-line for a loop that contains the logic required
to reverse the fields in each row of manycolumns.txt. In fact, you could
even replace the contents of Listing 5.14 with the following one-liner:
ten
four three two one
four three two one four three
three two one four three two one
seven six five
five four three two one
three two one three two one three two one
Listing 5.16 initializes the array lines with all the rows of the input file, and
the BEGIN block contains a loop that prints the contents of lines in reverse
order. You could even replace the contents of Listing 5.16 with the following
one-liner:
awk '{ lines[i++]=$0 } END { for(j=i-1;j>=0;j--)print lines[j]; }'
manycolumns.txt
Incidentally, the BSD version of the Unix tail command can also reverse
the order of the rows in a file, and it’s much simpler than the awk script:
tail -r manycolumns.txt
The example in this section shows you how to switch pairs of columns in a
text file. For example, we can switch the first two columns, and also switch the
third and fourth columns, after we verify that they exist. Listing 5.17 displays
the contents of switchcolumns.sh, which performs this task. Notice that
the code does not require any REs.
two one
four three
two one
four three
six five
two one
five four
two,one
four,three
two,one,four,three
six,five
two,one
five,four
The examples in the previous section work correctly for rows containing
two or four columns, but they can become difficult to generalize in rows that
have an arbitrarily large number of columns. However, the awk-based code
example in this section does enable you to switch consecutive columns in a row,
regardless of the number of columns in that row.
Listing 5.19 displays the contents of switchcolumns3.sh, which
switches each pair of consecutive columns in manycolumns.txt.
# print linefeed
printf("\n")
}
' manycolumns.txt
Listing 5.19 initializes the variable line as the current line and creates an
array field whose contents are the columns of line. Next, the variable fc2 is
calculated as the largest even number that’s no greater than the length of the
array fields.
The next portion of Listing 5.19 contains a loop that switches consecutive
columns of the current line. Notice that the subsequent if statement prints
the rightmost field of the current line if the line has an odd number of fields.
The last code snippet prints a linefeed (otherwise we would have a single line
of output).
The output from the awk script is here:
ten
two,one,four,three,
four,three,two,one,four,three,
two,one,four,three,two,one,three
six,five,seven
two,one,four,three,five
two,one,one,three,three,two,two,one,three
There is one more detail to fix: remove the trailing “,” that appears in rows
with an even number of fields (can you explain why that happens?). One way
to remove the trailing “,” is with the sed command:
./switchcolumns3.sh | sed "s/,$//"
As you can see, the solution in Listing 5.19 is elegant in its simplicity (are
you surprised?). In fact, there are even more simple solutions available, but
the current solution demonstrates some of the other things that you can do in
an awk script.
Although there are few situations where you need a shell script such as
columns3.sh (possibly never), the point to keep in mind is that this task can
be performed in a very simple manner, without the use of any REs. If you think
that the latter is easy to do, see if you can create a suitable regular expression
(hint: it’s very difficult!).
Another point to keep in mind: the complexity of the solution to a particular
task can vary among languages (or utilities), and it’s worthwhile learning dif-
ferent languages—such as those discussed in this book—so that you can solve
tasks more easily.
Finally, keep in mind that a short and simple solution is easier to debug and
enhance, not only for you but also for the people who inherit your code.
Working with REs in bash • 121
The example in this section is admittedly more contrived than the other
code samples, but it serves to illustrate the ease with which you can solve com-
plex tasks with very simple awk scripts.
The awk script rotaterows.sh in this section does the following:
Listing 5.20 initializes the variable line as the current line and creates an
array field whose contents are the columns of lines. Next, the if statement
checks if the second fields starts with the string six, in which case it contains
another code block that contains additional conditional logic. That logic prints
the fourth, third, and first columns if the current row has at least four columns,
otherwise it prints the contents of the current line.
The else portion of the code in Listing 5.20 is executed when the sec-
ond column does not start with the string six, in which case a for loop
is executed that reverses the order of the columns in the current row. The
output from launching the code in Listing 5.20 is here:
ten
four three two one
four three two one four three
three two one four three two one
five six seven
five four three two one
three two one three two one three two one
Notice that Listing 5.20 is slightly shorter than Listing 5.19 from the previ-
ous use case, even though the current task is arguably more complex.
122 • Regular Expressions Pocket Primer
If you still aren’t convinced of the power of awk scripts, suppose you need
to do the following:
If you have read the code in the previous two sections, the code in Listing
5.21 ought to be self-explanatory. Notice that Listing 5.21 has the same num-
ber of lines of code as Listing 5.20, despite having slightly greater complexity
in terms of conditional logic.
Another point to notice is that Listing 5.21 is a straightforward implemen-
tation of the description of the task: if you read the code aloud, it’s almost like
English sentences, and the code contains only two simple REs.
Chapter Summary
This chapter introduced you to the sed utility, illustrating the basic tasks
of data transformation: allowing additions, removal, and mutation of data by
matching individual patterns, or matching the position of the rows in a file, or
a combination of the two.
Moreover, we showed that sed not only uses REs to match data, similar
to the grep command, but can also use REs to describe how to transform the
data.
Next you learned about the awk command, which is its own programming
language that supports REs. A series of examples showed the versatility of the
awk command, and hopefully communicated the sense that it is an even more
flexible and powerful utility than we can show in a single chapter.
Now that you have finished this book, you might be interested in “next
steps” to learn more about REs. The answer to this question varies widely,
Working with REs in bash • 123
mainly because the answer depends heavily on your objectives. The best an-
swer is to try techniques from the book out on a problem or task you care
about, professionally or personally. Precisely what that might be depends on
who you are, as the needs of a data scientist, manager, student, or developer
are all different. In addition, keep what you learned in mind as you tackle new
challenges. Sometimes knowing a technique is possible makes finding a solu-
tion easier, even if you have to reread the section to remember exactly how
the syntax works. In addition, there are various online resources and literature
describing how to create complex and arcane regular expressions.
At this point there is one more thing to say: congratulations! You have com-
pleted a fast-paced yet dense book, and if you are an RE neophyte, the mate-
rial will probably keep you busy for many hours. The examples in the chapters
provide a solid foundation, and the Appendices contain additional examples
of REs in Perl, Java, and Scala. The combined effect demonstrates that the
universe of possibilities is larger than the examples in this book, and ultimately
they will spark ideas in you. Good luck!
appendix
A
REs in Perl
T
his Appendix contains an assortment of REs in Perl, with code snippets
from earlier chapters that have been converted to Perl syntax. Please
keep in mind that you will learn only rudimentary Perl functionality
that pertains to REs, and that Perl has powerful features that are not discussed
because they are beyond the scope of this Appendix.
The first section of this chapter is similar to the examples in Chapter 1, but
without fully replicating the same details. Although the REs in this section are
often the same as their counterparts in Chapter 1, there are some syntactic
differences when you invoke Perl “one-liners” from the command line, versus
doing so with the grep command.
The second section in this chapter contains a description of metacharacters
and character classes, along with code snippets that illustrate how to use them.
For example, you will see how to match alphabetic characters (uppercase, low-
ercase, or a combination of both types), pure digits, and regular expressions
with combinations of digits and alphabetic characters.
The third section contains REs that match dates, phone numbers, and zip
codes. This section also contains REs that match various types of numbers,
such as integers, decimals, hexadecimals, octals, and binary numbers. You will
also learn how to work with scientific numbers and REs.
The final section contains REs that match IP addresses and simple com-
ment strings (in source code), as well as REs for matching ISBNs.
Recall that Chapter 1 uses the Unix grep utility and the Unix egrep util-
ity to illustrate various REs, whereas this Appendix uses the Perl executable.
If you work on a PC, please read the Preface for information about software to
download to your PC so that you can run Perl commands.
REs in Perl • 125
As you can see, the word grey appears in the first and second lines, the
word gray appears in the first and third lines, and all three lines contain
either grey or gray.
Here are the tasks that we want to perform:
-w : Use warnings.
-l : Remove (“chomp” in Perl parlance) the newline character from each
line before processing and place it back during printing.
-n : Create an implicit while(<>) { ... } loop to perform an action on each
line.
-e : Direct the Perl interpreter to execute the code that follows it.
Finally, print the entire line if the line contains the word gray.
or grey in a text file, which means matching with the vowel a or the vowel e.
Square brackets provide this functionality: the term [ae] means “use either a
or e” (and later you’ll see other variations, such as a range of letters or numbers).
The following command performs the third task listed in the previous section:
The term gr[ae]y is a compact way of representing the two strings gray
and grey. The order of the letters in the square brackets is irrelevant, which
means that the third task can also be solved with this command:
The only matches are grey and gray, but if the text file included a line
with the string grzy, then this line would appear in the previous output.
We can also specify a single letter inside the square brackets. For example,
the term [a] is an RE that matches the letter a. Launch this command:
Once again, the order of the letters in the square brackets is irrelevant,
which means that the following commands have the same output:
The “^” metacharacter matches a pattern that starts from the beginning of
a line. For example, the RE “^the” matches any lines that start with the string
the, as shown here:
On the other hand, the RE “^[the]” matches any lines that start with
either a t, or an h, or an e, as shown here:
By contrast, the following RE matches any lines that do not start with the
letter t, and in this case, there are no matching lines:
Since every line starts with the letter t, you can specify any other letter in
the preceding code snippet and the result matches all the lines in the text file.
For example, the following RE matches all lines:
Notice that the first line is excluded: the next section explains why this hap-
pened, and also the type of RE that will match the first line.
o ccurrences of any character. The “*” metacharacter is useful when you want
to match the intervening letters between a start character (or word) and an end
character (or word).
For example, if you want to match the lines that start with the letter t, fol-
lowed by an occurrence of the word gray, use this expression:
Notice how the “*” metacharacter enables you to “ignore” the intervening
characters between the initial t and the occurrence of the word gray some-
where else in a line.
If you want to match the lines that start with the word the, followed by an
occurrence of the word gray, use this expression:
You can match the final “.” character with the following expression:
x
y
z w
Match all lines that start with a whitespace with this expression:
x
Match all lines that end with a whitespace with this expression:
The output is a blank line, which you will see on the screen. Note that
matching an empty line is different from matching a line containing only
whitespaces.
Escaping a Metacharacter
If you want to match the lines that start with the letter t and also end with
the word gray, use this expression:
If you want to match the lines that contain a “.”, use this expression:
If you want to match the lines that match .doc, use this expression:
The following expression matches the lines that end with .doc:
cat
catty
catfish
small catfish
If you want to match the lines that contain dog, use this expression:
perl -wln -e 'print if /dog/' lines3.txt
If you want to match the lines that start with the word dog, use this expres-
sion:
perl -wln -e 'print if /^dog/' lines3.txt
If you want to match the lines that end with the word dog, use this expres-
sion:
perl -wln -e 'print if /dog$/' lines3.txt
If you want to match the lines that start and also end with the word dog,
use this expression:
perl -wln -e 'print if /^dog$/' lines3.txt
If you want to match the lines that start with a blank space, use this expression:
perl -wln -e 'print if /^ /' lines3.txt
REs in Perl • 133
catfish
If you want to match the lines that start with a period, use this expression:
perl -wln -e 'print if /^\./' lines3.txt
If you want to match the lines with any occurrence of a period, use this
expression:
perl -wln -e 'print if /\./' lines3.txt
By contrast, the following expression matches all lines because the “.” me-
tacharacter has not been escaped:
perl -wln -e 'print if /^./' lines3.txt
The following expression matches lines that contain the string that ends
with cat:
perl -wln -e 'print if /cat\b/' lines3.txt
The following expression matches lines that start with a space, followed by
any characters, and then followed by the string cat:
perl -wln -e 'print if /[ ].*cat/' lines3.txt
The following expression matches lines that contain the letter r or the letter e:
grey.
.gray
The following expression matches lines that contain the letter g, followed
by either the letter r or the letter e:
grey.
The following expression matches lines that contain a period before a ques-
tion mark “?”:
maybe...? perhaps?
The following expression matches lines that contain the word or:
yes123? or no?
either/or? or yes?
The following expression matches lines that match the sequence of the
word or, followed by a ?, followed by a blank space, and then another occur-
rence of or:
either/or? or yes?
The following expression matches lines that contain three consecutive dot
“.” characters:
maybe...? perhaps?
yes?
yes123? or no?
maybe?
maybe...? perhaps?
either/or? or yes?
The following expression matches lines that start with the word yes, and
are optionally followed by the number 123:
yes?
yes123? or no?
maybe?
maybe...? perhaps?
either/or? or yes?
The following expression matches lines that start with two digits (followed
by anything):
05/12/18
05/12/2018
05912918
05.12.18
05.12.2018
0591292018
The following expression matches lines that start with two digits (followed
by a forward slash):
05/12/18
05/12/2018
The following expression matches lines that start with two digits (followed
by a forward slash or period):
05/12/18
05/12/2018
REs in Perl • 137
05.12.18
05.12.2018
The following expression matches lines that end with a forward slash or
period, preceded by two digits:
perl -wln -e 'print if /[/ .][0-9][0-9]$/' lines5.txt
The following expression matches lines that contain four consecutive digits:
perl -wln -e 'print if /[0-9][0-9][0-9]/' lines5.txt
The following expressions both match lines that end with four consecutive
digits that are preceded by a forward slash or period:
perl -wln -e 'print if /[\/.][0-9][0-9][0-9]/' lines5.txt
perl -wln -e 'print if /[\/.][0-9][0-9][0-9]$/' lines5.txt
05/12/2018
05912918
05.12.2018
0591292018
There is also a simpler way to match multiple consecutive digits via the
\d character class. The following expression matches lines that contain three
consecutive digits:
The following expression matches lines that contain a pair of digits followed
by a non-digit character:
The following expression matches lines that contain three pairs of digits
that are separated by a non-digit character:
The following expression matches lines that contain three pairs of digits
that are separated by a non-digit character, and also excludes four-digit se-
quences:
36K8Z3
123-45-6789
jsmith@acme.com
john.smith@acme.com
650 123-4567
650 123 4567
(650) 123 4567
1-650 123-4567
The following RE matches strings that contain five digits, which is a com-
mon U.S. zip code pattern:
perl -wln -e 'print if /\d{5}/' lines6.txt
94053
94053-06123s
9405306123
The following expression matches U.S. zip codes consisting of five digits:
perl -wln -e 'print if /^\d{5}$/' lines6.txt
The following expression matches U.S. zip codes consisting of five digits
followed by a hyphen, and then followed by another five digits:
perl -wln -e 'print if /^\d{5}-\d{5}$/' lines6.txt
Valid Canadian postal codes are of the form A1A 1A1, where A is a capital
letter and 1 is a digit (with a space between the two triplets). The following RE
matches Canadian zip codes:
egrep "^[A-Z][0-9][A-Z] [0-9][A-Z][0-9]" lines6.txt
V6K 8Z3
Matching email addresses is a complex task. This section provides REs that
match common (but not all) email addresses that have the following pattern:
1. an initial string having at least four characters and at most twelve char-
acters (which can be any combination of lowercase letters, uppercase
letters, or digits), then
2. followed by the “@” symbol, then
3. a string having at least four characters and at most twelve characters
(which can be any combination of lowercase letters, uppercase letters,
or digits), then
4. followed by the string “.com”
Here is the RE that has the structure described in the preceding list that
matches an email address:
jsmith@acme.com
There are some limitations regarding the preceding RE. First, it only
matches email addresses with the suffix .com. Second, longer (yet still valid)
email addresses are excluded, such as the one shown here:
myverylongemailaddress@acme.com
john.smith@acme.com
The section shown in bold in the preceding RE shows you how to match
the dot “.” character, followed by an alphanumeric string that has at least four
characters and at most twelve characters.
REs in Perl • 141
The following RE matches U.S. phone numbers of the form ddd ddd dddd:
The following RE matches U.S. phone numbers of the form ddd ddd-
dddd:
650 123-4567
The following RE matches U.S. phone numbers of the form (ddd) ddd-
dddd:
(650) 123-4567
The following RE matches U.S. phone numbers of the form 1-ddd ddd-
dddd:
1-(650) 123-4567
This section contains examples of REs that match integers, floating point
numbers, hexadecimal numbers, octal numbers, and binary numbers. The sub-
sequent section discusses REs for scientific numbers, which are a “generaliza-
tion” of decimal numbers: they are more complex, and so they merit their own
section.
Listing A.8 displays the contents of numbers.txt, which is used in some
RE code snippets in this section.
#hexadecimal numbers
12345
FA4389
0xFA4389
0X4A3E5C
#octal numbers
1234
03434
#binary numbers
010101
110101
0b010101
Notice that integers, octal numbers, and binary numbers also appear in the
preceding list (because they are valid hexadecimal numbers).
The following RE matches hexadecimal numbers that start with either 0x
or 0X:
perl -wln -e 'print if /^(0x|0X)[a-fA-F0-9]+$/' numbers.txt
Notice that there are two occurrences of the number 1234: the first one
appears as an integer (and it’s a valid octal number) and the second one appears
in the section with octal numbers. Moreover, the number 110101 from the
binary section is also a valid octal number.
144 • Regular Expressions Pocket Primer
1234
12345
1234
03434
Once again, there are two occurrences of the number 1234: the first one
appears as an integer (and it’s a valid octal number) and the second one ap-
pears in the section with octal numbers.
010101
110101
010101
110101
0b010101
This section contains examples of REs that match scientific numbers and
hexadecimal numbers.
Listing A.9 displays the contents of lines7.txt and Listing A.10 dis-
plays the contents of lines8.txt, which are used in some code snippets.
Matching all scientific numbers (and nothing else) is rather complex, and
this section contains some REs that partially succeed in this task. A useful
exercise for you is to determine why these REs contain “false positives” (i.e.,
strings that you want to exclude).
Option #1: the following RE matches scientific numbers:
perl -wln -e 'print if /^[+-]?\d*(([,.]\d{3})+)?([,.]\d+)?([eE][+-
]?\d+)?$'/ lines8.txt
192.168.123.065
+0005
125.e12
*** Option #8:
--------------
0.123
z = 0xFFFF00;
+13
423.2e32
-7.20e+19
-.4E-8
-27.6603
+0005
125.e12
As you can see, the REs in Listing A.10 have varying degrees of success in
terms of matching scientific numbers. In general, they err by matching “false
positives” (numbers that are not valid scientific numbers) instead of excluding
“false negatives” (numbers that are valid scientific numbers).
The following snippet matches the lines that contain the strings http or
https:
perl -wln -e 'print if /http/' urls.txt
ftp://www.acme.com
http://www.bdnf.com
https://www.ceog.com
a line with https://www.ceog.com embedded in it
The following snippet matches the lines that contain the strings http or
https:
http://www.bdnf.com
https://www.ceog.com
a line with https://www.ceog.com embedded in it
REs in Perl • 149
The following snippet matches the lines that contain the strings ftp,
http, or https:
ftp://www.acme.com
http://www.bdnf.com
https://www.ceog.com
a line with https://www.ceog.com embedded in it
The following snippet matches the lines that contain the string http em-
bedded in the line of text:
One interesting point: the equivalent RE with the egrep command is here
(the initial whitespace is specified in a different location):
The preceding code snippet specifies a whitespace and any lowercase letter
in this expression: [ a-z]. However, the corresponding section in the Perl
expression must include the whitespace after the range of lowercase letters:
[a-z ]. If you do not make this slight modification, you will see the following
error message:
Unquoted string "a" may clash with future reserved word at -e line 1.
syntax error at -e line 1, near "a-z"
Valid IBSNs can start with the optional string ISBN, and also contain either
ten-digit sequences or thirteen-digit sequences. Listing A.13 displays the con-
tents of ISBN.txt, which contains examples of valid ISBN numbers.
Notice that the first line in Listing A.14 contains the string ISBN followed
by a blank space, and the next two lines contain the string ISBN, followed by a
hyphen, and then two more digits, and then either a colon “:” or a blank space.
Those two lines end with a hyphenated thirteen-digit number and a hyphen-
ated ten-digit number, respectively.
150 • Regular Expressions Pocket Primer
The fourth line in Listing A.14 contains a thirteen-digit number with white
spaces; the fifth line contains a “pure” thirteen-digit number; and the sixth line
contains a hyphenated ten-digit number.
Now let’s see how to match the numeric portion of the ISBNs in Listing
A.14. The following RE matches the digits in the first and the second lines:
\d{3}-\d-\d{3}-\d{5}-\d
The following RE matches the digits in the third line as well as the sixth line:
\d-\d{3}-\d{5}-\d
Now let’s create REs for the text prefix (when present) and combine them
with the earlier list of REs to match entire lines in Listing A.14. The result
involves four REs, as shown in the following examples:
Now we can combine the preceding four REs to create a single (and
lengthy) RE that matches every valid ISBN in the text file ISBN.txt:
Miscellaneous Patterns
This section contains examples of REs that match simple comment strings
(in source code). Listing A.15 displays the contents of lines7.txt, which is
used in some code snippets.
// this is a comment
// this is a comment
v = 7; // this is also a comment
This section contains examples of REs that match IP addresses that are in
Listing A.15 in the previous section.
The following RE matches arbitrary valid IP addresses:
192.168.3.99
192.168.123.065
The following RE matches valid IP addresses that contain three digits in all
four components:
192.168.123.065
This section contains examples of REs that match mixed-case strings (typi-
cally user names). Listing A.15 displays the contents of lines10.txt, which
is used in some code snippets.
The following RE matches mixed-case strings that end with a period “.”:
perl -wln -e 'print if /[A-Z][a-z]+\.'/ lines9.txt
The following RE matches strings that start with an uppercase letter fol-
lowed by a space, another lowercase string, and end in a period “.”:
Another RE that uses the “|” metacharacter to match strings that contain
either John or john is here:
catfish
The following expression matches lines that start with one or more whites-
paces, any number of characters, then followed by the string cat:
perl -wln -e 'print if /\s+.*cat'/ lines3.txt
catfish
small catfish
grey.
.gray
dog
doggy
cat
catty
grey.
dog
doggy
cat
catty
The following expression matches lines that do not start with a word, fol-
lowed by the string cat:
perl -wln -e 'print if /^\Wcat'/ lines3.txt
If you read Chapter 5, you recognize that the RE (shown in bold) in the
preceding code snippet bears an uncanny resemblance to a sed-based RE.
Now let’s look at a Perl-based RE for replacing non-digits with blank spaces,
as shown here:
1234 4 1
5678 3 2
9012 2 3
3456 1 4
156 • Regular Expressions Pocket Primer
As you can see, each of the four output lines starts with five blank spaces
because the preceding Perl snippet replaces a non-digit with a blank. Since the
lines in alphanums.txt start with a quote (“), followed by three capital letters,
then another quote (“), those five characters are replaced by blanks. Similar
comments apply to the other whitespaces that appear in the output.
Now consider the following REs that match lines with words ending with ‘g’:
perl -ne 'print if /g\w+\b/' lines1.txt
perl -ne 'print if /g\w+\b/ ' lines1.txt
perl -ne 'print if /g\w+\b/ ' lines1.txt
perl -ne 'print if / g\w+\b/' lines1.txt
perl -ne 'print if /[^g ]g\w+\b/' lines1.txt
perl -ne 'print if /[ ]g\w+\b/' lines1.txt
perl -ne 'print if /[ ]g\S+\b/' lines1.txt
[empty output]
perl -ne 'print "$&\n" if /[^ ]g\w*\b/' lines1.txt
og
og
The code snippets are grouped in blocks, and in each block the code snip-
pets look very similar, but they have subtle differences that require a solid
understanding of metacharacters in REs.
While it’s virtually impossible to check all possible combinations of char-
acters in a text string, you need to be vigilant and test your REs on a large
variety of patterns to minimize the likelihood of matching (or not matching)
an “outlier” RE.
Summary
This Appendix started with an introduction to some basic REs in Perl, fol-
lowed by examples that illustrate how to match (or how to not match) char-
acters or words. Next you learned about the metacharacter “^” and how its
interpretation depends on its location in an RE, followed by the “$” metachar-
acter for matching strings at the end of a line.
You then saw how to use the metacharacters “.”, “*”, and “ \”to create REs
that are combinations of metacharacters, along with “escaping” the meaning of
metacharacters.
Moreover, you learned how to create REs for common strings, such as
dates, U.S. phone numbers, zip codes (U.S. and Canadian), and some email
addresses. Then you saw how to detect IP addresses and comments in source
code, as well as create REs for matching ISBNs.
Although the REs in this Appendix are not exhaustive, they do provide you
with enough information to help you define REs that are more comprehen-
sive. In addition, you are now in a good position to convert the other REs in
Chapter 2 (that are not covered in this Appendix) into Perl-based REs.
appendix
B
Res In Java
T
his short Appendix introduces you to REs in Java, with code samples
that illustrate how to work with REs in Java programs. Keep in mind
that the Java code samples in this Appendix contain compiled code,
which differs from the Perl Appendix and all the book chapters. However,
even if you are new to Java, the REs in the code samples have already been
discussed in the book chapters, so you can easily follow the Java code. In any
case, this Appendix is optional.
The first section of this Appendix contains an eclectic mix of REs (all of
which appear in Chapter 1 or Chapter 2) in complete Java code samples.
Although there are fewer Java code samples in this Appendix (compared to the
number of REs in Chapter 1), they contain a greater assortment of REs.
The second part of this Appendix contains some code snippets with Scala-
based REs. This section is extremely short, and if you like the Java section,
then this section will be a very simple transition. If you want more practice,
feel free to convert the code samples in Chapter 1 into their Scala-based
counterparts.
Please keep in mind the following points when you read this Appendix.
First, the discussions of metacharacters and character classes in Chapter 1 are
not repeated in this Appendix.
Second, if you want to launch the Java code samples from the command
line, it means that you might need to perform an Internet search in order to
download and install the necessary software for your platform. Alternatively,
you can simply read the output that accompanies the code samples in this
chapter (so you won’t need to launch the code yourself).
Third, this Appendix does not provide any tutorial-style material that
explains how to create, compile, and launch Java or Scala programs. Fortu-
nately, the code samples are rudimentary, and they do not require any knowl-
edge of OOP (Object Oriented Programming) to understand them. However,
REs In Java • 159
// true
System.out.println("Pattern: hello");
System.out.println("Match: "+line1.matches("hello"));
// true
System.out.println("Pattern: [hH]ello");
System.out.println("Match: "+line1.matches("[hH]ello"));
// false
System.out.println("Pattern: he");
System.out.println("Match: "+line1.matches("he"));
// false
System.out.println("Pattern: goodbye");
System.out.println("Match: "+line1.matches("goodbye"));
}
}
Listing B.1 contains four pairs of code snippets that compare REs with
the string hello, and each code snippet is preceded by a comment line that
indicates whether the result is true or false. All four pattern matches rely on
the matches() method of the Java String class. The first result is obviously
true, and the second result—which is also true—shows you how to define a
simple RE with a character class.
The third result might surprise you: only a full match yields a true result,
and since the string he is a proper substring of hello, the result is false.
Finally, the fourth result is clearly false.
The output from launching the code in Listing B.1 is here:
Match: true
Pattern: [hH]ello
Match: true
Pattern: he
Match: false
Pattern: goodbye
Match: false
// true
System.out.println("Pattern: ^[\\d].*");
System.out.println("Match: "+line1.matches("^[\\d].*"));
// true
System.out.println("Pattern: ^[\\d]+\\s+[\\d]+.*");
System.out.println("Match: "+
line1.matches("^[\\d]+\\s+[\\d]+.*"));
}
}
Listing B.2 contains a string that consists of two integers that are separated
by a space. The first RE matches a string that starts with a number, followed by
zero or more arbitrary characters. The second RE matches a string that starts
with a number, followed by one or more spaces, which is then followed by an-
other number, and then zero or more arbitrary characters. As you can see, the
string line1 matches both REs.
The output from launching the code in Listing B.1 is here:
===> LINE: 123 456
Pattern: ^[\d].*
Match: true
Pattern: ^[\d]+\s+[\d]+.*
Match: true
// true
System.out.println("Pattern: [hH]ello");
System.out.println("Match: "+line1.matches("[hH]ello"));
Listing B.3 contains two REs, where the first RE specifies a character class.
The second RE contains a pipe “|” symbol, which you know is used for an
either-or match. Based on your knowledge of REs, both pattern matches are
clearly true.
The output from launching the code in Listing B.3 is here:
// true
System.out.println("Pattern: ^[\\d].*");
System.out.println("Match: "+line1.matches("^[\\d].*"));
162 • Regular Expressions Pocket Primer
// true
System.out.println("Pattern: .*\\d{3}$");
System.out.println("Match: "+line1.matches(".*\\d{3}$"));
// true
System.out.println("Pattern: ^\\d{3}[a-z]+\\d{3}$");
System.out.println("Match: "+
line1.matches("^\\d{3}[a-z]+\\d{3}$"));
}
}
Listing B.4 contains a string consisting of three digits, followed by five char-
acters, and then another three digits. The first RE matches strings that start
with an integer. The second RE matches strings that end with three digits. The
third RE matches strings that start with three digits, followed by one or more
lowercase letters, and then end with three digits. Hence, the pattern match for
all three REs is true.
The output from launching the code in Listing B.4 is here:
// true
System.out.println("Pattern: [A-Za-z]+");
System.out.println("Match: "+line1.matches("[A-Za-z]+"));
// true
System.out.println("Pattern: ^[A-Za-z]+$");
System.out.println("Match: "+line1.matches("^[A-Za-z]+$"));
// true
System.out.println("Pattern: ^[A-Z][a-z]+");
System.out.println("Match: "+line1.matches("^[A-Z][a-z]+"));
REs In Java • 163
// true
System.out.println("Pattern: ^[A-Z][a-z]{4,6}");
System.out.println("Match: "+
line1.matches("^[A-Z][a-z]{4,6}"));
}
}
// true
System.out.println("Pattern: .hello$");
System.out.println("Match: "+line1.matches(".hello$"));
// true
System.out.println("Pattern: \\.hello\\$");
System.out.println("Match: "+line1.matches("\\.hello\\$"));
// true
System.out.println("Pattern: \\.[A-Za-z]+\\$");
System.out.println("Match: "+line1.matches("\\.
[A-Za-z]+\\$"));
}
}
164 • Regular Expressions Pocket Primer
Listing B.6 contains a string that starts with a period “.”, followed by low-
ercase letters, and ending with a dollar sign “$”. The first RE fails to match
the string because the initial “.” and final “$” are treated as metacharacters. By
contrast, the second RE successfully matches because the initial “.” and final
“$” are both “escaped” via a pair of consecutive backslash “ \” characters. The
third RE is a modified version of the second RE: the hard-coded string hello is
replaced with the RE [A-Za-z]+ , which matches the initial string hello.
The output from launching the code in Listing B.6 is here:
The title of this section explicitly mentions date-related REs because Java
provides extensive support for dates and calendars via date-related and calen-
dar-related classes that match many different date formats. Hence, you do not
need to use REs in Java if you need to work with dates. In fact, those classes
provide many other date-related features that are unavailable in REs, and you
ought to explore those Java classes if you need more sophisticated data-related
functionality.
Keep in mind that this section does not use any date-related Java classes:
we’ll use simple REs to match patterns of strings that have two different date
formats.
Listing B.7 displays the contents of DateStrings.java, which illus-
trates how to match some valid dates.
// true
System.out.println("Pattern: \\d{2}.\\d{2}.\\d{2}");
System.out.println("Match: "+
line1.matches("\\d{2}.\\d{2}.\\d{2}"));
// true
System.out.println("Pattern: \\d{2}.\\d{2}.\\d{2}");
REs In Java • 165
System.out.println("Match: "+
line2.matches("\\d{2}.\\d{2}.\\d{2}"));
}
}
Listing B.7 contains two date-related strings: the first has the MM/DD/YY
format, and the second has the MM.DD.YY format. However, we can use the
same RE to match both date formats: \\d{2}.\\d{2}.\\d{2}. The pre-
ceding RE contains the “.” metacharacter that matches the “.” in the first date
string and also the “/” in the second date string.
The output from launching the code in Listing B.7 is here:
// true
System.out.println("Pattern: \\d{5}");
System.out.println("Match: "+line1.matches("\\d{5}"));
// true
System.out.println("Pattern: \\d{5}(-\\d{5})");
System.out.println("Match: "+line2.matches("\\d{5}(-\\
d{5})"));
}
}
Listing B.8 contains two REs for U.S. zip codes. The first RE is \\d{5},
which matches many (most?) U.S. zip codes. The second RE is \\d{5}
(-\\d{5})), which matches U.S. zip codes which are qualified by an extra
five-digit sequence.
166 • Regular Expressions Pocket Primer
This concludes the Java-related portion of this Appendix. The next section
contains a few examples of working with REs in Scala.
Scala provides the Regex class for handling REs, which delegates to the
java.util.regex package of the Java Platform. An instance of Regex
represents a compiled RE pattern. For performance reasons, it’s better to con-
struct frequently used REs only (preferably outside of loops).
More information is available here:
http://www.scala-lang.org/api/current/scala/util/matching/Regex.html
As a simple example, the following RE in Scala matches an integer:
Even if you are unfamiliar with Scala, you can see that the preceding code
snippet initializes the variable num as an RE that consists of one or more digits
via the \d+ expression.
The following code snippet illustrates how to create an RE that matches
dates that consist of four digits, a hyphen, a pair of digits, a hyphen, and an-
other pair of digits:
val date = raw"(\d{4})-(\d{2})-(\d{2})".r
Since escape characters are not processed in multi-line string literals, three
consecutive quotes (before and after) avoids the need to escape the backslash
character. Hence, \\d can also be written as """\d""".
Extraction
To extract the capturing groups when an RE is matched, use it as an extrac-
tor in a pattern match:
REs in Scala have access to various methods, such as start() and has-
Next(), as shown here:
val r = "(ab+c)".r
val s = "xxxabcyyyabbczzz"
r.findAllIn(s).start // 3
val mi = r.findAllIn(s)
mi.hasNext // true
mi.start // 3
mi.next() // "abc"
REs In Java • 167
mi.start // 3
mi.hasNext // true
mi.start // 9
mi.next() // "abbc"
Summary
This Appendix started with an introduction to some basic REs in Java, fol-
lowed by examples that illustrate how to match (or how to not match) charac-
ters or words. You saw examples of using metacharacters and character classes
to match sequences of numbers and characters (uppercase and lowercase).
Moreover, you learned how to create REs for common strings, such as dates,
U.S. phone numbers, and U.S. zip codes.
Then you learned how to create REs in Scala, which provides a “wrapper”
around a Java class for matching REs.
Index
E M
egrep utility, 3 metacharacters, 5–6, 21–22
email addresses, 31–33, 140 “^,” 6
extraction, Scala, 166–167 “$,” 7
“.,” “*,” and “\,” 7–8
F “+,” “?,” and “|,” 13–14
file, reversing lines, 117–118 “^” and “\,” 59
findAll() method, 55 escaping, 10
character classes, 66–67 extended, 13–14
FTP, 43–44 mixing and escaping, examples, 10–13
in Python, 56–58
G miscellaneous patterns, 151
Google i18n phone number dataset, 29 mixed-case strings, 14–15, 152–153, 162–163
Google library, 27, 29
greedy search, 79 N
gregexpr command, 84–85 neophyte, 1
grep command, 82 numbers, 35–38, 141–144
grepl command, 83 binary numbers, 38, 144
grep (or egrep) utility, 1 hexadecimal numbers, 36–37, 143
“group” subexpressions, 77 integers and decimal numbers, 35–36,
gsub() commands, 86, 87 142–143
octal numbers, 37–38, 143–144
H scientific numbers, 38–42, 144–148
hard-coded strings, 1
hexadecimal color sequences, 33–34 O
hexadecimal numbers, 36–37, 143 Object Oriented Programming, 158
http links, 43–44 octal numbers, 37–38, 143–144
I P
integers, 35–36, 142–143 pattern-matching functions, 96
Internet search, 79, 96 performance factors, 54
IP addresses, 43, 152 Perl-style RE patterns, 56, 124–157
ISBNs, 26, 46–48, 149–150 character class, 125–126
dates and metacharacters, 136–138
J metacharacter, escaping, 131
Java programs, 158–167 “?” metacharacter, 134–135
date-related REs, 164–165 “^” metacharacter, 128
mixed-case strings, 162–163 “|” metacharacter, 135–136
mixing and escaping metacharacters, “$” metacharacter, 128
163–164 “.,” “*” and “\” metacharacters, 128–129
numbers and, 160 mixing and escaping metacharacters,
ranges and, 160–161 132–134
Index • 171