Perl Regex

Perl: Regular
A powerful tool for searching and
transforming text.
Motivation
We have seen many
operations involving
string comparisons
Several Perl built-in
functions also help with
operations on strings
split & join
substr
length
There is a lot we can do
with such functions
Given a string holding
some timestamp,
extract out different
parts of date & time
while (my $line = <STDIN>) {
chomp $line;
Motivation
iCalendar dates are used
by iCal-like programs
The year, month, etc.
portions of the code are
fixed in position
How could we use substr
to help us?
This code certainly obtains
what we need.
But it can be a bit tricky
to get right.
Adapting code to use
another date/time format
is not trivial
and is bugbait!
Motivation
A better method is to
indicate the strings pattern
in a way the reflects the
actual order of pattern
components
The date begins at the
start of the string.
The year is four digits.
The month follows (two
digits)
and then the day.
The T character
separates the date and
time
Hour, minute and date
follow, each two digits
Motivation
Back to our code
modification example
Now we have a different
date format
Using a regular
expression, we can
greatly reduce the
possibility of bugs
String begins with an
followed by year
followed by a dash
followed by month
etc
Simple matching
Metacharacters
Anchored search
Character classes
Range operators in
character classes
Matching any character
Extracting Matches
Search and Replace
Our coverage of regex syntax will
be much more slowly paced that
the motivation just shown!
Previous slides have been
shown to give you a flavour
of what regular expressions
can achieve.
We will learn how to
construct such expression
over the next few lectures.
We have a range of topics
Regular expressions can seem
complex and cryptic
However, slow and patient
work with such expressions
will improve your
productivity.
Perl Regular Expressions
Perl is renowned for its
excellence at text
Handling of regular
expressions plays a big
factor in its fame.
Mastering even the basics
will allow you to manipulate
text with ease.
Regular expressions have a
strong formalism (FSA).
You have already used
some and seen others.
Other languages have
some support for regexes,
usually via some library.
Simple String Matching
Regular expressions are
usually used in
conjunction with an if
if < string matches
this pattern>
... then > do
something with that
match>.
The simplest such match
refers to a string
But note: this is much
different that using eq
A word about
m/ yadayada/ xms
The text between the two slashes is the regular expression
( regex ).
Leading m indicates the regex is used for a match
Trailing xms are three regex options
x : Extended formatting (whitespace in regex is ignored)
m : For line boundaries (and eliminates a cause of some subtle
s : ensures everything is matched by the . symbol
Why all of this verbiage instead of plain old /yadayada/ as of
Also note: m{ } or m//
Another example
The code on the right
searches for a pattern in
some dictionary file
Note that a command-
line argument is being
used for a regex!
Also note < > syntax:
This takes the first
unused command-line
argument, and uses it
as a filename for
Metacharacters
Regexs obtain their power
by describing sets of
strings.
Such descriptions involve
the use of
metacharacters
Of course, some strings
that we want to match will
contain these strings.
Therefore we must
escape them.
{ ' [ ] ( )
^ $ .
/ \
We may wish to anchor a match to certain
locations
^ matches the beginning of a line.
$ matches the end of a line.
\A matches the beginning of a string.
\z matches the end of a string.
Character classes
These allow
sets of
characters
to be
matched
Used at
points within
a regex.
Range operators
Ranges can
eliminate some
ugly code
[ 0123456789]
becomes [ 0-9]
[ abcdefghijklmnopqrs
tuvwxyz] becomes [ a-
If - is the first or last
character in a character
class, it is treated as an
ordinary character
Negated character
The special character
^ in the first position
of a character class
denotes a negated
character class
Matches any character
but those in the
brackets
# doesnt match aat or at, but
# matches all other bat, cat,
# 0at, %at, etc.
# matches a nonnumeric character
# matches aat or at# here
# is ordinary
Matching any character
The period '.' matches any character but "\n"
A period is a metacharacter, it needs to be
escaped to match as an ordinary period.
m/..rt/xms # matches any 2 chars, followed by rt
m/end\./xms # matches end.
m/end[.]/xms # same thing, matches only end.
"" =~ m/./xms # doesnt match needs a character
"a" =~ m/./xms # matches
"" =~ m/./xms # doesnt match needs a character
"\n" =~ m/./xms # doesnt match needs a character
# other than \n
"a\n" =~ m/./xms # matches, ignores the \n
Matching this or that
We would like to match different
possible words or character strings
We use the alternation character |
"cats and dogs" =~ /cat|dog|bird/ # matches "cat"
"cats and dogs" =~ /dog|cat|bird/ # matches "cat"
Group
Toget her
Somet imes we want alt ernat ives for part of a
regular expression.
/(a|b)b/ # matches ab or bb
/(ac|b)b/ # matches acb or bb
/(a|b)c/ # matches ac at start of string or
# bc anywhere
/(a|[bc])d/ # matches ad, bd, or cd
/house(cat|)/ # matches either housecat
# or house
/house(cat(s|)|)/ # matches either housecats or
# housecat or house.
# Note groups can be nested.
Ext ract ing Mat ches
The grouping met acharact ers () also serve anot her
complet ely different funct ion: t hey allow t he ext ract ion of
t he part s of a st ring t hat mat ched.
For each grouping, t he part t hat mat ched inside goes int o
t he special variables $1, $2, et c.
# extract hours, minutes, seconds
$time = /(\d\d):(\d\d):(\d\d)/ # match hh:mm:ss format
# \d is equivalent to [0-9]
$hours = $1;
$minutes = $2;
$seconds = $3;
# More compact code, equivalent code
($hours,$minutes,$second) = ($time =/(\d\d):(\d\d):
Mat ching Repet it ions
We would like t o be able t o mat ch mult iple t imes:

a? = mat ch ' a' 0 or 1 t imes (~ opt ional)

a* = mat ch ' a' 0 or more t imes, i.e., any number of t imes

a+ = mat ch ' a' 1 or more t imes, i.e., at least once

a{n,m} = mat ch at least n t imes, but not more t han m

t imes.

a{n,} = mat ch at least n or more t imes.

a{n} = mat ch exact ly n t imes

$year = /\d{2,4}/ # make sure year is at least 2 but
# not more than 4 digits
/[a-z]+\d*/i # match a word and any number of digits
/y(es)?/i # matches y, Y,
# or a case-insensitive yes
Search and Replace
Regular expressions also play a role in
search and replace operat ions in Perl
Search and replace is accomplished
wit h t he s/// operat or
General form:
s/regexp/replacement/modi ers
$x = "Time to feed the cat!";
if ( $x = s/cat/hamster/ ) {
print $x; # Time to feed the hamster!
More Search and Replace
$y = "'quoted words'";
$y = s/'(.*)'$/<<$1>>/ # strip single quotes, $y
# contains "<<quoted words>>"
$x = "I batted 4 for 4";
$x = s/4/four/ # doesnt do it all:
# $x contains
# "I batted four for 4
$x = "I batted 4 for 4";
$x = s/4/four/g # /g modifier does it all:
# $x contains
# "I batted four for four"
A f ew more regexp
t opics
Advanced uses of mat ches
Escape sequences
List and scalar cont ext , e.g., phone
Finding all inst ances of a mat ch
Parent hesis
Subst it ut ing wit h s///
t r, t he t ranslat e funct ion
Advanced uses of
mat ches
You can assign pat t ern memory
direct ly t o your own variable
names (capt uring):
($phone) = $value =~ /^phone\:(.+)$/;
Read from right t o left . Apply t his pat t ern
t o t he value in $value, and assign t he
result s t o t he list on t he left .
($front,$back) = /^phone\:(\d{3})-(\d{4})/;

Apply t his pat t ern t o $_ and assign t he

result s t o t he list on t he left .
Meaning of backslash let t ers
\n : newline
\r: carriage ret urn
\t : t ab
\f: formfeed
\d: a digit (same as [ 0-9] )
\D: a non-digit
\w: an alphanumeric charact er, same as [ 0-9a-z_A-
\W: a non-alphanumeric charact er
\s: a whit espace charact er, same as [ \t \n\r\f]
\S: a non-whit espace charact er
Reminder: list or scalar
cont ext ?
A pat t ern mat ch ret urns 0 (false) or 1 (t rue) in
scalar cont ext , and a list of mat ches in array
cont ext .
Recall: There are a lot of funct ions t hat do
different t hings depending on whet her t hey are
used in scalar or list cont ext .
# returns the number of elements
$count = @array
# returns a reversed string
$revString = reverse $string

# returns a reversed list
@revArray = reverse @array
You must always be caut ious of t his behaviour.
Pract ical Example of
Cont ext
$phone = $string =~ /^.+\:(.+)$/;
$phone cont ains 1 if pat t ern mat ches,
0 ot herwise
($phone) = $string =~ /^.+\:(.+)$/;

$phone cont ains t he mat ched st ring

Finding all inst ances of a
mat ch

Use t he g modifier wit h a regular

@sites = $sequence =~ /(TATTA)/g;
t hink g for global
Ret urns a list of all t he mat ches (in
order), and st ores t hem in t he array
If you have n pairs of parent heses,
t he array looks like t he following:
Perl is Greedy
Perl regular expressions t ry t o mat ch t he
largest possible st ring which mat ches your
pat t ern:
lalaaaaagag =~ /(la.*ag)/

/la.*ag/ mat ches laag, lalag, laaaaaag

$1 cont ains lalaaaaagag
If t his is not what you want ed t o do, use t he
? modifier:
lalaaaaagag =~ /(la.+?ag)/

/(la.+?ag)/ mat ches as few charact ers

as possible t o find mat ching pat t ern

$1 cont ains lalaaaaag

Making parent heses
f orget f ul
Somet imes you need parent heses t o make your
regular expression work, but you don t act ually want
t o keep t he result s. You can st ill use parent heses for

Cert ain charact ers are overloaded; recall:

\d? means 0 or 1 inst ances

\d+? means t he fewest non zero number of

digit s

(?:group) means look for t he group of

at oms in t he st ring, but don t remember
t hem
Example of f orget t ing
# Method 1
if (@ARGV && $ARGV[0] eq "-x") {
$mod = "?:";
} else {
$mod = "";
$pat1 = "\\w+";
$pat2 = "\\d+";
while (<STDIN>) {
$_ =~ /($mod$pat1) ($pat2)/;
print $1, "\n";
# Method 2
if (@ARGV && $ARGV[0] eq "-x") {
$ignore = 1;
} else {
$ignore = 0;
while (<STDIN>) {
$_ =~ /(\w+) (\d+)/;
if ($ignore) {
print $2, "\n";
else {
print $1, "\n";
More examples using
Subst it ut ing one word for anot her
$string =~ s/dogs/cats/
If $string was I love dogs , it is now I love cat s
Removing t railing whit e space
$string =~ s/\s+$//

If $string was ATG , it is now ATG

Adding 10 t o every number in a st ring
$string =~ /(\d+)/$1+10/ge
Not e pat t ern memory

g means global (just like in regular expressions)

e is specific t o s, evaluat e t he expression on t he right

tr f unct ion
t ranslat e or t ranslit erat e
Even less like a regular expression t han
subst it ut es charact ers in t he first list
wit h charact ers from t he second list :
$string =~ tr/a/A/
every a t o t ranslat ed t o an A

No need for a global modifier using tr.

More examples of tr
convert ing named scalar t o lowercase
$ARGV[1] =~ tr/A-Z/a-z/

count t he number of * in $_
$cnt = tr/*/*/
$cnt = $_ =~ tr/*/*/
change all non-alphabet ic charact ers t o
tr/a-zA-Z/ /c
not ice space + c = complement search st ring
delet e all non-alphabet ic charact ers complet ely
d = delet e found but unreplaced charact ers
Using t he result s of mat ches
wit hin a pat t ern
\1, \2, \3 refer t o what a previous set of
parent heses mat ched
abc abc =~ /(\w+) \1/ # matches
abc def =~ /(\w+) \2/ # doesnt match

Can also use $1, $2, et c. t o perform some

int erest ing operat ions:
s/^([^ ]*) *([^ ]*)/$2 $1/ #swap first two words
/(\w+)\s*=\s*\1/ # match foo = foo
ot her default variables used in mat ches

$` : ret urns everyt hing before mat ched st ring

$& : ret urns ent ire mat ched st ring

$ : ret urns everyt hing aft er mat ched st ring

Example: Celsius
#! /usr/bin/perl -w
print "Enter temperature: \n";
$line = <STDIN>;
if ( $line =~ /^([-+]?[0-9]+(?:\.[0-9]*)?)\s*([CF])$/i ) {
$temp = $1;
$scale = $2;
if ( $scale =~ /c/i ) {
$cel = $temp;
$fah = ($cel * 9 / 5) + 32;
else {
$fah = $temp;
$cel = ($fah - 32) * 5 / 9;
printf( "%.2f C is %.2f F\n", $cel, $fah );
else {
printf( "Bad format\n" );
Regex on command line
We can execut e simple regular
expressions on t he command line:
$ perl p i e 's/kat/cat/g' in.txt
p : apply program t o each line in file

i : writ e changes back t o in.txt

e : program bet ween ''

