Regex
Regex
Objectives
Regular expressions
Regular expressions are a concept and an implementation used in many different programming
environments for sophisticated pattern matching. They are an incredibly powerful tool that can
amplify your capacity to find, manage, and transform data and files.
Match on types of characters (e.g. ‘upper case letters’, ‘digits’, ‘spaces’, etc.).
Match patterns that repeat any number of times.
Capture the parts of the original string that match your pattern.
Regex can also be useful for daily work. For example, say your organization wants to change the
way they display telephone numbers on their website by removing the parentheses around the
area code. Rather than search for each specific phone number (that could take forever and be
prone to error) or searching for every open parenthesis character (could also take forever and
return many false-positives), you could search for the pattern of a phone number. Regular
expressions rely on the use of literal characters and metacharacters. A metacharacter is any
American Standard Code for Information Interchange (ASCII) character that has a special
meaning. By using metacharacters and possibly literal characters, you can construct a regex for
finding strings or files that match a pattern rather than a specific string.
Since regular expressions defines some ASCII characters as “metacharacters” that have more
than their literal meaning, it is also important to be able to “escape” these metacharacters to use
them for their normal, literal meaning. For example, the period . means “match any character”,
but if you want to match a period then you will need to use a \ in front of it to signal to the
regular expression processor that you want to use the period as a plain old period and not a
metacharacter. That notation is called “escaping” the special character. The concept of
“escaping” special characters is shared across a variety of computational settings, including
markdown and Hypertext Markup Language (HTML).
Most regular expression implementations employ similar syntaxes and metacharacters (generally
influenced by the regex syntax of a programming language called Perl), and they behave
similarly for most pattern-matching in this lesson. But there are differences, often subtle, in each,
so it’s always a good practice to read the application or language’s documentation whenever
available, especially when you start using more advanced regex features. Some programs,
notably many UNIX command line programs (for more on UNIX see our ‘Shell Lesson’), use an
older regex standard (called ‘POSIX regular expressions’) which is less feature-rich and uses
different metacharacters than Perl-influenced implementations. For the purposes of our lesson,
you do not need to worry too much about all this, but if you want to follow up on this see this
detailed syntax comparison.
A very simple use of a regular expression would be to locate the same word spelled two different
ways. For example the regular expression organi[sz]e matches
both organise and organize. But because it locates all matches for the pattern in the file, not
just for that word, it would also
match reorganise, reorganize, organises, organizes, organised, organized, etc.
Learning common regex metacharacters
Square brackets can be used to define a list or range of characters to be found. So:
[ABC] matches A or B or C.
[A-Z] matches any upper case letter.
[A-Za-z] matches any upper or lower case letter.
[A-Za-z0-9] matches any upper or lower case letter or any digit.
* matches the preceding element zero or more times. For example, ab*c matches “ac”,
“abc”, “abbbc”, etc.
+ matches the preceding element one or more times. For example, ab+c matches “abc”,
“abbbc” but not “ac”.
? matches when the preceding character appears zero or one time.
{VALUE} matches the preceding character the number of times defined by VALUE;
ranges, say, 1-6, can be specified with the syntax {VALUE,VALUE}, e.g. \d{1,9} will
match any number between one and nine digits in length.
| means or.
/i renders an expression case-insensitive (equivalent to [A-Za-z]).
1
RGANI.E\W*
To embed this knowledge we will not - however - be using computers. Instead we’ll use pen and
paper for now.
Exercise
Work in teams of four to six on the exercises below. When you think you have the right answer,
check it against the solution.
When you finish, split your team into two groups and write each other some tests. These should
include a) strings you want the other team to write regex for and b) regular expressions you want
the other team to work out what they would match.
Then test each other on the answers. If you want to check your logic
use regex101, myregexp, regex pal or regexper.com: the first three help you see what text your
regular expression will match, the latter visualises the workflow of a regular expression.
INTRODUCING OPTIONS
What would match the strings French and France that appear at the beginning of a line?
CASE INSENSITIVITY
How do you match the whole words colour and color (case insensitive)?
Solutions
\b[Cc]olou?r\b|\bCOLOU?R\b
/colou?r/i
In real life, you should only come across the case insensitive
variations colour, color, Colour, Color, COLOUR, and COLOR (rather than, say, coLour). So
based on what we know, the logical regular expression is \b[Cc]olou?r\b|\bCOLOU?R\b.
An alternative more elegant option we’ve not discussed is to take advantage of the / delimiters
and add an ‘ignore case’ flag. To use these flags, include / delimiters before and after the
expression then letters after to raise each flag (where i is ‘ignore case’): so /colou?r/i will
match all case insensitive variants of colour and color.
WORD BOUNDARIES
How would you find the whole word headrest and or head rest but not head rest (that is,
with two spaces between head and rest?
How would you find a string that ends with four letters preceded by at least one zero?
MATCHING DATES
How would you match the date format dd-MM-yyyy or dd-MM-yy at the end of a line only?
How would you match publication formats such as British Library : London,
2015 and Manchester University Press: Manchester, 1999?
KEYPOINTS
1. Oo↩︎
Matching & Extracting Strings
Last updated on 2023-05-03 | Edit this page
How can you use regular expressions to match and extract strings? Objectives
Use regular expressions to match words, email addresses, and phone numbers.
Open the swcCoC.md file, copy the text, and paste that into the test string box.
For a quick test to see if it is working, type the string community into the regular expression box.
If you look in the box on the right of the screen, you see that the expression matches six
instances of the string ‘community’ (the instances are also highlighted within the text).
Change the expression to communi and you get 15 full matches of several words. Why?
FINDING PLURALS
Find all of the words starting with Comm or comm that are plural.
\w+ matches the preceding element (a word character) one or more times
Open the swcCoC.md file, copy it, and paste it into the test string box.
The string before the “@” could contain any kind of word character, special character or digit in
any combination and length. How would you express this in regex? Hint: often addresses will
have a dash (-) or dot (.) in them, and neither of these are included in the word character
expression (\w). How do you capture this in the expression?
. matches a literal period (when used in between square brackets, . does not mean “any
character”, it literally means “.”)
- matches a dash
[] the brackets enclose the boolean string that ‘OR’ the word characters, dot, and dash.
The string after the “@” could contain any kind of word character, special character or digit in
any combination and length as well as the dash. In addition, we know that it will have some
characters after a period (.). Most common domain names have two or three characters, but
many more are now possible. Find the latest list here. What expression would capture this? Hint:
the . is also a metacharacter, so you will have to use the escape \ to express a literal period.
Note: for the string after the period, we did not try to match a - character, since those rarely
appear in the characters after the period at the end of an email address.
[] the brackets enclose the boolean string that ‘OR’ the digits, word characters, characters and
dash.
What to consider:
1. It may or may not have a country code, perhaps starting with a “+”.
2. It will have an area code, potentially enclosed in parentheses.
3. It may have the sections all separated with a “-”.
Start with what we know, which is the most basic format of a phone number: three digits, a dash,
and four digits. How would we write a regex expression that matches this?
Start with what we know, which is the most basic format of a phone number: three digits, a dash,
and four digits. How would we expand the expression to include an area code (three digits and a
dash)?
Start with what we know, which is the most basic format of a phone number: three digits, a dash,
and four digits. How would we expand the expression to include a phone number with an area
code in parenthesis, separated from the phone number, with or without a space.
\d matches digits
See the previous exercise for the explanation of the rest of the expression.
Country codes are preceded by a “+” and can have up to three digits. We also have to consider
that there may or may not be a space between the country code and anything appearing next.
\d matches digits
See the previous exercise for the explanation of the rest of the expression.
One of the reasons we stress the value of consistent and predictable directory and filenaming
conventions is that working in this way enables you to use the computer to select files based on
the characteristics of their file names. For example, if you have a bunch of files where the first
four digits are the year and you only want to do something with files from ‘2017’, then you can.
Or if you have ‘journal’ somewhere in a filename when you have data about journals, you can
use the computer to select just those files. Equally, using plain text formats means that you can
go further and select files or elements of files based on characteristics of the data within those
files. See Workshop Overview: File Naming & Formatting for further background.
2. Upload the CSV file to Google Sheets and open as a Google Sheet if it does not do
this by default.
3. Look in the ADDRESS column and notice that the values contain the latitude and
longitude in parenthesis after the library address.
4. Construct a regular expression to match and extract the latitude and longitude into
a new column named ‘latlong’. HINT: Look up the function REGEXEXTRACT in
Google Sheets. That function expects the first argument to be a string (a cell
in ADDRESS column) and a quoted regular expression in the second.
Show me the solution
This is one way to solve this challenge. You might have found others. Inside the cell you can use
the below to extract the latitude and longitude into a single cell. You can then copy the formula
down to the end of the column.
=REGEXEXTRACT(G2,"-?\d+\.\d+, -?\d+\.\d+")
Latitude and longitude are in decimal degree format and can be positive or negative, so we start
with an optional dash for negative values then use \d+ for a one or more digit match followed by
a period \.. Note we had to escape the period using \. After the period we look for one or more
digits \d+ again followed by a literal comma ,. We then have a literal space match followed by
an optional dash - (there are few 0.0 latitude/longitudes that are probably errors, but we’d want
to retain so we can deal with them). We then repeat our \d+\.\d+ we used for the latitude
match.
KEYPOINTS
Test yourself with RegexCrossword.com or via the quiz and exercises in this
lesson.
Multiple Choice Quiz
Last updated on 2023-05-03 | Edit this page
Objectives
A. ^
B. #
C. *
Answer
C
A. \s
B. \b
C. $
Answer
A
A. $Confident
B. ^Confident
C. #Confident
Answer
B
A. ^Confidential\d
B. ^Confidential\b
C. ^Confidential\w
Answer
B
A. revolution[a-z]?
B. revolution[a-z]*
C. revolution[a-z]+
Answer
B
A. [rR]evolution[s]+
B. revolution[s]?
C. [rR]evolution[s]?
Answer
C
A. dog|cat
B. dog,cat
C. dog | cat
Answer
A
A. \bdog|cat\b
B. \bdog\b | \bcat\b
C. \bdog\b|\bcat\b
Answer
C
A. {2,4}
B. {2-4}
C. [2,4]
Answer
A
C. The letterdfour times?
Answer
B
KEYPOINTS
Exercises
Last updated on 2023-05-03 | Edit this page
Expand All Solutions
OVERVIEW
Questions
Objectives
Exercises
The exercises are designed to embed the regex knowledge you learned during this module. We
recommend you work through it sometime after class (within a week or so).
KEYPOINTS