Regular Expressions: Python For Everybody
Regular Expressions: Python For Everybody
Chapter 11
http://en.wikipedia.org/wiki/Regular_expression
Really smart “Find” or “Search”
Understanding Regular
Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with
characters
• It is kind of an “old school” language - compact
http://xkcd.com/208/
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
https://www.py4e.com/lectures3/Pythonlearn-11-Regex-Handout.txt
The Regular Expression
Module
• Before you can use regular expressions in your program,
you must import the library using “import re”
import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.find('From:') >= 0: line = line.rstrip()
print(line) if re.search('From:', line) :
print(line)
Using re.search() Like
startswith()
import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.startswith('From:') : line = line.rstrip()
print(line) if re.search('^From:', line) :
print(line)
One or more
Match the start
X-Sieve: CMU Sieve 2.3 times
X-DSPAM-Result: Innocent of the line
X-: Very Short
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace
character
Matching and Extracting
Data
• re.search() returns a True/False depending on whether
the string matches the regular expression
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]
Warning: Greedy Matching
The repeat characters (* and +) push outward in both
directions (greedy) to match the largest possible string
One or more
characters
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+:', x)
>>> print(y) ^F.+:
['From: Using the :']
>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['stephen.marquard@uct.ac.za']
^From (\S+@\S+)
>>> y = re.findall('^From (\S+@\S+)',x)
>>> print(y)
['stephen.marquard@uct.ac.za']
String Parsing Examples…
2 3
1 1
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
['uct.ac.za']
'@([^ ]*)'
['uct.ac.za']
'@([^ ]*)'
Match non-blank
Match many of them
character
The Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*)',lin)
print(y)
['uct.ac.za']
'@([^ ]*)'
['uct.ac.za']
'^From .*@([^ ]*)'
Starting at the beginning of the line, look for the string
'From '
Even Cooler Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)
['uct.ac.za']
'^From .*@([^ ]*)'
['uct.ac.za']
'^From .*@([^ ]*)'
Start extracting
Even Cooler Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)
['uct.ac.za']
'^From .*@([^ ]+)'
Match non-blank Match many of
character them
Even Cooler Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)
['uct.ac.za']
'^From .*@([^ ]+)'
Stop extracting
Spam Confidence
import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
numlist.append(num)
print('Maximum:', max(numlist)) python ds.py
Maximum: 0.9907
X-DSPAM-Confidence: 0.8475
Escape Character
If you want a special regular expression character to just
behave normally (most of the time) you prefix it with '\'
>>> import re
At least
>>> x = 'We just received $10.00 for cookies.' one or
>>> y = re.findall('\$[0-9.]+',x) more
>>> print(y)
['$10.00']
\$[0-9.]+
A real dollar A digit or
sign period
Summary
• Regular expressions are a cryptic but powerful
language for matching strings and extracting elements
from those strings
• Regular expressions have special characters that
indicate intent
Acknowledgements / Contributions
These slides are Copyright 2010- Charles R. Severance (
...
www.dr-chuck.com) of the University of Michigan School of
Information and open.umich.edu and made available under a
Creative Commons Attribution 4.0 License. Please maintain this
last slide in all copies of the document to comply with the
attribution requirements of the license. If you make a change,
feel free to add your name and organization to the list of
contributors on this page as you republish the materials.