0% found this document useful (0 votes)

87 views

Regular Expressions: Python For Everybody

The document discusses regular expressions and provides examples of using regular expressions in Python to parse strings and extract information. Regular expressions provide a powerful yet cryptic way to match patterns in text. The document also demonstrates how to refine regular expression patterns to fine-tune matching.

Uploaded by

Emmanuel

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views

Regular Expressions: Python For Everybody

Uploaded by

Emmanuel

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Regular Expressions

Chapter 11

Python for Everybody

www.py4e.com
Regular Expressions
In computing, a regular expression, also
referred to as “regex” or “regexp”, provides a
concise and flexible means for matching
strings of text, such as particular characters,
words, or patterns of characters. A regular
expression is written in a formal language
that can be interpreted by a regular
expression processor.
http://en.wikipedia.org/wiki/Regular_expression
Regular Expressions
Really clever “wild card” expressions for
matching and parsing strings

http://en.wikipedia.org/wiki/Regular_expression
Really smart “Find” or “Search”
Understanding Regular
Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with
characters
• It is kind of an “old school” language - compact
http://xkcd.com/208/
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end

https://www.py4e.com/lectures3/Pythonlearn-11-Regex-Handout.txt
The Regular Expression
Module
• Before you can use regular expressions in your program,
you must import the library using “import re”

• You can use re.search() to see if a string matches a

regular expression, similar to using the find() method for
strings

• You can use re.findall() to extract portions of a string

that match your regular expression, similar to a
combination of find() and slicing: var[5:10]
Using re.search() Like find()

import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.find('From:') >= 0: line = line.rstrip()
print(line) if re.search('From:', line) :
print(line)
Using re.search() Like
startswith()
import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.startswith('From:') : line = line.rstrip()
print(line) if re.search('^From:', line) :
print(line)

We fine-tune what is matched by adding special characters to the

string
Wild-Card Characters
• The dot character matches any character

• If you add the asterisk character, the character is “any

number of times”
Many
Match the start of
times
X-Sieve: CMU Sieve 2.3 the line
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit
Many
Match the start times
X-Sieve: CMU Sieve 2.3 of the line
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks
X-: Very short
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit

One or more
Match the start
X-Sieve: CMU Sieve 2.3 times
X-DSPAM-Result: Innocent of the line
X-: Very Short
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace
character
Matching and Extracting
Data
• re.search() returns a True/False depending on whether
the string matches the regular expression

• If we actually want the matching strings to be extracted,

we use re.findall()
>>> import re
[0-9]+ >>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
One or more ['2', '19', '42']
digits
Matching and Extracting
Data
When we use re.findall(), it returns a list of zero or more
sub-strings that match the regular expression

>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]
Warning: Greedy Matching
The repeat characters (* and +) push outward in both
directions (greedy) to match the largest possible string
One or more
characters
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+:', x)
>>> print(y) ^F.+:
['From: Using the :']

First character in Last character in

Why not 'From:' ?
the match is an F the match is a :
Non-Greedy Matching
Not all regular expression repeat codes are
greedy! If you add a ? character, the + and * One or more
chill out a bit... characters
but not
>>> import re greedy
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+?:', x) ^F.+?:
>>> print(y)
['From:']
First character in Last character in
the match is an F the match is a :
Fine-Tuning String Extraction
You can refine the match for re.findall() and separately determine
which portion of the match is to be extracted by using parentheses

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

>>> y = re.findall('\S+@\S+',x) \S+@\S+

>>> print(y)
['stephen.marquard@uct.ac.za’] At least one
non-
whitespace
character
Fine-Tuning String Extraction
Parentheses are not part of the match - but they tell where
to start and stop what string to extract

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['stephen.marquard@uct.ac.za']
^From (\S+@\S+)
>>> y = re.findall('^From (\S+@\S+)',x)
>>> print(y)
['stephen.marquard@uct.ac.za']
String Parsing Examples…
2 3
1 1
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

>>> data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'

>>> atpos = data.find('@')
>>> print(atpos)
21 Extracting a host
>>> sppos = data.find(' ',atpos) name - using
>>> print(sppos)
31 find and string
>>> host = data[atpos+1 : sppos] slicing
>>> print(host)
uct.ac.za
The Double Split Pattern
Sometimes we split a line one way, and then grab one of
the pieces of the line and split that piece again

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

words = line.split() stephen.marquard@uct.ac.za

email = words[1] ['stephen.marquard', 'uct.ac.za']
pieces = email.split('@')
print(pieces[1]) 'uct.ac.za'
The Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*)',lin)
print(y)

['uct.ac.za']
'@([^ ]*)'

Look through the string until you find an at

sign
The Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*)',lin)
print(y)

['uct.ac.za']
'@([^ ]*)'
Match non-blank
Match many of them
character
The Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*)',lin)
print(y)

['uct.ac.za']
'@([^ ]*)'

Extract the non-blank characters

Even Cooler Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['uct.ac.za']
'^From .*@([^ ]*)'
Starting at the beginning of the line, look for the string
'From '
Even Cooler Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['uct.ac.za']
'^From .*@([^ ]*)'

Skip a bunch of characters, looking for an at sign

['uct.ac.za']
'^From .*@([^ ]*)'

Start extracting
Even Cooler Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['uct.ac.za']
'^From .*@([^ ]+)'
Match non-blank Match many of
character them
Even Cooler Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['uct.ac.za']
'^From .*@([^ ]+)'

Stop extracting
Spam Confidence
import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
numlist.append(num)
print('Maximum:', max(numlist)) python ds.py
Maximum: 0.9907
X-DSPAM-Confidence: 0.8475
Escape Character
If you want a special regular expression character to just
behave normally (most of the time) you prefix it with '\'

>>> import re
At least
>>> x = 'We just received $10.00 for cookies.' one or
>>> y = re.findall('\$[0-9.]+',x) more
>>> print(y)
['$10.00']
\$[0-9.]+
A real dollar A digit or
sign period
Summary
• Regular expressions are a cryptic but powerful
language for matching strings and extracting elements
from those strings
• Regular expressions have special characters that
indicate intent
Acknowledgements / Contributions
These slides are Copyright 2010- Charles R. Severance (
...
www.dr-chuck.com) of the University of Michigan School of
Information and open.umich.edu and made available under a
Creative Commons Attribution 4.0 License. Please maintain this
last slide in all copies of the document to comply with the
attribution requirements of the license. If you make a change,
feel free to add your name and organization to the list of
contributors on this page as you republish the materials.

Initial Development: Charles Severance, University of Michigan

School of Information

… Insert new Contributors and Translations here

KPMG Test-Pack
100% (2)
KPMG Test-Pack
403 pages
KPMG Numerical Test 3 Solution
100% (1)
KPMG Numerical Test 3 Solution
11 pages
Acetone Design Review
No ratings yet
Acetone Design Review
66 pages
KPMG Numerical Test 5 Solution
100% (1)
KPMG Numerical Test 5 Solution
11 pages
KPMG Prep
100% (1)
KPMG Prep
4 pages
Pretreatment and Hydrolysis of Elephant Grass For The Optimisation of Fermentable Sugar For Butanol Production
No ratings yet
Pretreatment and Hydrolysis of Elephant Grass For The Optimisation of Fermentable Sugar For Butanol Production
81 pages
Practice CSCS Test 01
100% (1)
Practice CSCS Test 01
4 pages
Case 1 Jaguar
No ratings yet
Case 1 Jaguar
4 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
Lecture 3-4 Regex
No ratings yet
Lecture 3-4 Regex
33 pages
5A - Regex
No ratings yet
5A - Regex
32 pages
Day3.3 StringManipulation
No ratings yet
Day3.3 StringManipulation
43 pages
06 - Regular Expressions and Network Programming
No ratings yet
06 - Regular Expressions and Network Programming
55 pages
Chapter - 11 - Regular Expressions
100% (1)
Chapter - 11 - Regular Expressions
10 pages
Module3 RegularExpressions
No ratings yet
Module3 RegularExpressions
8 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Module 3 Regular Expressions
No ratings yet
Module 3 Regular Expressions
8 pages
Python Regex: Re - Match, Re - Search, Re - Findall With Example
No ratings yet
Python Regex: Re - Match, Re - Search, Re - Findall With Example
10 pages
RegEx in Python (4)
No ratings yet
RegEx in Python (4)
6 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
Python RegEx
No ratings yet
Python RegEx
11 pages
Regular Expression 4
No ratings yet
Regular Expression 4
16 pages
RegEx 1
No ratings yet
RegEx 1
48 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
Module 4 - Regular Expressions1
No ratings yet
Module 4 - Regular Expressions1
37 pages
Lec 06 - Regular Expression
No ratings yet
Lec 06 - Regular Expression
19 pages
UNIT - 4 REGEX
No ratings yet
UNIT - 4 REGEX
28 pages
Regex Summary
No ratings yet
Regex Summary
8 pages
9.RegEx (1)
No ratings yet
9.RegEx (1)
57 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Regular Expression
No ratings yet
Regular Expression
18 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
Regular Expression l
No ratings yet
Regular Expression l
20 pages
UNIT4
No ratings yet
UNIT4
67 pages
Regular Exp
No ratings yet
Regular Exp
6 pages
Unit 2
No ratings yet
Unit 2
69 pages
Lecture 7 Re Part2 Split
No ratings yet
Lecture 7 Re Part2 Split
8 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
W10A Full
No ratings yet
W10A Full
40 pages
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
Regular Expressions
No ratings yet
Regular Expressions
5 pages
Module II
No ratings yet
Module II
17 pages
Python Regex Cheat Sheet
No ratings yet
Python Regex Cheat Sheet
29 pages
RegEx-in-Python
No ratings yet
RegEx-in-Python
5 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
16 pages
Python Module-41
No ratings yet
Python Module-41
56 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Regular Expression
No ratings yet
Regular Expression
21 pages
Subtitle
No ratings yet
Subtitle
3 pages
Regular Expressions Python
No ratings yet
Regular Expressions Python
26 pages
Lec 07 II Dsfa23
No ratings yet
Lec 07 II Dsfa23
44 pages
Session-20 - Jupyter Notebook
No ratings yet
Session-20 - Jupyter Notebook
12 pages
Python - Slide 5
No ratings yet
Python - Slide 5
42 pages
Module 4 - Regular Expressions
No ratings yet
Module 4 - Regular Expressions
35 pages
Summary Python 1
No ratings yet
Summary Python 1
36 pages
Regular Expressions
No ratings yet
Regular Expressions
104 pages
22MCA1061 Regx
No ratings yet
22MCA1061 Regx
18 pages
9Python-Simple-Character-Matches
No ratings yet
9Python-Simple-Character-Matches
19 pages
unit 4 Regular expression
No ratings yet
unit 4 Regular expression
16 pages
regular exp
No ratings yet
regular exp
10 pages
Python Regex Cheatsheet With Examples: Re Module Functions
No ratings yet
Python Regex Cheatsheet With Examples: Re Module Functions
1 page
Python Re
No ratings yet
Python Re
18 pages
Howto Regex
No ratings yet
Howto Regex
19 pages
Python Course: Session 6b - Regular Expressions
No ratings yet
Python Course: Session 6b - Regular Expressions
11 pages
Python Tutorial 32
No ratings yet
Python Tutorial 32
4 pages
Regular Expressions
No ratings yet
Regular Expressions
9 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
UBA Test
No ratings yet
UBA Test
44 pages
KPMG Verbal Solution - 2
No ratings yet
KPMG Verbal Solution - 2
13 pages
Detailed Characteristics of Nigerian Clay in Oil Drilling Using Samples
No ratings yet
Detailed Characteristics of Nigerian Clay in Oil Drilling Using Samples
54 pages
KPMG Critical Reasoning Test 2: Solution Booklet
No ratings yet
KPMG Critical Reasoning Test 2: Solution Booklet
19 pages
KPMG CRITICAL Test 4 Solution
No ratings yet
KPMG CRITICAL Test 4 Solution
12 pages
KPMG Null Document
No ratings yet
KPMG Null Document
13 pages
Food Safety Attitude and Associated Factors Among
No ratings yet
Food Safety Attitude and Associated Factors Among
6 pages
Production of Biodiesel From Used Vegetable Oil
No ratings yet
Production of Biodiesel From Used Vegetable Oil
4 pages
Design Report
No ratings yet
Design Report
71 pages
Onoji's Thesis (PHD) - Final Submission
No ratings yet
Onoji's Thesis (PHD) - Final Submission
305 pages
Justice Design Project
No ratings yet
Justice Design Project
67 pages
Efe Dave
No ratings yet
Efe Dave
44 pages
Processing and Properties of Plastic Lumber
No ratings yet
Processing and Properties of Plastic Lumber
16 pages
Report On Biodiesel
No ratings yet
Report On Biodiesel
32 pages
Applied Clay Science: Richard O. Afolabi, Oyinkepreye D. Orodu, Vincent E. Efeovbokhan
No ratings yet
Applied Clay Science: Richard O. Afolabi, Oyinkepreye D. Orodu, Vincent E. Efeovbokhan
11 pages
Madai 2020
No ratings yet
Madai 2020
11 pages
Recycled-Plastic Lumber Standards: From Waste Plastics To Markets For Plastic Lumber Bridges
No ratings yet
Recycled-Plastic Lumber Standards: From Waste Plastics To Markets For Plastic Lumber Bridges
12 pages
Examples of Long Vowel Words: With The Long "A" Sound
No ratings yet
Examples of Long Vowel Words: With The Long "A" Sound
4 pages
Original
No ratings yet
Original
1 page
Catalysts: Bio-Derived Catalysts: A Current Trend of Catalysts Used in Biodiesel Production
No ratings yet
Catalysts: Bio-Derived Catalysts: A Current Trend of Catalysts Used in Biodiesel Production
28 pages
Bright Report
No ratings yet
Bright Report
48 pages
Unit 1: Introduction To Short Vowels
No ratings yet
Unit 1: Introduction To Short Vowels
8 pages
DOC-20200116-WA0002 - Copy3
No ratings yet
DOC-20200116-WA0002 - Copy3
60 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Zakatn2024 Ramadan
No ratings yet
Zakatn2024 Ramadan
2 pages
Business Ethics and Corporate Social Responsibility: A Holistic Approach
100% (1)
Business Ethics and Corporate Social Responsibility: A Holistic Approach
6 pages
Locum Tenens Guide - 080123
No ratings yet
Locum Tenens Guide - 080123
6 pages
Design Patterns
No ratings yet
Design Patterns
11 pages
EN ISO 6888-1 (2021) (E) Codified
No ratings yet
EN ISO 6888-1 (2021) (E) Codified
8 pages
Mystery School Code Review
No ratings yet
Mystery School Code Review
4 pages
Analisis Tingkat Kesehatan Bank Dengan Menggunakan Metode Risk Profile, Good Corporate Governance, Earning, Capital (RGEC)
No ratings yet
Analisis Tingkat Kesehatan Bank Dengan Menggunakan Metode Risk Profile, Good Corporate Governance, Earning, Capital (RGEC)
16 pages
Wa0000.
No ratings yet
Wa0000.
11 pages
Taipei Trash & Recycling: A Short Guide - COLLECTIVE GREEN
No ratings yet
Taipei Trash & Recycling: A Short Guide - COLLECTIVE GREEN
11 pages
Limited Liability Partnership LLP Agreement
No ratings yet
Limited Liability Partnership LLP Agreement
7 pages
Tertiary Tectonic of Barito Basin, South East Kalimantan, and Implication For Petroleum System
No ratings yet
Tertiary Tectonic of Barito Basin, South East Kalimantan, and Implication For Petroleum System
16 pages
Modulo 680 AS
No ratings yet
Modulo 680 AS
98 pages
Recruitment Services in Romania
No ratings yet
Recruitment Services in Romania
3 pages
2022 Mendenhall Megan SC Thesis
No ratings yet
2022 Mendenhall Megan SC Thesis
116 pages
Sap Copa Hana
No ratings yet
Sap Copa Hana
23 pages
Krone Distribution Box 102218be
No ratings yet
Krone Distribution Box 102218be
4 pages
2023 June Jebsen QC Convention Center
No ratings yet
2023 June Jebsen QC Convention Center
5 pages
Ibarreta-Presentor No.17
No ratings yet
Ibarreta-Presentor No.17
29 pages
Somatic RET Indels in Sporadic Medullary Thyroid Cancer
No ratings yet
Somatic RET Indels in Sporadic Medullary Thyroid Cancer
8 pages
The Beginners Guide To Nintendo DS Homebrew
No ratings yet
The Beginners Guide To Nintendo DS Homebrew
26 pages
Notes On Digital Electronics Unit 2
No ratings yet
Notes On Digital Electronics Unit 2
90 pages
U.S. E T, A RFQ PR9037551: Mbassy Irana Lbania
No ratings yet
U.S. E T, A RFQ PR9037551: Mbassy Irana Lbania
49 pages
Before The Lights Go Out: A Survey of EMP Preparedness Reveals Significant Shortfalls
No ratings yet
Before The Lights Go Out: A Survey of EMP Preparedness Reveals Significant Shortfalls
15 pages
Bks MaiSL 11uu mx00 Xxaann
No ratings yet
Bks MaiSL 11uu mx00 Xxaann
7 pages
Polara INavigator 2 Wire System Manual
No ratings yet
Polara INavigator 2 Wire System Manual
104 pages
Technical Specification: 40' X 8' X 8'6" ISO 1AA TYPE Steel Dry Cargo Container
No ratings yet
Technical Specification: 40' X 8' X 8'6" ISO 1AA TYPE Steel Dry Cargo Container
23 pages
Headquarters: Microsoft Redmond Campus
No ratings yet
Headquarters: Microsoft Redmond Campus
2 pages
2013 Course Structure BTech CSE
No ratings yet
2013 Course Structure BTech CSE
32 pages