0% found this document useful (0 votes)

42 views

Module3 RegularExpressions

The document discusses regular expressions in Python. It introduces regular expressions as programs that search and parse strings. It provides examples of using regular expression functions like search() and findall() to extract patterns from text. It also summarizes important regular expression meta characters like ., *, ?, [], etc. and provides examples of using them to match various patterns in strings.

Uploaded by

adfqadsf

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Module3 RegularExpressions

Uploaded by

adfqadsf

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

32

3.4 REGULAR EXPRESSIONS

Searching for required patterns and extracting only the lines/words matching the pattern is
a very common task in solving problems programmatically. We have done such tasks
earlier using string slicing and string methods like split(), find() etc. As the task of searching
and extracting is very common, Python provides a powerful library called regular
expressions to handle these tasks elegantly. Though they have quite complicated syntax,
they provide efficient way of searching the patterns.

The regular expressions are themselves little programs to search and parse strings. To use
them in our program, the library/module re must be imported. There is a search() function
in this module, which is used to find particular substring within a string. Consider the
following example –

import re
fhand = open('myfile.txt')
for line in fhand:
line = line.rstrip()
if re.search('how', line):
print(line)

By referring to file myfile.txt that has been discussed in previous Chapters, the output would
be –

hello, how are you?

how about you?

In the above program, the search() function is used to search the lines containing a word
how.

One can observe that the above program is not much different from a program that uses
find() function of strings. But, regular expressions make use of special characters with
specific meaning. In the following example, we make use of caret (^) symbol, which
indicates beginning of the line.

import re
hand = open('myfile.txt')
for line in hand:
line = line.rstrip()
if re.search('^how', line):
print(line)

The output would be –

how about you?

Here, we have searched for a line which starts with a string how. Again, this program will
not makes use of regular expression fully. Because, the above program would have been
33

written using a string function startswith(). Hence, in the next section, we will understand
the true usage of regular expressions.

3.4.1 Character Matching in Regular Expressions

Python provides a list of meta-characters to match search strings. Table 3.1 shows the
details of few important metacharacters. Some of the examples for quick and easy
understanding of regular expressions are given in Table 3.2.

Table 3.1 List of Important Meta-Characters

Character Meaning
^ (caret) Matches beginning of the line
$ Matches end of the line
. (dot) Matches any single character except newline. Using option m, then
newline also can be matched
[…] Matches any single character in brackets
[^…] Matches any single character NOT in brackets
re* Matches 0 or more occurrences of preceding expression.
re+ Matches 1 or more occurrence of preceding expression.
re? Matches 0 or 1 occurrence of preceding expression.
re{ n} Matches exactly n number of occurrences of preceding expression.
re{ n,} Matches n or more occurrences of preceding expression.
re{ n, m} Matches at least n and at most m occurrences of preceding expression.
a| b Matches either a or b.
(re) Groups regular expressions and remembers matched text.
\d Matches digits. Equivalent to [0-9].
\D Matches non-digits.
\w Matches word characters.
\W Matches non-word characters.
\s Matches whitespace. Equivalent to [\t\n\r\f].
\S Matches non-whitespace.
\A Matches beginning of string.
\Z Matches end of string. If a newline exists, it matches just before
newline.
\z Matches end of string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
( ) When parentheses are added to a regular expression, they are ignored
for the purpose of matching, but allow you to extract a particular subset
of the matched string rather than the whole string when using
findall()
34

Table 3.2 Examples for Regular Expressions

Expression Description
[Pp]ython Match "Python" or "python"
rub[ye] Match "ruby" or "rube"
[aeiou] Match any one lowercase vowel
[0-9] Match any digit; same as [0123456789]
[a-z] Match any lowercase ASCII letter
[A-Z] Match any uppercase ASCII letter
[a-zA-Z0-9] Match any of uppercase, lowercase alphabets and digits
[^aeiou] Match anything other than a lowercase vowel
[^0-9] Match anything other than a digit

Most commonly used metacharacter is dot, which matches any character. Consider the
following example, where the regular expression is for searching lines which starts with I
and has any two characters (any character represented by two dots) and then has a
character m.
import re
fhand = open('myfile.txt')
for line in fhand:
line = line.rstrip()
if re.search('^I..m', line):
print(line)

The output would be –

I am doing fine.

Note that, the regular expression ^I..m not only matches ‘I am’, but it can match ‘Isdm’,
‘I*3m’ and so on. That is, between I and m, there can be any two characters.

In the previous program, we knew that there are exactly two characters between I and m.
Hence, we could able to give two dots. But, when we don’t know the exact number of
characters between two characters (or strings), we can make use of dot and + symbols
together. Consider the below given program –

import re
hand = open('myfile.txt')
for line in hand:
line = line.rstrip()
if re.search('^h.+u', line):
print(line)

The output would be –

hello, how are you?
how about you?
35

Observe the regular expression ^h.+u here. It indicates that, the string should be starting
with h and ending with u and there may by any number of (dot and +) characters in-
between.

Few examples:
To understand the behavior of few basic meta characters, we will see some examples. The
file used for these examples is mbox-short.txt which can be downloaded from –
https://www.py4e.com/code3/mbox-short.txt

Use this as input and try following examples –

 Pattern to extract lines starting with the word From (or from) and ending with edu:
import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
pattern = ‘^[Ff]rom.*edu$’
if re.search(pattern, line):
print(line)

Here the pattern given for regular expression indicates that the line should start with
either From or from. Then there may be 0 or more characters, and later the line should
end with edu.

 Pattern to extract lines ending with any digit:

Replace the pattern by following string, rest of the program will remain the same.
pattern = ‘[0-9]$’

 Using Not :
pattern = ‘^[^a-z0-9]+’

Here, the first ^ indicates we want something to match in the beginning of a line. Then,
the ^ inside square-brackets indicate do not match any single character within bracket.
Hence, the whole meaning would be – line must be started with anything other than a
lower-case alphabets and digits. In other words, the line should not be started with
lowercase alphabet and digits.

 Start with upper case letters and end with digits:

pattern = '^[A-Z].*[0-9]$'

Here, the line should start with capital letters, followed by 0 or more characters, but must
end with any digit.
36

3.4.2 Extracting Data using Regular Expressions

Python provides a method findall() to extract all of the substrings matching a regular
expression. This function returns a list of all non-overlapping matches in the string. If there
is no match found, the function returns an empty list. Consider an example of extracting
anything that looks like an email address from any line.

import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting
@2PM'
lst = re.findall('\S+@\S+', s)
print(lst)

The output would be –

['csev@umich.edu', 'cwen@iupui.edu']

Here, the pattern indicates at least one non-white space characters (\S) before @ and at
least one non-white space after @. Hence, it will not match with @2pm, because of a white-
space before @.

Now, we can write a complete program to extract all email-ids from the file.

import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
x = re.findall('\S+@\S+', line)
if len(x) > 0:
print(x)

Here, the condition len(x) > 0 is checked because, we want to print only the line which
contain an email-ID. If any line do not find the match for a pattern given, the findall()
function will return an empty list. The length of empty list will be zero, and hence we would
like to print the lines only with length greater than 0.

The output of above program will be something as below –

['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
……………………………….
………………………………..
37

Note that, apart from just email-ID’s, the output contains additional characters (<, >, ; etc)
attached to the extracted pattern. To remove all that, refine the pattern. That is, we want
email-ID to be started with any alphabets or digits, and ending with only alphabets. Hence,
the statement would be –

x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)

3.4.3 Combining Searching and Extracting

Assume that we need to extract the data in a particular syntax. For example, we need to
extract the lines containing following format –

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

The line should start with X-, followed by 0 or more characters. Then, we need a colon and
white-space. They are written as it is. Then there must be a number containing one or more
digits with or without a decimal point. Note that, we want dot as a part of our pattern string,
but not as meta character here. The pattern for regular expression would be –
^X-.*: [0-9.]+

The complete program is –

import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^X\S*: [0-9.]+', line):
print(line)

The output lines will as below –

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
……………………………………………………
……………………………………………………

Assume that, we want only the numbers (representing confidence, probability etc) in the
above output. We can use split() function on extracted string. But, it is better to refine
regular expression. To do so, we need the help of parentheses.

When we add parentheses to a regular expression, they are ignored when matching the
string. But when we are using findall(), parentheses indicate that while we want the whole
expression to match, we only are interested in extracting a portion of the substring that
matches the regular expression.
38

import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^X-\S*: ([0-9.]+)', line)
if len(x) > 0:
print(x)

Because of the parentheses enclosing the pattern above, it will match the pattern starting
with X- and extracts only digit portion. Now, the output would be –
['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
…………………
………………..

Another example of similar form: The file mbox-short.txt contains lines like –

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

We may be interested in extracting only the revision numbers mentioned at the end of
these lines. Then, we can write the statement –

x = re.findall('^Details:.*rev=([0-9.]+)', line)

The regex here indicates that the line must start with Details:, and has something with
rev= and then digits. As we want only those digits, we will put parenthesis for that portion
of expression. Note that, the expression [0-9] is greedy, because, it can display very
large number. It keeps grabbing digits until it finds any other character than the digit. The
output of above regular expression is a set of revision numbers as given below –
['39772']
['39771']
['39770']
['39769']
………………………
………………………

Consider another example – we may be interested in knowing time of a day of each email.
The file mbox-short.txt has lines like –
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Here, we would like to extract only the hour 09. That is, we would like only two digits
representing hour. Hence, we need to modify our expression as –
x = re.findall('^From .* ([0-9][0-9]):', line)
39

Here, [0-9][0-9] indicates that a digit should appear only two times. The alternative way
of writing this would be -

x = re.findall('^From .* ([0-9]{2}):', line)

The number 2 within flower-brackets indicates that the preceding match should appear
exactly two times. Hence [0-9]{2} indicates there can be exactly two digits. Now, the
output would be –

['09']
['18']
['16']
['15']
…………………
…………………

3.4.4 Escape Character

As we have discussed till now, the character like dot, plus, question mark, asterisk, dollar
etc. are meta characters in regular expressions. Sometimes, we need these characters
themselves as a part of matching string. Then, we need to escape them using a back-
slash. For example,

import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)

Output:
['$10.00']

Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we need
to use \ before it. So that, now $ is treated as a part of matching string, but not as
metacharacter.

3.4.5 Bonus Section for Unix/Linux Users

Support for searching files using regular expressions was built into the Unix OS. There is a
command-line program built into Unix called grep (Generalized Regular Expression Parser)
that behaves similar to search() function.

$ grep '^From:' mbox-short.txt

Output:
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
Note that, grep command does not support the non-blank character \S, hence we need to
use [^ ] indicating not a white-space.

BCG Matrix PPT of Apple
50% (2)
BCG Matrix PPT of Apple
15 pages
Module 3 Regular Expressions
No ratings yet
Module 3 Regular Expressions
8 pages
Chapter - 11 - Regular Expressions
100% (1)
Chapter - 11 - Regular Expressions
10 pages
Module 4 - Regular Expressions1
No ratings yet
Module 4 - Regular Expressions1
37 pages
Module 4 - Regular Expressions
No ratings yet
Module 4 - Regular Expressions
35 pages
Unit 4 - Regular Expressions
No ratings yet
Unit 4 - Regular Expressions
20 pages
Python Regex: Re - Match, Re - Search, Re - Findall With Example
No ratings yet
Python Regex: Re - Match, Re - Search, Re - Findall With Example
10 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
16 Java Regex
100% (8)
16 Java Regex
26 pages
Regular Expressions in Java
No ratings yet
Regular Expressions in Java
30 pages
Python Regex
No ratings yet
Python Regex
8 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
Python Reg Expressions PDF
No ratings yet
Python Reg Expressions PDF
8 pages
8 Regular Expressions (E Next - In)
No ratings yet
8 Regular Expressions (E Next - In)
3 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
Regular Expressions: Regular Expression Syntax in Python
No ratings yet
Regular Expressions: Regular Expression Syntax in Python
11 pages
Regex Summary
No ratings yet
Regex Summary
8 pages
POSIX Regular Expressions: Brackets
No ratings yet
POSIX Regular Expressions: Brackets
5 pages
Regular Expressions in Java
No ratings yet
Regular Expressions in Java
30 pages
Python Programming Strings
No ratings yet
Python Programming Strings
53 pages
Regular Expression
No ratings yet
Regular Expression
20 pages
306787a873bd4019a13b3bc8d67e1292
No ratings yet
306787a873bd4019a13b3bc8d67e1292
10 pages
BITypes Notes
No ratings yet
BITypes Notes
7 pages
Class 11
No ratings yet
Class 11
12 pages
python_reg_expressions
No ratings yet
python_reg_expressions
8 pages
1 ICS 2175 Lecture 5 Strings and Arrays
No ratings yet
1 ICS 2175 Lecture 5 Strings and Arrays
23 pages
Using Regular Expressions With PHP
No ratings yet
Using Regular Expressions With PHP
6 pages
Ch-6 Notes and Questions
No ratings yet
Ch-6 Notes and Questions
27 pages
Strings Full
No ratings yet
Strings Full
54 pages
PHP - Regular Expressions
No ratings yet
PHP - Regular Expressions
7 pages
3.string Slicing and Other Functions in Python
No ratings yet
3.string Slicing and Other Functions in Python
4 pages
Regular Expression
No ratings yet
Regular Expression
21 pages
Regular Expression
No ratings yet
Regular Expression
28 pages
String
No ratings yet
String
14 pages
String Operators & Method
No ratings yet
String Operators & Method
31 pages
Compiler Design Practical List
No ratings yet
Compiler Design Practical List
5 pages
SS Lab Manual
No ratings yet
SS Lab Manual
66 pages
Regular Exp
No ratings yet
Regular Exp
6 pages
Python Cht4 PDF
No ratings yet
Python Cht4 PDF
29 pages
Regular Expression l
No ratings yet
Regular Expression l
20 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Python Part 3 Access Web Data
No ratings yet
Python Part 3 Access Web Data
37 pages
Unit 3 Powerpoint
No ratings yet
Unit 3 Powerpoint
43 pages
Metacharacters in Python
No ratings yet
Metacharacters in Python
7 pages
Code Python Notes
No ratings yet
Code Python Notes
17 pages
Reg Expressions
No ratings yet
Reg Expressions
5 pages
Manipulating Text with Regular Expression in python
No ratings yet
Manipulating Text with Regular Expression in python
4 pages
Regex Case Interview Guide
No ratings yet
Regex Case Interview Guide
10 pages
Module-4 Lex and Yacc
No ratings yet
Module-4 Lex and Yacc
67 pages
Regular Expression 01
No ratings yet
Regular Expression 01
48 pages
الاسبوع الثالث
No ratings yet
الاسبوع الثالث
14 pages
PYTHON 3
No ratings yet
PYTHON 3
11 pages
Python Strings
No ratings yet
Python Strings
35 pages
BPOPS103/203 Module 4 Notes
No ratings yet
BPOPS103/203 Module 4 Notes
18 pages
Module 4 RegEX
No ratings yet
Module 4 RegEX
22 pages
20.10 Filters-Text Processing Commands
No ratings yet
20.10 Filters-Text Processing Commands
14 pages
SE T05 - The String Data Type
No ratings yet
SE T05 - The String Data Type
14 pages
Python Strings
No ratings yet
Python Strings
18 pages
Untitled
No ratings yet
Untitled
53 pages
Ian Talks Regex A-Z
From Everand
Ian Talks Regex A-Z
Ian Eress
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Python Mod2
No ratings yet
Python Mod2
93 pages
Python Mod1 PPT
No ratings yet
Python Mod1 PPT
141 pages
Module3 Lists Dictionaries Tuples
No ratings yet
Module3 Lists Dictionaries Tuples
31 pages
18 Ae 55
No ratings yet
18 Ae 55
4 pages
User Defined Datatypes (Oracle)
No ratings yet
User Defined Datatypes (Oracle)
5 pages
01 SAP S4 HANA Finance Content - Topics
No ratings yet
01 SAP S4 HANA Finance Content - Topics
3 pages
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
No ratings yet
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
12 pages
Chang et al._2009_Visualizing the Republic of Letters
No ratings yet
Chang et al._2009_Visualizing the Republic of Letters
2 pages
Learning Outcomes: LO1 Analyse The Information Requirements of Organisations
No ratings yet
Learning Outcomes: LO1 Analyse The Information Requirements of Organisations
37 pages
Lab 6
No ratings yet
Lab 6
3 pages
Inverse Functions Activity
100% (4)
Inverse Functions Activity
3 pages
Opt Nov. 2023 Ella
No ratings yet
Opt Nov. 2023 Ella
539 pages
VERNER Assignment
No ratings yet
VERNER Assignment
8 pages
(Ebook) The English in Love: The Intimate Story of an Emotional Revolution by Claire Langhamer ISBN 9780199594436, 0199594430 - The full ebook version is available, download now to explore
100% (1)
(Ebook) The English in Love: The Intimate Story of an Emotional Revolution by Claire Langhamer ISBN 9780199594436, 0199594430 - The full ebook version is available, download now to explore
36 pages
MBSE withtheARCADIAMethodandtheCapella Tool
No ratings yet
MBSE withtheARCADIAMethodandtheCapella Tool
11 pages
AI and Emerging Research
No ratings yet
AI and Emerging Research
44 pages
CE212 Chapter1
No ratings yet
CE212 Chapter1
57 pages
Chapter 1 Advanced Programming Part II
No ratings yet
Chapter 1 Advanced Programming Part II
71 pages
Cse373 Lecture2 Vertical
No ratings yet
Cse373 Lecture2 Vertical
13 pages
Sentinel 2 Products Specification Document
No ratings yet
Sentinel 2 Products Specification Document
524 pages
Letter of Quarantine of Employees
No ratings yet
Letter of Quarantine of Employees
3 pages
Buffalo TeraStation TS-WX4.0TL-R1 Manual
No ratings yet
Buffalo TeraStation TS-WX4.0TL-R1 Manual
160 pages
Jurnal Baru
No ratings yet
Jurnal Baru
8 pages
Tran-Pham2020 Chapter DesigningAndBuildingTheVeinFin
No ratings yet
Tran-Pham2020 Chapter DesigningAndBuildingTheVeinFin
6 pages
Module 1 - Introduction - To - DevOps
No ratings yet
Module 1 - Introduction - To - DevOps
34 pages
Enhancing Regulatory Intelligence
No ratings yet
Enhancing Regulatory Intelligence
49 pages
Ansible Full Course: For Beginners
No ratings yet
Ansible Full Course: For Beginners
18 pages
Aws Global Infrastructure Slides
No ratings yet
Aws Global Infrastructure Slides
30 pages
Comparative Scale
No ratings yet
Comparative Scale
1 page
Katalia Small Housing Book
No ratings yet
Katalia Small Housing Book
37 pages
Indicator NAS Ultimate Algo Remastered for TradingView
No ratings yet
Indicator NAS Ultimate Algo Remastered for TradingView
6 pages
Oceanica 25
No ratings yet
Oceanica 25
28 pages
AWS Auto Scaling
No ratings yet
AWS Auto Scaling
8 pages