Unit-3 - Regular Expression
Unit-3 - Regular Expression
Many a times, we are needed to extract required information from given data. For example,
we want to know the number of people who contacted us in the last month through Gmail or
we want to know the phone numbers of employees in a company whose names start with 'A'
or we want to retrieve the date of births of the patients in a hospital who joined for treatment
for hypertension, etc.
To get such information, we have to conduct the searching operation on the data. Once we get
required information, we have to extract that data for further use. Regular expressions are
useful to perform such operations on data.
Regular Expressions
A regular expression is a string that contains special symbols and characters to find and
extract the information needed by us from the given data.
A regular expression helps us to search information, match, find and split information as
per our requirements.
A regular expression is also called simply regex Regular expressions are available not
only in Python but also in many languages like Java, Perl, AWK, etc.
Python provides re module that stands for regular expressions.
This module contains methods like compile(), search0, match0, findall(), split(), etc,
which are used in finding the information in the available data.
So, when we write a regular expression, we should import re module as:
import re
Regular expressions are nothing but strings containing characters and special symbols.
Simple regular expression may look like this:
reg = r'm\w\w'
Meaning: any word starting with letter m and having length 3 letter.
Some of the special sequences beginning with '\' represent predefined sets of characters
that are often useful, such as the set of digits, the set of letters, or the set of anything that
isn’t whitespace.
The following predefined special sequences are a subset of those available.
\d
Matches any decimal digit; this is equivalent to the class [0-9].
\D
Matches any non-digit character; this is equivalent to the class [^0-9].
\s
Matches any whitespace character; this is equivalent to the class [\t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
\A
Matches only at the start of the string.
\b
Matches the empty string, but only at the beginning or end of a word.
\B
Matches the empty string, but only when it is not at the beginning or end of a word. This
means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the
opposite of \b,
\Z
Matches only at the end of the string.
Special Characters and Pattern Matching
Python allows you to specify a pattern to match when searching for a substring contained in a
string. This pattern is known as a regular expression. For example, if you want to find a
North American-style phone number in a string, you can define a pattern consisting of three
digits, followed by a dash (-), and followed by four more digits.
In a regular expression, certain characters have special meanings. The table below lists these
special characters and their meanings.
If you use multiple special characters in a pattern, you can use parentheses () to specify the
order in which the characters are to be interpreted.
Following methods belong to the ‘re' module that are used in the regular expressions:
The match() method searches in the beginning of the string and if the matching string
is found, it returns an object that contains the resultant string, otherwise it returns
None. We can access the string from the returned object using group() method.
The search() method searches the string from beginning till the end and returns the
first occurrence of the matching string, otherwise it returns None. We can use group0
method to retrieve the string from the object returned by this method.
The findall() method searches the string from beginning till the end and returns all
occurrences of the matching string in the form of a list object. If the matching strings
are not found, then it returns an empty list. We can retrieve the resultant strings
from the list using a for loop.
The split0 method splits the string according to the regular expression and the
resultant pieces are returned as a list. If there are no string pieces, then it returns an
empty list. We can retrieve the resultant string pieces from the list using a for loop.
The sub() method-substitutes (or replaces) new strings in the place of existing strings.
After substitution, the main string is returned by this method
reg = r'm\w\w’
str ='cat mat man mom'
prog = re.compile(r'\m\w\w')
res = prog.search(str)
print res.group()
# Search method will return only first search so in place of search use findall() method
import re
str ='cat mat man mom'
prog = re.compile(r'\m\w\w')
res = prog.findall(str)
print "Findall Method return list"
print res #it will print as list
# match method will return matched string if it is in starting otherwise it returns None
import re
str = 'mat man mom'
res = re.match(r'\m\w\w',str)
res1 = re.match(r'\n\w\w',str)
print res.group()
print res1
Program 4: A python program to create Regular Expression to split a string into pieces
where one or more non alpha numeric characters are found.
#split method
import re
str = "this: is; python\'s book"
res = re.split(r'\W+',str)
print "Split with any charecter which is not alpha numeric"
print res
#RE also used to find and replace word. For this sub() method will be used
# Syntax : sub(regular expression, new string, string)
import re
str ="Today is sunday"
res = re.sub(r'sunday','Monday',str)
print res
Repetition
Things get more interesting when you use + and * to specify repetition in the pattern
+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* -- 0 or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left
Program 6: A python program to create Regular Expression to retrieve all the word
starting with given string.
for r in res:
print r
# But there is problem in output. The ay is not word but still we get as output. So to solve this
#use \b.
#using \b
res= re.findall (r'\ba[\w]*\b',str)
print "\nUsing \\b"
for r in res:
print r
Program 7: A python program to create Regular Expression to retrieve all the word
starting with given string.
import re
Str1 = '1 2 3 one 1st two three four five six seven 8eight 9nine 10ten as on in'
print "String is : " + Str1
Output:
import re
str1 ="hello world"
print str1
print "\nSearch in starting r'^He'"
res = re.search(r'^He',str1,re.IGNORECASE)
if res:
print ("String start with 'He'")
else:
print "string does not start with 'He'"
res = re.search(r'World$',str1)
if res:
print ("String end with 'world'")
else:
print "string does not end with 'world'"
#for world output will be else part because we don’t have used ignore case
Output:
Using RE on Files
We can use RE not only on string, but also on file where huge data is available.
Following program illustrate how can we apply RE on file
Program 8: Python Program to create RE that reads email-ids from a text file.
Output:
Program 9: Python Program to retrieve data from file using RE and then write data in
another file.
Program:
import re
f1=open('E:\Varsha\LJIET\Python\Notes\salary.txt','r')
f2=open('newfile.txt','w')
print "ID\tSalary"
for line in f1:
res1=re.search(r'\d{4}',line)
res2=re.search(r'\d{4,}.\d{2}',line)
print res1.group() +"\t"+res2.group()
f2.write(res1.group()+"\t"+res2.group()+"\n")
f1.close()
f2.close()
Output:
The match() method searches in the beginning of the string and if the matching string
is found, it returns an object that contains the resultant string, otherwise it returns
None. We can access the string from the returned object using group() method.
The re.match function returns a match object on success, None on failure. We use
group(num) or groups() function of match object to get matched expression.
Match Object Method & Description
1 group(num=0) This method returns entire match (or specific subgroup num)
2 groups() This method returns all matching subgroups in a tuple (empty if there weren't
any)
Program: Demonstrate use of group and groups function using re.match():
import re
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "No match!!"
Output:
if searchObj:
print "searchObj.group() : ", searchObj.group()
print "searchObj.group(1) : ", searchObj.group(1)
print "searchObj.group(2) : ", searchObj.group(2)
else:
print "Nothing found!!"
Output:
File: ReAllFunctionDemo.py
data = """ljiet32
careerljiet32
selenium"""
k1 = re.findall(r"^\w", data)
k2 = re.findall(r"^\w", data, re.MULTILINE)
print k1
print k2
Output: