Lecture 6 Re Basics
Lecture 6 Re Basics
• is a special sequence of characters that uses a search pattern to find a string or set of
strings. It can detect the presence or absence of a text by matching with a particular
pattern, and also can split a pattern into one or more sub-patterns.
• Python provides a re module that supports the use of regex in Python. Its primary
function is to offer a search, where it takes a regular expression and a string. Here, it
either returns the first match or else none.
• A regular expression is a special sequence of characters that helps you match or find
other strings or sets of strings, using a specialized syntax held in a pattern. Regular
expressions are widely used in UNIX world.
• The Python module re provides full support for Perl-like regular expressions in
Python. The re module raises the exception re.error if an error occurs while
compiling or using a regular expression.
• We would cover two important functions, which would be used to handle regular
expressions. But a small thing first: There are various characters, which would have
special meaning when they are used in regular expression. To avoid any confusion
while dealing with regular expressions, we would use Raw Strings as r'expression'.
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly
specialized programming language embedded inside Python and made available through the re
module. Using this little language, you specify the rules for the set of possible strings that you
want to match; this set might contain English sentences, or e-mail addresses, or TeX
commands, or anything you like. You can then ask questions such as “Does this string match the
pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to
modify a string or to split it apart in various ways.
Regular expression patterns are compiled into a series of bytecodes which are then executed by
a matching engine written in C. For advanced use, it may be necessary to pay careful attention to
how the engine will execute a given RE, and write the RE in a certain way in order to produce
bytecode that runs faster. Optimization isn’t covered in this document, because it requires that
you have a good understanding of the matching engine’s internals.
The regular expression language is relatively small and restricted, so not all possible string
processing tasks can be done using regular expressions. There are also tasks that can be done
with regular expressions, but the expressions turn out to be very complicated. In these cases,
you may be better off writing Python code to do the processing; while Python code will be
slower than an elaborate regular expression, it will also probably be more understandable.
import re
s = "Welcome to Artificial intelligence"
res = re.search(r"\D{3} t",s)
print(res.group())
ome t
# Code gives the starting index and the ending index of the string
"characters".
import re
match = re.search(r'characters', s)
# Here r character (r'characters') stands for raw
# The raw string is slightly different from a regular string, it won’t
interpret the \ character as an escape character.
# This is because the regular expression engine uses \ character for
its own escaping purpose.
print(match)
print('Start Index:', match.start())
print('End Index:', match.end())
print('span Index:', match.span())
Match Object
A Match object contains all the information about the search and the result and if there is no
match found then None will be returned. Let’s see some of the commonly used methods and
attributes of the match object.
if matchObj:
print("matchObj.group() : ", matchObj.group())
print("matchObj.group(1) : ", matchObj.group(1))
print("matchObj.group(2) : ", matchObj.group(2))
else:
print("No match!!")
import re
print(res.start())
print(res.end())
print(res.span())
11
15
(11, 15)
MetaCharacters Description
• Used to drop the special meaning of character following it.
• [] Represent a character class.
• ^ Matches the beginning.
• $ Matches the end.
• . Matches any character except newline.
• | Means OR (Matches with any of the characters separated by it.
• ? Matches zero or one occurrence.
• * Any number of occurrences (including 0 occurrences).
• + One or more occurrences.
• {} Indicate the number of occurrences of a preceding regex to match.
• () Enclose a group of Regex.
– Backslash
• The backslash () makes sure that the character is not treated in a special way. This can be
considered a way of escaping metacharacters.
• For example, if you want to search for the dot(.) in the string then you will find that dot(.)
will be treated as a special character as is one of the metacharacters (as shown in the
above table).
• So for this case, we will use the backslash() just before the dot(.) so that it will lose its
specialty. See the below example for a better understanding.
# A Python program to demonstrate working of re.match().
import re
s = 'Artificail .Intelligence'
# without using \
match = re.search(r'.', s)
print(match)
# using \
match = re.search(r'\.', s)
print(match)
import re
import re
801 35
[] Square Brackets
• Square Brackets ([]) represents a character class consisting of a set of characters
that we wish to match. For example, the character class [abc] will match any single
a, b, or c.
• We can also specify a range of characters using – inside the square brackets. For
example,
• We can also invert the character class using the caret(^) symbol. For example,
^ Caret
• Caret (^) symbol matches the beginning of the string i.e. checks whether the string
starts with the given character(s) or not. For example –
• ^g will check if the string starts with g such as geeks, globe, girl, g, etc.
• ^ge will check if the string starts with ge such as geeks, geeksforgeeks, etc.
$ Dollar
• Dollar($) symbol matches the end of the string i.e checks whether the string ends
with the given character(s) or not. For example –
• s$ will check for the string that ends with a such as geeks, ends, s, etc.
• ks$ will check for the string that ends with ks such as geeks, geeksforgeeks, ks, etc.
. Dot
• Dot(.) symbol matches only a single character except for the newline character (\n).
For example –
• a.b will check for the string that contains any character at the place of the dot such
as acb, acbd, abbb, etc
| Or
• Or symbol works as the or operator meaning it checks whether the pattern before
or after the or symbol is present in the string or not. For example –
• a|b will match any string that contains a or b such as acd, bcd, abcd, etc.
? Question Mark
• Question mark(?) checks if the string before the question mark in the regex occurs
at least once or not at all. For example –
• ab?c will be matched for the string ac, acb, dabc but will not be matched for abbc
because there are two b. Similarly, it will not be matched for abdc because b is not
followed by c.
* Star
• Star () symbol matches zero or more occurrences of the regex preceding the symbol.
For example –
• ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be matched
for abdc because b is not followed by c.
+ Plus
• Plus (+) symbol matches one or more occurrences of the regex preceding the +
symbol. For example –
• ab+c will be matched for the string abc, abbc, dabc, but will not be matched for ac,
abdc because there is no b in ac and b is not followed by c in abdc.
{m, n} Braces
• Braces match any repetitions preceding regex from m to n both inclusive. For
example –
• a{2, 4} will be matched for the string aaab, baaaac, gaad, but will not be matched for
strings like abc, bc because there is only one a or no a in both the cases.
() Group
• Group symbol is used to group sub-patterns. For example –
• (a|b)cd will match for strings like acd, abcd, gacd, etc.
re.findall()
• Return all non-overlapping matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found.
# A Python program to demonstrate working of
# findall()
import re
['123456789', '987654321']
re.compile()
• Regular expressions are compiled into pattern objects, which have methods for various
operations such as searching for pattern matches or performing string substitutions.
# Module Regular Expression is imported
# using __import__().
import re
['A', 'y', ',', ' ', 's', 'i', ' ', 'M', 'r', '.', ' ', 'G', 'i', 'n',
's', 'o', 'n', ' ', 'S', 't', 'r', 'k']
['s', 's']
['A', 'y', 'e', ',', ' ', 'a', 'i', 'd', ' ', 'M', 'r', '.', ' ', 'G',
'i', 'b', 'e', 'n', 'o', 'n', ' ', 'S', 't', 'a', 'r', 'k']
# \d is equivalent to [0-9].
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))
import re
# \w is equivalent to [a-zA-Z0-9_].
p = re.compile('\w')
print(p.findall("He said * in some_lang."))
['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l',
'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in',
'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']
import re
if match != None:
else:
print ("The regex pattern does not match.")