Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Lecture 6 Re Basics

Uploaded by

mudassirsabri45
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture 6 Re Basics

Uploaded by

mudassirsabri45
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

A Regular Expressions (RegEx)

• is a special sequence of characters that uses a search pattern to find a string or set of
strings. It can detect the presence or absence of a text by matching with a particular
pattern, and also can split a pattern into one or more sub-patterns.

• Python provides a re module that supports the use of regex in Python. Its primary
function is to offer a search, where it takes a regular expression and a string. Here, it
either returns the first match or else none.

• A regular expression is a special sequence of characters that helps you match or find
other strings or sets of strings, using a specialized syntax held in a pattern. Regular
expressions are widely used in UNIX world.

• The Python module re provides full support for Perl-like regular expressions in
Python. The re module raises the exception re.error if an error occurs while
compiling or using a regular expression.

• We would cover two important functions, which would be used to handle regular
expressions. But a small thing first: There are various characters, which would have
special meaning when they are used in regular expression. To avoid any confusion
while dealing with regular expressions, we would use Raw Strings as r'expression'.
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly
specialized programming language embedded inside Python and made available through the re
module. Using this little language, you specify the rules for the set of possible strings that you
want to match; this set might contain English sentences, or e-mail addresses, or TeX
commands, or anything you like. You can then ask questions such as “Does this string match the
pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to
modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by
a matching engine written in C. For advanced use, it may be necessary to pay careful attention to
how the engine will execute a given RE, and write the RE in a certain way in order to produce
bytecode that runs faster. Optimization isn’t covered in this document, because it requires that
you have a good understanding of the matching engine’s internals.

The regular expression language is relatively small and restricted, so not all possible string
processing tasks can be done using regular expressions. There are also tasks that can be done
with regular expressions, but the expressions turn out to be very complicated. In these cases,
you may be better off writing Python code to do the processing; while Python code will be
slower than an elaborate regular expression, it will also probably be more understandable.

import re
s = "Welcome to Artificial intelligence"
res = re.search(r"\D{3} t",s)
print(res.group())

ome t

# Code gives the starting index and the ending index of the string
"characters".
import re

s = 'a special sequence of characters that uses a characters search


pattern to find a string or set of strings'

match = re.search(r'characters', s)
# Here r character (r'characters') stands for raw
# The raw string is slightly different from a regular string, it won’t
interpret the \ character as an escape character.
# This is because the regular expression engine uses \ character for
its own escaping purpose.

print(match)
print('Start Index:', match.start())
print('End Index:', match.end())
print('span Index:', match.span())

<re.Match object; span=(22, 32), match='characters'>


Start Index: 22
End Index: 32
span Index: (22, 32)

Match Object
A Match object contains all the information about the search and the result and if there is no
match found then None will be returned. Let’s see some of the commonly used methods and
attributes of the match object.

Getting the string and the regex


match.re attribute returns the regular expression passed and match.string attribute returns the
string passed.

Getting index of matched object


• start() method returns the starting index of the matched substring
• end() method returns the ending index of the matched substring
• span() method returns a tuple containing the starting and the ending index of the
matched substring
#!/usr/bin/python
import re

string = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', string, re.M|re.I)

if matchObj:
print("matchObj.group() : ", matchObj.group())
print("matchObj.group(1) : ", matchObj.group(1))
print("matchObj.group(2) : ", matchObj.group(2))
else:
print("No match!!")

matchObj.group() : Cats are smarter than dogs


matchObj.group(1) : Cats
matchObj.group(2) : smarter

import re

s = "Welcome to Artificial intelligence"

# here x is the match object


res = re.search(r"\bArti", s)

print(res.start())
print(res.end())
print(res.span())

11
15
(11, 15)

MetaCharacters Description
• Used to drop the special meaning of character following it.
• [] Represent a character class.
• ^ Matches the beginning.
• $ Matches the end.
• . Matches any character except newline.
• | Means OR (Matches with any of the characters separated by it.
• ? Matches zero or one occurrence.
• * Any number of occurrences (including 0 occurrences).
• + One or more occurrences.
• {} Indicate the number of occurrences of a preceding regex to match.
• () Enclose a group of Regex.
– Backslash
• The backslash () makes sure that the character is not treated in a special way. This can be
considered a way of escaping metacharacters.
• For example, if you want to search for the dot(.) in the string then you will find that dot(.)
will be treated as a special character as is one of the metacharacters (as shown in the
above table).
• So for this case, we will use the backslash() just before the dot(.) so that it will lose its
specialty. See the below example for a better understanding.
# A Python program to demonstrate working of re.match().
import re
s = 'Artificail .Intelligence'

# without using \
match = re.search(r'.', s)
print(match)

# using \
match = re.search(r'\.', s)
print(match)

<re.Match object; span=(0, 1), match='A'>


<re.Match object; span=(11, 12), match='.'>

import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments


num = re.sub(r'#.*$', "", phone)
print("Phone Num : ", num)

# Remove anything other than digits


num = re.sub(r'\D', "", phone)
print("Phone Num : ", num)

Phone Num : 2004-959-559 F


Phone Num : 2004959559

import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number


pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.


match = re.search(pattern, string)
if match:
print(match.group())
else:
print("pattern not found")

801 35

[] Square Brackets
• Square Brackets ([]) represents a character class consisting of a set of characters
that we wish to match. For example, the character class [abc] will match any single
a, b, or c.

• We can also specify a range of characters using – inside the square brackets. For
example,

• [0, 3] is sample as [0123]

• [a-c] is same as [abc]

• We can also invert the character class using the caret(^) symbol. For example,

• [^0-3] means any number except 0, 1, 2, or 3

• [^a-c] means any character except a, b, or c

^ Caret
• Caret (^) symbol matches the beginning of the string i.e. checks whether the string
starts with the given character(s) or not. For example –

• ^g will check if the string starts with g such as geeks, globe, girl, g, etc.

• ^ge will check if the string starts with ge such as geeks, geeksforgeeks, etc.

$ Dollar
• Dollar($) symbol matches the end of the string i.e checks whether the string ends
with the given character(s) or not. For example –

• s$ will check for the string that ends with a such as geeks, ends, s, etc.
• ks$ will check for the string that ends with ks such as geeks, geeksforgeeks, ks, etc.

. Dot
• Dot(.) symbol matches only a single character except for the newline character (\n).
For example –

• a.b will check for the string that contains any character at the place of the dot such
as acb, acbd, abbb, etc

• .. will check if the string contains at least 2 characters

| Or
• Or symbol works as the or operator meaning it checks whether the pattern before
or after the or symbol is present in the string or not. For example –

• a|b will match any string that contains a or b such as acd, bcd, abcd, etc.

? Question Mark
• Question mark(?) checks if the string before the question mark in the regex occurs
at least once or not at all. For example –

• ab?c will be matched for the string ac, acb, dabc but will not be matched for abbc
because there are two b. Similarly, it will not be matched for abdc because b is not
followed by c.

* Star
• Star () symbol matches zero or more occurrences of the regex preceding the symbol.
For example –

• ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be matched
for abdc because b is not followed by c.
+ Plus
• Plus (+) symbol matches one or more occurrences of the regex preceding the +
symbol. For example –

• ab+c will be matched for the string abc, abbc, dabc, but will not be matched for ac,
abdc because there is no b in ac and b is not followed by c in abdc.

{m, n} Braces
• Braces match any repetitions preceding regex from m to n both inclusive. For
example –

• a{2, 4} will be matched for the string aaab, baaaac, gaad, but will not be matched for
strings like abc, bc because there is only one a or no a in both the cases.

() Group
• Group symbol is used to group sub-patterns. For example –

• (a|b)cd will match for strings like acd, abcd, gacd, etc.

Special Sequence Description Examples


• \A Matches if the string begins with the given character. ( \Afor | for geeks | for the world)
• \b Matches if the word begins or ends with the given character. \b(string) will check for
the beginning of the word and (string)\b will check for the ending of the word. (\bge |
geeks | get)
• \B It is the opposite of the \b i.e. the string should not start or end with the given regex. (\
Bge | together | forge)
• \d Matches any decimal digit, this is equivalent to the set class [0-9] ( \d |123 | gee1)
• \D Matches any non-digit character, this is equivalent to the set class [^0-9] (\D | geeks |
geek1)
• \s Matches any whitespace character. (\s gee ks a bc a)
• \S Matches any non-whitespace character (\S a bd abcd)
• \w Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_]. (\w |
123 | geeKs4)
• \W Matches any non-alphanumeric character. (\W | >$ | gee<>)
• \Z Matches if the string ends with the given regex (ab\Z |abcdab | abababab)

re.findall()
• Return all non-overlapping matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found.
# A Python program to demonstrate working of
# findall()
import re

# A sample text string where regular expression


# is searched.
string = """Hello my Number is 123456789 and
my friend's number is 987654321"""

# A sample regular expression to find digits.


regex = '\d+' #\d Matches any decimal digit, this is equivalent to
the set class [0-9]

match = re.findall(regex, string)


print(match)

# This example is contributed by Ayush Saluja.

['123456789', '987654321']

re.compile()
• Regular expressions are compiled into pattern objects, which have methods for various
operations such as searching for pattern matches or performing string substitutions.
# Module Regular Expression is imported
# using __import__().
import re

# compile() creates regular expression


# character class [a-e],
# which is equivalent to [abcde].
# class [abcde] will match with string with
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[^a-e]')

# findall() searches for the Regular Expression


# and return a list upon finding
print(p.findall("Aye, said Mr. Gibenson Stark"))

['A', 'y', ',', ' ', 's', 'i', ' ', 'M', 'r', '.', ' ', 'G', 'i', 'n',
's', 'o', 'n', ' ', 'S', 't', 'r', 'k']

# Module Regular Expression is imported


# using __import__().
import re

# compile() creates regular expression


# character class [a-e],
# which is equivalent to [abcde].
# class [abcde] will match with string with
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[$s]')

# findall() searches for the Regular Expression


# and return a list upon finding
print(p.findall("Aye, said Mr. Gibenson Stark"))

['s', 's']

# Module Regular Expression is imported


# using __import__().
import re

# compile() creates regular expression


# character class [a-e],
# which is equivalent to [abcde].
# class [abcde] will match with string with
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[^s]')

# findall() searches for the Regular Expression


# and return a list upon finding
print(p.findall("Aye, said Mr. Gibenson Stark"))

['A', 'y', 'e', ',', ' ', 'a', 'i', 'd', ' ', 'M', 'r', '.', ' ', 'G',
'i', 'b', 'e', 'n', 'o', 'n', ' ', 'S', 't', 'a', 'r', 'k']

Understanding the Output:


• First occurrence is ‘e’ in “Aye” and not ‘A’, as it being Case Sensitive.
• Next Occurrence is ‘a’ in “said”, then ‘d’ in “said”, followed by ‘b’ and ‘e’ in “Gibenson”,
the Last ‘a’ matches with “Stark”.
• Metacharacter backslash ‘’ has a very important role as it signals various sequences. If
the backslash is to be used without its special meaning as metacharacter, use’\’
Example 2: Set class [\s,.] will match any
whitespace character, ‘,’, or, ‘.’
import re

# \d is equivalent to [0-9].
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

# \d+ will match a group on [0-9], group


# of one or greater size
p = re.compile('\d+')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

['1', '1', '4', '1', '8', '8', '6']


['11', '4', '1886']

import re

# \w is equivalent to [a-zA-Z0-9_].
p = re.compile('\w')
print(p.findall("He said * in some_lang."))

# \w+ matches to group of alphanumeric character.


p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M., he \
said *** in some_language."))

# \W matches to non alphanumeric characters.


p = re.compile('\W')
print(p.findall("he said *** in some_language."))

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l',
'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in',
'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']

import re

# '*' replaces the no. of occurrence


# of a character.
p = re.compile('ab*')
print(p.findall("ababbaabbb"))

['ab', 'abb', 'a', 'abbb']


Understanding the Output:
• Our RE is ab*, which ‘a’ accompanied by any no. of ‘b’s, starting from 0.
• Output ‘ab’, is valid because of single ‘a’ accompanied by single ‘b’.
• Output ‘abb’, is valid because of single ‘a’ accompanied by 2 ‘b’.
• Output ‘a’, is valid because of single ‘a’ accompanied by 0 ‘b’.
• Output ‘abbb’, is valid because of single ‘a’ accompanied by 3 ‘b’.

# A Python program to demonstrate working of re.match().


import re

# Lets use a regular expression to match a date string


# in the form of Month name followed by day number
regex = r"([a-zA-Z]+) (\d+)"

match = re.search(regex, "I was born on June 24")

if match != None:

# We reach here when the expression "([a-zA-Z]+) (\d+)"


# matches the date string.

# This will print [14, 21), since it matches at index 14


# and ends at 21.
print ("Match at index %s, %s" % (match.start(), match.end()))

# We us group() method to get all the matches and


# captured groups. The groups contain the matched values.
# In particular:
# match.group(0) always returns the fully matched string
# match.group(1) match.group(2), ... return the capture
# groups in order from left to right in the input string
# match.group() is equivalent to match.group(0)

# So this will print "June 24"


print ('date = ', match.group(0))

# So this will print "June"


print ("Month: %s" % (match.group(1)))

# So this will print "24"


print ("Day: %s" % (match.group(2)))

else:
print ("The regex pattern does not match.")

Match at index 14, 21


Full match: June 24
Month: June
Day: 24

You might also like