Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

17_Regular Expression

The document provides an overview of Regular Expressions (RegEx), explaining their purpose, syntax, and functions in Python's re module. Key functions such as match(), search(), findall(), split(), and sub() are discussed, along with their differences and applications in text processing. Understanding RegEx is essential for data scientists and developers for efficient data mining and text manipulation.

Uploaded by

Arif Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

17_Regular Expression

The document provides an overview of Regular Expressions (RegEx), explaining their purpose, syntax, and functions in Python's re module. Key functions such as match(), search(), findall(), split(), and sub() are discussed, along with their differences and applications in text processing. Understanding RegEx is essential for data scientists and developers for efficient data mining and text manipulation.

Uploaded by

Arif Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Regular Expressions

Learning objective
• What is a Regular Expression?
• Metacharacters
• match() function
• search() function
• re.match() vs re.search()
• findall() function
• split() function
• sub() function
What is a Regular Expression?
• A regular expression RegEx is a special sequence of characters that
helps you match or find other strings or sets of strings, using a
specialized syntax held in a pattern.

• Regular expressions are widely used in UNIX world.

• The module re provides full support for regular expressions in Python.


• The module re raises the exception re.error if an error occurs while
compiling or using a regular expression.
What is a Regular Expression?
• As a data scientist/developer, having a solid understanding of
Regex can help you perform data mining and text mining tasks very
easily.

• It is extremely useful for extracting information from text such as


files, log, spread sheets or even documents.

• While using the regular expression the first thing is to recognize is


that everything is essentially a character, and we are writing
patterns to match a specific sequence of characters also referred
as string.
What is a Regular Expression?
• For instance, a regular expression could tell a program to search
for specific text from the string and then to print out the result
accordingly.

• Regular expressions (regex) are essentially text patterns that you


can use to automate searching through and replacing elements
within strings of text.

• This can make cleaning and working with text-based data sets
much easier, saving you the trouble of having to search through
mountains of text by hand.
Metacharacters
• To understand the RE analogy, Metacharacters are useful,
important and will be used in functions of module re.
• There are many metacharacters available in re module.

\ :Used to indicate the special meaning of character following it


[] :Represent a character class
^ :Matches the beginning
$ : Matches the end
. :Matches any character except newline
Metacharacters
? :Matches zero or one occurrence.
| :Matches with any of the characters separated by it.
* :Any number of occurrences (including 0 occurrences)
+ :One or more occurrences
{} :Indicate number of occurrences of a preceding RE to match.
() :Enclose a group of REs
match() function
• This function checks for a match only at the beginning of the string. This
function attempts to match Regular Expression pattern to string with
optional flags.

• Here is the syntax for this function:


re.match(pattern, string, flags=0)

• Here is the description of the parameters:


• pattern :This is the regular expression to be matched.
• String: This is the string, which would be searched to match the pattern
at the beginning of string.
• flags(optinal): Flags that modify the behavior of the regex. It can be a
combination of various constants, such as re.IGNORECASE
match() function
• The re.match function returns a match object on success, none on failure.
• Example: Simple example of match() function.
import re
line = “Learning Data Science"
matchObj = re.match(r'(.*) Data', line)
print(matchObj)
• Here r character (r’portal’) stands for raw, not regex. The raw string is
slightly different from a regular string, it won’t interpret the \ character as
an escape character.
• This is because the regular expression engine uses \ character for its own
escaping purpose.
search() function
• The search() function searches the string for a match, and returns a
Match object if there is a match.
• If there is more than one match, only the first occurrence of the match
will be returned.
• This function searches for first occurrence of RE pattern within string
with optional flags.
• Here is the syntax for this function: re.search(pattern, string, flags=0)
• pattern: The regular expression pattern to search for.
• string: The input string in which to search for the pattern.
• Flags that modify the behavior of the regex. It can be a combination of
various constants, such as re.IGNORECASE
Search for the exact word "apple" as a whole
word in the text.
import re
# Input string
text = "I have an apple and a pineapple."
# Search for the pattern in the string
match = re.search(r'\bapple\b', text)
# Check if a match is found
if match:
print("Pattern found:", match.group())
print("Start index:", match.start())
print("End index:", match.end())
else:
print("Pattern not found.")
search() function
#Search the string to see if it starts with "The" and ends with "Spain":
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
print(x)

#Example:
import re
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)
re.match() vs re.search()
• There is a difference between the use of both functions.

• re.match() function checks for a match only at the beginning of the


string
• re.search() function searches for a match anywhere in the string.

• Return Value: If the pattern is found at the beginning of the string,


re.match() returns a match object; otherwise, it returns None.

• In case of re.search, it returns a match object if the pattern is found


anywhere in the string; otherwise, it returns None
re.match() vs re.search()
Substring ='Science'
String ='You are learning Data Science with Python Programming.’

# Use of re.search() Method


print(re.search(Substring, String, re.IGNORECASE))

# Use of re.match() Method


print(re.match(Substring, String, re.IGNORECASE))
The findall() function
• The findall() function returns a list containing all matches. The list
contains the matches in the order they are found. If no matches
are found, an empty list is returned.
The split() function
The split() function returns a list where the string has been split at
each match:
Example:

txt = "The rain in Spain“


x = re.split("\t", txt)
print(x)
print(txt.split())
import re
# Load the text file and read it line by line
file_path = "zen_of_python.txt"
with open(file_path, 'r') as file:
content = file.readlines()

# Define a regular expression to find lines containing the word "better"


# re.IGNORECASE makes the search case-insensitive.
pattern = re.compile(r'\bbetter\b', re.IGNORECASE)

# The search method checks if the pattern exists in each line.


matching_lines = [line.strip() for line in content if pattern.search(line)]

# Print matching lines


print("Lines containing the word 'better':")
for line in matching_lines:
print(line)
import re
file_path = "zen_of_python.txt"
with open(file_path, 'r') as file:
content = file.read()

# Define a regular expression to find all 5-letter words


# \w{5} matches exactly 5 word characters (letters, digits, or
underscores).
pattern = re.compile(r'\b\w{5}\b')

# findall() method returns all non-overlapping matches of the pattern in a


list.
five_letter_words = pattern.findall(content)

print("Five-letter words in the text:")


print(five_letter_words)
print(len(five_letter_words))
import re
file_path = "zen_of_python.txt"
with open(file_path, 'r') as file:
content = file.read()

# Define a regular expression to find the word "than"


pattern = r'\bthan\b'

# re.sub() replaces all occurrences of the pattern with "then".


updated_content = re.sub(pattern, 'then', content)

# The updated text is written to a new file to preserve the original.


output_file_path = "zen_of_python_updated.txt"
with open(output_file_path, 'w') as file:
file.write(updated_content)

print(f"Replaced 'than' with 'then' in the text. Updated file saved to {output_file_path}.")
You must have learnt:
• What is a Regular Expression?
• Metacharacters
• match() function
• search() function
• re.match() vs re.search()
• findall() function
• split() function
• sub() function

You might also like