A Simple Intro To Regex With Python: You Have 2 Free Stories Left This Month
A Simple Intro To Regex With Python: You Have 2 Free Stories Left This Month
You have 2 free stories left this month. Sign up and get an extra one for free.
Tirthajyoti Sarkar
May 19 · 9 min read
Introduction
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 1/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
Text mining is a hot topic in data science these days. The volume, variety, and
complexity of textual data are increasing at an astounding space.
As per this article, the global text analytics market was valued at USD 5.46 billion in
2019 and is expected to reach a value of USD 14.84 billion by 2025.
Image source
Regular expressions are used to identify whether a pattern exists in a given sequence
of characters (string) or not and also to locate the position of the pattern in a corpus of
text. They help in manipulating textual data, which is often a pre-requisite for data
science projects that involve text analytics.
It is, therefore, important for budding data scientists, to have a preliminary knowledge
of this powerful tool, for future projects and analysis tasks.
In Python, there is a built-in module called re, which needs to be imported for working
with Regex.
import re
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 2/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
In this short review, we will go through the basics of Regex usage in simple text
processing with some practical examples in Python.
A ` compile `d program
Instead of repeating the code, we can use compile to create a regex program and use
built-in methods.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 3/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
So, compiled programs return special object e.g. match objects. But if they don’t match
it will return None , and that means we can still run our conditional loop!
Positional matching
We can easily use additional parameters in the match object to check for positional
matching of a string pattern.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 4/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
Above, we notice that once we created a program prog with the pattern thon , we can
Also, note that the pos argument is used to indicate where the matching should be
looked into. For the last two code snippets, we change the starting position and get
different results in terms of the match. although the string is identical.
The answer is that it can match a very complex pattern. But to see such advanced
examples, let’s first explore the search method.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 5/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
Note, how the match method returns None (because we did not specify the proper
starting position of the pattern in the text) but the search method finds the position of
the match (by scanning through the text).
Naturally, we can use the span() method of the match object, returned by search , to
methods.
The findall method returns a list with the matching pattern. You can count the
number of items to understand the frequency of the searched term in the text.
The finditer method produces an iterator. We can use this to see more information,
as shown below.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 6/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
Here are various examples. Here we will also apply the group() method on the object
returned by search to essentially return the matched string.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 7/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 8/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
Here is an example.
Start of a string
The ^(caret) matches pattern at the beginning of a string (but not anywhere else).
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 9/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
End of a string
The $ (dollar sign) matches a pattern at the end of the string. Following is a practical
example where we are only interested in pulling out the patent information of Apple
and discard other companies. We check the end of the text for ‘Apple’ and only if it
matches, we pull out the patent number using the numerical digit matching code we
showed earlier.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 10/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 11/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 12/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 13/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
Let’s suppose, we want to extract an email id. We put in a pattern matching regex with
alphabetical characters + @ + .com. But it cannot catch an email id with some
numerical digits in it.
So, we expand the regex a little bit. But we are only extracting email ids with the
domain name ‘.com’. So, it cannot catch the emails with other domains.
It is quite easy to expand on that but clever manipulation of email may prevent
extraction of with such a regex.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 14/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
For example, if we are interested to find phone numbers containing ‘312’ area code, the
following code fails to extract it from the second string.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 15/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
A combined example
Now, we show an example of extracting valid phone numbers from a text using
findall() and the multi-character matching tricks we learned so far.
Note that a valid phone number with 312 area code is of the pattern 312-xxx-xxxx or
312.xxx.xxxx.
Summary
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 16/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
We reviewed the essentials of defining Regex objects and search patterns with Python
and how to use them for extracting patterns from a text corpus.
Regex is a vast topic, with almost being a small programming language in itself.
Readers, particularly those who are interested in text analytics, are encouraged to
explore this topic more from other authoritative sources. Here are a few links,
medium.com
codeburst.io
medium.com
. . .
A lso, you can check the author’s GitHub repositories for code, ideas, and
resources in machine learning and data science. If you are, like me, passionate
about AI/machine learning/data science, please feel free to add me on LinkedIn or
follow me on Twitter.
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 17/18
5/29/2020 A simple intro to Regex with Python - Towards Data Science
www.linkedin.com
https://towardsdatascience.com/a-simple-intro-to-regex-with-python-14d23a34d170 18/18