REGEX in Data Analytics
REGEX in Data Analytics
Study Notes
This is done using various tools and technologies such as data slicing,
manipulation, mining, algorithmic processing etc and the Most popular
languages that are used for running these analytic applications are
Python and R.
We shall be discussing these techniques one by one. The First one we would discuss is
REGEX.
What is REGEX:
REGEX stands for Regular Expressions. It is a search method by which we can search or
replace some specific data from a large data set.
In a large set of data, it usually helps by finding some specific set of strings.
For example If you want to find data where the name of a person starts with the letter ‘i’ and
ends with ‘e’. We can apply the following formula for search
search("i*e", txt)
REGEX Functions:
REGEX has different functions apart from the search function. They are as follows:
Functions Description
These functions help in doing the necessary REGEX search and replace operations.
Metacharacters in REGEX:
These metacharacters act as helpers in the search commands. With the help of these
characters a complex search parameter can be formed to get data from any type of data set.
\d To show a match where the string contains digits (numbers from 0-9). "\d"
\D To Show a match where the string DOES NOT contain digits "\D"
\s To show a match where the string contains a white space character "\s"
To show a match where the string DOES NOT contain a white space
\S character "\S"
To show a match where the string contains characters containing ( a to Z,
\w digits from 0-9, and the underscore _ character). "\w"
\W To show a match where the string DOES NOT contain any word characters "\W"
\Z To show a match if the specified characters are at the end of the string "Spain\Z"
RegEx
Sets in REGEX:
Sets are the set of characters that you would like to search placed inside [ ] brackets.
Set Description
Shows a match where one of the specified characters (a, r, or n) are
[arn] present.
Shows a match for any lower case character, alphabetically between a and
[a-n] n.
[0123] Shows a match where any of the specified digits (0, 1, 2, or 3) are present.
[0-5][0-9] Shows a match for any two-digit numbers from 00 and 59.
Shows a match for any character alphabetically between a and z, lower
[a-zA-Z] case OR upper case.
In sets, +, *, ., |, (), $,{} characters have no special meaning, so [+] means:
[+] show a match for any + character in the string.