Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

REGEX in Data Analytics

Uploaded by

yashkshatriya108
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

REGEX in Data Analytics

Uploaded by

yashkshatriya108
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

RegEx

Study Notes

REGEX in Data Analytics


RegEx

Introduction to Data Analytics

Data Analytics or more commonly known as Data Science is one of the


most emerging fields in today’s world. As the technology is growing
and the number of users adopting these technologies, there has been a
huge amount of data flowing in to the service providers which needs to
be processed and understood for better service and understanding.

This is done using various tools and technologies such as data slicing,
manipulation, mining, algorithmic processing etc and the Most popular
languages that are used for running these analytic applications are
Python and R.

We shall be discussing these techniques one by one. The First one we would discuss is
REGEX.

What is REGEX:
REGEX stands for Regular Expressions. It is a search method by which we can search or
replace some specific data from a large data set.

It was devised in 1950 by an American mathematician Stephen Cole Kleene.

In a large set of data, it usually helps by finding some specific set of strings.

For example If you want to find data where the name of a person starts with the letter ‘i’ and
ends with ‘e’. We can apply the following formula for search

search("i*e", txt)

The results would get displayed as per the analyst code.


RegEx

REGEX Functions:
REGEX has different functions apart from the search function. They are as follows:

Functions Description

Findall Shows the list containing all the matches

Search Shows a matching object if there is a match


anywhere in the string

Split shows a list where the string has been split


at each match

sub Replaces one or many instances where


matches with a string

These functions help in doing the necessary REGEX search and replace operations.

Metacharacters in REGEX:

Character Description Example

[] Set of characters "[a-m]"


It signals a special sequence (it can also be used to escape special
\ characters) "\d"
Any character in the string. One dot per character. (except newline
. character) "he..o"

^ If the string Starts with "^hello"

$ If the string Ends with "world$"

* To search if unspecified number of characters "aix*"

=+ If One or more occurrence "aix+"

{} If Exactly the specified number of occurrences of same character set "al{2}"

| For Either or "falls|stays"


RegEx
() To Capture and group

These metacharacters act as helpers in the search commands. With the help of these
characters a complex search parameter can be formed to get data from any type of data set.

Special Sequences in REGEX:


A special sequence is the character placed after ‘\’ which signifies different functions as
described in the list.

Character Description Example


To show a match if the specified characters are at the beginning of the
\A string. "\AThe"
To show a match where the specified characters are at the beginning or at the r"\bain"
\b end of a word r"ain\b"
To show a match where the specified characters are present, but NOT at the r"\Bain"
\B beginning (or at the end) of a word. r"ain\B"

\d To show a match where the string contains digits (numbers from 0-9). "\d"

\D To Show a match where the string DOES NOT contain digits "\D"

\s To show a match where the string contains a white space character "\s"
To show a match where the string DOES NOT contain a white space
\S character "\S"
To show a match where the string contains characters containing ( a to Z,
\w digits from 0-9, and the underscore _ character). "\w"

\W To show a match where the string DOES NOT contain any word characters "\W"

\Z To show a match if the specified characters are at the end of the string "Spain\Z"
RegEx

Sets in REGEX:
Sets are the set of characters that you would like to search placed inside [ ] brackets.

Set Description
Shows a match where one of the specified characters (a, r, or n) are
[arn] present.
Shows a match for any lower case character, alphabetically between a and
[a-n] n.

[^arn] Shows a match for any character EXCEPT a, r, and n.

[0123] Shows a match where any of the specified digits (0, 1, 2, or 3) are present.

[0-9] Shows a match for any digit between 0 and 9.

[0-5][0-9] Shows a match for any two-digit numbers from 00 and 59.
Shows a match for any character alphabetically between a and z, lower
[a-zA-Z] case OR upper case.
In sets, +, *, ., |, (), $,{} characters have no special meaning, so [+] means:
[+] show a match for any + character in the string.

You might also like