4 Data Transformation Using Pandas
4 Data Transformation Using Pandas
ipynb - Colab
In this series, we will cover the basics of Data Analysis using Python. The lessons
will start growing gradually until forming a concrete analytical mindset for students.
This lesson will cover the essentials of Pandas in processing tabular data.
What is Pandas?
Pandas is an open source Python package that is most widely used for data science/data analysis
and machine learning tasks. It is built on top of Numpy, which has been previously touched, which
provides support for multi-dimensional arrays. As one of the most popular data wrangling
packages, Pandas works well with many other data science modules inside the Python ecosystem,
and is typically included in every Python distribution.
Pandas makes it simple to do many of the time consuming, repetitive tasks associated with
working with data, including:
Data cleansing
Data fill
Data normalization
Merges and joins
Data visualization
Statistical analysis
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 1/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Data inspection
Loading and saving data
In fact, with Pandas, you can do everything that makes world-leading data scientists vote Pandas as
the best data analysis and manipulation tool available.
The main powerful point of pandas is its basic data structure, which is DataFrame , every tabular
format of data will be stored directly as a DataFrame which will ease the transforamtion and
manipulation of data.
Also, because we will be mostly intersted in columns and rows as analysts, Pandas also store every
column and every row in a Series which will also ease the transformation and manipulation on
column and row level.
# Create a dict
fruits_dct = { 'Oranges': [3, 2, 0, 1],
'Apples': [0, 3, 7, 2]
}
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 2/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Install pandas if it's not installed (Kaggle Enviroment has it installed by default)
Import the package into the current working enviroment.
Every package in Python is nothing but some code files published online for anyone to use or edit.
Importing the package means that you are graping those functions and codes to the current
runtime enviroment which will allow you to use any of this code functions, methods, or attributes.
To import pandas you need to use the following line.
import pandas as pd
The term import pandas is kind of self-explaining but the term as pd is just a convention.
Something like agreement between people who use pandas to appriviate it as pd
# import statment
import pandas as pd
# lets print it
fruits_df
Oranges Apples
0 3 0
1 2 3
2 0 7
3 1 2
print(type(fruits_df))
<class 'pandas.core.frame.DataFrame'>
As you can see the DataFrame is a neat and clean formation of the tabular data which is
considering 80% of the data that we will deal with in general as analysts.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 3/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Before diving deep in pandas , we need to get ourselves familiar with the structure of every table of
data you will see.
Structure of table
As mentioned before, 80% of the data we will see are table so, it's crucial to know its structure
deeply.
Index is like an address, that’s how any data point across the dataframe or series can be
accessed.
Columns also called as features or fields are a list of values for the specific thing we are
measure. for example, in the snippet we have every value in the apples column represents
number of apples at that time.
Rows also called as observations or occurances are representing every object in our data.
for example, here we have 4 rows which means we have 4 differenet counts of apples and
oranges.
CSV files
Excel files
JSON files
SQL Tables
Pickle files
Parquet files
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 4/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Now, and for the sake of analysis, we will read a CSV file that is containing top 1000 in the history of
film making according to IMDB website.
A group of
Guardians Chris Pratt, Vin
intergalactic James
0 of the 1 Action,Adventure,Sci-Fi Diesel, Bradley
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...
Following
clues to the Noomi Rapace,
Ridley
1 Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall-
Scott
mankind, a Green, Michael Fa...
te...
Three girls
are James McAvoy, Anya
M. Night
2 Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu
Shyamalan
a man with a Richar...
diag...
In a city of
Matthew
humanoid
Christophe McConaughey,Reese
3 Sing 4 Animation,Comedy,Family animals, a
Lourdelet Witherspoon, Seth
hustling
Ma...
thea...
A secret
government Will Smith, Jared
Suicide
4 5 Action,Adventure,Fantasy agency David Ayer Leto, Margot Robbie,
Squad
recruits some Viola D...
of th...
A tight-knit
Chiwetel Ejiofor,
Secret in team of rising
995 996 Ci D M t Bill R Ni l Kid J li
As you can see, pandas is capable of viewing any table in a clean and tidy format of
pandas.core.frame.DataFrame data type. Now, lets get ourselves familliar with some of pandas
basic methods and attributes to explore more about our dataset.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 5/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
df.info() : method used for getting information about number of rows, columns, count of not
null, memory usage. You can read more about it in the documentation
df.head() : method that views first 5 rows of the dataframe. You can pass any number in the
brackets that will represent any number of rows you want to view.
df.tail() : method that views last 5 rows of the dataframe. You can pass any number in the
brackets that will represent any number of rows you want to view.
df.describe() : method that can view the summary statistics for numerical columns. You can
read more about this in the documentation
df.shape : attribute that is use to find out the number of rows and columns (Can you
remember the same method in NumPy ?)
df.columns : attribute that is used to view the column names and it's an iterator by default.
You can see more in the documentation
df_movies.head()
A group of
Guardians Chris Pratt, Vin
intergalactic James
0 of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 2
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...
Following
clues to the Noomi Rapace,
Ridley
1 Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 2
Scott
mankind, a Green, Michael Fa...
te...
Three girls
are James McAvoy, Anya
M. Night
2 Split 3 Horror Thriller kidnapped by Taylor Joy Haley Lu 2
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 6/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
A pair of
friends Adam Pally, T.J.
Search Scot
998 999 Adventure,Comedy embark on a Miller, Thomas 2014
Party Armstrong
mission to Middleditch,Sh...
reuni
df_movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Title 1000 non-null object
1 Rank 1000 non-null int64
2 Genre 1000 non-null object
3 Description 1000 non-null object
4 Director 1000 non-null object
5 Actors 1000 non-null object
6 Year 1000 non-null int64
7 Runtime (Minutes) 1000 non-null int64
8 Rating 1000 non-null float64
9 Votes 1000 non-null int64
10 Revenue (Millions) 872 non-null float64
11 Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
df.info() is one of the most useful methods to use when looking to any DataFrame
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 7/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
For our DataFrame , we have two columns with Null values, those columns are Revenue
(Millions) and Metascore
For the purpose of finding the number of Null values specifically. We can use the following
isnull() methods such as following.
Title 0
Rank 0
Genre 0
Description 0
Director 0
Actors 0
Year 0
Runtime (Minutes) 0
Rating 0
Votes 0
Revenue (Millions) 128
Metascore 64
dtype: int64
df_movies.describe()
Runtime Revenue
Rank Year Rating Votes Meta
(Minutes) (Millions)
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 8/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
As you can see the df.describe() is calculating some summary statistics. Such as count , mean ,
std , min , max and the Percentile.
Some column names in the dataset may case some confusing such as Runtime (Minutes) and
Revenue (Millions) . We will try to rename them to Runtime_in_Minutes and Revenue_in_Millions
df.rename() : method for renaming columns it takes the columns in form of a dict where the
old names are keys and the new names are values. Also, it takes an argument called inplace
when its set to true it means that the current changes will be committed to the current
dataframe directly.
For better display of the column names, we will convert them to lowercase.
pd.concat() : it will append any number of dataframes to each other as they are following the
same structure. Check the documentation
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 9/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
We will use this function to artficially add two df_movies to one to another to have a situation
where every row is existing twice
# to create a dataset with row duplicates let's just douple our movie dataset
df_temp = pd.concat([df_movies, df_movies], axis = 0)
# investigate the shape
df_temp.shape
(2000, 12)
As the shape check says, we have now a new dataframe df_temp that is douple the size of
df_movies . Now, we will sort its rows according to the title column to have better look.
NOTE: For any method dealing with pandas that has the argument axis , 0 means column level and
1 means row level
An offbeat Zooey
romantic Deschanel,
(500)
comedy Joseph
507 Days of 508 Comedy,Drama,Romance Marc Webb 2009
about a Gordon-
Summer
woman who Levitt,
d... Geoffre...
An offbeat Zooey
romantic Deschanel,
(500)
comedy Joseph
507 Days of 508 Comedy,Drama,Romance Marc Webb 2009
about a Gordon-
Summer
woman who Levitt,
d... Geoffre...
John
After getting
Goodman
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 10/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
(1000, 12)
Lets re-create the df_temp again to study the effect of the keep arguement
(2000, 12)
Now, we'll remove any rows that have any duplicates. that's mean the remaining number of rows will
be 0
df_temp.drop_duplicates(inplace=True, keep=False)
# check the shape
df_temp.shape
(0, 12)
Now, as our dataframe is consisting of 1000 unique movie, it's better to make our index as the
title column
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 11/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
A group of
Guardians Chris Pratt, Vin
intergalactic James
of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 201
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...
Following
clues to the Noomi Rapace,
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 201
Scott
mankind, a Green, Michael Fa...
te...
Three girls
are James McAvoy, Anya
M. Night
Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu 201
Shyamalan
a man with a Richar
df.loc[] : used to select specific rows/columns by names using the following syntax :
df.loc[rows_range_by_names: columns_range_by_names]
df.iloc[] : used to select specific rows/columns by number index using the following syntax
:
df.iloc[rows_range_by_numbers: columns_range_by_numbers]
Now, we want to select some info about Split movie, starting from rank to actors columns.
rank 3
genre Horror,Thriller
description Three girls are kidnapped by a man with a diag...
director M. Night Shyamalan
actors James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
Name: Split, dtype: object
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 12/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
pd.DataFrame(split_loc0).T
split_loc1 = df_movies.loc['Split', : ]
# print the split
pd.DataFrame(split_loc1).T
James
McAvoy,
Th il
James
McAvoy,
Th il
We can obrain the same result using df.iloc[] . Remember the Split movie is ranked 3 which will
be 2 when taking 0-indexing into consideration.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 13/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
James
McAvoy,
Th il
Suppose after importing the dataframe we want to work with a subset of it such as genre and
rating columns
genre rating
title
df_movies_subset.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 genre 1000 non-null object
1 rating 1000 non-null float64
dtypes: float64(1), object(1)
memory usage: 55.7+ KB
Suppose your crietira for selection based on datatypes. Like you want to isolate all string columns
in a separate dataframe
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 14/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
A group of
Guardians of intergalactic James Chris Pratt, Vin Diesel,
Action,Adventure,Sci-Fi
the Galaxy criminals are Gunn Bradley Cooper, Zoe S...
forced ...
df_movies_str.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 genre 1000 non-null object
1 description 1000 non-null object
2 director 1000 non-null object
3 actors 1000 non-null object
dtypes: object(4)
memory usage: 71.4+ KB
All columns of df_movies_str are of data type objet, which represents strings.
Dropping Null Values: When having a realtively small portion of rows with Null values we can
simply drop them.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 15/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Imputation: Some times when the portion of Null values is slightly bigger we can choose
some values to add in those Nulls such as the mean or median value for numerical columns
and may be the most or least frequent category for object columns. Sometimes the
imputation will be in a systematic way such as the example we will touch now.
For applying different scenarios, we will make a copy of the df_movies using the following method.
<class 'pandas.core.frame.DataFrame'>
Index: 838 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rank 838 non-null int64
1 genre 838 non-null object
2 description 838 non-null object
3 director 838 non-null object
4 actors 838 non-null object
5 year 838 non-null int64
6 runtime_in_minutes 838 non-null int64
7 rating 838 non-null float64
8 votes 838 non-null int64
9 revenue_in_millions 838 non-null float64
10 metascore 838 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 78.6+ KB
We have eliminated the problem of missing values in df_movies . lets try to handle the
df_movies_cp using df.fillna()
rank 0
genre 0
description 0
director 0
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 16/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
actors 0
year 0
runtime_in_minutes 0
rating 0
votes 0
revenue_in_millions 128
metascore 64
dtype: int64
Now we need to figure out some info about the columns we have such as the mean or median and
pandas can easily provide this using a wide range of methods that work on both column level and
dataframe level such as:
sum()
min()
max()
count()
idxmin()
idxmax()
mean()
median()
std()
82.95637614678898
Now, lets use fillna() method to fix the problem of revenue column.
df_movies_cp.revenue_in_millions.fillna(mean_revenue, inplace=True)
# check missing
df_movies_cp.isnull().sum()
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime_in_minutes 0
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 17/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
rating 0
votes 0
revenue_in_millions 0
metascore 64
dtype: int64
Sometimes, filling missing values need to follow some pattern. lets see the following example to
see. It's taken from Uber company.
Date Hour Requests Completes Supply Hours Time On Trip pETA aETA
As you can see, we have some missing values in the date column. The dataset is gathered for some
days for hour 11 , 12 , and 13 so, it makes more sense to see the date value of the upper row in the
rows underneath until finding a new date.
This can be done using the method argument for fillna . We will use ffill which is abbreviation
for forward fill.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 18/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Date Hour Requests Completes Supply Hours Time On Trip pETA aETA
keyboard_arrow_down Filtering
Filtering is a common activity in almost analyzing any dataframe. In almost all cases, you will need
to apply some filter on the data.
Now, we need to study the movies directed by Ridley Scott . lets see how to apply this filter.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 19/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
Noomi
Following Rapace,
clues to the Logan
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Marshall- 2012
Scott
mankind, a Green,
te... Michael
Fa...
Matt
An astronaut
Damon,
becomes
Ridley Jessica
The Martian 103 Adventure,Drama,Sci-Fi stranded on 2015
Scott Chastain,
Mars after
Kristen
hi...
Wiig, Ka...
Russell
In 12th
Crowe,
century
Ridley Cate
Robin Hood 388 Action,Adventure,Drama England, 2010
Scott Blanchett,
Robin and
Matthew
his band of...
Macfady...
df_movies_ridley_scott.shape
(8, 11)
8 movies out of the top 1000 movies are directed by Ridley Scott !
Now, let suppose that we want to get all movies directed by Ridley Scott or Cristopher Nolan .
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 20/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
Following Noomi
clues to the Rapace, Logan
Adventure,Mystery,Sci- Ridley
Prometheus 2 origin of Marshall- 2012
Fi Scott
mankind, a Green, Michael
te... Fa...
A team of Matthew
explorers McConaughey,
Christopher
Interstellar 37 Adventure,Drama,Sci-Fi travel Anne 2014
Nolan
through a Hathaway,
wormhole ... Jessica Ch...
When the
Christian Bale,
menace
The Dark Christopher Heath Ledger,
55 Action,Crime,Drama known as the 2008
Knight Nolan Aaron
Joker wreaks
Eckhart,Mi...
havo...
df_movies_multi_filter = df_movies[
((df_movies['year'] >= 2005) & (df_movies['year'] <= 2010))
& (df_movies['rating'] > 8.0)
& (df_movies['revenue_in_millions'] < df_movies['revenue_in_millions'].quantile(0.25))
]
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 21/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
Two friends
Aamir Khan,
are
Rajkumar Madhavan, Mona
3 Idiots 431 Comedy,Drama searching for 2009
Hirani Singh, Sharman
their long lost
Joshi
...
Now, we will list down some of the ways for iterating over rows while measuring the excution time
of each.
for the sake of simplicity, we will use only 100 rows of the df_movies dataframe
keyboard_arrow_down 1. index
You can simply iterate over the index of the dataframe and access the rows data based on it. lets
see how!
df_movies.head()
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 22/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
A group of
Guardians Chris Pratt, Vin
intergalactic James
of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 201
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...
Following
clues to the Noomi Rapace,
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 201
Scott
mankind, a Green, Michael Fa...
te...
Three girls
are James McAvoy, Anya
M. Night
Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu 201
Shyamalan
a man with a Richar
(100, 11)
%%time
for ind in df_small.index:
print(df_small['director'][ind], df_small['revenue_in_millions'][ind])
keyboard_arrow_down 2. iterrows()
iterrows() also can be used for iterating over rows. lets see how!
%%time
for index, row in df_small.iterrows():
print(row["director"], row["revenue_in_millions"])
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 24/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Clint Eastwood 125.07
Zack Snyder 330.25
Tate Taylor 75.31
Sam Taylor-Johnson 166.15
Christopher Nolan 53.08
Matthew Vaughn 128.25
Peter Berg 31.86
George Miller 153.63
Robin Swicord 0.01
Peter Berg 61.28
Robert Zemeckis 40.07
J.A. Bayona 3.73
David Frankel 30.98
Byron Howard 341.26
Gore Verbinski 309.4
Joss Whedon 623.28
Quentin Tarantino 120.52
Gore Verbinski 423.03
Paul Feig 128.34
Christopher Nolan 292.57
Matt Ross 5.88
Martin Scorsese 116.87
David Fincher 167.74
James Wan 350.03
Colin Trevorrow 652.18
Ben Affleck 10.38
James Cameron 760.51
Quentin Tarantino 54.12
Gavin O'Connor 86.2
Denis Villeneuve 60.96
Duncan Jones 47.17
Tate Taylor 169.71
Todd Phillips 43.02
Joss Whedon 458.99
Shane Black 36.25
Makoto Shinkai 4.68
Jeremy Gillespie 0.15
Olivier Assayas 1.29
Martin Scorsese 132.37
Brian Helgeland 1.87
Kenneth Branagh 181.02
Ridley Scott 228.43
Guy Ritchie 45.43
David Mackenzie 26.86
Taylor Hackford 1.66
David Yates 126.59
Alex Garland 25.44
Greg McLean 10.16
Steve McQueen 56.67
Zack Snyder 210.59
CPU times: user 8.88 ms, sys: 847 µs, total: 9.73 ms
Wall time: 9.4 ms
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 25/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
keyboard_arrow_down 3. itertuples()
%%time
for row in df_small.itertuples():
print(row.director, row.revenue_in_millions)
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 26/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Martin Scorsese 132.37
Brian Helgeland 1.87
Kenneth Branagh 181.02
Ridley Scott 228.43
Guy Ritchie 45.43
David Mackenzie 26.86
Taylor Hackford 1.66
David Yates 126.59
Alex Garland 25.44
Greg McLean 10.16
Steve McQueen 56.67
Zack Snyder 210.59
CPU times: user 2 91 ms sys: 0 ns total: 2 91 ms
keyboard_arrow_down Exercise
As you may have noticed by now, the genre column has multiple values in the same row. That's
because any movie can be classified to multiple genres.
Now, the exercise is to loop over the rows, then get all the whole genres in one list then get the
count of each genre using Counter function from collections module.
['Action',
'Adventure',
'Sci-Fi',
'Adventure',
'Mystery',
'Sci-Fi',
'Horror',
'Thriller',
'Animation',
'Comedy',
'Family',
'Action',
'Adventure',
'Fantasy',
'Action',
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 27/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
'Adventure',
'Fantasy',
'Comedy',
'Drama',
'Music']
Now, lets count the repetition of every genre in our top 1000 movie.
Counter(genre_lst).most_common()
[('Drama', 419),
('Action', 277),
('Comedy', 250),
('Adventure', 244),
('Thriller', 148),
('Crime', 126),
('Romance', 120),
('Sci-Fi', 107),
('Fantasy', 92),
('Horror', 87),
('Mystery', 86),
('Biography', 67),
('Family', 48),
('Animation', 45),
('History', 25),
('Music', 15),
('Sport', 15),
('War', 10),
('Musical', 5),
('Western', 4)]
As you can see, Drama genre is the most repeated genre in the list of top 1000 movies!
keyboard_arrow_down 1. apply()
apply() function is used to apply some function agnist some axis. Check documentation
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 28/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Lets say that we will add a new column called rating_category which will simply be an indicator
over the rating column. Any movie with rating above 8 will be considered as a good category
otherwise, it will be neutral
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 29/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
A group of
Guardians Chris Pratt, Vin
intergalactic James
of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 20
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...
Following
clues to the Noomi Rapace,
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 20
Scott
mankind, a Green, Michael Fa...
te...
Three girls
are James McAvoy, Anya
M. Night
Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu 20
Shyamalan
a man with a Richar...
diag...
In a city of
Matthew
humanoid
Christophe McConaughey,Reese
Sing 4 Animation,Comedy,Family animals, a 20
Lourdelet Witherspoon, Seth
hustling
Ma...
thea...
A secret
government Will Smith, Jared
Suicide
5 Action,Adventure,Fantasy agency David Ayer Leto, Margot Robbie, 20
Squad
recruits some Viola D...
of th...
European
mercenaries Matt Damon, Tian
The Great Yimou
6 Action,Adventure,Fantasy searching for Jing, Willem Dafoe, 20
Wall Zhang
black Andy Lau
powde...
A jazz pianist
Ryan Gosling, Emma
falls for an Damien
La La Land 7 Comedy,Drama,Music Stone, Rosemarie 20
aspiring Chazelle
DeWitt, J....
actress i...
A true-life
drama, Charlie Hunnam,
The Lost James
9 Action,Adventure,Biography centering on Robert Pattinson, 20
City of Z Gray
British Sienna Mille...
explor...
A spacecraft
Jennifer Lawrence,
traveling to a Morten
2. insert()
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 30/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
insert() function used to insert some values, usually list, in a specific position of a dataframe.
Check documentation
keyboard_arrow_down Excercise
For the current dataset, I have an assumption that the long movies are the most profitable to
validate that I will create a new colum based on the values of runtime_in_minutes and
revenue_in_millions columns.
We will iterate over the dataframe, then put all values in list, and finally insert this column using
df.insert() function.
lng_prof_lst = list()
for row in df_movies.itertuples():
# split every genre by the ',' then add this to genre_lst
if row.runtime_in_minutes > 120 and row.revenue_in_millions > 100:
lng_prof_lst.append('long_profitable')
elif row.runtime_in_minutes < 120 and row.revenue_in_millions > 100:
lng_prof_lst.append('short_profitable')
elif row.runtime_in_minutes > 120 and row.revenue_in_millions < 100:
lng_prof_lst.append('long_unprofitable')
else:
lng_prof_lst.append('short_unprofitable')
['long_profitable',
'long_profitable',
'short_profitable',
'short_profitable',
'long_profitable',
'short_unprofitable',
'long_profitable',
'long_unprofitable',
'short_profitable',
'long_profitable',
'long_profitable',
'long_profitable',
'short_profitable',
'short_unprofitable',
'short_profitable',
'long_unprofitable',
'long_profitable',
'short_unprofitable',
'short_profitable',
'short_unprofitable']
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 31/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 32/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
title
A group of
Guardians
intergalactic Jame
of the 1 long_profitable Action,Adventure,Sci-Fi
criminals are Gun
Galaxy
forced ...
Following
clues to the
Ridle
Prometheus 2 long_profitable Adventure,Mystery,Sci-Fi origin of
Sco
mankind, a
te...
Three girls
are
M. Nig
Split 3 short_profitable Horror,Thriller kidnapped by
Shyamala
a man with a
diag...
In a city of
humanoid
Christoph
Sing 4 short_profitable Animation,Comedy,Family animals, a
Lourdel
hustling
thea...
A secret
government
Suicide
5 long_profitable Action,Adventure,Fantasy agency David Aye
Squad
recruits some
of th...
European
mercenaries
The Great Yimo
6 short_unprofitable Action,Adventure,Fantasy searching for
Wall Zhan
black
powde...
A jazz pianist
falls for an Damie
La La Land 7 long_profitable Comedy,Drama,Music
aspiring Chazel
actress i...
A true-life
drama,
The Lost Jame
9 long_unprofitable Action,Adventure,Biography centering on
City of Z Gra
British
explor...
A spacecraft
traveling to a Morte
Passengers 10 short_profitable Adventure,Drama,Romance
distant Tyldu
colony pla...
The
Fantastic adventures
Beasts and of writer Dav
11 l fi bl Ad F
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true
il F 33/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
df[column].value_counts() : this method can count every category in a dataframe. It can has
an argument called normalize when its true the result will be in percentage. Check
documentation
df_movies['long_profitable_flg'].value_counts()
short_unprofitable 454
long_unprofitable 142
short_profitable 123
long_profitable 119
Name: long_profitable_flg, dtype: int64
df_movies['long_profitable_flg'].value_counts(normalize=True)
short_unprofitable 0.541766
long_unprofitable 0.169451
short_profitable 0.146778
long_profitable 0.142005
Name: long_profitable_flg, dtype: float64
Our assumption is not true, 54% of the movies are short unprofitable.
Of course, you will need to remove some columns sometimes. This can be done using the following
function.
df.drop('column name', axis=1/0, inplace=True) : The function is pretty easy but just in
case, you can check documentation
It worked!
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 34/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
But what is the difference between long and wide format. Lets see the following figure!
The above figure is showing the difference between wide and long format of data. Although, both
tables are giving the exact same information but they are formating in different ways!
So, what is the correct format when doing data analysis? Is it wide or long ?
There's no right or wrong regarding this point. You may use whatever version you want but you have
to fulfill two main aspects.
Usability : you have to think about what is the usability of the data on its current state. For
example, if you were given a very wide formatted table with 300 columns, one column for each
date in the observation period. Will this be a good form for representing such data? In this
format, you are missing alot of insights such as plotting those 300 date vs. any other quantity!
So, converting such case to long may enhance the usability of the data.
Readablity : also, you have to think if your data is readable and easy to understand enough or
not. For example, if you were given a very long formatted table with 300 different value for key.
Would it be better to deal with such format? or have every key as a separate column? It's
obvious that such case is better to be converted also to wide format.
id_vars : list of columns that will not change from the wide to long format
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 35/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
value_vars : list of columns that will be converted to values from the wide to long.
var_name : the key column name when converting.
value_name : the value column name during converting. Also, make sure to check its
documentation
index : column that will remain the same from long to wide.
columns : column name that will be prodcasted into new column names in the wide
format.
values : column that will be used as a value for all corresponding categories in the
columns agrgument.
aggfunc : it may be used when squeezing multible occurances of the same category.
Also, make sure to check its documentation
Name Weight BP
Now, lets convert the above dataframe inot wide format and see.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 36/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
3 John BP 120
4 Smith BP 130
5 Liz BP 100
Now, you may have got a good idea about how pd.melt() is working.
For more conveniance, we will start reading a new dataframe. That is called gapminder which is
gathering some info about different countries such as gdp per capita , life Expectency , and
Population
For this dataset, we have its url and yes, pandas can read directly from the url !
data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)
gapminder.head(3)
3 rows × 38 columns
gapminder.columns
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 37/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
'pop_1997', 'pop_2002', 'pop_2007'],
dtype='object')
You can notice that the dataset is in a very wide format and each column gives the values for a
specific year which is not a very good format when analysing such dataset.
For this, we will start by separting all the gdpPercap columns with continent and country because
they are both considered indexes.
gdpPercap.shape
(142, 14)
Regular Expression or Regex for short is the part of ^gdp|^c . Regex is used in general for text
parsing and it has some strict rules. If you want to get the correct regex for the text you want to
catch in a more fun way. You can use AutoRegex which is an AI translator between plain English
text and regex.
For example, to be able to catch the columns starts with C and the word gdp , I entered capture
anything starts with the letter c or the word 'gdp'
Now, lets use pd.melt() to get a better format for gdpPercap dataframe.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 38/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
gdpPercap_tidy.shape
(1704, 4)
As you can see, the format is much better except for the year colum that we will try to fix right now.
Also, notice that converting wide format to long will result in an increase in the number of rows and
decrease in the number of columns.
For the values in the year column such as gdpPercap_1952 needs to be 1952 . To fix this we will
create a function called keep_year that takes in any text mixed with numbers and return only
numbers.
def keep_year(text):
clean_text = ''.join([item for item in text if item.isdigit()])
return clean_text
keep_year('some_text_and_all_this_about_1994')
'1994'
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 39/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
This is a perfect format but to make sure lets check about the data types of columns.
gdpPercap_tidy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 country 1704 non-null object
2 year 1704 non-null object
3 gdpPercap 1704 non-null float64
dtypes: float64(1), object(3)
memory usage: 53.4+ KB
Lets do it!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 country 1704 non-null object
2 year 1704 non-null int64
3 gdpPercap 1704 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.4+ KB
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 40/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Now, lets repeat all the previous steps with lifeExp and pop columns
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 41/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Now, we need to combine all those three dataframes in one gdpPercap_tidy , lifeExp_tidy , and
pop_tidy .
We can use pd.concat() for this task but this time to stack those dataframes vertically.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 42/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Burkina Burkina
4 Africa 1952 543.255241 Africa 1952 31.975 Africa
Faso Faso
Central Central
7 Africa African 1952 1071.310713 Africa African 1952 35.463 Africa
Republic Republic
Obviouslly, this is not the required result! we have duplicate columns!, how to handle this in a ONE
LINE OF CODE?
gapminder_final = gapminder_final.T.drop_duplicates().T
# Let's check!
gapminder_final.head()
# Done!!
Now, we want to apply more in transforming long dataframes to wide by using pd.pivot() . To do
this, we will take a supset of the gapminder dataset
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 43/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
gm_df = gapminder_final[['continent','year','lifeExp']].copy()
# lets print it
gm_df.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1704 entries, 0 to 1703
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 year 1704 non-null object
2 lifeExp 1704 non-null object
dtypes: object(3)
memory usage: 117.8+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1704 entries, 0 to 1703
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 year 1704 non-null int64
2 lifeExp 1704 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 117.8+ KB
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 44/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
for our long dataset, we tried to summerize it using pd.pivot_table() . We choosed that the
categories (values) in the continent column to be broadcasted as columns and the values to be
taken from the lifeExp column (passed to values argument) and because we have multiple values
for lifeExp that correspond to each continenet we had to choose some aggregation function to
be applied in the way of the transforamation. Here, we choose the np.mean() which means that we
will be the mean (average) lifeExp for each continent
year
Here we used pd.pivot_table() in a different way. The year column will be an index and the
categories in continent column are broadcasted into columns. So, this table is giving the average
lifeExp per continent per year.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 45/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
keyboard_arrow_down Grouping
Any groupby operation involves one of the following operations on the original dataframe. They are
−
The basic argument is by which is the column that we will be grouping over. All other argument are
almost self explantory.
When we use .groupby() function on any categorical column of dataframe, it returns a GroupBy
object. Then we can use various methods on this object and even aggregate other columns to get
the summarized view of the dataset.
For now, we will read a new dataframe sales_data to apply some concepts about grouping .
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 46/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Not
0 2.951110e+18 92 238 8/8/2021 Healthcare
Delivered
Not
1 2.181910e+18 61 136 10/3/2021 Office
Delivered
Not
2 3.239110e+18 67 235 9/27/2021 Office
Delivered
Not
3 1.112610e+18 33 133 7/30/2021 Fashion
Shipped
Not
4 1.548310e+18 13 189 8/15/2021 Fashion
Delivered
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OrderID 9999 non-null float64
1 Quantity 9999 non-null int64
2 UnitPrice(USD) 9999 non-null int64
3 Status 9999 non-null object
4 OrderDate 9999 non-null object
5 Product_Category 9963 non-null object
6 Sales_Manager 9999 non-null object
7 Shipping_Cost(USD) 9999 non-null int64
8 Delivery_Time(Days) 9948 non-null float64
9 Shi i Add 9999 ll bj t
sales_gp_df = sales_df.groupby("Product_Category")
type(sales_gp_df)
pandas.core.groupby.generic.DataFrameGroupBy
You can see that when grouping over the Product_Category . It returns a GroupBy object which is
nothing but a dictionary where keys are the unique groups in which records are split and values are
the columns of each group which are not mentioned in groupby.
Certainly, GroupBy object holds contents of entire DataFrame but in more structured form. And just
like dictionaries there are several methods to get the required data efficiently.
sales_df.Product_Category.nunique()
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 47/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
We have 5 differnet categories into the Product_Category . Lets see the number of groups.
sales_gp_df.ngroups
For now, lets try to get the average UnitPrice(USD) per Product_Category and see how this is
working.
# we will first perform the groupby then get the mean unit price.
sales_df.groupby("Product_Category")['UnitPrice(USD)'].mean().reset_index()
Product_Category UnitPrice(USD)
0 Entertainment 176.038618
1 Fashion 176.117199
2 Healthcare 175.489503
3 Home 175.354854
4 Office 175.127300
Different aggregate functions can be used with grouping. Lets try to get median
Shipping_Cost(USD) per Product_Category
sales_df.groupby("Product_Category")['Shipping_Cost(USD)'].median().reset_index()
Product_Category Shipping_Cost(USD)
0 Entertainment 28.0
1 Fashion 28.0
2 Healthcare 27.0
3 Home 28.0
4 Office 28.0
Grouping also can work on multi-level! Suppose, that we want to get the average
Delivery_Time(Days) per Product_Category per Sales_Manager .
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 48/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
sales_df.groupby(['Product_Category', 'Sales_Manager'])['Delivery_Time(Days)'].mean().reset_
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 49/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Grouping in pandas can work with multiple aggregate functions. For example, imagine that we want
to get the sum, mean, median, standard deviation,and maximum Shipping_Cost(USD) for every
Product_Category
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 51/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
.agg() function can add a lot of flexiability when aggregating using different aggregate functions.
Check documentation
We can use .agg() to multiple columns. For example, we want to get the median of
Shipping_Cost(USD) and the standard deviation of Quantity for every Product_Category .
# aggregates dict
f = {'Shipping_Cost(USD)': [np.mean], 'Quantity': [np.std]}
# pass this dict to GroupBy
sales_df.groupby("Product_Category").agg(f).reset_index()
mean std
As you can see, different aggreagate funtions can be applied to different columns!
In additon to that, some useful funtions can be used with GroupBy for example you can get the first
and last row in each category. Lets see how!
sales_df.groupby("Product_Category").first()
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 52/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Product_Category
Not
Entertainment 4.934810e+18 51 204 11/13/2021 A
Delivered
Not
Fashion 1.112610e+18 33 133 7/30/2021 A
Shipped
Not
Healthcare 2.951110e+18 92 238 8/8/2021 P
Delivered
sales_df.groupby("Product_Category").last()
Product_Category
Not
Home 2.301610e+18 75 201 10/16/2021 S
Delivered
Interstingly, you can select to view specific row of each group using nth() function. Lets see how!
sales_df.groupby("Product_Category").nth(3)
Product_Category
Not
Entertainment 4.030410e+18 21 123 10/10/2021 Kr
Delivered
Not
Fashion 2.804110e+18 31 163 12/23/2021 A
Shipped
Not
Healthcare 4.276410e+18 83 224 7/10/2021 S
Delivered
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 53/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
We can find the size() of each group which is the number of rows in each Product_Category
sales_df.groupby("Product_Category").size().reset_index()
Product_Category 0
0 Entertainment 1968
1 Fashion 1971
2 Healthcare 1953
3 Home 2060
4 Office 2011
sales_df.groupby("Product_Category").count().reset_index()
.count() counts only the non-null values from each column, whereas .size() simply returns the
number of rows available in each group irrespective of presence or absence of values.
GroupBy method get_group() is used to select or extract only one group from the GroupBy object.
sales_gp_df.get_group('Healthcare').head(10)
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 54/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Not
0 2.951110e+18 92 238 8/8/2021 Healthcare
Delivered
Not
6 2.750410e+18 73 242 7/8/2021 Healthcare
Delivered
Not
14 2.559910e+18 55 233 7/15/2021 Healthcare
Delivered
Not
19 4.276410e+18 83 224 7/10/2021 Healthcare
Delivered
Not
Also, remember that the GroupBy object is nothig but a dict and you can iterate over it.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 55/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Entertainment
OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sal
Not
5 4.934810e+18 51 204 11/13/2021 Entertainment
Delivered
Not
21 4.030410e+18 21 123 10/10/2021 Entertainment
Delivered
Not
3 1.112610e+18 33 133 7/30/2021 Fashion
Shipped
Not
4 1.548310e+18 13 189 8/15/2021 Fashion
Delivered
Not
8 2.804110e+18 31 163 12/23/2021 Fashion
Shipped
Not
9 1.735910e+18 62 214 8/14/2021 Fashion
Delivered
Healthcare
OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sal
Not
0 2.951110e+18 92 238 8/8/2021 Healthcare
Delivered
Not
6 2.750410e+18 73 242 7/8/2021 Healthcare
Delivered
Not
14 2.559910e+18 55 233 7/15/2021 Healthcare
Delivered
Not
19 4.276410e+18 83 224 7/10/2021 Healthcare
Delivered
Not
10 4.337210e+18 57 226 9/27/2021 Home
Shipped
Not
13 4.583610e+18 46 208 7/28/2021 Home
Shipped
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 56/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Easy and simple! Keep in mind this aspect for GroupBy objects. It can be game-changing when
dealing with large dataset!
A combination between .loc and .between method can be used for this task. .between
method returns a boolean vector containing True wherever the corresponding Series element
is between the boundary values left and right. Check documentation but the most important
arguments are as following:
pd.cut() can be also used to bin values into discrete intervals. Use cut when you need to
segment and sort data values into bins. This function is also useful for going from a
continuous variable to a categorical variable. Check documentation. Its main arguments are
as following:
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 57/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
labels : Specifies the labels for the returned bins. Must be the same length as the
resulting bins.
include_lowest : (bool) Whether the first interval should be left-inclusive or not.
To demonstrate the above methods, we will use a dataset named banking_clients which is
revealing the total transcations that have been made by some clients and associated with their
account numbers. Our job is to add a customer life time value flag clv_flg that indicates the
importance of every client based on the amount in the transactions!
Lets apply!
Lets use df.describe() to see the range of values in the total_transactions columns.
banking_df.describe()
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 58/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
account_number total_transactions
So, lets make the folloing bucketing for clients based on total_transactions as follows:
Note: square brackets [ and round brackets )indicates that the boundary value is inclusive and
exclusive respectively.
# low bucket
banking_df.loc[banking_df['total_transactions'].between(55700, 90000, 'right'), 'clv_flg_v1'
# medium bucket
banking_df.loc[banking_df['total_transactions'].between(90000, 120000, 'right'), 'clv_flg_v1
# high bucket
banking_df.loc[banking_df['total_transactions'].between(120000, 200000, 'right'), 'clv_flg_v
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 59/59