Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

4 Data Transformation Using Pandas

Uploaded by

AB
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

4 Data Transformation Using Pandas

Uploaded by

AB
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

7/18/24, 1:12 PM 4-data-transformation-using-pandas.

ipynb - Colab

In this series, we will cover the basics of Data Analysis using Python. The lessons
will start growing gradually until forming a concrete analytical mindset for students.
This lesson will cover the essentials of Pandas in processing tabular data.

What is Pandas?

Pandas is an open source Python package that is most widely used for data science/data analysis
and machine learning tasks. It is built on top of Numpy, which has been previously touched, which
provides support for multi-dimensional arrays. As one of the most popular data wrangling
packages, Pandas works well with many other data science modules inside the Python ecosystem,
and is typically included in every Python distribution.

Pandas makes it simple to do many of the time consuming, repetitive tasks associated with
working with data, including:

Data cleansing
Data fill
Data normalization
Merges and joins
Data visualization
Statistical analysis
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 1/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Data inspection
Loading and saving data

In fact, with Pandas, you can do everything that makes world-leading data scientists vote Pandas as
the best data analysis and manipulation tool available.

Pandas Getting Started

The main powerful point of pandas is its basic data structure, which is DataFrame , every tabular
format of data will be stored directly as a DataFrame which will ease the transforamtion and
manipulation of data.

Also, because we will be mostly intersted in columns and rows as analysts, Pandas also store every
column and every row in a Series which will also ease the transformation and manipulation on
column and row level.

Lets see the following graph!

keyboard_arrow_down Creating a DataFrame


DataFrames can be created of various datatypes but the most straight forward approach is to creat
a DataFrame from a Dict where the keys are the column names and the values are lists.

Lets try to create the above piece of data.

# Create a dict
fruits_dct = { 'Oranges': [3, 2, 0, 1],
'Apples': [0, 3, 7, 2]
}

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 2/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

To use pandas properly, you have to do two things:

Install pandas if it's not installed (Kaggle Enviroment has it installed by default)
Import the package into the current working enviroment.

Every package in Python is nothing but some code files published online for anyone to use or edit.
Importing the package means that you are graping those functions and codes to the current
runtime enviroment which will allow you to use any of this code functions, methods, or attributes.
To import pandas you need to use the following line.

import pandas as pd

The term import pandas is kind of self-explaining but the term as pd is just a convention.
Something like agreement between people who use pandas to appriviate it as pd

Lets create our first pandas DataFrame !

# import statment
import pandas as pd

# lets create our dataframe from fruits dct


fruits_df = pd.DataFrame(fruits_dct)

# lets print it
fruits_df

Oranges Apples

0 3 0

1 2 3

2 0 7

3 1 2

Lets check its type.

print(type(fruits_df))

<class 'pandas.core.frame.DataFrame'>

As you can see the DataFrame is a neat and clean formation of the tabular data which is
considering 80% of the data that we will deal with in general as analysts.

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 3/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Before diving deep in pandas , we need to get ourselves familiar with the structure of every table of
data you will see.

Structure of table

As mentioned before, 80% of the data we will see are table so, it's crucial to know its structure
deeply.

Lets see the following image:

Index is like an address, that’s how any data point across the dataframe or series can be
accessed.
Columns also called as features or fields are a list of values for the specific thing we are
measure. for example, in the snippet we have every value in the apples column represents
number of apples at that time.
Rows also called as observations or occurances are representing every object in our data.
for example, here we have 4 rows which means we have 4 differenet counts of apples and
oranges.

keyboard_arrow_down Reading Data using Pandas


Pandas can read data from wide range of formats such as:

CSV files
Excel files
JSON files
SQL Tables
Pickle files
Parquet files
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 4/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Now, and for the sake of analysis, we will read a CSV file that is containing top 1000 in the history of
film making according to IMDB website.

Lets see how.

# read csv file


df_movies = pd.read_csv('../input/data-analytics/IMDB-Movie-Data.csv')
# see the data
display(df_movies)
# print its type
display(type(df_movies))

Title Rank Genre Description Director Actors

A group of
Guardians Chris Pratt, Vin
intergalactic James
0 of the 1 Action,Adventure,Sci-Fi Diesel, Bradley
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...

Following
clues to the Noomi Rapace,
Ridley
1 Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall-
Scott
mankind, a Green, Michael Fa...
te...

Three girls
are James McAvoy, Anya
M. Night
2 Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu
Shyamalan
a man with a Richar...
diag...

In a city of
Matthew
humanoid
Christophe McConaughey,Reese
3 Sing 4 Animation,Comedy,Family animals, a
Lourdelet Witherspoon, Seth
hustling
Ma...
thea...

A secret
government Will Smith, Jared
Suicide
4 5 Action,Adventure,Fantasy agency David Ayer Leto, Margot Robbie,
Squad
recruits some Viola D...
of th...

... ... ... ... ... ... ...

A tight-knit
Chiwetel Ejiofor,
Secret in team of rising
995 996 Ci D M t Bill R Ni l Kid J li

As you can see, pandas is capable of viewing any table in a clean and tidy format of
pandas.core.frame.DataFrame data type. Now, lets get ourselves familliar with some of pandas
basic methods and attributes to explore more about our dataset.
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 5/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

NOTE: pd.read_csv() function will be deep dived later on.

keyboard_arrow_down Basic Attributes and Methods


Now, we will demonstrate some of the basic attributes and methods related to pandas.

df.info() : method used for getting information about number of rows, columns, count of not
null, memory usage. You can read more about it in the documentation
df.head() : method that views first 5 rows of the dataframe. You can pass any number in the
brackets that will represent any number of rows you want to view.
df.tail() : method that views last 5 rows of the dataframe. You can pass any number in the
brackets that will represent any number of rows you want to view.
df.describe() : method that can view the summary statistics for numerical columns. You can
read more about this in the documentation
df.shape : attribute that is use to find out the number of rows and columns (Can you
remember the same method in NumPy ?)
df.columns : attribute that is used to view the column names and it's an iterator by default.
You can see more in the documentation

Lets see some code!

df_movies.head()

Title Rank Genre Description Director Actors Y

A group of
Guardians Chris Pratt, Vin
intergalactic James
0 of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 2
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...

Following
clues to the Noomi Rapace,
Ridley
1 Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 2
Scott
mankind, a Green, Michael Fa...
te...

Three girls
are James McAvoy, Anya
M. Night
2 Split 3 Horror Thriller kidnapped by Taylor Joy Haley Lu 2

# now, we want to view the last 2 rows of the df


df_movies.tail(2)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 6/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Title Rank Genre Description Director Actors Year


(M

A pair of
friends Adam Pally, T.J.
Search Scot
998 999 Adventure,Comedy embark on a Miller, Thomas 2014
Party Armstrong
mission to Middleditch,Sh...
reuni

# lets view our columns


df_movies.columns

Index(['Title', 'Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',


'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore'],
dtype='object')

df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Title 1000 non-null object
1 Rank 1000 non-null int64
2 Genre 1000 non-null object
3 Description 1000 non-null object
4 Director 1000 non-null object
5 Actors 1000 non-null object
6 Year 1000 non-null int64
7 Runtime (Minutes) 1000 non-null int64
8 Rating 1000 non-null float64
9 Votes 1000 non-null int64
10 Revenue (Millions) 872 non-null float64
11 Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB

df.info() is one of the most useful methods to use when looking to any DataFrame

First line is indicating the datatype, which is DataFrame here.


Second line is indicating the index type and the total number of entries
Third line is indicating the total number of columns.
Then we will come to a summerization table that contains every column name, then the not
null count and the data type of that column.
Then we will see a summary of the data types of all columns
And in the last line we will see the memeory usage of our dataframe

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 7/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

For our DataFrame , we have two columns with Null values, those columns are Revenue
(Millions) and Metascore

For the purpose of finding the number of Null values specifically. We can use the following
isnull() methods such as following.

# number of all nulls in every column summed


df_movies.isnull().sum()

Title 0
Rank 0
Genre 0
Description 0
Director 0
Actors 0
Year 0
Runtime (Minutes) 0
Rating 0
Votes 0
Revenue (Millions) 128
Metascore 64
dtype: int64

This also validating the our previous findings.

Now, lets see the summary statistics using df.describe()

df_movies.describe()

Runtime Revenue
Rank Year Rating Votes Meta
(Minutes) (Millions)

count 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 872.000000 936.00

mean 500.500000 2012.783000 113.172000 6.723200 1.698083e+05 82.956376 58.98

std 288.819436 3.205962 18.810908 0.945429 1.887626e+05 103.253540 17.19

min 1.000000 2006.000000 66.000000 1.900000 6.100000e+01 0.000000 11.00

25% 250.750000 2010.000000 100.000000 6.200000 3.630900e+04 13.270000 47.00

50% 500.500000 2014.000000 111.000000 6.800000 1.107990e+05 47.985000 59.50

75% 750.250000 2016.000000 123.000000 7.400000 2.399098e+05 113.715000 72.00

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 8/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

As you can see the df.describe() is calculating some summary statistics. Such as count , mean ,
std , min , max and the Percentile.

Some column names in the dataset may case some confusing such as Runtime (Minutes) and
Revenue (Millions) . We will try to rename them to Runtime_in_Minutes and Revenue_in_Millions

df.rename() : method for renaming columns it takes the columns in form of a dict where the
old names are keys and the new names are values. Also, it takes an argument called inplace
when its set to true it means that the current changes will be committed to the current
dataframe directly.

# rename() :: method ---> for renaming column names


df_movies.rename(columns = {'Runtime (Minutes)' : 'Runtime_in_Minutes', 'Revenue (Millions)'
inplace = True)
# check the new column names
df_movies.columns

Index(['Title', 'Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',


'Runtime_in_Minutes', 'Rating', 'Votes', 'Revenue_in_Millions',
'Metascore'],
dtype='object')

For better display of the column names, we will convert them to lowercase.

Try to think about a solution before viewing the answer!

# using list comprehensions with column names


# let's try lowering all columns' names in one line of code
df_movies.columns = [column.lower() for column in df_movies.columns]
# check column names
df_movies.columns

Index(['title', 'rank', 'genre', 'description', 'director', 'actors', 'year',


'runtime_in_minutes', 'rating', 'votes', 'revenue_in_millions',
'metascore'],
dtype='object')

keyboard_arrow_down Duplicate Rows Handling


One of the basic checks to do over any dataset is to check for duplicate rows. Here we don't have
any duplicate rows so we will try to create this situation manually.

pd.concat() : it will append any number of dataframes to each other as they are following the
same structure. Check the documentation
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 9/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

We will use this function to artficially add two df_movies to one to another to have a situation
where every row is existing twice

# to create a dataset with row duplicates let's just douple our movie dataset
df_temp = pd.concat([df_movies, df_movies], axis = 0)
# investigate the shape
df_temp.shape

(2000, 12)

As the shape check says, we have now a new dataframe df_temp that is douple the size of
df_movies . Now, we will sort its rows according to the title column to have better look.

df.sort_values(by='column_name',axis=1/0, inplace=True/False) : used to sort a


dataframe by a column. Check documentation

NOTE: For any method dealing with pandas that has the argument axis , 0 means column level and
1 means row level

# Let's view df_temp sorted


df_temp.sort_values(by='title', axis=0, inplace=True)
df_temp.head()

title rank genre description director actors year r

An offbeat Zooey
romantic Deschanel,
(500)
comedy Joseph
507 Days of 508 Comedy,Drama,Romance Marc Webb 2009
about a Gordon-
Summer
woman who Levitt,
d... Geoffre...

An offbeat Zooey
romantic Deschanel,
(500)
comedy Joseph
507 Days of 508 Comedy,Drama,Romance Marc Webb 2009
about a Gordon-
Summer
woman who Levitt,
d... Geoffre...

John
After getting
Goodman

Now, we will try removing duplicates using the following function:

df.drop_duplicates(subset= [sequence of columns], keep='first/last/False',


inplace=True/False) : For keep argument it has the following 3 cases.

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 10/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

* 'first' ---> to keep only the first occurace of the row


* 'last' ---> to keep the last occurance of the row
* 'False' ---> to delete any duplicates with any number of occurances

# Now we will try to drop the full-row duplicates


df_temp.drop_duplicates(inplace=True)
# check the shape
df_temp.shape

(1000, 12)

Lets re-create the df_temp again to study the effect of the keep arguement

# let's create it a temp df


df_temp = pd.concat([df_movies, df_movies], axis = 0)
# check shape
df_temp.shape

(2000, 12)

Now, we'll remove any rows that have any duplicates. that's mean the remaining number of rows will
be 0

df_temp.drop_duplicates(inplace=True, keep=False)
# check the shape
df_temp.shape

(0, 12)

keyboard_arrow_down Slicing and Selecting in DataFrames


Always, we will need to select and slice some parts of the DataFrame. The selecting process may
be arbitrary, for example, some specific rows and columns or select based on some conditions.

Now, as our dataframe is consisting of 1000 unique movie, it's better to make our index as the
title column

df.set_index('column_name', inplace=True/False) : Check documentation

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 11/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

# Let's set the index to the 'title' column first


df_movies.set_index('title',inplace=True)
# Lets see a slice now.
df_movies.head()

rank genre description director actors yea

title

A group of
Guardians Chris Pratt, Vin
intergalactic James
of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 201
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...

Following
clues to the Noomi Rapace,
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 201
Scott
mankind, a Green, Michael Fa...
te...

Three girls
are James McAvoy, Anya
M. Night
Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu 201
Shyamalan
a man with a Richar

How to select specific rows/columns?

df.loc[] : used to select specific rows/columns by names using the following syntax :

df.loc[rows_range_by_names: columns_range_by_names]

df.iloc[] : used to select specific rows/columns by number index using the following syntax
:

df.iloc[rows_range_by_numbers: columns_range_by_numbers]

Now, we want to select some info about Split movie, starting from rank to actors columns.

# let's try loc


split_loc0 = df_movies.loc['Split', 'rank': 'actors' ]
# print the split
split_loc0

rank 3
genre Horror,Thriller
description Three girls are kidnapped by a man with a diag...
director M. Night Shyamalan
actors James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
Name: Split, dtype: object
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 12/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Lets try to beatify the output a little bit.

pd.DataFrame(split_loc0).T

rank genre description director actors

James McAvoy, Anya


Three girls are kidnapped by M. Night
Split 3 Horror,Thriller Taylor-Joy, Haley Lu
a man with a diag Shyamalan

If we want to view all columns for Split movie.

split_loc1 = df_movies.loc['Split', : ]
# print the split
pd.DataFrame(split_loc1).T

rank genre description director actors year runtime_in_minutes rati

James
McAvoy,
Th il

We can use another form when using df.loc[] .

# Another form of loc


split_loc2 = df_movies.loc['Split']
# print the split
pd.DataFrame(split_loc2).T

rank genre description director actors year runtime_in_minutes rati

James
McAvoy,
Th il

We can obrain the same result using df.iloc[] . Remember the Split movie is ranked 3 which will
be 2 when taking 0-indexing into consideration.

# let's try iloc


split_iloc0 = df_movies.iloc[2]
# print the split
pd.DataFrame(split_iloc0).T

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 13/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

rank genre description director actors year runtime_in_minutes rati

James
McAvoy,
Th il

Suppose after importing the dataframe we want to work with a subset of it such as genre and
rating columns

df_movies_subset = df_movies[['genre', 'rating']]


# print the head
df_movies_subset.head()

genre rating

title

Guardians of the Galaxy Action,Adventure,Sci-Fi 8.1

Prometheus Adventure,Mystery,Sci-Fi 7.0

Split Horror,Thriller 7.3

Sing Animation,Comedy,Family 7.2

Suicide Squad Action,Adventure,Fantasy 6.2

df_movies_subset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 genre 1000 non-null object
1 rating 1000 non-null float64
dtypes: float64(1), object(1)
memory usage: 55.7+ KB

Suppose your crietira for selection based on datatypes. Like you want to isolate all string columns
in a separate dataframe

df.select_dtypes(include = [], exclude = []) :: method for filtering based on datatypes.


Check documentation

Now, lets isolate all string columns in a separate dataframe

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 14/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

# include all object columns


df_movies_str = df_movies.select_dtypes(include=['object'])
# Let's view it
df_movies_str.head()

genre description director actors

title

A group of
Guardians of intergalactic James Chris Pratt, Vin Diesel,
Action,Adventure,Sci-Fi
the Galaxy criminals are Gunn Bradley Cooper, Zoe S...
forced ...

Following clues to Noomi Rapace, Logan


Prometheus Adventure,Mystery,Sci-Fi the origin of Ridley Scott Marshall-Green, Michael
mankind, a te... Fa...

Three girls are James McAvoy Anya

We can validate that using the df.info() method.

df_movies_str.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 genre 1000 non-null object
1 description 1000 non-null object
2 director 1000 non-null object
3 actors 1000 non-null object
dtypes: object(4)
memory usage: 71.4+ KB

All columns of df_movies_str are of data type objet, which represents strings.

keyboard_arrow_down Missing Values Handling


Missing values are a common thing to check when exploring any dataframe and they are some
handling techniques for them such as the following:

Dropping Null Values: When having a realtively small portion of rows with Null values we can
simply drop them.

df.dropna(axis=1/0, inplace=True/False) : This method will remove any rows with


missing values. Check documentation

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 15/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Imputation: Some times when the portion of Null values is slightly bigger we can choose
some values to add in those Nulls such as the mean or median value for numerical columns
and may be the most or least frequent category for object columns. Sometimes the
imputation will be in a systematic way such as the example we will touch now.

df.fillna(value, method='ffill/bfill', inplace='True/False') : This method will


impute any values to the dataframe in place of Null values. Check documentation

For applying different scenarios, we will make a copy of the df_movies using the following method.

df.copy() : Create a copy of the dataframe.

# lets create a copy of the `df_movies`


df_movies_cp = df_movies.copy()

# lets use dropna to remove all rows with Null values


df_movies.dropna(axis = 0 , inplace=True )
# lets check with `df.info()`
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 838 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rank 838 non-null int64
1 genre 838 non-null object
2 description 838 non-null object
3 director 838 non-null object
4 actors 838 non-null object
5 year 838 non-null int64
6 runtime_in_minutes 838 non-null int64
7 rating 838 non-null float64
8 votes 838 non-null int64
9 revenue_in_millions 838 non-null float64
10 metascore 838 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 78.6+ KB

We have eliminated the problem of missing values in df_movies . lets try to handle the
df_movies_cp using df.fillna()

# Remember df_movies_cp has Null columns


df_movies_cp.isnull().sum()

rank 0
genre 0
description 0
director 0

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 16/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
actors 0
year 0
runtime_in_minutes 0
rating 0
votes 0
revenue_in_millions 128
metascore 64
dtype: int64

Now we need to figure out some info about the columns we have such as the mean or median and
pandas can easily provide this using a wide range of methods that work on both column level and
dataframe level such as:

sum()
min()
max()
count()
idxmin()
idxmax()
mean()
median()
std()

Lets use mean !

# lets find the mean revenue


mean_revenue = df_movies_cp.revenue_in_millions.mean()
# print it
mean_revenue

82.95637614678898

Now, lets use fillna() method to fix the problem of revenue column.

df_movies_cp.revenue_in_millions.fillna(mean_revenue, inplace=True)

# check missing
df_movies_cp.isnull().sum()

rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime_in_minutes 0
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 17/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
rating 0
votes 0
revenue_in_millions 0
metascore 64
dtype: int64

Now, we have fixed the problem of missing values in revenue_in_millions column

Sometimes, filling missing values need to follow some pattern. lets see the following example to
see. It's taken from Uber company.

# read the dataset


uber = pd.read_csv('../input/data-analytics/uber.csv')
# view the head
uber.head(10)

Date Hour Requests Completes Supply Hours Time On Trip pETA aETA

0 9/1/2018 11 79 55 42.63 20.43 5.51 7.19

1 NaN 12 73 41 36.43 15.53 5.48 8.48

2 NaN 13 54 50 23.02 17.76 5.07 8.94

3 9/2/2018 11 193 170 64.20 31.47 5.31 6.55

4 NaN 12 258 210 80.28 38.68 4.94 6.08

5 NaN 13 153 107 59.18 23.37 5.14 6.42

6 9/3/2018 11 124 34 30.67 19.65 6.70 8.19

7 NaN 12 78 34 27.02 14.38 6.36 8.01

8 NaN 13 36 15 20.82 12.62 7.82 9.05

9 9/4/2018 11 98 43 29.17 16.55 6.99 8.32

As you can see, we have some missing values in the date column. The dataset is gathered for some
days for hour 11 , 12 , and 13 so, it makes more sense to see the date value of the upper row in the
rows underneath until finding a new date.

This can be done using the method argument for fillna . We will use ffill which is abbreviation
for forward fill.

uber.fillna(method = 'ffill', inplace = True)


# lets view it
uber.head(10)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 18/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Date Hour Requests Completes Supply Hours Time On Trip pETA aETA

0 9/1/2018 11 79 55 42.63 20.43 5.51 7.19

1 9/1/2018 12 73 41 36.43 15.53 5.48 8.48

2 9/1/2018 13 54 50 23.02 17.76 5.07 8.94

3 9/2/2018 11 193 170 64.20 31.47 5.31 6.55

4 9/2/2018 12 258 210 80.28 38.68 4.94 6.08

5 9/2/2018 13 153 107 59.18 23.37 5.14 6.42

6 9/3/2018 11 124 34 30.67 19.65 6.70 8.19

7 9/3/2018 12 78 34 27.02 14.38 6.36 8.01

8 9/3/2018 13 36 15 20.82 12.62 7.82 9.05

9 9/4/2018 11 98 43 29.17 16.55 6.99 8.32

Now, its in a good format!

keyboard_arrow_down Filtering
Filtering is a common activity in almost analyzing any dataframe. In almost all cases, you will need
to apply some filter on the data.

Now, we need to study the movies directed by Ridley Scott . lets see how to apply this filter.

df_movies_ridley_scott = df_movies[df_movies['director'] == 'Ridley Scott']


df_movies_ridley_scott.head(10)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 19/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

rank genre description director actors year runtim

title

Noomi
Following Rapace,
clues to the Logan
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Marshall- 2012
Scott
mankind, a Green,
te... Michael
Fa...

Matt
An astronaut
Damon,
becomes
Ridley Jessica
The Martian 103 Adventure,Drama,Sci-Fi stranded on 2015
Scott Chastain,
Mars after
Kristen
hi...
Wiig, Ka...

Russell
In 12th
Crowe,
century
Ridley Cate
Robin Hood 388 Action,Adventure,Drama England, 2010
Scott Blanchett,
Robin and
Matthew
his band of...
Macfady...

Check the shape

df_movies_ridley_scott.shape

(8, 11)

8 movies out of the top 1000 movies are directed by Ridley Scott !

Now, let suppose that we want to get all movies directed by Ridley Scott or Cristopher Nolan .

df_movies_ridley_scott_cristopher_nolan = df_movies[(df_movies['director'] == 'Christopher N


(df_movies['director'] == 'Ridley Scott'
# print the subset
df_movies_ridley_scott_cristopher_nolan.head(10)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 20/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

rank genre description director actors year run

title

Following Noomi
clues to the Rapace, Logan
Adventure,Mystery,Sci- Ridley
Prometheus 2 origin of Marshall- 2012
Fi Scott
mankind, a Green, Michael
te... Fa...

A team of Matthew
explorers McConaughey,
Christopher
Interstellar 37 Adventure,Drama,Sci-Fi travel Anne 2014
Nolan
through a Hathaway,
wormhole ... Jessica Ch...

When the
Christian Bale,
menace
The Dark Christopher Heath Ledger,
55 Action,Crime,Drama known as the 2008
Knight Nolan Aaron
Joker wreaks
Eckhart,Mi...
havo...

Two stage Christian Bale,


magicians Hugh
The Christopher
65 Drama,Mystery,Sci-Fi engage in Jackman, 2006
Prestige Nolan
competitive Scarlett
one-... Johanss...

A thief, who Leonardo


steals DiCaprio,
Christopher
Inception 81 Action,Adventure,Sci-Fi corporate Joseph 2010
Nolan
secrets Gordon-Levitt,

Also, we can apply more complex filtering conditions.

Suppose that we want the movies that follow those conditions:

movies between 2005 and 2010


with rating higher than 8
the total revenue is below the 0.25 percentile

df_movies_multi_filter = df_movies[
((df_movies['year'] >= 2005) & (df_movies['year'] <= 2010))
& (df_movies['rating'] > 8.0)
& (df_movies['revenue_in_millions'] < df_movies['revenue_in_millions'].quantile(0.25))
]

# Check the contents


df_movies_multi_filter

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 21/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

rank genre description director actors year runt

title

Two friends
Aamir Khan,
are
Rajkumar Madhavan, Mona
3 Idiots 431 Comedy,Drama searching for 2009
Hirani Singh, Sharman
their long lost
Joshi
...

In 1984 East Ulrich Mühe,


Florian
The Lives Berlin, an Martina
477 Drama,Thriller Henckel von 2006
of Others agent of the Gedeck,Sebastian
Donnersmarck
secret po... Koch, Ul...

keyboard_arrow_down Iterating over rows in DataFrame


One of the most repetitive tasks when analysing data is to be able to iterate over rows. This process
can be used in creating new insightful columns or even get some analytical insights directly.

Now, we will list down some of the ways for iterating over rows while measuring the excution time
of each.

for the sake of simplicity, we will use only 100 rows of the df_movies dataframe

keyboard_arrow_down 1. index

You can simply iterate over the index of the dataframe and access the rows data based on it. lets
see how!

df_movies.head()

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 22/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

rank genre description director actors yea

title

A group of
Guardians Chris Pratt, Vin
intergalactic James
of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 201
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...

Following
clues to the Noomi Rapace,
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 201
Scott
mankind, a Green, Michael Fa...
te...

Three girls
are James McAvoy, Anya
M. Night
Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu 201
Shyamalan
a man with a Richar

# take first 100 rows


df_small = df_movies[:100]
# lets see its shape
df_small.shape

(100, 11)

%%time
for ind in df_small.index:
print(df_small['director'][ind], df_small['revenue_in_millions'][ind])

James Gunn 333.13


Ridley Scott 126.46
M. Night Shyamalan 138.12
Christophe Lourdelet 270.32
David Ayer 325.02
Yimou Zhang 45.13
Damien Chazelle 151.06
James Gray 8.01
Morten Tyldum 100.01
David Yates 234.02
Theodore Melfi 169.27
Gareth Edwards 532.17
Ron Clements 248.75
Nacho Vigalondo 2.87
Chris Renaud 368.31
Mel Gibson 67.12
Paul Greengrass 162.16
Garth Davis 51.69
Denis Villeneuve 100.5
Stephen Gaghan 7.22
Kenneth Lonergan 47.7
Walt Dohrn 153.69
Roland Emmerich 103.14
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 23/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Jon Lucas 113.08
Justin Kurzel 54.65
John Hamburg 60.31
Tom Ford 10.64
Bryan Singer 155.33
Tim Miller 363.02
Paul W.S. Anderson 26.84
Anthony Russo 408.08
Christopher Nolan 187.99
Scott Derrickson 232.6
Antoine Fuqua 93.38
Greg Tiernan 97.66
Barry Jenkins 27.85
John Lee Hancock 12.79
Ricardo de Montreuil 4.21
Rob Marshall 241.06
John Madden 3.44
Justin Lin 158.8
J.J. Abrams 936.63
Anna Foerster 30.35
Garry Marshall 32.46
Chad Stahelski 43.0
Christopher Nolan 533.32
Martin Scorsese 7.08
Fede Alvarez 89.21
Thea Sharrock 56.23
Lone Scherfig 3.18
Clint Eastwood 125.07
Zack Snyder 330.25
Tate Taylor 75.31
Sam Taylor-Johnson 166.15
Christopher Nolan 53.08
Matthew Vaughn 128.25
Peter Berg 31.86
George Miller 153.63

For this method, it took 4.22 ms to read 100 rows.

keyboard_arrow_down 2. iterrows()

iterrows() also can be used for iterating over rows. lets see how!

%%time
for index, row in df_small.iterrows():
print(row["director"], row["revenue_in_millions"])

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 24/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Clint Eastwood 125.07
Zack Snyder 330.25
Tate Taylor 75.31
Sam Taylor-Johnson 166.15
Christopher Nolan 53.08
Matthew Vaughn 128.25
Peter Berg 31.86
George Miller 153.63
Robin Swicord 0.01
Peter Berg 61.28
Robert Zemeckis 40.07
J.A. Bayona 3.73
David Frankel 30.98
Byron Howard 341.26
Gore Verbinski 309.4
Joss Whedon 623.28
Quentin Tarantino 120.52
Gore Verbinski 423.03
Paul Feig 128.34
Christopher Nolan 292.57
Matt Ross 5.88
Martin Scorsese 116.87
David Fincher 167.74
James Wan 350.03
Colin Trevorrow 652.18
Ben Affleck 10.38
James Cameron 760.51
Quentin Tarantino 54.12
Gavin O'Connor 86.2
Denis Villeneuve 60.96
Duncan Jones 47.17
Tate Taylor 169.71
Todd Phillips 43.02
Joss Whedon 458.99
Shane Black 36.25
Makoto Shinkai 4.68
Jeremy Gillespie 0.15
Olivier Assayas 1.29
Martin Scorsese 132.37
Brian Helgeland 1.87
Kenneth Branagh 181.02
Ridley Scott 228.43
Guy Ritchie 45.43
David Mackenzie 26.86
Taylor Hackford 1.66
David Yates 126.59
Alex Garland 25.44
Greg McLean 10.16
Steve McQueen 56.67
Zack Snyder 210.59
CPU times: user 8.88 ms, sys: 847 µs, total: 9.73 ms
Wall time: 9.4 ms

It too for this method, 8.09 ms

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 25/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

keyboard_arrow_down 3. itertuples()

Lets see how it could be used!

%%time
for row in df_small.itertuples():
print(row.director, row.revenue_in_millions)

Garry Marshall 32.46


Chad Stahelski 43.0

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 26/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
Martin Scorsese 132.37
Brian Helgeland 1.87
Kenneth Branagh 181.02
Ridley Scott 228.43
Guy Ritchie 45.43
David Mackenzie 26.86
Taylor Hackford 1.66
David Yates 126.59
Alex Garland 25.44
Greg McLean 10.16
Steve McQueen 56.67
Zack Snyder 210.59
CPU times: user 2 91 ms sys: 0 ns total: 2 91 ms

itertuples() is generally the fastest method when iterating over rows.

keyboard_arrow_down Exercise
As you may have noticed by now, the genre column has multiple values in the same row. That's
because any movie can be classified to multiple genres.

Now, the exercise is to loop over the rows, then get all the whole genres in one list then get the
count of each genre using Counter function from collections module.

Lets see how this could be done!

# list to carry all genres


genre_lst = list()
# loop over all rows using itertuples() [the fastest]
for row in df_movies.itertuples():
# split every genre by the ',' then add this to genre_lst
genre_lst += row.genre.split(',')

# lets see our list


genre_lst[:20]

['Action',
'Adventure',
'Sci-Fi',
'Adventure',
'Mystery',
'Sci-Fi',
'Horror',
'Thriller',
'Animation',
'Comedy',
'Family',
'Action',
'Adventure',
'Fantasy',
'Action',

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 27/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
'Adventure',
'Fantasy',
'Comedy',
'Drama',
'Music']

Now, lets count the repetition of every genre in our top 1000 movie.

from collections import Counter

Counter(genre_lst).most_common()

[('Drama', 419),
('Action', 277),
('Comedy', 250),
('Adventure', 244),
('Thriller', 148),
('Crime', 126),
('Romance', 120),
('Sci-Fi', 107),
('Fantasy', 92),
('Horror', 87),
('Mystery', 86),
('Biography', 67),
('Family', 48),
('Animation', 45),
('History', 25),
('Music', 15),
('Sport', 15),
('War', 10),
('Musical', 5),
('Western', 4)]

As you can see, Drama genre is the most repeated genre in the list of top 1000 movies!

keyboard_arrow_down Creating or Removing columns from DataFrame


As a part of analysing almost any dataframe, you may need to add some new columns that may
hold some new insights about the data.

Lets see various methods for doing that!

keyboard_arrow_down 1. apply()

apply() function is used to apply some function agnist some axis. Check documentation

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 28/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Lets say that we will add a new column called rating_category which will simply be an indicator
over the rating column. Any movie with rating above 8 will be considered as a good category
otherwise, it will be neutral

df_movies['rating_category'] = df_movies['rating'].apply(lambda x : 'good' if x >= 8 else 'n


# let's print the new dataset and see!
df_movies.head(10)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 29/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

rank genre description director actors y

title

A group of
Guardians Chris Pratt, Vin
intergalactic James
of the 1 Action,Adventure,Sci-Fi Diesel, Bradley 20
criminals are Gunn
Galaxy Cooper, Zoe S...
forced ...

Following
clues to the Noomi Rapace,
Ridley
Prometheus 2 Adventure,Mystery,Sci-Fi origin of Logan Marshall- 20
Scott
mankind, a Green, Michael Fa...
te...

Three girls
are James McAvoy, Anya
M. Night
Split 3 Horror,Thriller kidnapped by Taylor-Joy, Haley Lu 20
Shyamalan
a man with a Richar...
diag...

In a city of
Matthew
humanoid
Christophe McConaughey,Reese
Sing 4 Animation,Comedy,Family animals, a 20
Lourdelet Witherspoon, Seth
hustling
Ma...
thea...

A secret
government Will Smith, Jared
Suicide
5 Action,Adventure,Fantasy agency David Ayer Leto, Margot Robbie, 20
Squad
recruits some Viola D...
of th...

European
mercenaries Matt Damon, Tian
The Great Yimou
6 Action,Adventure,Fantasy searching for Jing, Willem Dafoe, 20
Wall Zhang
black Andy Lau
powde...

A jazz pianist
Ryan Gosling, Emma
falls for an Damien
La La Land 7 Comedy,Drama,Music Stone, Rosemarie 20
aspiring Chazelle
DeWitt, J....
actress i...

A true-life
drama, Charlie Hunnam,
The Lost James
9 Action,Adventure,Biography centering on Robert Pattinson, 20
City of Z Gray
British Sienna Mille...
explor...

A spacecraft
Jennifer Lawrence,
traveling to a Morten

2. insert()

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 30/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

insert() function used to insert some values, usually list, in a specific position of a dataframe.
Check documentation

keyboard_arrow_down Excercise
For the current dataset, I have an assumption that the long movies are the most profitable to
validate that I will create a new colum based on the values of runtime_in_minutes and
revenue_in_millions columns.

We will iterate over the dataframe, then put all values in list, and finally insert this column using
df.insert() function.

lng_prof_lst = list()
for row in df_movies.itertuples():
# split every genre by the ',' then add this to genre_lst
if row.runtime_in_minutes > 120 and row.revenue_in_millions > 100:
lng_prof_lst.append('long_profitable')
elif row.runtime_in_minutes < 120 and row.revenue_in_millions > 100:
lng_prof_lst.append('short_profitable')
elif row.runtime_in_minutes > 120 and row.revenue_in_millions < 100:
lng_prof_lst.append('long_unprofitable')
else:
lng_prof_lst.append('short_unprofitable')

# lets view the list


lng_prof_lst[:20]

['long_profitable',
'long_profitable',
'short_profitable',
'short_profitable',
'long_profitable',
'short_unprofitable',
'long_profitable',
'long_unprofitable',
'short_profitable',
'long_profitable',
'long_profitable',
'long_profitable',
'short_profitable',
'short_unprofitable',
'short_profitable',
'long_unprofitable',
'long_profitable',
'short_unprofitable',
'short_profitable',
'short_unprofitable']

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 31/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Perfect! lets append our list to the dataframe now!

df_movies.insert(1, "long_profitable_flg", lng_prof_lst)


# lets see our dataframe now.
df_movies.head(10)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 32/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

rank long_profitable_flg genre description directo

title

A group of
Guardians
intergalactic Jame
of the 1 long_profitable Action,Adventure,Sci-Fi
criminals are Gun
Galaxy
forced ...

Following
clues to the
Ridle
Prometheus 2 long_profitable Adventure,Mystery,Sci-Fi origin of
Sco
mankind, a
te...

Three girls
are
M. Nig
Split 3 short_profitable Horror,Thriller kidnapped by
Shyamala
a man with a
diag...

In a city of
humanoid
Christoph
Sing 4 short_profitable Animation,Comedy,Family animals, a
Lourdel
hustling
thea...

A secret
government
Suicide
5 long_profitable Action,Adventure,Fantasy agency David Aye
Squad
recruits some
of th...

European
mercenaries
The Great Yimo
6 short_unprofitable Action,Adventure,Fantasy searching for
Wall Zhan
black
powde...

A jazz pianist
falls for an Damie
La La Land 7 long_profitable Comedy,Drama,Music
aspiring Chazel
actress i...

A true-life
drama,
The Lost Jame
9 long_unprofitable Action,Adventure,Biography centering on
City of Z Gra
British
explor...

A spacecraft
traveling to a Morte
Passengers 10 short_profitable Adventure,Drama,Romance
distant Tyldu
colony pla...

The
Fantastic adventures
Beasts and of writer Dav
11 l fi bl Ad F
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true
il F 33/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

To have some deeper insights. lets use a new function:

df[column].value_counts() : this method can count every category in a dataframe. It can has
an argument called normalize when its true the result will be in percentage. Check
documentation

df_movies['long_profitable_flg'].value_counts()

short_unprofitable 454
long_unprofitable 142
short_profitable 123
long_profitable 119
Name: long_profitable_flg, dtype: int64

df_movies['long_profitable_flg'].value_counts(normalize=True)

short_unprofitable 0.541766
long_unprofitable 0.169451
short_profitable 0.146778
long_profitable 0.142005
Name: long_profitable_flg, dtype: float64

Our assumption is not true, 54% of the movies are short unprofitable.

Of course, you will need to remove some columns sometimes. This can be done using the following
function.

df.drop('column name', axis=1/0, inplace=True) : The function is pretty easy but just in
case, you can check documentation

Now, we will use it to drop metascore column

df_movies.drop('metascore', axis=1, inplace=True)


df_movies.columns

Index(['rank', 'long_profitable_flg', 'genre', 'description', 'director',


'actors', 'year', 'runtime_in_minutes', 'rating', 'votes',
'revenue_in_millions', 'rating_category'],
dtype='object')

It worked!
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 34/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

keyboard_arrow_down Data Reshaping


In the field of data analysis, reshaping a pandas dataframe is one of the most popular data
wrangling activities. Changing a table's format from long to wide or from wide to long is also known
as transposing, pivoting, or unpivoting.

But what is the difference between long and wide format. Lets see the following figure!

The above figure is showing the difference between wide and long format of data. Although, both
tables are giving the exact same information but they are formating in different ways!

So, what is the correct format when doing data analysis? Is it wide or long ?

There's no right or wrong regarding this point. You may use whatever version you want but you have
to fulfill two main aspects.

Usability : you have to think about what is the usability of the data on its current state. For
example, if you were given a very wide formatted table with 300 columns, one column for each
date in the observation period. Will this be a good form for representing such data? In this
format, you are missing alot of insights such as plotting those 300 date vs. any other quantity!
So, converting such case to long may enhance the usability of the data.

Readablity : also, you have to think if your data is readable and easy to understand enough or
not. For example, if you were given a very long formatted table with 300 different value for key.
Would it be better to deal with such format? or have every key as a separate column? It's
obvious that such case is better to be converted also to wide format.

How to convert from the wide to long format?

pd.melt(df, id_vars=[column_names], value_vars=[column_names], var_name='name',


value_name = 'name')

id_vars : list of columns that will not change from the wide to long format
https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 35/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

value_vars : list of columns that will be converted to values from the wide to long.
var_name : the key column name when converting.
value_name : the value column name during converting. Also, make sure to check its
documentation

How to convert from the long to wide format?

pd.pivot_table(df, index=['column_names'], columns= 'column_name', values =


'column_name , aggfunc)`

index : column that will remain the same from long to wide.
columns : column name that will be prodcasted into new column names in the wide
format.
values : column that will be used as a value for all corresponding categories in the
columns agrgument.
aggfunc : it may be used when squeezing multible occurances of the same category.
Also, make sure to check its documentation

Now, lets see all of those in action using some data!

Lets define a simple dataframe to see how pd.melt() is working!

# Define a dict that contains the data above


data = {'Name': ['John', 'Smith', 'Liz',],
'Weight': [150, 170, 110],
'BP': [120, 130, 100]}
# Let's create a wide DataFrame
df_wide = pd.DataFrame(data)
# Let's print
df_wide

Name Weight BP

0 John 150 120

1 Smith 170 130

2 Liz 110 100

Now, lets convert the above dataframe inot wide format and see.

df_long = df_wide.melt(id_vars='Name', var_name='key', value_name='value')


# print to check
df_long

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 36/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Name key value

0 John Weight 150

1 Smith Weight 170

2 Liz Weight 110

3 John BP 120

4 Smith BP 130

5 Liz BP 100

Now, you may have got a good idea about how pd.melt() is working.

For more conveniance, we will start reading a new dataframe. That is called gapminder which is
gathering some info about different countries such as gdp per capita , life Expectency , and
Population

For this dataset, we have its url and yes, pandas can read directly from the url !

data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)
gapminder.head(3)

continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 g

0 Africa Algeria 2449.008185 3013.976023 2550.816880 3246.991771

1 Africa Angola 3520.610273 3827.940465 4269.276742 5522.776375

2 Africa Benin 1062.752200 959.601080 949.499064 1035.831411

3 rows × 38 columns

Let's view the columns for gapminder

gapminder.columns

Index(['continent', 'country', 'gdpPercap_1952', 'gdpPercap_1957',


'gdpPercap_1962', 'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977',
'gdpPercap_1982', 'gdpPercap_1987', 'gdpPercap_1992', 'gdpPercap_1997',
'gdpPercap_2002', 'gdpPercap_2007', 'lifeExp_1952', 'lifeExp_1957',
'lifeExp_1962', 'lifeExp_1967', 'lifeExp_1972', 'lifeExp_1977',
'lifeExp_1982', 'lifeExp_1987', 'lifeExp_1992', 'lifeExp_1997',
'lifeExp_2002', 'lifeExp_2007', 'pop_1952', 'pop_1957', 'pop_1962',
'pop_1967', 'pop_1972', 'pop_1977', 'pop_1982', 'pop_1987', 'pop_1992',

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 37/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab
'pop_1997', 'pop_2002', 'pop_2007'],
dtype='object')

You can notice that the dataset is in a very wide format and each column gives the values for a
specific year which is not a very good format when analysing such dataset.

For this, we will start by separting all the gdpPercap columns with continent and country because
they are both considered indexes.

Lets see how!

gdpPercap = gapminder.loc[:, gapminder.columns.str.contains('^gdp|^c')]


# Print the head
gdpPercap.head(3)

continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 g

0 Africa Algeria 2449.008185 3013.976023 2550.816880 3246.991771

1 Africa Angola 3520.610273 3827.940465 4269.276742 5522.776375

2 Africa Benin 1062.752200 959.601080 949.499064 1035.831411

gdpPercap.shape

(142, 14)

Pandas Series.str.contains() function is used to test if pattern or regex is contained within a


string of a Series or Index.

Regular Expression or Regex for short is the part of ^gdp|^c . Regex is used in general for text
parsing and it has some strict rules. If you want to get the correct regex for the text you want to
catch in a more fun way. You can use AutoRegex which is an AI translator between plain English
text and regex.

For example, to be able to catch the columns starts with C and the word gdp , I entered capture
anything starts with the letter c or the word 'gdp'

Now, lets use pd.melt() to get a better format for gdpPercap dataframe.

# Let's convert it from this very wide format to a long one


gdpPercap_tidy = gdpPercap.melt(id_vars=['continent', 'country'], var_name='year', value_nam
gdpPercap_tidy.head(3)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 38/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

continent country year gdpPercap

0 Africa Algeria gdpPercap_1952 2449.008185

1 Africa Angola gdpPercap_1952 3520.610273

2 Africa Benin gdpPercap_1952 1062.752200

gdpPercap_tidy.shape

(1704, 4)

As you can see, the format is much better except for the year colum that we will try to fix right now.
Also, notice that converting wide format to long will result in an increase in the number of rows and
decrease in the number of columns.

For the values in the year column such as gdpPercap_1952 needs to be 1952 . To fix this we will
create a function called keep_year that takes in any text mixed with numbers and return only
numbers.

def keep_year(text):
clean_text = ''.join([item for item in text if item.isdigit()])
return clean_text

Let's test our function before applying to the dataframe

keep_year('some_text_and_all_this_about_1994')

'1994'

Perfect! let's apply it to our dataset

# apply the function to the year column


gdpPercap_tidy['year'] = gdpPercap_tidy['year'].apply(keep_year)
# let's see the result!
gdpPercap_tidy.head()

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 39/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

continent country year gdpPercap

0 Africa Algeria 1952 2449.008185

1 Africa Angola 1952 3520.610273

2 Africa Benin 1952 1062.752200

3 Africa Botswana 1952 851.241141

4 Africa Burkina Faso 1952 543.255241

This is a perfect format but to make sure lets check about the data types of columns.

gdpPercap_tidy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 country 1704 non-null object
2 year 1704 non-null object
3 gdpPercap 1704 non-null float64
dtypes: float64(1), object(3)
memory usage: 53.4+ KB

The year column is in object format so, it need to be converted to int .

Lets do it!

# convert the `year` column to numeric


gdpPercap_tidy.year = pd.to_numeric(gdpPercap_tidy.year)
# check the data types
gdpPercap_tidy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 country 1704 non-null object
2 year 1704 non-null int64
3 gdpPercap 1704 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.4+ KB

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 40/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Now, lets repeat all the previous steps with lifeExp and pop columns

# preparing lifeExp dataframe


lifeExp = gapminder.loc[:, gapminder.columns.str.contains('^life|^c')]
# melt the dataframe.
lifeExp_tidy = lifeExp.melt(id_vars=['continent', 'country'], var_name='year', value_name =
# fix the year column
lifeExp_tidy.year = lifeExp_tidy.year.apply(keep_year)
# convert the year column to numeric
lifeExp_tidy.year = pd.to_numeric(lifeExp_tidy.year)
# display the data itself
display(lifeExp_tidy.head())
# display info
display(lifeExp_tidy.info())

continent country year lifeExp

0 Africa Algeria 1952 43.077

1 Africa Angola 1952 30.015

2 Africa Benin 1952 38.223

3 Africa Botswana 1952 47.622

4 Africa Burkina Faso 1952 31.975


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 country 1704 non-null object
2 year 1704 non-null int64
3 lifeExp 1704 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.4+ KB
None

Perfect! lets do the same for pop columns!

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 41/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

# preparing pop dataframe


pop = gapminder.loc[:, gapminder.columns.str.contains('^pop|^c')]
# melt the dataframe.
pop_tidy = pop.melt(id_vars=['continent', 'country'], var_name='year', value_name = 'pop')
# fix the year column
pop_tidy.year = pop_tidy.year.apply(keep_year)
# convert the year column to numeric
pop_tidy.year = pd.to_numeric(pop_tidy.year)
# display the data itself
display(pop_tidy.head())
# display info
display(pop_tidy.info())

continent country year pop

0 Africa Algeria 1952 9279525.0

1 Africa Angola 1952 4232095.0

2 Africa Benin 1952 1738315.0

3 Africa Botswana 1952 442308.0

4 Africa Burkina Faso 1952 4469979.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 country 1704 non-null object
2 year 1704 non-null int64
3 pop 1704 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.4+ KB
None

Now, we need to combine all those three dataframes in one gdpPercap_tidy , lifeExp_tidy , and
pop_tidy .

We can use pd.concat() for this task but this time to stack those dataframes vertically.

gapminder_final = pd.concat([gdpPercap_tidy, lifeExp_tidy, pop_tidy],sort=True, axis=1)


# check
gapminder_final.head(10)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 42/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

continent country year gdpPercap continent country year lifeExp continent

0 Africa Algeria 1952 2449.008185 Africa Algeria 1952 43.077 Africa

1 Africa Angola 1952 3520.610273 Africa Angola 1952 30.015 Africa

2 Africa Benin 1952 1062.752200 Africa Benin 1952 38.223 Africa

3 Africa Botswana 1952 851.241141 Africa Botswana 1952 47.622 Africa

Burkina Burkina
4 Africa 1952 543.255241 Africa 1952 31.975 Africa
Faso Faso

5 Africa Burundi 1952 339.296459 Africa Burundi 1952 39.031 Africa

6 Africa Cameroon 1952 1172.667655 Africa Cameroon 1952 38.523 Africa

Central Central
7 Africa African 1952 1071.310713 Africa African 1952 35.463 Africa
Republic Republic

Obviouslly, this is not the required result! we have duplicate columns!, how to handle this in a ONE
LINE OF CODE?

gapminder_final = gapminder_final.T.drop_duplicates().T
# Let's check!
gapminder_final.head()
# Done!!

continent country year gdpPercap lifeExp pop

0 Africa Algeria 1952 2449.008185 43.077 9279525.0

1 Africa Angola 1952 3520.610273 30.015 4232095.0

2 Africa Benin 1952 1062.7522 38.223 1738315.0

3 Africa Botswana 1952 851.241141 47.622 442308.0

4 Africa Burkina Faso 1952 543.255241 31.975 4469979.0

Now, our dataframe is finally clean and tidy!

Now, we want to apply more in transforming long dataframes to wide by using pd.pivot() . To do
this, we will take a supset of the gapminder dataset

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 43/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

gm_df = gapminder_final[['continent','year','lifeExp']].copy()
# lets print it
gm_df.head()

continent year lifeExp

0 Africa 1952 43.077

1 Africa 1952 30.015

2 Africa 1952 38.223

3 Africa 1952 47.622

4 Africa 1952 31.975

# check the datatype for each column


gm_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1704 entries, 0 to 1703
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 year 1704 non-null object
2 lifeExp 1704 non-null object
dtypes: object(3)
memory usage: 117.8+ KB

We need to convert lifeExp and year to numerical type.

# convert year column


gm_df.year = pd.to_numeric(gm_df.year)
# convert lifeExp column
gm_df.lifeExp = pd.to_numeric(gm_df.lifeExp)
# lets check info
gm_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1704 entries, 0 to 1703
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continent 1704 non-null object
1 year 1704 non-null int64
2 lifeExp 1704 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 117.8+ KB

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 44/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Now, we will try to use pd.pivot()

# lets import numpy


import numpy as np
# pivot
p_df_1 = pd.pivot_table(gm_df, values='lifeExp', columns='continent', aggfunc=np.mean)
# Let's print it
p_df_1

continent Africa Americas Asia Europe Oceania

lifeExp 48.86533 64.658737 60.064903 71.903686 74.326208

for our long dataset, we tried to summerize it using pd.pivot_table() . We choosed that the
categories (values) in the continent column to be broadcasted as columns and the values to be
taken from the lifeExp column (passed to values argument) and because we have multiple values
for lifeExp that correspond to each continenet we had to choose some aggregation function to
be applied in the way of the transforamation. Here, we choose the np.mean() which means that we
will be the mean (average) lifeExp for each continent

Lets try to use pd.pivot_table() in a different way!

p_df_2 = pd.pivot_table(gm_df, values='lifeExp',


index=['year'],
columns='continent',
aggfunc=np.mean)
# check the table
p_df_2.head()

continent Africa Americas Asia Europe Oceania

year

1952 39.135500 53.27984 46.314394 64.408500 69.255

1957 41.266346 55.96028 49.318544 66.703067 70.295

1962 43.319442 58.39876 51.563223 68.539233 71.085

1967 45.334538 60.41092 54.663640 69.737600 71.310

1972 47.450942 62.39492 57.319269 70.775033 71.910

Here we used pd.pivot_table() in a different way. The year column will be an index and the
categories in continent column are broadcasted into columns. So, this table is giving the average
lifeExp per continent per year.

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 45/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

keyboard_arrow_down Grouping
Any groupby operation involves one of the following operations on the original dataframe. They are

Splitting the dataframe


Applying a function
Combining the results

The basic syntax for grouping in pandas.

df.groupby(by=None, axis=0, as_index=True, sort=True, dropna=True)

The basic argument is by which is the column that we will be grouping over. All other argument are
almost self explantory.

When we use .groupby() function on any categorical column of dataframe, it returns a GroupBy
object. Then we can use various methods on this object and even aggregate other columns to get
the summarized view of the dataset.

For now, we will read a new dataframe sales_data to apply some concepts about grouping .

# read the dataset


sales_df = pd.read_csv('../input/data-analytics/sales_data.csv')
# view head
display(sales_df.head())
# view info
display(sales_df.info())

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 46/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sales_M

Not
0 2.951110e+18 92 238 8/8/2021 Healthcare
Delivered

Not
1 2.181910e+18 61 136 10/3/2021 Office
Delivered

Not
2 3.239110e+18 67 235 9/27/2021 Office
Delivered

Not
3 1.112610e+18 33 133 7/30/2021 Fashion
Shipped

Not
4 1.548310e+18 13 189 8/15/2021 Fashion
Delivered
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OrderID 9999 non-null float64
1 Quantity 9999 non-null int64
2 UnitPrice(USD) 9999 non-null int64
3 Status 9999 non-null object
4 OrderDate 9999 non-null object
5 Product_Category 9963 non-null object
6 Sales_Manager 9999 non-null object
7 Shipping_Cost(USD) 9999 non-null int64
8 Delivery_Time(Days) 9948 non-null float64
9 Shi i Add 9999 ll bj t

sales_gp_df = sales_df.groupby("Product_Category")
type(sales_gp_df)

pandas.core.groupby.generic.DataFrameGroupBy

You can see that when grouping over the Product_Category . It returns a GroupBy object which is
nothing but a dictionary where keys are the unique groups in which records are split and values are
the columns of each group which are not mentioned in groupby.

Certainly, GroupBy object holds contents of entire DataFrame but in more structured form. And just
like dictionaries there are several methods to get the required data efficiently.

Lets try to validate it!

sales_df.Product_Category.nunique()

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 47/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

We have 5 differnet categories into the Product_Category . Lets see the number of groups.

sales_gp_df.ngroups

Also, for the GroupBy object we have 5 groups.

For now, lets try to get the average UnitPrice(USD) per Product_Category and see how this is
working.

# we will first perform the groupby then get the mean unit price.
sales_df.groupby("Product_Category")['UnitPrice(USD)'].mean().reset_index()

Product_Category UnitPrice(USD)

0 Entertainment 176.038618

1 Fashion 176.117199

2 Healthcare 175.489503

3 Home 175.354854

4 Office 175.127300

Different aggregate functions can be used with grouping. Lets try to get median
Shipping_Cost(USD) per Product_Category

sales_df.groupby("Product_Category")['Shipping_Cost(USD)'].median().reset_index()

Product_Category Shipping_Cost(USD)

0 Entertainment 28.0

1 Fashion 28.0

2 Healthcare 27.0

3 Home 28.0

4 Office 28.0

Grouping also can work on multi-level! Suppose, that we want to get the average
Delivery_Time(Days) per Product_Category per Sales_Manager .

Lets see how!

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 48/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

sales_df.groupby(['Product_Category', 'Sales_Manager'])['Delivery_Time(Days)'].mean().reset_

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 49/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

18 Fashion Sofia 17.675978

19 Fashion Stella 17.169591

20 Healthcare Abdul 17.336957

21 Healthcare Anthony 18.202970

22 Healthcare Emma 17.542373

23 Healthcare Jacob 17.303665

24 Healthcare John 17.555556

25 Healthcare Kristen 17.053476

26 Healthcare Maria 17.584699

27 Healthcare Pablo 17.056701

28 Healthcare Sofia 18.049774

29 Healthcare Stella 17.827225


https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 50/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

30 Home Abdul 17.888268

31 Home Anthony 17.469194

32 Home Emma 17.504762

33 Home Jacob 17.193237

34 Home John 17.685714

35 Home Kristen 17.630841

36 Home Maria 18.285714

37 Home Pablo 17.025126

38 Home Sofia 17.730435

39 Home Stella 17.615000

40 Office Abdul 17.457447

41 Office Anthony 17.378641

42 Office Emma 17.481651

43 Office Jacob 17.408377

44 Office John 17.552511

45 Office Kristen 17.876238

46 Office Maria 18.105556

47 Office Pablo 17.449495

48 Office Sofia 17.308901

49 Office Stella 17.023697

Grouping in pandas can work with multiple aggregate functions. For example, imagine that we want
to get the sum, mean, median, standard deviation,and maximum Shipping_Cost(USD) for every
Product_Category

sales_df.groupby("Product_Category")['Shipping_Cost(USD)'].agg([np.sum, np.mean, np.median,

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 51/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Product_Category sum mean median std max len

0 Entertainment 54277 27.579776 28.0 4.646282 35 1968

1 Fashion 54568 27.685439 28.0 4.603958 35 1971

2 Healthcare 53745 27.519201 27.0 4.610302 35 1953

3 Home 56728 27.537864 28.0 4.550939 35 2060

4 Office 55347 27.522128 28.0 4.642962 35 2011

.agg() function can add a lot of flexiability when aggregating using different aggregate functions.
Check documentation

We can use .agg() to multiple columns. For example, we want to get the median of
Shipping_Cost(USD) and the standard deviation of Quantity for every Product_Category .

Lets see how!

# aggregates dict
f = {'Shipping_Cost(USD)': [np.mean], 'Quantity': [np.std]}
# pass this dict to GroupBy
sales_df.groupby("Product_Category").agg(f).reset_index()

Product_Category Shipping_Cost(USD) Quantity

mean std

0 Entertainment 27.579776 29.184244

1 Fashion 27.685439 28.687947

2 Healthcare 27.519201 29.265929

3 Home 27.537864 28.917479

4 Office 27.522128 28.983277

As you can see, different aggreagate funtions can be applied to different columns!

In additon to that, some useful funtions can be used with GroupBy for example you can get the first
and last row in each category. Lets see how!

sales_df.groupby("Product_Category").first()

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 52/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

OrderID Quantity UnitPrice(USD) Status OrderDate Sales_Mana

Product_Category

Not
Entertainment 4.934810e+18 51 204 11/13/2021 A
Delivered

Not
Fashion 1.112610e+18 33 133 7/30/2021 A
Shipped

Not
Healthcare 2.951110e+18 92 238 8/8/2021 P
Delivered

This is similar to last function.

sales_df.groupby("Product_Category").last()

OrderID Quantity UnitPrice(USD) Status OrderDate Sales_Mana

Product_Category

Entertainment 1.468010e+18 53 229 Delivered 7/12/2021 Ant

Fashion 2.281610e+18 18 117 Shipped 12/23/2021 S

Healthcare 1.847410e+18 37 135 Shipped 10/3/2021 M

Not
Home 2.301610e+18 75 201 10/16/2021 S
Delivered

Interstingly, you can select to view specific row of each group using nth() function. Lets see how!

sales_df.groupby("Product_Category").nth(3)

OrderID Quantity UnitPrice(USD) Status OrderDate Sales_Mana

Product_Category

Not
Entertainment 4.030410e+18 21 123 10/10/2021 Kr
Delivered

Not
Fashion 2.804110e+18 31 163 12/23/2021 A
Shipped

Not
Healthcare 4.276410e+18 83 224 7/10/2021 S
Delivered

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 53/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

We can find the size() of each group which is the number of rows in each Product_Category

sales_df.groupby("Product_Category").size().reset_index()

Product_Category 0

0 Entertainment 1968

1 Fashion 1971

2 Healthcare 1953

3 Home 2060

4 Office 2011

Also, we can use the count() for the same purpose.

sales_df.groupby("Product_Category").count().reset_index()

Product_Category OrderID Quantity UnitPrice(USD) Status OrderDate Sales_Manager

0 Entertainment 1968 1968 1968 1968 1968 1968

1 Fashion 1971 1971 1971 1971 1971 1971

2 Healthcare 1953 1953 1953 1953 1953 1953

3 Home 2060 2060 2060 2060 2060 2060

4 Office 2011 2011 2011 2011 2011 2011

.count() counts only the non-null values from each column, whereas .size() simply returns the
number of rows available in each group irrespective of presence or absence of values.

GroupBy method get_group() is used to select or extract only one group from the GroupBy object.

sales_gp_df.get_group('Healthcare').head(10)

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 54/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sales_

Not
0 2.951110e+18 92 238 8/8/2021 Healthcare
Delivered

Not
6 2.750410e+18 73 242 7/8/2021 Healthcare
Delivered

Not
14 2.559910e+18 55 233 7/15/2021 Healthcare
Delivered

Not
19 4.276410e+18 83 224 7/10/2021 Healthcare
Delivered

22 2.111510e+18 74 250 Delivered 12/8/2021 Healthcare

24 3.509310e+18 8 119 Shipped 11/1/2021 Healthcare

30 1.408910e+18 7 130 Delivered 7/4/2021 Healthcare

Not

Now, we are viewing only rows with Product_Category is Healthcare

Also, remember that the GroupBy object is nothig but a dict and you can iterate over it.

for gp_name, gp_contents in sales_gp_df:


print(gp_name)
display(gp_contents.head())

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 55/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Entertainment
OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sal

Not
5 4.934810e+18 51 204 11/13/2021 Entertainment
Delivered

12 3.882310e+18 78 219 Delivered 10/29/2021 Entertainment

20 2.469010e+18 15 156 Shipped 12/13/2021 Entertainment

Not
21 4.030410e+18 21 123 10/10/2021 Entertainment
Delivered

23 3.629310e+18 78 155 Delivered 12/2/2021 Entertainment


Fashion
OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sale

Not
3 1.112610e+18 33 133 7/30/2021 Fashion
Shipped

Not
4 1.548310e+18 13 189 8/15/2021 Fashion
Delivered

7 4.797510e+18 48 240 Delivered 10/4/2021 Fashion

Not
8 2.804110e+18 31 163 12/23/2021 Fashion
Shipped

Not
9 1.735910e+18 62 214 8/14/2021 Fashion
Delivered
Healthcare
OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sal

Not
0 2.951110e+18 92 238 8/8/2021 Healthcare
Delivered

Not
6 2.750410e+18 73 242 7/8/2021 Healthcare
Delivered

Not
14 2.559910e+18 55 233 7/15/2021 Healthcare
Delivered

Not
19 4.276410e+18 83 224 7/10/2021 Healthcare
Delivered

22 2.111510e+18 74 250 Delivered 12/8/2021 Healthcare


Home
OrderID Quantity UnitPrice(USD) Status OrderDate Product_Category Sal

Not
10 4.337210e+18 57 226 9/27/2021 Home
Shipped

Not
13 4.583610e+18 46 208 7/28/2021 Home
Shipped

16 2.808810e+18 96 115 Delivered 10/17/2021 Home

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 56/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

Easy and simple! Keep in mind this aspect for GroupBy objects. It can be game-changing when
dealing with large dataset!

keyboard_arrow_down Data Buketing


This is a typical data pre-processing method sometimes called binning, often referred to as
bucketing or discretization, groups intervals of continuous data into bins or buckets

Now, we will demostrate 3 methods for Bucketing :

A combination between .loc and .between method can be used for this task. .between
method returns a boolean vector containing True wherever the corresponding Series element
is between the boundary values left and right. Check documentation but the most important
arguments are as following:

left : left boundary


right : right boundary
inclusive : Which boundary to include. Acceptable values are {“both”, “neither”, “left”,
“right”}.

pd.cut() can be also used to bin values into discrete intervals. Use cut when you need to
segment and sort data values into bins. This function is also useful for going from a
continuous variable to a categorical variable. Check documentation. Its main arguments are
as following:

x : The input array to be binned. Must be 1-dimensional.


bins : Sequence of scalars : Defines the bin edges allowing for non-uniform width.

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 57/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

labels : Specifies the labels for the returned bins. Must be the same length as the
resulting bins.
include_lowest : (bool) Whether the first interval should be left-inclusive or not.

pd.qcut() can be used to do quantile-based discretization function. Discretize variable into


equal-sized buckets based on rank or based on sample quantiles. Check documentation. Its
main arguements are as following:

x : The input array to be binned. Must be 1-dimensional.


q : Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles,
e.g. [0, .25, .5, .75, 1.] for quartiles.
labels : Specifies the labels for the returned bins. Must be the same length as the
resulting bins.
retbins : (bool) Whether to return the (bins, labels) or not. Can be useful if bins is given
as a scalar.

To demonstrate the above methods, we will use a dataset named banking_clients which is
revealing the total transcations that have been made by some clients and associated with their
account numbers. Our job is to add a customer life time value flag clv_flg that indicates the
importance of every client based on the amount in the transactions!

Lets apply!

# read the data


banking_df = pd.read_csv('../input/data-analytics/banking_clients.csv')
# view it
banking_df.head()

account_number client_name total_transactions

0 141962 Herman LLC 63626.03

1 146832 Kiehn-Spinka 99608.77

2 163416 Purdy-Kunde 77898.21

3 218895 Kulas Inc 137351.96

4 239344 Stokes LLC 91535.92

Lets use df.describe() to see the range of values in the total_transactions columns.

banking_df.describe()

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 58/59
7/18/24, 1:12 PM 4-data-transformation-using-pandas.ipynb - Colab

account_number total_transactions

count 20.00000 20.000000

mean 476998.75000 101711.287500

std 231499.20897 27037.449673

min 141962.00000 55733.050000

25% 252734.50000 89137.707500

50% 476006.50000 100271.535000

75% 695352.25000 110132.552500

max 786968.00000 184793.700000

So, lets make the folloing bucketing for clients based on total_transactions as follows:

(55700, 90000] --> low


(90000, 120000] --> medium
(120000, 200000] --> high

Note: square brackets [ and round brackets )indicates that the boundary value is inclusive and
exclusive respectively.

# low bucket
banking_df.loc[banking_df['total_transactions'].between(55700, 90000, 'right'), 'clv_flg_v1'
# medium bucket
banking_df.loc[banking_df['total_transactions'].between(90000, 120000, 'right'), 'clv_flg_v1
# high bucket
banking_df.loc[banking_df['total_transactions'].between(120000, 200000, 'right'), 'clv_flg_v

# lets see our dataset


banking_df.head()

account_number client_name total_transactions clv_flg_v1

0 141962 Herman LLC 63626.03 low

1 146832 Kiehn-Spinka 99608.77 medium

2 163416 Purdy-Kunde 77898.21 low

3 218895 Kulas Inc 137351.96 high

4 239344 Stokes LLC 91535.92 medium

Lets see how many clients in each category.

https://colab.research.google.com/drive/1lTfhucZPrBaGXEM9RgWSdEdXpqiVuU0J#printMode=true 59/59

You might also like