Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
Edited by AI Publishing
Ebook Converted and Cover by Gazler Studio
Published by AI Publishing LLC
ISBN-13: 978-1-7347901-0-8
Legal Notice:
You cannot amend, distribute, sell, use, quote, or paraphrase any
part of the content within this book without the specific consent of
the author.
Disclaimer Notice:
Please note the information contained within this document is for
educational and entertainment purposes only. No warranties of
any kind are expressed or implied. Readers acknowledge that the
author is not engaging in the rendering of legal, financial, medical,
or professional advice. Please consult a licensed professional before
attempting any techniques outlined in this book.
www.aipublishing.io/book-preprocessing-python
Get in Touch with Us
Preface��������������������������������������������������������������������������������������1
About the Author....................................................................5
Chapter 1: Introduction........................................................... 7
1.1. What is Data Preprocessing?......................................................7
1.2. Environment Setup.........................................................................8
1.2.1. Windows Setup........................................................................ 8
1.2.2. Mac Setup..................................................................................13
1.2.3. Linux Setup...............................................................................19
1.3. Python Crash Course................................................................... 22
1.3.1. Writing Your First Program................................................ 22
1.3.2. Python Variables and Data Types...................................26
1.3.3. Python Operators..................................................................28
1.3.4. Conditional Statements.......................................................35
1.3.5. Iteration Statements............................................................. 37
1.3.6. Functions..................................................................................39
1.3.7. Objects and Classes..............................................................41
1.4. Different Libraries for Data Preprocessing......................... 43
1.4.1. NumPy.......................................................................................43
1.4.2. Scikit Learn..............................................................................43
1.4.3. Matplotlib................................................................................. 44
1.4.4. Seaborn.................................................................................... 44
1.4.5. Pandas...................................................................................... 44
Exercise 1.1.................................................................................................. 45
Exercise 1.2................................................................................................ 46
§§ Book Approach
The book follows a very simple approach. It is divided into
nine chapters. Chapter 1 introduces the basic concept of
data preprocessing, along with the installation steps for the
software that we will need to perform data preprocessing in
this book. Chapter 1 also contains a crash course on Python.
A brief overview of different data types is given in Chapter 2.
Chapter 3 explains how to handle missing values in the data,
while the categorical encoding of numeric data is explained
in Chapter 4. Data discretization is presented in Chapter 5.
Chapter 6 explains the process of handline outliers, while
Chapter 7 explains how to scale features in the dataset.
Handling of mixed and datetime data type is explained in
Chapter 8, while data balancing and resampling has been
explained in Chapter 9. A full data preprocessing final project
is also available at the end of the book.
Requirements
This box lists all requirements needed to be done before
proceeding to the next topic. Generally, it works as a checklist
to see if everything is ready before a tutorial.
Further Readings
Here, you will be pointed to some external reference or
source that will serve as additional content about the specific
Topic being studied. In general, it consists of packages,
documentations, and cheat sheets.
Hands-on Time
Here, you will be pointed to an external file to train and test all
the knowledge acquired about a Tool that has been studied.
Generally, these files are Jupyter notebooks (.ipynb), Python
(.py) files, or documents (.pdf).
Example
www.aipublishing.io/book-preprocessing-python
1
Introduction
3. Run the executable file after the download is complete.
You will most likely find the downloaded file in your
download folder. The name of the file should be similar
to “Anaconda3-5.1.0-Windows-x86_64.” The installation
wizard will open when you run the file, as shown in the
following figure. Click the Continue button.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 15
5. The Important Information dialog will pop up. Simply
click Continue to go with the default version that is
Anaconda 3.
16 | Introduction
7. It is mandatory to read the license agreement and click
the Agree button before you can click the Continue
button again.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 17
The system will prompt you to give your password. Use the
same password you use to login to your Mac computer. Now,
click on Install Software.
18 | Introduction
The next screen will display the message that the installation
has completed successfully. Click on the Close button to close
the installer.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 19
2. The second step is to download the installer bash script.
Log into your Linux computer and open your terminal.
Now, go to /temp directory and download the bash you
downloaded from Anaconda’s home page using curl.
20 | Introduction
$ cd / tmp
$ curl –o https://repo.anaconda.com.archive/
Anaconda3-5.2.0-Linux-x86_64.sh
$ sha256sum Anaconda3-5.2.0-Linux-x86_64.sh
$ bash Anaconda3-5.2.0-Linux-x86_64.sh
The command line will produce the following output. You will
be asked to review the license agreement. Keep on pressing
Enter until you reach the end.
Output
Type Yes when you get to the bottom of the License Agreement.
5. The installer will ask you to choose the installation
location after you agree to the license agreement.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 21
[/home/tola/anaconda3] >>>
The installation will proceed once you press Enter. Once again,
you have to be patient as the installation process takes some
time to complete.
6. You will receive the following result when the installation
is complete. If you wish to use conda command, type
Yes.
Output
...
Installation finished.
Do you wish the installer to prepend Anaconda3 install
location to path in your /home/tola/.bashrc? [yes|no]
[no]>>>
$ conda list
on the console any string passed to it. If you see the following
output, you have successfully run your first Python program.
Output:
Welcome to Data Visualization with Python
f. Tuples
g. Dictionaries
A variable is an alias for the memory address where actual
data is stored. The data or the values stored at a memory
address can be accessed and updated via the variable name.
Unlike other programming languages like C++, Java, and C#,
Python is loosely typed, which means that you don’t have to
state the data type while creating a variable. Rather, the type
of data is evaluated at runtime.
Script 1:
# A string Variable
first_name = "Joseph"
print(type(first_name))
# An Integer Variable
age = 20
print(type(age))
# A boolean variable
married = False
print(type(married))
#List
cars = ["Honda", "Toyota", "Suzuki"]
print(type(cars))
28 | Introduction
#Tuples
days = ("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
print(type(days))
#Dictionaries
days2 = {1:"Sunday", 2:"Monday", 3:"Tuesday", 4:"Wednesday",
5:"Thursday", 6:"Friday", 7:"Saturday"}
print(type(days2))
Output:
<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>
<class 'list'>
<class 'tuple'>
<class 'dict'>
Arithmetic Operators
Operator
Symbol Functionality Example
Name
Addition + Adds the operands on X + Y= 30
either side
Script 2:
X = 20
Y = 10
print(X + Y)
print(X - Y)
print(X * Y)
print(X / Y)
print(X ** Y)
30 | Introduction
Output:
30
10
200
2.0
10240000000000
Logical Operators
Script 3:
X = True
Y = False
print(X and Y)
print(X or Y)
print(not(X and Y))
Output:
False
True
True
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 31
Comparison Operators
Script 4:
X = 20
Y = 35
print(X == Y)
print(X != Y)
print(X > Y)
print(X < Y)
print(X >= Y)
print(X <= Y)
Output:
False
True
False
True
False
True
Assignment Operators
Script 5:
X = 20; Y = 10
R = X + Y
print(R)
X = 20;
Y = 10
X += Y
print(X)
X = 20;
Y = 10
X -= Y
print(X)
34 | Introduction
X = 20;
Y = 10
X *= Y
print(X)
X = 20;
Y = 10
X /= Y
print(X)
X = 20;
Y = 10
X %= Y
print(X)
X = 20;
Y = 10
X **= Y
print(X)
Output:
30
30
10
200
2.0
0
10240000000000
Membership Operators
Script 6:
days = ("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
print('Sunday' in days)
Output:
True
Script 7:
days = ("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
print('Xunday' not in days)
Output:
True
Script 8:
# The if statment
if 10 > 5:
print("Ten is greater than 10")
Output:
Ten is greater than 10
IF-Else Statement
Script 9:
# if-else statement
if 5 > 10:
print("5 is greater than 10")
else:
print("10 is greater than 5")
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 37
Output:
10 is greater than 5
IF-Elif Statement
Script 10:
#if-elif and else
if 5 > 10:
print("5 is greater than 10")
elif 8 < 4:
print("8 is smaller than 4")
else:
print("5 is not greater than 10 and 8 is not smaller than
4")
Output:
5 is not greater than 10 and 8 is not smaller than 4
For Loop
Script 11:
items = range(5)
for item in items:
print(item)
Output:
0
1
2
3
4
While Loop
Script 12:
c = 0
while c < 10:
print(c)
c = c +1
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 39
Output:
0
1
2
3
4
5
6
7
8
9
1.3.6. Functions
Functions, in any programming language, are used to
implement that piece of code that is required to be executed
numerous times at different locations in the code. In such
cases, instead of writing long pieces of codes again and again,
you can simply define a function that contains the piece of
code, and then you can call the function wherever you want
in the code.
Script 13:
def myfunc():
print("This is a simple function")
Output:
This is a simple function
You can also pass values to a function. The values are passed
inside the parenthesis of the function call. However, you
must specify the parameter name in the function definition,
too. In the following script, we define a function named
myfuncparam(). The function accepts one parameter, i.e., num.
The value passed in the parenthesis of the function call will be
stored in this num variable and will be printed by the print()
method inside the myfuncparam() method.
Script 14:
def myfuncparam(num):
print("This is a function with parameter value: "+num)
Output:
This is a function with parameter value:Parameter 1
Script 15:
def myreturnfunc():
return "This function returns a value"
val = myreturnfunc()
print(val)
Output:
This function returns a value
Script 16:
class Fruit:
name = "apple"
price = 10
def eat_fruit(self):
print("Fruit has been eaten")
f = Fruit()
f.eat_fruit()
print(f.name)
print(f.price)
Output:
Fruit has been eaten
apple
10
Script 17:
class Fruit:
name = "apple"
price = 10
def eat_fruit(self):
print("Fruit has been eaten")
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 43
f = Fruit("Orange", 15)
f.eat_fruit()
print(f.name)
print(f.price)
Output:
Fruit has been eaten
Orange
15
1.4.1. NumPy
NumPy is one of the most commonly used libraries for
numeric and scientific computing. NumPy is extremely fast
and contains support for multiple mathematical domains such
as linear algebra, geometry, etc. It is extremely important to
learn NumPy in case you plan to make a career in data science
and data preprocessing.
1.4.3. Matplotlib
Data visualization is an important precursor to data
preprocessing. Before you actually apply data preprocessing
techniques on the data, you should know how the data looks
like, what is the distribution of a certain variable, etc. Matplotlib
is the de facto standard for static data visualization in Python.
1.4.4. Seaborn
Seaborn library is built on top of the Matplotlib library and
contains all the plotting capabilities of Matplotlib. However,
with Seaborn, you can plot much more pleasing and aesthetic
graphs with the help of Seaborn default styles and color
palettes.
1.4.5. Pandas
Pandas library, like Seaborn, is based on the Matplotlib library
and offers utilities that can be used to plot different types
of static plots in a single line of codes. With Pandas, you can
import data in various formats such as CSV (Comma Separated
View) and TSV (Tab Separated View), and can plot a variety of
data visualizations via these data sources.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 45
Exercise 1.1
Question 1:
Question 3:
Exercise 1.2
Print the table of integer 9 using a while loop:
§§ References
1. https://numpy.org/
2. https://scikit-learn.org/
3. https://matplotlib.org/
4. https://seaborn.pydata.org/index.html
5. https://pandas.pydata.org/
2
Understanding Data Types
2.1. Introduction
A dataset can contain variables of different types depending
upon the data they store. It is important to know the different
types of data that a variable can store since different
techniques are required to handle data of various types. In
this chapter, you will see the different data types that you may
come across.
pip install pandas
pip install numpy
pip install matplotlib
pip install seaborn
Script 1:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
The script above first imports the Matplotlib and the Seaborn
libraries. It then increases the size of the default plot and then
sets the grid style to dark. Finally, the above dataset uses the
50 | U n d e r s ta n d i n g D ata T y p e s
You can see that the dataset contains information about the
gender, age, fare, class, etc. of the passengers. Let’s now
identify the columns containing the discrete and continuous
numerical values.
Script 2:
sns.countplot(x='pclass', data=titanic_data)
Output:
Let’s plot a distributional plot for the fare column to see how
the fare is distributed. Execute the following script:
Script 3:
sns.distplot(tips_data['fare'], kde = False)
52 | U n d e r s ta n d i n g D ata T y p e s
Output:
Script 4:
sns.countplot(x='survived', data=titanic_data)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 53
Output:
If you look at the Titanic dataset, you can see that the class
column contains ordinal data. The class column can have three
possible values, i.e., First, Second, and Third. Here there is a
certain relationship between the three values. The first class is
more expensive than the second class, and the second class
is more expensive than the third class. We can verify this by
plotting the average fare paid by the passengers in each class.
Script 5:
sns.barplot(x='class', y='fare', data=titanic_data)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 55
Output:
You can see that the average fare paid by the first-class
passengers is around 85, while the second and third-class
passengers paid average fares of 20 and 16, respectively.
Script 6:
sns.barplot(x='embark_town', y='age', data=titanic_data)
Output:
1. Uniform Distribution
Uniform distribution is a type of distribution where all
observations have an equal likelihood of occurrence. The plot
of a uniform distribution is a straight horizontal line.
2. Normal Distribution
On the other hand, the normal distribution, which is also
known as the Gaussian distribution, is a type of distribution
where most of the observations occur around the center
peak. And the probabilities for values further away from the
peak decrease equally in both directions and have an equal
likelihood of occurrence. Normal distribution usually looks like
a reverse bell.
Script 7:
tips_data = sns.load_dataset('Tips')
tips_data.head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 63
Script 8:
plt.title('Tip distribution')
plt.hist(tips_data[«tip»])
Output:
2.9. Outliers
Outliers are the values that are too far from the rest of the
observations in the columns. For instance, if the weight of most
of the people in a sample varies between 50–100 kilograms, an
observation of 500 kilograms will be considered as an outlier
since such an observation occurs rarely.
Script 9:
sns.boxplot( y='age', data=titanic_data)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 65
Output:
The Seaborn box plot plots the quartile information along with
the outliers. Here, the above output shows that the median
value of age is around 29 for all the passengers in the Titanic
dataset. The 4th quartile contains values between 39 and 65
years. Beyond the age of 65, you can see outliers in the form of
black dots. That means that there are few passengers beyond
the age of 65.
Exercise 2.1
Question 1:
Question 2:
Question 3:
Question 4:
Question 5:
Question 6:
3.1. Introduction
In the previous chapters, you were introduced to many high-
level concepts that we are going to study in this book. One of
the concepts was missing values. You studied what missing
values are, how missing values are introduced in datasets, and
how they affect statistical models. In this chapter, you will see
how to basically handle missing values.
Advantages of CCA
The assumption behind the CCA is data is missing at random.
CCA is extremely simple to apply, and no statistical technique
is involved. Finally, the distribution of the variables is also
preserved.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 71
Disadvantages of CCA
The major disadvantage of CCA is that if a dataset contains a
large number of missing values, a large subset of data will be
removed by CCA. Also, if the values are not missing randomly,
CCA can create a biased dataset. Finally, statistical models
trained on a dataset on which CCA is applied are not capable
of handling missing values in production.
As a rule of thumb, if you are sure that the values are missing
totally at random and the percentage of records with missing
values is less than 5 percent, you can use CAA to handle those
missing values.
Age
15
NA
20
25
40
Age
15
25
20
25
40
Script 1:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
Output:
Let’s filter some of the numeric columns from the dataset, and
see if they contain any missing values.
Script 2:
titanic_data = titanic_data[["survived", "pclass", "age",
"fare"]]
titanic_data.head()
74 | H a n dl i n g M i s s i n g D ata
Output:
Script 3:
titanic_data.isnull().mean()
Output:
survived 0.000000
pclass 0.000000
age 0.198653
fare 0.000000
dtype: float64
The output shows that only the age column contains missing
values. And the ratio of missing values is around 19.86 percent.
Let’s now find out the median and mean values for all the non-
missing values in the age column.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 75
Script 4:
median = titanic_data.age.median()
print(median)
mean = titanic_data.age.mean()
print(mean)
Output:
28.0
29.69911764705882
To plot the kernel density plots for the actual age and median
and mean age, we will add columns to the Pandas dataframe.
Script 5:
import numpy as np
titanic_data['Median_Age'] = titanic_data.age.fillna(median)
titanic_data['Mean_Age'] = titanic_data.age.fillna(mean)
titanic_data['Mean_Age'] = np.round(titanic_data['Mean_Age'],
1)
titanic_data.head(20)
Output:
Script 6:
plt.rcParams["figure.figsize"] = [8,6]
fig = plt.figure()
ax = fig.add_subplot(111)
You can clearly see that the default values in the age columns
have been distorted by the mean and median imputation, and
the overall variance of the dataset has also been decreased.
Recommendations
Mean and Median imputation could be used for missing
numerical data in case the data is missing at random. If the
data is normally distributed, mean imputation is better, or else
median imputation is preferred in case of skewed distributions.
Advantages
Mean and median imputations are easy to implement and are a
useful strategy to quickly obtain a large dataset. Furthermore,
the mean and median imputations can be implemented during
the production phase.
Disadvantages
As said earlier, the biggest disadvantage of mean and median
imputation is that it affects the default data distribution and
variance and covariance of the data.
Script 7:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data.isnull().mean()
80 | H a n dl i n g M i s s i n g D ata
Output:
survived 0.000000
pclass 0.000000
age 0.198653
fare 0.000000
dtype: float64
The above output shows that only the age column has missing
values, which are around 20 percent of the whole dataset.
The next step is plotting the data distribution for the age
column. A histogram can reveal the data distribution of a
column.
Script 8:
titanic_data.age.hist(bins=50)
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 81
The output shows that the age column has an almost normal
distribution. Hence, the end of the distribution value can be
calculated by multiplying the mean value of the age column
by three standard deviations.
Script 9:
eod_value = titanic_data.age.mean() + 3 * titanic_data.age.
std()
print(eod_value)
Output:
73.278
Script 10:
import numpy as np
titanic_data['age_eod'] = titanic_data.age.fillna(eod_value)
titanic_data.head(20)
82 | H a n dl i n g M i s s i n g D ata
Output:
The above output shows that the end of distribution value, i.e.,
~73, has replaced the NaN values in the age column.
Finally, you can plot the kernel density estimation plot for
the original age column and the age column with the end of
distribution imputation.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 83
Script 11:
plt.rcParams["figure.figsize"] = [8,6]
fig = plt.figure()
ax = fig.add_subplot(111)
Output:
Script 12:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data.isnull().mean()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 85
Output:
survived 0.000000
pclass 0.000000
age 0.198653
fare 0.000000
dtype: float64
The output shows that only the age column contains some
missing values. Next, we plot the histogram for the age column
to see data distribution.
Script 13:
titanic_data.age.hist()
Output:
The output shows that the maximum positive value is around 80.
Therefore, 99 can be a very good arbitrary value. Furthermore,
since the age column only contains positive values, −1 can be
another very useful arbitrary value. Let’s replace the missing
values in the age column first by 99, and then by −1.
86 | H a n dl i n g M i s s i n g D ata
Script 14:
import numpy as np
titanic_data['age_99'] = titanic_data.age.fillna(99)
titanic_data['age_minus1'] = titanic_data.age.fillna(-1)
titanic_data.head(20)
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 87
The final step is to plot the kernel density plots for the original
age column and for the age columns where the missing values
are replaced by 99 and −1. The following script does that:
Script 15:
plt.rcParams["figure.figsize"] = [8,6]
fig = plt.figure()
ax = fig.add_subplot(111)
Output:
88 | H a n dl i n g M i s s i n g D ata
We will again use the Titanic dataset. We will first try to find the
percentage of missing values in the age, fare, and embarked_
town columns.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 89
Script 16:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
Output:
embark_town 0.002245
age 0.198653
fare 0.000000
dtype: float64
Script 17:
titanic_data.embark_town.value_counts().sort_
values(ascending=False).plot.bar()
plt.xlabel('Embark Town')
plt.ylabel('Number of Passengers')
Output:
Script 18:
titanic_data.embark_town.mode()
Output:
0 Southampton
dtype: object
Script 19:
titanic_data.embark_town.fillna('Southampton',
inplace=True)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 91
Let’s now find the mode of the age column and use it to
replace the missing values in the age column.
Script 20:
titanic_data.age.mode()
Output:
24.0
The output shows that the mode of the age column is 24.
Therefore, we can use this value to replace the missing values
in the age column.
Script 21:
import numpy as np
titanic_data['age_mode'] = titanic_data.age.fillna(24)
titanic_data.head(20)
92 | H a n dl i n g M i s s i n g D ata
Output:
Finally, let’s plot the kernel density estimation plot for the
original age column and the age column that contains the
mode of the values in place of the missing values.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 93
Script 22:
plt.rcParams["figure.figsize"] = [8,6]
fig = plt.figure()
ax = fig.add_subplot(111)
Output:
Script 23:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
Output:
embark_town 0.002245
age 0.198653
fare 0.000000
dtype: float64
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 95
Script 24:
titanic_data.embark_town.fillna('Missing', inplace=True)
After applying missing value imputation, plot the bar plot for
the embark_town column. You can see that we have a very
small, almost negligible plot for the missing column.
Script 25:
titanic_data.embark_town.value_counts().sort_
values(ascending=False).plot.bar()
plt.xlabel('Embark Town')
plt.ylabel('Number of Passengers')
Output:
96 | H a n dl i n g M i s s i n g D ata
Exercise 3.1
Question 1:
Question 2:
Question 3:
Exercise 3.2
Replace the missing values in the deck column of the Titanic
dataset by the most frequently occurring categories in that
column. Plot a bar plot for the updated deck column.
4
Encoding Categorical Data
4.1. Introduction
Models based on statistical algorithms, such as machine
learning and deep learning, work with numbers. However, a
dataset can contain numerical, categorical, date time, and
mixed variables, as you saw in chapter 2. A mechanism is
needed to convert categorical data to its numeric counterpart
so that the data can be used to build statistical models. The
techniques used to convert numeric data into categorical data
are called categorical data encoding schemes. In this chapter,
you will see some of the most commonly used categorical
data encoding schemes.
Country Target
USA 1
UK 0
USA 1
France 1
USA 0
UK 0
UK France Target
0 0 1
1 0 0
0 0 1
0 1 1
0 0 0
1 0 0
Script 1:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
Output:
102 | E n c o d i n g C at e g o r i c a l D ata
Script 2:
titanic_data = titanic_data[["sex", "class", "embark_town"]]
titanic_data.head()
Output:
Script 3:
print(titanic_data['sex'].unique())
print(titanic_data['class'].unique())
print(titanic_data['embark_town'].unique())
Output:
['male' 'female']
[Third, First, Second]
Categories (3, object): [Third, First, Second]
['Southampton' 'Cherbourg' 'Queenstown' nan]
Script 4:
import pandas as pd
temp = pd.get_dummies(titanic_data['sex'])
temp.head()
In the output, you will see two columns, one for males and one
for females.
Output:
Let’s display the actual sex name and the one hot encoded
version for the sex column in the same dataframe.
Script 5:
pd.concat([titanic_data['sex'],
pd.get_dummies(titanic_data['sex'])], axis=1).head()
Output:
104 | E n c o d i n g C at e g o r i c a l D ata
From the above output, you can see that in the first row, 1 has
been added in the male column because the actual value in
the sex column is male. Similarly, in the second row, 1 is added
to the female column since the actual value in the sex column
is female.
Script 6:
import pandas as pd
temp = pd.get_dummies(titanic_data['embark_town'])
temp.head()
Output:
As you saw earlier, you can have N-1 columns one hot encoded
columns for the categorical column that contains N unique
labels. You can remove the first column created by get_
dummies() method by passing True as the value for drop_first
parameter as shown below:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 105
Script 7:
import pandas as pd
temp = pd.get_dummies(titanic_data['embark_town'], drop_first =
True)
temp.head()
Output:
Also, you can create one hot encoded column for null values in
the actual column by passing True as a value for the dummy_na
parameter.
Script 8:
import pandas as pd
temp = pd.get_dummies(titanic_data['embark_town'], dummy_na =
True ,drop_first = True)
temp.head()
106 | E n c o d i n g C at e g o r i c a l D ata
Output:
Country Target
USA 1
UK 0
USA 1
France 1
USA 0
UK 0
The above table has been label encoded as follows. You can
see that USA has been labeled as 1, UK has been labeled as 2,
and France has been labeled as 3.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 107
Country Target
1 1
2 0
1 1
3 1
1 0
2 0
Script 9:
# for integer encoding using sklearn
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(titanic_data['class'])
titanic_data['le_class'] = le.transform(titanic_data['class'])
titanic_data.head()
108 | E n c o d i n g C at e g o r i c a l D ata
Output:
From the above output, you can see that the class Third labeled
as 2, the class First is labeled as 0, and so on. It is important to
mention that label encoding starts at 0.
Country Target
USA 1
UK 0
USA 1
France 1
USA 0
UK 0
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 109
Country Target
3 1
2 0
3 1
1 1
3 0
2 0
Script 10:
titanic_data.dropna(inplace = True)
Script 11:
value_counts = titanic_data['embark_town'].value_counts().
to_dict()
print(value_counts)
Output:
{'Southampton': 644, 'Cherbourg': 168, 'Queenstown': 77}
Script 12:
titanic_data['embark_town'] = titanic_data['embark_town'].
map(value_counts)
titanic_data.head()
Output:
Script 13:
frequency_count = (titanic_data['embark_town'].value_counts()
/ len(titanic_data) ).to_dict()
print(frequency_count)
Output:
{644: 0.7244094488188977, 168: 0.1889763779527559, 77:
0.08661417322834646}
Script 14:
titanic_data['embark_town'] = titanic_data['embark_town'].
map(frequency_count)
titanic_data.head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 111
Output:
Country Target
USA 1
UK 0
USA 1
France 1
USA 0
UK 0
112 | E n c o d i n g C at e g o r i c a l D ata
France -> 1
UK -> 0
Country Target
1 1
0 0
1 1
2 1
1 0
0 0
Script 15:
titanic_data = sns.load_dataset('titanic')
titanic_data = titanic_data[["sex", "class", "embark_town",
"survived"]]
titanic_data.groupby(['class'])['survived'].mean().sort_
values()
Output:
class
Third 0.242363
Second 0.472826
First 0.629630
Name: survived, dtype: float64
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 113
You can see that the First class has the highest mean value
against the survived column. You can use any other column
as the target column. Next, we create a dictionary where class
labels are assigned corresponding integer labels. Finally, the
map() function is used to create a column that contains ordinal
values, as shown below:
Script 16:
ordered_cats = titanic_data.groupby(['class'])['survived'].
mean().sort_values().index
titanic_data['class_ordered'] = titanic_data['class'].map(cat_
map)
titanic_data.head()
Output:
You can see that most passengers survived from the First
class, and it has been given the highest label, i.e., 2, and so on.
column of the table below, you have three rows where the
Country is USA for these three rows, and the total sum of
the target is 2. Hence, the targeted mean value will be 2/3 =
0.66. For UK, this value is 0 since for both the occurrences of
UK, there is a 0 in the Target column. Hence, 0/2 = 0. Finally,
France will have a value of 1.
Actual Table:
Country Target
USA 1
UK 0
USA 1
France 1
USA 0
UK 0
Country Target
0.66 1
0 0
0.66 1
1 1
0.66 0
0 0
Script 17:
titanic_data.groupby(['class'])['survived'].mean()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 115
Output:
class
First 0.629630
Second 0.472826
Third 0.242363
Name: survived, dtype: float64
Script 18:
mean_labels = titanic_data.groupby(['class'])['survived'].
mean().to_dict()
titanic_data['class_mean'] = titanic_data['class'].map(mean_
labels)
titanic_data.head()
Output:
Exercise 4.1
Question 1:
A. Mean Encoding
B. Ordinal Encoding
C. One Hot Encoding
D. All of the Above
Question 2:
Question 3:
Exercise 4.2
Apply frequency encoding to the class column of the Titanic
Dataset:
5
Data Discretization
5.1. Introduction
In the previous chapter, you studied how to perform numerical
encoding of the categorical values. In this chapter, you will
see how to convert continuous numeric values into discrete
intervals.
In this chapter, you will see some of the most commonly used
approaches for discretization.
Script 1:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
Output:
Script 2:
sns.distplot(diamond_data['price'])
120 | D ata D i s c r e t i z at i o n
Output:
The histogram for the price column shows that our dataset is
positively skewed. We can use discretization on this type of
data distribution.
Let’s now find the total price range by subtracting the minimum
price from the maximum price.
Script 3:
price_range = diamond_data['price'].max() - diamond_
data['price'].min()
print(price_range )
Output:
18497
Script 4:
price_range / 10
Output:
1849.7
Script 5:
lower_interval = int(np.floor( diamond_data['price'].min()))
upper_interval = int(np.ceil( diamond_data['price'].max()))
print(lower_interval)
print(upper_interval)
print(interval_length)
Output:
326
18823
1850
Next, let’s create the 10 bins for our dataset. To create bins, we
will start with the minimum value, and add the bin interval or
length to it. To get the second interval, the interval length will
be added to the upper boundary of the first interval and so on.
The following script creates 10 equal width bins.
122 | D ata D i s c r e t i z at i o n
Script 6:
Output:
[326, 2176, 4026, 5876, 7726, 9576, 11426, 13276, 15126,
16976, 18826]
Next, we will create string labels for each bin. You can give any
name to the bin labels.
Script 7:
bin_labels = ['Bin_no_' + str(i) for i in range(1, len(total_
bins))]
print(bin_labels)
Output:
['Bin_no_1', 'Bin_no_2', 'Bin_no_3', 'Bin_no_4', 'Bin_no_5',
'Bin_no_6', 'Bin_no_7', 'Bin_no_8', 'Bin_no_9', 'Bin_no_10']
Script 8:
diamond_data['price_bins'] = pd.cut(x=diamond_data['price'],
bins=total_bins, labels=bin_labels, include_lowest=True)
diamond_data.head(10)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 123
Output:
In the above output, you can see that a column price_bins has
been added that shows the bin value for the price.
Next, let’s plot a bar plot that shows the frequency of prices
in each bin.
Script 9:
diamond_data.groupby('price_bins')['price'].count().plot.bar()
plt.xticks(rotation=45)
124 | D ata D i s c r e t i z at i o n
Output:
The output shows that the price of most of the diamonds lies
in the first bin or the first interval.
Script 10:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
Output:
Script 11:
discretised_price, bins = pd.qcut(diamond_data['price'], 10,
labels=None, retbins=True, precision=3, duplicates='raise')
Output:
To see the bin intervals, simply print the bins returned by the
“qcut()” function as shown below:
Script 12:
print(bins)
print(type(bins))
Output:
[ 326. 646. 837. 1087. 1698. 2401. 3465. 4662. 6301.2
9821. 18823. ]
<class 'numpy.ndarray'>
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 127
Next, let’s find the number of records per bin. Execute the
following script:
Script 13:
discretised_price.value_counts()
Output:
(325.999, 646.0] 5411
(1698.0, 2401.0] 5405
(837.0, 1087.0] 5396
(6301.2, 9821.0] 5395
(3465.0, 4662.0] 5394
(9821.0, 18823.0] 5393
(4662.0, 6301.2] 5389
(1087.0, 1698.0] 5388
(646.0, 837.0] 5385
(2401.0, 3465.0] 5384
Name: price, dtype: int64
From the output, you can see that all the bins have more or
less the same number of records. This is what equal frequency
discretization does, i.e., create bins with an equal number of
records.
Script 14:
bin_labels = ['Bin_no_' +str(i) for i in range(1,11)]
print(bin_labels)
Output:
['Bin_no_1', 'Bin_no_2', 'Bin_no_3', 'Bin_no_4', 'Bin_no_5',
'Bin_no_6', 'Bin_no_7', 'Bin_no_8', 'Bin_no_9', 'Bin_no_10']
Script 15:
diamond_data['price_bins'] = pd.cut(x=diamond_data['price'],
bins=bins, labels=bin_labels, include_lowest=True)
diamond_data.head(10)
Output:
In the output above, you can see a new column, i.e., price_
bins. This column contains equal frequency discrete bin labels.
Finally, we can plot a bar plot that displays the frequency of
records per bin.
Script 16:
diamond_data.groupby('price_bins')['price'].count().plot.bar()
plt.xticks(rotation=45)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 129
Output:
You can see that the number of records is almost the same for
all the bins.
Script 17:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
Output:
Script 18:
discretization = KBinsDiscretizer(n_bins=10, encode='ordinal',
strategy='kmeans')
discretization.fit(diamond_data[['price']])
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 131
Next, you can access the bins created via K-means clustering
using the “bin_edges” attribute.
Script 19:
intervals = discretization.bin_edges_.tolist()
print(intervals)
Output:
[array([ 326. , 1417.67543928, 2627.50524806, 3950.3762392,
5441.70606939, 7160.05893161, 9140.61465361,
11308.37609661,
13634.55462656, 16130.22549621, 18823. ])]
Script 20:
intervals = [ 326. , 1417.67543928, 2627.50524806,
3950.3762392 ,
5441.70606939, 7160.05893161, 9140.61465361,
11308.37609661,
13634.55462656, 16130.22549621, 18823. ]
Script 21:
bin_labels = ['Bin_no_' +str(i) for i in range(1,11)]
print(bin_labels)
Output:
['Bin_no_1', 'Bin_no_2', 'Bin_no_3', 'Bin_no_4', 'Bin_
no_5', 'Bin_no_6', 'Bin_no_7', 'Bin_no_8', 'Bin_no_9',
'Bin_no_10']
Finally, you can use the cut method of the Pandas dataframe
to create a new column containing bins for the price column.
132 | D ata D i s c r e t i z at i o n
Script 22:
diamond_data['price_bins'] = pd.cut(x=diamond_data['price'],
bins=intervals, labels=bin_labels, include_lowest=True)
diamond_data.head(10)
Output:
Script 23:
diamond_data.groupby('price_bins')['price'].count().plot.bar()
plt.xticks(rotation=45)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 133
Output:
Script 24:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.tree import DecisionTreeClassifier
sns.set_style("darkgrid")
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
Output:
For instance, the following script creates bins for the price
column of the Diamonds dataset, based on the values in the
cut column.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 135
Script 25:
tree_model = DecisionTreeClassifier(max_depth=3)
tree_model.fit(diamond_data['price'].to_frame(), diamond_
data['cut'])
diamond_data['price_tree']= tree_model.predict_proba(diamond_
data['price'].to_frame())[:,1]
diamond_data.head()
Output:
Script 26:
diamond_data['price_tree'].unique()
Output:
array([0.12743549, 0.10543414, 0.0964318 , 0.11666667,
0.15124195,
0.08576481, 0.05252665, 0.08874839])
Script 27:
diamond_data.groupby(['price_tree'])['price'].count().plot.
bar()
plt.xticks(rotation=45)
136 | D ata D i s c r e t i z at i o n
Output:
Script 28:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.tree import DecisionTreeClassifier
sns.set_style("darkgrid")
tips_data = sns.load_dataset('tips')
tips_data.head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 137
Output:
Let’s find the maximum and minimum values in the tip column.
Script 29:
tips_data['tip'].describe()
Output:
count 244.000000
mean 2.998279
std 1.383638
min 1.000000
25% 2.000000
50% 2.900000
75% 3.562500
max 10.000000
Name: tip, dtype: float64
You can see that the tip column has a minimum value of 1
and a maximum value of 10. We will create three bins for the
tip column. The first bin will contain records where the tip is
between 0 and 3. The second bin will contain records where
the tip is between 3 and 7, and finally, the third bin will contain
records between 7 and 10. After that, you can simply use the
“cut()” method from the Pandas library to create a column that
contains customized bins, as shown in the following script:
138 | D ata D i s c r e t i z at i o n
Script 30:
buckets = [0, 3, 7, 10]
tips_data.head()
Output:
Finally, the number of records per bin can be plotted via a bar
plot.
Script 31:
tips_data.groupby('tip_bins')['tip'].count().plot.bar()
plt.xticks(rotation=45)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 139
Output:
Exercise 5.1
Question 1:
Question 2:
Question 3:
Exercise 5.2
Create five bins for the total_bill column of the Tips dataset
using equal frequency discretization. Plot a bar plot displaying
the frequency of bills per category.
6
Outlier Handling
6.1. Introduction
Outliers have been briefly explained in section 9 of chapter 2.
In that section, you studied what the different types of outliers
are, how they occur in a dataset, and how they can affect the
performance of statistical machine learning and deep learning
models. In this chapter, you are going to see how to handle
outliers.
Let’s remove the outliers from the age column of the Titanic
dataset. The Titanic dataset contains records of the passengers
who traveled on the unfortunate Titanic that sank in 1912. The
following script imports the Titanic dataset from the Seaborn
library.
Script 1:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
The first five rows of the Titanic dataset look like this.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 143
Output:
To visualize the outliers, you can simply plot the box plot for
the age column, as shown below:
Script 2:
sns.boxplot( y='age', data=titanic_data)
Output:
You can see that there are a few outliers in the form of black
dots at the upper end of the age distribution in the box plot.
it from the first quartile value (0.25 quantile) to find the lower
limit. To find the upper limit, add the product of IQR and 1.5 to
the 3rd quartile value (0.75 quantile). IQR can be calculated by
subtracting the first quartile value from the 4th quartile.
The following script finds the lower and upper limits for the
outliers for the age column.
Script 3:
IQR = titanic_data["age"].quantile(0.75) - titanic_
data["age"].quantile(0.25)
print(lower_age_limit)
print(upper_age_limit)
Output:
-6.6875
64.8125
The output shows that any age value larger than 64.81 and
smaller than -6.68 will be considered an outlier. The following
script finds the rows containing the outlier values:
Script 4:
age_outliers = np.where(titanic_data["age"] > upper_age_limit,
True,
np.where(titanic_data["age"] < lower_age_limit,
True, False))
Script 5:
titanic_without_age_outliers = titanic_data.loc[~(age_
outliers), ]
titanic_data.shape, titanic_without_age_outliers.shape
Output:
((891, 15), (880, 15))
Finally, you can plot a box plot to see if the outliers have
actually been removed.
Script 6:
sns.boxplot( y='age', data = titanic_without_age_outliers)
Output:
You can see from the above output that the dataset doesn’t
contain any outliers now.
146 | O u t l i e r H a n dl i n g
Script 7:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
The following script plots a box plot for the fare column of the
Titanic dataset.
Script 8:
sns.boxplot( y='fare', data=titanic_data)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 147
Output:
You can see that the fare column has a very high variance
owing to the presence of a large number of outliers. Let’s plot
a distribution plot to see the histogram distribution of data in
the fare column.
Script 9:
sns.distplot(titanic_data['fare'])
The output shows that the data in the fare column is positively
skewed.
148 | O u t l i e r H a n dl i n g
Output:
We will again use the IQR method to find the upper and lower
limits to find the outliers in the fare column.
Script 10:
IQR = titanic_data["fare"].quantile(0.75) - titanic_
data["fare"].quantile(0.25)
print(lower_fare_limit)
print(upper_fare_limit)
The output shows that any fare greater than 65.63 and less
than -26.72 is an outlier.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 149
Output:
-26.724
65.6344
Script 11:
titanic_data["fare"]= np.where(titanic_data["fare"] > upper_
fare_limit, upper_fare_limit,
np.where(titanic_data["fare"] < lower_fare_limit,
lower_fare_limit, titanic_data["fare"]))
Let’s now plot a box plot to see if we still have any outliers in
the fare column.
Script 12:
sns.boxplot( y='fare', data=titanic_data)
150 | O u t l i e r H a n dl i n g
Output:
The output shows that all the outliers have been removed.
Script 13:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
Script 14:
sns.boxplot( y='age', data=titanic_data)
Output:
The following script finds the upper and lower Thresholds for
the age column of the Titanic dataset using the mean and
standard deviation capping.
152 | O u t l i e r H a n dl i n g
Script 15:
lower_age_limit = titanic_data["age"].mean() - (3 * titanic_
data["age"].std())
upper_age_limit = titanic_data["age"].mean() + (3 * titanic_
data["age"].std())
print(lower_age_limit)
print(upper_age_limit)
Output:
-13.88037434994331
73.27860964406095
The output shows that the upper threshold value obtained via
the mean and standard deviation capping is 73.27, and the
lower limit or threshold is -13.88.
Script 16:
titanic_data["age"]= np.where(titanic_data["age"] > upper_age_
limit, upper_age_limit,
np.where(titanic_data["age"] < lower_age_limit,
lower_age_limit, titanic_data["age"]))
Script 17:
sns.boxplot( y='age', data=titanic_data)
The box plot shows that we still have some outlier values after
applying the mean and standard deviation capping on the age
column of the Titanic dataset.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 153
Output:
Script 18:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
The following script plots a box plot for the fare column of the
Titanic dataset.
Script 19:
sns.boxplot( y='fare', data=titanic_data)
Output:
The following script sets 0.05 as the lower limit and 0.95 as
the upper limit for the quantiles to find the outliers:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 155
Script 20:
lower_fare_limit = titanic_data["fare"].quantile(0.05)
upper_fare_limit = titanic_data["fare"].quantile(0.95)
print(lower_fare_limit)
print(upper_fare_limit)
Output:
7.225
112.07915
Script 21:
titanic_data["fare"]= np.where(titanic_data["fare"] > upper_
fare_limit, upper_fare_limit,
np.where(titanic_data["fare"] < lower_fare_limit,
lower_fare_limit, titanic_data["fare"]))
The following script plots a box plot for the fare column of
the Titanic dataset after removing outliers using the quantile
method.
Script 22:
sns.boxplot( y='fare', data=titanic_data)
Output:
Script 23:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
print(titanic_data.age.max())
print(titanic_data.age.min())
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 157
Output:
80.0
0.42
Script 24:
titanic_data["age"]= np.where(titanic_data["age"] > 50, 50,
np.where(titanic_data["age"] < 10, 10, titanic_
data["age"]))
Let’s now print the maximum and minimum values for the age
column of the Titanic dataset.
Script 25:
print(titanic_data.age.max())
print(titanic_data.age.min())
Output:
50.0
10.0
Script 26:
sns.boxplot( y='age', data=titanic_data)
Output:
Exercise 6.1
Question 1:
Question 2:
Question 3:
Exercise 6.2
On the price column of the following Diamonds dataset, apply
outlier capping via IQR. Display box plot for the price column
after outlier capping.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
7
Feature Scaling
7.1. Introduction
A dataset can have different attributes. The attributes can
have different magnitudes, variances, standard deviations,
mean values, etc. For instance, the salary can be in thousands,
whereas age is normally a two-digit number. The difference
in the scale or magnitude of attributes can actually affect
statistical models. Variables with a bigger range dominate
those with a smaller range, for linear models. Similarly, the
gradient descent algorithm converges faster when variables
have similar scales. Feature magnitudes can also affect
Euclidean distances.
7.2. Standardization
Standardization is the processing of centering the variable at
zero and standardizing the data variance to 1. To standardize
the dataset, you simply have to subtract each data point
from the mean of the datapoint and divide the result by the
standard deviation of the data.
162 | F e at u r e S c a l i n g
Script 1:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
plt.rcParams[«figure.figsize»] = [8,6]
sns.set_style(«darkgrid»)
titanic_data = sns.load_dataset(‘titanic’)
titanic_data = titanic_data[[«age»,»fare»,»pclass»]]
titanic_data.head()
Output:
Let’s see the mean, std, min, and max values for the age, fare,
and pclass columns.
Script 2:
titanic_data.describe()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 163
Output:
You can see that the mean, min, and max values for the three
columns are very different.
Script 3:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(titanic_data)
titanic_data_scaled = scaler.transform(titanic_data)
Script 4:
titanic_data_scaled = pd.DataFrame(titanic_data_scaled,
columns = titanic_data.columns)
titanic_data_scaled.head()
You can see from the output that values have been scaled.
Output:
The following script plots a kernel density plot for the unscaled
columns.
Script 5:
sns.kdeplot(titanic_data['age'])
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 165
The following script plots a kernel density plot for the scaled
columns.
Script 6:
sns.kdeplot(titanic_data_scaled['age'])
Output:
Script 7:
scaler = MinMaxScaler()
scaler.fit(titanic_data)
titanic_data_scaled = scaler.transform(titanic_data)
Script 8:
titanic_data_scaled = pd.DataFrame(titanic_data_scaled,
columns = titanic_data.columns)
titanic_data_scaled.head()
Output:
Let’s plot the kernel density plot to see if the data distribution
has changed or not.
Script 9:
sns.kdeplot(titanic_data_scaled['age'])
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 167
Output:
The following script calculates the mean values for all the
columns.
Script 10:
mean_vals = titanic_data.mean(axis=0)
mean_vals
Output:
age 29.699118
fare 32.204208
pclass 2.308642
dtype: float64
168 | F e at u r e S c a l i n g
Script 11:
range_vals = titanic_data.max(axis=0) - titanic_data.
min(axis=0)
range_vals
Output:
age 79.5800
fare 512.3292
pclass 2.0000
dtype: float64
Script 12:
titanic_data_scaled = (titanic_data - mean_vals) / range_vals
range_vals
Let’s plot the kernel density plot to see if the data distribution
has been affected or not. Execute the following script:
Script 13:
sns.kdeplot(titanic_data_scaled['age'])
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 169
The output shows that the data distribution has not been
affected.
Script 14:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(titanic_data)
titanic_data_scaled = scaler.transform(titanic_data)
Script 15:
titanic_data_scaled = pd.DataFrame(titanic_data_scaled,
columns = titanic_data.columns)
titanic_data_scaled.head()
170 | F e at u r e S c a l i n g
Output:
Let’s plot the kernel density plot to see if the data distribution
has been affected by absolute maximum scaling or not.
Execute the following script:
Script 16:
sns.kdeplot(titanic_data_scaled['age'])
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 17 1
To implement the median and quantile scaling, you can use the
RobustScaler class from the sklearn.preprocessing module.
You have to pass the Pandas dataframe to the fit() method of
the class and then to the transorm() method of the class. The
following script applies median and quantile scaling on the
age, fare, and pclass columns of the Titanic dataset.
Script 17:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(titanic_data)
titanic_data_scaled = scaler.transform(titanic_data)
Script 18:
titanic_data_scaled = pd.DataFrame(titanic_data_scaled,
columns = titanic_data.columns)
titanic_data_scaled.head()
172 | F e at u r e S c a l i n g
Output:
Script 19:
sns.kdeplot(titanic_data_scaled['age'])
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 173
Script 20:
from sklearn.preprocessing import Normalizer
titanic_data.dropna(inplace =True)
scaler = Normalizer(norm='l1')
scaler.fit(titanic_data)
titanic_data_scaled = scaler.transform(titanic_data)
Script 21:
titanic_data_scaled = pd.DataFrame(titanic_data_scaled,
columns = titanic_data.columns)
titanic_data_scaled.head()
174 | F e at u r e S c a l i n g
Output:
Script 22:
sns.kdeplot(titanic_data_scaled['age'])
Output:
The output shows that the vector unit length scaling actually
changes the default data distribution.
Exercise 7.1
Question 1
Question 2
Question 3
Exercise 7.2
On the price column of the following Diamonds dataset,
apply min/max scaling. Display the Kernel density plot for
the price column after scaling.
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
8
Handling Mixed and
DateTime Variables
8.1. Introduction
In sections 2.4 and 2.5 of the second chapter, we briefly
reviewed the datatime variables and mixed variables. In this
chapter, you will see how to handle datetime data and mixed
variables.
Let’s first see the type of mixed variables that can contain
multiple values of different data types. Execute the following
script to import the required libraries.
Script 1:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
178 | H a n dl i n g M i x e d and D at e T i m e V a r i a bl e s
Script 2:
name = ['Jon', 'Nick', 'Ben', 'Sally', 'Alice', 'Josh']
eduation = [9, 'Graduate', 7, 'Graduate', 'PhD', 8]
std = {'name':name,'Qualification':eduation}
student_df = pd.DataFrame(std)
student_df.head()
Output:
You can see from the above dataset that the Qualification
column contains integer as well as string values. For instance,
the first record contains 9 while the second record contains
Graduate. This is a type of mixed variable as it can contain data
of various types. One of the ways to handle such a variable is
to create a new column for each of the unique data types in
the original mixed variable. For instance, for the Qualification
column that contains either string or numerical data, we need
to create two columns: one that contains the categorical or
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 179
string values from the original mixed variable, and the other
that contains the numeric values from the original variable.
The following script creates a numeric column that will contain
the numeric values from the original mixed variable.
Script 3:
student_df['q_numeric'] = pd.to_numeric(student_
df["Qualification"],
errors='coerce',
downcast='integer')
Script 4:
student_df['q_categoric'] = np.where(student_df['q_numeric'].
isnull(),
student_df['Qualification'],
np.nan)
student_df.head()
Output:
In the output above, you can see that two new columns, “q_
numeric” and “q_categoric,” have been added to the datasets.
In the “q_numeric” column, you can see the numeric values
180 | H a n dl i n g M i x e d and D at e T i m e V a r i a bl e s
Script 5:
titanic_data = pd.read_csv("https://raw.githubusercontent.com/
datasciencedojo/datasets/master/titanic.csv")
titanic_data.head()
Output:
The output shows the first five rows of the Titanic dataset. The
Titanic dataset contains information about the passengers
who traveled in the famous Titanic ship that sank in 1912. The
Ticket and Cabin columns of the Titanic dataset contain mixed
values. Let’s filter these two columns and display the first five
rows.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 181
Script 6:
titanic_data = titanic_data[['Ticket', 'Cabin']]
titanic_data.head()
Output:
From the output, you can clearly see that both the Ticket
and Cabin columns contain values that are a combination of
numbers and strings. To deal with such mixed variables, you
again need to create two new columns. One of the columns
will contain the numeric portions of the original mixed values,
while the other column will contain the categorical portions
of the original mixed values. Execute the following script to
create new columns for the Ticket mixed variable.
Script 7:
titanic_data ['Ticket_Num'] = titanic_data['Ticket'].str.
extract('(\d+)')
titanic_data ['Ticket_Cat'] = titanic_data['Ticket'].str[0]
Output:
From the output, you can see that the newly created “Ticket_
Num” column contains the numeric portion while the “Ticket_
Cat” column contains the first character from the original value.
It is important to mention that a space or special character
truncates numeric values. Therefore, instead of “5 21171” in the
first row, you only see 5 in the corresponding numeric column
since there is a space after 5.
Script 8:
tesla_stock = pd.read_csv("https://raw.githubusercontent.com/
plotly/datasets/master/tesla-stock-price.csv")
tesla_stock = tesla_stock.shift(-1)
tesla_stock.dropna(inplace = True)
tesla_stock.head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 183
Output:
In the output, the date column contains the date data type.
However, by default, the Pandas dataframe treats date as
string type data. You need to tell the Pandas dataframe to
treat the date column as a date time type. This will help us to
execute date time type functions on the dataset. To convert
string dates to datetime data type values, you need to call the
to_date() function and pass it the column that contains date
type data as shown below:
Script 9:
tesla_stock['date'] = pd.to_datetime(tesla_stock['date'])
Script 10:
tesla_stock['week'] = tesla_stock['date'].dt.week
tesla_stock[['date', 'week']].head()
184 | H a n dl i n g M i x e d and D at e T i m e V a r i a bl e s
Output:
Script 11:
tesla_stock['month'] = tesla_stock['date'].dt.month
tesla_stock[['date', 'month']].head()
Output:
Script 12:
tesla_stock['day_month'] = tesla_stock['date'].dt.day
tesla_stock[['date', 'day_month']].head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 185
Output:
If you want to retrieve the name of the day of the week using
the date as input, you can use the dt.weekday_name attribute
as shown in the below script:
Script 13:
tesla_stock[‘day_week’] = tesla_stock[‘date’].dt.day_name()
tesla_stock[[‘date’, ‘day_week’]].head()
Output:
Script 14:
diff = tesla_stock["date"].iloc[0] - tesla_stock["date"].
iloc[4]
print(tesla_stock["date"].iloc[0])
print(tesla_stock["date"].iloc[4])
print(diff)
Output:
2018-10-15 00:00:00
2018-10-09 00:00:00
6 days 00:00:00
Script 15:
bike_data = pd.read_csv("https://raw.githubusercontent.com/
QROWD/TR/master/datasets/bike.csv")
bike_data.dropna(inplace = True)
bike_data.head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 187
Output:
Script 16:
bike_data['timestamp'] = pd.to_datetime(bike_
data['timestamp'])
Next, let’s print the hour, minute, and second information from
the timestamp column.
Script 17:
bike_data['hour'] = bike_data['timestamp'].dt.hour
bike_data['min'] = bike_data['timestamp'].dt.minute
bike_data['sec'] = bike_data['timestamp'].dt.second
bike_data.shift(-50).head(20)
188 | H a n dl i n g M i x e d and D at e T i m e V a r i a bl e s
Output:
In the output, you can see three new columns hour, min,
and sec that contain information about the hour, minute, and
second, respectively.
Script 18:
bike_data['time'] = bike_data['timestamp'].dt.time
bike_data.shift(-50).head(20)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 189
Output:
In the output, you can see that a new column, i.e., “time,” has
been added, which contains the time information only from
the “timestamp” column.
190 | H a n dl i n g M i x e d and D at e T i m e V a r i a bl e s
Exercise 8.1
Question 1:
Question 2:
Which attribute is used to find the day of the week from the
datetime type column?
A. dt.weekday_name
B. dt_day_week
C. dt_name_of_weekday
D. None of the above
Question 3:
Exercise 8.2
9.1. Introduction
An imbalanced dataset is a type of dataset where there
is a substantial mismatch between the number of records
belonging to different output classes. Imbalanced datasets
can greatly affect the performance of statistical models. In this
chapter, you will see how to balance the imbalanced datasets.
Script 1:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
churn_data = pd.read_csv("https://raw.githubusercontent.com/
albayraktaroglu/Datasets/master/churn.csv")
Script 2:
churn_data = churn_data.drop("State", axis = 1)
churn_data = churn_data.drop("Phone", axis = 1)
churn_data = churn_data.drop("VMail Plan", axis = 1)
churn_data = churn_data.drop("Int'l Plan", axis = 1)
Let’s see how our dataset looks now. Execute the following
script.
Script 3:
churn_data.head()
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 195
Script 4:
churn_data.shape
Output:
(3333, 17)
The above output shows that our dataset contains 3333 rows
and 17 columns.
Let’s now see the distribution of the data with respect to the
customers who churned and those who didn’t.
Script 5:
sns.countplot(x='Churn?', data=churn_data)
Output:
196 | H a n dl i n g I m b a l a n c e d D ata s e t s
The output clearly shows that one class, i.e., False is in high
majority compared to the True. This is a classic example
of imbalanced data, and such data can actually affect the
performance of machine learning models. Let’s see the actual
number for the distribution of customer churn.
Script 6:
churn_data["Churn?"].value_counts()
Output:
False. 2850
True. 483
Name: Churn?, dtype: int64
The output shows that 2850 customers didn’t churn while 483
customers churned the telecom company.
Script 7:
churn_true = churn_data[churn_data["Churn?"] == "True."]
churn_false = churn_data[churn_data["Churn?"] == "False."]
print(churn_true.shape)
print(churn_false.shape)
Output:
(483, 17)
(2850, 17)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 197
Script 8:
from sklearn.utils import resample
churn_falseds = resample(churn_false,
replace=True,
n_samples=len(churn_true),
random_state=27)
Now, if you look at the shape of the churn_falsds, you will see
that it contains 483 records, which is equal to the size of the
minority class, as shown below.
Script 9:
churn_falseds.shape
Output:
(483, 17)
Script 10:
churn_downsampled = pd.concat([churn_true, churn_falseds])
Now, if you plot the count plot for the Churn? column for the
newly balanced dataset, you should see equal bars for the two
classes.
198 | H a n dl i n g I m b a l a n c e d D ata s e t s
Script 11:
sns.countplot(x='Churn?', data=churn_downsampled)
Output:
Finally, you can verify the count for both the False and True
classes using the value_counts function, as shown below:
Script 12:
churn_downsampled["Churn?"].value_counts()
Output:
True. 483
False. 483
Name: Churn?, dtype: int64
The output shows that both the False and True classes have
an equal number of records.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 199
9.4. Up Sampling
In up sampling, you simply copy the minority records so that
the total minority records become equal to the majority class.
Script 13:
from sklearn.utils import resample
churn_trueus = resample(churn_true,
replace=True,
n_samples=len(churn_false),
random_state=27)
Script 14:
churn_upsampled = pd.concat([churn_trueus, churn_false])
If you plot the count plot, you will see an equal number of
records in the Churn? column, as shown below:
Script 15:
sns.countplot(x='Churn?', data=churn_upsampled)
200 | H a n dl i n g I m b a l a n c e d D ata s e t s
Output:
Finally, let’s find out the exact number of records where churn
is True or False.
Script 16:
churn_upsampled["Churn?"].value_counts()
Output:
False. 2850
True. 2850
Name: Churn?, dtype: int64
The output now shows that both the classes have an equal
number of records, i.e., 2,850.
You can use the imbalanced learn library to apply SMOTE for
oversampling. Execute the following script to download the
imbalanced learn library.
pip install imbalanced-learn
Script 17:
churn_data['Churn?'] = churn_data['Churn?'].map({'True.': 1,
'False.': 0})
Next, you need to divide the data into the feature set and the
output labels set. The following script divides the data into
the features and labels set. We first print the number of values
in both classes.
Script 18:
y = churn_data[["Churn?"]]
X = churn_data.drop("Churn?", axis = 1)
y["Churn?"].value_counts()
Output:
0 2850
1 483
Name: Churn?, dtype: int64
Next, to apply SMOTE, you can use the SMOTE class from the
imblearn.over_sampling module. You need to pass the feature
and label set to the fit_resample() method of the SMOTE
class object, as shown below:
202 | H a n dl i n g I m b a l a n c e d D ata s e t s
Script 19:
# install imblearn using the following pip command
# pip install imbalanced-learn
sm = SMOTE(random_state=2)
X_us, y_us = sm.fit_resample(X, y)
Script 20:
y_us["Churn?"].value_counts()
Output:
1 2850
0 2850
Name: Churn?, dtype: int64
Exercise 9.1
1.1. Introduction
In the previous chapters, you saw various data preprocessing
and feature engineering techniques that can be used to
preprocess data used for various purposes, e.g., machine
learning and deep learning. In this section, you will build a
complete data preprocessing pipeline that you can use to
prepare your data before it can be used by a statistical model.
Before we could go on and build our data preprocessing
pipeline, let’s first see where all the steps involved in creating
a machine learning model and where the data preprocessing
step comes into play.
1. Raw Data
2. Data Visualization
3. Data preprocessing
4. Feature Selection
5. Training Statistical Models
6. Model Deployment
204 | F i n a l P r o j e c t – A C o m pl e t e D ata P r e p r o c e s s i n g P i p e l i n e
Script 1:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Script 2:
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
Output:
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 207
For the sake of simplicity, we will only work with the following
eight columns of the Titanic dataset in this chapter. The
following script filters the columns specified in the "cols" list
in the following script:
Script 3:
cols = [
'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare','embarked',
'survived']
titanic_data = titanic_data[cols]
Script 4:
titanic_data.head()
Output:
Script 5:
titanic_data.dtypes
208 | F i n a l P r o j e c t – A C o m pl e t e D ata P r e p r o c e s s i n g P i p e l i n e
Output:
pclass int64
sex object
age float64
sibsp int64
parch int64
fare float64
embarked object
survived int64
dtype: object
Let’s now check out the ratio of missing values in all the
columns.
Script 6:
titanic_data.isnull().mean()
Output:
pclass 0.000000
sex 0.000000
age 0.198653
sibsp 0.000000
parch 0.000000
fare 0.000000
embarked 0.002245
survived 0.000000
dtype: float64
to transform the test set. Let’s first divide the data into the
training and test sets.
Script 7:
X_train, X_test, y_train, y_test = train_test_split(
titanic_data.drop('survived', axis=1),
titanic_data['survived'],
test_size=0.2,
random_state=42)
X_train.shape, X_test.shape
Output:
((712, 7), (179, 7))
Script 8:
titanic_data_pipe = Pipeline([
(‘numerical_imputation’, miss_data_imput.
ArbitraryNumberImputer(arbitrary_number=-1, variables=[‘age’,
‘fare’])),
(‘categorical_imputation’, miss_data_imput.
CategoricalImputer(variables=[‘embarked’])),
(‘categorical_encoder’,cat_encode.OrdinalEncoder(encoding_
method=’ordered’, variables=[ ‘sex’, ‘embarked’])),
(‘rf’, RandomForestClassifier(random_state=0))
])
Once you create the pipeline, the last step is to apply the
pipeline to the training set. To do so, you need to call the fit()
method, which applies all the steps in the pipeline in a sequence
to the training and test set. Once the pipeline functions have
been applied to the training set, you can make predictions on
training and test sets using the predict() method, as shown
below:
Script 9:
titanic_data_pipe.fit(X_train, y_train)
pred_X_train = titanic_data_pipe.predict(X_train)
pred_X_test = titanic_data_pipe.predict(X_test)
Script 10:
from sklearn.metrics import classification_report, confusion_
matrix, accuracy_score
print(confusion_matrix(y_test,pred_X_test))
print(classification_report(y_test,pred_X_test))
print(accuracy_score(y_test, pred_X_test))
Output:
[[90 15]
[19 55]]
precision recall f1-score support
0.8100558659217877
Script 11:
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
Output:
Script 12:
diamond_data.dtypes
Output:
carat float64
cut object
color object
clarity object
depth float64
table float64
price int64
x float64
y float64
z float64
dtype: object
The output shows that the cut, color, and clarity are categorical
columns in our dataset, while all of the remaining columns are
numeric.
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 213
Script 13:
diamond_data.isnull().mean()
Output:
carat 0.0
cut 0.0
color 0.0
clarity 0.0
depth 0.0
table 0.0
price 0.0
x 0.0
y 0.0
z 0.0
dtype: float64
The output shows that the dataset doesn’t contain any missing
values.
The next step is to divide the data into training and test sets.
Execute the following script to do so. The following script will
also display the shape of the training and test sets.
Script 14:
X_train, X_test, y_train, y_test = train_test_split(
diamond_data.drop('price', axis=1),
diamond_data['price'],
test_size=0.2,
random_state=42)
X_train.shape, X_test.shape
Output:
((43152, 9), (10788, 9))
214 | F i n a l P r o j e c t – A C o m pl e t e D ata P r e p r o c e s s i n g P i p e l i n e
The output shows that the training set contains 43,152 records,
while the test contains 10,788 records.
Script 15:
diamond_data_pipe = Pipeline([
(‘categorical_encoder’,
cat_encode.OrdinalEncoder(encoding_method=’ordered’,
variables=[ ‘cut’, ‘color’, ‘clarity’])),
(‘rf’, RandomForestRegressor(random_state=42))
])
Script 16:
diamond_data_pipe.fit(X_train, y_train)
pred_X_train = diamond_data_pipe.predict(X_train)
pred_X_test = diamond_data_pipe.predict(X_test)
Script 17:
from sklearn import metrics
Output:
Mean Absolute Error: 271.5354506226789
Mean Squared Error: 308115.6650668038
Root Mean Squared Error: 555.0816742307422
Exercise 2.1
Question 1:
Question 2:
Question 3:
Question 4:
Answer: size
Question 5:
Question 6:
Exercise 3.1
Question 1:
Question 2:
Question 3:
Exercise 3.2
Solution:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data = titanic_data[["deck"]]
titanic_data.head()
titanic_data.isnull().mean()
titanic_data.deck.value_counts().sort_values(ascending=False).
plot.bar()
plt.xlabel('deck')
plt.ylabel('Number of Passengers')
titanic_data.deck.mode()
titanic_data.deck.fillna('C', inplace=True)
titanic_data.deck.value_counts().sort_values(ascending=False).
plot.bar()
plt.xlabel('deck')
plt.ylabel('Number of Passengers')
224 | Exercise Solutions
Exercise 4.1
Question 1:
Question 2:
Question 3:
Exercise 4.2
Solution:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [8,6]
sns.set_style("darkgrid")
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
value_counts = titanic_data['class'].value_counts().to_dict()
print(value_counts)
titanic_data['class_freq'] = titanic_data['class'].map(value_
counts)
titanic_data.head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 227
Exercise 5.1
Question 1:
Which of the following discretization scheme is supervised?
A. K Means Discretization
B. Decision Tree Discretization
C. Equal Width Discretization
D. Equal Frequency Discretization
Answer: B
Question 2:
Which of the following discretization scheme generate bins of
equal sizes?
A. K Means Discretization
B. Decision Tree Discretization
C. Equal Frequency Discretization
D. None of the Above
Answer: D
Question 3:
Which of the following discretization scheme generate bins
containing an equal number of samples?
A. K Means Discretization
B. Decision Tree Discretization
C. Equal Frequency Discretization
D. Equal Width Distribution
Answer: C
228 | Exercise Solutions
Exercise 5.2
Solution:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.tree import DecisionTreeClassifier
sns.set_style("darkgrid")
tips_data = sns.load_dataset('tips')
tips_data.head()
pd.concat([discretised_bill, tips_data['total_bill']],
axis=1).head(10)
tips_data['bill_bins'] = pd.cut(x=tips_data['total_bill'],
bins=bins, labels=bin_labels, include_lowest=True)
tips_data.head(10)
tips_data.groupby('bill_bins')['total_bill'].count().plot.
bar()
plt.xticks(rotation=45)
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 229
Exercise 6.1
Question 1:
Which of the following techniques can be used to remove
outliers from a dataset?
A. Trimming
B. Censoring
C. Discretization
D. All of the above
Answer: D
Question 2:
What is the IQR distance normally used to cap outliers via
IQR?
A. 2.0
B. 3.0
C. 1.5
D. 1.0
Answer: C
Question 3:
What is the quartile distance normally used to cap outliers via
mean and standard deviation?
A. 2.0
B. 3.0
C. 1.5
D. 1.0
Answer: B
230 | Exercise Solutions
Exercise 6.2
diamond_data = sns.load_dataset('diamonds')
diamond_data.head()
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 231
Solution:
IQR = diamond_data["price"].quantile(0.75) - diamond_
data["price"].quantile(0.25)
lower_price_limit = diamond_data["price"].quantile(0.25) -
(IQR * 1.5)
upper_price_limit = diamond_data["price"].quantile(0.75) +
(IQR * 1.5)
print(lower_fare_limit)
print(upper_fare_limit)
Exercise 7.1
Question 1:
After standardization, the mean value of the dataset becomes:
A. 1
B. 0
C. -1
D. None of the above
Answer: B
Question 2:
What is the formula to apply mean normalization on the
dataset?
A. (values - mean) / (max - min)
B. (value) / (max - min)
C. (value) / (max)
D. None of the above
Answer: A
Question 3:
The formula value/max(values) is used to implement
A. Min/Max Scaling
B. Maximum Absolute Scaling
C. Standardization
D. Mean Normalization
Answer: B
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 233
Exercise 7.2
diamond_data = sns.load_dataset('diamonds')
diamond_data = diamond_data[['price']]
diamond_data.head()
234 | Exercise Solutions
Solution:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(diamond_data)
diamond_data_scaled = scaler.transform(diamond_data)
diamond_data_scaled = pd.DataFrame(diamond_data_scaled,
columns = diamond_data.columns)
diamond_data_scaled.head()
sns.kdeplot(diamond_data_scaled['price'])
D e e p L e a r n i n g F u n da m e n ta l s for Beginners | 235
Exercise 8.1
Question 1:
Which function is used to convert string type dataframe
column to datetime type?
A. convertToDate()
B. convertToDateTime()
C. to_datetime()
D. None of the above
Answer: C
Question 2:
Which attribute is used to find the day of the week from the
datetime type column?
A. dt.weekday_name
B. dt_day_week
C. dt_name_of_weekday
D. None of the above
Answer: A
Question 3:
Which attribute is used to find the time portion from a datetime
type column of a Pandas dataframe?
A. dt.get_time
B. dt.show_time
C. dt.time
D. dt.display_time
Answer: C
236 | Exercise Solutions
Exercise 8.2
titanic_data = pd.read_csv("https://raw.githubusercontent.com/
datasciencedojo/datasets/master/titanic.csv")
titanic_data.dropna(inplace = True)
titanic_data.head()
Solution:
titanic_data = titanic_data[['Ticket', 'Cabin']]
titanic_data.head()
Exercise 9.1
churn_data = pd.read_csv("https://raw.githubusercontent.com/
IBM/xgboost-smote-detect-fraud/master/data/creditcard.csv")
churn_data.head()
Solution:
y = churn_data[["Class"]]
X = churn_data.drop("Class", axis = 1)
y["Class"].value_counts()
sm = SMOTE(random_state=2)
X_us, y_us = sm.fit_resample(X, y)
y_us[«Class»].value_counts()