AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
Edited by AI Publishing
eBook Converted and Cover by Gazler Studio
Published by AI Publishing LLC
ISBN-13: 978-1-7347901-5-3
Legal Notice:
You are not permitted to amend, use, distribute, sell, quote, or paraphrase any part
of the content within this book without the specific consent of the author.
Disclaimer Notice:
Kindly note that the information contained within this document is solely for
educational and entertainment purposes. No warranties of any kind are indicated or
expressed. Readers accept that the author is not providing any legal, professional,
financial, or medical advice. Kindly consult a licensed professional before trying out
any techniques explained in this book.
By reading this document, the reader consents that under no circumstances is the
author liable for any losses, direct or indirect, that are incurred as a consequence of
the use of the information contained within this document, including, but not
restricted to, errors, omissions, or inaccuracies.
How to Contact Us
https://www.aispublishing.net/book-pdssl
Preface
Book Approach
Who Is This Book For?
How to Use This Book?
Chapter 1: Introduction
1.1. What Is Machine Learning and Data Science?
1.2. Where Does Scikit-Learn Fit In?
1.3. Other Machine Learning Libraries
1.3.1. NumPy
1.3.2. Matplotlib
1.3.3. Seaborn
1.3.4. Pandas
1.3.5. TensorFlow
1.3.6. Keras
1.4. What’s Ahead?
Exercise Solutions
Exercise 2.1
Exercise 2.2
Exercise 3.1
Exercise 3.2
Exercise 4.1
Exercise 4.2
Exercise 5.1
Exercise 5.2
Exercise 6.1
Exercise 6.2
Exercise 7.1
Exercise 7.2
Exercise 8.1
Exercise 8.2
Exercise 9.1
Exercise 9.2
Exercise 10.1
Exercise 10.2
Exercise 11.1
Exercise 11.2
Preface
§ Book Approach
The book follows a very simple approach. It is divided into 11
chapters. Chapter 1 provides a very brief introduction to data
science and machine learning and the use of Python’s Scikit-
learn library for machine learning. The process for
environment setup, including the software needed to run the
scripts in this book, is explained in Chapter 2. Chapter 2 also
contains a crash course on Python for beginners. If you are
already familiar with Python, you can skip chapter 2.
In this chapter, you will study what machine learning and data
science are, how they differ, and what are the steps that you
need to take to become a machine learning data science
expert. Since this book is about the use of the Scikit-learn
library for machine learning, this chapter also briefly reviews
what Scikit-learn is and what you can do with it. Finally, some
of the other most commonly used machine learning libraries
have also been introduced.
1.3.1 NumPy
NumPy is one of the most commonly used libraries for
numeric and scientific computing. NumPy is extremely fast
and contains support for multiple mathematical domains, such
as linear algebra, geometry, etc. It is extremely important to
learn NumPy in case you plan to make a career in data science
and data preparation.
1.3.2 Matplotlib
Matplotlib is the de facto standard for static data visualization
in Python, which is the first step in data science and machine
learning. Being the oldest data visualization library in Python,
Matplotlib is the most widely used data visualization library.
Matplotlib was developed to resemble MATLAB, which is one
of the most widely used programming languages in academia.
While Matplotlib graphs are easy to plot, the look and feel of
the Matplotlib plots have a distinct feel of the 1990s. Many
wrapper libraries like Pandas and Seaborn have been
developed on top of Matplotlib. These libraries allow users to
plot much cleaner and sophisticated graphs.
1.3.3 Seaborn
Seaborn library is built on top of the Matplotlib library and
contains all the plotting capabilities of Matplotlib. However,
with Seaborn, you can plot much more pleasing and aesthetic
graphs with the help of Seaborn default styles and color
palettes.
1.3.5 TensorFlow
TensorFlow is one of the most commonly used libraries for
deep learning. TensorFlow has been developed by Google and
offers an easy-to-use API for the development of various deep
learning models. TensorFlow is consistently being updated,
and at the time of writing of this book, TensorFlow 2 is the
latest major release of TensorFlow. With TensorFlow, you can
not only easily develop deep learning applications but also
deploy them as well with ease, owing to the deployment
functionalities of TensorFlow.
1.3.6 Keras
Keras is a high-level TensorFlow library that implements
complex TensorFlow functionalities under the hood. If you are
new to deep learning, Keras is the one deep learning library
that you should start with for developing a deep learning
library. As a matter of fact, Keras has been adopted as the
official deep learning library for TensorFlow 2.0, and now all
the TensorFlow applications use Keras abstractions for
training deep learning models.
9. Click Continue on the next window. You also have the option
to install Microsoft VSCode at this point.
The next screen will display the message that the installation
has been completed successfully. Click on the Close button to
close the installer.
$ cd / tmp
$ curl -o https://repo.anaconda.com.archive/Anaconda3-5.2.0-Linux-
x86_64.sh
$ sha256sum Anaconda3-5.2.0-Linux-x86_64.sh
09f53738b0cd3bb96f5b1bac488e5528df9906be248 0fe61df40e0e0d19e3d48
Anaconda3-5.2.0-Linux- x86_64.sh
$ bash Anaconda3-5.2.0-Linux-x86_64.sh
The command line will produce the following output. You
will be asked to review the license agreement. Keep on
pressing Enter until you reach the end.
Output
Output
[/home/tola/anaconda3] >>>
Output
…
Installation finished.
Do you wish the installer to prepend Anaconda3 install location to path
in your /home/tola/.bashrc? [yes|no]
[no]>>>
$ source ‘/.bashrc
8. You can also test the installation using the conda command.
$ conda list
https://colab.research.google.com/
With Google Cloud, you can import the datasets from your
Google drive. Execute the following script. And click on the
link that appears, as shown below:
Copy the link, and paste it in the empty field in the Google
Colab cell, as shown below:
This way, you can import datasets from your Google drive to
your Google Colab environment.
2.2. Python Crash Course
If you are familiar with the basic concepts of the Python
programming language, you can skip this section. For those
who are absolute beginners to Python, this section provides a
very brief overview of some of the most basic concepts of
Python. Python is a very vast programming language, and this
section is by no means a substitute for a complete Python
book. However, if you want to see how various operations and
commands are executed in Python, you are welcome to follow
along with the rest of this section.
Output:
Welcome to Data Visualization with Python
b. Integers
f. Tuples
g. Dictionaries
Script 2:
Output:
<class ‘str’>
<class ‘int’>
<class ‘float’>
<class ‘bool’>
<class ‘list’>
<class ‘tuple’>
<class ‘dict’>
b. Logical Operators
c. Comparison Operators
d. Assignment Operators
e. Membership Operators
§ Arithmetic Operators
Arithmetic operators are used to perform arithmetic
operations in Python. The following table summarizes the
arithmetic operators supported by Python. Suppose X = 20,
and Y = 10.
Script 3:
Output:
30
10
200
2.0
10240000000000
§ Logical Operators
Logical operators are used to perform logical AND, OR, and
NOT operations in Python. The following table summarizes the
logical operators. Here, X is True, and Y is False.
Script 4:
Output:
1. False
2. True
3. True
§ Comparison Operators
Comparison operators, as the name suggests, are used to
compare two or more than two operands. Depending upon
the relation between the operands, comparison operators
return Boolean values. The following table summarizes
comparison operators in Python. Here, X is 20, and Y is 35.
Script 5
Output:
False
True
False
True
False
True
§ Assignment Operators
Assignment operators are used to assign values to variables.
The following table summarizes the assignment operators.
Here, X is 20, and Y is equal to 10.
Take a look at script 6 to see Python assignment operators in
action.
Script 6:
Output:
30
30
10
200
2.0
0
10240000000000
§ Membership Operators
Membership operators are used to find if an item is a member
of a collection of items or not. There are two types of
membership operators: the in operator and the not in
operator. The following script shows the in operator in action.
Script 7:
Output:
True
Script 8:
Output:
True
b. If-else statement
c. If-elif statement
§ IF Statement
If you have to check for a single condition and you are not
concerned about the alternate condition, you can use the if
statement. For instance, if you want to check if 10 is greater
than 5, and based on that, you want to print a statement, you
can use the if statement. The condition evaluated by the if
statement returns a Boolean value.
Script 9:
Output:
Ten is greater than 10
§ IF-Else Statement
The If-else statement comes in handy when you want to
execute an alternate piece of code in case the condition for
the if statement returns false. For instance, in the following
example, the condition 5 > 10 will return false. Hence, the code
block that follows the else statement will execute.
Script 10:
Output:
10 is greater than 5
§ IF-Elif Statement
The if-elif statement comes handy when you have to evaluate
multiple conditions. For instance, in the following example, we
first check if 5 > 10, which evaluates to false. Next, an elif
statement evaluates the condition 8 < 4, which also returns
false. Hence, the code block that follows the last else
statement executes.
Script 11:
Output:
5 is not greater than 10 and 8 is not smaller than 4
b. While Loop
§ For Loop
The for loop is used to iteratively execute a piece of code a
certain number of times. You should use for loop when you
know the exact number of iterations or repetitions for which
you want to run your code. A for loop iterates over a
collection of items. In the following example, we create a
collection of five integers using the range() method. Next, a
for loop iterates five times and prints each integer in the
collection.
Script 12:
Output:
0
l
2
3
4
§ While Loop
The while loop keeps executing a certain piece of code unless
the evaluation condition becomes false. For instance, the
while loop in the following script keeps executing unless the
variable c becomes greater than 10.
Script 13:
Output:
0
l
2
3
4
5
6
7
8
9
2.2.6 Functions
Functions in any programming language are used to
implement the piece of code that is required to be executed
multiple times at different locations in the code. In such cases,
instead of writing long pieces of code, again and again, you
can simply define a function that contains the piece of code,
and then you can call the function wherever you want in the
code.
Script 14:
Output:
This is a simple function
You can also pass values to a function. The values are passed
inside the parenthesis of the function call. However, you must
specify the parameter name in the function definition, too. In
the following script, we define a function named
myfuncparam(). The function accepts one parameter, i.e.,
num. The value passed in the parenthesis of the function call
will be stored in this num variable and will be printed by the
print() method inside the myfuncparam() method.
Script 15:
Output:
This is a function with parameter value: Parameter 1
Script 16:
Output:
This function returns a value
Script 17:
Output:
Fruit has been eaten apple
10
Script 18:
Output:
Fruit has been eaten Orange
15
B. While Loop
C. Both A & B
D. None of the above
Question 2
What is the maximum number of values that a function can
return in Python?
A. Single Value
B. Double Value
Question 3
Which of the following membership operators are supported
by Python?
A. In
B. Out
C. Not In
D. Both A and C
Exercise 2.2.
Print the table of integer 9 using a while loop:
Data Preprocessing with Scikit-Learn
3.1.1 Standardization
Script 1:
Output:
Let’s see the mean, std, min and max values for the age, fare,
and pclass columns.
Script 2:
Output:
You can see that the mean, min and max values for the three
columns are very different.
Script 3:
Script 4:
You can see from the output that values have been scaled.
Output:
Script 5:
Output:
The following script plots a kernel density plot for the scaled
columns.
Script 6:
Output:
Script 7:
Output:
Let’s plot the kernel density plot to see if the data distribution
has changed or not.
Script 9:
Output:
The following script calculates the mean values for all the
columns.
Script 10:
Output:
age 29.699118
fare 32.204208
pclass 2.308642
dtype: float64
Script 11:
Output:
age 79.5800
fare 512.3292
pclass 2.0000
dtype: float64
Script 12:
Let’s plot the kernel density plot to see if the data distribution
has been affected or not. Execute the following script:
Script 13:
Output:
The output shows that the data distribution has not been
affected.
2. Missing data not Randomly: In this case, you can attribute the
missing data to a logical reason. For instance, the research
shows that depressed patients are more likely to leave empty
fields in forms compared to the patients that are not
depressed. Therefore, the missing data is not random; there
has been an established reason for missing data.
Script 14:
Output:
Let’s filter some of the numeric columns from the dataset and
see if they contain any missing values.
Script 15:
Output:
Script 16:
Output:
survived 0.000000
pclass 0.000000
age 0.198653
fare 0.000000
dtype: float64
The output shows that only the age column contains missing
values. And the ratio of missing values is around 19.86 percent.
Let’s now find out the median and mean values for all the non-
missing values in the age column.
Script 17:
Output:
28.0
29.69911764705882
To plot the kernel density plots for the actual age and median
and mean age, we will add columns to the Pandas dataframe.
Script 18:
The above script adds Median_Age and Mean_Age columns
to the titanic_data dataframe and prints the first 20 records.
Here is the output of the above script:
Output:
The highlighted rows in the above output show that NaN, i.e.,
null values in the age column have been replaced by the
median values in the Median_Age column and by mean values
in the Mean_Age columns.
You can clearly see that the default values in the age columns
have been distorted by mean and median imputation, and the
overall variance of the dataset has also been decreased.
Upper IQR Limit = 75th Quantile + IQR x 1.5 Lower IQR Limit =
25th Quantile – IQR x 1.5
Script 20:
Output:
survived 0.000000
pclass 0.000000
age 0.198653
fare 0.000000
dtype: float64
The above output shows that only the age column has missing
values, which are around 20 percent of the whole dataset.
The next step is plotting the data distribution for the age
column. A histogram can reveal the data distribution of a
column.
Script 21:
Output:
The output shows that the age column has an almost normal
distribution. Hence, the end of distribution value can be
calculated by multiplying the mean value of the age column
by three standard deviations, as shown in the following script:
Script 22:
Output:
73.278
Script 23:
Output:
The above output shows that the end of distribution value i.e.,
-73 has replaced the NaN values in the age column.
Finally, you can plot the kernel density estimation plot for the
original age column and the age column with the end of
distribution imputation.
Script 24:
Output:
We will again use the Titanic dataset. We will first try to find
the percentage of missing values in the age, fare, and
embarked_town columns.
Script 25:
Output:
embark_town 0.002245
age 0.198653
fare 0.000000
dtype: float64
Script 26:
Output:
Let’s make sure if Southampton is actually the mode value for
the embark_town column.
Script 27:
Output:
0 Southampton
dtype: object
Script 28:
Let’s now find the mode of the age column and use it to
replace the missing values in the age column.
Script 29:
Output:
24.0
The output shows that the mod of the age column is 24.
Therefore, we can use this value to replace the missing values
in the age column.
Script 30:
Output:
Finally, let’s plot the kernel density estimation plot for the
original age column and the age column that contains the
mode of the values in place of the missing values.
Script 31:
Output:
Let’s load the Titanic dataset and see if any categorical value
contains missing values.
Script 32:
Output:
embark_town 0.002245
age 0.198653
fare 0.000000
dtype: float64
Script 33:
Script 34:
Output:
3.3. Categorical Data Encoding
Models based on statistical algorithms, such as machine
learning and deep learning, work with numbers. However,
datasets can contain numerical, categorical, date time, and
mixed variables. A mechanism is needed to convert
categorical data to its numeric counterpart so that the data
can be used to build statistical models. The techniques used to
convert numeric data into categorical data are called
categorical data encoding schemes. In this section, you will
see some of the most commonly used categorical data
encoding schemes.
As a matter of fact, you only need N-1 columns in the one hot
encoded dataset for a column that originally contained N
unique labels. Look at the following table:
Script 35:
Output:
Script 36:
Output:
Let’s print the unique values in the three columns in the
titanic_data dataframe.
Script 37:
Output:
[‘male’ ‘female’] [Third, First, Second]
Categories (3, object): [Third, First, Second]
[‘Southampton’ ‘Cherbourg’ ‘Queenstown’ nan]
Script 38:
In the output, you will see two columns, one for males and one
for females.
Output:
Let’s display the actual sex name and the one hot encoded
version for the sex column in the same dataframe.
Script 39:
Output:
From the above output, you can see that in the first row, 1 has
been added in the male column because the actual value in
the sex column is male. Similarly, in the second row, 1 is added
to the female column since the actual value in the sex column
is female.
Script 40:
Output:
As you saw earlier, you can have N-1 one hot encoded
columns for the categorical column that contains N unique
labels. You can remove the first column created by the
get_dummies() method by passing True as the value for the
drop_first parameter, as shown below:
Script 41:
Output:
Also, you can create one hot encoded columns for null values
in the actual column by passing True as a value for the
dummy_na parameter.
Script 42:
Output:
Script 43:
Output:
From the above output, you can see that class Third has been
labeled as 2, the class First is labeled as 0, and so on. It is
important to mention that label encoding starts from 0.
Script 44:
Output:
The output shows that the dataset contains 10 columns. We
will only perform discretization on the price column. Let’s first
plot a histogram for the price column.
Script 45:
Output:
The histogram for the price column shows that our dataset is
positively skewed. We can use discretization on this type of
data distribution.
Script 46:
Output:
18497
Script 47:
Output:
1849.7
The minimum price will be rounded off to the floor, while the
maximum price will be rounded off to the ceiling. The price
will be rounded off to the nearest integer value. The following
script does that:
Script 48:
Output:
326
18823
1850
Next, let’s create the 10 bins for our dataset. To create bins,
we will start with the minimum value and add the bin interval
or length to it. To get the second interval, the interval length
will be added to the upper boundary of the first interval and
so on. The following script creates 10 equal width bins.
Script 49:
Output:
[326, 2176, 4026, 5876, 7726, 9576, 11426, 13276, 15126, 16976, 18826]
Next, we will create string labels for each bin. You can give
any name to the bin labels.
Script 50:
Output:
[‘Bin_no_1’, ‘Bin_no_2’, ‘Bin_no_3’, ‘Bin_n o_4’, ‘Bin_no_5’, ‘Bin_no_6’,
‘Bin_no_7’, ‘Bin_no_8’, ‘Bin_no_9’, ‘Bin_no_10’]
Script 51:
Output:
In the above output, you can see that a column price_bins has
been added that shows the bin value for the price.
Next, let’s plot a bar plot that shows the frequency of prices in
each bin.
Script 52:
Output:
The output shows that the price of most of the diamonds lies
in the first bin or the first interval.
Script 53:
Output:
Script 54:
Output:
To see the bin intervals, simply print the bins returned by the
“qcut()” function, as shown below:
Script 55:
Output:
[326. 646. 837. 1087. 1698. 2 401. 3465. 4662. 6301.2 9821. 18823.]
<class ‘numpy.ndarray’>
Next, let’s find the number of records per bin. Execute the
following script:
Script 56:
Output:
(325.999, 646.0] 5411
(1698.0, 2401.0] 5405
(837.0, 1087.0] 5396
(6301.2, 9821.0] 5395
(3465.0, 4662.0] 5394
(9821.0, 18823.0] 5393
(4662.0, 6301.2] 5389
(1087.0, 1698.0] 5388
(646.0, 837.0] 5385
(2401.0, 3465.0] 5384
Name: price, dtype: int64
From the output, you can see that all the bins have more or
less the same number of records. This is what equal frequency
discretization does, i.e., create bins with an equal number of
records.
Script 57:
Output:
[‘Bin_no_1’, ‘Bin_no_2’, ‘Bin_no_3’, ‘Bin_n o_4’, ‘Bin_no_5’, ‘Bin_no_6’,
‘Bin_no_7’, ‘ Bin_no_8’, ‘Bin_no_9’, ‘Bin_no_10’]
Script 58:
Output:
In the output above, you can see a new column, i.e.,
price_bins. This column contains equal frequency discrete bin
labels.
Script 59:
Output:
You can see that the number of records is almost the same for
all the bins.
4. You can cap or censor the outliers and replace them with
maximum and minimum values that can be found via several
techniques.
Let’s remove the outliers from the age column of the Titanic
dataset. The Titanic dataset contains records of the
passengers who traveled on the unfortunate Titanic ship that
sank in 1912. The following script imports the Titanic dataset
from the Seaborn library.
Script 60:
The first five rows of the Titanic dataset look like this.
Output:
To visualize the outliers, you can simply plot the box plot for
the age column, as shown below:
Script 61:
Output:
You can see that there are few outliers in the form of black
dots at the upper end of the age distribution in the box plot.
The following script finds the lower and upper limits for the
outliers for the age column.
Script 62:
Output:
-6.6875
64.8125
The output shows that any age value larger than 64.81 and
smaller than –6.68 will be considered an outlier. The following
script finds the rows containing the outlier values:
Script 63:
Finally, the following script removes the rows containing the
outlier values from the actual Titanic dataset.
Script 64:
Output:
((891, 15), (880, 15))
Finally, you can plot a box plot to see if outliers have actually
been removed.
Script 65:
Output:
You can see from the above output that the dataset doesn’t
contain any outliers now.
Script 66:
Script 67:
Output:
The following script finds the upper and lower threshold for
the age column of the Titanic dataset, using the mean and
standard deviation capping.
Script 68:
Output:
-13.88037434994331
73.27860964406095
The output shows that the upper threshold value obtained via
mean and standard deviation capping is 73.27 and the lower
limit or threshold is –13.88.
The following script replaces the outlier values with the upper
and lower limits.
Script 69:
Script 70:
The box plot shows that we still have some outlier values after
applying mean and standard deviation capping on the age
column of the Titanic dataset.
Output:
Exercise 3.1
Question 1
Which of the following techniques can be used to remove
outliers from a dataset?
A. Trimming
B. Censoring
C. Discretization
D. All of the above
Question 2
Which attribute is set to True to remove the first column from
the one-hot encoded columns generated via the
get_dummies() method?
A. drop_first
B. remove_first
C. delete_first
D. None of the above
Question 3
After standardization, the mean value of the dataset becomes:
A. 1
B. 0
C. -1
D. None of the above
Exercise 3.2
Replace the missing values in the deck column of the Titanic
dataset with the most frequently occurring categories in that
column. Plot a bar plot for the updated deck column.
Feature Selection with Python Scikit- Learn
Library
Script 1:
Output:
Script 2:
Script 3:
Output:
fixed acidity 3.031416
volatile acidity 0.032062
citric acid 0.037947
residual sugar 1.987897
chlorides 0.002215
free sulfur dioxide 109.414884
total sulfur dioxide 1082.102373
density 0.000004
pH 0.023835
sulphates 0.028733
alcohol 1.135647
dtype: float64
Script 4:
Output:
VarianceThreshold(threshold=0.1)
Script 5:
Output:
Index([‘fixed acidity’, ‘residual sugar’, ‘free sulfur dioxide’, ‘total
sulfur dioxide’, ‘alcohol’], dtype=’object’)
You can also get the attribute names that are not selected
using the following script.
Script 6:
Output:
1. [‘volatile acidity’, ’citric acid’, ‘chlorides’, ’density’, ’pH’,
‘sulphates’]
To get the final dataset with the selected features, you can
simply remove the features that are not selected based on the
variance threshold. Execute the following script to get the
final dataset containing the selected features only.
Script 7:
Output:
One of the main issues with variance-based feature selection
is that it doesn’t take the relationship between mutual features
into account while feature selection. Hence, with variance-
based feature selection, redundant features may be selected.
To avoid selecting redundant features, you can use the feature
selection method based on correlation.
Script 8:
Output:
You can also plot the correlation matrix using the heatmap()
plot from the seaborn library, as shown below:
Script 9:
Output:
In the above script, the correlation between features is
represented in the form of black to white boxes. You can see
that the correlation varies between -0.6 to 1.0, where darker
boxes represent high negative correlation while a lighter box
represents high positive correlation.
To find all the correlated features, you can iterate through the
rows in the feature correlation matrix and then select the
features that have a correlation higher than a certain
threshold. For example, in the following script, all the features
with an absolute correlation higher than 0.6 are selected and
added to the correlated feature matrix set.
Script 10:
Script 11:
Output:
4
Script 12:
The following four features have correlations higher than 0.6
with at least one of the other features in the dataset.
Output:
{’pH’, ‘total sulfur dioxide’, ‘density’, ‘c itric acid’}0
Finally, you can create the final feature set by removing the
correlated features, as shown in the following script.
Script 13:
Output:
Script 14:
Output:
RFE(estimator=LinearRegression(), n_features_to_select=4)
To find the feature names, you can first retrieve the index
value of the selected features using the ranking attribute of
the RFE class, as shown below.
Script 15:
Script 16:
Output:
array([1, 4, 7, 9], dtype=int64)
In [36]:
Script 17:
Output:
Script 18:
Script 19:
Script 20:
Finally, the following script creates the final dataset with the
selected features.
Script 21:
Output:
Question 2
Which of the following features should you remove from the
dataset?
A. Features with high mutual correlation
Question 3
Which of the following feature selection method does not
depend upon the output label?
A. Feature selection based on Model performance
You can read data from CSV files. However, the datasets we
are going to use in this section are available by default in the
Seaborn library. To view all the datasets, you can use the
get_dataset_names() function, as shown in the following
script:
Script 1:
Output:
[‘anagrams’,
‘anscombe’,
‘attention’,
‘brain_networks’,
‘car_crashes’,
‘diamonds’,
‘dots’,
‘exercise’,
‘flights’,
‘fmri’,
‘gammas’,
‘geyser’,
‘iris’,
‘mpg’,
‘penguins’,
‘planets’,
‘tips’,
‘titanic’]
The following script loads the Tips dataset and displays its
first five rows.
Script 2:
Output:
Script 3:
In this section, we will be working with the Tips dataset. We
will be using machine learning algorithms to predict the tip for
a particular record based on the remaining features, such as
total_bill, sex, day, time, etc.
As a first step, we divide the data into features and labels set.
Our labels set consists of values from the tip column, while the
feature set consists of values from the remaining columns. The
following script divides data into features and labels set.
Script 4:
Script 5:
Output:
And the following script prints the labels set.
Script 6:
Output:
0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
Name: tip, dtype: float64
Script 7:
Output:
Script 9:
Output:
Script 10:
Output:
The final step is to join the numerical columns with the one-
hot encoded columns. To do so, you can use the concat()
function from the Pandas library, as shown below:
Script 11:
The final dataset looks like this. You can see that it doesn’t
contain any categorical value.
Output:
Script 12:
Script 14:
Once you have trained a model and have made predictions on
the test set, the next step is to know how well your model has
performed for making predictions on the unknown test set.
There are various metrics to check that. However, mean
absolute error, mean squared error, and root mean squared
error are three of the most common metrics.
The methods used to find the value for these metrics are
available in sklearn.metrics class. The predicted and actual
values have to be passed to these methods, as shown in the
output.
Script 15:
Output:
Mean Absolute Error: 0.7080218832979829
Mean Squared Error: 0.893919522160961
Root Mean Squared Error: 0.9454731736865732
Script 16:
Output:
Mean Absolute Error: 0.7513877551020406
Mean Squared Error: 0.9462902040816326
Root Mean Squared Error: 0.9727744877830794
Script 17:
The mean absolute error value of 0.70 shows that random
forest performs better than both linear regression and KNN for
predicting a tip in the Tips dataset.
Output:
Mean Absolute Error: 0.7054065306122449
Mean Squared Error: 0.8045782841306138
Root Mean Squared Error: 0.8969828783932354
From the results obtained from sections 6.2 to 6.5, we can see
that Random Forest Regressor algorithm results in the
minimum MAE, MSE, and RMSE values. The choice of
algorithm to use depends totally upon your dataset and
evaluation metrics. Some algorithms perform better on one
dataset, while the other algorithms perform better on the
other dataset. It is better that you use all the algorithms to see
which gives the best results. However, as a rule of thumb, if
you only have limited options, try starting with ensemble
learning algorithms such as Random Forest. They yield the
best result. You will study more about model selection
strategies in chapter 9.
Let’s pick the 101st record from our dataset, which is located at
the 100th index.
Script 18:
The output shows that the value of the tip in the 100th record
in our dataset is 2.5.
Output:
total_bill 11.35
tip 2.5
sex Female
smoker Yes
day Fri
time Dinner
size 2
Name: 100, dtype: object
We will try to predict the value of the tip of the 100th record
using the random forest regressor algorithm and see what
output we get. Look at the script below:
Script 19:
Output:
[2.2609]
Script 20:
Output:
(2000, 8) (2000, 3)
The output shows that you have 2,000 records with eight
features and three outputs. Thus, this is a multi-output
regression problem.
Let’s divide the dummy dataset into training and test sets, and
apply feature scaling on it.
Script 21:
Script 22:
Output:
Mean Absolute Error: 0.24400622004095515
Mean Squared Error: 0.09288200051053495
Root Mean Squared Error: 0.3047654844475256
Script 23:
The output shows the predicted output and actual output.
You can see that there are three values in the output now
since this is a multi-output regression problem.
Output:
[[52.14499321 154.07153888 29.65411176]]
[50.3331556 155.43458476 26.52621361]
Script 24:
Output:
Mean Absolute Error: 17.578462377518566
Mean Squared Error: 737.569952450891
Root Mean Squared Error: 27.158239126476722
Script 25:
Output:
[[15.29925902 114.41624666 12.90183432]]
[50.3331556 155.43458476 26.52621361]
Script 26:
Output:
Mean Absolute Error: 0.24566521365979566
Mean Squared Error: 0.09412825912574384
Root Mean Squared Error: 0.30680329060449113
Script 27:
Output:
[[52.10616073 154.0113967 29.64235478]]
[50.3331556 155.43458476 26.52621361]
Script 28:
Output:
Mean Absolute Error: 0.2999883276581629
Mean Squared Error: 0.14873277575291557
Root Mean Squared Error: 0.3856588852249039
Finally, the following script shows how you can use the
LinearSVR algorithm along with the RegressorChain wrapper
to make predictions on a single data point.
Script 29:
Output:
[[52.11002869 154.00609972 29.12383881]]
[50.3331556 155.43458476 26.52621361]
B. Red
C. 2.5
D. None of the above
Question 2
Which one of the following algorithms is a lazy algorithm?
A. Random Forest
B. KNN
C. SVM
D. Linear Regression
Question 3
Which one of the following algorithms is not a regression
metric?
A. Accuracy
B. Recall
C. F1 Measure
D. All of the above
Exercise 5.2
Using the ‘Diamonds’ dataset from the Seaborn library, train a
regression algorithm of your choice, which predicts the price
of the diamond. Perform all the preprocessing steps.
Solving Classification Problems in Machine
Learning Using Sklearn Library
Script 1:
Output:
Script 3:
The following script prints the first five rows of the features
set.
Script 5:
Output:
And the following script prints the first five rows of the labels
set, as shown below:
Script 6:
Output:
0 1
1 0
2 1
3 0
4 0
Name: Exited, dtype: int64
6.1.2 Converting Categorical Data to Numbers
In section 5.1.2, you saw that we converted categorical
columns to numerical because the machine learning
algorithms in the Sklearn library only work with numbers.
Script 7:
Script 8:
Output:
Script 9:
The output shows that there are two categorical columns:
Geography and Gender in our dataset.
Output:
Script 10:
Output:
The last and final step is to join or concatenate the numeric
columns and one-hot encoded categorical columns. To do so,
you can use the concat function from the Pandas library, as
shown here:
Script 11:
Output:
Script 12:
6.1.4 Data Scaling/Normalization
The last step (optional) before the data is passed to the
machine learning algorithms is to scale the data. You can see
that some columns of the dataset contain small values while
the others contain very large values. It is better to convert all
values to a uniform scale. To do so, you can use the
StandardScaler() function from the sklearn.preprocessing
module, as shown below:
Script 13:
Script 14:
True Positive: True positives are those labels that are actually
true and also predicted as true by the model.
§ Confusion Matrix
§ Precision
Another way to analyze a classification algorithm is by
calculating precision, which is basically obtained by dividing
true positives by the sum of true positive and false positive, as
shown below:
§ Recall
Recall is calculated by dividing true positives by the sum of
true positive and false negative, as shown below:
§ F1 Measure
F1 measure is simply the harmonic mean of precision and
recall and is calculated as follows:
§ Accuracy
Accuracy refers to the number of correctly predicted labels
divided by the total number of observations in a dataset.
The methods used to find the value for these metrics are
available in sklearn.metrics class. The predicted and actual
values have to be passed to these methods, as shown in the
output.
Script 15:
Output:
The output shows that for 81 percent of the records in the test
set, logistic regression correctly predicted whether or not a
customer will leave the bank.
The pros and cons of the KNN classifier algorithm are the
same as the KNN regression algorithm, which has been
explained already in chapter 5, section 5.2.2.
Script 16:
Output:
Further Readings – KNN Classification
To study more about KNN classification, please check these
links:
1. https://bit.ly/33pXWIj
2. https://bit.ly/2FqNmZx
Script 17:
Output:
Script 18:
Output:
[0.796 0.796 0.7965 0.7965 0.7965]
Script 19:
Output:
CreditScore 665
Geography France
Gender Female
Age 40
Tenure 6
Balance 0
NumOfProducts 1
HasCrCard 1
IsActiveMember 1
EstimatedSalary 161848
Exited 0
Name: 100, dtype: object
The output above shows that the customer did not exit the
bank after six months since the value for the Exited attribute is
0. Let’s see what our classification model predicts:
Script 20:
Output:
[0]
Script 21:
Output:
(2000, 12) (2000,)
Script 22:
The output below shows that there are four unique labels in
the output corresponding to four output classes.
Output:
array([0, 1, 2, 3])
Script 23:
Let’s first see the first technique. The following script uses a
random forest classifier for multiclass classification. You can
see that no change has been made in the script that you used
for binary classification.
Script 24:
Output:
You can implement one vs. rest classifier using the Scikit-learn
library. To do so, you can use the OneVsRestClassifier class
from the multiclass module. The base algorithm that you
want to train multiple binary classification algorithms is
passed as a parameter value to the constructor of the
OneVsRestClassifier class. For instance, the following script
uses the logistic regression algorithm as a base classifier to
make multiclass classification using the OneVsRestClassifier
class. The process is straightforward. The training data is
passed to the fit() method of the OneVsRestClassifier class.
To make predictions, you can use the predict() method.
Script 25:
Output:
Script 26:
Output:
Let’s first create a dummy multilabel dataset. You can use the
make_multilabel_classifcation method from the datasets
module of the Sklearn library to create a dummy multilabel
dataset. The following script creates a dummy multilabel
classification dataset, which contains 2,000 records, 10
features, and 5 classes.
Script 27:
Output:
(2000, 10) (2000, 5)
Let’s print one of the output labels and see what it contains.
Script 28:
The output for the record at index 200 shows that there can
be three possible output classes for the record, as marked by
digit 1 at three different indexes belonging to different classes.
Output:
array([0, 1, 0, 1, 1])
The following script divides the dummy data into the training
and test sets and also applies feature scaling on the dataset.
Script 29:
Script 30:
Output:
Apart from default algorithms, you can use one vs. rest
classifier for multilabel classification, as well. The final
prediction is calculated as a result of the union of individual
one vs. the rest binary classifiers. The following script uses
logistic regression as the base algorithm to perform multilabel
classification using the one vs. the rest classifier.
Script 31:
Output:
B. Red
C. Male
D. None of the above
Question 2
Which of the following metrics is used for unbalanced
classification datasets?
A. Accuracy
B. F1
C. Precision
D. Recall
Question 3
Which one of the following functions is used to convert
categorical values to one-hot encoded numerical values?
A. pd.get_onehot()
B. pd.get_dummies()
C. pd.get_numeric()
D. All of the above
Exercise 6.2
Using the iris dataset from the Seaborn library, train a
classification algorithm of your choice, which predicts the
specie of the iris plant. Perform all the preprocessing steps.
Clustering Data with Scikit-Learn Library
2. Hierarchical Clustering.
3. Assign the data point to the cluster of the centroid with the
shortest distance.
5. Repeat steps 2-4 until new centroid values for all the clusters
are different from the previous centroid values.
Let’s first see how to cluster dummy data using the K-Means
clustering.
Script 1:
Script 2:
The output looks like this. Using K-Means clustering, you will
see how you can create four clusters in this dataset.
Output:
Script 3:
Once the model is trained, you can print the cluster centers
using the cluster_centers_ attribute of the K-Means class
object.
Script 4:
Output:
[[-1.43956092 -2.89493362]
[-8.7477121 -8.06593055]
[-9.25770944 6.1927544]
[-0.21911049 -10.22506455]]
To print the cluster ids for all the labels, you can use the
labels_attribute of the K-Means class, as shown below.
Script 5:
Output:
[2 0 3 1 3 2 3 0 1 0 3 2 1 2 1 3 1 3 2 2 3 3
1 2 3 3 2 3 2 0 2 2 0 3 1 2 2
1 2 3 1 2 2 2 3 0 0 0 0 3 3 3 3 2 1 1 3 2 2
0 0 1 1 1 1 1 3 3 1 0 1 1 1 0
1 1 1 3 1 3 0 0 0 3 3 0 3 0 2 0 3 3 1 2 3 1
2 0 1 3 0 0 1 2 3 3 3 0 1 2 0
0 0 1 1 3 1 2 0 0 0 3 1 0 3 0 0 3 3 1 0 3 3
0 3 1 3 0 3 1 0 3 1 3 3 2 3 2
1 0 3 3 0 2 1 3 3 3 0 3 2 2 1 2 2 3 0 1 3 1
1 1 1 0 2 1 2 3 3 1 1 1 0 0 2
2 2 2 1 0 0 2 2 3 1 0 0 2 2 2 1 0 0 2 2 0 3
1 0 2 3 1 3 1 0 1 2 0 1 2 2 0
0 3 2 3 2 2 3 2 3 0 0 0 3 1 0 2 3 1 2 2 3 1
0 3 0 0 0 1 0 1 3 2 0 0 1 3 1
3 3 2 0 2 2 0 0 1 0 3 0 3 3 1 0 3 1 1 1 1 1
0 1 0 1 2 0 3 2 2 0 1 0 2 1 2
1 2 0 2 1 0 1 3 2 2 2 2 0 1 2 2 2 1 2 1 3 1
3 2 1 1 3 1 0 0 0 0 3 1 2 2 1
0 0 1 3 1 3 1 3 2 0 3 0 1 0 2 2 2 0 2 2 0 3
0 0 2 1 3 2 3 1 0 3 1 2 3 2 3
0 2 1 0 1 3 1 1 3 2 3 1 1 2 1 0 0 2 2 2 2 1
3 1 3 1 3 0 0 0 2 1 0 2 2 2 3
3 0 2 1 0 1 0 2 2 0 2 0 0 1 3 2 0 1 3 0 0 2
0 1 3 0 0 3 1 1 3 0 3 3 3 0 2
1 3 2 2 3 3 0 3 2 0 3 0 3 3 3 2 2 1 3 0 2 3
2 2 1 0 2 0 0 1 0 2 1 2 3 1 3
1 1 0 2 1 1 1 2 2 0 2 2 0 2 0 0 0 3 0]
Script 6:
Output:
Output:
The CSV dataset file for this project is freely available at this
link (https://bit.ly/3kxXvCl). The CSV file for the dataset
Mall_Customers.csv can also be downloaded from the
Datasets folder of GitHub and SharePoint repositories.
Script 9:
The following script prints the first five rows of the dataset.
Script 10:
The below output shows that the dataset has five columns:
CustomerID, Genre, Age, Annual Income (K$), and Spending
Score (1–100). The spending score is the score assigned to
customers based on their previous spending habits.
Customers with higher spending in the past have higher
scores.
Output:
Let’s see the shape of the dataset.
Script 11:
Output
(200, 5)
Data Analysis
Script 12:
Output:
Similarly, we can plot a histogram for the spending scores of
the customers, as well.
Script 13:
Output:
We can also plot a regression line between annual income and
spending score to see if there is any linear relationship
between the two or not.
Script 14:
From the straight line in the below output, you can infer that
there is no linear relationship between annual income and
spending.
Output:
Finally, you can also plot a linear regression line between the
age column and the spending score.
Script 15:
Output:
Enough of the data analysis. We are now ready to perform
customer segmentation on our data using the K-Means
algorithm.
Script 16:
The output shows that we now have only annual income and
spending score columns in our dataset.
Output:
To implement K-Means clustering, you can use the K-Means
class from the sklearn.cluster module of the Sklearn
library. You have to pass the number of clusters as an
attribute to the K-Means class constructor. To train the K-
Means model, simply pass the dataset to the fit() method of
the K-Means class, as shown below.
Script 17:
Output
KMeans(n_clusters=4)
Once the model is trained, you can print the cluster centers
using the cluster_centers_attribute of the K-Means class
object.
Script 18:
Output
[[86.53846154 82.12820513]
[48.26 56.48]
[26.30434783 20.91304348]
[87. 18.63157895]]
To print the cluster ids for all the labels, you can use the
labels_attribute of the K-Means class, as shown below.
Script 19:
Output
[2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 0 3 0 3 0 3 0 3 0 3
0 3 0 3 0 3 0 3 0 3 0 3 0 3 0
3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0
3 0 3 0 3 0 3 0 3 0 3 0 3 0 3
0 3 0 3 0 3 0 3 0 3 0 3 0 3 0]
Script 20:
Output:
Script 21:
From the output below, it can be seen that the value of inertia
didn’t decrease much after five clusters.
Output:
Let’s now segment our customer data into five groups by
creating five clusters.
Script 22:
Output
KMeans(n_clusters=5)
Script 23:
Output:
From the above output, you can see that the customers are
divided into five segments. The customers in the middle of the
plot (in purple) are the customers with average income and
average spending. The customers belonging to the red cluster
are the ones with low income and low spending. You need to
target the customers who belong to the top right cluster (sky
blue). These are the customers with high incomes and high
spending in the past, and they are more likely to spend in the
future, as well. So any new marketing campaigns or
advertisements should be directed to these customers.
The last step is to find the customers who belong to the sky-
blue cluster. To do so, we will first plot the centers of the
clusters.
Script 24:
To fetch all the records from the cluster with id 1, we will first
create a dataframe containing index values of all the records
in the dataset and their corresponding cluster labels, as shown
below.
Script 25:
Output:
Next, we can simply filter all the records from the cluster_map
dataframe, where the value of the cluster column is 1. Execute
the following script to do so.
Script 26:
Here are the first five records that belong to cluster 1. These
are the customers who have high incomes and high spending,
and these customers should be targeted during marketing
campaigns.
Output:
Example 1
Script 27:
Script 28:
Output:
Note: You might get different data points because the points
are randomly generated.
The features are passed to the linkage class. And the object of
the linkage class is passed to the dendrogram class to plot
dendrogram for the features, as shown in the following script:
Script 29:
Here is the output of the above script.
Output:
From the figure, it can be seen that points 1 and 4 are closest
to each other. Hence, a cluster is formed by connecting these
points. The cluster of points 1 and 4 is closest to data point 8,
resulting in a cluster containing points 1, 4, and 8. In the same
way, the remaining clusters are formed until a big cluster is
formed.
Script 30:
Output:
array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0], dtype=int64)
Script 31:
The output shows that our clustering algorithm has
successfully clustered the data points.
Output:
Example 2
Script 32:
Output:
The following script applies agglomerative hierarchical
clustering on the dataset. The number of predicted clusters is
4.
Script 33:
The output shows the labels of some of the data points in our
dataset. You can see that since there are four clusters, there
are four unique labels, i.e., 0, 1, 2, and 3.
Output:
([2, 1, 0, 0, 1, 2, 2, 1, 2, 2, 3, 3, 1, 2, 0, 0, 2, 0, 1, 0, 1, 2, 2,
1], dtype=int64)
Script 34:
Output:
Similarly, to plot the actual clusters in the dataset (for the sake
of comparison), execute the following script.
Script 35:
Output:
Script 36:
Output:
The following script divides the data into features and labels
sets and displays the first five rows of the labels set.
Script 37:
Output:
Similarly, the following script applies the agglomerative
clustering on the features set using the
AgglomerativeClustering class from the sklearn.cluster
module.
Script 38:
The output below shows the predicted cluster labels for the
features set in the Iris dataset.
Output:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2,
2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2,
0, 2, 2, 2, 0, 2, 2, 0], dtype=int64)
Script 39:
Output:
You can also create dendrograms using the features set using
the shc module from the scipy.cluster.hierarchy library. You
have to pass the features set to the linkage class of the shc
module, and then the object of the linkage class is passed to
the dendrogram class to plot the dendrograms, as shown in the
following script.
Script 40:
Output:
If you want to cluster the dataset into three clusters, you can
simply draw a horizontal line that passes through the three
vertical lines, as shown below. The clusters below the
horizontal line are the resultant clusters. In the following
figure, we form three clusters.
B. Hierarchical Clustering
Question 2
In K-Means clustering, what does the inertia tell us?
A. the distance between data point within a cluster
Question 3
In hierarchical clustering, in the case of vertical dendrograms,
the number of clusters is equal to the number of______ lines
that the_____line passes through?
A. horizontal, vertical
B. vertical, horizontal
Disadvantages of PCA
There are two major disadvantages of PCA:
1. You need to standardize the data before you apply PCA.
In this section, you will see how to use PCA to select the two
most important features in the Iris dataset using the Sklearn
library.
Script 1:
The following script imports the Iris dataset using the Seaborn
library and prints the first five rows of the dataset.
Script 2:
Output:
The following script divides the data into the features and
labels sets.
Script 3:
Script 4:
Finally, both the training and test sets should be scaled before
PCA could be applied to them.
Script 5:
To apply PCA via Sklearn, all you have to do is import the PCA
class from the Sklearn.decomposition module. Next, to apply
PCA to the training set, pass the training set to the
fit_tansform() method of the PCA class object. To apply PCA
on the test set, pass the test set to the transform() method of
the PCA class object. This is shown in the following script.
Script 6:
Once you have applied PCA on a dataset, you can use the
explained_variance_ratio_ feature to print the variance
caused by all the features in the dataset. This is shown in the
following script:
Script 7:
Output:
[0.72229951 0.2397406 0.03335483 0.00460506]
Script 8:
Script 9:
Output:
0.8666666666666667
The output shows that even with two features, the accuracy
for correctly predicting the label for the iris plant is 86.66.
Finally, with two features, you can easily visualize the dataset
using the following script.
Script 10:
Output:
8.2 Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a supervised
dimensionality reduction technique where a decision
boundary is formed around the data points belonging to each
cluster of a class. The data points are projected to new
dimensions in a way that the distance between the data points
within a cluster is minimized while the distance between the
clusters is maximized. The new dimensions are ranked w.r.t
their ability to (i) minimize the distance between the data
points within a cluster and (ii) maximize the distance between
individual clusters.
Disadvantages of LDA
There are three major disadvantages of LDA:
1. Not able to detect correlated features.
Let’s see how you can implement LDA using the Sklearn
library. As always, the first step is to import the required
libraries.
Script 11:
Script 12:
Output:
Script 13:
Finally, the following script divides the data into training and
test sets.
Script 14:
Like PCA, you need to scale the data before you can apply
LDA on it. The data scaling is performed in the following step.
Script 15:
Script 16:
Like PCA, you can find variance ratios for LDA using the
explained_variance_ratio attribute.
Script 17:
Output:
[1.]
The above output shows that even with one component, the
maximum variance can be achieved.
Script 18:
Next, we will try to class whether or not a banknote is fake
using a single feature. We will use the Logistic Regression
algorithm for that. This is shown in the following script.
Script 19:
Output:
0.9890909090909091
The output shows that even with a single feature, we are able
to correctly predict whether or not a banknote is fake with
98.90 percent accuracy.
Disadvantages of SVD
Let’s now see how you can implement SVD via Python’s
Sklearn Library.
Script 20:
Output:
Script 21:
The script below divides the data into training and test sets.
The training set can be used to train the machine learning
model, while the test set is used to evaluate the performance
of a trained model.
Script 22:
Output:
(1279, 11)
(1279,)
Script 23:
As a first step, you will train a regression algorithm on the
complete feature set and then make predictions about the
quality of the wine. Next, you will apply SVD to the Wine
dataset to reduce it to two components and again predict the
quality of wine.
Script 24:
Output:
Mean Absolute Error: 0.403125
Mean Squared Error: 0.484375
Root Mean Squared Error: 0.6959705453537527
Next, we will apply SVD to our features set and reduce the
total number of features to 2.
To apply SVD via the Sklearn Library, you can use the
TruncatedSVD class from the Sklearn.decomposition module.
You first need to create an object of the class and then pass
the training set to the fit_transform() method. The test set
can simply be passed to the transform() method. These two
methods return training and test sets with reduced features (2
by default). The following script then prints the shape of the
training test set.
Script 25:
From the output, you can see that the training set now
consists of only two features.
Output:
(1279, 2)
(1279,)
You will now use the reduced training set to make predictions
about the quality of the wine. Execute the following script.
Script 26:
The output shows that with two features, the mean absolute
error is 0.51, which is slightly greater than when you used the
complete feature set with 11 features. This is because when
you remove nine features, you lose some information from
your dataset.
Output:
Mean Absolute Error: 0.51875
Mean Squared Error: 0.64375
Root Mean Squared Error: 0.8023403267940606
Question 2
In PCA, dimensionality reduction depends upon the:
A. Features set only
Question 3
LDA is a____dimensionality reduction technique.
A. Unsupervised
B. Semi-Supervised
C. Supervised
D. Reinforcement
Exercise 8.2
Apply principal component analysis for dimensionality
reduction on the customer_churn.csv dataset from the
Datasets folder. Print the accuracy using two principal
components. Also, plot the results on the test set using the
two principal components.
Selecting Best Models with Scikit- Learn
In this book till now, normally, you divide the data into 80
percent training and 20 percent test set. However, it means
that only 20 percent of the data is used for testing and that 20
percent of data is never used for training.
In this section, you will see two scripts. In the first script, you
will see model training without cross-validation. In the second
section, the same dataset will be used for the model training
with cross-validation.
Script 1:
Output:
Script 2:
The script below divides the dataset into 80 percent training
set and 20 percent testing set using the train_test_split()
function from the sklearn.model_selection module.
Script 3:
Script 4:
Script 5:
Output:
Mean Absolute Error: 0.40390625
Mean Squared Error: 0.3176339125
Root Mean Squared Error: 0.563590199080857
The following script imports the dataset and prints its first five
rows.
Script 6:
Output:
The dataset is divided into the features and labels sets in the
script below:
Script 7:
Script 8:
Next, instead of dividing the dataset into the training and test
sets as you did previously, you will simply initialize an instance
of your machine learning model. The following script initializes
the RandomForestRegressor model from the sklearn.ensemble
class, as shown below:
Script 9:
Script 10:
In the output, you will see five values. Each value corresponds
to the mean absolute error for one of the five data folds.
Output:
[-0.504025 -0.51095625 -0.50714375 -0.5192125 -0.49863323]
You can find the average and standard deviation of the mean
absolute error values for the five folds using the following
script.
Script 11:
Output:
-0.51 accuracy with a standard deviation of 0.01
The following script imports the dataset and prints its first five
rows.
Script 12:
Output:
The script below divides the data into features and labels sets
X and y, respectively.
Script 13:
Script 14:
Next, just like you did with cross-validation, you need to
specify the machine learning model whose parameters you
want to select. In the script below, we select the
RandomForestRegressor algorithm.
Script 15:
Script 16:
Finally, to train your grid search model, you need to call the
fit() method, as shown below:
Script 17:
Once the grid search finishes training, you can find the best
parameters that your grid search algorithm selected using the
best_params_attribute, as shown below:
Script 18:
In the output below, you can see the parameter values that
return the best results for the prediction of wine quality as
selected by the GridSearchCV() algorithm.
Output:
{‘bootstrap’: True, ‘criterion’: ‘mae’, ‘min_samples_leaf’: 1,
‘n_estimators’: 300}
To see the best case mean absolute error returned by the grid
search using the above parameter, you can use the
best_scores_ attribute, as shown below:
Script 19:
Output:
-0.531481223876698
Script 20:
Script 21:
Output:
Script 22:
You need to specify the list containing the number of sets you
want your data to be divided into. For instance, the following
script will divide your data into 100 sets.
Script 23:
Output:
[learning_curve] Training set sizes: [10 21 32 43 54 65 76 87 98 109 120
131 142 153 164 175 186 197 208 219 230 241 252 263 274 285 296 307 318
329 340 351 362 372 383 394 405 416 427 438 449 460 471 482 493 504 515
526 537 548 559 570 581 592 603 614 625 636 647 658 669 680 691 702 713
724 734 745 756 767 778 789 800 811 822 833 844 855 866 877 888 899 910
921 932 943 954 965 976 987 998 1009 1020 1031 1042 1053 1064 1075 1086
1097]
Script 24:
The output shows that the mean accuracies for the training
data are approximately 1 for all the sets.
Output:
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1.]
And the following script returns mean accuracies for all the
sets in the test data.
Script 25:
Output:
[0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.55539217 0.55539217 0.55539217 0.55539217 0.55539217
0.62244194 0.7981075 0.88192966 0.93222031 0.96720106
0.96938553 0.96866092 0.96793364 0.97303251 0.97448441
0.97448441 0.97448971 0.98031586 0.97740411 0.99053484
0.99126211 0.99052953 0.99125946 0.98907498 0.99052687
0.99198407 0.99271666 0.99052422 0.99271666 0.992714
0.99344127 0.99490113 0.99344127 0.99198673 0.99490113
0.992714 0.99344393 0.9897996 0.992714 0.99198938 0.99271666
0.992714 0.9941712 0.99417386 0.99344127 0.99489847
0.99344127 0.99344127 0.99343862 0.99344127]
Next, you can plot the mean training and test accuracy scores
against the number of records in the training data.
Script 26:
The output below shows that the performance on the training
data remains constant, i.e., 100 percent on all the sets. Hence,
the model has no bias. For the test data, the model performs
poorly up to 600 records, and we can say that our model is
overfitting till the 600th records. After the 600th record, the
model performance on the test data starts improving and
becomes very close to the model performance on the training
data at around 800 records.
Output:
Script 27:
Script 28:
Output:
The script below divides the dataset into the features and
labels sets.
Script 29:
Division of data into the training and test sets is done using
the following script.
Script 30:
Script 31:
Script 32:
Script 33:
After you have trained the model, you can save it. The process
is simple. You need to call the dump() method of the pickle
module. The first parameter to the dump() method is the
trained classifier, and the second parameter is the file path
where you want to save the classifier. You also need to pass
the file permissions, which should be written binary (wb) for
saving the model. Look at the following script for reference.
Script 34:
Finally, you can load the saved model using the load()
method from the pickle module. You need to pass the file path
for the saved classifier along with the read permission (r), as
shown in the following script. The following script also makes
a prediction on the test set using the loaded classifier.
Script 35:
Script 36:
Output:
Further Readings – Model Selection with Scikit-Learn
Question 2:
Learning curves can be used to study the:
A. Bias of a trained algorithm
Question 3:
Which pickle method can be used to save a trained machine
learning model?
A. save()
B. register()
C. load()
D. dump()
Exercise 9.2
Use the Grid Search to find parameters of the
RandomForestClassifier algorithm, which return the highest
classification accuracy for classifying the banknote.csv
dataset:
Natural Language Processing with Scikit-
Learn
2. If a category exists in the test set but not in the training set,
the probability of prediction for that category in the test set
will be set to 0.
Script 1:
Script 2:
Output:
Script 3:
Output:
(5728, 2)
10.2.4 Data Visualization
Script 4:
Output:
From the above pie chart, you can see that 24 percent of the
emails in our dataset are spam emails.
Next, we will plot word clouds for the spam and non-spam
emails in our dataset. Word cloud is basically a kind of graph,
which shows the most frequently occurring words in the text.
The higher the frequency of occurrence, the larger will be the
size of the word.
But first, we will remove all the stop words such as “a, is, you,
i, are, etc.” from our dataset because these words occur quite
a lot, and they do not have any significant classification ability.
The following script imports all the stop words from the
dataset.
Script 5:
The following script filters spam emails from the dataset and
then plots word cloud using spam emails only.
Script 6:
Script 7:
Output:
10.2.5 Cleaning the Data
Before we actually train our machine learning model on the
training data, we need to remove special characters and
numbers from our text. Removing special characters and
numbers creates empty spaces in the text, which also need to
be removed.
Before cleaning the data, let’s first divide the data into email
text, which forms the feature set (X) and the email labels (y),
which contains information about whether or not an email is a
spam email.
Script 8:
Script 9:
The following script calls the clean_text() method and
preprocesses all the emails in the dataset.
Script 10:
Once the naïve Bayes model is trained on the training set, the
test set containing only email texts is passed as inputs to the
model. The model then predicts which of the emails in the test
set are spam. Predicted outputs for the test set are then
compared with the actual label in the test data in order to
evaluate the performance of the spam email detector naïve
Bayes model.
The following script divides the data into training and test
sets.
Script 12:
To train the machine learning model, you will be using the
MultinomialNB() class from sklearn.naïve_bayes module,
which actually implements the naïve Bayes algorithm in
Sklearn. The fit() method of the MultinomialNB() class is used
to train the model.
Script 13:
Script 14:
Script 15:
Output:
Script 16:
Output:
Subject localized software all languages available hello we would like to
offer localized software versions german french spanish uk and many
others aii iisted software is avail able for immediate download no need
to wait week for cd delivery just few examples norton Internet security
pro windows xp professional with sp fuil version corei draw graphicssuite
dreamweaver mx homesite inciudinq macromedia studio mx just browse our
site and find any software you need in your native ianguaqe best reqards
kayieen 1
Let’s pass this sentence into our spam detector classifier and
see what it thinks:
Script 17:
Output:
[1]
Output:
From the output, you can see that the dataset contains two
columns SentimentText and Sentiment. The former contains
the text reviews about movies, while the latter contains user
opinions for corresponding movies. In the sentiment column, 1
refers to a positive opinion, while 0 refers to a negative
opinion.
Let’s see the number of rows in the dataset.
Script 20:
Output:
(25000, 2)
Script 21:
Output:
The pie chart shows that half of the reviews are positive, while
the other half contains negative reviews.
Before cleaning the data, let’s first divide the data into text
reviews and user sentiment.
Script 22:
Script 24:
Script 25:
The text data has been processed. Now, we can train our
machine learning model on the text.
The following script divides the data into training and test
sets.
Script 26:
To train the machine learning model, you will be using the
RandomForestClassifier (https://bit.ly/2V1G0k0) model,
which is one of the most commonly used machine learning
models for classification. The fit() method of the
RandomForestClassifier class is used to train the model.
Script 27:
Script 28:
Output:
Script 29:
Output:
[1]
B. min_count
C. min_df
D. None of the above
Question 2:
Which method of the RandomForestClassifier object is used
to train the algorithm on the input data:
A. train()
B. fit()
C. predict()
D. train_data()
Question 3:
Sentimental analysis with RandomForestClassifier is a type of
____learning problem
A. Supervised
B. Unsupervised
C. Reinforcement
D. Lazy
Exercise 10.2
Import the “spam.csv” file from the resources folder. The
dataset contains ham and spam text messages. Write a
Python application that uses Scikit-Learn to classify ham and
spam messages in the dataset. Column v1 contains a text label,
while column v2 contains the text of the message.
Image Classification with Scikit-Learn
Script 1:
Script 2:
Output:
Script 3:
Output:
Let’s now print the label for the image at index 0 to see if the
label is actually 5.
Script 4:
Output:
5
Script 5:
The rest of the process is familiar. You need to divide the data
into training and test sets, as shown in the following script.
You can scale your image data just like any other data, as
shown in the following script.
Script 8:
Script 9:
Output:
The output shows that your model can predict digits in the
images in the MNIST dataset with 96.95 percent accuracy.
B. 2
C. 3
D. 4
Question 2:
You need to convert an image into a_____dimensional array
before you can use Scikit-learn to train models on image data?
A. 1
B. 2
C. 3
D. 4
Question 3:
To convert a one-dimensional numpy array into a two-
dimensional array or matrix, which method can be used?
A. np.tomatrix()
B. pd.convert2d
C. pd.reshape()
D. None of the above
Exercise 11.2
Divide the following image dataset into 80 percent training
and 20 percent test sets. Train the model on the training set
and make predictions on the test set. Print the accuracy and
confusion matrix for the model performance.
From the Same Publisher
Exercise 2.1
Question 1
Which iteration should be used when you want to repeatedly
execute a code specific number of times?
A. For Loop
B. While Loop
C. Both A & B
D. None of the above
Answer: A
Question 2
What is the maximum number of values that a function can
return in Python?
A. Single Value
B. Double Value
Answer: C
Question 3
Which of the following membership operators are supported
by Python?
A. In
B. Out
C. Not In
D. Both A and C
Answer: D
Exercise 2.2
Print the table of integer 9 using a while loop:
Exercise 3.1
Question 1
Which of the following techniques can be used to remove
outliers from a dataset?
A. Trimming
B. Censoring
C. Discretization
D. All of the above
Answer: D
Question 2
Which attribute is set to True to remove the first column from
the one-hot encoded columns generated via the get_
dummies() method?
A. drop_first
B. remove_first
C. delete_first
D. None of the above
Answer: A
Question 3
After standardization, the mean value of the dataset becomes:
A. 1
B. 0
C. -1
D. None of the above
Answer: B
Exercise 3.2
Replace the missing values in the deck column of the Titanic
dataset with the most frequently occurring categories in that
column. Plot a bar plot for the updated deck column.
Solution:
Exercise 4.1
Question 1
Which of the following feature types should you retain in the
dataset?
A. Features with low Variance
Answer: D
Question 2
Which of the following features should you remove from the
dataset?
A. Features with high mutual correlation
Answer: A
Question 3
Which of the following feature selection method does not
depend upon the output label?
A. Feature selection based on Model performance
Answer: C
Exercise 4.2
Using the ‘winequalit-white’ dataset from the ‘Dataset and
Source Codes’ folder, apply the recursive elimination
technique for feature selection.
Solution:
Exercise 5.1
Question 1
Which of the following is an example of a regression output?
A. True
B. Red
C. 2.5
D. None of the above
Answer: C
Question 2
Which of the following algorithm is a lazy algorithm?
A. Random Forest
B. KNN
C. SVM
D. Linear Regression
Answer: B
Question 3
Which of the following algorithm is not a regression metric?
A. Accuracy
B. Recall
C. F1 Measure
D. All of the above
Answer: D
Exercise 5.2
Using the ‘Diamonds’ dataset from the Seaborn library, train a
regression algorithm of your choice, which predicts the price
of the diamond. Perform all the preprocessing steps.
Solution:
Exercise 6.1
Question 1
Which of the following is not an example of classification
outputs?
A. True
B. Red
C. Male
D. None of the above
Answer: D
Question 2
Which of the following metrics is used for unbalanced
classification datasets?
A. Accuracy
B. F1
C. Precision
D. Recall
Answer: C
Question 3
Which of the following function is used to convert categorical
values to one-hot encoded numerical values?
A. pd.get_onehot()
B. pd.get_dummies()
C. pd.get_numeric()
D. All of the above
Answer: B
Exercise 6.2
Using the iris dataset from the Seaborn library, train a
classification algorithm of your choice, which predicts the
specie of the iris plant. Perform all the preprocessing steps.
Solution:
Exercise 7.1
Question 1
Which of the following is a supervised machine learning
algorithm?
A. K-Means Clustering
B. Hierarchical Clustering
Answer: D
Question 2
In K- Means clustering, the inertia tells us?
A. the distance between data points within a cluster
Answer: C
Question 3
In hierarchical clustering, in the case of vertical dendrograms,
the number of clusters is equal to the number of_____ lines
that the_____line passes through?
A. horizontal, vertical
B. vertical, horizontal
Answer: B
Exercise 7.2
Apply K-Means clustering on the banknote.csv dataset
available in the Datasets folder in this GitHub repository
(https://bit.ly/3nhAJBi). Find the optimal number of clusters
and then print the clustered dataset. The following script
imports the dataset and prints the first five rows of the
dataset.
Exercise 8.1
Question 1
Which of the following are the benefits of dimensionality
reduction?
A. Data Visualization
Answer: C
Question 2
In PCA, dimensionality reduction depends upon the:
A. Features set only
Answer: A
Question 3
LDA is a____dimensionality reduction technique
A. Unsupervised
B. Semi-Supervised
C. Supervised
D. Reinforcement
Answer: C
Exercise 8.2
Apply principal component analysis for dimensionality
reduction on the customer_churn.csv dataset from the
Datasets folder in this GitHub repository
(https://bit.ly/3nhAJBi). Print the accuracy using two principal
components. Also, plot the results on a test set using the two
principal components.
Solution:
Exercise 9.1
Question 1:
With grid search, you can________
A. Test all parameters for a model by default
Answer: B
Question 2:
Learning curves can be used to study the:
A. Bias of a trained algorithm
Answer: C
Question 3:
Which pickle method can be used to save a trained machine
learning model:
A. save()
B. register()
C. load()
D. dump()
Answer: D
Exercise 9.2
Use the Grid Search to find the parameters of the
RandomForestClassifier algorithm, which return the highest
classification accuracy for classifying the banknote.csv
dataset:
Solution:
Exercise 10.1
Question 1:
Which attribute of the TfidfVectorizer vectorizer is used to
define the minimum word count?
A. min_word
B. min_count
C. min_df
D. None of the Above
Answer: C
Question 2:
Which method of the RandomForestClassifier object is used
to train the algorithm on the input data?
A. train()
B. fit()
C. predict()
D. train_data()
Answer: B
Question 3:
Sentimental analysis with RandomForestClassifier is a type
of_____learning problem.
A. Supervised
B. Unsupervised
C. Reinforcement
D. Lazy
Answer: A
Exercise 10.2
Import the “spam.csv” file from the resources folder. The
dataset contains ham and spam text messages. Write a
Python application that uses Scikit-Learn to classify ham and
spam messages in the dataset. Column v1 contains a text label,
while column v2 contains the text of the message.
Solution:
Exercise 11.1
Question 1:
A colored image has_____channels:
A. 1
B. 2
C. 3
D. 4
Answer: C
Question 2:
You need to convert an image into a_____dimensional array
before you can use Scikit-learn to train models on image data?
A. 1
B. 2
C. 3
D. 4
Answer: A
Question 3:
To convert a one-dimensional numpy array into a two-
dimensional array or matrix, which method can be used?
A. np.tomatrix()
B. pd.convert2d
C. pd.reshape()
D. None of the above
Answer: C
Exercise 11.2
Divide the following image dataset into 80 percent training
and 20 percent test sets. Train the model on the training set
and make predictions on the test set. Print the accuracy and
confusion matrix for the model performance.
Solution: