Python Machine Learning - Sample Chapter
Python Machine Learning - Sample Chapter
$ 44.99 US
28.99 UK
P U B L I S H I N G
Sa
m
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Sebastian Raschka
ee
Unlock deeper insights into machine learning with this vital guide
to cutting-edge predictive analytics
Foreword by Dr. Randal S. Olson
Artificial Intelligence and Machine Learning Researcher, University of Pennsylvania
Sebastian Raschka
Preface
I probably don't need to tell you that machine learning has become one of the most
exciting technologies of our time and age. Big companies, such as Google, Facebook,
Apple, Amazon, IBM, and many more, heavily invest in machine learning research
and applications for good reasons. Although it may seem that machine learning has
become the buzzword of our time and age, it is certainly not a hype. This exciting
field opens the way to new possibilities and has become indispensable to our daily
lives. Talking to the voice assistant on our smart phones, recommending the right
product for our customers, stopping credit card fraud, filtering out spam from our
e-mail inboxes, detecting and diagnosing medical diseases, the list goes on and on.
If you want to become a machine learning practitioner, a better problem solver, or
maybe even consider a career in machine learning research, then this book is for you!
However, for a novice, the theoretical concepts behind machine learning can be quite
overwhelming. Yet, many practical books that have been published in recent years
will help you get started in machine learning by implementing powerful learning
algorithms. In my opinion, the use of practical code examples serve an important
purpose. They illustrate the concepts by putting the learned material directly into
action. However, remember that with great power comes great responsibility! The
concepts behind machine learning are too beautiful and important to be hidden in
a black box. Thus, my personal mission is to provide you with a different book; a
book that discusses the necessary details regarding machine learning concepts, offers
intuitive yet informative explanations on how machine learning algorithms work,
how to use them, and most importantly, how to avoid the most common pitfalls.
If you type "machine learning" as a search term in Google Scholar, it returns an
overwhelmingly large number-1,800,000 publications. Of course, we cannot discuss
all the nitty-gritty details about all the different algorithms and applications that have
emerged in the last 60 years. However, in this book, we will embark on an exciting
journey that covers all the essential topics and concepts to give you a head start in this
field. If you find that your thirst for knowledge is not satisfied, there are many useful
resources that can be used to follow up on the essential breakthroughs in this field.
Preface
If you have already studied machine learning theory in detail, this book will show
you how to put your knowledge into practice. If you have used machine learning
techniques before and want to gain more insight into how machine learning really
works, this book is for you! Don't worry if you are completely new to the machine
learning field; you have even more reason to be excited. I promise you that machine
learning will change the way you think about the problems you want to solve and
will show you how to tackle them by unlocking the power of data.
Before we dive deeper into the machine learning field, let me answer your most
important question, "why Python?" The answer is simple: it is powerful yet very
accessible. Python has become the most popular programming language for data
science because it allows us to forget about the tedious parts of programming and
offers us an environment where we can quickly jot down our ideas and put concepts
directly into action.
Reflecting on my personal journey, I can truly say that the study of machine learning
made me a better scientist, thinker, and problem solver. In this book, I want to
share this knowledge with you. Knowledge is gained by learning, the key is our
enthusiasm, and the true mastery of skills can only be achieved by practice. The road
ahead may be bumpy on occasions, and some topics may be more challenging than
others, but I hope that you will embrace this opportunity and focus on the reward.
Remember that we are on this journey together, and throughout this book, we will
add many powerful techniques to your arsenal that will help us solve even the
toughest problems the data-driven way.
Preface
Chapter 4, Building Good Training Sets Data Preprocessing, discusses how to deal with
the most common problems in unprocessed datasets, such as missing data. It also
discusses several approaches to identify the most informative features in datasets
and teaches you how to prepare variables of different types as proper inputs for
machine learning algorithms.
Chapter 5, Compressing Data via Dimensionality Reduction, describes the essential
techniques to reduce the number of features in a dataset to smaller sets while
retaining most of their useful and discriminatory information. It discusses the
standard approach to dimensionality reduction via principal component analysis
and compares it to supervised and nonlinear transformation techniques.
Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning,
discusses the do's and don'ts for estimating the performances of predictive models.
Moreover, it discusses different metrics for measuring the performance of our
models and techniques to fine-tune machine learning algorithms.
Chapter 7, Combining Different Models for Ensemble Learning, introduces you to the
different concepts of combining multiple learning algorithms effectively. It teaches
you how to build ensembles of experts to overcome the weaknesses of individual
learners, resulting in more accurate and reliable predictions.
Chapter 8, Applying Machine Learning to Sentiment Analysis, discusses the essential
steps to transform textual data into meaningful representations for machine learning
algorithms to predict the opinions of people based on their writing.
Chapter 9, Embedding a Machine Learning Model into a Web Application, continues with
the predictive model from the previous chapter and walks you through the essential
steps of developing web applications with embedded machine learning models.
Chapter 10, Predicting Continuous Target Variables with Regression Analysis, discusses
the essential techniques for modeling linear relationships between target and
response variables to make predictions on a continuous scale. After introducing
different linear models, it also talks about polynomial regression and
tree-based approaches.
Chapter 11, Working with Unlabeled Data Clustering Analysis, shifts the focus to a
different subarea of machine learning, unsupervised learning. We apply algorithms
from three fundamental families of clustering algorithms to find groups of objects
that share a certain degree of similarity.
Preface
Chapter 12, Training Artificial Neural Networks for Image Recognition, extends the
concept of gradient-based optimization, which we first introduced in Chapter 2,
Training Machine Learning Algorithms for Classification, to build powerful, multilayer
neural networks based on the popular backpropagation algorithm.
Chapter 13, Parallelizing Neural Network Training with Theano, builds upon the
knowledge from the previous chapter to provide you with a practical guide for
training neural networks more efficiently. The focus of this chapter is on Theano, an
open source Python library that allows us to utilize multiple cores of modern GPUs.
[ 49 ]
[ 50 ]
Chapter 3
We will assign the petal length and petal width of the 150 flower samples to the feature
matrix X and the corresponding class labels of the flower species to the vector y:
>>>
>>>
>>>
>>>
>>>
[ 51 ]
Using the preceding code, we loaded the StandardScaler class from the
preprocessing module and initialized a new StandardScaler object that we assigned
to the variable sc. Using the fit method, StandardScaler estimated the parameters
(sample mean) and (standard deviation) for each feature dimension from the
training data. By calling the transform method, we then standardized the training
data using those estimated parameters and . Note that we used the same
scaling parameters to standardize the test set so that both the values in the training
and test dataset are comparable to each other.
Having standardized the training data, we can now train a perceptron model. Most
algorithms in scikit-learn already support multiclass classification by default via the
One-vs.-Rest (OvR) method, which allows us to feed the three flower classes to the
perceptron all at once. The code is as follows:
>>> from sklearn.linear_model import Perceptron
>>> ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)
>>> ppn.fit(X_train_std, y_train)
[ 52 ]
Chapter 3
On executing the preceding code, we see that the perceptron misclassifies 4 out of the
45 flower samples. Thus, the misclassification error on the test dataset is 0.089 or 8.9
percent ( 4 / 45 0.089 ) .
Instead of the misclassification error, many machine learning
practitioners report the classification accuracy of a model, which is
simply calculated as follows:
1 - misclassification error = 0.911 or 91.1 percent.
Scikit-learn also implements a large variety of different performance metrics that are
available via the metrics module. For example, we can calculate the classification
accuracy of the perceptron on the test set as follows:
>>> from sklearn.metrics import accuracy_score
>>> print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))
0.91
Here, y_test are the true class labels and y_pred are the class labels that we
predicted previously.
Note that we evaluate the performance of our models based
on the test set in this chapter. In Chapter 5, Compressing Data via
Dimensionality Reduction, you will learn about useful techniques,
including graphical analysis such as learning curves, to detect
and prevent overfitting. Overfitting means that the model
captures the patterns in the training data well, but fails to
generalize well to unseen data.
[ 53 ]
[ 54 ]
Chapter 3
As we can see in the resulting plot, the three flower classes cannot be perfectly
separated by a linear decision boundaries:
[ 55 ]
ratio can be written as (1 p ) , where p stands for the probability of the positive
event. The term positive event does not necessarily mean good, but refers to the event
that we want to predict, for example, the probability that a patient has a certain
disease; we can think of the positive event as class label y = 1 . We can then further
define the logit function, which is simply the logarithm of the odds ratio (log-odds):
logit ( p ) = log
[ 56 ]
p
(1 p )
Chapter 3
The logit function takes input values in the range 0 to 1 and transforms them to
values over the entire real number range, which we can use to express a linear
relationship between feature values and the log-odds:
n
logit ( p ( y = 1| x ) ) = w0 x0 + w1 x1 + + wm xm = wm xm = w T x
i =0
( z) =
1
1 + e z
Here, z is the net input, that is, the linear combination of weights and sample features
and can be calculated as z = w T x = w0 + w1 x1 + + wm xm .
Now let's simply plot the sigmoid function for some values in the range -7 to 7 to see
what it looks like:
>>>
>>>
>>>
...
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
[ 57 ]
As a result of executing the previous code example, we should now see the S-shaped
(sigmoidal) curve:
To build some intuition for the logistic regression model, we can relate it to our
previous Adaline implementation in Chapter 2, Training Machine Learning Algorithms
for Classification. In Adaline, we used the identity function ( z ) = z as the activation
function. In logistic regression, this activation function simply becomes the sigmoid
function that we defined earlier, which is illustrated in the following figure:
1
Error
w0
x1
x2
w1
w2
..
.
wm
S
Net input
function
xm
[ 58 ]
y
Sigmoid
function
Quantizer
Chapter 3
The output of the sigmoid function is then interpreted as the probability of particular
sample belonging to class 1 ( z ) = P ( y = 1| x; w ) , given its features x parameterized by
the weights w. For example, if we compute ( z ) = 0.8 for a particular flower sample,
it means that the chance that this sample is an Iris-Versicolor flower is 80 percent.
Similarly, the probability that this flower is an Iris-Setosa flower can be calculated as
P ( y = 0 | x; w ) = 1 P ( y = 1| x; w ) = 0.2 or 20 percent. The predicted probability can then
simply be converted into a binary outcome via a quantizer (unit step function):
1 if ( z ) 0.5
y =
0 otherwise
If we look at the preceding sigmoid plot, this is equivalent to the following:
1
y =
0
if z 0.0
otherwise
In fact, there are many applications where we are not only interested in the predicted
class labels, but where estimating the class-membership probability is particularly
useful. Logistic regression is used in weather forecasting, for example, to not
only predict if it will rain on a particular day but also to report the chance of rain.
Similarly, logistic regression can be used to predict the chance that a patient has a
particular disease given certain symptoms, which is why logistic regression enjoys
wide popularity in the field of medicine.
J (w) =
i
(( )
1
z (i ) y (i )
2
[ 59 ]
We minimized this in order to learn the weights w for our Adaline classification
model. To explain how we can derive the cost function for logistic regression,
let's first define the likelihood L that we want to maximize when we build a
logistic regression model, assuming that the individual samples in our dataset are
independent of one another. The formula is as follows:
i =1
i =1
y( )
1 y ( )
( ( )) ( ( ))
L ( w ) = P ( y | x; w ) = P y ( ) | x ( ) ; w = z ( )
i
1 z( )
i
In practice, it is easier to maximize the (natural) log of this equation, which is called
the log-likelihood function:
n
( ( )) + (1 y ) log (1 ( z ))
l ( w ) = log L ( w ) = log z (i )
i =1
(i )
(i )
Firstly, applying the log function reduces the potential for numerical underflow,
which can occur if the likelihoods are very small. Secondly, we can convert the
product of factors into a summation of factors, which makes it easier to obtain
the derivative of this function via the addition trick, as you may remember
from calculus.
Now we could use an optimization algorithm such as gradient ascent to maximize
this log-likelihood function. Alternatively, let's rewrite the log-likelihood as a cost
function J that can be minimized using gradient descent as in Chapter 2, Training
Machine Learning Algorithms for Classification:
n
( ( )) (1 y ) log (1 ( z ))
J ( w ) = log z (i )
i =1
(i )
(i )
To get a better grasp on this cost function, let's take a look at the cost that we
calculate for one single-sample instance:
J ( ( z ) , y; w ) = y log ( ( z ) ) (1 y ) log (1 ( z ) )
[ 60 ]
Chapter 3
Looking at the preceding equation, we can see that the first term becomes zero if
log ( ( z ) )
if y = 1
J ( ( z ) , y ; w ) =
log (1 ( z ) ) if y = 0
The following plot illustrates the cost for the classification of a single-sample instance
for different values of ( z ) :
We can see that the cost approaches 0 (plain blue line) if we correctly predict that
a sample belongs to class 1. Similarly, we can see on the y axis that the cost also
approaches 0 if we correctly predict y = 0 (dashed line). However, if the prediction
is wrong, the cost goes towards infinity. The moral is that we penalize wrong
predictions with an increasingly larger cost.
[ 61 ]
( ( )) + (1 y ) log (1 ( z ))
J ( w ) = y (i ) log z (i )
i
(i )
(i )
This would compute the cost of classifying all training samples per epoch and we
would end up with a working logistic regression model. However, since scikit-learn
implements a highly optimized version of logistic regression that also supports
multiclass settings off-the-shelf, we will skip the implementation and use the
sklearn.linear_model.LogisticRegression class as well as the familiar fit
method to train the model on the standardized flower training dataset:
>>> from sklearn.linear_model import LogisticRegression
>>> lr = LogisticRegression(C=1000.0, random_state=0)
>>> lr.fit(X_train_std, y_train)
>>> plot_decision_regions(X_combined_std,
...
y_combined, classifier=lr,
...
test_idx=range(105,150))
>>> plt.xlabel('petal length [standardized]')
>>> plt.ylabel('petal width [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.show()
After fitting the model on the training data, we plotted the decision regions, training
samples and test samples, as shown here:
[ 62 ]
Chapter 3
0.000,
0.063,
0.937]])
[ 63 ]
The preceding array tells us that the model predicts a chance of 93.7 percent that the
sample belongs to the Iris-Virginica class, and a 6.3 percent chance that the sample is
a Iris-Versicolor flower.
We can show that the weight update in logistic regression via gradient descent is
indeed equal to the equation that we used in Adaline in Chapter 2, Training Machine
Learning Algorithms for Classification. Let's start by calculating the partial derivative of
the log-likelihood function with respect to the jth weight:
1
1
(1 y )
l ( w ) = y
( z)
w j
z
w
(
)
(
)
j
Before we continue, let's calculate the partial derivative of the sigmoid function first:
1
1
1
=
( z) =
e z =
2
z
z
z
z 1 + e
1 + e z
(1 + e )
1
z
1+ e
= ( z ) (1 ( z ) )
1
1
(1 y )
( z)
y
1 ( z ) w j
( z)
1
1
= y
(1 y )
z
( z ) (1 ( z ) )
1 ( z)
w j
( z)
= y (1 ( z ) ) (1 y ) ( z ) x j
= ( y ( z )) x j
[ 64 ]
Chapter 3
Remember that the goal is to find the weights that maximize the log-likelihood so
that we would perform the update for each weight as follows:
n
( ) ) x( )
w j := w j + y (i ) z (i )
i =1
Since we update all weights simultaneously, we can write the general update rule
as follows:
w := w + w
We define w as follows:
w = l ( w )
Since maximizing the log-likelihood is equal to minimizing the cost function J that
we defined earlier, we can write the gradient descent update rule as follows:
w j =
( ) )
n
J
= y (i ) z (i ) x(i )
w j
i =1
w := w + w , w = J ( w )
This is equal to the gradient descent rule in Adaline in Chapter 2, Training Machine
Learning Algorithms for Classification.
[ 65 ]
Although we have only encountered linear models for classification so far, the
problem of overfitting and underfitting can be best illustrated by using a more
complex, nonlinear decision boundary as shown in the following figure:
w =
[ 66 ]
w
j =1
2
j
Chapter 3
Here,
In order to apply regularization, we just need to add the regularization term to the
cost function that we defined for logistic regression to shrink the weights:
( ( ( )) + (1 y ) ( log (1 ( z )))) + 2 w
n
J ( w ) = log z (i )
i =1
(i )
Via the regularization parameter , we can then control how well we fit the training
data while keeping the weights small. By increasing the value of , we increase the
regularization strength.
The parameter C that is implemented for the LogisticRegression class in
scikit-learn comes from a convention in support vector machines, which will be
the topic of the next section. C is directly related to the regularization parameter ,
which is its inverse:
C=
( ( ( )) + (1 y )) ( log (1 ( z ))) + 12 w
n
J ( w ) = C log z (i )
i =1
(i )
[ 67 ]
By executing the preceding code, we fitted ten logistic regression models with
different values for the inverse-regularization parameter C. For the purposes of
illustration, we only collected the weight coefficients of the class 2 vs. all classifier.
Remember that we are using the OvR technique for multiclass classification.
As we can see in the resulting plot, the weight coefficients shrink if we decrease the
parameter C, that is, if we increase the regularization strength:
[ 68 ]
Chapter 3
[ 69 ]
w0 + w T x pos = 1
(1)
w0 + w T xneg = 1
( 2)
If we subtract those two linear equations (1) and (2) from each other, we get:
w T ( x pos xneg ) = 2
We can normalize this by the length of the vector w, which is defined as follows:
w =
m
j =1
w2j
w T ( x pos xneg )
w
2
w
The left side of the preceding equation can then be interpreted as the distance
between the positive and negative hyperplane, which is the so-called margin that we
want to maximize.
[ 70 ]
Chapter 3
Now the objective function of the SVM becomes the maximization of this margin
2
by maximizing w under the constraint that the samples are classified correctly,
which can be written as follows:
w0 + w T x (i ) 1 if y (i ) = 1
w0 + w T x (i ) < 1 if y (i ) = 1
These two equations basically say that all negative samples should fall on one side
of the negative hyperplane, whereas all the positive samples should fall behind the
positive hyperplane. This can also be written more compactly as follows:
y (i ) w0 + w T x (i ) 1 i
1
w , which can be
In practice, though, it is easier to minimize the reciprocal term
2
solved by quadratic programming. However, a detailed discussion about quadratic
programming is beyond the scope of this book, but if you are interested, you can
learn more about Support Vector Machines (SVM) in Vladimir Vapnik's The Nature
of Statistical Learning Theory, Springer Science & Business Media, or Chris J.C. Burges'
excellent explanation in A Tutorial on Support Vector Machines for Pattern Recognition
(Data mining and knowledge discovery, 2(2):121167, 1998).
[ 71 ]
w T x (i ) 1 if y (i ) = 1 (i )
w T x (i ) < 1 if y (i ) = 1 + (i )
So the new objective to be minimized (subject to the preceding constraints) becomes:
2
w + C (i )
2
i
Using the variable C, we can then control the penalty for misclassification. Large
values of C correspond to large error penalties whereas we are less strict about
misclassification errors if we choose smaller values for C. We can then we use the
parameter C to control the width of the margin and therefore tune the bias-variance
trade-off as illustrated in the following figure:
[ 72 ]
Chapter 3
Now that we learned the basic concepts behind the linear SVM, let's train a SVM
model to classify the different flowers in our Iris dataset:
>>> from sklearn.svm import SVC
>>> svm = SVC(kernel='linear', C=1.0, random_state=0)
>>> svm.fit(X_train_std, y_train)
>>> plot_decision_regions(X_combined_std,
...
y_combined, classifier=svm,
...
test_idx=range(105,150))
>>> plt.xlabel('petal length [standardized]')
>>> plt.ylabel('petal width [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.show()
The decision regions of the SVM visualized after executing the preceding code
example are shown in the following plot:
[ 73 ]
[ 74 ]
Chapter 3
np.random.seed(0)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0, X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)
>>>
...
>>>
...
>>>
>>>
>>>
After executing the code, we will have an XOR dataset with random noise,
as shown in the following figure:
[ 75 ]
Obviously, we would not be able to separate samples from the positive and negative
class very well using a linear hyperplane as the decision boundary via the linear
logistic regression or linear SVM model that we discussed in earlier sections.
The basic idea behind kernel methods to deal with such linearly inseparable data
is to create nonlinear combinations of the original features to project them onto a
higher dimensional space via a mapping function ( ) where it becomes linearly
separable. As shown in the next figure, we can transform a two-dimensional dataset
onto a new three-dimensional feature space where the classes become separable via
the following projection:
( x1 , x2 ) = ( z1 , z2 , z3 ) = ( x1 , x2 , x12 + x22 )
This allows us to separate the two classes shown in the plot via a linear hyperplane
that becomes a nonlinear decision boundary if we project it back onto the original
feature space:
[ 76 ]
Chapter 3
( ) ( )
k x (i ) , x ( j ) = x (i ) x ( j ) .
T
One of the most widely used kernels is the Radial Basis Function kernel
(RBF kernel) or Gaussian kernel:
(i )
k x ,x
( j)
x (i ) - x ( j )
= exp
2 2
k x (i ) , x ( j ) = exp x (i ) - x ( j )
Here, =
1
2 2
Roughly speaking, the term kernel can be interpreted as a similarity function between
a pair of samples. The minus sign inverts the distance measure into a similarity score
and, due to the exponential term, the resulting similarity score will fall into a range
between 1 (for exactly similar samples) and 0 (for very dissimilar samples).
[ 77 ]
Now that we defined the big picture behind the kernel trick, let's see if we can train
a kernel SVM that is able to draw a nonlinear decision boundary that separates the
XOR data well. Here, we simply use the SVC class from scikit-learn that we imported
earlier and replace the parameter kernel='linear' with kernel='rbf':
>>>
>>>
>>>
>>>
>>>
As we can see in the resulting plot, the kernel SVM separates the XOR data
relatively well:
[ 78 ]
Chapter 3
...
...
>>>
>>>
>>>
>>>
y_combined, classifier=svm,
test_idx=range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.show()
Since we chose a relatively small value for , the resulting decision boundary of the
RBF kernel SVM model will be relatively soft, as shown in the following figure:
Now let's increase the value of and observe the effect on the decision boundary:
>>>
>>>
>>>
...
...
>>>
>>>
>>>
>>>
[ 79 ]
In the resulting plot, we can now see that the decision boundary around the classes 0
and 1 is much tighter using a relatively large value of :
Although the model fits the training dataset very well, such a classifier will
likely have a high generalization error on unseen data, which illustrates that the
optimization of also plays an important role in controlling overfitting.
[ 80 ]
Chapter 3
Let's consider the following example where we use a decision tree to decide upon an
activity on a particular day:
Work to do?
Yes
No
Stay in
Outlook?
Sunny
Go to beach
Overcast
Go running
Rainy
Friends busy?
Yes
Stay in
No
Go to movies
Based on the features in our training set, the decision tree model learns a series of
questions to infer the class labels of the samples. Although the preceding figure
illustrated the concept of a decision tree based on categorical variables, the same
concept applies if our features. This also works if our features are real numbers like
in the Iris dataset. For example, we could simply define a cut-off value along the
sepal width feature axis and ask a binary question "sepal width 2.8 cm?"
Using the decision algorithm, we start at the tree root and split the data on the
feature that results in the largest information gain (IG), which will be explained in
more detail in the following section. In an iterative process, we can then repeat this
splitting procedure at each child node until the leaves are pure. This means that the
samples at each node all belong to the same class. In practice, this can result in a very
deep tree with many nodes, which can easily lead to overfitting. Thus, we typically
want to prune the tree by setting a limit for the maximal depth of the tree.
[ 81 ]
Nj
j =1
Np
IG ( D p , f ) = I ( D p )
I ( Dj )
Here, f is the feature to perform the split, D p and D j are the dataset of the parent
and jth child node, I is our impurity measure, N p is the total number of samples at
the parent node, and N j is the number of samples in the jth child node. As we can
see, the information gain is simply the difference between the impurity of the parent
node and the sum of the child node impuritiesthe lower the impurity of the child
nodes, the larger the information gain. However, for simplicity and to reduce the
combinatorial search space, most libraries (including scikit-learn) implement binary
decision trees. This means that each parent node is split into two child nodes, Dleft
and Dright :
IG ( D p , a ) = I ( D p )
N left
Np
I ( Dleft )
N right
Np
I ( Dright )
Now, the three impurity measures or splitting criteria that are commonly used in
binary decision trees are Gini impurity ( I G ), entropy ( I H ), and the classification
error ( I E ). Let's start with the definition of entropy for all non-empty classes
p (i | t ) 0 :
c
I H ( t ) = p ( i | t ) log 2 p ( i | t )
i =1
[ 82 ]
Chapter 3
Here, p ( i | t ) is the proportion of the samples that belongs to class c for a particular
node t. The entropy is therefore 0 if all samples at a node belong to the same class,
and the entropy is maximal if we have a uniform class distribution. For example, in
a binary class setting, the entropy is 0 if p ( i = 1| t ) = 1 or p ( i = 0 | t ) = 0 . If the classes are
distributed uniformly with p ( i = 1| t ) = 0.5 and p ( i = 0 | t ) = 0.5 , the entropy is 1. Therefore,
we can say that the entropy criterion attempts to maximize the mutual information
in the tree.
Intuitively, the Gini impurity can be understood as a criterion to minimize the
probability of misclassification:
c
i =1
i =1
I G ( t ) = p ( i | t ) (1 p ( i | t ) ) = 1 p ( i | t )
Similar to entropy, the Gini impurity is maximal if the classes are perfectly mixed,
for example, in a binary class setting ( c = 2 ):
c
1 0.52 = 0.5
i=1
However, in practice both the Gini impurity and entropy typically yield very similar
results and it is often not worth spending much time on evaluating trees using
different impurity criteria rather than experimenting with different pruning cut-offs.
Another impurity measure is the classification error:
I E = 1 max { p ( i | t )}
[ 83 ]
This is a useful criterion for pruning but not recommended for growing a decision
tree, since it is less sensitive to changes in the class probabilities of the nodes. We
can illustrate this by looking at the two possible splitting scenarios shown in the
following figure:
B
(40, 40)
(30, 10)
(40, 40)
(20, 40)
(10, 30)
(20, 0)
We start with a dataset D p at the parent node D p that consists of 40 samples from
class 1 and 40 samples from class 2 that we split into two datasets Dleft and Dright ,
respectively. The information gain using the classification error as a splitting
criterion would be the same ( IGE = 0.25 ) in both scenario A and B:
I E ( D p ) = 1 0.5 = 0.5
A : I E ( Dleft ) = 1
3
= 0.25
4
A : I E ( Dright ) = 1
3
= 0.25
4
4
4
A : IGE = 0.5 0.25 0.25 = 0.25
8
8
B : I E ( Dleft ) = 1
4 1
=
6 3
B : I E ( Dright ) = 1 1 = 0
6 1
B : IGE = 0.5 0 = 0.25
8 3
[ 84 ]
Chapter 3
However, the Gini impurity would favor the split in scenario B ( IGG = 0.16 ) over
scenario A ( IGG = 0.125 ) , which is indeed more pure:
1 2 3 2 3
A : I G ( Dright ) = 1 + = = 0.375
4 4 8
4
4
A : I G = 0.5 0.375 0.375 = 0.125
8
8
2 2 4 2 4
B : I G ( Dleft ) = 1 + = = 0.4
6 6 9
B : I G ( Dright ) = 1 (12 + 02 ) = 0
6
B : IGG = 0.5 0.4 0 = 0.16
8
Similarly, the entropy criterion would favor scenario B ( IGH = 0.31) over
scenario A ( IGH = 0.19 ) :
[ 85 ]
1
1 3
3
A : I H ( Dright ) = log 2 + log 2 = 0.81
4 4
4
4
4
4
A : IGH = 1 0.81 0.81 = 0.19
8
8
2
2 4
4
B : I H ( Dleft ) = log 2 + log 2 = 0.92
6 6
6
6
B : I H ( Dright ) = 0
6
B : IGH = 1 0.92 0 = 0.31
8
For a more visual comparison of the three different impurity criteria that we
discussed previously, let's plot the impurity indices for the probability range [0, 1]
for class 1. Note that we will also add in a scaled version of the entropy (entropy/2) to
observe that the Gini impurity is an intermediate measure between entropy and the
classification error. The code is as follows:
>>>
>>>
>>>
...
>>>
...
>>>
...
>>>
>>>
>>>
>>>
>>>
>>>
>>>
...
...
[ 86 ]
Chapter 3
...
...
...
...
...
...
>>>
...
>>>
>>>
>>>
>>>
>>>
>>>
'Misclassification Error'],
['-', '-', '--', '-.'],
['black', 'lightgray',
'red', 'green', 'cyan']):
line = ax.plot(x, i, label=lab,
linestyle=ls, lw=2, color=c)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),
ncol=3, fancybox=True, shadow=False)
ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')
plt.ylim([0, 1.1])
plt.xlabel('p(i=1)')
plt.ylabel('Impurity Index')
plt.show()
[ 87 ]
After executing the preceding code example, we get the typical axis-parallel decision
boundaries of the decision tree:
[ 88 ]
Chapter 3
After we have installed GraphViz on our computer, we can convert the tree.dot file
into a PNG file by executing the following command from the command line in the
location where we saved the tree.dot file:
> dot -Tpng tree.dot -o tree.png
Looking at the decision tree figure that we created via GraphViz, we can now nicely
trace back the splits that the decision tree determined from our training dataset.
We started with 105 samples at the root and split it into two child nodes with 34
and 71 samples each using the petal width cut-off 0.75 cm. After the first split,
we can see that the left child node is already pure and only contains samples from
the Iris-Setosa class (entropy = 0). The further splits on the right are then used to
separate the samples from the Iris-Versicolor and Iris-Virginica classes.
[ 89 ]
[ 90 ]
Chapter 3
[ 91 ]
After executing the preceding code, we should see the decision regions formed by
the ensemble of trees in the random forest, as shown in the following figure:
Using the preceding code, we trained a random forest from 10 decision trees via the
n_estimators parameter and used the entropy criterion as an impurity measure to
split the nodes. Although we are growing a very small random forest from a very
small training dataset, we used the n_jobs parameter for demonstration purposes,
which allows us to parallelize the model training using multiple cores of our
computer (here, two).
[ 92 ]
Chapter 3
The KNN algorithm itself is fairly straightforward and can be summarized by the
following steps:
1. Choose the number of k and a distance metric.
2. Find the k nearest neighbors of the sample that we want to classify.
3. Assign the class label by majority vote.
The following figure illustrates how a new data point (?) is assigned the triangle class
label based on majority voting among its five nearest neighbors.
[ 93 ]
Based on the chosen distance metric, the KNN algorithm finds the k samples in the
training dataset that are closest (most similar) to the point that we want to classify.
The class label of the new data point is then determined by a majority vote among
its k nearest neighbors.
The main advantage of such a memory-based approach is that the classifier
immediately adapts as we collect new training data. However, the downside is that
the computational complexity for classifying new samples grows linearly with the
number of samples in the training dataset in the worst-case scenariounless the
dataset has very few dimensions (features) and the algorithm has been implemented
using efficient data structures such as KD-trees. J. H. Friedman, J. L. Bentley, and R.
A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM
Transactions on Mathematical Software (TOMS), 3(3):209226, 1977. Furthermore, we
can't discard training samples since no training step is involved. Thus, storage space
can become a challenge if we are working with large datasets.
By executing the following code, we will now implement a KNN model in
scikit-learn using an Euclidean distance metric:
>>>
>>>
...
>>>
>>>
...
>>>
>>>
>>>
By specifying five neighbors in the KNN model for this dataset, we obtain a
relatively smooth decision boundary, as shown in the following figure:
[ 94 ]
Chapter 3
The right choice of k is crucial to find a good balance between over- and underfitting.
We also have to make sure that we choose a distance metric that is appropriate for
the features in the dataset. Often, a simple Euclidean distance measure is used for
real-valued samples, for example, the flowers in our Iris dataset, which have features
measured in centimeters. However, if we are using a Euclidean distance measure, it
is also important to standardize the data so that each feature contributes equally to
the distance. The 'minkowski' distance that we used in the previous code is just a
generalization of the Euclidean and Manhattan distance that can be written as follows:
) x( ) x( )
d x (i ) , x (i ) =
[ 95 ]
It becomes the Euclidean distance if we set the parameter p=2 or the Manhatten
distance at p=1, respectively. Many other distance metrics are available in scikit-learn
and can be provided to the metric parameter. They are listed at http://scikitlearn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.
html.
Summary
In this chapter, you learned about many different machine algorithms that are
used to tackle linear and nonlinear problems. We have seen that decision trees are
particularly attractive if we care about interpretability. Logistic regression is not only
a useful model for online learning via stochastic gradient descent, but also allows us
to predict the probability of a particular event. Although support vector machines
are powerful linear models that can be extended to nonlinear problems via the
kernel trick, they have many parameters that have to be tuned in order to make good
predictions. In contrast, ensemble methods such as random forests don't require
much parameter tuning and don't overfit so easily as decision trees, which makes
it an attractive model for many practical problem domains. The K-nearest neighbor
classifier offers an alternative approach to classification via lazy learning that allows
us to make predictions without any model training but with a more computationally
expensive prediction step.
[ 96 ]
Chapter 3
However, even more important than the choice of an appropriate learning algorithm
is the available data in our training dataset. No algorithm will be able to make good
predictions without informative and discriminatory features.
In the next chapter, we will discuss important topics regarding the preprocessing
of data, feature selection, and dimensionality reduction, which we will need to
build powerful machine learning models. Later in Chapter 6, Learning Best Practices
for Model Evaluation and Hyperparameter Tuning, we will see how we can evaluate
and compare the performance of our models and learn useful tricks to fine-tune the
different algorithms.
[ 97 ]
www.PacktPub.com
Stay Connected: