Classification Algorithms I
Classification Algorithms I
Supervised learning
Supervised learning is used whenever we want to predict a certain outcome from a given
input, and we have examples of input/output (predictor/response) pairs. We build a
machine learning model model from these input/output pairs, which comprise our training
set. Our goal is to make accurate predictions for new, never-before-seen data. Supervised
learning often requires human effort to build the training set, but afterward automates and
often speeds up an otherwise laborious or infeasible task.
Source: Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A
GUIDE FOR DATA SCIENTISTS (IMLP). "O'Reilly Media, Inc.".
On the other hand, we should be aware that larget dataset size itself will lead to bigger
model complexity, turning the sweet spot further right on the complexity spectrum. In the
real world, you often have the ability to decide how much data to collect, which might be
more beneficial than tweaking and tuning your model. Never underestimate the power of
more data.
Creating a sample dataset
Scikit-learn has a lot of tools for creating synthetic (meaning made-up) datasets, which
are great for testing machine learning algorithms. I'm going to utilize the make_blobs
method.
I create a synthetic two-class classification dataset which has two features. The following
code creates a dataset and a scatter plot visualizing all of the data points in this dataset. The
plot has the first feature on the x-axis and the second feature on the y-axis. As is always the
case in scatter plots, each data point is represented as one dot. The color of the dot
indicates its class.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Figure 2-2. Scatter plot of the same dataset from the textbook IMLP
k-Nearest Neighbors
The k-NN algorithm is arguably the simplest machine learning algorithm. Building the
model consists only of storing the training dataset. To make a prediction for a new data
point, the algorithm finds the closest data points in the training dataset--its "nearest
neighbors". This is a non-parametric, lazy learning algorithm.
• Non-parametric: No parameters are learned from the model
• Lazy algorithm: There is no explicit training phase involved
k-Neighbors classification
In its simplest version, the k-NN algorithm only considers exactly one nearest neighbor,
which is the closest training data point to the point we want to make a prediction for and
assigns its label to the test data. This is 1-NN. Figure 2-4 illustrates this for the case of
classification on the above dataset.
Here, we added three new data points, shown as stars. For each of them, we marked the
closest point in the training set. The prediction of the one-nearest-neighbor algorithm is
the label of that point (shown by the color of the cross).
Python code:
• from sklearn.neighbors import KNeighborsClassifier
• KNeighborsClassifier(n_neighbors=1)
The prediction is shown as the color of the cross. You can see that the prediction for the
new data point at the top left is not the same as the prediction when we used only one
neighbor.
While this illustration is for a binary classification problem, this method can be applied to
datasets with any number of classes. For more classes, we count how many neighbors
belong to each class and again predict the most common class (majority voting).
Now let's look at how we can apply the k-nearest neighbors algorithm using scikit-
learn. First, we split our data into a training and a test set so we can evaluate
generalization performance, as discussed above.
Steps to apply a k-NN model:
from sklearn.model_selection import train_test_split
# Fit the model, knn, to the training set. For k-NN, it's simply
storing the data.
knn.fit(X_train, y_train)
We see that our model is about 86% accurate, meaning the model predicted the class
correctly for 86% of the samples in the test dataset.
Analyzing KNeighborsClassifier
For two-dimensional datasets, we can also illustrate the prediction for all possible test
points in the xy-plane. We color the plane according to the class that would be assigned to a
point in this region. This lets us view the decision boundary, which is the divide between
where the algorithm assigns class 0 versus where it assigns class 1 to the new test data.
Let's build k-NN model for k =1 ,3 , 9 on the above dataset and plot the decision boundary.
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
plt.show()
As you can see in the first figure, using a single neighbor results in a decision boundary that
follows the training data closely. Considering more and more neighbors leads to a
smoother decision boundary. A smoother boundary corresponds to a simpler model. In
other words, using few neighbors corresponds to high model complexity (as shown on the
right side of Figure 2-1), and using many neighbors corresponds to low model complexity
(as shown on the left side of Figure 2-1).
If you consider the extreme case where the number of neighbors is the number of all data
points in the training set, each test point would have exactly the same neighbors (all
training points) and all predictions would be the same: the class that is most frequent in the
training set.
k-NN accuracy on breast cancer data
Let's see the connection between model complexity and generalization (ability of the model
to make accurate predictions on new, unseen dataset). We will do this on the real-world
breast cancer dataset.
Breast cancer dataset records clinical measurements of breast cancer tumors. Each tumor
is labeled as "benign" (for harmless tumors) or "malignant" (for cancerous tumors), and the
task is to learn to predict whether a tumor is malignant based on the measurements of the
tissue. The data can be loaded using the load_breast_cancer function from scikit-
learn.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
The dataset consists of 569 data points, with 30 features each. Of these 569 data points,
212 are labeled as malignant and 357 as benign.
We begin by splitting the dataset into a training and a test set.
# Create training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(cancer.data,
cancer.target, stratify=cancer.target, random_state=66)
Then we evaluate training and test set performance with different numbers of neighbors.
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 20
neighbors_settings = range(1, 21)
While real-world plots are rarely very smooth, we can still recognize some of the
characteristics of overfitting and underfitting (note that considering fewer neighbors
corresponds to a more complex model).
Considering a single nearest neighbor, the prediction on the training set is perfect (close to
1). But when more neighbors are considered, the model becomes simpler and the training
accuracy drops. The test set accuracy for using a single neighbor is lower than when using
more neighbors, indicating that using the single nearest neighbor leads to a model that is
too complex. On the other hand, when considering 10 neighbors, the model is too simple
and performance is even worse. The best performance is somewhere in the middle, using
around 6 neighbors (6-NN) with a test accuracy of 93.7%. The worst performance is
around 88% accuracy (for n_neighbour=2), which might still be acceptable.
Keep in mind that best n_neighbours vary from dataset to dataset and even within same
dataset of different training/test data size.
print('--- Based on test set ---\n')
index = test_accuracy.index(max(test_accuracy))
print('Best n_neighour = {} | Training accuracy = {:0.3f} | Test
accuracy = {:0.3f}'.format(neighbors_settings[index],
training_accuracy[index], test_accuracy[index]))
index = test_accuracy.index(min(test_accuracy))
print ('Worst n_neighour = {} | Training accuracy = {:0.3f} | Test
accuracy = {:0.3f}'.format(neighbors_settings[index],
training_accuracy[index], test_accuracy[index]))
Summary
Parameters
The two most important parameters of k-NN are:
• n_neighbors: the number of neighbours
• metric: how we measure the distance between data points
n_neighbors
Default distance metric is Minkowski with power parameter of 2 which generally works
well.
Minkowsi distance of order p between two points =
( ∑|x i − y i ) p )
1/p
for i=1 , . .. , n
In practice, using a small number of neighbors like three or five often works well, but you
should certainly adjust this parameter. Choosing the right distance measure requires
understanding the context. By default, Euclidean distance is used, which works well in
many settings.
For further details on parameters, refer the scikit-learn docmentation:
KNeighborsClassifier
Strength and weakaness
Strength:
• Simple, easy to understand and implement
• Less number of parameters to tune
• Nonparametric (i.e., no weights are learnt)
• Fast training phase as it just have to store all the training sampels
• Works well for multiclass data sets
• A good baseline method to try before considering more advanced techniques
Weakness:
• Computationally expensive testing phase (hence, impractical in industrial
standards)
• Prediction can be slow, when the training set is large (in terms of number of
features or number of samples)
• k-NN can suffer from skewed class distributions (certain class being frequent in
training set will dominate majority voting).
• Often does not perform well on datasets with many features (hundreds or more),
and it does particularly badly with datasets where most features are 0 most of the
time (so-called sparse datasets)
Data have to be preprocessed (to avoid skewing the distance metric in favor of a particular
feature).
So, while the nearest k-neighbors algorithm is easy to understand, it is not often used in
practice, due to prediction being slow and its inability to handle many features. The method
we discuss next has neither of the k-NNs drawbacks.