Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

Ch5 Data Science

The document discusses decision trees and random forests for machine learning. It explains that decision trees make predictions by recursively splitting data into purer subsets based on features, while random forests create an ensemble of decision trees to overcome their individual weaknesses. Random forests introduce randomness by building each tree on a random sample of data and random subsets of features, improving stability and reducing overfitting compared to single decision trees. The document also provides an example of using decision trees and random forests to predict customer churn for a telecommunications company.

Uploaded by

Ingle Ashwini
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Ch5 Data Science

The document discusses decision trees and random forests for machine learning. It explains that decision trees make predictions by recursively splitting data into purer subsets based on features, while random forests create an ensemble of decision trees to overcome their individual weaknesses. Random forests introduce randomness by building each tree on a random sample of data and random subsets of features, improving stability and reducing overfitting compared to single decision trees. The document also provides an example of using decision trees and random forests to predict customer churn for a telecommunications company.

Uploaded by

Ingle Ashwini
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Decision Trees and Random

Forests —

A Decision Tree is a Supervised Machine Learning algorithm that


imitates the human thinking process. It makes the predictions, just
like how, a human mind would make, in real life. It can be
considered as a series of if-then-else statements and goes on making
decisions or predictions at every point, as it grows.

A decision tree looks like a flowchart or an inverted tree. It grows


from root to leaf but in an upside down manner. We can easily
interpret the decision making /prediction process of a decision tree,
due to it’s structure.
The starting node is called as the Root Node. It splits further by
making a decision based on some criterion into what are called
as Internal Nodes. The internal nodes that split further are called
as parent nodes and the nodes formed out of parent nodes are called
a s child nodes. The nodes that do not split further are called Leaf
Nodes.
A Decision Tree can be a Classification Tree or a Regression Tree,
based upon the type of target variable. The class in case of
classification tree is based upon the majority prediction in leaf
nodes. In case of regression, the final predicted value is based upon
the average values in the leaf nodes.

There are various Decision Tree algorithms depending on the


whether it is a binary split or multiway split. Binary Split means a
parent node splits into 2 child nodes and in a multiway split, the
parent node splits into more than 2 child nodes.

CART meaning Classification and Regression Tree algorithm deals


with binary split trees while ID3 algorithm deals with multiway split
trees. We will discuss the CART algorithm in detail. (Hereafter the
Decision Tree will mean CART algorithm tree)

A Decision Tree divides the data into various subsets and then
makes a split based on a chosen attribute. This attribute is chosen
based upon the homogeneity criterion called Gini Index.

The Gini Index, basically measures purity (or impurity as well, we


can say) of the nodes after the split happens. Meaning, it is the
measure of how pure are the child nodes from the parent node, after
the split.

The Tree should make a split based upon that attribute which leads
to maximum homogeneity of the child nodes. The formula for Gini
Index is as follows :
Where pi is the probability of finding a point with the label i,
and n is the number of classes/resultant values of the target
variable.

Thus the tree calculates the Gini index at the parent node. Then the
weighted Gini index at the child nodes is calculated for every
attribute. An attribute is chosen such that we get maximum
homogeneity in the child nodes and the split is performed. This
process is repeated until all the attributes are exhausted. At the end
we get what we call as a fully grown Decision tree. Thus the
attributes which lead to initial splits can be considered as the most
important attributes for prediction.

The problem with a fully grown Decision Tree is that, it overfits.


Meaning it mugs up all the train data and fails to generalize, leading
to a poor performance on test data. To solve this problem and to get
maximum efficacy out of a decision tree, we need to prune the fully
grown tree or truncate it’s growth.

Pruning is a technique where we can prune the less important


branches of a fully grown tree while truncation is a technique where
we control the growth of the tree, in order to avoid overfitting.
Truncation is more popularly used and is done by tuning the
hyperparameters of the tree which will be discussed in the example.

So overall, Decision Trees are efficient algorithms which require zero


or minimum data processing. They can handle linear and non-linear
data, categorical or numerical data efficiently and make predictions
based on the given set of attributes. Most important, they are easily
interpretable.

Some shortcomings of Decision Trees are that, they overfit. A small


change in data may change the entire tree structure, which makes
them high variance algorithms. Gini Index calculations make them
complex and they consume lot of time and memory. Also, they can
be called as greedy algorithms, since they consider only the
immediate split implications and not the implications of further
splits.

Random Forests

A combination of various models (linear regression, logistic


regression, decision tree, etc.) brought together as a single model in
order to achieve the final outcome is called as an Ensemble. So, in
an ensemble, different models are viewed as one, and not separately.

A Random Forest is a powerful ensemble model built with large


number of Decision Trees. It overcomes the shortcomings of a single
decision tree in addition to some other advantages. But that does not
mean that it is always better than a decision tree. There can be
instances when a decision tree may perform better than a random
forest.

An ensemble functions in such a way that every model in the


ensemble helps to compensate for the shortcomings of every other
model. We can say, if a random forest is built with 10 decision trees,
every tree may not be performing great with the data, but the
stronger trees help to fill the gaps for weaker trees. This is what
makes an ensemble a powerful machine learning model.

The individual trees in a random forest must satisfy two criterion :

1. Diversity : Meaning every model in the ensemble should


function independently and should complement other
models.

2. Acceptability : Meaning every model in the ensemble


should perform at least better than a random guesser
model.

Bagging is one of the common ensemble techniques, used to


achieve the diversity criterion. Bagging stands
for Bootstrapped Aggregation. Bootstrapping is a method used to
create bootstrapped samples. These samples are created by sampling
the given data, uniformly and with replacement. A bootstrapped
sample contains approximately 30%–70% data from the entire
dataset. Bootstrapped samples of uniform length are created and
given as input to the individual decision trees in the random forest
model. The result of all the individual models is then aggregated to
arrive at a final decision.
A typical Random Forest Classifier

Also, not all features are used to train every tree. A fixed number of
random set of features is chosen at every node of every tree. This
ensures there is no correlation among functioning of different trees
and that they function independently.

In Random Forest Classifier, the majority class predicted by


individual trees is considered as final prediction, while in Random
Forest Regressor, the average of all the individual predicted values is
considered as the final prediction.

In the Random Forest model, usually the data is not divided into
training and test sets. The entire data is used to form bootstrapped
samples, such that a set of data points is always set aside to form the
test or validation set for individual trees. These samples are
called Out Of Bag (OOB) samples. Thus for every tree in the
random forest ensemble, there exists a set of data points which was
not present in it’s training data and is used as validation set to
evaluate it’s performance .
The OOB error is calculated as the ratio of number of incorrect
predictions in the OOB sample to the total number of predictions of
OOB samples.

Random Forests also help decide the important features, among all
the features in the given data, by eliminating less important
features/attributes. This leads to better prediction results. Feature
importance is decided by calculating the decrease in impurity or
increase in purity with the help of Gini Index calculation.

The main advantage of random forests over decision trees is that


they are stable and are low variance models. They also overcome the
problem of overfitting present in decision trees. Since they use
bootstrapped data and random set of features, they ensure diversity
and robust performance. They are immune to curse of
dimensionality as they do not consider all the features at one time
for individual trees.

The main disadvantage of random forests is their lack of


interpretability. One cannot trace how the algorithm works unlike
decision trees. Another disadvantage is that they are complex and
computationally expensive.

Now let us see the python implementation of both Decision tree and
Random forest models with the help of a telecom churn data set.
Python Implementation

For telecom operators, retaining high profitable customers is the


number one business goal. The telecommunications industry
experiences an average of 15–25% annual churn rate. We will
analyse customer-level data of a leading telecom firm, build
predictive models to identify customers at high risk of churn and
identify the main indicators of churn.

The dataset contains customer-level information for a span of four


consecutive months — June, July, August and September. The
months are encoded as 6, 7, 8 and 9, respectively. We need to
predict the churn in the last (i.e. the ninth) month using the data
(features) from the first three months.

Flow of analysis :

1.Import the required libraries


2. Read and understand the data
3. Data cleanup and preparation
4. Filter High Value customer
5. Churn Analysis
6. Exploratory Data Analysis
7. Model building
8. Model building — hyperparameters tuning
9. Best Model

Required libraries are imported. We will be loading more packages


from various libraries as and when required further.
Data is loaded into pandas data frame and analysed.

Initial analysis tells us that the data has 99999 rows and 226
columns. There are a lot of missing values which need t be imputed
or dropped.

Most of the columns with more than 30% null values are dropped
and for others, the null values are imputed with
mean/median/mode values appropriately. Finally we have data set
with 99999 rows and 185 columns.

Since most of the revenue comes from high value customers, we


filter the data accordingly and get a data set with 30011 rows and
185 columns.
We perform Exploratory Data Analysis (EDA)for these customers.

Churn Analysis

We can see that 91.4% customers are non-churn and only 8.6%
customers are churn. So this is an imbalanced data set. Following
are some of the analyses graphs from EDA.
std incoming within the telecom company network minutes of usage
Roaming Incoming minutes of usage for churn and non-churn

Local Incoming minutes of usage for churn and non-churn


We split the data into train and test sets and separate the churn
variable as y, as the dependent variable and rest of features as
independent variable X.
We need to handle the class imbalance, since every model that we
build will give good performance with the majority class and worst
performance with the minority class. We can handle imbalanced
classes by balancing the classes. Meaning increasing minority or
decreasing majority class.

There are various class imbalance handling techniques as follows :

1. Random Under-Sampling
2. Random Over-Sampling
3. SMOTE — Synthetic Minority Oversampling Technique
4. ADASYN — Adaptive Synthetic Sampling Method
5. SMOTETomek — Over-sampling followed by under-
sampling

We implemented all the techniques with Logistic Regression,


Decision Tree and Random Forest models. We need to concentrate
on the models that give us high sensitivity/recall. Following were the
results :

Comparison of evaluation metrics for various models

As we can see in the summary table, we have recall values high for
Random Forest with Random Undersampling, Logistic Regression
with Random Undersampling, Random Oversampling, SMOTE,
ADASYN, SMOTE+TOMEK. Within Logistic Regression ADASYN
has highest recall.
We will pick up Random Forest with Undersampling method for
further analysis.

We know that Random Forest gives us feature importance by


eliminating less important features. We run Random Forest
Classifier and choose the important features :
We perform hyperparameter tuning using GridSearchCV for
Random Forest Undersampling model. Tuning of hyperparameters
is similar for a Decision Tree Classifier. We just need to replace the
Random Forest Classifier with Decision Tree Classifier, for basic
model.
We get the following best score and best model.

The grid search provided by GridSearchCV exhaustively generates


candidates from a grid of parameter values specified with
the param_grid parameter. This process is computationally expensive
and time consuming especially over a large hyperparameter space.

There is another approach called RandomizedSearchCV to find the


best model. Here a fixed number of hyperparameters is sampled
from specified probability distributions. This makes the process time
efficient and the model less complex.
We get the following best score and best model.

Comparing the scores obtained and complexity of the


above models, we conclude that model obtained using
RandomizedSearchCV, is the best model.

We make predictions on the test set and get the following results.
We get recall equal to 81% and accuracy 85%. The following is the
representation of top 10 important predictors of churn.
Support Vector Machine (SVM)

The Support Vector Machine is a supervised learning


algorithm mostly used for classification but it can be used also
for regression. The main idea is that based on the labeled data
(training data) the algorithm tries to find the optimal
hyperplane which can be used to classify new data points. In two
dimensions the hyperplane is a simple line.

Usually a learning algorithm tries to learn the most common


characteristics (what differentiates one class from
another) of a class and the classification is based on those
representative characteristics learnt (so classification is based on
differences between classes). The SVM works in the other way
around. It finds the most similar examples between classes.
Those will be the support vectors.

As an example, lets consider two classes, apples and lemons.

Other algorithms will learn the most evident, most representative


characteristics of apples and lemons, like apples are green and
rounded while lemons are yellow and have elliptic form.

In contrast, SVM will search for apples that are very similar to
lemons, for example apples which are yellow and have elliptic form.
This will be a support vector. The other support vector will be a
lemon similar to an apple (green and rounded). So other
algorithms learns
the differences while SVM learns similarities.
If we visualize the example above in 2D, we will have something like
this:

As we go from left to right, all the examples will be classified as


apples until we reach the yellow apple. From this point, the
confidence that a new example is an apple drops while the lemon
class confidence increases. When the lemon class confidence
becomes greater than the apple class confidence, the new examples
will be classified as lemons (somewhere between the yellow apple
and the green lemon).
Based on these support vectors, the algorithm tries to find the best
hyperplane that separates the classes. In 2D the hyperplane is
a line, so it would look like this:

Ok, but why did I draw the blue boundary like in the
picture above? I could also draw boundaries like this:
As you can see, we have an infinite number of possibilities to
draw the decision boundary. So how can we find the optimal
one?

Finding the Optimal Hyperplane

Intuitively the best line is the line that is far away from both
apple and lemon examples (has the largest margin). To have
optimal solution, we have to maximize the margin in both
ways (if we have multiple classes, then we have to maximize it
considering each of the classes).
So if we compare the picture above with the picture below, we can
easily observe, that the first is the optimal hyperplane (line) and the
second is a sub-optimal solution, because the margin is far shorter.
Because we want to maximize the margins taking in
consideration all the classes, instead of using one margin for each
class, we use a “global” margin, which takes in
consideration all the classes. This margin would look like the
purple line in the following picture:
This margin is orthogonal to the boundary and equidistant to
the support vectors.

So where do we have vectors? Each of the calculations (calculate


distance and optimal hyperplanes) are made in vectorial space, so
each data point is considered a vector. The dimension of the
space is defined by the number of attributes of the examples.
To understand the math behind, please read this brief mathematical
description of vectors, hyperplanes and optimizations: SVM
Succintly.
All in all, support vectors are data points that defines the
position and the margin of the hyperplane. We call
them “support” vectors, because these are the representative data
points of the classes, if we move one of them, the position
and/or the margin will change. Moving other data points won’t
have effect over the margin or the position of the hyperplane.

To make classifications, we don’t need all the training data points


(like in the case of KNN), we have to save only the support vectors.
In worst case all the points will be support vectors, but this is very
rare and if it happens, then you should check your model for errors
or bugs.

So basically the learning is equivalent with finding the


hyperplane with the best margin, so it is a
simple optimization problem.

Basic Steps

The basic steps of the SVM are:

1. select two hyperplanes (in 2D) which separates the


data with no points between them (red lines)

2. maximize their distance (the margin)

3. the average line (here the line half way between the two
red lines) will be the decision boundary
This is very nice and easy, but finding the best margin, the
optimization problem is not trivial (it is easy in 2D, when we have
only two attributes, but what if we have N dimensions with N a very
big number)

To solve the optimization problem, we use the Lagrange


Multipliers. To understand this technique you can read the
following two articles: Duality Langrange Multiplier and A Simple
Explanation of Why Langrange Multipliers Wroks.

Until now we had linearly separable data, so we could use a line as


class boundary. But what if we have to deal with non-linear data
sets?

SVM for Non-Linear Data Sets

An example of non-linear data is:


In this case we cannot find a straight line to separate apples
from lemons. So how can we solve this problem. We will use
the Kernel Trick!

The basic idea is that when a data set is inseparable in the current
dimensions, add another dimension, maybe that way the data
will be separable. Just think about it, the example above is in 2D and
it is inseparable, but maybe in 3D there is a gap between the apples
and the lemons, maybe there is a level difference, so lemons are on
level one and apples are on level two. In this case, we can easily draw
a separating hyperplane (in 3D a hyperplane is a plane) between
level 1 and 2.
Mapping to Higher Dimensions

To solve this problem we shouldn’t just blindly add another


dimension, we should transform the space so we generate this
level difference intentionally.

Mapping from 2D to 3D

Let's assume that we add another dimension called X3. Another


important transformation is that in the new dimension the points
are organized using this formula x1² + x2².

If we plot the plane defined by the x² + y² formula, we will get


something like this:

Now we have to map the apples and lemons (which are just simple
points) to this new space. Think about it carefully, what did we do?
We just used a transformation in which we added levels based
on distance. If you are in the origin, then the points will be on the
lowest level. As we move away from the origin, it means that we
are climbing the hill (moving from the center of the plane towards
the margins) so the level of the points will be higher. Now if we
consider that the origin is the lemon from the center, we will have
something like this:

Now we can easily separate the two classes. These transformations


are called kernels. Popular kernels are: Polynomial Kernel,
Gaussian Kernel, Radial Basis Function (RBF), Laplace
RBF Kernel, Sigmoid Kernel, Anove RBF Kernel, etc
(see Kernel Functions or a more detailed description Machine
Learning Kernels).
Mapping from 1D to 2D

Another, easier example in 2D would be:

After using the kernel and after all the transformations we will get:
So after the transformation, we can easily delimit the two classes
using just a single line.

In real life applications we won’t have a simple straight line, but we


will have lots of curves and high dimensions. In some cases we won’t
have two hyperplanes which separates the data with no points
between them, so we need some trade-offs, tolerance for
outliers. Fortunately the SVM algorithm has a so-
called regularization parameter to configure the trade-off and to
tolerate outliers.

Tuning Parameters
As we saw in the previous section choosing the right kernel is
crucial, because if the transformation is incorrect, then the model
can have very poor results. As a rule of thumb, always check if
you have linear data and in that case always use linear
SVM (linear kernel). Linear SVM is a parametric model, but
an RBF kernel SVM isn’t, so the complexity of the latter grows
with the size of the training set. Not only is more expensive to
train an RBF kernel SVM, but you also have to keep the
kernel matrix around, and the projection into this
“infinite” higher dimensional space where the data becomes
linearly separable is more expensive as well during prediction.
Furthermore, you have more hyperparameters to tune, so
model selection is more expensive as well! And finally, it’s
much easier to overfit a complex model!

Regularization

The Regularization Parameter (in python it’s called C) tells


the SVM optimization how much you want to avoid miss
classifying each training example.

If the C is higher, the optimization will choose smaller


margin hyperplane, so training data miss classification rate
will be lower.

On the other hand, if the C is low, then the margin will be big,
even if there will be miss classified training data examples. This
is shown in the following two diagrams:
As you can see in the image, when the C is low, the margin is higher
(so implicitly we don’t have so many curves, the line doesn’t strictly
follows the data points) even if two apples were classified as lemons.
When the C is high, the boundary is full of curves and all the training
data was classified correctly. Don’t forget, even if all the training
data was correctly classified, this doesn’t mean that increasing the C
will always increase the precision (because of overfitting).

Gamma

The next important parameter is Gamma. The gamma parameter


defines how far the influence of a single training example
reaches. This means that high Gamma will consider only
points close to the plausible hyperplane and low Gamma will
consider points at greater distance.
As you can see, decreasing the Gamma will result that finding the
correct hyperplane will consider points at greater distances so more
and more points will be used (green lines indicates which points
were considered when finding the optimal hyperplane).

Margin

The last parameter is the margin. We’ve already talked about


margin, higher margin results better model, so better
classification (or prediction). The margin should be
always maximized.

SVM Example using Python

In this example we will use the Social_Networks_Ads.csv file, the


same file as we used in the previous article, see KNN example using
Python.

In this example I will write down only the differences between


SVM and KNN, because I don’t want to repeat myself in each article!
If you want the whole explanation about how can we read the
data set, how do we parse and split our data or how can we evaluate
or plot the decision boundaries, then please read the code
example from the previous chapter (KNN)!

Because the sklearn library is a very well written and useful Python
library, we don’t have too much code to change. The only difference
is that we have to import the SVC class (SVC = SVM in sklearn)
from sklearn.svm instead of the KNeighborsClassifier class from
sklearn.neighbors.
# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', C = 0.1, gamma = 0.1)
classifier.fit(X_train, y_train)

After importing the SVC, we can create our new model using the
predefined constructor. This constructor has many parameters, but I
will describe only the most important ones, most of the time you
won’t use other parameters.

The most important parameters are:

1. kernel: the kernel type to be used. The most common


kernels are rbf (this is the default
value), poly or sigmoid, but you can also create your own
kernel.

2. C: this is the regularization parameter described in


the Tuning Parameters section

3. gamma: this was also described in the Tuning


Parameters section
4. degree: it is used only if the chosen kernel is poly and
sets the degree of the polinom

5. probability: this is a boolean parameter and if it’s true,


then the model will return for each prediction, the vector of
probabilities of belonging to each class of the response
variable. So basically it will give you the confidences for
each prediction.

6. shrinking: this shows whether or not you want


a shrinking heuristic used in your optimization of the
SVM, which is used in Sequential Minimal Optimization.
It’s default value is true, an if you don’t have a good
reason, please don’t change this value to false,
because shrinking will greatly improve your
performance, for very little loss in terms
of accuracy in most cases.

Now lets see the output of running this code. The decision boundary
for the training set looks like this:
As we can see and as we’ve learnt in the Tuning Parameters section,
because the C has a small value (0.1) the decision boundary is
smooth.

Now if we increase the C from 0.1 to 100 we will have more curves in
the decision boundary:
What would happen if we use C=0.1 but now we increase Gamma
from 0.1 to 10? Lets see!
What happened here? Why do we have such a bad model? As you’ve
seen in the Tuning Parameters section, high gamma means that
when calculating the plausible hyperplane we consider only points
which are close. Now because the density of the green points is
high only in the selected green region, in that region the
points are close enough to the plausible hyperplane, so those
hyperplanes were chosen. Be careful with the gamma parameter,
because this can have a very bad influence over the results of your
model if you set it to a very high value (what is a “very high value”
depends on the density of the data points).
For this example the best values for C and Gamma are 1.0 and 1.0.
Now if we run our model on the test set we will get the following
diagram:

And the Confusion Matrix looks like this:

As you can see, we’ve got only 3 False Positives and only 4 False
Negatives. The Accuracy of this model is 93% which is a really
good result, we obtained a better score than using KNN (which had
an accuracy of 80%).
 What Is K Means Clustering
 Implementation of K means Clustering Python
 WCSS And Elbow Method To find No. Of clusters
 Implementation of K means Clustering Python Code

K means is one of the most popular Unsupervised Machine Learning


Algorithms Used for Solving Classification Problems. K Means segregates
the unlabeled data into various groups, called clusters, based on
having similar features, common patterns.

K Means Algorithm theory with python


1. What Is Clustering
2. What Is K Means Algorithm
3. Diagrammatic Implementation of KMeans Clustering
4. Choosing The Right Number of Cluster
5. Python Implementation

1. What Is Clustering?

Suppose we have N number of Unlabeled Multivariate Datasets of various


Animals like Dogs, Cats, birds etc. The technique to segregate Datasets
into various groups, on basis of having similar features and
characteristics, is being called Clustering.

The groups being Formed are being known as Clusters. Clustering


Technique is being used in various Field such as Image recognition, Spam
Filtering

Clustering is being used in Unsupervised Learning Algorithm in Machine


Learning as it can be segregated multivariate data into various
groups, without any supervisor, on basis of common pattern hidden
inside the datasets.

Become a Full-Stack Data Scientist


Power Ahead in your AI ML Career | No Pre-requisites RequiredDownload Brochure

2. What Is K Means Algorithm


Kmeans Algorithm is an Iterative algorithm that divides a group of n
datasets into k subgroups /clusters based on the similarity and their
mean distance from the centroid of that particular subgroup/ formed.

K, here is the pre-defined number of clusters to be formed by the Algorithm.


If K=3, It means the number of clusters to be formed from the dataset is 3

Algorithm steps Of K Means


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the value of K, to decide the number of clusters to be


formed.

Step-2: Select random K points which will act as centroids.

Step-3: Assign each data point, based on their distance from the randomly
selected points (Centroid), to the nearest/closest centroid which will form
the predefined clusters.

Step-4: place a new centroid of each cluster.

Step-5: Repeat step no.3, which reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.

Step-7: FINISH

3. Diagrammatic Implementation of K Means Clustering


STEP 1:Let’s choose number k of clusters, i.e., K=2, to segregate the
dataset and to put them into different respective clusters. We will choose
some random 2 points which will act as centroid to form the cluster.

STEP 2: Now we will assign each data point to a scatter plot based on its
distance from the closest K-point or centroid. It will be done by drawing a
median between both the centroids. Consider the below image:
STEP 3: points left side of the line is near to blue centroid, and points to the
right of the line are close to the yellow centroid. The left one Form cluster
with blue centroid and the right one with the yellow centroid.

STEP 4:repeat the process by choosing a new centroid. To choose the


new centroids, we will find the new center of gravity of these centroids,
which is depicted below :

STEP 5: Next, we will reassign each datapoint to the new centroid. We will
repeat the same process as above (using a median line). The yellow data
point on the blue side of the median line will be included in the blue cluster

STEP 6: As reassignment has taken place, so we will repeat the above step
of finding new centroids.
STEP 7: We will repeat the above process of finding the center of gravity of
centroids, as being depicted below
STEP 8: After Finding the new centroids we will again draw the median line
and reassign the data points, like the above steps.

STEP 9: We will finally segregate points based on the median line, such that
two groups are being formed and no dissimilar point to be included in a
single group
The final Cluster being formed are as Follows

4. Choosing The Right Number Of Clusters


The number of clusters that we choose for the algorithm shouldn’t be
random. Each and Every cluster is formed by calculating and
comparing the mean distances of each data points within a cluster
from its centroid.
We Can Choose the right number of clusters with the help of the Within-
Cluster-Sum-of-Squares (WCSS) method.

WCSS Stands for the sum of the squares of distances of the data points in
each and every cluster from its centroid.

The main idea is to minimize the distance between the data points and the
centroid of the clusters. The process is iterated until we reach a minimum
value for the sum of distances.

To find the optimal value of clusters, the elbow method follows the below
steps:
1 Execute the K-means clustering on a given dataset for different K values
(ranging from 1-10).

2 For each value of K, calculates the WCSS value.

3 Plots a graph/curve between WCSS values and the respective number of


clusters K.

4 The sharp point of bend or a point( looking like an elbow joint ) of the plot
like an arm, will be considered as the best/optimal value of K

5. Python Implementation
Importing relevant libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

Loading the Data


data = pd.read_csv('Countryclusters.csv')
data

Plotting the data


Python Code:

Selecting the feature


x = data.iloc[:,1:3] # 1t for rows and second for columns
x
Clustering
kmeans = KMeans(3)
means.fit(x)

Clustering Results
identified_clusters = kmeans.fit_predict(x)
identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Lati
tude'],c=data_with_clusters['Clusters'],cmap='rainbow')
Trying different method ( to find no .of clusters to be selected)
WCSS and Elbow Method
wcss=[]
for i in range(1,7):
kmeans = KMeans(i)
kmeans.fit(x)
wcss_iter = kmeans.inertia_
wcss.append(wcss_iter)

number_clusters = range(1,7)
plt.plot(number_clusters,wcss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

we can choose 3 as no. of clusters, this method shows what is the


good number of clusters
Principal Component Analysis is basically a statistical procedure to
convert a set of observations of possibly correlated variables into a set of
values of linearly uncorrelated variables.
Each of the principal components is chosen in such a way so that it would
describe most of them still available variance and all these principal
components are orthogonal to each other. In all principal components first
principal component has a maximum variance.

Uses of PCA:
 It is used to find inter-relation between variables in the data.
 It is used to interpret and visualize data.
 The number of variables is decreasing it makes further analysis
simpler.
 It’s often used to visualize genetic distance and relatedness
between populations.
These are basically performed on a square symmetric matrix. It can be a
pure sums of squares and cross-products matrix or Covariance matrix or
Correlation matrix. A correlation matrix is used if the individual variance
differs much.

Objectives of PCA:
 It is basically a non-dependent procedure in which it reduces
attribute space from a large number of variables to a smaller
number of factors.
 PCA is basically a dimension reduction process but there is no
guarantee that the dimension is interpretable.
 The main task in this PCA is to select a subset of variables from a
larger set, based on which original variables have the highest
correlation with the principal amount.
Principal Axis Method: PCA basically searches a linear combination of
variables so that we can extract maximum variance from the variables. Once
this process completes it removes it and searches for another linear
combination that gives an explanation about the maximum proportion of
remaining variance which basically leads to orthogonal factors. In this
method, we analyze total variance.
Eigenvector: It is a non-zero vector that stays parallel after matrix
multiplication. Let’s suppose x is an eigenvector of dimension r of matrix M
with dimension r*r if Mx and x are parallel. Then we need to solve Mx=Ax
where both x and A are unknown to get eigenvector and eigenvalues.
Under Eigen-Vectors we can say that Principal components show both
common and unique variance of the variable. Basically, it is variance focused
approach seeking to reproduce total variance and correlation with all
components. The principal components are basically the linear combinations
of the original variables weighted by their contribution to explain the variance
in a particular orthogonal dimension.

Eigen Values: It is basically known as characteristic roots. It basically


measures the variance in all variables which is accounted for by that factor.
The ratio of eigenvalues is the ratio of explanatory importance of the factors
with respect to the variables. If the factor is low then it is contributing less to
the explanation of variables. In simple words, it measures the amount of
variance in the total given database accounted by the factor. We can
calculate the factor’s eigenvalue as the sum of its squared factor loading for
all the variables.

Now, Let’s understand Principal Component Analysis with Python.


To get the dataset used in the implementation, click here.
Step 1: Importing the libraries
 Python

# importing required libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Step 2: Importing the data set


Import the dataset and distributing the dataset into X and y components for
data analysis.

# importing or loading the dataset


dataset = pd.read_csv('wine.csv')

# distributing the dataset into two components X and Y


X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

Step 3: Splitting the dataset into the Training set and Test set
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,


random_state = 0)
Step 4: Feature Scaling
Doing the pre-processing part on training and testing set such as fitting the
Standard scale.
# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 5: Applying PCA function


Applying the PCA function into the training and testing set for analysis.
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_

Step 6: Fitting Logistic Regression To the training set


# Fitting Logistic Regression To the training set
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

Step 7: Predicting the test set result


# Predicting the test set result using
# predict function under LogisticRegression
y_pred = classifier.predict(X_test)

Step 8: Making the confusion matrix


# making confusion matrix between
# test set of Y and predicted value.
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

Step 9: Predicting the training set result


# Predicting the training set
# result through scatter plot
from matplotlib.colors import ListedColormap

X_set, y_set = X_train, y_train


X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),


X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('yellow', 'white', 'aquamarine')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)

plt.title('Logistic Regression (Training set)')


plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend() # to show legend

# show scatter plot


plt.show()
Step 10: Visualizing the Test set results
# Visualising the Test set results through scatter plot
from matplotlib.colors import ListedColormap

X_set, y_set = X_test, y_test

X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,


stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),


X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('yellow', 'white', 'aquamarine')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)

# title for scatter plot


plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend()

# show scatter plot


plt.show()

You might also like