Ch5 Data Science
Ch5 Data Science
Forests —
A Decision Tree divides the data into various subsets and then
makes a split based on a chosen attribute. This attribute is chosen
based upon the homogeneity criterion called Gini Index.
The Tree should make a split based upon that attribute which leads
to maximum homogeneity of the child nodes. The formula for Gini
Index is as follows :
Where pi is the probability of finding a point with the label i,
and n is the number of classes/resultant values of the target
variable.
Thus the tree calculates the Gini index at the parent node. Then the
weighted Gini index at the child nodes is calculated for every
attribute. An attribute is chosen such that we get maximum
homogeneity in the child nodes and the split is performed. This
process is repeated until all the attributes are exhausted. At the end
we get what we call as a fully grown Decision tree. Thus the
attributes which lead to initial splits can be considered as the most
important attributes for prediction.
Random Forests
Also, not all features are used to train every tree. A fixed number of
random set of features is chosen at every node of every tree. This
ensures there is no correlation among functioning of different trees
and that they function independently.
In the Random Forest model, usually the data is not divided into
training and test sets. The entire data is used to form bootstrapped
samples, such that a set of data points is always set aside to form the
test or validation set for individual trees. These samples are
called Out Of Bag (OOB) samples. Thus for every tree in the
random forest ensemble, there exists a set of data points which was
not present in it’s training data and is used as validation set to
evaluate it’s performance .
The OOB error is calculated as the ratio of number of incorrect
predictions in the OOB sample to the total number of predictions of
OOB samples.
Random Forests also help decide the important features, among all
the features in the given data, by eliminating less important
features/attributes. This leads to better prediction results. Feature
importance is decided by calculating the decrease in impurity or
increase in purity with the help of Gini Index calculation.
Now let us see the python implementation of both Decision tree and
Random forest models with the help of a telecom churn data set.
Python Implementation
Flow of analysis :
Initial analysis tells us that the data has 99999 rows and 226
columns. There are a lot of missing values which need t be imputed
or dropped.
Most of the columns with more than 30% null values are dropped
and for others, the null values are imputed with
mean/median/mode values appropriately. Finally we have data set
with 99999 rows and 185 columns.
Churn Analysis
We can see that 91.4% customers are non-churn and only 8.6%
customers are churn. So this is an imbalanced data set. Following
are some of the analyses graphs from EDA.
std incoming within the telecom company network minutes of usage
Roaming Incoming minutes of usage for churn and non-churn
1. Random Under-Sampling
2. Random Over-Sampling
3. SMOTE — Synthetic Minority Oversampling Technique
4. ADASYN — Adaptive Synthetic Sampling Method
5. SMOTETomek — Over-sampling followed by under-
sampling
As we can see in the summary table, we have recall values high for
Random Forest with Random Undersampling, Logistic Regression
with Random Undersampling, Random Oversampling, SMOTE,
ADASYN, SMOTE+TOMEK. Within Logistic Regression ADASYN
has highest recall.
We will pick up Random Forest with Undersampling method for
further analysis.
We make predictions on the test set and get the following results.
We get recall equal to 81% and accuracy 85%. The following is the
representation of top 10 important predictors of churn.
Support Vector Machine (SVM)
In contrast, SVM will search for apples that are very similar to
lemons, for example apples which are yellow and have elliptic form.
This will be a support vector. The other support vector will be a
lemon similar to an apple (green and rounded). So other
algorithms learns
the differences while SVM learns similarities.
If we visualize the example above in 2D, we will have something like
this:
Ok, but why did I draw the blue boundary like in the
picture above? I could also draw boundaries like this:
As you can see, we have an infinite number of possibilities to
draw the decision boundary. So how can we find the optimal
one?
Intuitively the best line is the line that is far away from both
apple and lemon examples (has the largest margin). To have
optimal solution, we have to maximize the margin in both
ways (if we have multiple classes, then we have to maximize it
considering each of the classes).
So if we compare the picture above with the picture below, we can
easily observe, that the first is the optimal hyperplane (line) and the
second is a sub-optimal solution, because the margin is far shorter.
Because we want to maximize the margins taking in
consideration all the classes, instead of using one margin for each
class, we use a “global” margin, which takes in
consideration all the classes. This margin would look like the
purple line in the following picture:
This margin is orthogonal to the boundary and equidistant to
the support vectors.
Basic Steps
3. the average line (here the line half way between the two
red lines) will be the decision boundary
This is very nice and easy, but finding the best margin, the
optimization problem is not trivial (it is easy in 2D, when we have
only two attributes, but what if we have N dimensions with N a very
big number)
The basic idea is that when a data set is inseparable in the current
dimensions, add another dimension, maybe that way the data
will be separable. Just think about it, the example above is in 2D and
it is inseparable, but maybe in 3D there is a gap between the apples
and the lemons, maybe there is a level difference, so lemons are on
level one and apples are on level two. In this case, we can easily draw
a separating hyperplane (in 3D a hyperplane is a plane) between
level 1 and 2.
Mapping to Higher Dimensions
Mapping from 2D to 3D
Now we have to map the apples and lemons (which are just simple
points) to this new space. Think about it carefully, what did we do?
We just used a transformation in which we added levels based
on distance. If you are in the origin, then the points will be on the
lowest level. As we move away from the origin, it means that we
are climbing the hill (moving from the center of the plane towards
the margins) so the level of the points will be higher. Now if we
consider that the origin is the lemon from the center, we will have
something like this:
After using the kernel and after all the transformations we will get:
So after the transformation, we can easily delimit the two classes
using just a single line.
Tuning Parameters
As we saw in the previous section choosing the right kernel is
crucial, because if the transformation is incorrect, then the model
can have very poor results. As a rule of thumb, always check if
you have linear data and in that case always use linear
SVM (linear kernel). Linear SVM is a parametric model, but
an RBF kernel SVM isn’t, so the complexity of the latter grows
with the size of the training set. Not only is more expensive to
train an RBF kernel SVM, but you also have to keep the
kernel matrix around, and the projection into this
“infinite” higher dimensional space where the data becomes
linearly separable is more expensive as well during prediction.
Furthermore, you have more hyperparameters to tune, so
model selection is more expensive as well! And finally, it’s
much easier to overfit a complex model!
Regularization
On the other hand, if the C is low, then the margin will be big,
even if there will be miss classified training data examples. This
is shown in the following two diagrams:
As you can see in the image, when the C is low, the margin is higher
(so implicitly we don’t have so many curves, the line doesn’t strictly
follows the data points) even if two apples were classified as lemons.
When the C is high, the boundary is full of curves and all the training
data was classified correctly. Don’t forget, even if all the training
data was correctly classified, this doesn’t mean that increasing the C
will always increase the precision (because of overfitting).
Gamma
Margin
Because the sklearn library is a very well written and useful Python
library, we don’t have too much code to change. The only difference
is that we have to import the SVC class (SVC = SVM in sklearn)
from sklearn.svm instead of the KNeighborsClassifier class from
sklearn.neighbors.
# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', C = 0.1, gamma = 0.1)
classifier.fit(X_train, y_train)
After importing the SVC, we can create our new model using the
predefined constructor. This constructor has many parameters, but I
will describe only the most important ones, most of the time you
won’t use other parameters.
Now lets see the output of running this code. The decision boundary
for the training set looks like this:
As we can see and as we’ve learnt in the Tuning Parameters section,
because the C has a small value (0.1) the decision boundary is
smooth.
Now if we increase the C from 0.1 to 100 we will have more curves in
the decision boundary:
What would happen if we use C=0.1 but now we increase Gamma
from 0.1 to 10? Lets see!
What happened here? Why do we have such a bad model? As you’ve
seen in the Tuning Parameters section, high gamma means that
when calculating the plausible hyperplane we consider only points
which are close. Now because the density of the green points is
high only in the selected green region, in that region the
points are close enough to the plausible hyperplane, so those
hyperplanes were chosen. Be careful with the gamma parameter,
because this can have a very bad influence over the results of your
model if you set it to a very high value (what is a “very high value”
depends on the density of the data points).
For this example the best values for C and Gamma are 1.0 and 1.0.
Now if we run our model on the test set we will get the following
diagram:
As you can see, we’ve got only 3 False Positives and only 4 False
Negatives. The Accuracy of this model is 93% which is a really
good result, we obtained a better score than using KNN (which had
an accuracy of 80%).
What Is K Means Clustering
Implementation of K means Clustering Python
WCSS And Elbow Method To find No. Of clusters
Implementation of K means Clustering Python Code
1. What Is Clustering?
Step-3: Assign each data point, based on their distance from the randomly
selected points (Centroid), to the nearest/closest centroid which will form
the predefined clusters.
Step-5: Repeat step no.3, which reassign each datapoint to the new
closest centroid of each cluster.
Step-7: FINISH
STEP 2: Now we will assign each data point to a scatter plot based on its
distance from the closest K-point or centroid. It will be done by drawing a
median between both the centroids. Consider the below image:
STEP 3: points left side of the line is near to blue centroid, and points to the
right of the line are close to the yellow centroid. The left one Form cluster
with blue centroid and the right one with the yellow centroid.
STEP 5: Next, we will reassign each datapoint to the new centroid. We will
repeat the same process as above (using a median line). The yellow data
point on the blue side of the median line will be included in the blue cluster
STEP 6: As reassignment has taken place, so we will repeat the above step
of finding new centroids.
STEP 7: We will repeat the above process of finding the center of gravity of
centroids, as being depicted below
STEP 8: After Finding the new centroids we will again draw the median line
and reassign the data points, like the above steps.
STEP 9: We will finally segregate points based on the median line, such that
two groups are being formed and no dissimilar point to be included in a
single group
The final Cluster being formed are as Follows
WCSS Stands for the sum of the squares of distances of the data points in
each and every cluster from its centroid.
The main idea is to minimize the distance between the data points and the
centroid of the clusters. The process is iterated until we reach a minimum
value for the sum of distances.
To find the optimal value of clusters, the elbow method follows the below
steps:
1 Execute the K-means clustering on a given dataset for different K values
(ranging from 1-10).
4 The sharp point of bend or a point( looking like an elbow joint ) of the plot
like an arm, will be considered as the best/optimal value of K
5. Python Implementation
Importing relevant libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
Clustering Results
identified_clusters = kmeans.fit_predict(x)
identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Lati
tude'],c=data_with_clusters['Clusters'],cmap='rainbow')
Trying different method ( to find no .of clusters to be selected)
WCSS and Elbow Method
wcss=[]
for i in range(1,7):
kmeans = KMeans(i)
kmeans.fit(x)
wcss_iter = kmeans.inertia_
wcss.append(wcss_iter)
number_clusters = range(1,7)
plt.plot(number_clusters,wcss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
Uses of PCA:
It is used to find inter-relation between variables in the data.
It is used to interpret and visualize data.
The number of variables is decreasing it makes further analysis
simpler.
It’s often used to visualize genetic distance and relatedness
between populations.
These are basically performed on a square symmetric matrix. It can be a
pure sums of squares and cross-products matrix or Covariance matrix or
Correlation matrix. A correlation matrix is used if the individual variance
differs much.
Objectives of PCA:
It is basically a non-dependent procedure in which it reduces
attribute space from a large number of variables to a smaller
number of factors.
PCA is basically a dimension reduction process but there is no
guarantee that the dimension is interpretable.
The main task in this PCA is to select a subset of variables from a
larger set, based on which original variables have the highest
correlation with the principal amount.
Principal Axis Method: PCA basically searches a linear combination of
variables so that we can extract maximum variance from the variables. Once
this process completes it removes it and searches for another linear
combination that gives an explanation about the maximum proportion of
remaining variance which basically leads to orthogonal factors. In this
method, we analyze total variance.
Eigenvector: It is a non-zero vector that stays parallel after matrix
multiplication. Let’s suppose x is an eigenvector of dimension r of matrix M
with dimension r*r if Mx and x are parallel. Then we need to solve Mx=Ax
where both x and A are unknown to get eigenvector and eigenvalues.
Under Eigen-Vectors we can say that Principal components show both
common and unique variance of the variable. Basically, it is variance focused
approach seeking to reproduce total variance and correlation with all
components. The principal components are basically the linear combinations
of the original variables weighted by their contribution to explain the variance
in a particular orthogonal dimension.
Step 3: Splitting the dataset into the Training set and Test set
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
cm = confusion_matrix(y_test, y_pred)
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)