Unit No. 03 - Classification & Regression
Unit No. 03 - Classification & Regression
Unit No. 03 - Classification & Regression
Course Outcomes:
On completion of the course, learner will be able to
CO1. DEMONSTRATE fundamentals of artificial intelligence and machine learning.
CO2. APPLY feature extraction and selection techniques.
CO3. APPLY machine learning algorithms for classification and regression problems.
CO4. DEVISE AND DEVELOP a machine learning model using various steps.
CO5. EXPLAIN concepts of reinforced and deep learning.
CO6. SIMULATE machine learning model in mechanical engineering problems.
Theory questions
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into sub-trees. Below diagram explains the general structure of a decision tree.
It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
The decision tree comprises of root node, leaf node, branch nodes, parent/child node etc.
following is the explanation of this terminology.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node. For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM)
i.e. information gain and Gini index.
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two
leaf nodes (Accepted offers and Declined offer). See the above figure.
The general idea is that we will segment the predictor space into a number of simple
regions. In order to make a prediction for a given observation, we typically use the mean of
the training data in the region to which it belongs. Since the set of splitting rules used to
segment the predictor space can be summarized by a tree such approaches are called
decision tree methods. These methods are simple and useful for interpretation. We want to
predict a response or class Y from inputs X1,X2, . . .Xp. We do this by growing a binary tree.
At each internal node in the tree, we apply a test to one of the inputs, say Xi . Depending on
the outcome of the test, we go to either the left or the right sub-branch of the tree. Eventually
we come to a leaf node, where we make a prediction. This prediction aggregates or
averages all the training data points which reach that leaf. In order to motivate regression
trees, we begin with a simple example. Our motivation is to predict a baseball player’s Salary
based on Years (the number of years that he has played in the major leagues) and Hits (the
number of hits that he made in the previous year). We first remove observations that are
missing Salary values and log-transform Salary so that its distribution has more of a typical
bell-shape. Recall that Salary is measured in thousands of dollars.
The tree represents a series of splits starting at the top of the tree. The top split assigns
observations having Years < 4.5 to the left branch. The predicted salary for these players is
given by the mean response value for the players in the data set with Years < 4.5.For such
players, the mean log salary is 5.107, and so we make a prediction of e5.107 thousands of
dollars, i.e. 165, 174. How would you interpret the rest (right branch) of the tree?
In keeping with the tree analogy, the regions R1, R2, and R3 are known as terminal
nodes or leaves of the tree.
As is the case for Figure 2, decision trees are typically drawn upside down, in the sense
that the leaves are at the bottom of the tree.
The points along the tree where the predictor space is split are referred to as internal
nodes.
In Figure 2, the two internal nodes are indicated by the text Years < 4.5 and Hits < 117.5.
We refer to the segments of the trees that connect the nodes as branches.
Years is the most important factor in determining Salary, and players with less experience
earn lower salaries than more experienced players.
Given that a player is less experienced, the number of hits that he made in the previous
year seems to play little role in his salary.
But among players who have been in the major leagues for five or more years, the
number of hits made in the previous year does affect salary, and players who made more
hits last year tend to have higher salaries.
The regression tree shown in Figure 2 is likely an over-simplification of the true
relationship between Hits, Years, and Salary, but it‘s a very nice easy interpretation over
more complicated approaches.
Mathematics based questions
5. Explain entropy reduction, information gain and Gini index in decision tree.
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision
tree. A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Information Gain= Entropy(S) – [(Weighted Average) * Entropy (each feature)]
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s) = – P(yes)log2 P(yes) – P(no) log2 P(no)
Where, S= Total number of samples, P(yes)= probability of yes, P(no)= probability of no
Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART (Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini
index. Gini index can be calculated using the formula: Gini Index= 1 – ∑jPj2
It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
7. Many times while training decision tree tends to overfit. What is the reason
Decision tree tends to overfit since at each node, it will make the decision among a subset of
all the features (columns), so when it reaches a final decision, it is a complicated and long
decision chain. Only if a data point satisfies all the rules along this chain, the final decision
can be made. This kind of specific rule on training dataset make it very specific for the
training set, on the other hand, cannot generalize well for new data points that it has never
seen. Especially when your dataset has many features (high dimension), it tends to overfit
more. In J48 decision tree, over fitting happens when algorithm gets information with
exceptional attributes. This causes many fragmentations in the process distribution.
Statistically unimportant nodes with least examples are known as fragmentations. Usually
J48 algorithm builds trees and grows its branches ‗just deep enough to perfectly classify the
training examples‘. This approach performs better with noise free data. But most of the time
this strategy overfits the training examples with noisy data. At present there are two
strategies which are widely used to bypass this overfitting in decision tree learning. Those
are: 1) If tree grows taller, stop it from growing before it reaches the maximum point of
accurate classification of the training data. 2) Let the tree to over-fit the training data then
post-prune tree. By default, the decision tree model is allowed to grow to its full depth.
Pruning refers to a technique to remove the parts of the decision tree to prevent growing to
its full depth. By tuning the hyperparameters of the decision tree model one can prune the
trees and prevent them from overfitting. There are two types of pruning Pre-pruning and
Post-pruning. Now let's discuss the in-depth understanding and hands-on implementation
of each of these pruning techniques.
Pre-Pruning:
The pre-pruning technique refers to the early stopping of the growth of the decision tree.
The pre-pruning technique involves tuning the hyperparameters of the decision tree model
prior to the training pipeline. The hyperparameters of the decision tree including
max_depth, min_samples_leaf, min_samples_split can be tuned to early stop the growth
of the tree and prevent the model from overfitting.
Post-Pruning:
The Post-pruning technique allows the decision tree model to grow to its full depth, then
removes the tree branches to prevent the model from overfitting. Cost complexity pruning
(ccp) is one type of post-pruning technique. In case of cost complexity pruning, the
ccp_alpha can be tuned to get the best fit model.
Problems/Numerical
Problem 1:
If we decided to arbitrarily label all 4 gumballs as red, how often would one of the gumballs
is incorrectly labelled?
4 red and 0 blue:
The impurity measurement is 0 because we would never incorrectly label any of the 4 red
gumballs here. If we arbitrarily chose to label all the balls ‗blue‘, then our index would still be
0, because we would always incorrectly label the gumballs.
The gini score is always the same no matter what arbitrary class you take the probabilities of
because they always add to 0 in the formula above.
A gini score of 0 is the most pure score possible.
2 red and 2 blue:
The impurity measurement is 0.5 because we would incorrectly label gumballs wrong about
half the time. Because this index is used in binary target variables (0,1), a gini index of 0.5 is
the least pure score possible. Half is one type and half is the other. Dividing gini scores by
0.5 can help intuitively understand what the score represents. 0.5/0.5 = 1, meaning the
grouping is as impure as possible (in a group with just 2 outcomes).
3 red and 1 blue:
The impurity measurement here is 0.375. If we divide this by 0.5 for more intuitive
understanding we will get 0.75, which is the probability of incorrectly/correctly labeling.
Problem 2:
How does entropy work with the same gumball scenarios stated in problem 1?
4 red and 0 blue:
Unsurprisingly, the impurity measurement is 0 for entropy as well. This is the max purity
score using information entropy.
2 red and 2 blue:
Problem 3:
The purity/impurity measurement is 0.811 here, a bit worse than the gini score.
Theory questions
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output. The greater number of trees in the forest leads to higher
accuracy and prevents the problem of overfitting.
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
The below diagram explains the working of the Random Forest algorithm:
Below are some points that explain why we should use the Random Forest algorithm:
It takes less training time as compared to other algorithms.
It predicts output with high accuracy, even for the large dataset it runs efficiently.
It can also maintain accuracy when a large proportion of data is missing.
It can be used for both classifications as well as regression tasks.
Overfitting problem that is censorious and can make results poor but in case of the
random forest the classifier will not overfit if there are enough trees.
It can be used for categorical values as well.
10. How does the random forest tree work for classification?
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:
Bagging: Given the training set of N examples, we repeatedly sample subsets of the
training data of size n where n is less than N. Sampling is done at random but with
replacement. This subsampling of a training set is called bootstrap aggregating, or
bagging, for short.
Random subspace method: If each training example has M features, we take a subset of
them of size m < M to train each estimator. So no estimator sees the full training set,
each estimator sees only m features of n training examples.
Training estimators: We create Ntree decision trees, or estimators, and train each one on
a different set of m features and n training examples. The trees are not pruned, as they
would be in the case of training a simple decision tree classifier.
Perform inference by aggregating predictions of estimators: To make a prediction
for a new incoming example, we pass the relevant features of this example to each of the
Ntree estimators. We will obtain Ntree predictions, which we need to combine to produce
the overall prediction of the random forest. In the case of classification, we will use
majority voting to decide on the predicted class, and in the case of regression, we will
take the mean value of the predictions of all the estimators.
12. What are advantages and limitations of the random forest tree?
The Ensemble learning helps improve machine learning results by combining several models.
This approach allows the production of better predictive performance compared to a single
model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote. Bagging
and Boosting are two types of Ensemble Learning. These two decrease the variance of a
single estimate as they combine several estimates from different models. So the result may
be a model with higher stability. Let‘s understand these two terms in a glimpse.
1. Bagging: It is a homogeneous weak learners‘ model that learns from each other
independently in parallel and combines them for determining the model average.
2. Boosting: It is also a homogeneous weak learners‘ model but works differently from
Bagging. In this model, learners learn sequentially and adaptively to improve model
predictions of a learning algorithm.
Bagging: Bootstrap Aggregating, also known as bagging, is a machine learning ensemble
meta-algorithm designed to improve the stability and accuracy of machine learning
algorithms used in statistical classification and regression. It decreases the variance and
helps to avoid overfitting. It is usually applied to decision tree methods. Bagging is a special
case of the model averaging approach.
Description of the Technique
Suppose a set D of d tuples, at each iteration i, a training set D i of d tuples is sampled with
replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for each training
set D < i. Each classifier Mi returns its class prediction. The bagged classifier M* counts the
votes and assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
Step 1: Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
Step 2: A base model is created on each of these subsets.
Step 3: Each model is learned in parallel from each training set and independent of each
other.
Step 4: The final predictions are determined by combining the predictions from all the
models.
An illustration presenting the intuition behind the boosting algorithm, consisting of the
parallel learners and weighted dataset
Similarities between Bagging and Boosting
Bagging and Boosting, both being the commonly used methods, have a universal similarity
of being classified as ensemble methods. Here we will explain the similarities between them.
Both are ensemble methods to get N learners from 1 learner.
Both generate several training data sets by random sampling.
Both make the final decision by averaging the N learners (or taking the majority of them
i.e Majority Voting).
Both are good at reducing variance and provide higher stability.
Differences between Bagging and Boosting
SN Bagging Boosting
1. The simplest way of combining predictions A way of combining predictions that
that belongs to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3. Each model receives equal weight. Models are weighted according to
their performance.
4. Each model is built independently. New models are influenced
by the performance of previously built
models.
5. Different training data subsets are randomly Every new subset contains the
drawn with replacement from the entire elements that were misclassified by
training dataset. previous models.
6. Bagging tries to solve the over-fitting Boosting tries to reduce bias.
problem.
7. If the classifier is unstable (high variance), If the classifier is stable and simple
then apply bagging. (high bias) the apply boosting.
8. Example: The Random forest model uses Example: The AdaBoost uses Boosting
Bagging. techniques
15. Which is the best, Bagging or Boosting?
There‘s not an outright winner; it depends on the data, the simulation and the
circumstances.
Bagging and Boosting decrease the variance of your single estimate as they combine
several estimates from different models. So the result may be a model with higher
stability.
If the problem is that the single model gets a very low performance, Bagging will rarely
get a better bias. However, Boosting could generate a combined model with lower
errors as it optimises the advantages and reduces pitfalls of the single model.
By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best
option. Boosting for its part doesn‘t help to avoid over-fitting; in fact, this technique is
faced with this problem itself. Thus, Bagging is effective more often than Boosting.
16. What are the main advantages of using a random forest versus a single
decision tree?
In an ideal world, we'd like to reduce both bias-related and variance-related errors. This issue
is well-addressed by random forests. A random forest is nothing more than a series of
decision trees with their findings combined into a single final result. They are so powerful
because of their capability to reduce overfitting without massively increasing error due to
bias. Random forests, on the other hand, are a powerful modelling tool that is far more
resilient than a single decision tree. They combine numerous decision trees to reduce
overfitting and bias-related inaccuracy, and hence produce usable results.
Theory questions
18. What are the Pros and Cons of using Naive Bayes?
19. How does the Bayes algorithm differ from decision trees?
Working of Naïve Bayes' Classifier can be understood with the help of the below example.
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps: 1. Convert the given dataset into frequency tables. 2. Generate
Likelihood table by finding the probabilities of given features. 3. Now, use Bayes theorem to
calculate the posterior probability. Problem: If the weather is sunny, then the Player should
play or not? Solution: To solve this, first consider the below dataset:
SN Outlook Play SN Outlook Play SN Outlook Play SN Outlook Play
0 Rainy Yes 4 Sunny No 8 Rainy No 12 Overcast Yes
1 Sunny Yes 5 Rainy Yes 9 Sunny No 13 Overcast Yes
2 Overcast Yes 6 Sunny Yes 10 Sunny Yes
3 Overcast Yes 7 Overcast Yes 11 Rainy No
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Problems/Numerical
Problem 1:
Consider a car theft example. The attributes are Colour, Type, Origin, and the subject, stolen
can be either yes or no. Use Naive Bayes Classifier to classify a ―Red Domestic SUV‖.
Dataset is as below.
Solution:
Note there is no example of a Red Domestic SUV in our data set. We need to calculate the
probabilities P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) , P(Red|No) , P(SUV|No), and P(Domestic|No) and
multiply them by P(Yes) and P(No) respectively .
Problem 2:
Problem 3:
Theory Mathematics Numerical
Topic: Support Vector Machine
Theory questions
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help
in creating the hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or hyperplane:
SVM works by mapping data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable.
Original dataset Data with separator added Transformed data
A separator between the categories is found, and then the data are transformed in such a
way that the separator could be drawn as a hyperplane. Following this, characteristics of new
data can be used to predict the group to which a new record should belong. For example,
consider the following figure, in which the data points fall into two different categories. The
two categories can be separated with a curve, as shown in the figure. After the
transformation, the boundary between the two categories can be defined by a hyperplane,
as shown in the following figure.
The mathematical function used for the transformation is known as the kernel function.
Following are the popular functions.
Linear
Polynomial
Radial basis function (RBF)
Sigmoid
A linear kernel function is recommended when linear separation of the data is
straightforward. In other cases, one of the other functions should be used. You will need to
experiment with the different functions to obtain the best model in each case, as they each
use different algorithms and parameters.
The distance of the vectors from the hyperplane is called the margin which is a separation of
a line to the closest class points. We would like to choose a hyperplane that maximizes the
margin between classes. The graph below shows what good margins and bad margins are.
Again Margin can be sub-divided into,
Soft Margin – As most of the real-world data are not fully linearly separable, we will
allow some margin violation to occur which is called soft margin classification. It is better
to have a large margin, even though some constraints are violated. Margin violation
means choosing a hyperplane, which can allow some data points to stay on either the
incorrect side of the hyperplane and between the margin and correct side of the
hyperplane.
2. Hard Margin – If the training data is linearly separable, we can select two parallel
hyperplanes that separate the two classes of data, so that the distance between them is
as large as possible.
27. Explain Support Vector Machine terminology.
Support Vector Machines are part of the supervised learning model with an associated
learning algorithm. It is the most powerful and flexible algorithm used for classification,
regression, and detection of outliers. It is used in case of high dimension spaces; where each
data item is plotted as a point in n-dimension space such that each feature value
corresponds to the value of specific coordinate. The classification is made on the basis of a
hyperplane/line as wide as possible, which distinguishes between two categories more
clearly. Basically, support vectors are the observational points of each individual, whereas the
support vector machine is the boundary that differentiates one class from another class.
Some significant terminology of SVM is given below:
Support Vectors: These are the data point or the feature vectors lying nearby to the
hyperplane. These help in defining the separating line.
Hyperplane: It is a subspace whose dimension is one less than that of a decision plane.
It is used to separate different objects into their distinct categories. The best hyperplane
is the one with the maximum separation distance between the two classes.
Margins: It is defined as the distance (perpendicular) from the data point to the decision
boundary. There are two types of margins: good margins and margins. Good margins
are the one with huge margins and the bad margins in which the margin is minor.
The main goal of SVM is to find the maximum marginal hyperplane, so as to segregate the
dataset into distinct classes. It undergoes the following steps:
Firstly the SVM will produce the hyperplanes repeatedly, which will separate out the class
in the best suitable way.
Then we will look for the best option that will help in correct segregation.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image: So as it is 2-d space so by just using a straight line, we
can easily separate these two classes. But there can be multiple lines that can
separate these classes. Consider the below image. Hence, the SVM algorithm helps to
find the best line or decision boundary; this best boundary or region is called as a
hyperplane. SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
29. What are advantages and limitations of the Support Vector Machine
Advantages
SVM‘s are very good when we have no idea on the data.
Works well with even unstructured and semi structured data like text, Images and trees.
The kernel trick is real strength of SVM. With an appropriate kernel function, we can
solve any complex problem.
Unlike in neural networks, SVM is not solved for local optima.
It scales relatively well to high dimensional data. SVM is more effective in high
dimensional spaces.
SVM models have generalization in practice; the risk of over-fitting is less in SVM.
SVM works relatively well when there is a clear margin of separation between classes.
SVM is effective in cases where the number of dimensions is greater than the number of
samples.
SVM is relatively memory efficient
Disadvantages
Choosing a ―good‖ kernel function is not easy.
SVM algorithm is not suitable for large data sets.
Long training time for large datasets.
Difficult to understand and interpret the final model, variable weights and individual
impact.
Since the final model is not so easy to see, we cannot do small calibrations to the model
hence it‘s tough to incorporate our business logic.
The SVM hyper parameters are Cost -C and gamma. It is not that easy to fine-tune these
hyper-parameters. It is hard to visualize their impact
SVM does not perform very well when the data set has more noise i.e. target classes are
overlapping.
In cases where the number of features for each data point exceeds the number of
training data samples, the SVM will underperform.
As the support vector classifier works by putting data points, above and below the
classifying hyperplane there is no probabilistic explanation for the classification.
Two different examples of this approach are the One-vs-Rest and One-vs-One strategies.
Binary classification models like logistic regression and SVM do not support multi-class
classification natively and require meta-strategies.
The One-vs-Rest strategy splits a multi-class classification into one binary classification
problem per class.
The One-vs-One strategy splits a multi-class classification into one binary classification
problem per each pair of classes.
Hyper parameters of SVM are considered as Kernel, Regularization, Gamma and Margin.
Kernel: The learning of the hyperplane in linear SVM is done by transforming the problem
using some linear algebra. This is where the kernel plays role.
For linear kernel the equation for prediction for a new input using the dot product between
the input (x) and each support vector (xi) is calculated as follows:
f(x) = B(0) + sum(ai * (x,xi))
This is an equation that involves calculating the inner products of a new input vector (x) with
all support vectors in training data. The coefficients B0 and ai (for each input) must be
estimated from the training data by the learning algorithm.
The polynomial kernel can be written as K(x,xi) = 1 + sum(x * xi)^d and exponential as
K(x,xi) = exp(-gamma * sum((x — xi²)).
Polynomial and exponential kernels calculates separation line in higher dimension. This is
called kernel trick.
Regularization: The Regularization parameter (often termed as C parameter in python‘s
sklearn library) tells the SVM optimization how much you want to avoid misclassifying each
training example. For large values of C, the optimization will choose a smaller-margin
hyperplane if that hyperplane does a better job of getting all the training points classified
correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-
margin separating hyperplane, even if that hyperplane misclassifies more points. The images
below are example of two different regularization parameters. Left one has some
misclassification due to lower regularization value. Higher value leads to results like right
one.
High Gamma
Low Gamma
Margin: And finally last but very important characteristic of SVM classifier. SVM to core tries
to achieve a good margin.
A margin is a separation of line to the closest class points.
A good margin is one where this separation is larger for both the classes. Images below
gives to visual example of good and bad margin. A good margin allows the points to be in
their respective classes without crossing to other class.
In Support Vector Machine, there is the word vector. That means it is important to
understand vector well and how to use them.
What is a vector?
o its norm
o its direction
How to add and subtract vectors?
What is the dot product?
How to project a vector onto another?
What is the equation of the hyperplane?
How to compute the margin?
What is a vector?
If we define a point A(3,4) in 2 we can plot it like this.
Definition: Any point x=(x1, x2),x≠0, in 2 specifies a vector in the plane, namely the vector
starting at the origin and ending at x.
This definition means that there exists a vector between the origin and A.
If we say that the point at the origin is the point O(0,0) then the vector above is the vector
⃗𝑂 ⃗→. We could also give it an arbitrary name such as u.
⃗⃗⃗⃗𝐴
Note:
You can notice that we write vector either with an arrow on top of them, or in bold, in the
rest of this text I will use the arrow when there is two letters like ⃗𝑂 ⃗→ and the bold notation
⃗⃗⃗⃗𝐴
otherwise.
Ok so now we know that there is a vector, but we still don't know what IS a vector.
Definition: A vector is an object that has both a magnitude and a direction.
We will now look at these two concepts.
1) The magnitude
The magnitude or length of a vector x is written ∥x∥ and is called its norm.
For our vector ⃗𝑂 ⃗→, ∥OA∥ is the length of the segment OA
⃗⃗⃗⃗𝐴
From Figure, we can easily calculate the distance OA using Pythagoras' theorem:
OA2=OB2+AB2
OA2=32+42
OA2=25
OA=5
∥OA∥=5
2) The direction
The direction is the second component of a vector.
Definition: The direction of a vector u(u1,u2) is the vector 𝑢1 , 𝑢2
‖𝑢‖ ‖𝑢‖
cos(α) = ‖𝑢‖
𝑢2
Hence the original definition of the vector w. That's why its coordinates are also called
direction cosine.
Computing the direction vector
We will now compute the direction of the vector u from Figure 4.
cos(θ) = ‖𝑢‖
𝑢1 = 3/5 = 0.6
cos(α) = ‖𝑢‖
𝑢2 = 4/5 = 0.8
Since the subtraction is not commutative, we can also consider the other case:
v−u=(v1−u1,v2−u2)
The last two pictures describe the "true" vectors generated by the difference of u and v.
However, since a vector has a magnitude and a direction, we often consider that parallel
translate of a given vector (vectors with the same magnitude and direction but with a
different origin) are the same vector, just drawn in a different place in space.
So don't be surprised if you meet the following:
If you do the math, it looks wrong, because the end of the vector u−v is not in the right
point, but it is a convenient way of thinking about vectors which you'll encounter often.
The dot product
One very important notion to understand SVM is the dot product.
Definition: Geometrically, it is the product of the Euclidian magnitudes of the two vectors
and the cosine of the angle between them
Which means if we have two vectors x and y and there is an angle θ (theta) between them,
their dot product is:
x⋅y=∥x∥∥y∥cos(θ)
Why?
To understand let's look at the problem geometrically.
In the definition, they talk about cos(θ), let's see what it is.
By definition we know that in a right-angled triangle:
cos(θ)=adjacent/hypotenuse
In our example, we don't have a right-angled triangle.
However if we take a different look Figure 12 we can find two right-angled triangles formed
by each vector with the horizontal axis.
Theory questions
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).The curve from the logistic
function indicates the likelihood of something such as whether the cells are cancerous or
not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Advantages Disadvantages
Logistic regression is easier to implement, If the number of observations is lesser than
interpret, and very efficient to train. the number of features, Logistic Regression
should not be used; otherwise, it may lead to
overfitting.
It makes no assumptions about distributions It constructs linear boundaries.
of classes in feature space.
It can easily extend to multiple classes Limitation of Logistic Regression is the
(multinomial regression) and a natural assumption of linearity between dependent
probabilistic view of class predictions. variable and independent variables.
It not only provides a measure of how It can only be used to predict discrete
appropriate a predictor (coefficient size)is, functions. Hence, the dependent variable of
but also its direction of association (positive Logistic Regression is bound to the discrete
or negative). number set.
It is very fast at classifying unknown records. Non-linear problems can‘t be solved with
logistic regression because it has a linear
decision surface. Linearly separable data is
rarely found in real-world scenarios.
Good accuracy for many simple data sets Logistic Regression requires average or no
and it performs well when the dataset is multicollinearity between independent
linearly separable. variables.
It can interpret model coefficients as It is tough to obtain complex relationships
indicators of feature importance. using logistic regression. More powerful and
compact algorithms such as Neural Networks
can easily outperform this algorithm.
Logistic regression is less inclined to over- In Linear Regression independent and
fitting but it can overfit in high dimensional dependent variables are related linearly. But
datasets.One may consider Regularization Logistic Regression needs that independent
(L1 and L2) techniques to avoid over-fittingin variables are linearly related to the log odds
these scenarios. (log(p/(1-p)).
38. Differentiate between logistic regression and linear regression?
The i indexes have been removed for clarity. In words this is the cost the algorithm pays if it
predicts a value hθ(x) while the actual cost label turns out to be y. By using this function we
will grant the convexity to the function the gradient descent algorithm has to process, as
discussed above. There is also a mathematical proof for that, which is outside the scope of
this introductory course. In case y=1, the output (i.e. the cost to pay) approaches to 0 as
hθ(x) approaches to 1. Conversely, the cost to pay grows to infinity as hθ(x) approaches to 0.
This is a desirable property: we want a bigger penalty as the algorithm predicts something
far away from the actual value. If the label is y=1 but the algorithm predicts hθ(x)=0, the
outcome is completely wrong. Conversely, the same intuition applies when y=0, depicted in
the plot 2. below, right side. Bigger penalties when the label is y=0 but the algorithm
predicts hθ(x)=1. The above two functions can be compressed into a single function i.e.
Theory questions
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
41. Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the
neighbour is maximum.
Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
43. What is the difference between KNN and K means?
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such
a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters. The below diagram explains the working of the K-means Clustering Algorithm:
Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:\
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
46. Is K nearest neighbor supervised or unsupervised?
The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm
that can be used to solve both classification and regression problems. It‘s easy to implement
and understand, but has a major drawback of becoming significantly slows as the size of that
data in use grows. KNN works by finding the distances between a query and all the examples
in the data, selecting the specified number examples (K) closest to the query, then votes for
the most frequent label (in the case of classification) or averages the labels (in the case of
regression). In the case of classification and regression, we saw that choosing the right K for
our data is done by trying several Ks and picking the one that works best.
Advantages
The algorithm is simple and easy to implement.
There‘s no need to build a model, tune several parameters, or make additional
assumptions.
The algorithm is versatile. It can be used for classification, regression, and search (as we
will see in the next section).
Disadvantages
The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
To select the K that‘s right for your data, we run the KNN algorithm several times with
different values of K and choose the K that reduces the number of errors we encounter while
maintaining the algorithm‘s ability to accurately make predictions when it‘s given data it
hasn‘t seen before. Here are some things to keep in mind:
As we decrease the value of K to 1, our predictions become less stable. Just think for a
minute, imagine K=1 and we have a query point surrounded by several reds and one
green (I‘m thinking about the top left corner of the colored plot above), but the green is
the single nearest neighbor. Reasonably, we would think the query point is most likely
red, but because K=1, KNN incorrectly predicts that the query point is green.
Inversely, as we increase the value of K, our predictions become more stable due to
majority voting / averaging, and thus, more likely to make more accurate predictions (up
to a certain point). Eventually, we begin to witness an increasing number of errors. It is at
this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification
problem) among labels, we usually make K an odd number to have a tiebreaker.
49. How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
For each value of K, calculates the WCSS value.
Plots a curve between calculated WCSS values and the number of clusters K.
The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the
number of clusters equal to the data points, then the value of WCSS becomes zero, and that
will be the endpoint of the plot.
*********************