Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Module 3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 79

Module 3

Introduction to Machine Learning -

4CS1201
Syllabus • MODULE 3
Supervised Learning
• Types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the
output.
• The labelled data means some input data is already tagged with the
correct output.
• Supervised learning is a process of providing input data as well as
correct output data to the machine learning model.
• The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
How Supervised Learning Works?

Supervised Machine learning - Javatpoint


Types of Supervised Learning
1. Classification:
• used when the output variable is categorical,
which means there are two classes such as Yes-
No, Male-Female, True-false, etc.
• Algorithms
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Types of Supervised Learning
2. Regression
• used if there is a relationship between the input variable
and the output variable.
• It is used for the prediction of continuous variables, such
as Weather forecasting, Market Trends, etc.
• Algorithms
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Types of
Supervised
Learning
Difference
Regression Classification
• Output- Continuous or real value • Output-Discrete value
• Best fit line is found out • Classification Boundary is found
• Egs- Weather prediction, House out
price prediction • Egs- Spam emails, Identification
• Problems – Linear and Non- of cancerous cells
Linear • Problems- binary or multi-class
Adv and Disadv of Supervised Learning
1. K-Nearest Neighbor
• Simplest Supervised Machine Learning algorithms.
• Assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available
categories
• Used for Regression as well as for Classification but mostly it is used for the
Classification problems.
• Non-parametric algorithm, which means it does not make any assumption
on underlying data.
• A lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
1. K-Nearest Neighbor
1. K-Nearest Neighbor
1. K-Nearest Neighbor
• Distance Metric Used in K-NN
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the plane/hyperplane.
Euclidean distance can also be visualized as the length of the straight line that joins the two points which are into
consideration. This metric helps us calculate the net displacement done between the two states of an object.

Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance traveled by the object
instead of the displacement. This metric is calculated by summing the absolute difference between the coordinates of
the points in n-dimensions.
1. K-Nearest Neighbor
• Distance Metric Used in K-NN
Understanding data points and Euclidean Distance
Let X be the training dataset with n data points, where each data point is
represented by a d-dimensional feature vector and Y be the corresponding
labels or values for each data point in X. Given a new data point x, the
algorithm calculates the distance between x and each data point in X using
Euclidean distance as :
1. K-Nearest Neighbor
• How to select the value of K in the K-NN Algorithm?

• If the input data has more outliers or noise, a higher value of k would
be better.
• It is recommended to choose an odd value for k to avoid ties in binary
classification.
1. K-Nearest Neighbor
1. K-Nearest Neighbor
• 1. Solved Numerical Example of KNN Classifier to classify New Insta
nce IRIS Example by Mahesh
Huddar – YouTube
• 2. Solved Example KNN Classifier to classify New Instance Height an
d Weight Example by
mahesh Huddar (youtube.com)
• 3. K nearest Neighbor Learning Algorithm Lazy Learner Solved Exam
ple Dr. Mahesh
Huddar (youtube.com)
• Solved Example K Nearest Neighbors Algorithm Weighted KNN to cl
assify New Instance by Mahesh
Huddar (youtube.com)
KNN Algorithm In Machine Learning | KNN Algorithm Using Python | K Nearest Neighbor | Simplilearn (youtube.com)
Support Vector Machines (SVM)
• Used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that
we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine.
Support Vector Machines (SVM)
Support
Vector
Machines
(SVM)
Support Vector Machines (SVM)
• Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out
the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in
the dataset, which means if there are 2 features (as shown in image),
then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which
means the maximum distance between the data points.
Support Vector Machines (SVM)
• Support Vectors: The data points or vectors that are the closest to the
hyperplane and which affect the position of the hyperplane are termed
as Support Vector.
• Since these vectors support the hyperplane, hence called a Support
vector.
Types of SVM
• 2 Types
• Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
Support Vector Machines (SVM)
1. Linear SVM
Support Vector Machines (SVM)
Since projection of any vector on
another vector is dot-product .
Therefore,
We all know the equation of a hyperplane is w.x+b=0 where w is a
vector normal to hyperplane and b is an offset. To classify a point as
negative or positive we need to define a decision rule.

We can define decision rule as:


Support Vector Machines (SVM)
Support Vector Machines (SVM)
So to separate these data points, we need to add one more dimension.
1. Non-Linear SVM For linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
Support Vector Machines (SVM)
1. Solved Support Vector Machine | Linear SVM Example by Mahesh
Huddar (youtube.com)
2. How to draw a hyper plane in Support Vector Machine | Linear SVM
– Solved Example by Mahesh
Huddar (youtube.com)

Support Vector Machine (SVM) Algorithm - Javatpoint


Decision Tree Algorithm
• Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for
solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
Decision Tree Algorithm
• The decisions or the test are performed on the basis of features of the given
dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question and based on the answer (Yes/No), it
further split the tree into subtrees.
Decision
Tree
Algorithm
Decision Tree Algorithm
• Need?
Decision Tree Algorithm
Terminologies
Decision Tree Algorithm
Working
Decision
Tree
Algorithm
Decision Tree Algorithm
• Attribute Selection Measures or ASM
• how to select the best attribute for the root node and for sub-nodes. So,
to solve such problems there is a technique which is called
as Attribute selection measure or ASM.
• 2 Techniques
Decision Tree Algorithm
• Attribute Selection Measures or ASM
Decision Tree Algorithm
• Attribute Selection Measures or ASM
Example of Decision Tree
Decision Tree Algorithm
CART Decision Tree Example
on same Dataset
Gini index
Gini = 1 – Σ (Pi)^2 for i=1 to number of classes

Outlook Yes No Number of instances


Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
Gini(Outlook=Sunny) = 1 – (2/5)^2 – (3/5)^2 = 1 – 0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)^2 – (0/4)^2 = 0
Gini(Outlook=Rain) = 1 – (3/5)^2 – (2/5)^2 = 1 – 0.36 – 0.16 = 0.48
Then weighted sum of gini indexes for outlook feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
Similarly

Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x


0.445 = 0.142 + 0.107 + 0.190 = 0.439

Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 =


0.367

Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

Therefore lowest is Gini(Outlook) and thus the root node


and the process is repeated for subsets
We apply same principles to the sub datasets
Final form of the decision tree built by CART
algorithm
Decision Tree Algorithm
1. 1. Decision Tree Solved Play Tennis Example Big Data Analytics C
ART Algorithm by Mahesh
Huddar (youtube.com)
2. 2. Decision Tree Solved Numerical Example Big Data Analytics CA
RT Algorithm by Mahesh
Huddar (youtube.com)
3. 3. Decision Tree Solved Numerical Example Big Data Analytics ML
CART Algorithm by Mahesh
Huddar (youtube.com)
Random Forest Classifier
• Used for both Classification and Regression problems in ML.
• It is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to improve
the performance of the model.
• "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset.“
• Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts
the final output.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
Random
Forest
Classifier
Random Forest Classifier
Need?
Random Forest Classifier
Working
Random Forest Classifier
Working
Random
Forest
Classifier
Evaluating
classification
model
performance
Evaluating classification model performance
Confusion Matrix
• A confusion matrix is defined as the table that is often used to
describe the performance of a classification model on a set of the test
data for which the true values are known.
Evaluating classification model performance
Confusion Matrix
Evaluating classification model performance
Confusion Matrix
Evaluating classification model performance
Confusion Matrix - Need
Evaluating classification model performance
Calculations using Confusion Matrix
Evaluating classification model performance
Calculations using Confusion Matrix
Evaluating classification model performance
Calculations using Confusion Matrix
Evaluating classification model performance
ROC and AUC
• The Receiver Operator Characteristic (ROC) curve is an evaluation
metric for binary classification problems. It is a probability curve that
plots the TPR against FPR at various threshold values. In other
words, it shows the performance of a classification model at all
classification thresholds.
• The Area Under the Curve (AUC) is the measure of the ability of a
binary classifier to distinguish between classes and is used as a
summary of the ROC curve.
• The higher the AUC, the better the model’s performance at
distinguishing between the positive and negative classes.
Evaluating classification model performance
ROC and AUC
• Sensitivity / True Positive Rate / Recall

Sensitivity tells us what proportion of the positive class got correctly classified.

• False Negative Rate

False Negative Rate (FNR) tells us what proportion of the positive class got incorrectly
classified by the classifier.

A higher TPR and a lower FNR are desirable since we want to classify the positive class
correctly.
• Specificity / True Negative Rate

Specificity tells us what proportion of the negative class got correctly classified.

• False Positive Rate

FPR tells us what proportion of the negative class got incorrectly classified by the
classifier.

A higher TNR and a lower FPR are desirable since we want to classify the negative
class correctly.
Let’s understand ROC curve with an example. Suppose we need to distinguish between
patients with a particular disease(say phobic) and those that do not have a disease (non-
phobic). The patient population for both these states forms two overlapping normal
distributions as shown below. It can be seen that the curves overlap, as they almost
always will in real life. Some individuals with disease have test scores or other
characteristics that are similar to those that do not have disease.
Streiner (2007) examines the ROC with a problem of classifying patients into Phobic (disease state positive category)
and Non Phobic ( disease state negative category) using a 10 point test score. Table 1 shows the test score from 1–10
and a frequency table of the test results categorized by label categories:
To predict Phobic and Non Phobic cases, we need to define a cutoff score. Lets do it for different cutoff scores
Plotting the ROC Curve
*The lower left-hand corner of the curve shows the beginning of the
classification process. No classifications are identified initially.

*Initially the cutoff is 9/10 ( cutoff of 10+ represents Phobic and less than or
equal to 9 is Non Phobic) — see Table 1. In this case the classification criteria
is very strict and only strong TP cases are classified. But there are only a few
examples in the sample which meet this criterion(21 instances for score 10).

*Once we make the cutoff a little more relaxed, for example 8/9, we can begin
to see some FPs as well.

*The cutoff 7/8 is the one which is closest to the upper left corner and hence it
minimizes the overall classification error. At this point, TPR =.8 and FPR =.1
approximately.
Evaluating classification model performance
• Confusion Matrix Solved Example Accuracy Precision Recall F1 Scor
e Prevalence by Mahesh
Huddar - YouTube
Thank you

You might also like