0% found this document useful (0 votes)

27 views

Module - 4 - ECE3047 - Machine Learning

The document discusses machine learning fundamentals and classification techniques. It covers topics like introduction to classification, types of classification including supervised and unsupervised, evaluation metrics for classification models, Bayes' theorem and examples, and Naive Bayes classification algorithm and examples.

Uploaded by

Utkarsh Maurya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Module - 4 - ECE3047 - Machine Learning

Uploaded by

Utkarsh Maurya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

ECE3047 - Machine

Learning Fundamentals
Prepared By
Dr. Rohith G
Assistant Professor (Senior)
School of Electronics Engineering (SENSE), VIT-Chennai
Under the Guidance and Materials mentored by
Dr. Sathiya Narayanan S
Assistant Professor (Senior)
School of Electronics Engineering (SENSE), VIT-Chennai
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 1
• Module 1: Introduction (to • Module 5: Clustering
Machine Learning) • Module 6: Optimization
• Module 2: Data Preprocessing
• Module 7:
• Module 3: Regression Reinforcement Learning
• Module 4: Classification
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 2
Topics in Module-4

Classification
• Introduction – Hyperplane – Radial Basis Function (RBF) –
Support Vector Machine (SVM) – Support Vector Regression
(SVR)- Random Forest (RF)- Case Study.

• Bayes’ theorem – Parameter Estimation – Distribution -

Classifier – Networks – K-Nearest Neighbors- Case Study.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 3
What is Classification?

Segregate vast quantities of data

into discrete values, i.e.
:distinct, like 0/1, True/False, or
a pre-defined output label class.

• The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 4

Classification Vs. Regression

• Regression algorithms predicts the discrete or a

continues value. In some cases, the predicted
value can be used to identify the linear
relationship between the attributes.
• Classification algorithms predicts the target class
(Yes/ No). If the trained model is for predicting
any of two target classes. It is known as binary
classification.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 5
Types of Classification?
• Types of Classifiers: The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
• Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier. Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two outcomes, then it is called
as Multi-class Classifier. Example: Classifications of types of crops, Classification of types of
music.
• Types of learners: In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
• Example: K-NN algorithm, Case-based reasoning
• Eager Learners: Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction.
• Example: Decision Trees, Naïve Bayes, ANN.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 6
Types of Classification?
• Types of classification:
• Supervised: The set of possible classes is known in advance.
• Unsupervised: Set of possible classes is not known. After classification we can try to assign a
name to that class. Unsupervised classification is called clustering.
• Types of Classification algorithms: The Classification algorithms can be further divided into the
Mainly two category:
• Linear Models
• Logistic Regression
• Support Vector Machines
• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 7

Evaluation of Classification model?
• Log loss or cross entropy loss:
• It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near to 0.
• The value of log loss increases if the predicted value deviates from the actual value.
• The lower log loss represents the higher accuracy of the model.
• Confusion Matrix:
• The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
• It is also known as the error matrix.
• The matrix consists of predictions result in a summarized form, which has a total number of correct
predictions and incorrect predictions.
• AUC-ROC curve:
• ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under
the Curve.
• It is a graph that shows the performance of the classification model at different thresholds.
• To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
• The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 8
An Example of Bayes Theorem
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20

• If a patient has stiff neck, what’s the probability he/she has meningitis?

P( S | M ) P( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 9

Naïve Bayes Classification model
•Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps
in building the fast machine learning models that can make quick predictions.
•It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
•Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
•Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
•Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
•The formula for Bayes' theorem is given as:
•Where,
•P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
•P(B) is Marginal Probability: Probability of Evidence.
•P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
•P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 10
Naïve Bayes Classification model

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 11

Naïve Bayes Classification model Solved Example#1
If the weather is sunny, then the Player should play or not?

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 12

Naïve Bayes Classification model Solved Example#1
Step-1 Frequency table for the Weather Conditions:

Step-2 Likelihood table weather condition:

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 13

Naïve Bayes Classification model Solved Example#1
Step-3 Applying Bayes Theorem

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 14

Naïve Bayes Classification model Solved Example#2
Attributes are Color , Type , Origin, and the subject, stolen can be either yes or no. We
want to classify a Red Domestic SUV. Note there is no example of a Red Domestic
SUV in our data set.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 15

Naïve Bayes Classification model Solved Example#2

Step-1 There are six categories for computing a classification task in the Training samples

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 16

Naïve Bayes Classification model Solved Example#2

Step-2 Probability of estimate is given

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 17

Naïve Bayes Classification model Solved Example#2
Likelihood: Looking at P(Red|Yes), we have 5 cases where vj = Yes , and in 3 of those cases
Step-3
ai = Red. So for P(Red| Yes), n = 5 and nc = 3. Note that all attribute are binary (two possible
values). We are assuming no other information so, p = 1 / (number-of-attribute-values) = 0.5
for all of our attributes. Our m value is arbitrary, (We will use m = 3) but consistent for all
attributes.

Step-4

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 18

Naïve Bayes Classification Model
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 19

Summary of Naïve Bayes Classification Model
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• Bayes’ rule can be turned into a classifier
• Maximum A Posteriori (MAP) hypothesis estimation incorporates prior
knowledge; Max Likelihood (ML) doesn’t
• Naive Bayes Classifier is a simple but effective Bayesian classifier for
vector data (i.e. data with several attributes) that assumes that attributes
are independent given the class.
• Bayesian classification is a generative approach to classification
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 20

K-Nearest Neighbor(KNN) Algorithm
• K-Nearest Neighbour based on Supervised Learning technique used for
Regression as well as for Classification.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.

• KNN model will find the similar features

of the new data set to the cats and dogs
images and based on the most similar
features it will put it in either cat or
dog category.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 21

Need for K-Nearest Neighbor(KNN) Algorithm?
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• A feature space
representation of the
instance in the dataset
and a measure of
similarity between
instances.
• The prediction is based
on finding out what
class the nearest
instance belongs to.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 22

K-Nearest Neighbor(KNN) Algorithm

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 23

K-Nearest Neighbor(KNN) Algorithm

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 24

K-Nearest Neighbor(KNN) Algorithm
Pseudocode:
1.Load the data
2.Choose K value
3.For each data point in the data:
1. Find the Euclidean distance to
all training data samples
2. Store the distances on an
ordered list and sort it
3. Choose the top K entries from
the sorted list
4. Label the test point based on
the majority of classes present
in the selected points
4.End

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 25

K-Nearest Neighbor(KNN) Algorithm-An Illustration

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 26

K-Nearest Neighbor(KNN) Algorithm-An Example

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k

smallest distance to x
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 27
Voronoi Diagram
Decision surface formed by the training examples

• Each line segment is equidistance between points in opposite classes.

• The more points, the more complex the boundaries.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 28
Voronoi Diagram
The boundary is always the perpendicular bisector of the line between two
points (Voronoi tessellation)

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 29

K-Nearest Neighbor(KNN) Algorithm-An Example

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 30

K-Nearest Neighbor(KNN) Algorithm
• Choosing the value of k:
• If k is too small, sensitive to noise
points
• If k is too large, neighborhood may
include points from other classes
• Higher values of k provide smoothing
that reduces the risk of overfitting
due to noise in the training data
• Value of k can be chosen based on
error rate measures
• We should also avoid over-smoothing
by choosing k=n, where n is the total
number of tuples in the training data
set
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 31
K-Nearest Neighbor(KNN) Algorithm-Solved Example

• Let us consider the data given in table above consisting

of 10 entries.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 32

K-Nearest Neighbor(KNN) Algorithm-Solved Example
The distance between the new point and each training point is
calculated.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 33

K-Nearest Neighbor(KNN) Algorithm-Solved Example
The distance between the new point and each training point is calculated using either of
the forms

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 34

K-Nearest Neighbor(KNN) Algorithm-Solved Example
The closest k data points are selected (based on the distance). In this
example, points 1, 5, 6 will be selected if the value of k is 3.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 35

K-Nearest Neighbor(KNN) Algorithm-Solved Example
• Select the k value. This determines the
number of neighbors we look at when
we assign a value to any new
observation.
• In our example, for a value k = 3, the
closest points are ID1, ID5 and ID6.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 36

K-Nearest Neighbor(KNN) Algorithm-Solved Example
• In our example, for a value k = 5, the
closest points are ID1, ID4, ID5, ID6
and ID10.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 37

K-Nearest Neighbor(KNN) Algorithm
Advantages
• It is simple to implement.
• It is robust to the noisy training
data
• It can be more effective if the
training data is large.
Disadvantages:
• Always needs to determine the
value of K which may be complex
some time.
• The computation cost is high
because of calculating the distance
between the data points for all the
training samples.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 38

Support Vector Machine (SVM)
• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
• SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 39

Support Vector Machine (SVM)
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 40

Terminologies in Support Vector Machine (SVM)
• Hyperplane: There can be multiple
lines/decision boundaries to segregate the
classes in n-dimensional space, but we need
to find out the best decision boundary that
helps to classify the data points. This best
boundary is known as the hyperplane of
SVM.
• Dimensions of the hyperplane depend on
the features present in the dataset. 2
features, then hyperplane will be a straight
line.
• Support Vectors: The data points or
vectors that are the closest to the hyperplane
and which affect the position of the
hyperplane are termed as Support Vector
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 41
Terminologies in Support Vector Machine (SVM)
• The distance between the Positive hyperplane
vectors and the hyperplane
is called as margin.
• Goal of SVM is to
maximize this margin.
• The hyperplane with
maximum margin is called Negative
the optimal hyperplane. hyperplane

This line
represents the
decision
boundary:
ax + by − c = 0
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 42
Types of Support Vector Machine (SVM)
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data.

• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 43
Types of Support Vector Machine (SVM)

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 44

An example for SVM

Temperature

Humidity
= play tennis
= do not play tennis
SVM

Data: <xi,yi>, i=1,..,l

xi  R d
yi  {-1,+1}

f(x) =-1
=+1

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.

Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Formulation of Margin
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1
xi•w+b  -1 when yi =-1
H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
H2: xi•w+b = -1 d-
The points on the planes H1 H
and H2 are the Support
Vectors

d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+ + d-.
Decision on margin for SVM

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 48

Maximizing the margin
Maximizing the margin
Support Vector Machine (SVM)-Illustration

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 51

Support Vector Machine (SVM)-Illustration

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 52

Support Vector Machine (SVM)-Pros and Cons
Advantages:
•Effective in high dimensional spaces.
•Still effective in cases where number of dimensions is
greater than the number of samples.
•Uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
•Versatile: different Kernel functions can be specified for
the decision function. Common kernels are provided, but it
is also possible to specify custom kernels.

Disadvantages:
•If the number of features is much greater than the number
of samples, avoid over-fitting in choosing Kernel
functions and regularization term is crucial.
•SVMs do not directly provide probability estimates, these
are calculated using an expensive five-fold cross-validation
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 53
Support Vector Machine (SVM)-Solved Example#1
Suppose, we have positively labeled data points

And we have negatively labeled data points

1 By inspection, it should be obvious that there are three

support vectors

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 54

Support Vector Machine (SVM)-Solved Example#1
2 The hyperplane driving SVM is given as

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 55

Support Vector Machine (SVM)-Solved Example#1
4

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 56

Support Vector Machine (SVM)-Solved Example#1

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 57

Non Linear Support Vector Machine (SVM)-Solved
Example#2
Suppose, we have positively labeled data points

And we have negatively labeled data points

1 Nonlinear mapping from input space into some feature

space

Sub the labelled points in above feature space

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 58

Non Linear Support Vector Machine (SVM)-Solved
Example#2
2 There are two support vectors

3 The hyperplane driving SVM is given as

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 59

Non Linear Support Vector Machine (SVM)-Solved
Example#2
4 The above equation reduces to

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 60

Non Linear Support Vector Machine (SVM)-Solved
Example#2

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 61

Decision tree
• A Decision tree is a flowchart-like tree structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 62

Decision tree
• A tree can be “learned” by splitting the
source set into subsets based on an
attribute value test.
• This process is repeated on each derived
subset in a recursive manner called
recursive partitioning.
• The recursion is completed when the
subset at a node all has the same value of
the target variable, or when splitting no
longer adds value to the predictions.
• The construction of a decision tree
classifier does not require any domain
knowledge or parameter setting, and
therefore is appropriate for exploratory
knowledge discovery.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 63
Decision tree
• Decision trees can handle high-
dimensional data. In general decision tree
classifier has good accuracy.
• Decision tree induction is a typical
inductive approach to learn knowledge on
classification.
• Decision trees classify instances by
sorting them down the tree from the root
to some leaf node, which provides the
classification of the instance.
• An instance is classified by starting at the
root node of the tree, testing the attribute
specified by this node, then moving down
the tree branch corresponding to the value
of the attribute.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 64
Decision tree
Strength:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for prediction
or classification.
Disadvantage:
• Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a decision
tree is computationally expensive. At each node, each candidate splitting field must be
sorted before its best split can be found. In some algorithms, combinations of fields are
used and a search must be made for optimal combining weights. Pruning algorithms can
also be expensive since many candidate sub-trees must be formed and compared.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 65
Decision tree Solved Example
Consider whether a dataset based on which we will determine whether to play football or
not.

There are 4 independent variables - Outlook, Temperature, Humidity, and Wind to determine
the dependent variable-whether to play football or not.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 66
Decision tree Solved Example
1 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 67

Decision tree Solved Example
2 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 68

Decision tree Solved Example
3 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)