ML UNIT-IV Notes
ML UNIT-IV Notes
ML UNIT-IV Notes
UNIT-IV
Linear models: The least-squares method, The perceptron: a heuristic learning algorithm for
linear classifiers, Support vector machines, obtaining probabilities from linear classifiers, Going
beyond linearity with kernel methods.
Distance Based Models: Introduction, Neighbors and exemplars, Nearest Neighbours
classification, Distance Based Clustering, Hierarchical Clustering.
Linear regression is a linear model, e.g. a model that assumes a linear relationship between the
input variables (x) and the single output variable (y). More specifically, that y can be calculated
from a linear combination of the input variables (x).
When there is a single input variable (x), the method is referred to as simple linear
regression.
When there are multiple input variables, literature from statistics often refers to the
method as multiple linear regression.
Different techniques can be used to prepare or train the linear regression equation from
data, the most common of which is called Ordinary Least Squares. It is common to
therefore refer to a model prepared this way as Ordinary Least Squares Linear Regression
or just Least Squares Regression.
G SRINIVASARAO,Asst.Prof Page 1
MACHINE LEARNING [IV Year II Sem]
In higher dimensions when we have more than one input (x), the line is called a plane or
a hyper-plane. The representation therefore is the form of the equation and the specific
values used for the coefficients (e.g. B0 and B1 in the above example).
It is common to talk about the complexity of a regression model like linear regression.
This refers to the number of coefficients used in the model.
When a coefficient becomes zero, it effectively removes the influence of the input
variable on the model and therefore from the prediction made from the model (0 * x = 0).
This becomes relevant if you look at regularization methods that change the learning
algorithm to reduce the complexity of regression models by putting pressure on the
absolute size of the coefficients, driving some to zero.
G SRINIVASARAO,Asst.Prof Page 2
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 3
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 4
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 5
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 6
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 7
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 8
MACHINE LEARNING [IV Year II Sem]
An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible.
In addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.
Given a set of training examples, each marked as belonging to one or the other of two categories,
an SVM training algorithm builds a model that assigns new examples to one category or the
other, making it a non-probabilistic binary linear classifier.
Working of SVM
G SRINIVASARAO,Asst.Prof Page 9
MACHINE LEARNING [IV Year II Sem]
Support Vectors − Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
Hyperplane − As we can see in the above diagram, it is a decision plane or space which
is divided between a set of objects having different classes.
Margin − It may be defined as the gap between two lines on the closet data points of
different classes. It can be calculated as the perpendicular distance from the line to the
support vectors. Large margin is considered as a good margin and small margin is
considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −
First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
Then, it will choose the hyperplane that separates the classes correctly.
SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple words,
kernel converts non-separable problems into separable problems by adding more dimensions to
it. It makes SVM more powerful, flexible and accurate. The following are some of the types of
kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is as
below −
K(x,xi)=sum(x∗xi)
From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum
of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel −
G SRINIVASARAO,Asst.Prof Page 10
MACHINE LEARNING [IV Year II Sem]
k(X,Xi)=1+sum(X∗Xi)^d
Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.
RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional
space. Following formula explains it mathematically −
K(x,xi)=exp(−gamma∗sum(x−xi^2))
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A
good default value of gamma is 0.1.
As we implemented SVM for linearly separable data, we can implement it in Python for the data
that is not linearly separable. It can be done by using kernels.
SVM classifiers offers great accuracy and work well with high dimensional space. SVM
classifiers basically use a subset of training points hence in result uses very less memory.
They have high training time hence in practice not suitable for large datasets. Another
disadvantage is that SVM classifiers do not work well with overlapping classes.
G SRINIVASARAO,Asst.Prof Page 11
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 12
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 13
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 14
MACHINE LEARNING [IV Year II Sem]
G SRINIVASARAO,Asst.Prof Page 15
MACHINE LEARNING [IV Year II Sem]
Introduction
In many real world applications, we use Machine Learning algorithms for classifying or
recognizing images and for retrieving information through an Image’s content. For example -
Face recognition, Censored Images online, Retail Catalog, Recommendation Systems etc.
Choosing a good distance metric becomes really important here. The distance metric helps
algorithms to recognize similarities between the contents.
Distance metric uses distance function which provides a relationship metric between each
elements in the dataset.
G SRINIVASARAO,Asst.Prof Page 16
MACHINE LEARNING [IV Year II Sem]
Distance Function
Do you remember studying Pythagorean theorem? If you do, then you might remember
calculating distance between two data points using the theorem.
In order to calculate the distance between data points A and B Pythagorean theorem considers
the length of x and y axis.
A distance function provides distance between the elements of a set. If the distance is zero
then elements are equivalent else they are different from each other.
A distance function is nothing but a mathematical formula used by distance metrics. The distance
function can differ across different distance metrics.
o Distance-based models are the second class of Geometric models. Like Linear models,
distance-based models are based on the geometry of data.
o As the name implies, distance-based models work on the concept of distance.
G SRINIVASARAO,Asst.Prof Page 17
MACHINE LEARNING [IV Year II Sem]
o In the context of Machine learning, the concept of distance is not based on merely the
physical distance between two points. Instead, we could think of the distance between
two points considering the mode of transport between two points.
o Travelling between two cities by plane covers less distance physically than by train
because a plane is unrestricted. Similarly, in chess, the concept of distance depends on
the piece used – for example, a Bishop can move diagonally.
o Thus, depending on the entity and the mode of travel, the concept of distance can be
experienced differently.
o The distance metrics commonly used are Euclidean, Minkowski, Manhattan,
and Mahalanobis.
Notes:
The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean
position of all the points in the figure from the centroid point. This definition extends to
any object in n-dimensional space: its centroid is the mean position of all the points.
G SRINIVASARAO,Asst.Prof Page 18
MACHINE LEARNING [IV Year II Sem]
Medoids are similar in concept to means or centroids. Medoids are most commonly used
on data when a mean or centroid cannot be defined. They are used in contexts where the
centroid is not representative of the dataset, such as in image data.
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.
K-nearest neighbours (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help of
following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we
must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with the help
of any of the method namely: Euclidean, Manhattan or Hamming distance. The most
commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
G SRINIVASARAO,Asst.Prof Page 19
MACHINE LEARNING [IV Year II Sem]
3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.
Example
The following is an example to understand the concept of K and working of KNN algorithm −
Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram −
G SRINIVASARAO,Asst.Prof Page 20
MACHINE LEARNING [IV Year II Sem]
We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in red
class.
Pros
It is very useful for nonlinear data because there is no assumption about data in this
algorithm.
It has relatively high accuracy but there are much better supervised learning models than
KNN.
Cons
It is computationally a bit expensive algorithm because it stores all the training data.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan approval?
Does that individual have the characteristics similar to the defaulters one?
KNN algorithms can be used to find an individual’s credit rating by comparing with the persons
having similar traits.
G SRINIVASARAO,Asst.Prof Page 21
MACHINE LEARNING [IV Year II Sem]
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes like
“Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
Clustering methods are one of the most useful unsupervised ML methods. These
methods are used to find similarity as well as the relationship patterns among data
samples and then cluster those samples into groups having similarity based on features.
Clustering is important because it determines the intrinsic grouping among the present
unlabelled data. They basically make some assumptions about data points to constitute
their similarity. Each assumption will construct different but equally valid clusters.
For example, below is the diagram which shows clustering system grouped together the
similar kind of data in different clusters −
Hierarchal clustering
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all
the data points are treated as one big cluster and the process of clustering involves dividing
(Top-down approach) the one big cluster into various small clusters.
We are going to explain the most used and important Hierarchical clustering i.e. agglomerative.
The steps to perform the same is as follows −
Step 1 − Treat each data point as single cluster. Hence, we will be having, say K clusters
at start. The number of data points will also be K at start.
Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints.
This will result in total of K-1 clusters.
Step 3 − Now, to form more clusters we need to join two closet clusters. This will result
in total of K-2 clusters.
Step 4 − Now, to form one big cluster repeat the above three steps until K would become
0 i.e. no more data points left to join.
Step 5 − At last, after making one single big cluster, dendrograms will be used to divide
into multiple clusters depending upon the problem.
G SRINIVASARAO,Asst.Prof Page 23