Machine Learning Unit 4
Machine Learning Unit 4
Unit IV
Support Vector Machines (SVM)
Introduction:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Assumptions of LDA
LDA assumes that the data has a Gaussian distribution and that the covariance matrices of
the different classes are equal. It also assumes that the data is linearly separable, meaning
that a linear decision boundary can accurately classify the different classes.
Suppose we have two sets of data points belonging to two different classes that we want
to classify. As shown in the given 2D graph, when the data points are plotted on the 2D
plane, there’s no straight line that can separate the two classes of data points completely.
Hence, in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph
into a 1D graph in order to maximize the separability between the two classes.
Here, Linear Discriminant Analysis uses both axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories
and hence, reduces the 2D graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between the means of the two classes.
2. Minimize the variation within each class.
The perpendicular distance between the line and points
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the
2D graph such that it maximizes the distance between the means of the two classes and
minimizes the variation within each class. In simple terms, this newly generated axis
increases the separation between the data points of the two classes. After generating this
new axis using the above-mentioned criteria, all the data points of the classes are plotted
on this new axis and are shown in the figure given below.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both classes linearly separable.
In such cases, we use non-linear discriminant analysis.
Similarly,
Similarly:
Now, we need to protect our data on the line having direction v which maximizes,
For maximizing the above equation we need to find a projection vector that maximizes the
difference of means of reducing the scatters of both classes. Now, scatter matrix of s1 and
s2 of classes c1 and c2 are:
and s2
After simplifying the above equation, we get scatter within the classes(sw) and scatter b/w
the classes(sb):
Now, To maximize the above equation we need to calculate differentiation with respect to
v,
Here, for the maximum value of J(v), we will use the value corresponding to the
highest eigenvalue. This will provide us with the best solution for LDA.
Extensions to LDA
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs
are used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of
different variables on LDA.
Perceptron Algorithm:
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains
three main components. These are as follows:
ADVERTISEMENT
This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow or
has vanishing or exploding gradients.
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is
indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift
the activation function curve up or down.
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Based on the layers, Perceptron models are divided into two types. These are as follows:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside
the model. The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins
with inconstantly allocated input for weight parameters. Further, it sums up all inputs
(weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined
value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into the
model. Hence, to find desired output and minimize errors, some changes should be necessary
for the weights input.
Like a single-layer perceptron model, a multi-layer perceptron model also has the same model
structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and
demanded originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single
layer perceptron model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT,
XNOR, NOR.
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
ADVERTISEMENT
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If
input vectors are non-linear, it is not easy to classify them properly.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret data
by building intuitive patterns and applying them in the future. Machine learning is a rapidly
growing technology of Artificial Intelligence that is continuously evolving and in the
developing phase; hence the future of perceptron technology will continue to support and
facilitate analytical behavior in machines that will, in turn, add to the efficiency of computers.
The perceptron model is continuously becoming more advanced and working efficiently on
complex problems with the help of artificial neurons.
Consider a binary classification problem where you have two classes: Class A and Class
B. The objective of SVM is to find a hyperplane that best separates these two classes
while maximizing the margin. The equation for this hyperplane is
w x+b = 0
Where
W: The weight vector that defines the orientation of the hyperplane.
hyperplane.
The margin is the distance between the hyperplane and the nearest data points from
each class. To maximize this margin, we want to find w and b such that the distance
from the hyperplane to the closest point in Class A and the closest point in Class B is
maximized.
Mathematically, this can be represented as:
Where ||w|| is the Euclidean norm (magnitude) of the weight vector w. The objective of
SVM is to maximize this margin.
Here’s diagrams illustrating this concept
In the diagram, the decision hyperplane (the straight line) separates Class A from Class B.
The margin is the distance from the hyperplane to the closest data points from each class.
• SVM's objective is to find the optimal w and b that maximize this margin while
ensuring that data points are correctly classified.
• In this ideal scenario of linearly separable data, the support vectors are the data points
closest to the hyperplane, and they are used to define the margin.
• SVM finds these support vectors and optimizes the margin by solving a constrained
optimization problem.
The large margin classifier provides a robust solution for linearly separable data, ensuring a
wider separation between classes and making it less sensitive to noise in the data.
Fig: Representing good and bad SVM classifier models in small and large margin cases
A large margin classifier in SVM for linearly separable data aims to find an optimal
hyperplane that maximizes the margin between two classes, ensuring a robust separation.
Support vectors define this margin, and SVM finds the best hyperplane by minimizing
classification errors while maximizing the margin, enhancing classification accuracy and
robustness.
The linear soft margin classifier in SVM aims to find a hyperplane that best separates
overlapping classes, even when perfect separation isn't possible. It introduces a
"slack variable" (ξ) to account for classification errors. The objective function is
modified as follows
Where
w: The weight vector.
b: The bias or threshold.
C: A hyper parameter that controls the trade-off between maximizing the margin and
minimizing the misclassification error.
ξi: Slack variable for the ith data
data point.
• The decision hyperplane (a straight line) attempts to separate the classes, but due to
overlapping, some data points may lie on the wrong side.
• The slack variables (ξ) allow for some misclassifications while trying maximizing the
margin. The parameter C controls the balance between minimizing errors (small C)
and maximizing the margin (large C).
• It helps SVM adapt to overlapping classes and create a margin that balances the trade-
off between classification accuracy and margin size.
• In SVM, the kernel function plays a central role. The kernel function, denoted as K(x,
y), takes two input data points x and y and returns a measure of similarity between
them.
• It implicitly maps the data into a higher-dimensional feature space where linear
separation might be possible.
The equation for SVM's decision boundary in the feature space is:
Where
K (xi, x): The kernel function that maps xi and x into the feature
• Consider a simple 2D dataset where Class A (Green points) and Class B (blue points)
are not linearly separable in the original feature space:
• In this diagram, it's evident that a straight-line decision boundary cannot separate the
classes effectively in the original 2D space.
• Now, by using a kernel function, we implicitly map this data to a higher-dimensional
feature space, often referred to as a "kernel-induced feature space." Let's say we use
a radial basis function (RBF) kernel
• This RBF kernel implicitly maps the data to a higher-dimensional space where the
classes might become linearly separable
Fig: Non linear separable data using 3D Kernel Space and 2D Space
• In this new feature space, the data points might be linearly separable with the right
choice of kernel and kernel parameters, enabling SVM to find an optimal decision
boundary that maximizes the margin between classes.
The transformation into the kernel-induced feature space is implicit and doesn't require
explicit calculation of the transformed feature vectors. It allows SVM to handle non-linearly
separable data effectively.
Support Vector Machines (SVM) are not just limited to classification tasks; they can also be
used for regression. In regression tasks, the goal is to predict a continuous target variable
rather than class labels. SVM for regression is known as Support Vector Regression (SVR).it is
classified into two models.
Linear regression using Support Vector Machines (SVM) is a variation of SVM designed for
regression tasks. It aims to find a linear relationship between input features and a
continuous target variable.
• In linear regression using SVM, the goal is to find a linear function that best
approximates the relationship between input features and the target variable. This
linear function is represented as:
f(x) = w⋅x+b
f(x): The predicted target variable. w: The weight
• The linear regression objective is to minimize the mean squared error (MSE) between
the predictions and the true target values
• The target variable (y) is represented on the vertical axis, and the input features (x)
are on the horizontal axis.
• The linear function f(x) = w.x + b is the best-fitting line that minimizes the mean
squared error by adjusting the weight vector (w) and the bias term (b).
• This linear model can be used for regression tasks to predict continuous target
variables based on input features.
• In non-linear regression using SVM, the goal is to find a non-linear function that best
fits the relationship between input features and the target variable.
• Unlike linear regression, which assumes a linear relationship, non-linear regression
allows for more complex, non-linear patterns.
• The non-linear regression objective is to minimize the mean squared error (MSE)
between the predictions and the true target values:
K(xi, x): The kernel function that implicitly maps xi and x into a higher-dimensional
feature space.
b : The bias or intercept term.
• The target variable (y) is represented on the vertical axis, and the input features (x)
are on the horizontal axis.
• The non-linear function f(x)= ∑𝑛𝑖=1 αi K(xi, x)+b captures non-linear relationships
between input features and the target variable by implicitly mapping the data into a
higher-dimensional feature space using the kernel function.
• The model can then make non-linear predictions based on the input features.
Neural networks, particularly deep learning models like convolutional neural networks
(CNNs) and recurrent neural networks (RNNs), serve as the foundation for cognitive
machine learning. These models are capable of handling complex data, learning patterns,
and making predictions.
2. Supervised and Unsupervised Learning:
To mimic cognitive abilities, neural networks use transfer learning. Pre-trained models are
fine-tuned for specific tasks, which is akin to humans applying knowledge learned in one
context to solve related problems.
5. Multimodal Data Processing:
Cognitive machines process data from various sources (text, images, audio)
simultaneously, fostering a more comprehensive understanding of the environment. They
can analyze multiple data modalities to make informed decisions.
6. Memory and Reasoning:
Cognitive machines integrate memory networks and reasoning modules, enabling them to
store and retrieve information and perform logical reasoning. This allows them to solve
problems by considering context and past experiences.
Cognitive machines excel in natural language processing tasks. They can understand and
generate human-like text and engage in meaningful conversations, making them highly
interactive and adaptive.
8. Contextual Awareness:
These machines have contextual awareness, recognizing the importance of the context in
which they operate. They can adapt their behavior, decisions, and responses based on the
current situation.
9. Continuous Learning:
Cognitive machines don't stop learning after initial training. They engage in continuous
learning and self-improvement, allowing them to adapt to changing conditions and acquire
new knowledge over time.
10. Emulating Human Cognition:
The ultimate goal of learning with neural networks toward cognitive machines is to create
systems that replicate and augment human-like cognition. They mimic human problem-
solving, decision-making, creativity, and adaptability.
In summary, learning with neural networks toward cognitive machines involves a holistic
approach to developing intelligent systems. By combining various learning techniques, these
machines can process complex data, reason, understand language, adapt to changing
situations, and replicate cognitive functions, bringing us closer to creating intelligent
systems that emulate human cognition and understanding.
Neuron Models:
Let us discuss two neuron models
1.Biological neuron
2.Artificial neuron 1. Biological neuron:
• Neuron Structure:
• Synapses:
Neurons communicate with each other through synapses, which are small gaps
between the axon of one neuron and the dendrites of another. Neurotransmitters
are released at the synapse to transmit signals.
• Action Potential:
Neurons transmit electrical signals in the form of action potentials. An action
potential is a brief change in the neuron's electrical charge, leading to the
propagation of a signal along the axon.
• Resting Potential:
When the electrical charge inside the neuron reaches a certain threshold, an action
potential is initiated. This action potential travels down the axon and signals the
release of neurotransmitters at the synapse.
• Neural Networks:
• Summation (Σ):
The weighted inputs are summed together, typically with a bias term (b), to compute
the net input:
Net Input= (w1∗x1) + (w2∗x2) +...+ (wn∗xn) +b
Output (y):
The result of the activation function is the output of the artificial neuron. It
represents the neuron's response to the input signals.
A single-layer neural network, also known as a single-layer Perceptron, is the simplest neural
network architecture. It consists of an input layer, which directly connects to an output
layer, without any hidden layers. Single-layer networks are mainly used for binary
classification problems or linearly separable tasks.
Fig: Single layer Artificial Neural Network
Where:
z is the weighted sum.
b is the bias.
A step function, also known as the Heaviside step function, is often used as the
activation function. It outputs 1 if the weighted sum z is greater than or equal to 0,
and 0 otherwise
• In the diagram, input features (x1, x2... xn) are connected to the weighted sum
calculation, followed by the activation function (step function), which produces a
binary output (0 or
1).
• This single-layer neural network can make binary decisions based on the weighted
sum of its input features, which is often used for linearly separable classification
problems.
• Single-layer networks are limited in their capability compared to more complex neural
architectures like multi-layer perceptrons (MLPs) or deep neural networks.
• They can only solve problems that are linearly separable and cannot capture complex
non-linear relationships in data. While simple, they are foundational in understanding
neural networks and are a starting point for more sophisticated architectures. To
handle more complex tasks, deeper neural networks with hidden layers are employed.
The weighted sum for each neuron in a hidden layer is calculated as follows:
Where
layer.
The weighted sum for each neuron in the output layer is calculated similarly to the
hidden layer zk= ∑𝑚𝑖=1 wkj′⋅f(zj)+bk′
Where
zk is the weighted sum for neuron k in the output layer.
w'kj is the weight connecting neuron j in the hidden layer to neuron k in the output
The activation function in the output layer depends on the type of problem. For
binary classification, you might use a sigmoid function. For multiclass classification, a
softmax function is common.
In this diagram, input features (x1, x2... xn) are connected to the weighted sum
calculations in the hidden layer, followed by the activation function for the hidden layer. The
output of the hidden layer is then connected to the weighted sum calculations in the output
layer, followed by the activation function for the output layer. This network structure allows
multilayer neural networks to capture complex relationships and solve a wide range of tasks,
including classification, regression, and more.
• Inputs (x1, x2... xn): A linear neuron takes multiple input values (x1, x2... xn). Each input
is associated with a weight (w1, w2... wn), which represents the importance of that
input.
• Weighted Sum (z): The weighted sum of inputs is computed as
Z = w1 x1+w2 x2+...+wn xn
• Threshold (θ): The weighted sum is compared to a threshold (θ) to produce the
output.
• Output (y): If the weighted sum z is greater than or equal to the threshold θ, the
neuron's output is 1. Otherwise, the output is 0.
A linear neuron can be used for binary classification, where it acts as a simple decision-
maker, and the weights and threshold are adjusted to make correct classifications.
Where
Winew is the new weight.
α :is the learning rate, controlling the step size for weight updates.
weight wi.
• The learning rule adjusts the weights in the direction that reduces the error. It
continues to update the weights in an iterative process until the error is minimized or
converges to a satisfactory level.
The Widrow-Hoff learning rule is a foundational concept in machine learning and neural
networks, providing a mechanism for training linear neurons to make accurate binary
classifications or predictions in a supervised learning context.
Error Correction Delta Rule:
The Error Correction Delta Rule, often referred to simply as the Delta Rule or the Delta
Learning Rule, is a supervised learning algorithm used to adjust the weights of artificial
neurons in a neural network, specifically in the context of supervised learning tasks. The
primary goal of this rule is to minimize the error between the actual output of the neuron
and the desired target output.
Components of the Error Correction Delta Rule:
• Actual Output (Y): This is the output produced by the artificial neuron or network
based on the current set of weights and inputs.
• Desired Target Output (D): This is the expected or correct output for the given input.
It's provided during the training phase.
• Error (E):The error is the difference between the actual output and the desired target
output:
E=D-Y
The goal of the Error Correction Delta Rule is to adjust the weights to minimize the error (E).
The update for the ith weight wi is given by
Where
Winew is the new weight.
α :is the learning rate, controlling the step size for weight
• Calculate the error (E) by taking the difference between the desired target output (D)
and the actual output (Y).
• Adjust each weight (wi) based on the weight update rule, considering the learning rate
(α).
This weight adjustment process is repeated iteratively for multiple data points during the
training process until the error converges to a satisfactory level, meaning that the difference
between the desired and actual outputs is minimized.
The Error Correction Delta Rule is a foundational concept in supervised learning for
neural networks. It's used to train the network by iteratively adjusting the weights to make
the network's predictions more accurate and aligned with the desired target outputs. The
choice of the learning rate is crucial, as it affects the speed and stability of the learning
process.