PRu 4
PRu 4
PRu 4
In the field of machine learning, the goal of statistical classification is to use an object's
characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by
making a classification decision based on the value of a linear combination of the characteristics.
A linear classifier is a model that makes a decision to categories a set of data points to a discrete
class based on a linear combination of its explanatory variables. As an example, combining details
about a dog such as weight, height, colour and other features would be used by a model to decide
its species. The effectiveness of these models lies in their ability to find this mathematical
combination of features that groups data points together when they have the same class and
separate them when they have different classes, providing us with clear boundaries for how to
classify.
If each instance belongs to one and only one class, then our input data can be divided into decision
regions separated by decision boundaries.
Discriminant Functions
A two-category classifier with a discriminant function of the form (1) uses the following rule:
Decide ω1 if
g(x) > 0 and ω2 if g(x) < 0
⇔ Decide ω1 if
wtx > -w0 and ω2 otherwise
If g(x) = 0 ⇒ x is assigned to either class
• The equation g(x) = 0 defines the decision surface that separates points assigned to the
category ω1 from points assigned to the category ω2
• When g(x) is linear, the decision surface is a hyperplane
Decision Hyperplanes
• a hyperplane is a subspace whose dimension is one less than that of its ambient space. For
example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes,
while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines.
• A decision boundary is a hypersurface that partitions the underlying vector space into two
sets, one for each class. A general hypersurface in a small dimension space is turned into a
hyperplane in a space with much larger dimensions.
• Hyperplane and decision boundary are equivalent at small dimension space, 'plane' has the
meaning of straight and flat, so it is a line or a plane that separate the data sets. When you
do a non-linear operation to map your data to a new feature space, the decision boundary
is still a hyperplane in that space, but is not a plane any more at the original space.
Decide ω1 if
g(x) > 0 and ω2 if g(x) < 0
⇔ Decide ω1 if
wtx > -w0 and ω2 otherwise
If g(x) = 0 ⇒ x is assigned to either class
• The equation g(x) = 0 defines the decision surface that separates points assigned to the
category ω1 from points assigned to the category ω2
• When g(x) is linear, the decision surface is a hyperplane
Linear Discriminant Functions and Decision Hyperplanes
Let us once more focus on the two-class case and consider linear discriminant functions. Then the
respective decision hypersurface in the l-dimensional feature space is a hyperplane, that is
𝑔(𝑥) = 𝑤 𝑇 𝑥 + 𝑤0 = 0
where w = [w1, w2,…, wl]T is known as the weight vector and w0 as the threshold. If x1, x2 are two
points on the decision hyperplane, then the following is valid
0 = 𝑤 𝑇 𝑥1 + 𝑤0 = 𝑤 𝑇 𝑥2 + 𝑤0 ⇒
𝑤 𝑇 (𝑥1 − 𝑥2 ) = 0
Since the difference vector x1 – x2 obviously lies on the decision hyperplane (for any x1, x2), it is
apparent from Eq. (3.2) that the vector w is orthogonal to the decision hyperplane.
Figure shows the corresponding geometry (for w1 > 0, w2 > 0, w0 < 0). Recalling our high school
math, it is easy to see that the quantities entering in the figure are given by
|𝑤0 | |𝑔(𝑥)|
𝑑= and 𝑧 =
√𝑤2 2
1 +𝑤2 √𝑤2 2
1 +𝑤2
In other words, |g(x)| is a measure of the Euclidean distance of the point x from the decision
hyperplane. On one side of the plane g(x) takes positive values and on the other negative. In the
special case that w0 = 0, the hyperplane passes through the origin.
The Perceptron algorithm
https://www.youtube.com/watch?v=1XkjVl-j8MM
• The Perceptron algorithm is a two-class (binary) classification machine learning algorithm.
• It is a type of neural network model, perhaps the simplest type of neural network model.
• It consists of a single node or neuron that takes a row of data as input a nd predicts a class
label. This is achieved by calculating the weighted sum of the inputs and a bias (set to 1).
The weighted sum of the input of the model is called the activation.
Activation = Weights * Inputs + Bias
If the activation is above 0.0, the model will output 1.0; otherwise, it will output 0.0.
Predict 1: If Activation > 0.0
Predict 0: If Activation <= 0.0
• Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
Perceptron models are divided into two types. These are as follows:
• Single-layer Perceptron Model
• Multi-layer Perceptron model
Characteristics of Perceptron
• Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
• In Perceptron, the weight coefficient is automatically learned.
• Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
• The activation function applies a step rule to check whether the weight function is greater
than zero.
• The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
• If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.
Example Problem:
Find the MSE for the following set of values: (43,41), (44,45), (45,49), (46,47), (47,44).
Step 1: Find the regression line. I used this online calculator and got the regression line y = 9.2 +
0.8x.
Step 2: Find the new Y’ values:
• 9.2 + 0.8(43) = 43.6
• 9.2 + 0.8(44) = 44.4
• 9.2 + 0.8(45) = 45.2
• 9.2 + 0.8(46) = 46
• 9.2 + 0.8(47) = 46.8
Step 3: Find the error (Y – Y’):
• 41 – 43.6 = -2.6
• 45 – 44.4 = 0.6
• 49 – 45.2 = 3.8
• 47 – 46 = 1
• 44 – 46.8 = -2.8
Step 4: Square the Errors:
• -2.62 = 6.76
• 0.62 = 0.36
• 3.82 = 14.44
• 12 = 1
• -2.82 = 7.84
Step 5: Add all of the squared errors up: 6.76 + 0.36 + 14.44 + 1 + 7.84 = 30.4.
Step 6: Find the mean squared error:
• 30.4 / 5 = 6.08
Stochastic Approximation of LMS Algorithm
LMS or Gradient Descent
• The least mean square algorithm uses a technique called “method of steepest
descent” and continuously estimates results by updating filter weights. Through the
principle of algorithm convergence, the least mean square algorithm provides
particular learning curves useful in machine learning theory and implementation.
• Gradient Descent is a generic optimization algorithm capable of finding optimal
solutions to a wide range of problems.
• An important parameter of Gradient Descent (GD) is the size of the steps,
determined by the learning rate hyperparameters. If the learning rate is too small,
then the algorithm will have to go through many iterations to converge, which will
take a long time, and if it is too high, we may jump the optimal value.
Stochastic Gradient Descent
• The word ‘stochastic ‘means a system or process linked with a random probability. Hence,
in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which
denotes the total number of samples from a dataset that is used for calculating the gradient
for each iteration
• So, in SGD, we find out the gradient of the cost function of a single example at each
iteration instead of the sum of the gradient of the cost function of all the examples.
• In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical
Gradient Descent algorithm. But that doesn’t matter all that much because the path taken
by the algorithm does not matter, as long as we reach the minima and with a significantly
shorter training time.
• Stochastic gradient descent is an optimization algorithm often used in machine learning
applications to find the model parameters that correspond to the best fit between
predicted and actual outputs. It’s an inexact but powerful technique.
• Stochastic gradient descent is widely used in machine learning applications. Combined with
backpropagation, it’s dominant in neural network training applications.
Algorithm
Sum of Error Estimate/ Standard Error Estimation
Standard Error Meaning
The standard error is one of the mathematical tools used in statistics to estimate the variability. It
is abbreviated as SE. The standard error of a statistic or an estimate of a parameter is the standard
deviation of its sampling distribution. We can define it as an estimate of that standard deviation.
Standard Error Formula
The accuracy of a sample that describes a population is identified through the SE formula. The
sample mean which deviates from the given population and that deviation is given as;
Where xi stands for data values, x bar is the mean value and n is the sample size.