Chapter 2

Chapter 2
Training Machine Learning Algorithms for Classification
September 9, 2021
Chapter 2
Biology
Chapter 2
Logic Gate
Simple logic gate with binary outputs

Signals arrive at dendrites
Integrated into cell body
If signal exceeds threshold, generate output, and pass to axon
Chapter 2
Rosenblatt Perceptron
Binary classification task

Positive class (1) vs. negative class (-1)
Define activation function φ(z)
Takes as input a dot product of input and weights
Net input: z = w1 x1 + · · · + wm xm
w (1)
   (1) 
x
 w (2)   x (2) 
w =  . ,x =  . 
   
.
 .  .
 . 
w (m) x (m)
Chapter 2
Heaviside step function
φ(z) known as activation

if activation above some threshold, predict class 1
predict class -1 otherwise
Heaviside Step Function
(
1 if z ≥ θ
φ(z) =
−1 otherwise .
Chapter 2
Step function simplified
Bring the threshold θ to the left side of the equation and define a
weight-zero as w0 = −θ and x0 = 1, so that we write z in a more
compact form
z = w0 x0 + w1 x1 + · · · + wm xm = wT x
and
(
1 if z ≥ 0
φ(z) =
−1 otherwise .
Chapter 2
Basic Linear Algebra
Vector dot product

m
X
z = wT x = wj x j
j=0
 
4
1 2 3 × 5 = 1 × 4 + 2 × 5 + 3 × 6 = 32.
6
Chapter 2
Input squashed into a binary output
Chapter 2
Rosenblatt perceptron algorithm
1 Initialize the weights to 0 or small random numbers.

2 For each training sample x(i) , perform the following steps:
1 Compute the output value ŷ (i)
2 Update the weights
Chapter 2
Weight update
Weight update rule:

wj := wj + ∆wj
Perceptron learning rule:

(i)
∆wj = η y (i) − ŷ (i) xj
Where η is the learning rate (a constant between 0.0 and 1.0), y (i)
is the true class label of the ith training sample, and ŷ (i) is the
predicted class label.
Chapter 2
Update rule examples
Correct prediction, weights unchanged:

(i)
∆wj = η − 1 − −1 xj = 0

(i)
∆wj = η 1 − 1 xj = 0
Wrong prediction, weights pushed towards the positive or negative

class:
(i) (i)
∆wj = η 1 − −1 xj = η(2)xj

(i) (i)
∆wj = η − 1 − 1 xj = η(−2)xj
Chapter 2
Linear separability
Chapter 2
Convergence
Convergence guaranteed if
The two classess linearly separable
Learning rate is sufficiently small
If classes cannot be seprated:
Set a maximum number of passes over the training dataset
(epochs)
Set a threshold for the number of tolerated misclassifications
Otherwise, it will never stop updating weights (converge)
Chapter 2
Linear separability
Chapter 2
Perceptron implementation
iPython notebook on github
Chapter 2
ADAPtive LInear NEuron (Adaline)
Weights updated based on a linear activation function

Remember that perceptron used a unit step function
φ(z) is simply the identity function of the net input
φ wT x = wT x

A quantizer is then used to predict class label
Chapter 2
Adaline: notice the difference with perceptron
Chapter 2
Cost functions
ML algorithms often define an objective function

This function is optimized during learning
It is often a cost function we want to minimize
Adaline uses a cost function J(·)
Learns weights as the sum of squared errors (SSE)
2

1 X (i) (i)
J(w) = y −φ z
2
i
Chapter 2
Advantages of Adaline cost function
The linear activation function is differentiable

Unlike the unit step function
Why derivatives?
We need to know how much each variable affects the output!
It is convex
Can use gradient descent to learn the weights
Chapter 2
What is the gradient? Ask Wikipedia:
The gradient is a multi-variable generalization of the

derivative. While a derivative can be defined on functions of a
single variable, for functions of several variables, the gradient
takes its place.
Like the derivative, the gradient represents the slope of the
tangent of the graph of the function. More precisely, the
gradient points in the direction of the greatest rate of increase
of the function, and its magnitude is the slope of the graph in
that direction.
Chapter 2
Gradient Descent
Chapter 2
Gradient Descent: an intuition
https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/
Chapter 2
Gradient descent: an intuition
Suppose you are at the top of a mountain, and you have to

reach a lake which is at the lowest point of the mountain
(a.k.a valley). A twist: you are blindfolded.
The best way is to check the ground near you and observe
where the land tends to descend. This will give an idea in
what direction you should take your first step. Subsequently:
follow the descending path.
Chapter 2
Gradient Descent: an intuition
Chapter 2
Gradient Descent
Weights updated by taking small steps

Step size determined by learning rate
Take a step away from the gradient ∇J(w) of the cost
function
w := w + ∆w.
The weight change is defined as follows:
∆w = −η∇J(w)
Chapter 2
Gradient computation
To compute the gradient of the cost function, we need to compute

the partial derivative of the cost function with respect to each
weight wj ,
∂J X
(i)
(i) (i)
=− y −φ z xj ,
∂wj
i
Weight update of weight wj
∂J X
(i)
(i) (i)
∆wj = −η =η y −φ z xj
∂wj
i
We update all weights simultaneously, so Adaline learning rule

becomes
w := w + ∆w.
Chapter 2
Partial derivatives
2

∂J ∂ 1 X (i) (i)
= y −φ z
∂wj ∂wj 2
i
2

1 ∂ X (i) (i)
= y −φ z
2 ∂wj
i
1X ∂ (i)
= 2 y (i) − φ(z (i) ) y − φ(z (i) )
2 ∂wj
i
X ∂ (i) X (i) (i)
= y (i) − φ(z (i) ) y − wj xj
∂wj
i i
X
(i)

y (i) − φ z (i)

= − xj
i
X
(i)
(i) (i)
=− y −φ z xj
i
Chapter 2
Adaline learning rule vs. Perceptron rule
Looks (almost) identical. What is the difference?

φ(z (i) ) with z (i) = wT x(i) is a real number
And not an integer class label as in Perceptron
The weight update is done based on all samples in training set
Perceptron updates weights incrementally after each sample
This approach is known as “batch” gradient descent
Chapter 2
Perceptron implementation
iPython notebook on github
Chapter 2
Lessons learned
Learning rate too high: error becomes larger (overshoots

global min)
Learning rate too low: takes many epochs to converge
Feature normalization
Chapter 2
Stochastic gradient descent (SGD)
Large dataset with millions of data points (“big data”)

Batch gradient descent costly
Need to compute the error for the entire dataset ...
... to take one step towards the global minimum!
X
(i) (i)
(i)
∆w = η y −φ z x .
i
SGD updates the weights incrementally for each training

sample
(i) (i)
(i)
∆w = η y − φ z x .
Chapter 2
SGD details
Approximation of gradient descent

Reaches convergence faster because of frequent weight
updates
Important to present data in random order
Learning rate often gradually decreased (adaptive learning
rate)
Can be used for online learning
Middle ground between SGD and batch GD is known as
mini-batch learning
E.g. 50 examples at a time
Can use vector/matrix operations rather than loops as in SGD
Vectorized operations highly efficient
Chapter 2
Chapter 2

Chapter 2

Uploaded by

Copyright:

Available Formats

Chapter 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2

Uploaded by

Copyright:

Available Formats

Chapter 2

Training Machine Learning Algorithms for Classification

Simple logic gate with binary outputs

Binary classification task

φ(z) known as activation

Vector dot product

1 Initialize the weights to 0 or small random numbers.

Weight update rule:

Correct prediction, weights unchanged:

Wrong prediction, weights pushed towards the positive or negative

iPython notebook on github

Weights updated based on a linear activation function

A quantizer is then used to predict class label

ML algorithms often define an objective function

The linear activation function is differentiable

The gradient is a multi-variable generalization of the

Suppose you are at the top of a mountain, and you have to

Weights updated by taking small steps

To compute the gradient of the cost function, we need to compute

Weight update of weight wj

We update all weights simultaneously, so Adaline learning rule

Looks (almost) identical. What is the difference?

iPython notebook on github

Learning rate too high: error becomes larger (overshoots

Large dataset with millions of data points (“big data”)

SGD updates the weights incrementally for each training

Approximation of gradient descent

You might also like