Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Chapter 2

Training Machine Learning Algorithms for Classification

September 9, 2021

Chapter 2
Biology

Chapter 2
Logic Gate

Simple logic gate with binary outputs


Signals arrive at dendrites
Integrated into cell body
If signal exceeds threshold, generate output, and pass to axon
Chapter 2
Rosenblatt Perceptron

Binary classification task


Positive class (1) vs. negative class (-1)
Define activation function φ(z)
Takes as input a dot product of input and weights
Net input: z = w1 x1 + · · · + wm xm

w (1)
   (1) 
x
 w (2)   x (2) 
w =  . ,x =  . 
   
.
 .  .
 . 
w (m) x (m)

Chapter 2
Heaviside step function

φ(z) known as activation


if activation above some threshold, predict class 1
predict class -1 otherwise
Heaviside Step Function
(
1 if z ≥ θ
φ(z) =
−1 otherwise .

Chapter 2
Step function simplified

Bring the threshold θ to the left side of the equation and define a
weight-zero as w0 = −θ and x0 = 1, so that we write z in a more
compact form

z = w0 x0 + w1 x1 + · · · + wm xm = wT x
and
(
1 if z ≥ 0
φ(z) =
−1 otherwise .

Chapter 2
Basic Linear Algebra

Vector dot product


m
X
z = wT x = wj x j
j=0

 
  4
1 2 3 × 5 = 1 × 4 + 2 × 5 + 3 × 6 = 32.
6

Chapter 2
Input squashed into a binary output

Chapter 2
Rosenblatt perceptron algorithm

1 Initialize the weights to 0 or small random numbers.


2 For each training sample x(i) , perform the following steps:
1 Compute the output value ŷ (i)
2 Update the weights

Chapter 2
Weight update

Weight update rule:


wj := wj + ∆wj
Perceptron learning rule:
 
(i)
∆wj = η y (i) − ŷ (i) xj

Where η is the learning rate (a constant between 0.0 and 1.0), y (i)
is the true class label of the ith training sample, and ŷ (i) is the
predicted class label.

Chapter 2
Update rule examples

Correct prediction, weights unchanged:


 
(i)
∆wj = η − 1 − −1 xj = 0

 
(i)
∆wj = η 1 − 1 xj = 0

Wrong prediction, weights pushed towards the positive or negative


class:  
(i) (i)
∆wj = η 1 − −1 xj = η(2)xj
 
(i) (i)
∆wj = η − 1 − 1 xj = η(−2)xj

Chapter 2
Linear separability

Chapter 2
Convergence

Convergence guaranteed if
The two classess linearly separable
Learning rate is sufficiently small
If classes cannot be seprated:
Set a maximum number of passes over the training dataset
(epochs)
Set a threshold for the number of tolerated misclassifications
Otherwise, it will never stop updating weights (converge)

Chapter 2
Linear separability

Chapter 2
Perceptron implementation

iPython notebook on github

Chapter 2
ADAPtive LInear NEuron (Adaline)

Weights updated based on a linear activation function


Remember that perceptron used a unit step function
φ(z) is simply the identity function of the net input

φ wT x = wT x


A quantizer is then used to predict class label

Chapter 2
Adaline: notice the difference with perceptron

Chapter 2
Cost functions

ML algorithms often define an objective function


This function is optimized during learning
It is often a cost function we want to minimize
Adaline uses a cost function J(·)
Learns weights as the sum of squared errors (SSE)

 2
 
1 X (i) (i)
J(w) = y −φ z
2
i

Chapter 2
Advantages of Adaline cost function

The linear activation function is differentiable


Unlike the unit step function
Why derivatives?
We need to know how much each variable affects the output!
It is convex
Can use gradient descent to learn the weights

Chapter 2
What is the gradient? Ask Wikipedia:

The gradient is a multi-variable generalization of the


derivative. While a derivative can be defined on functions of a
single variable, for functions of several variables, the gradient
takes its place.
Like the derivative, the gradient represents the slope of the
tangent of the graph of the function. More precisely, the
gradient points in the direction of the greatest rate of increase
of the function, and its magnitude is the slope of the graph in
that direction.

Chapter 2
Gradient Descent

Chapter 2
Gradient Descent: an intuition

https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/

Chapter 2
Gradient descent: an intuition

Suppose you are at the top of a mountain, and you have to


reach a lake which is at the lowest point of the mountain
(a.k.a valley). A twist: you are blindfolded.
The best way is to check the ground near you and observe
where the land tends to descend. This will give an idea in
what direction you should take your first step. Subsequently:
follow the descending path.

https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/

Chapter 2
Gradient Descent: an intuition

https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/

Chapter 2
Gradient Descent

Weights updated by taking small steps


Step size determined by learning rate
Take a step away from the gradient ∇J(w) of the cost
function
w := w + ∆w.
The weight change is defined as follows:

∆w = −η∇J(w)

Chapter 2
Gradient computation

To compute the gradient of the cost function, we need to compute


the partial derivative of the cost function with respect to each
weight wj ,

∂J X 
 (i)
(i) (i)
=− y −φ z xj ,
∂wj
i

Weight update of weight wj

∂J X 
 (i)
(i) (i)
∆wj = −η =η y −φ z xj
∂wj
i

We update all weights simultaneously, so Adaline learning rule


becomes

w := w + ∆w.

Chapter 2
Partial derivatives

 2
 
∂J ∂ 1 X (i) (i)
= y −φ z
∂wj ∂wj 2
i
 2
 
1 ∂ X (i) (i)
= y −φ z
2 ∂wj
i
1X  ∂  (i) 
= 2 y (i) − φ(z (i) ) y − φ(z (i) )
2 ∂wj
i
X  ∂  (i) X (i) (i) 
= y (i) − φ(z (i) ) y − wj xj
∂wj
i i
X 
(i)

y (i) − φ z (i)

= − xj
i
X 
 (i)
(i) (i)
=− y −φ z xj
i

Chapter 2
Adaline learning rule vs. Perceptron rule

Looks (almost) identical. What is the difference?


φ(z (i) ) with z (i) = wT x(i) is a real number
And not an integer class label as in Perceptron
The weight update is done based on all samples in training set
Perceptron updates weights incrementally after each sample
This approach is known as “batch” gradient descent

Chapter 2
Perceptron implementation

iPython notebook on github

Chapter 2
Lessons learned

Learning rate too high: error becomes larger (overshoots


global min)
Learning rate too low: takes many epochs to converge
Feature normalization

Chapter 2
Stochastic gradient descent (SGD)

Large dataset with millions of data points (“big data”)


Batch gradient descent costly
Need to compute the error for the entire dataset ...
... to take one step towards the global minimum!
X 
(i) (i)
 (i)
∆w = η y −φ z x .
i

SGD updates the weights incrementally for each training


sample  
(i) (i)
 (i)
∆w = η y − φ z x .

Chapter 2
SGD details

Approximation of gradient descent


Reaches convergence faster because of frequent weight
updates
Important to present data in random order
Learning rate often gradually decreased (adaptive learning
rate)
Can be used for online learning
Middle ground between SGD and batch GD is known as
mini-batch learning
E.g. 50 examples at a time
Can use vector/matrix operations rather than loops as in SGD
Vectorized operations highly efficient

Chapter 2
Chapter 2

You might also like