Chapter 2
Chapter 2
Chapter 2
September 9, 2021
Chapter 2
Biology
Chapter 2
Logic Gate
w (1)
(1)
x
w (2) x (2)
w = . ,x = .
.
. .
.
w (m) x (m)
Chapter 2
Heaviside step function
Chapter 2
Step function simplified
Bring the threshold θ to the left side of the equation and define a
weight-zero as w0 = −θ and x0 = 1, so that we write z in a more
compact form
z = w0 x0 + w1 x1 + · · · + wm xm = wT x
and
(
1 if z ≥ 0
φ(z) =
−1 otherwise .
Chapter 2
Basic Linear Algebra
4
1 2 3 × 5 = 1 × 4 + 2 × 5 + 3 × 6 = 32.
6
Chapter 2
Input squashed into a binary output
Chapter 2
Rosenblatt perceptron algorithm
Chapter 2
Weight update
Where η is the learning rate (a constant between 0.0 and 1.0), y (i)
is the true class label of the ith training sample, and ŷ (i) is the
predicted class label.
Chapter 2
Update rule examples
(i)
∆wj = η 1 − 1 xj = 0
Chapter 2
Linear separability
Chapter 2
Convergence
Convergence guaranteed if
The two classess linearly separable
Learning rate is sufficiently small
If classes cannot be seprated:
Set a maximum number of passes over the training dataset
(epochs)
Set a threshold for the number of tolerated misclassifications
Otherwise, it will never stop updating weights (converge)
Chapter 2
Linear separability
Chapter 2
Perceptron implementation
Chapter 2
ADAPtive LInear NEuron (Adaline)
φ wT x = wT x
Chapter 2
Adaline: notice the difference with perceptron
Chapter 2
Cost functions
2
1 X (i) (i)
J(w) = y −φ z
2
i
Chapter 2
Advantages of Adaline cost function
Chapter 2
What is the gradient? Ask Wikipedia:
Chapter 2
Gradient Descent
Chapter 2
Gradient Descent: an intuition
https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/
Chapter 2
Gradient descent: an intuition
https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/
Chapter 2
Gradient Descent: an intuition
https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/
Chapter 2
Gradient Descent
∆w = −η∇J(w)
Chapter 2
Gradient computation
∂J X
(i)
(i) (i)
=− y −φ z xj ,
∂wj
i
∂J X
(i)
(i) (i)
∆wj = −η =η y −φ z xj
∂wj
i
w := w + ∆w.
Chapter 2
Partial derivatives
2
∂J ∂ 1 X (i) (i)
= y −φ z
∂wj ∂wj 2
i
2
1 ∂ X (i) (i)
= y −φ z
2 ∂wj
i
1X ∂ (i)
= 2 y (i) − φ(z (i) ) y − φ(z (i) )
2 ∂wj
i
X ∂ (i) X (i) (i)
= y (i) − φ(z (i) ) y − wj xj
∂wj
i i
X
(i)
y (i) − φ z (i)
= − xj
i
X
(i)
(i) (i)
=− y −φ z xj
i
Chapter 2
Adaline learning rule vs. Perceptron rule
Chapter 2
Perceptron implementation
Chapter 2
Lessons learned
Chapter 2
Stochastic gradient descent (SGD)
Chapter 2
SGD details
Chapter 2
Chapter 2