Lecture 3
Lecture 3
𝑚
1
𝐽(𝑤) = ∑(𝑧𝑖 − 𝑤𝑥𝑖 )2
2𝑚
𝑖=1
𝑤1 𝑤1 = 𝑤 0 − 𝜂
𝜕𝑦
𝜕𝑥
Example revisited:
Updating 𝒘𝟐
𝜕𝐸
𝑤2 = 𝑤2 − 𝜂
𝜕𝑤2
1 𝜕𝐸
• 𝐸 = 2 (𝑇 − 𝑎2 )2 , where 𝑇 is true outcome → 𝜕𝑎
2
1 𝜕𝑎2
• 𝑎2 = 𝑓(𝑧2 ) = →
1+𝑒 −𝑧2 𝜕𝑧2
𝜕𝑧2
• 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2 → 𝜕𝑤
2
1
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2
= ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1
𝜕𝑤2 𝜕𝑎2 𝜕𝑧2 𝜕𝑤2
Therefore,
𝜕𝐸
𝑤2 = 𝑤2 − 𝜂 = 𝑤2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1
𝜕𝑤2
Updating 𝒃𝟐
𝜕𝐸
𝑏2 = 𝑏2 − 𝜂
𝜕𝑏2
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2
= ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1
𝜕𝑏2 𝜕𝑎2 𝜕𝑧2 𝜕𝑏2
Thus,
𝜕𝐸
𝑏2 = 𝑏2 − 𝜂 = 𝑏2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1
𝜕𝑏2
Updating 𝒘𝟏
𝜕𝐸
𝑤1 = 𝑤1 − 𝜂
𝜕𝑤1
1 𝜕𝐸
• 𝐸 = 2 (𝑇 − 𝑎2 )2 , where 𝑇 is true outcome → 𝜕𝑎
2
1 𝜕𝑎2
• 𝑎2 = 𝑓(𝑧2 ) = 1+𝑒 −𝑧2 → 𝜕𝑧2
𝜕𝑧2
• 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2 → 𝜕𝑎 = 𝑤2
1
1 𝜕𝑎1
• 𝑎1 = 𝑓(𝑧1 ) = 1+𝑒 −𝑧1 → 𝜕𝑧1
𝜕𝑧
• 𝑧1 = 𝑥1 ∙ 𝑤1 + 𝑏1 → 𝜕𝑤1
1
Thus,
2
Therefore,
𝜕𝐸
𝑤1 = 𝑤1 − 𝜂 = 𝑤1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1
𝜕𝑤1
Updating 𝒃𝟏
𝜕𝐸
𝑏1 = 𝑏1 − 𝜂
𝜕𝑏1
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1
= ∙ ∙ ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 1
𝜕𝑏1 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1 𝜕𝑏1
Therefore,
In the example:
1
Then, 𝐸 = 2 (𝑇 − 𝑎2 )2 = .1083. Then update the weights and bias according to epochs
𝜕𝐸
𝑤2 = 𝑤2 − 𝜂 = 𝑤2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1 = .427
𝜕𝑤2
𝜕𝐸
𝑏2 = 𝑏2 − 𝜂 = 𝑏2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1 = .612
𝜕𝑏2
𝜕𝐸
𝑤1 = 𝑤1 − 𝜂 = 𝑤1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1 = .1496
𝜕𝑤1
3
The vanishing gradient problem is a common issue in neural networks, especially deep
networks with many layers. It occurs when gradients become very small, effectively
"vanishing," during the process of backpropagation. This problem is particularly
prominent when using the sigmoid activation function in earlier hidden layers.
The sigmoid activation function maps input values to the range (0, 1), which can
cause two problems in deep networks:
The earlier hidden layers in a neural network are responsible for detecting basic
patterns and features in the data. If the gradients for these layers vanish, they stop
learning and adjusting their weights, which negatively impacts the entire learning
process of the network.
Thus, replacing the sigmoid with a Rectified Linear Unit (ReLU) activation function
helps mitigate this problem. ReLU has a gradient of 1 for positive inputs and 0 for
negative inputs, which helps maintain larger gradients during backpropagation,
preventing the vanishing gradient issue.
• Gradient computation
4
o Loss Gradient at output
• Partial derivative of the loss w.r.t. 𝑓(𝑥)𝑐 :
𝜕 −1(𝑦=𝑐)
(−
⏟ log 𝑓(𝒙)𝑦 ) = ,
𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑦
𝑐𝑟𝑜𝑠𝑠−𝑒𝑛𝑡𝑟𝑜𝑝𝑦
𝑙𝑜𝑠𝑠
where 𝑐 is any of the output neuron (class) and 𝑓(𝑥)𝑐 is the predicted
probability for class 𝑐, and 𝑦 is the true label (actual class).
Here, − log 𝑓(𝒙)𝑦 is the cross-entropy loss between the true label 𝑦 and
the predicted probability 𝑓(𝑥).
This loss function measures how well the predicted probability aligns with
the true class.
o Gradient:
1(𝑦=0)
1 𝑒(𝑦)
▽𝒇(𝒙) (− log 𝒇(𝒙)𝒚 ) = − [ ⋮ ]=− ,
𝑓(𝑥)𝑦 1 𝑓(𝑥)𝑦
(𝑦=𝐶−1)
Where 𝑒(𝑦) is a one-hot encoded vector where the correct class 𝑦 has a
value of 1 and all other classes are 0.
𝑙(𝑓(𝑥), 𝑦)
5
Example: Assume you are working with a classification problem that has 4 classes
(i.e., 𝐶 = 4), and the true class label 𝑦 is class 2.
The softmax output of your neural network might look like this for a given input 𝑥:
𝑓(𝑥) = [. 1 . 7 . 15 . 05]𝑇
This means the network is predicting the following probabilities for each class:
For the true class 𝑦 = 2, the one-hot encoded vector 𝑒(𝑦) will look like this:
In this example, the vector 𝑒(𝑦) has a 1 at the index of the correct class (class 2) and
0 elsewhere.
𝑒(𝑦) 1 𝑇 1 𝑇
Now from ▽𝒇(𝒙) (− log 𝑓(𝑥)𝑦 ) = − 𝑓(𝑥) = − [0, 0, 𝑓(𝑥) , 0] = − [0, 0, .15 , 0] =
𝑦 2
This gradient vector will be used during backpropagation to adjust the network's
weights to make the model's prediction for the correct class (class 2) more confident.
6
𝑙(𝑓(𝑥), 𝑦)
• Partial derivative:
𝜕
(− log 𝑓(𝒙)𝑦 ) = −(1(𝑦=𝑐) − 𝑓(𝒙)𝑐 ),
𝜕𝑎𝐿+1 (𝒙)𝑐
where 𝑓(𝒙)𝑦 is the predicted probability for the true class 𝑦, i.e., the
output of the softmax function for class 𝑦.
• The softmax function outputs probabilities for each class 𝑐 based on the
(𝐿+1)
pre-activation 𝑎𝑐 , where
(𝐿+1)
exp(𝑎𝑐 )
𝑓(𝑥)𝑐 = .
∑𝑘 exp(𝑎𝑘(𝐿+1) )
(𝐿+1)
This function takes in the pre-activation scores 𝑎𝑐 and converts them
into probabilities that sum to 1.
• Derivative of the loss with respect to pre-activation
We want to compute the gradient of the loss function with respect to the
(𝐿+1)
pre-activation 𝑎𝑐 for class 𝑐.
• Case 1: 𝑦 = 𝑐 (the true class)
𝜕
By considering (− log 𝑓(𝒙)𝑦 ):
𝜕𝑎𝐿+1 (𝒙)𝑐
7
In this case, the loss is directly influenced by the probability 𝑓(𝒙)𝑐 for
the correct class, and the gradient reflects how the model should
adjust this probability. For the cross-entropy loss,
𝜕 −1(𝑦=𝑐) −1
(− log 𝑓(𝑥)𝑦 ) = =
𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑦 𝑓(𝑥)𝑦
Now, using the derivative of the softmax function with respect to its
(𝐿+1)
pre-activation 𝑎𝑐 , we get:
(𝐿+1)
𝜕𝑓(𝑥)𝑐 𝜕 exp(𝑎𝑐
)
( )
= ( )
( )=
𝜕𝑎𝑐𝐿+1 𝜕𝑎𝑐𝐿+1 ∑𝑘 exp(𝑎𝑘(𝐿+1) )
(𝐿+1) (𝐿+1) (𝐿+1) (𝐿+1)
1(𝑦=𝑐) exp(𝑎𝑐 ) ∙ ∑𝑘 exp(𝑎𝑘 ) − exp(𝑎𝑐 ) ∙ exp(𝑎𝑐 )
= 2
(𝐿+1)
(∑𝑘 exp(𝑎𝑘 ))
𝜕𝐿 𝜕𝐿 𝜕𝑓(𝑥)𝑐 1
(𝐿+1)
= ∙ (𝐿+1) = − ∙ (−𝑓(𝑥)𝑐 ∙ 𝑓(𝑥)𝑦 ) = 𝑓(𝑥)𝑦
𝜕𝑎𝑐 𝜕𝑓(𝑥)𝑐 𝜕𝑎 𝑓(𝑥)𝑐
𝑐
where
𝜕𝐿 𝜕 1
= (− log 𝑓(𝑥)𝑐 ) = −
𝜕𝑓(𝑥)𝑐 𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑐
8
Thus, by combining the cases 1 and 2,
𝜕
(𝐿+1)
(− log 𝑓(𝑥)𝑦 ) = −(1(𝑦=𝑐) − 𝑓(𝑥)𝑐 ) = 𝑓(𝑥)𝑐 − 𝑒(𝑦)𝑐
𝜕𝑎𝑐
o Gradient of the softmax and cross-entropy loss function for all cases
simultaneously:
o Let 𝒂(𝐿+1) (𝒙) represent the vector of pre-activations for all classes:
(𝐿+1) (𝐿+1) (𝐿+1)
𝒂(𝐿+1) (𝒙) = [𝑎1 , 𝑎2 , … , 𝑎𝐶 ]
Example: Neural Network with One Hidden Layer Using Stochastic Gradient Descent
• Output layer: 1 neuron, using the sigmoid activation function (for binary
classification)
Initialization
We start by initializing the weights and biases for each layer. Assume we use random
initialization for simplicity.
9
Let:
• 𝑊1 be the weights matrix for the input to hidden layer (shape: 2x3).
. 2 −.4 . 1
𝑊1 = [ ]
. 4 . 3 −.5
𝑏1 = [. 1 −.2 . 3]
• 𝑊2 be the weights matrix for the hidden layer to the output layer (shape: 3x1).
𝑊2 = [. 5 −.3 . 2]𝑇
Forward Propagation
𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ) = [. 5 0 . 15]𝑇
10
Why?
The output of the network is 𝑦̂ = 𝜎(𝑧2 ), where 𝑧2 = 𝑎1 𝑊2 + 𝑏2 .
The error signal 𝛿2 is the gradient of the loss w.r.t. 𝑧2 :
𝜕𝐿
𝛿2 = .
𝜕𝑧2
Now, the gradient of the loss w.r.t. 𝑊2 is:
𝜕𝐿 𝜕𝐿 𝜕𝑧2
▽ 𝑊2 = = ∙ = 𝑎1𝑇 𝛿2
𝜕𝑊2 𝜕𝑧
⏟2 𝜕𝑊
⏟2
=𝛿2 =𝑎1
• ▽ 𝑏2 = 𝛿2 = −.406
Why?
𝜕𝐿 𝜕𝐿 𝜕𝑧2
▽ 𝑏2 = = ∙ = 𝛿2
𝜕𝑏2 𝜕𝑧
⏟2 𝜕𝑏
⏟2
=𝛿2 =1
𝑙(𝑓(𝑥), 𝑦)
𝑗th unit
If the loss function 𝜙(𝑎) can be written as a pre-activation function 𝑞𝑖 (𝑎) in the layer
above, then
11
𝜕𝜙(𝑎) 𝜕𝜙(𝑎) 𝜕𝑞𝑖 (𝑎)
=∑ ∙ ,
𝜕𝑎 𝑖 𝜕𝑞𝑖 (𝑎) 𝜕𝑎
(𝑘) (𝑘)
Considering a pre-activation (weighted sum) for layer 𝑘 𝑎(𝑘) (𝑥)𝑖 = 𝑏𝑖 + ∑𝑗 𝑊𝑖,𝑗 ℎ(𝑘−1) (𝑥)𝑗 ,
where
Let 𝑓(𝒙)𝑦 be softmax output for the true class 𝑦 and ▽𝑎(𝑘) is gradient of the loss
(𝑘+1) 𝑇
= (𝑾∙,𝑗 ) (▽𝑎(𝑘+1)(𝑥) (− log 𝑓(𝒙)𝑦 ))
Gradient:
𝑇
▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) = 𝑾(𝑘+1) (▽𝒂(𝑘+1)(𝒙) (− log 𝑓(𝒙)𝑦 ))
12
Let’s assume the following weight matrices and biases for simplicity:
.2 .4
𝑊 (1) = [ ]
−.3 . 1
.5 .6
𝑊 (2) = [ ]
−.4 . 2
Backward propagation:
• Gradient at output layer: The error at the output layer for softmax with cross-
entropy is
𝛿 (2) = 𝑓(𝑥) − 𝑒(𝑦) = [. 539 . 461]𝑇 − [1 0]𝑇 = [−.461 . 461]𝑇
• Gradient for 𝑊 (2) :
13
𝜕𝐿 𝜕𝐿 𝜕𝑧(2)
= ∙
𝜕𝑊(2) ⏟
𝜕𝑧 (2)
𝜕𝑊(2)
⏟
𝑒𝑟𝑟𝑜𝑟 𝑎𝑡 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝜕
= (2) (𝑊 (2) ℎ(1) )=ℎ(1)
≡𝛿 (2) =𝑦̂−𝑦 𝜕𝑊
𝑇 −.1844 . 1844
In matrix form ▽ 𝑊 (2) = ℎ(1) 𝛿 (2) = [. 4 0]𝑇 [−.461 . 461] = [ ]
0 0
Considering 𝑗th activation ℎ(𝑘) (𝒙)𝑗 = 𝑔(𝑎(𝑘) (𝒙)𝑗 ) only depends on the pre-activation 𝑎(𝑘) (𝒙)𝑗 ,
• 𝑊𝑗,:𝑘 is the row vector of weights connecting the previous layer to the current layer,
thus no sum,
(𝑘)
𝜕𝐿 𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕ℎ (𝒙)𝑗
= (− log 𝑓(𝒙)𝑦 ) = ∙
𝜕𝑎(𝑘) (𝒙)𝑗 𝜕𝑎(𝑘) (𝒙)𝑗 𝜕ℎ(𝑘) (𝒙)𝑗 (𝑘)
⏟ (𝒙)𝑗
𝜕𝑎
=𝑔′(𝑎(𝑘) (𝒙)𝑗 )
Gradient:
(𝑘) 𝑇
⏟𝒂(𝑘) (𝒙) ℎ (𝒙)
▽𝒂(𝑘) (𝒙) (− log 𝑓(𝒙)𝑦 ) =▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) ▽
𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛
𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙 𝑚𝑎𝑡𝑟𝑖𝑥
14
▽ 𝑊1 = 𝑥 𝑇 𝛿1 ; ▽ 𝑏1 = 𝛿1
Pf)
• The hidden layer outputs 𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ), where 𝑧1 = 𝑥𝑊1 + 𝑏1 .
• The output of the network 𝑧2 = 𝑎1 𝑊2 + 𝑏2
Now we want to calculate the error signal 𝛿1 , which tells us how much the loss
depends on the pre-activation 𝑧1 of the hidden layer.
𝜕𝐿 𝜕𝐿 𝜕 𝑎1
𝛿1 = = ∙ = 𝛿2 𝑊𝑇2 ∙ 𝑅𝑒𝐿𝑈′ (𝑧1 ),
𝜕𝑧1 𝜕𝑎
⏟1 𝜕𝑧1
⏟
=𝛿2 𝑊𝑇2 =𝑅𝑒𝐿𝑈′ (𝑧1 )
where
𝜕𝐿 𝜕𝐿 𝜕𝑧2
= ∙ = 𝛿2 𝑊2𝑇
𝜕𝑎1 𝜕𝑧
⏟2 𝜕𝑎
⏟1
=𝛿2 =𝑊2
𝜕𝐿
𝛿2 =
𝜕𝑧2
And 𝑧2 = 𝑎1 𝑊2 + 𝑏2 ,
𝜕𝑧2
= 𝑊2
𝜕𝑎1
𝜕𝑎1 1, 𝑧1 > 0
= 𝑅𝑒𝐿𝑈 ′ (𝑧1 ) = {
𝜕𝑧1 0, 𝑧1 ≤ 0
15
(𝑘)
where 𝑎(𝑘) (𝒙)𝑖 = 𝑏𝑖 + ∑𝑗 𝑊𝑖,𝑗(𝑘) ℎ(𝑘−1) (𝒙)𝑗
Gradient:
• Gradient (biases):
Example:
Let's work through an example of a neural network with the following structure:
• Input layer: 3 neurons
• Hidden layer 1: 5 neurons, using the ReLU activation function
• Hidden layer 2: 5 neurons, using the ReLU activation function
• Output layer: 2 neurons, using the softmax activation function (for multi-class
classification)
We'll go through the forward pass, calculate the loss, and then backpropagate to
compute the gradients for the weights.
16
o Hidden layer 2: 5 neurons with weights 𝑊 (2) (shape: 5x5) and biases
𝑏 (2) (Shape: 1x5)
o Output layer: 2 neurons with weights 𝑊 (3) (shape: 5x2) and biases 𝑏 (3)
(Shape: 1x2)
. 2 −.1 . 4 −.3 . 1
• 𝑊 (1) = [−.2 . 5 −.4 . 3 −.1]
. 1 −.3 . 2 . 1 −.2
• 𝑏 (1) = [. 1 .1 . 1 . 1 . 1]
.3 −.2 .5 .1 −.1
.2 .1 −.3 .2 .1
• 𝑊 (2) = −.5 .3 .1 −.1 .4
.1 .2 −.2 .3 −.4
[−.3 .4 .2 −.1 .2 ]
• 𝑏 (2) = [. 05 . 05 . 05 . 05 . 05]
.4 −.5
.3 .2
• 𝑊 (3) = −.4 .1
.2 −.3
[−.1 .4 ]
• 𝑏 (3) = [. 2 −.1]
Forward propagation
17
𝑇
𝑧 (2) = 𝑊 (2) ℎ(1) + 𝑏 (2)
.3 −.2 .5 .1 −.1 𝑇
.2 .1 −.3 .2 .1
= −.5 .3 .1 −.1 . 4 [. 05 . 65 . 05 0 . 55]𝑇
.1 .2 −.2 .3 −.4
[−.3 .4 .2 −.1 .2 ]
+ [. 05 . 05 . 05 . 05 . 05]𝑇
= [−.075 . 245 . 285 . 045 . 405]𝑇
• Hidden layer 2 activation (ReLU):
ℎ(2) = 𝑅𝑒𝐿𝑈(𝑧 (2) ) = [0 . 245 . 285 . 045 . 405]𝑇
• Output layer pre-activation:
𝑇
𝑧 (3) = 𝑊 (3) ℎ(2) + 𝑏 (3)
.4 −.5 𝑇
.3 .2
= −.4 .1 [0 . 245 . 285 . 045 . 405]𝑇 + [. 2 −.1]
.2 −.3
[−.1 .4 ]
= [. 128 −.016]
• Output layer activation:
𝑒 .128 𝑒 −.016
𝑓(𝑥)1 = = .535; 𝑓(𝑥) 2 = = .465
𝑒 .128 + 𝑒 −.016 𝑒 .128 + 𝑒 −.016
Let’s assume the true label is 𝑦 = [1, 0] (meaning the true class is class 1). Then, the
cross-entropy loss is
Backpropagation
• Gradient at the output layer: The gradient of the loss with respect to the output
layer activations 𝑓(𝑥) is
𝛿 (3) = 𝑓(𝑥) − 𝑦 = [. 535, .465]𝑇 − [1, 0]𝑇 = [−.465, .465]𝑇
• Gradient for 𝑊 (3) and 𝑏 (3)
The gradient of the loss with respect to the weights 𝑊 (3) is
18
𝑇
▽ 𝑊 (3) = ℎ(2) 𝛿 (3) = [0 . 245 . 285 . 045 . 405]𝑇 [−.465, .465]
0 0
−.114 . 114
= −.1325 . 1325
−.0209 . 0209
[−.1883 . 1883 ]
▽ 𝑏 (3) = 𝛿 (3) = [−.465, .465]𝑇
Now, Updated weights and biases for layer 3: Assuming a learning rate 𝜂 = .01,
𝑊 (3) = 𝑊 (3) − 𝜂 ×▽ 𝑊 (3)
𝑏 (3) = 𝑏 (3) − 𝜂 ×▽ 𝑏 (3)
19
0
−.0465
▽ 𝑏 (2) = 𝛿 (2) = . 2325
−.2325
[ . 2325 ]
• Gradient for hidden layer 1:
Next, we backpropagate to the first hidden layer. The gradient with respect to
the pre-activation 𝑧 (1) is
𝛿 (1) = (𝑊 (2) 𝛿 (2) )⨀𝑔′ (𝑧 (1) ),
.3 −.2 .5 .1 −.1 0 . 07805
.2 .1 −.3 .2 . 1 −.0465 −.02355
𝑊 (2) 𝛿 (2) = −.5 .3 .1 −.1 . 4 . 2325 = . 06745
.1 .2 −.2 .3 −.4 −.2325 . 0092
[−.3 .4 .2 −.1 ] [
. 2 . 2325 ] [ −.00165 ]
𝑔′ (𝑧 (1) ) = 𝑔′ ([. 05 . 65 . 05 −.1 . 55]𝑇 ) = [1 1 1 0 1]𝑇
Thus,
. 07805 . 07805
−.02355 −.02355
𝛿 (1) = (𝑊 (2) 𝛿 (2) )⨀𝑔′ (𝑧 (1) ) = . 06745 ⨀[1 1 1 0 1]𝑇 = . 06745
. 0092 0
[−.00165 ] [−.00165 ]
• Gradient for 𝑊 (1) and 𝑏 (1)
. 07805 𝑇
𝑇 −.02355
▽ 𝑊 (1) = 𝑥𝛿 (1) = [1, .5, −1]𝑇 . 06745
0
[−.00165 ]
. 07805 −.02355 . 06745 0 −.00165
= [. 039025 −.011775 . 033725 0 −.000825]
−.07805 . 02355 −.06745 0 . 00165
. 07805
−.02355
▽ 𝑏 (1) = 𝛿 (1) = . 06745
0
[−.00165 ]
Training NN – Regularization
In SGD algorithm that performs updates after each example, for 𝑁 iterations with
each training example (𝒙(𝑡) , 𝑦 (𝑡) ),
20
Δ = −∇𝜽 𝑙(𝑓(𝒙(𝑡) ; 𝜽), 𝑦 (𝑡) ) − 𝜆∇𝜽 Ω(𝜽) = − (∇𝜽 𝑙(𝑓(𝒙(𝑡) ; 𝜽), 𝑦 (𝑡) ) + 𝜆∇𝜽 Ω(𝜽)),
• L2 regularization
2
( ) 2
Ω(𝜽) = ∑ ∑ ∑ (𝑊𝑖,𝑗𝑘 ) = ∑ ‖𝑾(𝑘) ‖ ,
𝑘 𝑖 𝑗 𝑘 𝐹
where the subscript 𝐹 stands for the Frobenius norm. The Frobenius norm is a
matrix norm that is analogous to the Euclidean norm for vectors.
o Gradient: ∇𝑾(𝑘) Ω(𝜽) = 2𝑾(𝑘)
o Only applied on weights, not on biases
o Can be interpreted as having a Gaussian prior over the weights.
( )
Ω(𝜽) = ∑ ∑ ∑ |𝑊𝑖,𝑗𝑘 |
𝑘 𝑖 𝑗
Example: Suppose we are training a linear regression model with two features. The
model is:
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
The loss function without regularization is typically the Mean Squared Error (MSE):
1 𝑛
𝑀𝑆𝐸 = ∑ (𝑦𝑖 − 𝑦̂𝑖 )2
𝑛 𝑖=1
21
L1 regularization adds a penalty proportional to the absolute values of the weights.
The regularized loss function becomes:
Let’s assume we have some data and after training, the weights 𝑤1 and 𝑤2 are
initialized as: 𝑤1 = .6; 𝑤2 = .2.
• Without regularization: The gradient descent update rule for the weights
without regularization is just based on the gradient of the loss:
𝜕(𝑀𝑆𝐸)
𝑤𝑖 = 𝑤𝑖 − 𝜂
𝜕𝑤𝑖
For simplicity, assume that the gradients of MSE with respect to 𝑤1 and 𝑤2
are .1 and .05, respectively. The weights will be updated as follows:
𝑤1 = .6 − .01 ∙ .1 = .599; 𝑤2 = .2 − .01 ∙ .05 = .1995
So the weights gradually decrease, but both remain non-zero.
22
• Effect of L1 over multiple iterations:
Now, suppose after a few more iterations, the weights continue shrinking:
𝑤1 = .05; 𝑤2 = .01
If we update again:
𝑤1 = .05 − .01 ∙ (. 1 + .1 ∙ 1) = .048; 𝑤2 = .01 − .01 ∙ (. 05 + .1 ∙ 1) = .0085
As training continues, 𝑤2 will eventually become smaller and smaller. When it
gets sufficiently close to zero, the regularization term dominates the update. If,
for example, 𝑤2 becomes .0001:
𝑤2 = .0001 − .01 ∙ (. 05 + .1 ∙ 1) = −.0014
At this point, 𝑤2 can be pushed to exactly 0 because of the regularization term.
Once 𝑤2 reaches zero, it will stay at zero in subsequent iterations, effectively
removing the contribution of feature 𝑥2 from the model.
• Variance of trained model: does it vary a lot if the training set changes
• Bias of trained model: is the average model close to the true solution
• Generalization error can be seen as the sum of the (squared) bias and the
variance.
• For biases:
o Initialize all to 0
• For weights:
o Can’t initialize weights to 0 with tanh function
o Can’t initialize all weights to the same value (need to break symmetry)
√6
o Sample 𝑊𝑖,𝑗(𝑘) from 𝑈𝑛𝑖𝑓[−𝑏, 𝑏], where 𝑏 =
√𝑠𝑖𝑧𝑒(ℎ (𝑘) (𝑥))+𝑠𝑖𝑧𝑒(ℎ(𝑘−1) (𝑥))
23
24