Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Lecture 3

Uploaded by

pill.pine6731
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 3

Uploaded by

pill.pine6731
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Gradient Descent

𝑚
1
𝐽(𝑤) = ∑(𝑧𝑖 − 𝑤𝑥𝑖 )2
2𝑚
𝑖=1

𝑤1 𝑤1 = 𝑤 0 − 𝜂
𝜕𝑦
𝜕𝑥

Example revisited:

Updating 𝒘𝟐

𝜕𝐸
𝑤2 = 𝑤2 − 𝜂
𝜕𝑤2
1 𝜕𝐸
• 𝐸 = 2 (𝑇 − 𝑎2 )2 , where 𝑇 is true outcome → 𝜕𝑎
2
1 𝜕𝑎2
• 𝑎2 = 𝑓(𝑧2 ) = →
1+𝑒 −𝑧2 𝜕𝑧2
𝜕𝑧2
• 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2 → 𝜕𝑤
2

1
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2
= ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1
𝜕𝑤2 𝜕𝑎2 𝜕𝑧2 𝜕𝑤2

Therefore,

𝜕𝐸
𝑤2 = 𝑤2 − 𝜂 = 𝑤2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1
𝜕𝑤2

Updating 𝒃𝟐

𝜕𝐸
𝑏2 = 𝑏2 − 𝜂
𝜕𝑏2
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2
= ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1
𝜕𝑏2 𝜕𝑎2 𝜕𝑧2 𝜕𝑏2

Thus,

𝜕𝐸
𝑏2 = 𝑏2 − 𝜂 = 𝑏2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1
𝜕𝑏2
Updating 𝒘𝟏

𝜕𝐸
𝑤1 = 𝑤1 − 𝜂
𝜕𝑤1
1 𝜕𝐸
• 𝐸 = 2 (𝑇 − 𝑎2 )2 , where 𝑇 is true outcome → 𝜕𝑎
2
1 𝜕𝑎2
• 𝑎2 = 𝑓(𝑧2 ) = 1+𝑒 −𝑧2 → 𝜕𝑧2
𝜕𝑧2
• 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2 → 𝜕𝑎 = 𝑤2
1

1 𝜕𝑎1
• 𝑎1 = 𝑓(𝑧1 ) = 1+𝑒 −𝑧1 → 𝜕𝑧1
𝜕𝑧
• 𝑧1 = 𝑥1 ∙ 𝑤1 + 𝑏1 → 𝜕𝑤1
1

Thus,

𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1


= ∙ ∙ ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1
𝜕𝑤1 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1 𝜕𝑤1

2
Therefore,

𝜕𝐸
𝑤1 = 𝑤1 − 𝜂 = 𝑤1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1
𝜕𝑤1
Updating 𝒃𝟏

𝜕𝐸
𝑏1 = 𝑏1 − 𝜂
𝜕𝑏1
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1
= ∙ ∙ ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 1
𝜕𝑏1 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1 𝜕𝑏1

Therefore,

𝑏1 = 𝑏1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 1

In the example:

𝑧1 = .415; 𝑎1 = .6023; 𝑧2 = .921; 𝑎2 = .7153;

𝑤2 = .45; 𝑤1 = .15; 𝑏2 = .65; 𝑏1 = .4; 𝑥1 = .1

Now assume that the true outcome would be 𝑇 = .25.

1
Then, 𝐸 = 2 (𝑇 − 𝑎2 )2 = .1083. Then update the weights and bias according to epochs

= 1000 or 𝜖 = .001 with learning rate 𝜂 = .4.

In the first iteration,

𝜕𝐸
𝑤2 = 𝑤2 − 𝜂 = 𝑤2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1 = .427
𝜕𝑤2
𝜕𝐸
𝑏2 = 𝑏2 − 𝜂 = 𝑏2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1 = .612
𝜕𝑏2
𝜕𝐸
𝑤1 = 𝑤1 − 𝜂 = 𝑤1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1 = .1496
𝜕𝑤1

𝑏1 = 𝑏1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 1 = .3959

NOTE: Vanishing gradient problem

3
The vanishing gradient problem is a common issue in neural networks, especially deep
networks with many layers. It occurs when gradients become very small, effectively
"vanishing," during the process of backpropagation. This problem is particularly
prominent when using the sigmoid activation function in earlier hidden layers.

The sigmoid activation function maps input values to the range (0, 1), which can
cause two problems in deep networks:

• Small Derivatives: The derivative of the sigmoid function is:


𝜎 ′ (𝑥) = 𝜎(𝑥)(1 − 𝜎(𝑥)).
• The derivative is small for large positive or negative values of 𝑥. When 𝑥
is either very large or very small (far from 0), 𝜎(𝑥) saturates near 0 or 1,
and the derivative 𝜎 ′ (𝑥) becomes very close to 0.
• This means that when backpropagating through many layers, the
gradients of the weights in earlier layers will become smaller and
smaller, approaching zero. This leads to a situation where the network
cannot effectively update the weights of earlier layers, effectively
"freezing" learning in these layers.

The earlier hidden layers in a neural network are responsible for detecting basic
patterns and features in the data. If the gradients for these layers vanish, they stop
learning and adjusting their weights, which negatively impacts the entire learning
process of the network.

Thus, replacing the sigmoid with a Rectified Linear Unit (ReLU) activation function
helps mitigate this problem. ReLU has a gradient of 1 for positive inputs and 0 for
negative inputs, which helps maintain larger gradients during backpropagation,
preventing the vanishing gradient issue.

Training NN – Output layer gradient

• Gradient computation

4
o Loss Gradient at output
• Partial derivative of the loss w.r.t. 𝑓(𝑥)𝑐 :

𝜕 −1(𝑦=𝑐)
(−
⏟ log 𝑓(𝒙)𝑦 ) = ,
𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑦
𝑐𝑟𝑜𝑠𝑠−𝑒𝑛𝑡𝑟𝑜𝑝𝑦
𝑙𝑜𝑠𝑠

where 𝑐 is any of the output neuron (class) and 𝑓(𝑥)𝑐 is the predicted
probability for class 𝑐, and 𝑦 is the true label (actual class).
Here, − log 𝑓(𝒙)𝑦 is the cross-entropy loss between the true label 𝑦 and
the predicted probability 𝑓(𝑥).
This loss function measures how well the predicted probability aligns with
the true class.
o Gradient:
1(𝑦=0)
1 𝑒(𝑦)
▽𝒇(𝒙) (− log 𝒇(𝒙)𝒚 ) = − [ ⋮ ]=− ,
𝑓(𝑥)𝑦 1 𝑓(𝑥)𝑦
(𝑦=𝐶−1)

Where 𝑒(𝑦) is a one-hot encoded vector where the correct class 𝑦 has a
value of 1 and all other classes are 0.

𝑙(𝑓(𝑥), 𝑦)

5
Example: Assume you are working with a classification problem that has 4 classes
(i.e., 𝐶 = 4), and the true class label 𝑦 is class 2.

The softmax output of your neural network might look like this for a given input 𝑥:

𝑓(𝑥) = [. 1 . 7 . 15 . 05]𝑇

This means the network is predicting the following probabilities for each class:

𝑃(𝑐𝑙𝑎𝑠𝑠 0) = .1; 𝑃(𝑐𝑙𝑎𝑠𝑠 1) = .7; 𝑃(𝑐𝑙𝑎𝑠𝑠 2) = .15; 𝑃(𝑐𝑙𝑎𝑠𝑠 3) = .05.

For the true class 𝑦 = 2, the one-hot encoded vector 𝑒(𝑦) will look like this:

𝑒(𝑦) = [0, 0, 1, 0]𝑇

In this example, the vector 𝑒(𝑦) has a 1 at the index of the correct class (class 2) and
0 elsewhere.

𝑒(𝑦) 1 𝑇 1 𝑇
Now from ▽𝒇(𝒙) (− log 𝑓(𝑥)𝑦 ) = − 𝑓(𝑥) = − [0, 0, 𝑓(𝑥) , 0] = − [0, 0, .15 , 0] =
𝑦 2

−[0, 0, 6.67, 0]𝑇 .

This gradient vector will be used during backpropagation to adjust the network's
weights to make the model's prediction for the correct class (class 2) more confident.

𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝜂 ∙▽𝒇(𝒙) (− log 𝑓(𝑥)𝑦 )

• Loss Gradient at output pre-activation

6
𝑙(𝑓(𝑥), 𝑦)

• Partial derivative:
𝜕
(− log 𝑓(𝒙)𝑦 ) = −(1(𝑦=𝑐) − 𝑓(𝒙)𝑐 ),
𝜕𝑎𝐿+1 (𝒙)𝑐
where 𝑓(𝒙)𝑦 is the predicted probability for the true class 𝑦, i.e., the
output of the softmax function for class 𝑦.
• The softmax function outputs probabilities for each class 𝑐 based on the
(𝐿+1)
pre-activation 𝑎𝑐 , where
(𝐿+1)
exp(𝑎𝑐 )
𝑓(𝑥)𝑐 = .
∑𝑘 exp(𝑎𝑘(𝐿+1) )
(𝐿+1)
This function takes in the pre-activation scores 𝑎𝑐 and converts them
into probabilities that sum to 1.
• Derivative of the loss with respect to pre-activation
We want to compute the gradient of the loss function with respect to the
(𝐿+1)
pre-activation 𝑎𝑐 for class 𝑐.
• Case 1: 𝑦 = 𝑐 (the true class)
𝜕
By considering (− log 𝑓(𝒙)𝑦 ):
𝜕𝑎𝐿+1 (𝒙)𝑐

7
In this case, the loss is directly influenced by the probability 𝑓(𝒙)𝑐 for
the correct class, and the gradient reflects how the model should
adjust this probability. For the cross-entropy loss,
𝜕 −1(𝑦=𝑐) −1
(− log 𝑓(𝑥)𝑦 ) = =
𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑦 𝑓(𝑥)𝑦
Now, using the derivative of the softmax function with respect to its
(𝐿+1)
pre-activation 𝑎𝑐 , we get:
(𝐿+1)
𝜕𝑓(𝑥)𝑐 𝜕 exp(𝑎𝑐
)
( )
= ( )
( )=
𝜕𝑎𝑐𝐿+1 𝜕𝑎𝑐𝐿+1 ∑𝑘 exp(𝑎𝑘(𝐿+1) )
(𝐿+1) (𝐿+1) (𝐿+1) (𝐿+1)
1(𝑦=𝑐) exp(𝑎𝑐 ) ∙ ∑𝑘 exp(𝑎𝑘 ) − exp(𝑎𝑐 ) ∙ exp(𝑎𝑐 )
= 2
(𝐿+1)
(∑𝑘 exp(𝑎𝑘 ))

= 𝑓(𝑥)𝑐 − 𝑓(𝑥)2𝑐 = 𝑓(𝑥)𝑐 ∙ (1 − 𝑓(𝑥)𝑐 ).


So, for 𝑦 = 𝑐, the total derivative is
𝜕 1 𝜕
(𝐿+1)
(− log 𝑓(𝑥)𝑦 ) = − ∙ (𝐿+1) 𝑓(𝑥)𝑦
𝜕𝑎𝑐 𝑓(𝑥)𝑦 𝜕𝑎
𝑐
1
=
⏟ − ∙ 𝑓(𝑥)𝑐 ∙ (1 − 𝑓(𝑥)𝑐 ) = 𝑓(𝑥)𝑐 − 1.
𝑦=𝑐
𝑓(𝑥)𝑐

• Case 2: 𝑦 ≠ 𝑐 (incorrect class)


(𝐿+1)
The gradient with respect to 𝑎𝑐 , 𝑦 ≠ 𝑐, is
(𝐿+1) (𝐿+1)
𝜕𝑓(𝑥)𝑐 0 − exp(𝑎𝑐 ) ∙ exp(𝑎𝑦 )
(𝐿+1)
= 2 = −𝑓(𝑥)𝑐 ∙ 𝑓(𝑥)𝑦
𝜕𝑎𝑐 (𝐿+1)
(∑𝑘 exp(𝑎𝑘 ))

𝜕𝐿 𝜕𝐿 𝜕𝑓(𝑥)𝑐 1
(𝐿+1)
= ∙ (𝐿+1) = − ∙ (−𝑓(𝑥)𝑐 ∙ 𝑓(𝑥)𝑦 ) = 𝑓(𝑥)𝑦
𝜕𝑎𝑐 𝜕𝑓(𝑥)𝑐 𝜕𝑎 𝑓(𝑥)𝑐
𝑐

where
𝜕𝐿 𝜕 1
= (− log 𝑓(𝑥)𝑐 ) = −
𝜕𝑓(𝑥)𝑐 𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑐

8
Thus, by combining the cases 1 and 2,

𝜕
(𝐿+1)
(− log 𝑓(𝑥)𝑦 ) = −(1(𝑦=𝑐) − 𝑓(𝑥)𝑐 ) = 𝑓(𝑥)𝑐 − 𝑒(𝑦)𝑐
𝜕𝑎𝑐

o Gradient of the softmax and cross-entropy loss function for all cases
simultaneously:
o Let 𝒂(𝐿+1) (𝒙) represent the vector of pre-activations for all classes:
(𝐿+1) (𝐿+1) (𝐿+1)
𝒂(𝐿+1) (𝒙) = [𝑎1 , 𝑎2 , … , 𝑎𝐶 ]

• 𝒇(𝒙) represent the vector of softmax outputs (class probabilities):


𝒇(𝒙) = [𝑓(𝑥)1 , 𝑓(𝑥)2 , … , 𝑓(𝑥)𝐶 ]
• 𝒆(𝑦) represent a one-hot encoded vector where the entry
corresponding to the true class 𝑦 is 1, and all other entries are 0:
𝒆(𝑦) = [𝑒(𝑦)1 , 𝑒(𝑦)2 , … , 𝑒(𝑦)𝐶 ].

o For example, if 𝑦 = 2 in a 3-class classification problem,


𝒆(𝑦) = [0, 1, 0].
• Vector form of the gradient
▽𝑎(𝐿+1)(𝒙) (− log 𝒇(𝒙)𝒚 ) = 𝒇(𝒙) − 𝒆(𝑦).

Example: Neural Network with One Hidden Layer Using Stochastic Gradient Descent

Neural Network Structure:

• Input layer: 2 features (𝑥1 , 𝑥2 )

• Hidden layer: 3 neurons, using the ReLU activation function

• Output layer: 1 neuron, using the sigmoid activation function (for binary
classification)

Initialization

We start by initializing the weights and biases for each layer. Assume we use random
initialization for simplicity.

9
Let:

• 𝑊1 be the weights matrix for the input to hidden layer (shape: 2x3).

. 2 −.4 . 1
𝑊1 = [ ]
. 4 . 3 −.5

• 𝑏1 be the biases for the hidden layer (shape: 1x3).

𝑏1 = [. 1 −.2 . 3]

• 𝑊2 be the weights matrix for the hidden layer to the output layer (shape: 3x1).

𝑊2 = [. 5 −.3 . 2]𝑇

• 𝑏2 be the bias for the output layer (shape: 1x1). 𝑒. 𝑔. 𝑏2 = .1

Forward Propagation

Given an input vector 𝑥 = [1 . 5]𝑇 :

1. Hidden Layer Computation: 𝑧1 = 𝑥𝑊1 + 𝑏1 = [. 5 −.05 . 15]𝑇

Apply an activation function (let’s use ReLU):

𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ) = [. 5 0 . 15]𝑇

2. Output layer computation: 𝑧2 = 𝑎1 𝑊2 + 𝑏2 = .38


Since this is a binary classification, apply the sigmoid activation function to get
the predicted output:
1
𝑦̂ = 𝜎(𝑧2 ) = = .594
1 + 𝑒 −𝑧2
3. Loss computation
Use the binary cross-entropy loss function:
𝐿 = −[𝑦 log 𝑦̂ + (1 − 𝑦) log(1 − 𝑦̂)] = .520
4. Backward Propagation (Gradient computation)
a. Gradient of the output layer: 𝛿2 = 𝑦̂ − 𝑦 = .594 − 1 = −.406
Compute the gradients of 𝑊2 and 𝑏2
• ▽ 𝑊2 = 𝑎1𝑇 𝛿2 = [−.203 0 −.061]𝑇

10
Why?
The output of the network is 𝑦̂ = 𝜎(𝑧2 ), where 𝑧2 = 𝑎1 𝑊2 + 𝑏2 .
The error signal 𝛿2 is the gradient of the loss w.r.t. 𝑧2 :
𝜕𝐿
𝛿2 = .
𝜕𝑧2
Now, the gradient of the loss w.r.t. 𝑊2 is:
𝜕𝐿 𝜕𝐿 𝜕𝑧2
▽ 𝑊2 = = ∙ = 𝑎1𝑇 𝛿2
𝜕𝑊2 𝜕𝑧
⏟2 𝜕𝑊
⏟2
=𝛿2 =𝑎1

• ▽ 𝑏2 = 𝛿2 = −.406

Why?

𝜕𝐿 𝜕𝐿 𝜕𝑧2
▽ 𝑏2 = = ∙ = 𝛿2
𝜕𝑏2 𝜕𝑧
⏟2 𝜕𝑏
⏟2
=𝛿2 =1

Hidden layer gradient

𝑙(𝑓(𝑥), 𝑦)
𝑗th unit

If the loss function 𝜙(𝑎) can be written as a pre-activation function 𝑞𝑖 (𝑎) in the layer
above, then

11
𝜕𝜙(𝑎) 𝜕𝜙(𝑎) 𝜕𝑞𝑖 (𝑎)
=∑ ∙ ,
𝜕𝑎 𝑖 𝜕𝑞𝑖 (𝑎) 𝜕𝑎

where 𝑎 is a unit in layer.

Loss gradient at hidden layers

(𝑘) (𝑘)
Considering a pre-activation (weighted sum) for layer 𝑘 𝑎(𝑘) (𝑥)𝑖 = 𝑏𝑖 + ∑𝑗 𝑊𝑖,𝑗 ℎ(𝑘−1) (𝑥)𝑗 ,

where

• ℎ(𝑘−1) (𝑥) is an activation form the previous layer 𝑘 − 1


• 𝑊 (𝑘) is weight matrix between layer 𝑘 − 1 and layer 𝑘.
• 𝑏 𝑘 is bias for layer 𝑘

Let 𝑓(𝒙)𝑦 be softmax output for the true class 𝑦 and ▽𝑎(𝑘) is gradient of the loss

function w.r.t. pre-activations at layer 𝑘.

𝜕𝐿 𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕𝑎(𝑘+1) (𝒙)𝑖


= (− log 𝑓(𝒙)𝑦 ) =
⏟ ∑ ∙
𝜕ℎ(𝑘) (𝒙)𝑗 𝜕ℎ(𝑘) (𝒙)𝑗 𝐶𝑜𝑛𝑠𝑖𝑑𝑒𝑟 𝑖 ⏟ 𝜕𝑎
(𝑘+1) (𝒙)
𝑖
(𝑘)
⏟𝜕ℎ (𝒙)𝑗
𝑒𝑎𝑐ℎ 𝑝𝑎𝑡ℎ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑓𝑟𝑜𝑚 (𝑘+1)
𝑙𝑎𝑦𝑒𝑟 (𝑘+1) =𝑊𝑖,𝑗

(𝑘+1) 𝑇
= (𝑾∙,𝑗 ) (▽𝑎(𝑘+1)(𝑥) (− log 𝑓(𝒙)𝑦 ))

Gradient:

𝑇
▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) = 𝑾(𝑘+1) (▽𝒂(𝑘+1)(𝒙) (− log 𝑓(𝒙)𝑦 ))

Example (Backpropagation in a hidden layer)

Let’s take a simple neural network with:

• Input layer: 2 neurons

• Hidden layer: 2 neurons with ReLU activation

• Output layer: 2 neurons with softmax activation

12
Let’s assume the following weight matrices and biases for simplicity:

• Weight from input to hidden layer (2x2 matrix)

.2 .4
𝑊 (1) = [ ]
−.3 . 1

• Weight from hidden to output layer (2x2 matrix)

.5 .6
𝑊 (2) = [ ]
−.4 . 2

Let’s ignore biases for simplicity.

Forward propagation: Given an input vector 𝒙 = [1 . 5]𝑇 ,

• Hidden layer pre-activation:


𝑧 (1) = 𝑊 (1) 𝒙 = [. 4 −.25]𝑇
• Hidden layer activation (ReLU)
ℎ(1) = 𝑅𝑒𝐿𝑈(𝑧 (1) ) = [. 4 0]𝑇
• Output layer pre-activation:
𝑧 (2) = 𝑊 (2) ℎ(1) = [. 2 −.16]𝑇
• Output layer activation (Softmax):
(2)
(2) 𝑒𝑧
𝑓(𝑥) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 )= = [. 539 . 461]𝑇
∑ 𝑒 𝑧 (2)

Now, let’s assume that the true label is class 1, so 𝑦 = 1.

Then, the loss is calculated using the cross-entropy:

𝐿 = − log(𝑓(𝑥)1 ) = − log(. 539) = .618

Backward propagation:

• Gradient at output layer: The error at the output layer for softmax with cross-
entropy is
𝛿 (2) = 𝑓(𝑥) − 𝑒(𝑦) = [. 539 . 461]𝑇 − [1 0]𝑇 = [−.461 . 461]𝑇
• Gradient for 𝑊 (2) :

13
𝜕𝐿 𝜕𝐿 𝜕𝑧(2)
= ∙
𝜕𝑊(2) ⏟
𝜕𝑧 (2)
𝜕𝑊(2)

𝑒𝑟𝑟𝑜𝑟 𝑎𝑡 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝜕
= (2) (𝑊 (2) ℎ(1) )=ℎ(1)
≡𝛿 (2) =𝑦̂−𝑦 𝜕𝑊
𝑇 −.1844 . 1844
In matrix form ▽ 𝑊 (2) = ℎ(1) 𝛿 (2) = [. 4 0]𝑇 [−.461 . 461] = [ ]
0 0

Loss gradient at hidden layers pre-activation

Considering 𝑗th activation ℎ(𝑘) (𝒙)𝑗 = 𝑔(𝑎(𝑘) (𝒙)𝑗 ) only depends on the pre-activation 𝑎(𝑘) (𝒙)𝑗 ,

where 𝑎(𝑘) (𝒙)𝑗 = 𝑊𝑗,:𝑘 ℎ(𝑘−1) (𝑥) + 𝑏𝑗 ,

• 𝑊𝑗,:𝑘 is the row vector of weights connecting the previous layer to the current layer,

• ℎ(𝑘−1) (𝑥): the activation from the previous layer

thus no sum,

(𝑘)
𝜕𝐿 𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕ℎ (𝒙)𝑗
= (− log 𝑓(𝒙)𝑦 ) = ∙
𝜕𝑎(𝑘) (𝒙)𝑗 𝜕𝑎(𝑘) (𝒙)𝑗 𝜕ℎ(𝑘) (𝒙)𝑗 (𝑘)
⏟ (𝒙)𝑗
𝜕𝑎
=𝑔′(𝑎(𝑘) (𝒙)𝑗 )

Gradient:

(𝑘) 𝑇
⏟𝒂(𝑘) (𝒙) ℎ (𝒙)
▽𝒂(𝑘) (𝒙) (− log 𝑓(𝒙)𝑦 ) =▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) ▽
𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛
𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙 𝑚𝑎𝑡𝑟𝑖𝑥

=▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) ⨀


⏟ [… , 𝑔′ (𝑎(𝑘) (𝒙)𝑗 ), … ]

𝑒𝑙𝑒𝑚𝑒𝑛𝑡−𝑤𝑖𝑠𝑒
𝑝𝑟𝑜𝑑𝑢𝑐𝑡 =𝑑𝑖𝑎𝑔(𝑔′ (𝑎(𝑘) (𝑥)))

Revisit the previous example:

• Gradient of the hidden layer:


Backpropagate the error to the hidden layer:
𝛿1 = 𝛿2 𝑊2𝑇 ∙ 𝑅𝑒𝐿𝑈 ′ (𝑧1 ),
where 𝛿1 is the error signal for the hidden layer, 𝛿2 is the error signal for the output
layer, 𝑊2 is the weight matrix between the hidden layer and the output layer.
The gradients of 𝑊1 and 𝑏1

14
▽ 𝑊1 = 𝑥 𝑇 𝛿1 ; ▽ 𝑏1 = 𝛿1
Pf)
• The hidden layer outputs 𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ), where 𝑧1 = 𝑥𝑊1 + 𝑏1 .
• The output of the network 𝑧2 = 𝑎1 𝑊2 + 𝑏2

Now we want to calculate the error signal 𝛿1 , which tells us how much the loss
depends on the pre-activation 𝑧1 of the hidden layer.

𝜕𝐿 𝜕𝐿 𝜕 𝑎1
𝛿1 = = ∙ = 𝛿2 𝑊𝑇2 ∙ 𝑅𝑒𝐿𝑈′ (𝑧1 ),
𝜕𝑧1 𝜕𝑎
⏟1 𝜕𝑧1

=𝛿2 𝑊𝑇2 =𝑅𝑒𝐿𝑈′ (𝑧1 )

where

𝜕𝐿 𝜕𝐿 𝜕𝑧2
= ∙ = 𝛿2 𝑊2𝑇
𝜕𝑎1 𝜕𝑧
⏟2 𝜕𝑎
⏟1
=𝛿2 =𝑊2

Since The error signal 𝛿2 is the gradient of the loss w.r.t. 𝑧2 :

𝜕𝐿
𝛿2 =
𝜕𝑧2

And 𝑧2 = 𝑎1 𝑊2 + 𝑏2 ,

𝜕𝑧2
= 𝑊2
𝜕𝑎1

Next, the derivative of 𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ) w.r.t. 𝑧1 is

𝜕𝑎1 1, 𝑧1 > 0
= 𝑅𝑒𝐿𝑈 ′ (𝑧1 ) = {
𝜕𝑧1 0, 𝑧1 ≤ 0

Parameter Gradient – Loss gradient of parameters

Partial derivative (weights)

𝜕𝐿 𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕𝑎(𝑘) (𝒙)𝑖 𝜕(− log 𝑓(𝒙)𝑦 ) (𝑘−1)


(𝑘)
= (𝑘)
(− log 𝑓(𝒙)𝑦 ) = ∙ = ∙ ℎ𝑗 (𝒙)
𝜕𝑊𝑖,𝑗 𝜕𝑊𝑖,𝑗 𝜕𝑎(𝑘) (𝒙) 𝑖
(𝑘)
𝜕𝑊𝑖,𝑗 𝜕𝑎(𝑘) (𝒙) 𝑖

15
(𝑘)
where 𝑎(𝑘) (𝒙)𝑖 = 𝑏𝑖 + ∑𝑗 𝑊𝑖,𝑗(𝑘) ℎ(𝑘−1) (𝒙)𝑗

Gradient:

▽𝑊 (𝑘) (− log 𝑓(𝒙)𝑦 ) = ▽ ⏟(𝑘−1) (𝒙)𝑇


⏟𝒂(𝑘) (𝒙) (− log 𝑓(𝒙)𝑦 ) 𝒉
𝑟𝑜𝑤
⏟ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑒𝑐𝑡𝑜𝑟 𝑣𝑒𝑐𝑡𝑜𝑟
𝑚𝑎𝑡𝑟𝑖𝑥

Partial derivative (biases)

𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕𝑎(𝑘) (𝑥)𝑖


(𝑘)
(− log 𝑓(𝒙)𝑦 ) = ∙
𝜕𝑏𝑖 𝜕𝑎(𝑘) (𝑥)𝑖 (𝑘)
⏟𝜕𝑏𝑖
=1

Reminder: 𝑎(𝑘) (𝑥)𝑖 = 𝑏𝑖(𝑘) + ∑𝑗 𝑊𝑖,𝑗(𝑘) ℎ(𝑘−1) (𝑥)𝑗

• Gradient (biases):

▽𝑏(𝑘) (− log 𝑓(𝒙)𝑦 ) =▽𝑎(𝑘) (𝑥) (− log 𝑓(𝒙)𝑦 )

Example:
Let's work through an example of a neural network with the following structure:
• Input layer: 3 neurons
• Hidden layer 1: 5 neurons, using the ReLU activation function
• Hidden layer 2: 5 neurons, using the ReLU activation function
• Output layer: 2 neurons, using the softmax activation function (for multi-class
classification)
We'll go through the forward pass, calculate the loss, and then backpropagate to
compute the gradients for the weights.

• What is the neural network structure?


o Input layer 𝑥 = [𝑥1 , 𝑥2 , 𝑥3 ]
o Hidden layer 1: 5 neurons with weights 𝑊 (1) (shape: 3x5) and biases
𝑏 (1) (Shape: 1x5)

16
o Hidden layer 2: 5 neurons with weights 𝑊 (2) (shape: 5x5) and biases
𝑏 (2) (Shape: 1x5)
o Output layer: 2 neurons with weights 𝑊 (3) (shape: 5x2) and biases 𝑏 (3)
(Shape: 1x2)

Let's assume the weights and biases are initialized as follows:

. 2 −.1 . 4 −.3 . 1
• 𝑊 (1) = [−.2 . 5 −.4 . 3 −.1]
. 1 −.3 . 2 . 1 −.2
• 𝑏 (1) = [. 1 .1 . 1 . 1 . 1]
.3 −.2 .5 .1 −.1
.2 .1 −.3 .2 .1
• 𝑊 (2) = −.5 .3 .1 −.1 .4
.1 .2 −.2 .3 −.4
[−.3 .4 .2 −.1 .2 ]
• 𝑏 (2) = [. 05 . 05 . 05 . 05 . 05]
.4 −.5
.3 .2
• 𝑊 (3) = −.4 .1
.2 −.3
[−.1 .4 ]
• 𝑏 (3) = [. 2 −.1]

Let’s use the input vector 𝑥 = [1, .5, −1]𝑇 .

Forward propagation

• Hidden layer 1 pre-activation:


𝑇
𝑧 (1) = 𝑊 (1) 𝑥 + 𝑏 (1)
. 2 −.1 . 4 −.3 . 1 𝑇
= [−.2 . 5 −.4 . 3 −.1] [1, .5, −1]𝑇
. 1 −.3 . 2 . 1 −.2
+ [. 1 .1 .1 . 1 . 1]𝑇 = [. 05 . 65 . 05 −.1 . 55]𝑇
• Hidden layer 1 activation (ReLU):
ℎ(1) = 𝑅𝑒𝐿𝑈(𝑧 (1) ) = [. 05 . 65 . 05 0 . 55]𝑇
• Hidden layer 2 pre-activation:

17
𝑇
𝑧 (2) = 𝑊 (2) ℎ(1) + 𝑏 (2)
.3 −.2 .5 .1 −.1 𝑇
.2 .1 −.3 .2 .1
= −.5 .3 .1 −.1 . 4 [. 05 . 65 . 05 0 . 55]𝑇
.1 .2 −.2 .3 −.4
[−.3 .4 .2 −.1 .2 ]
+ [. 05 . 05 . 05 . 05 . 05]𝑇
= [−.075 . 245 . 285 . 045 . 405]𝑇
• Hidden layer 2 activation (ReLU):
ℎ(2) = 𝑅𝑒𝐿𝑈(𝑧 (2) ) = [0 . 245 . 285 . 045 . 405]𝑇
• Output layer pre-activation:
𝑇
𝑧 (3) = 𝑊 (3) ℎ(2) + 𝑏 (3)
.4 −.5 𝑇
.3 .2
= −.4 .1 [0 . 245 . 285 . 045 . 405]𝑇 + [. 2 −.1]
.2 −.3
[−.1 .4 ]
= [. 128 −.016]
• Output layer activation:
𝑒 .128 𝑒 −.016
𝑓(𝑥)1 = = .535; 𝑓(𝑥) 2 = = .465
𝑒 .128 + 𝑒 −.016 𝑒 .128 + 𝑒 −.016

Let’s assume the true label is 𝑦 = [1, 0] (meaning the true class is class 1). Then, the
cross-entropy loss is

𝐿 = − ∑ 𝑦𝑖 log(𝑓(𝑥)𝑖 ) = −[1 × log(. 535) + 0 × log(. 465)] = .626


𝑖=1

Backpropagation

• Gradient at the output layer: The gradient of the loss with respect to the output
layer activations 𝑓(𝑥) is
𝛿 (3) = 𝑓(𝑥) − 𝑦 = [. 535, .465]𝑇 − [1, 0]𝑇 = [−.465, .465]𝑇
• Gradient for 𝑊 (3) and 𝑏 (3)
The gradient of the loss with respect to the weights 𝑊 (3) is

18
𝑇
▽ 𝑊 (3) = ℎ(2) 𝛿 (3) = [0 . 245 . 285 . 045 . 405]𝑇 [−.465, .465]
0 0
−.114 . 114
= −.1325 . 1325
−.0209 . 0209
[−.1883 . 1883 ]
▽ 𝑏 (3) = 𝛿 (3) = [−.465, .465]𝑇
Now, Updated weights and biases for layer 3: Assuming a learning rate 𝜂 = .01,
𝑊 (3) = 𝑊 (3) − 𝜂 ×▽ 𝑊 (3)
𝑏 (3) = 𝑏 (3) − 𝜂 ×▽ 𝑏 (3)

• Gradient for hidden layer 2:


The gradient with respect to the pre-activation 𝑧 (2) is
𝛿 (2) = (𝑊 (3) 𝛿 (3) )⨀𝑔′ (𝑧 (2) )
Here,
.4 −.5 −.4185
.3 .2 −.0465
𝑊 (3) 𝛿 (3) = −.4 . 1 [−.465, .465]𝑇 = . 2325
.2 −.3 −.2325
[−.1 .4 ] [ . 2325 ]
𝑔′ (𝑧 (2) ) = 𝑔′ ([−.075 . 245 . 285 . 045 . 405]𝑇 ) = [0, 1, 1, 1, 1]𝑇
Therefore, the gradient at the second hidden layer's pre-activation becomes:
−.4185 0
−.0465 −.0465
𝛿 (2) = (𝑊 (3) 𝛿 (3) )⨀𝑔′ (𝑧 (2) ) = . 2325 ⨀[0, 1, 1, 1, 1]𝑇 = . 2325
−.2325 −.2325
[ . 2325 ] [ . 2325 ]
• Gradient for 𝑊 (2) and 𝑏 (2)
𝑇
0
𝑇 −.0465
▽ 𝑊 (2) = ℎ(1) 𝛿 (2) = [. 05 . 65 . 05 0 . 55]𝑇 . 2325
−.2325
[ . 2325 ]
0 −.002325 . 011625 −.011625 . 011625
0 −.030225 . 151125 −.151125 . 151125
= 0 −.002325 . 011625 −.011625 . 011625
0 0 0 0 0
[0 −.025575 . 127875 −.127875 . 127875]

19
0
−.0465
▽ 𝑏 (2) = 𝛿 (2) = . 2325
−.2325
[ . 2325 ]
• Gradient for hidden layer 1:
Next, we backpropagate to the first hidden layer. The gradient with respect to
the pre-activation 𝑧 (1) is
𝛿 (1) = (𝑊 (2) 𝛿 (2) )⨀𝑔′ (𝑧 (1) ),
.3 −.2 .5 .1 −.1 0 . 07805
.2 .1 −.3 .2 . 1 −.0465 −.02355
𝑊 (2) 𝛿 (2) = −.5 .3 .1 −.1 . 4 . 2325 = . 06745
.1 .2 −.2 .3 −.4 −.2325 . 0092
[−.3 .4 .2 −.1 ] [
. 2 . 2325 ] [ −.00165 ]
𝑔′ (𝑧 (1) ) = 𝑔′ ([. 05 . 65 . 05 −.1 . 55]𝑇 ) = [1 1 1 0 1]𝑇
Thus,
. 07805 . 07805
−.02355 −.02355
𝛿 (1) = (𝑊 (2) 𝛿 (2) )⨀𝑔′ (𝑧 (1) ) = . 06745 ⨀[1 1 1 0 1]𝑇 = . 06745
. 0092 0
[−.00165 ] [−.00165 ]
• Gradient for 𝑊 (1) and 𝑏 (1)
. 07805 𝑇
𝑇 −.02355
▽ 𝑊 (1) = 𝑥𝛿 (1) = [1, .5, −1]𝑇 . 06745
0
[−.00165 ]
. 07805 −.02355 . 06745 0 −.00165
= [. 039025 −.011775 . 033725 0 −.000825]
−.07805 . 02355 −.06745 0 . 00165
. 07805
−.02355
▽ 𝑏 (1) = 𝛿 (1) = . 06745
0
[−.00165 ]

Training NN – Regularization
In SGD algorithm that performs updates after each example, for 𝑁 iterations with
each training example (𝒙(𝑡) , 𝑦 (𝑡) ),

20
Δ = −∇𝜽 𝑙(𝑓(𝒙(𝑡) ; 𝜽), 𝑦 (𝑡) ) − 𝜆∇𝜽 Ω(𝜽) = − (∇𝜽 𝑙(𝑓(𝒙(𝑡) ; 𝜽), 𝑦 (𝑡) ) + 𝜆∇𝜽 Ω(𝜽)),

where Ω(𝜽) is a regularizer and the gradient ∇𝜽 Ω(𝜃).

• L2 regularization
2
( ) 2
Ω(𝜽) = ∑ ∑ ∑ (𝑊𝑖,𝑗𝑘 ) = ∑ ‖𝑾(𝑘) ‖ ,
𝑘 𝑖 𝑗 𝑘 𝐹

where the subscript 𝐹 stands for the Frobenius norm. The Frobenius norm is a
matrix norm that is analogous to the Euclidean norm for vectors.
o Gradient: ∇𝑾(𝑘) Ω(𝜽) = 2𝑾(𝑘)
o Only applied on weights, not on biases
o Can be interpreted as having a Gaussian prior over the weights.

• L1 regularization: a.k.a. Lasso (Least Absolute Shrinkage and Selection


Operator)

( )
Ω(𝜽) = ∑ ∑ ∑ |𝑊𝑖,𝑗𝑘 |
𝑘 𝑖 𝑗

o Gradient: ∇𝑾(𝑘) Ω(𝜽) = 𝑠𝑖𝑔𝑛(𝑾(𝑘) ), where 𝑠𝑖𝑔𝑛(𝑾(𝑘) ) = 1𝑊(𝑘)>0 − 1𝑊(𝑘)<0


𝑖,𝑗 𝑖,𝑗

o Also only applied on weights


o Unlike L2, L1 will push certain weights to be exactly 0
o Can be interpreted as having a Laplacian prior over the weights.

Example: Suppose we are training a linear regression model with two features. The
model is:

𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏

The loss function without regularization is typically the Mean Squared Error (MSE):

1 𝑛
𝑀𝑆𝐸 = ∑ (𝑦𝑖 − 𝑦̂𝑖 )2
𝑛 𝑖=1

21
L1 regularization adds a penalty proportional to the absolute values of the weights.
The regularized loss function becomes:

𝐿𝑜𝑠𝑠 = 𝑀𝑆𝐸 + 𝜆(|𝑤1 | + |𝑤2 |),

Where 𝜆 is a hyperparameter that controls the strength of the regularization. This


penalty encourages the weights to shrink, and possibly one or both weights may be
driven to zero.

Let’s assume we have some data and after training, the weights 𝑤1 and 𝑤2 are
initialized as: 𝑤1 = .6; 𝑤2 = .2.

Let's also assume the regularization term 𝜆 = .1.

• Without regularization: The gradient descent update rule for the weights
without regularization is just based on the gradient of the loss:
𝜕(𝑀𝑆𝐸)
𝑤𝑖 = 𝑤𝑖 − 𝜂
𝜕𝑤𝑖
For simplicity, assume that the gradients of MSE with respect to 𝑤1 and 𝑤2
are .1 and .05, respectively. The weights will be updated as follows:
𝑤1 = .6 − .01 ∙ .1 = .599; 𝑤2 = .2 − .01 ∙ .05 = .1995
So the weights gradually decrease, but both remain non-zero.

• With L1 regularization: Now, when we apply L1 regularization, we need to add


the gradient of the regularization term to the gradient of the loss function. The
derivative of |𝑤| is the sign function, which is 1 if 𝑤 > 0 and −1 if 𝑤 < 0. Thus,
the update rule becomes:
𝜕(𝑀𝑆𝐸)
𝑤𝑖 = 𝑤𝑖 − 𝜂 ( + 𝜆 ∙ 𝑠𝑖𝑔𝑛(𝑤𝑖 ))
𝜕𝑤𝑖

For this example, the updates are:


𝑤1 = .6 − .01 ∙ (. 1 + .1 ∙ 1) = .598; 𝑤2 = .2 − .01 ∙ (.05 + .1 ∙ 1) = .1985
Here you see that both weights are decreasing faster than they would without
regularization. The penalty term 𝜆 accelerates the shrinking.

22
• Effect of L1 over multiple iterations:
Now, suppose after a few more iterations, the weights continue shrinking:
𝑤1 = .05; 𝑤2 = .01
If we update again:
𝑤1 = .05 − .01 ∙ (. 1 + .1 ∙ 1) = .048; 𝑤2 = .01 − .01 ∙ (. 05 + .1 ∙ 1) = .0085
As training continues, 𝑤2 will eventually become smaller and smaller. When it
gets sufficiently close to zero, the regularization term dominates the update. If,
for example, 𝑤2 becomes .0001:
𝑤2 = .0001 − .01 ∙ (. 05 + .1 ∙ 1) = −.0014
At this point, 𝑤2 can be pushed to exactly 0 because of the regularization term.
Once 𝑤2 reaches zero, it will stay at zero in subsequent iterations, effectively
removing the contribution of feature 𝑥2 from the model.

Q: Why do we use regularization? - Bias-variance trade-off

• Variance of trained model: does it vary a lot if the training set changes
• Bias of trained model: is the average model close to the true solution
• Generalization error can be seen as the sum of the (squared) bias and the
variance.

Q: How to initialize the parameters?

• For biases:
o Initialize all to 0
• For weights:
o Can’t initialize weights to 0 with tanh function
o Can’t initialize all weights to the same value (need to break symmetry)
√6
o Sample 𝑊𝑖,𝑗(𝑘) from 𝑈𝑛𝑖𝑓[−𝑏, 𝑏], where 𝑏 =
√𝑠𝑖𝑧𝑒(ℎ (𝑘) (𝑥))+𝑠𝑖𝑧𝑒(ℎ(𝑘−1) (𝑥))

▪ Other values of 𝑏 could work as well (not an exact science)

23
24

You might also like