0% found this document useful (0 votes)

6 views

Lecture 3

Uploaded by

pill.pine6731

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lecture 3

Uploaded by

pill.pine6731

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Gradient Descent

𝑚
1
𝐽(𝑤) = ∑(𝑧𝑖 − 𝑤𝑥𝑖 )2
2𝑚
𝑖=1

𝑤1 𝑤1 = 𝑤 0 − 𝜂
𝜕𝑦
𝜕𝑥

Example revisited:

Updating 𝒘𝟐

𝜕𝐸
𝑤2 = 𝑤2 − 𝜂
𝜕𝑤2
1 𝜕𝐸
• 𝐸 = 2 (𝑇 − 𝑎2 )2 , where 𝑇 is true outcome → 𝜕𝑎
2
1 𝜕𝑎2
• 𝑎2 = 𝑓(𝑧2 ) = →
1+𝑒 −𝑧2 𝜕𝑧2
𝜕𝑧2
• 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2 → 𝜕𝑤
2

1
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2
= ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1
𝜕𝑤2 𝜕𝑎2 𝜕𝑧2 𝜕𝑤2

Therefore,

𝜕𝐸
𝑤2 = 𝑤2 − 𝜂 = 𝑤2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1
𝜕𝑤2

Updating 𝒃𝟐

𝜕𝐸
𝑏2 = 𝑏2 − 𝜂
𝜕𝑏2
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2
= ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1
𝜕𝑏2 𝜕𝑎2 𝜕𝑧2 𝜕𝑏2

Thus,

𝜕𝐸
𝑏2 = 𝑏2 − 𝜂 = 𝑏2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1
𝜕𝑏2
Updating 𝒘𝟏

𝜕𝐸
𝑤1 = 𝑤1 − 𝜂
𝜕𝑤1
1 𝜕𝐸
• 𝐸 = 2 (𝑇 − 𝑎2 )2 , where 𝑇 is true outcome → 𝜕𝑎
2
1 𝜕𝑎2
• 𝑎2 = 𝑓(𝑧2 ) = 1+𝑒 −𝑧2 → 𝜕𝑧2
𝜕𝑧2
• 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2 → 𝜕𝑎 = 𝑤2
1

1 𝜕𝑎1
• 𝑎1 = 𝑓(𝑧1 ) = 1+𝑒 −𝑧1 → 𝜕𝑧1
𝜕𝑧
• 𝑧1 = 𝑥1 ∙ 𝑤1 + 𝑏1 → 𝜕𝑤1
1

Thus,

𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1

= ∙ ∙ ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1
𝜕𝑤1 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1 𝜕𝑤1

2
Therefore,

𝜕𝐸
𝑤1 = 𝑤1 − 𝜂 = 𝑤1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1
𝜕𝑤1
Updating 𝒃𝟏

𝜕𝐸
𝑏1 = 𝑏1 − 𝜂
𝜕𝑏1
𝜕𝐸 𝜕𝐸 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1
= ∙ ∙ ∙ ∙ = (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 1
𝜕𝑏1 𝜕𝑎2 𝜕𝑧2 𝜕𝑎1 𝜕𝑧1 𝜕𝑏1

Therefore,

𝑏1 = 𝑏1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 1

In the example:

𝑧1 = .415; 𝑎1 = .6023; 𝑧2 = .921; 𝑎2 = .7153;

𝑤2 = .45; 𝑤1 = .15; 𝑏2 = .65; 𝑏1 = .4; 𝑥1 = .1

Now assume that the true outcome would be 𝑇 = .25.

1
Then, 𝐸 = 2 (𝑇 − 𝑎2 )2 = .1083. Then update the weights and bias according to epochs

= 1000 or 𝜖 = .001 with learning rate 𝜂 = .4.

In the first iteration,

𝜕𝐸
𝑤2 = 𝑤2 − 𝜂 = 𝑤2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑎1 = .427
𝜕𝑤2
𝜕𝐸
𝑏2 = 𝑏2 − 𝜂 = 𝑏2 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 1 = .612
𝜕𝑏2
𝜕𝐸
𝑤1 = 𝑤1 − 𝜂 = 𝑤1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 𝑥1 = .1496
𝜕𝑤1

𝑏1 = 𝑏1 − 𝜂 ∙ (−(𝑇 − 𝑎2 )) ∙ (𝑎2 (1 − 𝑎2 )) ∙ 𝑤2 ∙ (𝑎1 (1 − 𝑎1 )) ∙ 1 = .3959

NOTE: Vanishing gradient problem

3
The vanishing gradient problem is a common issue in neural networks, especially deep
networks with many layers. It occurs when gradients become very small, effectively
"vanishing," during the process of backpropagation. This problem is particularly
prominent when using the sigmoid activation function in earlier hidden layers.

The sigmoid activation function maps input values to the range (0, 1), which can
cause two problems in deep networks:

• Small Derivatives: The derivative of the sigmoid function is:

𝜎 ′ (𝑥) = 𝜎(𝑥)(1 − 𝜎(𝑥)).
• The derivative is small for large positive or negative values of 𝑥. When 𝑥
is either very large or very small (far from 0), 𝜎(𝑥) saturates near 0 or 1,
and the derivative 𝜎 ′ (𝑥) becomes very close to 0.
• This means that when backpropagating through many layers, the
gradients of the weights in earlier layers will become smaller and
smaller, approaching zero. This leads to a situation where the network
cannot effectively update the weights of earlier layers, effectively
"freezing" learning in these layers.

The earlier hidden layers in a neural network are responsible for detecting basic
patterns and features in the data. If the gradients for these layers vanish, they stop
learning and adjusting their weights, which negatively impacts the entire learning
process of the network.

Thus, replacing the sigmoid with a Rectified Linear Unit (ReLU) activation function
helps mitigate this problem. ReLU has a gradient of 1 for positive inputs and 0 for
negative inputs, which helps maintain larger gradients during backpropagation,
preventing the vanishing gradient issue.

Training NN – Output layer gradient

• Gradient computation

4
o Loss Gradient at output
• Partial derivative of the loss w.r.t. 𝑓(𝑥)𝑐 :

𝜕 −1(𝑦=𝑐)
(−
⏟ log 𝑓(𝒙)𝑦 ) = ,
𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑦
𝑐𝑟𝑜𝑠𝑠−𝑒𝑛𝑡𝑟𝑜𝑝𝑦
𝑙𝑜𝑠𝑠

where 𝑐 is any of the output neuron (class) and 𝑓(𝑥)𝑐 is the predicted
probability for class 𝑐, and 𝑦 is the true label (actual class).
Here, − log 𝑓(𝒙)𝑦 is the cross-entropy loss between the true label 𝑦 and
the predicted probability 𝑓(𝑥).
This loss function measures how well the predicted probability aligns with
the true class.
o Gradient:
1(𝑦=0)
1 𝑒(𝑦)
▽𝒇(𝒙) (− log 𝒇(𝒙)𝒚 ) = − [ ⋮ ]=− ,
𝑓(𝑥)𝑦 1 𝑓(𝑥)𝑦
(𝑦=𝐶−1)

Where 𝑒(𝑦) is a one-hot encoded vector where the correct class 𝑦 has a
value of 1 and all other classes are 0.

𝑙(𝑓(𝑥), 𝑦)

5
Example: Assume you are working with a classification problem that has 4 classes
(i.e., 𝐶 = 4), and the true class label 𝑦 is class 2.

The softmax output of your neural network might look like this for a given input 𝑥:

𝑓(𝑥) = [. 1 . 7 . 15 . 05]𝑇

This means the network is predicting the following probabilities for each class:

𝑃(𝑐𝑙𝑎𝑠𝑠 0) = .1; 𝑃(𝑐𝑙𝑎𝑠𝑠 1) = .7; 𝑃(𝑐𝑙𝑎𝑠𝑠 2) = .15; 𝑃(𝑐𝑙𝑎𝑠𝑠 3) = .05.

For the true class 𝑦 = 2, the one-hot encoded vector 𝑒(𝑦) will look like this:

𝑒(𝑦) = [0, 0, 1, 0]𝑇

In this example, the vector 𝑒(𝑦) has a 1 at the index of the correct class (class 2) and
0 elsewhere.

𝑒(𝑦) 1 𝑇 1 𝑇
Now from ▽𝒇(𝒙) (− log 𝑓(𝑥)𝑦 ) = − 𝑓(𝑥) = − [0, 0, 𝑓(𝑥) , 0] = − [0, 0, .15 , 0] =
𝑦 2

−[0, 0, 6.67, 0]𝑇 .

This gradient vector will be used during backpropagation to adjust the network's
weights to make the model's prediction for the correct class (class 2) more confident.

𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝜂 ∙▽𝒇(𝒙) (− log 𝑓(𝑥)𝑦 )

• Loss Gradient at output pre-activation

6
𝑙(𝑓(𝑥), 𝑦)

• Partial derivative:
𝜕
(− log 𝑓(𝒙)𝑦 ) = −(1(𝑦=𝑐) − 𝑓(𝒙)𝑐 ),
𝜕𝑎𝐿+1 (𝒙)𝑐
where 𝑓(𝒙)𝑦 is the predicted probability for the true class 𝑦, i.e., the
output of the softmax function for class 𝑦.
• The softmax function outputs probabilities for each class 𝑐 based on the
(𝐿+1)
pre-activation 𝑎𝑐 , where
(𝐿+1)
exp(𝑎𝑐 )
𝑓(𝑥)𝑐 = .
∑𝑘 exp(𝑎𝑘(𝐿+1) )
(𝐿+1)
This function takes in the pre-activation scores 𝑎𝑐 and converts them
into probabilities that sum to 1.
• Derivative of the loss with respect to pre-activation
We want to compute the gradient of the loss function with respect to the
(𝐿+1)
pre-activation 𝑎𝑐 for class 𝑐.
• Case 1: 𝑦 = 𝑐 (the true class)
𝜕
By considering (− log 𝑓(𝒙)𝑦 ):
𝜕𝑎𝐿+1 (𝒙)𝑐

7
In this case, the loss is directly influenced by the probability 𝑓(𝒙)𝑐 for
the correct class, and the gradient reflects how the model should
adjust this probability. For the cross-entropy loss,
𝜕 −1(𝑦=𝑐) −1
(− log 𝑓(𝑥)𝑦 ) = =
𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑦 𝑓(𝑥)𝑦
Now, using the derivative of the softmax function with respect to its
(𝐿+1)
pre-activation 𝑎𝑐 , we get:
(𝐿+1)
𝜕𝑓(𝑥)𝑐 𝜕 exp(𝑎𝑐
)
( )
= ( )
( )=
𝜕𝑎𝑐𝐿+1 𝜕𝑎𝑐𝐿+1 ∑𝑘 exp(𝑎𝑘(𝐿+1) )
(𝐿+1) (𝐿+1) (𝐿+1) (𝐿+1)
1(𝑦=𝑐) exp(𝑎𝑐 ) ∙ ∑𝑘 exp(𝑎𝑘 ) − exp(𝑎𝑐 ) ∙ exp(𝑎𝑐 )
= 2
(𝐿+1)
(∑𝑘 exp(𝑎𝑘 ))

= 𝑓(𝑥)𝑐 − 𝑓(𝑥)2𝑐 = 𝑓(𝑥)𝑐 ∙ (1 − 𝑓(𝑥)𝑐 ).

So, for 𝑦 = 𝑐, the total derivative is
𝜕 1 𝜕
(𝐿+1)
(− log 𝑓(𝑥)𝑦 ) = − ∙ (𝐿+1) 𝑓(𝑥)𝑦
𝜕𝑎𝑐 𝑓(𝑥)𝑦 𝜕𝑎
𝑐
1
=
⏟ − ∙ 𝑓(𝑥)𝑐 ∙ (1 − 𝑓(𝑥)𝑐 ) = 𝑓(𝑥)𝑐 − 1.
𝑦=𝑐
𝑓(𝑥)𝑐

• Case 2: 𝑦 ≠ 𝑐 (incorrect class)

(𝐿+1)
The gradient with respect to 𝑎𝑐 , 𝑦 ≠ 𝑐, is
(𝐿+1) (𝐿+1)
𝜕𝑓(𝑥)𝑐 0 − exp(𝑎𝑐 ) ∙ exp(𝑎𝑦 )
(𝐿+1)
= 2 = −𝑓(𝑥)𝑐 ∙ 𝑓(𝑥)𝑦
𝜕𝑎𝑐 (𝐿+1)
(∑𝑘 exp(𝑎𝑘 ))

𝜕𝐿 𝜕𝐿 𝜕𝑓(𝑥)𝑐 1
(𝐿+1)
= ∙ (𝐿+1) = − ∙ (−𝑓(𝑥)𝑐 ∙ 𝑓(𝑥)𝑦 ) = 𝑓(𝑥)𝑦
𝜕𝑎𝑐 𝜕𝑓(𝑥)𝑐 𝜕𝑎 𝑓(𝑥)𝑐
𝑐

where
𝜕𝐿 𝜕 1
= (− log 𝑓(𝑥)𝑐 ) = −
𝜕𝑓(𝑥)𝑐 𝜕𝑓(𝑥)𝑐 𝑓(𝑥)𝑐

8
Thus, by combining the cases 1 and 2,

𝜕
(𝐿+1)
(− log 𝑓(𝑥)𝑦 ) = −(1(𝑦=𝑐) − 𝑓(𝑥)𝑐 ) = 𝑓(𝑥)𝑐 − 𝑒(𝑦)𝑐
𝜕𝑎𝑐

o Gradient of the softmax and cross-entropy loss function for all cases
simultaneously:
o Let 𝒂(𝐿+1) (𝒙) represent the vector of pre-activations for all classes:
(𝐿+1) (𝐿+1) (𝐿+1)
𝒂(𝐿+1) (𝒙) = [𝑎1 , 𝑎2 , … , 𝑎𝐶 ]

• 𝒇(𝒙) represent the vector of softmax outputs (class probabilities):

𝒇(𝒙) = [𝑓(𝑥)1 , 𝑓(𝑥)2 , … , 𝑓(𝑥)𝐶 ]
• 𝒆(𝑦) represent a one-hot encoded vector where the entry
corresponding to the true class 𝑦 is 1, and all other entries are 0:
𝒆(𝑦) = [𝑒(𝑦)1 , 𝑒(𝑦)2 , … , 𝑒(𝑦)𝐶 ].

o For example, if 𝑦 = 2 in a 3-class classification problem,

𝒆(𝑦) = [0, 1, 0].
• Vector form of the gradient
▽𝑎(𝐿+1)(𝒙) (− log 𝒇(𝒙)𝒚 ) = 𝒇(𝒙) − 𝒆(𝑦).

Example: Neural Network with One Hidden Layer Using Stochastic Gradient Descent

Neural Network Structure:

• Input layer: 2 features (𝑥1 , 𝑥2 )

• Hidden layer: 3 neurons, using the ReLU activation function

• Output layer: 1 neuron, using the sigmoid activation function (for binary
classification)

Initialization

We start by initializing the weights and biases for each layer. Assume we use random
initialization for simplicity.

9
Let:

• 𝑊1 be the weights matrix for the input to hidden layer (shape: 2x3).

. 2 −.4 . 1
𝑊1 = [ ]
. 4 . 3 −.5

• 𝑏1 be the biases for the hidden layer (shape: 1x3).

𝑏1 = [. 1 −.2 . 3]

• 𝑊2 be the weights matrix for the hidden layer to the output layer (shape: 3x1).

𝑊2 = [. 5 −.3 . 2]𝑇

• 𝑏2 be the bias for the output layer (shape: 1x1). 𝑒. 𝑔. 𝑏2 = .1

Forward Propagation

Given an input vector 𝑥 = [1 . 5]𝑇 :

1. Hidden Layer Computation: 𝑧1 = 𝑥𝑊1 + 𝑏1 = [. 5 −.05 . 15]𝑇

Apply an activation function (let’s use ReLU):

𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ) = [. 5 0 . 15]𝑇

2. Output layer computation: 𝑧2 = 𝑎1 𝑊2 + 𝑏2 = .38

Since this is a binary classification, apply the sigmoid activation function to get
the predicted output:
1
𝑦̂ = 𝜎(𝑧2 ) = = .594
1 + 𝑒 −𝑧2
3. Loss computation
Use the binary cross-entropy loss function:
𝐿 = −[𝑦 log 𝑦̂ + (1 − 𝑦) log(1 − 𝑦̂)] = .520
4. Backward Propagation (Gradient computation)
a. Gradient of the output layer: 𝛿2 = 𝑦̂ − 𝑦 = .594 − 1 = −.406
Compute the gradients of 𝑊2 and 𝑏2
• ▽ 𝑊2 = 𝑎1𝑇 𝛿2 = [−.203 0 −.061]𝑇

10
Why?
The output of the network is 𝑦̂ = 𝜎(𝑧2 ), where 𝑧2 = 𝑎1 𝑊2 + 𝑏2 .
The error signal 𝛿2 is the gradient of the loss w.r.t. 𝑧2 :
𝜕𝐿
𝛿2 = .
𝜕𝑧2
Now, the gradient of the loss w.r.t. 𝑊2 is:
𝜕𝐿 𝜕𝐿 𝜕𝑧2
▽ 𝑊2 = = ∙ = 𝑎1𝑇 𝛿2
𝜕𝑊2 𝜕𝑧
⏟2 𝜕𝑊
⏟2
=𝛿2 =𝑎1

• ▽ 𝑏2 = 𝛿2 = −.406

Why?

𝜕𝐿 𝜕𝐿 𝜕𝑧2
▽ 𝑏2 = = ∙ = 𝛿2
𝜕𝑏2 𝜕𝑧
⏟2 𝜕𝑏
⏟2
=𝛿2 =1

Hidden layer gradient

𝑙(𝑓(𝑥), 𝑦)
𝑗th unit

If the loss function 𝜙(𝑎) can be written as a pre-activation function 𝑞𝑖 (𝑎) in the layer
above, then

11
𝜕𝜙(𝑎) 𝜕𝜙(𝑎) 𝜕𝑞𝑖 (𝑎)
=∑ ∙ ,
𝜕𝑎 𝑖 𝜕𝑞𝑖 (𝑎) 𝜕𝑎

where 𝑎 is a unit in layer.

Loss gradient at hidden layers

(𝑘) (𝑘)
Considering a pre-activation (weighted sum) for layer 𝑘 𝑎(𝑘) (𝑥)𝑖 = 𝑏𝑖 + ∑𝑗 𝑊𝑖,𝑗 ℎ(𝑘−1) (𝑥)𝑗 ,

where

• ℎ(𝑘−1) (𝑥) is an activation form the previous layer 𝑘 − 1

• 𝑊 (𝑘) is weight matrix between layer 𝑘 − 1 and layer 𝑘.
• 𝑏 𝑘 is bias for layer 𝑘

Let 𝑓(𝒙)𝑦 be softmax output for the true class 𝑦 and ▽𝑎(𝑘) is gradient of the loss

function w.r.t. pre-activations at layer 𝑘.

𝜕𝐿 𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕𝑎(𝑘+1) (𝒙)𝑖

= (− log 𝑓(𝒙)𝑦 ) =
⏟ ∑ ∙
𝜕ℎ(𝑘) (𝒙)𝑗 𝜕ℎ(𝑘) (𝒙)𝑗 𝐶𝑜𝑛𝑠𝑖𝑑𝑒𝑟 𝑖 ⏟ 𝜕𝑎
(𝑘+1) (𝒙)
𝑖
(𝑘)
⏟𝜕ℎ (𝒙)𝑗
𝑒𝑎𝑐ℎ 𝑝𝑎𝑡ℎ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑓𝑟𝑜𝑚 (𝑘+1)
𝑙𝑎𝑦𝑒𝑟 (𝑘+1) =𝑊𝑖,𝑗

(𝑘+1) 𝑇
= (𝑾∙,𝑗 ) (▽𝑎(𝑘+1)(𝑥) (− log 𝑓(𝒙)𝑦 ))

Gradient:

𝑇
▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) = 𝑾(𝑘+1) (▽𝒂(𝑘+1)(𝒙) (− log 𝑓(𝒙)𝑦 ))

Example (Backpropagation in a hidden layer)

Let’s take a simple neural network with:

• Input layer: 2 neurons

• Hidden layer: 2 neurons with ReLU activation

• Output layer: 2 neurons with softmax activation

12
Let’s assume the following weight matrices and biases for simplicity:

• Weight from input to hidden layer (2x2 matrix)

.2 .4
𝑊 (1) = [ ]
−.3 . 1

• Weight from hidden to output layer (2x2 matrix)

.5 .6
𝑊 (2) = [ ]
−.4 . 2

Let’s ignore biases for simplicity.

Forward propagation: Given an input vector 𝒙 = [1 . 5]𝑇 ,

• Hidden layer pre-activation:

𝑧 (1) = 𝑊 (1) 𝒙 = [. 4 −.25]𝑇
• Hidden layer activation (ReLU)
ℎ(1) = 𝑅𝑒𝐿𝑈(𝑧 (1) ) = [. 4 0]𝑇
• Output layer pre-activation:
𝑧 (2) = 𝑊 (2) ℎ(1) = [. 2 −.16]𝑇
• Output layer activation (Softmax):
(2)
(2) 𝑒𝑧
𝑓(𝑥) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 )= = [. 539 . 461]𝑇
∑ 𝑒 𝑧 (2)

Now, let’s assume that the true label is class 1, so 𝑦 = 1.

Then, the loss is calculated using the cross-entropy:

𝐿 = − log(𝑓(𝑥)1 ) = − log(. 539) = .618

Backward propagation:

• Gradient at output layer: The error at the output layer for softmax with cross-
entropy is
𝛿 (2) = 𝑓(𝑥) − 𝑒(𝑦) = [. 539 . 461]𝑇 − [1 0]𝑇 = [−.461 . 461]𝑇
• Gradient for 𝑊 (2) :

13
𝜕𝐿 𝜕𝐿 𝜕𝑧(2)
= ∙
𝜕𝑊(2) ⏟
𝜕𝑧 (2)
𝜕𝑊(2)
⏟
𝑒𝑟𝑟𝑜𝑟 𝑎𝑡 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝜕
= (2) (𝑊 (2) ℎ(1) )=ℎ(1)
≡𝛿 (2) =𝑦̂−𝑦 𝜕𝑊
𝑇 −.1844 . 1844
In matrix form ▽ 𝑊 (2) = ℎ(1) 𝛿 (2) = [. 4 0]𝑇 [−.461 . 461] = [ ]
0 0

Loss gradient at hidden layers pre-activation

Considering 𝑗th activation ℎ(𝑘) (𝒙)𝑗 = 𝑔(𝑎(𝑘) (𝒙)𝑗 ) only depends on the pre-activation 𝑎(𝑘) (𝒙)𝑗 ,

where 𝑎(𝑘) (𝒙)𝑗 = 𝑊𝑗,:𝑘 ℎ(𝑘−1) (𝑥) + 𝑏𝑗 ,

• 𝑊𝑗,:𝑘 is the row vector of weights connecting the previous layer to the current layer,

• ℎ(𝑘−1) (𝑥): the activation from the previous layer

thus no sum,

(𝑘)
𝜕𝐿 𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕ℎ (𝒙)𝑗
= (− log 𝑓(𝒙)𝑦 ) = ∙
𝜕𝑎(𝑘) (𝒙)𝑗 𝜕𝑎(𝑘) (𝒙)𝑗 𝜕ℎ(𝑘) (𝒙)𝑗 (𝑘)
⏟ (𝒙)𝑗
𝜕𝑎
=𝑔′(𝑎(𝑘) (𝒙)𝑗 )

Gradient:

(𝑘) 𝑇
⏟𝒂(𝑘) (𝒙) ℎ (𝒙)
▽𝒂(𝑘) (𝒙) (− log 𝑓(𝒙)𝑦 ) =▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) ▽
𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛
𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙 𝑚𝑎𝑡𝑟𝑖𝑥

=▽𝒉(𝑘)(𝒙) (− log 𝑓(𝒙)𝑦 ) ⨀

⏟ [… , 𝑔′ (𝑎(𝑘) (𝒙)𝑗 ), … ]
⏟
𝑒𝑙𝑒𝑚𝑒𝑛𝑡−𝑤𝑖𝑠𝑒
𝑝𝑟𝑜𝑑𝑢𝑐𝑡 =𝑑𝑖𝑎𝑔(𝑔′ (𝑎(𝑘) (𝑥)))

Revisit the previous example:

• Gradient of the hidden layer:

Backpropagate the error to the hidden layer:
𝛿1 = 𝛿2 𝑊2𝑇 ∙ 𝑅𝑒𝐿𝑈 ′ (𝑧1 ),
where 𝛿1 is the error signal for the hidden layer, 𝛿2 is the error signal for the output
layer, 𝑊2 is the weight matrix between the hidden layer and the output layer.
The gradients of 𝑊1 and 𝑏1

14
▽ 𝑊1 = 𝑥 𝑇 𝛿1 ; ▽ 𝑏1 = 𝛿1
Pf)
• The hidden layer outputs 𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ), where 𝑧1 = 𝑥𝑊1 + 𝑏1 .
• The output of the network 𝑧2 = 𝑎1 𝑊2 + 𝑏2

Now we want to calculate the error signal 𝛿1 , which tells us how much the loss
depends on the pre-activation 𝑧1 of the hidden layer.

𝜕𝐿 𝜕𝐿 𝜕 𝑎1
𝛿1 = = ∙ = 𝛿2 𝑊𝑇2 ∙ 𝑅𝑒𝐿𝑈′ (𝑧1 ),
𝜕𝑧1 𝜕𝑎
⏟1 𝜕𝑧1
⏟
=𝛿2 𝑊𝑇2 =𝑅𝑒𝐿𝑈′ (𝑧1 )

where

𝜕𝐿 𝜕𝐿 𝜕𝑧2
= ∙ = 𝛿2 𝑊2𝑇
𝜕𝑎1 𝜕𝑧
⏟2 𝜕𝑎
⏟1
=𝛿2 =𝑊2

Since The error signal 𝛿2 is the gradient of the loss w.r.t. 𝑧2 :

𝜕𝐿
𝛿2 =
𝜕𝑧2

And 𝑧2 = 𝑎1 𝑊2 + 𝑏2 ,

𝜕𝑧2
= 𝑊2
𝜕𝑎1

Next, the derivative of 𝑎1 = 𝑅𝑒𝐿𝑈(𝑧1 ) w.r.t. 𝑧1 is

𝜕𝑎1 1, 𝑧1 > 0
= 𝑅𝑒𝐿𝑈 ′ (𝑧1 ) = {
𝜕𝑧1 0, 𝑧1 ≤ 0

Parameter Gradient – Loss gradient of parameters

Partial derivative (weights)

𝜕𝐿 𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕𝑎(𝑘) (𝒙)𝑖 𝜕(− log 𝑓(𝒙)𝑦 ) (𝑘−1)

(𝑘)
= (𝑘)
(− log 𝑓(𝒙)𝑦 ) = ∙ = ∙ ℎ𝑗 (𝒙)
𝜕𝑊𝑖,𝑗 𝜕𝑊𝑖,𝑗 𝜕𝑎(𝑘) (𝒙) 𝑖
(𝑘)
𝜕𝑊𝑖,𝑗 𝜕𝑎(𝑘) (𝒙) 𝑖

15
(𝑘)
where 𝑎(𝑘) (𝒙)𝑖 = 𝑏𝑖 + ∑𝑗 𝑊𝑖,𝑗(𝑘) ℎ(𝑘−1) (𝒙)𝑗

Gradient:

▽𝑊 (𝑘) (− log 𝑓(𝒙)𝑦 ) = ▽ ⏟(𝑘−1) (𝒙)𝑇

⏟𝒂(𝑘) (𝒙) (− log 𝑓(𝒙)𝑦 ) 𝒉
𝑟𝑜𝑤
⏟ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑒𝑐𝑡𝑜𝑟 𝑣𝑒𝑐𝑡𝑜𝑟
𝑚𝑎𝑡𝑟𝑖𝑥

Partial derivative (biases)

𝜕 𝜕(− log 𝑓(𝒙)𝑦 ) 𝜕𝑎(𝑘) (𝑥)𝑖

(𝑘)
(− log 𝑓(𝒙)𝑦 ) = ∙
𝜕𝑏𝑖 𝜕𝑎(𝑘) (𝑥)𝑖 (𝑘)
⏟𝜕𝑏𝑖
=1

Reminder: 𝑎(𝑘) (𝑥)𝑖 = 𝑏𝑖(𝑘) + ∑𝑗 𝑊𝑖,𝑗(𝑘) ℎ(𝑘−1) (𝑥)𝑗

• Gradient (biases):

▽𝑏(𝑘) (− log 𝑓(𝒙)𝑦 ) =▽𝑎(𝑘) (𝑥) (− log 𝑓(𝒙)𝑦 )

Example:
Let's work through an example of a neural network with the following structure:
• Input layer: 3 neurons
• Hidden layer 1: 5 neurons, using the ReLU activation function
• Hidden layer 2: 5 neurons, using the ReLU activation function
• Output layer: 2 neurons, using the softmax activation function (for multi-class
classification)
We'll go through the forward pass, calculate the loss, and then backpropagate to
compute the gradients for the weights.

• What is the neural network structure?

o Input layer 𝑥 = [𝑥1 , 𝑥2 , 𝑥3 ]
o Hidden layer 1: 5 neurons with weights 𝑊 (1) (shape: 3x5) and biases
𝑏 (1) (Shape: 1x5)

16
o Hidden layer 2: 5 neurons with weights 𝑊 (2) (shape: 5x5) and biases
𝑏 (2) (Shape: 1x5)
o Output layer: 2 neurons with weights 𝑊 (3) (shape: 5x2) and biases 𝑏 (3)
(Shape: 1x2)

Let's assume the weights and biases are initialized as follows:

. 2 −.1 . 4 −.3 . 1
• 𝑊 (1) = [−.2 . 5 −.4 . 3 −.1]
. 1 −.3 . 2 . 1 −.2
• 𝑏 (1) = [. 1 .1 . 1 . 1 . 1]
.3 −.2 .5 .1 −.1
.2 .1 −.3 .2 .1
• 𝑊 (2) = −.5 .3 .1 −.1 .4
.1 .2 −.2 .3 −.4
[−.3 .4 .2 −.1 .2 ]
• 𝑏 (2) = [. 05 . 05 . 05 . 05 . 05]
.4 −.5
.3 .2
• 𝑊 (3) = −.4 .1
.2 −.3
[−.1 .4 ]
• 𝑏 (3) = [. 2 −.1]

Let’s use the input vector 𝑥 = [1, .5, −1]𝑇 .

Forward propagation

• Hidden layer 1 pre-activation:

𝑇
𝑧 (1) = 𝑊 (1) 𝑥 + 𝑏 (1)
. 2 −.1 . 4 −.3 . 1 𝑇
= [−.2 . 5 −.4 . 3 −.1] [1, .5, −1]𝑇
. 1 −.3 . 2 . 1 −.2
+ [. 1 .1 .1 . 1 . 1]𝑇 = [. 05 . 65 . 05 −.1 . 55]𝑇
• Hidden layer 1 activation (ReLU):
ℎ(1) = 𝑅𝑒𝐿𝑈(𝑧 (1) ) = [. 05 . 65 . 05 0 . 55]𝑇
• Hidden layer 2 pre-activation:

17
𝑇
𝑧 (2) = 𝑊 (2) ℎ(1) + 𝑏 (2)
.3 −.2 .5 .1 −.1 𝑇
.2 .1 −.3 .2 .1
= −.5 .3 .1 −.1 . 4 [. 05 . 65 . 05 0 . 55]𝑇
.1 .2 −.2 .3 −.4
[−.3 .4 .2 −.1 .2 ]
+ [. 05 . 05 . 05 . 05 . 05]𝑇
= [−.075 . 245 . 285 . 045 . 405]𝑇
• Hidden layer 2 activation (ReLU):
ℎ(2) = 𝑅𝑒𝐿𝑈(𝑧 (2) ) = [0 . 245 . 285 . 045 . 405]𝑇
• Output layer pre-activation:
𝑇
𝑧 (3) = 𝑊 (3) ℎ(2) + 𝑏 (3)
.4 −.5 𝑇
.3 .2
= −.4 .1 [0 . 245 . 285 . 045 . 405]𝑇 + [. 2 −.1]
.2 −.3
[−.1 .4 ]
= [. 128 −.016]
• Output layer activation:
𝑒 .128 𝑒 −.016
𝑓(𝑥)1 = = .535; 𝑓(𝑥) 2 = = .465
𝑒 .128 + 𝑒 −.016 𝑒 .128 + 𝑒 −.016

Let’s assume the true label is 𝑦 = [1, 0] (meaning the true class is class 1). Then, the
cross-entropy loss is

𝐿 = − ∑ 𝑦𝑖 log(𝑓(𝑥)𝑖 ) = −[1 × log(. 535) + 0 × log(. 465)] = .626

𝑖=1

Backpropagation

• Gradient at the output layer: The gradient of the loss with respect to the output
layer activations 𝑓(𝑥) is
𝛿 (3) = 𝑓(𝑥) − 𝑦 = [. 535, .465]𝑇 − [1, 0]𝑇 = [−.465, .465]𝑇
• Gradient for 𝑊 (3) and 𝑏 (3)
The gradient of the loss with respect to the weights 𝑊 (3) is

18
𝑇
▽ 𝑊 (3) = ℎ(2) 𝛿 (3) = [0 . 245 . 285 . 045 . 405]𝑇 [−.465, .465]
0 0
−.114 . 114
= −.1325 . 1325
−.0209 . 0209
[−.1883 . 1883 ]
▽ 𝑏 (3) = 𝛿 (3) = [−.465, .465]𝑇
Now, Updated weights and biases for layer 3: Assuming a learning rate 𝜂 = .01,
𝑊 (3) = 𝑊 (3) − 𝜂 ×▽ 𝑊 (3)
𝑏 (3) = 𝑏 (3) − 𝜂 ×▽ 𝑏 (3)

• Gradient for hidden layer 2:

The gradient with respect to the pre-activation 𝑧 (2) is
𝛿 (2) = (𝑊 (3) 𝛿 (3) )⨀𝑔′ (𝑧 (2) )
Here,
.4 −.5 −.4185
.3 .2 −.0465
𝑊 (3) 𝛿 (3) = −.4 . 1 [−.465, .465]𝑇 = . 2325
.2 −.3 −.2325
[−.1 .4 ] [ . 2325 ]
𝑔′ (𝑧 (2) ) = 𝑔′ ([−.075 . 245 . 285 . 045 . 405]𝑇 ) = [0, 1, 1, 1, 1]𝑇
Therefore, the gradient at the second hidden layer's pre-activation becomes:
−.4185 0
−.0465 −.0465
𝛿 (2) = (𝑊 (3) 𝛿 (3) )⨀𝑔′ (𝑧 (2) ) = . 2325 ⨀[0, 1, 1, 1, 1]𝑇 = . 2325
−.2325 −.2325
[ . 2325 ] [ . 2325 ]
• Gradient for 𝑊 (2) and 𝑏 (2)
𝑇
0
𝑇 −.0465
▽ 𝑊 (2) = ℎ(1) 𝛿 (2) = [. 05 . 65 . 05 0 . 55]𝑇 . 2325
−.2325
[ . 2325 ]
0 −.002325 . 011625 −.011625 . 011625
0 −.030225 . 151125 −.151125 . 151125
= 0 −.002325 . 011625 −.011625 . 011625
0 0 0 0 0
[0 −.025575 . 127875 −.127875 . 127875]

19
0
−.0465
▽ 𝑏 (2) = 𝛿 (2) = . 2325
−.2325
[ . 2325 ]
• Gradient for hidden layer 1:
Next, we backpropagate to the first hidden layer. The gradient with respect to
the pre-activation 𝑧 (1) is
𝛿 (1) = (𝑊 (2) 𝛿 (2) )⨀𝑔′ (𝑧 (1) ),
.3 −.2 .5 .1 −.1 0 . 07805
.2 .1 −.3 .2 . 1 −.0465 −.02355
𝑊 (2) 𝛿 (2) = −.5 .3 .1 −.1 . 4 . 2325 = . 06745
.1 .2 −.2 .3 −.4 −.2325 . 0092
[−.3 .4 .2 −.1 ] [
. 2 . 2325 ] [ −.00165 ]
𝑔′ (𝑧 (1) ) = 𝑔′ ([. 05 . 65 . 05 −.1 . 55]𝑇 ) = [1 1 1 0 1]𝑇
Thus,
. 07805 . 07805
−.02355 −.02355
𝛿 (1) = (𝑊 (2) 𝛿 (2) )⨀𝑔′ (𝑧 (1) ) = . 06745 ⨀[1 1 1 0 1]𝑇 = . 06745
. 0092 0
[−.00165 ] [−.00165 ]
• Gradient for 𝑊 (1) and 𝑏 (1)
. 07805 𝑇
𝑇 −.02355
▽ 𝑊 (1) = 𝑥𝛿 (1) = [1, .5, −1]𝑇 . 06745
0
[−.00165 ]
. 07805 −.02355 . 06745 0 −.00165
= [. 039025 −.011775 . 033725 0 −.000825]
−.07805 . 02355 −.06745 0 . 00165
. 07805
−.02355
▽ 𝑏 (1) = 𝛿 (1) = . 06745
0
[−.00165 ]

Training NN – Regularization
In SGD algorithm that performs updates after each example, for 𝑁 iterations with
each training example (𝒙(𝑡) , 𝑦 (𝑡) ),

20
Δ = −∇𝜽 𝑙(𝑓(𝒙(𝑡) ; 𝜽), 𝑦 (𝑡) ) − 𝜆∇𝜽 Ω(𝜽) = − (∇𝜽 𝑙(𝑓(𝒙(𝑡) ; 𝜽), 𝑦 (𝑡) ) + 𝜆∇𝜽 Ω(𝜽)),

where Ω(𝜽) is a regularizer and the gradient ∇𝜽 Ω(𝜃).

• L2 regularization
2
( ) 2
Ω(𝜽) = ∑ ∑ ∑ (𝑊𝑖,𝑗𝑘 ) = ∑ ‖𝑾(𝑘) ‖ ,
𝑘 𝑖 𝑗 𝑘 𝐹

where the subscript 𝐹 stands for the Frobenius norm. The Frobenius norm is a
matrix norm that is analogous to the Euclidean norm for vectors.
o Gradient: ∇𝑾(𝑘) Ω(𝜽) = 2𝑾(𝑘)
o Only applied on weights, not on biases
o Can be interpreted as having a Gaussian prior over the weights.

• L1 regularization: a.k.a. Lasso (Least Absolute Shrinkage and Selection

Operator)

( )
Ω(𝜽) = ∑ ∑ ∑ |𝑊𝑖,𝑗𝑘 |
𝑘 𝑖 𝑗

o Gradient: ∇𝑾(𝑘) Ω(𝜽) = 𝑠𝑖𝑔𝑛(𝑾(𝑘) ), where 𝑠𝑖𝑔𝑛(𝑾(𝑘) ) = 1𝑊(𝑘)>0 − 1𝑊(𝑘)<0

𝑖,𝑗 𝑖,𝑗

o Also only applied on weights

o Unlike L2, L1 will push certain weights to be exactly 0
o Can be interpreted as having a Laplacian prior over the weights.

Example: Suppose we are training a linear regression model with two features. The
model is:

𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏

The loss function without regularization is typically the Mean Squared Error (MSE):

1 𝑛
𝑀𝑆𝐸 = ∑ (𝑦𝑖 − 𝑦̂𝑖 )2
𝑛 𝑖=1

21
L1 regularization adds a penalty proportional to the absolute values of the weights.
The regularized loss function becomes:

𝐿𝑜𝑠𝑠 = 𝑀𝑆𝐸 + 𝜆(|𝑤1 | + |𝑤2 |),

Where 𝜆 is a hyperparameter that controls the strength of the regularization. This

penalty encourages the weights to shrink, and possibly one or both weights may be
driven to zero.

Let’s assume we have some data and after training, the weights 𝑤1 and 𝑤2 are
initialized as: 𝑤1 = .6; 𝑤2 = .2.

Let's also assume the regularization term 𝜆 = .1.

• Without regularization: The gradient descent update rule for the weights
without regularization is just based on the gradient of the loss:
𝜕(𝑀𝑆𝐸)
𝑤𝑖 = 𝑤𝑖 − 𝜂
𝜕𝑤𝑖
For simplicity, assume that the gradients of MSE with respect to 𝑤1 and 𝑤2
are .1 and .05, respectively. The weights will be updated as follows:
𝑤1 = .6 − .01 ∙ .1 = .599; 𝑤2 = .2 − .01 ∙ .05 = .1995
So the weights gradually decrease, but both remain non-zero.

• With L1 regularization: Now, when we apply L1 regularization, we need to add

the gradient of the regularization term to the gradient of the loss function. The
derivative of |𝑤| is the sign function, which is 1 if 𝑤 > 0 and −1 if 𝑤 < 0. Thus,
the update rule becomes:
𝜕(𝑀𝑆𝐸)
𝑤𝑖 = 𝑤𝑖 − 𝜂 ( + 𝜆 ∙ 𝑠𝑖𝑔𝑛(𝑤𝑖 ))
𝜕𝑤𝑖

For this example, the updates are:

𝑤1 = .6 − .01 ∙ (. 1 + .1 ∙ 1) = .598; 𝑤2 = .2 − .01 ∙ (.05 + .1 ∙ 1) = .1985
Here you see that both weights are decreasing faster than they would without
regularization. The penalty term 𝜆 accelerates the shrinking.

22
• Effect of L1 over multiple iterations:
Now, suppose after a few more iterations, the weights continue shrinking:
𝑤1 = .05; 𝑤2 = .01
If we update again:
𝑤1 = .05 − .01 ∙ (. 1 + .1 ∙ 1) = .048; 𝑤2 = .01 − .01 ∙ (. 05 + .1 ∙ 1) = .0085
As training continues, 𝑤2 will eventually become smaller and smaller. When it
gets sufficiently close to zero, the regularization term dominates the update. If,
for example, 𝑤2 becomes .0001:
𝑤2 = .0001 − .01 ∙ (. 05 + .1 ∙ 1) = −.0014
At this point, 𝑤2 can be pushed to exactly 0 because of the regularization term.
Once 𝑤2 reaches zero, it will stay at zero in subsequent iterations, effectively
removing the contribution of feature 𝑥2 from the model.

Q: Why do we use regularization? - Bias-variance trade-off

• Variance of trained model: does it vary a lot if the training set changes
• Bias of trained model: is the average model close to the true solution
• Generalization error can be seen as the sum of the (squared) bias and the
variance.

Q: How to initialize the parameters?

• For biases:
o Initialize all to 0
• For weights:
o Can’t initialize weights to 0 with tanh function
o Can’t initialize all weights to the same value (need to break symmetry)
√6
o Sample 𝑊𝑖,𝑗(𝑘) from 𝑈𝑛𝑖𝑓[−𝑏, 𝑏], where 𝑏 =
√𝑠𝑖𝑧𝑒(ℎ (𝑘) (𝑥))+𝑠𝑖𝑧𝑒(ℎ(𝑘−1) (𝑥))

▪ Other values of 𝑏 could work as well (not an exact science)

23
24

Formula sheet psychometrics 23-24
No ratings yet
Formula sheet psychometrics 23-24
2 pages
Formula Sheet for statistics agriculture
No ratings yet
Formula Sheet for statistics agriculture
5 pages
ST2131 22S2 Tutorial 10 Solution
No ratings yet
ST2131 22S2 Tutorial 10 Solution
5 pages
Formula Sheet
No ratings yet
Formula Sheet
1 page
Fórmulas Estadística
No ratings yet
Fórmulas Estadística
2 pages
Olympiades de Maths - 2023
No ratings yet
Olympiades de Maths - 2023
2 pages
Olympiades de Maths - 2023
No ratings yet
Olympiades de Maths - 2023
2 pages
Formula Matematik Tambahan Penting
No ratings yet
Formula Matematik Tambahan Penting
2 pages
Formulário Fisica Moderna
No ratings yet
Formulário Fisica Moderna
2 pages
Econometrics Formulas Updated
No ratings yet
Econometrics Formulas Updated
4 pages
Capitulo 1 Inferencia Estadistica Web
No ratings yet
Capitulo 1 Inferencia Estadistica Web
23 pages
Student Name: - Student Number
No ratings yet
Student Name: - Student Number
2 pages
Formulari 2023-24
No ratings yet
Formulari 2023-24
2 pages
Module 1 Differential Calculus
No ratings yet
Module 1 Differential Calculus
22 pages
TB ch2
No ratings yet
TB ch2
58 pages
Homework 5
No ratings yet
Homework 5
2 pages
Week 4_418
No ratings yet
Week 4_418
10 pages
Time Series Formula or
No ratings yet
Time Series Formula or
4 pages
Eqs - PHYN03 Beson Physique 3
No ratings yet
Eqs - PHYN03 Beson Physique 3
2 pages
INFERENCE-UNBIASEDNESS
No ratings yet
INFERENCE-UNBIASEDNESS
4 pages
15.3 Double Integrals in Polar Coordinates
No ratings yet
15.3 Double Integrals in Polar Coordinates
15 pages
Josue David Rosales Ambriz: Formulario
No ratings yet
Josue David Rosales Ambriz: Formulario
3 pages
FORMULA SHEET - Midterm
No ratings yet
FORMULA SHEET - Midterm
1 page
Derivadas
No ratings yet
Derivadas
4 pages
Orbital Mechanics Guide
No ratings yet
Orbital Mechanics Guide
28 pages
Orbital Mechanics Gdje7euide
No ratings yet
Orbital Mechanics Gdje7euide
28 pages
Integrating Factor For Non-Exact Reducible
No ratings yet
Integrating Factor For Non-Exact Reducible
3 pages
Formulario LaPlace
No ratings yet
Formulario LaPlace
1 page
Latin Square Design
No ratings yet
Latin Square Design
8 pages
Complex Analysis Spring 2023 HW 7
No ratings yet
Complex Analysis Spring 2023 HW 7
4 pages
Formulas Obs ESP
No ratings yet
Formulas Obs ESP
1 page
Formulas
No ratings yet
Formulas
4 pages
CH3 مكمل
No ratings yet
CH3 مكمل
12 pages
Two Sample Statistical Inference:: Formulae
No ratings yet
Two Sample Statistical Inference:: Formulae
2 pages
Formulee Sheet Given in The Exam
No ratings yet
Formulee Sheet Given in The Exam
4 pages
Sequence
No ratings yet
Sequence
1 page
Ecuaciones
No ratings yet
Ecuaciones
1 page
Unificado - Mat - Disc - 15112023 - GC - 2
No ratings yet
Unificado - Mat - Disc - 15112023 - GC - 2
13 pages
FiniteDifferenceApprox Notes
No ratings yet
FiniteDifferenceApprox Notes
10 pages
Statistics Formulae Booklet
No ratings yet
Statistics Formulae Booklet
36 pages
Ws 25 Introrobots
No ratings yet
Ws 25 Introrobots
6 pages
2021_Chapter 2a_Parabolic 1D CN and Stability Analysis
No ratings yet
2021_Chapter 2a_Parabolic 1D CN and Stability Analysis
5 pages
Mock Test - 2 Solution ( By Parveen Sir )
No ratings yet
Mock Test - 2 Solution ( By Parveen Sir )
16 pages
19 - Practice Exercise 2.3 - Answer
100% (1)
19 - Practice Exercise 2.3 - Answer
2 pages
Exam Formula Sheet (1) Physics
No ratings yet
Exam Formula Sheet (1) Physics
3 pages
EDA - Formula
No ratings yet
EDA - Formula
2 pages
Guide Confidence Mock Fmm3
No ratings yet
Guide Confidence Mock Fmm3
3 pages
Quiz 2.1 2. Find All Derivatives of The Function 7
No ratings yet
Quiz 2.1 2. Find All Derivatives of The Function 7
1 page
Imaginarios
No ratings yet
Imaginarios
9 pages
BMATE201 - Module 3
No ratings yet
BMATE201 - Module 3
44 pages
DDP - Solution - JEE ADVANCED - Matrices & Determinants
No ratings yet
DDP - Solution - JEE ADVANCED - Matrices & Determinants
10 pages
chap3
No ratings yet
chap3
6 pages
Formulário Hidrologia
No ratings yet
Formulário Hidrologia
2 pages
Formulário P1
No ratings yet
Formulário P1
1 page
Formulari 2021-22
No ratings yet
Formulari 2021-22
3 pages
Report Part 1
No ratings yet
Report Part 1
6 pages
NuovoFormulario
No ratings yet
NuovoFormulario
3 pages
Unificado Mat Disc 15112023 GC
No ratings yet
Unificado Mat Disc 15112023 GC
11 pages
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Exact Trigonometry Table for All Angles
From Everand
Exact Trigonometry Table for All Angles
Bhava Nath Dahal
No ratings yet
Cnns Convolution Neural Networks
No ratings yet
Cnns Convolution Neural Networks
50 pages
Deep Learning 4/7: Convolutional Neural Networks: C. de Castro, IEIIT-CNR, Cristina - Decastro@ieiit - Cnr.it
0% (1)
Deep Learning 4/7: Convolutional Neural Networks: C. de Castro, IEIIT-CNR, Cristina - Decastro@ieiit - Cnr.it
49 pages
Bachelor Thesis by Jintao Ling-20-05-2020
No ratings yet
Bachelor Thesis by Jintao Ling-20-05-2020
37 pages
1
No ratings yet
1
9 pages
Cyber Security and Operations Management For Industry 4 - 0 - Ahmed A Elngar N Thillaiarasu Mohamed Elhoseny K Martin - 2022 - CRC Press - 9781032079486 - Anna's Archive
No ratings yet
Cyber Security and Operations Management For Industry 4 - 0 - Ahmed A Elngar N Thillaiarasu Mohamed Elhoseny K Martin - 2022 - CRC Press - 9781032079486 - Anna's Archive
161 pages
Factors Affecting Students Performance I
No ratings yet
Factors Affecting Students Performance I
32 pages
Optimal Sample Size Selection For Torusity Estimation Using A PSO Based Neural Network
No ratings yet
Optimal Sample Size Selection For Torusity Estimation Using A PSO Based Neural Network
14 pages
A Neural Network in 13 Lines of Python (Part 2 - Gradient Descent) - I Am Trask
No ratings yet
A Neural Network in 13 Lines of Python (Part 2 - Gradient Descent) - I Am Trask
18 pages
Recurrent Neural Network Using LSTM Model
No ratings yet
Recurrent Neural Network Using LSTM Model
15 pages
AD3511 dl
No ratings yet
AD3511 dl
2 pages
AI-Generated Content (AIGC) : A Survey: Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin
No ratings yet
AI-Generated Content (AIGC) : A Survey: Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin
17 pages
Future of Workforce
No ratings yet
Future of Workforce
33 pages
A Novel Scheme For Accurate Remaining Useful Life Prediction For Industrial IoTs by Using Deep Neural Network
No ratings yet
A Novel Scheme For Accurate Remaining Useful Life Prediction For Industrial IoTs by Using Deep Neural Network
9 pages
DeepSkin A Deep Learning Approach For Skin Cancer Classification
No ratings yet
DeepSkin A Deep Learning Approach For Skin Cancer Classification
10 pages
Product Cost Estimation, Technique Classification
No ratings yet
Product Cost Estimation, Technique Classification
13 pages
Automatic Vehicle License Plate Recognition Using Optimal K-Means With Convolutional Neural Network For Intelligent Transportation Systems
No ratings yet
Automatic Vehicle License Plate Recognition Using Optimal K-Means With Convolutional Neural Network For Intelligent Transportation Systems
11 pages
RNN Based Intrusion Detection System
No ratings yet
RNN Based Intrusion Detection System
5 pages
Exercise 2: Hopeld Networks: Articiella Neuronnät Och Andra Lärande System, 2D1432, 2004
No ratings yet
Exercise 2: Hopeld Networks: Articiella Neuronnät Och Andra Lärande System, 2D1432, 2004
8 pages
Data Mining-1
No ratings yet
Data Mining-1
15 pages
Seminar Log Book
No ratings yet
Seminar Log Book
17 pages
Lecture 5 Emerging Technology
No ratings yet
Lecture 5 Emerging Technology
20 pages
A Stock Pattern Recognition Algorithm Based On Neural Networks
No ratings yet
A Stock Pattern Recognition Algorithm Based On Neural Networks
5 pages
Anti-Personnel Mine Detection and Classification Using GPR Image
No ratings yet
Anti-Personnel Mine Detection and Classification Using GPR Image
4 pages
Cooperative Computation Offloading and Resource Allocation For Blockchain-Enabled Mobile Edge Computing: A Deep Reinforcement Learning Approach
No ratings yet
Cooperative Computation Offloading and Resource Allocation For Blockchain-Enabled Mobile Edge Computing: A Deep Reinforcement Learning Approach
15 pages
Mini Project Report Final Last
No ratings yet
Mini Project Report Final Last
43 pages
Artificial Neural Networks Architectures: Edited by Kenji Suzuki
100% (1)
Artificial Neural Networks Architectures: Edited by Kenji Suzuki
264 pages
A Review A Review of Financial Accounting Fraud Detection Based On Data Mining Techniquesof Financial Accounting Fraud Detection Based On Data Mining Techniques
No ratings yet
A Review A Review of Financial Accounting Fraud Detection Based On Data Mining Techniquesof Financial Accounting Fraud Detection Based On Data Mining Techniques
11 pages
07cp18 Neural Networks and Applications 3 0 0 100
No ratings yet
07cp18 Neural Networks and Applications 3 0 0 100
2 pages
Artificial Intelligence - CS50x 2021
No ratings yet
Artificial Intelligence - CS50x 2021
8 pages
Ai Introduction
No ratings yet
Ai Introduction
31 pages