Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Note On Backpropagation John Hull: Ith Observation, and y

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Note on Backpropagation

John Hull

Backpropagation is a way of using the chain rule to calculate


derivatives of the mean squared error (or other objective
function) with respect to the parameter values. For convenience
we assume a single target. The mean squared error is given by:
𝑛
1
𝐸 = ∑(𝑦̂𝑖 − 𝑦𝑖 )2
𝑛
𝑖=1

where there are n observations, 𝑦̂𝑖 is the value of the target for the
ith observation, and yi is the estimate of the target’s value given by
the neural network. If is the value of a parameter
𝑛
𝜕𝐸 2 𝜕𝑦𝑖
= − ∑(𝑦̂ 𝑖 − 𝑦𝑖 )
𝜕θ 𝑛 𝜕θ
𝑖=1

We can therefore consider each observation separately,


calculating 𝜕𝑦𝑖⁄𝜕θ, and at the end use this equation to get the
partial derivative we are interested in.
We start with the values of  used to calculate the target 𝑦𝑖 at
the end of the network and work back through the network
considering other values. As in Chapter 6, we define L as the
number of layers and K as the number of neurons per layer. The
value at the kth neuron for layer l will be denoted by 𝑉𝑙𝑘(1 ≤ k ≤ K;
1 ≤ l ≤ L).
First we note that if  is a parameter relating the output to the
final layer, 𝜕𝑦𝑖⁄𝜕θ can be calculated without difficulty. If is a
parameter relating the value at neuron k of the final layer to a
neuron in the penultimate layer, we have from the chain rule

1
Machine Learning in Business

𝜕𝑦𝑖 𝜕𝑦𝑖 𝜕𝑉𝐿𝑘


=
𝜕𝜃 𝜕𝑉𝐿𝑘 𝜕𝜃

Both 𝜕𝑦𝑖⁄𝜕𝑉𝐿𝑘 and 𝜕𝑉𝐿𝑘⁄𝜕θ can be calculated without difficulty.


Now let us consider the situation where the parameter
relates the value at neuron k of layer l to a neuron in layer l−1 (l
< L). Then

𝜕𝑦𝑖 𝜕𝑦𝑖 𝜕𝑉𝑙𝑘


=
𝜕𝜃 𝜕𝑉𝑙𝑘 𝜕𝜃

The partial derivative 𝜕𝑉𝑙𝑘 ⁄𝜕θ can be calculated without difficulty.


We have to do a little more work to calculate 𝜕𝑦𝑖⁄𝜕𝑉𝑙𝑘. An
application of the chain rule gives
𝐾
𝜕𝑦𝑖 𝜕𝑦𝑖 𝜕𝑉𝑙+1,𝑘∗
= ∑
𝜕𝑉𝑙𝑘 ∗
𝜕𝑉𝑙+1,𝑘∗ 𝜕𝑉𝑙𝑘
𝑘 =1

The partial derivative, 𝜕𝑉𝑙 +1,𝑘∗ ⁄𝜕𝑉𝑙𝑘 , can be calculated without


difficulty for all k and k*. Because calculations start at the end of
the network and work back, we have already calculated the values
of 𝜕𝑦𝑖⁄𝜕𝑉𝑙+1,𝑘∗ for all k* by the time that we consider a  that
relates layer l−1 to layer l.
Taken together, the equations we have presented provide a
fast way to calculate all the partial derivatives necessary for the
gradient descent algorithm.

You might also like