6.034f Neural Net Notes October 28, 2010
6.034f Neural Net Notes October 28, 2010
6.034f Neural Net Notes October 28, 2010
dy d 1
= ( )
dx dx 1 + e−x
d
= (1 + e−x )−1
dx
= − 1 × (1 + e−x )−2 × e−x × −1
1 e−x
= ×
1 + e−x 1 + e−x
1 1 + e−x − 1
= ×
1 + e−x 1 + e−x
1 1 + e−x 1
= × ( − )
1 + e−x 1 + e−x 1 + e−x
=y(1 − y)
Thus, remarkably, the derivative of the output with respect to the input is expressed as a simple
function of the output.
1
P = − (dsample − osample )2
2
2
where P is the performance function, dsample is the desired output for some specific sample and
osample is the observed output for that sample. From this point forward, assume that d and o are
the desired and observed outputs for a specific sample so that we need not drag a subscript around
as we work through the algebra.
The reason for choosing the given formula for P is that the formula has convenient properties.
The formula yields a maximum at o = d and monotonically decreases as o deviates from d . Moreover,
the derivative of P with respect to o is simple:
dP d 1
= [− (d − o)2 ]
do do 2
2
= − × (d − o)1 × −1
2
=d −o
Gradient ascent
Backpropagation is a specialization of the idea of gradient ascent. You are trying to find the maximum
of a performance function P, by changing the weights associated with neurons, so you move in the
direction of the gradient in a space that gives P as a function of the weights, w. That is, you move in
the direction of most rapid ascent if we take a step in the direction with components governed by the
following formula, which shows how much to change a weight, w, in terms of a partial derivative:
∂P
Δw ∝
∂w
The actual change is influenced by a rate constant, α; accordingly, the new weight, w , is given by
the following:
∂P
w = w + α ×
∂w
Gradient descent
If the performance function were 12 (dsample − osample )2 instead of − 12 (dsample − osample )2 , then
you would be searching for the minimum rather than the maximum of P, and the change in w would
be subtracted from w instead of added, so w would be w − α × ∂∂wP instead of w + α × ∂∂wP . The two
sign changes, one in the performance function and the other in the update formula cancel, so in the
end, you get the same result whether you use gradient ascent, as I prefer, or gradient descent.
Wl Wr
il pl ol ir pr or
x Sigmoid x Sigmoid
Note that the subscripts indicate layer. Thus, il , wl , pl , and ol are the input, weight, product, and
output associated with the neuron on the left while ir , wr , pr , and or are the input, weight, product,
and output associated with the neuron on the right. Of course, ol = ir .
Suppose that the output of the right neuron, or , is the value that determines performance P. To
compute the partial derivative of P with respect to the weight in the right neuron, wr , you need the
chain rule, which allows you to compute partial derivatives of one variable with respect to another
in terms of an intermediate variable. In particular, for wr , you have the following, taking or to be
the intermediate variable:
∂P ∂P ∂or
= ×
∂wr ∂or ∂wr
∂ or ∂ or ∂ pr
Now, you can repeat, using the chain-rule to turn ∂ wr into ∂ pr × ∂ wr :
∂P ∂P ∂or ∂pr
= × ×
∂wr ∂or ∂pr ∂wr
∂ pr ∂(wr ×ol )
Conveniently, you have seen two of the derivatives already, and the third, ∂ wr = ∂ wr , is easy to
compute:
∂P
= [(d − or )] × [or (1 − or )] × [ir ]
∂wr
Repeating the analysis for wl yields the following. Each line is the same as the previously, except
that one more partial derivative is expanded using the chain rule:
∂P ∂P ∂or
= ×
∂wl ∂or ∂wl
∂P ∂or ∂pr
= × ×
∂or ∂pr ∂wl
∂P ∂or ∂pr ∂ol
= × × ×
∂or ∂pr ∂ol ∂wl
∂P ∂or ∂pr ∂ol ∂pl
= × × × ×
∂or ∂pr ∂ol ∂pl ∂wl
=[(d − or )] × [or (1 − or )] × [wr ] × [ol (1 − ol )] × [il ]
4
Thus, the derivative consists of products of terms that have already been computed and terms in the
vicinity of wl . This is clearer if you write the two derivatives next to one another:
∂P
=(d − or ) × or (1 − or ) × ir
∂wr
∂P
=(d − or ) × or (1 − or ) × wr × ol (1 − ol ) × il
∂wl
You can simplify the equations by defining δs as follows, where each delta is associated with either
the left or right neuron:
δr =or (1 − or ) × (d − or )
δl =ol (1 − ol ) × wr × δr
Then, you can write the partial derivatives with the δs:
∂P
=ir × δr
∂wr
∂P
=il × δl
∂wl
If you add more layers to the front of the network, each weight has a partial derivatives that
is computed like the partial derivative of the weight of the left neuron. That is, each has a partial
derivative determined by its input and its delta, where its delta in turn is determined by its output,
the weight to its right, and the delta to its right. Thus, for the weights in the final layer, you compute
the change as follows, where I use f as the subscript instead of r to emphasize that the computation
is for the neuron in the final layer:
Δwf = α × if × δf
where
δf = of (1 − of ) × (d − of )
Δwl = α × il × δl
where
δl = ol (1 − ol ) × wr × δr
in the final-layer neuron, fk , you compute the change as follows from the input corresponding to the
weight and from the δ associated with the neuron:
Δw =α × i × δfk
δfk =ofk (1 − ofk ) × (dk − ofk )
Note that the output of each final-layer neuron output is subtracted from the output desired for that
neuron.
For other layers, there may also be many neurons, and the output of each may influence all the
neurons in the next layer to the right. The change in weight has to account for what happens to all
of those neurons to the right, so a summation appears, but otherwise you compute the change, as
before, from the input corresponding to the weight and from the δ associated with the neuron:
Δw =α × i × δli
�
δli =oli (1 − oli ) × wli →rj × δrj
j
Note that wli →rj is the weight that connects the jth right-side neuron to the output of the ith left-side
neuron.
Summary
Once you understood how to derive the formulas, you can combine and simplify them in preparation
for solving problems. For each weight, you compute the weight’s change from the input correspond
ing to the weight and from the δ associated with the neuron. Assuming that δ is the delta associated
with that neuron, you have the following, where w→rj is the weight connecting the output of the
neuron you are working on, the ith left-side neuron, to the jth right-side neuron, and δ is the δ
rj
associated with that right-side neuron.
That is, you computed change in a neuron’s w, in every layer, by multiplying α times the
neuron’s input times its δ. The δ is determined for all but the final layer in terms of the neuron’s
output and all the weights that connect that output to neurons in the layer to the right and the δs
associated with those right-side neurons. The δ for each neuron in the final layer is determined only
by the output of that neuron and by the difference between the desired output and the actual output
of that neuron.
6
x
Σ ∫
i x
δ1
x
o
x
Σ ∫
x
w→r
N
x
x
Σ ∫
x δΝ
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.