Machine Learning and Pattern Recognition Week 8 - Backprop
Machine Learning and Pattern Recognition Week 8 - Backprop
Derivatives for neural networks, and other functions with multiple parameters and stages
of computation, can be expressed by mechanical application of the chain rule. Computing
these derivatives efficiently requires ordering the computation carefully, and expressing each
step using matrix computations.
[The first parts of this note cover general material on differentiation. The symbols used ( f , x, y, . . .) do
not imply machine learning meanings — such as model outputs, inputs, or outputs — as they do in
the rest of the course notes.]
∂f ∂f
f ( x, y) = x2 y, means that = 2xy, = x2 . (1)
∂x ∂y
You should also know how to use these derivatives to predict how much a function will
change if its inputs are perturbed. Then you can run a check:
fn = @(x, y) (x.^2) * y; % Python: def fn(x, y): return (x**2) * y
xx = randn(); yy = randn(); hh = 1e-5;
2*xx*yy % analytic df/dx
(fn(xx+hh, yy) - fn(xx-hh, yy)) / (2*hh) % approximate df/dx
A function might be defined in terms of a series of computations. For example, the variables
x and y might be defined in terms of other quantities: x = r cos θ, and y = r sin θ. The chain
rule of differentiation gives the derivatives with respect to the earlier quantities:
∂f ∂ f ∂x ∂ f ∂y ∂f ∂ f ∂x ∂ f ∂y
= + , and = + . (2)
∂r ∂x ∂r ∂y ∂r ∂θ ∂x ∂θ ∂y ∂θ
∂f
A small change δr in r causes the function to change by about ∂r δr . That change is caused by
∂y
small changes in x and y of δx ≈ ∂x
∂r δr and δy ≈ ∂r δr .
You could write code for f (θ, r ) and find its derivatives by evaluating the expressions above.
You don’t need answers from me: you can check your derivatives by finite differences.
In the example above, you could also substitute the expressions for x (θ, r ) and y(θ, r ) into
the equation for f ( x, y), and then differentiate directly with respect to θ and r. You should
get the same answers. You might have found it easier to eliminate the intermediate quantities
x and y in this example: there was no need to reason about the chain rule. However, the
chain rule approach is better for larger computations, such as neural network functions,
because we can apply it mechanically.
In general, given a function f (x), where the vector of inputs is computed from another
vector u, the chain rule states:
D
∂f ∂ f ∂xd
= ∑ . (3)
∂ui d =1
∂x d ∂ui
When a quantity ui is used multiple times in a function computation, we sum over the effect
it has through each of the quantities that we compute directly from it.
w
% &
u→v y→z (4)
& %
x
Each child variable is computed as a function of its parents. As a running example, the
function z = exp(sin(u2 ) log(u2 )) can be written as a series of elementary steps following
the graph above. We list the local functions below, along with the corresponding local
derivatives of each child variable with respect to its parents:
v = u2 , w = sin(v), x = log(v), y = wx, z = exp(y).
∂v ∂w ∂x ∂y ∂y ∂z
= 2u, = cos(v), = 1/v, = x, = w, = exp(y) = z.
∂u ∂v ∂v ∂w ∂x ∂y
Forward-mode differentiation computes the derivative of every variable in the graph with
respect to a scalar input. As the name suggests, we accumulate these derivatives in a forward
pass through the graph. Here we differentiate with respect to the input u, and notate these
∂θ
derivatives with a dot: θ̇ = ∂u , where θ is any intermediate quantity. The chain rule of
differentiation gives us each derivative in terms of quantities we computed while computing
the function, a local derivative, and the derivatives that we have already computed for the
parent quantities:
∂v ∂w ∂x ∂y ∂z
v̇ = ẇ = ẋ = ẏ = ż =
∂u ∂u ∂u ∂u ∂u
∂w ∂v ∂x ∂v ∂y ∂w ∂y ∂x ∂z ∂y
= 2u = = = + =
∂v ∂u ∂v ∂u ∂w ∂u ∂x ∂u ∂y ∂u
= cos(v)v̇ = (1/v)v̇ = x ẇ + w ẋ = zẏ
To compute the numerical value of each θ̇, we only need the derivatives of the elementary
function used at that stage, and the numerical values of the other quantities, which we have
already computed.
After computing the function, the additional computational cost of computing a derivative
can be less than the cost of computing the function. The final stage of our function gives
an example: ż = zẏ requires one floating-point multiply operation, whereas the z = exp(y)
usually has the cost of many floating point operations. Propagating derivatives can also be
more expensive: y requires one multiply, whereas ẏ requires two multiplies and an addition.
However, for the elementary mathematical functions we use in code, propagating derivatives
is never much more expensive than the original function.
The computation of the derivatives can be done alongside the original function computation.
Only a constant factor of extra memory is required: we need to track a θ̇ derivative for every θ
intermediate quantity currently in memory. Because only derivatives of elementary functions
are needed, the process can be completely automated, and is then called forwards-mode
automatic differentiation (AD).2
1. Caveats: 1) Computing functions does not normally require keeping the whole DAG in memory. Computing
derivatives as advocated in this note can sometimes have prohibitive memory costs. 2) DAG’s are also used in
machine learning (but not in this course) to define probability distributions rather than deterministic functions as
here. The diagram here is not a probabilistic graphical model.
2. The (non-examinable) complex step trick is a hacky proof of concept: for some cases (analytic functions of
real-valued inputs) it tracks a small multiple of the θ̇ derivative quantities in the complex part of each number.
More general AD tools exist.
3. A literature on reducing the memory consumption of reverse-mode differentiation exists. There are tricks to
avoid storing everything, at the cost of more computation, including “check-pointing” and recomputing the inputs
of reversible operations from their outputs.
4. Backpropagation is the term from the neural network literature for reverse-mode differentiation. Backpropagation
is also sometimes used to mean the whole procedure of training a network with a gradient descent method, where
the gradients come from reverse-mode differentiation.
The chain rule gives a general equation6 for backpropagating matrix derivatives:
∂z ∂z ∂C ∂Ckl
Āij =
∂Aij
= ∑ ∂Ckl ∂Aklij = ∑ C̄kl
∂Aij
. (7)
k,l k,l
However, we shouldn’t evaluaten all ofo the terms in this equation. If A and C are N × N
matrices, there are N 4 elements ∂A
∂Ckl
, when considering all combinations of i, j, k, and l.
ij
We’ve argued that derivatives can have the same computational scaling as the cost function,
but most matrix functions scale better than O( N 4 ). Therefore, it is usually possible to
evaluate the sum above without explicitly computing all of its terms.
Given a standard matrix function, we don’t want to differentiate the function! That is, we
usually won’t compute the partial derivatives of all the outputs with respect to all the inputs.
Instead we derive a propagation rule that takes the derivatives of a scalar cost function with
respect to the output, C̄, and returns the derivatives with respect to parent quantities: in the
above example Ā, and B̄. These reverse-mode propagation rules are the building blocks for
differentiating larger matrix functions. By chaining them together we can differentiate any
function built up of primitives for which we know the propagation rules.
Standard results: The general backpropagation rule above simplifies for some standard
matrix operations as follows:
5. How slow? It’s hardware and compiler dependent. The last time I compared my naive C code for a matrix
operation to a hardware-optimized BLAS implementation, the difference in speed was about 50×.
6. If A is a parent to more than one child in the graph, then Ā is the sum over this equation for all its children C.
7. In practice. There are algorithms with better scaling, but they have large constant factors.
8. Apologies: I’ve put the activations and hiddens for each example in columns rather than rows in this section.
The single example and mini-batch cases look more similar this way, but it’s probably less standard, and doesn’t
match the orientation of the design matrix, so I’d have to say H (0) = X > .
9. The vector of ones isn’t required in NumPy code, or recent Matlab, because these languages broadcast the
addition operator automatically. However, using the vector of ones makes the “+” a standard matrix addition, and
helps us get the derivatives correct.
10. Earlier in the notes (and later), when considering N scalar targets, I used y and f as N × 1 vectors. Whereas
here I just used f for a single K-dimensional network output. Another option is to use B × K matrices F and Y for
predictions and targets for B examples, even if K = 1. I’m afraid keeping track of the orientation and meaning of
matrices is generally a pain.
We can backpropagate these δ = ā error signals through the layer equations above by deriving
the updates from scratch, or by combining the standard backpropagation rules for matrix
operations:
(l )
b̄(l ) = ā(l ) b̄(l ) = Ā(l ) 1 B = ∑bB=1 Ā:,b (11)
W̄ (l ) = ā(l ) h(l −1)> W̄ (l ) = Ā(l ) H (l −1)> (12)
( l −1) (l )> (l ) ( l −1) (l )> (l )
h̄ =W ā H̄ =W Ā (13)
( l −1) 0 ( l −1) 0
ā(l −1) = g ( a ( l −1) ) h̄(l −1) Ā(l −1) = g ( A ( l −1) ) H̄ (l −1) . (14)
(l )0
We need to derive or look up g , the derivative of the non-linearity in the lth layer.
We obtain the gradients for all of the parameters in the layer, and a new error signal to
backpropagate to the previous layer.
• You could continue the previous question to get x̄ and ȳ. You should then also be able
to immediately write down the x̄ rule for c = x> Ax. You should be able to check for
yourself if you got it right.
• Given a rule to compute B̄ from C̄, how could you compute a derivative of the local
∂C
function ∂B ij if you wanted to know it? Hint: in footnote11 . You should know why we
kl
don’t usually compute all of these derivatives.
11. C̄ contains derivatives of some scalar function f (C ). You can choose what that function is. Write out notation
for what we want, everything we know, and things we can set. Then this question shouldn’t be too hard.
7 Further Reading
For keen students:
Reverse mode differentiation and its automatic application to code is an old idea:
Who invented the reverse mode of differentiation?, Andreas Griewank, Optimization Stories,
Documenta Matematica, Extra Volume ISMP (2012), pp389–400.
Automatic differentiation in machine learning: a survey. Atilim Gunes Baydin, Barak A.
Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind (2015). A good tutorial
overview of automatic differentiation and some common misconceptions.
Although the Baydin et al. survey has “machine learning” in the title, early versions didn’t
cover the real reasons that adoption in machine learning has been so slow at all. Software
tools that work easily with the sort of code written by machine learning researchers didn’t
exist. Even now (version 4), the survey has little on using efficient linear algebra operations.
Theano wasn’t as general as traditional AD compilers, but it did most of what people wanted,
was easy to use, so it revolutionized how machine learning code was written. Now the gap
of what’s possible and what’s easy needs closing further.
What if we want machine learning tools to automatically pass derivatives through a useful
function of a matrix, like the Cholesky decomposition, that isn’t commonly used in neural
networks? Iain recently wrote a tutorial note https://arxiv.org/abs/1602.07527 discussing
the options. Matrix-based approaches from this note were rapidly adopted by several tools,
including TensorFlow, Autograd, MXNet, PyTorch, and Theano. However, it’s still the case
that there are mainstream machine learning packages that still can’t differentiate some
reasonably common linear algebra operations. But the situation is improving.