0% found this document useful (0 votes)

11 views

Machine Learning and Pattern Recognition Week 8 - Backprop

This document discusses backpropagation and automatic differentiation for neural networks. It provides an overview of the chain rule and how it can be used to efficiently compute derivatives for functions with multiple parameters and stages of computation by expressing each step using matrix computations. It also covers forward-mode and reverse-mode differentiation by representing functions as directed acyclic graphs and propagating derivatives through the graph.

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Machine Learning and Pattern Recognition Week 8 - Backprop

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Backpropagation of Derivatives

Derivatives for neural networks, and other functions with multiple parameters and stages
of computation, can be expressed by mechanical application of the chain rule. Computing
these derivatives efficiently requires ordering the computation carefully, and expressing each
step using matrix computations.
[The first parts of this note cover general material on differentiation. The symbols used ( f , x, y, . . .) do
not imply machine learning meanings — such as model outputs, inputs, or outputs — as they do in
the rest of the course notes.]

1 A quick review of the chain rule

Pre-requisite: given a simple multivariate function, you should be able to write down its
partial derivatives. For example:

∂f ∂f
f ( x, y) = x2 y, means that = 2xy, = x2 . (1)
∂x ∂y

You should also know how to use these derivatives to predict how much a function will
change if its inputs are perturbed. Then you can run a check:
fn = @(x, y) (x.^2) * y; % Python: def fn(x, y): return (x**2) * y
xx = randn(); yy = randn(); hh = 1e-5;
2*xx*yy % analytic df/dx
(fn(xx+hh, yy) - fn(xx-hh, yy)) / (2*hh) % approximate df/dx

A function might be defined in terms of a series of computations. For example, the variables
x and y might be defined in terms of other quantities: x = r cos θ, and y = r sin θ. The chain
rule of differentiation gives the derivatives with respect to the earlier quantities:

∂f ∂ f ∂x ∂ f ∂y ∂f ∂ f ∂x ∂ f ∂y
= + , and = + . (2)
∂r ∂x ∂r ∂y ∂r ∂θ ∂x ∂θ ∂y ∂θ
∂f
A small change δr in r causes the function to change by about ∂r δr . That change is caused by
∂y
small changes in x and y of δx ≈ ∂x
∂r δr and δy ≈ ∂r δr .
You could write code for f (θ, r ) and find its derivatives by evaluating the expressions above.
You don’t need answers from me: you can check your derivatives by finite differences.
In the example above, you could also substitute the expressions for x (θ, r ) and y(θ, r ) into
the equation for f ( x, y), and then differentiate directly with respect to θ and r. You should
get the same answers. You might have found it easier to eliminate the intermediate quantities
x and y in this example: there was no need to reason about the chain rule. However, the
chain rule approach is better for larger computations, such as neural network functions,
because we can apply it mechanically.
In general, given a function f (x), where the vector of inputs is computed from another
vector u, the chain rule states:
D
∂f ∂ f ∂xd
= ∑ . (3)
∂ui d =1
∂x d ∂ui

When a quantity ui is used multiple times in a function computation, we sum over the effect
it has through each of the quantities that we compute directly from it.

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

2 Application to graphs of scalar computations
A function expressed as a sequence of operations in code can be represented as a Directed
Acyclic Graph (DAG)1 , for example:

w
% &
u→v y→z (4)
& %
x

Each child variable is computed as a function of its parents. As a running example, the
function z = exp(sin(u2 ) log(u2 )) can be written as a series of elementary steps following
the graph above. We list the local functions below, along with the corresponding local
derivatives of each child variable with respect to its parents:
v = u2 , w = sin(v), x = log(v), y = wx, z = exp(y).

∂v ∂w ∂x ∂y ∂y ∂z
= 2u, = cos(v), = 1/v, = x, = w, = exp(y) = z.
∂u ∂v ∂v ∂w ∂x ∂y

Forward-mode differentiation computes the derivative of every variable in the graph with
respect to a scalar input. As the name suggests, we accumulate these derivatives in a forward
pass through the graph. Here we differentiate with respect to the input u, and notate these
∂θ
derivatives with a dot: θ̇ = ∂u , where θ is any intermediate quantity. The chain rule of
differentiation gives us each derivative in terms of quantities we computed while computing
the function, a local derivative, and the derivatives that we have already computed for the
parent quantities:
∂v ∂w ∂x ∂y ∂z
v̇ = ẇ = ẋ = ẏ = ż =
∂u ∂u ∂u ∂u ∂u
∂w ∂v ∂x ∂v ∂y ∂w ∂y ∂x ∂z ∂y
= 2u = = = + =
∂v ∂u ∂v ∂u ∂w ∂u ∂x ∂u ∂y ∂u
= cos(v)v̇ = (1/v)v̇ = x ẇ + w ẋ = zẏ
To compute the numerical value of each θ̇, we only need the derivatives of the elementary
function used at that stage, and the numerical values of the other quantities, which we have
already computed.
After computing the function, the additional computational cost of computing a derivative
can be less than the cost of computing the function. The final stage of our function gives
an example: ż = zẏ requires one floating-point multiply operation, whereas the z = exp(y)
usually has the cost of many floating point operations. Propagating derivatives can also be
more expensive: y requires one multiply, whereas ẏ requires two multiplies and an addition.
However, for the elementary mathematical functions we use in code, propagating derivatives
is never much more expensive than the original function.
The computation of the derivatives can be done alongside the original function computation.
Only a constant factor of extra memory is required: we need to track a θ̇ derivative for every θ
intermediate quantity currently in memory. Because only derivatives of elementary functions
are needed, the process can be completely automated, and is then called forwards-mode
automatic differentiation (AD).2

1. Caveats: 1) Computing functions does not normally require keeping the whole DAG in memory. Computing
derivatives as advocated in this note can sometimes have prohibitive memory costs. 2) DAG’s are also used in
machine learning (but not in this course) to define probability distributions rather than deterministic functions as
here. The diagram here is not a probabilistic graphical model.
2. The (non-examinable) complex step trick is a hacky proof of concept: for some cases (analytic functions of
real-valued inputs) it tracks a small multiple of the θ̇ derivative quantities in the complex part of each number.
More general AD tools exist.

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

Reverse-mode differentiation computes the derivative of a scalar output variable with
respect to every other variable in the graph. As the name suggests, we accumulate these
derivatives in a reverse pass through the graph. Here we differentiate the output z, and
∂z
notate its derivatives with a bar: θ̄ = ∂θ , where θ is any intermediate quantity. The chain rule
of differentiation gives us each derivative in terms of a local derivative, and the derivatives
that we have already computed for the child quantities. The quantities are computed starting
in the right column below, and then moving to the left:
∂z ∂z ∂z ∂z ∂z
ū = v̄ = w̄ = x̄ = ȳ =
∂u ∂v ∂w ∂x ∂y
∂z ∂v ∂z ∂w ∂z ∂x ∂z ∂y ∂z ∂y
= = + = = = exp(y)
∂v ∂u ∂w ∂v ∂x ∂v ∂y ∂w ∂y ∂x
= v̄(2u) = w̄ cos(v) + x̄ (1/v) = ȳx = ȳw =z
To compute the numerical value of each θ̄, we only need the derivatives of the elementary
function used at that stage, and the numerical values of the childrens’ derivative signals.
Reverse-mode automatic differentiation tools also exist, which can construct a graph and
compute reverse-mode derivatives automatically from your code for a function.
Again the number of floating point operations is at most a small constant factor times the
number required for one function evaluation. However, the computation can no longer be
done in parallel with the original function computation. We need to traverse the graph in
reverse after computing the function, using stored values of the intermediate quantities. The
memory cost of reverse mode differentiation can be much higher than for the function, if we
wouldn’t normally keep the whole computation graph in memory.3
∂z
Comparison: At the end of the forwards-mode computation we had accumulated ż = ∂u .
∂z
At the end of the reverse-mode computation we accumulated the same quantity: ū = ∂u . For
this example:
∂z ∂v ∂w ∂y ∂x ∂y ∂z
ż = ū = = + , (5)
∂u ∂u ∂v ∂w ∂v ∂x ∂y
with forwards-mode running left-to-right, and reverse-mode running right-to-left.
Reverse-mode tools are harder to implement, and use more memory, but have a major
advantage for machine learning applications. One reverse-mode sweep simultaneously
accumulates the derivatives of the output with respect to every quantity in the computation
graph. If a function has many inputs, we can obtain all of its partial derivatives in one
reverse sweep, which only costs a constant times the number of operations of one function
evaluation. However, the forwards-mode procedure, like finite differences, would require
multiple sweeps through the graph, one sweep for each partial derivative. For a neural
network with millions of parameters, reverse-mode differentiation (or backpropagation4 ) is
millions of times faster than forwards-mode differentiation and finite differences!
It’s easy to dismiss differentiation as “just the chain rule”. But reverse-mode differentiation
is amazing. We can get a million partial-derivatives, which describe the whole tangent-
hyperplane of a function, usually for the cost of only a couple of evaluations of the scalar
function!

3 Application to graphs of array-based operations

VIDEO 2020-11-02_11-47-43_backprop_and_matrix_matrix_multiplication

3. A literature on reducing the memory consumption of reverse-mode differentiation exists. There are tricks to
avoid storing everything, at the cost of more computation, including “check-pointing” and recomputing the inputs
of reversible operations from their outputs.
4. Backpropagation is the term from the neural network literature for reverse-mode differentiation. Backpropagation
is also sometimes used to mean the whole procedure of training a network with a gradient descent method, where
the gradients come from reverse-mode differentiation.

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

Cost functions in machine learning are usually based on matrix computations. In theory,
matrix computations can be broken down into primitive scalar operations. Then reverse-
mode automatic differentiation could be applied. This theory tells us that there is always
an algorithm for computing derivatives of a cost function that uses at most a few times as
many operations as the cost function itself.
However, in practice, machine learning code is not usually written at the level of primitive
scalar operations. Where possible, code is written using high-level matrix operations, which
are provided by a library like BLAS or LAPACK, often optimized for the hardware we’re
using. Our derivatives need to use these optimized libraries too, or they will be slow.5
As before, we consider a computation graph that ends with the computation of a scalar z.
An intermediate computation creates a matrix C from two parent matrices A and B.
··· → A
&
C → ··· → z (6)
%
··· → B
In reverse-mode differentiation we will compute C̄, a matrix the same size as C where
∂z
C̄ij = ∂C . We then need to backpropagate this derivative signal to compute Ā and B̄. Where
ij
∂z ∂z
Āij = ∂Aij and B̄ij = ∂Bij .

The chain rule gives a general equation6 for backpropagating matrix derivatives:
∂z ∂z ∂C ∂Ckl
Āij =
∂Aij
= ∑ ∂Ckl ∂Aklij = ∑ C̄kl
∂Aij
. (7)
k,l k,l

However, we shouldn’t evaluaten all ofo the terms in this equation. If A and C are N × N
matrices, there are N 4 elements ∂A
∂Ckl
, when considering all combinations of i, j, k, and l.
ij
We’ve argued that derivatives can have the same computational scaling as the cost function,
but most matrix functions scale better than O( N 4 ). Therefore, it is usually possible to
evaluate the sum above without explicitly computing all of its terms.
Given a standard matrix function, we don’t want to differentiate the function! That is, we
usually won’t compute the partial derivatives of all the outputs with respect to all the inputs.
Instead we derive a propagation rule that takes the derivatives of a scalar cost function with
respect to the output, C̄, and returns the derivatives with respect to parent quantities: in the
above example Ā, and B̄. These reverse-mode propagation rules are the building blocks for
differentiating larger matrix functions. By chaining them together we can differentiate any
function built up of primitives for which we know the propagation rules.
Standard results: The general backpropagation rule above simplifies for some standard
matrix operations as follows:

• matrix product: C = AB ⇒ Ā = C̄B> and B̄ = A> C̄,

• matrix addition: C = A + B ⇒ Ā = C̄ and B̄ = C̄,
• element-wise function: C = g( A) ⇒ Ā = g0 ( A) C̄, where g0 is the gradient of g
and is the element-wise or Hadamard product (.* in Matlab, * for NumPy arrays).
• masking: C = r ( A) ⇒ Ā = r (C̄ ), where r is any operation that masks out elements
of the matrix A. Example: extract lower-triangle, r ( A) = tril( A).
• reshaping: C = r ( A) ⇒ Ā = r −1 (C̄ ), where r is any operation that rearranges
the elements of the matrix A, and r −1 is the reverse operation. Examples: transpose,

5. How slow? It’s hardware and compiler dependent. The last time I compared my naive C code for a matrix
operation to a hardware-optimized BLAS implementation, the difference in speed was about 50×.
6. If A is a parent to more than one child in the graph, then Ā is the sum over this equation for all its children C.

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

r ( A) = A> ; or rearrange into a column vector, r ( A) = vec( A). If a subset S of elements
of A are ignored, such that C contains fewer elements than A, then Āi = 0 for i ∈ S .

Matrix-matrix multiplication is an expensive operation — O( LMN ) for L × M and M × N

matrices7 — it’s often a computational bottleneck, and so this operation is also heavily
optimized in library routines. The derivative propagation rule for matrix multiplication
above can use the same optimized routines, and so has similar cost to the corresponding
stage of the original function.
You could break down any matrix operation into a sequence of scalar operations, apply
backpropagation to those, and then try to implement the result using matrix operations.
However, there is also an elegant framework for deriving matrix-based backpropagation
rules, described by Giles (2008) — full reference below.
Tools like Theano (developed 2008–2017) made automated application of reverse-mode
differentiation popular in machine learning. Theano knows the backpropagation rules for
many standard matrix operations, and can thus automatically differentiate matrix-based
functions built from these primitives, including most neural networks. Recently a large
number of alternative tools have emerged. If you want to add a new matrix function primitive
to one of these tools, you often need to add the corresponding gradient propagation rule by
hand.

4 Application to straightforward feedforward networks

The computation of a feedforward neural network’s error on a training example is easily
expressed as a graph, starting at the input features and moving through the layers of the
neural network. Each layer introduces more parameters, which are additional inputs to
the computation. In one reverse mode sweep through the network we can work out the
derivative signals for all of the parameters in all of the layers.
In each layer of a standard feedforward neural network we form an activation using the
weights and biases for the layer and the values of the units in the layer below:
a ( l ) = W ( l ) h ( l −1) + b ( l ) or A ( l ) = W ( l ) H ( l −1) + b ( l ) 1 >
B. (8)
The equation on the left is for a single K (l ) × 1 vector of activations, using a K (l ) × K (l −1)
weight matrix, and a K (l ) × 1 vector of biases. On the right we compute a K (l ) × B matrix of
activations A(l ) from a matrix of hidden unit values in the previous layer, for a minibatch
of B examples.8 The biases are added to the activation for each example with the help of a
B × 1 vector of ones, 1 B .9 Computing the layer for a whole minibatch at once lets us use a
matrix-matrix product, which in most frameworks is heavily optimized.
These activations then have an element-wise function applied:
h(l ) = g(l ) ( a(l ) ) or H ( l ) = g ( l ) ( A ( l ) ), (9)
to give hidden unit values. These values are inputs into the next layer, or in the final layer
give the network’s function value f = h( L) or values f = h( L) . We could also create a matrix
of function values for a minibatch F = H ( L)> . I’ve added a transpose to make the output
B × K, because in this example the neural net internally has examples in columns, but data is
often stored with examples in rows.10

7. In practice. There are algorithms with better scaling, but they have large constant factors.
8. Apologies: I’ve put the activations and hiddens for each example in columns rather than rows in this section.
The single example and mini-batch cases look more similar this way, but it’s probably less standard, and doesn’t
match the orientation of the design matrix, so I’d have to say H (0) = X > .
9. The vector of ones isn’t required in NumPy code, or recent Matlab, because these languages broadcast the
addition operator automatically. However, using the vector of ones makes the “+” a standard matrix addition, and
helps us get the derivatives correct.
10. Earlier in the notes (and later), when considering N scalar targets, I used y and f as N × 1 vectors. Whereas
here I just used f for a single K-dimensional network output. Another option is to use B × K matrices F and Y for
predictions and targets for B examples, even if K = 1. I’m afraid keeping track of the orientation and meaning of
matrices is generally a pain.

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 5

Finally, we compare the network’s function values to the training targets, to compute a scalar
training cost
for a single example: c = L( f , y), or c = L(f, y), or a minibatch: c = L( F, Y ). (10)
We find derivatives of this scalar cost with respect to everything in the network by reverse-
(l ) ∂c
mode differentiation, or backpropagation. For example, W̄ij = (l ) .
∂Wij

Traditional explanations of backpropagation (e.g., Murphy 16.5.4) refer to “error signals” δ

(not to be confused with small perturbations δ as used in the opening section of this note).
These error signals are the reverse-mode derivative signals for the activations of each layer:
(l ) ∂c
δnk = (l ) .
∂Akn

We can backpropagate these δ = ā error signals through the layer equations above by deriving
the updates from scratch, or by combining the standard backpropagation rules for matrix
operations:
(l )
b̄(l ) = ā(l ) b̄(l ) = Ā(l ) 1 B = ∑bB=1 Ā:,b (11)
W̄ (l ) = ā(l ) h(l −1)> W̄ (l ) = Ā(l ) H (l −1)> (12)
( l −1) (l )> (l ) ( l −1) (l )> (l )
h̄ =W ā H̄ =W Ā (13)
( l −1) 0 ( l −1) 0
ā(l −1) = g ( a ( l −1) ) h̄(l −1) Ā(l −1) = g ( A ( l −1) ) H̄ (l −1) . (14)
(l )0
We need to derive or look up g , the derivative of the non-linearity in the lth layer.
We obtain the gradients for all of the parameters in the layer, and a new error signal to
backpropagate to the previous layer.

5 Check your understanding

• You should compute the scalar examples of forward and reverse propagation for some
random settings of u, and check the answers numerically. If there’s anything you don’t
know how to do, you should ask.
• You want to backpropagate through the linear algebra operation C = A> B. You
recall that Ā involves B and C̄ somehow. It’s something like a) Ā = BC̄, b) Ā = B> C̄,
c) Ā = BC̄ > , d) Ā = B> C̄ > , e) Ā = C̄B, f) Ā = C̄B> , g) Ā = C̄ > B, or h) Ā = C̄ > B> .
Which of these is the only one that could be correct in general? How do you know?
[The website version of this note has a question here.]
• By combining the propagation rules for C = AB and matrix transposition in the
standard results above, backpropagate through the quadratic form c = x> Ay.
Enter an expression Ā in terms of c̄, x, A, and/or y, followed by enough of your
working to see how you got the answer. To apply the rules, treat every quantity as
a matrix (as Matlab does), it’s just that some matrices have only one row and/or
column.
[The website version of this note has a question here.]

Optional questions for those with enough time:

• You could continue the previous question to get x̄ and ȳ. You should then also be able
to immediately write down the x̄ rule for c = x> Ax. You should be able to check for
yourself if you got it right.
• Given a rule to compute B̄ from C̄, how could you compute a derivative of the local
∂C
function ∂B ij if you wanted to know it? Hint: in footnote11 . You should know why we
kl
don’t usually compute all of these derivatives.

11. C̄ contains derivatives of some scalar function f (C ). You can choose what that function is. Write out notation
for what we want, everything we know, and things we can set. Then this question shouldn’t be too hard.

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 6

• The notes give updates for each layer of a neural network once backpropagation has
begun. Check you know how to use the derivatives of the training loss function to
start the backpropagation procedure.

6 Why study this material if derivatives can be done by software?

The code for the derivatives in many methods can be short and simple. You won’t always
want to bring in Theano or TensorFlow as a dependency.
You need to understand how backpropagation works to structure your software correctly, or
provide new derivative propagation operators, even if you lean on automatic differentiation
in parts. Currently several extensible “Gaussian process” frameworks are slower than they
should be, because they have structured part of their derivative computations incorrectly.
GPML updated its implementation of gradients in September 2016 to have the correct scaling.
Widespread understanding of how to do differentiation efficiently is surprisingly recent!
Some learning procedures involving modifying (hacking) parts of the gradient computation.
Implementing such procedures probably requires understanding how the derivatives are
computed.
Some neural network code still resorts to computing derivatives by hand. The automatic
differentiation engines in machine learning frameworks aren’t yet clever enough to save
memory in all of the creative ways people can implement by hand.
Moving outside neural networks, software support for automatic differentiation still isn’t
perfect. There are highly regarded machine learning papers, even with authors who are
early adopters of backpropagation, that contain inefficient expressions for derivatives12 . You
might avoid these mistakes if you study reverse mode differentiation more generally than
backpropagation for standard feedforward neural networks.

7 Further Reading
For keen students:
Reverse mode differentiation and its automatic application to code is an old idea:
Who invented the reverse mode of differentiation?, Andreas Griewank, Optimization Stories,
Documenta Matematica, Extra Volume ISMP (2012), pp389–400.
Automatic differentiation in machine learning: a survey. Atilim Gunes Baydin, Barak A.
Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind (2015). A good tutorial
overview of automatic differentiation and some common misconceptions.
Although the Baydin et al. survey has “machine learning” in the title, early versions didn’t
cover the real reasons that adoption in machine learning has been so slow at all. Software
tools that work easily with the sort of code written by machine learning researchers didn’t
exist. Even now (version 4), the survey has little on using efficient linear algebra operations.
Theano wasn’t as general as traditional AD compilers, but it did most of what people wanted,
was easy to use, so it revolutionized how machine learning code was written. Now the gap
of what’s possible and what’s easy needs closing further.
What if we want machine learning tools to automatically pass derivatives through a useful
function of a matrix, like the Cholesky decomposition, that isn’t commonly used in neural
networks? Iain recently wrote a tutorial note https://arxiv.org/abs/1602.07527 discussing
the options. Matrix-based approaches from this note were rapidly adopted by several tools,
including TensorFlow, Autograd, MXNet, PyTorch, and Theano. However, it’s still the case
that there are mainstream machine learning packages that still can’t differentiate some
reasonably common linear algebra operations. But the situation is improving.

12. e.g., https://papers.nips.cc/paper/2566-neighbourhood-components-analysis

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 7

As the Cholesky note explains, many of the existing guides on differentiating matrix com-
putations encourage inefficient computation. A good note for computation is “An extended
collection of matrix derivative results for forward and reverse mode automatic differentiation” (Giles,
2008). The introduction is succinct, so requires some careful reading. However, the note
focusses on the correct primitive task: given a matrix function C ( A, B), backpropagate the
derivative signal C̄ to obtain Ā and B̄, without creating large intermediate objects. Giles also
provides example Matlab/Octave code, and demonstrates how to check everything.

MLPR:w8d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 8

Cambridge Lower Secondary Mathematics Workbook 8 9781108746403book-7-8
82% (11)
Cambridge Lower Secondary Mathematics Workbook 8 9781108746403book-7-8
2 pages
A Step-By-step Introduction to the Implementation of Automatic Differentiation
No ratings yet
A Step-By-step Introduction to the Implementation of Automatic Differentiation
17 pages
Chap3slides
No ratings yet
Chap3slides
95 pages
Calc
No ratings yet
Calc
6 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Appendix D Calculus
No ratings yet
Appendix D Calculus
31 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
Tut 01
No ratings yet
Tut 01
39 pages
UNIT 1
No ratings yet
UNIT 1
30 pages
Content Beyond Syllabus Unit-2
No ratings yet
Content Beyond Syllabus Unit-2
4 pages
Mod 2 DL
No ratings yet
Mod 2 DL
8 pages
BackPropagation
No ratings yet
BackPropagation
10 pages
autograd-handouts
No ratings yet
autograd-handouts
14 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Lecture NM 1 Numerical Differentiation Integration
No ratings yet
Lecture NM 1 Numerical Differentiation Integration
57 pages
Autodiff
No ratings yet
Autodiff
12 pages
Differentiation Tutorial
No ratings yet
Differentiation Tutorial
16 pages
s&Ml Unit 3- q & A
No ratings yet
s&Ml Unit 3- q & A
14 pages
Topic 1.3: Differentiation:, F (X) ), Say N (X, F (X) ) - The MN Line Act As The Tangent Line To The
No ratings yet
Topic 1.3: Differentiation:, F (X) ), Say N (X, F (X) ) - The MN Line Act As The Tangent Line To The
15 pages
CS231n Convolutional Neural Networks For Visual Recognition 4
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 4
10 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Automatic Differentiation, C++ Templates AndPhotogrammetry
No ratings yet
Automatic Differentiation, C++ Templates AndPhotogrammetry
14 pages
Statistics and Probability
No ratings yet
Statistics and Probability
6 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
Differentiation Rules PDF
No ratings yet
Differentiation Rules PDF
8 pages
Derivatives PDF
No ratings yet
Derivatives PDF
6 pages
Differentiation Rules (Differential Calculus)
No ratings yet
Differentiation Rules (Differential Calculus)
3 pages
Unit 2 - Differentiation
No ratings yet
Unit 2 - Differentiation
15 pages
Autodiff
No ratings yet
Autodiff
15 pages
Numerical_Methods_Kirkegaard
No ratings yet
Numerical_Methods_Kirkegaard
122 pages
Calculus-Review-Presentation
No ratings yet
Calculus-Review-Presentation
42 pages
CS115_Intro_to_Optimization (1) (1)
No ratings yet
CS115_Intro_to_Optimization (1) (1)
60 pages
Differentiation (Derivatives)
No ratings yet
Differentiation (Derivatives)
42 pages
Differentiation & Integral
No ratings yet
Differentiation & Integral
12 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
3-Gradient.pptx
No ratings yet
3-Gradient.pptx
31 pages
Arsdigita University Month 0: Mathematics For Computer Science
No ratings yet
Arsdigita University Month 0: Mathematics For Computer Science
24 pages
Ch.9 Differentiation
No ratings yet
Ch.9 Differentiation
1 page
Differentiation
No ratings yet
Differentiation
20 pages
Automatic Differentiation of Algorithms For Machine Learning
No ratings yet
Automatic Differentiation of Algorithms For Machine Learning
7 pages
Evaluating derivatives principles and techniques of algorithmic differentiation Second Edition Andreas Griewank - Download the ebook and explore the most detailed content
100% (1)
Evaluating derivatives principles and techniques of algorithmic differentiation Second Edition Andreas Griewank - Download the ebook and explore the most detailed content
47 pages
Differentiation
No ratings yet
Differentiation
59 pages
neural networks with cheap differential operators
No ratings yet
neural networks with cheap differential operators
11 pages
differentiation-StudyGuide by thea for class 12
No ratings yet
differentiation-StudyGuide by thea for class 12
4 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Derivatives and Application
No ratings yet
Derivatives and Application
14 pages
Differential Calculus Overview
No ratings yet
Differential Calculus Overview
4 pages
18060
No ratings yet
18060
52 pages
Derivadas y Sus Aplicaciones.
No ratings yet
Derivadas y Sus Aplicaciones.
6 pages
ECON 200A Math Handout
No ratings yet
ECON 200A Math Handout
19 pages
Unit 3
No ratings yet
Unit 3
6 pages
Applications of Differentiation
No ratings yet
Applications of Differentiation
16 pages
Machine Learning and Pattern Recognition - Maths
No ratings yet
Machine Learning and Pattern Recognition - Maths
3 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
w2e_multivariate_gaussian
No ratings yet
w2e_multivariate_gaussian
6 pages
MDA3S
No ratings yet
MDA3S
22 pages
w2c_central_limit
No ratings yet
w2c_central_limit
1 page
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
Part 5
No ratings yet
Part 5
31 pages
Part 4
No ratings yet
Part 4
24 pages
TS Part2
No ratings yet
TS Part2
62 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Part 3
No ratings yet
Part 3
29 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
Grade 9 Curriculum Map
No ratings yet
Grade 9 Curriculum Map
5 pages
Y8 Math Progression Papers
No ratings yet
Y8 Math Progression Papers
15 pages
Octave Exercises Eng
No ratings yet
Octave Exercises Eng
5 pages
CH 3 Matrices Multiple Choice Questions With Answers PDF
100% (1)
CH 3 Matrices Multiple Choice Questions With Answers PDF
3 pages
Review Packet #2 - Polynomials: FX X X
No ratings yet
Review Packet #2 - Polynomials: FX X X
4 pages
K To 12 Mathematics Curriculum
No ratings yet
K To 12 Mathematics Curriculum
29 pages
Algebra & Numbers (Math Project
No ratings yet
Algebra & Numbers (Math Project
14 pages
Jean-Pierre Serre
No ratings yet
Jean-Pierre Serre
12 pages
QB M2 E06
No ratings yet
QB M2 E06
75 pages
Answer Key - CK-12 Chapter 05 Algebra II Withtrigonometry Concepts (Revised)
No ratings yet
Answer Key - CK-12 Chapter 05 Algebra II Withtrigonometry Concepts (Revised)
30 pages
Desmos-Only-Drill-2-tbu47i
No ratings yet
Desmos-Only-Drill-2-tbu47i
11 pages
G-20 2012 IFoS P-1 Solutions
No ratings yet
G-20 2012 IFoS P-1 Solutions
40 pages
Se Math
No ratings yet
Se Math
2 pages
Syllabus - Vector Calculus
No ratings yet
Syllabus - Vector Calculus
2 pages
Candisc
No ratings yet
Candisc
39 pages
COMEDK Mathematics (1)
No ratings yet
COMEDK Mathematics (1)
8 pages
Summative Test 1 Q2 Ea
No ratings yet
Summative Test 1 Q2 Ea
4 pages
5 Special Products
No ratings yet
5 Special Products
3 pages
Reduced Order Modelling of Linear Dynamic Systems Using Eigen Spectrum Analysis and Modified Cauer Continued Fraction
No ratings yet
Reduced Order Modelling of Linear Dynamic Systems Using Eigen Spectrum Analysis and Modified Cauer Continued Fraction
6 pages
Q3.W1Intro To Radicals
No ratings yet
Q3.W1Intro To Radicals
26 pages
PDF
No ratings yet
PDF
349 pages
Maths Engg
No ratings yet
Maths Engg
7 pages
Functional Analysis.
No ratings yet
Functional Analysis.
165 pages
PDF For Color Printout1
No ratings yet
PDF For Color Printout1
6 pages
Matrix Encryption How-To
No ratings yet
Matrix Encryption How-To
5 pages
(5-1) Concept
No ratings yet
(5-1) Concept
1 page
Cambridge Part IB Linear Algebra Alex Chan
No ratings yet
Cambridge Part IB Linear Algebra Alex Chan
82 pages
UG 2017 TimeTable
No ratings yet
UG 2017 TimeTable
37 pages
Alg 2 Honors Pace Chart
No ratings yet
Alg 2 Honors Pace Chart
2 pages