Understanding Backpropagation Algorithm - Towards Data Science
Understanding Backpropagation Algorithm - Towards Data Science
Algorithm
Learn the nuts and bolts of a neural network’s most important
ingredient
Simeon Kostadinov
Aug 8 · 8 min read
In this article, I would like to go over the mathematical process of training and
optimizing a simple 4-layer neural network. I believe this would help the reader
understand how backpropagation works as well as realize its importance.
Input layer
The neurons, colored in purple, represent the input data. These can be as simple as
scalars or more complex like vectors or multidimensional matrices.
The first set of activations (a) are equal to the input values. NB: “activation” is the
neuron’s value after applying an activation function. See below.
Hidden layers
The final values at the hidden neurons, colored in green, are computed using z^l —
weighted inputs in layer l, and a^l— activations in layer l. For layer 2 and 3 the
equations are:
l=2
l=3
W² and W³ are the weights in layer 2 and 3 while b² and b³ are the biases in those
layers.
Looking carefully, you can see that all of x, z², a², z³, a³, W¹, W², b¹ and b² are missing
their subscripts presented in the 4-layer network illustration above. The reason is that
we have combined all parameter values in matrices, grouped by layers. This is the
standard way of working with neural networks and one should be comfortable with the
calculations. However, I will go over the equations to clear out any confusion.
Let’s pick layer 2 and its parameters as an example. The same operations can be applied
to any layer in the network.
W¹ is a weight matrix of shape (n, m) where n is the number of output neurons
(neurons in the next layer) and m is the number of input neurons (neurons in the
previous layer). For us, n = 2 and m = 4.
Equation for W¹
NB: The first number in any weight’s subscript matches the index of the neuron in
the next layer (in our case this is the Hidden_2 layer) and the second number
matches the index of the neuron in previous layer (in our case this is the Input
layer).
x is the input vector of shape (m, 1) where m is the number of input neurons. For
us, m = 4.
Equation for x
Equation for b¹
Following the equation for z², we can use the above definitions of W¹, x and b¹ to derive
“Equation for z²”:
Equation for z²
You will see that z² can be expressed using (z_1)² and (z_2)² where (z_1)² and (z_2)²
are the sums of the multiplication between every input x_i with the corresponding
weight (W_ij)¹.
This leads to the same “Equation for z²” and proofs that the matrix representations for
z², a², z³ and a³ are correct.
Output layer
The final part of a neural network is the output layer which produces the predicated
value. In our simple example, it is presented as a single neuron, colored in blue and
evaluated as follows:
Again, we are using the matrix representation to simplify the equation. One can use the
above techniques to understand the underlying logic. Please leave any comments
below if you find yourself lost in the equations — I would love to help!
The final step in a forward pass is to evaluate the predicted output s against an
expected output y.
The output y is part of the training dataset (x, y) where x is the input (as we saw in the
previous section).
Evaluation between s and y happens through a cost function. This can be as simple as
MSE (mean squared error) or more complex like cross-entropy.
were cost can be equal to MSE, cross-entropy or any other cost function.
Based on C’s value, the model “knows” how much to adjust its parameters in order to
get closer to the expected output y. This happens using the backpropagation algorithm.
and
the ability to create useful new features distinguishes back-propagation from earlier,
simpler methods…
The gradient shows how much the parameter x needs to change (in positive or negative
direction) to minimize C.
The common part in both equations is often called “local gradient” and is expressed as
follows:
The “local gradient” can easily be determined using the chain rule. I won’t go over the
process now but if you have any questions, please comment below.
. . .
Algorithm for optimizing weights and biases (also called “Gradient descent”)
. . .
I would like to dedicate the final part of this section to a simple example in which we
will calculate the gradient of C with respect to a single weight (w_22)².
Weight (w_22)² connects (a_2)² and (z_2)², so computing the gradient requires
applying the chain rule through (z_2)³ and (a_2)³:
I hope this example manages to throw some light on the mathematics behind
computing gradients. To further enhance your skills, I strongly recommend watching
Stanford’s NLP series where Richard Socher gives 4 great explanations of
backpropagation.
Final remarks
In this article, I went through a detailed explanation of how backpropagation works
under the hood using mathematical techniques like computing gradients, chain rule
etc. Knowing the nuts and bolts of this algorithm will fortify your neural networks
knowledge and make you feel comfortable to take on more complex models. Enjoy your
deep learning journey!
Thank you for the reading. Hope you enjoyed the article
🤩 and I wish you a great day!