Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
Neural networks are almost always fitted with gradient based optimizers, such as variants
of Stochastic Gradient Descent1 . We defer how to compute the gradients to the next note.
1 Initialization
How do we set the initial weights before calling an optimizer? Don’t set all the weights to
zero! If different hidden units (adaptable basis functions) start out with the same parameters,
they will all compute the same function of the inputs. Each unit will then get the same
gradient vector, and be updated in the same way. As each hidden unit remains the same, or
neural network function can only depend on a single linear combination of the inputs.
Instead we usually initialize the weights randomly. Don’t simply set all the weights using
randn() though! As a concrete example, if all your inputs were xd ∈√{−1, +1} the activation
(w(k) )> x to hidden unit k would have zero mean, but typical size D if there are D inputs.
(See the review of random walks on the expectations notes.) If your units saturate, like
the logistic sigmoid, most of the gradients will be close to zero, and it will be hard for the
gradient optimizer to update the parameters to useful settings.
Summary: initialize a weight matrix that transforms K values to small random values, like
0.1*randn()/sqrt(K), assuming your input features are ∼1.
The MLP course points to Glorot and Bengio’s (2010) paper Understanding
p the difficulty of
training deep feedforward networks, which suggests a scaling ∝ 1/ K (l ) + K (l −1) , involving
the number of hidden units in the layer after the weights, not just before. The argument
involves the gradient computations, which we haven’t described in detail for neural networks
yet, so we defer the interested reader to the paper or the MLP (2019) slides2 .
Some specialized neural network architectures have particular tricks for initializing them.
Do a literature search if you find yourself trying something other than a standard dense
feedforward network: e.g., recurrent/recursive architectures, convolutional architectures,
transformers, or memory networks. Alternatively, a pragmatic tip: if you are using a neural
network toolbox, try to process your data to have similar properties to the standard datasets
that are usually used to demonstrate that software. For example, similar dimensionality,
means, variances, sparsity (number of non-zero features). Then any initialization tricks that
the demonstrations use are more likely to carry over to your setting.
2 Local optima
The cost function for neural networks is not unimodal, and so is certainly not convex (a
stronger property). We can see why by considering a neural network with two hidden units.
Assume we’ve fitted the network to a (local) optimum of a cost function, so that any small
change in parameters will make the network worse. Then we can find another parameter
vector that will represent exactly the same function, showing that the optimum is only a
local one.
To create the second parameter vector, we simply take all of the parameters associated
with hidden unit one, and replace them with the corresponding parameters associated with
hidden unit two. Then we take all of the parameters associated with hidden unit two and
replace them with the parameters that were associated with hidden unit one. The network is
really the same as before, with the hidden units labelled differently, so will have the same
cost.
1. Adam (https://arxiv.org/abs/1412.6980) has now been popular for some time, although pure SGD is still in
use too.
2. https://www.inf.ed.ac.uk/teaching/courses/mlp/2019-20/lectures/mlp06-enc.pdf
3. The high-level idea is old, but a recent (2018) analysis described the idea that some parts of a large network
“get lucky” and identify good features as “The Lottery Ticket Hypothesis”, https://arxiv.org/abs/1803.03635
5 Further Reading
Most textbooks are long out-of-date when it comes to recent practical wisdom on fitting
neural networks and regularization strategies. However, https://www.deeplearningbook.
org/ is still fairly recent, and is a good starting point. The MLP notes are also more detailed
on practical tips for deep nets.
If you were to read about one more trick, perhaps it should be Batch Normalization (or “batch
norm”), which is (just) “old” enough to be covered in the deep learning textbook. Like most
ideas, it doesn’t always improve things, so experiments are required. And variants are still
being actively explored.
The discussion in this note about initialization pointed out that we don’t want to saturate
hidden units. Batch normalization shifts and scales the activations for a unit across a training
batch to have a target mean and variance. Gradient-based training of the neural nets often
then works better. In hindsight it’s surprising amazed that this trick is so recent: it’s a simple
idea that someone could have come up with in a previous decade, but didn’t.