Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting

The document discusses various techniques for initializing and regularizing neural networks, including initializing weights randomly, early stopping to prevent overfitting, adding noise to inputs or weights as regularization, and batch normalization.

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Fitting and initializing neural networks

Neural networks are almost always fitted with gradient based optimizers, such as variants
of Stochastic Gradient Descent1 . We defer how to compute the gradients to the next note.

1 Initialization
How do we set the initial weights before calling an optimizer? Don’t set all the weights to
zero! If different hidden units (adaptable basis functions) start out with the same parameters,
they will all compute the same function of the inputs. Each unit will then get the same
gradient vector, and be updated in the same way. As each hidden unit remains the same, or
neural network function can only depend on a single linear combination of the inputs.
Instead we usually initialize the weights randomly. Don’t simply set all the weights using
randn() though! As a concrete example, if all your inputs were xd ∈√{−1, +1} the activation
(w(k) )> x to hidden unit k would have zero mean, but typical size D if there are D inputs.
(See the review of random walks on the expectations notes.) If your units saturate, like
the logistic sigmoid, most of the gradients will be close to zero, and it will be hard for the
gradient optimizer to update the parameters to useful settings.
Summary: initialize a weight matrix that transforms K values to small random values, like
0.1*randn()/sqrt(K), assuming your input features are ∼1.
The MLP course points to Glorot and Bengio’s (2010) paper Understanding
p the difficulty of
training deep feedforward networks, which suggests a scaling ∝ 1/ K (l ) + K (l −1) , involving
the number of hidden units in the layer after the weights, not just before. The argument
involves the gradient computations, which we haven’t described in detail for neural networks
yet, so we defer the interested reader to the paper or the MLP (2019) slides2 .
Some specialized neural network architectures have particular tricks for initializing them.
Do a literature search if you find yourself trying something other than a standard dense
feedforward network: e.g., recurrent/recursive architectures, convolutional architectures,
transformers, or memory networks. Alternatively, a pragmatic tip: if you are using a neural
network toolbox, try to process your data to have similar properties to the standard datasets
that are usually used to demonstrate that software. For example, similar dimensionality,
means, variances, sparsity (number of non-zero features). Then any initialization tricks that
the demonstrations use are more likely to carry over to your setting.

2 Local optima
The cost function for neural networks is not unimodal, and so is certainly not convex (a
stronger property). We can see why by considering a neural network with two hidden units.
Assume we’ve fitted the network to a (local) optimum of a cost function, so that any small
change in parameters will make the network worse. Then we can find another parameter
vector that will represent exactly the same function, showing that the optimum is only a
local one.
To create the second parameter vector, we simply take all of the parameters associated
with hidden unit one, and replace them with the corresponding parameters associated with
hidden unit two. Then we take all of the parameters associated with hidden unit two and
replace them with the parameters that were associated with hidden unit one. The network is
really the same as before, with the hidden units labelled differently, so will have the same
cost.

1. Adam (https://arxiv.org/abs/1412.6980) has now been popular for some time, although pure SGD is still in
use too.
2. https://www.inf.ed.ac.uk/teaching/courses/mlp/2019-20/lectures/mlp06-enc.pdf

MLPR:w8c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

Models with “hidden” or “latent” representations of data, usually have many equivalent
ways to represent the same model. When the goal of a machine learning system is to make
predictions, it doesn’t matter whether the parameters are well-specified. However, it’s worth
remembering that the values of individual parameters are often completely arbitrary, and
can’t be interpreted in isolation.
In practice local optima don’t just correspond to permuting the hidden units. Some local op-
tima will have better cost than others, and some will make predictions that generalize better
than others. For small neural networks, one could fit many times and use the network that
cross-validates the best. However, researchers pushing up against available computational
resources will find it difficult to optimize a network many times.
One advantage of large neural networks is that fitting far more parameters than necessary
tends to work better(!). One intuition is that there are many more ways to set the parameters
to get low cost, so it’s less hard to find one good setting.3 Although it’s difficult to make
rigorous statements on this issue. Understanding the difficulties that are faced in really
high-dimensional optimization is an open area of research. (For example, https://arxiv.
org/abs/1412.6544.)

3 Regularization by early stopping

We have referred to complex models that generalize poorly as “overfitted”. One idea to avoid
“overfitting” is to fit less! That is, stop the optimization routine before it has found a local
optimum of the cost function. This heuristic idea is often called “early stopping”.
The most common way to implement early stopping is to periodically monitor performance
on a validation set. If the validation score is the best that we have seen so far, we save a
copy of the network’s parameters. If the validation score fails to improve upon that cost over
some number of future checks (say 20), we stop the optimization and return the weights
we’ve saved.
David MacKay’s textbook mentions early stopping (Section 39.4, p479). This book points
out that stopping the optimizer prevents the weights from growing too large. Goodfellow
et al.’s deep learning textbook (Chapter 7) makes a more detailed link to L2 regularization.
MacKay argued that adding a regularization term to the cost function to achieve a similar
effect seems more appealing: if we have a well-defined cost function, we’re not tied to a
particular optimizer, and it’s probably easier to analyse what we’re doing.
However, I’ve found it hard to argue with early stopping as a pragmatic, sensible procedure.
The heuristic directly checks whether continuing to fit is improving predictions for held-out
data, which is what we care about. And we might save a lot of computer time by stopping
early. Moreover, we can still use a regularized cost function along with early stopping.
Questions: These questions don’t relate to early stopping, but seemed natural to ask after
the video above, which had some more general review:

• [The website version of this note has a question here.]

4 Regularization corrupting the data or model

There are a whole family of methods for regularizing models that involve adding noise to
the data or model during training. As with early-stopping, it can be hard to understand
what objective we are fitting, and the effect of the regularizer can depend on which optimizer
we are using. However, these methods are often effective. . .

3. The high-level idea is old, but a recent (2018) analysis described the idea that some parts of a large network
“get lucky” and identify good features as “The Lottery Ticket Hypothesis”, https://arxiv.org/abs/1803.03635

MLPR:w8c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

Adding Gaussian noise to the inputs of a linear model during gradient training has the
same average effect as L2 regularization4 . We can also add noise when training neural
networks. The procedure will still have a regularization effect, but one that’s harder to
understand. We can also add noise to the weights or hidden units of a neural network.
In some applications, adding noise has worked better than optimizing easy-to-define cost
functions (like L2 regularizers).
Other regularization methods randomly replace some of the weights with zeros (“drop-
out”5 ) or features with zeros (such as in “denoising auto-encoders”6 or a 2006 feature-
dropping regularizer). These heuristics prevent the model from fitting delicate combinations
of parameters, or fits that depend on careful combinations of features. If used aggressively,
“masking noise” makes it hard to fit anything! Often large models are needed when using
these heuristics.

5 Further Reading
Most textbooks are long out-of-date when it comes to recent practical wisdom on fitting
neural networks and regularization strategies. However, https://www.deeplearningbook.
org/ is still fairly recent, and is a good starting point. The MLP notes are also more detailed
on practical tips for deep nets.
If you were to read about one more trick, perhaps it should be Batch Normalization (or “batch
norm”), which is (just) “old” enough to be covered in the deep learning textbook. Like most
ideas, it doesn’t always improve things, so experiments are required. And variants are still
being actively explored.
The discussion in this note about initialization pointed out that we don’t want to saturate
hidden units. Batch normalization shifts and scales the activations for a unit across a training
batch to have a target mean and variance. Gradient-based training of the neural nets often
then works better. In hindsight it’s surprising amazed that this trick is so recent: it’s a simple
idea that someone could have come up with in a previous decade, but didn’t.

4. Non-examinable: there’s a sketch in these slides: https://www.cs.toronto.edu/~tijmen/csc321/slides/

lecture_slides_lec9.pdf. More detail in Bishop’s (1995) neural network textbook, section 9.3.
5. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
6. http://icml2008.cs.helsinki.fi/papers/592.pdf

MLPR:w8c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Metis Bootcamp Curriculum
No ratings yet
Metis Bootcamp Curriculum
18 pages
ML MCQ
100% (2)
ML MCQ
31 pages
Unit 3
No ratings yet
Unit 3
110 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Information Sciences: Le Zhang, P.N. Suganthan
No ratings yet
Information Sciences: Le Zhang, P.N. Suganthan
3 pages
Artificial Neural Networks_dl
No ratings yet
Artificial Neural Networks_dl
55 pages
Lecture 8.4
No ratings yet
Lecture 8.4
13 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
1. Introduction to deep learning- Deep feed forward network
No ratings yet
1. Introduction to deep learning- Deep feed forward network
24 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
A Survey of Randomized Algorithms For Training Neural Networks
No ratings yet
A Survey of Randomized Algorithms For Training Neural Networks
10 pages
UNIT3
No ratings yet
UNIT3
17 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
DL UNIT 5 NOTES 2
No ratings yet
DL UNIT 5 NOTES 2
23 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Initialization
No ratings yet
Initialization
16 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
General Observation
No ratings yet
General Observation
93 pages
a imprimer 4
No ratings yet
a imprimer 4
4 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Neural Networks:: Basics Using MATLAB
No ratings yet
Neural Networks:: Basics Using MATLAB
54 pages
ITNN Week3
No ratings yet
ITNN Week3
21 pages
cours4
No ratings yet
cours4
30 pages
Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Initializing Neural Networks - Deeplearning - Ai
No ratings yet
Initializing Neural Networks - Deeplearning - Ai
15 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Ann MPDM Ii
No ratings yet
Ann MPDM Ii
42 pages
Lec 8
No ratings yet
Lec 8
43 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
week 03-04 - Deep Feedforward Networks - Intro
141 pages
Richi's Neural Nets Summary
No ratings yet
Richi's Neural Nets Summary
114 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
No ratings yet
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
48 pages
DL 4
No ratings yet
DL 4
15 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Sigmoid Neural Networks to Predict Handwritten Digits
No ratings yet
Sigmoid Neural Networks to Predict Handwritten Digits
16 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Deep Learning Computer Vision
No ratings yet
Deep Learning Computer Vision
302 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
5 Regularization
No ratings yet
5 Regularization
79 pages
ET 287 Unit3 MLP
No ratings yet
ET 287 Unit3 MLP
71 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
DNN Hyperparameter Tuning
No ratings yet
DNN Hyperparameter Tuning
105 pages
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
From Everand
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
William Sullivan
1/5 (1)
w2e_multivariate_gaussian
No ratings yet
w2e_multivariate_gaussian
6 pages
MDA3S
No ratings yet
MDA3S
22 pages
w2c_central_limit
No ratings yet
w2c_central_limit
1 page
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
Part 5
No ratings yet
Part 5
31 pages
Part 4
No ratings yet
Part 4
24 pages
TS Part2
No ratings yet
TS Part2
62 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Part 3
No ratings yet
Part 3
29 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
Chap 5-1 - Machine Learning Basics - Jinwook Kim
No ratings yet
Chap 5-1 - Machine Learning Basics - Jinwook Kim
39 pages
Fresco
No ratings yet
Fresco
29 pages
Data Science Interview Questions With Answers ?
No ratings yet
Data Science Interview Questions With Answers ?
16 pages
Karthik Nambiar 60009220193
No ratings yet
Karthik Nambiar 60009220193
9 pages
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Lecture3 Transfer Learning
No ratings yet
Lecture3 Transfer Learning
28 pages
Lecture 3-Linear-Regression-Part2
No ratings yet
Lecture 3-Linear-Regression-Part2
45 pages
2018 Rotman Management Fall 2018
No ratings yet
2018 Rotman Management Fall 2018
136 pages
Topic1 Research Proposal
No ratings yet
Topic1 Research Proposal
10 pages
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
No ratings yet
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
19 pages
Epolar-Unet: An Edge-Attending Polar Unet For Automatic Medical Image Segmentation With Small Datasets
No ratings yet
Epolar-Unet: An Edge-Attending Polar Unet For Automatic Medical Image Segmentation With Small Datasets
12 pages
Final PHD Paal Sundsoey
No ratings yet
Final PHD Paal Sundsoey
167 pages
Analysis of Wheat
No ratings yet
Analysis of Wheat
21 pages
Flock Safety Technologies in Law Enforcement: An Initial Evaluation of Effectiveness in Aiding Police in Real-World Crime Clearance
100% (1)
Flock Safety Technologies in Law Enforcement: An Initial Evaluation of Effectiveness in Aiding Police in Real-World Crime Clearance
22 pages
Machine Learning Based Hyperspectral Image Analysis: A Survey
No ratings yet
Machine Learning Based Hyperspectral Image Analysis: A Survey
46 pages
The Cartoon Guide To Statistics-3
100% (1)
The Cartoon Guide To Statistics-3
8 pages
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
48 pages
The Problem of Overfitting: Perspective
No ratings yet
The Problem of Overfitting: Perspective
12 pages
2.b Applied Machine Learning Secret Sauce - Slides
No ratings yet
2.b Applied Machine Learning Secret Sauce - Slides
41 pages
Complete Guide To Parameter Tuning in XGBoost (With Codes in Python) PDF
No ratings yet
Complete Guide To Parameter Tuning in XGBoost (With Codes in Python) PDF
20 pages
Anomaly Detection Report
No ratings yet
Anomaly Detection Report
33 pages
Self-Study Plan For Becoming A Quantitative Trader - Part II
No ratings yet
Self-Study Plan For Becoming A Quantitative Trader - Part II
4 pages
45 Data Scientist Questions
No ratings yet
45 Data Scientist Questions
53 pages
SLS Corrected 1.4.16 PDF
No ratings yet
SLS Corrected 1.4.16 PDF
362 pages
Top 45 Machine Learning Interview Questions in 2025
No ratings yet
Top 45 Machine Learning Interview Questions in 2025
37 pages
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles - Download the ebook today and own the complete content
100% (2)
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles - Download the ebook today and own the complete content
47 pages
Decision Trees - 2022
No ratings yet
Decision Trees - 2022
49 pages
Phase - 2 - PPT (1) Final
No ratings yet
Phase - 2 - PPT (1) Final
26 pages

Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting

Uploaded by

Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting

Uploaded by

Fitting and initializing neural networks

MLPR:w8c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

3 Regularization by early stopping

• [The website version of this note has a question here.]

4 Regularization corrupting the data or model

MLPR:w8c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

4. Non-examinable: there’s a sketch in these slides: https://www.cs.toronto.edu/~tijmen/csc321/slides/

MLPR:w8c Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

You might also like