0% found this document useful (0 votes)

13 views

Lec 05 Regularization

This document discusses regularization techniques for deep learning models. It begins by introducing the concept of adding a penalty term to the loss function to limit model capacity and prevent overfitting. It then describes several specific regularization methods: 1) L2 regularization (weight decay) adds a penalty that is the sum of the squares of the weights. This has the effect of pushing weights closer to zero during training. 2) L1 regularization uses a penalty that is the sum of the absolute values of the weights, encouraging sparsity. 3) Other techniques discussed include early stopping, ensemble methods, dropout, and data augmentation.

Uploaded by

Mr. Coffee

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Lec 05 Regularization

Uploaded by

Mr. Coffee

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Deep Learning

Lecture 5 – Regularization

Prof. Dr.-Ing. Andreas Geiger

Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

5.1 Parameter Penalties

5.2 Early Stopping

5.3 Ensemble Methods

5.4 Dropout

5.5 Data Augmentation

2
Recap: Capacity, Overﬁtting and Underﬁtting
1.5 1.5 1.5
M=1 Ground Truth M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set Test Set
0.5 0.5 0.5

0.0 0.0 0.0

y
0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x

Capacity too low Capacity about right Capacity too high

I Underﬁtting: Model too simple, does not achieve low error on training set
I Overﬁtting: Training error small, but test error (= generalization error) large
I Regularization: Take model from third regime (right) to second regime (middle)

3
Recap: Capacity, Overﬁtting and Underﬁtting
103
Training Error
Generalization Error
102

101
Error

100

10 1

10 2

10 3
0 1 2 3 4 5 6 7 8 9
Degree of Polynomial

Regularization:
I Trades increased bias for reduced variance
I Goal is to minimize generalization error despite using large model family
4
Function Space View

Solution

Regularizer

Data
Function Space

5
5.1
Parameter Penalties
Parameter Penalties
Let X = (X, y) denote the dataset and w the model parameters. We can limit the
model capacity by adding a parameter norm penalty R to the loss L

L̃(X , w) = L(X , w) + α R(w)

| {z } | {z } | {z }
Total Loss Original Loss Regularizer

where α ∈ [0, ∞) controls the strength of the regularizer.

I R quantiﬁes the size of the parameters / model capacity

I Minimizing L̃ will decrease both L and R
I Typically, R is applied only to the weights (not the bias) of the afﬁne layers
I Often, R drives weights closer to the origin (in absence of prior knowledge)

7
Parameter Penalties

Why do we want the weights/inputs to be small?

I Suppose x1 and x2 are nearly identical.
The following two networks make nearly the same predictions:

I But the second network might predict wrongly if the test distribution is slightly
different (x1 and x2 match less closely) ⇒ Worse generalization

8
Parameter Penalties

Objective Function

Optimum

Regularizer

9
L2 Regularization
Weight decay (=ridge regression) uses an L2 penalty R(w) = 12 kwk22 :

L̃(X , w) = L(X , w) + α R(w)

α
= L(X , w) + w> w
2

The parameter updates during gradient descent are given by:

wt+1 = wt − η ∇w L̃(X , wt )
= wt − η ∇w L(X , wt ) + α wt

= (1 − η α) wt − η ∇w L(X , wt )

Thus, we decay the weights at each training iteration before the gradient update.
10
L2 Regularization
What happens over the the entire course of training? Let w∗ = argmin w L(X , w)
denote the solution to the unregularized objective and consider a
quadratic approximation L̂ of the unregularized loss L around w∗

1
L̂(X , w) = L(X , w∗ ) + g> (w − w∗ ) + (w − w∗ )> H(w − w∗ )
2
1
= L(X , w ) + (w − w ) H(w − w∗ )
∗ ∗ >
2

with gradient vector g = 0 and semi-positive Hessian matrix H.

When including the regularization term, this approximation becomes:

1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2

11
L2 Regularization

When including the regularization term, this approximation becomes:

1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2

The minimum w̃ of the regularized objective L̂(X , w) is attained at ∇w L̂(X , w) = 0:

H(w̃ − w∗ ) + αw̃ = 0
(H + αI)w̃ = Hw∗
w̃ = (H + αI)−1 Hw∗

Thus, as α approaches 0, the regularized solution w̃ approaches w∗ .

12
L2 Regularization
What happens if α grows?
Consider the decomposition H = QΛQ> of the symmetric Hessian matrix into
a diagonal matrix of eigenvalues Λ and an orthonormal basis of eigenvectors Q:

w̃ = (H + αI)−1 Hw∗
−1
= QΛQ> + αI QΛQ> w∗
−1
= Q (Λ + αI) Q> QΛQ> w∗

= Q (Λ + αI)−1 ΛQ> w∗

Thus, the component of w∗ that is aligned with the i-th eigenvector of H is rescaled
by a factor of λiλ+α
i
. Regularization affects directions with small eigenvalues λi α.
13
L2 Regularization

Unregularized Objective

L2 Regularizer

I Contours of unregularized objective L(X , w) and L2 regularizer R(w)

I At w̃, the competing objectives reach an equilibrium (solution to regularized loss)
I Along w1 , eigenvalue of H is small (low curvature) ⇒ strong effect of regularizer
I Along w2 , eigenvalue of H is large (high curvature) ⇒ small effect of regularizer
14
L1 Regularization

Unregularized Objective

L1 Regularizer

I Contours of unregularized objective L(X , w) and L1 regularizer R(w)

I At w̃, the competing objectives reach an equilibrium (solution to regularized loss)
I L1 regularized loss function: L̃(X , w) = L(X , w) + αkwk1
I L1 Regularization results in a solution which is more sparse (compared to L2 )
15
L2 vs. L1 Regularization

Example: Assume 3 input features: x = (1, 2, 1)>

The following two linear classiﬁers fw (x) = σ(w> x) yield the same result/loss:
I w1 = (0, 0.75, 0)> ⇒ ignores 2 features
I w2 = (0.25, 0.5, 0.25)> ⇒ takes all features into account
But the L1 and L2 regularizer prefer different solutions!

L2 Regularization: L1 Regularization:
I kw1 k2 = 0 + 0.752 + 0 = 0.5625 I kw1 k1 = 0 + 0.75 + 0 = 0.75
I kw2 k2 = 0.252 + 0.52 + 0.252 = 0.375 I kw2 k1 = 0.25 + 0.5 + 0.25 = 1

Slide credits: Leal-Taixe and Niessner, I2DL. 16

L2 vs. L1 Regularization

Slide credits: Leal-Taixe and Niessner, I2DL. 17

L2 vs. L1 Regularization

Slide credits: Leal-Taixe and Niessner, I2DL. 17

Interpretation as MAP Inference

L2 regularization can be interpreted as Bayesian maximum-a-posteriori (MAP)

estimation of the network parameters w with a Gaussian prior applied to w:

w̃ = argmax p(w|y, X)
w

= argmax p(y|X, w) p(w)

= argmax log p(y|X, w) + log p(w)

= argmax log p(y|X, w) + log N (w|0, α−1 I)

w
α >
= argmin − log p(y|X, w) + w w
w 2

18
Computation Graph

I The combination of loss functions is straight forward (compute nodes)

19
5.2
Early Stopping
Early Stopping

I While training error decreases over time, validation error starts increasing again
I Thus: train for some time and return parameters with lowest validation error
21
Early Stopping vs. Parameter Penalties

Unregularized Objective

L2 Regularizer

Early stopping: L2 Regularization:

I Dashed: Trajectory taken by SGD I Regularize objective with L2 penalty
I Trajectory stops at w̃ before I Penalty forces minimum of
reaching the minimum w∗ regularized loss w̃ closer to origin
I Under some assumptions, both are equivalent (see Chapter 7.8 of text book) 22
Early Stopping

Early Stopping:
I Most commonly used form of regularization in deep learning
I Effective, simple and computationally efﬁcient form of regularization
I Training time can be viewed as hyperparameter ⇒ model selection problem
I Efﬁcient as a single training run tests all hyperparameters (unlike weight decay)
I Only cost: periodically evaluate validation error on validation set
I Validation set can be small, and evaluation less frequently

Remark: If little training data is available, one can perform a second training phase
where the model is retrained from scratch on all training data using the same number
of training iterations determined by the early stopping procedure
23
5.3
Ensemble Methods
Ensemble Methods

Idea:
I Train several models separately for the same task
I At inference time: average results
I Thus, often also called “model averaging”

Intuition:
I Different models make different errors on the test set
I By averaging we obtain a more robust estimate without a better model!
I Works best if models are maximally uncorrelated
I Winning entries of challenges are often ensembles (e.g., Netﬂix challenge)
I Drawback: requires evaluation of multiple models at inference time
25
Ensemble Methods

Consider K regression models, each of which has an error of k with variances

E[2k ] = v and covariances E[k l ] = c. The expected square error of the ensemble
predictor (with each model having the same weight) is given as:
 !2    
1 X  = 1 E
X
2k +
X
E k k l 
K K2
k k l6=k
 
1 X X X
E 2k +

= 2 E [k l ]
K
k k l6=k
1
= (Kv + K(K − 1)c)
K2
1 K −1
= v+ c
K K
26
Ensemble Methods

Consider K regression models, each of which has an error of k ∼ N (0, Σ) with

variances E[2k ] = v and covariances E[k l ] = c. The ensemble error is given by:
 !2 
E
1 X
k  = 1 v + K − 1c
K K K
k

I If errors are correlated (c = v), the ensemble error becomes v ⇒ no gain

I If errors are uncorrelated (c = 0), the ensemble error reduces to 1
Kv

Thus:
I Ensemble maximally effective if errors maximally uncorrelated

27
Ensemble Methods

Different Types of Ensemble Methods:

I Initialization: Train networks starting from different random initialization on
same dataset or using different minibatches (via stochastic gradient descent).
This often already introduces some independence.
I Model: Use different models, architectures, losses or hyperparameters
I Bagging: Train networks on different random draws (with replacement) from the
original dataset. Thus, each dataset likely misses some of the examples from the
original dataset and contains some duplicates.

28
Bagging Example

I First model learns to detect top “loop”, second model detects bottom “loop”
29
5.4
Dropout
Dropout

Idea:
I During training, set neurons to zero with probability µ (typically µ = 0.5)
I Each binary mask is one model, changes randomly with every training iteration
I Creates ensemble “on the ﬂy” from a single network with shared parameters

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 2014. 31
Dropout

Why is this a good idea?

I Forces the network to learn a redundant representation ⇒ regularization
I Reduces effective model capacity ⇒ larger models, longer training
I Prevents co-adaptation of features (units can’t learn to undo output of others)
I Requires only one forward pass at inference time ⇒ Why?
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 2014. 32
Dropout
Inference:
I Dropout makes the output random. Formally, we have:

ŷz = fw (x, z)

Here, z is a binary mask with one element per unit drawn i.i.d. from a Bernoulli
p(zi ) = µ1−zi (1 − µ)zi where zi = 0 if neuron i is removed from the network
I At inference time, we want to calculate the ensemble prediction:
X
ŷ = fw (x) = Ez [fw (x, z)] = p(z)fw (x, z)
z

I For M neurons, we have 2M terms ⇒ this is not tractable

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 2014. 33
Dropout
Let us consider a simple linear model:

fw (x) = w1 x1 + w2 x2
fw (x, z) = z1 w1 x1 + z2 w2 x2

Assuming µ = 0.5, during training we optimize the expectation over the ensemble:

1 1 1 1
Ez [fw (x, z)] = (0 + 0) + (w1 x1 + 0) + (0 + w2 x2 ) + (w1 x1 + w2 x2 )
4 4 4 4
1 1
= (w1 x1 + w2 x2 ) = fw (x)
2 2

Thus, at test time, we must multiply the trained weights by 1 − µ.

Remark: This weight scaling inference is only an approximation for non-linear models.
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 2014. 34
Dropout

I Features of an Autoencoder on MNIST with a single hidden layer of 256 ReLUs

I Right: With dropout ⇒ less co-adaptation and thus better generalization
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 2014. 35
Dropout

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 2014. 36
5.5
Data Augmentation
Data Augmentation

Motivation:
I Deep neural networks must be invariant to a wide variety of input variations
I Often large intra-class variation in terms of pose, appearance, lighting, etc.
38
Data Augmentation
I Best way towards better generalization
is to train on more data
I However, data in practice often limited
I Goal of data augmentation: create
“fake” data from the existing data (on
the ﬂy) and add it to the training set
I New data must preserve semantics
I Even simple operations like translation
or adding per-pixel noise often already
greatly improve generalization
I https://github.com/aleju/imgaug

39
Geometric Transformations
Data Augmentation: Geometry

Image Cropping

iaa.Crop(px=(1,64))