Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Lec 05 Regularization

This document discusses regularization techniques for deep learning models. It begins by introducing the concept of adding a penalty term to the loss function to limit model capacity and prevent overfitting. It then describes several specific regularization methods: 1) L2 regularization (weight decay) adds a penalty that is the sum of the squares of the weights. This has the effect of pushing weights closer to zero during training. 2) L1 regularization uses a penalty that is the sum of the absolute values of the weights, encouraging sparsity. 3) Other techniques discussed include early stopping, ensemble methods, dropout, and data augmentation.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lec 05 Regularization

This document discusses regularization techniques for deep learning models. It begins by introducing the concept of adding a penalty term to the loss function to limit model capacity and prevent overfitting. It then describes several specific regularization methods: 1) L2 regularization (weight decay) adds a penalty that is the sum of the squares of the weights. This has the effect of pushing weights closer to zero during training. 2) L1 regularization uses a penalty that is the sum of the absolute values of the weights, encouraging sparsity. 3) Other techniques discussed include early stopping, ensemble methods, dropout, and data augmentation.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Deep Learning

Lecture 5 – Regularization

Prof. Dr.-Ing. Andreas Geiger


Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

5.1 Parameter Penalties

5.2 Early Stopping

5.3 Ensemble Methods

5.4 Dropout

5.5 Data Augmentation

2
Recap: Capacity, Overfitting and Underfitting
1.5 1.5 1.5
M=1 Ground Truth M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set Test Set
0.5 0.5 0.5

0.0 0.0 0.0


y

y
0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x

Capacity too low Capacity about right Capacity too high

I Underfitting: Model too simple, does not achieve low error on training set
I Overfitting: Training error small, but test error (= generalization error) large
I Regularization: Take model from third regime (right) to second regime (middle)

3
Recap: Capacity, Overfitting and Underfitting
103
Training Error
Generalization Error
102

101
Error

100

10 1

10 2

10 3
0 1 2 3 4 5 6 7 8 9
Degree of Polynomial

Regularization:
I Trades increased bias for reduced variance
I Goal is to minimize generalization error despite using large model family
4
Function Space View

Solution

Regularizer

Data
Function Space

5
5.1
Parameter Penalties
Parameter Penalties
Let X = (X, y) denote the dataset and w the model parameters. We can limit the
model capacity by adding a parameter norm penalty R to the loss L

L̃(X , w) = L(X , w) + α R(w)


| {z } | {z } | {z }
Total Loss Original Loss Regularizer

where α ∈ [0, ∞) controls the strength of the regularizer.

I R quantifies the size of the parameters / model capacity


I Minimizing L̃ will decrease both L and R
I Typically, R is applied only to the weights (not the bias) of the affine layers
I Often, R drives weights closer to the origin (in absence of prior knowledge)

7
Parameter Penalties

Why do we want the weights/inputs to be small?


I Suppose x1 and x2 are nearly identical.
The following two networks make nearly the same predictions:

I But the second network might predict wrongly if the test distribution is slightly
different (x1 and x2 match less closely) ⇒ Worse generalization

8
Parameter Penalties

Objective Function

Optimum

Regularizer

9
L2 Regularization
Weight decay (=ridge regression) uses an L2 penalty R(w) = 12 kwk22 :

L̃(X , w) = L(X , w) + α R(w)


α
= L(X , w) + w> w
2

The parameter updates during gradient descent are given by:

wt+1 = wt − η ∇w L̃(X , wt )
= wt − η ∇w L(X , wt ) + α wt


= (1 − η α) wt − η ∇w L(X , wt )

Thus, we decay the weights at each training iteration before the gradient update.
10
L2 Regularization
What happens over the the entire course of training? Let w∗ = argmin w L(X , w)
denote the solution to the unregularized objective and consider a
quadratic approximation L̂ of the unregularized loss L around w∗

1
L̂(X , w) = L(X , w∗ ) + g> (w − w∗ ) + (w − w∗ )> H(w − w∗ )
2
1
= L(X , w ) + (w − w ) H(w − w∗ )
∗ ∗ >
2

with gradient vector g = 0 and semi-positive Hessian matrix H.


When including the regularization term, this approximation becomes:

1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2

11
L2 Regularization

When including the regularization term, this approximation becomes:

1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2

The minimum w̃ of the regularized objective L̂(X , w) is attained at ∇w L̂(X , w) = 0:

H(w̃ − w∗ ) + αw̃ = 0
(H + αI)w̃ = Hw∗
w̃ = (H + αI)−1 Hw∗

Thus, as α approaches 0, the regularized solution w̃ approaches w∗ .

12
L2 Regularization
What happens if α grows?
Consider the decomposition H = QΛQ> of the symmetric Hessian matrix into
a diagonal matrix of eigenvalues Λ and an orthonormal basis of eigenvectors Q:

w̃ = (H + αI)−1 Hw∗
 −1
= QΛQ> + αI QΛQ> w∗
 −1
= Q (Λ + αI) Q> QΛQ> w∗

= Q (Λ + αI)−1 ΛQ> w∗

Thus, the component of w∗ that is aligned with the i-th eigenvector of H is rescaled
by a factor of λiλ+α
i
. Regularization affects directions with small eigenvalues λi  α.
13
L2 Regularization

Unregularized Objective

L2 Regularizer

I Contours of unregularized objective L(X , w) and L2 regularizer R(w)


I At w̃, the competing objectives reach an equilibrium (solution to regularized loss)
I Along w1 , eigenvalue of H is small (low curvature) ⇒ strong effect of regularizer
I Along w2 , eigenvalue of H is large (high curvature) ⇒ small effect of regularizer
14
L1 Regularization

Unregularized Objective

L1 Regularizer

I Contours of unregularized objective L(X , w) and L1 regularizer R(w)


I At w̃, the competing objectives reach an equilibrium (solution to regularized loss)
I L1 regularized loss function: L̃(X , w) = L(X , w) + αkwk1
I L1 Regularization results in a solution which is more sparse (compared to L2 )
15
L2 vs. L1 Regularization

Example: Assume 3 input features: x = (1, 2, 1)>

The following two linear classifiers fw (x) = σ(w> x) yield the same result/loss:
I w1 = (0, 0.75, 0)> ⇒ ignores 2 features
I w2 = (0.25, 0.5, 0.25)> ⇒ takes all features into account
But the L1 and L2 regularizer prefer different solutions!

L2 Regularization: L1 Regularization:
I kw1 k2 = 0 + 0.752 + 0 = 0.5625 I kw1 k1 = 0 + 0.75 + 0 = 0.75
I kw2 k2 = 0.252 + 0.52 + 0.252 = 0.375 I kw2 k1 = 0.25 + 0.5 + 0.25 = 1

Slide credits: Leal-Taixe and Niessner, I2DL. 16


L2 vs. L1 Regularization

Slide credits: Leal-Taixe and Niessner, I2DL. 17


L2 vs. L1 Regularization

Slide credits: Leal-Taixe and Niessner, I2DL. 17


Interpretation as MAP Inference

L2 regularization can be interpreted as Bayesian maximum-a-posteriori (MAP)


estimation of the network parameters w with a Gaussian prior applied to w:

w̃ = argmax p(w|y, X)
w

= argmax p(y|X, w) p(w)


w

= argmax log p(y|X, w) + log p(w)


w

= argmax log p(y|X, w) + log N (w|0, α−1 I)


w
α >
= argmin − log p(y|X, w) + w w
w 2

18
Computation Graph

I The combination of loss functions is straight forward (compute nodes)


19
5.2
Early Stopping
Early Stopping

I While training error decreases over time, validation error starts increasing again
I Thus: train for some time and return parameters with lowest validation error
21
Early Stopping vs. Parameter Penalties

Unregularized Objective

L2 Regularizer

Early stopping: L2 Regularization:


I Dashed: Trajectory taken by SGD I Regularize objective with L2 penalty
I Trajectory stops at w̃ before I Penalty forces minimum of
reaching the minimum w∗ regularized loss w̃ closer to origin
I Under some assumptions, both are equivalent (see Chapter 7.8 of text book) 22
Early Stopping

Early Stopping:
I Most commonly used form of regularization in deep learning
I Effective, simple and computationally efficient form of regularization
I Training time can be viewed as hyperparameter ⇒ model selection problem
I Efficient as a single training run tests all hyperparameters (unlike weight decay)
I Only cost: periodically evaluate validation error on validation set
I Validation set can be small, and evaluation less frequently

Remark: If little training data is available, one can perform a second training phase
where the model is retrained from scratch on all training data using the same number
of training iterations determined by the early stopping procedure
23
5.3
Ensemble Methods
Ensemble Methods

Idea:
I Train several models separately for the same task
I At inference time: average results
I Thus, often also called “model averaging”

Intuition:
I Different models make different errors on the test set
I By averaging we obtain a more robust estimate without a better model!
I Works best if models are maximally uncorrelated
I Winning entries of challenges are often ensembles (e.g., Netflix challenge)
I Drawback: requires evaluation of multiple models at inference time
25
Ensemble Methods

Consider K regression models, each of which has an error of k with variances


E[2k ] = v and covariances E[k l ] = c. The expected square error of the ensemble
predictor (with each model having the same weight) is given as:
 !2    
1 X  = 1 E
X
2k +
X
E k k l 
K K2
k k l6=k
 
1 X X X
E 2k +
 
= 2 E [k l ]
K
k k l6=k
1
= (Kv + K(K − 1)c)
K2
1 K −1
= v+ c
K K
26
Ensemble Methods

Consider K regression models, each of which has an error of k ∼ N (0, Σ) with


variances E[2k ] = v and covariances E[k l ] = c. The ensemble error is given by:
 !2 
E
1 X
k  = 1 v + K − 1c
K K K
k

I If errors are correlated (c = v), the ensemble error becomes v ⇒ no gain


I If errors are uncorrelated (c = 0), the ensemble error reduces to 1
Kv

Thus:
I Ensemble maximally effective if errors maximally uncorrelated

27
Ensemble Methods

Different Types of Ensemble Methods:


I Initialization: Train networks starting from different random initialization on
same dataset or using different minibatches (via stochastic gradient descent).
This often already introduces some independence.
I Model: Use different models, architectures, losses or hyperparameters
I Bagging: Train networks on different random draws (with replacement) from the
original dataset. Thus, each dataset likely misses some of the examples from the
original dataset and contains some duplicates.

28
Bagging Example

I First model learns to detect top “loop”, second model detects bottom “loop”
29
5.4
Dropout
Dropout

Idea:
I During training, set neurons to zero with probability µ (typically µ = 0.5)
I Each binary mask is one model, changes randomly with every training iteration
I Creates ensemble “on the fly” from a single network with shared parameters

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 31
Dropout

Why is this a good idea?


I Forces the network to learn a redundant representation ⇒ regularization
I Reduces effective model capacity ⇒ larger models, longer training
I Prevents co-adaptation of features (units can’t learn to undo output of others)
I Requires only one forward pass at inference time ⇒ Why?
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 32
Dropout
Inference:
I Dropout makes the output random. Formally, we have:

ŷz = fw (x, z)

Here, z is a binary mask with one element per unit drawn i.i.d. from a Bernoulli
p(zi ) = µ1−zi (1 − µ)zi where zi = 0 if neuron i is removed from the network
I At inference time, we want to calculate the ensemble prediction:
X
ŷ = fw (x) = Ez [fw (x, z)] = p(z)fw (x, z)
z

I For M neurons, we have 2M terms ⇒ this is not tractable

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 33
Dropout
Let us consider a simple linear model:

fw (x) = w1 x1 + w2 x2
fw (x, z) = z1 w1 x1 + z2 w2 x2

Assuming µ = 0.5, during training we optimize the expectation over the ensemble:

1 1 1 1
Ez [fw (x, z)] = (0 + 0) + (w1 x1 + 0) + (0 + w2 x2 ) + (w1 x1 + w2 x2 )
4 4 4 4
1 1
= (w1 x1 + w2 x2 ) = fw (x)
2 2

Thus, at test time, we must multiply the trained weights by 1 − µ.


Remark: This weight scaling inference is only an approximation for non-linear models.
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 34
Dropout

I Features of an Autoencoder on MNIST with a single hidden layer of 256 ReLUs


I Right: With dropout ⇒ less co-adaptation and thus better generalization
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 35
Dropout

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 36
5.5
Data Augmentation
Data Augmentation

Motivation:
I Deep neural networks must be invariant to a wide variety of input variations
I Often large intra-class variation in terms of pose, appearance, lighting, etc.
38
Data Augmentation
I Best way towards better generalization
is to train on more data
I However, data in practice often limited
I Goal of data augmentation: create
“fake” data from the existing data (on
the fly) and add it to the training set
I New data must preserve semantics
I Even simple operations like translation
or adding per-pixel noise often already
greatly improve generalization
I https://github.com/aleju/imgaug

39
Geometric Transformations
Data Augmentation: Geometry

Image Cropping

iaa.Crop(px=(1,64))

41
Data Augmentation: Geometry

Image Cropping and Padding

iaa.CropAndPad(percent=(-0.2, 0.2),pad mode=iaa.ia.ALL,pad cval=(0, 255))

41
Data Augmentation: Geometry

Horizontal Image Flipping

iaa.Fliplr(0.5)

41
Data Augmentation: Geometry

Affine Transformation

iaa.Affine()

41
Data Augmentation: Geometry

Piecewise Affine Transformation

iaa.PiecewiseAffine(scale=(0.01, 0.1))

41
Data Augmentation: Geometry

Perspective Transformation

iaa.PerspectiveTransform(scale=(0, 0.4))

41
Local Filters
Data Augmentation: Local Filters

Gaussian Blur

iaa.GaussianBlur(sigma=(0.0, 10.0))

43
Data Augmentation: Local Filters

Image Sharpening

iaa.Sharpen(alpha=(0, 0.5), lightness=(0.75, 1.25))

43
Data Augmentation: Local Filters

Emboss Effect

iaa.Emboss(alpha=(0, 1.0), strength=(0, 2.0))

43
Data Augmentation: Local Filters

Edge Detection

iaa.EdgeDetect(alpha=(0, 1.0))

43
Adding Noise
Data Augmentation: Noise

Geirhos, Temme, Rauber, Schütt, Bethge and Wichmann: Generalisation in humans and deep neural networks. NeurIPS, 2018. 45
Data Augmentation: Noise

Gaussian Noise

iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.2*255)

46
Data Augmentation: Noise

Salt and Pepper Noise

iaa.SaltAndPepper(0.1)

46
Data Augmentation: Noise

Dropout Noise

iaa.Dropout((0.01, 0.5))

46
Data Augmentation: Noise

Cutout

iaa.Cutout(nb iterations=(1, 5), size=0.2, squared=False)

46
Data Augmentation: Noise

Noise Augmentation:
I Noise can also be applied to the hidden units, not only to the input
I Prominent example: Dropout
47
Color Transformations
Data Augmentation: Color

Contrast

iaa.LinearContrast((0.1, 2.0), per channel=0)

49
Data Augmentation: Color

Brightness

iaa.Multiply((0.5, 1.5), per channel=0)

49
Data Augmentation: Color

Brightness per Channel

iaa.Multiply((0.5, 1.5), per channel=0.5)

49
Data Augmentation: Color

Local Brightness

iaa.FrequencyNoiseAlpha(exponent=(-4, 0), first=iaa.Multiply((0.5, 1.5), per channel=True))

49
Data Augmentation: Color

Hue and Saturation

iaa.AddToHueAndSaturation((-50, 50))

49
Data Augmentation: Color

Color Inversion

iaa.Invert(0.5, per channel=0.75)

49
Data Augmentation: Color

Grayscale

iaa.Grayscale(alpha=(0.0, 1.0))

49
Weathers
Data Augmentation: Weather

Snow

iaa.FastSnowyLandscape(lightness threshold=(100, 255),lightness multiplier=(1.0, 4.0))

51
Data Augmentation: Weather

Clouds

iaa.Clouds()

51
Data Augmentation: Weather

Fog

iaa.Fog()

51
Random Combinations
Data Augmentation: Random Combination

53
Output Transformations
Data Augmentation: Output Transformations
I For some classification tasks, e.g., handwritten letter recognition, be careful to
not apply transformations that would change the output class
I Example 1: Horizontal flips changes the interpretation of the letter ’d’:

Horizontal
Flip

I Example 2: 180◦ rotations changes the interpretation of the number ’6’:

180°
Rotation

I Remark: For general object recognition, flips and rotations can often be useful!
55
Data Augmentation: Output Transformations

I For dense prediction tasks (depth/instance/keypoints), also transform targets


56
Data Augmentation
I When comparing two networks, make sure you use the same augmentation
I Consider data augmentation as a part of your network design
I It is important to specify the right distributions (often done empirically)
I Can also be combined with ensemble idea:
I At training time, sample random crops/scales and train one model
I At inference time, average predictions for a fixed set of crops of the test image
I AutoAugment uses reinforcement learning to find strategies automatically:

Cubuk, Zoph, Mané, Vasudevan, Le: AutoAugment: Learning Augmentation Strategies From Data. CVPR, 2019. 57

You might also like