Lec 05 Regularization
Lec 05 Regularization
Lecture 5 – Regularization
5.4 Dropout
2
Recap: Capacity, Overfitting and Underfitting
1.5 1.5 1.5
M=1 Ground Truth M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set Test Set
0.5 0.5 0.5
y
0.5 0.5 0.5
I Underfitting: Model too simple, does not achieve low error on training set
I Overfitting: Training error small, but test error (= generalization error) large
I Regularization: Take model from third regime (right) to second regime (middle)
3
Recap: Capacity, Overfitting and Underfitting
103
Training Error
Generalization Error
102
101
Error
100
10 1
10 2
10 3
0 1 2 3 4 5 6 7 8 9
Degree of Polynomial
Regularization:
I Trades increased bias for reduced variance
I Goal is to minimize generalization error despite using large model family
4
Function Space View
Solution
Regularizer
Data
Function Space
5
5.1
Parameter Penalties
Parameter Penalties
Let X = (X, y) denote the dataset and w the model parameters. We can limit the
model capacity by adding a parameter norm penalty R to the loss L
7
Parameter Penalties
I But the second network might predict wrongly if the test distribution is slightly
different (x1 and x2 match less closely) ⇒ Worse generalization
8
Parameter Penalties
Objective Function
Optimum
Regularizer
9
L2 Regularization
Weight decay (=ridge regression) uses an L2 penalty R(w) = 12 kwk22 :
wt+1 = wt − η ∇w L̃(X , wt )
= wt − η ∇w L(X , wt ) + α wt
= (1 − η α) wt − η ∇w L(X , wt )
Thus, we decay the weights at each training iteration before the gradient update.
10
L2 Regularization
What happens over the the entire course of training? Let w∗ = argmin w L(X , w)
denote the solution to the unregularized objective and consider a
quadratic approximation L̂ of the unregularized loss L around w∗
1
L̂(X , w) = L(X , w∗ ) + g> (w − w∗ ) + (w − w∗ )> H(w − w∗ )
2
1
= L(X , w ) + (w − w ) H(w − w∗ )
∗ ∗ >
2
1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2
11
L2 Regularization
1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2
H(w̃ − w∗ ) + αw̃ = 0
(H + αI)w̃ = Hw∗
w̃ = (H + αI)−1 Hw∗
12
L2 Regularization
What happens if α grows?
Consider the decomposition H = QΛQ> of the symmetric Hessian matrix into
a diagonal matrix of eigenvalues Λ and an orthonormal basis of eigenvectors Q:
w̃ = (H + αI)−1 Hw∗
−1
= QΛQ> + αI QΛQ> w∗
−1
= Q (Λ + αI) Q> QΛQ> w∗
= Q (Λ + αI)−1 ΛQ> w∗
Thus, the component of w∗ that is aligned with the i-th eigenvector of H is rescaled
by a factor of λiλ+α
i
. Regularization affects directions with small eigenvalues λi α.
13
L2 Regularization
Unregularized Objective
L2 Regularizer
Unregularized Objective
L1 Regularizer
The following two linear classifiers fw (x) = σ(w> x) yield the same result/loss:
I w1 = (0, 0.75, 0)> ⇒ ignores 2 features
I w2 = (0.25, 0.5, 0.25)> ⇒ takes all features into account
But the L1 and L2 regularizer prefer different solutions!
L2 Regularization: L1 Regularization:
I kw1 k2 = 0 + 0.752 + 0 = 0.5625 I kw1 k1 = 0 + 0.75 + 0 = 0.75
I kw2 k2 = 0.252 + 0.52 + 0.252 = 0.375 I kw2 k1 = 0.25 + 0.5 + 0.25 = 1
w̃ = argmax p(w|y, X)
w
18
Computation Graph
I While training error decreases over time, validation error starts increasing again
I Thus: train for some time and return parameters with lowest validation error
21
Early Stopping vs. Parameter Penalties
Unregularized Objective
L2 Regularizer
Early Stopping:
I Most commonly used form of regularization in deep learning
I Effective, simple and computationally efficient form of regularization
I Training time can be viewed as hyperparameter ⇒ model selection problem
I Efficient as a single training run tests all hyperparameters (unlike weight decay)
I Only cost: periodically evaluate validation error on validation set
I Validation set can be small, and evaluation less frequently
Remark: If little training data is available, one can perform a second training phase
where the model is retrained from scratch on all training data using the same number
of training iterations determined by the early stopping procedure
23
5.3
Ensemble Methods
Ensemble Methods
Idea:
I Train several models separately for the same task
I At inference time: average results
I Thus, often also called “model averaging”
Intuition:
I Different models make different errors on the test set
I By averaging we obtain a more robust estimate without a better model!
I Works best if models are maximally uncorrelated
I Winning entries of challenges are often ensembles (e.g., Netflix challenge)
I Drawback: requires evaluation of multiple models at inference time
25
Ensemble Methods
Thus:
I Ensemble maximally effective if errors maximally uncorrelated
27
Ensemble Methods
28
Bagging Example
I First model learns to detect top “loop”, second model detects bottom “loop”
29
5.4
Dropout
Dropout
Idea:
I During training, set neurons to zero with probability µ (typically µ = 0.5)
I Each binary mask is one model, changes randomly with every training iteration
I Creates ensemble “on the fly” from a single network with shared parameters
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 31
Dropout
ŷz = fw (x, z)
Here, z is a binary mask with one element per unit drawn i.i.d. from a Bernoulli
p(zi ) = µ1−zi (1 − µ)zi where zi = 0 if neuron i is removed from the network
I At inference time, we want to calculate the ensemble prediction:
X
ŷ = fw (x) = Ez [fw (x, z)] = p(z)fw (x, z)
z
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 33
Dropout
Let us consider a simple linear model:
fw (x) = w1 x1 + w2 x2
fw (x, z) = z1 w1 x1 + z2 w2 x2
Assuming µ = 0.5, during training we optimize the expectation over the ensemble:
1 1 1 1
Ez [fw (x, z)] = (0 + 0) + (w1 x1 + 0) + (0 + w2 x2 ) + (w1 x1 + w2 x2 )
4 4 4 4
1 1
= (w1 x1 + w2 x2 ) = fw (x)
2 2
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 36
5.5
Data Augmentation
Data Augmentation
Motivation:
I Deep neural networks must be invariant to a wide variety of input variations
I Often large intra-class variation in terms of pose, appearance, lighting, etc.
38
Data Augmentation
I Best way towards better generalization
is to train on more data
I However, data in practice often limited
I Goal of data augmentation: create
“fake” data from the existing data (on
the fly) and add it to the training set
I New data must preserve semantics
I Even simple operations like translation
or adding per-pixel noise often already
greatly improve generalization
I https://github.com/aleju/imgaug
39
Geometric Transformations
Data Augmentation: Geometry
Image Cropping
iaa.Crop(px=(1,64))
41
Data Augmentation: Geometry
41
Data Augmentation: Geometry
iaa.Fliplr(0.5)
41
Data Augmentation: Geometry
Affine Transformation
iaa.Affine()
41
Data Augmentation: Geometry
iaa.PiecewiseAffine(scale=(0.01, 0.1))
41
Data Augmentation: Geometry
Perspective Transformation
iaa.PerspectiveTransform(scale=(0, 0.4))
41
Local Filters
Data Augmentation: Local Filters
Gaussian Blur
iaa.GaussianBlur(sigma=(0.0, 10.0))
43
Data Augmentation: Local Filters
Image Sharpening
43
Data Augmentation: Local Filters
Emboss Effect
43
Data Augmentation: Local Filters
Edge Detection
iaa.EdgeDetect(alpha=(0, 1.0))
43
Adding Noise
Data Augmentation: Noise
Geirhos, Temme, Rauber, Schütt, Bethge and Wichmann: Generalisation in humans and deep neural networks. NeurIPS, 2018. 45
Data Augmentation: Noise
Gaussian Noise
46
Data Augmentation: Noise
iaa.SaltAndPepper(0.1)
46
Data Augmentation: Noise
Dropout Noise
iaa.Dropout((0.01, 0.5))
46
Data Augmentation: Noise
Cutout
46
Data Augmentation: Noise
Noise Augmentation:
I Noise can also be applied to the hidden units, not only to the input
I Prominent example: Dropout
47
Color Transformations
Data Augmentation: Color
Contrast
49
Data Augmentation: Color
Brightness
49
Data Augmentation: Color
49
Data Augmentation: Color
Local Brightness
49
Data Augmentation: Color
iaa.AddToHueAndSaturation((-50, 50))
49
Data Augmentation: Color
Color Inversion
49
Data Augmentation: Color
Grayscale
iaa.Grayscale(alpha=(0.0, 1.0))
49
Weathers
Data Augmentation: Weather
Snow
51
Data Augmentation: Weather
Clouds
iaa.Clouds()
51
Data Augmentation: Weather
Fog
iaa.Fog()
51
Random Combinations
Data Augmentation: Random Combination
53
Output Transformations
Data Augmentation: Output Transformations
I For some classification tasks, e.g., handwritten letter recognition, be careful to
not apply transformations that would change the output class
I Example 1: Horizontal flips changes the interpretation of the letter ’d’:
Horizontal
Flip
180°
Rotation
I Remark: For general object recognition, flips and rotations can often be useful!
55
Data Augmentation: Output Transformations
Cubuk, Zoph, Mané, Vasudevan, Le: AutoAugment: Learning Augmentation Strategies From Data. CVPR, 2019. 57