0% found this document useful (0 votes)

14 views

Deep Neural Network Module 4 Regularization

The document discusses techniques for improving generalization in deep neural networks. It covers regularization techniques like L1 and L2 regularization, dropout, and early stopping. It explains the concepts of underfitting and overfitting, and how validation datasets are used for model selection. The goal of these techniques is to reduce overfitting and help models generalize well to new data.

Uploaded by

Manju Prasad N

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Deep Neural Network Module 4 Regularization

Uploaded by

Manju Prasad N

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Deep Neural Network

AIML Module 5
Seetha Parameswaran
BITS Pilani

1
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.

2
Regularization Techniques

3
What we Learn….
4.1 Model Selection
4.2 Underfitting, and Overfitting
4.3 L1 and L2 Regularization
4.4 Dropout
4.5 Challenges - Vanishing and Exploding Gradients, Covariance shift
4.6 Parameter Initialization
4.7 Batch Normalization

4
Generalization in DNN

5
Generalization
● Goal is to discover patterns that generalize.
○ The goal is to discover patterns that capture regularities in the
underlying population from which our training set was drawn.
○ Models are trained a sample of data.
○ When working with finite samples, we run the risk that we might
discover apparent associations that turn out not to hold up when we
collect more data or on newer samples.
● The trained model should predict for newer or unseen data. This problem
is called generalization.

6
Training Error and Generalization Error
● Training error is the error of our model as calculated on the training
dataset.
○ Obtained while training the model.
● Generalization error is the expectation of our modelʼs error, if an
infinite stream of additional data examples drawn from the same
underlying data distribution as the original sample were applied on the
model.
○ Cannot be computed, but estimated.
○ Estimate the generalization error by applying the model to an independent test set,
constituted of a random selection of data examples that were withheld from the
training set.

7
Model Complexity
● Simple models and abundant data
○ Expect the generalization error to resemble the training error.
● More complex models and fewer examples
○ Expect the training error to go down but the generalization gap to grow.
● Model complexity
○ A model with more parameters might be considered more complex.
○ A model whose parameters can take a wider range of values might be more
complex.
○ A neural network model that takes more training iterations are more complex, and
one subject to early stopping (fewer training iterations) are less complex.
●

8
Factors that influence the generalizability of a model
1. The number of tunable parameters.
○ When the number of tunable parameters, called the degrees of freedom, is large,
models tend to be more susceptible to overfitting.
2. The values taken by the parameters.
○ When weights can take a wider range of values, models can be more susceptible
to overfitting.
3. The number of training examples.
○ It is trivially easy to overfit a dataset containing only one or two examples even if
your model is simple. But overfitting a dataset with millions of examples requires
an extremely flexible model.
4.

9
Model Selection
● Model selection is the process of selecting the final model after
evaluating several candidate models.
● With MLPs, compare models with
○ different numbers of hidden layers,
○ different numbers of hidden units
○ different activation functions applied to each hidden layer.
● Use Validation dataset to determine the best among our candidate
models.

10
Validation dataset
● Never rely on the test data for model selection.
○ Risk of overfit the test data
● Do not rely solely on the training data for model selection
○ We cannot estimate the generalization error on the very data that we use to train
the model.
● Split the data three ways, incorporating a validation dataset (or
validation set) in addition to the training and test datasets.
● In deep learning, with millions of data available, the split is generally
○ Training = 98-99 % of the original dataset
○ Validation = 1-2 % of training dataset
○ Testing = 1-2 % of the original dataset

11
Just Right Model
● High Training accuracy
● High Validation accuracy
● Low Bias and Low Variance
● Usually care more about the
validation error than about the gap
between the training and validation
errors.

12
Underfitting
● Low Training accuracy and Low Validation accuracy.

● Training error and validation error are both substantial but there is a little gap
between them.
● The model is too simple (insufficiently expressive) to capture the pattern that we
are trying to model.
● If generalization gap between our training and validation errors is small, a more
complex model may be better.

13
Overfitting
● The phenomenon of fitting the
training data more closely than fit
the underlying distribution is called
overfitting.
● High Training accuracy and Low
Validation accuracy
● Training error is significantly lower
than the validation error.
● The techniques used to combat
overfitting are called
regularization.

14
Underfitting or Overfitting?

Simple Model Complex Model Just right model

Underfitting Overfitting

15
Polynomial degree and underfitting vs. overfitting

16
Model complexity and dataset size
● More data, fit a more complex model.
● More data, the generalization error typically decreases.

● Less data, simpler models may be more difficult to beat.

● Less data, more likely and more severely, models may over-fit.

The current success of deep learning owes to the current abundance

of massive datasets due to Internet companies, cheap storage,
connected devices, and the broad digitization of the economy.

17
Deep Learning Model Selection

18
Regularization

19
Regularization Techniques
● Weight Decay ( L2 regularization)
● Dropout
● Early Stopping

20
Weight Decay

21
L2 Regularization
● Measure the complexity of a linear function f (x) = w⊤x by some norm
of its weight vector, e.g., ∥w∥2.
● Add the norm as a penalty term to the problem of minimizing the loss.
This will ensure that the weight vector is small.
● The objective function becomes minimizing the sum of the
prediction loss and the penalty term.
● L2-regularized linear models constitute the ridge regression algorithm.

22
L2 Regularization
● The trade off between standard loss and the additive penalty is given by
regularization constant λ, a non-negative hyperparameter.

● For λ = 0, we recover the original loss function.

● For λ > 0, we restrict the size of ∥w∥.
● Smaller values of λ correspond to less constrained w, whereas larger
values of λ constrain w more considerably.
● By squaring the L2 norm, we remove the square root, leaving the sum of
squares of each component of the weight vector.
● The derivative of a quadratic function ensure that the 2 and 1/2 cancel
out.
23
L1 Regularization
● L1-regularized linear regression is known as lasso regression.
● L1 penalties lead to models that concentrate weights on a small set of
features by clearing the other weights to zero. This is called feature
selection.

24
L2 Regularization L1 Regularization
Sum of square of weights Sum of absolute value of weights

Learn complex data patterns Generates simple and interpretable models

Estimate mean of data Estimate median of data

Not robust to outliers Robust to outliers

Shrink coefficients equally Shrink coefficients to zero

Non sparse solution Sparse solution

One solution Multiple solutions

No feature selection Selects features

Useful for collinear features Useful for dimensionality reduction

25
Dropout

26
Smoothness
● Classical generalization theory suggests that to close the gap
between train and test performance, aim for a simple model.
● Simplicity can be achieved
○ Using weight decay
○ Smoothness, i.e., that the function should not be sensitive to small changes to its
inputs.
● Injecting noise enforces smoothness
○ training with input noise
○ inject noise into each layer of the network before calculating the subsequent layer
during training.

27
Dropout
● Dropout involves injecting noise while computing each internal layer
during forward propagation.
● It has become a standard technique for training neural networks.
● The method is called dropout because we literally drop out some
neurons during training.
● Apply dropout to a hidden layer, zeroing out each hidden unit with
probability p.
● The calculation of the outputs no longer depends on dropped out
neurons and their respective gradient also vanishes when performing
backpropagation. (in that iteration)
● Dropout is disabled at test time.
28
Without Dropout

29
Dropout for first iteration

30
Dropout for second iteration

31
Dropout
● Dropout gives a smaller neural network, giving the effect of
regularization.
● In general,
○ Vary keep probability (0.5 to 0.8) for each hidden layer.
○ The input layer has a keep probability of 1.0 or 0.9.
○ The output layer has a keep probability of 1.0.
●

32
Early Stopping

33
Before Early stopping
● When training large models, training error decreases steadily over
time, but validation set error begins to rise again.
● Training objective decreases consistently over time.
● Validation set average loss begins to increase again, forming an
asymmetric U-shaped curve.

34
Early stopping
● No longer looking for a local minimum of validation error, while
training.
● Train until the validation set error has not improved for some amount
of time.
● Every time the error on the validation set improves, store a copy of the
model parameters. When the training algorithm terminates, return
these parameters.

35
Early stopping
● Effective and simple form of regularization.
● Trains simpler models

36
Early Stopping code

37
Numerical Stability and Initialization

38
Why Initialization is important?
● The choice of initialization is crucial for maintaining numerical stability.
● The choices of initialization can be tied up in interesting ways with the
choice of the nonlinear activation function.
● Which function we choose and how we initialize parameters can
determine how quickly our optimization algorithm converges.
● Poor choices can cause to encounter exploding or vanishing gradients
while training.

39
Vanishing and Exploding Gradients
● Consider a deep network with L layers, input x and output o. With
each layer l defined by a transformation fl parameterized by weights
W(l) , whose hidden variable is h(l)

● If all the hidden variables and the input are vectors, then the gradient
of o with respect to any set of parameters W(l)

● Gradient is the product of L − l matrices M(L) · . . . · M(l+1) and the

gradient vector v(l) .
40
Vanishing and Exploding Gradients
● The matrices M(l) may have a wide variety of eigenvalues. They might
be small or large, and their product might be very large or very small.
● Gradients of unpredictable magnitude also threaten the stability of the
optimization algorithms.
● Parameter updates may be either
(i) excessively large, destroying our model (the exploding gradient
problem);
(ii) excessively small (the vanishing gradient problem), rendering
learning impossible as parameters hardly move on each update.

41
Vanishing Gradients
● Activation function sigmoid σ can cause the vanishing gradient
problem.
○ The sigmoidʼs gradient vanishes both when its inputs are large and when they are
small.
○ When backpropagating through many layers, where the inputs to many of the
sigmoids are close to zero, the gradients of the overall product may vanish.
● Solution: Use ReLU for hidden lakers. ReLU is more stable.

42
Parameter Initialization
1. Default Initialization
○ Used a normal distribution to initialize the values of the parameters.
2. Xavier Initialization
○ samples weights from a Gaussian distribution with zero mean and variance

○ now-standard and practically beneficial

43
Batch Normalization

44
Why Batch Normalization?
1. Standardize the input features to each have a mean of zero and
variance of one. This standardization puts the parameters a priori at a
similar scale. Better optimization.
2. A MLP or CNN, as we train, the variables in intermediate layers may
take values with widely varying magnitudes: both along the layers
from the input to the output, across units in the same layer, and over
time due to our updates to the model parameters. This drift in the
distribution of such variables could hamper the convergence of the
network.
3. Deeper networks are complex and easily capable of overfitting. This
means that regularization becomes more critical.
45
Batch Normalization
● Batch normalization is a popular and effective technique that
consistently accelerates the convergence of deep networks.
● Batch normalization is applied to individual layers.
● It works as follows:
○ In each training iteration, first normalize the inputs (of batch normalization) by
subtracting their mean and dividing by their standard deviation, where both
are estimated based on the statistics of the current minibatch.
○ Next, apply a scale coefficient and a scale offset.
● Due to the normalization based on batch statistics that batch
normalization derives its name.
● Batch normalization works best for moderate minibatches sizes in the
50 to 100 range.
46
Batch Normalization
● Denote by x ∈ B an input to batch normalization (BN) that is from a
minibatch B, batch normalization transforms x as

● μ̂B is the sample mean and σ̂B is the sample standard deviation of the
minibatch B.
● After applying standardization, the resulting minibatch has zero mean
and unit variance.

47
Batch Normalization
● Denote by x ∈ B an input to batch normalization (BN) that is from a
minibatch B, batch normalization transforms x as

● μ̂B is the sample mean and σ̂B is the sample standard deviation of the
minibatch B.
● After applying standardization, the resulting minibatch has zero mean
and unit variance.

48
Batch Normalization
● Elementwise scale parameter γ and shift parameter β that have the
same shape as x. γ and β are parameters are learned jointly with the
other model parameters.
● Batch normalization actively centers and rescales the inputs to each
layer back to a given mean and size.
● Calculate μ̂B and σ̂B

49
Batch Normalization Layers
● Batch normalization implementations for fully-connected layers and
convolutional layers are slightly different.
○ Fully-Connected Layers
■ Insert batch normalization after the affine transformation and before the
nonlinear activation function.
○ Convolutional Layers
■ Apply batch normalization after the convolution and before the nonlinear
activation function.
■ Carry out each batch normalization over the m ·p ·q elements per output
channel simultaneously.
● It operates on a full minibatch at a time.

50
Batch Normalization During Prediction
● After training, use the entire dataset to compute stable estimates of
the variable statistics and then fix them at prediction time.

51
Ref TB Dive into Deep Learning
● Sections 5.4, 5.5, 5.6 and 8,5 (online
version)

52
Next Session:
CNN

Cambridge IGCSE and O Level Computer Science Study and Revision Guide Second Edition (David Watson, Helen Williams, David Fairley) (Z-Library)
93% (15)
Cambridge IGCSE and O Level Computer Science Study and Revision Guide Second Edition (David Watson, Helen Williams, David Fairley) (Z-Library)
211 pages
Maquinas LG PDF
90% (10)
Maquinas LG PDF
329 pages
A. English 8.docx - EXERCISES 1. COMPLETE THE SENTENCES WITH THE CORRECT PASSIVE FORM OF THE VERBS IN BRACKETS. USE THE PRESENT
100% (1)
A. English 8.docx - EXERCISES 1. COMPLETE THE SENTENCES WITH THE CORRECT PASSIVE FORM OF THE VERBS IN BRACKETS. USE THE PRESENT
1 page
DS License Server: Installation Guide
No ratings yet
DS License Server: Installation Guide
176 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
06 Regularizations
No ratings yet
06 Regularizations
42 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
CHP 3
No ratings yet
CHP 3
70 pages
ML_Theory
No ratings yet
ML_Theory
10 pages
Machine Learning by Tom Mitchell - Definitions
No ratings yet
Machine Learning by Tom Mitchell - Definitions
12 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
All DL
No ratings yet
All DL
72 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
DL+lect+7 (1)
No ratings yet
DL+lect+7 (1)
15 pages
Big Data Notes
No ratings yet
Big Data Notes
33 pages
02 - Diagnostics For Machine Learning Model
No ratings yet
02 - Diagnostics For Machine Learning Model
20 pages
Lec 3
No ratings yet
Lec 3
13 pages
CSD411-Week_5-Generalization_1725085952123619433966d2b90031636
No ratings yet
CSD411-Week_5-Generalization_1725085952123619433966d2b90031636
43 pages
NN
No ratings yet
NN
12 pages
Chapter III - Supervised and Unsupervised Algorithms
No ratings yet
Chapter III - Supervised and Unsupervised Algorithms
122 pages
DL Class3
No ratings yet
DL Class3
28 pages
AI - W7L14
No ratings yet
AI - W7L14
22 pages
Deep Learning Module-03
No ratings yet
Deep Learning Module-03
20 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
43 pages
unit4
No ratings yet
unit4
93 pages
Unit-3
No ratings yet
Unit-3
47 pages
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
20 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
ML Final Notes Unit 4,5 Rishi
No ratings yet
ML Final Notes Unit 4,5 Rishi
45 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Data Science Concepts Overfitting Underfitting
No ratings yet
Data Science Concepts Overfitting Underfitting
8 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Unit 4
No ratings yet
Unit 4
35 pages
Regularization
No ratings yet
Regularization
46 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
ML Pipe
No ratings yet
ML Pipe
25 pages
MLP, Gradient Descent, Activation Functions
No ratings yet
MLP, Gradient Descent, Activation Functions
5 pages
DSS08 - CLS-ANN, SVM, Ensemble-Vn
No ratings yet
DSS08 - CLS-ANN, SVM, Ensemble-Vn
44 pages
Unit 5
No ratings yet
Unit 5
8 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
DL_Unit-3
No ratings yet
DL_Unit-3
56 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Machine Learning Models
No ratings yet
Machine Learning Models
52 pages
MLANS
No ratings yet
MLANS
26 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Lecture 9.1 - Model Evaluations - Train Test Cross-Validate (Autosaved)
No ratings yet
Lecture 9.1 - Model Evaluations - Train Test Cross-Validate (Autosaved)
33 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
Artificial Neural Networks - Lect - 4
No ratings yet
Artificial Neural Networks - Lect - 4
17 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
18-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-22!08!2024
No ratings yet
18-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-22!08!2024
5 pages
Data Science
No ratings yet
Data Science
5 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
ML Module 1 + Module 2
No ratings yet
ML Module 1 + Module 2
4 pages
Interview Questions
100% (1)
Interview Questions
67 pages
Nndl Notes
No ratings yet
Nndl Notes
73 pages
Edab Module - 2
No ratings yet
Edab Module - 2
20 pages
40 Machine Learning Algorithms
From Everand
40 Machine Learning Algorithms
Anam Giri
No ratings yet
CCNA 200-125 Exam: EIGRP GRE Troubleshooting Sim With Answers
No ratings yet
CCNA 200-125 Exam: EIGRP GRE Troubleshooting Sim With Answers
5 pages
FCR Users Guide
No ratings yet
FCR Users Guide
58 pages
Data Rich Information Poor RG
No ratings yet
Data Rich Information Poor RG
9 pages
Physiology Marrow Ed8 [Medicalstudyzone.com]
No ratings yet
Physiology Marrow Ed8 [Medicalstudyzone.com]
267 pages
Fishfinder 580 Chartplotter Operations Manual
No ratings yet
Fishfinder 580 Chartplotter Operations Manual
134 pages
9781315398907_previewpdf
No ratings yet
9781315398907_previewpdf
33 pages
Choosing A Research Topic
No ratings yet
Choosing A Research Topic
2 pages
Assignment3 Btech Sem3 TD Mu207
No ratings yet
Assignment3 Btech Sem3 TD Mu207
2 pages
What Is Logistics Finance
No ratings yet
What Is Logistics Finance
5 pages
Singlife WOW SOW SLAs
No ratings yet
Singlife WOW SOW SLAs
13 pages
SAP - Number Range Interval For New Number Range Number
100% (3)
SAP - Number Range Interval For New Number Range Number
2 pages
Fit Up Report
No ratings yet
Fit Up Report
41 pages
List of Deleted Topics of Maths Textbook
No ratings yet
List of Deleted Topics of Maths Textbook
3 pages
E PORTAL@JKBOTE
No ratings yet
E PORTAL@JKBOTE
1 page
Citra Log - Txt.old
No ratings yet
Citra Log - Txt.old
2 pages
BCS 11 - June2010 June2023
No ratings yet
BCS 11 - June2010 June2023
91 pages
SAP IQ11.0 SQL Reference en
No ratings yet
SAP IQ11.0 SQL Reference en
2,074 pages
Computer Engineering Department: Micro Project Report
No ratings yet
Computer Engineering Department: Micro Project Report
18 pages
RICE - Information Sheet and Data Privacy Consent Form
No ratings yet
RICE - Information Sheet and Data Privacy Consent Form
1 page
Network-Aware RF-energy Harvesting For Designing Energy Efficient
No ratings yet
Network-Aware RF-energy Harvesting For Designing Energy Efficient
16 pages
Hexaware - Job Description 2024
No ratings yet
Hexaware - Job Description 2024
1 page
4th - AI
No ratings yet
4th - AI
4 pages
MPC Instrument Collection Installation Guide
No ratings yet
MPC Instrument Collection Installation Guide
23 pages
DM UNIT-5 E-Commerce
No ratings yet
DM UNIT-5 E-Commerce
42 pages
5V_s_of_Big_Data_Attributes_and_their_Re
No ratings yet
5V_s_of_Big_Data_Attributes_and_their_Re
10 pages
Gas Turbine
100% (2)
Gas Turbine
25 pages