DL Unit 3
DL Unit 3
DL Unit 3
Regularization for Deep Learning: Parameter Norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised Learning, Multi-Task Learning, Early Stopping, Parameter Tying
and Parameter Sharing, Sparse Representations, Bagging and Other Ensemble Methods,
Dropout, Adversarial Training, Tangent Distance, Tangent Prop and Manifold Tangent
Classifier.
Optimization for Training Deep Models: Pure Optimization, Challenges in Neural Network
Optimization, Basic Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive
Learning Rates, Approximate Second-Order Methods, Optimization Strategies and Meta-
Algorithms.
https://studyglance.in/dl/display.php?tno=12&topic=Early-stopping
What is Regularization?
A central problem in machine learning is how to make an algorithm that will perform well not
just on the training data, but also on new inputs
Before we deep dive into the topic, take a look at this image:
Have you seen this image before? As we move towards the right in this image, our model tries to
learn too well the details and the noise from the training data, which ultimately results in poor
In other words, while going towards the right, the complexity of the model increases such that
the training error reduces but the testing error doesn’t. This is shown in the image below.
Unit 3
Source: Slideplayer
If you’ve built a neural network before, you know how complex they are. This makes them more
prone to overfitting.
Regularization is a technique which makes slight modifications to the learning algorithm such
that the model generalizes better. This in turn improves the model’s performance on the unseen
data as well.
Unit 3
i.
ii.
Unit 3
The idea here is to limit the capacity (the space of all possible model families) of the model by
Here, θ represents only the weights and not the biases, the reason being that the biases require
we can construct a generalized Lagrangian function containing the objective function along with
the penalties. Suppose we wanted Ω(θ) < k, then we could construct the following Lagrangian:
We get optimal θ by solving the Lagrangian. If Ω(θ) > k, then the weights need to be compensated
highly and hence, α should be large to reduce its value below k. Likewise, if Ω(θ)<k, then the
norm shouldn’t be reduced too much and hence, α should be small. This is now similar to the
parameter norm penalty regularized objective function as both of them encourage lower values of
the norm. Thus, parameter norm penalties naturally impose a constraint, like the L²-regularization,
defining a constrained L²-ball. Larger α implies a smaller constrained region as it pushes the
values really low, hence, allowing a small radius and vice versa. The idea of constraints over
penalties is important for several reasons. Large penalties might cause non-convex optimization
algorithms to get stuck in local minima due to small values of θ, leading to the formation of so-
called dead cells, as the weights entering and leaving them are too small to have an impact.
Unit 3
Constraints don’t enforce the weights to be near zero, rather being confined to a constrained
region.
Another reason is that constraints induce higher stability. With higher learning rates, there might
be a large weight, leading to a large gradient, which could go on iteratively leading to numerical
overflow in the value of θ. Constrains, along with reprojection (to the corresponding ball), prevent
A final suggestion made by Hinton was to restrict the individual column norms of the weight
matrix rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden unit
from having a large weight. The idea here is that if we restrict the Frobenius norm, it doesn’t
guarantee that the individual weights would be small, just their norm. So, we might have large
weights being compensated by extremely small weights to make the overall norm small.
4. Data augmentation
The best way to make a machine learning model generalize better is to train it on more data.Data
augmentation is a way of creating fake data and adding it to training set.
Data augmentation is a technique of artificially increasing the training set by creating modified
copies of a dataset using existing data. It includes making minor changes to the dataset or using
deep learning to generate new data points.
Augmented vs. Synthetic data
Augmented data is driven from original data with some minor changes. In the case of image
augmentation, we make geometric and color space transformations (flipping, resizing, cropping,
brightness, contrast) to increase the size and diversity of the training set.
Synthetic data is generated artificially without using the original dataset. It often uses DNNs (Deep
Neural Networks) and GANs (Generative Adversarial Networks) to generate synthetic data.
Note: the augmentation techniques are not limited to images. You can augment audio, video, text, and
other types of data too.
Data set augmentation very effective for the classification problem of object recognition. Images are high-
dimensional and include a variety of variations, may easily simulated. translating the training images a
few pixels in each direction can greatly improve performance. Many other operations such as rotating the
image or scaling the image have also proven quite effective.
Unit 3
One must be careful not to apply transformations that would change the correct class. For
example, optical character recognition tasks require recognizing the difference between ‘b’ and
‘d’ and the difference between ‘6’ and ‘9’, so horizontal flips and 180◦ rotations are not
appropriate ways of augmenting datasets for these tasks.
Injecting noise into the input of a neural network can be seen as data augmentation. For some
regression tasks it is still be possible to solve even if small random noise is added to the input.
Neural networks are not robust to noise. To improve robustness, train them with random noise
applied to their inputs. Input noise injection is part of some unsupervised learning algorithms
such as the denoising autoencoder. Noise can also be applied to hidden units, which can be seen
as doing dataset augmentation at multiple levels of abstraction.
In this section, we will learn about audio, text, image, and advanced data augmentation
techniques.
1. Noise injection: add gaussian or random noise to the audio dataset to improve the
model performance.
2. Shifting: shift audio left (fast forward) or right with random seconds.
3. Changing the speed: stretches times series by a fixed rate.
4. Changing the pitch: randomly change the pitch of the audio.
Text Data Augmentation
1. Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images.
You need to be careful about applying multiple transformations on the same images,
as this can reduce model performance.
Unit 3
2. Color space transformations: randomly change RGB color channels, contrast, and
brightness.
3. Kernel filters: randomly change the sharpness or blurring of the image.
4. Random erasing: delete some part of the initial image.
5. Mixing images: blending and mixing multiple images.
Advanced Techniques
Data augmentation can apply to all machine learning applications where acquiring quality data is
challenging. Furthermore, it can help improve model robustness and performance across all
fields of study.
Healthcare
Acquiring and labeling medical imaging datasets is time-consuming and expensive. You also
need a subject matter expert to validate the dataset before performing data analysis. Using
geometric and other transformations can help you train robust and accurate machine-learning
models.
For example, in the case of Pneumonia Classification, you can use random cropping, zooming,
stretching, and color space transformation to improve the model performance. However, you
need to be careful about certain augmentations as they can result in opposite results. For
example, random rotation and reflection along the x-axis are not recommended for the X-ray
imaging dataset.
Unit 3
Self-Driving Cars
There is limited data available on self-driving cars, and companies are using simulated
environments to generate synthetic data using reinforcement learning. It can help you train and
test machine learning applications where data security is an issue.
Unit 3
Text data augmentation is generally used in situations with limited quality data, and improving
the performance metric takes priority. You can apply synonym augmentation, word embedding,
character swap, and random insertion and deletion. These techniques are also valuable for low-
resource languages.
Image from Papers With Code | Selective Text Augmentation with Word Roles for Low-
Resource Text Classification.
Researchers use text augmentation for the language models in high error recognition scenarios,
sequence-to-sequence data generation, and text classification.
In sound classification and speech recognition, data augmentation works wonders. It improves
the model performance even on low-resource languages.
Unit 3
The random noise injection, shifting, and changing the pitch can help you produce state-of-the-
art speech-to-text models. You can also use GANs to generate realistic sounds for a particular
application.
Having more data is the most desirable thing to improving a machine learning model’s
performance. In many cases, it is relatively easy to artificially generate data. For a classification
Unit 3
task, we desire for the model to be invariant to certain types of transformations, and we can
generate the corresponding (x,y)pairs by translating the input x. But for certain problems, like
density estimation, we can’t apply this directly unless we have already solved the density
estimation problem.
However, caution needs to be maintained while augmenting data to make sure that the class
doesn’t change. For e.g., if the labels contain both “b” and “d”, then horizontal flipping would be
a bad idea for data augmentation. Adding random noise to the inputs is another form of data
augmentation, while adding noise to hidden units can be seen as doing data augmentation at
Finally, when comparing machine learning models, we need to evaluate them using the same
hand-designed data augmentation schemes or else it might happen that algorithm A outperforms
algorithm B, just because it was trained on a dataset which had more / better data augmentation.
5. Noise Robustness
Noise applied to inputs is a data augmentation, For some models addition of noise with
extremely small variance at the input is equivalent to imposing a penalty on the norm of the
weights.
Noise applied to hidden units, Noise injection can be much more powerful than simply shrinking
the parameters. Noise applied to hidden units is so important that Dropout is the main
development of this approach.
Adding Noise to Weights, This technique primarily used with Recurrent Neural
Networks(RNNs). This can be interpreted as a stochastic implementation of Bayesian inference
over the weights. Bayesian learning considers model weights to be uncertain and representable
via a probability distribution p(w) that reflects that uncertainty. Adding noise to weights is a
practical, stochastic way to reflect this uncertainty.
Unit 3
6. Semi-Supervised Learning
In the paradigm of semi-supervised learning, both unlabeled examples from P
(x) and labeled examples from P (x, y) are used to estimate P (y | x) or predict
y from x.
The most basic disadvantage of any Supervised Learning algorithm is that the
dataset has to be hand-labeled either by a Machine Learning Engineer or a Data
Scientist. This is a very costly process, especially when dealing with large
volumes of data. The most basic disadvantage of any Unsupervised Learning is
that it’s application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was
introduced. In this type of learning, the algorithm is trained upon a combination
of labeled and unlabelled data. Typically, this combination will contain a very
small amount of labeled data and a very large amount of unlabelled data.
Firstly, it trains the model with less amount of training data similar to the supervised
learning models. The training continues until the model gives accurate results.
The algorithms use the unlabeled dataset with pseudo labels in the next step, and
now the result may not be accurate.
Now, the labels from labeled training data and pseudo labels data are linked
together.
The input data in labeled training data and unlabeled training data are also linked.
In the end, again train the model with the new combined input as did in the first step.
It will reduce errors and improve the accuracy of the model.
Sharing Parameters, Instead of separate unsupervised and supervised
components in the model, construct models in which generative models of
Unit 3
7. Multitask Learning
Multi-Task Learning is a sub-field of Deep Learning that aims to solve multiple
different tasks at the same time, by taking advantage of the similarities
between different tasks. This can improve the learning efficiency and also act as
a regularizer which we will discuss in a while.
Formally, if there are n tasks (conventional deep learning approaches aim to
solve just 1 task using 1 particular model), where these n tasks or a subset of
them are related to each other but not exactly identical, Multi-Task Learning
(MTL) will help in improving the learning of a particular model by using the
knowledge contained in all the n tasks.
Unit 3
The different supervised tasks (predicting y(i) given x) share the same input x,
as well as some intermediate-level representation h(shared) capturing a common
pool of factors (Common input but different target random variables). Task-
specific parameters h(1), h(2) can be learned on top of those yielding a shared
representation h(shared). Common pool of factors explain variations of Input x while
each task is associated with a Subset of these factors.
In the unsupervised learning context, some of the top level factors are
associated with none of the output tasks h(3). These are factors that explain
some of the input variations but not relevant for predicting h(1), h(2)
The model can generally be divided into two kinds of parts and associated
parameters:
1. Task-specific parameters (which only benefit from the examples of their task to
achieve good generalization). These are the upper layers of the neural network.
2. Generic parameters, shared across all the tasks (which benefit from the pooled data
of all the tasks). These are the lower layers of the neural network.
Benefits of multi-tasking Improved generalization and generalization error
bounds. Achieved due to shared parameters for which statistical strength can be
greatly improved in proportion to the increased no.of examples for the shared
parameters compared to the scenario of single-task models. From the point of
view of deep learning, the underlying prior belief is the following: Among the
factors that explain the variations observed in the data associated with different
tasks, some are shared across two or more tasks
Multitask learning leads to better generalization when there is actually some relationship between
the tasks, which actually happens in the context of Deep Learning where some of the factors,
which explain the variation observed in the data, are shared across different tasks.
Unit 3
8. Early Stopping
o Initialize model again and train all data. For 2nd training round, train data for
same #steps as early-stopping predicted.
o Keep parameters obtained from 1st training round and then continue training
using all data.
Monitor average loss function on validation set and continue training till
it falls below the value of training set objective at which early stopping
procedure halted.
Prevents high cost of re-training model from scratch.
May not ever terminate, if objective on validation set never reaches the
target value.
Till now, most of the methods focused on bringing the weights to a fixed point, e.g. 0 in the case
of norm penalty. However, there might be situations where we might have some prior knowledge
on the kind of dependencies that the model should encode. Suppose, two models A and B,
perform a classification task on similar input and output distributions. In such a case, we’d expect
the parameters for both the models to be similar to each other as well. We could impose a norm
penalty on the distance between the weights, but a more popular method is to force the set of
parameters to be equal. This is the essence behind Parameter Sharing. A major benefit here is that
we need to store only a subset of the parameters (e.g. storing only the parameters for model A
instead of storing for both A and B) which leads to large memory savings. In the example of
Convolutional Neural Networks or CNNs ,the same feature is computed across different regions
of the image and hence, a cat is detected irrespective of whether it is at position ior i+1 .
We can place penalties on even the activation values of the units which indirectly imposes a
penalty on the parameters. This leads to representational sparsity, where many of the activation
values of the units are zero. In the figure below, h is a representation of x, which is sparse.
Another idea could be to average the activation values across various examples and push it
towards some value. An example of getting representational sparsity by imposing hard constraint
on the activation value is the Orthogonal Matching Pursuit (OMP) algorithm, where a
representation h is learned for the input x by solving the constrained optimization problem:
where the constraint is on the the number of non-zero entries indicated by b. The problem can be
The techniques which train multiple models and take the maximum vote across those models for
the final prediction are called ensemble methods. The idea is that it’s highly unlikely that multiple
Suppose that we have K regression models, with the model #i making an error ϵi on each
example, where ϵi is drawn from a zero mean, multivariate normal distribution such that: 𝔼(ϵi²)=v
and 𝔼(ϵiϵj)=c. The error on each example is then the average across all the models: (∑ ϵi)/K.
The mean of this average error is 0 (as the mean of each of the individual ϵiϵi is 0). The variance
Thus, if c = v, then there is no change. If c = 0, then the variance of the average error decreases
with K. There are various ensembling techniques. In the case of Bagging (Bootstrap
Aggregating), the same training algorithm is used multiple times. The dataset is broken into K
parts by sampling with replacement (see figure below for clarity) and a model is trained on each
of those K parts. Because of sampling with replacement, the K parts have a few similarities as
well as a few differences. These differences cause the difference in the predictions of the K
12. Dropout
with bagging is that we can’t train an exponentially large number of models and store them for
simplistic view, dropout trains the ensemble of all sub-networks formed by randomly removing a
few non-output units by multiplying their outputs by 0. For every training sample, a mask is
computed for all the input and hidden units independently. For clarification, suppose we
have h hidden units in some layer. Then, a mask for that layer refers to a h dimensional vector
In bagging, the models are independent of each other, whereas in dropout, the
different models share parameters, with each model taking as input, a sample of the
total parameters.
In bagging, each model is trained till convergence, but in dropout, each model is
trained for just one step and the parameter sharing makes sure that subsequent
updates ensure better predictions in the future.
At test time, we combine the predictions of all the models. In the case of bagging with K models,
this was given by the arithmetic mean. In case of dropout, the probability that a model is chosen is
given by p(μ), with μ denoting the mask vector. The prediction then becomes ∑ p(μ)p(y|x, μ).
This is not computationally feasible, and there’s a better method to compute this in one go, using
We need to take care of two main things when working with geometric mean:
The advantage for dropout is that first term can be approximated in one pass of the complete
model by dividing the weight values by the keep probability (weight scaling inference rule). The
motivation behind this is to capture the right expected values from the output of each unit, i.e. the
total expected input to a unit at train time is equal to the total expected input at test time. A big
advantage of dropout then is that it doesn’t place any restriction on the type of model or training
procedure to use.
Points to note:
Reduces the representational capacity of the model and hence, the model should be
large enough to begin with.
Equivalent to L² for linear regression, with different weight decay coefficient for each
input feature.
Biological Interpretration:
During sexual reproduction, genes could be swapped between organisms if they are unable to
correctly adapt to the unusual features of any organism. Thus, the units in dropout learn to
perform well regardless of the presence of other hidden units, and also in many different contexts.
Adding noise in the hidden layer is more effective than adding noise in the input layer. For e.g.
let’s assume that some unit learns to detect a nose in a face recognition task. Now, if this unit is
Unit 3
removed, then some other unit either learns to redundantly detect a nose or associates some other
feature (like mouth) for recognising a face. In either way, the model learns to make more use of
the information in the input. On the other hand, adding noise to the input won’t completely
removed the nose information, unless the noise is so large as to remove most of the information
Deep Learning has outperformed humans in the task of Image Recognition, which might lead us
to believe that these models have acquired a human-level understanding of an image. However,
experimentally searching for an x′ (given an x), such that prediction made by the model changes,
shows otherwise. As shown in the image below, although the newly formed image (adversarial
image) looks almost exactly the same to a human, the model classifies it wrongly and that too
Adversarial training refers to training on images which are adversarially generated and it has
been shown to reduce the error rate. The main factor attributed to the above mentioned behaviour
is the linearity of the model (say y = Wx), caused by the main building blocks being primarily
linear. Thus, a small change of ϵ in the input causes a drastic change of Wϵ in the output. The idea
of adversarial training is to avoid this jumping and induce the model to be locally constant in the
This can also be used in semi-supervised learning. For an unlabelled sample x, we can assign the
label ŷ (x) using our model. Then, we find an adversarial example, x′, such that y(x′)≠ŷ (x) (an
adversary found this way is called virtual adversarial example). The objective then is to assign
the same class to both x and x′. The idea behind this is that different classes are assumed to lie on
disconnected manifolds and a little push from one manifold shouldn’t land in any other manifold.
Many ML models assume the data to lie on a low dimensional manifold to overcome the curse of
dimensionality. The inherent assumption which follows is that small perturbations that cause the
data to move along the manifold (it originally belonged to), shouldn’t lead to different class
predictions. The idea of the tangent distance algorithm to find the K-nearest neighbors using the
distance metric as the distance between manifolds. A manifold Mi is approximated by the tangent
The tangent prop algorithm proposed to learn a neural network based classifier, f(x), which is
invariant to known transformations causing the input to move along its manifold. Local invariance
would require that ▽f(x) is perpendicular to the tangent vectors V(i). This can also be achieved by
adding a penalty term that minimizes the directional directive of f(x) along each of the V(i).
It is similar to data augmentation in that both of them use prior knowledge of the domain to
specify various transformations that the model should be invariant to. However, tangent prop only
resists infinitesimal perturbations while data augmentation causes invariance to much larger
perturbations.
Optimization for Training Deep Models: Pure Optimization, Challenges in Neural Network
Optimization, Basic Algorithms, Parameter Initialization Strategies, Algorithms with
Adaptive Learning Rates, Approximate Second-Order Methods, Optimization Strategies
and Meta-Algorithms.
In Deep Learning, with the help of loss function, the performance of the model is
estimated/evaluated. This loss is used to train the network so that it performs better. Essentially,
we try to minimize the Loss function. Lower Loss means the model performs better. The Process
of minimizing any mathematical function is called Optimization.
Optimizers are algorithms or methods used to change the features of the neural network
such
as weights and learning rate so that the loss is reduced. Optimizers are used to solve optimization
problems by minimizing the function
2. Convex Optimization:
Convex optimization is a kind of optimization which deals with the study of problem
All Linear functions are convex, so linear programming problems are convex
problems. When we have a convex objective and a convex feasible region, then there can be
DL UNIT 3-Part-II 1
Convexity plays a vital role in the design of optimization algorithms. This is largely due
to the fact that it is much easier to analyze and test algorithms in such a context.
Consider the given Figure 4.1 given below, select any to points in the region and join
them by a straight Line. If the line and the selected points all lie inside the region then we call
that region as Convex Region. Shown in following fig
3. Non-Convex Optimization:
• All non-linear problems can be modelled by using non convex functions. (Linear
functions
are convex)
DL UNIT 3-Part-II 2
• Neural networks are universal function approximators, to do this, they need to be able to
• Varying Curvatu
Deep Neural Networks are very powerful and can do amazing things, but training them can be
difficult. Training deep neural networks (DNNs) has led to impressive advances in artificial
intelligence. However, it comes with hurdles like vanishing gradients, overfitting, and limited
labeled data.
Deep learning networks can be problematic when the numbers change too quickly or slowly
through many layers. This can make it hard for the network to learn and stay stable. This can
cause difficulties for the network in learning and remaining stable.
DL UNIT 3-Part-II 3
Solution: Gradient clipping, advanced weight initialization, and skip connections help a
computer learn things accurately and consistently.
Overfitting
Overfitting happens when a model knows too much about the training data, so it can't make good
predictions about new data. As a result, the model performs well on the training data but
struggles to make accurate predictions on new, unseen data. It's essential to address overfitting
by employing techniques like regularization, cross-validation, and more diverse datasets to
ensure the model generalizes well to unseen examples.
Solution: Regularisation techniques help us ensure our models memorize the data and use what
they've learned to make good predictions about new data. Techniques like dropout, L1/L2
regularisation, and early stopping can help us do this.
Data augmentation and preprocessing are techniques used to provide better information to the
model during training, enabling it to learn more effectively and make accurate predictions.
Solution: Apply data augmentation techniques like rotation, translation, and flipping alongside
data normalization and proper handling of missing values.
Label Noise
Training data sometimes need to be corrected, making it hard for computers to do things well.
Solution: Using special kinds of math called "loss functions" can help ensure that the model you
are using is not affected by label mistakes.
Imbalanced Datasets
Datasets can have too many of one type of thing and need more of another type. This can cause
models not to work very well for things not represented as much.
Solution: Classes can sometimes be uneven, meaning more people are in one group than
another. To fix this, we can use special techniques like class weighting, oversampling, or data
synthesis to ensure that all the classes have the same number of people.
Training deep neural networks can be very difficult and take a lot of computer power, especially
if the model is very big.
Solution: Using multiple computers or special chips called GPUs and TPUs can help make
learning faster and easier.
DL UNIT 3-Part-II 4
Hyperparameter Tuning
Deep neural networks have numerous hyperparameters that require careful tuning to achieve
optimal performance.
Solution: To efficiently find the best hyperparameters, such as Bayesian optimization or genetic
algorithms, utilize automated hyperparameter optimization methods.
Convergence Speed
It is important to ensure a model works quickly when using lots of data and complicated designs.
Solution: Adopt learning rate scheduling or adaptive algorithms like Adam or RMSprop to
expedite convergence.
Using the proper activation function when building a machine-learning model is important. This
helps ensure the model works properly and yields correct results.
Solution: ReLU and its variants (Leaky ReLU, Parametric ReLU) are popular choices due to
their ability to mitigate vanishing activation issues.
Gradient descent algorithms help computers solve problems but sometimes need help when it is
very difficult.
Solution: Advanced techniques can help us navigate difficult problems better. Examples are
stochastic gradient descent with momentum and Nesterov Accelerated Gradient.
Memory Constraints
Computers need a lot of memory to train large models and datasets, but they can work properly if
there is enough memory.
Deep learning networks need lots of data to work well. If they don't get enough data or the data
is different, they won't work as well.
DL UNIT 3-Part-II 5
Exploring Architecture Design Space
Designing buildings is difficult because there are many different ways to do it. Choosing the best
way to create a building for a specific purpose can take time and effort.
Solution: Use automated neural architecture search (NAS) algorithms to explore the design
space and discover architectures tailored to the task.
Adversarial Attacks
Deep neural networks are unique ways of understanding data. But they can be tricked by
minimal changes that we can't see. This can make them give wrong answers.
Understanding the decisions made by deep neural networks is crucial in critical applications like
healthcare and autonomous driving.
Training deep neural networks on sequential data, such as time series or natural language
sequences, presents unique challenges.
Solution: Utilize specialized architectures like recurrent neural networks (RNNs) or transformers
to handle sequential data effectively.
Limited Data
Training deep neural networks with limited labeled data is a common challenge, especially in
specialized domains.
Solution: Consider semi-supervised, transfer, or active learning to make the most of available
data.
Catastrophic Forgetting
When a model forgets previously learned knowledge after training on new data, it encounters the
issue of catastrophic forgetting.
DL UNIT 3-Part-II 6
Solution: Implement techniques like elastic weight consolidation (EWC) or knowledge
distillation to retain old knowledge during continual learning.
Using trained models on devices with not much computing power can be hard.
Solution: Scientists use special techniques to make computer models run better on devices with
limited resources.
When training computers to do complex tasks, it is essential to keep data private and ensure the
computers are secure.
Training deep neural networks is like doing a challenging puzzle. It takes a lot of time to
assemble the puzzle, especially if it is vast and has a lot of pieces.
Solution: Special tools like GPUs or TPUs can help us train our computers faster. We can also
try using different computers simultaneously to make the training even quicker.
Some models are too big and need a lot of space, so they are hard to use on regular computers.
Setting an appropriate learning rate schedule can be challenging, affecting model convergence
and performance.
Solution: Using special learning rate schedules can help make learning easier and faster. These
schedules can be used to help teach things in a better way.
Deep neural networks can get stuck in local minima during training, impacting the model's final
performance.
DL UNIT 3-Part-II 7
Solution: Using unique strategies like simulated annealing, momentum-based optimization, and
evolutionary algorithms can help us escape difficult spots.
Finding the best way to do something can be very hard when there are many different options
because the surface it is on is complicated and bumpy.
3. Basic Algorithms,
Optimizers are algorithms or methods used to change the attributes of your neural network such
as weights and learning rate in order to reduce the losses
Gradient Descent
Gradient Descent is the most basic but most used optimization
algorithm. It’s used heavily in linear regression and classification
DL UNIT 3-Part-II 8
algorithms. Backpropagation in neural networks also uses a gradient
descent algorithm.
Advantages:
1. Easy computation.
2. Easy to implement.
3. Easy to understand.
Disadvantages:
1. May trap at local minima.
2. Weights are changed after calculating gradient on the whole
dataset. So, if the dataset is too large than this may take years to
converge to the minima.
3. Requires large memory to calculate gradient on the whole dataset.
Stochastic Gradient Descent
It’s a variant of Gradient Descent. It tries to update the model’s
parameters more frequently. In this, the model parameters are altered
after computation of loss on each training example. So, if the dataset
DL UNIT 3-Part-II 9
contains 1000 rows SGD will update the model parameters 1000 times
in one cycle of dataset instead of one time as in Gradient Descent.
Advantages:
1. Frequent updates of model parameters hence, converges in less
time.
2. Requires less memory as no need to store values of loss functions.
3. May get new minima’s.
Disadvantages:
1. High variance in model parameters.
2. May shoot even after achieving global minima.
3. To get the same convergence as gradient descent needs to slowly
reduce the value of learning rate.
Mini-Batch Gradient Descent
It’s best among all the variations of gradient descent algorithms. It is
an improvement on both SGD and standard gradient descent. It
updates the model parameters after every batch. So, the dataset is
divided into various batches and after every batch, the parameters are
updated.
DL UNIT 3-Part-II 10
θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training
examples.
Advantages:
1. Frequently updates the model parameters and also has less
variance.
2. Requires medium amount of memory.
V(t)=γV(t−1)+α.∇J(θ)
DL UNIT 3-Part-II 11
The momentum term γ is usually set to 0.9 or a similar value.
Advantages:
1. Reduces the oscillations and high variance of the parameters.
2. Converges faster than gradient descent.
Disadvantages:
1.One more hyper-parameter is added which needs to be selected
manually and accurately.
Advantages:
DL UNIT 3-Part-II 12
1. Does not miss the local minima.
2. Slows if minima’s are occurring.
Disadvantages:
1. Still, the hyperparameter needs to be selected manually.
Adagrad
One of the disadvantages of all the optimizers explained is that the
learning rate is constant for all parameters and for each cycle. This
optimizer changes the learning rate. It changes the learning rate ‘η’ for
each parameter and at every time step ‘t’. It’s a type second order
optimization algorithm. It works on the derivative of an error function.
We store the sum of the squares of the gradients w.r.t. θ(i) up to time
step t, while ϵ is a smoothing term that avoids division by zero (usually
on the order of 1e−8). Interestingly, without the square root operation,
the algorithm performs much worse.
DL UNIT 3-Part-II 13
It makes big updates for less frequent parameters and a small step for
frequent parameters.
Advantages:
1. Learning rate changes for each training parameter.
2. Don’t need to manually tune the learning rate.
3. Able to train on sparse data.
Disadvantages:
1. Computationally expensive as a need to calculate the second order
derivative.
2. The learning rate is always decreasing results in slow training.
AdaDelta
It is an extension of AdaGrad which tends to remove the decaying
learning Rate problem of it. Instead of accumulating all previously
squared gradients, Adadelta limits the window of accumulated past
gradients to some fixed size w. In this exponentially moving average is
used rather than the sum of all the gradients.
E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
DL UNIT 3-Part-II 14
Advantages:
1. Now the learning rate does not decay and the training does not
stop.
Disadvantages:
1. Computationally expensive.
Adam
Adam (Adaptive Moment Estimation) works with momentums of first
and second order. The intuition behind the Adam is that we don’t want
to roll so fast just because we can jump over the minimum, we want to
decrease the velocity a little bit for a careful search. In addition to
storing an exponentially decaying average of past squared gradients
like AdaDelta, Adam also keeps an exponentially decaying average of
past gradients M(t).
M(t) and V(t) are values of the first moment which is the Mean and
the second moment which is the uncentered variance of the
gradients respectively.
DL UNIT 3-Part-II 15
date the parameters
The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘ϵ’.
Advantages:
1. The method is too fast and converges rapidly.
2. Rectifies vanishing learning rate, high variance.
Disadvantages:
Computationally costly.
DL UNIT 3-Part-II 16
Optimization algorithm that is not iterative and simply solves for
one point.
Xavier recommendation.
Kaiming He recommendation.
To illustrate the above cases, we’ll use the cats vs dogs dataset which
consists of 50 images for cats and 50 images for dogs. Each image is
DL UNIT 3-Part-II 17
150 pixels x 150 pixels on RGB color scale. Therefore, we would have
67,500 features where each column in the input matrix would be one
image which means our input data would have 67,500 x 100
dimension.
DL UNIT 3-Part-II 18
As the cost curve shows, the neural network didn’t learn
anything! That is because of symmetry between all neurons which
leads to all neurons have the same update on every iteration.
Therefore, regardless of how many iterations we run the
optimization algorithms, all the neurons would still get the same
update and no learning would happen. As a result, we
must break symmetry when initializing parameters so that the
model would start learning on each update of the gradient
descent.
DL UNIT 3-Part-II 19
Random initialization here is helping but still the loss function has
high value and may take long time to converge and achieve a
significantly low value.
DL UNIT 3-Part-II 20
Xavier method is best applied when activation function applied on
hidden layers is Hyperbolic Tangent so that the weight on each
hidden layer would have the following variance: var(W^l )= 1/n^(l-
1). We can achieve this by multiplying the random values from
standard normal distribution by
We’ll train the network using both methods and look at the results.
# train NN where all parameters were initialized based on He
recommendation
layers_dims = [X.shape[0], 5, 5, 1]
parameters = model(X, Y, layers_dims,
hidden_layers_activation_fn="tanh", initialization_method="he")
accuracy(X, parameters, Y,"tanh")The cost after 100 iterations is:
0.6300611704834093
The cost after 200 iterations is: 0.49092836452522753
The cost after 300 iterations is: 0.46579423512433943
The cost after 400 iterations is: 0.6516254192289226
The cost after 500 iterations is: 0.32487779301799485
The cost after 600 iterations is: 0.4631461605716059
The cost after 700 iterations is: 0.8050310690163623
The cost after 800 iterations is: 0.31739195517372376
The cost after 900 iterations is: 0.3094592175030812
The cost after 1000 iterations is: 0.19934509244449203The accuracy
rate is: 99.00%.
DL UNIT 3-Part-II 21
# train NN where all parameters were initialized based on Xavier
recommendation
layers_dims = [X.shape[0], 5, 5, 1]
parameters = model(X, Y, layers_dims,
hidden_layers_activation_fn="tanh", initialization_method="xavier")
accuracy(X, parameters, Y,"tanh")accuracy(X, parameters, Y,
"tanh")The cost after 100 iterations is: 0.6351961521800779
The cost after 200 iterations is: 0.548973489787121
The cost after 300 iterations is: 0.47982386652748565
The cost after 400 iterations is: 0.32811768889968684
The cost after 500 iterations is: 0.2793453045790634
The cost after 600 iterations is: 0.3258507563809604
The cost after 700 iterations is: 0.2873032724176074
The cost after 800 iterations is: 0.0924974839405706
The cost after 900 iterations is: 0.07418011931058155
The cost after 1000 iterations is: 0.06204402572328295The accuracy
rate is: 99.00%.
DL UNIT 3-Part-II 22
As shown from applying the four methods, parameters’ initial
values play a huge role in achieving low cost values as well as
converging and achieve lower training error rates. The same
would apply to test error rate if we had test data.
5.Algorithms with Adaptive Learning Rates
there are methods which tune learning rates adaptively and work
for a broad range of parameters.
Adagrad:
DL UNIT 3-Part-II 23
Here ϵ is the smoothing term (usually takes value of 1e-6) needed to
avoid division by zero. One issue with Adagrad is that in case of
Deep Learning, the monotonic rate usually proves to be aggressive
and leads to early learning stoppage.
RMSprop:
Adam:
DL UNIT 3-Part-II 24
Adam update can be considered as RMSprop with momentum. In
place of raw and noisy gradient vector dx, a “smooth” version of
gradient m is used.
DL UNIT 3-Part-II 25
Fig. 1 shows the loss function (green) and the gradient (red) at the
current position. The left plot shows the case where the gradient
exactly matches the loss function locally. The right plot shows the case
where the loss function turns upwards when moving in the direction of
the negative gradient. While it might make sense to apply a large
update step in the left plot, a smaller update step is needed in the right
plot to avoid overshooting the minimum
Fig. 1: Loss function f(w) (green) and its gradient at w=-1 (red). (Image by author)
DL UNIT 3-Part-II 26
g is the gradient and H the Hessian matrix
This gives the Newton update with the negative gradient scaled and
rotated by the inverse of the Hessian
Hutchinson’s method
DL UNIT 3-Part-II 27
Create a random vector z by flipping a coin for each of its elements,
and set +1 for head and -1 for tail, so in the 2D case z could be (1, -1)
as an example
This looks strange at first sight. We already need H to get the diagonal
elements of H — this does not sound very clever. But it turns out that
we only need to know the result of H·z (a vector), which is easy to
compute, so we never need to know the actual elements of the full
Hessian matrix.
But how does Hutchinson’s method work? When writing it down for
the 2D case (Fig. 4) it is easy to see. Element-wise multiplication
means multiplying the vectors row-wise. Inspecting the result of
z⊙(H·z) shows terms with z₁² (and z₂²) and terms with z₁·z₂ in it.
When computing z⊙(H·z) for multiple trials, z₁² (and z₂²) is always
+1, while z₁·z₂ gives +1 in 50% of the trials and -1 for the other 50%
(simply write down all possible products: 1·1, 1·(-1), (-1)·1, (-1)·(-1)).
When computing the mean over multiple trials, the terms containing
DL UNIT 3-Part-II 28
z₁·z₂ tend to zero, and we are left with a vector of the diagonal
elements of H.
There is only one problem left: we don’t have the Hessian matrix H
which is used in Hutchinson’s method. However, as already mentioned
we don’t actually need the Hessian. It is enough if we have the result of
H·z. It is computed with the help of PyTorch’s automatic
differentiation functions: if we take the gradient already computed by
PyTorch, multiple it by z, and differentiate w.r.t. the parameter vector
w, we get the same result as if we compute H·z directly. So we can
compute H·z without knowing the elements of H. Fig. 5 shows why this
is true: differentiating the gradient one more time gives the Hessian
and z is treated as a constant.
Fig. 5: Equality of what we compute with PyTorch’s automatic differentiation (left) and the matrix vector
product H·z that we need for Hutchinson’s method (right). (Image by author)
We now have the gradient and the diagonal values of the Hessian, so
we could directly apply the Newton update step. However, AdaHessian
DL UNIT 3-Part-II 29
follows a similar approach as the Adam optimizer: it averages over
multiple time-steps to get a smoother estimate of the gradient and the
Hessian.
DL UNIT 3-Part-II 30
Fig. 6: 3 steps towards the minimum of a quadratic function. Left: gradient descent with an appropriate
learning rate. Right: Newton’s method using a diagonal Hessian approximation (right). (Image by author)
Many optimization techniques are not exactly algorithms but rather general tem-plates that can
be specialized to yield algorithms, or subroutines that can be in corporated into many different
algorithms
In deep learning, preparing a deep neural network with many layers as they can be delicate to
the underlying initial random weights and design of the learning algorithm.One potential purpose
behind this trouble is the distribution of the inputs to layers somewhere down in the network may
change after each mini-batch when the weights are refreshed. This can make the learning
algorithm always pursue a moving target. This adjustment in the distribution of inputs to layers in
the network has alluded to the specialized name internal covariate shift.
The challenge is that the model is refreshed layer-by-layer in reverse from the output to the input
utilizing an estimate of error that accept the weights in the layers preceding the current layer are
fixed.Batch normalization gives a rich method of parametrizing practically any deep neural
network. The reparameterization fundamentally decreases the issue of planning updates across
numerous layers.
ii. Coordinate descent is an optimization algorithm that successively minimizes
along coordinate directions to find the minimum of a function.
DL UNIT 3-Part-II 31
determine the appropriate step size. Coordinate descent is applicable in both
differentiable and derivative-free contexts.
DL UNIT 3-Part-II 32