Unit 2.4

MIT Art Design and Technology University
MIT School of Computing, Pune

21BTCS031 – Deep Learning & Neural Networks
Class - L.Y. CORE (SEM-I)
Unit - II Deep Networks

Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Optimizers
Understanding Optimizers
• In deep learning we have the concept of loss, which tells us how poorly
the model is performing at that current instant.
• Now we need to use this loss to train our network such that it performs
better.
• Essentially what we need to do is to take the loss and try to minimize it,
because a lower loss means our model is going to perform better.
• The process of minimizing (or maximizing) any mathematical expression
is called optimization.
• In a neural network, we have many weights in between each layer.
• We have to understand that each and every weight in the network will
affect the output of the network in some way, because they are all
directly or indirectly connected to the output.
Visualizing which parts of a network altering particular weights will affect
Relationship of weights with loss
• Now that we understand how to change the output of the network by
changing the weights, let’s go ahead to understand how we can
minimize the loss.
• Changing the weights changes the output.
• Changing the output changes the loss, since loss is a function of the
predicted value (Y_pred), which is basically the output of the
network.
• Hence we can say that changing the weights will ultimately change
the loss.
• Changing can mean increase or decrease, but we need to decrease
the loss.
• So now we need to see how to change the weights in such a way that
the loss decreases.
• This process is called optimization.
Mathematical perspective of Changing weights
• Looking at it from a mathematical perspective, we can do this by using partial
derivatives.
• A partial derivative will allow us to understand how two mathematical
expressions affect each other.
• Let us take X and Y, which are connected by some arbitrary mathematical
relationship.
• If we find the partial derivative of X with respect to Y, we can understand how
changing X will affect Y.
• If the partial derivative is positive, that means increasing X will also increase
Y.
• If it’s negative that means increasing X will decrease Y.
Partial Derivative for optimizing loss visual
At a particular instant, if the weight’s partial derivative is positive, then we will decrease that weight in order to
decrease the loss.
If the partial derivative is negative, then we will increase that weight in order to decrease the loss.
• This algorithm is called Gradient Descent.
• And this is the most basic method of optimizing neural networks.
• This happens as an iterative process and hence we will update the value
of each weight multiple times before the Loss converges at a suitable
value.
• The weights are updated when the whole dataset gradient is calculated.
• If there is a huge amount of data weights updation takes more time
and required huge amount of RAM size memory which will slow down
the process and computationally expensive.
• Here the alpha symbol is the learning rate.
• This will affect the speed of optimization of our
neural network.
• If we have a large learning rate, we will reach
the minima for Loss faster because we are
taking big steps, however as a result we may not
reach a very good minimum since we are taking
big steps and hence we might overshoot it.
• A smaller learning rate will solve this issue, but it
will take a lot of steps for the neural network’s
loss to decrease to a good value.
• Hence we need to keep a learning rate at an
optimal value.
• Usually keeping alpha = 0.01 is a safe value.
• In some cases, problems like Vanishing Gradient or Exploding
Gradient may also occur due to incorrect parameter
initialization.
• These problems occur due to a very small or very large
gradient, which makes it difficult for the algorithm to converge.
Gradient Descent variants
• There are three variants of gradient descent based on the amount of
data used to calculate the gradient:
• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent
Batch Gradient Descent
• Batch Gradient Descent (Vanilla gradient descent), calculates the error
for each observation in the dataset but performs an update only after
all observations have been evaluated.
• Batch gradient descent is not often used, because it represents a huge
consumption of computational resources, as the entire dataset needs
to remain in memory.
Stochastic Gradient Descent
• This is another variant of the Gradient Descent

optimizer with an additional capability of working
with the data with a non-convex optimization
problem.
• The problem with such data is that the cost
function results to rest at the local minima which
are not suitable for your learning algorithm.
• Instead of taking entire data at one time, in SGD we
take single record at a time to feed neural network
and to update weights.
• SGD is updated only once, there is no redundancy, it is
faster than GD, and less computationally expensive.
• SGD is updated more frequently, the cost function will
have severe oscillations as we can see in the figure.
• The oscillation of SGD may jump to a better local
minimum.
Mini-Batch Gradient Descent
• It is a combination of both bath gradient descent and stochastic
gradient descent.
• Mini-batch gradient descent performs an update for a batch of
observations.
• It is the algorithm of choice for neural networks, and the batch sizes
are usually from 50 to 256.
Momentum
• Here, we are starting from the labelled green dot.

• Every subsequent green dot represents the loss and new weight value after a single update has occurred.
• The gradient descent will only happen till the local minima since the partial derivative (gradient) near the local
minima is near zero.
• Hence it will stay near there after reaching the local minima and will not try to reach the global minima.
• This is a rather simple graph and in reality the graph will be much more complicated than this with many local
minima's present.
• Hence if we use just gradient descent we are not guaranteed to reach a good loss.
• We can combat this problem by using the concept of momentum.
• In momentum, what we are going to do is essentially try to capture
some information regarding the previous updates a weight has gone
through before performing the current update.
• Essentially, if a weight is constantly moving in a particular direction
(increasing or decreasing), it will slowly accumulate some
“momentum” in that direction.
• Hence when it faces some resistance and actually has to go the
opposite way, it will still continue going in the original direction for a
while because of the accumulated momentum.
Nesterov Accelerated Gradients (NAG)
• In NAG, instead of calculating the gradients at the current position we
try to calculate it from an approximate future position.
• This is because we want to try to calculate our gradients in a smarter
way.
• Just before reaching a minima, the momentum will start reducing
before it reaches it because we are using gradients from a future
point.
• This results in improved stability and lesser oscillations while
converging, furthermore, in practice it performs better than pure
momentum.
How NAG helps in optimizing a weight in a
neural network.
•
• Now what we want to do is instead of calculating gradients with
respect to the current W value, we will calculate them with respect to
the future W value.
• This allows the momentum factor to start adapting to sharp gradient
changes before they even occur, leading to increased stability while
training.
• This is the new momentum equation with NAG. As you can see,
we are taking gradients from an approximate future value of W
instead of the current value of W.
•
Adaptive Optimization
•
Adagrad
• Adagrad is short for adaptive gradients. In this we try to change the
learning rate (alpha) for each update.
• The learning rate changes during each update in such a way that it will
decrease if a weight is being updated too much in a short amount of
time and it will increase if a weight is not being updated much.
• First, each weight has its own cache value, which collects the squares
of the gradients till the current point.
• The cache will continue to increase in value as the training

progresses. Now the new update formula is as follows:
• Essentially what’s happening here is that if a weight has been having
very huge updates, it’s cache value is also going to increase.
• As a result, the learning rate will become lesser and that weight’s
update magnitudes will decrease over time.
• On the other hand, if a weight has not been having any significant
update, it’s cache value is going to be very less, and hence its learning
rate will increase, forcing it to take bigger updates.
• This is the basic principle of the Adagrad optimizer.
• However the disadvantage of this algorithm is that regardless of a
weight’s past gradients, the cache will always increase by some
amount because square cannot be negative.
• Hence the learning rate of every weight will eventually decrease to a
very low value till training does not happen significantly anymore.
• The next adaptive optimizer, RMSProp effectively solves this problem.

RMSProp
• In RMSProp the only difference lies in the cache updating strategy. In the
new formula, we introduce a new parameter, the decay rate (gamma).
• Here the gamma value is usually around 0.9 or 0.99. Hence for each
update, the square of gradients get added at a very slow rate compared to
adagrad.
• This ensures that the learning rate is changing constantly based on the way
the weight is being updated, just like adagrad, but at the same time the
learning rate does not decay too quickly, hence allowing training to
continue for much longer.
Adam
• Adam is a little like combining RMSProp with Momentum.
• First we calculate our m value, which will represent the momentum at
the current point.
The only difference between this equation and the momentum equation is that instead of the learning rate we
keep to be multiplied with the current gradient.
• Next we will calculate the accumulated cache, which is exactly the
same as it is in RMSProp:
• Now we can get the final update formula:

• As we can observe here, we are performing accumulating the
gradients by calculating momentum and also we are constantly
changing the learning rate by using the cache.
• Due to these two features, Adam usually performs better than any
other optimizer out there and is usually preferred while training
neural networks.
• In the paper for Adam, the recommended parameters are 0.9 for
beta1, 0.99 for beta2 and 1e-08 for epsilon.

Unit 2.4

Uploaded by

Copyright:

Available Formats

Unit 2.4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2.4

Uploaded by

Copyright:

Available Formats

MIT Art Design and Technology University

MIT School of Computing, Pune

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks

• This is another variant of the Gradient Descent

• Here, we are starting from the labelled green dot.

• The cache will continue to increase in value as the training

• The next adaptive optimizer, RMSProp effectively solves this problem.

• Now we can get the final update formula:

You might also like