Optimization Techniques in Deep Learning

OPTIMIZATION TECHNIQUES IN DEEP LEARNING
In this blog, we will be looking briefly at various optimization

techniques widely used in Deep Learning.
Optimization algorithms are responsible for reducing losses and provide

most accurate results possible. The weight is initialized using some
initialization strategies and is updated with each epoch according to the
equation. The best results are achieved using some optimization strategies
or algorithms called optimizer.
Some of the techniques that we will be discussing in this article:
1. Gradient Descent
2. Stochastic Gradient Descent(SGD)
3. Mini-Batch Stochastic Gradient Descent(MB-SGD)
4. SGD with Momentum
5. Nesterov Accelerated Gradient(NAG)
6. Adaptive Gradient(AdaGrad)
7. AdaDelta
8. RMSProp
9. Adam
10. Nadam
the data scientist must understand the differences to be able

to select the best approach for the problem at hand.
1. Gradient Descent:
Gradient Descent is an Iterative algorithm, that starts from a random point on the
function and traverses down it's slope in steps until it reaches lowest point of that
function. This algorithm is apt for cases where optimal points cannot be found by
equating the slope of function to 0.In Deep Learning Algorithm for the function to
reach the minimum value the weights should be altered. With the help of back
propagation, loss is transferred from one layer to another and "weights" parameter
are also modified depending on loss so that loss can be minimized.
The gradient descent method is the most popular optimization method.
The idea of this method is to update the variables iteratively in the (opposite)
direction of the gradients of the objective function.
With every update, this method guides the model to find the target and gradually
converge to the optimal value of the objective function.
The term gradient descent refers to the changes to the model that move it along a
slope or gradient in a graph toward the lowest possible error value. Each time the
algorithm is run, it moves step-by-step in the direction of the steepest descent,
defined by the negative of the gradient.
The size of the steps is known as the learning rate. Larger steps allow for a higher
learning rate but may be less precise. A low learning rate will be more precise but
time-consuming to run on large datasets.
Gradient descent is most appropriately used when the parameters can’t reach an
accurate conclusion through linear calculation and the target must be searched for
by an optimization algorithm. Gradient descent also can be much cheaper and
faster to find a solution.
There are two main types of gradient descent configurations: Batch and Stochastic Each type has its
pros and cons, and the data scientist must understand the differences to be able to select the best
approach for the problem at hand.
2. Stochastic Gradient Descent(SGD):

SGD is an extension of Gradient Descent where it comes in handy if the data
set is very large like huge data sets millions. Using Gradient Descent in such
cases will be computationally extensive since it considers the whole data set.
By computing the derivative
of one point at a time SGD tries to overcome the computationally intensive
Gradient Descent problem. Due to this fact, SGD takes more number of
iterations compared to GD to reach minimum and also contains some noise
when compared to Gradient Descent. As SGD computes derivatives of only 1
point at a time, the time taken to complete one epoch is large compared to
Gradient Descent Algorithm.
Wnew = Wold - eta * dh/dwold
3. Mini Batch - Stochastic Gradient Descent(SGD):

MB-SGD is an extension of SGD algorithm. It overcomes the time-consuming
complexity of SGD by taking a batch of points / subset of points from dataset to
compute derivative. It is observed that the derivative of loss function of MB-SGD is
similar to the loss function of GD after some iterations. But the number iterations to
achieve minima in MB-SGD is large compared to GD and is computationally
expensive. The update of weights in much noisier because the derivative is not always
towards minima.
4. Stochastic Gradient Descent(SGD) With Momentum:
It is an adaptive optimization algorithm which exponentially uses weighted average
gradients over previous iterations to stabilize the convergence, resulting in quicker
optimization. This is done by adding a fraction (gamma) to the previous iteration
values. Essentially the momentum term increase when the gradient points are in the
same directions and reduce when gradients fluctuate. As a result, the value of loss
function converges faster than expected.
Cost function: V(t)=γV(t−1)+α.∇J(θ)

5. Nesterov Accelerated Gradient(NAG):
In momentum-based optimization, the current gradient takes the
next step based on previous iteration values. But we need a much
smarter algorithm that knows when to intuitively stop so that the
gradient doesn’t further increase. To do this the algorithm should
have an approximate idea of the parameter values in its next
iteration. In doing so we can efficiently look ahead by
calculating the gradient values wrt to future position of
parameters.
From the previous equation, we know that momentum includes

the term [γV(t−1)] to calculate the value of previous iterations.
Computing (θ-γV(t−1)) gives us an approximation of next
position of parameters[θ]. We can now conclusively look ahead
of current parameters by approximating future position with the
help of below equation.
Cost function: V(t)=γV(t−1)+α.∇J[θ-γV(t−1)]

θ=θ — V(t)
6. Adaptive Gradient(AdaGrad):
AdaGrad as the name suggests adopts the learning rate of parameters by updating it at
each iteration depending on the position it is present, i.e.- by adapting slower learning
rates when features are occurring frequently and adapting higher learning rate when
features are infrequent.
Technically it acts on learning rate parameter by dividing the learning rate by the
square root of gamma, which is the summation of all gradients squared.
In the update rule, AdaGrad modifies the general learning rate N at each step for all the
parameters based on past computations. One of the biggest disadvantages is the
accumulation of squared gradients in the denominator. Since every added term is
positive, the accumulated sum keeps growing during the training. This makes the
learning rate to shrink and eventually become small. This method is not very sensitive
to master step size and also converges faster.
Wt = Wt-1 - 𝝶t * δh/δwt-1
n 2
𝝶t = 𝝨 (δh/δwi)
i=1
7 & 8. Adaptive Delta(AdaDelta) & RMSProp:

It is simply an extension of AdaGrad that seeks to reduce its monotonically decreasing
learning rate. Instead of summing all the past gradients, AdaDelta restricts the no. of
summation values to a limit (w). In AdaDelta, the sum of past gradients (w) is defined
as “Decaying Average of all past squared gradients”. The current average at the
iteration then depends only on the previous average and current gradient.
Wt = Wt-1 - 𝝶t * δh/δwt-1

𝝶t = 𝝶
_____________
√Wavg + Є
2
Wavg t = Ɣ Wavg t-1 + (1- Ɣ) * (δh/δwt)
9. Adaptive Moment Estimation (Adam):
It is a combination of RMSProp and Momentum. This method computes adaptive

learning rate for each parameter. In addition to storing the previous decaying average
of squared gradients, it also holds the average of past gradient similar to Momentum.
Thus, Adam behaves like a heavy ball with friction which prefers flat minima in error
surface.
10. Nesterov Accelerated Adaptive Moment Estimation
(Nadam):
As we have seen in previous section, Adam is a combination of RMSProp and
Momentum. Earlier we have also seen that NAG is superior to momentum. Nadam
thus incorporates Adam and NAG. To do this we need to modify the momentum term.
OPTIMIZATION TECHNIQUES USED IN DEEP LEARNING
INTRODUCTION:
In this blog, we will be looking briefly at various optimization techniques widely used in Deep
Learning.
Optimization algorithms are responsible for reducing losses and provide most accurate results
possible. The weight is initialized using some initialization strategies and is updated with each
epoch according to the equation. The best results are achieved using some optimization strategies
or algorithms called optimizer.
Some of the techniques that we will be discussing in this article:
Gradient Descent:
 The gradient descent method is the most popular optimization method.

 The idea of this method is to update the variables iteratively in the (opposite) direction of
the gradients of the objective function.
 With every update, this method guides the model to find the target and gradually
converge to the optimal value of the objective function.

 The term gradient descent refers to the changes to the model that move it along a slope or
gradient in a graph toward the lowest possible error value. Each time the algorithm is run,
it moves step-by-step in the direction of the steepest descent, defined by the negative of
the gradient.
 The size of the steps is known as the learning rate. Larger steps allow for a higher
learning rate but may be less precise. A low learning rate will be more precise but time-
consuming to run on large datasets.
 Gradient descent is most appropriately used when the parameters can’t reach an accurate
conclusion through linear calculation and the target must be searched for by an
optimization algorithm. Gradient descent also can be much cheaper and faster to find a
solution.
There are two main types of gradient descent configurations: Batch and Stochastic Each type
has its pros and cons, and the data scientist must understand the differences to be able to select
the best approach for the problem at hand.
Batch Gradient Descent:
In a batch gradient descent, the algorithm calculates the error for each example in the training
dataset, but the model is only updated after all the training samples have been run through the
algorithm. The model is updated in a group or batch.
Each cycle through the complete training dataset is called a training epoch. In the batch gradient
descent, the model is updated at the conclusion of each training epoch.
Each batch is used to evaluate how close the machine learning model fits the estimates that the
target function compared to the training dataset. The batch approach is computationally more
efficient than the stochastic method.
The lower update frequency means the error gradient is more stable and may offer a more stable
convergence on some problems. It’s useful in parallel processing implementation because the
calculation of prediction errors and the model updates occur separately. However, with large
datasets running the entire batch can be slow. Also, the stable error gradient may lead to
premature convergence on a less-than-optimal set of parameters.
Mini Batch Gradient Descent:
There is a variation called the mini-batch gradient descent that divides the training dataset into
smaller batches, often between 10 and 1,000 examples selected at random. It’s a compromise
between the efficiency of the full batch version and the robustness of the stochastic approach. It
is faster because smaller batches are run, and not all training data has to be uploaded into
memory to run. However, error results must be accumulated across all training examples.
The mini-batch gradient descent is the most common form used for machine learning.
Stochastic gradient descent :
In comparison to the batch approach, the stochastic version calculates error and updates the
model for a single random example in the training dataset. The stochastic gradient descent is also
called the online machine learning algorithm. Each iteration of the gradient descent uses a single
sample and requires a prediction for each iteration. Stochastic gradient descent is often used
when there is a lot of data.
Stochastic gradient descent is more computationally intensive because the error is calculated and
the model is updated after each instance. Stochastic gradient descent can lead to faster learning
for some problems due to the increase in update frequency. The frequent updates also give faster
insights into the model’s performance and rate of improvement.
Due to the granularity from updating the model at each step, the model can deliver a more
accurate result before reaching convergence. However, despite all the benefits, the process can
be affected by a noisy update procedure that makes it hard for the algorithm to arrive at
minimum error for the model.
Adaptive optimization Techniques:

In recent times Adaptive Optimization Algorithms are gaining popularity due to their ability to
converge swiftly. These algorithms use statistics from previous iterations to speed up the process
of convergence.
Drawbacks of base optimizer:(GD, SGD, mini-batch GD):
 Gradient Descent uses the whole training data to update weight and bias. Suppose if we
have millions of records then training becomes slow and computationally very expensive.
 SGD solved the Gradient Descent problem by using only single records to updates
parameters. But, still, SGD is slow to converge because it needs forward and backward
propagation for every record. And the path to reach global minima becomes very noisy.
 Mini-batch GD overcomes the SDG drawbacks by using a batch of records to update the
parameter. Since it doesn't use entire records to update parameter, the path to reach global
minima is not as smooth as Gradient Descent.
Momentum based Optimizer:
It always works better than the normal Stochastic Gradient Descent Algorithm.
The problem with SGD is that while it tries to reach minima because of the high oscillation we
can’t increase the learning rate. So it takes time to converge. In this algorithm, we will be
using Exponentially Weighted Averages to compute Gradient and used this Gradient to update
parameter.
In SGD with momentum, we have added momentum in a gradient function. By this I mean the
present Gradient is dependent on its previous Gradient and so on. This accelerates SGD to
converge faster and reduce the oscillation.
It is an adaptive optimization algorithm which exponentially uses weighted average gradients

over previous iterations to stabilize the convergence, resulting in quicker optimization. This is
done by adding a fraction (gamma) to the previous iteration values. Essentially the momentum
term increase when the gradient points are in the same directions and reduce when gradients
fluctuate. As a result, the value of loss function converges faster than expected.
Cost function: V(t)=γV(t−1)+α.∇J(θ)
θ=θ - V(t)
Adagrad (Adaptive Gradient Algorithm)Adagrad (Adaptive Gradient Algorithm):
Whatever the optimizer we learned till SGD with momentum, the learning rate remains constant.
In Adagrad optimizer, there is no momentum concept so, it is much simpler compared to SGD
with momentum.
The idea behind Adagrad is to use different learning rates for each parameter base on iteration.
The reason behind the need for different learning rates is that the learning rate for sparse features
parameters needs to be higher compare to the dense features parameter because the frequency of
occurrence of sparse features is lower.
Technically it acts on learning rate parameter by dividing the learning rate by the square root of
gamma, which is the summation of all gradients squared.
In the update rule, AdaGrad modifies the general learning rate N at each step for all the
parameters based on past computations. One of the biggest disadvantages is the accumulation of
squared gradients in the denominator. Since every added term is positive, the accumulated sum
keeps growing during the training. This makes the learning rate to shrink and eventually become
small. This method is not very sensitive to master step size and also converges faster.
Nesterov Accelerated Gradient (NAG):
In momentum-based optimization, the current gradient takes the next step based on previous
iteration values. But we need a much smarter algorithm that knows when to intuitively stop so
that the gradient doesn’t further increase. To do this the algorithm should have an approximate
idea of the parameter values in its next iteration. In doing so we can efficiently look ahead by
calculating the gradient values wrt to future position of parameters.
From the previous equation, we know that momentum includes the term [γV(t−1)] to calculate
the value of previous iterations. Computing (θ-γV(t−1)) gives us an approximation of next
position of parameters[θ]. We can now conclusively look ahead of current parameters by
approximating future position with the help of below equation.
Cost function: V(t)=γV(t−1)+α.∇J[θ-γV(t−1)]
θ=θ — V(t)
By using NAG technique, we are now able to adapt error function with the help of previous
and future values and thus eventually speed up the convergence. Now, in the next techniques
we will try to adapt alter or vary the individual parameters depending on the importance factor
it plays in each case.
RMSProp:
Root Mean Squared Prop is another adaptive learning rate method that tries to improve AdaGrad.
Instead of taking cumulative sum of squared gradients like in AdaGrad, we take the exponential
moving average. The first step in both AdaGrad and RMSProp is identical. RMSProp simply
divides learning rate by an exponentially decaying average.
Adaptive Moment Estimation (Adam):
It is a combination of RMSProp and Momentum. This method computes adaptive learning rate
for each parameter. In addition to storing the previous decaying average of squared gradients, it
also holds the average of past gradient similar to Momentum. Thus, Adam behaves like a heavy
ball with friction which prefers flat minima in error surface.
Note: Adam is definitely one of the best optimization algorithms for deep learning and its
popularity is growing very fastAdam is definitely one of the best optimization algorithms for
deep learning and its popularity is growing very fast
Conclusion:
Now, that we have had a look at the different optimization techniques, we cannot implement all
of them for a same problem. Depending on the case of the problem the approach may change.
Now, its your time to decide which optimization technique you want to use in your model.

Optimization Techniques in Deep Learning

Uploaded by

Copyright:

Available Formats

Optimization Techniques in Deep Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization Techniques in Deep Learning

Uploaded by

Copyright:

Available Formats

OPTIMIZATION TECHNIQUES IN DEEP LEARNING

In this blog, we will be looking briefly at various optimization

Optimization algorithms are responsible for reducing losses and provide

Some of the techniques that we will be discussing in this article:

the data scientist must understand the differences to be able

The gradient descent method is the most popular optimization method.

2. Stochastic Gradient Descent(SGD):

Wnew = Wold - eta * dh/dwold

3. Mini Batch - Stochastic Gradient Descent(SGD):

Cost function: V(t)=γV(t−1)+α.∇J(θ)

From the previous equation, we know that momentum includes

Cost function: V(t)=γV(t−1)+α.∇J[θ-γV(t−1)]

n 2

7 & 8. Adaptive Delta(AdaDelta) & RMSProp:

9. Adaptive Moment Estimation (Adam):

It is a combination of RMSProp and Momentum. This method computes adaptive

Some of the techniques that we will be discussing in this article:

 The gradient descent method is the most popular optimization method.

Batch Gradient Descent:

Mini Batch Gradient Descent:

Adaptive optimization Techniques:

Drawbacks of base optimizer:(GD, SGD, mini-batch GD):

It is an adaptive optimization algorithm which exponentially uses weighted average gradients

Cost function: V(t)=γV(t−1)+α.∇J(θ)

Nesterov Accelerated Gradient (NAG):

Cost function: V(t)=γV(t−1)+α.∇J[θ-γV(t−1)]

Adaptive Moment Estimation (Adam):

You might also like