Optimization Techniques in Deep Learning
Optimization Techniques in Deep Learning
Optimization Techniques in Deep Learning
1. Gradient Descent
2. Stochastic Gradient Descent(SGD)
3. Mini-Batch Stochastic Gradient Descent(MB-SGD)
4. SGD with Momentum
5. Nesterov Accelerated Gradient(NAG)
6. Adaptive Gradient(AdaGrad)
7. AdaDelta
8. RMSProp
9. Adam
10. Nadam
Gradient Descent is an Iterative algorithm, that starts from a random point on the
function and traverses down it's slope in steps until it reaches lowest point of that
function. This algorithm is apt for cases where optimal points cannot be found by
equating the slope of function to 0.In Deep Learning Algorithm for the function to
reach the minimum value the weights should be altered. With the help of back
propagation, loss is transferred from one layer to another and "weights" parameter
are also modified depending on loss so that loss can be minimized.
The idea of this method is to update the variables iteratively in the (opposite)
direction of the gradients of the objective function.
With every update, this method guides the model to find the target and gradually
converge to the optimal value of the objective function.
The term gradient descent refers to the changes to the model that move it along a
slope or gradient in a graph toward the lowest possible error value. Each time the
algorithm is run, it moves step-by-step in the direction of the steepest descent,
defined by the negative of the gradient.
The size of the steps is known as the learning rate. Larger steps allow for a higher
learning rate but may be less precise. A low learning rate will be more precise but
time-consuming to run on large datasets.
Gradient descent is most appropriately used when the parameters can’t reach an
accurate conclusion through linear calculation and the target must be searched for
by an optimization algorithm. Gradient descent also can be much cheaper and
faster to find a solution.
There are two main types of gradient descent configurations: Batch and Stochastic Each type has its
pros and cons, and the data scientist must understand the differences to be able to select the best
approach for the problem at hand.
Technically it acts on learning rate parameter by dividing the learning rate by the
square root of gamma, which is the summation of all gradients squared.
In the update rule, AdaGrad modifies the general learning rate N at each step for all the
parameters based on past computations. One of the biggest disadvantages is the
accumulation of squared gradients in the denominator. Since every added term is
positive, the accumulated sum keeps growing during the training. This makes the
learning rate to shrink and eventually become small. This method is not very sensitive
to master step size and also converges faster.
Wt = Wt-1 - 𝝶t * δh/δwt-1
𝝶t = 𝝨 (δh/δwi)
i=1
𝝶t = 𝝶
_____________
√Wavg + Є
2
Wavg t = Ɣ Wavg t-1 + (1- Ɣ) * (δh/δwt)
Optimization algorithms are responsible for reducing losses and provide most accurate results
possible. The weight is initialized using some initialization strategies and is updated with each
epoch according to the equation. The best results are achieved using some optimization strategies
or algorithms called optimizer.
Gradient Descent:
There are two main types of gradient descent configurations: Batch and Stochastic Each type
has its pros and cons, and the data scientist must understand the differences to be able to select
the best approach for the problem at hand.
In a batch gradient descent, the algorithm calculates the error for each example in the training
dataset, but the model is only updated after all the training samples have been run through the
algorithm. The model is updated in a group or batch.
Each cycle through the complete training dataset is called a training epoch. In the batch gradient
descent, the model is updated at the conclusion of each training epoch.
Each batch is used to evaluate how close the machine learning model fits the estimates that the
target function compared to the training dataset. The batch approach is computationally more
efficient than the stochastic method.
The lower update frequency means the error gradient is more stable and may offer a more stable
convergence on some problems. It’s useful in parallel processing implementation because the
calculation of prediction errors and the model updates occur separately. However, with large
datasets running the entire batch can be slow. Also, the stable error gradient may lead to
premature convergence on a less-than-optimal set of parameters.
There is a variation called the mini-batch gradient descent that divides the training dataset into
smaller batches, often between 10 and 1,000 examples selected at random. It’s a compromise
between the efficiency of the full batch version and the robustness of the stochastic approach. It
is faster because smaller batches are run, and not all training data has to be uploaded into
memory to run. However, error results must be accumulated across all training examples.
The mini-batch gradient descent is the most common form used for machine learning.
Stochastic gradient descent :
In comparison to the batch approach, the stochastic version calculates error and updates the
model for a single random example in the training dataset. The stochastic gradient descent is also
called the online machine learning algorithm. Each iteration of the gradient descent uses a single
sample and requires a prediction for each iteration. Stochastic gradient descent is often used
when there is a lot of data.
Stochastic gradient descent is more computationally intensive because the error is calculated and
the model is updated after each instance. Stochastic gradient descent can lead to faster learning
for some problems due to the increase in update frequency. The frequent updates also give faster
insights into the model’s performance and rate of improvement.
Due to the granularity from updating the model at each step, the model can deliver a more
accurate result before reaching convergence. However, despite all the benefits, the process can
be affected by a noisy update procedure that makes it hard for the algorithm to arrive at
minimum error for the model.
Gradient Descent uses the whole training data to update weight and bias. Suppose if we
have millions of records then training becomes slow and computationally very expensive.
SGD solved the Gradient Descent problem by using only single records to updates
parameters. But, still, SGD is slow to converge because it needs forward and backward
propagation for every record. And the path to reach global minima becomes very noisy.
Mini-batch GD overcomes the SDG drawbacks by using a batch of records to update the
parameter. Since it doesn't use entire records to update parameter, the path to reach global
minima is not as smooth as Gradient Descent.
Momentum based Optimizer:
It always works better than the normal Stochastic Gradient Descent Algorithm.
The problem with SGD is that while it tries to reach minima because of the high oscillation we
can’t increase the learning rate. So it takes time to converge. In this algorithm, we will be
using Exponentially Weighted Averages to compute Gradient and used this Gradient to update
parameter.
In SGD with momentum, we have added momentum in a gradient function. By this I mean the
present Gradient is dependent on its previous Gradient and so on. This accelerates SGD to
converge faster and reduce the oscillation.
θ=θ - V(t)
Adagrad (Adaptive Gradient Algorithm)Adagrad (Adaptive Gradient Algorithm):
Whatever the optimizer we learned till SGD with momentum, the learning rate remains constant.
In Adagrad optimizer, there is no momentum concept so, it is much simpler compared to SGD
with momentum.
The idea behind Adagrad is to use different learning rates for each parameter base on iteration.
The reason behind the need for different learning rates is that the learning rate for sparse features
parameters needs to be higher compare to the dense features parameter because the frequency of
occurrence of sparse features is lower.
Technically it acts on learning rate parameter by dividing the learning rate by the square root of
gamma, which is the summation of all gradients squared.
In the update rule, AdaGrad modifies the general learning rate N at each step for all the
parameters based on past computations. One of the biggest disadvantages is the accumulation of
squared gradients in the denominator. Since every added term is positive, the accumulated sum
keeps growing during the training. This makes the learning rate to shrink and eventually become
small. This method is not very sensitive to master step size and also converges faster.
In momentum-based optimization, the current gradient takes the next step based on previous
iteration values. But we need a much smarter algorithm that knows when to intuitively stop so
that the gradient doesn’t further increase. To do this the algorithm should have an approximate
idea of the parameter values in its next iteration. In doing so we can efficiently look ahead by
calculating the gradient values wrt to future position of parameters.
From the previous equation, we know that momentum includes the term [γV(t−1)] to calculate
the value of previous iterations. Computing (θ-γV(t−1)) gives us an approximation of next
position of parameters[θ]. We can now conclusively look ahead of current parameters by
approximating future position with the help of below equation.
θ=θ — V(t)
By using NAG technique, we are now able to adapt error function with the help of previous
and future values and thus eventually speed up the convergence. Now, in the next techniques
we will try to adapt alter or vary the individual parameters depending on the importance factor
it plays in each case.
RMSProp:
Root Mean Squared Prop is another adaptive learning rate method that tries to improve AdaGrad.
Instead of taking cumulative sum of squared gradients like in AdaGrad, we take the exponential
moving average. The first step in both AdaGrad and RMSProp is identical. RMSProp simply
divides learning rate by an exponentially decaying average.
It is a combination of RMSProp and Momentum. This method computes adaptive learning rate
for each parameter. In addition to storing the previous decaying average of squared gradients, it
also holds the average of past gradient similar to Momentum. Thus, Adam behaves like a heavy
ball with friction which prefers flat minima in error surface.
Note: Adam is definitely one of the best optimization algorithms for deep learning and its
popularity is growing very fastAdam is definitely one of the best optimization algorithms for
deep learning and its popularity is growing very fast
Conclusion:
Now, that we have had a look at the different optimization techniques, we cannot implement all
of them for a same problem. Depending on the case of the problem the approach may change.
Now, its your time to decide which optimization technique you want to use in your model.