Unit 2.4
Unit 2.4
Unit 2.4
At a particular instant, if the weight’s partial derivative is positive, then we will decrease that weight in order to
decrease the loss.
If the partial derivative is negative, then we will increase that weight in order to decrease the loss.
• This algorithm is called Gradient Descent.
• And this is the most basic method of optimizing neural networks.
• This happens as an iterative process and hence we will update the value
of each weight multiple times before the Loss converges at a suitable
value.
• The weights are updated when the whole dataset gradient is calculated.
• If there is a huge amount of data weights updation takes more time
and required huge amount of RAM size memory which will slow down
the process and computationally expensive.
• Here the alpha symbol is the learning rate.
• This will affect the speed of optimization of our
neural network.
• If we have a large learning rate, we will reach
the minima for Loss faster because we are
taking big steps, however as a result we may not
reach a very good minimum since we are taking
big steps and hence we might overshoot it.
• A smaller learning rate will solve this issue, but it
will take a lot of steps for the neural network’s
loss to decrease to a good value.
• Hence we need to keep a learning rate at an
optimal value.
• Usually keeping alpha = 0.01 is a safe value.
• In some cases, problems like Vanishing Gradient or Exploding
Gradient may also occur due to incorrect parameter
initialization.
• These problems occur due to a very small or very large
gradient, which makes it difficult for the algorithm to converge.
Gradient Descent variants
• There are three variants of gradient descent based on the amount of
data used to calculate the gradient:
• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent
Batch Gradient Descent
• Batch Gradient Descent (Vanilla gradient descent), calculates the error
for each observation in the dataset but performs an update only after
all observations have been evaluated.
• Batch gradient descent is not often used, because it represents a huge
consumption of computational resources, as the entire dataset needs
to remain in memory.
Stochastic Gradient Descent
• This is the new momentum equation with NAG. As you can see,
we are taking gradients from an approximate future value of W
instead of the current value of W.
•
Adaptive Optimization
•
Adagrad
• Adagrad is short for adaptive gradients. In this we try to change the
learning rate (alpha) for each update.
• The learning rate changes during each update in such a way that it will
decrease if a weight is being updated too much in a short amount of
time and it will increase if a weight is not being updated much.
• First, each weight has its own cache value, which collects the squares
of the gradients till the current point.
• In RMSProp the only difference lies in the cache updating strategy. In the
new formula, we introduce a new parameter, the decay rate (gamma).
• Here the gamma value is usually around 0.9 or 0.99. Hence for each
update, the square of gradients get added at a very slow rate compared to
adagrad.
• This ensures that the learning rate is changing constantly based on the way
the weight is being updated, just like adagrad, but at the same time the
learning rate does not decay too quickly, hence allowing training to
continue for much longer.
Adam
• Adam is a little like combining RMSProp with Momentum.
• First we calculate our m value, which will represent the momentum at
the current point.
The only difference between this equation and the momentum equation is that instead of the learning rate we
keep to be multiplied with the current gradient.
• Next we will calculate the accumulated cache, which is exactly the
same as it is in RMSProp: