Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
56 views

Soft Computing Assignment

The document discusses various optimizers used for neural network training, including: 1. Gradient descent and its variants like stochastic gradient descent and mini-batch gradient descent. 2. Momentum and Nesterov accelerated gradient methods to reduce variance. 3. Adaptive learning rate methods like Adagrad, Adadelta, and Adam that modify the learning rate for each parameter. 4. Adam is highlighted as one of the best optimizers as it converges rapidly and handles vanishing/exploding gradients well. Choosing an optimizer depends on factors like dataset sparsity and training time.

Uploaded by

Manohar Suman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Soft Computing Assignment

The document discusses various optimizers used for neural network training, including: 1. Gradient descent and its variants like stochastic gradient descent and mini-batch gradient descent. 2. Momentum and Nesterov accelerated gradient methods to reduce variance. 3. Adaptive learning rate methods like Adagrad, Adadelta, and Adam that modify the learning rate for each parameter. 4. Adam is highlighted as one of the best optimizers as it converges rapidly and handles vanishing/exploding gradients well. Choosing an optimizer depends on factors like dataset sparsity and training time.

Uploaded by

Manohar Suman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

SANT LONGOWAL INSTITUTE OF ENGINEERING

AND TECHNOLOGY
(DEEMED-TO-BE-University, Under MHRD,Govt.Of India)
Sangrur, Longowal, Punjab

Assignment
Of
Soft Computing

Submitted To: Submitted By:


Dr. Birmohan Singh Manohar Suman

(Professor, CSE Dept.) GCS - 1830054


GROUP - C
Prepare an assignment on the various optimizers used for neural
networks training. (E.g. SGD, SGDM, Adadelta, Adam, Adagrad)

Optimizers are algorithms or methods used to change the attributes of your


neural network such as weights and learning rate in order to reduce the losses.

Various optimizers used for neural network are as follows:

Gradient Descent
Gradient Descent is the most basic but most used optimization algorithm. It’s
used heavily in linear regression and classification algorithms.
Backpropagation in neural networks also uses a gradient descent algorithm.

Gradient descent is a first-order optimization algorithm which is dependent on


the first order derivative of a loss function. It calculates that which way the
weights should be altered so that the function can reach minima. Through
backpropagation, the loss is transferred from one layer to another and the
model’s parameters also known as weights are modified depending on the
losses so that the loss can be minimized.

Algorithm: θ=θ−α⋅∇J (θ)

Advantages:

1. Easy computation.

2. Easy to implement.

3. Easy to understand.

Disadvantages:

1. May trap at local minima.

2. Weights are changed after calculating gradient on the whole dataset. So,
if the dataset is too large than this may take years to converge to the
minima.

3. Requires large memory to calculate gradient on the whole dataset.


Stochastic Gradient Descent
It’s a variant of Gradient Descent. It tries to update the model’s parameters
more frequently. In this, the model parameters are altered after computation of
loss on each training example. So, if the dataset contains 1000 rows SGD will
update the model parameters 1000 times in one cycle of dataset instead of one
time as in Gradient Descent.

θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.

As the model parameters are frequently updated parameters have high variance
and fluctuations in loss functions at different intensities.

Advantages:

1. Frequent updates of model parameters hence, converges in less time.

2. Requires less memory as no need to store values of loss functions.

3. May get new minima’s.

Disadvantages:

1. High variance in model parameters.

2. May shoot even after achieving global minima.

3. To get the same convergence as gradient descent needs to slowly reduce


the value of learning rate.
Mini-Batch Gradient Descent

It’s best among all the variations of gradient descent algorithms. It is an


improvement on both SGD and standard gradient descent. It updates the model
parameters after every batch. So, the dataset is divided into various batches and
after every batch, the parameters are updated.

θ=θ−α⋅∇J (θ; B (i)), where {B (i)} are the batches of training examples.

Advantages:

1. Frequently updates the model parameters and also has less variance.

2. Requires medium amount of memory.

All types of Gradient Descent have some challenges:

1. Choosing an optimum value of the learning rate. If the learning rate is too
small than gradient descent may take ages to converge.

2. Have a constant learning rate for all the parameters. There may be some
parameters which we may not want to change at the same rate.

3. May get trapped at local minima.


Momentum

Momentum was invented for reducing high variance in SGD and softens the
convergence. It accelerates the convergence towards the relevant direction and
reduces the fluctuation to the irrelevant direction. One more hyperparameter is
used in this method known as momentum symbolized by ‘γ’.

V(t)=γV(t−1)+α.∇J(θ)

Now, the weights are updated by θ=θ−V(t).

The momentum term γ is usually set to 0.9 or a similar value.

Advantages:

1. Reduces the oscillations and high variance of the parameters.

2. Converges faster than gradient descent.

Disadvantages:

1. One more hyper-parameter is added which needs to be selected manually


and accurately.

Nesterov Accelerated Gradient


Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up. So, to
resolve this issue the NAG algorithm was developed. It is a look ahead method.
We know we’ll be using γV(t−1) for modifying the weights
so, θ−γV(t−1) approximately tells us the future location. Now, we’ll calculate
the cost based on this future parameter rather than the current one.

V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) ) and then update the parameters


using θ=θ−V(t)

Advantages:

1. Does not miss the local minima.

2. Slows if minima’s are occurring.

Disadvantages:

1. Still, the hyperparameter needs to be selected manually.


Adagrad
One of the disadvantages of all the optimizers explained is that the learning rate
is constant for all parameters and for each cycle. This optimizer changes the
learning rate. It changes the learning rate ‘η’ for each parameter and at every
time step ‘t’. It’s a type second order optimization algorithm. It works on the
derivative of an error function.

A derivative of loss function for given parameters at a given time t.

Update parameters for given input i and at time/iteration t

η is a learning rate which is modified for given parameter θ(i) at a given time
based on previous gradients calculated for given parameter θ(i).

We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t,
while ϵ is a smoothing term that avoids division by zero (usually on the order
of 1e−8). Interestingly, without the square root operation, the algorithm
performs much worse.

It makes big updates for less frequent parameters and a small step for frequent
parameters.

Advantages:

1. Learning rate changes for each training parameter.

2. Don’t need to manually tune the learning rate.

3. Able to train on sparse data.

Disadvantages:

1. Computationally expensive as a need to calculate the second order


derivative.

2. The learning rate is always decreasing results in slow training.


AdaDelta

It is an extension of AdaGrad which tends to remove the decaying learning


Rate problem of it. Instead of accumulating all previously squared
gradients, Adadelta limits the window of accumulated past gradients to some
fixed size w. In this exponentially moving average is used rather than the sum of
all the gradients.

E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)

We set γ to a similar value as the momentum term, around 0.9.

Update the parameters

Advantages:

1. Now the learning rate does not decay and the training does not stop.

Disadvantages:

1. Computationally expensive.
Adam

Adam (Adaptive Moment Estimation) works with momentums of first and


second order. The intuition behind the Adam is that we don’t want to roll so fast
just because we can jump over the minimum, we want to decrease the velocity a
little bit for a careful search. In addition to storing an exponentially decaying
average of past squared gradients like AdaDelta, Adam also keeps an
exponentially decaying average of past gradients M (t).

M (t) and V(t) are values of the first moment which is the Mean and the second
moment which is the uncentered variance of the gradients respectively.

First and second order of momentum.

Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal


to E[g(t)] where, E[f(x)] is an expected value of f(x).

To update the parameter:

Update the parameters

The values for β1 is 0.9, 0.999 for β2, and (10 x exp (-8)) for ‘ϵ’.
Advantages:

1. The method is too fast and converges rapidly.

2. Rectifies vanishing learning rate, high variance.

Disadvantages:

1. Computationally costly.

Conclusions

Adam is the best optimizers. If one wants to train the neural network in less time
and more efficiently than Adam is the optimizer.

For sparse data use the optimizers with dynamic learning rate.

If, want to use gradient descent algorithm than min-batch gradient descent is the
best option.

You might also like