Soft Computing Assignment
Soft Computing Assignment
AND TECHNOLOGY
(DEEMED-TO-BE-University, Under MHRD,Govt.Of India)
Sangrur, Longowal, Punjab
Assignment
Of
Soft Computing
Gradient Descent
Gradient Descent is the most basic but most used optimization algorithm. It’s
used heavily in linear regression and classification algorithms.
Backpropagation in neural networks also uses a gradient descent algorithm.
Algorithm: θ=θ−α⋅∇J (θ)
Advantages:
1. Easy computation.
2. Easy to implement.
3. Easy to understand.
Disadvantages:
2. Weights are changed after calculating gradient on the whole dataset. So,
if the dataset is too large than this may take years to converge to the
minima.
As the model parameters are frequently updated parameters have high variance
and fluctuations in loss functions at different intensities.
Advantages:
Disadvantages:
θ=θ−α⋅∇J (θ; B (i)), where {B (i)} are the batches of training examples.
Advantages:
1. Frequently updates the model parameters and also has less variance.
1. Choosing an optimum value of the learning rate. If the learning rate is too
small than gradient descent may take ages to converge.
2. Have a constant learning rate for all the parameters. There may be some
parameters which we may not want to change at the same rate.
Momentum was invented for reducing high variance in SGD and softens the
convergence. It accelerates the convergence towards the relevant direction and
reduces the fluctuation to the irrelevant direction. One more hyperparameter is
used in this method known as momentum symbolized by ‘γ’.
V(t)=γV(t−1)+α.∇J(θ)
Advantages:
Disadvantages:
Advantages:
Disadvantages:
η is a learning rate which is modified for given parameter θ(i) at a given time
based on previous gradients calculated for given parameter θ(i).
We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t,
while ϵ is a smoothing term that avoids division by zero (usually on the order
of 1e−8). Interestingly, without the square root operation, the algorithm
performs much worse.
It makes big updates for less frequent parameters and a small step for frequent
parameters.
Advantages:
Disadvantages:
E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
Advantages:
1. Now the learning rate does not decay and the training does not stop.
Disadvantages:
1. Computationally expensive.
Adam
M (t) and V(t) are values of the first moment which is the Mean and the second
moment which is the uncentered variance of the gradients respectively.
The values for β1 is 0.9, 0.999 for β2, and (10 x exp (-8)) for ‘ϵ’.
Advantages:
Disadvantages:
1. Computationally costly.
Conclusions
Adam is the best optimizers. If one wants to train the neural network in less time
and more efficiently than Adam is the optimizer.
For sparse data use the optimizers with dynamic learning rate.
If, want to use gradient descent algorithm than min-batch gradient descent is the
best option.