0% found this document useful (0 votes)

56 views

Soft Computing Assignment

The document discusses various optimizers used for neural network training, including: 1. Gradient descent and its variants like stochastic gradient descent and mini-batch gradient descent. 2. Momentum and Nesterov accelerated gradient methods to reduce variance. 3. Adaptive learning rate methods like Adagrad, Adadelta, and Adam that modify the learning rate for each parameter. 4. Adam is highlighted as one of the best optimizers as it converges rapidly and handles vanishing/exploding gradients well. Choosing an optimizer depends on factors like dataset sparsity and training time.

Uploaded by

Manohar Suman

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Soft Computing Assignment

Uploaded by

Manohar Suman

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

SANT LONGOWAL INSTITUTE OF ENGINEERING

AND TECHNOLOGY
(DEEMED-TO-BE-University, Under MHRD,Govt.Of India)
Sangrur, Longowal, Punjab

Assignment
Of
Soft Computing

Submitted To: Submitted By:

Dr. Birmohan Singh Manohar Suman

(Professor, CSE Dept.) GCS - 1830054

GROUP - C
Prepare an assignment on the various optimizers used for neural
networks training. (E.g. SGD, SGDM, Adadelta, Adam, Adagrad)

Optimizers are algorithms or methods used to change the attributes of your

neural network such as weights and learning rate in order to reduce the losses.

Various optimizers used for neural network are as follows:

Gradient Descent
Gradient Descent is the most basic but most used optimization algorithm. It’s
used heavily in linear regression and classification algorithms.
Backpropagation in neural networks also uses a gradient descent algorithm.

Gradient descent is a first-order optimization algorithm which is dependent on

the first order derivative of a loss function. It calculates that which way the
weights should be altered so that the function can reach minima. Through
backpropagation, the loss is transferred from one layer to another and the
model’s parameters also known as weights are modified depending on the
losses so that the loss can be minimized.

Algorithm: θ=θ−α⋅∇J (θ)

Advantages:

1. Easy computation.

2. Easy to implement.

3. Easy to understand.

Disadvantages:

1. May trap at local minima.

2. Weights are changed after calculating gradient on the whole dataset. So,
if the dataset is too large than this may take years to converge to the
minima.

3. Requires large memory to calculate gradient on the whole dataset.

Stochastic Gradient Descent
It’s a variant of Gradient Descent. It tries to update the model’s parameters
more frequently. In this, the model parameters are altered after computation of
loss on each training example. So, if the dataset contains 1000 rows SGD will
update the model parameters 1000 times in one cycle of dataset instead of one
time as in Gradient Descent.

θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.

As the model parameters are frequently updated parameters have high variance
and fluctuations in loss functions at different intensities.

Advantages:

1. Frequent updates of model parameters hence, converges in less time.

2. Requires less memory as no need to store values of loss functions.

3. May get new minima’s.

Disadvantages:

1. High variance in model parameters.

2. May shoot even after achieving global minima.

3. To get the same convergence as gradient descent needs to slowly reduce

the value of learning rate.
Mini-Batch Gradient Descent

It’s best among all the variations of gradient descent algorithms. It is an

improvement on both SGD and standard gradient descent. It updates the model
parameters after every batch. So, the dataset is divided into various batches and
after every batch, the parameters are updated.

θ=θ−α⋅∇J (θ; B (i)), where {B (i)} are the batches of training examples.

Advantages:

1. Frequently updates the model parameters and also has less variance.

2. Requires medium amount of memory.

All types of Gradient Descent have some challenges:

1. Choosing an optimum value of the learning rate. If the learning rate is too
small than gradient descent may take ages to converge.

2. Have a constant learning rate for all the parameters. There may be some
parameters which we may not want to change at the same rate.

3. May get trapped at local minima.

Momentum

Momentum was invented for reducing high variance in SGD and softens the
convergence. It accelerates the convergence towards the relevant direction and
reduces the fluctuation to the irrelevant direction. One more hyperparameter is
used in this method known as momentum symbolized by ‘γ’.

V(t)=γV(t−1)+α.∇J(θ)

Now, the weights are updated by θ=θ−V(t).

The momentum term γ is usually set to 0.9 or a similar value.

Advantages:

1. Reduces the oscillations and high variance of the parameters.

2. Converges faster than gradient descent.

Disadvantages:

1. One more hyper-parameter is added which needs to be selected manually

and accurately.

Nesterov Accelerated Gradient

Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up. So, to
resolve this issue the NAG algorithm was developed. It is a look ahead method.
We know we’ll be using γV(t−1) for modifying the weights
so, θ−γV(t−1) approximately tells us the future location. Now, we’ll calculate
the cost based on this future parameter rather than the current one.

V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) ) and then update the parameters

using θ=θ−V(t)

Advantages:

1. Does not miss the local minima.

2. Slows if minima’s are occurring.

Disadvantages:

1. Still, the hyperparameter needs to be selected manually.

Adagrad
One of the disadvantages of all the optimizers explained is that the learning rate
is constant for all parameters and for each cycle. This optimizer changes the
learning rate. It changes the learning rate ‘η’ for each parameter and at every
time step ‘t’. It’s a type second order optimization algorithm. It works on the
derivative of an error function.

A derivative of loss function for given parameters at a given time t.

Update parameters for given input i and at time/iteration t

η is a learning rate which is modified for given parameter θ(i) at a given time
based on previous gradients calculated for given parameter θ(i).

We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t,
while ϵ is a smoothing term that avoids division by zero (usually on the order
of 1e−8). Interestingly, without the square root operation, the algorithm
performs much worse.

It makes big updates for less frequent parameters and a small step for frequent
parameters.

Advantages:

1. Learning rate changes for each training parameter.

2. Don’t need to manually tune the learning rate.

3. Able to train on sparse data.

Disadvantages:

1. Computationally expensive as a need to calculate the second order

derivative.

2. The learning rate is always decreasing results in slow training.

AdaDelta

It is an extension of AdaGrad which tends to remove the decaying learning

Rate problem of it. Instead of accumulating all previously squared
gradients, Adadelta limits the window of accumulated past gradients to some
fixed size w. In this exponentially moving average is used rather than the sum of
all the gradients.

E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)

We set γ to a similar value as the momentum term, around 0.9.

Update the parameters

Advantages:

1. Now the learning rate does not decay and the training does not stop.

Disadvantages:

1. Computationally expensive.
Adam

Adam (Adaptive Moment Estimation) works with momentums of first and

second order. The intuition behind the Adam is that we don’t want to roll so fast
just because we can jump over the minimum, we want to decrease the velocity a
little bit for a careful search. In addition to storing an exponentially decaying
average of past squared gradients like AdaDelta, Adam also keeps an
exponentially decaying average of past gradients M (t).

M (t) and V(t) are values of the first moment which is the Mean and the second
moment which is the uncentered variance of the gradients respectively.

First and second order of momentum.

Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal

to E[g(t)] where, E[f(x)] is an expected value of f(x).

To update the parameter:

Update the parameters

The values for β1 is 0.9, 0.999 for β2, and (10 x exp (-8)) for ‘ϵ’.
Advantages:

1. The method is too fast and converges rapidly.

2. Rectifies vanishing learning rate, high variance.

Disadvantages:

1. Computationally costly.

Conclusions

Adam is the best optimizers. If one wants to train the neural network in less time
and more efficiently than Adam is the optimizer.

For sparse data use the optimizers with dynamic learning rate.

If, want to use gradient descent algorithm than min-batch gradient descent is the
best option.

Mas-I Formula Sheet PDF
No ratings yet
Mas-I Formula Sheet PDF
7 pages
CS5242 Neural Networks and Deep Learning: Quiz 1
No ratings yet
CS5242 Neural Networks and Deep Learning: Quiz 1
2 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
ADL Unit-3
No ratings yet
ADL Unit-3
21 pages
ADAM StochasticOptimiz 1412.6980
100% (1)
ADAM StochasticOptimiz 1412.6980
15 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Learning To Learn by Gradient Descent by Gradient Descent
No ratings yet
Learning To Learn by Gradient Descent by Gradient Descent
10 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Deep learning exp 2.3 MU
No ratings yet
Deep learning exp 2.3 MU
4 pages
UNIT3
No ratings yet
UNIT3
17 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
ANN Explanation Request Updated
No ratings yet
ANN Explanation Request Updated
44 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
5 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Optimizers Types
No ratings yet
Optimizers Types
6 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Gradient Descent Algorithm and Its Variants - GeeksforGeeks
No ratings yet
Gradient Descent Algorithm and Its Variants - GeeksforGeeks
5 pages
ADAM-1
No ratings yet
ADAM-1
11 pages
Gradient Descent a Fundamental Optimization Algorithm
No ratings yet
Gradient Descent a Fundamental Optimization Algorithm
30 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
dl 3unit last topic meta algoritham
No ratings yet
dl 3unit last topic meta algoritham
32 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
GD Types
No ratings yet
GD Types
98 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
Aie231 NN Lab5
No ratings yet
Aie231 NN Lab5
7 pages
Image Compression Using Resilient-Propagation Neural Network
No ratings yet
Image Compression Using Resilient-Propagation Neural Network
5 pages
Code Adam Optimization Algorithm From Scratch
No ratings yet
Code Adam Optimization Algorithm From Scratch
28 pages
Module 3
No ratings yet
Module 3
27 pages
DL Lab Manual
No ratings yet
DL Lab Manual
52 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
A3 Handout
No ratings yet
A3 Handout
8 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
What Is Algorithm
No ratings yet
What Is Algorithm
64 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
A: A M S O: DAM Ethod For Tochastic Ptimization
No ratings yet
A: A M S O: DAM Ethod For Tochastic Ptimization
13 pages
(MIT) Strength Reduction of Integer Division and Modulo Operations
No ratings yet
(MIT) Strength Reduction of Integer Division and Modulo Operations
17 pages
deep learnig u2
No ratings yet
deep learnig u2
4 pages
Ex4 Tutorial - Forward and Back-Propagation
No ratings yet
Ex4 Tutorial - Forward and Back-Propagation
20 pages
Fast and Adaptive Online Training of Feature-Rich Translation Models
No ratings yet
Fast and Adaptive Online Training of Feature-Rich Translation Models
11 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
UNIT2
No ratings yet
UNIT2
25 pages
SGD
No ratings yet
SGD
3 pages
Viva Questions For DAA UoP
No ratings yet
Viva Questions For DAA UoP
10 pages
Gradient_Descent_(1)
No ratings yet
Gradient_Descent_(1)
8 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
351CS42 Data Structure
No ratings yet
351CS42 Data Structure
134 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
20-Delta Rule-02-09-2024
No ratings yet
20-Delta Rule-02-09-2024
3 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
BCOM 209 Business Statistics
No ratings yet
BCOM 209 Business Statistics
12 pages
Essential Statistics for the Behavioral Sciences 1st Edition Privitera Test Bankinstant download
100% (3)
Essential Statistics for the Behavioral Sciences 1st Edition Privitera Test Bankinstant download
23 pages
Weibull Distributions and Their Applications: January 2006
No ratings yet
Weibull Distributions and Their Applications: January 2006
50 pages
BCSL-044 S1-S2-S3-S4
No ratings yet
BCSL-044 S1-S2-S3-S4
4 pages
Exam All Questions
No ratings yet
Exam All Questions
566 pages
SM 316 - Spring 2019 Homework 7 Solutions
No ratings yet
SM 316 - Spring 2019 Homework 7 Solutions
9 pages
Pencegahan Perdarahan Postpartum.
No ratings yet
Pencegahan Perdarahan Postpartum.
8 pages
Machine Learning: B.Tech (CSBS) V Semester
No ratings yet
Machine Learning: B.Tech (CSBS) V Semester
9 pages
ch03 Forecasting
No ratings yet
ch03 Forecasting
43 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Brochure 2021
No ratings yet
Brochure 2021
4 pages
(FREE PDF Sample) Probability and Statistics With R 1st Edition Maria Dolores Ugarte Ebooks
100% (8)
(FREE PDF Sample) Probability and Statistics With R 1st Edition Maria Dolores Ugarte Ebooks
84 pages
Alfredo H-S. Ang
100% (1)
Alfredo H-S. Ang
419 pages
BSNL Research
No ratings yet
BSNL Research
3 pages
ETL 1110-2-547 Reliability Methods in Geotechnical Engineeri PDF
No ratings yet
ETL 1110-2-547 Reliability Methods in Geotechnical Engineeri PDF
14 pages
Sampling Distributions
No ratings yet
Sampling Distributions
2 pages
cost function
No ratings yet
cost function
3 pages
MR Share (Discrete Random Variables) Test 1 and 2 Ex and Solution
No ratings yet
MR Share (Discrete Random Variables) Test 1 and 2 Ex and Solution
18 pages
Business Statistics Syllabus
No ratings yet
Business Statistics Syllabus
3 pages
v33b01 PDF
No ratings yet
v33b01 PDF
3 pages
Coefficient of Variation, Interpret Which Team May Be Considered More Consistent
No ratings yet
Coefficient of Variation, Interpret Which Team May Be Considered More Consistent
2 pages
Tugasan Kumpulan - Analisis Statistik
No ratings yet
Tugasan Kumpulan - Analisis Statistik
41 pages
Ntroduction To Statistics For A Level Biology
No ratings yet
Ntroduction To Statistics For A Level Biology
4 pages
Mathematics Statistical Square Root Mean Squares
No ratings yet
Mathematics Statistical Square Root Mean Squares
5 pages
Non Parametric Statistics
No ratings yet
Non Parametric Statistics
96 pages
Manova One Powerpoint 1 STAT
No ratings yet
Manova One Powerpoint 1 STAT
35 pages
ProbStat hw2 65a8e881da7a5
No ratings yet
ProbStat hw2 65a8e881da7a5
2 pages
Quantitative Prelim To Finals
No ratings yet
Quantitative Prelim To Finals
38 pages
Excel - Skills SUMIFS Type 02
No ratings yet
Excel - Skills SUMIFS Type 02
40 pages