Mustapha_2021_J._Phys.__Conf._Ser._1743_012002

Journal of Physics: Conference
Series
PAPER • OPEN ACCESS You may also like

- Engineering the magnetic anisotropy of
Comparative study of optimization techniques in atomic-scale nanostructure under electric
field
deep learning: Application in the ophthalmology Wanjiao Zhu, Hang-Chen Ding, Wen-Yi
Tong et al.
field - Optimizing solar power generation

forecasting in smart grids: a hybrid
convolutional neural network -autoencoder
To cite this article: Aatila Mustapha et al 2021 J. Phys.: Conf. Ser. 1743 012002 long short-term memory approach
Ahsan Zafar, Yanbo Che, Moeed Sehnan
et al.
- Phase prediction and new parametric

approach for multi-component alloys with
View the article online for updates and enhancements. using deep learning API: Keras
Kaan arlar
This content was downloaded from IP address 151.250.27.246 on 05/09/2024 at 20:18

The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
Comparative study of optimization techniques in deep

learning: Application in the ophthalmology field.
AATILA Mustapha, LACHGAR Mohamed and KARTIT Ali

LTI laboratory, ENSA, National Road of Azemmour N°1, ELHAOUZIA BP: 1166,
Chouaib Doukkali University, El Jadida 24002, Morocco
mu.aatila@gmail.com, lachgar.m@gmail.com, alikartit@gmail.com
Abstract. The optimization is a discipline which is part of mathematics and which aims to model,
analyse and solve analytically or numerically problems of minimization or maximization of a
function on a specific dataset. Several optimization algorithms are used in systems based on deep
learning (DL) such as gradient descent (GD) algorithm. Considering the importance and the
efficiency of the GD algorithm, several research works made it possible to improve it and to
produce several other variants which also knew great success in DL. This paper presents a
comparative study of stochastic, momentum, Nesterov, AdaGrad, RMSProp, AdaDelta, Adam,
AdaMax and Nadam gradient descent algorithms based on the speed of convergence of these
different algorithms, as well as the mean absolute error of each algorithm in the generation of an
optimization solution. The obtained results show that AdaGrad algorithm represents the best
performances than the other algorithms with a mean absolute error (MAE) of 0.3858 in 53
iterations and AdaDelta one represents the lowest performances with a MAE of 0.6035 in 6000
iterations. The case study treated in this work is based on an extract of data from the keratoconus
dataset of Harvard Dataverse and the results are obtained using Python.
1. Introduction
AI is difficult to define, yet it can be presented as a set of theories and techniques which aim to make
computer systems capable to imitate some human behaviours such as reasoning, task planning, decision-
making and learning [1].
The expectations of AI in the health sector are promising. Intelligent systems can assist in the
diagnosis and detection of several diseases such as keratoconus [2], glaucoma [3] and diabetic
retinopathy [4] in ophthalmology for example. These systems are mainly based on the analysis of
biomedical images, taking advantage of the benefits of DL tools for the classification, prediction, and
treatment of these diseases. Detection and classification of keratoconus using DL requires the analysis
of topographic maps of the eye with a high number of features [5], this makes the learning phase of
these systems very complicated and slow. Gradient descent (GD) is an optimization algorithm which is
widely used in DL [6].
This work represents a comparative study of the stochastic, momentum, Nesterov, AdaGrad,
RMSProp, AdaDelta, Adam, AdaMax and Nadam GD algorithms. The paper is organized into 5
sections, section 2 will present GD algorithm and its different variants. Related works are presented in
section 3. Section 4 presents a case study. And section 5 presents a discussion of obtained results and
finally a conclusion and perspectives of this work in the last section.
2. Gradient descent algorithms
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
The GD algorithm is an iterative optimization algorithm widely used in DL; this algorithm allows to
gradually correct the parameters θ in order to minimize the cost function J(θ) [7]. GD algorithm uses all
the dataset for each update of a parameter. This approach is the most precise of the gradient algorithms,
but the most expensive given the number of calculations to be done. To correct this defect, several
variants of this algorithm have been implemented.
2.1. Stochastic gradient descent algorithm (SGD)

In the SGD algorithm, only one randomly selected example of the dataset is used for the update. This
solution is faster than the GD, but it does not provide a solution of the same precision [7] [8]. The SGD
algorithm is described below:
Input: initial vector θ, learning rate η
Repeat until convergence and k ≤ maximum number of iterations
mix the dataset
for i ¬ 1, 2,3,..., n do:
Select an observation randomly
Calculate the step with the gradient from ∇ (θ), scores and η
f
q ¬ q + step
Endfor
EndRepeat
2.2. Momentum algorithm

The objective of Momentum algorithm [9] is to accelerate the descent process by adding a velocity
vector to the equation of SGD [10]. The basic idea of momentum algorithm is to use the fraction from
the previous update for the current one [8]. The algorithm introduces a new hyperparameter β, called
momentum. The equation used to update the parameter θ is as follows:
m ¬ b m - h´Ñq J (q ) (1)
q ¬q + m (2)
The Momentum gradient algorithm is described as follows:
Input: learning rate η, momentum β, initial θ, initial velocity m
While stop criteria is not met do:
Compose a minibatch θ(i) from the dataset and corresponding targets y(i)
Calculate gradient estimate: g ¬
1
n
( ( )
Ñq å J f q (i ) ;q , y (i )
i
)
Calculate velocity update: m ¬ b m - h.g
Apply the update: q ¬ q + m
EndWhile
2.3. Nesterov Accelerated Gradient

The idea of Nesterov method is to add an intermediate displacement to each iteration, which makes it
possible to shift the point according to the last direction of dis-placement [11]. This will correct the
Momentum jump to eliminate the risk of skipping the overall minimum. The equation of Nesterov
algorithm is as following:
m ¬ b m - h.Ñq J (q + b m ) (3)
q ¬q + m (4)
The Nesterov algorithm is depicted as follows:
2
Input: learning rate η, momentum parameter β, initial θ, initial velocity m

Apply an intermediate update: q! ¬ q + h m
Calculate gradient estimate: g ¬

1
( ( )
Ñ ! å J f q (i ) ;q! , y (i )
n q i
)
Calculate velocity update: m ¬ b m - h.g
Apply update: q ¬ q + m
EndWhile
These algorithms use the same learning rate for all parameters. However, approaching a minimum using
a bad learning rate produces fluctuations around the minimums.
2.4. Adaptive Gradient algorithm (AdaGrad)

AdaGrad algorithm [12] proposes to adjust the learning rate for each parameter during the learning phase
based on historical information [13]. The objective of this adaptation is to improve the convergence of
the algorithm and its prediction accuracy [14]. The equation of the AdaGrad algorithm is as follows:
s ¬ s + Ñq J (q ) ÄÑq J (q ) (5)
hÑq J (q )
q ¬q - (6)
s+b
Where η is the initial learning rate, S is the history of the square gradient of the i dimension of all the
0
th
previous gradients for this dimension and β is a smoothing constant. The AdaGrad algorithm is
structured as follows [13]:
Input: η, a constant β (about 10-7 for the numeric stability), initial θ.
Initialize the gradient accumulation variable r ¬ 0
Calculate the gradient:

1
( (
g ¬ Ñq å J f q (i ) ;q , y (i )
n i
) )
Accumulate the squares of gradients: r ¬ r + g .g
Calculate update: m ¬ -
h
!g
r+b
EndWhile
As the sum of squares of the gradients accumulates, the learning rhythm becomes slow, which is a
weakness for the AdaGrad algorithm.
2.5. Root Mean Square Propagation algorithm (RMSProp)

To adjust the excessive growth of the cumulative squares of gradients for the AdaGrad algorithm, the
RMSProp algorithm proposes to accumulate squares only for the most recent gradients [15]. The
RMSProp algorithm equation is structured as follows:
s ¬ b s + (1 - b ) Ñq J (q ) ÄÑq J (q ) (7)
3
hÑq J (q )
q ¬q - (8)
s +Ú
The RMSProp algorithm is structured as follows:
Input: η, decay rate β, small constant α (about 10-7), initial θ.
Initialize the gradient accumulation variable r ¬ 0
Calculate the gradient: g ¬

1
n
( (
Ñq å J f q (i ) ;q , y (i )
i
) )
Accumulate the squares of gradients: r ¬ b . r + (1 - b ) g.g
Calculate update: m ¬ -
Ú
!g
r +a
EndWhile
2.6. AdaDelta algorithm

The objective of AdaDelta algorithm is to reduce the rapid decay of AdaGrad learning rate. The idea
behind it is to accumulate only the gradient squares of a window of a fixed size w of gradients [16]. The
root means square error (RMS) of parameter updates is described by the following equation:
RMS [ Dq ]t ¬ E éëDq 2 ùû + Ú (9)

t
Replacing η in the previous update rule with RMS[∆θ] finally provides the AdaDelta algorithm
t−1
equation as follow:
RMS [ Dq ]t -1
Dqt ¬ - gt (10)
RMS [ g ]t
qt +1 ¬ qt + Dqt (11)
The AdaDelta algorithm is organized as follows:

Input: decay rate r , small constant 𝜖 (about 10-7), θ initial.
Initialize accumulation variables: E é g 2 ù ¬ 0, E é Dq 2 ù ¬ 0
ë û0 ë û0
For t ¬ 1: T do %% Loop over # of updates
Calculate gradient: gt
Accumulate gradients squares: E é g 2 ù ¬ r E é g 2 ù + (1 - r ) gt2
ë ût ë û t -1
RMS [ Dq ]t -1
Calculate update: Dqt ¬ - gt
RMS [ g ]t
Accumulate updates : E é Dq 2 ù ¬ r E é Dq 2 ù + (1 - r ) Dqt2
ë ût ë û t -1
Apply update: qt +1 ¬ qt + Dqt
EndFor
4
The learning rate is eliminated from the parameter update expression for the AdaDelta algorithm.
2.7. Adaptive Moment Algorithm (Adam)

Adam algorithm [17] is a combination of Momentum and RMSProp. This algorithm also calculates
adaptive learning rates for each parameter. Adam stores an exponentially decaying average of the
previous gradient squares v like AdaDelta and RMSprop and keeps an exponentially decaying average
t
of the past gradients m as for Momentum. Adam algorithm equation is described as follows:
t
m ¬ b1m - (1 - b1 )Ñq J (q ) (12)
vt ¬ b2vt + (1 - b2 ) Ñq J (q ) ÄÑq J (q ) (13)
m ¬ m Ä (1 - b1t ) (14)
vt
vt ¬ (15)
(1 - b 2t )
hm
q ¬q + (16)
vt + Ú
The Aadam algorithm is structured as follows:
Input: η, decay rate β1 and β2, small constant α (about 10-7), initial θ.
Initialize gradient accumulation variable r ¬ 0
m0 ¬ 0; v 0 ¬ 0; t ¬ 0 . (Initialize 1st moment, 2nd moment and time step)
While θt not converged do:
t ¬ t +1
Calculate gradients for step t: gt ¬ Ñq J t (qt -1 )
Update biased 1st moment estimate: mt ¬ b1 × mt -1 + (1 - b1 ) × gt
Update biased 2nd raw moment estimate: vt ¬ b2 × vt -1 + (1 - b2 ) × gt2
! ¬ m / 1- b t
Calculate bias-corrected 1st moment estimate: mt t 1 ( )
Calculate bias-corrected 2nd raw moment estimate: v!t ¬ vt / 1 - b2t ( )
!/
Update parameters: qt ¬ qt -1 -h × mt ( v!t + a )
EndWhile
2.8. AdaMax Algorithm

AdaMax algorithm is an extension of Adam algorithm based on an infinite norm. In Adam algorithm,
the factor v in the update rule scales the gradient inversely proportional to the l norm of past gradients
t 2
2
(v ) and current gradient gt
t-1 [17]:
vt ¬ b2 vt -1 + (1 - b2 ) gt
2
(17)
The generalization of l norm to l norm provides:

2 p
vt ¬ b 2p vt -1 + (1 - b2p ) gt
p
(18)
5
Authors of AdaMax [17] prove that v with l converges to a more stable value. To avoid confusion with
t ∞
Adam, u is used to denote the infinity norm v :

t t
ut ¬ b 2¥ vt -1 + (1 - b 2¥ ) gt
¥
(19)
ut ¬ max ( b 2 × vt -1 , gt )
The obtained AdaMax update rule is as follows:
h !
qt +1 ¬ qt - m t
(20)
ut
The AdaMax Algorithm is structured as following:
Input: η, decay rate β1 and β2, small constant α (about 10-7), initial θ.
m0 ¬ 0; v 0 ¬ 0; t ¬ 0 . (Initialize 1st moment, 2nd moment and time step)
t ¬ t +1
Update biased 1st moment estimate: mt ¬ b1 × mt -1 + (1 - b1 ) × gt
Update exponentially weighted infinity norm: ut ¬ max (b 2 × ut -1 , gt )
( (
Update parameters: qt ¬ qt -1 - h / 1 - b1t )) × m / u
t t
EndWhile
2.9. Nadam algorithm

Nesterov-accelerated Adaptive Moment (Nadam) algorithm [18] is a fusion of Adam and Nesterov-
accelerated algorithms. The idea behind this type of algorithm is to increase and decrease the decay
factor β over time, a series of parameters β , β , …, β corresponding respectively to steps 1, 2, …, t is
1 2 t
considered for better clarity. The application of the momentum step in step t+1 is applied once updating
the step t instead of t+1 as follows [18]:
gt ¬ Ñqt-1 J t (qt -1 ) (21)
mt ¬ bt mt -1 + h gt (22)
qt ¬ qt -1 - ( bt +1mt +h gt ) (23)
Momentum and gradient steps, here, depend on the current gradient. The same applications on Adam
algorithm give the following equations [18]:
æ ö
ç bm (1 - bt ) gt ÷
qt ¬ qt -1 - h ç t t t -1 + t
÷ (24)
ç 1- ÷
ç
è
Õ
i =1
b i 1 - Õi =1
bi ÷
ø
æ ö
ç b m (1 - bt ) gt ÷
qt ¬ qt -1 - h ç t +t 1+1 t + t
÷ (25)
ç 1- ÷
ç Õ i b 1 - Õ bi ÷
è i =1 i =1 ø
6
The Nadam algorithm is described as follows:

Input: learning rates η1, η2, …, ηt; decay rates β1, β2,…, βt; small constant ε; initial θ;
hyperparameter ν.
m0 ¬ 0; n0 ¬ 0 . (Initialize 1st and 2nd moments)
t -1
Update biased 1st moment estimate: mt ¬ bt mt -1 + (1 - bt ) gt
Update biased 2nd moment estimate: nt ¬n nt -1 + (1 -n ) gt2
æ æ t +1 ö ö æ æ t
öö
mˆ ¬ ç bt +1mt / ç1 - Õ bi ÷ ÷ + ç (1 - bt ) gt / ç1 - Õ bi ÷ ÷
è è i =1 øø è è i =1 øø
nˆ ¬n nt / (1 -n t )
ht !
Update parameters: qt ¬ qt -1 - mt
n! + Ú
t
EndWhile
3. Related works
Several research works are realized to improve performances of gradient descent algorithm and compare
its different variants. The authors in [19] provided a comparative study of stochastic algorithms with
momentum, Adam, AdaGrad, AdaDelta and RMSProp of optimization. In this study, authors compared
the advantages and disadvantages these approaches considering the convergence time, number of
fluctuations and the update rate of features while selecting specific test functions. In [15], the authors
proposed an analysis of the RMSProp algorithm for training deep neural networks and suggest two
variants of this algorithm, SC-Adagrad and SC-RMSProp for which they show logarithmic regret limits
for strongly convex functions. The authors in [20] proposed a study to prove that ADAM and RMSProp
algorithms are guaranteed to reach criticality for smooth non-convex objectives, authors studied by
experiments the convergence and generalization properties of RMSProp and ADAM against Nesterov’s
Accelerated Gradient method on a variety of common autoencoder setups. Through these experiments
we demonstrate the interesting sensitivity. In [21], the authors realized a comparative study for all
already studied algorithms, which were evaluated in terms of convergence speed, accuracy, and loss
function. In [22], the authors propose a comparative experimental analysis of different stochastic
optimization algorithms for recording images in the spatial domain. Searchers in [12] and [23] provide
an analytical study for GD algorithms and offer improvements to increase their performance. This work
is an experimental comparative study of the nine variants of gradient descent algorithms already detailed
because few are the works that have compared them all in the state of the art.
4. Case Study
The following case study is implemented in Python in order to compare the performances of the
stochastic, momentum, Nesterov, AdaGrad, RMSProp, AdaDelta, Adam, AdaMax and Nadam GD
algorithms in terms of speed of convergence and mean absolute error for the different generated
solutions. The implementation is based on an extract of the keratoconus dataset of Harvard Dataverse
[24]. The dataset used is composed of two columns and 96 rows. Structured in a csv file, this dataset
represents the relationship between flat and steep corneal meridians as shown in figure 1 bellow:
7
Figure 1. Original data (b) and Normalized data (b).

The visualization of the used dataset shows a linear correlation between the steep and the flat corneal
meridians. As the value of the steep corneal meridian increases, so does the flat corneal meridian value.
4.1. Simulation results

Figures 2, 3, 4, 5, 6, 7, 8, 9 and 10 illustrate the simulation results obtained by the application of the
different gradient descent algorithms with a learning rate of 0.001. For the stochastic gradient algorithm,
the number of iterations is fixed at 85 iterations. All algorithms were implemented in Python.
Figure 2. Stochastic solution (a) and Global cost error function (b).
Figure 3. Momentum solution (a) and Global cost error function (b).
Figure 4. Nesterov solution (a) and Global cost error function (b).
8
Figure 5. AdaGrad solution (a) and Global cost error function (b).
Figure 6. RMSProp solution (a) and Global cost error function (b).
Figure 7. Adam solution (a) and Global cost error function (b).
Figure 8. AdaDelta solution (a) and Global cost error function (b).
9
Figure 9. AdaMax solution (a) and Global cost error function (b).
Figure 10. Nadam solution (a) and Global cost error function (b).
Generally, all generated solutions are close as much as possible to all points of the dataset and after a
certain number of iterations, the global cost error functions stabilize, this stability indicates the
convergence of different studied algorithms.
5. Results discussion
Figures 2, 3, 4, 5, 6, 7, 8, 9 and 10 represent respectively the optimization solutions and error cost
functions obtained by the stochastic, Momentum, Nesterov, AdaGrad, RMSProp, Adam, AdaDelta,
AdaMax and Nadam gradient descent algorithms. The table 1 below summarizes the performance of the
different algorithms studied in terms of the number of iterations and the mean absolute error (MAE) of
each solution:
Table 1. Comparison of the performances of studied algorithms.
Number of iterations MAE TensorFlow
Stochastic 85 (fixed) 0.7221 SGDRegressor

Momentum 160 0.3673 MomentumOptimizer
Nesterov 70 0.3856 MomentumOptimizer (use_nesterov)
AdaGrad 53 0.3858 AdagradOptimizer
RMSProp 210 0.3788 RMSPropOptimizer
Adam 90 0.3863 AdamOptimizer
AdaDelta 6000 0.6035 AdadeltaOptimizer
AdaMax 95 0.3857 AdamaxOptimizer
Nadam 70 0.3856 NadamOptimizer
The obtained results show that the Stochastic and AdaDelta gradient algorithms, figures 2 and 8
respectively, present the lowest performances with the largest mean absolute errors, 0.7221 and 0.6035
respectively, as well as the greatest number of iterations 6000 for AdaDelta algorithm (the number of
iterations is fixed at 85 for the stochastic gradient). The algorithms of Nesterov in Figure 4, Adam in
Figure 7, AdaMax in Figure 9, AdaGrad in Figure 5 and Nadam in Figure 10 represent almost similar
10
performances for the mean absolute error which is of the order of 0.38. On the other hand, a remarkable
difference for the number of iterations carried out by each algorithm with a distinction of the AdaGrad
algorithm having done the least iterations with only 53 iterations, then the algorithms of Nesterov,
AdaMax, Adam and Nadam with respectively 70, 95, 90 and 70 iterations. The algorithms of RMSProp
in figure 6 and Momentum in figure 3 represent the best MAE which are of the order of 0.3788 and
0.3673 respectively, with a growth in the number of iterations, 210 and 160 iterations for RMSProp and
Momentum respectively. Among the studied algorithms, the AdaGrad algorithm represents a great
interest for a later use in a project of detection and classification of keratoconus given its good
performances in terms of convergence speed and MAE.
This work comes in order to provide a more detailed and in-depth study of the different gradient
descent algorithms, applied to ophthalmological data.
6. Conclusion
This work represents a comparative study of Stochastic, Momentum, Nesterov, AdaGrad, AdaDelta,
RMSProp, Adam, AdaMax and Nadam gradient descent algorithms of optimization in terms of
convergence speed and mean absolute error of the generated solutions. Among the already cited
algorithms, AdaGrad algorithm represents the best solution with a MAE of 0.3858 and 53 iterations,
this algorithm is interesting considering the approximation of minimization which it provide, its speed
of convergence, this makes it possible to use it in futur works using a large datasets. Stochastic and
AdaDelta algorithms present the lowest performances with MAE of 0.7221 and 0.6035 respectively, and
6000 iterations for AdaDelta. This study was realized to compare these algorithms in order to facilitate
the choice of the most efficient algorithm for later use in a project of keratoconus detection through eye
topographic images analysis.
References
[1] Stadie B C, Abbeel P and Sutskever I 2017 Third-Person Imitation Learning (arXiv:1703.01703).
[2] Lavric A and Valentin P 2019 KeratoDetect: Keratoconus Detection Algorithm Using
Convolutional Neural Networks Comput. Intell. Neurosci 2019 p 1–9.
[3] Chai Y, Liu H and Xu J 2018 Glaucoma diagnosis based on both hidden features and domain
knowledge through deep learning models Knowledge-Based Syst 161 p 147–156.
[4] Gargeya R and Leng T 2017 Automated Identification of Diabetic Retinopathy Using Deep
Learning Ophthalmology 124 p 962–969.
[5] Kamiya K, Ayatsuka Y, Kato Y, Fujimura F, Takahashi M, Shoji N, Mori Y and Miyata K 2019
Keratoconus detection using deep learning of colour-coded maps with anterior segment optical
coherence tomography: A diagnostic accuracy study BMJ Open 9 p 1–7.
[6] Ruder S 2016 An overview of gradient descent optimization algorithms (arXiv:1609.04747) p 1–
14.
[7] Aatila M, Lachgar M and Kartit A 2020 An Overview of Gradient Descent Algorithm
Optimization in Machine Learning: Application in the Ophthalmology Field International
Conference on Smart Applications and Data Analysis Springer Marrakech Morocco p 349-
359.
[8] Mandt S, Hoffman M D and Blei D M 2017 Stochastic Gradient Descent as Approximate
Bayesian Inference (arXiv:1704.04289) p 1–35.
[9] Botev A, Lever G and Barber D 2017 Nesterov’s Accelerated Gradient and Momentum as
approximations to Regularised Update Descent International Joint Conference on Neural
Networks (IJCNN) IEEE Anchorage AK USA p 1899–1903.
[10] Qian N 1999 On the Momentum Term in Gradient Descent Learning Algorithms
Acknowledgments Neural Networks 12 p 145–151.
[11] Nesterov Y 1983 A method of solving a convex programming problem with convergence rate o
(1/k2) Soviet Mathematics Doklady 27 p 372–376.
[12] Duchi J, Hazan E and Singer Y 2010 Adaptive subgradient methods for online learning and
stochastic optimization Conf. Learn. Theory p 257–269.
11
[13] Zhang N, Lei D and Zhao J F 2019 An Improved Adagrad Gradient Descent Optimization
Algorithm Proc 2018 Chinese Automation Congress p 2359–2362.
[14] Hadgu A T, Nigam A and Diaz-Aviles E 2015 Large-scale learning with AdaGrad on Spark Proc.
2015 IEEE Int. Conf. Big Data 2015 2 Santa Clara USA p 2828–2830.
[15] Chandra M and Matthias M 2017 Variants of RMSProp and Adagrad with Logarithmic Regret
Bounds (arXiv:1706.05507).
[16] Zeiler M D 2012 ADADELTA: An Adaptive Learning Rate Method (arXiv:1212.5701).
[17] Kingma D P and Ba J L 2015 Adam: A method for stochastic optimization 3rd International
Conference on Learning Representations San Diego p 1–15.
[18] Dozat T 2016 Incorporating Nesterov Momentum into Adam ICLR Workshop p 2013–2016.
[19] Yazan Y and Talu M F 2017 Comparison of the stochastic gradient descent based optimization
techniques International Artificial Intelligence and Data Processing Symposium IEEE
Malatya Turkey p 1–5.
[20] De S, Mukherjee A and Ullah E 2018 Convergence guarantees for RMSProp and ADAM in non-
convex optimization and an empirical comparison to Nesterov acceleration
(arXiv:1807.06766).
[21] Dogo E M, Afolabi O J, Nwulu N I, Twala B and Aigbavboa C O 2018 A Comparative Analysis
of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks
International Conference on Computational Techniques, Electronics and Mechanical Systems
IEEE Belgaum India p 92–99.
[22] Voronov S, Voronov I and Kovalenko R 2018 Comparative analysis of stochastic optimization
algorithms for image registration IV International Conference on Information Technology and
Nanotechnology.
[23] Hui Z, Zaiyi C, Chuan Q, Zai H, Vincent W Z, Tong X and Enhong C 2020 Adam revisited: a
weighted past gradients perspective Front. Comput. Sci 14.
[24] Yousefi S, Yousefi E, Takahashi H, Hayashi T, Tampo H, Inoda S, Arai Y and Asbell P 2018
Replication Data for the "Keratoconus severity identification using unsupervised machine
learning" PLOS One 2018.
12

Mustapha_2021_J._Phys.__Conf._Ser._1743_012002

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Mustapha_2021_J._Phys.__Conf._Ser._1743_012002

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mustapha_2021_J._Phys.__Conf._Ser._1743_012002

Uploaded by

Copyright:

Available Formats

Journal of Physics: Conference

PAPER • OPEN ACCESS You may also like

field - Optimizing solar power generation

- Phase prediction and new parametric

This content was downloaded from IP address 151.250.27.246 on 05/09/2024 at 20:18

Comparative study of optimization techniques in deep

AATILA Mustapha, LACHGAR Mohamed and KARTIT Ali

mu.aatila@gmail.com, lachgar.m@gmail.com, alikartit@gmail.com

2. Gradient descent algorithms

2.1. Stochastic gradient descent algorithm (SGD)

2.2. Momentum algorithm

2.3. Nesterov Accelerated Gradient

Input: learning rate η, momentum parameter β, initial θ, initial velocity m

Calculate gradient estimate: g ¬

2.4. Adaptive Gradient algorithm (AdaGrad)

Calculate the gradient:

2.5. Root Mean Square Propagation algorithm (RMSProp)

Calculate the gradient: g ¬

2.6. AdaDelta algorithm

RMS [ Dq ]t ¬ E éëDq 2 ùû + Ú (9)

The AdaDelta algorithm is organized as follows:

2.7. Adaptive Moment Algorithm (Adam)

m ¬ b1m - (1 - b1 )Ñq J (q ) (12)

vt ¬ b2vt + (1 - b2 ) Ñq J (q ) ÄÑq J (q ) (13)

2.8. AdaMax Algorithm

The generalization of l norm to l norm provides:

Adam, u is used to denote the infinity norm v :

2.9. Nadam algorithm

gt ¬ Ñqt-1 J t (qt -1 ) (21)

The Nadam algorithm is described as follows:

Update biased 1st moment estimate: mt ¬ bt mt -1 + (1 - bt ) gt

Update biased 2nd moment estimate: nt ¬n nt -1 + (1 -n ) gt2

Figure 1. Original data (b) and Normalized data (b).

4.1. Simulation results

Stochastic 85 (fixed) 0.7221 SGDRegressor

You might also like