Mustapha_2021_J._Phys.__Conf._Ser._1743_012002
Mustapha_2021_J._Phys.__Conf._Ser._1743_012002
Mustapha_2021_J._Phys.__Conf._Ser._1743_012002
Series
Abstract. The optimization is a discipline which is part of mathematics and which aims to model,
analyse and solve analytically or numerically problems of minimization or maximization of a
function on a specific dataset. Several optimization algorithms are used in systems based on deep
learning (DL) such as gradient descent (GD) algorithm. Considering the importance and the
efficiency of the GD algorithm, several research works made it possible to improve it and to
produce several other variants which also knew great success in DL. This paper presents a
comparative study of stochastic, momentum, Nesterov, AdaGrad, RMSProp, AdaDelta, Adam,
AdaMax and Nadam gradient descent algorithms based on the speed of convergence of these
different algorithms, as well as the mean absolute error of each algorithm in the generation of an
optimization solution. The obtained results show that AdaGrad algorithm represents the best
performances than the other algorithms with a mean absolute error (MAE) of 0.3858 in 53
iterations and AdaDelta one represents the lowest performances with a MAE of 0.6035 in 6000
iterations. The case study treated in this work is based on an extract of data from the keratoconus
dataset of Harvard Dataverse and the results are obtained using Python.
1. Introduction
AI is difficult to define, yet it can be presented as a set of theories and techniques which aim to make
computer systems capable to imitate some human behaviours such as reasoning, task planning, decision-
making and learning [1].
The expectations of AI in the health sector are promising. Intelligent systems can assist in the
diagnosis and detection of several diseases such as keratoconus [2], glaucoma [3] and diabetic
retinopathy [4] in ophthalmology for example. These systems are mainly based on the analysis of
biomedical images, taking advantage of the benefits of DL tools for the classification, prediction, and
treatment of these diseases. Detection and classification of keratoconus using DL requires the analysis
of topographic maps of the eye with a high number of features [5], this makes the learning phase of
these systems very complicated and slow. Gradient descent (GD) is an optimization algorithm which is
widely used in DL [6].
This work represents a comparative study of the stochastic, momentum, Nesterov, AdaGrad,
RMSProp, AdaDelta, Adam, AdaMax and Nadam GD algorithms. The paper is organized into 5
sections, section 2 will present GD algorithm and its different variants. Related works are presented in
section 3. Section 4 presents a case study. And section 5 presents a discussion of obtained results and
finally a conclusion and perspectives of this work in the last section.
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
The GD algorithm is an iterative optimization algorithm widely used in DL; this algorithm allows to
gradually correct the parameters θ in order to minimize the cost function J(θ) [7]. GD algorithm uses all
the dataset for each update of a parameter. This approach is the most precise of the gradient algorithms,
but the most expensive given the number of calculations to be done. To correct this defect, several
variants of this algorithm have been implemented.
q ¬ q + step
Endfor
EndRepeat
m ¬ b m - h´Ñq J (q ) (1)
q ¬q + m (2)
The Momentum gradient algorithm is described as follows:
Input: learning rate η, momentum β, initial θ, initial velocity m
While stop criteria is not met do:
Compose a minibatch θ(i) from the dataset and corresponding targets y(i)
Calculate gradient estimate: g ¬
1
n
( ( )
Ñq å J f q (i ) ;q , y (i )
i
)
Calculate velocity update: m ¬ b m - h.g
Apply the update: q ¬ q + m
EndWhile
m ¬ b m - h.Ñq J (q + b m ) (3)
q ¬q + m (4)
The Nesterov algorithm is depicted as follows:
2
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
These algorithms use the same learning rate for all parameters. However, approaching a minimum using
a bad learning rate produces fluctuations around the minimums.
s ¬ s + Ñq J (q ) ÄÑq J (q ) (5)
hÑq J (q )
q ¬q - (6)
s+b
Where η is the initial learning rate, S is the history of the square gradient of the i dimension of all the
0
th
previous gradients for this dimension and β is a smoothing constant. The AdaGrad algorithm is
structured as follows [13]:
Input: η, a constant β (about 10-7 for the numeric stability), initial θ.
Initialize the gradient accumulation variable r ¬ 0
While stop criteria is not met do:
Compose a minibatch θ(i) from the dataset and corresponding targets y(i)
Apply an intermediate update: q! ¬ q + h m
Calculate update: m ¬ -
h
!g
r+b
Apply update: q ¬ q + m
EndWhile
As the sum of squares of the gradients accumulates, the learning rhythm becomes slow, which is a
weakness for the AdaGrad algorithm.
s ¬ b s + (1 - b ) Ñq J (q ) ÄÑq J (q ) (7)
3
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
hÑq J (q )
q ¬q - (8)
s +Ú
The RMSProp algorithm is structured as follows:
Input: η, decay rate β, small constant α (about 10-7), initial θ.
Initialize the gradient accumulation variable r ¬ 0
While stop criteria is not met do:
Compose a minibatch θ(i) from the dataset and corresponding targets y(i)
Apply an intermediate update: q! ¬ q + h m
Calculate update: m ¬ -
Ú
!g
r +a
Apply update: q ¬ q + m
EndWhile
Replacing η in the previous update rule with RMS[∆θ] finally provides the AdaDelta algorithm
t−1
equation as follow:
RMS [ Dq ]t -1
Dqt ¬ - gt (10)
RMS [ g ]t
qt +1 ¬ qt + Dqt (11)
4
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
The learning rate is eliminated from the parameter update expression for the AdaDelta algorithm.
of the past gradients m as for Momentum. Adam algorithm equation is described as follows:
t
m ¬ m Ä (1 - b1t ) (14)
vt
vt ¬ (15)
(1 - b 2t )
hm
q ¬q + (16)
vt + Ú
The Aadam algorithm is structured as follows:
Input: η, decay rate β1 and β2, small constant α (about 10-7), initial θ.
Initialize gradient accumulation variable r ¬ 0
m0 ¬ 0; v 0 ¬ 0; t ¬ 0 . (Initialize 1st moment, 2nd moment and time step)
While θt not converged do:
t ¬ t +1
Calculate gradients for step t: gt ¬ Ñq J t (qt -1 )
Update biased 1st moment estimate: mt ¬ b1 × mt -1 + (1 - b1 ) × gt
Update biased 2nd raw moment estimate: vt ¬ b2 × vt -1 + (1 - b2 ) × gt2
! ¬ m / 1- b t
Calculate bias-corrected 1st moment estimate: mt t 1 ( )
Calculate bias-corrected 2nd raw moment estimate: v!t ¬ vt / 1 - b2t ( )
!/
Update parameters: qt ¬ qt -1 -h × mt ( v!t + a )
EndWhile
2
(v ) and current gradient gt
t-1 [17]:
vt ¬ b2 vt -1 + (1 - b2 ) gt
2
(17)
vt ¬ b 2p vt -1 + (1 - b2p ) gt
p
(18)
5
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
Authors of AdaMax [17] prove that v with l converges to a more stable value. To avoid confusion with
t ∞
ut ¬ b 2¥ vt -1 + (1 - b 2¥ ) gt
¥
(19)
ut ¬ max ( b 2 × vt -1 , gt )
The obtained AdaMax update rule is as follows:
h !
qt +1 ¬ qt - m t
(20)
ut
The AdaMax Algorithm is structured as following:
Input: η, decay rate β1 and β2, small constant α (about 10-7), initial θ.
m0 ¬ 0; v 0 ¬ 0; t ¬ 0 . (Initialize 1st moment, 2nd moment and time step)
While θt not converged do:
t ¬ t +1
Calculate gradients for step t: gt ¬ Ñq J t (qt -1 )
Update biased 1st moment estimate: mt ¬ b1 × mt -1 + (1 - b1 ) × gt
Update exponentially weighted infinity norm: ut ¬ max (b 2 × ut -1 , gt )
( (
Update parameters: qt ¬ qt -1 - h / 1 - b1t )) × m / u
t t
EndWhile
considered for better clarity. The application of the momentum step in step t+1 is applied once updating
the step t instead of t+1 as follows [18]:
mt ¬ bt mt -1 + h gt (22)
qt ¬ qt -1 - ( bt +1mt +h gt ) (23)
Momentum and gradient steps, here, depend on the current gradient. The same applications on Adam
algorithm give the following equations [18]:
æ ö
ç bm (1 - bt ) gt ÷
qt ¬ qt -1 - h ç t t t -1 + t
÷ (24)
ç 1- ÷
ç
è
Õ
i =1
b i 1 - Õi =1
bi ÷
ø
æ ö
ç b m (1 - bt ) gt ÷
qt ¬ qt -1 - h ç t +t 1+1 t + t
÷ (25)
ç 1- ÷
ç Õ i b 1 - Õ bi ÷
è i =1 i =1 ø
6
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
æ æ t +1 ö ö æ æ t
öö
mˆ ¬ ç bt +1mt / ç1 - Õ bi ÷ ÷ + ç (1 - bt ) gt / ç1 - Õ bi ÷ ÷
è è i =1 øø è è i =1 øø
nˆ ¬n nt / (1 -n t )
ht !
Update parameters: qt ¬ qt -1 - mt
n! + Ú
t
EndWhile
3. Related works
Several research works are realized to improve performances of gradient descent algorithm and compare
its different variants. The authors in [19] provided a comparative study of stochastic algorithms with
momentum, Adam, AdaGrad, AdaDelta and RMSProp of optimization. In this study, authors compared
the advantages and disadvantages these approaches considering the convergence time, number of
fluctuations and the update rate of features while selecting specific test functions. In [15], the authors
proposed an analysis of the RMSProp algorithm for training deep neural networks and suggest two
variants of this algorithm, SC-Adagrad and SC-RMSProp for which they show logarithmic regret limits
for strongly convex functions. The authors in [20] proposed a study to prove that ADAM and RMSProp
algorithms are guaranteed to reach criticality for smooth non-convex objectives, authors studied by
experiments the convergence and generalization properties of RMSProp and ADAM against Nesterov’s
Accelerated Gradient method on a variety of common autoencoder setups. Through these experiments
we demonstrate the interesting sensitivity. In [21], the authors realized a comparative study for all
already studied algorithms, which were evaluated in terms of convergence speed, accuracy, and loss
function. In [22], the authors propose a comparative experimental analysis of different stochastic
optimization algorithms for recording images in the spatial domain. Searchers in [12] and [23] provide
an analytical study for GD algorithms and offer improvements to increase their performance. This work
is an experimental comparative study of the nine variants of gradient descent algorithms already detailed
because few are the works that have compared them all in the state of the art.
4. Case Study
The following case study is implemented in Python in order to compare the performances of the
stochastic, momentum, Nesterov, AdaGrad, RMSProp, AdaDelta, Adam, AdaMax and Nadam GD
algorithms in terms of speed of convergence and mean absolute error for the different generated
solutions. The implementation is based on an extract of the keratoconus dataset of Harvard Dataverse
[24]. The dataset used is composed of two columns and 96 rows. Structured in a csv file, this dataset
represents the relationship between flat and steep corneal meridians as shown in figure 1 bellow:
7
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
Figure 2. Stochastic solution (a) and Global cost error function (b).
Figure 3. Momentum solution (a) and Global cost error function (b).
Figure 4. Nesterov solution (a) and Global cost error function (b).
8
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
Figure 5. AdaGrad solution (a) and Global cost error function (b).
Figure 6. RMSProp solution (a) and Global cost error function (b).
Figure 7. Adam solution (a) and Global cost error function (b).
Figure 8. AdaDelta solution (a) and Global cost error function (b).
9
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
Figure 9. AdaMax solution (a) and Global cost error function (b).
Figure 10. Nadam solution (a) and Global cost error function (b).
Generally, all generated solutions are close as much as possible to all points of the dataset and after a
certain number of iterations, the global cost error functions stabilize, this stability indicates the
convergence of different studied algorithms.
5. Results discussion
Figures 2, 3, 4, 5, 6, 7, 8, 9 and 10 represent respectively the optimization solutions and error cost
functions obtained by the stochastic, Momentum, Nesterov, AdaGrad, RMSProp, Adam, AdaDelta,
AdaMax and Nadam gradient descent algorithms. The table 1 below summarizes the performance of the
different algorithms studied in terms of the number of iterations and the mean absolute error (MAE) of
each solution:
Table 1. Comparison of the performances of studied algorithms.
Number of iterations MAE TensorFlow
The obtained results show that the Stochastic and AdaDelta gradient algorithms, figures 2 and 8
respectively, present the lowest performances with the largest mean absolute errors, 0.7221 and 0.6035
respectively, as well as the greatest number of iterations 6000 for AdaDelta algorithm (the number of
iterations is fixed at 85 for the stochastic gradient). The algorithms of Nesterov in Figure 4, Adam in
Figure 7, AdaMax in Figure 9, AdaGrad in Figure 5 and Nadam in Figure 10 represent almost similar
10
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
performances for the mean absolute error which is of the order of 0.38. On the other hand, a remarkable
difference for the number of iterations carried out by each algorithm with a distinction of the AdaGrad
algorithm having done the least iterations with only 53 iterations, then the algorithms of Nesterov,
AdaMax, Adam and Nadam with respectively 70, 95, 90 and 70 iterations. The algorithms of RMSProp
in figure 6 and Momentum in figure 3 represent the best MAE which are of the order of 0.3788 and
0.3673 respectively, with a growth in the number of iterations, 210 and 160 iterations for RMSProp and
Momentum respectively. Among the studied algorithms, the AdaGrad algorithm represents a great
interest for a later use in a project of detection and classification of keratoconus given its good
performances in terms of convergence speed and MAE.
This work comes in order to provide a more detailed and in-depth study of the different gradient
descent algorithms, applied to ophthalmological data.
6. Conclusion
This work represents a comparative study of Stochastic, Momentum, Nesterov, AdaGrad, AdaDelta,
RMSProp, Adam, AdaMax and Nadam gradient descent algorithms of optimization in terms of
convergence speed and mean absolute error of the generated solutions. Among the already cited
algorithms, AdaGrad algorithm represents the best solution with a MAE of 0.3858 and 53 iterations,
this algorithm is interesting considering the approximation of minimization which it provide, its speed
of convergence, this makes it possible to use it in futur works using a large datasets. Stochastic and
AdaDelta algorithms present the lowest performances with MAE of 0.7221 and 0.6035 respectively, and
6000 iterations for AdaDelta. This study was realized to compare these algorithms in order to facilitate
the choice of the most efficient algorithm for later use in a project of keratoconus detection through eye
topographic images analysis.
References
[1] Stadie B C, Abbeel P and Sutskever I 2017 Third-Person Imitation Learning (arXiv:1703.01703).
[2] Lavric A and Valentin P 2019 KeratoDetect: Keratoconus Detection Algorithm Using
Convolutional Neural Networks Comput. Intell. Neurosci 2019 p 1–9.
[3] Chai Y, Liu H and Xu J 2018 Glaucoma diagnosis based on both hidden features and domain
knowledge through deep learning models Knowledge-Based Syst 161 p 147–156.
[4] Gargeya R and Leng T 2017 Automated Identification of Diabetic Retinopathy Using Deep
Learning Ophthalmology 124 p 962–969.
[5] Kamiya K, Ayatsuka Y, Kato Y, Fujimura F, Takahashi M, Shoji N, Mori Y and Miyata K 2019
Keratoconus detection using deep learning of colour-coded maps with anterior segment optical
coherence tomography: A diagnostic accuracy study BMJ Open 9 p 1–7.
[6] Ruder S 2016 An overview of gradient descent optimization algorithms (arXiv:1609.04747) p 1–
14.
[7] Aatila M, Lachgar M and Kartit A 2020 An Overview of Gradient Descent Algorithm
Optimization in Machine Learning: Application in the Ophthalmology Field International
Conference on Smart Applications and Data Analysis Springer Marrakech Morocco p 349-
359.
[8] Mandt S, Hoffman M D and Blei D M 2017 Stochastic Gradient Descent as Approximate
Bayesian Inference (arXiv:1704.04289) p 1–35.
[9] Botev A, Lever G and Barber D 2017 Nesterov’s Accelerated Gradient and Momentum as
approximations to Regularised Update Descent International Joint Conference on Neural
Networks (IJCNN) IEEE Anchorage AK USA p 1899–1903.
[10] Qian N 1999 On the Momentum Term in Gradient Descent Learning Algorithms
Acknowledgments Neural Networks 12 p 145–151.
[11] Nesterov Y 1983 A method of solving a convex programming problem with convergence rate o
(1/k2) Soviet Mathematics Doklady 27 p 372–376.
[12] Duchi J, Hazan E and Singer Y 2010 Adaptive subgradient methods for online learning and
stochastic optimization Conf. Learn. Theory p 257–269.
11
The International Conference on Mathematics & Data Science (ICMDS) 2020 IOP Publishing
Journal of Physics: Conference Series 1743 (2021) 012002 doi:10.1088/1742-6596/1743/1/012002
[13] Zhang N, Lei D and Zhao J F 2019 An Improved Adagrad Gradient Descent Optimization
Algorithm Proc 2018 Chinese Automation Congress p 2359–2362.
[14] Hadgu A T, Nigam A and Diaz-Aviles E 2015 Large-scale learning with AdaGrad on Spark Proc.
2015 IEEE Int. Conf. Big Data 2015 2 Santa Clara USA p 2828–2830.
[15] Chandra M and Matthias M 2017 Variants of RMSProp and Adagrad with Logarithmic Regret
Bounds (arXiv:1706.05507).
[16] Zeiler M D 2012 ADADELTA: An Adaptive Learning Rate Method (arXiv:1212.5701).
[17] Kingma D P and Ba J L 2015 Adam: A method for stochastic optimization 3rd International
Conference on Learning Representations San Diego p 1–15.
[18] Dozat T 2016 Incorporating Nesterov Momentum into Adam ICLR Workshop p 2013–2016.
[19] Yazan Y and Talu M F 2017 Comparison of the stochastic gradient descent based optimization
techniques International Artificial Intelligence and Data Processing Symposium IEEE
Malatya Turkey p 1–5.
[20] De S, Mukherjee A and Ullah E 2018 Convergence guarantees for RMSProp and ADAM in non-
convex optimization and an empirical comparison to Nesterov acceleration
(arXiv:1807.06766).
[21] Dogo E M, Afolabi O J, Nwulu N I, Twala B and Aigbavboa C O 2018 A Comparative Analysis
of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks
International Conference on Computational Techniques, Electronics and Mechanical Systems
IEEE Belgaum India p 92–99.
[22] Voronov S, Voronov I and Kovalenko R 2018 Comparative analysis of stochastic optimization
algorithms for image registration IV International Conference on Information Technology and
Nanotechnology.
[23] Hui Z, Zaiyi C, Chuan Q, Zai H, Vincent W Z, Tong X and Enhong C 2020 Adam revisited: a
weighted past gradients perspective Front. Comput. Sci 14.
[24] Yousefi S, Yousefi E, Takahashi H, Hayashi T, Tampo H, Inoda S, Arai Y and Asbell P 2018
Replication Data for the "Keratoconus severity identification using unsupervised machine
learning" PLOS One 2018.
12