Optimization in Machine Learning
Optimization in Machine Learning
Derya Soydaner
Statistics Department, Mimar Sinan Fine Arts University
İstanbul, 34380,Turkey
derya.soydaner@msgsu.edu.tr
In recent years, we have witnessed the rise of deep learning. Deep neural networks have
proved their success in many areas. However, the optimization of these networks has
become more difficult as neural networks going deeper and datasets becoming bigger.
Therefore, more advanced optimization algorithms have been proposed over the past
years. In this study, widely used optimization algorithms for deep learning are examined
in detail. To this end, these algorithms called adaptive gradient methods are implemented
for both supervised and unsupervised tasks. The behaviour of the algorithms during
training and results on four image datasets, namely, MNIST, CIFAR-10, Kaggle Flowers
and Labeled Faces in the Wild are compared by pointing out their differences against
basic optimization algorithms.
1. Introduction
Adaptive gradient methods have been widely used in deep learning. Although one
of the most preferred algorithms has been stochastic gradient descent (SGD) for
many years, it has difficulties to overcome serious problems such as ill-conditioning
and time necessity for large-scale datasets when training deep neural networks. It
requires manual tuning of learning rate and difficult to parallelize 16 . Thus, the
problems of SGD caused the invention of more advanced algorithms. Nowadays,
the optimization algorithms used for deep learning adapt their learning rates dur-
ing training. Basically, the adaptive gradient methods adjust the learning rate for
each parameter. Therefore, when the gradients for some parameters are large, their
learning rates are reduced or vice versa.
Until recently, many adaptive methods have been proposed and they become the
most commonly used alternatives to SGD. In addition to their high performance on
training deep models, another advantage is they are first-order optimization algo-
rithms just as SGD. Thus, they are computationally efficient for training deep neural
networks. This work aims to present the most widely used adaptive optimization
algorithms that are proven their superiority and compare their working principles.
To this end, image processing that is one of the most important areas of deep learn-
ing is handled. Firstly, the effects of adaptive gradient methods are observed for
image classification task by using convolutional neural networks (CNNs). Secondly,
1
2 D. Soydaner
2. Related Work
In deep learning literature, working principles and performance analysis of op-
timization algorithms are widely studied. For example, theoretical guarantees of
convergence to criticality for RMSProp and Adam are presented in the setting of
optimizing a non-convex objective 28 . They design experiments to empirically study
the convergence and generalization properties of RMSProp and Adam against Nes-
terov´s accelerated gradient method. In another study, conjugate gradient, SGD
and limited memory BFGS algorithms are compared 16 . A review is presented on
numerical optimization algorithms in the context of machine learning applications 3 .
Additionally, similar to this work, an overview of gradient optimization algorithms
is summarized 25 .
In this study, most widely used optimization algorihms are examined in the
context of deep learning. On the other side, new variants of adaptive methods still
have been proposed more recently. For example, new variants of Adam and AMS-
Grad, called AdaBound and AMSBound respectively, are proposed 18 . They employ
dynamic bounds on learning rates to achieve a gradual and smooth transition from
adaptive methods to SGD. Also, a new algorithm that adapts the learning rate
locally for each parameter separately, and also globally for all parameters together
is presented 9 . Another new algorithm, called Nostalgic Adam (NosAdam), which
places bigger weights on the past gradients than the recent gradients when design-
ing the adaptive learning rate is introduced 12 . In another study, two variants called
SC-Adagrad and SC-RMSProp are proposed 20 . A new adaptive optimization algo-
rithm called YOGI is presented 30 . It controls the increase in effective learning rate.
A novel adaptive learning rate scheme, called ESGD, based on the equilibration pre-
conditioner is developed 4 . Also, a new algorithm called Adafactor is presented 26 .
Instead of updating parameters scaling by the inverse square roots of exponential
moving averages of squared past gradients, Adafactor maintains only the per-row
and per-column sums of the moving averages, and estimates the per-parameter
second moments based on these sums.
A Comparison of Optimization Algorithms for Deep Learning 3
1 X
ĝ ← ∇θ L(f (x(i) ; θ), y (i) ) (1)
m i
θ ← θ − k ĝ (2)
Here, the learning rate k is a very important hyperparameter. The magnitude of
the update depends on the learning rate. If it is too large, updates depend too much
on recent instances. If it is small, many updates may be need for convergence 1 .
This hyperparameter can be chosen by trial and error. One way is to choose one
of the several learning rates that results in the smallest loss function value. This
is called line search. Another way is to monitor the first several epochs and use a
learning rate that is higher than the best performing learning rate. In Equation 2,
the learning rate is denoted as k at iteration k because in practice, it is necessary
to gradually decrease the learning rate over time 8 .
aims primarily to solve two problems: Poor conditioning of the Hessian matrix and
variance in the stochastic gradient. The idea behind this algorithm is to take a run-
ning average by incorporating the previous update in the current change as if there
is a momentum due to previous updates 1 . When SGD is used with momentum, it
can converge faster as well as reduced oscillation.
SGD with momentum uses a variable v called velocity. The velocity is the di-
rection and speed at which the parameters move through parameter space. It is
set to an exponentially decaying average of the negative
gradient. Also, SGD with
momentum requires a new hyperparameter α ∈ 0, 1 called momentum parameter
that determines how quickly the contributions of previous gradients exponentially
decay. The parameters are updated after the velocity update is computed:
m
1 X
L(f (x(i) ; θ), y (i) )
v ← αv − ∇θ (3)
m i=1
θ ←θ+v (4)
The velocity v accumulates the gradient elements. The larger α is relative to ,
the more previous gradients affect the current direction. Common values of α used
in practice are 0.5, 0.9 and 0.99 8 . However, a disadvantage of this algorithm is the
requirement of momentum hyperparameter in addition to the learning rate.
θ̃ ← θ + αv (5)
Then, gradient is computed at the interim point. By using this gradient, velocity
update is computed. Finally, the parameters are updated:
m
1 X
L f (x(i) ; θ̃), y (i)
g← ∇θ̃ (6)
m i=1
v ← αv − g (7)
θ ←θ+v (8)
A Comparison of Optimization Algorithms for Deep Learning 5
3.2. AdaGrad
One of the optimization algorithms that individually adapts the learning rates of
model parameters is AdaGrad 6 . The parameters with the largest partial derivative
of the loss have a rapid decrease in their learning rate, while parameters with small
partial derivatives have a relatively small decrease in their learning rate 8 . This is
performed by using all the historical squared values of the gradient.
AdaGrad uses an additional variable r for gradient accumulation. In the begin-
ning of this algorithm, the gradient accumulation variable is initialized to zero and
gradient is computed for a minibatch:
1 X
g← ∇θ L(f (x(i) ; θ), y (i) ) (9)
m i
By using this gradient, the squared gradient is accumulated. Then, the update
is computed by scaling learning rates of all parameters inversely proportional to the
square root of the sum of all the historical squared values of the gradient. Finally,
this update is applied to the model parameters:
r ←r+gg (10)
∆θ ← − √ g (11)
δ+ r
θ ← θ + ∆θ (12)
where is the global learning rate and δ is a small constant for numerical stability.
However, AdaGrad has serious disadvantages. Generally, it performs well for
simple quadratic problems, but it often stops too early when training neural net-
works. The learning rate gets scaled down so much that the algorithm ends up stop-
ping entirely before reaching the global optimum 7 . Also, for training deep neural
networks, the accumulation of squared gradients from the beginning of training can
result in an excessive decrease in the effective learning rate. AdaGrad performs well
for some but not all deep learning models 8 .
3.3. AdaDelta
The underlying idea of AdaDelta algorithm is to improve the two main drawbacks of
AdaGrad: The continual decay of learning rates throughout training and the need
for a manually selected global learning rate. To this end, AdaDelta restricts the
window of past gradients to be some fixed size w instead of accumulating the sum
of squared gradients over all time. As mentioned in the previous section, AdaGrad
accumulates the squared gradients from each iteration starting at the beginning
6 D. Soydaner
E g 2 t = ρE g 2 t−1 + 1 − ρ gt2
(13)
where ρ is a decay constant similar to that used in the momentum method. Since it
is required the square root of this quantity in the parameter updates, this effectively
becomes the root mean square (RMS) of previous squared gradients up to time t:
q
RM S g t = E g 2 t + δ (14)
where δ is again a small constant. Based on this RMS, the parameter update is
computed, updates are accumulated and the parameters are updated, respectively:
RM S ∆θ t−1
∆θt = − gt (15)
RM S g t
3.4. RMSProp
Another algorithm that modifies AdaGrad is RMSProp 10 . It is proposed to perform
better in the nonconvex setting by changing the gradient accumulation into an
exponentially weighted moving average. As mentioned in Section 3.2, AdaGrad
shrinks the learning rate according to the entire history of the squared gradient.
Instead, RMSProp uses an exponentially decaying average to discard history from
the extreme past so that it can converge rapidly after finding a convex bowl 8 .
A Comparison of Optimization Algorithms for Deep Learning 7
r ← ρr + (1 − ρ)g g (18)
where ρ is the decay rate. Then parameter update is computed and applied as
follows:
∆θ = − √ g (19)
δ+r
θ ← θ + ∆θ (20)
3.5. Adam
Adam is one of the most widely used optimization algorithms in deep learning.
The name Adam is derived from adaptive moment estimation because it computes
individual adaptive learning rates for different parameters from estimates of first
and second moments of the gradients. Adam combines the advantages of AdaGrad
which works well with sparse gradients and RMSProp which works well in online
and non-stationary settings 13 .
There are some important properties of Adam. Firstly, momentum is incorpo-
rated directly as an estimate of the first-order moment of the gradient. Also, Adam
includes bias corrections to the estimate of both the first-order moments and the
second-order moments to account for their initialization at the origin 8 . The al-
gorithm updates exponential moving averages of the gradient mt and the squared
gradient ut where the hyperparameters ρ1 ve ρ2 control the exponential decay rates
of these moving averages. The moving averages themselves are estimates of the first
moment (the mean) and the second raw moment (the uncentered variance) of the
gradient 13 .
Adam algorithm requires first and second moment variables m and u. After
computing gradient, biased first and second moment estimates are updated at time
step t respectively:
ut ← ρ2 ut−1 + (1 − ρ2 )g g (22)
Then, bias is corrected in first and second moments. By using the corrected
moment estimates parameter updates are calculated and applied:
mt
m̂t ← (23)
1 − ρt1
8 D. Soydaner
ut
ût ← (24)
1 − ρt2
m̂t
∆θ = − √ (25)
ût + δ
θt ← θt−1 + ∆θ (26)
Adam has many advantages. First of all, it requires a little tuning for the learning
rate. Also, it is a method that is straightforward to implement and invariant to
diagonal rescaling of gradients. It is computationally efficient as well as a little
memory requirements. Besides, Adam is appropriate for non-stationary objectives
and problems with very noisy and sparse gradients 13 .
3.6. AdaMax
AdaMax is proposed as an extension of Adam. It is a variant of Adam based on the
infinity norm. In Adam, update rule for individual weights is to scale their gradients
inversely proportional to a L2 norm of their individual current and past gradients.
AdaMax is based on the idea that L2 norm based update rule can be generalized
to a Lp norm based update rule.
AdaMax algorithm begins with calculating gradients w.r.t. stochastic objective
at time step t, as usual. Then, biased first moment estimate and exponentially
weighted infinity norm are computed. By using them, the model parameters are
updated. These steps are defined below, respectively:
gt ← ∇θ ft θt−1 (27)
mt ← ρ1 mt−1 + 1 − ρ1 gt (28)
γt ← max ρ2 γt−1 , |gt | (29)
3.7. Nadam
Nadam (Nesterov-accelerated adaptive moment estimation) modifies Adam’s mo-
mentum component with Nesterov´s accelerated gradient. Thus, Nadam aims to
improve the speed of convergence and the quality of the learned models 5 .
Similar to Adam, after computing gradient, first and second moment variables
are updated as in Equations 31 and 32. Then, corrected moments are computed
and parameters are updated as in following equations:
mt ← ρt mt−1 + 1 − ρt gt (31)
ut ← vut−1 + 1 − v gt2
(32)
t+1
Y t
Y
m̂ ← ρt+1 mt / 1 − ρi + 1 − ρt gt / 1 − ρi (33)
i=1 i=1
û ← vut / 1 − vt (34)
t
θt ← θt−1 − √ m̂t (35)
ût + δ
3.8. AMSGrad
Another an exponential moving average variant is AMSGrad 23 . The purpose of
developing AMSGrad is to guarantee the convergence while preserving the benefits
of Adam and RMSProp. In AMSGrad algorithm, the first and second moment vari-
ables are updated as in Equations 36 and 37. The key difference between AMSGrad
and Adam is shown in Equation 38. Here, AMSGrad maintains the maximum of
all ut until the present time step and uses this maximum value for normalizing
the running average of the gradient instead of ut in Adam. By doing this, AMS-
Grad results in a non-increasing step size. Finally, the parameters are updated as
in Equation 39. Here, Ût indicates diag ût .
ût ← max ût−1 , ut (38)
10 D. Soydaner
Y p
θt+1 ← √ θt − t mt / ût (39)
F, Ût
On the other side, the difference of AMSGrad from Adam and AdaGrad, it
neither increases nor decreases the learning rate and furthermore, decreases ut
which can potentially lead to non-decreasing learning rate even if gradient is large
in the future iterations 23 .
4. Experiments
4.1. Datasets
The performances of optimization algorithms are evaluated on four image datasets,
namely, MNIST 17 , CIFAR-10 14 , Kaggle Flowers 19 and Labeled Faces in the Wild
(LFW) 11 . The well-known dataset MNIST includes 60000 training and 10000 test
examples each of which is a 28×28 gray scale handwritten digit image. CIFAR-10 is
composed of 50000 training and 10000 test examples. They are 32 × 32 color images
belonging to 10 classes. Kaggle Flowers contains 4242 images of 5 different types of
flowers. LFW includes 13233 face images belonging to 5749 people. In supervised
learning experiments, LFW classes that have at least 30 images are chosen. LFW
has 34 classes meet this requirement, thus 1777 face images belong to 34 people
are used for classification. On the other side, all the 13233 images are used in
unsupervised learning experiments. Besides, LFW and Kaggle flowers include color
images in different sizes. In this study, LFW and Kaggle Flowers are scaled to a
size of 64 × 64 and 96 × 96, respectively. Also, they are randomly divided into two
subsets as 0.75 for training and 0.25 for testing.
the architecture of the autoencoder, the size of the representation is very important.
Because when the size of representation becomes smaller, the reconstruction of the
images becomes harder. Therefore, two architectures are preferred for the unsuper-
vised learning experiments as described in Table 2. The encoder and decoder have
symmetric convolutional-deconvolutional layers. Here, the difference between the
encoder and decoder is that upsampling layers are used in decoder instead of max-
pooling layers. The first architecture, CAE-1, reduces the size of representation to
one quarter while the second one, CAE-2, reduces it to one half. For example, the
representation of 64 × 64 images of LFW is 1024 dimensional in CAE-1 and 2048
dimensional in CAE-2. For unsupervised learning experiments, the reconstruction
loss is mean squared error and the activation function is tanh in the output layer.
The rest of the layers use ReLU.
In all experiments, the learning rate is 0.01 for SGD and its momentum vari-
ants; 0.001 for the algorithms with adaptive learning rates. The same learning rate
is used for all algorithms, except SGD and its momentum variants. Because they
do not perform well if the learning rate is smaller. Therefore, the hyperparameters
which the algorithms give best results on test data are preferred. The momentum α
is 0.9 for SGD variants. The decay constant ρ is 0.95 for AdaDelta. Also, the decay
constants ρ1 and ρ2 are 0.9 and 0.999 for Adam, AdaMax, Nadam and AMSGrad.
The minibatch size is 128 for all models and each of them is trained 50 epochs. In
order to compare different algorithms, the same parameter initialization is used.
The software is based on Theano 2 on a single GTX 1050 Ti GPU.
12 D. Soydaner
Fig. 1. The behaviour of algorithms on MNIST during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 13
The classification results for CIFAR-10 are given in Table 4. As CIFAR-10 in-
cludes color images in higher resolutions than MNIST, the performances of al-
gorithms begin to change. AdaMax gives the best accuracies on test data for all
architectures. AMSGrad performs close to AdaMax for CNN-1 and CNN-2. On the
other side, Adam and SGD with Nesterov momentum follow AdaMax when the
neural network becomes deeper. Similar to MNIST results, AdaGrad can not per-
form well on CIFAR-10. The behaviour of the algorithms during training is shown
in Figure 2. While SGD begins training with the highest loss, Adam begins with
the lowest. Also, AdaMax seems to perform better for deeper architectures.
Fig. 2. The behaviour of algorithms on CIFAR-10 during training for three CNN architectures.
By now, the algorithms perform well except SGD and AdaGrad. The change
of depth in neural network architectures does not significantly affect performance
rankings of the algorithms. Therefore, the algorithms are compared on a more
complicated dataset than MNIST and CIFAR-10. LFW consists of face images,
each of which is a 64 × 64 color image belonging to one of 34 people. It includes
14 D. Soydaner
more classes than MNIST and CIFAR-10 as well as higher resolutions. Also, it has
small number of samples per each class. This classification task is more difficult
for the optimization algorithms. The classification results for LFW are given in
Table 5. The most obvious result is the decrease of the SGD performance. It can
not learn nearly anything during training. Accordingly, its accuracy on test data
is the lowest. Similarly, AdaGrad also can not perform well. The best accuracies
on test data are obtained by using Adam and RMSProp for CNN-1, AdaDelta
for CNN-2 and Adam for CNN-3. In the previous experiments, the performances
of SGD momentum variants compete with the adaptive algorithms while training
the neural networks on MNIST and CIFAR-10. However, while training the neural
networks on LFW, the difference between them appears and the superiority of
the adaptive learning algorithms become clear. Especially AdaDelta, Adam and its
variants perform much better.
Fig. 3. The behaviour of algorithms on LFW during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 15
and AMSGrad performs well on test data in addition to AdaDelta. Also, the most
important advantage of AdaDelta is that it does not require to select the learning
rate manually.
Fig. 4. The behaviour of algorithms on Kaggle Flowers during training for three CNN architectures.
Additionally, the algorithms are compared based on the training time they re-
quire. As LFW includes a small number of training examples in supervised learning
experiments, there is not significant difference between the training time of algo-
rithms for each CNN architecture. On the other hand, MNIST and CIFAR-10 have
more training examples than the other two datasets. Therefore, the training time
vary across the algorithms as shown in Figure 5. For CNN-1, RMSProp is the fastest
among the adaptive algorithms for both datasets. While AdaDelta and AdaMax are
slow for MNIST, AMSGrad and Nadam require more time for CIFAR-10. When
the dataset becomes more complex, Adam needs more time. AdaGrad performs fast
but it can not generalize well on test data. Also, SGD and its momentum variants
seem to be fast but their learning rate is bigger in addition to the smaller number
of computational steps they have. Nevertheless, they can not generalize well similar
to AdaGrad.
RMSProp, Nadam and AdaMax require more time on CIFAR-10 for CNN-2.
However, the complexity of dataset does not matter for Adam and AMSGrad.
When the neural network becomes deeper, in CNN-3, Nadam and AdaDelta take
more time in comparison with the others for both datasets. In order to train CNN-1
and CNN-2, the difference between the slowest and fastest algorithms is less than
1 minutes. On the other hand, this required time increases to nearly 5 minutes for
CNN-3. AdaMax takes minimum time among the Adam variants for the deepest
CNN. Additionally, the training time differs when CNN-3 is trained on Kaggle Flow-
ers. As seen in Figure 5, Nadam and AdaDelta take more time. RMSProp requires
less time on Kaggle Flowers in comparison with the other adaptive algorithms.
A Comparison of Optimization Algorithms for Deep Learning 17
Fig. 5. (Top) MNIST and CIFAR-10 training time for CNN-1 and CNN-2. (Bottom left) MNIST
training time for CNN-3. (Bottom right) Kaggle Flowers training time for CNN-3.
and AMSGrad can not perform well on high resolution images. Also, RMSProp
performs well together with Adam and its variants on Kaggle Flowers for both un-
supervised tasks. Both vanilla and denoising autoencoder results are given in Table
8 for CIFAR-10, Table 9 for LFW and Table 10 for Kaggle Flowers, respectively.
Also, the behaviour of the optimization algorithms during training for the unsu-
pervised learning tasks on CIFAR-10 is shown in Figure 7, LFW in Figure 8 and
Kaggle Flowers in Figure 9.
Fig. 6. The behaviour of algorithms on MNIST during training for two autoencoder architectures.
(Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
A Comparison of Optimization Algorithms for Deep Learning 19
Fig. 7. The behaviour of algorithms on CIFAR-10 during training for two autoencoder architec-
tures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
SGD by far the worst for both unsupervised tasks on all datasets. In general,
Adam and its variants seem to be much better than the other adaptive gradient
methods in addition to SGD and its momentum variants. AdaGrad can not perform
well among the adaptive algorithms.
20 D. Soydaner
Fig. 8. The behaviour of algorithms on LFW during training for two autoencoder architectures.
(Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
Fig. 9. The behaviour of algorithms on Kaggle Flowers during training for two autoencoder archi-
tectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
For denoising task, the reconstructions of all datasets for CAE-1 are given in
Figure 12. The small size of representation makes this task more difficult. None of
algorithms can reconstruct colors properly on Kaggle Flowers. Also, as it is hard to
reconstruct faces and objects, the results are blurry. However, as the representation
size increases, the colors given by especially Adam and its variants are much better
as shown in Figure 13. Also, while SGD and AdaGrad can reconstruct images with
their colors on CIFAR-10 for CAE-1, both algorithms fail in denosing CAE-1.
22 D. Soydaner
Lastly, the algorithms are compared based on the training time they require.
Similar to supervised learning experiments, training time does not vary significantly
on LFW and Kaggle Flowers. However, training time on MNIST and CIFAR-10 for
unsupervised learning experiments are shown in Figure 14. In general, Adam and
Nadam need more time than other algorithms to reconstruct images. Similar to
supervised results, when the dataset becomes more complex, they need more time.
Also, AdaMax requires more time on CIFAR-10 as especially seen in CAE-1. All
algorithms need more time for CAE-2 relative to CAE-1. Also, denoising task cause
training time mostly decrease on CIFAR-10 and increase on MNIST. Even though
SGD and its momentum variants train autoencoders fast by using the learning rate
they perform best, they can not generalize well on test data.
Fig. 14. (Top) MNIST and CIFAR-10 training time for CAE-1 and CAE-2. (Bottom) MNIST and
CIFAR-10 training time for denoising CAE-1 and CAE-2.
5. Conclusion
In this study, the most commonly used optimization algorithms in deep learning
are examined. The differences between their working principles are summarized
considering pros and cons for each of them. In this context, the importance of
adaptive learning algorithms is highlighted. The performances of algorithms are
compared on four image datasets empirically. The behaviour of the algorithms
A Comparison of Optimization Algorithms for Deep Learning 25
References
1. E. Alpaydın, Introduction to Machine Learning, The MIT Press, 2014.
2. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley and Y. Bengio, Theano: A CPU and GPU math compiler in Python,
In Proc. 9th Python in Science Conference, Austin, Texas, 2010, pp. 3–10.
3. L. Bottou, F.E. Curtis and J. Nocedal, Optimization methods for large-scale machine
learning, Siam Review, 60(2) (2018) 223–311.
4. Y. Dauphin, H. De Vries, J. Chung and Y. Bengio, RMSProp and equilibrated adaptive
learning rates for non-convex optimization, CoRR abs/1502.04390, 2015.
5. T. Dozat, Incorporating Nesterov Momentum into Adam, International Conference on
Learning Representations, San Juan, Puerto Rico, 1(2016) 2013-2016.
6. J. Duchi, E. Hazan and Y. Singer, Adaptive subgradient methods for online learning
and stochastic optimization, Journal of Machine Learning Research, 12 (2011) 2121–
2159.
7. A. Geron, Hands-on machine learning with Scikit-learn and Tensorflow, O’Reilly, 2017.
8. I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, The MIT Press, 2016.
9. H. Hayashi, J. Koushik and G. Neubig, Eve: A gradient based optimization method
with locally and globally adaptive learning rates, arXiv preprint arXiv:1611.01505,
2016.
10. G. Hinton, Neural networks for machine learning, Coursera, video lectures, 2012.
11. G.B. Huang, M. Ramesh, T. Berg and E. Learned-Miller, Labeled Faces in the Wild:
A database for studying face recognition in unconstrained environments, Technical
Report, University of Massachusetts, Amherst, 2007, pp. 07–49.
12. H. Huang, C. Wang and B. Dong, Nostalgic Adam: Weighing more of the past gra-
dients when designing the adaptive learning rate, arXiv preprint arXiv:1805.07557,
2018.
13. D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980, 2014.
14. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images,
Technical Report, University of Toronto, 2009.
15. A. Krizhevsky, I. Sutskever and G. Hinton, ImageNet classification with deep convolu-
tional neural networks, Advances in Neural Information Processing Systems, 25(2012)
1097-1105.
16. Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow and A.Y. Ng, On optimiza-
tion methods for deep learning, in Proc. 28th International Conference on Machine
Learning, Bellevue, Washington, USA, 2011, pp. 265–272.
17. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE, 86(11) (1998) 2278–2324.
18. L. Luo, Y. Xiong, Y. Liu and X. Sun, Adaptive gradient methods with dynamic bound
of learning rate, International Conference on Learning Representations, New Orleans,
2019.
19. A. Mamaev, Flowers Recognition, https://www.kaggle.com/alxmamaev/flowers-
recognition, 2018.
20. M.C. Mukkamala and M. Hein, Variants of RMSProp and Adagrad with logarithmic
26 D. Soydaner