Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
123 views

Optimization in Machine Learning

The document compares optimization algorithms for deep learning. It discusses how adaptive gradient methods have become widely used as neural networks have increased in depth and size. The paper examines commonly used adaptive optimization algorithms in detail and implements them for supervised and unsupervised tasks using image datasets. It analyzes the behavior of the algorithms during training and their performance results, pointing out differences from basic algorithms like stochastic gradient descent.

Uploaded by

rbernalMx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

Optimization in Machine Learning

The document compares optimization algorithms for deep learning. It discusses how adaptive gradient methods have become widely used as neural networks have increased in depth and size. The paper examines commonly used adaptive optimization algorithms in detail and implements them for supervised and unsupervised tasks using image datasets. It analyzes the behavior of the algorithms during training and their performance results, pointing out differences from basic algorithms like stochastic gradient descent.

Uploaded by

rbernalMx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

International Journal of Pattern Recognition and Artificial Intelligence

c World Scientific Publishing Company

A Comparison of Optimization Algorithms for Deep Learning


arXiv:2007.14166v1 [cs.LG] 28 Jul 2020

Derya Soydaner
Statistics Department, Mimar Sinan Fine Arts University
İstanbul, 34380,Turkey
derya.soydaner@msgsu.edu.tr

In recent years, we have witnessed the rise of deep learning. Deep neural networks have
proved their success in many areas. However, the optimization of these networks has
become more difficult as neural networks going deeper and datasets becoming bigger.
Therefore, more advanced optimization algorithms have been proposed over the past
years. In this study, widely used optimization algorithms for deep learning are examined
in detail. To this end, these algorithms called adaptive gradient methods are implemented
for both supervised and unsupervised tasks. The behaviour of the algorithms during
training and results on four image datasets, namely, MNIST, CIFAR-10, Kaggle Flowers
and Labeled Faces in the Wild are compared by pointing out their differences against
basic optimization algorithms.

Keywords: Adaptive gradient methods; optimization; deep learning; image processing.

1. Introduction
Adaptive gradient methods have been widely used in deep learning. Although one
of the most preferred algorithms has been stochastic gradient descent (SGD) for
many years, it has difficulties to overcome serious problems such as ill-conditioning
and time necessity for large-scale datasets when training deep neural networks. It
requires manual tuning of learning rate and difficult to parallelize 16 . Thus, the
problems of SGD caused the invention of more advanced algorithms. Nowadays,
the optimization algorithms used for deep learning adapt their learning rates dur-
ing training. Basically, the adaptive gradient methods adjust the learning rate for
each parameter. Therefore, when the gradients for some parameters are large, their
learning rates are reduced or vice versa.
Until recently, many adaptive methods have been proposed and they become the
most commonly used alternatives to SGD. In addition to their high performance on
training deep models, another advantage is they are first-order optimization algo-
rithms just as SGD. Thus, they are computationally efficient for training deep neural
networks. This work aims to present the most widely used adaptive optimization
algorithms that are proven their superiority and compare their working principles.
To this end, image processing that is one of the most important areas of deep learn-
ing is handled. Firstly, the effects of adaptive gradient methods are observed for
image classification task by using convolutional neural networks (CNNs). Secondly,

1
2 D. Soydaner

as an unsupervised task, convolutional autoencoders (CAEs) that is one of the


quintessential examples of unsupervised learning are used for image reconstruction.
Besides, the effects of algorithms are examined by using denoising autoencoders. By
this way, the behaviours of the algorithms during training are analyzed in addition
to their performances for both supervised and unsupervised learning tasks.
The rest of the paper is organized as follows. In Section 2, studies about adap-
tive gradient methods and the most recent variants of them are reviewed briefly.
In Section 3, widely used optimization algorithms in deep learning are explained
by showing their update rules and solutions for challenges of training deep net-
works. SGD and its momentum variants are also mentioned in this section. The
experiments and comparative results are in Section 4 and conclusion in Section 5.

2. Related Work
In deep learning literature, working principles and performance analysis of op-
timization algorithms are widely studied. For example, theoretical guarantees of
convergence to criticality for RMSProp and Adam are presented in the setting of
optimizing a non-convex objective 28 . They design experiments to empirically study
the convergence and generalization properties of RMSProp and Adam against Nes-
terov´s accelerated gradient method. In another study, conjugate gradient, SGD
and limited memory BFGS algorithms are compared 16 . A review is presented on
numerical optimization algorithms in the context of machine learning applications 3 .
Additionally, similar to this work, an overview of gradient optimization algorithms
is summarized 25 .
In this study, most widely used optimization algorihms are examined in the
context of deep learning. On the other side, new variants of adaptive methods still
have been proposed more recently. For example, new variants of Adam and AMS-
Grad, called AdaBound and AMSBound respectively, are proposed 18 . They employ
dynamic bounds on learning rates to achieve a gradual and smooth transition from
adaptive methods to SGD. Also, a new algorithm that adapts the learning rate
locally for each parameter separately, and also globally for all parameters together
is presented 9 . Another new algorithm, called Nostalgic Adam (NosAdam), which
places bigger weights on the past gradients than the recent gradients when design-
ing the adaptive learning rate is introduced 12 . In another study, two variants called
SC-Adagrad and SC-RMSProp are proposed 20 . A new adaptive optimization algo-
rithm called YOGI is presented 30 . It controls the increase in effective learning rate.
A novel adaptive learning rate scheme, called ESGD, based on the equilibration pre-
conditioner is developed 4 . Also, a new algorithm called Adafactor is presented 26 .
Instead of updating parameters scaling by the inverse square roots of exponential
moving averages of squared past gradients, Adafactor maintains only the per-row
and per-column sums of the moving averages, and estimates the per-parameter
second moments based on these sums.
A Comparison of Optimization Algorithms for Deep Learning 3

3. Optimization Algorithms with Adaptive Learning Rates


The choice of the algorithm to optimize a neural network is one of the most impor-
tant steps. In machine learning, there are three main kinds of optimization methods.
First one is called batch or deterministic gradient methods that process all training
examples simultaneously in a large batch. Second one is called stochastic or online
methods that use only one example at a time. Today, most algorithms are a blend of
the two. During training, they use only a part of training set at each epoch. These
algorithms are called minibatch methods. In deep learning era, minibatch methods
are mostly preferred for two major reasons. Firstly, they accelerate the training of
neural networks. Secondly, as the minibatches are selected randomly and they are
independent, an unbiased estimate of the expected gradient can be computed 8 .
In this paper, the most widely used minibatch-based adaptive algorithms are
examined in detail. Besides, SGD that had been preferred conventionally for a long
time is explained alongside its momentum variants, briefly.

3.1. Stochastic gradient descent


Basically, SGD 24 follows the gradient of randomly selected minibatches downhill.
In order to train a neural network using SGD, firstly, the gradient estimate is
computed by using a loss function. Then, the update at iteration k is applied for
parameters
 (1) θ. (m)The calculations for each minibatch of m examples from the training
set x , ..., x with corresponding targets y (i) are as follows:

1 X
ĝ ← ∇θ L(f (x(i) ; θ), y (i) ) (1)
m i

θ ← θ − k ĝ (2)
Here, the learning rate k is a very important hyperparameter. The magnitude of
the update depends on the learning rate. If it is too large, updates depend too much
on recent instances. If it is small, many updates may be need for convergence 1 .
This hyperparameter can be chosen by trial and error. One way is to choose one
of the several learning rates that results in the smallest loss function value. This
is called line search. Another way is to monitor the first several epochs and use a
learning rate that is higher than the best performing learning rate. In Equation 2,
the learning rate is denoted as k at iteration k because in practice, it is necessary
to gradually decrease the learning rate over time 8 .

3.1.1. Stochastic gradient descent with momentum


SGD has difficulty to reach global optimum because of its tendency to oscillate es-
pecially on steep surface curves. Noisy or small gradients may also be problematic.
The method of momentum 22 is designed to accelerate learning in such cases. It
4 D. Soydaner

aims primarily to solve two problems: Poor conditioning of the Hessian matrix and
variance in the stochastic gradient. The idea behind this algorithm is to take a run-
ning average by incorporating the previous update in the current change as if there
is a momentum due to previous updates 1 . When SGD is used with momentum, it
can converge faster as well as reduced oscillation.
SGD with momentum uses a variable v called velocity. The velocity is the di-
rection and speed at which the parameters move through parameter space. It is
set to an exponentially decaying average of the negative
  gradient. Also, SGD with
momentum requires a new hyperparameter α ∈ 0, 1 called momentum parameter
that determines how quickly the contributions of previous gradients exponentially
decay. The parameters are updated after the velocity update is computed:

m
1 X
L(f (x(i) ; θ), y (i) )

v ← αv −  ∇θ (3)
m i=1

θ ←θ+v (4)
The velocity v accumulates the gradient elements. The larger α is relative to ,
the more previous gradients affect the current direction. Common values of α used
in practice are 0.5, 0.9 and 0.99 8 . However, a disadvantage of this algorithm is the
requirement of momentum hyperparameter in addition to the learning rate.

3.1.2. Stochastic gradient descent with Nesterov momentum


SGD with Nesterov momentum 29 is proposed as a variant of the standard momen-
tum by taking inspiration from Nesterov´s accelerated gradient method 21 . The
idea is to measure the gradient of the loss function not at the local position but
slightly ahead in the direction of the momentum 7 . This algorithm evaluates the
gradient after the current velocity is applied with Nesterov momentum. Therefore,
SGD with Nesterov momentum begins with an interim update for a minibatch 8 :

θ̃ ← θ + αv (5)
Then, gradient is computed at the interim point. By using this gradient, velocity
update is computed. Finally, the parameters are updated:

m
1 X
L f (x(i) ; θ̃), y (i)

g← ∇θ̃ (6)
m i=1

v ← αv − g (7)

θ ←θ+v (8)
A Comparison of Optimization Algorithms for Deep Learning 5

3.2. AdaGrad
One of the optimization algorithms that individually adapts the learning rates of
model parameters is AdaGrad 6 . The parameters with the largest partial derivative
of the loss have a rapid decrease in their learning rate, while parameters with small
partial derivatives have a relatively small decrease in their learning rate 8 . This is
performed by using all the historical squared values of the gradient.
AdaGrad uses an additional variable r for gradient accumulation. In the begin-
ning of this algorithm, the gradient accumulation variable is initialized to zero and
gradient is computed for a minibatch:

1 X
g← ∇θ L(f (x(i) ; θ), y (i) ) (9)
m i

By using this gradient, the squared gradient is accumulated. Then, the update
is computed by scaling learning rates of all parameters inversely proportional to the
square root of the sum of all the historical squared values of the gradient. Finally,
this update is applied to the model parameters:

r ←r+g g (10)


∆θ ← − √ g (11)
δ+ r

θ ← θ + ∆θ (12)
where  is the global learning rate and δ is a small constant for numerical stability.
However, AdaGrad has serious disadvantages. Generally, it performs well for
simple quadratic problems, but it often stops too early when training neural net-
works. The learning rate gets scaled down so much that the algorithm ends up stop-
ping entirely before reaching the global optimum 7 . Also, for training deep neural
networks, the accumulation of squared gradients from the beginning of training can
result in an excessive decrease in the effective learning rate. AdaGrad performs well
for some but not all deep learning models 8 .

3.3. AdaDelta
The underlying idea of AdaDelta algorithm is to improve the two main drawbacks of
AdaGrad: The continual decay of learning rates throughout training and the need
for a manually selected global learning rate. To this end, AdaDelta restricts the
window of past gradients to be some fixed size w instead of accumulating the sum
of squared gradients over all time. As mentioned in the previous section, AdaGrad
accumulates the squared gradients from each iteration starting at the beginning
6 D. Soydaner

of training. This accumulated sum continues to grow during training, effectively


shrinking the learning rate on each dimension. After many iterations, the learning
rate becomes infinitesimally small. With the windowed accumulation AdaGrad be-
comes a local estimate using recent gradients instead of accumulating to infinity.
Thus, learning continues to make progress even after many iterations of updates
have been done 31 .
Since storing w previous squared gradients is inefficient, AdaDelta implements
this accumulation as an exponentially
 decaying average of the squared gradients.
Assuming this running average is E g 2 t at time t, gradient accumulation is com-
puted as follows:

E g 2 t = ρE g 2 t−1 + 1 − ρ gt2
    
(13)

where ρ is a decay constant similar to that used in the momentum method. Since it
is required the square root of this quantity in the parameter updates, this effectively
becomes the root mean square (RMS) of previous squared gradients up to time t:

  q  
RM S g t = E g 2 t + δ (14)

where δ is again a small constant. Based on this RMS, the parameter update is
computed, updates are accumulated and the parameters are updated, respectively:

 
RM S ∆θ t−1
∆θt = −   gt (15)
RM S g t

E ∆θ2 t = ρE ∆θ2 t−1 + 1 − ρ ∆θt2


    
(16)

θt+1 = θt + ∆θt (17)

The advantage of AdaDelta is that it requires no manual tuning of a learning


rate and appears robust to noisy gradient information, different model architectures,
various data modalities and selection of hyperparameters 31 .

3.4. RMSProp
Another algorithm that modifies AdaGrad is RMSProp 10 . It is proposed to perform
better in the nonconvex setting by changing the gradient accumulation into an
exponentially weighted moving average. As mentioned in Section 3.2, AdaGrad
shrinks the learning rate according to the entire history of the squared gradient.
Instead, RMSProp uses an exponentially decaying average to discard history from
the extreme past so that it can converge rapidly after finding a convex bowl 8 .
A Comparison of Optimization Algorithms for Deep Learning 7

In order to implement RMSProp, squared gradient is accumulated after com-


puting gradient:

r ← ρr + (1 − ρ)g g (18)
where ρ is the decay rate. Then parameter update is computed and applied as
follows:


∆θ = − √ g (19)
δ+r

θ ← θ + ∆θ (20)

3.5. Adam
Adam is one of the most widely used optimization algorithms in deep learning.
The name Adam is derived from adaptive moment estimation because it computes
individual adaptive learning rates for different parameters from estimates of first
and second moments of the gradients. Adam combines the advantages of AdaGrad
which works well with sparse gradients and RMSProp which works well in online
and non-stationary settings 13 .
There are some important properties of Adam. Firstly, momentum is incorpo-
rated directly as an estimate of the first-order moment of the gradient. Also, Adam
includes bias corrections to the estimate of both the first-order moments and the
second-order moments to account for their initialization at the origin 8 . The al-
gorithm updates exponential moving averages of the gradient mt and the squared
gradient ut where the hyperparameters ρ1 ve ρ2 control the exponential decay rates
of these moving averages. The moving averages themselves are estimates of the first
moment (the mean) and the second raw moment (the uncentered variance) of the
gradient 13 .
Adam algorithm requires first and second moment variables m and u. After
computing gradient, biased first and second moment estimates are updated at time
step t respectively:

mt ← ρ1 mt−1 + (1 − ρ1 )gt (21)

ut ← ρ2 ut−1 + (1 − ρ2 )g g (22)
Then, bias is corrected in first and second moments. By using the corrected
moment estimates parameter updates are calculated and applied:

mt
m̂t ← (23)
1 − ρt1
8 D. Soydaner

ut
ût ← (24)
1 − ρt2

m̂t
∆θ = − √ (25)
ût + δ

θt ← θt−1 + ∆θ (26)

Adam has many advantages. First of all, it requires a little tuning for the learning
rate. Also, it is a method that is straightforward to implement and invariant to
diagonal rescaling of gradients. It is computationally efficient as well as a little
memory requirements. Besides, Adam is appropriate for non-stationary objectives
and problems with very noisy and sparse gradients 13 .

3.6. AdaMax
AdaMax is proposed as an extension of Adam. It is a variant of Adam based on the
infinity norm. In Adam, update rule for individual weights is to scale their gradients
inversely proportional to a L2 norm of their individual current and past gradients.
AdaMax is based on the idea that L2 norm based update rule can be generalized
to a Lp norm based update rule.
AdaMax algorithm begins with calculating gradients w.r.t. stochastic objective
at time step t, as usual. Then, biased first moment estimate and exponentially
weighted infinity norm are computed. By using them, the model parameters are
updated. These steps are defined below, respectively:


gt ← ∇θ ft θt−1 (27)


mt ← ρ1 mt−1 + 1 − ρ1 gt (28)


γt ← max ρ2 γt−1 , |gt | (29)

θt ← θt−1 − / 1 − ρt1 mt /γt



(30)

It is showed that if AdaMax is preferred as optimization algorithm, there is no


need to correct for initialization bias. Besides, the magnitude of parameter updates
has a simpler bound with AdaMax than Adam 13 .
A Comparison of Optimization Algorithms for Deep Learning 9

3.7. Nadam
Nadam (Nesterov-accelerated adaptive moment estimation) modifies Adam’s mo-
mentum component with Nesterov´s accelerated gradient. Thus, Nadam aims to
improve the speed of convergence and the quality of the learned models 5 .
Similar to Adam, after computing gradient, first and second moment variables
are updated as in Equations 31 and 32. Then, corrected moments are computed
and parameters are updated as in following equations:


mt ← ρt mt−1 + 1 − ρt gt (31)

ut ← vut−1 + 1 − v gt2

(32)

t+1
Y t
Y
  
m̂ ← ρt+1 mt / 1 − ρi + 1 − ρt gt / 1 − ρi (33)
i=1 i=1


û ← vut / 1 − vt (34)

t
θt ← θt−1 − √ m̂t (35)
ût + δ

3.8. AMSGrad
Another an exponential moving average variant is AMSGrad 23 . The purpose of
developing AMSGrad is to guarantee the convergence while preserving the benefits
of Adam and RMSProp. In AMSGrad algorithm, the first and second moment vari-
ables are updated as in Equations 36 and 37. The key difference between AMSGrad
and Adam is shown in Equation 38. Here, AMSGrad maintains the maximum of
all ut until the present time step and uses this maximum value for normalizing
the running average of the gradient instead of ut in Adam. By doing this, AMS-
Grad results in a non-increasing step size. Finally, the parameters are updated as
in Equation 39. Here, Ût indicates diag ût .

mt ← ρt1 mt−1 + (1 − ρt1 )gt (36)

ut ← ρ2 ut−1 + (1 − ρ2 )gt2 (37)


ût ← max ût−1 , ut (38)
10 D. Soydaner

Y p 
θt+1 ← √ θt − t mt / ût (39)
F, Ût

On the other side, the difference of AMSGrad from Adam and AdaGrad, it
neither increases nor decreases the learning rate and furthermore, decreases ut
which can potentially lead to non-decreasing learning rate even if gradient is large
in the future iterations 23 .

4. Experiments
4.1. Datasets
The performances of optimization algorithms are evaluated on four image datasets,
namely, MNIST 17 , CIFAR-10 14 , Kaggle Flowers 19 and Labeled Faces in the Wild
(LFW) 11 . The well-known dataset MNIST includes 60000 training and 10000 test
examples each of which is a 28×28 gray scale handwritten digit image. CIFAR-10 is
composed of 50000 training and 10000 test examples. They are 32 × 32 color images
belonging to 10 classes. Kaggle Flowers contains 4242 images of 5 different types of
flowers. LFW includes 13233 face images belonging to 5749 people. In supervised
learning experiments, LFW classes that have at least 30 images are chosen. LFW
has 34 classes meet this requirement, thus 1777 face images belong to 34 people
are used for classification. On the other side, all the 13233 images are used in
unsupervised learning experiments. Besides, LFW and Kaggle flowers include color
images in different sizes. In this study, LFW and Kaggle Flowers are scaled to a
size of 64 × 64 and 96 × 96, respectively. Also, they are randomly divided into two
subsets as 0.75 for training and 0.25 for testing.

4.2. Experimental setting


In this study, both supervised and unsupervised learning tasks are handled. Firstly,
three different CNN architectures are used for classification task. First one in-
cludes three convolutional and two fully-connected (FC) layers similar to well-
known LeNet-5 17 . Second one has five convolutional and three FC layers that
are organized in the architecture similar to AlexNet 15 . Last one has seven convolu-
tional layers that are stacked VGG-style 27 . These architectures are shown in Table
1. All of them have convolutional layers that include filters with 3 × 3 kernels in
addition to 2 × 2 max-pooling layers. In order to avoid overfitting, weight decay of
1e − 4 is applied to convolutional layers. Additionally, dropout in ratio 0.50 is ap-
plied to the FC layers preceding the output layer. All layers use ReLU as activation
function except the output layer. The loss function is categorical cross entropy.
Secondly, convolutional autoencoders (CAE) are preferred for the unsupervised
learning task. In general, a CAE comprises of two parts called encoder and de-
coder. While the encoder generates a representation of input data, the decoder
takes this representation as input and reconstructs it in the output. When deciding
A Comparison of Optimization Algorithms for Deep Learning 11

the architecture of the autoencoder, the size of the representation is very important.
Because when the size of representation becomes smaller, the reconstruction of the
images becomes harder. Therefore, two architectures are preferred for the unsuper-
vised learning experiments as described in Table 2. The encoder and decoder have
symmetric convolutional-deconvolutional layers. Here, the difference between the
encoder and decoder is that upsampling layers are used in decoder instead of max-
pooling layers. The first architecture, CAE-1, reduces the size of representation to
one quarter while the second one, CAE-2, reduces it to one half. For example, the
representation of 64 × 64 images of LFW is 1024 dimensional in CAE-1 and 2048
dimensional in CAE-2. For unsupervised learning experiments, the reconstruction
loss is mean squared error and the activation function is tanh in the output layer.
The rest of the layers use ReLU.

Table 1. CNN architectures used in super- Table 2. CAE architectures used in


vised learning experiments. unsupervised learning experiments.

CNN-1 CNN-2 CNN-3 CAE-1 CAE-2


Conv-32 Conv-32 Conv-32 Conv-32 Conv-32
MaxPool MaxPool Conv-32 MaxPool MaxPool
MaxPool
Conv-64 Conv-64 Conv-4 Conv-8
MaxPool MaxPool Conv-64 MaxPool MaxPool
Conv-64
Conv-128 Conv-128 MaxPool Conv-4 Conv-8
MaxPool Conv-128 UpSample UpSample
Conv-128 Conv-128
FC-128 MaxPool Conv-128 Conv-32 Conv-32
FC-softmax Conv-128 UpSample UpSample
FC-128 MaxPool
FC-256 Conv-tanh Conv-tanh
FC-softmax FC-128
FC-256
FC-softmax

In all experiments, the learning rate  is 0.01 for SGD and its momentum vari-
ants; 0.001 for the algorithms with adaptive learning rates. The same learning rate
is used for all algorithms, except SGD and its momentum variants. Because they
do not perform well if the learning rate is smaller. Therefore, the hyperparameters
which the algorithms give best results on test data are preferred. The momentum α
is 0.9 for SGD variants. The decay constant ρ is 0.95 for AdaDelta. Also, the decay
constants ρ1 and ρ2 are 0.9 and 0.999 for Adam, AdaMax, Nadam and AMSGrad.
The minibatch size is 128 for all models and each of them is trained 50 epochs. In
order to compare different algorithms, the same parameter initialization is used.
The software is based on Theano 2 on a single GTX 1050 Ti GPU.
12 D. Soydaner

4.3. Supervised learning experiments


In order to compare the optimization algorithms, all datasets are classified for each
CNN architecture. The classification results for the basic dataset MNIST are given
in Table 3. For all architectures, AdaDelta and AdaMax give the best accuracies
on test data. On the other side, SGD with Nesterov momentum follows AdaDelta
and AdaMax so closely for CNN-3. The worst performance belongs to AdaGrad
which is the simplest adaptive learning algorithm. The behaviour of the algorithms
during training is shown in Figure 1. SGD begins training with the highest loss for
all three architectures while Nadam begins with the lowest. Therefore, it can be
said that Nadam makes a better start the training by using the same parameter
initialization. Adam and its variants give similar results to each other for CNN-1.
However, the difference between their performances is observed when the network
becomes deeper. Besides, AdaDelta performs better than AdaGrad and RMSProp
during training.

Table 3. Comparison of the algorithms on MNIST for classification.

CNN-1 CNN-2 CNN-3


Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 0.0468 99.00 0.0724 99.03 0.0780 99.24
SGD - momentum 0.0395 99.26 0.0633 99.18 0.0718 99.26
SGD - Nesterov 0.0366 99.34 0.0511 99.35 0.0589 99.41
AdaGrad 0.0600 98.57 0.0753 98.79 0.0730 99.08
AdaDelta 0.0306 99.50 0.0395 99.49 0.0460 99.40
RMSProp 0.0505 99.26 0.1899 98.70 0.1108 99.32
Adam 0.0425 99.35 0.0597 99.00 0.0536 99.29
AdaMax 0.0337 99.37 0.0418 99.51 0.0454 99.42
Nadam 0.0364 99.32 0.0567 99.19 0.0565 99.16
AMSGrad 0.0401 99.24 0.0561 99.28 0.0497 99.36

Fig. 1. The behaviour of algorithms on MNIST during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 13

The classification results for CIFAR-10 are given in Table 4. As CIFAR-10 in-
cludes color images in higher resolutions than MNIST, the performances of al-
gorithms begin to change. AdaMax gives the best accuracies on test data for all
architectures. AMSGrad performs close to AdaMax for CNN-1 and CNN-2. On the
other side, Adam and SGD with Nesterov momentum follow AdaMax when the
neural network becomes deeper. Similar to MNIST results, AdaGrad can not per-
form well on CIFAR-10. The behaviour of the algorithms during training is shown
in Figure 2. While SGD begins training with the highest loss, Adam begins with
the lowest. Also, AdaMax seems to perform better for deeper architectures.

Table 4. Comparison of the algorithms on CIFAR-10 for classification.

CNN-1 CNN-2 CNN-3


Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 0.9428 67.98 0.8899 70.84 1.0112 68.19
SGD - momentum 1.2479 74.96 1.3368 76.74 1.1978 78.21
SGD - Nesterov 1.2235 76.01 1.2909 77.96 1.1651 79.97
AdaGrad 1.2428 56.62 1.1769 59.61 1.1531 60.55
AdaDelta 1.6619 74.08 1.7222 76.23 1.4789 78.55
RMSProp 1.2670 75.01 1.1354 76.14 0.9821 74.81
Adam 1.2279 76.05 1.1988 76.94 0.9393 79.03
AdaMax 0.8054 76.89 1.1240 79.17 1.0437 81.41
Nadam 1.2735 76.11 1.3102 76.60 1.0672 78.98
AMSGrad 1.1940 76.45 1.2000 77.11 1.0135 78.27

Fig. 2. The behaviour of algorithms on CIFAR-10 during training for three CNN architectures.

By now, the algorithms perform well except SGD and AdaGrad. The change
of depth in neural network architectures does not significantly affect performance
rankings of the algorithms. Therefore, the algorithms are compared on a more
complicated dataset than MNIST and CIFAR-10. LFW consists of face images,
each of which is a 64 × 64 color image belonging to one of 34 people. It includes
14 D. Soydaner

more classes than MNIST and CIFAR-10 as well as higher resolutions. Also, it has
small number of samples per each class. This classification task is more difficult
for the optimization algorithms. The classification results for LFW are given in
Table 5. The most obvious result is the decrease of the SGD performance. It can
not learn nearly anything during training. Accordingly, its accuracy on test data
is the lowest. Similarly, AdaGrad also can not perform well. The best accuracies
on test data are obtained by using Adam and RMSProp for CNN-1, AdaDelta
for CNN-2 and Adam for CNN-3. In the previous experiments, the performances
of SGD momentum variants compete with the adaptive algorithms while training
the neural networks on MNIST and CIFAR-10. However, while training the neural
networks on LFW, the difference between them appears and the superiority of
the adaptive learning algorithms become clear. Especially AdaDelta, Adam and its
variants perform much better.

Table 5. Comparison of the algorithms on LFW for classification.

CNN-1 CNN-2 CNN-3


Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 2.8016 22.76 3.1141 19.22 3.1152 19.22
SGD - momentum 1.0460 74.70 1.1003 71.66 1.0803 73.18
SGD - Nesterov 0.9470 78.24 0.8571 79.25 0.9930 75.37
AdaGrad 1.6427 58.17 1.8106 52.44 1.7973 52.95
AdaDelta 0.8116 82.29 0.6383 84.31 0.8417 80.10
RMSProp 0.7532 84.48 0.7469 81.78 0.9060 75.54
Adam 0.6887 84.48 0.8219 79.25 0.7223 82.96
AdaMax 0.8734 77.74 0.9822 74.70 0.8617 78.58
Nadam 0.7637 83.81 0.7172 82.96 0.8015 80.26
AMSGrad 0.6795 82.63 0.7648 82.12 0.6726 81.78

Fig. 3. The behaviour of algorithms on LFW during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 15

The behaviour of algorithms during training is shown in Figure 3. While SGD


begins training with the highest loss for CNN-1, RMSProp begins with the highest
for CNN-2 and CNN-3. On the other side, AdaDelta begins training with a high
training loss but ends being one of lowest by giving a good accuracy on test data.
As the network becomes deeper, RMSProp becomes worse on both training loss
and test accuracy. Also, RMSProp has sometimes sudden peaks during training.
This behaviour is more obvious for the deeper networks. While the training loss of
Adam and AMSGrad are so close to each other, AdaMax falls behind its variants. In
general, AdaDelta, Adam, Nadam and AMSGrad are significantly performs better
during training. This result can be easily seen especially when the depth of neural
network increases.
Lastly, the performances of algorithms are compared on Kaggle Flowers which
contains 96 × 96 images. The classification results for Kaggle Flowers are reported
in Table 6 and the training process is shown in Figure 4. AdaMax and AMS-
Grad come into prominence according to accuracies on test data. On the contrary,
SGD gives the worst results. RMSProp begins the training with the highest loss.
However, the initial loss become closer for all algorithms when the neural net-
work becomes deeper. Also, the performance of Nadam decreases when the depth
increases. This behaviour of Nadam may arise because of the high resolution of
images. AdaDelta, Nadam and RMSProp have sudden peaks during training. On
the other side, AdaDelta is not so successful on test data even though it performs
well during training. Interestingly, AdaGrad performs well especially for CNN-3.
Its accuracy on test data is the closest result to the best which is performed by
AdaMax. Therefore, AdaGrad seems to perform well for deeper networks as well as
input images with high resolutions.

Table 6. Comparison of the algorithms on Kaggle Flowers for classification.

CNN-1 CNN-2 CNN-3


Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 1.0115 59.94 1.0764 59.66 1.3027 44.40
SGD - momentum 1.5339 67.99 1.3234 69.19 1.5486 63.64
SGD - Nesterov 1.5711 68.54 1.4873 68.54 1.7025 64.56
AdaGrad 0.8664 67.25 0.8639 66.60 0.9359 66.88
AdaDelta 1.7850 68.64 1.8371 65.58 2.1974 61.79
RMSProp 1.9709 67.99 1.9051 66.88 1.9653 63.55
Adam 1.9147 69.93 1.4274 68.82 1.8564 62.16
AdaMax 1.2073 69.93 1.1429 71.32 1.0537 70.30
Nadam 1.7687 67.06 1.9668 67.06 1.4252 64.56
AMSGrad 1.8220 70.67 1.6430 71.23 1.9611 61.88

In supervised learning experiments, the positive effect of momentum for SGD


is conspicuous especially for CIFAR-10, LFW and Kaggle Flowers datasets. Also,
Nesterov momentum is slightly better despite of the momentum and Nesterov mo-
mentum results are close at the end of the training. In general, Adam, AdaMax
16 D. Soydaner

and AMSGrad performs well on test data in addition to AdaDelta. Also, the most
important advantage of AdaDelta is that it does not require to select the learning
rate manually.

Fig. 4. The behaviour of algorithms on Kaggle Flowers during training for three CNN architectures.

Additionally, the algorithms are compared based on the training time they re-
quire. As LFW includes a small number of training examples in supervised learning
experiments, there is not significant difference between the training time of algo-
rithms for each CNN architecture. On the other hand, MNIST and CIFAR-10 have
more training examples than the other two datasets. Therefore, the training time
vary across the algorithms as shown in Figure 5. For CNN-1, RMSProp is the fastest
among the adaptive algorithms for both datasets. While AdaDelta and AdaMax are
slow for MNIST, AMSGrad and Nadam require more time for CIFAR-10. When
the dataset becomes more complex, Adam needs more time. AdaGrad performs fast
but it can not generalize well on test data. Also, SGD and its momentum variants
seem to be fast but their learning rate is bigger in addition to the smaller number
of computational steps they have. Nevertheless, they can not generalize well similar
to AdaGrad.
RMSProp, Nadam and AdaMax require more time on CIFAR-10 for CNN-2.
However, the complexity of dataset does not matter for Adam and AMSGrad.
When the neural network becomes deeper, in CNN-3, Nadam and AdaDelta take
more time in comparison with the others for both datasets. In order to train CNN-1
and CNN-2, the difference between the slowest and fastest algorithms is less than
1 minutes. On the other hand, this required time increases to nearly 5 minutes for
CNN-3. AdaMax takes minimum time among the Adam variants for the deepest
CNN. Additionally, the training time differs when CNN-3 is trained on Kaggle Flow-
ers. As seen in Figure 5, Nadam and AdaDelta take more time. RMSProp requires
less time on Kaggle Flowers in comparison with the other adaptive algorithms.
A Comparison of Optimization Algorithms for Deep Learning 17

Fig. 5. (Top) MNIST and CIFAR-10 training time for CNN-1 and CNN-2. (Bottom left) MNIST
training time for CNN-3. (Bottom right) Kaggle Flowers training time for CNN-3.

4.4. Unsupervised learning experiments


When CAE is used, the task is to reconstruct input images in the output layer.
When denoising CAE is used, noisy images are taken as input and the task is to
reconstruct clean images in the output layer. Therefore, a gaussian noise matrix is
applied to input images before using denoising CAE. In these experiments, the noise
factor is 0.25. Both vanilla and denoising autoencoder results on MNIST are given
in Table 7. While Adam gives the lowest reconstruction loss for CAE-1 architecture,
AdaMax and Nadam give the lowest for CAE-2. But the performances of Adam
and its variants are so close to each other. On the other side, Nadam gives the best
results for denosing CAE. RMSProp performs well on denoising task in addition
to Adam variants. The performance of AdaMax slightly decreases for denoising
task. Besides, AdaGrad falls behind the other adaptive algorithms in both tasks.
The training loss for MNIST is shown in Figure 6. The behavior of Adam and
AMSGrad beginning the training with the lowest loss is in sight.
The results on CIFAR-10, LFW and Kaggle Flowers are similar. Adam and its
variants give the best results. Different from MNIST results, RMSProp begins train-
ing with the lowest reconstruction loss. However, Adam and its variants give better
results at the end of the training. The results on Kaggle Flowers are interesting.
For CAE-1, the performances of Adam and AMSGrad get worse in denosing task.
Accordingly, it can be said that when the size of representation is small, Adam
18 D. Soydaner

and AMSGrad can not perform well on high resolution images. Also, RMSProp
performs well together with Adam and its variants on Kaggle Flowers for both un-
supervised tasks. Both vanilla and denoising autoencoder results are given in Table
8 for CIFAR-10, Table 9 for LFW and Table 10 for Kaggle Flowers, respectively.
Also, the behaviour of the optimization algorithms during training for the unsu-
pervised learning tasks on CIFAR-10 is shown in Figure 7, LFW in Figure 8 and
Kaggle Flowers in Figure 9.

Table 7. Comparison of the algorithms on MNIST for reconstruction.

CAE Denoising CAE


Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0175 0.0160 0.0219 0.0186
SGD - momentum 0.0095 0.0074 0.0120 0.0103
SGD - Nesterov 0.0094 0.0074 0.0120 0.0102
AdaGrad 0.0113 0.0084 0.0140 0.0125
AdaDelta 0.0080 0.0053 0.0105 0.0077
RMSProp 0.0065 0.0044 0.0087 0.0067
Adam 0.0053 0.0038 0.0082 0.0067
AdaMax 0.0055 0.0036 0.0088 0.0069
Nadam 0.0056 0.0036 0.0080 0.0066
AMSGrad 0.0054 0.0038 0.0083 0.0068

Fig. 6. The behaviour of algorithms on MNIST during training for two autoencoder architectures.
(Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
A Comparison of Optimization Algorithms for Deep Learning 19

Table 8. Comparison of the algorithms on CIFAR-10 for reconstruction.

CAE Denoising CAE


Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0145 0.0152 0.0189 0.0158
SGD - momentum 0.0093 0.0091 0.0119 0.0108
SGD - Nesterov 0.0093 0.0089 0.0119 0.0109
AdaGrad 0.0115 0.0067 0.0167 0.0113
AdaDelta 0.0093 0.0088 0.0113 0.0094
RMSProp 0.0072 0.0059 0.0102 0.0083
Adam 0.0055 0.0047 0.0096 0.0075
AdaMax 0.0061 0.0047 0.0087 0.0074
Nadam 0.0069 0.0046 0.0087 0.0073
AMSGrad 0.0060 0.0049 0.0098 0.0075

Fig. 7. The behaviour of algorithms on CIFAR-10 during training for two autoencoder architec-
tures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.

SGD by far the worst for both unsupervised tasks on all datasets. In general,
Adam and its variants seem to be much better than the other adaptive gradient
methods in addition to SGD and its momentum variants. AdaGrad can not perform
well among the adaptive algorithms.
20 D. Soydaner

Table 9. Comparison of the algorithms on LFW for reconstruction.

CAE Denoising CAE


Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0122 0.0136 0.0125 0.0094
SGD - momentum 0.0070 0.0077 0.0073 0.0063
SGD - Nesterov 0.0070 0.0077 0.0071 0.0063
AdaGrad 0.0071 0.0046 0.0090 0.0069
AdaDelta 0.0061 0.0055 0.0086 0.0066
RMSProp 0.0044 0.0033 0.0062 0.0053
Adam 0.0030 0.0018 0.0044 0.0039
AdaMax 0.0033 0.0019 0.0052 0.0043
Nadam 0.0036 0.0025 0.0050 0.0045
AMSGrad 0.0030 0.0018 0.0049 0.0041

Fig. 8. The behaviour of algorithms on LFW during training for two autoencoder architectures.
(Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.

In order to compare the results clearly in visual, the reconstructions of test


images obtained by all algorithms for CAE-1 architecture are shown in Figure 10.
As handwritten images are simpler than the other datasets, all algorithms perform
well on MNIST. Even though the reconstructions of SGD are a little blurry, the
digits can be seen clearly. An important point is based on the Kaggle Flowers results.
SGD and AdaMax can not reconstruct colors properly. Similar results can be seen
for CAE-2 as shown in Figure 11. Here, it seems that all adaptive algorithms can
reconstruct images with their colors when the size of representation increases.
A Comparison of Optimization Algorithms for Deep Learning 21

Table 10. Comparison of the algorithms on Kaggle Flowers for reconstruction.

CAE Denoising CAE


Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0359 0.0385 0.0367 0.0347
SGD - momentum 0.0217 0.0220 0.0320 0.0217
SGD - Nesterov 0.0217 0.0214 0.0230 0.0219
AdaGrad 0.0298 0.0179 0.0254 0.0234
AdaDelta 0.0220 0.0180 0.0236 0.0203
RMSProp 0.0202 0.0159 0.0216 0.0179
Adam 0.0144 0.0121 0.0269 0.0148
AdaMax 0.0205 0.0133 0.0209 0.0164
Nadam 0.0170 0.0138 0.0208 0.0166
AMSGrad 0.0145 0.0122 0.0270 0.0146

Fig. 9. The behaviour of algorithms on Kaggle Flowers during training for two autoencoder archi-
tectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.

For denoising task, the reconstructions of all datasets for CAE-1 are given in
Figure 12. The small size of representation makes this task more difficult. None of
algorithms can reconstruct colors properly on Kaggle Flowers. Also, as it is hard to
reconstruct faces and objects, the results are blurry. However, as the representation
size increases, the colors given by especially Adam and its variants are much better
as shown in Figure 13. Also, while SGD and AdaGrad can reconstruct images with
their colors on CIFAR-10 for CAE-1, both algorithms fail in denosing CAE-1.
22 D. Soydaner

Fig. 10. Reconstructions of test images using CAE-1.

Fig. 11. Reconstructions of test images using CAE-2.


A Comparison of Optimization Algorithms for Deep Learning 23

Fig. 12. Reconstructions of test images using denoising CAE-1.

Fig. 13. Reconstructions of test images using denoising CAE-2.


24 D. Soydaner

Lastly, the algorithms are compared based on the training time they require.
Similar to supervised learning experiments, training time does not vary significantly
on LFW and Kaggle Flowers. However, training time on MNIST and CIFAR-10 for
unsupervised learning experiments are shown in Figure 14. In general, Adam and
Nadam need more time than other algorithms to reconstruct images. Similar to
supervised results, when the dataset becomes more complex, they need more time.
Also, AdaMax requires more time on CIFAR-10 as especially seen in CAE-1. All
algorithms need more time for CAE-2 relative to CAE-1. Also, denoising task cause
training time mostly decrease on CIFAR-10 and increase on MNIST. Even though
SGD and its momentum variants train autoencoders fast by using the learning rate
they perform best, they can not generalize well on test data.

Fig. 14. (Top) MNIST and CIFAR-10 training time for CAE-1 and CAE-2. (Bottom) MNIST and
CIFAR-10 training time for denoising CAE-1 and CAE-2.

5. Conclusion
In this study, the most commonly used optimization algorithms in deep learning
are examined. The differences between their working principles are summarized
considering pros and cons for each of them. In this context, the importance of
adaptive learning algorithms is highlighted. The performances of algorithms are
compared on four image datasets empirically. The behaviour of the algorithms
A Comparison of Optimization Algorithms for Deep Learning 25

during training is observed according to the effects of different image resolutions


and neural network architectures. Because of adaptive methods are mostly superior
and computationally efficient, their results seem to be better for both tasks. Still,
the research to find better adaptive methods continues for deep learning.

References
1. E. Alpaydın, Introduction to Machine Learning, The MIT Press, 2014.
2. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley and Y. Bengio, Theano: A CPU and GPU math compiler in Python,
In Proc. 9th Python in Science Conference, Austin, Texas, 2010, pp. 3–10.
3. L. Bottou, F.E. Curtis and J. Nocedal, Optimization methods for large-scale machine
learning, Siam Review, 60(2) (2018) 223–311.
4. Y. Dauphin, H. De Vries, J. Chung and Y. Bengio, RMSProp and equilibrated adaptive
learning rates for non-convex optimization, CoRR abs/1502.04390, 2015.
5. T. Dozat, Incorporating Nesterov Momentum into Adam, International Conference on
Learning Representations, San Juan, Puerto Rico, 1(2016) 2013-2016.
6. J. Duchi, E. Hazan and Y. Singer, Adaptive subgradient methods for online learning
and stochastic optimization, Journal of Machine Learning Research, 12 (2011) 2121–
2159.
7. A. Geron, Hands-on machine learning with Scikit-learn and Tensorflow, O’Reilly, 2017.
8. I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, The MIT Press, 2016.
9. H. Hayashi, J. Koushik and G. Neubig, Eve: A gradient based optimization method
with locally and globally adaptive learning rates, arXiv preprint arXiv:1611.01505,
2016.
10. G. Hinton, Neural networks for machine learning, Coursera, video lectures, 2012.
11. G.B. Huang, M. Ramesh, T. Berg and E. Learned-Miller, Labeled Faces in the Wild:
A database for studying face recognition in unconstrained environments, Technical
Report, University of Massachusetts, Amherst, 2007, pp. 07–49.
12. H. Huang, C. Wang and B. Dong, Nostalgic Adam: Weighing more of the past gra-
dients when designing the adaptive learning rate, arXiv preprint arXiv:1805.07557,
2018.
13. D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980, 2014.
14. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images,
Technical Report, University of Toronto, 2009.
15. A. Krizhevsky, I. Sutskever and G. Hinton, ImageNet classification with deep convolu-
tional neural networks, Advances in Neural Information Processing Systems, 25(2012)
1097-1105.
16. Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow and A.Y. Ng, On optimiza-
tion methods for deep learning, in Proc. 28th International Conference on Machine
Learning, Bellevue, Washington, USA, 2011, pp. 265–272.
17. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE, 86(11) (1998) 2278–2324.
18. L. Luo, Y. Xiong, Y. Liu and X. Sun, Adaptive gradient methods with dynamic bound
of learning rate, International Conference on Learning Representations, New Orleans,
2019.
19. A. Mamaev, Flowers Recognition, https://www.kaggle.com/alxmamaev/flowers-
recognition, 2018.
20. M.C. Mukkamala and M. Hein, Variants of RMSProp and Adagrad with logarithmic
26 D. Soydaner

regret bounds, in Proc. 34th International Conference on Machine Learning, Sydney,


Australia, 70 (2017) 2545–2553.
21. Y. Nesterov, A method of solving a convex programming problem with convergence
rate O(1/k2 ), Soviet Mathematics Doklady, 27 (1983) 372–376.
22. B.T. Polyak, Some methods of speeding up the convergence of iteration methods,
USSR Computational Mathematics and Mathematical Physics, 4(5) (1964) 1–17.
23. S.J. Reddi, S. Kale and S. Kumar, On the convergence of Adam and beyond, Inter-
national Conference on Learning Representations, Vancouver, Canada, 2018.
24. H. Robbins and S. Monro, A stochastic approximation method, The Annals of Math-
ematical Statistics, 22 (1951) 400–407.
25. S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint
arXiv:1609.04747, 2016.
26. N. Shazeer and M. Stern, Adafactor: Adaptive learning rates with sublinear memory
cost, arXiv preprint arXiv:1804.04235, 2018.
27. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale
image recognition, International Conference on Learning Representations, San Diego,
CA, USA, 2015.
28. D. Soham, M. Anirbit and U. Enayat, Convergence guarantees for RMSProp and
Adam in non-convex optimization and an empirical comparison to Nesterov accelera-
tion, arXiv preprint arXiv:1807.06766, 2018.
29. I. Sutskever, J. Martens, G. Dahl and G. Hinton, On importance of initialization and
momentum in deep learning, International Conference on Machine Learning, Atlanta,
USA, 2013, pp. 1139–1147.
30. M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar, Adaptive methods for non-
convex optimization, Advances in Neural Information Processing Systems, 31 (2018),
pp. 9793–9803.
31. M.D. Zeiler, ADADELTA: An adaptive learning rate method, arXiv preprint
arXiv:1212.5701, 2012.

Derya Soydaner re-


ceived her B.Sc., M.Sc.
and Ph.D. degrees in
statistics from Mimar
Sinan Fine Arts Univer-
sity, Istanbul, Turkey in
2012, 2014 and 2018,
respectively. She mainly
studied on neural net-
works throughout her
entire postgraduate education. Currently, she
is a faculty member in the Department
of Statistics at Mimar Sinan Fine Arts
University. Her research interests include
pattern recognition, machine learning and
image processing.

You might also like