0% found this document useful (0 votes)

123 views

Optimization in Machine Learning

The document compares optimization algorithms for deep learning. It discusses how adaptive gradient methods have become widely used as neural networks have increased in depth and size. The paper examines commonly used adaptive optimization algorithms in detail and implements them for supervised and unsupervised tasks using image datasets. It analyzes the behavior of the algorithms during training and their performance results, pointing out differences from basic algorithms like stochastic gradient descent.

Uploaded by

rbernalMx

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views

Optimization in Machine Learning

Uploaded by

rbernalMx

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

International Journal of Pattern Recognition and Artificial Intelligence

c World Scientific Publishing Company

A Comparison of Optimization Algorithms for Deep Learning

arXiv:2007.14166v1 [cs.LG] 28 Jul 2020

Derya Soydaner
Statistics Department, Mimar Sinan Fine Arts University
İstanbul, 34380,Turkey
derya.soydaner@msgsu.edu.tr

In recent years, we have witnessed the rise of deep learning. Deep neural networks have
proved their success in many areas. However, the optimization of these networks has
become more difficult as neural networks going deeper and datasets becoming bigger.
Therefore, more advanced optimization algorithms have been proposed over the past
years. In this study, widely used optimization algorithms for deep learning are examined
in detail. To this end, these algorithms called adaptive gradient methods are implemented
for both supervised and unsupervised tasks. The behaviour of the algorithms during
training and results on four image datasets, namely, MNIST, CIFAR-10, Kaggle Flowers
and Labeled Faces in the Wild are compared by pointing out their differences against
basic optimization algorithms.

Keywords: Adaptive gradient methods; optimization; deep learning; image processing.

1. Introduction
Adaptive gradient methods have been widely used in deep learning. Although one
of the most preferred algorithms has been stochastic gradient descent (SGD) for
many years, it has difficulties to overcome serious problems such as ill-conditioning
and time necessity for large-scale datasets when training deep neural networks. It
requires manual tuning of learning rate and difficult to parallelize 16 . Thus, the
problems of SGD caused the invention of more advanced algorithms. Nowadays,
the optimization algorithms used for deep learning adapt their learning rates dur-
ing training. Basically, the adaptive gradient methods adjust the learning rate for
each parameter. Therefore, when the gradients for some parameters are large, their
learning rates are reduced or vice versa.
Until recently, many adaptive methods have been proposed and they become the
most commonly used alternatives to SGD. In addition to their high performance on
training deep models, another advantage is they are first-order optimization algo-
rithms just as SGD. Thus, they are computationally efficient for training deep neural
networks. This work aims to present the most widely used adaptive optimization
algorithms that are proven their superiority and compare their working principles.
To this end, image processing that is one of the most important areas of deep learn-
ing is handled. Firstly, the effects of adaptive gradient methods are observed for
image classification task by using convolutional neural networks (CNNs). Secondly,

1
2 D. Soydaner

as an unsupervised task, convolutional autoencoders (CAEs) that is one of the

quintessential examples of unsupervised learning are used for image reconstruction.
Besides, the effects of algorithms are examined by using denoising autoencoders. By
this way, the behaviours of the algorithms during training are analyzed in addition
to their performances for both supervised and unsupervised learning tasks.
The rest of the paper is organized as follows. In Section 2, studies about adap-
tive gradient methods and the most recent variants of them are reviewed briefly.
In Section 3, widely used optimization algorithms in deep learning are explained
by showing their update rules and solutions for challenges of training deep net-
works. SGD and its momentum variants are also mentioned in this section. The
experiments and comparative results are in Section 4 and conclusion in Section 5.

2. Related Work
In deep learning literature, working principles and performance analysis of op-
timization algorithms are widely studied. For example, theoretical guarantees of
convergence to criticality for RMSProp and Adam are presented in the setting of
optimizing a non-convex objective 28 . They design experiments to empirically study
the convergence and generalization properties of RMSProp and Adam against Nes-
terov´s accelerated gradient method. In another study, conjugate gradient, SGD
and limited memory BFGS algorithms are compared 16 . A review is presented on
numerical optimization algorithms in the context of machine learning applications 3 .
Additionally, similar to this work, an overview of gradient optimization algorithms
is summarized 25 .
In this study, most widely used optimization algorihms are examined in the
context of deep learning. On the other side, new variants of adaptive methods still
have been proposed more recently. For example, new variants of Adam and AMS-
Grad, called AdaBound and AMSBound respectively, are proposed 18 . They employ
dynamic bounds on learning rates to achieve a gradual and smooth transition from
adaptive methods to SGD. Also, a new algorithm that adapts the learning rate
locally for each parameter separately, and also globally for all parameters together
is presented 9 . Another new algorithm, called Nostalgic Adam (NosAdam), which
places bigger weights on the past gradients than the recent gradients when design-
ing the adaptive learning rate is introduced 12 . In another study, two variants called
SC-Adagrad and SC-RMSProp are proposed 20 . A new adaptive optimization algo-
rithm called YOGI is presented 30 . It controls the increase in effective learning rate.
A novel adaptive learning rate scheme, called ESGD, based on the equilibration pre-
conditioner is developed 4 . Also, a new algorithm called Adafactor is presented 26 .
Instead of updating parameters scaling by the inverse square roots of exponential
moving averages of squared past gradients, Adafactor maintains only the per-row
and per-column sums of the moving averages, and estimates the per-parameter
second moments based on these sums.
A Comparison of Optimization Algorithms for Deep Learning 3

3. Optimization Algorithms with Adaptive Learning Rates

The choice of the algorithm to optimize a neural network is one of the most impor-
tant steps. In machine learning, there are three main kinds of optimization methods.
First one is called batch or deterministic gradient methods that process all training
examples simultaneously in a large batch. Second one is called stochastic or online
methods that use only one example at a time. Today, most algorithms are a blend of
the two. During training, they use only a part of training set at each epoch. These
algorithms are called minibatch methods. In deep learning era, minibatch methods
are mostly preferred for two major reasons. Firstly, they accelerate the training of
neural networks. Secondly, as the minibatches are selected randomly and they are
independent, an unbiased estimate of the expected gradient can be computed 8 .
In this paper, the most widely used minibatch-based adaptive algorithms are
examined in detail. Besides, SGD that had been preferred conventionally for a long
time is explained alongside its momentum variants, briefly.

3.1. Stochastic gradient descent

Basically, SGD 24 follows the gradient of randomly selected minibatches downhill.
In order to train a neural network using SGD, firstly, the gradient estimate is
computed by using a loss function. Then, the update at iteration k is applied for
parameters
(1) θ. (m)The calculations for each minibatch of m examples from the training
set x , ..., x with corresponding targets y (i) are as follows:

1 X
ĝ ← ∇θ L(f (x(i) ; θ), y (i) ) (1)
m i

θ ← θ − k ĝ (2)
Here, the learning rate k is a very important hyperparameter. The magnitude of
the update depends on the learning rate. If it is too large, updates depend too much
on recent instances. If it is small, many updates may be need for convergence 1 .
This hyperparameter can be chosen by trial and error. One way is to choose one
of the several learning rates that results in the smallest loss function value. This
is called line search. Another way is to monitor the first several epochs and use a
learning rate that is higher than the best performing learning rate. In Equation 2,
the learning rate is denoted as k at iteration k because in practice, it is necessary
to gradually decrease the learning rate over time 8 .

3.1.1. Stochastic gradient descent with momentum

SGD has difficulty to reach global optimum because of its tendency to oscillate es-
pecially on steep surface curves. Noisy or small gradients may also be problematic.
The method of momentum 22 is designed to accelerate learning in such cases. It
4 D. Soydaner

aims primarily to solve two problems: Poor conditioning of the Hessian matrix and
variance in the stochastic gradient. The idea behind this algorithm is to take a run-
ning average by incorporating the previous update in the current change as if there
is a momentum due to previous updates 1 . When SGD is used with momentum, it
can converge faster as well as reduced oscillation.
SGD with momentum uses a variable v called velocity. The velocity is the di-
rection and speed at which the parameters move through parameter space. It is
set to an exponentially decaying average of the negative
gradient. Also, SGD with
momentum requires a new hyperparameter α ∈ 0, 1 called momentum parameter
that determines how quickly the contributions of previous gradients exponentially
decay. The parameters are updated after the velocity update is computed:

m
1 X
L(f (x(i) ; θ), y (i) )

v ← αv − ∇θ (3)
m i=1

θ ←θ+v (4)
The velocity v accumulates the gradient elements. The larger α is relative to ,
the more previous gradients affect the current direction. Common values of α used
in practice are 0.5, 0.9 and 0.99 8 . However, a disadvantage of this algorithm is the
requirement of momentum hyperparameter in addition to the learning rate.

3.1.2. Stochastic gradient descent with Nesterov momentum

SGD with Nesterov momentum 29 is proposed as a variant of the standard momen-
tum by taking inspiration from Nesterov´s accelerated gradient method 21 . The
idea is to measure the gradient of the loss function not at the local position but
slightly ahead in the direction of the momentum 7 . This algorithm evaluates the
gradient after the current velocity is applied with Nesterov momentum. Therefore,
SGD with Nesterov momentum begins with an interim update for a minibatch 8 :

θ̃ ← θ + αv (5)
Then, gradient is computed at the interim point. By using this gradient, velocity
update is computed. Finally, the parameters are updated:

m
1 X
L f (x(i) ; θ̃), y (i)

g← ∇θ̃ (6)
m i=1

v ← αv − g (7)

θ ←θ+v (8)
A Comparison of Optimization Algorithms for Deep Learning 5

3.2. AdaGrad
One of the optimization algorithms that individually adapts the learning rates of
model parameters is AdaGrad 6 . The parameters with the largest partial derivative
of the loss have a rapid decrease in their learning rate, while parameters with small
partial derivatives have a relatively small decrease in their learning rate 8 . This is
performed by using all the historical squared values of the gradient.
AdaGrad uses an additional variable r for gradient accumulation. In the begin-
ning of this algorithm, the gradient accumulation variable is initialized to zero and
gradient is computed for a minibatch:

1 X
g← ∇θ L(f (x(i) ; θ), y (i) ) (9)
m i

By using this gradient, the squared gradient is accumulated. Then, the update
is computed by scaling learning rates of all parameters inversely proportional to the
square root of the sum of all the historical squared values of the gradient. Finally,
this update is applied to the model parameters:

r ←r+gg (10)

∆θ ← − √ g (11)
δ+ r

θ ← θ + ∆θ (12)
where is the global learning rate and δ is a small constant for numerical stability.
However, AdaGrad has serious disadvantages. Generally, it performs well for
simple quadratic problems, but it often stops too early when training neural net-
works. The learning rate gets scaled down so much that the algorithm ends up stop-
ping entirely before reaching the global optimum 7 . Also, for training deep neural
networks, the accumulation of squared gradients from the beginning of training can
result in an excessive decrease in the effective learning rate. AdaGrad performs well
for some but not all deep learning models 8 .

3.3. AdaDelta
The underlying idea of AdaDelta algorithm is to improve the two main drawbacks of
AdaGrad: The continual decay of learning rates throughout training and the need
for a manually selected global learning rate. To this end, AdaDelta restricts the
window of past gradients to be some fixed size w instead of accumulating the sum
of squared gradients over all time. As mentioned in the previous section, AdaGrad
accumulates the squared gradients from each iteration starting at the beginning
6 D. Soydaner

of training. This accumulated sum continues to grow during training, effectively

shrinking the learning rate on each dimension. After many iterations, the learning
rate becomes infinitesimally small. With the windowed accumulation AdaGrad be-
comes a local estimate using recent gradients instead of accumulating to infinity.
Thus, learning continues to make progress even after many iterations of updates
have been done 31 .
Since storing w previous squared gradients is inefficient, AdaDelta implements
this accumulation as an exponentially
decaying average of the squared gradients.
Assuming this running average is E g 2 t at time t, gradient accumulation is com-
puted as follows:

E g 2 t = ρE g 2 t−1 + 1 − ρ gt2

(13)

where ρ is a decay constant similar to that used in the momentum method. Since it
is required the square root of this quantity in the parameter updates, this effectively
becomes the root mean square (RMS) of previous squared gradients up to time t:

q
RM S g t = E g 2 t + δ (14)

where δ is again a small constant. Based on this RMS, the parameter update is
computed, updates are accumulated and the parameters are updated, respectively:

RM S ∆θ t−1
∆θt = − gt (15)
RM S g t

E ∆θ2 t = ρE ∆θ2 t−1 + 1 − ρ ∆θt2

(16)

θt+1 = θt + ∆θt (17)

The advantage of AdaDelta is that it requires no manual tuning of a learning

rate and appears robust to noisy gradient information, different model architectures,
various data modalities and selection of hyperparameters 31 .

3.4. RMSProp
Another algorithm that modifies AdaGrad is RMSProp 10 . It is proposed to perform
better in the nonconvex setting by changing the gradient accumulation into an
exponentially weighted moving average. As mentioned in Section 3.2, AdaGrad
shrinks the learning rate according to the entire history of the squared gradient.
Instead, RMSProp uses an exponentially decaying average to discard history from
the extreme past so that it can converge rapidly after finding a convex bowl 8 .
A Comparison of Optimization Algorithms for Deep Learning 7

In order to implement RMSProp, squared gradient is accumulated after com-

puting gradient:

r ← ρr + (1 − ρ)g g (18)
where ρ is the decay rate. Then parameter update is computed and applied as
follows:

∆θ = − √ g (19)
δ+r

θ ← θ + ∆θ (20)

3.5. Adam
Adam is one of the most widely used optimization algorithms in deep learning.
The name Adam is derived from adaptive moment estimation because it computes
individual adaptive learning rates for different parameters from estimates of first
and second moments of the gradients. Adam combines the advantages of AdaGrad
which works well with sparse gradients and RMSProp which works well in online
and non-stationary settings 13 .
There are some important properties of Adam. Firstly, momentum is incorpo-
rated directly as an estimate of the first-order moment of the gradient. Also, Adam
includes bias corrections to the estimate of both the first-order moments and the
second-order moments to account for their initialization at the origin 8 . The al-
gorithm updates exponential moving averages of the gradient mt and the squared
gradient ut where the hyperparameters ρ1 ve ρ2 control the exponential decay rates
of these moving averages. The moving averages themselves are estimates of the first
moment (the mean) and the second raw moment (the uncentered variance) of the
gradient 13 .
Adam algorithm requires first and second moment variables m and u. After
computing gradient, biased first and second moment estimates are updated at time
step t respectively:

mt ← ρ1 mt−1 + (1 − ρ1 )gt (21)

ut ← ρ2 ut−1 + (1 − ρ2 )g g (22)
Then, bias is corrected in first and second moments. By using the corrected
moment estimates parameter updates are calculated and applied:

mt
m̂t ← (23)
1 − ρt1
8 D. Soydaner

ut
ût ← (24)
1 − ρt2

m̂t
∆θ = − √ (25)
ût + δ

θt ← θt−1 + ∆θ (26)

Adam has many advantages. First of all, it requires a little tuning for the learning
rate. Also, it is a method that is straightforward to implement and invariant to
diagonal rescaling of gradients. It is computationally efficient as well as a little
memory requirements. Besides, Adam is appropriate for non-stationary objectives
and problems with very noisy and sparse gradients 13 .

3.6. AdaMax
AdaMax is proposed as an extension of Adam. It is a variant of Adam based on the
infinity norm. In Adam, update rule for individual weights is to scale their gradients
inversely proportional to a L2 norm of their individual current and past gradients.
AdaMax is based on the idea that L2 norm based update rule can be generalized
to a Lp norm based update rule.
AdaMax algorithm begins with calculating gradients w.r.t. stochastic objective
at time step t, as usual. Then, biased first moment estimate and exponentially
weighted infinity norm are computed. By using them, the model parameters are
updated. These steps are defined below, respectively:

gt ← ∇θ ft θt−1 (27)

mt ← ρ1 mt−1 + 1 − ρ1 gt (28)

γt ← max ρ2 γt−1 , |gt | (29)

θt ← θt−1 − / 1 − ρt1 mt /γt

(30)

It is showed that if AdaMax is preferred as optimization algorithm, there is no

need to correct for initialization bias. Besides, the magnitude of parameter updates
has a simpler bound with AdaMax than Adam 13 .
A Comparison of Optimization Algorithms for Deep Learning 9

3.7. Nadam
Nadam (Nesterov-accelerated adaptive moment estimation) modifies Adam’s mo-
mentum component with Nesterov´s accelerated gradient. Thus, Nadam aims to
improve the speed of convergence and the quality of the learned models 5 .
Similar to Adam, after computing gradient, first and second moment variables
are updated as in Equations 31 and 32. Then, corrected moments are computed
and parameters are updated as in following equations:

mt ← ρt mt−1 + 1 − ρt gt (31)

ut ← vut−1 + 1 − v gt2

(32)

t+1
Y t
Y

m̂ ← ρt+1 mt / 1 − ρi + 1 − ρt gt / 1 − ρi (33)
i=1 i=1

û ← vut / 1 − vt (34)

t
θt ← θt−1 − √ m̂t (35)
ût + δ

3.8. AMSGrad
Another an exponential moving average variant is AMSGrad 23 . The purpose of
developing AMSGrad is to guarantee the convergence while preserving the benefits
of Adam and RMSProp. In AMSGrad algorithm, the first and second moment vari-
ables are updated as in Equations 36 and 37. The key difference between AMSGrad
and Adam is shown in Equation 38. Here, AMSGrad maintains the maximum of
all ut until the present time step and uses this maximum value for normalizing
the running average of the gradient instead of ut in Adam. By doing this, AMS-
Grad results in a non-increasing step size. Finally, the parameters are updated as
in Equation 39. Here, Ût indicates diag ût .

mt ← ρt1 mt−1 + (1 − ρt1 )gt (36)

ut ← ρ2 ut−1 + (1 − ρ2 )gt2 (37)

ût ← max ût−1 , ut (38)
10 D. Soydaner

Y p
θt+1 ← √ θt − t mt / ût (39)
F, Ût

On the other side, the difference of AMSGrad from Adam and AdaGrad, it
neither increases nor decreases the learning rate and furthermore, decreases ut
which can potentially lead to non-decreasing learning rate even if gradient is large
in the future iterations 23 .

4. Experiments
4.1. Datasets
The performances of optimization algorithms are evaluated on four image datasets,
namely, MNIST 17 , CIFAR-10 14 , Kaggle Flowers 19 and Labeled Faces in the Wild
(LFW) 11 . The well-known dataset MNIST includes 60000 training and 10000 test
examples each of which is a 28×28 gray scale handwritten digit image. CIFAR-10 is
composed of 50000 training and 10000 test examples. They are 32 × 32 color images
belonging to 10 classes. Kaggle Flowers contains 4242 images of 5 different types of
flowers. LFW includes 13233 face images belonging to 5749 people. In supervised
learning experiments, LFW classes that have at least 30 images are chosen. LFW
has 34 classes meet this requirement, thus 1777 face images belong to 34 people
are used for classification. On the other side, all the 13233 images are used in
unsupervised learning experiments. Besides, LFW and Kaggle flowers include color
images in different sizes. In this study, LFW and Kaggle Flowers are scaled to a
size of 64 × 64 and 96 × 96, respectively. Also, they are randomly divided into two
subsets as 0.75 for training and 0.25 for testing.

4.2. Experimental setting

In this study, both supervised and unsupervised learning tasks are handled. Firstly,
three different CNN architectures are used for classification task. First one in-
cludes three convolutional and two fully-connected (FC) layers similar to well-
known LeNet-5 17 . Second one has five convolutional and three FC layers that
are organized in the architecture similar to AlexNet 15 . Last one has seven convolu-
tional layers that are stacked VGG-style 27 . These architectures are shown in Table
1. All of them have convolutional layers that include filters with 3 × 3 kernels in
addition to 2 × 2 max-pooling layers. In order to avoid overfitting, weight decay of
1e − 4 is applied to convolutional layers. Additionally, dropout in ratio 0.50 is ap-
plied to the FC layers preceding the output layer. All layers use ReLU as activation
function except the output layer. The loss function is categorical cross entropy.
Secondly, convolutional autoencoders (CAE) are preferred for the unsupervised
learning task. In general, a CAE comprises of two parts called encoder and de-
coder. While the encoder generates a representation of input data, the decoder
takes this representation as input and reconstructs it in the output. When deciding
A Comparison of Optimization Algorithms for Deep Learning 11

the architecture of the autoencoder, the size of the representation is very important.
Because when the size of representation becomes smaller, the reconstruction of the
images becomes harder. Therefore, two architectures are preferred for the unsuper-
vised learning experiments as described in Table 2. The encoder and decoder have
symmetric convolutional-deconvolutional layers. Here, the difference between the
encoder and decoder is that upsampling layers are used in decoder instead of max-
pooling layers. The first architecture, CAE-1, reduces the size of representation to
one quarter while the second one, CAE-2, reduces it to one half. For example, the
representation of 64 × 64 images of LFW is 1024 dimensional in CAE-1 and 2048
dimensional in CAE-2. For unsupervised learning experiments, the reconstruction
loss is mean squared error and the activation function is tanh in the output layer.
The rest of the layers use ReLU.

Table 1. CNN architectures used in super- Table 2. CAE architectures used in

vised learning experiments. unsupervised learning experiments.

CNN-1 CNN-2 CNN-3 CAE-1 CAE-2

Conv-32 Conv-32 Conv-32 Conv-32 Conv-32
MaxPool MaxPool Conv-32 MaxPool MaxPool
MaxPool
Conv-64 Conv-64 Conv-4 Conv-8
MaxPool MaxPool Conv-64 MaxPool MaxPool
Conv-64
Conv-128 Conv-128 MaxPool Conv-4 Conv-8
MaxPool Conv-128 UpSample UpSample
Conv-128 Conv-128
FC-128 MaxPool Conv-128 Conv-32 Conv-32
FC-softmax Conv-128 UpSample UpSample
FC-128 MaxPool
FC-256 Conv-tanh Conv-tanh
FC-softmax FC-128
FC-256
FC-softmax

In all experiments, the learning rate is 0.01 for SGD and its momentum vari-
ants; 0.001 for the algorithms with adaptive learning rates. The same learning rate
is used for all algorithms, except SGD and its momentum variants. Because they
do not perform well if the learning rate is smaller. Therefore, the hyperparameters
which the algorithms give best results on test data are preferred. The momentum α
is 0.9 for SGD variants. The decay constant ρ is 0.95 for AdaDelta. Also, the decay
constants ρ1 and ρ2 are 0.9 and 0.999 for Adam, AdaMax, Nadam and AMSGrad.
The minibatch size is 128 for all models and each of them is trained 50 epochs. In
order to compare different algorithms, the same parameter initialization is used.
The software is based on Theano 2 on a single GTX 1050 Ti GPU.
12 D. Soydaner

4.3. Supervised learning experiments

In order to compare the optimization algorithms, all datasets are classified for each
CNN architecture. The classification results for the basic dataset MNIST are given
in Table 3. For all architectures, AdaDelta and AdaMax give the best accuracies
on test data. On the other side, SGD with Nesterov momentum follows AdaDelta
and AdaMax so closely for CNN-3. The worst performance belongs to AdaGrad
which is the simplest adaptive learning algorithm. The behaviour of the algorithms
during training is shown in Figure 1. SGD begins training with the highest loss for
all three architectures while Nadam begins with the lowest. Therefore, it can be
said that Nadam makes a better start the training by using the same parameter
initialization. Adam and its variants give similar results to each other for CNN-1.
However, the difference between their performances is observed when the network
becomes deeper. Besides, AdaDelta performs better than AdaGrad and RMSProp
during training.

Table 3. Comparison of the algorithms on MNIST for classification.

CNN-1 CNN-2 CNN-3

Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 0.0468 99.00 0.0724 99.03 0.0780 99.24
SGD - momentum 0.0395 99.26 0.0633 99.18 0.0718 99.26
SGD - Nesterov 0.0366 99.34 0.0511 99.35 0.0589 99.41
AdaGrad 0.0600 98.57 0.0753 98.79 0.0730 99.08
AdaDelta 0.0306 99.50 0.0395 99.49 0.0460 99.40
RMSProp 0.0505 99.26 0.1899 98.70 0.1108 99.32
Adam 0.0425 99.35 0.0597 99.00 0.0536 99.29
AdaMax 0.0337 99.37 0.0418 99.51 0.0454 99.42
Nadam 0.0364 99.32 0.0567 99.19 0.0565 99.16
AMSGrad 0.0401 99.24 0.0561 99.28 0.0497 99.36

Fig. 1. The behaviour of algorithms on MNIST during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 13

The classification results for CIFAR-10 are given in Table 4. As CIFAR-10 in-
cludes color images in higher resolutions than MNIST, the performances of al-
gorithms begin to change. AdaMax gives the best accuracies on test data for all
architectures. AMSGrad performs close to AdaMax for CNN-1 and CNN-2. On the
other side, Adam and SGD with Nesterov momentum follow AdaMax when the
neural network becomes deeper. Similar to MNIST results, AdaGrad can not per-
form well on CIFAR-10. The behaviour of the algorithms during training is shown
in Figure 2. While SGD begins training with the highest loss, Adam begins with
the lowest. Also, AdaMax seems to perform better for deeper architectures.

Table 4. Comparison of the algorithms on CIFAR-10 for classification.

CNN-1 CNN-2 CNN-3

Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 0.9428 67.98 0.8899 70.84 1.0112 68.19
SGD - momentum 1.2479 74.96 1.3368 76.74 1.1978 78.21
SGD - Nesterov 1.2235 76.01 1.2909 77.96 1.1651 79.97
AdaGrad 1.2428 56.62 1.1769 59.61 1.1531 60.55
AdaDelta 1.6619 74.08 1.7222 76.23 1.4789 78.55
RMSProp 1.2670 75.01 1.1354 76.14 0.9821 74.81
Adam 1.2279 76.05 1.1988 76.94 0.9393 79.03
AdaMax 0.8054 76.89 1.1240 79.17 1.0437 81.41
Nadam 1.2735 76.11 1.3102 76.60 1.0672 78.98
AMSGrad 1.1940 76.45 1.2000 77.11 1.0135 78.27

Fig. 2. The behaviour of algorithms on CIFAR-10 during training for three CNN architectures.

By now, the algorithms perform well except SGD and AdaGrad. The change
of depth in neural network architectures does not significantly affect performance
rankings of the algorithms. Therefore, the algorithms are compared on a more
complicated dataset than MNIST and CIFAR-10. LFW consists of face images,
each of which is a 64 × 64 color image belonging to one of 34 people. It includes
14 D. Soydaner

more classes than MNIST and CIFAR-10 as well as higher resolutions. Also, it has
small number of samples per each class. This classification task is more difficult
for the optimization algorithms. The classification results for LFW are given in
Table 5. The most obvious result is the decrease of the SGD performance. It can
not learn nearly anything during training. Accordingly, its accuracy on test data
is the lowest. Similarly, AdaGrad also can not perform well. The best accuracies
on test data are obtained by using Adam and RMSProp for CNN-1, AdaDelta
for CNN-2 and Adam for CNN-3. In the previous experiments, the performances
of SGD momentum variants compete with the adaptive algorithms while training
the neural networks on MNIST and CIFAR-10. However, while training the neural
networks on LFW, the difference between them appears and the superiority of
the adaptive learning algorithms become clear. Especially AdaDelta, Adam and its
variants perform much better.

Table 5. Comparison of the algorithms on LFW for classification.

CNN-1 CNN-2 CNN-3

Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 2.8016 22.76 3.1141 19.22 3.1152 19.22
SGD - momentum 1.0460 74.70 1.1003 71.66 1.0803 73.18
SGD - Nesterov 0.9470 78.24 0.8571 79.25 0.9930 75.37
AdaGrad 1.6427 58.17 1.8106 52.44 1.7973 52.95
AdaDelta 0.8116 82.29 0.6383 84.31 0.8417 80.10
RMSProp 0.7532 84.48 0.7469 81.78 0.9060 75.54
Adam 0.6887 84.48 0.8219 79.25 0.7223 82.96
AdaMax 0.8734 77.74 0.9822 74.70 0.8617 78.58
Nadam 0.7637 83.81 0.7172 82.96 0.8015 80.26
AMSGrad 0.6795 82.63 0.7648 82.12 0.6726 81.78

Fig. 3. The behaviour of algorithms on LFW during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 15

The behaviour of algorithms during training is shown in Figure 3. While SGD

begins training with the highest loss for CNN-1, RMSProp begins with the highest
for CNN-2 and CNN-3. On the other side, AdaDelta begins training with a high
training loss but ends being one of lowest by giving a good accuracy on test data.
As the network becomes deeper, RMSProp becomes worse on both training loss
and test accuracy. Also, RMSProp has sometimes sudden peaks during training.
This behaviour is more obvious for the deeper networks. While the training loss of
Adam and AMSGrad are so close to each other, AdaMax falls behind its variants. In
general, AdaDelta, Adam, Nadam and AMSGrad are significantly performs better
during training. This result can be easily seen especially when the depth of neural
network increases.
Lastly, the performances of algorithms are compared on Kaggle Flowers which
contains 96 × 96 images. The classification results for Kaggle Flowers are reported
in Table 6 and the training process is shown in Figure 4. AdaMax and AMS-
Grad come into prominence according to accuracies on test data. On the contrary,
SGD gives the worst results. RMSProp begins the training with the highest loss.
However, the initial loss become closer for all algorithms when the neural net-
work becomes deeper. Also, the performance of Nadam decreases when the depth
increases. This behaviour of Nadam may arise because of the high resolution of
images. AdaDelta, Nadam and RMSProp have sudden peaks during training. On
the other side, AdaDelta is not so successful on test data even though it performs
well during training. Interestingly, AdaGrad performs well especially for CNN-3.
Its accuracy on test data is the closest result to the best which is performed by
AdaMax. Therefore, AdaGrad seems to perform well for deeper networks as well as
input images with high resolutions.

Table 6. Comparison of the algorithms on Kaggle Flowers for classification.

CNN-1 CNN-2 CNN-3

Algorithm Test Loss Test Acc. Test Loss Test Acc. Test Loss Test Acc.
SGD 1.0115 59.94 1.0764 59.66 1.3027 44.40
SGD - momentum 1.5339 67.99 1.3234 69.19 1.5486 63.64
SGD - Nesterov 1.5711 68.54 1.4873 68.54 1.7025 64.56
AdaGrad 0.8664 67.25 0.8639 66.60 0.9359 66.88
AdaDelta 1.7850 68.64 1.8371 65.58 2.1974 61.79
RMSProp 1.9709 67.99 1.9051 66.88 1.9653 63.55
Adam 1.9147 69.93 1.4274 68.82 1.8564 62.16
AdaMax 1.2073 69.93 1.1429 71.32 1.0537 70.30
Nadam 1.7687 67.06 1.9668 67.06 1.4252 64.56
AMSGrad 1.8220 70.67 1.6430 71.23 1.9611 61.88

In supervised learning experiments, the positive effect of momentum for SGD

is conspicuous especially for CIFAR-10, LFW and Kaggle Flowers datasets. Also,
Nesterov momentum is slightly better despite of the momentum and Nesterov mo-
mentum results are close at the end of the training. In general, Adam, AdaMax
16 D. Soydaner

and AMSGrad performs well on test data in addition to AdaDelta. Also, the most
important advantage of AdaDelta is that it does not require to select the learning
rate manually.

Fig. 4. The behaviour of algorithms on Kaggle Flowers during training for three CNN architectures.

Additionally, the algorithms are compared based on the training time they re-
quire. As LFW includes a small number of training examples in supervised learning
experiments, there is not significant difference between the training time of algo-
rithms for each CNN architecture. On the other hand, MNIST and CIFAR-10 have
more training examples than the other two datasets. Therefore, the training time
vary across the algorithms as shown in Figure 5. For CNN-1, RMSProp is the fastest
among the adaptive algorithms for both datasets. While AdaDelta and AdaMax are
slow for MNIST, AMSGrad and Nadam require more time for CIFAR-10. When
the dataset becomes more complex, Adam needs more time. AdaGrad performs fast
but it can not generalize well on test data. Also, SGD and its momentum variants
seem to be fast but their learning rate is bigger in addition to the smaller number
of computational steps they have. Nevertheless, they can not generalize well similar
to AdaGrad.
RMSProp, Nadam and AdaMax require more time on CIFAR-10 for CNN-2.
However, the complexity of dataset does not matter for Adam and AMSGrad.
When the neural network becomes deeper, in CNN-3, Nadam and AdaDelta take
more time in comparison with the others for both datasets. In order to train CNN-1
and CNN-2, the difference between the slowest and fastest algorithms is less than
1 minutes. On the other hand, this required time increases to nearly 5 minutes for
CNN-3. AdaMax takes minimum time among the Adam variants for the deepest
CNN. Additionally, the training time differs when CNN-3 is trained on Kaggle Flow-
ers. As seen in Figure 5, Nadam and AdaDelta take more time. RMSProp requires
less time on Kaggle Flowers in comparison with the other adaptive algorithms.
A Comparison of Optimization Algorithms for Deep Learning 17

Fig. 5. (Top) MNIST and CIFAR-10 training time for CNN-1 and CNN-2. (Bottom left) MNIST
training time for CNN-3. (Bottom right) Kaggle Flowers training time for CNN-3.

4.4. Unsupervised learning experiments

When CAE is used, the task is to reconstruct input images in the output layer.
When denoising CAE is used, noisy images are taken as input and the task is to
reconstruct clean images in the output layer. Therefore, a gaussian noise matrix is
applied to input images before using denoising CAE. In these experiments, the noise
factor is 0.25. Both vanilla and denoising autoencoder results on MNIST are given
in Table 7. While Adam gives the lowest reconstruction loss for CAE-1 architecture,
AdaMax and Nadam give the lowest for CAE-2. But the performances of Adam
and its variants are so close to each other. On the other side, Nadam gives the best
results for denosing CAE. RMSProp performs well on denoising task in addition
to Adam variants. The performance of AdaMax slightly decreases for denoising
task. Besides, AdaGrad falls behind the other adaptive algorithms in both tasks.
The training loss for MNIST is shown in Figure 6. The behavior of Adam and
AMSGrad beginning the training with the lowest loss is in sight.
The results on CIFAR-10, LFW and Kaggle Flowers are similar. Adam and its
variants give the best results. Different from MNIST results, RMSProp begins train-
ing with the lowest reconstruction loss. However, Adam and its variants give better
results at the end of the training. The results on Kaggle Flowers are interesting.
For CAE-1, the performances of Adam and AMSGrad get worse in denosing task.
Accordingly, it can be said that when the size of representation is small, Adam
18 D. Soydaner

and AMSGrad can not perform well on high resolution images. Also, RMSProp
performs well together with Adam and its variants on Kaggle Flowers for both un-
supervised tasks. Both vanilla and denoising autoencoder results are given in Table
8 for CIFAR-10, Table 9 for LFW and Table 10 for Kaggle Flowers, respectively.
Also, the behaviour of the optimization algorithms during training for the unsu-
pervised learning tasks on CIFAR-10 is shown in Figure 7, LFW in Figure 8 and
Kaggle Flowers in Figure 9.

Table 7. Comparison of the algorithms on MNIST for reconstruction.

CAE Denoising CAE

Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0175 0.0160 0.0219 0.0186
SGD - momentum 0.0095 0.0074 0.0120 0.0103
SGD - Nesterov 0.0094 0.0074 0.0120 0.0102
AdaGrad 0.0113 0.0084 0.0140 0.0125
AdaDelta 0.0080 0.0053 0.0105 0.0077
RMSProp 0.0065 0.0044 0.0087 0.0067
Adam 0.0053 0.0038 0.0082 0.0067
AdaMax 0.0055 0.0036 0.0088 0.0069
Nadam 0.0056 0.0036 0.0080 0.0066
AMSGrad 0.0054 0.0038 0.0083 0.0068

Fig. 6. The behaviour of algorithms on MNIST during training for two autoencoder architectures.
(Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
A Comparison of Optimization Algorithms for Deep Learning 19

Table 8. Comparison of the algorithms on CIFAR-10 for reconstruction.

CAE Denoising CAE

Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0145 0.0152 0.0189 0.0158
SGD - momentum 0.0093 0.0091 0.0119 0.0108
SGD - Nesterov 0.0093 0.0089 0.0119 0.0109
AdaGrad 0.0115 0.0067 0.0167 0.0113
AdaDelta 0.0093 0.0088 0.0113 0.0094
RMSProp 0.0072 0.0059 0.0102 0.0083
Adam 0.0055 0.0047 0.0096 0.0075
AdaMax 0.0061 0.0047 0.0087 0.0074
Nadam 0.0069 0.0046 0.0087 0.0073
AMSGrad 0.0060 0.0049 0.0098 0.0075

Fig. 7. The behaviour of algorithms on CIFAR-10 during training for two autoencoder architec-
tures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.

SGD by far the worst for both unsupervised tasks on all datasets. In general,
Adam and its variants seem to be much better than the other adaptive gradient
methods in addition to SGD and its momentum variants. AdaGrad can not perform
well among the adaptive algorithms.
20 D. Soydaner

Table 9. Comparison of the algorithms on LFW for reconstruction.

CAE Denoising CAE

Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0122 0.0136 0.0125 0.0094
SGD - momentum 0.0070 0.0077 0.0073 0.0063
SGD - Nesterov 0.0070 0.0077 0.0071 0.0063
AdaGrad 0.0071 0.0046 0.0090 0.0069
AdaDelta 0.0061 0.0055 0.0086 0.0066
RMSProp 0.0044 0.0033 0.0062 0.0053
Adam 0.0030 0.0018 0.0044 0.0039
AdaMax 0.0033 0.0019 0.0052 0.0043
Nadam 0.0036 0.0025 0.0050 0.0045
AMSGrad 0.0030 0.0018 0.0049 0.0041

Fig. 8. The behaviour of algorithms on LFW during training for two autoencoder architectures.
(Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.

In order to compare the results clearly in visual, the reconstructions of test

images obtained by all algorithms for CAE-1 architecture are shown in Figure 10.
As handwritten images are simpler than the other datasets, all algorithms perform
well on MNIST. Even though the reconstructions of SGD are a little blurry, the
digits can be seen clearly. An important point is based on the Kaggle Flowers results.
SGD and AdaMax can not reconstruct colors properly. Similar results can be seen
for CAE-2 as shown in Figure 11. Here, it seems that all adaptive algorithms can
reconstruct images with their colors when the size of representation increases.
A Comparison of Optimization Algorithms for Deep Learning 21

Table 10. Comparison of the algorithms on Kaggle Flowers for reconstruction.

CAE Denoising CAE

Algorithm CAE-1 Loss CAE-2 Loss CAE-1 Loss CAE-2 Loss
SGD 0.0359 0.0385 0.0367 0.0347
SGD - momentum 0.0217 0.0220 0.0320 0.0217
SGD - Nesterov 0.0217 0.0214 0.0230 0.0219
AdaGrad 0.0298 0.0179 0.0254 0.0234
AdaDelta 0.0220 0.0180 0.0236 0.0203
RMSProp 0.0202 0.0159 0.0216 0.0179
Adam 0.0144 0.0121 0.0269 0.0148
AdaMax 0.0205 0.0133 0.0209 0.0164
Nadam 0.0170 0.0138 0.0208 0.0166
AMSGrad 0.0145 0.0122 0.0270 0.0146

Fig. 9. The behaviour of algorithms on Kaggle Flowers during training for two autoencoder archi-
tectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.

For denoising task, the reconstructions of all datasets for CAE-1 are given in
Figure 12. The small size of representation makes this task more difficult. None of
algorithms can reconstruct colors properly on Kaggle Flowers. Also, as it is hard to
reconstruct faces and objects, the results are blurry. However, as the representation
size increases, the colors given by especially Adam and its variants are much better
as shown in Figure 13. Also, while SGD and AdaGrad can reconstruct images with
their colors on CIFAR-10 for CAE-1, both algorithms fail in denosing CAE-1.
22 D. Soydaner

Fig. 10. Reconstructions of test images using CAE-1.

Fig. 11. Reconstructions of test images using CAE-2.

A Comparison of Optimization Algorithms for Deep Learning 23

Fig. 12. Reconstructions of test images using denoising CAE-1.

Fig. 13. Reconstructions of test images using denoising CAE-2.

24 D. Soydaner

Lastly, the algorithms are compared based on the training time they require.
Similar to supervised learning experiments, training time does not vary significantly
on LFW and Kaggle Flowers. However, training time on MNIST and CIFAR-10 for
unsupervised learning experiments are shown in Figure 14. In general, Adam and
Nadam need more time than other algorithms to reconstruct images. Similar to
supervised results, when the dataset becomes more complex, they need more time.
Also, AdaMax requires more time on CIFAR-10 as especially seen in CAE-1. All
algorithms need more time for CAE-2 relative to CAE-1. Also, denoising task cause
training time mostly decrease on CIFAR-10 and increase on MNIST. Even though
SGD and its momentum variants train autoencoders fast by using the learning rate
they perform best, they can not generalize well on test data.

Fig. 14. (Top) MNIST and CIFAR-10 training time for CAE-1 and CAE-2. (Bottom) MNIST and
CIFAR-10 training time for denoising CAE-1 and CAE-2.

5. Conclusion
In this study, the most commonly used optimization algorithms in deep learning
are examined. The differences between their working principles are summarized
considering pros and cons for each of them. In this context, the importance of
adaptive learning algorithms is highlighted. The performances of algorithms are
compared on four image datasets empirically. The behaviour of the algorithms
A Comparison of Optimization Algorithms for Deep Learning 25

during training is observed according to the effects of different image resolutions

and neural network architectures. Because of adaptive methods are mostly superior
and computationally efficient, their results seem to be better for both tasks. Still,
the research to find better adaptive methods continues for deep learning.

References
1. E. Alpaydın, Introduction to Machine Learning, The MIT Press, 2014.
2. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley and Y. Bengio, Theano: A CPU and GPU math compiler in Python,
In Proc. 9th Python in Science Conference, Austin, Texas, 2010, pp. 3–10.
3. L. Bottou, F.E. Curtis and J. Nocedal, Optimization methods for large-scale machine
learning, Siam Review, 60(2) (2018) 223–311.
4. Y. Dauphin, H. De Vries, J. Chung and Y. Bengio, RMSProp and equilibrated adaptive
learning rates for non-convex optimization, CoRR abs/1502.04390, 2015.
5. T. Dozat, Incorporating Nesterov Momentum into Adam, International Conference on
Learning Representations, San Juan, Puerto Rico, 1(2016) 2013-2016.
6. J. Duchi, E. Hazan and Y. Singer, Adaptive subgradient methods for online learning
and stochastic optimization, Journal of Machine Learning Research, 12 (2011) 2121–
2159.
7. A. Geron, Hands-on machine learning with Scikit-learn and Tensorflow, O’Reilly, 2017.
8. I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, The MIT Press, 2016.
9. H. Hayashi, J. Koushik and G. Neubig, Eve: A gradient based optimization method
with locally and globally adaptive learning rates, arXiv preprint arXiv:1611.01505,
2016.
10. G. Hinton, Neural networks for machine learning, Coursera, video lectures, 2012.
11. G.B. Huang, M. Ramesh, T. Berg and E. Learned-Miller, Labeled Faces in the Wild:
A database for studying face recognition in unconstrained environments, Technical
Report, University of Massachusetts, Amherst, 2007, pp. 07–49.
12. H. Huang, C. Wang and B. Dong, Nostalgic Adam: Weighing more of the past gra-
dients when designing the adaptive learning rate, arXiv preprint arXiv:1805.07557,
2018.
13. D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980, 2014.
14. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images,
Technical Report, University of Toronto, 2009.
15. A. Krizhevsky, I. Sutskever and G. Hinton, ImageNet classification with deep convolu-
tional neural networks, Advances in Neural Information Processing Systems, 25(2012)
1097-1105.
16. Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow and A.Y. Ng, On optimiza-
tion methods for deep learning, in Proc. 28th International Conference on Machine
Learning, Bellevue, Washington, USA, 2011, pp. 265–272.
17. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE, 86(11) (1998) 2278–2324.
18. L. Luo, Y. Xiong, Y. Liu and X. Sun, Adaptive gradient methods with dynamic bound
of learning rate, International Conference on Learning Representations, New Orleans,
2019.
19. A. Mamaev, Flowers Recognition, https://www.kaggle.com/alxmamaev/flowers-
recognition, 2018.
20. M.C. Mukkamala and M. Hein, Variants of RMSProp and Adagrad with logarithmic
26 D. Soydaner

regret bounds, in Proc. 34th International Conference on Machine Learning, Sydney,

Australia, 70 (2017) 2545–2553.
21. Y. Nesterov, A method of solving a convex programming problem with convergence
rate O(1/k2 ), Soviet Mathematics Doklady, 27 (1983) 372–376.
22. B.T. Polyak, Some methods of speeding up the convergence of iteration methods,
USSR Computational Mathematics and Mathematical Physics, 4(5) (1964) 1–17.
23. S.J. Reddi, S. Kale and S. Kumar, On the convergence of Adam and beyond, Inter-
national Conference on Learning Representations, Vancouver, Canada, 2018.
24. H. Robbins and S. Monro, A stochastic approximation method, The Annals of Math-
ematical Statistics, 22 (1951) 400–407.
25. S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint
arXiv:1609.04747, 2016.
26. N. Shazeer and M. Stern, Adafactor: Adaptive learning rates with sublinear memory
cost, arXiv preprint arXiv:1804.04235, 2018.
27. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale
image recognition, International Conference on Learning Representations, San Diego,
CA, USA, 2015.
28. D. Soham, M. Anirbit and U. Enayat, Convergence guarantees for RMSProp and
Adam in non-convex optimization and an empirical comparison to Nesterov accelera-
tion, arXiv preprint arXiv:1807.06766, 2018.
29. I. Sutskever, J. Martens, G. Dahl and G. Hinton, On importance of initialization and
momentum in deep learning, International Conference on Machine Learning, Atlanta,
USA, 2013, pp. 1139–1147.
30. M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar, Adaptive methods for non-
convex optimization, Advances in Neural Information Processing Systems, 31 (2018),
pp. 9793–9803.
31. M.D. Zeiler, ADADELTA: An adaptive learning rate method, arXiv preprint
arXiv:1212.5701, 2012.

Derya Soydaner re-

ceived her B.Sc., M.Sc.
and Ph.D. degrees in
statistics from Mimar
Sinan Fine Arts Univer-
sity, Istanbul, Turkey in
2012, 2014 and 2018,
respectively. She mainly
studied on neural net-
works throughout her
entire postgraduate education. Currently, she
is a faculty member in the Department
of Statistics at Mimar Sinan Fine Arts
University. Her research interests include
pattern recognition, machine learning and
image processing.

Solid Starts - First 100 Days
94% (18)
Solid Starts - First 100 Days
287 pages
Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Read People Like A Book by Patrick King-Edited
62% (66)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
36 Questions To Fall in Love 1
97% (31)
36 Questions To Fall in Love 1
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
71% (69)
1001 Songs
1,798 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Factoring Flow Chart
No ratings yet
Factoring Flow Chart
2 pages
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
No ratings yet
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
19 pages
A Study of the Optimization Algorithms in Deep Learning
No ratings yet
A Study of the Optimization Algorithms in Deep Learning
4 pages
A Modified Adam Algorithm For Deep Neural Network Optimization
No ratings yet
A Modified Adam Algorithm For Deep Neural Network Optimization
18 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
How To Train Your Pre Trained GAN Models: Sung Wook Park Jun Yeong Kim Jun Park Se Hoon Jung Chun Bo Sim
No ratings yet
How To Train Your Pre Trained GAN Models: Sung Wook Park Jun Yeong Kim Jun Park Se Hoon Jung Chun Bo Sim
26 pages
An A PID Controller CVPR 2018 Paper
No ratings yet
An A PID Controller CVPR 2018 Paper
10 pages
A_Survey_of_Optimization_Methods_From_a_Machine_Learning_Perspective
No ratings yet
A_Survey_of_Optimization_Methods_From_a_Machine_Learning_Perspective
14 pages
Cost Effective Transfer of Reinforcement Learn - 2024 - Expert Systems With Appl
No ratings yet
Cost Effective Transfer of Reinforcement Learn - 2024 - Expert Systems With Appl
15 pages
Bo 343
No ratings yet
Bo 343
8 pages
An LSTM-based Prediction Model For Gradient-Descending Optimization in Virtual Learning Environments
No ratings yet
An LSTM-based Prediction Model For Gradient-Descending Optimization in Virtual Learning Environments
9 pages
O L R A H D: Nline Earning ATE Daptation With Ypergradient Escent
No ratings yet
O L R A H D: Nline Earning ATE Daptation With Ypergradient Escent
11 pages
mmep_11.10_25
No ratings yet
mmep_11.10_25
10 pages
(1)_IJAIML23022024P0A3_(p.1-8)
No ratings yet
(1)_IJAIML23022024P0A3_(p.1-8)
8 pages
Comparative Performance of Several Recent Supervised Learning Algorithms
No ratings yet
Comparative Performance of Several Recent Supervised Learning Algorithms
7 pages
Deep Learning in Data Science Theoretical Foundati
No ratings yet
Deep Learning in Data Science Theoretical Foundati
6 pages
Contrast Media Molecular Imaging - 2022 - Bansal - GGA‐MLP A Greedy Genetic Algorithm to Optimize Weights and Biases in
No ratings yet
Contrast Media Molecular Imaging - 2022 - Bansal - GGA‐MLP A Greedy Genetic Algorithm to Optimize Weights and Biases in
14 pages
A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks (2)
No ratings yet
A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks (2)
8 pages
10.1007@s00521 018 3712 X PDF
No ratings yet
10.1007@s00521 018 3712 X PDF
13 pages
Spectral Clustering Via Ensemble Deep Autoencoder
No ratings yet
Spectral Clustering Via Ensemble Deep Autoencoder
33 pages
Secrets of Deep Learning 1716536527
No ratings yet
Secrets of Deep Learning 1716536527
12 pages
Large Batch Optimization For Deep Learning: Training Bert in 76 Minutes
No ratings yet
Large Batch Optimization For Deep Learning: Training Bert in 76 Minutes
37 pages
Agra Wal 2021
No ratings yet
Agra Wal 2021
8 pages
MLP Learning
No ratings yet
MLP Learning
13 pages
Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes
No ratings yet
Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes
32 pages
syn
No ratings yet
syn
6 pages
Zhang2020 Article InterpretablePolicyDerivationF
No ratings yet
Zhang2020 Article InterpretablePolicyDerivationF
13 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Spectral Normalization For GANs
No ratings yet
Spectral Normalization For GANs
26 pages
Int422 Project
No ratings yet
Int422 Project
8 pages
17EE831 - Module - 3
No ratings yet
17EE831 - Module - 3
32 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Robot Learning System Based On Adaptive Neural Control and Dynamic Movement Primitives
No ratings yet
Robot Learning System Based On Adaptive Neural Control and Dynamic Movement Primitives
11 pages
Deep Neural Network Regularization For Feature Selection in Learning-to-Rank
No ratings yet
Deep Neural Network Regularization For Feature Selection in Learning-to-Rank
19 pages
All Optimizers
No ratings yet
All Optimizers
13 pages
High Performance of Optimizers in Deep Learning For Cloth Patterns Detection
No ratings yet
High Performance of Optimizers in Deep Learning For Cloth Patterns Detection
12 pages
Lyzhov 20 A
No ratings yet
Lyzhov 20 A
10 pages
Applsci 12 10156
No ratings yet
Applsci 12 10156
23 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Adaptive Hypergraph Learning and Its Application in Image Classification
No ratings yet
Adaptive Hypergraph Learning and Its Application in Image Classification
11 pages
2020 Paper 1
No ratings yet
2020 Paper 1
13 pages
RLC Project Report
No ratings yet
RLC Project Report
2 pages
Pid Ieee
No ratings yet
Pid Ieee
13 pages
Optimization Techniques in Machine Learning: A Comprehensive Review
No ratings yet
Optimization Techniques in Machine Learning: A Comprehensive Review
3 pages
Facial Emotion Recognition: State of The Art Performance On FER2013
No ratings yet
Facial Emotion Recognition: State of The Art Performance On FER2013
9 pages
Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
No ratings yet
Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
7 pages
AI To Solve TSP
No ratings yet
AI To Solve TSP
15 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Choi Self-Ensembling With GAN-Based Data Augmentation For Domain Adaptation in Semantic ICCV 2019 Paper
No ratings yet
Choi Self-Ensembling With GAN-Based Data Augmentation For Domain Adaptation in Semantic ICCV 2019 Paper
11 pages
Deep Learning On A Data Diet - Finding Important Examples Early in Training
No ratings yet
Deep Learning On A Data Diet - Finding Important Examples Early in Training
21 pages
A Short Survey On Memory Based RL
No ratings yet
A Short Survey On Memory Based RL
18 pages
Multistage knowledge
No ratings yet
Multistage knowledge
3 pages
210 Icmlpaper
No ratings yet
210 Icmlpaper
8 pages
Rule-based Reinforcement Learning augmented by External Knowledge
No ratings yet
Rule-based Reinforcement Learning augmented by External Knowledge
7 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Scopus q4 Anggota_cm-3367（机构已修改）Bkd Kuz Nugroho
No ratings yet
Scopus q4 Anggota_cm-3367（机构已修改）Bkd Kuz Nugroho
13 pages
LeNgiCoaLahProNg11 PDF
No ratings yet
LeNgiCoaLahProNg11 PDF
8 pages
SuperTrack
No ratings yet
SuperTrack
13 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
MATLAB-to-Julia Translator Krasilnikova
No ratings yet
MATLAB-to-Julia Translator Krasilnikova
8 pages
Diplom Perez
No ratings yet
Diplom Perez
63 pages
Neural Networks For Dengue Prediction: A Systematic Review
No ratings yet
Neural Networks For Dengue Prediction: A Systematic Review
16 pages
Notes On Clifford Algebras
No ratings yet
Notes On Clifford Algebras
62 pages
ShiHai MorseControl
No ratings yet
ShiHai MorseControl
16 pages
Normal and Anomalous Diffusion: A Tutorial: Loukas Vlahos, Heinz Isliker
No ratings yet
Normal and Anomalous Diffusion: A Tutorial: Loukas Vlahos, Heinz Isliker
40 pages
Zuazua
No ratings yet
Zuazua
311 pages
Guiding The Self-Organization of Random Boolean Networks
No ratings yet
Guiding The Self-Organization of Random Boolean Networks
16 pages
Calculadoras
No ratings yet
Calculadoras
6 pages
Matrix
No ratings yet
Matrix
2 pages
Lecture 26: The Principle of Least Action (Hamilton's Principle)
No ratings yet
Lecture 26: The Principle of Least Action (Hamilton's Principle)
6 pages
Matrix Differentiation
No ratings yet
Matrix Differentiation
15 pages
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
No ratings yet
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
8 pages
Cambridge Methods 1/2 - Chapter 15 Revision of 13 and 14
No ratings yet
Cambridge Methods 1/2 - Chapter 15 Revision of 13 and 14
26 pages
ENG1091 Lectures Notes
No ratings yet
ENG1091 Lectures Notes
157 pages
CCP403
No ratings yet
CCP403
20 pages
Chapter 4 Lesson 3
No ratings yet
Chapter 4 Lesson 3
2 pages
Diff
No ratings yet
Diff
25 pages
Mathematical Physics - Assignment 1: Problem 2
No ratings yet
Mathematical Physics - Assignment 1: Problem 2
1 page
Definite Integration
100% (1)
Definite Integration
198 pages
Mohr-Coulomb Failure PDF
No ratings yet
Mohr-Coulomb Failure PDF
25 pages
Class 10 Mathematics Important Formulae: Real Numbers
No ratings yet
Class 10 Mathematics Important Formulae: Real Numbers
25 pages
Bifurcation Theory Thesis
No ratings yet
Bifurcation Theory Thesis
67 pages
Complex Numbers - Typical Exam Questions
No ratings yet
Complex Numbers - Typical Exam Questions
2 pages
MMC 2015 Fourth Year Eliminations
No ratings yet
MMC 2015 Fourth Year Eliminations
9 pages
Linear Systems Review Solutions
No ratings yet
Linear Systems Review Solutions
2 pages
Forces in Space Noncoplanar System of Forces
No ratings yet
Forces in Space Noncoplanar System of Forces
58 pages
Permutations by Cutting and Shuffling by SB Morris
No ratings yet
Permutations by Cutting and Shuffling by SB Morris
74 pages
Addis Ababa Science and Technology University College of Biological and Chemical Engineering Department of Chemical Engineering
No ratings yet
Addis Ababa Science and Technology University College of Biological and Chemical Engineering Department of Chemical Engineering
13 pages