A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks (2)

A Comparative Analysis of Gradient Descent-Based
Optimization Algorithms on Convolutional Neural

Networks
E. M. Dogo1, O. J. Afolabi2, N. I. Nwulu3, B. Twala4, C.O. Aigbavboa5
1,2,3
Department of Electrical and Electronics Engineering Science, 5Department of Construction Management and Quantity Survey
University of Johannesburg, Johannesburg, South Africa
4
School of Engineering, Department of Electrical and Mining Engineering, University of South Africa
{1eustaced, 2afolabij, 3nnwulu, 5caigbavboa}@uj.ac.za, 4twalab@unisa.ac.za
Abstract—In this paper, we perform a comparative estimations that works automatically and allows for little
evaluation of seven most commonly used first-order stochastic tuning of the hyperparameters.
gradient-based optimization techniques in a simple
Convolutional Neural Network (ConvNet) architectural setup. Non-convex problems are the most common case for
The investigated techniques are the Stochastic Gradient Descent neural networks, hence choosing optimization strategy that
(SGD), with vanilla (vSGD), with momentum (SGDm), with
seeks to find the global optima in these networks is usually
momentum and nesterov (SGDm+n)), Root Mean Square
Propagation (RMSProp), Adaptive Moment Estimation challenging due to process of estimation of a very large
(Adam), Adaptive Gradient (AdaGrad), Adaptive Delta number of parameters in high dimensional search space. As
(AdaDelta), Adaptive moment estimation Extension based on an improper optimization technique may make the network to
infinity norm (Adamax) and Nesterov-accelerated Adaptive reside in the local minima during training without any
Moment Estimation (Nadam). We trained the model and
improvement. Additionally, some neural network researchers
evaluated the optimization techniques in terms of convergence
speed, accuracy and loss function using three randomly selected intuitively show preference to Adam optimizer for all cases,
publicly available image classification datasets. The overall due to its robustness and performance across numerous case
experimental results obtained show Nadam achieved better scenarios. Hence, an investigation is required to analyse the
performance across the three datasets in comparison to the other performance of optimizers depending on the model and
optimization techniques, while AdaDelta performed the worst.
dataset employed for a better understanding of their
Keywords— Artificial Intelligence, optimizers, performance behaviour. Thus, the contribution of this paper is an
measures, deep learning, stochastic gradient descent experimental comparison of the performance of seven (7)
well known and commonly used first-order stochastic
I. INTRODUCTION gradient descent optimization algorithms on a ConvNet
Convolutional Neural Networks (ConvNet) is one the of model using three (3) different image classification datasets
the state-of-art deep learning technique and architecture that and reveal how well, fast and stable each optimizer was able
has over the past few years produced remarkable results in to deal with the problem of finding the appropriate and
computer vision problem domain, like in the ImageNet Large optimal minima during training, using convergence speed,
Scaled Visual Recognition Challenge (ILSVRC) project accuracy and loss function as performance criteria.
which has witnessed a steady reduction in error rate between
The rest of the paper is organized in the following order:
2010 and 2017. In 2017 the winning algorithm achieving a
Section 2 presents a background of key concepts and related
remarkable classification error rate of 2.3%, surpassing the
works, followed by a brief description of the optimization
human classification error rate of 5% [1, 2].
techniques to be examined. The experimentation results and
Training of deep neural networks like ConvNet is to a discussion are presented in section 3. Finally, section 4
large extent a challenging non-convex and high dimensional concludes the paper.
optimization problem. Depending on the domain problem and
datasets, the goal of a deep learning researcher is to II. BACKGROUND AND LITERATURE REVIEW
implement a model that will produce better and faster results A. Overview of Neural Networks Optimization
through hyperparameters tuning to minimize the loss
Over the years, research on the improvement and
function. Optimally tuning the weights in deep neural
development of new optimization algorithms has played a
networks is key to producing accurate model classification or
vital role in the improvement of deep learning architectures.
prediction. However, tuning and adjusting the weights is
Neural Networks to a very large extent is an optimization
dependent on setting weights with lowest loss function
problem which seeks to find the global optimum through a
(gradient descent) and the direction of change of the slope
robust training trajectory and fast convergence using gradient
during gradient descent (backpropagation). To avoid the
descent algorithms [3]. Optimization problem in data fitting
problem of local minima trapping, as against obtaining the
is finding a model parameter values that are consistent with
desirable absolute global minimum value, as well as other
prior information and gives the best fit with the smallest
challenges associated with gradient descent such as curse of
prediction error with observed data. Optimization problem is
dimensionality, many optimization algorithms have been
mathematically defined as [4]:
proposed in recent years, the trend mostly based on adaptive
978-1-5386-7709-4/18/$31.00 2018
c IEEE 92
Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on November 26,2024 at 07:36:18 UTC from IEEE Xplore. Restrictions apply.
Let f : ℝn → ℝ, Find ࢞
ෝ ൌ f (x), x ‫א‬ℝ,
‫ݏ‬ሺ‫ݐ‬ሻ ൌ න ‫ݔ‬ሺܽሻ‫ݓ‬ሺ‫ ݐ‬െ ܽሻ݀ܽ ൌ ‫ݏ‬ሺ‫ݐ‬ሻ ൌ ሺ‫ݓ כ ݔ‬ሻሺ‫ݐ‬ሻ ሺͳሻ
ෝ
where f is called the objective function or cost function and ࢞
is the minimizer of the objective function f. Where ‫ ݔ‬is input and ‫ ݓ‬as the kernel or filter, and ‫ ݏ‬the
B. Gradient Descent Variants output called the feature map for continues time series ‫ݐ‬. The
discrete convolution operation with the assumption that ‫ ݔ‬and
Gradient descent is a way to minimize an objective
‫ ݓ‬are defined based on integer values of ‫ ݐ‬will assumes the
function f (x) parameterized by a model’s parameters x ‫א‬ℝ
following form:
by updating the parameters in the opposite direction of the ஶ
gradient of the objective function ‫׏‬௫ ݂ሺ‫ݔ‬ሻ w.r.t. to the ‫ݏ‬ሺ‫ݐ‬ሻ ൌ ሺ‫ݓ כ ݔ‬ሻሺ‫ݐ‬ሻ ൌ ෍ ‫ݔ‬ሺܽሻ‫ݓ‬ሺ‫ ݐ‬െ ܽሻ ሺʹሻ
parameters, reaching the local minimum is determine by the ௔ୀିஶ
learning rate [3]. Generally, there are three variants of
gradient descent: 1) Batch Gradient Descent (BGD) – uses the For a 2-dimesional image I input, and 2-dimentional kernel
entire dataset rows for training at the same time and then K, the cross-correlation convolution is described as follows:
makes adjustments to the weights, and is hence referred to as
ܵሺ݅ǡ ݆ሻ ൌ ሺ‫ܭ כ ܫ‬ሻሺ݅Ǥ ݆ሻ ൌ ෍ ෍ ‫ܫ‬ሺ݅ ൅ ݉ǡ ݆ ൅ ݊ሻ‫ܭ‬ሺ݉ǡ ݊ሻ ሺ͵ሻ
a deterministic approach, 2) Stochastic Gradient Descent
௠ ௡
(SGD) – uses single training dataset at a time (one row after
another) and then iteration adjustment of weights for each row CNN differs from the usual neural network layers by
and, 3) a hybrid of BGD and SGD which is called Mini-batch having 1) sparse interactions whereby the kernel is made
gradient descent that uses more than one training example at smaller than the input, by using only meaningful features
a time [5]. They differ based on the amount of data utilised to which translates to fewer parameters and operations to
compute gradient of the objective function, usually making compute the output with better model statistical efficiency. 2)
trade-off between accuracy of the parameter update and the parameter sharing refers to the use of one weight value or
time it takes to perform the update [3]. BGD works well in parameter for multiple functions in a model, and 3)
convex scenarios where finding a local minimum is an equivariant representation where output changes when input
assurance of reaching a global minimum, however highly change in same manner [5].
non-convex scenarios are common for neural networks and
A comparison of five different stochastic gradient descent
the challenge is avoiding getting trapped in suboptimal local
minima. For this and other reasons such as choosing and based optimization methods namely: SGD and SGD with
adjusting learning rates, over the years gradient descent momentum, Adam, AdaGrad, AdaDelta and RMSProp were
investigated by authors in [14], based on convergence time,
optimization algorithms have been developed for deep
learning to mitigate challenges associated with gradient number fluctuations and parameter update rate, using
different number of iterations and specific test function values
descent.
on a Stacked denoising autoencoder (SdA) architecture.
Gradient descent optimization algorithm is a well-studied Based on their experimental findings, AdaDelta had a
area of deep learning research. Over the years, different superior performance compared to the other optimizers in
optimization algorithms have been developed by researchers terms of fast convergence. However, it is unclear what dataset
to improve on vanilla SGD to enhance the performance of was used for their experiment. In [15], authors carried out a
deep convolutional neural networks. Optimization is one of performance comparison of four gradient descent-based
the key components in deep learning, which helps a model to variants namely, gradient descent, stochastic gradient
train better during backpropagation when the weights are descent, semi-stochastic gradient descent and stochastic
adjusted to minimize loss error, as well as to address the curse average descent on logistic and softmax regression on
of dimensionality problem [5]. The most popular synthetic and MNIST handwritten digits dataset respectively,
optimization algorithm used in deep learning libraries such as limited convex objective fitting problem. Stochastic gradient
in Caffe, Lasagne and Keras are: SGD [6], RMSProp [7], descent generally performing better than SG in the two
AdaGrad [8], AdaDelta [9], Adam [10, 11], Adamax [10] and experiments conducted by the authors; whereas the two
Nadam [12]. Researchers continue to develop optimizers to hybrid variants produced better accuracy within reasonable
achieve better generalization such as in [13]. time. Similarly, a comparative performance study of gradient
descent (GD) and SGD was conducted in [16] using
CNN is the most widely used deep learning model in the performance metric as accuracy, training and convergence
areas of computer vision, image processing, speech time on MNIST dataset for linear regression and multinomial
recognition and natural language processing (NLP). CNN is logistic regression. A recent study by [3], gave an intuitive
a specific kind of neural network normally used for overview of the behaviour of modern gradient descent
processing grid-like topology data, such as 1D grid time optimization algorithms. The author’s intuitive opinion was
series data at regular intervals and 2D grid of pixels image to use adaptive learning rate methods if training deep and
data [5]. CNN employs a convolutional mathematical complex neural network models and faster convergence.
operation at the CNN layer to derive weighting function w(a), Also highlighted were the challenges associated with these
where a is the age of measurement for a time series data. By optimizers with strategies that could be utilised to improve
applying a weighted average operation at every time interval, optimizing gradient descent. The authors in [17], through
the convolution operation is generally defined as follows: practical experimentation, demonstrated sources of gradient
2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS) 93
descent performance failures in deep learning, they attributed ‘Natural Images’ obtained from Kaggle’s public dataset
these failures to subtle problems such as informativeness of platform. Generally, we applied all the optimization
the gradient descents, signal to noise ratio (SNR), bad algorithms on each dataset using our ConvNet model. We
conditioning associated with activations which leads to maintained the same hyperparameters settings for all
vanishing gradient and slow training process. From these experiments. Dropout was added to the output layer of the
related works it seems that an extensive study to compare the ConvNet model to address overfitting and the whole network
performance effects of these popular and well used was trained over 450 epochs for the ‘Fashion MNIST’ and the
optimization algorithms on a convolutional neural network ‘Cats and Dogs’ dataset but was trained over 150 epochs for
architecture using image classification datasets has not been the ‘Natural Images’ dataset. The Natural Images dataset was
entirely conducted. Most studies have investigated GD versus trained over 150 epochs so that the average number of
SGD or a couple of these optimizers against certain metrics training hours (which is 5 hours) is kept constant for all
and models. Others have intuitively shed light on tricks and experiments. This does not have any negative impact on the
strategies to improve the optimization algorithms and Natural Images models since convergence was already
common problems associated with these optimizers. Ours is approached at about 140 epochs for most of the optimizers.
a practical experimentation to find out the impact of these
A. ConvNet Model
seven well utilised optimizers using three standard image
classification datasets on a simplified convolutional The ConvNet architecture employed in this study has a
architecture. 3x3 sized filters across all convolution layers and 3x3
Maxpooling with stride size equal to 1, followed by a fully
III. EXPERIMENTAL SETUP AND RESULTS connected layer. The model was built with the keras deep
We carried out our experiment on three image learning library and trained using Kaggle’s NVidia K80 with
classification dataset problems using a ConvNet model. We 12GB VRAM GPU for about 5 hours per optimizer. Dropout
showed the effectiveness of seven (7) popular stochastic and data augmentation were two regularization techniques
gradient based optimization algorithms used in deep learning which were applied during the training of the model. Rectifier
realm on the built model. The optimizers we compared were Linear Unit (ReLU) nonlinearity activation function was also
SGD (vanilla, with momentum, and nesterov), RMSProp, applied to each convolutional layer. A summary of the
Adam, Adamax, Adagrad, Adadelta, and Nadam, while the ConvNet configuration applied in this study is outlined in
dataset used were: ‘Cats and Dogs’, ‘Fashion MNIST’ and Table 1.
TABLE 1. SUMMARY OF CONVNET CONFIGURATIONS

Dataset-1: Cats and Dogs Dataset-2: Fashion MNIST Dataset-3: Natural Images
Input image data - 64x64x3- Input data image - 28x28x1-channel grayscale Input date image - 150x150x3-
channels RGB Images Images channels RGB Images
Conv3x3-32; stride=1 Conv3x3-32; stride=1 Conv3x3-32; stride=1
ReLU (nonlinearity function)
Pooling layer: MaxPooling 2x2; stride= 1
Flattening (Input Layer of Neural Network)
FC-64 layers
Dropout = 0.5 (addressing overfitting issue)
Binary Cross entropy Categorical Cross entropy Categorical Cross entropy
Sigmoid Layer (ɸ) Softmax Layer (σ) Softmax Layer (σ)
B. Datasets 2) Performance Assessment of Dataset-2 (Fashion MNIST):

Fashion-MNIST is a dataset of Zalando's article images,
In this work, we choose three (3) popular image
consisting of a training set of 60,000 image examples and
classification problem datasets to evaluate our model with the
a test set of 10,000 image examples. Each example is a
different optimization algorithms.
28x28 grayscale images, associated with a label from 10
1) Performance Assessment of Dataset-1 (Cats and Dogs): classes, with 7,000 images per class. Zalando intends
This dataset was provided by Microsoft Research and Fashion-MNIST to serve as a direct drop-in replacement
referred to as Asirra dataset, it is made up input size of for the original MNIST dataset for benchmarking
64x64 pixels of dogs and cats coloured images. The machine learning algorithms. It shares the same image
training set contains 25,000 images, comprising 12,500 size and structure of training and testing splits [19]. The
images of dogs and 12,500 images of cats, with a test
simulation results are shown in Table 3 and on Figures
dataset of 12,500 images [18]. The simulation results are
10-18.
shown in Table 2 and on Figures 1-9.
94 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS)
3) Performance Assessment Dataset-3 (Natural Images): obtained from various sources, and is used as a
This dataset is made up of 6,899 images of 150x150 pixels benchmark dataset for the work in [20]. The simulation
coloured images from 8 distinct classes, which include results are shown in Table 4 and on Figures 19-27.
airplane, car, cat, dog, flower, fruit, motorbike and person
TABLE 2. RESULTS FOR CATS & DOGS DATASET TABLE 3. RESULTS FOR FASHION MNIST DATASET
Optimizers Convergence time (hrs) Accuracy Loss Optimizer Convergence time (hrs) Accuracy Loss
vSGD 2.625 0.643 0.670 vSGD 2.500 0.373 1.886
SGDm 2.375 0.745 0.525 SGDm 2.375 0.585 1.278
SGDm+n 2.250 0.736 0.531 SGDm+n 2.625 0.574 1.317
AdaGrad 2.500 0.675 0.649 AdaGrad 2.625 0.404 2.048
RMSProp 2.125 0.848 0.429 RMSProp 2.5 0.458 1.648
Adam 2.625 0.829 0.400 Adam 2.625 0.720 0.920
AdaDelta 2.625 0.624 0.687 AdaDelta 2.750 0.272 2.220
Adamax 2.250 0.826 0.437 Adamax 2.250 0.622 1.189
Nadam 2.625 0.855 0.362 Nadam 2.625 0.712 0.941
Fig. 2. SGD momentum

Fig. 1. SGD vanilla
Fig. 3. SGD momentum + nesterov Fig. 4. RMSProp
Fig. 5. Nadam Fig. 6. Adamax
Fig. 8. AdaGrad
Fig. 7. Adam
Fig. 9. AdaDelta
Fig. 10. SGD vanilla Fig. 11. SGD momentum
Fig. 12. SGD momentum + nesterov Fig. 13. RMSProp
Fig. 15. Adamax

Fig. 14. Nadam
Fig. 16. Adam Figure 17: AdaGrad
Figure 18: AdaDelta
TABLE 4. RESULTS FOR NATURAL IMAGES DATASET
Optimizer Convergence Time (hrs) Accuracy Loss

vSGD 5.056 0.775 1.228
SGDm 4.861 0.891 0.501
SGDm+n 5.017 0.851 0.584
AdaGrad 5.017 0.699 1.589
RMSProp 5.172 0.797 0.619
Adam 5.056 0.892 0.920
AdaDelta 5.483 0.739 1.669
Adamax 4.900 0.909 0.440
Nadam 5.444 0.913 0.225
Fig. 19. SGD vanilla Fig. 20. SGD momentum
Fig. 21. SGD momentum + nesterov Figure 22: RMSProp
Fig. 23. Nadam Fig. 24. Adamax
Fig. 25. Adam Fig. 26. AdaGrad
Fig. 27. AdaDelta
C. Overall Results Discussion 3) Dataset 3 – Natural Images: Table 4 and Figures 19-
27 presents the results obtained from natural images dataset,
1) Dataset 1 – Cats and Dogs: Table 2 and Figures 1-9
with Nadam the best performer in terms of validation
present the results for the Cats and Dogs dataset image
accuracy and loss, closely followed by Adamax. And no
classification problem, which shows Nadam outperforming
significant difference between Adam (third) and SGDm
the optimizers, but closely trailed by RMSProp and with no
(fourth), and the rest optimizer obtaining accuracies below
significant difference between Adam (third) and Adamax
80%, except AdaGrad which had an accuracy of 69.9% and
(fourth). The convergence was at 450 epochs and an average
the worst performer on this dataset. It is also worth noting the
of 2.5 hours convergence time for all the optimizers.
improvement exhibited by SGDm (89.1%) in comparison to
AdaDelta, vSGD and AdaGrad were the least performers on
vSGD (77.5%), but performance dropping for SGDm+n
this dataset.
(85.1%). Training was taken to converge at 140 epochs and
2) Dataset 2 – Fashion MNIST: Generally, all the
an average of 5 hours convergence time across all the
optimizers did not perform well on the Fashion MNIST
optimizers.
dataset as presented on Table 3 and Figures 10-18, which
could be because of the simplified ConvNet architecture In terms of overall performance across all the three
utilised and size of the dataset. However, Adam was the better datasets, Nadam was the best performer attaining an accuracy
performer on this dataset with an accuracy of 72%, rate of 85.5% on Cats and Dogs dataset, 91.3% on Natural
marginally better than Nadam that achieved an accuracy of images dataset and 71.2% on Fashion MNIST dataset,
71.2%. The rest optimizers performed woefully especially However, Nadam’s performance was marginally lower than
AdaDelta with an accuracy of just 27.2%. RMSProp not only Adam (72%) its closest competitor on Fashion MNIST
underperformed for this dataset but was unstable throughout
dataset. The experiments conducted confirms that
training. RSMProp could not converge during training, it was
performance of the optimizers is very much dependant on the
observed that reading was taken for the first 450 epoch during
which other optimizers had converged. It is evident that all type and size of dataset, with each optimizer exhibiting
the optimizers got a good test on this dataset, performance different level of accuracy and loss across the three datasets.
improvement could, however, be made by tweaking the
ConvNet architectural design using Adam and Nadam who
were the better performance on this dataset.
IV. CONCLUSION [4] K. Madsen and H.B. Nielsen, "Introduction to Optimization and Data
Fitting," 2010.DTU Informatics - IMM.
In this paper, a comparative effect of seven optimization [5] I. Goodfellow, Y. Bengio and A. Courville, Deep learning, MIT Press,
algorithms on three image classification datasets using a London, England, 2016.
[6] T. Schaul, I. Antonoglou and D. Silver, "Unit Tests for Stochastic
simple convolution architecture was carried out. Our findings Optimization," arXiv Preprint arXiv:1312.6055, Dec 20, 2013.
reveal that the performance of each optimizer varied from [7] G. Hinton, N. Srivastava and K. Swersky, "rmsprop: Divide the
each dataset, which confirms the effect that the data type and gradient by a running average of its recent magnitude," 2012.
[8] J. Duchi, E. Hazan and Y. Singer, "Adaptive Subgradient Methods for
size plays on the performance of the different optimizers. Online Learning and Stochastic Optimization," Journal of Machine
Based on the several experiments conducted, our finding Learning Research, vol. 12, pp. 2121-2159, 2011.
shows Nadam exhibited a more superior and robust [9] M.D. Zeiler, "ADADELTA: An Adaptive Learning Rate Method,"
arXiv Preprint arXiv:1212.5701, Dec 22, 2012.
performance across all the three datasets evaluated when [10] D.P. Kingma and L.J. Ba, "Adam: A Method for Stochastic
compared to the other optimization techniques. This could be Optimization," International Conference on Learning Representations
attributed to the fact that Nadam combines the strengths of (ICLR), 2015.
[11] Sashank J. Reddi, Satyen Kale and Sanjiv Kumar, "ON THE
Nesterov Acceleration Gradient (NAG) and Adaptive CONVERGENCE OF ADAM AND BEYOND," International
estimation (Adam) algorithms which apparently suits the Conference on Learning Representations (ICLR), 2018.
three datasets and the model examined in this study. [12] I. Sutskever, J. Martens, G. Dahl and G. Hinton, On the importance of
initialization and momentum in deep learning, PMLR, Atlanta,
In this study, only one model and three image Georgia, USA, pp. 1139-1147, 2013.
[13] K. Lv, S. Jiang and J. Li, "Learning Gradient Descent: Better
classification datasets were used to perform all our Generalization and Longer Horizons," arxiv.org/abs/1703.03633, Mar
experiments. It will be interesting to use more than three 10, 2017.
[14] E. Yazan and M.F. Talu, "Comparison of the stochastic gradient
datasets of different problem domains and experiment on the descent based optimization techniques," 2017 International Artificial
comparative effects of these optimizers across a number of Intelligence and Data Processing Symposium (IDAP), pp. 1-5, Malatya,
different ConvNet architectural designs and deep learning Turkey, 2017.
[15] G. Papamakarios, "Comparison of stochastic optimization algorithms,"
models for the purposes of better generalisation. We leave School of Mathematic, University of Edinburgh, 2014. Retrieved from
this for future endeavours. https://www.maths.ed.ac.uk/~prichtar/papers/Papamakarios.pdf, 26
October 2014.
ACKNOWLEDGEMENT [16] Rasmus Hallen, "A Study of Gradient-Based Algorithms," 2017.
Retrieved from http://lup.lub.lu.se/student-papers/record/8904399, 29
We would like to thank the University of Johannesburg October 2018
for funding support as well as the valuable resources to [17] S. Shalev-Shwartz, O. Shamir and S. Shammah, "Failures of Gradient-
Based Deep Learning," arXiv:1703.07950, 2017.
complete this work. [18] J. Elson, J.R. Douceur, J. Howell and J.S. Asirra, "A CAPTCHA that
Exploits Interest-Aligned Manual Image Categorization," Proceedings
REFERENCES of the 14th ACM conference on computer and communications security
[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. (CCS), pp. 366-374, Oct 28, 2007.
Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg and L. Fei- [19] H. Xiao, K. Rasul and R. Vollgraf, "Fashion-MNIST: a Novel Image
Fei, "ImageNet Large Scale Visual Recognition Challenge," Dataset for Benchmarking Machine Learning Algorithms,"
International Journal Computer Vision, vol. 115, issue 3, pp. 211-252, arXiv:1708.07747, 2017
2015. [20] P. Roy, S. Ghosh, S. Bhattacharya and U. Pal, "Effects of Degradations
[2] ImageNet, "Large Scale Visual Recognition Challenge (ILSVRC)," on Deep Neural Network Architectures," arXiv:1807.10108,
vol. 2018, 2018. 2018.
[3] S. Ruder, "An overview of gradient descent optimization algorithms,"
2016. Retrieved from http://arxiv.org/abs/1609.04747, 29 October
2018

A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks (2)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks (2)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks (2)

Uploaded by

Copyright:

Available Formats

A Comparative Analysis of Gradient Descent-Based

Optimization Algorithms on Convolutional Neural

TABLE 1. SUMMARY OF CONVNET CONFIGURATIONS

B. Datasets 2) Performance Assessment of Dataset-2 (Fashion MNIST):

Fig. 2. SGD momentum

Fig. 3. SGD momentum + nesterov Fig. 4. RMSProp

Fig. 5. Nadam Fig. 6. Adamax

Fig. 10. SGD vanilla Fig. 11. SGD momentum

Fig. 12. SGD momentum + nesterov Fig. 13. RMSProp

Fig. 15. Adamax

Figure 18: AdaDelta

TABLE 4. RESULTS FOR NATURAL IMAGES DATASET

Optimizer Convergence Time (hrs) Accuracy Loss

Fig. 19. SGD vanilla Fig. 20. SGD momentum

Fig. 21. SGD momentum + nesterov Figure 22: RMSProp

Fig. 25. Adam Fig. 26. AdaGrad

Fig. 27. AdaDelta

You might also like