Abstract
Overfitting and long training time are two fundamental challenges in multilayered neural network learning and deep learning in particular. Dropout and batch normalization are two well-recognized approaches to tackle these challenges. While both approaches share overlapping design principles, numerous research results have shown that they have unique strengths to improve deep learning. Many tools simplify these two approaches as a simple function call, allowing flexible stacking to form deep learning architectures. Although their usage guidelines are available, unfortunately no well-defined set of rules or comprehensive studies to investigate them concerning data input, network configurations, learning efficiency, and accuracy. It is not clear when users should consider using dropout and/or batch normalization, and how they should be combined (or used alternatively) to achieve optimized deep learning outcomes. In this paper we conduct an empirical study to investigate the effect of dropout and batch normalization on training deep learning models. We use multilayered dense neural networks and convolutional neural networks (CNN) as the deep learning models, and mix dropout and batch normalization to design different architectures and subsequently observe their performance in terms of training and test CPU time, number of parameters in the model (as a proxy for model size), and classification accuracy. The interplay between network structures, dropout, and batch normalization, allow us to conclude when and how dropout and batch normalization should be considered in deep learning. The empirical study quantified the increase in training time when dropout and batch normalization are used, as well as the increase in prediction time (important for constrained environments, such as smartphones and low-powered IoT devices). It showed that a non-adaptive optimizer (e.g. SGD) can outperform adaptive optimizers, but only at the cost of a significant amount of training times to perform hyperparameter tuning, while an adaptive optimizer (e.g. RMSProp) performs well without much tuning. Finally, it showed that dropout and batch normalization should be used in CNNs only with caution and experimentation (when in doubt and short on time to experiment, use only batch normalization).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note that the Keras API uses this parameter to control the number of units to remove (the opposite meaning of what is used in the dropout paper). This paper follows the Keras API, i.e. units to remove.
The source code used in the experiments is available in Github at https://github.com/fau-masters-collected-works-cgarbin/cap6619-deep-learning-term-project
Note that the dropout network is listed in the top 10 results as “1,024 hidden units”. The number of units is adjusted by the dropout rate, 0.5 in this case. The adjustment results in a dropout network configured to run with 1,024 units in a layer to effectively have 2,048 units in that layer.
References
Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures
Brock A, Lim T, Ritchie JM, Weston N (2017) Freezeout: accelerate training by progressively freezing layers. arXiv:1706.04983
Cortes C, Vapnik V (1995) Mach Learning, 273–297
Deng L, Hinton G, Kingsbury B (2013) New types of deep neural network learning for speech recognition and related applications: an overview. https://doi.org/10.1109/icassp.2013.6639344
Goodfellow IJ, Bengio Y, Courville AC (2016) Deep learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge. http://www.deeplearningbook.org/
Hinz T, Navarro-Guerrero N, Magg S, Wermter S (2018) Speeding up the Hyperparameter Optimization of Deep Convolutional Neural Networks. International Journal of Computation Intelligence and Applications 17(2):1850008. https://doi.org/10.1142/s1469026818500086
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift
KerasTeam (2016) Using test data as validation data during training. https://github.com/keras-team/keras/issues/1753
Kohler JM, Daneshmand H, Lucchi A, Zhou M, Neymeyr K, Hofmann T (2018) Towards a theoretical understanding of batch normalization, arXiv:1805.10694
Krizhevsky A (2009) Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf
Krizhevsky A, Sutskever I, Hinton GE (2012) . In: Bartlett PL, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, USA, pp 1106–1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
Krizhevsky A, Nair V, Hinton G (2019) The cifar-10 dataset. https://www.cs.toronto.edu/kriz/cifar.html
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks 60: 84. https://doi.org/10.1145/3065386
Längkvist M, Karlsson L, Loutfi A (2014) A review of unsupervised feature learning and deep learning for time-series modeling 42: 11. https://doi.org/10.1016/j.patrec.2014.01.008
LeCun Y (1999) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/
LeCun Y, Bengio Y, Hinton G (2015) . Deep Learning 521:436. https://doi.org/10.1038/nature14539
Li X, Chen S, Hu X, Yang J (2018) Understanding the disharmony between dropout and batch normalization by variance shift
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning
Loh WY (2014) Fifty years of classification and regression trees 82: 329. https://doi.org/10.1111/insr.12016
Luo P, Wang X, Shao W, Peng Z (2018) Towards understanding regularization in batch normalization
Mishkin D, Sergievskiy N, Matas J (2017) Systematic evaluation of convolution neural network advances on the imagenet. Comput. Vision Image Understanding. https://doi.org/10.1016/j.cviu.2017.05.007. http://www.sciencedirect.com/science/article/pii/S1077314217300814
Morgan N, Bourlard H (1989) . In: Touretzky DS (ed) Advances in neural information processing systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989]. http://papers.nips.cc/paper/275-generalization-and-parameter-estimation-in-feedforward-nets-some-experiments. Morgan Kaufmann, pp 630–637
Nair V, Hinton GE (2010) . In: Fürnkranz J, Joachims T (eds) Proceedings of the 27th international conference on machine learning (ICML-10), June 21-24, 2010, Haifa, Israel. Omnipress, pp 807–814. http://www.icml2010.org/papers/432.pdf
Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning
Ruder S (2016) An overview of gradient descent optimization algorithms
Rumelhart DE, Hinton G, Williams RJ (1985) Learning internal representations by error propagation. https://doi.org/10.21236/ada164453
Smith LN, A disciplined approach to neural network hyper-parameters: part 1 – learning rate batch size momentum and weight decay (2018)
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929. http://dl.acm.org/citation.cfm?id=?2627435.2670313
Team K (2019) https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
Tieleman T, Hinton G (2012) Lecture 6.5—Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning
University S (2018) Stanford university cs231n: convolutional neural networks for visual recognition. http://cs231n.github.io/classification/#nn
Wang X, Gao L, Song J, Shen H, Beyond frame-level CNN (2017) Saliency-aware 3-D CNN with lstm for video action recognition. IEEE Signal Process Lett 24(4):510. https://doi.org/10.1109/LSP.2016.2611485
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-D convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634. https://doi.org/10.1109/TMM.2017.2749159
Zhu X (2005) Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences, University of Wisconsin-Madison
Acknowledgements
This research is sponsored by the US National Science Foundation (NSF) through Grants IIS-1763452 and CNS-1828181.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Garbin, C., Zhu, X. & Marques, O. Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimed Tools Appl 79, 12777–12815 (2020). https://doi.org/10.1007/s11042-019-08453-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08453-9