Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Dropout vs. batch normalization: an empirical study of their impact to deep learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Overfitting and long training time are two fundamental challenges in multilayered neural network learning and deep learning in particular. Dropout and batch normalization are two well-recognized approaches to tackle these challenges. While both approaches share overlapping design principles, numerous research results have shown that they have unique strengths to improve deep learning. Many tools simplify these two approaches as a simple function call, allowing flexible stacking to form deep learning architectures. Although their usage guidelines are available, unfortunately no well-defined set of rules or comprehensive studies to investigate them concerning data input, network configurations, learning efficiency, and accuracy. It is not clear when users should consider using dropout and/or batch normalization, and how they should be combined (or used alternatively) to achieve optimized deep learning outcomes. In this paper we conduct an empirical study to investigate the effect of dropout and batch normalization on training deep learning models. We use multilayered dense neural networks and convolutional neural networks (CNN) as the deep learning models, and mix dropout and batch normalization to design different architectures and subsequently observe their performance in terms of training and test CPU time, number of parameters in the model (as a proxy for model size), and classification accuracy. The interplay between network structures, dropout, and batch normalization, allow us to conclude when and how dropout and batch normalization should be considered in deep learning. The empirical study quantified the increase in training time when dropout and batch normalization are used, as well as the increase in prediction time (important for constrained environments, such as smartphones and low-powered IoT devices). It showed that a non-adaptive optimizer (e.g. SGD) can outperform adaptive optimizers, but only at the cost of a significant amount of training times to perform hyperparameter tuning, while an adaptive optimizer (e.g. RMSProp) performs well without much tuning. Finally, it showed that dropout and batch normalization should be used in CNNs only with caution and experimentation (when in doubt and short on time to experiment, use only batch normalization).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. https://github.com/fau-masters-collected-works-cgarbin/cap6619-deep-learning-term-project

  2. Note that the Keras API uses this parameter to control the number of units to remove (the opposite meaning of what is used in the dropout paper). This paper follows the Keras API, i.e. units to remove.

  3. The source code used in the experiments is available in Github at https://github.com/fau-masters-collected-works-cgarbin/cap6619-deep-learning-term-project

  4. Note that the dropout network is listed in the top 10 results as “1,024 hidden units”. The number of units is adjusted by the dropout rate, 0.5 in this case. The adjustment results in a dropout network configured to run with 1,024 units in a layer to effectively have 2,048 units in that layer.

References

  1. Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures

    Google Scholar 

  2. Brock A, Lim T, Ritchie JM, Weston N (2017) Freezeout: accelerate training by progressively freezing layers. arXiv:1706.04983

  3. Cortes C, Vapnik V (1995) Mach Learning, 273–297

  4. Deng L, Hinton G, Kingsbury B (2013) New types of deep neural network learning for speech recognition and related applications: an overview. https://doi.org/10.1109/icassp.2013.6639344

  5. Goodfellow IJ, Bengio Y, Courville AC (2016) Deep learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge. http://www.deeplearningbook.org/

    MATH  Google Scholar 

  6. Hinz T, Navarro-Guerrero N, Magg S, Wermter S (2018) Speeding up the Hyperparameter Optimization of Deep Convolutional Neural Networks. International Journal of Computation Intelligence and Applications 17(2):1850008. https://doi.org/10.1142/s1469026818500086

    Article  Google Scholar 

  7. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift

  8. KerasTeam (2016) Using test data as validation data during training. https://github.com/keras-team/keras/issues/1753

  9. Kohler JM, Daneshmand H, Lucchi A, Zhou M, Neymeyr K, Hofmann T (2018) Towards a theoretical understanding of batch normalization, arXiv:1805.10694

  10. Krizhevsky A (2009) Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf

  11. Krizhevsky A, Sutskever I, Hinton GE (2012) . In: Bartlett PL, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, USA, pp 1106–1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

  12. Krizhevsky A, Nair V, Hinton G (2019) The cifar-10 dataset. https://www.cs.toronto.edu/kriz/cifar.html

  13. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks 60: 84. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  14. Längkvist M, Karlsson L, Loutfi A (2014) A review of unsupervised feature learning and deep learning for time-series modeling 42: 11. https://doi.org/10.1016/j.patrec.2014.01.008

    Article  Google Scholar 

  15. LeCun Y (1999) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/

  16. LeCun Y, Bengio Y, Hinton G (2015) . Deep Learning 521:436. https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  17. Li X, Chen S, Hu X, Yang J (2018) Understanding the disharmony between dropout and batch normalization by variance shift

  18. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning

  19. Loh WY (2014) Fifty years of classification and regression trees 82: 329. https://doi.org/10.1111/insr.12016

    Article  MathSciNet  Google Scholar 

  20. Luo P, Wang X, Shao W, Peng Z (2018) Towards understanding regularization in batch normalization

  21. Mishkin D, Sergievskiy N, Matas J (2017) Systematic evaluation of convolution neural network advances on the imagenet. Comput. Vision Image Understanding. https://doi.org/10.1016/j.cviu.2017.05.007. http://www.sciencedirect.com/science/article/pii/S1077314217300814

    Article  Google Scholar 

  22. Morgan N, Bourlard H (1989) . In: Touretzky DS (ed) Advances in neural information processing systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989]. http://papers.nips.cc/paper/275-generalization-and-parameter-estimation-in-feedforward-nets-some-experiments. Morgan Kaufmann, pp 630–637

  23. Nair V, Hinton GE (2010) . In: Fürnkranz J, Joachims T (eds) Proceedings of the 27th international conference on machine learning (ICML-10), June 21-24, 2010, Haifa, Israel. Omnipress, pp 807–814. http://www.icml2010.org/papers/432.pdf

  24. Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning

  25. Ruder S (2016) An overview of gradient descent optimization algorithms

  26. Rumelhart DE, Hinton G, Williams RJ (1985) Learning internal representations by error propagation. https://doi.org/10.21236/ada164453

  27. Smith LN, A disciplined approach to neural network hyper-parameters: part 1 – learning rate batch size momentum and weight decay (2018)

  28. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929. http://dl.acm.org/citation.cfm?id=?2627435.2670313

    MathSciNet  MATH  Google Scholar 

  29. Team K (2019) https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

  30. Tieleman T, Hinton G (2012) Lecture 6.5—Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

  31. University S (2018) Stanford university cs231n: convolutional neural networks for visual recognition. http://cs231n.github.io/classification/#nn

  32. Wang X, Gao L, Song J, Shen H, Beyond frame-level CNN (2017) Saliency-aware 3-D CNN with lstm for video action recognition. IEEE Signal Process Lett 24(4):510. https://doi.org/10.1109/LSP.2016.2611485

    Article  Google Scholar 

  33. Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-D convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634. https://doi.org/10.1109/TMM.2017.2749159

    Article  Google Scholar 

  34. Zhu X (2005) Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences, University of Wisconsin-Madison

Download references

Acknowledgements

This research is sponsored by the US National Science Foundation (NSF) through Grants IIS-1763452 and CNS-1828181.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingquan Zhu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garbin, C., Zhu, X. & Marques, O. Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimed Tools Appl 79, 12777–12815 (2020). https://doi.org/10.1007/s11042-019-08453-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08453-9

Keywords