Article

Batch normalization: accelerating deep network training by reducing internal covariate shift

Authors:

Christian SzegedyAuthors Info & Claims

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

Pages 448 - 456

Published: 06 July 2015 Publication History

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

References

[1]

Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249-256, May 2010.

[2]

Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc'Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.

[3]

Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished).

[4]

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121-2159, July 2011. ISSN 1532-4435.

[5]

Gülçehre, Ç aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013.

[6]

He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.

[7]

Hyvärinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5): 411-430, May 2000.

[8]

Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008.

[9]

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998a.

[10]

LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.

[11]

Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp. 1-8. IEEE Computer Society, Jun 23-28 2008.

[12]

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807-814. Omnipress, 2010.

[13]

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16- 21 June 2013, pp. 1310-1318, 2013.

[14]

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.

[15]

Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 924-932, 2012.

[16]

Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.

[17]

Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.

[18]

Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227-244, October 2000.

[19]

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929-1958, January 2014.

[20]

Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp. 1139-1147. JMLR.org, 2013.

[21]

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

[22]

Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657-665, Granada, Spain, December 2011.

[23]

Wiesler, Simon, Richard, Alexander, Schlüter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180-184, Florence, Italy, May 2014.

[24]

Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015.

Cited By

Chen HGu CXu LTan RHe SChen J(2025)Listen to Your Face: A Face Authentication Scheme Based on Acoustic SignalsACM Transactions on Sensor Networks10.1145/370832421:1(1-23)Online publication date: 27-Jan-2025
https://dl.acm.org/doi/10.1145/3708324
Thukral MHaresamudram HPlötz T(2025)Cross-Domain HAR: Few-Shot Transfer Learning for Human Activity RecognitionACM Transactions on Intelligent Systems and Technology10.1145/370492116:1(1-35)Online publication date: 20-Jan-2025
https://dl.acm.org/doi/10.1145/3704921
Fu JChen ZZhang HGao YXu HZhang H(2025)FANet: focus-aware lightweight light field salient object detection networkJournal of Real-Time Image Processing10.1007/s11554-024-01581-y22:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11554-024-01581-y
Show More Cited By

Batch normalization: accelerating deep network training by reducing internal covariate shift
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Batch Normalization: Is Learning An Adaptive Gain and Bias Necessary?
ICMLC '18: Proceedings of the 2018 10th International Conference on Machine Learning and Computing

The state-of-the-art training of deep neural networks requires to normalize the activities of the neurons for accelerating the training process. A standard approach is to employ batch normalization (BN), in which the activations are normalized by the ...
Enhanced LSTM with Batch Normalization
Neural Information Processing
Abstract
Recurrent neural networks (RNNs) are powerful models for sequence learning. However, the training of RNNs is complicated because the internal covariate shift problem, where the input distribution at each iteration changes during the training as ...
Zero-shot anomaly detection via batch normalization
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal", has led ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

July 2015

2558 pages

Editors:
Francis Bach,
David Blei

Publisher

JMLR.org

Publication History

Published: 06 July 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,598
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen HGu CXu LTan RHe SChen J(2025)Listen to Your Face: A Face Authentication Scheme Based on Acoustic SignalsACM Transactions on Sensor Networks10.1145/370832421:1(1-23)Online publication date: 27-Jan-2025
https://dl.acm.org/doi/10.1145/3708324
Thukral MHaresamudram HPlötz T(2025)Cross-Domain HAR: Few-Shot Transfer Learning for Human Activity RecognitionACM Transactions on Intelligent Systems and Technology10.1145/370492116:1(1-35)Online publication date: 20-Jan-2025
https://dl.acm.org/doi/10.1145/3704921
Fu JChen ZZhang HGao YXu HZhang H(2025)FANet: focus-aware lightweight light field salient object detection networkJournal of Real-Time Image Processing10.1007/s11554-024-01581-y22:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11554-024-01581-y
Yin WYu SLin YLiu JSonke JGavves EKiyavash NMooij J(2024)Domain adaptation with cauchy-schwarz divergenceProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702864(4011-4040)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.5555/3702676.3702864
Jin SWang HWang ZXiao FHu JHe YZhang WBa ZFang WYuan SRen KBalzarotti DXu W(2024)FaceObfuscatorProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699283(6849-6866)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.5555/3698900.3699283
Krauß TStang JDmitrienko ABalzarotti DXu W(2024)ClearStampProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699195(5269-5286)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.5555/3698900.3699195
Li SDai YBalzarotti DXu W(2024)BackdoorIndicatorProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699135(4193-4210)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.5555/3698900.3699135
Li JMao JZeng JLin QFeng SLiang ZBalzarotti DXu W(2024)UIHASHProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3698938(665-682)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.5555/3698900.3698938
Ziyin LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Symmetry induces structure and constraint of learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694673(62847-62866)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694673
Zheng ZPeng PMa ZChen XChoi EHarwath DSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)BATProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694612(61454-61469)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694612
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten