Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3045118.3045167guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Batch normalization: accelerating deep network training by reducing internal covariate shift

Published: 06 July 2015 Publication History

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

References

[1]
Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249-256, May 2010.
[2]
Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc'Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.
[3]
Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished).
[4]
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121-2159, July 2011. ISSN 1532-4435.
[5]
Gülçehre, Ç aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013.
[6]
He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
[7]
Hyvärinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5): 411-430, May 2000.
[8]
Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008.
[9]
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998a.
[10]
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.
[11]
Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp. 1-8. IEEE Computer Society, Jun 23-28 2008.
[12]
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807-814. Omnipress, 2010.
[13]
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16- 21 June 2013, pp. 1310-1318, 2013.
[14]
Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.
[15]
Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 924-932, 2012.
[16]
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.
[17]
Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
[18]
Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227-244, October 2000.
[19]
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929-1958, January 2014.
[20]
Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp. 1139-1147. JMLR.org, 2013.
[21]
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
[22]
Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657-665, Granada, Spain, December 2011.
[23]
Wiesler, Simon, Richard, Alexander, Schlüter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180-184, Florence, Italy, May 2014.
[24]
Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015.

Cited By

View all
  • (2025)Listen to Your Face: A Face Authentication Scheme Based on Acoustic SignalsACM Transactions on Sensor Networks10.1145/370832421:1(1-23)Online publication date: 27-Jan-2025
  • (2025)Cross-Domain HAR: Few-Shot Transfer Learning for Human Activity RecognitionACM Transactions on Intelligent Systems and Technology10.1145/370492116:1(1-35)Online publication date: 20-Jan-2025
  • (2025)FANet: focus-aware lightweight light field salient object detection networkJournal of Real-Time Image Processing10.1007/s11554-024-01581-y22:1Online publication date: 1-Jan-2025
  • Show More Cited By
  1. Batch normalization: accelerating deep network training by reducing internal covariate shift

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37
    July 2015
    2558 pages

    Publisher

    JMLR.org

    Publication History

    Published: 06 July 2015

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Listen to Your Face: A Face Authentication Scheme Based on Acoustic SignalsACM Transactions on Sensor Networks10.1145/370832421:1(1-23)Online publication date: 27-Jan-2025
    • (2025)Cross-Domain HAR: Few-Shot Transfer Learning for Human Activity RecognitionACM Transactions on Intelligent Systems and Technology10.1145/370492116:1(1-35)Online publication date: 20-Jan-2025
    • (2025)FANet: focus-aware lightweight light field salient object detection networkJournal of Real-Time Image Processing10.1007/s11554-024-01581-y22:1Online publication date: 1-Jan-2025
    • (2024)Domain adaptation with cauchy-schwarz divergenceProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702864(4011-4040)Online publication date: 15-Jul-2024
    • (2024)FaceObfuscatorProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699283(6849-6866)Online publication date: 14-Aug-2024
    • (2024)ClearStampProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699195(5269-5286)Online publication date: 14-Aug-2024
    • (2024)BackdoorIndicatorProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699135(4193-4210)Online publication date: 14-Aug-2024
    • (2024)UIHASHProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3698938(665-682)Online publication date: 14-Aug-2024
    • (2024)Symmetry induces structure and constraint of learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694673(62847-62866)Online publication date: 21-Jul-2024
    • (2024)BATProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694612(61454-61469)Online publication date: 21-Jul-2024
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media