Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3294771.3294828guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article
Free access

Convergence analysis of two-layer neural networks with ReLU activation

Published: 04 December 2017 Publication History

Abstract

In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard O(1/√d) initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks.
Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in "two phases": In phase I, the gradient points to the wrong direction, however, a potential function g gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.

References

[1]
Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In ICML, pages 1908-1916, 2014.
[2]
Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 584-592, 2014.
[3]
Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Information Theory, 39(3):930-945, 1993.
[4]
Leo Breiman. Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans. Information Theory, 39(3):999-1013, 1993.
[5]
Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
[6]
Anna Choromanska, Yann LeCun, and Gérard Ben Arous. Open problem: The landscape of the loss surfaces of multilayer networks. In Proceedings of The 28th Conference on Learning Theory, COLT2015, Paris, France, July 3-6, 2015, pages 1756-1760, 2015.
[7]
George Cybenko. Approximation by superpositions of a sigmoidal function. MCSS, 5(4):455, 1992.
[8]
Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In NIPS, pages 2253-2261, 2016.
[9]
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS 2014, pages 2933-2941, 2014.
[10]
John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121-2159, 2011.
[11]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In COLT 2015, volume 40, pages 797-842, 2015.
[12]
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249-256, 2010.
[13]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, pages 315-323, 2011.
[14]
Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. CoRR, abs/1611.10258, 2016.
[15]
Surbhi Goel and Adam Klivans. Eigenvalue decay implies polynomial-time learnability for neural networks. In NIPS 2017, 2017.
[16]
Surbhi Goel and Adam Klivans. Learning Depth-Three Neural Networks in Polynomial Time. ArXiv e-prints, 2017.
[17]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[18]
Ian J. Goodfellow and Oriol Vinyals. Qualitatively characterizing neural network optimization problems. CoRR, abs/1412.6544, 2014.
[19]
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231, 2016.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026-1034, 2015.
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.
[22]
Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359-366, 1989.
[23]
Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
[24]
Kenji Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586-594, 2016.
[25]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
[26]
J. M. Klusowski and A. R. Barron. Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks. ArXiv e-prints, July 2016.
[27]
Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus Robert Müller. Efficient BackProp, pages 9-50. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
[28]
Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In NIPS, pages 855-863, 2014.
[29]
Guido F. Montúfar, Razvan Pascanu, Kyung Hyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In NIPS, pages 2924-2932, 2014.
[30]
Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807-814, 2010.
[31]
Xingyuan Pan and Vivek Srikumar. Expressiveness of rectifier networks. In ICML, pages 2427-2435, 2016.
[32]
Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of inference regions of deep feed forward networks with piece-wise linear activations. CoRR, abs/1312.6098, 2013.
[33]
M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values. ArXiv e-prints, 2010.
[34]
David Saad and Sara A. Solla. Dynamics of on-line gradient descent learning for multilayer neural networks. Advances in Neural Information Processing Systems, 8:302-308, 1996.
[35]
Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In ICML, pages 774-782, 2016.
[36]
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
[37]
Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. ICLR, 2015.
[38]
Ohad Shamir. Distribution-specific hardness of learning neural networks. CoRR, abs/1609.01037, 2016.
[39]
Jirí Síma. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709-2728, 2002.
[40]
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, pages 1139-1147, 2013.
[41]
Yuandong Tian. Symmetry-breaking convergence analysis of certain two-layered neural networks with relu nonlinearity. In Submitted to ICLR 2017, 2016.
[42]
Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. In AISTATS, 2017.
[43]
Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, and Michael I. Jordan. Learning halfspaces and neural networks with random initialization. CoRR, abs/1511.07948, 2015.
[44]
Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In ICML 2017, 2017.

Cited By

View all
  • (2023)Global convergence analysis of local SGD for two-layer neural network without overparameterizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667192(24610-24660)Online publication date: 10-Dec-2023
  • (2022)Annihilation of spurious minima in two-layer ReLU networksProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602989(37510-37523)Online publication date: 28-Nov-2022
  • (2022)The mechanism of prediction head in non-contrastive self-supervised learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602068(24794-24809)Online publication date: 28-Nov-2022
  • Show More Cited By
  1. Convergence analysis of two-layer neural networks with ReLU activation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems
    December 2017
    7104 pages

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 04 December 2017

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)52
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Global convergence analysis of local SGD for two-layer neural network without overparameterizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667192(24610-24660)Online publication date: 10-Dec-2023
    • (2022)Annihilation of spurious minima in two-layer ReLU networksProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602989(37510-37523)Online publication date: 28-Nov-2022
    • (2022)The mechanism of prediction head in non-contrastive self-supervised learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602068(24794-24809)Online publication date: 28-Nov-2022
    • (2021)What training reveals about neural network complexityProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540299(494-508)Online publication date: 6-Dec-2021
    • (2021)Learning a Hierarchical Intent Model for Next-Item RecommendationACM Transactions on Information Systems10.1145/347397240:2(1-28)Online publication date: 27-Sep-2021
    • (2020)Optimistic dual extrapolation for coherent non-monotone variational inequalitiesProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3496923(14303-14314)Online publication date: 6-Dec-2020
    • (2020)From boltzmann machines to neural networks and back againProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3496257(6354-6365)Online publication date: 6-Dec-2020
    • (2019)Small ReLU networks are powerful memorizersProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455681(15558-15569)Online publication date: 8-Dec-2019
    • (2019)Continuous-time models for stochastic optimization algorithmsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455417(12610-12622)Online publication date: 8-Dec-2019
    • (2019)Provable non-linear inductive matrix completionProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455313(11439-11449)Online publication date: 8-Dec-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media