Article

Free access

Convergence analysis of two-layer neural networks with ReLU activation

Authors:

Yang YuanAuthors Info & Claims

NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

Pages 597 - 607

Published: 04 December 2017 Publication History

PDF eReader Publisher Site

Abstract

In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard O(1/√d) initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks.

Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in "two phases": In phase I, the gradient points to the wrong direction, however, a potential function g gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.

References

[1]

Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In ICML, pages 1908-1916, 2014.

[2]

Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 584-592, 2014.

[3]

Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Information Theory, 39(3):930-945, 1993.

Digital Library

[4]

Leo Breiman. Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans. Information Theory, 39(3):999-1013, 1993.

[5]

Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.

[6]

Anna Choromanska, Yann LeCun, and Gérard Ben Arous. Open problem: The landscape of the loss surfaces of multilayer networks. In Proceedings of The 28th Conference on Learning Theory, COLT2015, Paris, France, July 3-6, 2015, pages 1756-1760, 2015.

[7]

George Cybenko. Approximation by superpositions of a sigmoidal function. MCSS, 5(4):455, 1992.

[8]

Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In NIPS, pages 2253-2261, 2016.

Digital Library

[9]

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS 2014, pages 2933-2941, 2014.

[10]

John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121-2159, 2011.

Digital Library

[11]

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In COLT 2015, volume 40, pages 797-842, 2015.

[12]

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249-256, 2010.

[13]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, pages 315-323, 2011.

[14]

Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. CoRR, abs/1611.10258, 2016.

[15]

Surbhi Goel and Adam Klivans. Eigenvalue decay implies polynomial-time learnability for neural networks. In NIPS 2017, 2017.

[16]

Surbhi Goel and Adam Klivans. Learning Depth-Three Neural Networks in Polynomial Time. ArXiv e-prints, 2017.

[17]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

Digital Library

[18]

Ian J. Goodfellow and Oriol Vinyals. Qualitatively characterizing neural network optimization problems. CoRR, abs/1412.6544, 2014.

[19]

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231, 2016.

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026-1034, 2015.

Digital Library

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.

[22]

Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359-366, 1989.

[23]

Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.

[24]

Kenji Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586-594, 2016.

Digital Library

[25]

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[26]

J. M. Klusowski and A. R. Barron. Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks. ArXiv e-prints, July 2016.

[27]

Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus Robert Müller. Efficient BackProp, pages 9-50. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.

Digital Library

[28]

Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In NIPS, pages 855-863, 2014.

Digital Library

[29]

Guido F. Montúfar, Razvan Pascanu, Kyung Hyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In NIPS, pages 2924-2932, 2014.

[30]

Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807-814, 2010.

Digital Library

[31]

Xingyuan Pan and Vivek Srikumar. Expressiveness of rectifier networks. In ICML, pages 2427-2435, 2016.

[32]

Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of inference regions of deep feed forward networks with piece-wise linear activations. CoRR, abs/1312.6098, 2013.

[33]

M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values. ArXiv e-prints, 2010.

[34]

David Saad and Sara A. Solla. Dynamics of on-line gradient descent learning for multilayer neural networks. Advances in Neural Information Processing Systems, 8:302-308, 1996.

[35]

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In ICML, pages 774-782, 2016.

[36]

Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.

[37]

Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. ICLR, 2015.

[38]

Ohad Shamir. Distribution-specific hardness of learning neural networks. CoRR, abs/1609.01037, 2016.

[39]

Jirí Síma. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709-2728, 2002.

Digital Library

[40]

Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, pages 1139-1147, 2013.

Digital Library

[41]

Yuandong Tian. Symmetry-breaking convergence analysis of certain two-layered neural networks with relu nonlinearity. In Submitted to ICLR 2017, 2016.

[42]

Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. In AISTATS, 2017.

[43]

Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, and Michael I. Jordan. Learning halfspaces and neural networks with random initialization. CoRR, abs/1511.07948, 2015.

[44]

Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In ICML 2017, 2017.

Cited By

Bao YShehu ALiu MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Global convergence analysis of local SGD for two-layer neural network without overparameterizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667192(24610-24660)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667192
Arjevani YField MKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Annihilation of spurious minima in two-layer ReLU networksProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602989(37510-37523)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602989
Wen ZLi YKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)The mechanism of prediction head in non-contrastive self-supervised learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602068(24794-24809)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602068
Show More Cited By

Convergence analysis of two-layer neural networks with ReLU activation
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Implicit bias of gradient descent for two-layer ReLU and leaky ReLU networks on nearly-orthogonal data
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous ...
A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions
Abstract
In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process ...
Convergence of gradient method with momentum for two-Layer feedforward neural networks

A gradient method with momentum for two-layer feedforward neural networks is considered. The learning rate is set to be a constant and the momentum factor an adaptive variable. Both the weak and strong convergence results are proved, as well as the ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

December 2017

7104 pages

ISBN:9781510860964

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 04 December 2017

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
226
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)11

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bao YShehu ALiu MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Global convergence analysis of local SGD for two-layer neural network without overparameterizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667192(24610-24660)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667192
Arjevani YField MKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Annihilation of spurious minima in two-layer ReLU networksProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602989(37510-37523)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602989
Wen ZLi YKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)The mechanism of prediction head in non-contrastive self-supervised learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602068(24794-24809)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602068
Loukas APoiitis MJegelka SRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)What training reveals about neural network complexityProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540299(494-508)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3540299
Zhu NCao JLu XXiong H(2021)Learning a Hierarchical Intent Model for Next-Item RecommendationACM Transactions on Information Systems10.1145/347397240:2(1-28)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3473972
Song CZhou ZZhou YJiang YMa YLarochelle HRanzato MHadsell RBalcan MLin H(2020)Optimistic dual extrapolation for coherent non-monotone variational inequalitiesProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3496923(14303-14314)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3496923
Goel SKlivans AKoehler FLarochelle HRanzato MHadsell RBalcan MLin H(2020)From boltzmann machines to neural networks and back againProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3496257(6354-6365)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3496257
Yun CSra SJadbabaie AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Small ReLU networks are powerful memorizersProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455681(15558-15569)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455681
Orvieto ALucchi AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Continuous-time models for stochastic optimization algorithmsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455417(12610-12622)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455417
Zhong KSong ZJain PDhillon IWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Provable non-linear inductive matrix completionProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455313(11439-11449)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455313
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents