Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Envisioning future deep learning theories: some basic concepts and characteristics

  • Position Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

To advance deep learning methodologies in the next decade, a theoretical framework for reasoning about modern neural networks is needed. While efforts are increasing toward demystifying why deep learning is so effective, a comprehensive picture remains lacking, suggesting that a better theory is possible. We argue that a future deep learning theory should inherit three characteristics: a hierarchically structured network architecture, parameters iteratively optimized using stochastic gradient-based methods, and information from the data that evolves compressively. As an instantiation, we integrate these characteristics into a graphical model called neurashed. This model effectively explains some common empirical patterns in deep learning. In particular, neurashed enables insights into implicit regularization, information bottleneck, and local elasticity. Finally, we discuss how neurashed can guide the development of deep learning theories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Commun ACM, 2017, 60: 84–90

    Article  Google Scholar 

  2. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436–444

    Article  Google Scholar 

  3. Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529: 484–489

    Article  Google Scholar 

  4. Jacot A, Gabriel F, Hongler C. Neural tangent kernel: convergence and generalization in neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 8580–8589

    Google Scholar 

  5. Bartlett P L, Foster D J, Telgarsky M. Spectrally-normalized margin bounds for neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6241–6250

    Google Scholar 

  6. Berner J, Grohs P, Kutyniok G, et al. The modern mathematics of deep learning. 2021. ArXiv:2105.04026

  7. Tolstikhin I, Houlsby N, Kolesnikov A, et al. MlP-Mixer: an all-MLP architecture for vision. 2021. ArXiv:2105.01601

  8. Zdeborová L. Understanding deep learning is also a job for physicists. Nat Phys, 2020, 16: 602–604

    Article  Google Scholar 

  9. Eldan R, Shamir O. The power of depth for feedforward neural networks. In: Proceedings of Conference on Learning Theory, 2016. 907–940

    Google Scholar 

  10. Hinton G. How to represent part-whole hierarchies in a neural network. 2021. ArXiv:2102.12627

  11. Bagrov A A, Iakovlev I A, Iliasov A A, et al. Multiscale structural complexity of natural patterns. Proc Natl Acad Sci USA, 2020, 117: 30241–30251

    Article  MathSciNet  Google Scholar 

  12. Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015

    Google Scholar 

  13. Soudry D, Hoffer E, Nacson M S, et al. The implicit bias of gradient descent on separable data. J Mach Learn Res, 2018, 19: 2822–2878

    MathSciNet  Google Scholar 

  14. Blum A L, Rivest R L. Training a 3-node neural network is NP-complete. Neural Networks, 1992, 5: 117–127

    Article  Google Scholar 

  15. Goldt S, Mézard M, Krzakala F, et al. Modeling the influence of data structure on learning in neural networks: the hidden manifold model. Phys Rev X, 2020, 10: 041044

    Google Scholar 

  16. Tishby N, Zaslavsky N. Deep learning and the information bottleneck principle. In: Proceedings of IEEE Information Theory Workshop (ITW), 2015. 1–5

    Google Scholar 

  17. Shwartz-Ziv R, Tishby N. Opening the black box of deep neural networks via information. 2017. ArXiv:1703.00810

  18. Feldman V. Does learning require memorization? A short tale about a long tail. In: Proceedings of Symposium on Theory of Computing, 2020. 954–959

    Google Scholar 

  19. Hebb D O. The Organization of Behavior: A Neuropsychological Theory. New York: Psychology Press, 2005

    Book  Google Scholar 

  20. Poggio T, Banburski A, Liao Q. Theoretical issues in deep networks. Proc Natl Acad Sci USA, 2020, 117: 30039–30045

    Article  MathSciNet  Google Scholar 

  21. Allen-Zhu Z, Li Y. Backward feature correction: how deep learning performs deep learning. 2020. ArXiv:2001.04413

  22. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of International Conference on Machine Learning, 2015. 448–456

    Google Scholar 

  23. Ba J L, Kiros J R, Hinton G E. Layer normalization. 2016. ArXiv:1607.06450

  24. Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res, 2014, 15: 1929–1958

    MathSciNet  Google Scholar 

  25. Fang C, He H, Long Q, et al. Exploring deep neural networks via layer-peeled model: minority collapse in imbalanced training. In: Proceedings of the National Academy of Sciences, 2021

    Google Scholar 

  26. He H F, Su W J. The local elasticity of neural networks. In: Proceedings of International Conference on Learning Representations, 2020

    Google Scholar 

  27. Rahimi A, Recht B. Random features for large-scale kernel machines. In: Proceedings of Advances in Neural Information Processing Systems, 2007. 20

    Google Scholar 

  28. Yehudai G, Shamir O. On the power and limitations of random features for understanding neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 32

    Google Scholar 

  29. Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 3859–3869

    Google Scholar 

  30. Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning. New York: Springer, 2001

    Google Scholar 

  31. Zhang C Y, Bengio S, Hardt M, et al. Understanding deep learning (still) requires rethinking generalization. Commun ACM, 2021, 64: 107–115

    Article  Google Scholar 

  32. Bartlett P L, Long P M, Lugosi G, et al. Benign overfitting in linear regression. Proc Natl Acad Sci USA, 2020, 117: 30063–30070

    Article  MathSciNet  Google Scholar 

  33. Nagarajan V, Kolter J Z. Uniform convergence may be unable to explain generalization in deep learning. In: Proceedings of Advances in Neural Information Processing Systems, 2019

    Google Scholar 

  34. Razin N, Cohen N. Implicit regularization in deep learning may not be explainable by norms. In: Proceedings of Advances in Neural Information Processing Systems, 2020. 33

    Google Scholar 

  35. Zhou Z-H. Why over-parameterization of deep neural networks does not overfit? Sci China Inf Sci, 2021, 64: 116101

    Article  MathSciNet  Google Scholar 

  36. Papyan V, Han X Y, Donoho D L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc Natl Acad Sci USA, 2020, 117: 24652–24663

    Article  MathSciNet  Google Scholar 

  37. He H F, Su W J. A law of data separation in deep learning. Proc Natl Acad Sci USA, 2023, 120: e2221704120

    Article  Google Scholar 

  38. Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: generalization gap and sharp minima. 2016. ArXiv:1609.04836

  39. Smith S, Elsen E, De S. On the generalization benefit of noise in stochastic gradient descent. In: Proceedings of International Conference on Machine Learning, 2020. 9058–9067

    Google Scholar 

  40. Ilyas A, Santurkar S, Engstrom L, et al. Adversarial examples are not bugs, they are features. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 32

    Google Scholar 

  41. Xiao K, Engstrom L, Ilyas A, et al. Noise or signal: the role of image backgrounds in object recognition. In: Proceedings of International Conference on Learning Representations, 2021

    Google Scholar 

  42. HaoChen J Z, Wei C, Lee J D, et al. Shape matters: understanding the implicit bias of the noise covariance. 2020. ArXiv:2006.08680

  43. Feinberg I. Schizophrenia: caused by a fault in programmed synaptic elimination during adolescence? J Psychiatric Res, 1982, 17: 319–334

    Article  Google Scholar 

  44. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. ArXiv:1409.1556

  45. Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255

    Google Scholar 

  46. Chen S, He H, Su W J. Label-aware neural tangent kernel: toward better generalization and local elasticity. In: Proceedings of Advances in Neural Information Processing Systems, 2020. 15847–15858

    Google Scholar 

  47. Deng Z, He H F, Su W J. Toward better generalization bounds with locally elastic stability. In: Proceedings of International Conference on Machine Learning, 2021

    Google Scholar 

  48. Zhang J Y, Wang H, Su W J. Imitating deep learning dynamics via locally elastic stochastic differential equations. In: Proceedings of Advances in Neural Information Processing Systems, 2021

    Google Scholar 

  49. Frankle J, Carbin M. The lottery ticket hypothesis: finding sparse, trainable neural networks. In: Proceedings of International Conference on Learning Representations, 2018

    Google Scholar 

  50. Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. In: Proceedings of Advances in Neural Information Processing Systems, 2019

    Google Scholar 

  51. Wu L, Ma C, E W. How SGD selects the global minima in over-parameterized learning: a dynamical stability perspective. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 8289–8298

    Google Scholar 

  52. Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA, 2018, 115: 7665–7671

    Article  MathSciNet  Google Scholar 

  53. Chizat L, Bach F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 3040–3050

    Google Scholar 

  54. Belkin M, Hsu D, Ma S, et al. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci USA, 2019, 116: 15849–15854

    Article  MathSciNet  Google Scholar 

  55. Lee J, Xiao L, Schoenholz S S, et al. Wide neural networks of any depth evolve as linear models under gradient descent. 2019. ArXiv:1902.06720

  56. Xu Z Q J, Zhang Y Y, Xiao Y Y. Training behavior of deep neural network in frequency domain. In: Proceedings of International Conference on Neural Information Processing, 2019. 264–274

    Chapter  Google Scholar 

  57. Oymak S, Soltanolkotabi M. Toward moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE J Sel Areas Inf Theor, 2020, 1: 84–105

    Article  Google Scholar 

  58. Chan K H R, Yu Y D, You C, et al. ReduNet: a white-box deep network from the principle of maximizing rate reduction. 2021. ArXiv:2105.10446

  59. E W. The dawning of a new era in applied mathematics. Not Amer Math Soc, 2021, 68: 565–571

    MathSciNet  Google Scholar 

  60. Li Z L, You C, Bhojanapalli S, et al. Large models are parsimonious learners: activation sparsity in trained transformers. 2022. ArXiv:2210.06313

  61. Zhang Z, Lin Y, Liu Z, et al. MoEfication: transformer feed-forward layers are mixtures of experts. In: Proceedings of Findings of the Association for Computational Linguistics, 2022. 877–890

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by an Alfred Sloan Research Fellowship and the Wharton Dean’s Research Fund. We would like to thank Patrick CHAO, Zhun DENG, Cong FANG, Hangfeng HE, Qingxuan JIANG, Konrad KORDING, Yi MA, and Jiayao ZHANG for their helpful discussions and comments. We are grateful to two anonymous reviewers for their constructive comments that helped improve the presentation of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weijie J. Su.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, W.J. Envisioning future deep learning theories: some basic concepts and characteristics. Sci. China Inf. Sci. 67, 203101 (2024). https://doi.org/10.1007/s11432-023-4129-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-023-4129-1

Keywords