Scaling description of generalization with number of parameters in deep learning

Geiger, Mario; Jacot, Arthur; Spigler, Stefano; Gabriel, Franck; Sagun, Levent; d'Ascoli, Stéphane; Biroli, Giulio; Hongler, Clément; Wyart, Matthieu

Condensed Matter > Disordered Systems and Neural Networks

arXiv:1901.01608v1 (cond-mat)

[Submitted on 6 Jan 2019 (this version), latest version 8 Oct 2019 (v5)]

Title:Scaling description of generalization with number of parameters in deep learning

Authors:Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart

View PDF

Abstract:We provide a description for the evolution of the generalization performance of fixed-depth fully-connected deep neural networks, as a function of their number of parameters $N$. In the setup where the number of data points is larger than the input dimension, as $N$ gets large, we observe that increasing $N$ at fixed depth reduces the fluctuations of the output function $f_N$ induced by initial conditions, with $|\!|f_N-{\bar f}_N|\!|\sim N^{-1/4}$ where ${\bar f}_N$ denotes an average over initial conditions. We explain this asymptotic behavior in terms of the fluctuations of the so-called Neural Tangent Kernel that controls the dynamics of the output function. For the task of classification, we predict these fluctuations to increase the true test error $\epsilon$ as $\epsilon_{N}-\epsilon_{\infty}\sim N^{-1/2} + \mathcal{O}( N^{-3/4})$. This prediction is consistent with our empirical results on the MNIST dataset and it explains in a concrete case the puzzling observation that the predictive power of deep networks improves as the number of fitting parameters grows. This asymptotic description breaks down at a so-called jamming transition which takes place at a critical $N=N^*$, below which the training error is non-zero. In the absence of regularization, we observe an apparent divergence $|\!|f_N|\!|\sim (N-N^*)^{-\alpha}$ and provide a simple argument suggesting $\alpha=1$, consistent with empirical observations. This result leads to a plausible explanation for the cusp in test error known to occur at $N^*$. Overall, our analysis suggests that once models are averaged, the optimal model complexity is reached just beyond the point where the data can be perfectly fitted, a result of practical importance that needs to be tested in a wide range of architectures and data set.

Subjects:	Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
Cite as:	arXiv:1901.01608 [cond-mat.dis-nn]
	(or arXiv:1901.01608v1 [cond-mat.dis-nn] for this version)
	https://doi.org/10.48550/arXiv.1901.01608

Submission history

From: Mario Geiger [view email]
[v1] Sun, 6 Jan 2019 21:11:25 UTC (221 KB)
[v2] Fri, 18 Jan 2019 20:04:44 UTC (167 KB)
[v3] Thu, 24 Jan 2019 01:04:08 UTC (168 KB)
[v4] Tue, 18 Jun 2019 13:23:00 UTC (227 KB)
[v5] Tue, 8 Oct 2019 08:43:29 UTC (248 KB)

Condensed Matter > Disordered Systems and Neural Networks

Title:Scaling description of generalization with number of parameters in deep learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Condensed Matter > Disordered Systems and Neural Networks

Title:Scaling description of generalization with number of parameters in deep learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators