Scaling description of generalization with number of parameters in deep learning

Geiger, Mario; Jacot, Arthur; Spigler, Stefano; Gabriel, Franck; Sagun, Levent; d'Ascoli, Stéphane; Biroli, Giulio; Hongler, Clément; Wyart, Matthieu

doi:10.1088/1742-5468/ab633c

Condensed Matter > Disordered Systems and Neural Networks

arXiv:1901.01608 (cond-mat)

[Submitted on 6 Jan 2019 (v1), last revised 8 Oct 2019 (this version, v5)]

Title:Scaling description of generalization with number of parameters in deep learning

Authors:Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart

View PDF

Abstract:Supervised deep learning involves the training of neural networks with a large number $N$ of parameters. For large enough $N$, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as $N$ grows past a certain threshold $N^{*}$. Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with $N$. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations $\|f_{N}-\bar{f}_{N}\|\sim N^{-1/4}$ of the neural net output function $f_{N}$ around its expectation $\bar{f}_{N}$. These affect the generalization error $\epsilon_{N}$ for classification: under natural assumptions, it decays to a plateau value $\epsilon_{\infty}$ in a power-law fashion $\sim N^{-1/2}$. This description breaks down at a so-called jamming transition $N=N^{*}$. At this threshold, we argue that $\|f_{N}\|$ diverges. This result leads to a plausible explanation for the cusp in test error known to occur at $N^{*}$. Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond $N^{*}$, and averaging their outputs.

Comments:	The clarity of the text has been improved: the section "Related works" has been updated and the section "3.1 Regression task" has been added
Subjects:	Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
Cite as:	arXiv:1901.01608 [cond-mat.dis-nn]
	(or arXiv:1901.01608v5 [cond-mat.dis-nn] for this version)
	https://doi.org/10.48550/arXiv.1901.01608
Related DOI:	https://doi.org/10.1088/1742-5468/ab633c

Submission history

From: Mario Geiger [view email]
[v1] Sun, 6 Jan 2019 21:11:25 UTC (221 KB)
[v2] Fri, 18 Jan 2019 20:04:44 UTC (167 KB)
[v3] Thu, 24 Jan 2019 01:04:08 UTC (168 KB)
[v4] Tue, 18 Jun 2019 13:23:00 UTC (227 KB)
[v5] Tue, 8 Oct 2019 08:43:29 UTC (248 KB)

Condensed Matter > Disordered Systems and Neural Networks

Title:Scaling description of generalization with number of parameters in deep learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Condensed Matter > Disordered Systems and Neural Networks

Title:Scaling description of generalization with number of parameters in deep learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators