Scaling laws for learning with real and surrogate data

Jain, Ayush; Montanari, Andrea; Sasoglu, Eren

Computer Science > Machine Learning

arXiv:2402.04376 (cs)

[Submitted on 6 Feb 2024 (v1), last revised 28 Jun 2024 (this version, v2)]

Title:Scaling laws for learning with real and surrogate data

Authors:Ayush Jain, Andrea Montanari, Eren Sasoglu

View PDF HTML (experimental)

Abstract:Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

Comments:	Added new experiments
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as:	arXiv:2402.04376 [cs.LG]
	(or arXiv:2402.04376v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.04376

Submission history

From: Ayush Jain [view email]
[v1] Tue, 6 Feb 2024 20:30:19 UTC (1,070 KB)
[v2] Fri, 28 Jun 2024 15:36:50 UTC (1,120 KB)

Computer Science > Machine Learning

Title:Scaling laws for learning with real and surrogate data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Scaling laws for learning with real and surrogate data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators