Scaling laws for learning with real and surrogate data

Jain, Ayush; Montanari, Andrea; Sasoglu, Eren

Computer Science > Machine Learning

arXiv:2402.04376v1 (cs)

[Submitted on 6 Feb 2024 (this version), latest version 28 Jun 2024 (v2)]

Title:Scaling laws for learning with real and surrogate data

Authors:Ayush Jain, Andrea Montanari, Eren Sasoglu

View PDF HTML (experimental)

Abstract:Collecting large quantities of high-quality data is often prohibitively expensive or impractical, and a crucial bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources like public datasets, data collected under different circumstances, or synthesized by generative models. Blurring distinctions, we refer to such data as `surrogate data'.
We define a simple scheme for integrating surrogate data into training and use both theoretical models and empirical studies to explore its behavior. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution; $(ii)$ In order to reap this benefit, it is crucial to use optimally weighted empirical risk minimization; $(iii)$ The test error of models trained on mixtures of real and surrogate data is well described by a scaling law. This can be used to predict the optimal weighting and the gain from surrogate data.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as:	arXiv:2402.04376 [cs.LG]
	(or arXiv:2402.04376v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.04376

Submission history

From: Ayush Jain [view email]
[v1] Tue, 6 Feb 2024 20:30:19 UTC (1,070 KB)
[v2] Fri, 28 Jun 2024 15:36:50 UTC (1,120 KB)

Computer Science > Machine Learning

Title:Scaling laws for learning with real and surrogate data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Scaling laws for learning with real and surrogate data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators