Shape matters: Understanding the implicit bias of the noise covariance

JZ HaoChen, C Wei, J Lee… - Conference on Learning …, 2021 - proceedings.mlr.press
Conference on Learning Theory, 2021proceedings.mlr.press
The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization
effect for training overparameterized models. Prior theoretical work largely focuses on
spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that
parameter-dependent noise—induced by mini-batches or label perturbation—is far more
effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a
quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al …
Abstract
The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise—induced by mini-batches or label perturbation—is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
proceedings.mlr.press