Stochastic gradient descent with random label noises: doubly stochastic models and inference stabilizer

Haoyi Xiong; Xuhong Li; Boyang Yu; Dongrui Wu; Zhanxing Zhu; Dejing Dou

doi:10.1088/2632-2153/ad13ba

1. Introduction

Stochastic gradient descent (SGD) has been widely used as an effective way to train deep neural networks with large datasets [1]. While the mini-batch sampling strategy was first proposed to lower the cost of computation per iteration, it is considered to incorporate an implicit regularizer preventing the learning process from converging to the local minima with poor generalization performance [2–6]. To interpret such implicit regularization, one can model SGD as gradient descent (GD) with gradient noise caused by mini-batch sampling [7]. Studies have demonstrated the potential of such implicit regularization or gradient noise to improve the generalization performance of learning from both theoretical [8–11] and empirical aspects [2, 5, 6]. In summary, gradient noise keeps SGD from converging to a sharp local minima that generalizes poorly [2, 10, 11] and would select a flat minima [12] as the outcome of learning.

In this work, we aim to investigate the influence of random label noise on implicit regularization under mini-batch sampling of SGD. To simplify our research, we assume the training dataset to be a set of vectors $\mathcal{D} = \{x_1,x_2,x_3,\ldots,x_N\}$ . The label $\tilde{y}_i$ for every vector $x_i\in \mathcal{D}$ is the noisy response of the true neural network $f^{\,\,*}(x)$ such that

$\begin{align} \tilde{y}_i = y_i+\varepsilon_i,\ y_i = f^{\,\,*}\left(x_i\right),\ \mathrm{and }\ \mathbb{E}\left[\varepsilon_i\right] = 0,\ \mathrm{var}\left[\varepsilon_i\right] = \sigma^2\ , \end{align} \tag{ 1 }$

where the label noise ε_i is assumed to be an independent zero-mean random variable. In our work, the random label noise can either be (1) drawn from probability distributions before training steps (but re-sampled by mini-batch sampling of SGD) or (2) realized by the random variables per training iteration [13]. Thus, learning is to estimate $\widehat\theta$ in $f(x,\widehat\theta)$ for approximating $f^{\,\,*}(x)$ , such that

$\begin{align} \widehat\theta\gets\underset{\forall\theta\in\mathbb{R}^d}{\mathrm{arg min}}\ \left\{\frac{1}{N}\sum_{i = 1}^N\tilde{L}_i\left(\theta\right): = \frac{1}{2N}\sum_{i = 1}^N\left(f\left(x_i,\theta\right)-\tilde{y}_i\right)^2\right\}. \end{align} \tag{ 2 }$

Note that we denote ${L}^*_i(\theta) = \frac{1}{2}(f(x_i,\theta)-{y}_i)^2$ as the loss based on a noiseless sample in this work. Inspired by [2, 12], our work studies how unbiased label noise (ULN) ε_i ( $1\unicode{x2A7D} i\unicode{x2A7D} N$ ) would affect the 'selection' of $\widehat\theta$ from possible solutions, from the viewpoint of learning dynamics [14] of SGD under mini-batch sampling [10, 15, 16]. For symbols used in this paper, please refer to table 1.

Table 1. Key symbols and definitions.

Symbols	Definitions and equations
x_i , and $y_i = f^{\,\,*}(x_i)$	the ith data point and true label; equation (1)
$\tilde{y}_i$ , and ε_i	the ith noisy label and the label noise; equation (1)
$f(x,\theta)$	the output of neural network with parameter θ and input x; equation (2)
$\widehat{\theta}$	the estimator of parameters of a neural network; equation (2)
$L^*_i$	${L}_i(\theta) = \frac{1}{2}(f(x_i,\theta)-{y}_i)^2$ the loss based on a noiseless sample; equation (2)
$\tilde{L}_i$	$\tilde{L}_i(\theta) = \frac{1}{2}(f(x_i,\theta)-\tilde{y}_i)^2$ the loss based on a noisy sample; equation (2)
SGD without assumptions on label noise
B_k	the mini-batch of samples drawn by the kth step of SGD; equation (3)
$b = \vert B_k\vert$	the constant batch size of Bk; equation (3)
θ_k	the kth step of SGD; equation (3)
V_k	SGD noise caused by mini-batch sampling of loss gradients; equation (5)
η	the learning rate of SGD; equation (3)
${\Theta}(t)$	the continuous-time dynamics of SGD; equation (6)
z_k	the random vector of standard Gaussian; equation (9)
W(t)	the Brownian motion over time; equation (6)
$\overline{\theta}_k$	the kth step of discrete-time approximation to ${\Theta}(t)$ ; equation (7)
SGD with unbiased label noise (ULN)
$\theta^\mathrm{ULN}_k$	the kth step of SGD with unbiased label noise; equation (8)
$\xi^*_k$	SGD noise through mini-batch sampling of TRUE loss gradient; equation (8)
$\xi^\mathrm{ULN}_k$	SGD noise through mini-batch sampling of unbiased label noise; equation (8)
$\Sigma_N^\mathrm{SGD}$	the covariance matrix of the TRUE loss gradients; equations (15)–(17)
$\Sigma_N^\mathrm{ULN}$	the covariance matrix based on unbiased label noise; equations (15)–(17)
${\Theta}^\mathrm{ULN}(t)$	the continuous-time doubly stochastic model; equation (16)
$\overline{\theta}^\mathrm{ULN}_k$	the kth step of discrete-time doubly stochastic model; equation (17)
$W_1(t)$ , and $W_2(t)$	two independent Brownian motions over time; equation (16)
z_k , and $z^{^{\prime}}_k$	two independent random vectors of standard Gaussian; equations (9), (17).
${\Theta}^\mathrm{LNL}(t)$	the continuous-time dynamics under label-noiseless settings; equation (20)

1.1. Background: SGD dynamics and implicit regularization

To analyze the SGD algorithm solving the problem in equation (2), we follow settings in [15] and consider SGD as an algorithm that, in the kth iteration with the estimate θ_k, randomly picks up a b-length subset of samples from the training dataset i.e. $B_k\subset \mathcal{D}$ , and estimates the mini-batch stochastic gradient $\frac{1}{b}\sum_{\forall x_i\in B_k}\nabla \tilde{L}_i(\theta_k)$ , then updates the estimate for $\theta_{k+1}$ based on θ_k, as follows

$\begin{align} \theta_{k+1}\gets\left(\theta_k-\frac{\eta}{\vert B_k\vert}\sum_{\forall x_i\in B_k}\nabla\tilde{L}_i\left(\theta_k\right)\right)\ , \end{align} \tag{ 3 }$

where η refers to the step-size of SGD. Furthermore, we can derive the mini-batch sampled loss gradients into the combination of the full-batch loss gradient and the noise, such that

$\begin{align} &\frac{\eta}{\vert B_k\vert}\sum_{\forall x_i\in B_k}\nabla\tilde{L}_i\left(\theta_k\right)\nonumber\\ = &\frac{\eta}{N}\sum_{\forall x_i\in \mathcal{D}}\nabla\tilde{L}_i\left(\theta_k\right)-\left(\frac{\eta}{N}\sum_{\forall x_i\in \mathcal{D}}\nabla\tilde{L}_i\left(\theta_k\right)-\frac{\eta}{\vert B_k\vert}\sum_{\forall x_i\in B_k}\nabla\tilde{L}_i\left(\theta_k\right)\right)\nonumber\\ = &\frac{\eta}{N}\sum_{\forall x_i\in \mathcal{D}}\nabla\tilde{L}_i\left(\theta_k\right)-\sqrt{\eta}V_k\left(\theta_k\right)\ , \end{align} \tag{ 4 }$

where η refers to the step-size of SGD and $V_k(\theta_k)$ refers to a stochastic gradient noise term caused by mini-batch sampling. The noise will converge to zero with increasing batch size as follows

$\begin{align} V_k\left(\theta_k\right) = \sqrt{\eta}\left(\frac{1}{N}\sum_{\forall x_i\in \mathcal{D}}\nabla\tilde{L}_i\left(\theta_k\right)-\frac{1}{\vert B_k\vert}\sum_{\forall x_i\in B_k}\nabla\tilde{L}_i\left(\theta_k\right)\right) \to \mathbf{0}_d,\ \text{as }\ B\to N\ . \end{align} \tag{ 5 }$

With $\mathbf{d}t = \eta\to 0$ and constant batch size $b = \vert B_k\vert$ , the SGD algorithm will diffuse to a continuous-time dynamics ${\Theta}(t)$ with a stochastic differential equation (SDE), with weak convergence [15, 17], as follows

$\begin{align} \mathbf{d}{\Theta} = -\frac{1}{N}\sum_{i = 1}^N\nabla\tilde{L}_i\left({\Theta}\right)\mathbf{d} t+ \left(\frac{\eta}{b}\tilde{\Sigma}_N^\mathrm{SGD}\left({\Theta}\right)\right)^\frac{1}{2}\mathbf{d}W\left(t\right)\ \end{align} \tag{ 6 }$

where W(t) is a standard Brownian motion in $\mathbb{R}^d$ , and we define $\Sigma_N^\mathrm{SGD}({\Theta})$ as the sample covariance matrix of loss gradients $\nabla L_i({\Theta})$ for $1\unicode{x2A7D} i\unicode{x2A7D} N$ . For detailed derivations of the continuous-time approximation above and the assumptions made, please refer to [15]. We follow [15] and do not make low-rank assumptions on $\tilde{\Sigma}_N^\mathrm{SGD}({\Theta})$ . Through Euler discretization [9, 15], one can approximate SGD as $\overline{\theta}_k$ such that

$\begin{align} &\overline{\theta}_{k+1}\gets\overline{\theta}_k-\frac{\eta}{N}\sum_{\forall x_i\in \mathcal{D}}\nabla\tilde{L}_i\left(\overline{\theta}_k\right)+\sqrt{\eta}\xi_k\left(\overline{\theta}_k\right),\ \text{and}\nonumber\\ &\xi_k\left(\overline{\theta}_k\right) = \left(\frac{\eta}{b}\tilde{\Sigma}_N^\mathrm{SGD}\left(\overline{\theta}_k\right)\right)^\frac{1}{2}z_k,\ z_k\sim\mathcal{N}\left(0,\mathbf{I}_d\right)\ . \end{align} \tag{ 7 }$

The implicit regularizer of SGD is $\xi_k(\overline{\theta}_k) = (\frac{\eta}{b}\tilde{\Sigma}_N^\mathrm{SGD}(\overline{\theta}_k))^\frac{1}{2} z_k$ which is data-dependent and controlled by the learning rate η and batch size B [18]. References [8–10] discuss SGD for varational inference and enabled novel applications to samplers [19, 20]. To understand the effect on generalization performance, previous work [2, 18] studied the escaping behavior from the sharp local minima [6] and convergence to the flat ones. Jia and Su [21] discovered the way that SGD could find a flat local minimum from information-theoretical perspectives and proposed a novel regularizer to improve the performance. Finally, Gidel et al [22] studied regularization effects to linear deep neural networks (DNNs) and our previous work [16] proposed new multiplicative noise to interpret SGD and obtain stronger theoretical properties.

1.2. Our contributions

In this work, we assume the unbiased random label noise ε_i ( $1\unicode{x2A7D} i\unicode{x2A7D} N$ ) and the mini-batch sampler of SGD are independent. When the random label noise has been drawn from probability distributions prior to the training procedure, SGD re-samples the label noise and generates a new type of data-dependent noise, in addition to the stochastic gradient noise of label-noiseless losses, through re-sampling label-noisy data and averaging label-noisy loss gradients of random mini-batches [23, 24].

Our analysis shows that under mild conditions, with gradients of label-noisy losses, SGD might incorporate an additional data-dependent noise term, complementing the stochastic gradient noise [15, 16] of label-noiseless losses, through re-sampling the samples with label noise [24] or dynamically adding noise to labels over iterations [13]. We consider such noise as an implicit regularization caused by ULN, and interpret the effect of such noise as a solution selector of the learning procedure. More specifically, this work has made unique contributions as follows.

1.2.1. Doubly stochastic models

We reviewed the previous work [10, 15, 16, 25] and extended the analytical framework in [15] to interpret the effects of ULN as an additional implicit regularizer on top of the continuous-time dynamics of SGD. Through discretizing the continuous-time dynamics of label-noisy SGD, we write a discrete-time approximation to the learning dynamics, denoted as $\theta^\mathrm{ULN}_k$ for $k = 1,2,\ldots$ , as

$\begin{align} \theta^\mathrm{ULN}_{k+1}\gets\theta^\mathrm{ULN}_k-\frac{\eta}{N}\sum_{i = 1}^N\nabla L_i^*\left(\theta^\mathrm{ULN}_k\right)+\sqrt{\eta}\xi_k^*\left(\theta^\mathrm{ULN}_k\right)+\sqrt{\eta}\xi_k^\mathrm{ULN}\left(\theta^\mathrm{ULN}_k\right), \end{align} \tag{ 8 }$

where $L_i^*(\theta) = (f(x_i,\theta)-f^{\,\,*}(x_i))^2$ refers to the label-noiseless loss function with sample x_i and the true (noiseless) label y_i , and the noise term $\xi_k^*(\theta)$ refers to the stochastic gradient noise [15] of label-noiseless loss function $L_i^*(\theta)$ ; then, we can obtain the new implicit regularizer caused by the ULN for $\forall \theta\in\mathbb{R}^d$ , which can be approximated as follows

$\begin{align} \xi_k^{\mathrm{ULN}}\left(\theta\right)\approx\left(\frac{\eta\sigma^2}{bN}\sum_{i = 1}^N\nabla_\theta f\left(x_i,\theta\right)\nabla_\theta f\left(x_i,\theta\right)^\top\right)^{\frac{1}{2}}z_k,\ \text{and}\ z_k\sim\mathcal{N}\left(\mathbf{0}_d,\mathbf{I}_d\right)\ , \end{align} \tag{ 9 }$

where z_k refers to a random noise vector drawn from the standard Gaussian distribution, θ_k refers to the parameters of the network in the kth iteration, $(\cdot)^{1/2}$ refers to the Chelosky decomposition of the matrix, $\nabla_\theta f(x_i,\theta) = \partial f(x_i,\theta)/\partial \theta$ refers to the gradient of the neural network output for sample x_i over the parameter θ_k, and B and η are defined as the batch size and the learning rate of SGD, respectively. Obviously, the strength of this implicit regularizer is controlled by σ², B and η.

Section 3 formulates the algorithm of SGD with unbiased random label noise as stochastic dynamics based on two noise terms (proposition 1), derives the continuous-time and discrete-time doubly stochastic models from SGD algorithms (definitions 1 and 2), and provides approximation error bounds (proposition 2). Proofs of two propositions are provided in appendices A and B

1.2.2. Inference stabilizer as implicit regularizer

The regularization effects of unbiased random label noise should be

$\begin{align} \mathbb{E}_{z_k}\left\|\xi_k^{\mathrm{ULN}}\left(\theta_k\right)\right\|_2^2\approx\frac{\eta\sigma^2}{bN}\sum_{i = 1}^N \|\nabla_\theta f\left(x_i,{\theta}_k\right)\|_2^2 = \frac{\eta\sigma^2}{bN}\sum_{i = 1}^N \left\|\frac{\partial }{\partial\theta}f\left(x_i,\theta_k\right)\right\|_2^2\ , \end{align} \tag{ 10 }$

where $\nabla_\theta f(x,{\theta})$ refers to the gradient of f over θ and the effects are controlled by the batch size B and the variance of label noises σ². Similar results have been obtained by assuming deep learning algorithms are driven by an Ornstein–Uhlenbeck (OU)-like process [26], while our work does not rely on such an assumption and is all based on our proposed doubly stochastic models.

Section 4 analyzes the implicit regularization effects of unbiased random label noise for SGD, where we conclude the implicit regularizer is a controller of the neural network gradient norm $\frac{1}{N}\sum_{i = 1}^N \|\nabla_\theta f(x_i,{\theta})\|_2^2$ in the dynamics (proposition 3). We then offer remarks 2–4 to characterize the behaviors of SGD with ULN: (1) SGD would escape the local minimums with higher gradient norms, due to the larger perturbation driven by the implicit regularizer, (2) the strength of implicit regularization effects is controlled by the learning rate η and batch size b, and (3) it is possible to tune the performance of SGD through adding and controlling the ULN, as low neural network gradient norms usually correspond to flat loss landscapes.

We validate our findings through a series of experiments. In section 5, we apply self-distillation with unbiased label noise [27, 28] for training deep neural networks. Through the teacher–student learning paradigm, the well-trained model beneficially escapes local minima by learning from its own noisy outputs. In section 6, we visualize implicit regularization effects using SGD-based linear regression with ULN. We observe a Gaussian-like distribution, centered at the solution for linear regression, with its (co-)variance determined by the covariance of data samples, the learning rate, and the batch size. Collectively, the results of these experiments substantiate our theory.

1.2.3. Significance of our contributions

This work establishes a novel framework that provides an in-depth understanding of the influence of ULN in SGD. The study highlights the implicit regularization effects within the learning process and demonstrates how these effects can be quantified through doubly stochastic models. By assuming the independence of two noise sources and using globally bounded assumptions, the paper brings new theoretical insights on SGD dynamics, which are reinforced by empirical evidence obtained from DNNs and linear regression models. Moreover, it provides a comprehensive empirical study on the impacts of additive noise during the self-distillation process and SGD-based linear regression. Therefore, this study significantly contributes to the learning space by elucidating the underlying mechanisms of SGD and broadening our understanding of implicit regularization effects due to ULN, in particular in the context of complex machine learning algorithms.

2. Related work

2.1. SGD implicit regularization for ordinary least squares (OLS)

The most recent and relevant work in this area can be found in references [25, 29], where the same group of authors studied the implicit regularization of GD and SGD for OLS. They investigated an implicit regularizer of $\ell_2$ -norm on the parameter, which regularizes OLS as a ridge estimator with decaying penalty. Prior to these efforts, F. Bach and his group studied the convergence of gradient-based solutions for linear regression with OLS and regularized estimators under both noisy and noiseless settings in [30–32].

Langevin dynamics and gradient noises. With similar agendas, previous works [8–10] studied limiting behaviors of SGD (or steady-state of dynamics) from the perspectives of Bayesian/variational inference. They also proposed novel applications to stochastic gradient MCMC samplers [19, 20]. Through connecting $\Sigma_N^\mathrm{SGD}(\theta)$ to the loss Hessian $1/N\nabla^2 L_i(\theta)$ in near-convergence regions, reference [2] described the escaping behavior from the sharp local minima, while reference [6] discussed this issue in large-batch training settings. Furthermore, reference [18] discussed how learning rates and batch sizes would affect the generalization performance and flatness of optimization results. Finally, reference [22] describes the implicit regularization on linear neural networks and in reference [16] a new multiplicative noise model was proposed to interpret the gradient noise with stronger theoretical properties.

2.2. Self-distillation and noisy students

Self-distillation [27, 28, 33, 34] has been examined as an effective way to further improve the generalization performance of well-trained models. Such strategies enable knowledge distillation using well-trained ones as teacher models and optionally adding noise (e.g. dropout, stochastic depth and label smoothing, or potentially the label noise) onto the training procedure of student models [13].

Deep learning with label noise. The influence of label noise on deep learning has been explored in several studies, including [26, 35, 36]. Some of these findings align with our work; for example, [26, 35] achieved similar results, through different approaches. Blanc et al [26] also viewed SGD as a dynamical system and made similar observations by establishing stronger assumptions (please refer to the Discussion on relevant work). Bar et al [35] applied a spectral analysis to the learned mapping of networks and provided theoretical justification for the observed robustness to label noise. They associated typical smoothness regularization to the suppression of high-frequency components potentially caused by label noise. On a different note, a survey by Song et al [36] reviews deep learning algorithms for label noise. The studies featured in this survey predominantly view label noise as a negative influencer of DNN performance, proposing robust training algorithms as a countermeasure. Our work, however, is more concerned with the potential advantages conferred by ULN. This distinct aspect sets our work apart from other works.

2.3. Discussion on relevant work

Compared to the above works, we still make contributions in the above three categories. First of all, this work characterizes the implicit regularization effects of label noise to SGD dynamics. Compared to linear regression [25, 29], our proposed doubly stochastic model could be used to explain the learning dynamics of SGD with label noise for nonlinear neural networks. Even from linear regression perspectives [25, 29, 32], we precisely measured the gaps between SGD dynamics with and without label noise and provide a new example with numerical simulations to visualize the implicit regularization effects.

Compared to references [28, 37], our analysis emphasized the role of the implicit regularizer caused by label noise as a model selector, where models with high inferential stability would be selected. Li et al [38] is the most relevant work to us, where authors studied the early stopping of GD with label noise via neural tangent kernel (NTK) approximation [39]. Our work oerforms the analysis for SGD without assumptions for approximations such as NTK.

In addition to NTK assumption, in reference [26] it is assumed the deep learning algorithms are driven by an OU-like process and the authors obtain results similar to the inference stabilizer (the third result from our research), while our work makes a contribution by proposing doubly stochastic models and we reach the conclusion in a different way. We also provide the first empirical results and evidence, based on commonly-used DNN architectures and benchmark datasets, to visualize the effects of implicit regularizers caused by ULN in real-world settings.

Please note that an earlier manuscript [40] from us had been put on OpenReview with a discussion, where external reviewers aired their concerns—part of the results had been investigated in [26] and we did not provide the results in a strong form (e.g. theorems or proofs). Therefore, this work shifts the main contributions from implicit regularization of label noise to doubly stochastic models with approximation error bounds and proofs. The implicit regularization effects could be estimated via doubly stochastic models directly without the assumption of an OU process. To the best of our knowledge, this work is the first to understand the effects of ULN on SGD dynamics by addressing technical issues including implicit regularization, OLS, self-distillation, model selection, and stability inference results.

3. Double stochastic models for SGD with unbiased random label noise

In this section, we present SGD with unbiased random label noise, derive the continuous-time/discrete-time doubly stochastic models, and provide convergence of approximation between models.

3.1. Modeling ULN in SGD

In our research, SGD with unbiased random label noise refers to an iterative algorithm that updates the estimate incrementally from initialization, ${\theta}^\mathrm{ULN}_0$ . With mini-batch sampling and unbiased random label noises, in the kth iteration, the SGD algorithm updates the estimate $\theta^\mathrm{ULN}_k$ using the stochastic gradient $\tilde{\mathrm{g}}_k(\theta^\mathrm{ULN}_k)$ through a GD rule, such that

$\begin{align} \theta^\mathrm{ULN}_{k+1}\gets \theta^\mathrm{ULN}_k-\eta \tilde{\mathrm{g}}_k\left(\theta^\mathrm{ULN}_k\right)\ . \end{align} \tag{ 11 }$

Specifically, in the kth iteration, SGD randomly picks up a batch of samples $B_k\subseteq\mathcal{D}$ to estimate the stochastic gradient, as follows

$\begin{align} \eta\tilde{\mathrm{g}}_k\left(\theta^\mathrm{ULN}_k\right) & = \frac{\eta}{\vert B_k\vert}\sum_{x_i\in B_k}\nabla \tilde{L}_i\left(\theta^\mathrm{ULN}_k\right) = \frac{\eta}{\vert B_k\vert}\sum_{x_i\in B_k}\left(\left(f\left(x_i,\theta_k^\mathrm{ULN}\right)-y_i\right)-\varepsilon_i\right)\cdot\nabla_\theta f\left(x_i,\theta_k^\mathrm{ULN}\right)\nonumber\\ & = \frac{\eta}{N}\sum_{i = 1}^N\nabla {L}^*_i\left(\theta^\mathrm{ULN}_k\right)+\sqrt{\eta}\xi_{k}^*\left(\theta^\mathrm{ULN}_k\right)+\sqrt{\eta}\xi_k^\mathrm{ULN}\left(\theta^\mathrm{ULN}_k\right), \end{align} \tag{ 12 }$

where $\nabla {L}^*_i(\theta)$ for $\forall\theta\in\mathbb{R}^d$ refers to the loss gradient based on the label-noiseless sample $(x_i,y_i)$ and $y_i = f^{\,\,*}(x_i)$ , $\xi^*_{k}(\theta)$ refers to stochastic gradient noises [15] through mini-batch sampling over the gradients of label-noiseless samples, and $\xi_{k}^\mathrm{ULN}(\theta)$ is an additional noise term caused by the mini-batch sampling and the unbiased random label noises, such that

$\begin{align} \nabla {L}^*_i\left(\theta\right)& = \frac{\partial}{\partial\theta}\frac{\left(f\left(x_i,\theta\right)-f^{\,\,*}\left(x_i\right)\right)^2}{2} = \left(f\left(x_i,\theta\right)-f^{\,\,*}\left(x_i\right)\right)\cdot\nabla f\left(x_i,\theta\right)\ ,\nonumber\\ \xi^*_{k}\left(\theta\right)& = \frac{\sqrt{\eta}}{\vert B_k\vert}\sum_{x_j\in B_k}\left(\nabla L^*_j\left(\theta\right) -\frac{1}{N}\sum_{i = 1}^N\nabla L^*_i\left(\theta\right)\right),\nonumber\\ \xi_{k}^\mathrm{ULN}\left(\theta\right)& = - \frac{\sqrt{\eta}}{\vert B_k\vert}\ \sum_{x_j\in B_k} \varepsilon_j\cdot\nabla_\theta f\left(x_j,\theta\right)\ . \end{align} \tag{ 13 }$

Proposition 1 (Mean and variance of the two noise terms). The mean and variance of the noise terms $\xi^*_{k}(\theta)$ and $\xi_{k}^\mathrm{ULN}(\theta)$ should be the vector-value functions as follows

$\begin{align} & \mathbb{E}_{B_k}\left[\xi^*_{k}\left(\theta\right)\right] = {\boldsymbol{0}}_d, \ & \text{and}\ & \mathrm{Var}_{B_k}\left[\xi^*_{k}\left(\theta\right)\right] = \frac{\eta}{\vert B_k\vert}\Sigma_{N}^\mathrm{SGD}\left(\theta\right) \nonumber\\ & {\mathbb{E}}_{B_k,\varepsilon_i}\left[\xi_{k}^\mathrm{ULN}\left(\theta\right)\right] = {\boldsymbol{0}}_d, \ & \text{and}\ & \mathrm{Var}_{B_k\varepsilon_i}\left[\xi_{k}^\mathrm{ULN}\left(\theta\right)\right] = \frac{\eta}{\vert B_k\vert}\Sigma_{N}^\mathrm{ULN}\left(\theta\right)\ . \nonumber\\ \end{align} \tag{ 14 }$

The two matrix-value functions $\Sigma_N^\mathrm{SGD}(\theta)$ and $\Sigma_N^\mathrm{ULN}(\theta)$ over $\theta\in\mathbb{R}^d$ characterize the variance of noise vectors. When we assume that label noise and mini-batch sampling are independent:

$\begin{equation} \begin{aligned} \Sigma_N^\mathrm{SGD}\left(\theta\right)& = \frac{1}{N}\sum_{j = 1}^N\left(\nabla L^*_j\left(\theta\right) -\frac{1}{N}\sum_{i = 1}^NL^*_i\left(\theta\right)\right)\left(\nabla L^*_j\left(\theta\right) -\frac{1}{N}\sum_{i = 1}^NL^*_i\left(\theta\right)\right)^\top\\ \Sigma_N^\mathrm{ULN}\left(\theta\right)& = \frac{\sigma^2}{N}\sum_{j = 1}^N\nabla_\theta f\left(x_j,\theta\right)\nabla_\theta f\left(x_j,\theta\right)^\top\ \text{as }\ \mathrm{var}\left[\varepsilon_j\right] = \sigma^2\ . \end{aligned} \end{equation} \tag{ 15 }$

The two noise terms $\xi^*_{k}(\theta)$ and $\xi_{k}^\mathrm{ULN}(\theta)$ that are controlled by the learning rate and the batch size will largely influence the SGD dynamics. Please refer to appendix A for proofs.

With the mean and variance of the two noise terms, we can easily formulate the learning dynamics of SGD with ULN as follows.

3.2. Doubly stochastic models and approximation

We consider the SGD algorithm with unbiased random label noise in the form of GD with additive data-dependent noise, such that $\theta_{k+1} = \theta_k-\frac{\eta}{N}\sum_{i = 1}^N\nabla\tilde{L}_i(\theta_k)+\sqrt{\eta}\tilde{V}_k(\theta_k)$ . To simplify the model and analysis involved, we assume

A1.1.
The two noise terms $\xi^*_{k}(\theta_k)$ and $\xi^\mathrm{ULN}_{k}(\theta_k)$ are independent

By making the assumption above, the complexities of considering their potential interaction or correlation can be bypassed. Furthermore, our later experiments (in sections 5 and 6) would show the robustness and correctness of our model and analysis based on this assumption. Hence, when η → 0, we can follow the analysis in [17] to derive the diffusion process of SGD with unbiased random label noise, denoted as ${\Theta}^\mathrm{ULN}(t)$ over continuous-time $t\unicode{x2A7E} 0$ . We define the doubly stochastic models that characterize the continuous-time dynamics of SGD with ULN as follows.

Definition 1 (Continuous-time doubly stochastic models). Given an SGD algorithm and $\theta^\mathrm{ULN}_k$ defined and specified in section 3.1, with $\eta = \mathbf{dt}$ , we assume $B_k = B$ for $k = 1, 2, 3\ldots$ and formulate its continuous-time dynamics as

$\begin{align} \mathbf{d}{{\Theta}}^\mathrm{ULN} = - \frac{1}{N}\sum_{i = 1}^N \nabla L^*_i\left({{\Theta}}^\mathrm{ULN}\right)\mathbf{d}t + \left(\frac{\eta}{b}\Sigma_N^\mathrm{SGD}\left({{\Theta}}^\mathrm{ULN}\right)\right)^\frac{1}{2}\mathbf{d}W_1\left(t\right) +\left(\frac{\eta}{b}\Sigma_N^\mathrm{ULN}\left({{\Theta}}^\mathrm{ULN}\right)\right)^\frac{1}{2}\mathbf{d}W_2\left(t\right)\ , \end{align} \tag{ 16 }$

where ${W}_1(t)$ and ${W}_2(t)$ refer to two independent Brownian motions over time, $\mathbf{d}t = {\eta}$ and ${\Theta}^\mathrm{ULN}(0) = \theta^\mathrm{ULN}_0$ .

Obviously, we can obtain the discrete-time approximation [9, 15] to the SGD dynamics as follows.

Definition 2 (Discrete-time doubly stochastic models). We denote $\overline{\theta}^\mathrm{ULN}_k$ for $k = 1,2,\ldots$ as the discrete-time approximation to the doubly stochastic models for SGD with ULN, which in the kth iteration behaves as

$\begin{align} \overline{\theta}^\mathrm{ULN}_{k+1}\gets \overline{\theta}^\mathrm{ULN}_k-\frac{\eta}{N}\sum_{i = 1}^N\nabla L_i^*\left(\overline{\theta}^\mathrm{ULN}_k\right)+\sqrt{\eta}\left(\frac{\eta}{b}\Sigma_N^\mathrm{SGD}\left({\overline{\theta}^\mathrm{ULN}_k}\right)\right)^\frac{1}{2}z_k +\sqrt{\eta}\left(\frac{\eta}{b}\Sigma_N^\mathrm{ULN}\left({\overline{\theta}^\mathrm{ULN}_k}\right)\right)^\frac{1}{2}z^{^{\prime}}_k , \end{align} \tag{ 17 }$

where z_k and $z^{^{\prime}}_k$ are two independent d-dimensional random vectors drawn from a standard d-dimensional Gaussian distribution $\mathcal{N}(\mathbf{0}_d,\mathbf{I}_d)$ per iteration independently, and $\overline{\theta}^\mathrm{ULN}_0 = {\Theta}^\mathrm{ULN}(0)$ .

The convergence between $\overline{\theta}^\mathrm{ULN}_k$ and ${\Theta}^\mathrm{ULN}(t)$ is tight when $t = k\eta$ and the convergence bound is as follows.

Proposition 2 (Convergence of approximation). Let $T \unicode{x2A7E} 0$ . Let $\Sigma_N^\mathrm{SGD}(\theta)$ and $\Sigma_N^\mathrm{ULN}(\theta)$ be the two diffusion matrices defined in equation (15). Assume that

A2.1.
There exists some M > 0 such that $\underset{_{i = 1,2,\ldots ,N}}{\text{max}}\{(\|\nabla L^*_i(\theta)\|_2)\}\unicode{x2A7D} M$ and $\underset{_{i = 1,2,\ldots ,N}}{\text{max}}\{(\|\nabla_\theta f(x_i;\theta)\|_2)\}\unicode{x2A7D} M$ ;
A2.2.
There exists some L > 0 such that $\nabla L_i(\theta)$ and $\nabla_\theta f(x,\theta)$ for $\forall x\in\mathcal{D}$ are Lipschitz continuous with bounded Lipschitz constant L > 0 uniformly for all $i = 1,2,\ldots ,N$ .

The continuous-time dynamics of SGD with ULN, denoted as ${\Theta}^\mathrm{ULN}(t)$ in equation (16), is with order 1 strong approximation to the discrete time SGD dynamics $\overline{\theta}_k^\mathrm{ULN}$ in equation (17). That is, there exists a constant C independent of η but depending on σ², L and M such that

$\begin{align} \mathbb{E} \|{\Theta}^\mathrm{ULN}\left({k\eta}\right) - \overline{\theta}^\mathrm{ULN}_k \|^2 \unicode{x2A7D} C \eta^2, \quad \text{for all } 0\unicode{x2A7D} k\unicode{x2A7D} \lfloor T/\eta \rfloor. \end{align} \tag{ 18 }$

Please refer to appendix B for proofs.

Note that the boundedness and smoothness assumptions made in this proposition have also been used in a series of previous studies [15, 41–44]. We believe the assumptions are reasonable in deep learning settings. For example, the rectified linear unit (ReLU) activation function [45], commonly used in DNNs, has a globally bounded derivative, being either 0 or 1. The boundedness of DNNs also depends on the norms of weight matrices, while well-trained DNNs often have norm-bounded weight matrices [46–49]. Techniques like weight normalization or layer normalization can further maintain this boundedness.

Remark 1. With the above strong convergence bound for approximation, we can consider $\overline{\theta}^\mathrm{ULN}_k$ —the solution of equation (17)—as a tight approximation to the SGD algorithm with ULN based on the same initialization. A tight approximation to the noise term $\xi_k^\mathrm{ULN}(\theta)$ (defined in equation (27)) could be as follows

$\begin{align} \xi_k^\mathrm{ULN}\left(\theta\right)\approx\left(\frac{\eta}{b}\Sigma_N^\mathrm{ULN}\left(\theta\right)\right)^\frac{1}{2}z^{^{\prime}}_k,\ \textbf{and}\ z^{^{\prime}}_k\sim\mathcal{N}\left(\mathbf{0_d},\mathbf{I_d}\right)\ . \end{align} \tag{ 19 }$

We use such discrete-time iterations and approximations to the noise term $\xi_k^\mathrm{ULN}(\theta)$ to interpret the implicit regularization behaviors of the SGD with ULN algorithm $\theta^\mathrm{ULN}_k$ accordingly.

4. Implicit regularization effects on neural networks

In this section, we use our model to interpret the regularization effects of SGD with unbiased label noise on general neural networks, without assumptions on the structures of neural networks.

4.1. Implicit regularizer influenced by unbiased random label noise

Comparing the stochastic gradient with unbiased random label noise $\tilde{g}_k(\theta)$ and the stochastic gradient based on the label-noiseless losses, we find an additional noise term $\xi_k^\mathrm{ULN}(\theta)$ as the implicit regularizer.

To interpret $\xi_k^\mathrm{ULN}(\theta)$ , we first define the diffusion process of SGD based on label-noiseless losses i.e. $L_i^*(\theta)$ for $1\unicode{x2A7D} i\unicode{x2A7D} N$ as

$\begin{align} \mathbf{d}{\Theta}^\mathrm{LNL} = -\frac{1}{N}\sum_{i = 1}^N\nabla L_i^*\left({\Theta}^\mathrm{LNL}\right)\mathbf{d}t+\left(\frac{\eta}{b}\Sigma_N^\mathrm{SGD}\left({\Theta}^\mathrm{LNL}\right)\right)^\frac{1}{2}\mathbf{W}\left(t\right)\ . \end{align} \tag{ 20 }$

By comparing ${\Theta}^\mathrm{ULN}(t)$ with ${\Theta}^\mathrm{LNL}(t)$ , the effects of $\xi_k^\mathrm{ULN}({\Theta})$ over continuous-time form should be $\sqrt{\eta/B}(\Sigma_N^\mathrm{ULN}({\Theta}))^{1/2}\mathbf{d}W(t)$ . Then, in discrete-time, we could get results as follows.

Proposition 3 (The implicit regularizer $\xi_k^\mathrm{ULN}(\theta)$ ). The implicit regularizer of SGD with unbiased random label noise can be approximated as follows

$\begin{align} \xi_k^{\mathrm{ULN}}\left(\theta\right)\approx\left(\frac{\sigma^2\eta}{bN}\sum_{i = 1}^N\nabla_\theta f\left(x_i,\theta\right)\nabla_\theta f\left(x_i,\theta\right)^\top\right)^{\frac{1}{2}}z_k,\ \text{and}\ z_k\sim\mathcal{N}\left(\mathbf{0}_d,\mathbf{I}_d\right)\ . \end{align} \tag{ 21 }$

Hence, we can estimate the expected regularization effects of the implicit regularizer $\|\xi_k^{\mathrm{ULN}}(\theta)\|_2$ as follows

$\begin{align} \mathbb{E}_{z_k}\|\xi_k^{\mathrm{ULN}}\left(\theta\right)\|_2^2& = \frac{\eta\sigma^2}{bN}\sum_{i = 1}^N\left\|\nabla_\theta f\left(x_i,\theta\right)\right\|_2^2 . \end{align} \tag{ 22 }$

Please refer to appendix C for proofs.

We thus conclude that the effects of implicit regularization caused by unbiased random label noise for SGD is proportional to $\frac{1}{N}\sum_{i = 1}^N\left\|\nabla_\theta f(x_i,\theta)\right\|_2^2$ , the average gradient norm of the neural network $f(x,\theta)$ over samples.

4.2. Understanding the ULN as an inference stabilizer

Here, we extend the existing results on SGD [2, 50] to understand proposition 3 and obtain remarks as follows.

Remark 2 (Inference stability). In the partial derivative form, the gradient norm can be written as

$\begin{align} \frac{1}{N}\sum_{i = 1}^N\|\nabla_\theta f\left(x_i,\theta\right)\|_2^2 = \frac{1}{N}\sum_{i = 1}^N\|\frac{\partial}{\partial\theta}f\left(x_i,\theta\right)\|_2^2 \end{align} \tag{ 23 }$

which characterizes the variation of neural network output $f(x,\theta)$ based on samples x_i (for $1\unicode{x2A7D} i\unicode{x2A7D} N$ ) over the parameter interpolation around the point θ. Lower $\frac{1}{N}\sum_{i = 1}^N\|\nabla_\theta f(x_i,\theta)\|_2^2$ leads to higher stability of neural network $f(x,\theta)$ outputs against the (random) perturbations of parameters.

Remark 3 (Escape and converge). When the noise $\xi_k^{\mathrm{ULN}}(\theta)$ is θ-dependent (section 4 presents a special case that $\xi_k^{\mathrm{ULN}}(\theta)$ is θ-independent with OLS), we follow reference [2] and suggest that the implicit regularizer helps SGD escape from the point $\tilde{\theta}$ with a high neural network gradient norm $\frac{1}{N}\sum_{i = 1}^N\|\nabla_\theta f(x_i,\tilde{\theta})\|_2^2$ , as the scale of noise $\xi_k^{\mathrm{ULN}}(\tilde{\theta})$ is large. We follow reference [50] and suggest that when the SGD with unbiased random label noise converges, the algorithm will converge to a point $\theta^*$ with small $\frac{1}{N}\sum_{i = 1}^N\|\nabla_\theta f(x_i,\theta^*)\|_2^2$ . Similar results have been obtained in [26] when assuming deep learning algorithms are driven by an OU-like process.

Remark 4 (Performance tuning). Considering $\eta\sigma^2/B$ as the coefficient balancing the implicit regularizer and vanilla SGD, one can regularize/penalize the SGD learning procedure with a fixed η and B more fiercely using a larger σ². More specifically, we expect to obtain a regularized solution with lower $\frac{1}{N}\sum_{i = 1}^N\|\nabla_\theta f(x_i,\theta)\|_2^2$ or higher inference stability of neural networks, as regularization effects become stronger when σ² increases.

5. Experiments on self-distillation with ULN

The goal of this experiment is to understand proposition 3 and remarks 3 and 4 i.e. examining (1) whether the ULN will lower the gradient norm of the neural networks; (2) whether such ULN improves the performance of neural networks; and (3) whether one can carry out performance tuning through controlling the variances of ULN, all in real-world deep learning settings.

5.1. Experiment design

To evaluate SGD with ULN, we design a set of novel experiments based on self-distillation with unbiased label noises. In addition to learning from noisy labels directly, our experiment intends to train a (student) network from the noisy outputs of a (teacher) network in a quadratic regression loss, where the student network has been initialized from weights of the teacher network and ULN is given to the soft outputs of the teacher network randomly.

We aim to directly measure the gradient norm $\frac{1}{N}\sum_{i = 1}^N\|\nabla_\theta f(x_i,\theta)\|_2^2$ of the neural network after every epoch to test the SGD implicit regularization effects of ULN (i.e. proposition 3). The performance comparisons among the teacher network, the student network (trained with ULN) and the student network (trained noiselessly) demonstrate the advantage of ULN in SGD for regression tasks (i.e. remarks 3 and 4).

Particularly, we design a set of novel experiments based on self-distillation with unbiased label noise and explore in which way the proposed SGD with unbiased label noises fits the settings of self-distillation with unbiased label noises. Further, we introduce the goal of our empirical experiments with a list of expected evidence, then present the experiment settings for the empirical evaluation. Finally, we present the experimental results with solid evidence to validate our proposals in this work.

5.2. Noisy self-distillation

Given a well-trained model, self-distillation algorithms [27, 28, 33, 34] intend to further improve the performance of a model through learning from the 'soft label' outputs (i.e. logits) of the model (as the teacher). Furthermore, some studies found that the self-distillation could be further improved through incorporating certain randomness and stochasticity in the training procedure, namely noisy self-distillation, so as to obtain better generalization performance [28, 33]. In this work, we study two well-known strategies for additive noise as follows.

(i)
Gaussian noise. Given a pre-trained model with ${\mathbb{L}}$ -dimensional logit output, for every iteration of self-distillation, this strategy simple first draws random vectors from a ${\mathbb{L}}$ -dimensional Gaussian distribution ${\mathcal{N}}({\boldsymbol{0}}_{\mathbb{L}},\sigma^2{\mathbf{I}}_{\mathbb{L}})$ , then adds the vectors to the logit outputs of the model. It makes the student model learn from the noisy outputs. Note that in our analysis, we assume the output of the model is single dimension, while, in self-distillation, the logit labels are multiple dimensions. Thus, the diagonal matrix $\sigma^2{\mathbf{I}}_{\mathbb{L}}$ now refers to the complete form of the variances and σ² controls the scale of variances of the noise.
(ii)
Symmetric noise. Basically, this strategy is derived from [13] that generates noise through randomly swapping the values of logit output among the $\mathbb{L}$ dimensions. Specifically, in every iteration of self-distillation, given a swap-probability p, every logit output (denoted as y here) from the pre-trained model and every dimension of logit output (denoted as y_l ), the strategy in probability p swaps the logit value in the dimension that corresponds to y_l with any other dimension $y_{m\neq l}$ in equal prior (i.e. in $(\mathbb{L}-1)^{-1}$ probability). In the rest $1-p$ dimensions of logit (probability) outputs, the strategy remains the original logit output there. Hence, the new noisy label $\tilde{y}$ has expectation $\mathbb{E}[\tilde{y}]$ as follows
$\begin{align} \mathbb{E}\left[\tilde{y}_l\right] = \left(1-p\right)\cdot y_l+\frac{p\cdot\sum_{\forall m\neq l} y_m}{\mathbb{L}-1}. \end{align} \tag{ 24 }$
This strategy introduces explicit bias to the original logit outputs. However, when we consider the expectation $\mathbb{E}[\tilde{y}]$ as the innovative soft label, then the random noise around the new soft label is still unbiased as $\mathbb{E}[\tilde{y}-\mathbb{E}[\tilde{y}]] = 0$ for all dimensions. Note that this noise is not the symmetric noise studied for robust learning [51].

Thus, literately, our proposed SGD with unbiased label noises settings well fits the practice of noisy-self distillation.

5.3. Datasets and DNN models

We choose ResNet-56 [52], one of the most practical deep models, for conducting the experiments on three datasets: SVHN [53], CIFAR-10 and CIFAR-100 [54]. We follow the standard training procedure [52] for training a teacher model (original model). Specifically, we train the model from scratch for 200 epochs and adopt the SGD optimizer with batch size 64 and momentum 0.9. The learning rate is set to 0.1 at the beginning of training and divided by 10 at 100th epoch and 150th epoch. A standard weight decay with a small regularization parameter (10⁻⁴) is applied. As for noiseless self-distillation, we follow the standard procedure [55] for distilling knowledge from the teacher to a student of the same network structure. The basic training setting is the same as training the teacher model.

For the settings of noisy self-distillation, we divide the original training set into a new training set (80%) and a validation set (20%). For clarity, we also present the results using varying scales of ULN on all three sets, where the original training set is used for training.

5.4. Experiment results

Figure 1 presents the results of the above two methods with increasing scales of noise, i.e. increasing σ² for Gaussian noise and increasing p for symmetric noise. In figures 1(a)–(c), we demonstrate that the gradient norms of neural networks $\frac{1}{N} \sum_{i = 1}^N\|\nabla_\theta f(x_i,\theta)\|_2^2$ decrease with increasing σ² and p for two strategies. The results back up our theoretical investigation, which means the model has high inferential stability, as the variation of neural network outputs against the potential random perturbation in parameters has been reduced by the regularization. In figures 1(d)–(f) and (g)–(i), we plot the validation and testing accuracy of the models obtained under noisy self-distillation. The results show that (1) student models have lower gradient norms in neural networks $\frac{1}{N} \sum_{i = 1}^N\|\nabla_\theta f(x_i,\theta)\|_2^2$ than teacher models, and the gradient norm further decreases with increasing scale of noise (i.e. σ² and P); (2) some of the models have been improved through noisy self-distillation compared to the teacher model, while noisy self-distillation can obtain better performance than noiseless self-distillation; and (3) it is possible to select noisily self-distilled models using validation accuracy for better overall generalization performance (in the testing dataset). All results here are based on 200 epochs of noisy self-distillation.

We show the evolution of training and test losses during the entire training procedure, and compare the settings of adding no label noise, symmetric and Gaussian noise for noisy self-distillation. Figure 2 presents the results on the three datasets, i.e. SVHN, CIFAR-10 and CIFAR-100, with the optimal scales of label noise on validation sets. It shows all algorithms would finally converge to a local minima with a training loss near to zero, while the local minima searched by SGD with symmetric noise would be flatter with better generalization performance (in particular for the CIFAR-100 dataset).

**Figure 2.** Training and validation loss per epoch during the training procedure.
Download figure:
Standard image High-resolution image

6. Experiments on linear regression with ULN

To validate our findings in linear regression settings, we carry out numerical evaluation using synthesized data to simply visualize the dynamics over iterations of SGD algorithms with label-noisy OLS and label-noiseless OLS.

6.1. Linear regression with ULN

We here hope to see how ULN will influence SGD iterations for ordinary linear regression (OLS), where a simple quadratic loss function is considered for OLS, such that

$\begin{align} \widehat\beta_\mathrm{OLS} \gets \underset{\beta\in\mathbb{R}^d}{\mathrm{arg min}}\ \left\{\frac{1}{N}\sum_{i = 1}^N\tilde{L}_i\left(\beta\right): = \frac{1}{2N}\sum_{i = 1}^N\left(x_i^\top\beta-\tilde{y}_i\right)^2\right\}\ , \end{align} \tag{ 25 }$

where samples are generated through $\tilde{y}_i = x_i^\top\beta^*+\varepsilon_i$ , $\mathbb{E}[\varepsilon_i] = 0$ and $\mathrm{var}[\varepsilon_i] = \sigma^2$ . Note that in this section, we replace the notation of θ with β to present the parameters in linear regression models.

Let us combine equations (11) and (25). We write the SGD for ordinary least squares with unbiased label noises as the iterations $\beta_k^\mathrm{ULN}$ for $k = 1,2,3\ldots$ as follows

$\begin{align} \beta^\mathrm{ULN}_{k+1}\gets\beta^\mathrm{ULN}_k-\frac{\eta}{N}\sum_{i = 1}^N\nabla L_i^*\left(\beta^\mathrm{ULN}_k\right)+\sqrt{\eta}\xi_k^*\left(\beta^\mathrm{ULN}_k\right)+\sqrt{\eta}\xi_k^\mathrm{ULN}\left(\beta^\mathrm{ULN}_k\right), \end{align} \tag{ 26 }$

where $\nabla {L}^*_i(\beta)$ for $\forall\theta\in\mathbb{R}^d$ refers to the loss gradient based on the label-noiseless sample $(x_i,y_i)$ and $y_i = x_i^\top\beta^*$ , $\xi^*_{k}(\beta)$ refers to stochastic gradient noise caused by mini-batch sampling over the gradients of label-noiseless samples, and $\xi_{k}^\mathrm{ULN}(\beta)$ is an additional noise term caused by the mini-batch sampling and the ULN, such that

$\begin{align} \nabla {L}^*_i\left(\beta\right)& = \frac{\partial}{\partial\beta}\frac{\left(x_i^\top\beta-x_i^\top\beta^*\right)^2}{2} = x_ix_i^\top\left(\beta-\beta^*\right) ,\nonumber\\ \xi^*_{k}\left(\beta\right)& = \frac{\sqrt{\eta}}{\vert B_k\vert}\sum_{x_j\in B_k}\left(\nabla L^*_j\left(\beta\right) -\frac{1}{N}\sum_{i = 1}^NL^*_i\left(\beta\right)\right),\nonumber\\ \xi_{k}^\mathrm{ULN}\left(\beta\right)& = - \frac{\sqrt{\eta}}{\vert B_k\vert}\ \sum_{x_j\in B_k} \varepsilon_j\cdot x_j\ . \end{align} \tag{ 27 }$

We denote the sample covariance matrix of N samples as

$\begin{align} \overline{\Sigma}_N = \frac{1}{N}\sum_{i = 1}^{N}x_ix_i^\top\ . \end{align} \tag{ 28 }$

According to remark 1 for implicit regularization in the general form, we can write the implicit regularizer of SGD with the random label noise for OLS as

$\begin{align} \xi_k^\mathrm{ULN}\left(\beta\right)\approx\left(\frac{\eta\sigma^2}{B}\overline{\Sigma}_N\right)^\frac{1}{2}z^{^{\prime}}_k = \left({\frac{\eta\sigma^2}{bN}\sum_{i = 1}^Nx_ix_i^\top}\right)^\frac{1}{2} z^{^{\prime}}_k,\ \text{and}\ z^{^{\prime}}_k\sim\mathcal{N}\left(\mathbf{0}_d,\mathbf{I}_d\right),\ \end{align} \tag{ 29 }$

which is unbiased $\mathbb{E}[\xi_k^\mathrm{ULN}(\beta)] = \mathbf{0_d}$ with invariant covariance structure, and is independent of β (the location) and k (the time).

Combine=ing proposition 2 and linear regression settings, we obtain the continuous-time dynamics for linear regression with ULN, denoted as $\beta^\mathrm{ULN}(t)$ . According to references [32, 56], we can see SGD and its continuous-time dynamics for noiseless linear regression (denoted as $\beta^\mathrm{LNL}_k$ and $\beta^\mathrm{LNL}(t)$ ) would asymptotically converge to the optimal solution $\beta^*$ . As the additional noise term $\xi^\mathrm{ULN}$ is unbiased with an invariant covariance structure, when $t\to\infty$ , we can simply conclude that $\underset{t\to\infty}{\text{lim}}\mathbb{E}\ \beta^\mathrm{ULN}(t) = \underset{t\to\infty}{\text{lim}}\mathbb{E}\ \beta^\mathrm{LNL}(t) =$ $\beta^*$ , $\underset{t\to\infty}{\text{lim}}\mathbf{d}\beta^\mathrm{ULN}(t) = (\frac{\eta\sigma^2}{B}\overline{\Sigma}_N)^{1/2}\mathbf{d}W(t)$ . By definition of a distribution from a stochastic process, we could conclude $\beta^\mathrm{ULN}(t)$ converges to a stationary distribution, such that $\beta^\mathrm{ULN}(t)\sim\mathcal{N}(\beta^*,\frac{\eta\sigma^2}{B}\overline{\Sigma}_N)$ , as $t\to\infty$ .

Remark 5. Thus, with $k\to\infty$ , the SGD algorithm for OLS with unbiased label noise would converge to a Gaussian-like distribution as follows

$\begin{align} \lim_{k\to\infty}\mathbb{E}\ \left[\beta_k^\mathrm{ULN}\right] = \beta^*, \text{and}\ \lim_{k\to\infty}\mathrm{Var}\ \left[\beta_k^\mathrm{ULN}\right] = \frac{\eta\sigma^2}{B}\overline{\Sigma}_N. \end{align} \tag{ 30 }$

The span and shape of the distribution are controlled by σ² and $\overline{\Sigma}_N$ when η and B are constant.

In this experiment, we hope to evaluate the above remark using numerical simulations, so as to test: (1) whether the trajectories of $\beta_k^\mathrm{ULN}$ converges to a distribution of $\mathcal{N}(\beta^*,\frac{\eta\sigma^2}{B}\overline{\Sigma}_N)$ ; (2) whether the shape of the convergence area can be controlled by the sample covariance matrix of the data $\overline{\Sigma}_N$ ; and (3) whether the size of the convergence area can be controlled by the variance of label noises σ².

6.2. Experiment setup

In our experiments, we use 100 random samples realized from a two-dimensional Gaussian distribution $X_i\sim\mathcal{N}(\mathbf{0},\Sigma_{1,2})$ for $1\unicode{x2A7D} i\unicode{x2A7D} 100$ , where $\Sigma_{1,2}$ is a symmetric covariance matrix controlling the random sample generation. To add the noise to the labels, we first draw 100 copies of random noise from the normal distribution with the given variance $\varepsilon_i\sim\mathcal{N}(0,{\sigma}^2)$ , then we set up the OLS problem with $(X_i,\tilde{y}_i)$ pairs using $\tilde{y}_i = {X}_i^\top\beta^*+\varepsilon_i$ and $\beta^* = [1,1]^\top$ and various settings of σ² and $\Sigma_{1,2}$ . We set up the SGD algorithms with fixed learning rate η = 0.01, and batch size B = 5, with the total number of iterations $K = 1000\,000$ to visualize the complete paths.

6.3. Experiment results

Figure 3 presents the results of numerical validations. In figures 3(a)–(d), we gradually increase the variances of label noises σ² from 0.25 to 2.0, where we can observe (1) SGD over label-noiseless OLS converges to the optimal solution $\beta^* = [1.0,1.0]^\top$ in a fast manner, (2) SGD over OLS with unbiased random label noise would asymptotically converge to a distribution centered at the optimal point, and (3) when σ² increases, the span of the converging distribution becomes larger. In figures 3(e)–(h), we use four settings of $\Sigma_{1,2}$ , where we can see (4) no matter how $\Sigma_{1,2}$ is set for OLS problems, the SGD with unbiased random label noise would asymptotically converge to a distribution centered at the optimal point.

**Figure 3.** Trajectories of SGD over OLS with and without unbiased random label noise using various $\tilde{\sigma}^2$ and $\Sigma_{1,2}$ settings for (noisy) random data generation. For figures (a)–(d), the experiments are set up with a fixed $\Sigma_{1,2} = [[20,0]^\top,[0,20]^\top]$ and varying $\tilde{\sigma}^2$ . For figures (e)–(h), the experiments are set up with a fixed $\tilde{\sigma}^2 = 0.5$ and varying $\Sigma_{1,2}$ , where we set $\Sigma^\mathrm{Ver} = [[10,0]^\top,[0,100]^\top]$ and $\Sigma^\mathrm{Hor} = [[100,0]^\top,[0,10]^\top]$ to shape the converging distributions.
Download figure:
Standard image High-resolution image

Comparing the results in (e) with (f), we find that, when the trace of $\Sigma_{1,2}$ increases, the span of converging distributions increase. Furthermore, (5) the shapes of converging distributions depend on $\Sigma_{1,2}$ . In figure 3(g), when we place the principal component of $\Sigma_{1,2}$ onto the vertical axis (i.e. $\Sigma^\mathrm{Ver} = [[10,0]^\top,[0,100]^\top]$ ), the distribution lies principally on the vertical axis. Figure 3(h) demonstrates the opposite layout of the distribution, when we set $\Sigma^\mathrm{Hor} = [[100,0]^\top,[0,10]^\top]$ as $\Sigma_{1,2}$ . The scale and shape of the converging distribution backs up our theoretical investigation in equation (29).

Note that the unbiased random label noise is added to the labels prior to the learning procedure. In this setting, it is the mini-batch sampler of SGD that re-samples the noise and influences the dynamics of SGD by forming the implicit regularizer.

7. Discussion and conclusion

In this work, we advance the understanding of the impact of random label noise in the context of SGD under mini-batch sampling. Unlike most prior studies [38, 57] focusing on performance degradation due to label noise or corrupted labels, we intended to study the implicit regularization effects caused by this random label noise.

By adopting a dynamical systems viewpoint of SGD, our work decomposes SGD with ULN into three distinct components: $\nabla L^*(\theta)$ , the true gradient of label-noiseless loss functions; $\xi_k^*(\theta)$ , noise introduced via mini-batch sampling on the label-noiseless loss gradients; and $\xi_k^\mathrm{ULN}(\theta)$ , noise influenced by both random label noise and mini-batch sampling. The unpublished noise term, $\xi_k^\mathrm{ULN}(\theta)$ emerges as an implicit regularizer, contributing to a lower neural network gradient norm, a signature of higher model stability against random parameter perturbations.

Our extensive experiments offer robust evidence for our theoreticalanalyses. Notably, experiments involving self-distillation with deep neural networks substantiate the role of the implicit regularizer in lowering the gradient norm of neural networks and enhancing model stability. This process, enhanced by iteratively adding noise to the outputs of teacher models, eventually results in resolution with better generalization performance. Complementary evidence comes from our linear regression analysis, where the derived SGD dynamics are vividly manifested in the observed trajectories of SGD-based linear regression under the influence of random label noise. These empirical observations echo our theoretical findings, affirming the critical controlling role of data distribution, learning rate and batch size for the convergence behavior.

To sum up, our work underscores the significance of understanding and harnessing the influence of random label noise in SGD. The emergence of the implicit regularizer, $\xi_k^\mathrm{ULN}(\theta)$ , could help shape future algorithms to harness noise in a positive way, leading to improved model stability and performance. This study thus opens up new avenues for future research in designing more effective and robust learning procedures.

Data availability statement

No new data were created or analyzed in this study.

Ethical statements

No procedures performed in the studies involving human participants or animals. All authors have made contributions to the manuscript. Dr Haoyi Xiong contributed to the original research ideas, formulated the research problems, proposed part of algorithms, wrote the manuscript and was involved in discussions. Dr Xuhong Li contributed to the original research ideas, conducted part of the experiments, and wrote part of the manuscript. Mr Boyang Li helped in part of the mathematical proofs. Dongrui Wu and Zhanxing Zhu were involved in discussions. Prof Dejing Dou oversaw the progress of research. All procedures were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Appendix A: Proof of proposition 1

As the mini-batch B_k are randomly, independently and uniformly drawn from the full-set sample $\mathcal{D}$ , thus for $\forall\theta\in\mathbb{R}^d$ and $\forall x_j\in B_k$ :

$\begin{align} \mathbb{E}_{x_j\in B_k}\left[\nabla L^*_j\left(\theta\right)\right] = \frac{1}{N}\sum_{i = 1}^N \nabla L_i^*\left(\theta\right)\Rightarrow\ \mathbb{E}_{B_k}\ \left[\frac{1}{\vert B_k\vert}\sum_{x_j\in B_k}\nabla L^*_j\left(\theta\right) \right] = \frac{1}{N}\sum_{i = 1}^N \nabla L^*_i\left(\theta\right)\ . \end{align} \tag{ A.1 }$

Hence, we derive that

$\begin{align} \mathbb{E}_{B_k}\left[\xi^*_k\left(\theta\right)\right] & = \mathbb{E}_{B_k}\ \sqrt{\eta}\left(\frac{1}{\vert B_k\vert}\sum_{i = 1}^k\nabla^*_j\left(\theta\right)-\frac{1}{N}\sum_{i = 1}^NL^*_i\left(\theta\right)\right)\nonumber\\ & = \sqrt{\eta}\left(\mathbb{E}_{B_k}\left[\frac{1}{\vert B_k\vert}\sum_{i = 1}^k\nabla^*_j\left(\theta\right)\right]-\frac{1}{N}\sum_{i = 1}^NL^*_i\left(\theta\right)\right) = \mathbf{0_d}\ . \end{align} \tag{ A.2 }$

Further, for any $\theta\in\mathbb{R}^d$ :

$\begin{align} \mathrm{Var}\left[\xi^*_k\left(\theta\right)\right] & = \mathbb{E}_{B_k}\ \left(\xi^*_k\left(\theta\right)-\mathbb{E}_{B_k}\xi^*_k\left(\theta\right)\right)\left(\xi^*_k\left(\theta\right)-\mathbb{E}_{B_k}\xi^*_k\left(\theta\right)\right)^\top = \mathbb{E}_{B_k} \left[\xi^*_k\left(\theta\right)\xi^*_k\left(\theta\right)^\top\right] \nonumber\\ & = \frac{\eta}{\vert B_k\vert^2} \sum_{x_j\in B_k} \mathrm{Var}\left[\nabla L^*_j\left(\theta\right)\right] = \frac{\eta}{\vert B_k\vert N}\sum_{j = 1}^N\left(\nabla L^*_j\left(\theta\right)-\frac{1}{N}\sum_{i = 1}^NL^*_i\left(\theta\right)\right)\nonumber\\ &\quad\times\left(\nabla L^*_j\left(\theta\right)-\frac{1}{N}\sum_{i = 1}^NL^*_i\left(\theta\right)\right)^\top = \frac{\eta}{\vert B_k\vert}\Sigma_N^\mathrm{SGD}\left(\theta\right), \end{align} \tag{ A.3 }$

where $\Sigma_N^\mathrm{SGD}(\theta)$ is defined as equation (15).

Similarly, as the mini-batch B_k are randomly, independently and uniformly drawn from the full-set sample $\mathcal{D}$ , therefore $\forall\theta\in\mathbb{R}^d$ and

$\begin{align} \mathbb{E}_{x_j\in B_k}\left[\nabla_\theta f\left(x_j, \theta\right)\right] = \frac{1}{N}\sum_{i = 1}^N \nabla_\theta f\left(x_i,\theta\right)\ . \end{align} \tag{ A.4 }$

Thus:

$\begin{align} \mathbb{E}_{B_k,\varepsilon_i}\left[\xi_k^\mathrm{ULN}\left(\theta\right)\right]& = -\frac{\sqrt{\eta}}{\vert B_k\vert}\sum_{x_i\in B_k}\left\{\mathbb{E}_{x_i\in B_k}\left[\nabla f\left(x_i,\theta\right)\right]\cdot \mathbb{E}_{\varepsilon_i}\left[\varepsilon_i\right]\right\} = \mathbf{0_d}\ . \end{align} \tag{ A.5 }$

Again, by the definition:

$\begin{align} \mathrm{Var}_{B_k,\varepsilon_i}\left[\xi_k^\mathrm{ULN}(\theta)\right]& = \mathbb{E}_{B_k,\varepsilon_i}\left[\xi_k^\mathrm{ULN}(\theta)\xi_k^\mathrm{ULN}(\theta)^\top\right] \nonumber\\ &\quad\times \text{Let us assume}~B_k~\mathrm{and}~\varepsilon_i\mathrm{\,for}~1\unicode{x2A7D} i\unicode{x2A7D} N~\mathrm{are \,independent}\nonumber\\ & = \frac{\eta}{\vert B_k\vert^2}\sum_{x_j\in B_k}\mathbb{E}_{B_k\varepsilon_i}\left[(\varepsilon_i\cdot\nabla_\theta f(x_i,\theta)-\mathbb{E}_{B_k\varepsilon_i}(\varepsilon_i\cdot\nabla_\theta f(x_i,\theta))^2\right]. \nonumber\\ &\quad\times \text{As}~\mathrm{Var}\left[\varepsilon_i\right] = \sigma^2~\mathrm{for}~1\unicode{x2A7D} i\unicode{x2A7D} N,~\mathrm{there\, has}\nonumber\\ & = \frac{\eta\sigma^2}{\vert B_k\vert N}\sum_{i = 1}^N\nabla_\theta f(x_i,\theta)\nabla_\theta f(x_i,\theta)^\top = \frac{\eta\sigma^2}{\vert B_k\vert}\Sigma_N^\mathrm{ULN}(\theta)\ \end{align} \tag{ A.6 }$

where $\Sigma_N^\mathrm{ULN}(\theta)$ is defined as equation (15). $\square$

Appendix B: Proof of proposition 2

We show that, as $\eta \rightarrow 0$ , the discrete iteration $\overline{\theta}_k$ of equation (17) and on finite–time intervals is close to the solution of the SDE. (16). The main techniques follow [58], but [58] only considered the case when $\Sigma_N^\mathrm{SGD}(\theta)$ and $\Sigma_N^\mathrm{ULN}(\theta)$ are constants.

Let $C_1(\theta) = \sqrt{\frac{1}{B}\Sigma_N^\mathrm{SGD}(\theta)}$ , $C_2(\theta) = \sqrt{\frac{1}{B}\Sigma_N^\mathrm{ULN}(\theta)}$ and $L^*(\theta) = \frac{1}{N}\sum_{i = 1}^NL_i^*(\theta)$ . Let $\widehat{\Theta}(t)$ be the process defined by the integral form of the SDE

$\begin{align} \widehat{\Theta}\left(t\right)-\widehat{\Theta}\left(0\right) &= -\int_0^t \nabla L^*\left(\widehat{\Theta}_{\lfloor \frac{s}{\eta}\rfloor\eta}\right) \mathbf{d} s + \sqrt{\eta} \int_0^t C_1\left(\widehat{\Theta}\left(\lfloor\frac{s}{\eta}\rfloor\eta\right)\right) \mathbf{d} W_1\left(s\right)\nonumber\\ &\quad+ \sqrt{\eta} \int_0^t C_2\left(\widehat{\Theta}\left(\lfloor\frac{s}{\eta}\rfloor\eta\right)\right) \mathbf{d} W_2\left(s\right) \ , \ \widehat{\Theta}\left(0\right) = \theta_0 \ . \end{align} \tag{ B.1 }$

Here for a real positive number a > 0 we define $\lfloor a \rfloor = \text{max}\left\{k\in \mathbb{N}_+, k\lt a\right\}$ . From equation (B.1) we see that we have, for $k = 0,1,2,\ldots$

$\begin{align} \widehat{\Theta}\left({\left(k+1\right)\eta}\right)-\widehat{\Theta}\left({k\eta}\right) &= -\eta \nabla L^*\left(\widehat{\Theta}\left({k\eta}\right)\right)-\sqrt{\eta}C_1\left(\widehat{\Theta}\left({k\eta}\right)\right)\left(W_1\left(\left(k+1\right)\eta\right)-W_1\left(k\eta\right)\right) \nonumber\\ &\quad - \sqrt{\eta} C_2\left(\widehat{\Theta}\left({k\eta}\right)\right)\left(W_2\left(\left(k+1\right)\eta\right)-W_2\left(k\eta\right)\right). \end{align} \tag{ B.2 }$

Since $\sqrt{\eta}(W_1({(k+1)\eta})-W_1({k\eta}))\sim \mathcal{N}(0, \eta^2 I)$ and $\sqrt{\eta}(W_2({(k+1)\eta})-W_2({k\eta}))\sim \mathcal{N}(0, \eta^2 I)$ , we could let $\eta z_{k+1} = \sqrt{\eta}(W_1({(k+1)\eta})-W_1({k\eta}))$ and $z^{^{\prime}}_{k+1} = \sqrt{\eta}(W_2({(k+1)\eta})-W_2({k\eta}))$ , where $z_{k+1}$ and $z^{^{\prime}}_{k+1}$ are the i.i.d. Gaussian sequences in (16). From here, we see that

$\begin{align} \widehat{\Theta}\left({k\eta}\right) = \overline{\theta}^\mathrm{ULN}_k \ , \end{align} \tag{ B.3 }$

where $\overline{\theta}^\mathrm{ULN}_k$ is the solution to (17).

We then try to bound $\widehat{\Theta}_t$ in equation (B.1) and ${\Theta}^\mathrm{ULB}(t)$ in equation (16). Finally we can obtain the error estimation of $\overline{\theta}^\mathrm{ULN}_k = \widehat{\Theta}({k\eta})$ and ${\Theta}^\mathrm{ULN}({k\eta})$ by simply setting $t = k\eta$ . Since we assumed that $\nabla L^*_i(\theta)$ and $\nabla_\theta f(x, \theta)$ are L–Lipschitz continuous, we get

$\begin{align} \|C_1\left(\theta_1\right)-C_1\left(\theta_2\right)\|_2& = \sqrt{\frac{1}{bN}\sum_{i = 1}^N\|\nabla L^*_i\left(\theta_1\right)-\nabla L^*_i\left(\theta_2\right)\|_2^2} \unicode{x2A7D} L\|\theta_1-\theta_2\|_2 \end{align} \tag{ B.4 }$

since the batch size $b\unicode{x2A7E} 1$ . Similarly,

$\begin{align} \|C_2\left(\theta_1\right)-C_2\left(\theta_2\right)\|_2& = \sqrt{\frac{\sigma^2}{bN}\sum_{i = 1}^N\|\nabla_\theta f\left(x_i,\theta_1\right)-\nabla_\theta f\left(x_i,\theta_2\right)\|_2^2}\unicode{x2A7D} \sigma L\|\theta_1-\theta_2\|_2 \end{align} \tag{ B.5 }$

since the batch size $B\unicode{x2A7E} 1$ . Thus $C_1(\theta)$ and C₂ are both L–Lipschitz continuous. Taking the difference between (B.1) and (16) we get

$\begin{align} \widehat{\Theta}\left(t\right)-{\Theta}^\mathrm{ULN}\left(t\right) &= -\int_0^t \left[\nabla L^*\left(\widehat\Theta\left({\lfloor \frac{s}{\eta} \rfloor\eta}\right)\right)-\nabla L^*\left({\Theta}^\mathrm{ULN}\left(s\right)\right)\right] \mathbf{d} s\nonumber\\ &\quad+\sqrt{\eta}\int_0^t \left[C_1\left(\widehat{\Theta}\left({\lfloor \frac{s}{\eta}\rfloor}\right)\right) -C_1\left({\Theta}^\mathrm{ULN}\left(s\right)\right)\right]\mathbf{d} W_1\left(s\right)\nonumber\\ &\quad + \sqrt{\eta}\int_0^t \left[C_2\left(\widehat{\Theta}\left({\lfloor \frac{s}{\eta}\rfloor}\right)\right)-C_2\left({\Theta}^\mathrm{ULN}\left(s\right)\right)\right]\mathbf{d} W_2\left(s\right). \end{align} \tag{ B.6 }$

We can estimate

$\begin{align} &\|\nabla L^*(\widehat{\Theta}({\lfloor\frac{s}{\eta}\rfloor\eta}))-\nabla L^*(\Theta^\mathrm{ULN}(s))\|_2^2 \nonumber\\ &\quad\unicode{x2A7D} 2\|\nabla L^*(\widehat{\Theta}({\lfloor\frac{s}{\eta}\rfloor\eta}))-\nabla L^*({\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta}))\|_2^2 + 2\|\nabla L^*(\Theta({\lfloor\frac{s}{\eta}\rfloor\eta}))-\nabla L^*({\Theta}^\mathrm{ULN}(s))\|_2^2 \nonumber\\ &\quad\unicode{x2A7D} 2L^2\|\widehat{\Theta}({\lfloor\frac{s}{\eta}\rfloor\eta})-{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})\|_2^2 +2L^2 \|\widehat\Theta(\lfloor\frac{s}{\eta}\rfloor\eta)-{\Theta}^\mathrm{ULN}(s)\|_2^2 \ , \end{align} \tag{ B.7 }$

where we used the inequality derived from L-Lipschitz. Similarly, we estimate

$\begin{align} & \|C_1(\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta))-C_1({\Theta}^\mathrm{ULN}(s))\|_2^2 \nonumber\\ &\quad\unicode{x2A7D} 2\|C_1(\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta))-C_1({\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta}))\|_2^2 + 2\|C_1({\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta}))-{\Theta}^\mathrm{ULN}(s))\|_2^2 \nonumber\\ &\quad\unicode{x2A7D} 2L^2\|\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta)-{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})\|_2^2 +2L^2 \|{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})-{\Theta}^\mathrm{ULN}(s))\|_2^2 \ . \end{align} \tag{ B.8 }$

Likewise, based on the inequality derived from L-Lipschitz, we can also have

$\begin{align} & \|C_2(\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta))-C_2({\Theta}^\mathrm{ULN}(s))\|_2^2 \nonumber\\ &\quad\unicode{x2A7D} 2\|C_2(\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta))-C_2({\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta}))\|_2^2 + 2\|C_2({\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta}))-{\Theta}^\mathrm{ULN}(s))\|_2^2 \nonumber\\ &\quad\unicode{x2A7D} 2\sigma^2L^2\|\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta)-{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})\|_2^2 +2\sigma^2L^2 \|{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})-{\Theta}^\mathrm{ULN}(s))\|_2^2 \ .\end{align} \tag{ B.9 }$

On the other hand, from (B.6), the Itô's isometry [59] and Cauchy–Schwarz inequality we have

$\begin{equation} \begin{aligned} \mathbb{E}\vert\widehat{\Theta}(t)-{\Theta}^\mathrm{ULN}(t)\vert^2 \unicode{x2A7D} & 2 \mathbb{E}\left\|\int_0^t \left[\nabla L(^*(\Theta({\lfloor \frac{s}{\eta} \rfloor\eta}))-\nabla L^*({\Theta}^\mathrm{ULN}(s))\right] \mathbf{d}s\right\|_2^2\\ &+2\eta \mathbb{E} \left\|\int_0^t \left[C_1(\widehat{\Theta}_{\lfloor \frac{s}{\eta}\rfloor})-C_1({\Theta}^\mathrm{ULN}(s))\right]\mathbf{d}W_1(s)\right\|^2\\ &+2\eta \mathbb{E} \left\|\int_0^t \left[C_2(\widehat{\Theta}_{\lfloor \frac{s}{\eta}\rfloor})-C_2({\Theta}^\mathrm{ULN}(s))\right]\mathbf{d}W_1(s)\right\|^2\\ \unicode{x2A7D} &2\int_0^t \mathbb{E}\left\|\nabla L(^*(\Theta({\lfloor \frac{s}{\eta} \rfloor\eta}))-\nabla L^*({\Theta}^\mathrm{ULN}(s))\right\|_2^2\mathbf{d}s\\ &+2\eta \int_0^t \mathbb{E}\left\|C_1(\widehat{\Theta}_{\lfloor \frac{s}{\eta}\rfloor})-C_1({\Theta}^\mathrm{ULN}(s))\right\|^2\mathbf{d}s\\ &+2\eta \int_0^t \mathbb{E}\left\|C_2(\widehat{\Theta}_{\lfloor \frac{s}{\eta}\rfloor})-C_2({\Theta}^\mathrm{ULN}(s))\right\|^2\mathbf{d}s \ . \end{aligned} \end{equation} \tag{ B.10 }$

Combining equations (B.7)–(B.10) we obtain

$\begin{align} & \mathbb{E}\|\widehat{\Theta}(t)-{\Theta}^\mathrm{ULN}(t)\|_2^2 \nonumber\\ &\unicode{x2A7D} 2\int_0^t \left(2L^2 \mathbb{E} \|\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta)-{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})\|_2^2+2L^2 \mathbb{E}\|\Theta_{\lfloor \frac{s}{\eta} \rfloor\eta}-{\Theta}^\mathrm{ULN}(s))\|_2^2\right)\mathbf{d}s \nonumber\\ &\quad+2\eta \int_0^t \left(2L^2 \mathbb{E}\|\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta)-{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})\|_2^2 +2L^2 \mathbb{E} \|{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})-{\Theta}^\mathrm{ULN}(s))\|_2^2\right)\mathbf{d}s \nonumber\\ &\quad +2\eta \int_0^t \left(2\sigma^2L^2 \mathbb{E}\|\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta)-{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})\|_2^2 +2\sigma^2L^2 \mathbb{E} \|{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})-{\Theta}^\mathrm{ULN}(s))\|_2^2\right)\mathbf{d}s \ .\nonumber\\ &= 4(1+\eta+\eta\sigma^2) L^2\cdot \left(\int_0^t \mathbb{E}\|\widehat{\Theta}(\lfloor\frac{s}{\eta}\rfloor\eta)-{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})\|_2^2\mathbf{d}s +\int_0^t \mathbb{E} \|{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})-{\Theta}^\mathrm{ULN}(s))\|_2^2 \mathbf{d}s\right) \ .\end{align} \tag{ B.11 }$

Since we assumed that there is an M > 0 such that $\max_{i = 1,2\ldots ,N}(\vert\nabla L^*_i(\theta)\vert)\unicode{x2A7D} M$ , we conclude that $\vert\nabla L^*(\theta)\vert\unicode{x2A7D} \dfrac{1}{N}\sum_{i = 1}^N \vert\nabla L^*_i(\theta)\vert\unicode{x2A7D} M$ and

$\begin{align} \|C_1\left(\theta\right)\|_2\unicode{x2A7D} \dfrac{1}{\sqrt{b}}\sqrt{\frac{1}{N}\sum_{i = 1}^N \|\nabla L^*_i\left(\theta\right)\|_2^2}\unicode{x2A7D} M, \end{align} \tag{ B.12 }$

$\begin{align} \|C_2\left(\theta\right)\|_2\unicode{x2A7D} \dfrac{1}{\sqrt{b}}\sqrt{\frac{\sigma^2}{N}\sum_{i = 1}^N \|\nabla_\theta f\left(x_i,\theta\right)\|_2^2}\unicode{x2A7D} \sigma M \end{align} \tag{ B.13 }$

since $B\unicode{x2A7E} 1$ . Based on equation (16), the Itô's isometry [59], the Cauchy–Schwarz inequality and $0\unicode{x2A7D} s-\lfloor\frac{s}{\eta}\rfloor\eta \unicode{x2A7D} \eta$ we know that

$\begin{align} &\mathbb{E} \|{\Theta}^\mathrm{ULN}({\lfloor\frac{s}{\eta}\rfloor\eta})-{\Theta}^\mathrm{ULN}(s))\|_2^2\nonumber\\ &= \displaystyle{\mathbb{E} \left\|-\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \nabla L^*(\Theta_u) \mathbf{d}u +\sqrt{\eta}\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s C_1(\Theta_u)\mathbf{d}W_1(u)+\sqrt{\eta}\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s C_2(\Theta_u)\mathbf{d}W_2(u)\right\|_2^2} \nonumber\\ &\quad\unicode{x2A7D} \displaystyle{2 \mathbb{E} \left\|\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \nabla L^*(\Theta_u) \mathbf{d}u\right\|_2^2 +2\eta \mathbb{E}\left\|\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s C_1(\Theta_u)\mathbf{d}W_1(u) \right\|_2^2 +2\eta \mathbb{E}\left\|\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s C_2(\Theta_u)\mathbf{d}W_2(u) \right\|_2^2} \nonumber\\ &\quad\unicode{x2A7D} 2 \mathbb{E} \left(\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \left\|\nabla L^*(\Theta_u)\right\|\mathbf{d}u\right)^2 +2\eta\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \mathbb{E}\|C_1(\Theta_u)\|_2^2 \mathbf{d}u +2\eta\int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \mathbb{E}\|C_2(\Theta_u)\|_2^2 \mathbf{d}u \nonumber\\ &\quad\unicode{x2A7D} \displaystyle{2\eta \int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \mathbb{E}\vert\nabla L^*(\Theta_u)\vert^2 \mathbf{d}u +2\eta \int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \mathbb{E}\|C_1(\Theta_u)\|_2^2 \mathbf{d}u +2\eta \int_{\lfloor\frac{s}{\eta}\rfloor\eta}^s \mathbb{E}\|C_2(\Theta_u)\|_2^2 \mathbf{d}u} \nonumber\\ &\quad\unicode{x2A7D} 2\eta^2 M^2 +2\eta^2 M^2+2\eta^2\sigma^2 M^2 = (4+2\sigma^2)\eta^2 M^2 \ .\end{align} \tag{ B.14 }$

Combining (B.14) and (B.11) we obtain

$\begin{align} &\mathbb{E}\|\widehat{\Theta}\left(t\right)-{\Theta}^\mathrm{ULN}\left(t\right)\|_2^2 \unicode{x2A7D} 4\left(1+\eta+\eta\sigma^2\right)L^2 \cdot \left(\int_0^t \mathbb{E}\|\widehat{\Theta}\left(\lfloor\frac{s}{\eta}\rfloor\eta\right)\right.\nonumber\\ &\left.\quad-{\Theta}^\mathrm{ULN}\left({\lfloor\frac{s}{\eta}\rfloor\eta}\right)\|_2^2\mathbf{d}s +\left(4+2\sigma^2\right)\eta^2 M^2 t\right) \ .\end{align} \tag{ B.15 }$

Setting T > 0 and $m(t) = {\mathop{\text{max}}\limits_{0\unicode{x2A7D} s\unicode{x2A7D} t}} \mathbb{E} \vert\widehat{\Theta}_s-{\Theta}^\mathrm{ULN}(s))\vert^2$ , and noticing that $m(\lfloor\frac{s}{\eta}\rfloor \eta)\unicode{x2A7D} m(s)$ (as $\lfloor\frac{s}{\eta}\rfloor\eta \unicode{x2A7D} s$ ), then we see the above gives, for any $0\unicode{x2A7D} t\unicode{x2A7D} T$ ,

$\begin{align} m\left(t\right) \unicode{x2A7D} 4\left(1+\eta+\eta\sigma^2\right)L^2 \cdot \left(\int_0^t m\left(s\right)\mathbf{d}s +\left(4+2\sigma^2\right)\eta^2 M^2 T\right) \ .\end{align} \tag{ B.16 }$

By Gronwall's inequality we obtain, for $0\unicode{x2A7D} t \unicode{x2A7D} T$ ,

$\begin{align} m\left(t\right) \unicode{x2A7D} 4\left(1+\eta+\eta\sigma^2\right)\left(4+2\sigma^2\right) L^2\eta^2 M^2 T e^{4\left(1+\eta+\eta\sigma^2\right) L^2 t}. \end{align} \tag{ B.17 }$

Suppose $0\lt\eta\lt 1$ , then there is a constant C which is independent of η s.t.

$\begin{align} \mathbb{E}\|\widehat{\Theta}\left(t\right)-{\Theta}^\mathrm{ULN}\left(t\right)\|_2^2 \unicode{x2A7D} m\left(t\right) \unicode{x2A7D} C \eta^2. \end{align} \tag{ B.18 }$

Setting $t = k\eta$ in (B.18) and making use of (B.3), we finish the proof. $\square$

Appendix C: Proof of proposition 3

To obtain equation (22), we can use the simple vector-matrix-vector products transform that, for the random vector v and any symmetric matrix A, there has $\mathbb{E}_v [v^\top Av] = \mathrm{trace}(A\mathbb{E}[vv^\top])$ , such that

$\begin{align} \mathbb{E}_{z_k}\|\xi_k^{\mathrm{ULN}}\left(\theta\right)\|_2^2 & = \mathbb{E}_{z_k}\left[\xi_k^{\mathrm{ULN}}\left(\theta\right)^\top \xi_k^{\mathrm{ULN}}\left(\theta\right)\right] \approx{\frac{\eta\sigma^2}{b}}\mathbb{E}_{z_k}\left[z_k^\top\left(\frac{1}{N}\sum_{i = 1}^N\nabla_\theta f\left(x_i,\theta\right)\nabla_\theta f\left(x_i,\theta\right)^\top\right)z_k\right] \nonumber\\ & = {\frac{\eta\sigma^2}{b}}\mathrm{trace}\left(\left(\frac{1}{N}\sum_{i = 1}^N\nabla_\theta f\left(x_i,\theta\right)\nabla_\theta f\left(x_i,\theta\right)^\top\right)\mathbb{E}_{z_k}\left[z_kz_k^\top\right]\right)\nonumber \text{as } \mathbb{E}_{z_k}\left[z_kz_k^\top\right]\nonumber\\ & = \mathbf{I}_d\ for\ z_k\sim\mathcal{N}\left(\mathbf{0}_d,\mathbf{I}_d\right) = \frac{\eta\sigma^2}{b}\mathrm{trace}\left(\frac{1}{N}\sum_{i = 1}^N\nabla_\theta f\left(x_i,\theta\right)\nabla_\theta f\left(x_i,\theta\right)^\top\right)\nonumber\\ & = \frac{\eta\sigma^2}{bN}\sum_{i = 1}^N\left\|\nabla_\theta f\left(x_i,\theta\right)\right\|_2^2. \end{align} \tag{ B.19 }$

$\square$