Global Convergence and Rich Feature Learning in $L$ -Layer Infinite-Width Neural Networks under $\mu$ P Parametrization

Zixiang Chen Equal contributionDepartment of Computer Science, University of California, Los Angeles. Email: chenzx19@cs.ucla.edu Greg Yang¹¹footnotemark: 1 xAI Qingyue Zhao Department of Computer Science, University of California, Los Angeles. Email: zhaoqy24@cs.ucla.edu Quanquan Gu Department of Computer Science, University of California, Los Angeles. Email: qgu@cs.ucla.edu

Abstract

Despite deep neural networks’ powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$ -layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ( $\mu$ P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

1 Introduction

Deep learning has achieved remarkable success in various machine learning tasks, from image classification (Krizhevsky et al., 2012) and speech recognition (Hinton et al., 2012) to game playing (Silver et al., 2016). Yet this empirical success has posed a significant theoretical challenge: how can we explain the effectiveness of neural networks given their non-convex optimization landscape and over-parameterized nature? Traditional optimization and learning theory frameworks struggle to provide satisfactory explanations. A breakthrough came with the study of infinite-width neural networks, where the network behavior can be precisely characterized in the limit of infinite width. This theoretical framework has spawned several important approaches to understanding neural networks, with the Neural Tangent Kernel (NTK) emerging as a prominent example.

Under the NTK parametrization (NTP) (Jacot et al., 2018), neural network training behaves like a linear model: the features learned during training in each layer remain essentially identical to those obtained from random initialization. Consequently, the training process of over-parameterized deep neural networks can be characterized by training linear models with random feature (Lee et al., 2019; Arora et al., 2019b; Cao and Gu, 2019; Chen et al., 2021). Since random features are linearly independent, global convergence can be proved for wide neural networks trained using (stochastic) gradient descent (GD/SGD) (Du et al., 2019c; Allen-Zhu et al., 2019; Du et al., 2019a; Zou et al., 2019; Zou and Gu, 2019). However, the NTK parametrization has significant limitations, such as its inability to perform feature learning and transfer learning, which involve pretraining and fine-tuning. While NTK theory provides convergence results under infinite width, its inability to explain feature learning motivates us to ask:

In this paper, we show that deep neural networks can achieve both objectives through proper parametrization. While previous approaches like NTK and standard parametrization fail to perform meaningful feature learning, and mean field parametrization suffers from feature collapse in deep networks, we demonstrate that the $\mu$ parametrization (Yang and Hu, 2020, 2021; Yang et al., 2021; Yang, 2019a) enables both feature learning and global convergence. Specifically, working with $L$ -layer neural networks under $\mu$ P scaling, we prove that despite substantial feature evolution during training, the networks maintain linearly independent features in each layer when trained with stochastic gradient descent. As a consequence, if the training converges, it must converge to a global minimum. Our contributions are summarized as follows:

•

We establish that multilayer perceptrons (MLPs) under Maximal Update Parametrization ( $\mu$ P) learn linearly independent features that capture task-relevant information. The learned features substantially deviate from their initialization, demonstrating true feature learning rather than random feature approximation. This resolves a fundamental challenge in deep learning theory: characterizing feature properties that ensure global convergence while allowing meaningful feature learning.
•

Our proof technique analyzes neural network Gaussian processes by exploiting their second-order invariants across adjacent layers. These structural properties persist during training, which allows us to track the evolution of feature correlations. Through a careful inductive argument over network layers and iterations, we establish that when training converges, the linear independence of features ensures convergence to a global minimum. The proof reveals a deep connection between the feature learning dynamics and the structural properties of infinite-width neural networks.
•

Through experiments on classification tasks, we validate our theoretical findings by demonstrating that features maintain linear independence through analysis of covariance matrix properties. Our empirical results demonstrate $\mu$ P’s unique capability to simultaneously achieve meaningful feature learning while preserving feature richness, as supported by non-vanishing eigenvalues as network widths increase. Through comparative analysis against other parametrization schemes, we show that this behavior robustly persists across different choices of activation functions, illustrating the practical implications of our theoretical results.

Notation. For any positive integer $N$ , we use $[N]$ to denote the index set $\{1,\dots,N\}$ . We use $\phi:\mathbb{R}\to\mathbb{R}$ to denote the activation function. For an $L$ -layer network, we use superscript $l\in[L]$ to index layers, with $Z^{h^{l}}$ and $Z^{x^{l}}$ denoting pre-activation and post-activation features respectively. For matrices and vectors, $\widehat{W}_{0}^{L+1}\coloneqq W_{0}^{L+1}n$ denotes a scaled last layer weights. For any matrix $W$ and vector $x$ , $\widehat{Z}^{Wx}$ denotes the Gaussian component of $Z^{Wx}$ .¹¹1 $Z^{Wx}\coloneqq\widehat{Z}^{Wx}+\dot{Z}^{Wx}$ , which is detailed in Appendix B. We use $\mathbb{E}[\cdot]$ to denote expectation. We consider a filtration $\{\mathcal{F}_{t}\}_{t\geq 0}$ , where $\mathcal{F}_{t}$ is the $\sigma$ -algebra generated by all random variables up to time $t$ . This gives a sequence of probability spaces $(\Omega,\mathcal{F}_{t},\mathbb{P})$ with $\mathcal{F}_{0}\subseteq\mathcal{F}_{1}\subseteq\ldots\subseteq\mathcal{F}_{T}$ . An event $\mathcal{E}\in\mathcal{F}_{T}$ occurs almost surely (denoted as a.s.) if $\mathbb{P}(\mathcal{E})=1$ . The functions $\mathring{f}$ and $\mathring{\chi}$ denote the infinite-width limits of network outputs and error signals induced by $\mathring{f}$ respectively.

2 Related Work

Neural Tangent Kernel Parametrization Jacot et al. (2018) first introduced the neural tangent kernel (NTK) by studying the training dynamics of multi-layer perceptrons (MLPs) with Lipschitz and smooth activation functions under square loss. Based on NTK, Allen-Zhu et al. (2019); Du et al. (2019a); Zou et al. (2019); Arora et al. (2019a) proved the global convergence of (stochastic) gradient descent for various neural architectures with general activation and loss functions. Standard parametrization (SP) and NTK parametrization (NTP) share the same weight initialization scheme but with different learning schedules. As network width increases, SP requires learning rates to decrease as $O(1/\text{width})$ for all layers to maintain stability (Yang and Hu, 2020). When considering the infinite-width limit, neither SP nor NTK parametrization can learn features - the features remain essentially the same as those from random initialization. Both theoretical studies and empirical evidence demonstrated that these parametrizations failed to capture the feature learning behavior observed in practical neural networks (Woodworth et al., 2020; Geiger et al., 2020; Bordelon and Pehlevan, 2022; Yang et al., 2023a).

Mean Field Analysis The mean field limit emerged when networks and learning rates were scaled appropriately as width approached infinity, yielding nonlinear parameter evolution (Mei et al., 2018; Chizat and Bach, 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2018). Early analysis of two-layer networks showed promising results, proving convergence to global optima with explicit convergence rates established through both direct analysis (Chen et al., 2020) and mean field Langevin dynamics (Nitanda et al., 2022). Progress extended to three-layer networks with Pham and Nguyen (2021)’s global convergence results. However, studies of deeper architectures revealed significant limitations: for networks deeper than 4 layers, both feature vectors and gradients degenerated to zero vectors (Nguyen and Pham, 2020; Fang et al., 2021). While Hajjar et al. (2021) introduced Integrable Parameterization (IP) to address this, networks with more than four layers still started at a stationary point in the infinite-width limit, hard to achieve rich feature learning.

Refer to caption — Figure 1: Different parametrization schemes exhibit distinct feature learning behaviors as width increases in $3$ -layer MLPs. We train MLPs on CIFAR-10 dataset and measure feature properties in Layer $1$ . Left: Feature change ( $\|h(x)-h^{0}(x)\|_{2}/\|h^{0}(x)\|_{2}$ ) shows only $\mu$ P maintain stable feature representations. Right: Feature diversity measured by the minimum eigenvalue of the feature gram matrix $K_{ij}=\langle h(x_{i}),h(x_{j})\rangle$ , where a larger eigenvalue indicates the features span a higher dimensional space. The results reveal that Meanfield parametrization suffers from feature collapse while SP, NTP and $\mu$ P preserve rich feature representations. Notably, only $\mu$ P achieves both feature learning capability and feature richness. See Appendix A for experimental details.

Tensor Programs Tensor Programs (TPs) emerged as a unified framework for understanding infinite-width limits across neural architectures (Yang, 2019b, 2020a, 2020b). This approach generalized previous architecture-specific parametrizations (Du et al., 2018, 2019b; Hron et al., 2020; Alemohammad et al., 2020). Yang and Hu (2020) characterized two distinct behaviors in infinite-width MLPs: one where initialization dominated the training dynamics (the kernel regime), and another where training data substantially influenced the learned weights (the feature learning regime). Within this framework, the $\mu$ parametrization was identified as enabling maximal feature learning across all layers and architectures (Yang and Hu, 2020; Yang et al., 2021; Littwin and Yang, 2022). The framework has continued to expand with analysis of depth-dependent scaling (Yang et al., 2023b). Recent work by Yang et al. (2023a) refined the understanding through spectral analysis and input dimension scaling, which we adopt in our experiments.

Our experimental results reveal distinct feature learning behaviors across different parametrization schemes. As shown in Figure 1, Standard Parametrization (SP) keeps features close to initialization (demonstrated by small feature change in the left panel), while Integrable Parametrization (IP) achieves feature learning but suffers from feature collapse (shown by decreasing feature diversity in the right panel). In contrast, $\mu$ P achieves both substantial feature change and maintains feature diversity. We summarize these key characteristics in Table 1. Additional experiments with different activation functions, further illustrating these trends, are provided in Appendix A.

Table 1: Feature Properties Under Different Parametrizations

Parametrization	Feature Learning	Feature Richness
Standard (SP)	✗	Rich
Neural Tangent (NTP)	✗	Rich
Meanfield (IP)²²2IP (Integrable Parametrization) refers to parametrizations with a $1/n$ scaling factor for all layers except the first one, which leads to absolute convergence of weighted sums in the mean-field limit.	✓	Low
Maximal Update ( $\mu$ P)	✓	Rich

3 Preliminaries

Table 2: Initialization variance and learning rate scaling under different parametrization schemes for MLP networks.

Layer	SP		NTP		IP		$\mu$ P
Layer	Init. Var.	LR	Init. Var.	LR	Init. Var.	LR	Init. Var.	LR
Input ( $W^{1}$ )	$1$	$\eta\cdot n^{-1}$	$1$	$\eta$	$1$	$\eta\cdot n$	$1$	$\eta\cdot n$
Hidden ( $W^{l}$ )	$n^{-1}$	$\eta\cdot n^{-1}$	$n^{-1}$	$\eta\cdot n^{-1}$	$n^{-2}$	$\eta$	$n^{-1}$	$\eta$
Output ( $W^{L+1}$ )	$n^{-1}$	$\eta\cdot n^{-1}$	$n^{-1}$	$\eta\cdot n^{-1}$	$n^{-2}$	$\eta\cdot n^{-1}$	$n^{-2}$	$\eta\cdot n^{-1}$

Different parametrization schemes for MLPs are shown in Table 2 ³³3Init. Var. denotes initialization variance, LR denotes learning rate scaling. $\eta$ is the base learning rate and $n$ is the layer width. For notational simplicity, we omit the constant in the table.. Given a general MLP with $L$ hidden layers specified by weight matrices $W^{1}\in\mathbb{R}^{n\times d}$ , $\{W^{l}\}_{l=2}^{L}\in\mathbb{R}^{n\times n}$ , $W^{L+1}\in\mathbb{R}^{n}$ , and activation $\phi:\mathbb{R}\to\mathbb{R}$ , the network computation is formally defined as

$\displaystyle h^{1}$	$\displaystyle=W\xi\in\mathbb{R}^{n},$
$\displaystyle x^{l}$	$\displaystyle=\phi(h^{l})\in\mathbb{R}^{n},$
$\displaystyle h^{l+1}$	$\displaystyle=W^{l+1}x^{l}\in\mathbb{R}^{n},$
$\displaystyle f(\xi)$	$\displaystyle=W^{L+1}x^{L}\in\mathbb{R},$	(3.1)

where $L>1$ is any positive integer and $l\in\{1,\dots,L-1\}$ . Among these schemes, the Maximal Update Parametrization ( $\mu$ P) shown in Table 2 achieves maximal parameter updates at initialization. As $n\rightarrow\infty$ , we can consider the following infinite-width feature learning process: $f_{t}(\xi)\overset{a.s.}{\rightarrow}\mathring{f}_{t}(\xi)$ (Yang and Hu, 2020, Theorem 6.4). The neural network is assumed to be trained using a differentiable loss $\mathcal{L}$ by stochastic gradient descent, where the $s$ -th sampled batch is denoted by $\{(\xi_{i},y_{i})\}_{i\in\mathcal{B}{s}}\subseteq S$ where $\mathcal{B}_{s}$ is the index set and $S$ is the training dataset. For simplicity, we present the full-batch gradient descent result in the main paper, i.e., $\mathcal{B}_{s}=|S|=[m]$ .

Represent Hidden States via $Z$ Random Variables:

Following Yang and Hu (2020), we represent network’s hidden states using $Z$ random variables. This representation generalizes the spirit of two-layer mean field analysis: even with multiple hidden layers $(L\geq 2)$ , the entries of preactivation $h$ and activation vectors $x$ in (3) become approximately i.i.d. as width $n$ approaches infinity. This allows us to characterize their asymptotic behavior using scalar random variables that reflect their elementwise distributions.

Specifically, for a vector $x\in\mathbb{R}^{n}$ , we track it using $Z^{x}$ , where $x$ ’s entries behave like i.i.d. copies of $Z^{x}$ . When $x$ is properly scaled such that $\|x\|^{2}_{2}=\Theta(n)$ (i.e., its typical magnitude is independent of $n$ ), then $Z^{x}$ becomes independent of $n$ . For any two such normalized vectors $x,y\in\mathbb{R}^{n}$ , their corresponding random variables $Z^{x}$ and $Z^{y}$ are correlated via $\lim_{n\to\infty}x^{\top}y/n=\mathbb{E}Z^{x}Z^{y}$ . Our goal is to characterize these $Z$ in (3) throughout the training process.

Definition 3.1.

[Yang and Hu 2020] During training, we define the error signal $\mathring{\chi}_{t,i}$ at time step $t$ for the $i$ -th sample. When training with SGD to minimize the loss function $\mathcal{L}$ , this error signal is computed as $\mathring{\chi}_{t,i}=\mathcal{L}^{\prime}(\mathring{f}_{t},y_{i})% \operatorname{\mathds{1}}\{i\in\mathcal{B}_{t}\}$ , where $\mathring{f}_{t}$ is the model output at time $t$ , $(\xi_{i},y_{i})$ is the $i$ -th training sample pair, and $\mathcal{B}_{t}$ denotes the mini-batch at time step $t$ . The indicator function $\operatorname{\mathds{1}}\{\cdot\}$ ensures that the error signal is only computed for samples in the current mini-batch.

This error signal captures how much the model’s prediction deviates from the true label for each sample in the current mini-batch, and serves as the driving force for parameter updates during SGD training. For instance, in the case of mean squared error loss, the error signal takes the form $\mathring{\chi}_{t,i}=2(\mathring{f}_{t}(\xi_{i})-y_{i})\operatorname{\mathds{% 1}}\{i\in\mathcal{B}_{t}\}$ . Having defined the error signal, we now describe how the Z-variables characterize the network’s computation in the infinite-width limit $\mathring{f}_{t}(\xi)$ . The forward pass tracks how network features propagate through layers, while the backward pass characterizes gradient flow. For clarity of presentation, we next introduce a simplified version of $\mathring{f}$ that includes the key properties needed for our theoretical analysis. The complete derivation and technical details can be found in Appendix B.

Forward Pass

For $z\in\{x^{l},h^{l}\}_{l}$ , we have $Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}+Z^{\delta z_{1}(\xi)}+\cdots+Z^{\delta z_{t}(\xi)}$ where

(a)

for $l\in[L]$ , $Z^{\delta x^{l}_{t}(\xi)}=\phi(Z^{h^{l}_{t}(\xi)})-\phi(Z^{h^{l}_{t-1}(\xi)})$ ,

(b)

for $l=1$ , we have

\displaystyle Z^{\delta h^{l}_{t}(\xi)}=-\sum_{i\in[m]}\eta\mathring{\chi}_{t-% 1,i}\xi_{i}^{\top}\xi Z^{dh^{l}_{t-1}(\xi_{i})},

for $2\leq l\leq L$ , we have

\displaystyle Z^{\delta h^{l}_{t}(\xi)}

\displaystyle=\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)}+F_{t}(\xi),

(3.2)

where $F_{t}$ is a function that is determined by the random variable $\{Z^{dh_{s}(\xi_{i})}\}_{i\in[m],s\in[t-1]}$ (see Appendix B for detail), and $\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)}$ are zero centered jointly Gaussian with covariance matrix

\displaystyle\text{Cov}(\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)},% \widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{s}(\bm{\zeta})})=\mathbb{E}[Z^{\delta x^% {l-1}_{t}(\xi)}Z^{\delta x^{l-1}_{s}(\bm{\zeta})}].

For last layer weight, we have $Z^{\widehat{W}_{t}^{L+1}}=Z^{\widehat{W}_{0}^{L+1}}+Z^{\delta W_{1}^{L+1}}+% \cdots+Z^{\delta W_{t}^{L+1}}$ where

\displaystyle Z^{\delta W_{t}^{L+1}}=-\eta\sum_{i\in[m]}\mathring{\chi}_{t-1,i% }Z^{x_{t-1}^{L}(\xi_{i})}.

(3.3)

The output deltas have limits $\mathring{f}_{t}(\xi)=\delta\mathring{f}_{1}(\xi)+\cdots+\delta\mathring{f}_{t% }(\xi)$ where

\displaystyle\delta\mathring{f}_{t}(\xi)=\mathbb{E}Z^{\delta W_{t}^{L+1}}Z^{x_% {t}^{L}(\xi)}+\mathbb{E}Z^{\widehat{W}_{t-1}^{L+1}}Z^{\delta x_{t}^{L}(\xi)}.

(3.4)

Backward Pass

For gradients:

$\displaystyle Z^{dx_{t}^{L}(\xi)}$	$\displaystyle=Z^{\widehat{W}_{t}^{L+1}}$	(3.5)
$\displaystyle Z^{dh_{t}^{l}(\xi)}$	$\displaystyle=Z^{dx_{t}^{l}(\xi)}\phi^{\prime}(Z^{h_{t}^{l}(\xi)})$	(3.6)
$\displaystyle Z^{dx_{t}^{l-1}(\xi)}$	$\displaystyle=\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}+G_{t}(\xi)$	(3.7)

where $G_{t}$ is function that is determined by the random variable $\{Z^{x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t-1]}$ (see Appendix B for detail), and where $\{\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}\}_{\xi,t}$ are zero centered jointly Gaussian with covariance matrix

\displaystyle\text{Cov}(\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)},\widehat{Z}% ^{W_{0}^{l\top}dh_{s}^{l}(\bm{\zeta})})=\mathbb{E}[Z^{dh_{t}^{l}(\xi)}Z^{dh_{s% }^{l}(\bm{\zeta})}].

Remark 3.2.

The error signal generalizes to different optimization objectives. For example, in binary classification problems, the error signal can be expressed as $\mathring{\chi}_{t,i}=-y_{i}/\big{(}1+\exp(y_{i}\cdot\mathring{f}_{t}(\xi_{i})% )\big{)}\operatorname{\mathds{1}}\{i\in\mathcal{B}_{t}\}$ , where $\mathring{f}_{t}$ is the model output at time $t$ , $(\xi_{i},y_{i})$ is the $i$ -th training sample pair, and $\mathcal{B}_{t}$ denotes the mini-batch at time step $t$ .⁴⁴4 $\mathcal{L}$ is only required to be continuously differentiable with respect to its first argument (Yang and Hu, 2020), which we omit in subsequent presentations.

4 Main Results

In this section, we present our main theoretical results, which rely on the following assumptions regarding the training data and activation function. Specifically, we will first state a mild geometric condition on the inputs, and then discuss the regularity requirements on the activation function.

Assumption 4.1.

Consider input vectors $\xi$ drawn from the training data set $S\subseteq\mathbb{R}^{d}$ satisfying that for any three different points $\xi_{i},\xi_{j},\xi_{k}\in S$ , the following property holds,

\displaystyle|\langle\xi_{i},\xi_{j}\rangle|\not=|\langle\xi_{i},\xi_{k}% \rangle|,\quad|\langle\xi_{i},\xi_{j}\rangle|\not=0,\forall i\neq j.

Assumption 4.1 rules out the possibility of identical or zero inner products among different data points, which could otherwise lead to degenerate analyses. Although it may appear restrictive, it holds with probability $1$ if the samples are drawn from any continuous distribution (e.g., Gaussian). Indeed, the set of points violating the above requirement—such as those with exactly matching inner products—has Lebesgue measure zero. In practice, minor random perturbations to discrete data can also ensure the condition is satisfied.

Definition 4.2 (GOOD Function).

A function $\phi:\mathbb{R}\to\mathbb{R}$ is called GOOD if it prevents degeneracy in neural networks by ensuring non-trivial compositions. Specifically, for any finite set of parameters $\{a_{i}\},\{b_{i}\},\{c_{i}\}$ satisfying $a_{k}b_{k}\not=0,\exists k$ and $|b_{i}|\not=|b_{j}|,\forall i\not=j$ we have that the composite mapping

\displaystyle f(x)=\sum_{i=1}^{n}a_{i}\phi(b_{i}x+c_{i}),\quad x\in\mathbb{R}

is not a constant function. Moreover, for any real numbers $r_{1},r_{2}$ , the function $(r_{1}+\phi(x))(r_{2}+\phi^{\prime}(x))$ is not almost everywhere constant.

We next introduce an assumption on the activation function that ensures it is both sufficiently smooth and GOOD:

Assumption 4.3.

We assume that the activation function $\phi$ satisfies the following properties.

1.

$\phi$ is twice continuously differentiable.
2.

$\phi^{\prime}$ and $\phi^{\prime\prime}$ are bounded.
3.

$\phi$ is a GOOD function.
4.

$\{x\in\mathbb{R}:\phi(x)=y\}$ and $\{x\in\mathbb{R}:\phi^{\prime}(x)=y\}$ are countable for all $y\in\mathbb{R}$ .

Remark 4.4.

Assumption 4.3 imposes regularity and smoothness conditions on the activation function, ensuring that $\phi^{\prime}$ is pseudo-Lipschitz⁵⁵5See Yang and Hu (2020, Definition E.3) for the definition of pseudo-Lipschitz functions., a requirement for Yang and Hu (2020, Theorem 7.4). These conditions are met by many commonly used activation functions, including the sigmoid function $\sigma(x)=1/\big{(}1+\exp(-x)\big{)}$ and hyperbolic tangent ( $\tanh$ ), which is a rescaled version of sigmoid.

Modern activation functions such as the SiLU (Sigmoid Linear Unit), defined as $\text{SiLU}(x)=x\cdot\sigma(x)$ (Hendrycks and Gimpel, 2016), also satisfy these assumptions. SiLU has been widely adopted in practice, including in several state-of-the-art open-source foundation models (Touvron et al., 2023a, b). A detailed discussion of activation functions that meet these criteria is provided in Appendix D.

With these assumptions in place, we can now state our main theoretical results regarding feature non-degeneracy and convergence. In particular, the following theorem establishes that in wide neural networks, feature representations evolve while maintaining their diversity and avoiding collapse throughout training.

Theorem 4.5.

Consider an infinite-width $L$ -layer MLP trained with gradient descent. Under Assumptions 4.1 and 4.3, the features in each layer are non-degenerate at any time $t$ during training. Specifically, for each layer $l\in[L]$ :

1.

The pre-activation features $\{Z^{h_{t}^{l}(\xi)}\}_{\xi\in S}$ are linearly independent.
2.

The post-activation features $\{Z^{x_{t}^{l}(\xi)}\}_{\xi\in S}$ are linearly independent.

This non-degeneracy property has important implications for the convergence behavior of the model. In particular, it allows us to characterize the state of the model at convergence, as described in the following corollary.

Corollary 4.6.

Consider an infinite-width $L$ -layer MLP under the conditions of Theorem 4.5. If the model converges at time $T$ , meaning that the model weights remain unchanged for all $t\geq T$ , then the error signal vanishes for all subsequent mini-batches:

\displaystyle\mathring{\chi}_{T,i}=0,\quad\forall i\in\bigcup_{t\geq T}% \mathcal{B}_{t},

where $\mathcal{B}_{t}$ denotes the mini-batch at time $t$ .

This corollary establishes that feature non-degeneracy forces convergence to occur only at critical points where the error signal vanishes. More precisely, when the network converges, the error signals must vanish across all samples in subsequent mini-batches, implying convergence to a global minimum of the training objective. This is a consequence of the feature non-degeneracy established in Theorem 4.5, as non-degenerate features ensure that weight updates can only stop when the network has effectively minimized the error signals.

5 Key Techniques and Analysis

In this section, we first identify the key technical challenges in establishing our main results, and then present the techniques and insights to address them. We begin by discussing two fundamental challenges: the tension between feature evolution and Structural stability, and the intricate coupling across network layers. We then develop a systematic framework based on Gaussian processes to overcome these challenges. The complete proof is presented in Appendix C.

5.1 Technical Challenges

Establishing global convergence while allowing meaningful feature learning presents two fundamental technical challenges that must be addressed simultaneously:

Feature Evolution vs. Structural Stability: In contrast to the NTK parameterization (where features stay near their initialization), $\mu$ P enables features to evolve substantially during training. Specifically, for any feature $z\in\{x^{l},h^{l}\}_{l}$ in the Forward Pass (a) of Section 3, we have:

\displaystyle Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}+\underbrace{Z^{\delta z_{1}(\xi)}+% \cdots+Z^{\delta z_{t}(\xi)}}_{\text{feature learning term}}.

The presence of the feature learning term makes it challenging to track and characterize features’ properties throughout optimization. This contrasts sharply with the setting under NTK parametrization, where $Z^{z_{t}(\xi)}$ stays equal to its initialization $Z^{z_{0}(\xi)}$ (Yang and Hu, 2020) - a mathematically simpler but limited case where the network behavior is fully determined by the initial kernel.

Cross-Layer Coupling: In deep networks, changes in one layer’s features affect both earlier and later layers through forward and backward propagation. For forward propagation in layer $l$ and backward propagation in layer $l+1$ , we have by (3.2) and (3.7):

	$\displaystyle Z^{\delta h^{l}_{t}(\xi)}$	$\displaystyle=\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)}+F_{t}(\xi)$		(5.1)
	$\displaystyle Z^{dx_{s}^{l}(\xi)}$	$\displaystyle=\widehat{Z}^{W_{0}^{l+1\top}dh_{s}^{l+1}(\xi)}+G_{s}(\xi),$		(5.2)

where $F_{t}$ and $G_{s}$ capture the historical dependencies through previous features $\{Z^{dh^{l}_{s}(\xi_{i})}\}_{i\in[m],s\in[t-1]}$ and gradients $\{Z^{x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t-1]}$ respectively. This intricate coupling between forward and backward passes makes it challenging to ensure that features remain well-behaved as they propagate through the network.

Our key insight in addressing these challenges lies in analyzing structural invariants preserved by the induced Gaussian processes during training. While features evolve substantially, we find that certain second-order properties—specifically, non-degeneracy—remain invariant across layers and time steps. This invariance ensures rich feature learning while preventing the network from getting stuck in local minima.

5.2 The Gaussian Process View

In the infinite-width limit, neural network training induces two families of Gaussian processes that capture forward and backward propagation:

	$\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t],2\leq l\leq L},$		(5.3)
	$\displaystyle\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[% t],2\leq l\leq L}.$		(5.4)

The forward process ((5.3)) tracks how features evolve across layers, while the backward process ((5.4)) describes gradient flow. Unlike prior work that studies these processes in isolation, we discover fundamental connections between their structural properties that enable both feature learning and convergence.

Covariance Structure of Gaussian Processes Our key technical insight is that these Gaussian processes (5.3) and (5.4) exhibit invariant covariance properties that persist throughout training. Recall from (5.1) and (5.2) that both forward and backward propagation can be decomposed into a Gaussian term and a history-dependent term:

\displaystyle Z^{\delta h^{l}_{t}(\xi)}=\underbrace{\widehat{Z}^{W^{l}_{0}% \delta x^{l-1}_{t}(\xi)}}_{\text{Gaussian term}}+\underbrace{F_{t}(\xi)}_{% \text{history term}},\qquad Z^{dx_{s}^{l}(\xi)}=\underbrace{\widehat{Z}^{W_{0}% ^{l+1\top}dh_{s}^{l+1}(\xi)}}_{\text{Gaussian term}}+\underbrace{G_{s}(\xi)}_{% \text{history term}}.

We notice that these Gaussian terms can preserve covariance relationships across layers throughout training:

	$\displaystyle\text{Cov}(\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi)},% \widehat{Z}^{W_{0}^{l}\delta x_{t}^{l-1}(\bm{\zeta})})$	$\displaystyle=\mathbb{E}[Z^{\delta x_{s}^{l-1}(\xi)}Z^{\delta x_{t}^{l-1}(\bm{% \zeta})}],$
	$\displaystyle\text{Cov}(\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi)},\widehat{Z}% ^{W_{0}^{l\top}dh_{t}^{l}(\bm{\zeta})})$	$\displaystyle=\mathbb{E}[Z^{dh_{s}^{l}(\xi)}Z^{dh_{t}^{l}(\bm{\zeta})}].$

These covariance relationships reveal that feature correlations between adjacent layers follow consistent patterns, even as individual features evolve. They link the feature spaces of adjacent layers through their second-order statistics, providing a structural bridge that persists throughout the training process.

5.3 From Covariance Structure to Non-degeneracy

The preservation of covariance relationships across layers ensures the non-degeneracy of the induced Gaussian processes throughout training. In the proof of Theorem 4.5, we consider any linear combination of the Gaussian processes:

\displaystyle\sum_{i\in[m],s\in[t]}\lambda_{i,s}\widehat{Z}^{W_{0}^{l}\delta x% _{s}^{l-1}(\xi_{i})},\quad\sum_{i\in[m],s\in[t]}\lambda_{i,s}\widehat{Z}^{W_{0% }^{l\top}dh_{s}^{l}(\xi_{i})}.

Through our covariance preservation property, we show that if these linear combinations degenerate (i.e., equal to zero almost surely), then the corresponding linear combinations of original features and gradients must also degenerate:

\displaystyle\sum_{i\in[m],s\in[t]}\lambda_{i,s}Z^{\delta x_{s}^{l-1}(\xi_{i})% }\overset{a.s.}{=}0,\quad\sum_{i\in[m],s\in[t]}\lambda_{i,s}Z^{dh_{s}^{l}(\xi_% {i})}\overset{a.s.}{=}0.

This connection through linear combinations allows us to transfer the non-degeneracy property from feature space to the induced Gaussian processes across layers, establishing that both forward and backward processes remain non-degenerate throughout training. This result reveals a fundamental connection between covariance structure and feature richness: the preservation of covariance relationships ensures that linear independence propagates through layers.

This directly contrasts with other parametrizations. In the NTK parametrization, since features stay close to initialization with $Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}$ , the process necessarily becomes degenerate as it fails to capture new information during training. Our analysis can demonstrate that $\mu$ P uniquely maintains the non-degeneracy of features across both space and time dimensions, enabling the network to learning rich and meaningful features throughout training.

To empirically validate this theoretical finding, we analyze the minimum eigenvalue of the feature matrix constructed from the joint space-time features at Layer $2$ , by appending initial and final representations, complementing our analysis of feature diversity in Figure 1. Under the same experimental setup with 3-layer MLPs trained on CIFAR-10, Figure 2 shows that the $\mu$ P parametrization maintains higher eigenvalues across different network widths compared to other parametrizations. This aligns with our theoretical prediction and further strengthens the findings in Figure 1 where we observed $\mu$ P’s unique ability to achieve both feature learning and feature richness.

5.4 Evolution Framework

To rigorously track how these structural properties evolve throughout training, we need to carefully handle the natural flow of information in neural networks: forward propagation followed by backward propagation. In each iteration, the network first computes forward features through all layers, then calculates gradients backwards for parameter updates. This computational pattern naturally leads to a two-level filtration framework. We introduce a sequence of $\sigma$ -algebras to track the evolution of random variables during training. Let $\mathcal{F}_{0}$ denote the initial condition:

\displaystyle\mathcal{F}_{0}=\sigma\big{(}\{Z^{h^{l}_{0}(\xi_{i})},Z^{x^{l}_{0% }(\xi_{i})}\}_{i\in[m],l\in[L]},Z^{\widehat{W}_{0}^{L+1}}\big{)}

Then we define $\mathcal{F}_{t}$ to track all completed iterations up to time $t$ , and an extended filtration $\mathcal{G}_{t}$ to capture the forward pass of the $(t+1)$ -th iteration:

	$\displaystyle\mathcal{F}_{t}$	$\displaystyle=\sigma\big{(}\mathcal{F}_{0},\{\widehat{Z}^{W_{0}^{l}\delta x_{s% }^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t],2\leq l\leq L},\{\widehat{Z}^{W_{0}^{l% \top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t],2\leq l\leq L}\big{)}$		(5.5)
	$\displaystyle{\mathcal{G}}_{t}$	$\displaystyle=\sigma\big{(}\mathcal{F}_{t},\{\widehat{Z}^{W_{0}^{l}\delta x_{t% +1}^{l-1}(\xi_{i})}\}_{i\in[m],2\leq l\leq L}\big{)}$		(5.6)

This filtration structure allows us to precisely track how information flows through the network: $\mathcal{F}_{t}$ contains all information up to time $t$ , while $\mathcal{G}_{t}$ extends this to include the forward pass information at time $t+1$ before its backward pass begins. This framework enables us to:

1.
Inductive Proof Structure: The filtration framework enables a structured inductive proof that follows the natural flow of computation in neural networks. We establish non-degeneracy in four steps, motivated by how information propagates through the network:
- •
  
  Step 1: Features in first hidden layer $\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}$ . This forms the base case as it only depends on the input,
- •
  
  Step 2: Features in remaining layers $\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}$ . Using the non-degeneracy of previous layers,
- •
  
  Step 3: Gradients in last layer $\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}$ . Built upon the established feature properties,
- •
  
  Step 4: Gradients in remaining layers $\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}$ . Completing the backward pass analysis.
Each step leverages the non-degeneracy established in previous steps, creating a chain of dependency that mirrors the network’s computation graph.

Conditional Analysis: The filtration enables precise decomposition of feature and gradient updates into new and historical information:

	$\displaystyle Z^{h^{l}_{s}(\xi_{i})}$	$\displaystyle=\underbrace{\widehat{Z}^{W_{0}^{l}\delta x^{l-1}_{s}(\xi_{i})}}_% {\text{new randomness}}+\underbrace{\Delta_{s}(\xi_{i})}_{\text{history}}$
	$\displaystyle Z^{dx_{s}^{l}(\xi_{i})}$	$\displaystyle=\underbrace{\widehat{Z}^{W_{0}^{l+1\top}dh_{s}^{l+1}(\xi_{i})}}_% {\text{new randomness}}+\underbrace{G_{s}(\xi_{i})}_{\text{history}}$

where $\Delta_{s}(\xi_{i})\in\mathcal{F}_{s-1}$ captures the accumulated feature history, and $G_{s}(\xi_{i})$ represents previous gradient information. This decomposition is crucial for our inductive proof: by focusing on the new randomness in each step, we can show that non-degeneracy is preserved when conditioned on all historical information.

Non-degeneracy Preservation: By leveraging GOOD activation functions as introduced in Assumption 4.3 (e.g., Sigmoid, Tanh, SiLU) and the covariance structure, we show that non-degeneracy propagates forward in time. Specifically, if at time $t$ we have:

\displaystyle\sum_{i\in[m],s\in[t]}\lambda_{i,s}\widehat{Z}^{W_{0}^{l}\delta x% _{s}^{l-1}(\xi_{i})}\overset{a.s.}{\not=}0,\qquad\sum_{i\in[m],s\in[t]}\lambda% _{i,s}\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\overset{a.s.}{\not=}0.

Then, we show that these sums remain nonzero at time $t+1$ based on two key properties: (1) the non-degeneracy of Gaussian processes is preserved when they share the same covariance structure, and (2) GOOD activation functions, such as Sigmoid and SiLU, exhibit a crucial “non-collapsing” property—mapping distinct inputs to distinct outputs unless all combining coefficients are zero. Together, these properties ensure that features can evolve significantly during training while preserving their diversity, a fundamental distinction from the NTK parametrization, where features remain near their initialization.

Now, we revisit the key technical challenges introduced at the beginning of this section and demonstrate how our framework addresses them.

Feature Evolution vs. Structural Stability: Unlike NTK, where features remain close to their initialization, $\mu$ P enables substantial feature evolution. Our framework ensures that new randomness, represented by Gaussian features such as $\widehat{Z}^{W_{0}^{l}\delta x_{t+1}^{l-1}(\xi_{i})}$ , enters the system with a well-defined structure that preserves non-degeneracy. This structured evolution prevents feature collapse while allowing representations to adapt dynamically, ensuring both expressivity and stability throughout training.

Cross-Layer Coupling: The interplay between layers introduces dependencies that can destabilize training. By leveraging a two-level filtration structure $\mathcal{F}_{t}$ and $\mathcal{G}_{t}$ , our framework tracks both forward propagation $Z^{h^{l}_{s}}$ and backward propagation $Z^{dx_{s}^{l}}$ , ensuring that updates in one layer do not collapse the feature space of others. This structure maintains well-defined covariance relationships across layers, allowing $\mu$ P to support both deep feature learning and global convergence, distinguishing it from NTK and standard parametrizations.

6 Conclusion and Future Work

In this work, we establish a fundamental theoretical result: deep neural networks under $\mu$ P parametrization can simultaneously achieve meaningful feature learning while preserving feature non-degeneracy. Through a rigorous analysis of Gaussian processes and their covariance structures, we show that features not only remain linearly independent throughout training but also undergo substantial evolution from their initialization. This provides insight into a fundamental question in deep learning theory: how neural networks can simultaneously learn expressive representations and achieve global convergence.

Our analysis establishes fundamental connections between covariance preservation and feature richness. By preventing feature degeneracy, our framework provides a rigorous foundation for understanding how overparameterized networks learn expressive representations. Moreover, our results highlight the crucial role of parametrization in enabling both stable training and meaningful feature evolution. These insights into how $\mu$ P enables both feature learning and global convergence suggest promising directions for bridging the gap between theory and practical deep learning success.

Several promising directions for future work emerge from our analysis. First, extending our theoretical framework to transformer architectures, particularly the attention mechanism, would be valuable for understanding feature learning in modern language models. Second, our analysis of structural invariants could provide new perspectives on convergence rates beyond just global convergence, potentially informing optimization strategies in deep learning. Third, studying how our insights on feature non-degeneracy influence generalization bounds may yield deeper theoretical foundations for understanding the generalization properties of deep neural networks. Finally, exploring how $\mu$ P interacts with more complex training paradigms, such as fine-tuning and self-supervised learning, could further enhance our understanding of deep network training dynamics in practical settings.

Appendix A Experimental Details

Experimental details for Figures 1 and 2. We conduct experiments using 3-layer MLPs with input dimension $n_{0}=3072$ (flattened CIFAR-10 images), three hidden layers of equal width $n_{1}=n_{2}=n_{3}=n$ varying from ${8,16,32,64,128,256,512,1024,2048,4096}$ , and output dimension 1. In experiment, we only use 10 samples randomly selected from airplane and automobile classes in CIFAR-10. All networks use SiLU activation functions and are trained on a binary classification task with $\pm 1$ targets for 1000 steps. We use a global learning rate $\eta=0.1$ across all parametrization schemes with $10$ runs for each width setting. The learning rate $\eta=0.1$ is chosen to ensure stable training across parametrizations.

We implement the following parametrization schemes:

Standard Parametrization (SP):

\displaystyle\sigma_{\ell}

\displaystyle=\sqrt{\frac{2}{n_{\ell-1}}};\quad\eta_{\ell}=\eta\cdot\frac{1}{n}

at all layers, with $\eta=0.1$ .

Neural Tangent Parametrization (NTP):

\displaystyle\sigma_{\ell}

\displaystyle=\sqrt{\frac{2}{n_{\ell-1}}};\quad\eta_{\ell}=\eta\cdot\frac{1}{n% _{\ell-1}}

at all layers, with $\eta=0.1$ .

Integrable Parametrization (IP): For initialization variances:

\displaystyle\sigma_{1}

\displaystyle=\sqrt{\frac{2}{d}};\quad\sigma_{\ell}=\frac{\sqrt{2}}{n},\quad% \ell\geq 2.

For learning rates:

\displaystyle\eta_{1}

\displaystyle=\eta\cdot\frac{n}{d};\quad\eta_{2}=\eta_{3}=\eta;\quad\eta_{4}=% \eta\cdot\frac{1}{n}.

Maximal Update Parametrization ( $\mu$ P): For initialization variances:

\displaystyle\sigma_{1}

\displaystyle=\sqrt{\frac{2}{d}};\quad\sigma_{2}=\sigma_{3}=\sqrt{\frac{2}{n}}% ;\quad\sigma_{4}=\frac{\sqrt{2}}{n}.

For learning rates:

\displaystyle\eta_{1}=\eta\cdot\frac{n}{d};\quad\eta_{2}=\eta_{3}=\eta;\quad% \eta_{4}=\eta\cdot\frac{1}{n}

Networks are trained with batch size 10 for 1000 steps, sufficient for all widths to achieve stable feature representations with training loss smaller than $0.05$ . For each width configuration, we conduct $10$ independent trials with different random seeds ( $42\sim 53$ ) and report the mean values in Figures 1 and 2. To quantify feature properties, we measure two metrics:

•

Feature change: $\|h(x)-h^{0}(x)\|_{2}/\|h^{0}(x)\|_{2}$ , where $h^{0}$ represents features at initialization
•

Feature diversity: minimum eigenvalue of the Gram matrix $K_{ij}=\langle h(x_{i}),h(x_{j})\rangle$ computed over batch samples

These measurements allow us to track both the evolution of features from their initialization state and the maintenance of feature richness throughout training.

A.1 Additional Results on Activation Functions

To further examine the impact of activation functions, we conduct experiments with Tanh and ReLU under the same settings as described in Section A. Below, we present the feature evolution and diversity results for these activations.

Our theoretical analysis fully explains the feature learning behavior of Tanh networks, as confirmed by our experimental results in Figure 3. Tanh enables meaningful feature learning while leading to a gradual decrease in feature diversity as width goes up. This is consistent with our theoretical predictions, which account for the smooth and bounded nature of the Tanh activation.

While our theoretical analysis does not directly apply to ReLU due to its non-smooth nature, our experimental results in Figure 4 indicate that ReLU-trained networks still exhibit feature learning and maintain meaningful representations under Maximal Update Parametrization ( $\mu$ P). One possible explanation is that, although ReLU lacks explicit smoothness assumptions required in our analysis, its piecewise linear structure still allows for non-trivial feature evolution in practice. Moreover, $\mu$ P ensures that weight updates are appropriately scaled across layers, preventing degenerate training dynamics that could otherwise hinder learning in deep networks. Understanding the precise mechanisms behind ReLU’s feature learning in the infinite-width setting remains an important direction for future theoretical work.

Appendix B More Details for $\mu$ P Parametrization

Formally, the MLP definition (Yang and Hu, 2020, Table 1) in this section is

\displaystyle h^{1}=W\xi\in\mathbb{R}^{n},x^{l}=\phi(h^{l})\in\mathbb{R}^{n},h% ^{l+1}=W^{l+1}x^{l}\in\mathbb{R}^{n},f(\xi)=W^{L+1}x^{L},

(B.1)

where $L>1$ is any positive integer and $l\in\{1,\dots,L-1\}$ . Then the $\mu$ P for this $L$ -hidden-layer MLP is defined as follows (Yang and Hu, 2020).

1.

Initial weight matrices in the middle layer: $W_{0}^{2},\ldots,W_{0}^{L}$ , with each coordinates $(W_{0}^{l})_{\alpha\beta}\sim\mathcal{N}(0,1/n)$ .
2.

Initial weight matrix in the input and output layers: input layer matrix $W_{0}^{1}\in\mathbb{R}^{n\times d}$ and output layer matrix $\widehat{W}_{0}^{L+1}\coloneqq W_{0}^{L+1}n\in\mathbb{R}^{1\times n}$ , with each coordinate $(W_{0}^{1})_{\alpha\beta},(\widehat{W}_{0}^{L+1})_{\alpha\beta}\sim\mathcal{N}% (0,1)$ .
3.

Initial model outputs: we define the scalars $f_{0}(\xi)\coloneqq W_{0}^{L+1}x_{0}^{L}(\xi)$ for any input $\xi$ .

Assuming the same Assumption 4.3 for $\phi$ , we can characterize the $Z$ variables in the infinite-width training dynamics of SGD for this $L$ -hidden-layer MLP similarly as follows (Yang and Hu, 2020).

For $z\in\{x^{l},h^{l}\}_{l}$ , we have

\displaystyle Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}+Z^{\delta z_{1}(\xi)}+\cdots Z^{% \delta z_{t}(\xi)}

(B.2)

For $l\in[L],x=x^{l},h=h^{l}$ , we have

\displaystyle Z^{\delta x_{t}(\xi)}=\phi(Z^{h_{t}(\xi)})-\phi(Z^{h_{t-1}(\xi)}).

(B.3)

For $h=h^{1}$ , we have

\displaystyle Z^{\delta h_{t}(\xi)}=-\sum_{i\in[m]}\eta\mathring{\chi}_{t-1,i}% \xi_{i}^{\top}\xi Z^{dh_{t-1}(\xi_{i})}.

For $2\leq l\leq L,h=h^{l},x=x^{l-1},W=W^{l}$ , we have

\displaystyle Z^{\delta h_{t}(\xi)}

\displaystyle=\widehat{Z}^{W_{0}\delta x_{t}(\xi)}+\dot{Z}^{W_{0}\delta x_{t}(% \xi)}-\eta\sum_{s=0}^{t-1}\sum_{i\in[m]}\mathring{\chi}_{s}Z^{dh_{s}(\xi_{i})}% \mathbb{E}Z^{x_{s}(\xi_{i})}Z^{x_{t}(\xi)}

where

\displaystyle\dot{Z}^{W_{0}\delta x_{t}(\xi)}=\sum_{i\in[m]}\sum_{s=0}^{t-1}Z^% {dh_{s}(\xi_{i})}\mathbb{E}\frac{\partial Z^{\delta x_{t}(\xi)}}{\partial% \widehat{Z}^{W_{0}^{\top}dh_{s}(\xi_{i})}}.

For last layer weight

\displaystyle Z^{\widehat{W}_{t}^{L+1}}=Z^{\widehat{W}_{0}^{L+1}}-\eta\sum_{s=% 0}^{t-1}\sum_{i\in[m]}\mathring{\chi}_{s,i}Z^{x_{s}^{L}(\xi_{i})}

(B.4)

The output deltas have limits

\displaystyle\delta\mathring{f}_{t}(\xi)=\mathbb{E}Z^{\delta W_{t}^{L+1}}Z^{x_% {t}^{L}(\xi)}+\mathbb{E}Z^{\widehat{W}_{t-1}^{L+1}}Z^{\delta x_{t}^{L}(\xi)}

(B.5)

and

\displaystyle\mathring{f}_{t}(\xi)=\delta\mathring{f}_{1}(\xi)+\cdots+\delta% \mathring{f}_{t}(\xi).

For gradients:

$\displaystyle Z^{dx_{t}^{L}(\xi)}$	$\displaystyle=Z^{\widehat{W}_{t}^{L+1}}$	(B.6)
$\displaystyle Z^{dh_{t}^{l}(\xi)}$	$\displaystyle=Z^{dx_{t}^{l}(\xi)}\phi^{\prime}(Z^{h_{t}^{l}(\xi)})$	(B.7)
$\displaystyle Z^{dx_{t}^{l-1}(\xi)}$	$\displaystyle=\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}+\dot{Z}^{W_{0}^{l\top% }dh_{t}^{l}(\xi)}-\eta\sum_{s=0}^{t-1}\sum_{i\in[m]}\mathring{\chi}_{s,i}Z^{x_% {s}^{l-1}(\xi_{i})}\mathbb{E}Z^{dh_{s}^{l}(\xi_{i})}Z^{dh_{t}^{l}(\xi)}$	(B.8)

where

\displaystyle\dot{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}=\sum_{i\in[m]}\sum_{s=0}^{% t-1}Z^{x^{l-1}_{s}(\xi_{i})}\mathbb{E}\frac{\partial Z^{dh_{t}^{l}(\xi)}}{% \partial\widehat{Z}^{W_{0}^{l}x^{l-1}_{s}(\xi_{i})}}.

Loss derivative:

\displaystyle\mathring{\chi}_{t,i}=\mathcal{L}^{\prime}(\mathring{f}_{t},y_{i}% ,t,i)=\mathcal{L}^{\prime}(\mathring{f}_{t},y_{i})\operatorname{\mathds{1}}\{i% \in\mathcal{B}_{t}\}.

Intuition behind the entanglement term for the two-hidden-layer case The inclusion of $W^{\top}$ in the backward pass largely increases the system complexity by introducing multiplications between $W$ and certain nonlinear transformations of $W^{\top}$ in the forward pass, which necessitates involved definitions of $\dot{Z}^{Wx_{t}(\xi)}$ and $\dot{Z}^{W^{\top}d\bar{h}_{t}(\xi)}$ . Since they are all “conditioned out” in our analysis, we only showcase the definition of $\dot{Z}^{Wx_{t}(\xi)}=\sum_{j=1}^{m}\sum_{r=0}^{t-1}\theta_{r,j}Z^{d\bar{h}_{r% }(\xi_{j})}$ to give a sense of entanglement between $W$ and $W^{\top}$ , where $\theta_{r}$ is calculated like so: $Z^{x_{t}(\xi)}$ by definition is constructed as

\displaystyle Z^{x_{t}(\xi)}=\Phi(\widehat{Z}^{W^{\top}d\bar{h}_{0}(\xi_{1})},% \dots,\widehat{Z}^{W^{\top}d\bar{h}_{0}(\xi_{m})},\ldots,\widehat{Z}^{W^{\top}% d\bar{h}_{t-1}(\xi_{1})},\dots,\widehat{Z}^{W^{\top}d\bar{h}_{t-1}(\xi_{m})},Z% ^{U_{0}})

for some function $\Phi:\mathbb{R}^{m\times t+1}\rightarrow\mathbb{R}$ . Then

\displaystyle\theta_{r,j}=\mathbb{E}[\partial\Phi(\widehat{Z}^{W^{\top}d\bar{h% }_{0}(\xi_{1})},\dots,\widehat{Z}^{W^{\top}d\bar{h}_{0}(\xi_{m})},\ldots,% \widehat{Z}^{W^{\top}d\bar{h}_{t-1}(\xi_{1})},\dots,\widehat{Z}^{W^{\top}d\bar% {h}_{t-1}(\xi_{m})},Z^{U_{0}})/\partial\widehat{Z}^{W^{\top}d\bar{h}_{r}(\xi_{% j})}].

Appendix C Proof of Theorem 4.5

We begin by describing three key lemmas, each highlighting a crucial aspect of our subsequent proof.

Lemma C.1.

Suppose random variables $\{u_{k}\}_{k=[K]}$ and $\{v_{k}\}_{k=[K]}$ satisfy $\mathbb{E}[u_{i}u_{j}]=\mathbb{E}[v_{i}v_{j}],\forall i,j$ , then

\displaystyle\sum_{k\in[K]}\alpha_{k}u_{k}\overset{a.s.}{=}0\Leftrightarrow% \sum_{k\in[K]}\alpha_{k}v_{k}\overset{a.s.}{=}0

Proof.

$\sum_{k\in[K]}\alpha_{k}u_{k}\overset{a.s.}{=}0$ implies $\mathbb{E}[(\sum_{k\in[K]}\alpha_{k}u_{k})^{2}]=0$ . Because $\{u_{k}\}_{k=[K]}$ and $\{v_{k}\}_{k=[K]}$ share the same co-variance matrix, we have that $\mathbb{E}[(\sum_{k\in[K]}\alpha_{k}v_{k})^{2}]=0$ . ∎

Lemma C.2.

Suppose any level set of $\phi:\mathbb{R}\to\mathbb{R}$ is countable and $g_{1},\ldots,g_{K}$ are jointly non-degenerate Gaussian. If $\mathbb{P}\big{(}\sum_{i}a_{i}\phi(g_{i})=C\big{)}>0$ where $C$ is a constant, then $a_{i}=0$ for all $i$ and $C=0$ .

Proof.

$\{\mathbb{P}\big{(}\sum_{i}a_{i}\phi(g_{i})=C|g_{2},\ldots,g_{K}\big{)}>0\}$ has positive probability only if $a_{1}=0$ , because $g_{1}|g_{2},\ldots,g_{K}$ is a non-degenerate Gaussian random variable. We conclude that $\prod_{i\in[K]}a_{i}=0$ following a similar reasoning inductively. ∎

Lemma C.3.

Suppose $\phi$ satisfies Assumption 4.3. Moreover, suppose $g_{1},\ldots,g_{K}$ are jointly non-degenerate Gaussian. If $\big{(}c_{1}+\sum_{i}a_{i}\phi(g_{i})\big{)}\cdot\big{(}c_{2}+\sum_{i}b_{i}% \phi^{\prime}(g_{i})\big{)}=C$ where $c_{1},c_{2},C$ is a constant, then $C=0$ and either $a_{i}=0$ for all $i\in[K]$ or $b_{i}=0$ for all $i\in[K]$ .

Remark C.4.

Considering the function tail, it is easy to prove that the Sigmoid function $\sigma(x)=\frac{1}{1+\exp(-x)}$ , the smoothed ReLU function $\overline{\text{ReLU}}(x)=\log(1+\exp(x))=\int\sigma(x)dx$ , and the SiLU (Sigmoid Linear Unit) function, defined as $\text{SiLU}(x)=x\cdot\sigma(x)$ , all satisfy these assumptions. Notably, SiLU is employed in state-of-the-art open-source foundation models (Touvron et al., 2023a, b).

Proof.

We first prove that $C=0$ . Condition on the random variables $g_{2},\ldots,g_{K}$ and denote $g:=g_{1}\mid(g_{2},\ldots,g_{K})$ . Then $g$ is a non-degenerate univariate Gaussian.

Case 1: Suppose $C\neq 0$ . - In this scenario, $a_{1},b_{1}$ cannot be zero; otherwise $\phi$ (or $\phi^{\prime}$ ) would have an uncountable level set, contradicting our assumptions. Given that $(c_{1}^{\prime}+a_{1}\phi(g))(c_{2}^{\prime}+b_{1}\phi^{\prime}(g))=C$ almost surely, we may rewrite:

\displaystyle\bigg{(}\frac{c_{1}^{\prime}}{a_{1}}+\phi(g)\bigg{)}\bigg{(}\frac% {c_{2}^{\prime}}{b_{1}}+\phi^{\prime}(g)\bigg{)}=\frac{C}{a_{1}b_{1}}.

Here $c_{1}^{\prime},c_{2}^{\prime}$ are constants (absorbing the conditioning on $g_{2},\dots,g_{K}$ ). But this implies that for almost all $x\in\mathbb{R}$ , $\big{(}\frac{c_{1}^{\prime}}{a_{1}}+\phi(x)\big{)}\big{(}\frac{c_{2}^{\prime}}% {b_{1}}+\phi^{\prime}(x)\big{)}=\frac{C}{a_{1}b_{1}}$ , a contradiction to the assumption on the activation function. Therefore, a contradiction arises, implying $C$ must be zero.

Case 2: Now consider $C=0$ . Then at least one of the following holds with positive probability:

\displaystyle c_{1}+\sum_{i}a_{i}\phi(g_{i})=0\quad\text{or}\quad c_{2}+\sum_{% i}b_{i}\phi^{\prime}(g_{i})=0.

In either case, applying Lemma C.2 (which crucially uses the fact that $\phi$ has at most countable level sets, forcing the sum to avoid being constant on any uncountable domain with positive probability unless all involved coefficients vanish) completes the proof of zeroing out the corresponding coefficients. Concretely, if

\displaystyle\mathbb{P}\bigg{(}\sum_{i}a_{i}\phi(g_{i})=C\Big{|}g_{2},\ldots,g% _{K}\bigg{)}>0

then for those realizations we view $g_{1}$ (conditioned on $g_{2},\dots,g_{K}$ ) as a non-degenerate univariate Gaussian. Holding $g_{2},\dots,g_{K}$ fixed, the only way $\sum_{i}a_{i}\phi(g_{i})$ can remain a constant over a positive-measure set of $g_{1}$ values is if $a_{1}=0$ . Repeating this argument inductively for $g_{2},g_{3},\dots$ shows that $\prod_{i\in[K]}a_{i}=0$ . Therefore, either all $a_{i}$ vanish or all $b_{i}$ vanish, completing the proof of this lemma. ∎

With these lemmas at hand, we now prove our main theorem by an inductive argument. In particular, we show that the following two families of Gaussian processes, introduced in Section 4, remain non-degenerate throughout training:

	$\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t],2\leq l\leq L},$		(C.1)
	$\displaystyle\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[% t],2\leq l\leq L}.$		(C.2)

Recall that a Gaussian process is non-degenerate if its covariance matrix $C$ at any finite collection of points satisfies $\mathrm{det}(C)\neq 0$ (Adler and Taylor, 2009). Using the filtration framework introduced in Section 4, our proof follows the natural flow of computation in the network, proceeding layer by layer and separately handling forward and backward passes. We break this into four key steps, each building upon the results of previous steps:

•

Step 1: prove non-degeneracy for the features in the first hidden layer $\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}$ . This forms our base case as it only depends on the input data and network initialization, providing the foundation for our inductive argument.
•

Step 2: prove non-degeneracy for the features in remaining layers $\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}$ , $3\leq l\leq L$ . This step leverages the non-degeneracy established in Step 1 and shows how it propagates through deeper layers of the network.
•

Step 3: prove non-degeneracy for the gradients in the last layer $\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}$ . Here we transition from analyzing forward features to backward gradients, showing how the established feature properties ensure meaningful gradient flow.
•

Step 4: prove non-degeneracy for the gradients in remaining layers $\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}$ , $2\leq l\leq L-1$ . Finally, we complete our analysis by showing how gradient non-degeneracy propagates backward through the network, ensuring effective training dynamics at all layers.

The proof proceeds by induction on the time step $t$ , where at each step we verify these properties hold across all layers. This structure allows us to carefully track how the non-degeneracy property is maintained as information flows both forward and backward through the network during training. This systematic proof structure allows us to establish the global property of non-degeneracy by carefully tracking local changes at each layer and time step. We now proceed with the detailed proof.

Proof of Theorem 4.5.

Considering Trajectory Until Error Signals Vanish. Throughout this proof, we focus on the training trajectory up to the time when all error signals $\mathring{\chi}_{t,i}$ become zero. This is because once the error signals vanish, there are no further parameter updates, and the training dynamics remain static thereafter. Our analysis ensures that up to this point, the Gaussian processes governing the feature and gradient updates remain non-degenerate, thereby maintaining the linear independence of features across all layers.

Connecting $\widehat{Z}^{W\delta x}$ to $h^{l}$ and $x^{l}$ . Recall from Section 3 that each pre-activation $h^{l}(\xi)$ and post-activation $x^{l}(\xi)$ can be decomposed into a primary Gaussian increment plus lower-order (history-dependent) terms in the infinite-width limit:

Because these additional terms do not alter the essential covariance structure when conditioned on past information (they vanish or become deterministic in the limit), the linear (in)dependence of $\{h^{l}(\xi)\}$ or $\{x^{l}(\xi)\}$ is governed by the non-degeneracy of $\{\widehat{Z}^{W_{0}^{l}\,\delta x_{s}^{l-1}(\xi)}\}$ . Hence, showing that $\{\widehat{Z}^{W_{0}^{l}\,\delta x_{s}^{l-1}(\xi)}\}$ remain non-degenerate under conditioning on historical variables directly implies that $\{h^{l}(\xi)\}$ and $\{x^{l}(\xi)\}$ cannot collapse into a linearly dependent set.

Below, we provide an inductive argument to establish precisely this non-degeneracy at each step.

By definition when $t=0$ , $\{\widehat{Z}^{W_{0}^{l}\delta x_{0}^{l-1}(\xi_{i})}\},i\in[m],2\leq l\leq L$ are independent and therefore non-degenerate Gaussian.

Now assume that the random Gaussian Process features defined in (C.1) and (C.2) are non-degenerate at time $t$ , specifically for

\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t]},\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t]}

where layer $2\leq l\leq L$ .

Step 1: We first prove $\{\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}\}_{i\in[m],s\in[t+1]}$ is non-degenerate. Suppose there exists not all zero $\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}$ such that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W^{2}_{0}\delta x% ^{1}_{s}(\xi_{i})}\overset{\text{a.s.}}{=}0.

Since $\{\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}\}_{i\in[m],s\in[t]}$ are non-degenerate, we conclude that $\{\lambda_{i,t+1}\}_{i\in[m]}$ are not all zero. Consider the second moment, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{2}\delta x^{1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.

Because $\{\widehat{Z}^{W^{2}_{0}\delta x^{1}_{s}(\xi_{i})}\}$ shares the same co-variance matrix with $\{Z^{\delta x^{1}_{s}(\xi_{i})}\}$ . By Lemma C.1, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {\delta x^{1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s\in% [t+1]}\lambda_{i,s}Z^{\delta x^{1}_{s}(\xi_{i})}\overset{a.s.}{=}0.

Therefore by definition we have that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\Big{(}\phi(Z^{h^{1}_{s}(\xi% _{i})})-\phi(Z^{h^{1}_{s-1}(\xi_{i})})\Big{)}\overset{a.s.}{=}0,

(C.3)

where $Z^{h^{1}_{s}(\xi_{i})}$ satisfies that

\displaystyle Z^{h^{1}_{s}(\xi_{i})}

\displaystyle=Z^{h^{1}_{0}(\xi_{i})}-\sum_{j\in[m]}\eta\mathring{\chi}_{0,j}% \xi_{j}^{\top}\xi_{i}Z^{dh^{1}_{0}(\xi_{j})}-\cdots-\sum_{j\in[m]}\eta% \mathring{\chi}_{s-1,j}\xi_{j}^{\top}\xi_{i}Z^{dh^{1}_{s-1}(\xi_{j})}.

Plugging (3.6) and (3.7) into the above equation further gives the following reformulation of $Z^{h^{1}_{s}(\xi_{i})}$ :

\displaystyle Z^{h^{1}_{s}(\xi_{i})}=\Delta_{s}(\xi_{i})-\sum_{j\in[m]}\eta% \mathring{\chi}_{s,j}\xi_{j}^{\top}\xi_{i}\phi^{\prime}(Z^{h_{s-1}^{1}(\xi_{j}% )})\widehat{Z}^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{j})},

(C.4)

where $\Delta_{s}(\xi_{i})\in\mathcal{G}_{s-1}$ is a random variable. Notice that $Z^{h_{s-1}^{1}(\xi_{j})}\in\mathcal{G}_{s-1}$ and $Z^{h_{s}^{1}(\xi_{j})}\in\mathcal{G}_{s}$ .

At least one of the $\{\mathring{\chi}_{s,j}\}_{j\in[m]}$ is not zero, W.L.O.G assume $\mathring{\chi}_{s,k}\not=0$ . By induction hypothesis the non-degenerate property at time $t$ holds. Therefore, $Z^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{k})}$ condition on $\mathcal{G}_{t-1}\cup\{Z^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{j})}\}_{j\not=k}$ is a non-degenerate Gaussian.

Plugging (C.4) into (C.3) and then condition on $\mathcal{G}_{t-1}\cup\{Z^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{j})}\}_{j\not=k}$ gives that

\displaystyle\sum_{i\in[m]}\lambda_{i,t+1}\phi(\xi_{k}^{\top}\xi_{i}Z^{U}+c_{i% })=C

where $c_{i}$ and $C$ are constant and $Z^{U}$ is a non-degenerate uni-variate Gaussian random variable. Since $\phi$ meets the conditions in Assumption 4.3 and the dataset fulfills Assumption 4.1, ensuring the inner products and level sets behave as required. We can conclude that $\lambda_{i,t+1}=0$ for all $i\in[m]$ . A contradiction! Therefore, $\{\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}\}_{i\in[m],s\in[t+1]}$ is indeed non-degenerate.

Step 2: We prove the following is non-degenerate.

\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t+1]},l\geq 3.

Suppose there exists not all zero $\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}$ such that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W^{l}_{0}\delta x% ^{l-1}_{s}(\xi_{i})}\overset{\text{a.s.}}{=}0.

Since $\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t]}$ are non-degenerate, we conclude that $\{\lambda_{i,t+1}\}_{i\in[m]}$ are not all zero. Consider the second moment, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{l}\delta x^{l-1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.

Because $\{\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{s}(\xi_{i})}\}$ shares the same co-variance matrix with $\{Z^{\delta x^{l-1}_{s}(\xi_{i})}\}$ . By Lemma C.1, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {\delta x^{l-1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s% \in[t+1]}\lambda_{i,s}Z^{\delta x^{l-1}_{s}(\xi_{i})}\overset{a.s.}{=}0.

Therefore by definition we have that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\big{[}\phi(Z^{h^{l-1}_{s}(% \xi_{i})})-\phi(Z^{h^{l-1}_{s-1}(\xi_{i})})\big{]}\overset{a.s.}{=}0,

(C.5)

where $Z^{h^{l-1}_{s}}$ satisfies that

\displaystyle Z^{h^{l-1}_{s}(\xi_{i})}

\displaystyle=Z^{h^{l-1}_{0}(\xi_{i})}+Z^{\delta h^{l-1}_{1}(\xi_{i})}+\cdots+% Z^{\delta h^{l-1}_{s}(\xi_{i})}.

A reformulation of the above update rule further gives that,

\displaystyle Z^{h^{l-1}_{s}(\xi_{i})}

\displaystyle=\Delta_{s}(\xi_{i})+\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{s}(% \xi_{i})},

(C.6)

where $\Delta_{s}(\xi_{i})\in\mathcal{F}_{s-1}$ is a random variable. Notice that $\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{s}(\xi_{i})},Z^{h^{l-1}_{s}(\xi_{i})}% \in\mathcal{F}_{s}$ . Arbitrary pick an index $k$ . Because in induction hypothesis we assume the non-degenerate property at time $t$ for all layers and already proved the non-degenerate property at time $t+1$ layer $l-1$ , condition (C.3) on $\sigma\big{(}\mathcal{F}_{t}\cup\{\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{t+1}% (\xi_{j})}\}_{j\not=k}\big{)}$ gives that

\displaystyle\lambda_{k,t+1}\phi(Z^{U_{k}}+c_{k})=C_{k}

where $U_{k}$ is a non-degenerate uni-variate Gaussian random variable $\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{t+1}(\xi_{k})}|\sigma\big{(}\mathcal{F% }_{t}\cup\{\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{t+1}(\xi_{j})}\}_{j\not=k}% \big{)}$ , $c_{k}$ and $C_{k}$ are constants. By Assumption 4.3 of activation function, we know that $\lambda_{k,t+1}=0$ for arbitrary $k\in[m]$ . A contradiction! Therefore, $\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t+1]}$ is indeed non-degenerate.

Step 3: We prove the following gradients are non-degenerate.

\displaystyle\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}_{i\in[m],s\in[% t+1]}.

Suppose there exists not all zero $\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}$ such that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W_{0}^{L\top}dh% _{s}^{L}(\xi_{i})}\overset{\text{a.s.}}{=}0.

Since $\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}_{i\in[m],s\in[t]}$ are non-degenerate, we conclude that $\{\lambda_{i,t+1}\}_{i\in[m]}$ are not all zero. Consider the second moment, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.

Because $\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}$ shares the same co-variance matrix with $\{Z^{dh_{s}^{L}(\xi_{i})}\}$ . By Lemma C.1, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {dh_{s}^{L}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s\in[t+1]}% \lambda_{i,s}Z^{dh_{s}^{L}(\xi_{i})}\overset{a.s.}{=}0.

Therefore by definition we have that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^{dx_{s}^{L}(\xi_{i})}\phi^% {\prime}(Z^{h_{s}^{L}(\xi_{i})})\overset{a.s.}{=}0

(C.7)

where $Z^{dx_{s}^{L}(\xi)}$ satisfies that

\displaystyle Z^{dx_{s}^{L}(\xi)}

\displaystyle=Z^{\widehat{W}_{s}^{L+1}}=Z^{\widehat{W}_{0}^{L+1}}-\eta\sum_{s^% {\prime}=0}^{s-1}\sum_{i\in[m]}\mathring{\chi}_{s^{\prime},i}Z^{x_{s^{\prime}}% ^{L}(\xi_{i})}

A reformulation of the above update rule further gives that,

	$\displaystyle Z^{dx_{s}^{L}(\xi)}$	$\displaystyle=\widetilde{\Delta}_{s}-\eta\sum_{i\in[m]}\mathring{\chi}_{s,i}% \phi(Z^{h_{s}^{L}(\xi_{i})}),$
	$\displaystyle Z^{h_{s}^{L}(\xi_{i})}$	$\displaystyle\overset{(i)}{=}\Delta_{s}+\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{% s}(\xi_{i})},$

where $\Delta_{s},\widetilde{\Delta}_{s}\in\mathcal{F}_{s}$ and (i) is due to (C.6). Notice that $Z^{dx_{s}^{L}},Z^{h_{s}^{L}(\xi_{i})}\in\mathcal{F}_{s}$ . In (C.7), only $\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{t+1}(\xi_{i})}\in\mathcal{F}_{t+1}$ provides new randomness. Because in induction hypothesis we assume the non-degenerate property at time $t$ for all layers and already proved the non-degenerate property of $\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{t+1}(\xi_{i})}$ , condition (C.7) on $\mathcal{F}_{t}$ gives that

\displaystyle\Big{(}C-\eta\sum_{i\in[m]}\mathring{\chi}_{t,i}\phi(Z^{U_{i}}+b_% {i})\Big{)}\cdot\Big{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,t+1}\phi^{\prime}(Z^% {U_{i}}+b_{i})\Big{)}=C^{\prime},

where $U_{i}=\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{t+1}(\xi_{i})}|\mathcal{F}_{t}$ and $b_{i},C,C^{\prime}$ are all constants. Since $\mathring{\chi}_{t,i}$ are not all zero, by Lemma C.3, we have that $\lambda_{i,t+1}=0$ for all $i\in[m]$ . A contradiction! Therefore, $\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}_{i\in[m],s\in[t+1]}$ is indeed non-degenerate.

Step 4: We prove the following gradients are non-degenerate.

\displaystyle\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[% t+1]},2\leq l\leq L-1.

Suppose there exists not all zero $\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}$ such that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W_{0}^{l\top}dh% _{s}^{l}(\xi_{i})}\overset{\text{a.s.}}{=}0.

Since $\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t]}$ are non-degenerate, we conclude that $\{\lambda_{i,t+1}\}_{i\in[m]}$ are not all zero. Consider the second moment, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.

Because $\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}$ shares the same co-variance matrix with $\{Z^{dh_{s}^{l}(\xi_{i})}\}$ . By Lemma C.1, we have that

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {dh_{s}^{l}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s\in[t+1]}% \lambda_{i,s}Z^{dh_{s}^{l}(\xi_{i})}\overset{a.s.}{=}0.

Therefore by definition we have that

\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^{dx_{s}^{l}(\xi_{i})}\phi^% {\prime}(Z^{h_{s}^{l}(\xi_{i})})\overset{a.s.}{=}0,

(C.8)

where $Z^{dx_{s}^{l}(\xi)}$ satisfies that

\displaystyle Z^{dx_{s}^{l}(\xi_{i})}

\displaystyle=\widehat{Z}^{W_{0}^{l+1\top}dh_{s}^{l+1}(\xi_{i})}+G_{s}(\xi_{i}),

(C.9)

Similar to Steps 1 and 2, we have that $Z^{dx_{s}^{l}(\xi_{i})}\in\mathcal{G}_{s}$ , $Z^{h_{s}^{l}(\xi_{i})}\in\mathcal{G}_{s-1}$ . Therefore, condition (C.8) on $\mathcal{G}_{t}$ only $\widehat{Z}^{W_{0}^{l+1\top}dh_{t+1}^{l+1}(\xi)}$ gives new randomness. Arbitrarily pick an index $j$ . Because in induction hypothesis we assume the non-degenerate property at time $t$ for all layers and already proved the non-degenerate property at time $t+1$ layer $l+1$ , condition (C.8) on $\mathcal{G}_{t}\cup\{\widehat{Z}^{W_{0}^{l+1\top}dh_{t+1}^{l+1}(\xi_{i})}\}_{i% \not=j}$ gives that

\displaystyle\lambda_{j,t+1}c_{j}Z^{U_{j}}=C_{j}

where $U_{j}$ is a non-degenerate uni-variate Gaussian random variable $c_{j}$ and $C_{j}$ are constants. By Assumption 4.3 of activation function, we know that $c_{j}\not=0$ which induces $\lambda_{j,t+1}=0$ for all $j\in[m]$ . A contradiction! Therefore, $\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t+1]}$ is indeed non-degenerate.

∎

Proof of Corollary 4.6

Proof.

As stated in the main text, if the training parameters stop updating at time $T$ , then the training loss must be zero.

By Theorem 4.5, the training trajectory remains non-degenerate throughout training. Suppose, for contradiction, that at time $T$ the training loss is still nonzero for some sample $(\xi_{i},y_{i})$ . This implies that the error signal $\mathring{\chi}_{T,i}$ is nonzero. However, the non-degenerate trajectory ensures that a nonzero error signal $\mathring{\chi}_{T,i}$ would necessitate further parameter updates, contradicting the assumption that all parameter updates vanish at or after time $T$ . Therefore, the training loss must be zero for all samples at time $T$ , implying convergence to a global minimum. ∎

Appendix D Activation Functions with the GOOD Property

We now verify that many practical activation functions, especially those with exponential tails, satisfy the GOOD property introduced in Definition 4.2. By “exponential tail,” we mean that as $|x|\to\infty$ , the function and/or its derivatives decay at least as fast as $e^{-c|x|}$ for some $c>0$ . Representative examples include the sigmoid, tanh, SiLU, and GeLU. Below, we restate the full definition of GOOD and then show how each requirement is met by these exponential-tail activations.

Definition D.1 (Restatement of Definition 4.2).

An activation function $\phi:\mathbb{R}\to\mathbb{R}$ is called GOOD if it satisfies the following two conditions:

(a)

Non-constant decomposition. For any finite set of parameters $\{a_{i}\},\{b_{i}\},\{c_{i}\}$ such that $\exists k\text{ with }a_{k}b_{k}\neq 0$ and $|b_{i}|\neq|b_{j}|$ for all $i\neq j$ , the function

\displaystyle f(x)=\sum_{i=1}^{m}a_{i}\phi(b_{i}x+c_{i})

(D.1)

is not a constant function.

(b)

Non-degenerate product with derivative. For any real numbers $r_{1},r_{2}$ , the product

\displaystyle\bigl{(}r_{1}+\phi(x)\bigr{)}\bigl{(}r_{2}+\phi^{\prime}(x)\bigr{)}

(D.2)

is not almost everywhere (a.e.) constant on $\mathbb{R}$ .

Before analyzing each activation function in detail, we visualize these functions and their derivatives in Figure 5. These plots illustrate the key characteristics we will exploit in our proofs, particularly the exponential decay behavior in the tails. Note how most activation functions and their derivatives exhibit rapid decay as $|x|\to\infty$ , with ReLU serving as a contrasting example that grows linearly.

In the following subsections, we formally prove that these exponential-tail activations satisfy both conditions of Definition D.1.

D.1 Sigmoid and Tanh

Proposition D.2.

The sigmoid function $\sigma(x)=\frac{1}{1+\exp(-x)}$ satisfies both (a) and (b) in Definition D.1, hence is GOOD.

Proof.

We first prove condition (a). Without loss of generality, set $c_{i}=0$ , as they will not affect the tail of the activation function. Define $\Omega=\{i|a_{i}\neq 0\}$ , $A^{+}=\{i\in\Omega|b_{i}>0\}$ and $A^{-}=\{i\in\Omega|b_{i}<0\}$ . Let $i^{*}=\mathop{\mathrm{argmin}}_{i\in\Omega}|b_{i}|$ . If $b_{i^{*}}=0$ , we can redefine $\Omega\leftarrow\Omega\backslash\{i^{*}\}$ and $f\leftarrow f-a_{i^{*}}/2$ and reenter this proof. Thus we assume $b_{i^{*}}\neq 0$ without loss of generality.

We have:

\displaystyle f(x)

\displaystyle=\sum_{i\in A^{+}}a_{i}\sigma(b_{i}x)+\sum_{i\in A^{-}}a_{i}% \sigma(b_{i}x)=\sum_{i\in A^{+}}a_{i}-\sum_{i\in A^{+}}a_{i}[1-\sigma(b_{i}x)]% +\sum_{i\in A^{-}}a_{i}\sigma(b_{i}x).

For $b_{i^{*}}<0$ , we have:

\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}|

\displaystyle=\bigg{|}\sum_{i\in A^{+}}a_{i}[1-\sigma(b_{i}x)]-\sum_{i\in A^{-% }}a_{i}\sigma(b_{i}x)\bigg{|}=|a_{i^{*}}|\sigma(b_{i^{*}}x)+O(\exp(-Bx)),

where $B>|b_{i^{*}}|$ . This dominant term cannot be cancelled unless $a_{i^{*}}=0$ .

For $b_{i^{*}}>0$ :

\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}|=\bigg{|}\sum_{i\in A^{+}}a_{i}[1-% \sigma(b_{i}x)]-\sum_{i\in A^{-}}a_{i}\sigma(b_{i}x)\bigg{|}=|a_{i^{*}}|[1-% \sigma(b_{i^{*}}x)]+O(\exp(-Bx)),

where $B>|b_{i^{*}}|$ . This shows $f(x)$ cannot be constant unless $a_{i^{*}}b_{i^{*}}=0$ , contradicting our assumption.

For condition (b), we need to show $(r_{1}+\sigma(x))(r_{2}+\sigma^{\prime}(x))$ is not a.e. constant. Note $\sigma^{\prime}(x)=\sigma(x)(1-\sigma(x))$ has exponential decay as $|x|\to\infty$ . A direct computation shows:

\displaystyle(r_{1}+\sigma(x))(r_{2}+\sigma^{\prime}(x))

\displaystyle=r_{1}r_{2}+r_{2}\sigma(x)+r_{1}\sigma(x)(1-\sigma(x))=r_{1}r_{2}% +(r_{2}+r_{1})\sigma(x)-r_{1}\sigma(x)^{2}

(D.3)

Consider the tail in (D.3). If this expression were constant, then by examining the coefficients of different powers of $\sigma(x)$ , we must have $r_{1}=0$ and $r_{1}+r_{2}=0$ , which is impossible. Thus $(r_{1}+\sigma(x))(r_{2}+\sigma^{\prime}(x))$ cannot be constant almost everywhere. ∎

Remark D.3.

Since $\tanh(x)$ is a linear transformation of Sigmoid function $\sigma$ , it inherits the same exponential-tail property and similarly meets both (a) and (b).

D.2 SiLU and GeLU

Proposition D.4.

The SiLU function $\mathrm{SiLU}(x)=x\sigma(x)$ is GOOD.

Proof.

Define $\Omega=\{i|a_{i}\neq 0\}$ , $A^{+}=\{i\in\Omega|b_{i}>0\}$ and $A^{-}=\{i\in\Omega|b_{i}<0\}$ . Let $i^{*}=\mathop{\mathrm{argmin}}_{i\in\Omega}|b_{i}|$ . Using similar reasoning as in the sigmoid case, we assume $b_{i^{*}}\neq 0$ without loss of generality.

We have:

\displaystyle f(x)

\displaystyle=\sum_{i\in A^{+}}a_{i}\phi(b_{i}x)+\sum_{i\in A^{-}}a_{i}\phi(b_% {i}x)=\sum_{i\in A^{+}}a_{i}b_{i}x-\sum_{i\in A^{+}}a_{i}[b_{i}x-\phi(b_{i}x)]% +\sum_{i\in A^{-}}a_{i}\phi(b_{i}x).

For $b_{i^{*}}<0$ , we have:

\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}b_{i}x|

\displaystyle=\bigg{|}\sum_{i\in A^{+}}a_{i}[b_{i}x-\phi(b_{i}x)]-\sum_{i\in A% ^{-}}a_{i}\phi(b_{i}x)\bigg{|}=\underbrace{|a_{i^{*}}|\phi(b_{i^{*}}x)}_{% \mathrm{Majortail}}+O(x\exp(-Bx)),

where $B>|b_{i^{*}}|$ .

For $b_{i^{*}}>0$ , we have:

\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}b_{i}x|

\displaystyle=\bigg{|}\sum_{i\in A^{+}}a_{i}[b_{i}x-\phi(b_{i}x)]-\sum_{i\in A% ^{-}}a_{i}\phi(b_{i}x)\bigg{|}=\underbrace{|a_{i^{*}}|[b_{i^{*}}x-\phi(b_{i^{*% }}x)]}_{\mathrm{Majortail}}+O(x\exp(-Bx)),

where $B>|b_{i^{*}}|$ . Note that $\mathrm{Majortail}$ is bounded by some constant and asymptotically $\mathrm{Majortail}=\Theta(x\exp(-|b_{i^{*}}|x))$ . Therefore, $f(x)$ is constant only if $\sum_{i\in A^{+}}a_{i}b_{i}=0$ and $a_{i^{*}}b_{i^{*}}=0$ , which contradicts our assumption.

For condition (b), we need to show $(r_{1}+x\sigma(x))(r_{2}+\sigma(x)+x\sigma^{\prime}(x))$ is not a.e. constant. Note that:

\displaystyle\phi^{\prime}(x)

\displaystyle=\sigma(x)+x\sigma^{\prime}(x)=\sigma(x)+x\sigma(x)(1-\sigma(x))

(D.4)

Then we have:

	$\displaystyle(r_{1}+x\sigma(x))(r_{2}+\sigma(x)+x\sigma(x)(1-\sigma(x)))$
	$\displaystyle=r_{1}r_{2}+r_{1}\sigma(x)+r_{1}x\sigma(x)(1-\sigma(x))$
	$\displaystyle\quad+r_{2}x\sigma(x)+x\sigma(x)^{2}+x^{2}\sigma(x)^{2}(1-\sigma(% x))$		(D.5)

Consider the tail in (D.5). If this expression were constant, the coefficient of $x^{2}$ term must vanish, which requires $\sigma(x)^{2}(1-\sigma(x))\equiv 0$ . However, this is impossible as $\sigma(x)\in(0,1)$ for all $x$ . Thus this product cannot be constant almost everywhere. ∎

Remark D.5.

GeLU, defined by $x\Phi(x)$ where $\Phi$ is the Gaussian CDF, similarly satisfies (a) and (b) because of its strong exponential decay. Specifically, as $|x|\to\infty$ , GeLU and its derivatives exhibit Gaussian-like decay $O(e^{-x^{2}/2})$ , which is even stronger than the exponential decay of sigmoid and SiLU.

Conclusion.

We have shown that key exponential-tail activations ( $\sigma$ , $\tanh$ , SiLU, GeLU) fulfill both (a) and (b) in Definition D.1, and hence are GOOD. These results rely crucially on the exponential decay properties of these functions, which ensure that scaled copies cannot combine to yield constant functions. This ensures rich, non-degenerate behavior in our infinite-width analysis under $\mu$ P scaling.

References

Adler and Taylor (2009) Adler, R. J. and Taylor, J. E. (2009). Random fields and geometry. Springer Science & Business Media.
Alemohammad et al. (2020) Alemohammad, S., Wang, Z., Balestriero, R. and Baraniuk, R. (2020). The recurrent neural tangent kernel. arXiv preprint arXiv:2006.10246 .
Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y. and Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning.
Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z. and Wang, R. (2019a). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning.
Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. and Wang, R. (2019b). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
Bordelon and Pehlevan (2022) Bordelon, B. and Pehlevan, C. (2022). Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems 35 32240–32256.
Cao and Gu (2019) Cao, Y. and Gu, Q. (2019). Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems.
Chen et al. (2020) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020). A generalized neural tangent kernel analysis for two-layer neural networks. Advances in Neural Information Processing Systems 33 13363–13373.
Chen et al. (2021) Chen, Z., Cao, Y., Zou, D. and Gu, Q. (2021). How much over-parameterization is sufficient to learn deep relu networks? In International Conference on Learning Representation (ICLR).
Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems.
Du et al. (2019a) Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning.
Du et al. (2019b) Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R. and Xu, K. (2019b). Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems 32.
Du et al. (2018) Du, S. S., Lee, J. D. and Tian, Y. (2018). When is a convolutional filter easy to learn? In International Conference on Learning Representations.
Du et al. (2019c) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019c). Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
Fang et al. (2021) Fang, C., Lee, J., Yang, P. and Zhang, T. (2021). Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on learning theory. PMLR.
Geiger et al. (2020) Geiger, M., Spigler, S., Jacot, A. and Wyart, M. (2020). Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment 2020 113301.
Hajjar et al. (2021) Hajjar, K., Chizat, L. and Giraud, C. (2021). Training integrable parameterizations of deep neural networks in the infinite-width limit. arXiv preprint arXiv:2110.15596 .
Hendrycks and Gimpel (2016) Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 .
Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N. et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29 82–97.
Hron et al. (2020) Hron, J., Bahri, Y., Sohl-Dickstein, J. and Novak, R. (2020). Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning. PMLR.
Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems.
Littwin and Yang (2022) Littwin, E. and Yang, G. (2022). Adaptive optimization in the infinit-width limit. In The Eleventh International Conference on Learning Representations.
Mei et al. (2018) Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115 E7665–E7671.
Nguyen and Pham (2020) Nguyen, P.-M. and Pham, H. T. (2020). A rigorous framework for the mean field limit of multilayer neural networks. arXiv preprint arXiv:2001.11443 .
Nitanda et al. (2022) Nitanda, A., Wu, D. and Suzuki, T. (2022). Convex analysis of the mean field langevin dynamics. In International Conference on Artificial Intelligence and Statistics. PMLR.
Pham and Nguyen (2021) Pham, H. T. and Nguyen, P.-M. (2021). Global convergence of three-layer neural networks in the mean field regime. arXiv preprint arXiv:2105.05228 .
Rotskoff and Vanden-Eijnden (2018) Rotskoff, G. M. and Vanden-Eijnden, E. (2018). Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915 .
Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature 529 484–489.
Sirignano and Spiliopoulos (2018) Sirignano, J. and Spiliopoulos, K. (2018). Mean field analysis of neural networks: A central limit theorem. arXiv preprint arXiv:1808.09372 .
Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 .
Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
Woodworth et al. (2020) Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D. and Srebro, N. (2020). Kernel and rich regimes in overparametrized models. In Conference on Learning Theory. PMLR.
Yang (2019a) Yang, G. (2019a). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760 .
Yang (2019b) Yang, G. (2019b). Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Advances in Neural Information Processing Systems 32.
Yang (2020a) Yang, G. (2020a). Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548 .
Yang (2020b) Yang, G. (2020b). Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685 .
Yang et al. (2021) Yang, G., Hu, E., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W. and Gao, J. (2021). Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems 34 17084–17097.
Yang and Hu (2020) Yang, G. and Hu, E. J. (2020). Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522 .
Yang and Hu (2021) Yang, G. and Hu, E. J. (2021). Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning. PMLR.
Yang et al. (2023a) Yang, G., Simon, J. B. and Bernstein, J. (2023a). A spectral condition for feature learning. arXiv preprint arXiv:2310.17813 .
Yang et al. (2023b) Yang, G., Yu, D., Zhu, C. and Hayou, S. (2023b). Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244 .
Zou et al. (2019) Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2019). Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning .
Zou and Gu (2019) Zou, D. and Gu, Q. (2019). An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems.

Global Convergence and Rich Feature Learning in L𝐿Litalic_L-Layer Infinite-Width Neural Networks under μ𝜇\muitalic_μP Parametrization

Abstract

1 Introduction

2 Related Work

3 Preliminaries

Represent Hidden States via Z𝑍Zitalic_Z Random Variables:

Definition 3.1.

Remark 3.2.

4 Main Results

Assumption 4.1.

Definition 4.2 (GOOD Function).

Assumption 4.3.

Remark 4.4.

Theorem 4.5.

Corollary 4.6.

5 Key Techniques and Analysis

5.1 Technical Challenges

5.2 The Gaussian Process View

5.3 From Covariance Structure to Non-degeneracy

5.4 Evolution Framework

6 Conclusion and Future Work

Appendix A Experimental Details

A.1 Additional Results on Activation Functions

Appendix B More Details for μ𝜇\muitalic_μP Parametrization

Appendix C Proof of Theorem 4.5

Lemma C.1.

Proof.

Lemma C.2.

Proof.

Lemma C.3.

Remark C.4.

Proof.

Proof of Theorem 4.5.

Proof of Corollary 4.6

Proof.

Appendix D Activation Functions with the GOOD Property

Definition D.1 (Restatement of Definition 4.2).

D.1 Sigmoid and Tanh

Proposition D.2.

Proof.

Remark D.3.

D.2 SiLU and GeLU

Proposition D.4.

Proof.

Remark D.5.

Conclusion.

References

Global Convergence and Rich Feature Learning in $L$ -Layer Infinite-Width Neural Networks under $\mu$ P Parametrization

Represent Hidden States via $Z$ Random Variables:

Appendix B More Details for $\mu$ P Parametrization