Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Global Convergence and Rich Feature Learning in L𝐿Litalic_L-Layer Infinite-Width Neural Networks under μ𝜇\muitalic_μP Parametrization

Zixiang Chen Equal contributionDepartment of Computer Science, University of California, Los Angeles. Email: chenzx19@cs.ucla.edu    Greg Yang11footnotemark: 1 xAI    Qingyue Zhao Department of Computer Science, University of California, Los Angeles. Email: zhaoqy24@cs.ucla.edu    Quanquan Gu Department of Computer Science, University of California, Los Angeles. Email: qgu@cs.ucla.edu
Abstract

Despite deep neural networks’ powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, L𝐿Litalic_L-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization (μ𝜇\muitalic_μP) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

1 Introduction

Deep learning has achieved remarkable success in various machine learning tasks, from image classification (Krizhevsky et al., 2012) and speech recognition (Hinton et al., 2012) to game playing (Silver et al., 2016). Yet this empirical success has posed a significant theoretical challenge: how can we explain the effectiveness of neural networks given their non-convex optimization landscape and over-parameterized nature? Traditional optimization and learning theory frameworks struggle to provide satisfactory explanations. A breakthrough came with the study of infinite-width neural networks, where the network behavior can be precisely characterized in the limit of infinite width. This theoretical framework has spawned several important approaches to understanding neural networks, with the Neural Tangent Kernel (NTK) emerging as a prominent example.

Under the NTK parametrization (NTP) (Jacot et al., 2018), neural network training behaves like a linear model: the features learned during training in each layer remain essentially identical to those obtained from random initialization. Consequently, the training process of over-parameterized deep neural networks can be characterized by training linear models with random feature (Lee et al., 2019; Arora et al., 2019b; Cao and Gu, 2019; Chen et al., 2021). Since random features are linearly independent, global convergence can be proved for wide neural networks trained using (stochastic) gradient descent (GD/SGD) (Du et al., 2019c; Allen-Zhu et al., 2019; Du et al., 2019a; Zou et al., 2019; Zou and Gu, 2019). However, the NTK parametrization has significant limitations, such as its inability to perform feature learning and transfer learning, which involve pretraining and fine-tuning. While NTK theory provides convergence results under infinite width, its inability to explain feature learning motivates us to ask:

Can deep neural networks simultaneously learn meaningful features and achieve global convergence?

In this paper, we show that deep neural networks can achieve both objectives through proper parametrization. While previous approaches like NTK and standard parametrization fail to perform meaningful feature learning, and mean field parametrization suffers from feature collapse in deep networks, we demonstrate that the μ𝜇\muitalic_μ parametrization (Yang and Hu, 2020, 2021; Yang et al., 2021; Yang, 2019a) enables both feature learning and global convergence. Specifically, working with L𝐿Litalic_L-layer neural networks under μ𝜇\muitalic_μP scaling, we prove that despite substantial feature evolution during training, the networks maintain linearly independent features in each layer when trained with stochastic gradient descent. As a consequence, if the training converges, it must converge to a global minimum. Our contributions are summarized as follows:

  • We establish that multilayer perceptrons (MLPs) under Maximal Update Parametrization (μ𝜇\muitalic_μP) learn linearly independent features that capture task-relevant information. The learned features substantially deviate from their initialization, demonstrating true feature learning rather than random feature approximation. This resolves a fundamental challenge in deep learning theory: characterizing feature properties that ensure global convergence while allowing meaningful feature learning.

  • Our proof technique analyzes neural network Gaussian processes by exploiting their second-order invariants across adjacent layers. These structural properties persist during training, which allows us to track the evolution of feature correlations. Through a careful inductive argument over network layers and iterations, we establish that when training converges, the linear independence of features ensures convergence to a global minimum. The proof reveals a deep connection between the feature learning dynamics and the structural properties of infinite-width neural networks.

  • Through experiments on classification tasks, we validate our theoretical findings by demonstrating that features maintain linear independence through analysis of covariance matrix properties. Our empirical results demonstrate μ𝜇\muitalic_μP’s unique capability to simultaneously achieve meaningful feature learning while preserving feature richness, as supported by non-vanishing eigenvalues as network widths increase. Through comparative analysis against other parametrization schemes, we show that this behavior robustly persists across different choices of activation functions, illustrating the practical implications of our theoretical results.

Notation. For any positive integer N𝑁Nitalic_N, we use [N]delimited-[]𝑁[N][ italic_N ] to denote the index set {1,,N}1𝑁\{1,\dots,N\}{ 1 , … , italic_N }. We use ϕ::italic-ϕ\phi:\mathbb{R}\to\mathbb{R}italic_ϕ : blackboard_R → blackboard_R to denote the activation function. For an L𝐿Litalic_L-layer network, we use superscript l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] to index layers, with Zhlsuperscript𝑍superscript𝑙Z^{h^{l}}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and Zxlsuperscript𝑍superscript𝑥𝑙Z^{x^{l}}italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denoting pre-activation and post-activation features respectively. For matrices and vectors, W^0L+1W0L+1nsuperscriptsubscript^𝑊0𝐿1superscriptsubscript𝑊0𝐿1𝑛\widehat{W}_{0}^{L+1}\coloneqq W_{0}^{L+1}nover^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT ≔ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_n denotes a scaled last layer weights. For any matrix W𝑊Witalic_W and vector x𝑥xitalic_x, Z^Wxsuperscript^𝑍𝑊𝑥\widehat{Z}^{Wx}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W italic_x end_POSTSUPERSCRIPT denotes the Gaussian component of ZWxsuperscript𝑍𝑊𝑥Z^{Wx}italic_Z start_POSTSUPERSCRIPT italic_W italic_x end_POSTSUPERSCRIPT.111ZWxZ^Wx+Z˙Wxsuperscript𝑍𝑊𝑥superscript^𝑍𝑊𝑥superscript˙𝑍𝑊𝑥Z^{Wx}\coloneqq\widehat{Z}^{Wx}+\dot{Z}^{Wx}italic_Z start_POSTSUPERSCRIPT italic_W italic_x end_POSTSUPERSCRIPT ≔ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W italic_x end_POSTSUPERSCRIPT + over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W italic_x end_POSTSUPERSCRIPT, which is detailed in Appendix B. We use 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] to denote expectation. We consider a filtration {t}t0subscriptsubscript𝑡𝑡0\{\mathcal{F}_{t}\}_{t\geq 0}{ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT, where tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the σ𝜎\sigmaitalic_σ-algebra generated by all random variables up to time t𝑡titalic_t. This gives a sequence of probability spaces (Ω,t,)Ωsubscript𝑡(\Omega,\mathcal{F}_{t},\mathbb{P})( roman_Ω , caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_P ) with 01Tsubscript0subscript1subscript𝑇\mathcal{F}_{0}\subseteq\mathcal{F}_{1}\subseteq\ldots\subseteq\mathcal{F}_{T}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊆ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊆ … ⊆ caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. An event Tsubscript𝑇\mathcal{E}\in\mathcal{F}_{T}caligraphic_E ∈ caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT occurs almost surely (denoted as a.s.) if ()=11\mathbb{P}(\mathcal{E})=1blackboard_P ( caligraphic_E ) = 1. The functions f̊̊𝑓\mathring{f}over̊ start_ARG italic_f end_ARG and χ̊̊𝜒\mathring{\chi}over̊ start_ARG italic_χ end_ARG denote the infinite-width limits of network outputs and error signals induced by f̊̊𝑓\mathring{f}over̊ start_ARG italic_f end_ARG respectively.

2 Related Work

Neural Tangent Kernel Parametrization Jacot et al. (2018) first introduced the neural tangent kernel (NTK) by studying the training dynamics of multi-layer perceptrons (MLPs) with Lipschitz and smooth activation functions under square loss. Based on NTK, Allen-Zhu et al. (2019); Du et al. (2019a); Zou et al. (2019); Arora et al. (2019a) proved the global convergence of (stochastic) gradient descent for various neural architectures with general activation and loss functions. Standard parametrization (SP) and NTK parametrization (NTP) share the same weight initialization scheme but with different learning schedules. As network width increases, SP requires learning rates to decrease as O(1/width)𝑂1widthO(1/\text{width})italic_O ( 1 / width ) for all layers to maintain stability (Yang and Hu, 2020). When considering the infinite-width limit, neither SP nor NTK parametrization can learn features - the features remain essentially the same as those from random initialization. Both theoretical studies and empirical evidence demonstrated that these parametrizations failed to capture the feature learning behavior observed in practical neural networks (Woodworth et al., 2020; Geiger et al., 2020; Bordelon and Pehlevan, 2022; Yang et al., 2023a).

Mean Field Analysis The mean field limit emerged when networks and learning rates were scaled appropriately as width approached infinity, yielding nonlinear parameter evolution (Mei et al., 2018; Chizat and Bach, 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2018). Early analysis of two-layer networks showed promising results, proving convergence to global optima with explicit convergence rates established through both direct analysis (Chen et al., 2020) and mean field Langevin dynamics (Nitanda et al., 2022). Progress extended to three-layer networks with Pham and Nguyen (2021)’s global convergence results. However, studies of deeper architectures revealed significant limitations: for networks deeper than 4 layers, both feature vectors and gradients degenerated to zero vectors (Nguyen and Pham, 2020; Fang et al., 2021). While Hajjar et al. (2021) introduced Integrable Parameterization (IP) to address this, networks with more than four layers still started at a stationary point in the infinite-width limit, hard to achieve rich feature learning.

Refer to caption
Refer to caption
Figure 1: Different parametrization schemes exhibit distinct feature learning behaviors as width increases in 3333-layer MLPs. We train MLPs on CIFAR-10 dataset and measure feature properties in Layer 1111. Left: Feature change (h(x)h0(x)2/h0(x)2subscriptnorm𝑥superscript0𝑥2subscriptnormsuperscript0𝑥2\|h(x)-h^{0}(x)\|_{2}/\|h^{0}(x)\|_{2}∥ italic_h ( italic_x ) - italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ∥ italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) shows only μ𝜇\muitalic_μP maintain stable feature representations. Right: Feature diversity measured by the minimum eigenvalue of the feature gram matrix Kij=h(xi),h(xj)subscript𝐾𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗K_{ij}=\langle h(x_{i}),h(x_{j})\rangleitalic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_h ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩, where a larger eigenvalue indicates the features span a higher dimensional space. The results reveal that Meanfield parametrization suffers from feature collapse while SP, NTP and μ𝜇\muitalic_μP preserve rich feature representations. Notably, only μ𝜇\muitalic_μP achieves both feature learning capability and feature richness. See Appendix A for experimental details.

Tensor Programs Tensor Programs (TPs) emerged as a unified framework for understanding infinite-width limits across neural architectures (Yang, 2019b, 2020a, 2020b). This approach generalized previous architecture-specific parametrizations (Du et al., 2018, 2019b; Hron et al., 2020; Alemohammad et al., 2020). Yang and Hu (2020) characterized two distinct behaviors in infinite-width MLPs: one where initialization dominated the training dynamics (the kernel regime), and another where training data substantially influenced the learned weights (the feature learning regime). Within this framework, the μ𝜇\muitalic_μ parametrization was identified as enabling maximal feature learning across all layers and architectures (Yang and Hu, 2020; Yang et al., 2021; Littwin and Yang, 2022). The framework has continued to expand with analysis of depth-dependent scaling (Yang et al., 2023b). Recent work by Yang et al. (2023a) refined the understanding through spectral analysis and input dimension scaling, which we adopt in our experiments.

Our experimental results reveal distinct feature learning behaviors across different parametrization schemes. As shown in Figure 1, Standard Parametrization (SP) keeps features close to initialization (demonstrated by small feature change in the left panel), while Integrable Parametrization (IP) achieves feature learning but suffers from feature collapse (shown by decreasing feature diversity in the right panel). In contrast, μ𝜇\muitalic_μP achieves both substantial feature change and maintains feature diversity. We summarize these key characteristics in Table 1. Additional experiments with different activation functions, further illustrating these trends, are provided in Appendix A.

Table 1: Feature Properties Under Different Parametrizations
Parametrization Feature Learning Feature Richness
Standard (SP) Rich
Neural Tangent (NTP) Rich
Meanfield (IP)222IP (Integrable Parametrization) refers to parametrizations with a 1/n1𝑛1/n1 / italic_n scaling factor for all layers except the first one, which leads to absolute convergence of weighted sums in the mean-field limit. Low
Maximal Update (μ𝜇\muitalic_μP) Rich

3 Preliminaries

Table 2: Initialization variance and learning rate scaling under different parametrization schemes for MLP networks.
Layer SP NTP IP μ𝜇\muitalic_μP
Init. Var. LR Init. Var. LR Init. Var. LR Init. Var. LR
Input (W1superscript𝑊1W^{1}italic_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT) 1111 ηn1𝜂superscript𝑛1\eta\cdot n^{-1}italic_η ⋅ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 1111 η𝜂\etaitalic_η 1111 ηn𝜂𝑛\eta\cdot nitalic_η ⋅ italic_n 1111 ηn𝜂𝑛\eta\cdot nitalic_η ⋅ italic_n
Hidden (Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT) n1superscript𝑛1n^{-1}italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ηn1𝜂superscript𝑛1\eta\cdot n^{-1}italic_η ⋅ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT n1superscript𝑛1n^{-1}italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ηn1𝜂superscript𝑛1\eta\cdot n^{-1}italic_η ⋅ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT n2superscript𝑛2n^{-2}italic_n start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT η𝜂\etaitalic_η n1superscript𝑛1n^{-1}italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT η𝜂\etaitalic_η
Output (WL+1superscript𝑊𝐿1W^{L+1}italic_W start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT) n1superscript𝑛1n^{-1}italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ηn1𝜂superscript𝑛1\eta\cdot n^{-1}italic_η ⋅ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT n1superscript𝑛1n^{-1}italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ηn1𝜂superscript𝑛1\eta\cdot n^{-1}italic_η ⋅ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT n2superscript𝑛2n^{-2}italic_n start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ηn1𝜂superscript𝑛1\eta\cdot n^{-1}italic_η ⋅ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT n2superscript𝑛2n^{-2}italic_n start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ηn1𝜂superscript𝑛1\eta\cdot n^{-1}italic_η ⋅ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

Different parametrization schemes for MLPs are shown in Table 2 333Init. Var. denotes initialization variance, LR denotes learning rate scaling. η𝜂\etaitalic_η is the base learning rate and n𝑛nitalic_n is the layer width. For notational simplicity, we omit the constant in the table.. Given a general MLP with L𝐿Litalic_L hidden layers specified by weight matrices W1n×dsuperscript𝑊1superscript𝑛𝑑W^{1}\in\mathbb{R}^{n\times d}italic_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, {Wl}l=2Ln×nsuperscriptsubscriptsuperscript𝑊𝑙𝑙2𝐿superscript𝑛𝑛\{W^{l}\}_{l=2}^{L}\in\mathbb{R}^{n\times n}{ italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, WL+1nsuperscript𝑊𝐿1superscript𝑛W^{L+1}\in\mathbb{R}^{n}italic_W start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and activation ϕ::italic-ϕ\phi:\mathbb{R}\to\mathbb{R}italic_ϕ : blackboard_R → blackboard_R, the network computation is formally defined as

h1superscript1\displaystyle h^{1}italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT =Wξn,absent𝑊𝜉superscript𝑛\displaystyle=W\xi\in\mathbb{R}^{n},= italic_W italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,
xlsuperscript𝑥𝑙\displaystyle x^{l}italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT =ϕ(hl)n,absentitalic-ϕsuperscript𝑙superscript𝑛\displaystyle=\phi(h^{l})\in\mathbb{R}^{n},= italic_ϕ ( italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,
hl+1superscript𝑙1\displaystyle h^{l+1}italic_h start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT =Wl+1xln,absentsuperscript𝑊𝑙1superscript𝑥𝑙superscript𝑛\displaystyle=W^{l+1}x^{l}\in\mathbb{R}^{n},= italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,
f(ξ)𝑓𝜉\displaystyle f(\xi)italic_f ( italic_ξ ) =WL+1xL,absentsuperscript𝑊𝐿1superscript𝑥𝐿\displaystyle=W^{L+1}x^{L}\in\mathbb{R},= italic_W start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R , (3.1)

where L>1𝐿1L>1italic_L > 1 is any positive integer and l{1,,L1}𝑙1𝐿1l\in\{1,\dots,L-1\}italic_l ∈ { 1 , … , italic_L - 1 }. Among these schemes, the Maximal Update Parametrization (μ𝜇\muitalic_μP) shown in Table 2 achieves maximal parameter updates at initialization. As n𝑛n\rightarrow\inftyitalic_n → ∞, we can consider the following infinite-width feature learning process: ft(ξ)a.s.f̊t(ξ)f_{t}(\xi)\overset{a.s.}{\rightarrow}\mathring{f}_{t}(\xi)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG → end_ARG over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) (Yang and Hu, 2020, Theorem 6.4). The neural network is assumed to be trained using a differentiable loss \mathcal{L}caligraphic_L by stochastic gradient descent, where the s𝑠sitalic_s-th sampled batch is denoted by {(ξi,yi)}isSsubscriptsubscript𝜉𝑖subscript𝑦𝑖𝑖𝑠𝑆\{(\xi_{i},y_{i})\}_{i\in\mathcal{B}{s}}\subseteq S{ ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ caligraphic_B italic_s end_POSTSUBSCRIPT ⊆ italic_S where ssubscript𝑠\mathcal{B}_{s}caligraphic_B start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the index set and S𝑆Sitalic_S is the training dataset. For simplicity, we present the full-batch gradient descent result in the main paper, i.e., s=|S|=[m]subscript𝑠𝑆delimited-[]𝑚\mathcal{B}_{s}=|S|=[m]caligraphic_B start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = | italic_S | = [ italic_m ].

Represent Hidden States via Z𝑍Zitalic_Z Random Variables:

Following Yang and Hu (2020), we represent network’s hidden states using Z𝑍Zitalic_Z random variables. This representation generalizes the spirit of two-layer mean field analysis: even with multiple hidden layers (L2)𝐿2(L\geq 2)( italic_L ≥ 2 ), the entries of preactivation hhitalic_h and activation vectors x𝑥xitalic_x in (3) become approximately i.i.d. as width n𝑛nitalic_n approaches infinity. This allows us to characterize their asymptotic behavior using scalar random variables that reflect their elementwise distributions.

Specifically, for a vector xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we track it using Zxsuperscript𝑍𝑥Z^{x}italic_Z start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT, where x𝑥xitalic_x’s entries behave like i.i.d. copies of Zxsuperscript𝑍𝑥Z^{x}italic_Z start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT. When x𝑥xitalic_x is properly scaled such that x22=Θ(n)subscriptsuperscriptnorm𝑥22Θ𝑛\|x\|^{2}_{2}=\Theta(n)∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Θ ( italic_n ) (i.e., its typical magnitude is independent of n𝑛nitalic_n), then Zxsuperscript𝑍𝑥Z^{x}italic_Z start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT becomes independent of n𝑛nitalic_n. For any two such normalized vectors x,yn𝑥𝑦superscript𝑛x,y\in\mathbb{R}^{n}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, their corresponding random variables Zxsuperscript𝑍𝑥Z^{x}italic_Z start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and Zysuperscript𝑍𝑦Z^{y}italic_Z start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT are correlated via limnxy/n=𝔼ZxZysubscript𝑛superscript𝑥top𝑦𝑛𝔼superscript𝑍𝑥superscript𝑍𝑦\lim_{n\to\infty}x^{\top}y/n=\mathbb{E}Z^{x}Z^{y}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y / italic_n = blackboard_E italic_Z start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT. Our goal is to characterize these Z𝑍Zitalic_Z in (3) throughout the training process.

Definition 3.1.

[Yang and Hu 2020] During training, we define the error signal χ̊t,isubscript̊𝜒𝑡𝑖\mathring{\chi}_{t,i}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT at time step t𝑡titalic_t for the i𝑖iitalic_i-th sample. When training with SGD to minimize the loss function \mathcal{L}caligraphic_L, this error signal is computed as χ̊t,i=(f̊t,yi)𝟙{it}subscript̊𝜒𝑡𝑖superscriptsubscript̊𝑓𝑡subscript𝑦𝑖1𝑖subscript𝑡\mathring{\chi}_{t,i}=\mathcal{L}^{\prime}(\mathring{f}_{t},y_{i})% \operatorname{\mathds{1}}\{i\in\mathcal{B}_{t}\}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) blackboard_1 { italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, where f̊tsubscript̊𝑓𝑡\mathring{f}_{t}over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the model output at time t𝑡titalic_t, (ξi,yi)subscript𝜉𝑖subscript𝑦𝑖(\xi_{i},y_{i})( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the i𝑖iitalic_i-th training sample pair, and tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the mini-batch at time step t𝑡titalic_t. The indicator function 𝟙{}1\operatorname{\mathds{1}}\{\cdot\}blackboard_1 { ⋅ } ensures that the error signal is only computed for samples in the current mini-batch.

This error signal captures how much the model’s prediction deviates from the true label for each sample in the current mini-batch, and serves as the driving force for parameter updates during SGD training. For instance, in the case of mean squared error loss, the error signal takes the form χ̊t,i=2(f̊t(ξi)yi)𝟙{it}subscript̊𝜒𝑡𝑖2subscript̊𝑓𝑡subscript𝜉𝑖subscript𝑦𝑖1𝑖subscript𝑡\mathring{\chi}_{t,i}=2(\mathring{f}_{t}(\xi_{i})-y_{i})\operatorname{\mathds{% 1}}\{i\in\mathcal{B}_{t}\}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 2 ( over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) blackboard_1 { italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. Having defined the error signal, we now describe how the Z-variables characterize the network’s computation in the infinite-width limit f̊t(ξ)subscript̊𝑓𝑡𝜉\mathring{f}_{t}(\xi)over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ). The forward pass tracks how network features propagate through layers, while the backward pass characterizes gradient flow. For clarity of presentation, we next introduce a simplified version of f̊̊𝑓\mathring{f}over̊ start_ARG italic_f end_ARG that includes the key properties needed for our theoretical analysis. The complete derivation and technical details can be found in Appendix B.

Forward Pass

  1. 1.

    For z{xl,hl}l𝑧subscriptsuperscript𝑥𝑙superscript𝑙𝑙z\in\{x^{l},h^{l}\}_{l}italic_z ∈ { italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we have Zzt(ξ)=Zz0(ξ)+Zδz1(ξ)++Zδzt(ξ)superscript𝑍subscript𝑧𝑡𝜉superscript𝑍subscript𝑧0𝜉superscript𝑍𝛿subscript𝑧1𝜉superscript𝑍𝛿subscript𝑧𝑡𝜉Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}+Z^{\delta z_{1}(\xi)}+\cdots+Z^{\delta z_{t}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + italic_Z start_POSTSUPERSCRIPT italic_δ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + ⋯ + italic_Z start_POSTSUPERSCRIPT italic_δ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT where

    1. (a)

      for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], Zδxtl(ξ)=ϕ(Zhtl(ξ))ϕ(Zht1l(ξ))superscript𝑍𝛿subscriptsuperscript𝑥𝑙𝑡𝜉italic-ϕsuperscript𝑍subscriptsuperscript𝑙𝑡𝜉italic-ϕsuperscript𝑍subscriptsuperscript𝑙𝑡1𝜉Z^{\delta x^{l}_{t}(\xi)}=\phi(Z^{h^{l}_{t}(\xi)})-\phi(Z^{h^{l}_{t-1}(\xi)})italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT ),

    2. (b)

      for l=1𝑙1l=1italic_l = 1, we have

      Zδhtl(ξ)=i[m]ηχ̊t1,iξiξZdht1l(ξi),superscript𝑍𝛿subscriptsuperscript𝑙𝑡𝜉subscript𝑖delimited-[]𝑚𝜂subscript̊𝜒𝑡1𝑖superscriptsubscript𝜉𝑖top𝜉superscript𝑍𝑑subscriptsuperscript𝑙𝑡1subscript𝜉𝑖\displaystyle Z^{\delta h^{l}_{t}(\xi)}=-\sum_{i\in[m]}\eta\mathring{\chi}_{t-% 1,i}\xi_{i}^{\top}\xi Z^{dh^{l}_{t-1}(\xi_{i})},italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT italic_η over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ξ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ,

      for 2lL2𝑙𝐿2\leq l\leq L2 ≤ italic_l ≤ italic_L, we have

      Zδhtl(ξ)superscript𝑍𝛿subscriptsuperscript𝑙𝑡𝜉\displaystyle Z^{\delta h^{l}_{t}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Z^W0lδxtl1(ξ)+Ft(ξ),absentsuperscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑡𝜉subscript𝐹𝑡𝜉\displaystyle=\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)}+F_{t}(\xi),= over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) , (3.2)

      where Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a function that is determined by the random variable {Zdhs(ξi)}i[m],s[t1]subscriptsuperscript𝑍𝑑subscript𝑠subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{Z^{dh_{s}(\xi_{i})}\}_{i\in[m],s\in[t-1]}{ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t - 1 ] end_POSTSUBSCRIPT (see Appendix B for detail), and Z^W0lδxtl1(ξ)superscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑡𝜉\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT are zero centered jointly Gaussian with covariance matrix

      Cov(Z^W0lδxtl1(ξ),Z^W0lδxsl1(𝜻))=𝔼[Zδxtl1(ξ)Zδxsl1(𝜻)].Covsuperscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑡𝜉superscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑠𝜻𝔼delimited-[]superscript𝑍𝛿subscriptsuperscript𝑥𝑙1𝑡𝜉superscript𝑍𝛿subscriptsuperscript𝑥𝑙1𝑠𝜻\displaystyle\text{Cov}(\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)},% \widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{s}(\bm{\zeta})})=\mathbb{E}[Z^{\delta x^% {l-1}_{t}(\xi)}Z^{\delta x^{l-1}_{s}(\bm{\zeta})}].Cov ( over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ) = blackboard_E [ italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ] .
  2. 2.

    For last layer weight, we have ZW^tL+1=ZW^0L+1+ZδW1L+1++ZδWtL+1superscript𝑍superscriptsubscript^𝑊𝑡𝐿1superscript𝑍superscriptsubscript^𝑊0𝐿1superscript𝑍𝛿superscriptsubscript𝑊1𝐿1superscript𝑍𝛿superscriptsubscript𝑊𝑡𝐿1Z^{\widehat{W}_{t}^{L+1}}=Z^{\widehat{W}_{0}^{L+1}}+Z^{\delta W_{1}^{L+1}}+% \cdots+Z^{\delta W_{t}^{L+1}}italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_Z start_POSTSUPERSCRIPT italic_δ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + ⋯ + italic_Z start_POSTSUPERSCRIPT italic_δ italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT where

    ZδWtL+1=ηi[m]χ̊t1,iZxt1L(ξi).superscript𝑍𝛿superscriptsubscript𝑊𝑡𝐿1𝜂subscript𝑖delimited-[]𝑚subscript̊𝜒𝑡1𝑖superscript𝑍superscriptsubscript𝑥𝑡1𝐿subscript𝜉𝑖\displaystyle Z^{\delta W_{t}^{L+1}}=-\eta\sum_{i\in[m]}\mathring{\chi}_{t-1,i% }Z^{x_{t-1}^{L}(\xi_{i})}.italic_Z start_POSTSUPERSCRIPT italic_δ italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = - italic_η ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT . (3.3)
  3. 3.

    The output deltas have limits f̊t(ξ)=δf̊1(ξ)++δf̊t(ξ)subscript̊𝑓𝑡𝜉𝛿subscript̊𝑓1𝜉𝛿subscript̊𝑓𝑡𝜉\mathring{f}_{t}(\xi)=\delta\mathring{f}_{1}(\xi)+\cdots+\delta\mathring{f}_{t% }(\xi)over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) = italic_δ over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ξ ) + ⋯ + italic_δ over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) where

    δf̊t(ξ)=𝔼ZδWtL+1ZxtL(ξ)+𝔼ZW^t1L+1ZδxtL(ξ).𝛿subscript̊𝑓𝑡𝜉𝔼superscript𝑍𝛿superscriptsubscript𝑊𝑡𝐿1superscript𝑍superscriptsubscript𝑥𝑡𝐿𝜉𝔼superscript𝑍superscriptsubscript^𝑊𝑡1𝐿1superscript𝑍𝛿superscriptsubscript𝑥𝑡𝐿𝜉\displaystyle\delta\mathring{f}_{t}(\xi)=\mathbb{E}Z^{\delta W_{t}^{L+1}}Z^{x_% {t}^{L}(\xi)}+\mathbb{E}Z^{\widehat{W}_{t-1}^{L+1}}Z^{\delta x_{t}^{L}(\xi)}.italic_δ over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) = blackboard_E italic_Z start_POSTSUPERSCRIPT italic_δ italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + blackboard_E italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT . (3.4)

Backward Pass

  1. 1.

    For gradients:

    ZdxtL(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑡𝐿𝜉\displaystyle Z^{dx_{t}^{L}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =ZW^tL+1absentsuperscript𝑍superscriptsubscript^𝑊𝑡𝐿1\displaystyle=Z^{\widehat{W}_{t}^{L+1}}= italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (3.5)
    Zdhtl(ξ)superscript𝑍𝑑superscriptsubscript𝑡𝑙𝜉\displaystyle Z^{dh_{t}^{l}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Zdxtl(ξ)ϕ(Zhtl(ξ))absentsuperscript𝑍𝑑superscriptsubscript𝑥𝑡𝑙𝜉superscriptitalic-ϕsuperscript𝑍superscriptsubscript𝑡𝑙𝜉\displaystyle=Z^{dx_{t}^{l}(\xi)}\phi^{\prime}(Z^{h_{t}^{l}(\xi)})= italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT ) (3.6)
    Zdxtl1(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑡𝑙1𝜉\displaystyle Z^{dx_{t}^{l-1}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Z^W0ldhtl(ξ)+Gt(ξ)absentsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑡𝑙𝜉subscript𝐺𝑡𝜉\displaystyle=\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}+G_{t}(\xi)= over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) (3.7)

    where Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is function that is determined by the random variable {Zxsl1(ξi)}i[m],s[t1]subscriptsuperscript𝑍superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{Z^{x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t-1]}{ italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t - 1 ] end_POSTSUBSCRIPT (see Appendix B for detail), and where {Z^W0ldhtl(ξ)}ξ,tsubscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑡𝑙𝜉𝜉𝑡\{\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}\}_{\xi,t}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_ξ , italic_t end_POSTSUBSCRIPT are zero centered jointly Gaussian with covariance matrix

    Cov(Z^W0ldhtl(ξ),Z^W0ldhsl(𝜻))=𝔼[Zdhtl(ξ)Zdhsl(𝜻)].Covsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑡𝑙𝜉superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙𝜻𝔼delimited-[]superscript𝑍𝑑superscriptsubscript𝑡𝑙𝜉superscript𝑍𝑑superscriptsubscript𝑠𝑙𝜻\displaystyle\text{Cov}(\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)},\widehat{Z}% ^{W_{0}^{l\top}dh_{s}^{l}(\bm{\zeta})})=\mathbb{E}[Z^{dh_{t}^{l}(\xi)}Z^{dh_{s% }^{l}(\bm{\zeta})}].Cov ( over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ) = blackboard_E [ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ] .
Remark 3.2.

The error signal generalizes to different optimization objectives. For example, in binary classification problems, the error signal can be expressed as χ̊t,i=yi/(1+exp(yif̊t(ξi)))𝟙{it}subscript̊𝜒𝑡𝑖subscript𝑦𝑖1subscript𝑦𝑖subscript̊𝑓𝑡subscript𝜉𝑖1𝑖subscript𝑡\mathring{\chi}_{t,i}=-y_{i}/\big{(}1+\exp(y_{i}\cdot\mathring{f}_{t}(\xi_{i})% )\big{)}\operatorname{\mathds{1}}\{i\in\mathcal{B}_{t}\}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ( 1 + roman_exp ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) blackboard_1 { italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, where f̊tsubscript̊𝑓𝑡\mathring{f}_{t}over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the model output at time t𝑡titalic_t, (ξi,yi)subscript𝜉𝑖subscript𝑦𝑖(\xi_{i},y_{i})( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the i𝑖iitalic_i-th training sample pair, and tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the mini-batch at time step t𝑡titalic_t.444\mathcal{L}caligraphic_L is only required to be continuously differentiable with respect to its first argument (Yang and Hu, 2020), which we omit in subsequent presentations.

4 Main Results

In this section, we present our main theoretical results, which rely on the following assumptions regarding the training data and activation function. Specifically, we will first state a mild geometric condition on the inputs, and then discuss the regularity requirements on the activation function.

Assumption 4.1.

Consider input vectors ξ𝜉\xiitalic_ξ drawn from the training data set Sd𝑆superscript𝑑S\subseteq\mathbb{R}^{d}italic_S ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT satisfying that for any three different points ξi,ξj,ξkSsubscript𝜉𝑖subscript𝜉𝑗subscript𝜉𝑘𝑆\xi_{i},\xi_{j},\xi_{k}\in Sitalic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_S, the following property holds,

|ξi,ξj||ξi,ξk|,|ξi,ξj|0,ij.formulae-sequencesubscript𝜉𝑖subscript𝜉𝑗subscript𝜉𝑖subscript𝜉𝑘formulae-sequencesubscript𝜉𝑖subscript𝜉𝑗0for-all𝑖𝑗\displaystyle|\langle\xi_{i},\xi_{j}\rangle|\not=|\langle\xi_{i},\xi_{k}% \rangle|,\quad|\langle\xi_{i},\xi_{j}\rangle|\not=0,\forall i\neq j.| ⟨ italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | ≠ | ⟨ italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ | , | ⟨ italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | ≠ 0 , ∀ italic_i ≠ italic_j .

Assumption 4.1 rules out the possibility of identical or zero inner products among different data points, which could otherwise lead to degenerate analyses. Although it may appear restrictive, it holds with probability 1111 if the samples are drawn from any continuous distribution (e.g., Gaussian). Indeed, the set of points violating the above requirement—such as those with exactly matching inner products—has Lebesgue measure zero. In practice, minor random perturbations to discrete data can also ensure the condition is satisfied.

Definition 4.2 (GOOD Function).

A function ϕ::italic-ϕ\phi:\mathbb{R}\to\mathbb{R}italic_ϕ : blackboard_R → blackboard_R is called GOOD if it prevents degeneracy in neural networks by ensuring non-trivial compositions. Specifically, for any finite set of parameters {ai},{bi},{ci}subscript𝑎𝑖subscript𝑏𝑖subscript𝑐𝑖\{a_{i}\},\{b_{i}\},\{c_{i}\}{ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } satisfying akbk0,ksubscript𝑎𝑘subscript𝑏𝑘0𝑘a_{k}b_{k}\not=0,\exists kitalic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 , ∃ italic_k and |bi||bj|,ijformulae-sequencesubscript𝑏𝑖subscript𝑏𝑗for-all𝑖𝑗|b_{i}|\not=|b_{j}|,\forall i\not=j| italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≠ | italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | , ∀ italic_i ≠ italic_j we have that the composite mapping

f(x)=i=1naiϕ(bix+ci),xformulae-sequence𝑓𝑥superscriptsubscript𝑖1𝑛subscript𝑎𝑖italic-ϕsubscript𝑏𝑖𝑥subscript𝑐𝑖𝑥\displaystyle f(x)=\sum_{i=1}^{n}a_{i}\phi(b_{i}x+c_{i}),\quad x\in\mathbb{R}italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x + italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_x ∈ blackboard_R

is not a constant function. Moreover, for any real numbers r1,r2subscript𝑟1subscript𝑟2r_{1},r_{2}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the function (r1+ϕ(x))(r2+ϕ(x))subscript𝑟1italic-ϕ𝑥subscript𝑟2superscriptitalic-ϕ𝑥(r_{1}+\phi(x))(r_{2}+\phi^{\prime}(x))( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϕ ( italic_x ) ) ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) is not almost everywhere constant.

We next introduce an assumption on the activation function that ensures it is both sufficiently smooth and GOOD:

Assumption 4.3.

We assume that the activation function ϕitalic-ϕ\phiitalic_ϕ satisfies the following properties.

  1. 1.

    ϕitalic-ϕ\phiitalic_ϕ is twice continuously differentiable.

  2. 2.

    ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ϕ′′superscriptitalic-ϕ′′\phi^{\prime\prime}italic_ϕ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT are bounded.

  3. 3.

    ϕitalic-ϕ\phiitalic_ϕ is a GOOD function.

  4. 4.

    {x:ϕ(x)=y}conditional-set𝑥italic-ϕ𝑥𝑦\{x\in\mathbb{R}:\phi(x)=y\}{ italic_x ∈ blackboard_R : italic_ϕ ( italic_x ) = italic_y } and {x:ϕ(x)=y}conditional-set𝑥superscriptitalic-ϕ𝑥𝑦\{x\in\mathbb{R}:\phi^{\prime}(x)=y\}{ italic_x ∈ blackboard_R : italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = italic_y } are countable for all y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R.

Remark 4.4.

Assumption 4.3 imposes regularity and smoothness conditions on the activation function, ensuring that ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is pseudo-Lipschitz555See Yang and Hu (2020, Definition E.3) for the definition of pseudo-Lipschitz functions., a requirement for Yang and Hu (2020, Theorem 7.4). These conditions are met by many commonly used activation functions, including the sigmoid function σ(x)=1/(1+exp(x))𝜎𝑥11𝑥\sigma(x)=1/\big{(}1+\exp(-x)\big{)}italic_σ ( italic_x ) = 1 / ( 1 + roman_exp ( - italic_x ) ) and hyperbolic tangent (tanh\tanhroman_tanh), which is a rescaled version of sigmoid.

Modern activation functions such as the SiLU (Sigmoid Linear Unit), defined as SiLU(x)=xσ(x)SiLU𝑥𝑥𝜎𝑥\text{SiLU}(x)=x\cdot\sigma(x)SiLU ( italic_x ) = italic_x ⋅ italic_σ ( italic_x ) (Hendrycks and Gimpel, 2016), also satisfy these assumptions. SiLU has been widely adopted in practice, including in several state-of-the-art open-source foundation models (Touvron et al., 2023a, b). A detailed discussion of activation functions that meet these criteria is provided in Appendix D.

With these assumptions in place, we can now state our main theoretical results regarding feature non-degeneracy and convergence. In particular, the following theorem establishes that in wide neural networks, feature representations evolve while maintaining their diversity and avoiding collapse throughout training.

Theorem 4.5.

Consider an infinite-width L𝐿Litalic_L-layer MLP trained with gradient descent. Under Assumptions 4.1 and 4.3, the features in each layer are non-degenerate at any time t𝑡titalic_t during training. Specifically, for each layer l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ]:

  1. 1.

    The pre-activation features {Zhtl(ξ)}ξSsubscriptsuperscript𝑍superscriptsubscript𝑡𝑙𝜉𝜉𝑆\{Z^{h_{t}^{l}(\xi)}\}_{\xi\in S}{ italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_ξ ∈ italic_S end_POSTSUBSCRIPT are linearly independent.

  2. 2.

    The post-activation features {Zxtl(ξ)}ξSsubscriptsuperscript𝑍superscriptsubscript𝑥𝑡𝑙𝜉𝜉𝑆\{Z^{x_{t}^{l}(\xi)}\}_{\xi\in S}{ italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_ξ ∈ italic_S end_POSTSUBSCRIPT are linearly independent.

This non-degeneracy property has important implications for the convergence behavior of the model. In particular, it allows us to characterize the state of the model at convergence, as described in the following corollary.

Corollary 4.6.

Consider an infinite-width L𝐿Litalic_L-layer MLP under the conditions of Theorem 4.5. If the model converges at time T𝑇Titalic_T, meaning that the model weights remain unchanged for all tT𝑡𝑇t\geq Titalic_t ≥ italic_T, then the error signal vanishes for all subsequent mini-batches:

χ̊T,i=0,itTt,formulae-sequencesubscript̊𝜒𝑇𝑖0for-all𝑖subscript𝑡𝑇subscript𝑡\displaystyle\mathring{\chi}_{T,i}=0,\quad\forall i\in\bigcup_{t\geq T}% \mathcal{B}_{t},over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_T , italic_i end_POSTSUBSCRIPT = 0 , ∀ italic_i ∈ ⋃ start_POSTSUBSCRIPT italic_t ≥ italic_T end_POSTSUBSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the mini-batch at time t𝑡titalic_t.

This corollary establishes that feature non-degeneracy forces convergence to occur only at critical points where the error signal vanishes. More precisely, when the network converges, the error signals must vanish across all samples in subsequent mini-batches, implying convergence to a global minimum of the training objective. This is a consequence of the feature non-degeneracy established in Theorem 4.5, as non-degenerate features ensure that weight updates can only stop when the network has effectively minimized the error signals.

5 Key Techniques and Analysis

In this section, we first identify the key technical challenges in establishing our main results, and then present the techniques and insights to address them. We begin by discussing two fundamental challenges: the tension between feature evolution and Structural stability, and the intricate coupling across network layers. We then develop a systematic framework based on Gaussian processes to overcome these challenges. The complete proof is presented in Appendix C.

5.1 Technical Challenges

Establishing global convergence while allowing meaningful feature learning presents two fundamental technical challenges that must be addressed simultaneously:

  1. 1.

    Feature Evolution vs. Structural Stability: In contrast to the NTK parameterization (where features stay near their initialization), μ𝜇\muitalic_μP enables features to evolve substantially during training. Specifically, for any feature z{xl,hl}l𝑧subscriptsuperscript𝑥𝑙superscript𝑙𝑙z\in\{x^{l},h^{l}\}_{l}italic_z ∈ { italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in the Forward Pass (a) of Section 3, we have:

    Zzt(ξ)=Zz0(ξ)+Zδz1(ξ)++Zδzt(ξ)feature learning term.superscript𝑍subscript𝑧𝑡𝜉superscript𝑍subscript𝑧0𝜉subscriptsuperscript𝑍𝛿subscript𝑧1𝜉superscript𝑍𝛿subscript𝑧𝑡𝜉feature learning term\displaystyle Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}+\underbrace{Z^{\delta z_{1}(\xi)}+% \cdots+Z^{\delta z_{t}(\xi)}}_{\text{feature learning term}}.italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + under⏟ start_ARG italic_Z start_POSTSUPERSCRIPT italic_δ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + ⋯ + italic_Z start_POSTSUPERSCRIPT italic_δ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT feature learning term end_POSTSUBSCRIPT .

    The presence of the feature learning term makes it challenging to track and characterize features’ properties throughout optimization. This contrasts sharply with the setting under NTK parametrization, where Zzt(ξ)superscript𝑍subscript𝑧𝑡𝜉Z^{z_{t}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT stays equal to its initialization Zz0(ξ)superscript𝑍subscript𝑧0𝜉Z^{z_{0}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT (Yang and Hu, 2020) - a mathematically simpler but limited case where the network behavior is fully determined by the initial kernel.

  2. 2.

    Cross-Layer Coupling: In deep networks, changes in one layer’s features affect both earlier and later layers through forward and backward propagation. For forward propagation in layer l𝑙litalic_l and backward propagation in layer l+1𝑙1l+1italic_l + 1, we have by (3.2) and (3.7):

    Zδhtl(ξ)superscript𝑍𝛿subscriptsuperscript𝑙𝑡𝜉\displaystyle Z^{\delta h^{l}_{t}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Z^W0lδxtl1(ξ)+Ft(ξ)absentsuperscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑡𝜉subscript𝐹𝑡𝜉\displaystyle=\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{t}(\xi)}+F_{t}(\xi)= over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) (5.1)
    Zdxsl(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑠𝑙𝜉\displaystyle Z^{dx_{s}^{l}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Z^W0l+1dhsl+1(ξ)+Gs(ξ),absentsuperscript^𝑍superscriptsubscript𝑊0𝑙limit-from1top𝑑superscriptsubscript𝑠𝑙1𝜉subscript𝐺𝑠𝜉\displaystyle=\widehat{Z}^{W_{0}^{l+1\top}dh_{s}^{l+1}(\xi)}+G_{s}(\xi),= over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ ) , (5.2)

    where Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT capture the historical dependencies through previous features {Zdhsl(ξi)}i[m],s[t1]subscriptsuperscript𝑍𝑑subscriptsuperscript𝑙𝑠subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{Z^{dh^{l}_{s}(\xi_{i})}\}_{i\in[m],s\in[t-1]}{ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t - 1 ] end_POSTSUBSCRIPT and gradients {Zxsl1(ξi)}i[m],s[t1]subscriptsuperscript𝑍superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{Z^{x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t-1]}{ italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t - 1 ] end_POSTSUBSCRIPT respectively. This intricate coupling between forward and backward passes makes it challenging to ensure that features remain well-behaved as they propagate through the network.

Our key insight in addressing these challenges lies in analyzing structural invariants preserved by the induced Gaussian processes during training. While features evolve substantially, we find that certain second-order properties—specifically, non-degeneracy—remain invariant across layers and time steps. This invariance ensures rich feature learning while preventing the network from getting stuck in local minima.

5.2 The Gaussian Process View

In the infinite-width limit, neural network training induces two families of Gaussian processes that capture forward and backward propagation:

{Z^W0lδxsl1(ξi)}i[m],s[t],2lL,subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚formulae-sequence𝑠delimited-[]𝑡2𝑙𝐿\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t],2\leq l\leq L},{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] , 2 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT , (5.3)
{Z^W0ldhsl(ξi)}i[m],s[t],2lL.subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚formulae-sequence𝑠delimited-[]𝑡2𝑙𝐿\displaystyle\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[% t],2\leq l\leq L}.{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] , 2 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT . (5.4)

The forward process ((5.3)) tracks how features evolve across layers, while the backward process ((5.4)) describes gradient flow. Unlike prior work that studies these processes in isolation, we discover fundamental connections between their structural properties that enable both feature learning and convergence.

Covariance Structure of Gaussian Processes Our key technical insight is that these Gaussian processes (5.3) and (5.4) exhibit invariant covariance properties that persist throughout training. Recall from (5.1) and (5.2) that both forward and backward propagation can be decomposed into a Gaussian term and a history-dependent term:

Zδhtl(ξ)=Z^W0lδxtl1(ξ)Gaussian term+Ft(ξ)history term,Zdxsl(ξ)=Z^W0l+1dhsl+1(ξ)Gaussian term+Gs(ξ)history term.formulae-sequencesuperscript𝑍𝛿subscriptsuperscript𝑙𝑡𝜉subscriptsuperscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑡𝜉Gaussian termsubscriptsubscript𝐹𝑡𝜉history termsuperscript𝑍𝑑superscriptsubscript𝑥𝑠𝑙𝜉subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙limit-from1top𝑑superscriptsubscript𝑠𝑙1𝜉Gaussian termsubscriptsubscript𝐺𝑠𝜉history term\displaystyle Z^{\delta h^{l}_{t}(\xi)}=\underbrace{\widehat{Z}^{W^{l}_{0}% \delta x^{l-1}_{t}(\xi)}}_{\text{Gaussian term}}+\underbrace{F_{t}(\xi)}_{% \text{history term}},\qquad Z^{dx_{s}^{l}(\xi)}=\underbrace{\widehat{Z}^{W_{0}% ^{l+1\top}dh_{s}^{l+1}(\xi)}}_{\text{Gaussian term}}+\underbrace{G_{s}(\xi)}_{% \text{history term}}.italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = under⏟ start_ARG over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Gaussian term end_POSTSUBSCRIPT + under⏟ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_ARG start_POSTSUBSCRIPT history term end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = under⏟ start_ARG over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Gaussian term end_POSTSUBSCRIPT + under⏟ start_ARG italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ ) end_ARG start_POSTSUBSCRIPT history term end_POSTSUBSCRIPT .

We notice that these Gaussian terms can preserve covariance relationships across layers throughout training:

Cov(Z^W0lδxsl1(ξ),Z^W0lδxtl1(𝜻))Covsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1𝜉superscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑡𝑙1𝜻\displaystyle\text{Cov}(\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi)},% \widehat{Z}^{W_{0}^{l}\delta x_{t}^{l-1}(\bm{\zeta})})Cov ( over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ) =𝔼[Zδxsl1(ξ)Zδxtl1(𝜻)],absent𝔼delimited-[]superscript𝑍𝛿superscriptsubscript𝑥𝑠𝑙1𝜉superscript𝑍𝛿superscriptsubscript𝑥𝑡𝑙1𝜻\displaystyle=\mathbb{E}[Z^{\delta x_{s}^{l-1}(\xi)}Z^{\delta x_{t}^{l-1}(\bm{% \zeta})}],= blackboard_E [ italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ] ,
Cov(Z^W0ldhsl(ξ),Z^W0ldhtl(𝜻))Covsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙𝜉superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑡𝑙𝜻\displaystyle\text{Cov}(\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi)},\widehat{Z}% ^{W_{0}^{l\top}dh_{t}^{l}(\bm{\zeta})})Cov ( over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ) =𝔼[Zdhsl(ξ)Zdhtl(𝜻)].absent𝔼delimited-[]superscript𝑍𝑑superscriptsubscript𝑠𝑙𝜉superscript𝑍𝑑superscriptsubscript𝑡𝑙𝜻\displaystyle=\mathbb{E}[Z^{dh_{s}^{l}(\xi)}Z^{dh_{t}^{l}(\bm{\zeta})}].= blackboard_E [ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_ζ ) end_POSTSUPERSCRIPT ] .

These covariance relationships reveal that feature correlations between adjacent layers follow consistent patterns, even as individual features evolve. They link the feature spaces of adjacent layers through their second-order statistics, providing a structural bridge that persists throughout the training process.

5.3 From Covariance Structure to Non-degeneracy

The preservation of covariance relationships across layers ensures the non-degeneracy of the induced Gaussian processes throughout training. In the proof of Theorem 4.5, we consider any linear combination of the Gaussian processes:

i[m],s[t]λi,sZ^W0lδxsl1(ξi),i[m],s[t]λi,sZ^W0ldhsl(ξi).subscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖subscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖\displaystyle\sum_{i\in[m],s\in[t]}\lambda_{i,s}\widehat{Z}^{W_{0}^{l}\delta x% _{s}^{l-1}(\xi_{i})},\quad\sum_{i\in[m],s\in[t]}\lambda_{i,s}\widehat{Z}^{W_{0% }^{l\top}dh_{s}^{l}(\xi_{i})}.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .

Through our covariance preservation property, we show that if these linear combinations degenerate (i.e., equal to zero almost surely), then the corresponding linear combinations of original features and gradients must also degenerate:

i[m],s[t]λi,sZδxsl1(ξi)=a.s.0,i[m],s[t]λi,sZdhsl(ξi)=a.s.0.\displaystyle\sum_{i\in[m],s\in[t]}\lambda_{i,s}Z^{\delta x_{s}^{l-1}(\xi_{i})% }\overset{a.s.}{=}0,\quad\sum_{i\in[m],s\in[t]}\lambda_{i,s}Z^{dh_{s}^{l}(\xi_% {i})}\overset{a.s.}{=}0.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 , ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 .

This connection through linear combinations allows us to transfer the non-degeneracy property from feature space to the induced Gaussian processes across layers, establishing that both forward and backward processes remain non-degenerate throughout training. This result reveals a fundamental connection between covariance structure and feature richness: the preservation of covariance relationships ensures that linear independence propagates through layers.

This directly contrasts with other parametrizations. In the NTK parametrization, since features stay close to initialization with Zzt(ξ)=Zz0(ξ)superscript𝑍subscript𝑧𝑡𝜉superscript𝑍subscript𝑧0𝜉Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT, the process necessarily becomes degenerate as it fails to capture new information during training. Our analysis can demonstrate that μ𝜇\muitalic_μP uniquely maintains the non-degeneracy of features across both space and time dimensions, enabling the network to learning rich and meaningful features throughout training.

Refer to caption
Figure 2: Minimum eigenvalue analysis of features in the second layer for features from both initial and final iteration across different parametrizations. The μ𝜇\muitalic_μP parametrization maintains higher eigenvalues as width increases, demonstrating better preservation of feature richness. In contrast, other parametrizations (NTP, IP, SP) show substantial decay in eigenvalues with increasing width, indicating feature learning degeneration (either fail to learn new feature information or learn degenerate features). This empirically validates our theoretical analysis that μ𝜇\muitalic_μP uniquely preserves non-degeneracy across both spatial and temporal dimensions, while NTP and SP fail to capture new information during training.

To empirically validate this theoretical finding, we analyze the minimum eigenvalue of the feature matrix constructed from the joint space-time features at Layer 2222, by appending initial and final representations, complementing our analysis of feature diversity in Figure 1. Under the same experimental setup with 3-layer MLPs trained on CIFAR-10, Figure 2 shows that the μ𝜇\muitalic_μP parametrization maintains higher eigenvalues across different network widths compared to other parametrizations. This aligns with our theoretical prediction and further strengthens the findings in Figure 1 where we observed μ𝜇\muitalic_μP’s unique ability to achieve both feature learning and feature richness.

5.4 Evolution Framework

To rigorously track how these structural properties evolve throughout training, we need to carefully handle the natural flow of information in neural networks: forward propagation followed by backward propagation. In each iteration, the network first computes forward features through all layers, then calculates gradients backwards for parameter updates. This computational pattern naturally leads to a two-level filtration framework. We introduce a sequence of σ𝜎\sigmaitalic_σ-algebras to track the evolution of random variables during training. Let 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote the initial condition:

0=σ({Zh0l(ξi),Zx0l(ξi)}i[m],l[L],ZW^0L+1)subscript0𝜎subscriptsuperscript𝑍subscriptsuperscript𝑙0subscript𝜉𝑖superscript𝑍subscriptsuperscript𝑥𝑙0subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑙delimited-[]𝐿superscript𝑍superscriptsubscript^𝑊0𝐿1\displaystyle\mathcal{F}_{0}=\sigma\big{(}\{Z^{h^{l}_{0}(\xi_{i})},Z^{x^{l}_{0% }(\xi_{i})}\}_{i\in[m],l\in[L]},Z^{\widehat{W}_{0}^{L+1}}\big{)}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_σ ( { italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_l ∈ [ italic_L ] end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )

Then we define tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to track all completed iterations up to time t𝑡titalic_t, and an extended filtration 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to capture the forward pass of the (t+1)𝑡1(t+1)( italic_t + 1 )-th iteration:

tsubscript𝑡\displaystyle\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =σ(0,{Z^W0lδxsl1(ξi)}i[m],s[t],2lL,{Z^W0ldhsl(ξi)}i[m],s[t],2lL)absent𝜎subscript0subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚formulae-sequence𝑠delimited-[]𝑡2𝑙𝐿subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚formulae-sequence𝑠delimited-[]𝑡2𝑙𝐿\displaystyle=\sigma\big{(}\mathcal{F}_{0},\{\widehat{Z}^{W_{0}^{l}\delta x_{s% }^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t],2\leq l\leq L},\{\widehat{Z}^{W_{0}^{l% \top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t],2\leq l\leq L}\big{)}= italic_σ ( caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , { over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] , 2 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT , { over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] , 2 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT ) (5.5)
𝒢tsubscript𝒢𝑡\displaystyle{\mathcal{G}}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =σ(t,{Z^W0lδxt+1l1(ξi)}i[m],2lL)absent𝜎subscript𝑡subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑡1𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚2𝑙𝐿\displaystyle=\sigma\big{(}\mathcal{F}_{t},\{\widehat{Z}^{W_{0}^{l}\delta x_{t% +1}^{l-1}(\xi_{i})}\}_{i\in[m],2\leq l\leq L}\big{)}= italic_σ ( caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , { over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , 2 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT ) (5.6)

This filtration structure allows us to precisely track how information flows through the network: tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT contains all information up to time t𝑡titalic_t, while 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT extends this to include the forward pass information at time t+1𝑡1t+1italic_t + 1 before its backward pass begins. This framework enables us to:

  1. 1.

    Inductive Proof Structure: The filtration framework enables a structured inductive proof that follows the natural flow of computation in neural networks. We establish non-degeneracy in four steps, motivated by how information propagates through the network:

    • Step 1: Features in first hidden layer Z^W02δxs1(ξi)superscript^𝑍superscriptsubscript𝑊02𝛿superscriptsubscript𝑥𝑠1subscript𝜉𝑖\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. This forms the base case as it only depends on the input,

    • Step 2: Features in remaining layers Z^W0lδxsl1(ξi)superscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. Using the non-degeneracy of previous layers,

    • Step 3: Gradients in last layer Z^W0LdhsL(ξi)superscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. Built upon the established feature properties,

    • Step 4: Gradients in remaining layers Z^W0ldhsl(ξi)superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. Completing the backward pass analysis.

    Each step leverages the non-degeneracy established in previous steps, creating a chain of dependency that mirrors the network’s computation graph.

  2. 2.

    Conditional Analysis: The filtration enables precise decomposition of feature and gradient updates into new and historical information:

    Zhsl(ξi)superscript𝑍subscriptsuperscript𝑙𝑠subscript𝜉𝑖\displaystyle Z^{h^{l}_{s}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =Z^W0lδxsl1(ξi)new randomness+Δs(ξi)historyabsentsubscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿subscriptsuperscript𝑥𝑙1𝑠subscript𝜉𝑖new randomnesssubscriptsubscriptΔ𝑠subscript𝜉𝑖history\displaystyle=\underbrace{\widehat{Z}^{W_{0}^{l}\delta x^{l-1}_{s}(\xi_{i})}}_% {\text{new randomness}}+\underbrace{\Delta_{s}(\xi_{i})}_{\text{history}}= under⏟ start_ARG over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT new randomness end_POSTSUBSCRIPT + under⏟ start_ARG roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT history end_POSTSUBSCRIPT
    Zdxsl(ξi)superscript𝑍𝑑superscriptsubscript𝑥𝑠𝑙subscript𝜉𝑖\displaystyle Z^{dx_{s}^{l}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =Z^W0l+1dhsl+1(ξi)new randomness+Gs(ξi)historyabsentsubscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙limit-from1top𝑑superscriptsubscript𝑠𝑙1subscript𝜉𝑖new randomnesssubscriptsubscript𝐺𝑠subscript𝜉𝑖history\displaystyle=\underbrace{\widehat{Z}^{W_{0}^{l+1\top}dh_{s}^{l+1}(\xi_{i})}}_% {\text{new randomness}}+\underbrace{G_{s}(\xi_{i})}_{\text{history}}= under⏟ start_ARG over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT new randomness end_POSTSUBSCRIPT + under⏟ start_ARG italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT history end_POSTSUBSCRIPT

    where Δs(ξi)s1subscriptΔ𝑠subscript𝜉𝑖subscript𝑠1\Delta_{s}(\xi_{i})\in\mathcal{F}_{s-1}roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_F start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT captures the accumulated feature history, and Gs(ξi)subscript𝐺𝑠subscript𝜉𝑖G_{s}(\xi_{i})italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents previous gradient information. This decomposition is crucial for our inductive proof: by focusing on the new randomness in each step, we can show that non-degeneracy is preserved when conditioned on all historical information.

  3. 3.

    Non-degeneracy Preservation: By leveraging GOOD activation functions as introduced in Assumption 4.3 (e.g., Sigmoid, Tanh, SiLU) and the covariance structure, we show that non-degeneracy propagates forward in time. Specifically, if at time t𝑡titalic_t we have:

    i[m],s[t]λi,sZ^W0lδxsl1(ξi)a.s.0,i[m],s[t]λi,sZ^W0ldhsl(ξi)a.s.0.\displaystyle\sum_{i\in[m],s\in[t]}\lambda_{i,s}\widehat{Z}^{W_{0}^{l}\delta x% _{s}^{l-1}(\xi_{i})}\overset{a.s.}{\not=}0,\qquad\sum_{i\in[m],s\in[t]}\lambda% _{i,s}\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\overset{a.s.}{\not=}0.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG ≠ end_ARG 0 , ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG ≠ end_ARG 0 .

    Then, we show that these sums remain nonzero at time t+1𝑡1t+1italic_t + 1 based on two key properties: (1) the non-degeneracy of Gaussian processes is preserved when they share the same covariance structure, and (2) GOOD activation functions, such as Sigmoid and SiLU, exhibit a crucial “non-collapsing” property—mapping distinct inputs to distinct outputs unless all combining coefficients are zero. Together, these properties ensure that features can evolve significantly during training while preserving their diversity, a fundamental distinction from the NTK parametrization, where features remain near their initialization.

Now, we revisit the key technical challenges introduced at the beginning of this section and demonstrate how our framework addresses them.

Feature Evolution vs. Structural Stability: Unlike NTK, where features remain close to their initialization, μ𝜇\muitalic_μP enables substantial feature evolution. Our framework ensures that new randomness, represented by Gaussian features such as Z^W0lδxt+1l1(ξi)superscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑡1𝑙1subscript𝜉𝑖\widehat{Z}^{W_{0}^{l}\delta x_{t+1}^{l-1}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, enters the system with a well-defined structure that preserves non-degeneracy. This structured evolution prevents feature collapse while allowing representations to adapt dynamically, ensuring both expressivity and stability throughout training.

Cross-Layer Coupling: The interplay between layers introduces dependencies that can destabilize training. By leveraging a two-level filtration structure tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, our framework tracks both forward propagation Zhslsuperscript𝑍subscriptsuperscript𝑙𝑠Z^{h^{l}_{s}}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and backward propagation Zdxslsuperscript𝑍𝑑superscriptsubscript𝑥𝑠𝑙Z^{dx_{s}^{l}}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, ensuring that updates in one layer do not collapse the feature space of others. This structure maintains well-defined covariance relationships across layers, allowing μ𝜇\muitalic_μP to support both deep feature learning and global convergence, distinguishing it from NTK and standard parametrizations.

6 Conclusion and Future Work

In this work, we establish a fundamental theoretical result: deep neural networks under μ𝜇\muitalic_μP parametrization can simultaneously achieve meaningful feature learning while preserving feature non-degeneracy. Through a rigorous analysis of Gaussian processes and their covariance structures, we show that features not only remain linearly independent throughout training but also undergo substantial evolution from their initialization. This provides insight into a fundamental question in deep learning theory: how neural networks can simultaneously learn expressive representations and achieve global convergence.

Our analysis establishes fundamental connections between covariance preservation and feature richness. By preventing feature degeneracy, our framework provides a rigorous foundation for understanding how overparameterized networks learn expressive representations. Moreover, our results highlight the crucial role of parametrization in enabling both stable training and meaningful feature evolution. These insights into how μ𝜇\muitalic_μP enables both feature learning and global convergence suggest promising directions for bridging the gap between theory and practical deep learning success.

Several promising directions for future work emerge from our analysis. First, extending our theoretical framework to transformer architectures, particularly the attention mechanism, would be valuable for understanding feature learning in modern language models. Second, our analysis of structural invariants could provide new perspectives on convergence rates beyond just global convergence, potentially informing optimization strategies in deep learning. Third, studying how our insights on feature non-degeneracy influence generalization bounds may yield deeper theoretical foundations for understanding the generalization properties of deep neural networks. Finally, exploring how μ𝜇\muitalic_μP interacts with more complex training paradigms, such as fine-tuning and self-supervised learning, could further enhance our understanding of deep network training dynamics in practical settings.

Appendix A Experimental Details

Experimental details for Figures 1 and 2. We conduct experiments using 3-layer MLPs with input dimension n0=3072subscript𝑛03072n_{0}=3072italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 3072 (flattened CIFAR-10 images), three hidden layers of equal width n1=n2=n3=nsubscript𝑛1subscript𝑛2subscript𝑛3𝑛n_{1}=n_{2}=n_{3}=nitalic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_n varying from 8,16,32,64,128,256,512,1024,2048,40968163264128256512102420484096{8,16,32,64,128,256,512,1024,2048,4096}8 , 16 , 32 , 64 , 128 , 256 , 512 , 1024 , 2048 , 4096, and output dimension 1. In experiment, we only use 10 samples randomly selected from airplane and automobile classes in CIFAR-10. All networks use SiLU activation functions and are trained on a binary classification task with ±1plus-or-minus1\pm 1± 1 targets for 1000 steps. We use a global learning rate η=0.1𝜂0.1\eta=0.1italic_η = 0.1 across all parametrization schemes with 10101010 runs for each width setting. The learning rate η=0.1𝜂0.1\eta=0.1italic_η = 0.1 is chosen to ensure stable training across parametrizations.

We implement the following parametrization schemes:

Standard Parametrization (SP):

σsubscript𝜎\displaystyle\sigma_{\ell}italic_σ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT =2n1;η=η1nformulae-sequenceabsent2subscript𝑛1subscript𝜂𝜂1𝑛\displaystyle=\sqrt{\frac{2}{n_{\ell-1}}};\quad\eta_{\ell}=\eta\cdot\frac{1}{n}= square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_n start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG end_ARG ; italic_η start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_η ⋅ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG

at all layers, with η=0.1𝜂0.1\eta=0.1italic_η = 0.1.

Neural Tangent Parametrization (NTP):

σsubscript𝜎\displaystyle\sigma_{\ell}italic_σ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT =2n1;η=η1n1formulae-sequenceabsent2subscript𝑛1subscript𝜂𝜂1subscript𝑛1\displaystyle=\sqrt{\frac{2}{n_{\ell-1}}};\quad\eta_{\ell}=\eta\cdot\frac{1}{n% _{\ell-1}}= square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_n start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG end_ARG ; italic_η start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_η ⋅ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG

at all layers, with η=0.1𝜂0.1\eta=0.1italic_η = 0.1.

Integrable Parametrization (IP): For initialization variances:

σ1subscript𝜎1\displaystyle\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =2d;σ=2n,2.formulae-sequenceabsent2𝑑formulae-sequencesubscript𝜎2𝑛2\displaystyle=\sqrt{\frac{2}{d}};\quad\sigma_{\ell}=\frac{\sqrt{2}}{n},\quad% \ell\geq 2.= square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_d end_ARG end_ARG ; italic_σ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = divide start_ARG square-root start_ARG 2 end_ARG end_ARG start_ARG italic_n end_ARG , roman_ℓ ≥ 2 .

For learning rates:

η1subscript𝜂1\displaystyle\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =ηnd;η2=η3=η;η4=η1n.formulae-sequenceformulae-sequenceabsent𝜂𝑛𝑑subscript𝜂2subscript𝜂3𝜂subscript𝜂4𝜂1𝑛\displaystyle=\eta\cdot\frac{n}{d};\quad\eta_{2}=\eta_{3}=\eta;\quad\eta_{4}=% \eta\cdot\frac{1}{n}.= italic_η ⋅ divide start_ARG italic_n end_ARG start_ARG italic_d end_ARG ; italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_η ; italic_η start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_η ⋅ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG .

Maximal Update Parametrization (μ𝜇\muitalic_μP): For initialization variances:

σ1subscript𝜎1\displaystyle\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =2d;σ2=σ3=2n;σ4=2n.formulae-sequenceformulae-sequenceabsent2𝑑subscript𝜎2subscript𝜎32𝑛subscript𝜎42𝑛\displaystyle=\sqrt{\frac{2}{d}};\quad\sigma_{2}=\sigma_{3}=\sqrt{\frac{2}{n}}% ;\quad\sigma_{4}=\frac{\sqrt{2}}{n}.= square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_d end_ARG end_ARG ; italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_n end_ARG end_ARG ; italic_σ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = divide start_ARG square-root start_ARG 2 end_ARG end_ARG start_ARG italic_n end_ARG .

For learning rates:

η1=ηnd;η2=η3=η;η4=η1nformulae-sequenceformulae-sequencesubscript𝜂1𝜂𝑛𝑑subscript𝜂2subscript𝜂3𝜂subscript𝜂4𝜂1𝑛\displaystyle\eta_{1}=\eta\cdot\frac{n}{d};\quad\eta_{2}=\eta_{3}=\eta;\quad% \eta_{4}=\eta\cdot\frac{1}{n}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_η ⋅ divide start_ARG italic_n end_ARG start_ARG italic_d end_ARG ; italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_η ; italic_η start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_η ⋅ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG

Networks are trained with batch size 10 for 1000 steps, sufficient for all widths to achieve stable feature representations with training loss smaller than 0.050.050.050.05. For each width configuration, we conduct 10101010 independent trials with different random seeds (4253similar-to425342\sim 5342 ∼ 53) and report the mean values in Figures 1 and 2. To quantify feature properties, we measure two metrics:

  • Feature change: h(x)h0(x)2/h0(x)2subscriptnorm𝑥superscript0𝑥2subscriptnormsuperscript0𝑥2\|h(x)-h^{0}(x)\|_{2}/\|h^{0}(x)\|_{2}∥ italic_h ( italic_x ) - italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ∥ italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where h0superscript0h^{0}italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT represents features at initialization

  • Feature diversity: minimum eigenvalue of the Gram matrix Kij=h(xi),h(xj)subscript𝐾𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗K_{ij}=\langle h(x_{i}),h(x_{j})\rangleitalic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_h ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ computed over batch samples

These measurements allow us to track both the evolution of features from their initialization state and the maintenance of feature richness throughout training.

A.1 Additional Results on Activation Functions

To further examine the impact of activation functions, we conduct experiments with Tanh and ReLU under the same settings as described in Section A. Below, we present the feature evolution and diversity results for these activations.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Feature learning behavior for Tanh activation. Left: Feature change (h(x)h0(x)2/h0(x)2subscriptnorm𝑥superscript0𝑥2subscriptnormsuperscript0𝑥2\|h(x)-h^{0}(x)\|_{2}/\|h^{0}(x)\|_{2}∥ italic_h ( italic_x ) - italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ∥ italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Middle: Feature diversity (minimum eigenvalue of Gram matrix). Right: Minimum eigenvalue analysis by concatenating initial and final features.
Refer to caption
Refer to caption
Refer to caption
Figure 4: Feature learning behavior for ReLU activation. Left: Feature change (h(x)h0(x)2/h0(x)2subscriptnorm𝑥superscript0𝑥2subscriptnormsuperscript0𝑥2\|h(x)-h^{0}(x)\|_{2}/\|h^{0}(x)\|_{2}∥ italic_h ( italic_x ) - italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ∥ italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Middle: Feature diversity (minimum eigenvalue of Gram matrix). Right: Minimum eigenvalue analysis by concatenating initial and final features.

Our theoretical analysis fully explains the feature learning behavior of Tanh networks, as confirmed by our experimental results in Figure 3. Tanh enables meaningful feature learning while leading to a gradual decrease in feature diversity as width goes up. This is consistent with our theoretical predictions, which account for the smooth and bounded nature of the Tanh activation.

While our theoretical analysis does not directly apply to ReLU due to its non-smooth nature, our experimental results in Figure 4 indicate that ReLU-trained networks still exhibit feature learning and maintain meaningful representations under Maximal Update Parametrization (μ𝜇\muitalic_μP). One possible explanation is that, although ReLU lacks explicit smoothness assumptions required in our analysis, its piecewise linear structure still allows for non-trivial feature evolution in practice. Moreover, μ𝜇\muitalic_μP ensures that weight updates are appropriately scaled across layers, preventing degenerate training dynamics that could otherwise hinder learning in deep networks. Understanding the precise mechanisms behind ReLU’s feature learning in the infinite-width setting remains an important direction for future theoretical work.

Appendix B More Details for μ𝜇\muitalic_μP Parametrization

Formally, the MLP definition (Yang and Hu, 2020, Table 1) in this section is

h1=Wξn,xl=ϕ(hl)n,hl+1=Wl+1xln,f(ξ)=WL+1xL,formulae-sequencesuperscript1𝑊𝜉superscript𝑛superscript𝑥𝑙italic-ϕsuperscript𝑙superscript𝑛superscript𝑙1superscript𝑊𝑙1superscript𝑥𝑙superscript𝑛𝑓𝜉superscript𝑊𝐿1superscript𝑥𝐿\displaystyle h^{1}=W\xi\in\mathbb{R}^{n},x^{l}=\phi(h^{l})\in\mathbb{R}^{n},h% ^{l+1}=W^{l+1}x^{l}\in\mathbb{R}^{n},f(\xi)=W^{L+1}x^{L},italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_W italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_ϕ ( italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_f ( italic_ξ ) = italic_W start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , (B.1)

where L>1𝐿1L>1italic_L > 1 is any positive integer and l{1,,L1}𝑙1𝐿1l\in\{1,\dots,L-1\}italic_l ∈ { 1 , … , italic_L - 1 }. Then the μ𝜇\muitalic_μP for this L𝐿Litalic_L-hidden-layer MLP is defined as follows (Yang and Hu, 2020).

  1. 1.

    Initial weight matrices in the middle layer: W02,,W0Lsuperscriptsubscript𝑊02superscriptsubscript𝑊0𝐿W_{0}^{2},\ldots,W_{0}^{L}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, with each coordinates (W0l)αβ𝒩(0,1/n)similar-tosubscriptsuperscriptsubscript𝑊0𝑙𝛼𝛽𝒩01𝑛(W_{0}^{l})_{\alpha\beta}\sim\mathcal{N}(0,1/n)( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 / italic_n ).

  2. 2.

    Initial weight matrix in the input and output layers: input layer matrix W01n×dsuperscriptsubscript𝑊01superscript𝑛𝑑W_{0}^{1}\in\mathbb{R}^{n\times d}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and output layer matrix W^0L+1W0L+1n1×nsuperscriptsubscript^𝑊0𝐿1superscriptsubscript𝑊0𝐿1𝑛superscript1𝑛\widehat{W}_{0}^{L+1}\coloneqq W_{0}^{L+1}n\in\mathbb{R}^{1\times n}over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT ≔ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_n ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n end_POSTSUPERSCRIPT, with each coordinate (W01)αβ,(W^0L+1)αβ𝒩(0,1)similar-tosubscriptsuperscriptsubscript𝑊01𝛼𝛽subscriptsuperscriptsubscript^𝑊0𝐿1𝛼𝛽𝒩01(W_{0}^{1})_{\alpha\beta},(\widehat{W}_{0}^{L+1})_{\alpha\beta}\sim\mathcal{N}% (0,1)( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT , ( over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ).

  3. 3.

    Initial model outputs: we define the scalars f0(ξ)W0L+1x0L(ξ)subscript𝑓0𝜉superscriptsubscript𝑊0𝐿1superscriptsubscript𝑥0𝐿𝜉f_{0}(\xi)\coloneqq W_{0}^{L+1}x_{0}^{L}(\xi)italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ ) ≔ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) for any input ξ𝜉\xiitalic_ξ.

Assuming the same Assumption 4.3 for ϕitalic-ϕ\phiitalic_ϕ, we can characterize the Z𝑍Zitalic_Z variables in the infinite-width training dynamics of SGD for this L𝐿Litalic_L-hidden-layer MLP similarly as follows (Yang and Hu, 2020).

  1. 1.

    For z{xl,hl}l𝑧subscriptsuperscript𝑥𝑙superscript𝑙𝑙z\in\{x^{l},h^{l}\}_{l}italic_z ∈ { italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we have

    Zzt(ξ)=Zz0(ξ)+Zδz1(ξ)+Zδzt(ξ)superscript𝑍subscript𝑧𝑡𝜉superscript𝑍subscript𝑧0𝜉superscript𝑍𝛿subscript𝑧1𝜉superscript𝑍𝛿subscript𝑧𝑡𝜉\displaystyle Z^{z_{t}(\xi)}=Z^{z_{0}(\xi)}+Z^{\delta z_{1}(\xi)}+\cdots Z^{% \delta z_{t}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = italic_Z start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + italic_Z start_POSTSUPERSCRIPT italic_δ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + ⋯ italic_Z start_POSTSUPERSCRIPT italic_δ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT (B.2)
  2. 2.

    For l[L],x=xl,h=hlformulae-sequence𝑙delimited-[]𝐿formulae-sequence𝑥superscript𝑥𝑙superscript𝑙l\in[L],x=x^{l},h=h^{l}italic_l ∈ [ italic_L ] , italic_x = italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_h = italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, we have

    Zδxt(ξ)=ϕ(Zht(ξ))ϕ(Zht1(ξ)).superscript𝑍𝛿subscript𝑥𝑡𝜉italic-ϕsuperscript𝑍subscript𝑡𝜉italic-ϕsuperscript𝑍subscript𝑡1𝜉\displaystyle Z^{\delta x_{t}(\xi)}=\phi(Z^{h_{t}(\xi)})-\phi(Z^{h_{t-1}(\xi)}).italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT ) . (B.3)
  3. 3.

    For h=h1superscript1h=h^{1}italic_h = italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, we have

    Zδht(ξ)=i[m]ηχ̊t1,iξiξZdht1(ξi).superscript𝑍𝛿subscript𝑡𝜉subscript𝑖delimited-[]𝑚𝜂subscript̊𝜒𝑡1𝑖superscriptsubscript𝜉𝑖top𝜉superscript𝑍𝑑subscript𝑡1subscript𝜉𝑖\displaystyle Z^{\delta h_{t}(\xi)}=-\sum_{i\in[m]}\eta\mathring{\chi}_{t-1,i}% \xi_{i}^{\top}\xi Z^{dh_{t-1}(\xi_{i})}.italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT italic_η over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ξ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .
  4. 4.

    For 2lL,h=hl,x=xl1,W=Wlformulae-sequence2𝑙𝐿formulae-sequencesuperscript𝑙formulae-sequence𝑥superscript𝑥𝑙1𝑊superscript𝑊𝑙2\leq l\leq L,h=h^{l},x=x^{l-1},W=W^{l}2 ≤ italic_l ≤ italic_L , italic_h = italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_x = italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , italic_W = italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, we have

    Zδht(ξ)superscript𝑍𝛿subscript𝑡𝜉\displaystyle Z^{\delta h_{t}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Z^W0δxt(ξ)+Z˙W0δxt(ξ)ηs=0t1i[m]χ̊sZdhs(ξi)𝔼Zxs(ξi)Zxt(ξ)absentsuperscript^𝑍subscript𝑊0𝛿subscript𝑥𝑡𝜉superscript˙𝑍subscript𝑊0𝛿subscript𝑥𝑡𝜉𝜂superscriptsubscript𝑠0𝑡1subscript𝑖delimited-[]𝑚subscript̊𝜒𝑠superscript𝑍𝑑subscript𝑠subscript𝜉𝑖𝔼superscript𝑍subscript𝑥𝑠subscript𝜉𝑖superscript𝑍subscript𝑥𝑡𝜉\displaystyle=\widehat{Z}^{W_{0}\delta x_{t}(\xi)}+\dot{Z}^{W_{0}\delta x_{t}(% \xi)}-\eta\sum_{s=0}^{t-1}\sum_{i\in[m]}\mathring{\chi}_{s}Z^{dh_{s}(\xi_{i})}% \mathbb{E}Z^{x_{s}(\xi_{i})}Z^{x_{t}(\xi)}= over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT blackboard_E italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT

    where

    Z˙W0δxt(ξ)=i[m]s=0t1Zdhs(ξi)𝔼Zδxt(ξ)Z^W0dhs(ξi).superscript˙𝑍subscript𝑊0𝛿subscript𝑥𝑡𝜉subscript𝑖delimited-[]𝑚superscriptsubscript𝑠0𝑡1superscript𝑍𝑑subscript𝑠subscript𝜉𝑖𝔼superscript𝑍𝛿subscript𝑥𝑡𝜉superscript^𝑍superscriptsubscript𝑊0top𝑑subscript𝑠subscript𝜉𝑖\displaystyle\dot{Z}^{W_{0}\delta x_{t}(\xi)}=\sum_{i\in[m]}\sum_{s=0}^{t-1}Z^% {dh_{s}(\xi_{i})}\mathbb{E}\frac{\partial Z^{\delta x_{t}(\xi)}}{\partial% \widehat{Z}^{W_{0}^{\top}dh_{s}(\xi_{i})}}.over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT blackboard_E divide start_ARG ∂ italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG .
  5. 5.

    For last layer weight

    ZW^tL+1=ZW^0L+1ηs=0t1i[m]χ̊s,iZxsL(ξi)superscript𝑍superscriptsubscript^𝑊𝑡𝐿1superscript𝑍superscriptsubscript^𝑊0𝐿1𝜂superscriptsubscript𝑠0𝑡1subscript𝑖delimited-[]𝑚subscript̊𝜒𝑠𝑖superscript𝑍superscriptsubscript𝑥𝑠𝐿subscript𝜉𝑖\displaystyle Z^{\widehat{W}_{t}^{L+1}}=Z^{\widehat{W}_{0}^{L+1}}-\eta\sum_{s=% 0}^{t-1}\sum_{i\in[m]}\mathring{\chi}_{s,i}Z^{x_{s}^{L}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT (B.4)
  6. 6.

    The output deltas have limits

    δf̊t(ξ)=𝔼ZδWtL+1ZxtL(ξ)+𝔼ZW^t1L+1ZδxtL(ξ)𝛿subscript̊𝑓𝑡𝜉𝔼superscript𝑍𝛿superscriptsubscript𝑊𝑡𝐿1superscript𝑍superscriptsubscript𝑥𝑡𝐿𝜉𝔼superscript𝑍superscriptsubscript^𝑊𝑡1𝐿1superscript𝑍𝛿superscriptsubscript𝑥𝑡𝐿𝜉\displaystyle\delta\mathring{f}_{t}(\xi)=\mathbb{E}Z^{\delta W_{t}^{L+1}}Z^{x_% {t}^{L}(\xi)}+\mathbb{E}Z^{\widehat{W}_{t-1}^{L+1}}Z^{\delta x_{t}^{L}(\xi)}italic_δ over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) = blackboard_E italic_Z start_POSTSUPERSCRIPT italic_δ italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + blackboard_E italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT (B.5)

    and

    f̊t(ξ)=δf̊1(ξ)++δf̊t(ξ).subscript̊𝑓𝑡𝜉𝛿subscript̊𝑓1𝜉𝛿subscript̊𝑓𝑡𝜉\displaystyle\mathring{f}_{t}(\xi)=\delta\mathring{f}_{1}(\xi)+\cdots+\delta% \mathring{f}_{t}(\xi).over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) = italic_δ over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ξ ) + ⋯ + italic_δ over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) .
  7. 7.

    For gradients:

    ZdxtL(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑡𝐿𝜉\displaystyle Z^{dx_{t}^{L}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =ZW^tL+1absentsuperscript𝑍superscriptsubscript^𝑊𝑡𝐿1\displaystyle=Z^{\widehat{W}_{t}^{L+1}}= italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (B.6)
    Zdhtl(ξ)superscript𝑍𝑑superscriptsubscript𝑡𝑙𝜉\displaystyle Z^{dh_{t}^{l}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Zdxtl(ξ)ϕ(Zhtl(ξ))absentsuperscript𝑍𝑑superscriptsubscript𝑥𝑡𝑙𝜉superscriptitalic-ϕsuperscript𝑍superscriptsubscript𝑡𝑙𝜉\displaystyle=Z^{dx_{t}^{l}(\xi)}\phi^{\prime}(Z^{h_{t}^{l}(\xi)})= italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT ) (B.7)
    Zdxtl1(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑡𝑙1𝜉\displaystyle Z^{dx_{t}^{l-1}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Z^W0ldhtl(ξ)+Z˙W0ldhtl(ξ)ηs=0t1i[m]χ̊s,iZxsl1(ξi)𝔼Zdhsl(ξi)Zdhtl(ξ)absentsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑡𝑙𝜉superscript˙𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑡𝑙𝜉𝜂superscriptsubscript𝑠0𝑡1subscript𝑖delimited-[]𝑚subscript̊𝜒𝑠𝑖superscript𝑍superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖𝔼superscript𝑍𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖superscript𝑍𝑑superscriptsubscript𝑡𝑙𝜉\displaystyle=\widehat{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}+\dot{Z}^{W_{0}^{l\top% }dh_{t}^{l}(\xi)}-\eta\sum_{s=0}^{t-1}\sum_{i\in[m]}\mathring{\chi}_{s,i}Z^{x_% {s}^{l-1}(\xi_{i})}\mathbb{E}Z^{dh_{s}^{l}(\xi_{i})}Z^{dh_{t}^{l}(\xi)}= over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT + over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT blackboard_E italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT (B.8)

    where

    Z˙W0ldhtl(ξ)=i[m]s=0t1Zxsl1(ξi)𝔼Zdhtl(ξ)Z^W0lxsl1(ξi).superscript˙𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑡𝑙𝜉subscript𝑖delimited-[]𝑚superscriptsubscript𝑠0𝑡1superscript𝑍subscriptsuperscript𝑥𝑙1𝑠subscript𝜉𝑖𝔼superscript𝑍𝑑superscriptsubscript𝑡𝑙𝜉superscript^𝑍superscriptsubscript𝑊0𝑙subscriptsuperscript𝑥𝑙1𝑠subscript𝜉𝑖\displaystyle\dot{Z}^{W_{0}^{l\top}dh_{t}^{l}(\xi)}=\sum_{i\in[m]}\sum_{s=0}^{% t-1}Z^{x^{l-1}_{s}(\xi_{i})}\mathbb{E}\frac{\partial Z^{dh_{t}^{l}(\xi)}}{% \partial\widehat{Z}^{W_{0}^{l}x^{l-1}_{s}(\xi_{i})}}.over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT blackboard_E divide start_ARG ∂ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG .
  8. 8.

    Loss derivative:

    χ̊t,i=(f̊t,yi,t,i)=(f̊t,yi)𝟙{it}.subscript̊𝜒𝑡𝑖superscriptsubscript̊𝑓𝑡subscript𝑦𝑖𝑡𝑖superscriptsubscript̊𝑓𝑡subscript𝑦𝑖1𝑖subscript𝑡\displaystyle\mathring{\chi}_{t,i}=\mathcal{L}^{\prime}(\mathring{f}_{t},y_{i}% ,t,i)=\mathcal{L}^{\prime}(\mathring{f}_{t},y_{i})\operatorname{\mathds{1}}\{i% \in\mathcal{B}_{t}\}.over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t , italic_i ) = caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over̊ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) blackboard_1 { italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } .

Intuition behind the entanglement term for the two-hidden-layer case The inclusion of Wsuperscript𝑊topW^{\top}italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT in the backward pass largely increases the system complexity by introducing multiplications between W𝑊Witalic_W and certain nonlinear transformations of Wsuperscript𝑊topW^{\top}italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT in the forward pass, which necessitates involved definitions of Z˙Wxt(ξ)superscript˙𝑍𝑊subscript𝑥𝑡𝜉\dot{Z}^{Wx_{t}(\xi)}over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT and Z˙Wdh¯t(ξ)superscript˙𝑍superscript𝑊top𝑑subscript¯𝑡𝜉\dot{Z}^{W^{\top}d\bar{h}_{t}(\xi)}over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT. Since they are all “conditioned out” in our analysis, we only showcase the definition of Z˙Wxt(ξ)=j=1mr=0t1θr,jZdh¯r(ξj)superscript˙𝑍𝑊subscript𝑥𝑡𝜉superscriptsubscript𝑗1𝑚superscriptsubscript𝑟0𝑡1subscript𝜃𝑟𝑗superscript𝑍𝑑subscript¯𝑟subscript𝜉𝑗\dot{Z}^{Wx_{t}(\xi)}=\sum_{j=1}^{m}\sum_{r=0}^{t-1}\theta_{r,j}Z^{d\bar{h}_{r% }(\xi_{j})}over˙ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_r , italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT to give a sense of entanglement between W𝑊Witalic_W and Wsuperscript𝑊topW^{\top}italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is calculated like so: Zxt(ξ)superscript𝑍subscript𝑥𝑡𝜉Z^{x_{t}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT by definition is constructed as

Zxt(ξ)=Φ(Z^Wdh¯0(ξ1),,Z^Wdh¯0(ξm),,Z^Wdh¯t1(ξ1),,Z^Wdh¯t1(ξm),ZU0)superscript𝑍subscript𝑥𝑡𝜉Φsuperscript^𝑍superscript𝑊top𝑑subscript¯0subscript𝜉1superscript^𝑍superscript𝑊top𝑑subscript¯0subscript𝜉𝑚superscript^𝑍superscript𝑊top𝑑subscript¯𝑡1subscript𝜉1superscript^𝑍superscript𝑊top𝑑subscript¯𝑡1subscript𝜉𝑚superscript𝑍subscript𝑈0\displaystyle Z^{x_{t}(\xi)}=\Phi(\widehat{Z}^{W^{\top}d\bar{h}_{0}(\xi_{1})},% \dots,\widehat{Z}^{W^{\top}d\bar{h}_{0}(\xi_{m})},\ldots,\widehat{Z}^{W^{\top}% d\bar{h}_{t-1}(\xi_{1})},\dots,\widehat{Z}^{W^{\top}d\bar{h}_{t-1}(\xi_{m})},Z% ^{U_{0}})italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT = roman_Φ ( over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )

for some function Φ:m×t+1:Φsuperscript𝑚𝑡1\Phi:\mathbb{R}^{m\times t+1}\rightarrow\mathbb{R}roman_Φ : blackboard_R start_POSTSUPERSCRIPT italic_m × italic_t + 1 end_POSTSUPERSCRIPT → blackboard_R. Then

θr,j=𝔼[Φ(Z^Wdh¯0(ξ1),,Z^Wdh¯0(ξm),,Z^Wdh¯t1(ξ1),,Z^Wdh¯t1(ξm),ZU0)/Z^Wdh¯r(ξj)].subscript𝜃𝑟𝑗𝔼delimited-[]Φsuperscript^𝑍superscript𝑊top𝑑subscript¯0subscript𝜉1superscript^𝑍superscript𝑊top𝑑subscript¯0subscript𝜉𝑚superscript^𝑍superscript𝑊top𝑑subscript¯𝑡1subscript𝜉1superscript^𝑍superscript𝑊top𝑑subscript¯𝑡1subscript𝜉𝑚superscript𝑍subscript𝑈0superscript^𝑍superscript𝑊top𝑑subscript¯𝑟subscript𝜉𝑗\displaystyle\theta_{r,j}=\mathbb{E}[\partial\Phi(\widehat{Z}^{W^{\top}d\bar{h% }_{0}(\xi_{1})},\dots,\widehat{Z}^{W^{\top}d\bar{h}_{0}(\xi_{m})},\ldots,% \widehat{Z}^{W^{\top}d\bar{h}_{t-1}(\xi_{1})},\dots,\widehat{Z}^{W^{\top}d\bar% {h}_{t-1}(\xi_{m})},Z^{U_{0}})/\partial\widehat{Z}^{W^{\top}d\bar{h}_{r}(\xi_{% j})}].italic_θ start_POSTSUBSCRIPT italic_r , italic_j end_POSTSUBSCRIPT = blackboard_E [ ∂ roman_Φ ( over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) / ∂ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ] .

Appendix C Proof of Theorem 4.5

We begin by describing three key lemmas, each highlighting a crucial aspect of our subsequent proof.

Lemma C.1.

Suppose random variables {uk}k=[K]subscriptsubscript𝑢𝑘𝑘delimited-[]𝐾\{u_{k}\}_{k=[K]}{ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = [ italic_K ] end_POSTSUBSCRIPT and {vk}k=[K]subscriptsubscript𝑣𝑘𝑘delimited-[]𝐾\{v_{k}\}_{k=[K]}{ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = [ italic_K ] end_POSTSUBSCRIPT satisfy 𝔼[uiuj]=𝔼[vivj],i,j𝔼delimited-[]subscript𝑢𝑖subscript𝑢𝑗𝔼delimited-[]subscript𝑣𝑖subscript𝑣𝑗for-all𝑖𝑗\mathbb{E}[u_{i}u_{j}]=\mathbb{E}[v_{i}v_{j}],\forall i,jblackboard_E [ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] = blackboard_E [ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] , ∀ italic_i , italic_j, then

k[K]αkuk=a.s.0k[K]αkvk=a.s.0\displaystyle\sum_{k\in[K]}\alpha_{k}u_{k}\overset{a.s.}{=}0\Leftrightarrow% \sum_{k\in[K]}\alpha_{k}v_{k}\overset{a.s.}{=}0∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 ⇔ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0
Proof.

k[K]αkuk=a.s.0\sum_{k\in[K]}\alpha_{k}u_{k}\overset{a.s.}{=}0∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 implies 𝔼[(k[K]αkuk)2]=0𝔼delimited-[]superscriptsubscript𝑘delimited-[]𝐾subscript𝛼𝑘subscript𝑢𝑘20\mathbb{E}[(\sum_{k\in[K]}\alpha_{k}u_{k})^{2}]=0blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0. Because {uk}k=[K]subscriptsubscript𝑢𝑘𝑘delimited-[]𝐾\{u_{k}\}_{k=[K]}{ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = [ italic_K ] end_POSTSUBSCRIPT and {vk}k=[K]subscriptsubscript𝑣𝑘𝑘delimited-[]𝐾\{v_{k}\}_{k=[K]}{ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = [ italic_K ] end_POSTSUBSCRIPT share the same co-variance matrix, we have that 𝔼[(k[K]αkvk)2]=0𝔼delimited-[]superscriptsubscript𝑘delimited-[]𝐾subscript𝛼𝑘subscript𝑣𝑘20\mathbb{E}[(\sum_{k\in[K]}\alpha_{k}v_{k})^{2}]=0blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0. ∎

Lemma C.2.

Suppose any level set of ϕ::italic-ϕ\phi:\mathbb{R}\to\mathbb{R}italic_ϕ : blackboard_R → blackboard_R is countable and g1,,gKsubscript𝑔1subscript𝑔𝐾g_{1},\ldots,g_{K}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT are jointly non-degenerate Gaussian. If (iaiϕ(gi)=C)>0subscript𝑖subscript𝑎𝑖italic-ϕsubscript𝑔𝑖𝐶0\mathbb{P}\big{(}\sum_{i}a_{i}\phi(g_{i})=C\big{)}>0blackboard_P ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_C ) > 0 where C𝐶Citalic_C is a constant, then ai=0subscript𝑎𝑖0a_{i}=0italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for all i𝑖iitalic_i and C=0𝐶0C=0italic_C = 0.

Proof.

{(iaiϕ(gi)=C|g2,,gK)>0}subscript𝑖subscript𝑎𝑖italic-ϕsubscript𝑔𝑖conditional𝐶subscript𝑔2subscript𝑔𝐾0\{\mathbb{P}\big{(}\sum_{i}a_{i}\phi(g_{i})=C|g_{2},\ldots,g_{K}\big{)}>0\}{ blackboard_P ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_C | italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) > 0 } has positive probability only if a1=0subscript𝑎10a_{1}=0italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0, because g1|g2,,gKconditionalsubscript𝑔1subscript𝑔2subscript𝑔𝐾g_{1}|g_{2},\ldots,g_{K}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is a non-degenerate Gaussian random variable. We conclude that i[K]ai=0subscriptproduct𝑖delimited-[]𝐾subscript𝑎𝑖0\prod_{i\in[K]}a_{i}=0∏ start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 following a similar reasoning inductively. ∎

Lemma C.3.

Suppose ϕitalic-ϕ\phiitalic_ϕ satisfies Assumption 4.3. Moreover, suppose g1,,gKsubscript𝑔1subscript𝑔𝐾g_{1},\ldots,g_{K}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT are jointly non-degenerate Gaussian. If (c1+iaiϕ(gi))(c2+ibiϕ(gi))=Csubscript𝑐1subscript𝑖subscript𝑎𝑖italic-ϕsubscript𝑔𝑖subscript𝑐2subscript𝑖subscript𝑏𝑖superscriptitalic-ϕsubscript𝑔𝑖𝐶\big{(}c_{1}+\sum_{i}a_{i}\phi(g_{i})\big{)}\cdot\big{(}c_{2}+\sum_{i}b_{i}% \phi^{\prime}(g_{i})\big{)}=C( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ⋅ ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = italic_C where c1,c2,Csubscript𝑐1subscript𝑐2𝐶c_{1},c_{2},Citalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C is a constant, then C=0𝐶0C=0italic_C = 0 and either ai=0subscript𝑎𝑖0a_{i}=0italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for all i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] or bi=0subscript𝑏𝑖0b_{i}=0italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for all i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ].

Remark C.4.

Considering the function tail, it is easy to prove that the Sigmoid function σ(x)=11+exp(x)𝜎𝑥11𝑥\sigma(x)=\frac{1}{1+\exp(-x)}italic_σ ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_x ) end_ARG, the smoothed ReLU function ReLU¯(x)=log(1+exp(x))=σ(x)𝑑x¯ReLU𝑥1𝑥𝜎𝑥differential-d𝑥\overline{\text{ReLU}}(x)=\log(1+\exp(x))=\int\sigma(x)dxover¯ start_ARG ReLU end_ARG ( italic_x ) = roman_log ( 1 + roman_exp ( italic_x ) ) = ∫ italic_σ ( italic_x ) italic_d italic_x, and the SiLU (Sigmoid Linear Unit) function, defined as SiLU(x)=xσ(x)SiLU𝑥𝑥𝜎𝑥\text{SiLU}(x)=x\cdot\sigma(x)SiLU ( italic_x ) = italic_x ⋅ italic_σ ( italic_x ) , all satisfy these assumptions. Notably, SiLU is employed in state-of-the-art open-source foundation models (Touvron et al., 2023a, b).

Proof.

We first prove that C=0𝐶0C=0italic_C = 0. Condition on the random variables g2,,gKsubscript𝑔2subscript𝑔𝐾g_{2},\ldots,g_{K}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and denote g:=g1(g2,,gK)assign𝑔conditionalsubscript𝑔1subscript𝑔2subscript𝑔𝐾g:=g_{1}\mid(g_{2},\ldots,g_{K})italic_g := italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ ( italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ). Then g𝑔gitalic_g is a non-degenerate univariate Gaussian.

Case 1: Suppose C0𝐶0C\neq 0italic_C ≠ 0. - In this scenario, a1,b1subscript𝑎1subscript𝑏1a_{1},b_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cannot be zero; otherwise ϕitalic-ϕ\phiitalic_ϕ (or ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) would have an uncountable level set, contradicting our assumptions. Given that (c1+a1ϕ(g))(c2+b1ϕ(g))=Csuperscriptsubscript𝑐1subscript𝑎1italic-ϕ𝑔superscriptsubscript𝑐2subscript𝑏1superscriptitalic-ϕ𝑔𝐶(c_{1}^{\prime}+a_{1}\phi(g))(c_{2}^{\prime}+b_{1}\phi^{\prime}(g))=C( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϕ ( italic_g ) ) ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_g ) ) = italic_C almost surely, we may rewrite:

(c1a1+ϕ(g))(c2b1+ϕ(g))=Ca1b1.superscriptsubscript𝑐1subscript𝑎1italic-ϕ𝑔superscriptsubscript𝑐2subscript𝑏1superscriptitalic-ϕ𝑔𝐶subscript𝑎1subscript𝑏1\displaystyle\bigg{(}\frac{c_{1}^{\prime}}{a_{1}}+\phi(g)\bigg{)}\bigg{(}\frac% {c_{2}^{\prime}}{b_{1}}+\phi^{\prime}(g)\bigg{)}=\frac{C}{a_{1}b_{1}}.( divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_ϕ ( italic_g ) ) ( divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_g ) ) = divide start_ARG italic_C end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG .

Here c1,c2superscriptsubscript𝑐1superscriptsubscript𝑐2c_{1}^{\prime},c_{2}^{\prime}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are constants (absorbing the conditioning on g2,,gKsubscript𝑔2subscript𝑔𝐾g_{2},\dots,g_{K}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT). But this implies that for almost all x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R, (c1a1+ϕ(x))(c2b1+ϕ(x))=Ca1b1superscriptsubscript𝑐1subscript𝑎1italic-ϕ𝑥superscriptsubscript𝑐2subscript𝑏1superscriptitalic-ϕ𝑥𝐶subscript𝑎1subscript𝑏1\big{(}\frac{c_{1}^{\prime}}{a_{1}}+\phi(x)\big{)}\big{(}\frac{c_{2}^{\prime}}% {b_{1}}+\phi^{\prime}(x)\big{)}=\frac{C}{a_{1}b_{1}}( divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_ϕ ( italic_x ) ) ( divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) = divide start_ARG italic_C end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG, a contradiction to the assumption on the activation function. Therefore, a contradiction arises, implying C𝐶Citalic_C must be zero.

Case 2: Now consider C=0𝐶0C=0italic_C = 0. Then at least one of the following holds with positive probability:

c1+iaiϕ(gi)=0orc2+ibiϕ(gi)=0.formulae-sequencesubscript𝑐1subscript𝑖subscript𝑎𝑖italic-ϕsubscript𝑔𝑖0orsubscript𝑐2subscript𝑖subscript𝑏𝑖superscriptitalic-ϕsubscript𝑔𝑖0\displaystyle c_{1}+\sum_{i}a_{i}\phi(g_{i})=0\quad\text{or}\quad c_{2}+\sum_{% i}b_{i}\phi^{\prime}(g_{i})=0.italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0 or italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0 .

In either case, applying Lemma C.2 (which crucially uses the fact that ϕitalic-ϕ\phiitalic_ϕ has at most countable level sets, forcing the sum to avoid being constant on any uncountable domain with positive probability unless all involved coefficients vanish) completes the proof of zeroing out the corresponding coefficients. Concretely, if

(iaiϕ(gi)=C|g2,,gK)>0subscript𝑖subscript𝑎𝑖italic-ϕsubscript𝑔𝑖conditional𝐶subscript𝑔2subscript𝑔𝐾0\displaystyle\mathbb{P}\bigg{(}\sum_{i}a_{i}\phi(g_{i})=C\Big{|}g_{2},\ldots,g% _{K}\bigg{)}>0blackboard_P ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_C | italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) > 0

then for those realizations we view g1subscript𝑔1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (conditioned on g2,,gKsubscript𝑔2subscript𝑔𝐾g_{2},\dots,g_{K}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT) as a non-degenerate univariate Gaussian. Holding g2,,gKsubscript𝑔2subscript𝑔𝐾g_{2},\dots,g_{K}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT fixed, the only way iaiϕ(gi)subscript𝑖subscript𝑎𝑖italic-ϕsubscript𝑔𝑖\sum_{i}a_{i}\phi(g_{i})∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) can remain a constant over a positive-measure set of g1subscript𝑔1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT values is if a1=0subscript𝑎10a_{1}=0italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0. Repeating this argument inductively for g2,g3,subscript𝑔2subscript𝑔3g_{2},g_{3},\dotsitalic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … shows that i[K]ai=0subscriptproduct𝑖delimited-[]𝐾subscript𝑎𝑖0\prod_{i\in[K]}a_{i}=0∏ start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. Therefore, either all aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT vanish or all bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT vanish, completing the proof of this lemma. ∎

With these lemmas at hand, we now prove our main theorem by an inductive argument. In particular, we show that the following two families of Gaussian processes, introduced in Section 4, remain non-degenerate throughout training:

{Z^W0lδxsl1(ξi)}i[m],s[t],2lL,subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚formulae-sequence𝑠delimited-[]𝑡2𝑙𝐿\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t],2\leq l\leq L},{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] , 2 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT , (C.1)
{Z^W0ldhsl(ξi)}i[m],s[t],2lL.subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚formulae-sequence𝑠delimited-[]𝑡2𝑙𝐿\displaystyle\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[% t],2\leq l\leq L}.{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] , 2 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT . (C.2)

Recall that a Gaussian process is non-degenerate if its covariance matrix C𝐶Citalic_C at any finite collection of points satisfies det(C)0det𝐶0\mathrm{det}(C)\neq 0roman_det ( italic_C ) ≠ 0 (Adler and Taylor, 2009). Using the filtration framework introduced in Section 4, our proof follows the natural flow of computation in the network, proceeding layer by layer and separately handling forward and backward passes. We break this into four key steps, each building upon the results of previous steps:

  • Step 1: prove non-degeneracy for the features in the first hidden layer Z^W02δxs1(ξi)superscript^𝑍superscriptsubscript𝑊02𝛿superscriptsubscript𝑥𝑠1subscript𝜉𝑖\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. This forms our base case as it only depends on the input data and network initialization, providing the foundation for our inductive argument.

  • Step 2: prove non-degeneracy for the features in remaining layers Z^W0lδxsl1(ξi)superscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, 3lL3𝑙𝐿3\leq l\leq L3 ≤ italic_l ≤ italic_L. This step leverages the non-degeneracy established in Step 1 and shows how it propagates through deeper layers of the network.

  • Step 3: prove non-degeneracy for the gradients in the last layer Z^W0LdhsL(ξi)superscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. Here we transition from analyzing forward features to backward gradients, showing how the established feature properties ensure meaningful gradient flow.

  • Step 4: prove non-degeneracy for the gradients in remaining layers Z^W0ldhsl(ξi)superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, 2lL12𝑙𝐿12\leq l\leq L-12 ≤ italic_l ≤ italic_L - 1. Finally, we complete our analysis by showing how gradient non-degeneracy propagates backward through the network, ensuring effective training dynamics at all layers.

The proof proceeds by induction on the time step t𝑡titalic_t, where at each step we verify these properties hold across all layers. This structure allows us to carefully track how the non-degeneracy property is maintained as information flows both forward and backward through the network during training. This systematic proof structure allows us to establish the global property of non-degeneracy by carefully tracking local changes at each layer and time step. We now proceed with the detailed proof.

Proof of Theorem 4.5.

Considering Trajectory Until Error Signals Vanish. Throughout this proof, we focus on the training trajectory up to the time when all error signals χ̊t,isubscript̊𝜒𝑡𝑖\mathring{\chi}_{t,i}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT become zero. This is because once the error signals vanish, there are no further parameter updates, and the training dynamics remain static thereafter. Our analysis ensures that up to this point, the Gaussian processes governing the feature and gradient updates remain non-degenerate, thereby maintaining the linear independence of features across all layers.

Connecting Z^Wδxsuperscript^𝑍𝑊𝛿𝑥\widehat{Z}^{W\delta x}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W italic_δ italic_x end_POSTSUPERSCRIPT to hlsuperscript𝑙h^{l}italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and xlsuperscript𝑥𝑙x^{l}italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Recall from Section 3 that each pre-activation hl(ξ)superscript𝑙𝜉h^{l}(\xi)italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) and post-activation xl(ξ)superscript𝑥𝑙𝜉x^{l}(\xi)italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) can be decomposed into a primary Gaussian increment plus lower-order (history-dependent) terms in the infinite-width limit:

Because these additional terms do not alter the essential covariance structure when conditioned on past information (they vanish or become deterministic in the limit), the linear (in)dependence of {hl(ξ)}superscript𝑙𝜉\{h^{l}(\xi)\}{ italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) } or {xl(ξ)}superscript𝑥𝑙𝜉\{x^{l}(\xi)\}{ italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) } is governed by the non-degeneracy of {Z^W0lδxsl1(ξ)}superscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1𝜉\{\widehat{Z}^{W_{0}^{l}\,\delta x_{s}^{l-1}(\xi)}\}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT }. Hence, showing that {Z^W0lδxsl1(ξ)}superscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1𝜉\{\widehat{Z}^{W_{0}^{l}\,\delta x_{s}^{l-1}(\xi)}\}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT } remain non-degenerate under conditioning on historical variables directly implies that {hl(ξ)}superscript𝑙𝜉\{h^{l}(\xi)\}{ italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) } and {xl(ξ)}superscript𝑥𝑙𝜉\{x^{l}(\xi)\}{ italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) } cannot collapse into a linearly dependent set.

Below, we provide an inductive argument to establish precisely this non-degeneracy at each step.

By definition when t=0𝑡0t=0italic_t = 0, {Z^W0lδx0l1(ξi)},i[m],2lLformulae-sequencesuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥0𝑙1subscript𝜉𝑖𝑖delimited-[]𝑚2𝑙𝐿\{\widehat{Z}^{W_{0}^{l}\delta x_{0}^{l-1}(\xi_{i})}\},i\in[m],2\leq l\leq L{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } , italic_i ∈ [ italic_m ] , 2 ≤ italic_l ≤ italic_L are independent and therefore non-degenerate Gaussian.

Now assume that the random Gaussian Process features defined in (C.1) and (C.2) are non-degenerate at time t𝑡titalic_t, specifically for

{Z^W0lδxsl1(ξi)}i[m],s[t],{Z^W0ldhsl(ξi)}i[m],s[t]subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t]},\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT , { over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT

where layer 2lL2𝑙𝐿2\leq l\leq L2 ≤ italic_l ≤ italic_L.

Step 1: We first prove {Z^W02δxs1(ξi)}i[m],s[t+1]subscriptsuperscript^𝑍superscriptsubscript𝑊02𝛿superscriptsubscript𝑥𝑠1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}\}_{i\in[m],s\in[t+1]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT is non-degenerate. Suppose there exists not all zero {λi,s}i[m],s[t+1]subscriptsubscript𝜆𝑖𝑠formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT such that

i[m],s[t+1]λi,sZ^W02δxs1(ξi)=a.s.0.subscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍subscriptsuperscript𝑊20𝛿subscriptsuperscript𝑥1𝑠subscript𝜉𝑖a.s.0\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W^{2}_{0}\delta x% ^{1}_{s}(\xi_{i})}\overset{\text{a.s.}}{=}0.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT overa.s. start_ARG = end_ARG 0 .

Since {Z^W02δxs1(ξi)}i[m],s[t]subscriptsuperscript^𝑍superscriptsubscript𝑊02𝛿superscriptsubscript𝑥𝑠1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡\{\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}\}_{i\in[m],s\in[t]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT are non-degenerate, we conclude that {λi,t+1}i[m]subscriptsubscript𝜆𝑖𝑡1𝑖delimited-[]𝑚\{\lambda_{i,t+1}\}_{i\in[m]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT are not all zero. Consider the second moment, we have that

𝔼[(i[m],s[t+1]λi,sZ^W02δxs1(ξi))2]=0.𝔼delimited-[]superscriptsubscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊02𝛿subscriptsuperscript𝑥1𝑠subscript𝜉𝑖20\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{2}\delta x^{1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 .

Because {Z^W02δxs1(ξi)}superscript^𝑍subscriptsuperscript𝑊20𝛿subscriptsuperscript𝑥1𝑠subscript𝜉𝑖\{\widehat{Z}^{W^{2}_{0}\delta x^{1}_{s}(\xi_{i})}\}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } shares the same co-variance matrix with {Zδxs1(ξi)}superscript𝑍𝛿subscriptsuperscript𝑥1𝑠subscript𝜉𝑖\{Z^{\delta x^{1}_{s}(\xi_{i})}\}{ italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT }. By Lemma C.1, we have that

𝔼[(i[m],s[t+1]λi,sZδxs1(ξi))2]=0i[m],s[t+1]λi,sZδxs1(ξi)=a.s.0.\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {\delta x^{1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s\in% [t+1]}\lambda_{i,s}Z^{\delta x^{1}_{s}(\xi_{i})}\overset{a.s.}{=}0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 ⇒ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 .

Therefore by definition we have that

i[m],s[t+1]λi,s(ϕ(Zhs1(ξi))ϕ(Zhs11(ξi)))=a.s.0,\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\Big{(}\phi(Z^{h^{1}_{s}(\xi% _{i})})-\phi(Z^{h^{1}_{s-1}(\xi_{i})})\Big{)}\overset{a.s.}{=}0,∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT ( italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) ) start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 , (C.3)

where Zhs1(ξi)superscript𝑍subscriptsuperscript1𝑠subscript𝜉𝑖Z^{h^{1}_{s}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT satisfies that

Zhs1(ξi)superscript𝑍subscriptsuperscript1𝑠subscript𝜉𝑖\displaystyle Z^{h^{1}_{s}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =Zh01(ξi)j[m]ηχ̊0,jξjξiZdh01(ξj)j[m]ηχ̊s1,jξjξiZdhs11(ξj).absentsuperscript𝑍subscriptsuperscript10subscript𝜉𝑖subscript𝑗delimited-[]𝑚𝜂subscript̊𝜒0𝑗superscriptsubscript𝜉𝑗topsubscript𝜉𝑖superscript𝑍𝑑subscriptsuperscript10subscript𝜉𝑗subscript𝑗delimited-[]𝑚𝜂subscript̊𝜒𝑠1𝑗superscriptsubscript𝜉𝑗topsubscript𝜉𝑖superscript𝑍𝑑subscriptsuperscript1𝑠1subscript𝜉𝑗\displaystyle=Z^{h^{1}_{0}(\xi_{i})}-\sum_{j\in[m]}\eta\mathring{\chi}_{0,j}% \xi_{j}^{\top}\xi_{i}Z^{dh^{1}_{0}(\xi_{j})}-\cdots-\sum_{j\in[m]}\eta% \mathring{\chi}_{s-1,j}\xi_{j}^{\top}\xi_{i}Z^{dh^{1}_{s-1}(\xi_{j})}.= italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT italic_η over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - ⋯ - ∑ start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT italic_η over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s - 1 , italic_j end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .

Plugging (3.6) and (3.7) into the above equation further gives the following reformulation of Zhs1(ξi)superscript𝑍subscriptsuperscript1𝑠subscript𝜉𝑖Z^{h^{1}_{s}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT:

Zhs1(ξi)=Δs(ξi)j[m]ηχ̊s,jξjξiϕ(Zhs11(ξj))Z^W02dhs2(ξj),superscript𝑍subscriptsuperscript1𝑠subscript𝜉𝑖subscriptΔ𝑠subscript𝜉𝑖subscript𝑗delimited-[]𝑚𝜂subscript̊𝜒𝑠𝑗superscriptsubscript𝜉𝑗topsubscript𝜉𝑖superscriptitalic-ϕsuperscript𝑍superscriptsubscript𝑠11subscript𝜉𝑗superscript^𝑍superscriptsubscriptsuperscript𝑊20top𝑑subscriptsuperscript2𝑠subscript𝜉𝑗\displaystyle Z^{h^{1}_{s}(\xi_{i})}=\Delta_{s}(\xi_{i})-\sum_{j\in[m]}\eta% \mathring{\chi}_{s,j}\xi_{j}^{\top}\xi_{i}\phi^{\prime}(Z^{h_{s-1}^{1}(\xi_{j}% )})\widehat{Z}^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{j})},italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT = roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT italic_η over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s , italic_j end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , (C.4)

where Δs(ξi)𝒢s1subscriptΔ𝑠subscript𝜉𝑖subscript𝒢𝑠1\Delta_{s}(\xi_{i})\in\mathcal{G}_{s-1}roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_G start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT is a random variable. Notice that Zhs11(ξj)𝒢s1superscript𝑍superscriptsubscript𝑠11subscript𝜉𝑗subscript𝒢𝑠1Z^{h_{s-1}^{1}(\xi_{j})}\in\mathcal{G}_{s-1}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT and Zhs1(ξj)𝒢ssuperscript𝑍superscriptsubscript𝑠1subscript𝜉𝑗subscript𝒢𝑠Z^{h_{s}^{1}(\xi_{j})}\in\mathcal{G}_{s}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

At least one of the {χ̊s,j}j[m]subscriptsubscript̊𝜒𝑠𝑗𝑗delimited-[]𝑚\{\mathring{\chi}_{s,j}\}_{j\in[m]}{ over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT is not zero, W.L.O.G assume χ̊s,k0subscript̊𝜒𝑠𝑘0\mathring{\chi}_{s,k}\not=0over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s , italic_k end_POSTSUBSCRIPT ≠ 0. By induction hypothesis the non-degenerate property at time t𝑡titalic_t holds. Therefore, ZW02dhs2(ξk)superscript𝑍superscriptsubscriptsuperscript𝑊20top𝑑subscriptsuperscript2𝑠subscript𝜉𝑘Z^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{k})}italic_Z start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT condition on 𝒢t1{ZW02dhs2(ξj)}jksubscript𝒢𝑡1subscriptsuperscript𝑍superscriptsubscriptsuperscript𝑊20top𝑑subscriptsuperscript2𝑠subscript𝜉𝑗𝑗𝑘\mathcal{G}_{t-1}\cup\{Z^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{j})}\}_{j\not=k}caligraphic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∪ { italic_Z start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ≠ italic_k end_POSTSUBSCRIPT is a non-degenerate Gaussian.

Plugging (C.4) into (C.3) and then condition on 𝒢t1{ZW02dhs2(ξj)}jksubscript𝒢𝑡1subscriptsuperscript𝑍superscriptsubscriptsuperscript𝑊20top𝑑subscriptsuperscript2𝑠subscript𝜉𝑗𝑗𝑘\mathcal{G}_{t-1}\cup\{Z^{{W^{2}_{0}}^{\top}d{h}^{2}_{s}(\xi_{j})}\}_{j\not=k}caligraphic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∪ { italic_Z start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ≠ italic_k end_POSTSUBSCRIPT gives that

i[m]λi,t+1ϕ(ξkξiZU+ci)=Csubscript𝑖delimited-[]𝑚subscript𝜆𝑖𝑡1italic-ϕsuperscriptsubscript𝜉𝑘topsubscript𝜉𝑖superscript𝑍𝑈subscript𝑐𝑖𝐶\displaystyle\sum_{i\in[m]}\lambda_{i,t+1}\phi(\xi_{k}^{\top}\xi_{i}Z^{U}+c_{i% })=C∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT italic_ϕ ( italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_C

where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and C𝐶Citalic_C are constant and ZUsuperscript𝑍𝑈Z^{U}italic_Z start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT is a non-degenerate uni-variate Gaussian random variable. Since ϕitalic-ϕ\phiitalic_ϕ meets the conditions in Assumption 4.3 and the dataset fulfills Assumption 4.1, ensuring the inner products and level sets behave as required. We can conclude that λi,t+1=0subscript𝜆𝑖𝑡10\lambda_{i,t+1}=0italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT = 0 for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ]. A contradiction! Therefore, {Z^W02δxs1(ξi)}i[m],s[t+1]subscriptsuperscript^𝑍superscriptsubscript𝑊02𝛿superscriptsubscript𝑥𝑠1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\widehat{Z}^{W_{0}^{2}\delta x_{s}^{1}(\xi_{i})}\}_{i\in[m],s\in[t+1]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT is indeed non-degenerate.

Step 2: We prove the following is non-degenerate.

{Z^W0lδxsl1(ξi)}i[m],s[t+1],l3.subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1𝑙3\displaystyle\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s% \in[t+1]},l\geq 3.{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT , italic_l ≥ 3 .

Suppose there exists not all zero {λi,s}i[m],s[t+1]subscriptsubscript𝜆𝑖𝑠formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT such that

i[m],s[t+1]λi,sZ^W0lδxsl1(ξi)=a.s.0.subscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑠subscript𝜉𝑖a.s.0\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W^{l}_{0}\delta x% ^{l-1}_{s}(\xi_{i})}\overset{\text{a.s.}}{=}0.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT overa.s. start_ARG = end_ARG 0 .

Since {Z^W0lδxsl1(ξi)}i[m],s[t]subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT are non-degenerate, we conclude that {λi,t+1}i[m]subscriptsubscript𝜆𝑖𝑡1𝑖delimited-[]𝑚\{\lambda_{i,t+1}\}_{i\in[m]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT are not all zero. Consider the second moment, we have that

𝔼[(i[m],s[t+1]λi,sZ^W0lδxsl1(ξi))2]=0.𝔼delimited-[]superscriptsubscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊0𝑙𝛿subscriptsuperscript𝑥𝑙1𝑠subscript𝜉𝑖20\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{l}\delta x^{l-1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 .

Because {Z^W0lδxsl1(ξi)}superscript^𝑍subscriptsuperscript𝑊𝑙0𝛿subscriptsuperscript𝑥𝑙1𝑠subscript𝜉𝑖\{\widehat{Z}^{W^{l}_{0}\delta x^{l-1}_{s}(\xi_{i})}\}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } shares the same co-variance matrix with {Zδxsl1(ξi)}superscript𝑍𝛿subscriptsuperscript𝑥𝑙1𝑠subscript𝜉𝑖\{Z^{\delta x^{l-1}_{s}(\xi_{i})}\}{ italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT }. By Lemma C.1, we have that

𝔼[(i[m],s[t+1]λi,sZδxsl1(ξi))2]=0i[m],s[t+1]λi,sZδxsl1(ξi)=a.s.0.\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {\delta x^{l-1}_{s}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s% \in[t+1]}\lambda_{i,s}Z^{\delta x^{l-1}_{s}(\xi_{i})}\overset{a.s.}{=}0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 ⇒ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 .

Therefore by definition we have that

i[m],s[t+1]λi,s[ϕ(Zhsl1(ξi))ϕ(Zhs1l1(ξi))]=a.s.0,\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\big{[}\phi(Z^{h^{l-1}_{s}(% \xi_{i})})-\phi(Z^{h^{l-1}_{s-1}(\xi_{i})})\big{]}\overset{a.s.}{=}0,∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT [ italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) ] start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 , (C.5)

where Zhsl1superscript𝑍subscriptsuperscript𝑙1𝑠Z^{h^{l-1}_{s}}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT satisfies that

Zhsl1(ξi)superscript𝑍subscriptsuperscript𝑙1𝑠subscript𝜉𝑖\displaystyle Z^{h^{l-1}_{s}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =Zh0l1(ξi)+Zδh1l1(ξi)++Zδhsl1(ξi).absentsuperscript𝑍subscriptsuperscript𝑙10subscript𝜉𝑖superscript𝑍𝛿subscriptsuperscript𝑙11subscript𝜉𝑖superscript𝑍𝛿subscriptsuperscript𝑙1𝑠subscript𝜉𝑖\displaystyle=Z^{h^{l-1}_{0}(\xi_{i})}+Z^{\delta h^{l-1}_{1}(\xi_{i})}+\cdots+% Z^{\delta h^{l-1}_{s}(\xi_{i})}.= italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + ⋯ + italic_Z start_POSTSUPERSCRIPT italic_δ italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .

A reformulation of the above update rule further gives that,

Zhsl1(ξi)superscript𝑍subscriptsuperscript𝑙1𝑠subscript𝜉𝑖\displaystyle Z^{h^{l-1}_{s}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =Δs(ξi)+Z^W0l1δxsl2(ξi),absentsubscriptΔ𝑠subscript𝜉𝑖superscript^𝑍superscriptsubscript𝑊0𝑙1𝛿subscriptsuperscript𝑥𝑙2𝑠subscript𝜉𝑖\displaystyle=\Delta_{s}(\xi_{i})+\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{s}(% \xi_{i})},= roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , (C.6)

where Δs(ξi)s1subscriptΔ𝑠subscript𝜉𝑖subscript𝑠1\Delta_{s}(\xi_{i})\in\mathcal{F}_{s-1}roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_F start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT is a random variable. Notice that Z^W0l1δxsl2(ξi),Zhsl1(ξi)ssuperscript^𝑍superscriptsubscript𝑊0𝑙1𝛿subscriptsuperscript𝑥𝑙2𝑠subscript𝜉𝑖superscript𝑍subscriptsuperscript𝑙1𝑠subscript𝜉𝑖subscript𝑠\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{s}(\xi_{i})},Z^{h^{l-1}_{s}(\xi_{i})}% \in\mathcal{F}_{s}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Arbitrary pick an index k𝑘kitalic_k. Because in induction hypothesis we assume the non-degenerate property at time t𝑡titalic_t for all layers and already proved the non-degenerate property at time t+1𝑡1t+1italic_t + 1 layer l1𝑙1l-1italic_l - 1 , condition (C.3) on σ(t{Z^W0l1δxt+1l2(ξj)}jk)𝜎subscript𝑡subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙1𝛿subscriptsuperscript𝑥𝑙2𝑡1subscript𝜉𝑗𝑗𝑘\sigma\big{(}\mathcal{F}_{t}\cup\{\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{t+1}% (\xi_{j})}\}_{j\not=k}\big{)}italic_σ ( caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ { over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ≠ italic_k end_POSTSUBSCRIPT ) gives that

λk,t+1ϕ(ZUk+ck)=Cksubscript𝜆𝑘𝑡1italic-ϕsuperscript𝑍subscript𝑈𝑘subscript𝑐𝑘subscript𝐶𝑘\displaystyle\lambda_{k,t+1}\phi(Z^{U_{k}}+c_{k})=C_{k}italic_λ start_POSTSUBSCRIPT italic_k , italic_t + 1 end_POSTSUBSCRIPT italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

where Uksubscript𝑈𝑘U_{k}italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a non-degenerate uni-variate Gaussian random variable Z^W0l1δxt+1l2(ξk)|σ(t{Z^W0l1δxt+1l2(ξj)}jk)conditionalsuperscript^𝑍superscriptsubscript𝑊0𝑙1𝛿subscriptsuperscript𝑥𝑙2𝑡1subscript𝜉𝑘𝜎subscript𝑡subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙1𝛿subscriptsuperscript𝑥𝑙2𝑡1subscript𝜉𝑗𝑗𝑘\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{t+1}(\xi_{k})}|\sigma\big{(}\mathcal{F% }_{t}\cup\{\widehat{Z}^{W_{0}^{l-1}\delta x^{l-2}_{t+1}(\xi_{j})}\}_{j\not=k}% \big{)}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | italic_σ ( caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ { over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_l - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ≠ italic_k end_POSTSUBSCRIPT ), cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are constants. By Assumption 4.3 of activation function, we know that λk,t+1=0subscript𝜆𝑘𝑡10\lambda_{k,t+1}=0italic_λ start_POSTSUBSCRIPT italic_k , italic_t + 1 end_POSTSUBSCRIPT = 0 for arbitrary k[m]𝑘delimited-[]𝑚k\in[m]italic_k ∈ [ italic_m ]. A contradiction! Therefore, {Z^W0lδxsl1(ξi)}i[m],s[t+1]subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙𝛿superscriptsubscript𝑥𝑠𝑙1subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\widehat{Z}^{W_{0}^{l}\delta x_{s}^{l-1}(\xi_{i})}\}_{i\in[m],s\in[t+1]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT is indeed non-degenerate.

Step 3: We prove the following gradients are non-degenerate.

{Z^W0LdhsL(ξi)}i[m],s[t+1].subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\displaystyle\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}_{i\in[m],s\in[% t+1]}.{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT .

Suppose there exists not all zero {λi,s}i[m],s[t+1]subscriptsubscript𝜆𝑖𝑠formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT such that

i[m],s[t+1]λi,sZ^W0LdhsL(ξi)=a.s.0.subscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖a.s.0\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W_{0}^{L\top}dh% _{s}^{L}(\xi_{i})}\overset{\text{a.s.}}{=}0.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT overa.s. start_ARG = end_ARG 0 .

Since {Z^W0LdhsL(ξi)}i[m],s[t]subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}_{i\in[m],s\in[t]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT are non-degenerate, we conclude that {λi,t+1}i[m]subscriptsubscript𝜆𝑖𝑡1𝑖delimited-[]𝑚\{\lambda_{i,t+1}\}_{i\in[m]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT are not all zero. Consider the second moment, we have that

𝔼[(i[m],s[t+1]λi,sZ^W0LdhsL(ξi))2]=0.𝔼delimited-[]superscriptsubscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖20\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 .

Because {Z^W0LdhsL(ξi)}superscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } shares the same co-variance matrix with {ZdhsL(ξi)}superscript𝑍𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖\{Z^{dh_{s}^{L}(\xi_{i})}\}{ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT }. By Lemma C.1, we have that

𝔼[(i[m],s[t+1]λi,sZdhsL(ξi))2]=0i[m],s[t+1]λi,sZdhsL(ξi)=a.s.0.\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {dh_{s}^{L}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s\in[t+1]}% \lambda_{i,s}Z^{dh_{s}^{L}(\xi_{i})}\overset{a.s.}{=}0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 ⇒ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 .

Therefore by definition we have that

i[m],s[t+1]λi,sZdxsL(ξi)ϕ(ZhsL(ξi))=a.s.0\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^{dx_{s}^{L}(\xi_{i})}\phi^% {\prime}(Z^{h_{s}^{L}(\xi_{i})})\overset{a.s.}{=}0∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 (C.7)

where ZdxsL(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑠𝐿𝜉Z^{dx_{s}^{L}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT satisfies that

ZdxsL(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑠𝐿𝜉\displaystyle Z^{dx_{s}^{L}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =ZW^sL+1=ZW^0L+1ηs=0s1i[m]χ̊s,iZxsL(ξi)absentsuperscript𝑍superscriptsubscript^𝑊𝑠𝐿1superscript𝑍superscriptsubscript^𝑊0𝐿1𝜂superscriptsubscriptsuperscript𝑠0𝑠1subscript𝑖delimited-[]𝑚subscript̊𝜒superscript𝑠𝑖superscript𝑍superscriptsubscript𝑥superscript𝑠𝐿subscript𝜉𝑖\displaystyle=Z^{\widehat{W}_{s}^{L+1}}=Z^{\widehat{W}_{0}^{L+1}}-\eta\sum_{s^% {\prime}=0}^{s-1}\sum_{i\in[m]}\mathring{\chi}_{s^{\prime},i}Z^{x_{s^{\prime}}% ^{L}(\xi_{i})}= italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_Z start_POSTSUPERSCRIPT over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT

A reformulation of the above update rule further gives that,

ZdxsL(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑠𝐿𝜉\displaystyle Z^{dx_{s}^{L}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT =Δ~sηi[m]χ̊s,iϕ(ZhsL(ξi)),absentsubscript~Δ𝑠𝜂subscript𝑖delimited-[]𝑚subscript̊𝜒𝑠𝑖italic-ϕsuperscript𝑍superscriptsubscript𝑠𝐿subscript𝜉𝑖\displaystyle=\widetilde{\Delta}_{s}-\eta\sum_{i\in[m]}\mathring{\chi}_{s,i}% \phi(Z^{h_{s}^{L}(\xi_{i})}),= over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) ,
ZhsL(ξi)superscript𝑍superscriptsubscript𝑠𝐿subscript𝜉𝑖\displaystyle Z^{h_{s}^{L}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =(i)Δs+Z^W0LδxsL1(ξi),𝑖subscriptΔ𝑠superscript^𝑍superscriptsubscript𝑊0𝐿𝛿subscriptsuperscript𝑥𝐿1𝑠subscript𝜉𝑖\displaystyle\overset{(i)}{=}\Delta_{s}+\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{% s}(\xi_{i})},start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ,

where Δs,Δ~sssubscriptΔ𝑠subscript~Δ𝑠subscript𝑠\Delta_{s},\widetilde{\Delta}_{s}\in\mathcal{F}_{s}roman_Δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and (i) is due to (C.6). Notice that ZdxsL,ZhsL(ξi)ssuperscript𝑍𝑑superscriptsubscript𝑥𝑠𝐿superscript𝑍superscriptsubscript𝑠𝐿subscript𝜉𝑖subscript𝑠Z^{dx_{s}^{L}},Z^{h_{s}^{L}(\xi_{i})}\in\mathcal{F}_{s}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. In (C.7), only Z^W0Lδxt+1L1(ξi)t+1superscript^𝑍superscriptsubscript𝑊0𝐿𝛿subscriptsuperscript𝑥𝐿1𝑡1subscript𝜉𝑖subscript𝑡1\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{t+1}(\xi_{i})}\in\mathcal{F}_{t+1}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT provides new randomness. Because in induction hypothesis we assume the non-degenerate property at time t𝑡titalic_t for all layers and already proved the non-degenerate property of Z^W0Lδxt+1L1(ξi)superscript^𝑍superscriptsubscript𝑊0𝐿𝛿subscriptsuperscript𝑥𝐿1𝑡1subscript𝜉𝑖\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{t+1}(\xi_{i})}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , condition (C.7) on tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT gives that

(Cηi[m]χ̊t,iϕ(ZUi+bi))(i[m],s[t+1]λi,t+1ϕ(ZUi+bi))=C,𝐶𝜂subscript𝑖delimited-[]𝑚subscript̊𝜒𝑡𝑖italic-ϕsuperscript𝑍subscript𝑈𝑖subscript𝑏𝑖subscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑡1superscriptitalic-ϕsuperscript𝑍subscript𝑈𝑖subscript𝑏𝑖superscript𝐶\displaystyle\Big{(}C-\eta\sum_{i\in[m]}\mathring{\chi}_{t,i}\phi(Z^{U_{i}}+b_% {i})\Big{)}\cdot\Big{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,t+1}\phi^{\prime}(Z^% {U_{i}}+b_{i})\Big{)}=C^{\prime},( italic_C - italic_η ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_Z start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

where Ui=Z^W0Lδxt+1L1(ξi)|tsubscript𝑈𝑖conditionalsuperscript^𝑍superscriptsubscript𝑊0𝐿𝛿subscriptsuperscript𝑥𝐿1𝑡1subscript𝜉𝑖subscript𝑡U_{i}=\widehat{Z}^{W_{0}^{L}\delta x^{L-1}_{t+1}(\xi_{i})}|\mathcal{F}_{t}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_δ italic_x start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and bi,C,Csubscript𝑏𝑖𝐶superscript𝐶b_{i},C,C^{\prime}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are all constants. Since χ̊t,isubscript̊𝜒𝑡𝑖\mathring{\chi}_{t,i}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT are not all zero, by Lemma C.3, we have that λi,t+1=0subscript𝜆𝑖𝑡10\lambda_{i,t+1}=0italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT = 0 for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ]. A contradiction! Therefore, {Z^W0LdhsL(ξi)}i[m],s[t+1]subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝐿top𝑑superscriptsubscript𝑠𝐿subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\widehat{Z}^{W_{0}^{L\top}dh_{s}^{L}(\xi_{i})}\}_{i\in[m],s\in[t+1]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT is indeed non-degenerate.

Step 4: We prove the following gradients are non-degenerate.

{Z^W0ldhsl(ξi)}i[m],s[t+1],2lL1.subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡12𝑙𝐿1\displaystyle\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[% t+1]},2\leq l\leq L-1.{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT , 2 ≤ italic_l ≤ italic_L - 1 .

Suppose there exists not all zero {λi,s}i[m],s[t+1]subscriptsubscript𝜆𝑖𝑠formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\lambda_{i,s}\}_{i\in[m],s\in[t+1]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT such that

i[m],s[t+1]λi,sZ^W0ldhsl(ξi)=a.s.0.subscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖a.s.0\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}\widehat{Z}^{W_{0}^{l\top}dh% _{s}^{l}(\xi_{i})}\overset{\text{a.s.}}{=}0.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT overa.s. start_ARG = end_ARG 0 .

Since {Z^W0ldhsl(ξi)}i[m],s[t]subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT are non-degenerate, we conclude that {λi,t+1}i[m]subscriptsubscript𝜆𝑖𝑡1𝑖delimited-[]𝑚\{\lambda_{i,t+1}\}_{i\in[m]}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT are not all zero. Consider the second moment, we have that

𝔼[(i[m],s[t+1]λi,sZ^W0ldhsl(ξi))2]=0.𝔼delimited-[]superscriptsubscriptformulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1subscript𝜆𝑖𝑠superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖20\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}% \widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 .

Because {Z^W0ldhsl(ξi)}superscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } shares the same co-variance matrix with {Zdhsl(ξi)}superscript𝑍𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖\{Z^{dh_{s}^{l}(\xi_{i})}\}{ italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT }. By Lemma C.1, we have that

𝔼[(i[m],s[t+1]λi,sZdhsl(ξi))2]=0i[m],s[t+1]λi,sZdhsl(ξi)=a.s.0.\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^% {dh_{s}^{l}(\xi_{i})}\bigg{)}^{2}\bigg{]}=0\Rightarrow\sum_{i\in[m],s\in[t+1]}% \lambda_{i,s}Z^{dh_{s}^{l}(\xi_{i})}\overset{a.s.}{=}0.blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 ⇒ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 .

Therefore by definition we have that

i[m],s[t+1]λi,sZdxsl(ξi)ϕ(Zhsl(ξi))=a.s.0,\displaystyle\sum_{i\in[m],s\in[t+1]}\lambda_{i,s}Z^{dx_{s}^{l}(\xi_{i})}\phi^% {\prime}(Z^{h_{s}^{l}(\xi_{i})})\overset{a.s.}{=}0,∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_OVERACCENT italic_a . italic_s . end_OVERACCENT start_ARG = end_ARG 0 , (C.8)

where Zdxsl(ξ)superscript𝑍𝑑superscriptsubscript𝑥𝑠𝑙𝜉Z^{dx_{s}^{l}(\xi)}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT satisfies that

Zdxsl(ξi)superscript𝑍𝑑superscriptsubscript𝑥𝑠𝑙subscript𝜉𝑖\displaystyle Z^{dx_{s}^{l}(\xi_{i})}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =Z^W0l+1dhsl+1(ξi)+Gs(ξi),absentsuperscript^𝑍superscriptsubscript𝑊0𝑙limit-from1top𝑑superscriptsubscript𝑠𝑙1subscript𝜉𝑖subscript𝐺𝑠subscript𝜉𝑖\displaystyle=\widehat{Z}^{W_{0}^{l+1\top}dh_{s}^{l+1}(\xi_{i})}+G_{s}(\xi_{i}),= over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (C.9)

Similar to Steps 1 and 2, we have that Zdxsl(ξi)𝒢ssuperscript𝑍𝑑superscriptsubscript𝑥𝑠𝑙subscript𝜉𝑖subscript𝒢𝑠Z^{dx_{s}^{l}(\xi_{i})}\in\mathcal{G}_{s}italic_Z start_POSTSUPERSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Zhsl(ξi)𝒢s1superscript𝑍superscriptsubscript𝑠𝑙subscript𝜉𝑖subscript𝒢𝑠1Z^{h_{s}^{l}(\xi_{i})}\in\mathcal{G}_{s-1}italic_Z start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT. Therefore, condition (C.8) on 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT only Z^W0l+1dht+1l+1(ξ)superscript^𝑍superscriptsubscript𝑊0𝑙limit-from1top𝑑superscriptsubscript𝑡1𝑙1𝜉\widehat{Z}^{W_{0}^{l+1\top}dh_{t+1}^{l+1}(\xi)}over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_ξ ) end_POSTSUPERSCRIPT gives new randomness. Arbitrarily pick an index j𝑗jitalic_j. Because in induction hypothesis we assume the non-degenerate property at time t𝑡titalic_t for all layers and already proved the non-degenerate property at time t+1𝑡1t+1italic_t + 1 layer l+1𝑙1l+1italic_l + 1 , condition (C.8) on 𝒢t{Z^W0l+1dht+1l+1(ξi)}ijsubscript𝒢𝑡subscriptsuperscript^𝑍superscriptsubscript𝑊0𝑙limit-from1top𝑑superscriptsubscript𝑡1𝑙1subscript𝜉𝑖𝑖𝑗\mathcal{G}_{t}\cup\{\widehat{Z}^{W_{0}^{l+1\top}dh_{t+1}^{l+1}(\xi_{i})}\}_{i% \not=j}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ { over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT gives that

λj,t+1cjZUj=Cjsubscript𝜆𝑗𝑡1subscript𝑐𝑗superscript𝑍subscript𝑈𝑗subscript𝐶𝑗\displaystyle\lambda_{j,t+1}c_{j}Z^{U_{j}}=C_{j}italic_λ start_POSTSUBSCRIPT italic_j , italic_t + 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

where Ujsubscript𝑈𝑗U_{j}italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a non-degenerate uni-variate Gaussian random variable cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are constants. By Assumption 4.3 of activation function, we know that cj0subscript𝑐𝑗0c_{j}\not=0italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ 0 which induces λj,t+1=0subscript𝜆𝑗𝑡10\lambda_{j,t+1}=0italic_λ start_POSTSUBSCRIPT italic_j , italic_t + 1 end_POSTSUBSCRIPT = 0 for all j[m]𝑗delimited-[]𝑚j\in[m]italic_j ∈ [ italic_m ]. A contradiction! Therefore, {Z^W0ldhsl(ξi)}i[m],s[t+1]subscriptsuperscript^𝑍superscriptsubscript𝑊0limit-from𝑙top𝑑superscriptsubscript𝑠𝑙subscript𝜉𝑖formulae-sequence𝑖delimited-[]𝑚𝑠delimited-[]𝑡1\{\widehat{Z}^{W_{0}^{l\top}dh_{s}^{l}(\xi_{i})}\}_{i\in[m],s\in[t+1]}{ over^ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l ⊤ end_POSTSUPERSCRIPT italic_d italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] , italic_s ∈ [ italic_t + 1 ] end_POSTSUBSCRIPT is indeed non-degenerate.

Proof of Corollary 4.6

Proof.

As stated in the main text, if the training parameters stop updating at time T𝑇Titalic_T, then the training loss must be zero.

By Theorem 4.5, the training trajectory remains non-degenerate throughout training. Suppose, for contradiction, that at time T𝑇Titalic_T the training loss is still nonzero for some sample (ξi,yi)subscript𝜉𝑖subscript𝑦𝑖(\xi_{i},y_{i})( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). This implies that the error signal χ̊T,isubscript̊𝜒𝑇𝑖\mathring{\chi}_{T,i}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_T , italic_i end_POSTSUBSCRIPT is nonzero. However, the non-degenerate trajectory ensures that a nonzero error signal χ̊T,isubscript̊𝜒𝑇𝑖\mathring{\chi}_{T,i}over̊ start_ARG italic_χ end_ARG start_POSTSUBSCRIPT italic_T , italic_i end_POSTSUBSCRIPT would necessitate further parameter updates, contradicting the assumption that all parameter updates vanish at or after time T𝑇Titalic_T. Therefore, the training loss must be zero for all samples at time T𝑇Titalic_T, implying convergence to a global minimum. ∎

Appendix D Activation Functions with the GOOD Property

We now verify that many practical activation functions, especially those with exponential tails, satisfy the GOOD property introduced in Definition 4.2. By “exponential tail,” we mean that as |x|𝑥|x|\to\infty| italic_x | → ∞, the function and/or its derivatives decay at least as fast as ec|x|superscript𝑒𝑐𝑥e^{-c|x|}italic_e start_POSTSUPERSCRIPT - italic_c | italic_x | end_POSTSUPERSCRIPT for some c>0𝑐0c>0italic_c > 0. Representative examples include the sigmoid, tanh, SiLU, and GeLU. Below, we restate the full definition of GOOD and then show how each requirement is met by these exponential-tail activations.

Definition D.1 (Restatement of Definition 4.2).

An activation function ϕ::italic-ϕ\phi:\mathbb{R}\to\mathbb{R}italic_ϕ : blackboard_R → blackboard_R is called GOOD if it satisfies the following two conditions:

  1. (a)

    Non-constant decomposition. For any finite set of parameters {ai},{bi},{ci}subscript𝑎𝑖subscript𝑏𝑖subscript𝑐𝑖\{a_{i}\},\{b_{i}\},\{c_{i}\}{ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } such that k with akbk0𝑘 with subscript𝑎𝑘subscript𝑏𝑘0\exists k\text{ with }a_{k}b_{k}\neq 0∃ italic_k with italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 and |bi||bj|subscript𝑏𝑖subscript𝑏𝑗|b_{i}|\neq|b_{j}|| italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≠ | italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j, the function

    f(x)=i=1maiϕ(bix+ci)𝑓𝑥superscriptsubscript𝑖1𝑚subscript𝑎𝑖italic-ϕsubscript𝑏𝑖𝑥subscript𝑐𝑖\displaystyle f(x)=\sum_{i=1}^{m}a_{i}\phi(b_{i}x+c_{i})italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x + italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (D.1)

    is not a constant function.

  2. (b)

    Non-degenerate product with derivative. For any real numbers r1,r2subscript𝑟1subscript𝑟2r_{1},r_{2}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the product

    (r1+ϕ(x))(r2+ϕ(x))subscript𝑟1italic-ϕ𝑥subscript𝑟2superscriptitalic-ϕ𝑥\displaystyle\bigl{(}r_{1}+\phi(x)\bigr{)}\bigl{(}r_{2}+\phi^{\prime}(x)\bigr{)}( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϕ ( italic_x ) ) ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) (D.2)

    is not almost everywhere (a.e.) constant on \mathbb{R}blackboard_R.

Before analyzing each activation function in detail, we visualize these functions and their derivatives in Figure 5. These plots illustrate the key characteristics we will exploit in our proofs, particularly the exponential decay behavior in the tails. Note how most activation functions and their derivatives exhibit rapid decay as |x|𝑥|x|\to\infty| italic_x | → ∞, with ReLU serving as a contrasting example that grows linearly.

Refer to caption
Refer to caption
Figure 5: Different activation functions (left) and their derivatives (right). Note the exponential decay behavior in the tails of σ𝜎\sigmaitalic_σ, tanh\tanhroman_tanh, SiLU, and GeLU, which is crucial for the GOOD property.

In the following subsections, we formally prove that these exponential-tail activations satisfy both conditions of Definition D.1.

D.1 Sigmoid and Tanh

Proposition D.2.

The sigmoid function σ(x)=11+exp(x)𝜎𝑥11𝑥\sigma(x)=\frac{1}{1+\exp(-x)}italic_σ ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_x ) end_ARG satisfies both (a) and (b) in Definition D.1, hence is GOOD.

Proof.

We first prove condition (a). Without loss of generality, set ci=0subscript𝑐𝑖0c_{i}=0italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0, as they will not affect the tail of the activation function. Define Ω={i|ai0}Ωconditional-set𝑖subscript𝑎𝑖0\Omega=\{i|a_{i}\neq 0\}roman_Ω = { italic_i | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 }, A+={iΩ|bi>0}superscript𝐴conditional-set𝑖Ωsubscript𝑏𝑖0A^{+}=\{i\in\Omega|b_{i}>0\}italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = { italic_i ∈ roman_Ω | italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 } and A={iΩ|bi<0}superscript𝐴conditional-set𝑖Ωsubscript𝑏𝑖0A^{-}=\{i\in\Omega|b_{i}<0\}italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = { italic_i ∈ roman_Ω | italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0 }. Let i=argminiΩ|bi|superscript𝑖subscriptargmin𝑖Ωsubscript𝑏𝑖i^{*}=\mathop{\mathrm{argmin}}_{i\in\Omega}|b_{i}|italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_i ∈ roman_Ω end_POSTSUBSCRIPT | italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. If bi=0subscript𝑏superscript𝑖0b_{i^{*}}=0italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0, we can redefine ΩΩ\{i}Ω\Ωsuperscript𝑖\Omega\leftarrow\Omega\backslash\{i^{*}\}roman_Ω ← roman_Ω \ { italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } and ffai/2𝑓𝑓subscript𝑎superscript𝑖2f\leftarrow f-a_{i^{*}}/2italic_f ← italic_f - italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT / 2 and reenter this proof. Thus we assume bi0subscript𝑏superscript𝑖0b_{i^{*}}\neq 0italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≠ 0 without loss of generality.

We have:

f(x)𝑓𝑥\displaystyle f(x)italic_f ( italic_x ) =iA+aiσ(bix)+iAaiσ(bix)=iA+aiiA+ai[1σ(bix)]+iAaiσ(bix).absentsubscript𝑖superscript𝐴subscript𝑎𝑖𝜎subscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖𝜎subscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖subscript𝑖superscript𝐴subscript𝑎𝑖delimited-[]1𝜎subscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖𝜎subscript𝑏𝑖𝑥\displaystyle=\sum_{i\in A^{+}}a_{i}\sigma(b_{i}x)+\sum_{i\in A^{-}}a_{i}% \sigma(b_{i}x)=\sum_{i\in A^{+}}a_{i}-\sum_{i\in A^{+}}a_{i}[1-\sigma(b_{i}x)]% +\sum_{i\in A^{-}}a_{i}\sigma(b_{i}x).= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) + ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ 1 - italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) ] + ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) .

For bi<0subscript𝑏superscript𝑖0b_{i^{*}}<0italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < 0, we have:

|f(x)iA+ai|𝑓𝑥subscript𝑖superscript𝐴subscript𝑎𝑖\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}|| italic_f ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | =|iA+ai[1σ(bix)]iAaiσ(bix)|=|ai|σ(bix)+O(exp(Bx)),absentsubscript𝑖superscript𝐴subscript𝑎𝑖delimited-[]1𝜎subscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖𝜎subscript𝑏𝑖𝑥subscript𝑎superscript𝑖𝜎subscript𝑏superscript𝑖𝑥𝑂𝐵𝑥\displaystyle=\bigg{|}\sum_{i\in A^{+}}a_{i}[1-\sigma(b_{i}x)]-\sum_{i\in A^{-% }}a_{i}\sigma(b_{i}x)\bigg{|}=|a_{i^{*}}|\sigma(b_{i^{*}}x)+O(\exp(-Bx)),= | ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ 1 - italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) ] - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) | = | italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_σ ( italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x ) + italic_O ( roman_exp ( - italic_B italic_x ) ) ,

where B>|bi|𝐵subscript𝑏superscript𝑖B>|b_{i^{*}}|italic_B > | italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT |. This dominant term cannot be cancelled unless ai=0subscript𝑎superscript𝑖0a_{i^{*}}=0italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0.

For bi>0subscript𝑏superscript𝑖0b_{i^{*}}>0italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT > 0:

|f(x)iA+ai|=|iA+ai[1σ(bix)]iAaiσ(bix)|=|ai|[1σ(bix)]+O(exp(Bx)),𝑓𝑥subscript𝑖superscript𝐴subscript𝑎𝑖subscript𝑖superscript𝐴subscript𝑎𝑖delimited-[]1𝜎subscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖𝜎subscript𝑏𝑖𝑥subscript𝑎superscript𝑖delimited-[]1𝜎subscript𝑏superscript𝑖𝑥𝑂𝐵𝑥\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}|=\bigg{|}\sum_{i\in A^{+}}a_{i}[1-% \sigma(b_{i}x)]-\sum_{i\in A^{-}}a_{i}\sigma(b_{i}x)\bigg{|}=|a_{i^{*}}|[1-% \sigma(b_{i^{*}}x)]+O(\exp(-Bx)),| italic_f ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = | ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ 1 - italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) ] - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) | = | italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | [ 1 - italic_σ ( italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x ) ] + italic_O ( roman_exp ( - italic_B italic_x ) ) ,

where B>|bi|𝐵subscript𝑏superscript𝑖B>|b_{i^{*}}|italic_B > | italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT |. This shows f(x)𝑓𝑥f(x)italic_f ( italic_x ) cannot be constant unless aibi=0subscript𝑎superscript𝑖subscript𝑏superscript𝑖0a_{i^{*}}b_{i^{*}}=0italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0, contradicting our assumption.

For condition (b), we need to show (r1+σ(x))(r2+σ(x))subscript𝑟1𝜎𝑥subscript𝑟2superscript𝜎𝑥(r_{1}+\sigma(x))(r_{2}+\sigma^{\prime}(x))( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ ( italic_x ) ) ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) is not a.e. constant. Note σ(x)=σ(x)(1σ(x))superscript𝜎𝑥𝜎𝑥1𝜎𝑥\sigma^{\prime}(x)=\sigma(x)(1-\sigma(x))italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = italic_σ ( italic_x ) ( 1 - italic_σ ( italic_x ) ) has exponential decay as |x|𝑥|x|\to\infty| italic_x | → ∞. A direct computation shows:

(r1+σ(x))(r2+σ(x))subscript𝑟1𝜎𝑥subscript𝑟2superscript𝜎𝑥\displaystyle(r_{1}+\sigma(x))(r_{2}+\sigma^{\prime}(x))( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ ( italic_x ) ) ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) =r1r2+r2σ(x)+r1σ(x)(1σ(x))=r1r2+(r2+r1)σ(x)r1σ(x)2absentsubscript𝑟1subscript𝑟2subscript𝑟2𝜎𝑥subscript𝑟1𝜎𝑥1𝜎𝑥subscript𝑟1subscript𝑟2subscript𝑟2subscript𝑟1𝜎𝑥subscript𝑟1𝜎superscript𝑥2\displaystyle=r_{1}r_{2}+r_{2}\sigma(x)+r_{1}\sigma(x)(1-\sigma(x))=r_{1}r_{2}% +(r_{2}+r_{1})\sigma(x)-r_{1}\sigma(x)^{2}= italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( italic_x ) + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_x ) ( 1 - italic_σ ( italic_x ) ) = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_σ ( italic_x ) - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (D.3)

Consider the tail in (D.3). If this expression were constant, then by examining the coefficients of different powers of σ(x)𝜎𝑥\sigma(x)italic_σ ( italic_x ), we must have r1=0subscript𝑟10r_{1}=0italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 and r1+r2=0subscript𝑟1subscript𝑟20r_{1}+r_{2}=0italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, which is impossible. Thus (r1+σ(x))(r2+σ(x))subscript𝑟1𝜎𝑥subscript𝑟2superscript𝜎𝑥(r_{1}+\sigma(x))(r_{2}+\sigma^{\prime}(x))( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ ( italic_x ) ) ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) cannot be constant almost everywhere. ∎

Remark D.3.

Since tanh(x)𝑥\tanh(x)roman_tanh ( italic_x ) is a linear transformation of Sigmoid function σ𝜎\sigmaitalic_σ, it inherits the same exponential-tail property and similarly meets both (a) and (b).

D.2 SiLU and GeLU

Proposition D.4.

The SiLU function SiLU(x)=xσ(x)SiLU𝑥𝑥𝜎𝑥\mathrm{SiLU}(x)=x\sigma(x)roman_SiLU ( italic_x ) = italic_x italic_σ ( italic_x ) is GOOD.

Proof.

Define Ω={i|ai0}Ωconditional-set𝑖subscript𝑎𝑖0\Omega=\{i|a_{i}\neq 0\}roman_Ω = { italic_i | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 }, A+={iΩ|bi>0}superscript𝐴conditional-set𝑖Ωsubscript𝑏𝑖0A^{+}=\{i\in\Omega|b_{i}>0\}italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = { italic_i ∈ roman_Ω | italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 } and A={iΩ|bi<0}superscript𝐴conditional-set𝑖Ωsubscript𝑏𝑖0A^{-}=\{i\in\Omega|b_{i}<0\}italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = { italic_i ∈ roman_Ω | italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0 }. Let i=argminiΩ|bi|superscript𝑖subscriptargmin𝑖Ωsubscript𝑏𝑖i^{*}=\mathop{\mathrm{argmin}}_{i\in\Omega}|b_{i}|italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_i ∈ roman_Ω end_POSTSUBSCRIPT | italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. Using similar reasoning as in the sigmoid case, we assume bi0subscript𝑏superscript𝑖0b_{i^{*}}\neq 0italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≠ 0 without loss of generality.

We have:

f(x)𝑓𝑥\displaystyle f(x)italic_f ( italic_x ) =iA+aiϕ(bix)+iAaiϕ(bix)=iA+aibixiA+ai[bixϕ(bix)]+iAaiϕ(bix).absentsubscript𝑖superscript𝐴subscript𝑎𝑖italic-ϕsubscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖italic-ϕsubscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖subscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖delimited-[]subscript𝑏𝑖𝑥italic-ϕsubscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖italic-ϕsubscript𝑏𝑖𝑥\displaystyle=\sum_{i\in A^{+}}a_{i}\phi(b_{i}x)+\sum_{i\in A^{-}}a_{i}\phi(b_% {i}x)=\sum_{i\in A^{+}}a_{i}b_{i}x-\sum_{i\in A^{+}}a_{i}[b_{i}x-\phi(b_{i}x)]% +\sum_{i\in A^{-}}a_{i}\phi(b_{i}x).= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) + ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x - italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) ] + ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) .

For bi<0subscript𝑏superscript𝑖0b_{i^{*}}<0italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < 0, we have:

|f(x)iA+aibix|𝑓𝑥subscript𝑖superscript𝐴subscript𝑎𝑖subscript𝑏𝑖𝑥\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}b_{i}x|| italic_f ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x | =|iA+ai[bixϕ(bix)]iAaiϕ(bix)|=|ai|ϕ(bix)Majortail+O(xexp(Bx)),absentsubscript𝑖superscript𝐴subscript𝑎𝑖delimited-[]subscript𝑏𝑖𝑥italic-ϕsubscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖italic-ϕsubscript𝑏𝑖𝑥subscriptsubscript𝑎superscript𝑖italic-ϕsubscript𝑏superscript𝑖𝑥Majortail𝑂𝑥𝐵𝑥\displaystyle=\bigg{|}\sum_{i\in A^{+}}a_{i}[b_{i}x-\phi(b_{i}x)]-\sum_{i\in A% ^{-}}a_{i}\phi(b_{i}x)\bigg{|}=\underbrace{|a_{i^{*}}|\phi(b_{i^{*}}x)}_{% \mathrm{Majortail}}+O(x\exp(-Bx)),= | ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x - italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) ] - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) | = under⏟ start_ARG | italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x ) end_ARG start_POSTSUBSCRIPT roman_Majortail end_POSTSUBSCRIPT + italic_O ( italic_x roman_exp ( - italic_B italic_x ) ) ,

where B>|bi|𝐵subscript𝑏superscript𝑖B>|b_{i^{*}}|italic_B > | italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT |.

For bi>0subscript𝑏superscript𝑖0b_{i^{*}}>0italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT > 0, we have:

|f(x)iA+aibix|𝑓𝑥subscript𝑖superscript𝐴subscript𝑎𝑖subscript𝑏𝑖𝑥\displaystyle|f(x)-\sum_{i\in A^{+}}a_{i}b_{i}x|| italic_f ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x | =|iA+ai[bixϕ(bix)]iAaiϕ(bix)|=|ai|[bixϕ(bix)]Majortail+O(xexp(Bx)),absentsubscript𝑖superscript𝐴subscript𝑎𝑖delimited-[]subscript𝑏𝑖𝑥italic-ϕsubscript𝑏𝑖𝑥subscript𝑖superscript𝐴subscript𝑎𝑖italic-ϕsubscript𝑏𝑖𝑥subscriptsubscript𝑎superscript𝑖delimited-[]subscript𝑏superscript𝑖𝑥italic-ϕsubscript𝑏superscript𝑖𝑥Majortail𝑂𝑥𝐵𝑥\displaystyle=\bigg{|}\sum_{i\in A^{+}}a_{i}[b_{i}x-\phi(b_{i}x)]-\sum_{i\in A% ^{-}}a_{i}\phi(b_{i}x)\bigg{|}=\underbrace{|a_{i^{*}}|[b_{i^{*}}x-\phi(b_{i^{*% }}x)]}_{\mathrm{Majortail}}+O(x\exp(-Bx)),= | ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x - italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) ] - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x ) | = under⏟ start_ARG | italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | [ italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x - italic_ϕ ( italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x ) ] end_ARG start_POSTSUBSCRIPT roman_Majortail end_POSTSUBSCRIPT + italic_O ( italic_x roman_exp ( - italic_B italic_x ) ) ,

where B>|bi|𝐵subscript𝑏superscript𝑖B>|b_{i^{*}}|italic_B > | italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT |. Note that MajortailMajortail\mathrm{Majortail}roman_Majortail is bounded by some constant and asymptotically Majortail=Θ(xexp(|bi|x))MajortailΘ𝑥subscript𝑏superscript𝑖𝑥\mathrm{Majortail}=\Theta(x\exp(-|b_{i^{*}}|x))roman_Majortail = roman_Θ ( italic_x roman_exp ( - | italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_x ) ). Therefore, f(x)𝑓𝑥f(x)italic_f ( italic_x ) is constant only if iA+aibi=0subscript𝑖superscript𝐴subscript𝑎𝑖subscript𝑏𝑖0\sum_{i\in A^{+}}a_{i}b_{i}=0∑ start_POSTSUBSCRIPT italic_i ∈ italic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 and aibi=0subscript𝑎superscript𝑖subscript𝑏superscript𝑖0a_{i^{*}}b_{i^{*}}=0italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0, which contradicts our assumption.

For condition (b), we need to show (r1+xσ(x))(r2+σ(x)+xσ(x))subscript𝑟1𝑥𝜎𝑥subscript𝑟2𝜎𝑥𝑥superscript𝜎𝑥(r_{1}+x\sigma(x))(r_{2}+\sigma(x)+x\sigma^{\prime}(x))( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x italic_σ ( italic_x ) ) ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ ( italic_x ) + italic_x italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) is not a.e. constant. Note that:

ϕ(x)superscriptitalic-ϕ𝑥\displaystyle\phi^{\prime}(x)italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) =σ(x)+xσ(x)=σ(x)+xσ(x)(1σ(x))absent𝜎𝑥𝑥superscript𝜎𝑥𝜎𝑥𝑥𝜎𝑥1𝜎𝑥\displaystyle=\sigma(x)+x\sigma^{\prime}(x)=\sigma(x)+x\sigma(x)(1-\sigma(x))= italic_σ ( italic_x ) + italic_x italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = italic_σ ( italic_x ) + italic_x italic_σ ( italic_x ) ( 1 - italic_σ ( italic_x ) ) (D.4)

Then we have:

(r1+xσ(x))(r2+σ(x)+xσ(x)(1σ(x)))subscript𝑟1𝑥𝜎𝑥subscript𝑟2𝜎𝑥𝑥𝜎𝑥1𝜎𝑥\displaystyle(r_{1}+x\sigma(x))(r_{2}+\sigma(x)+x\sigma(x)(1-\sigma(x)))( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x italic_σ ( italic_x ) ) ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ ( italic_x ) + italic_x italic_σ ( italic_x ) ( 1 - italic_σ ( italic_x ) ) )
=r1r2+r1σ(x)+r1xσ(x)(1σ(x))absentsubscript𝑟1subscript𝑟2subscript𝑟1𝜎𝑥subscript𝑟1𝑥𝜎𝑥1𝜎𝑥\displaystyle=r_{1}r_{2}+r_{1}\sigma(x)+r_{1}x\sigma(x)(1-\sigma(x))= italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_x ) + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x italic_σ ( italic_x ) ( 1 - italic_σ ( italic_x ) )
+r2xσ(x)+xσ(x)2+x2σ(x)2(1σ(x))subscript𝑟2𝑥𝜎𝑥𝑥𝜎superscript𝑥2superscript𝑥2𝜎superscript𝑥21𝜎𝑥\displaystyle\quad+r_{2}x\sigma(x)+x\sigma(x)^{2}+x^{2}\sigma(x)^{2}(1-\sigma(% x))+ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x italic_σ ( italic_x ) + italic_x italic_σ ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_σ ( italic_x ) ) (D.5)

Consider the tail in (D.5). If this expression were constant, the coefficient of x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT term must vanish, which requires σ(x)2(1σ(x))0𝜎superscript𝑥21𝜎𝑥0\sigma(x)^{2}(1-\sigma(x))\equiv 0italic_σ ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_σ ( italic_x ) ) ≡ 0. However, this is impossible as σ(x)(0,1)𝜎𝑥01\sigma(x)\in(0,1)italic_σ ( italic_x ) ∈ ( 0 , 1 ) for all x𝑥xitalic_x. Thus this product cannot be constant almost everywhere. ∎

Remark D.5.

GeLU, defined by xΦ(x)𝑥Φ𝑥x\Phi(x)italic_x roman_Φ ( italic_x ) where ΦΦ\Phiroman_Φ is the Gaussian CDF, similarly satisfies (a) and (b) because of its strong exponential decay. Specifically, as |x|𝑥|x|\to\infty| italic_x | → ∞, GeLU and its derivatives exhibit Gaussian-like decay O(ex2/2)𝑂superscript𝑒superscript𝑥22O(e^{-x^{2}/2})italic_O ( italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT ), which is even stronger than the exponential decay of sigmoid and SiLU.

Conclusion.

We have shown that key exponential-tail activations (σ𝜎\sigmaitalic_σ, tanh\tanhroman_tanh, SiLU, GeLU) fulfill both (a) and (b) in Definition D.1, and hence are GOOD. These results rely crucially on the exponential decay properties of these functions, which ensure that scaled copies cannot combine to yield constant functions. This ensures rich, non-degenerate behavior in our infinite-width analysis under μ𝜇\muitalic_μP scaling.

References

  • Adler and Taylor (2009) Adler, R. J. and Taylor, J. E. (2009). Random fields and geometry. Springer Science & Business Media.
  • Alemohammad et al. (2020) Alemohammad, S., Wang, Z., Balestriero, R. and Baraniuk, R. (2020). The recurrent neural tangent kernel. arXiv preprint arXiv:2006.10246 .
  • Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y. and Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning.
  • Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z. and Wang, R. (2019a). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning.
  • Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. and Wang, R. (2019b). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
  • Bordelon and Pehlevan (2022) Bordelon, B. and Pehlevan, C. (2022). Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems 35 32240–32256.
  • Cao and Gu (2019) Cao, Y. and Gu, Q. (2019). Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems.
  • Chen et al. (2020) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020). A generalized neural tangent kernel analysis for two-layer neural networks. Advances in Neural Information Processing Systems 33 13363–13373.
  • Chen et al. (2021) Chen, Z., Cao, Y., Zou, D. and Gu, Q. (2021). How much over-parameterization is sufficient to learn deep relu networks? In International Conference on Learning Representation (ICLR).
  • Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems.
  • Du et al. (2019a) Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning.
  • Du et al. (2019b) Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R. and Xu, K. (2019b). Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems 32.
  • Du et al. (2018) Du, S. S., Lee, J. D. and Tian, Y. (2018). When is a convolutional filter easy to learn? In International Conference on Learning Representations.
  • Du et al. (2019c) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019c). Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
  • Fang et al. (2021) Fang, C., Lee, J., Yang, P. and Zhang, T. (2021). Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on learning theory. PMLR.
  • Geiger et al. (2020) Geiger, M., Spigler, S., Jacot, A. and Wyart, M. (2020). Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment 2020 113301.
  • Hajjar et al. (2021) Hajjar, K., Chizat, L. and Giraud, C. (2021). Training integrable parameterizations of deep neural networks in the infinite-width limit. arXiv preprint arXiv:2110.15596 .
  • Hendrycks and Gimpel (2016) Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 .
  • Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N. et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29 82–97.
  • Hron et al. (2020) Hron, J., Bahri, Y., Sohl-Dickstein, J. and Novak, R. (2020). Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning. PMLR.
  • Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
  • Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems.
  • Littwin and Yang (2022) Littwin, E. and Yang, G. (2022). Adaptive optimization in the infinit-width limit. In The Eleventh International Conference on Learning Representations.
  • Mei et al. (2018) Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115 E7665–E7671.
  • Nguyen and Pham (2020) Nguyen, P.-M. and Pham, H. T. (2020). A rigorous framework for the mean field limit of multilayer neural networks. arXiv preprint arXiv:2001.11443 .
  • Nitanda et al. (2022) Nitanda, A., Wu, D. and Suzuki, T. (2022). Convex analysis of the mean field langevin dynamics. In International Conference on Artificial Intelligence and Statistics. PMLR.
  • Pham and Nguyen (2021) Pham, H. T. and Nguyen, P.-M. (2021). Global convergence of three-layer neural networks in the mean field regime. arXiv preprint arXiv:2105.05228 .
  • Rotskoff and Vanden-Eijnden (2018) Rotskoff, G. M. and Vanden-Eijnden, E. (2018). Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915 .
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature 529 484–489.
  • Sirignano and Spiliopoulos (2018) Sirignano, J. and Spiliopoulos, K. (2018). Mean field analysis of neural networks: A central limit theorem. arXiv preprint arXiv:1808.09372 .
  • Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 .
  • Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
  • Woodworth et al. (2020) Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D. and Srebro, N. (2020). Kernel and rich regimes in overparametrized models. In Conference on Learning Theory. PMLR.
  • Yang (2019a) Yang, G. (2019a). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760 .
  • Yang (2019b) Yang, G. (2019b). Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Advances in Neural Information Processing Systems 32.
  • Yang (2020a) Yang, G. (2020a). Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548 .
  • Yang (2020b) Yang, G. (2020b). Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685 .
  • Yang et al. (2021) Yang, G., Hu, E., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W. and Gao, J. (2021). Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems 34 17084–17097.
  • Yang and Hu (2020) Yang, G. and Hu, E. J. (2020). Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522 .
  • Yang and Hu (2021) Yang, G. and Hu, E. J. (2021). Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning. PMLR.
  • Yang et al. (2023a) Yang, G., Simon, J. B. and Bernstein, J. (2023a). A spectral condition for feature learning. arXiv preprint arXiv:2310.17813 .
  • Yang et al. (2023b) Yang, G., Yu, D., Zhu, C. and Hayou, S. (2023b). Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244 .
  • Zou et al. (2019) Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2019). Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning .
  • Zou and Gu (2019) Zou, D. and Gu, Q. (2019). An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems.