Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization

Shida Wang    Qianxiao Li
Abstract

In this paper, we investigate the long-term memory learning capabilities of state-space models (SSMs) from the perspective of parameterization. We prove that state-space models without any reparameterization exhibit a memory limitation similar to that of traditional RNNs: the target relationships that can be stably approximated by state-space models must have an exponential decaying memory. Our analysis identifies this “curse of memory” as a result of the recurrent weights converging to a stability boundary, suggesting that a reparameterization technique can be effective. To this end, we introduce a class of reparameterization techniques for SSMs that effectively lift its memory limitations. Besides improving approximation capabilities, we further illustrate that a principled choice of reparameterization scheme can also enhance optimization stability. We validate our findings using synthetic datasets, language models and image classifications.

Machine Learning, ICML, State-space Models, Curse of Memory

1 Introduction

Understanding long-term memory relationships is fundamental in sequence modeling. Capturing this prolonged memory is vital, especially in applications like time series prediction (Connor et al., 1994), language models (Sutskever et al., 2011). Since its emergence, transformers (Vaswani et al., 2017) have become the go-to models for language representation tasks (Brown et al., 2020). However, a significant drawback lies in their computational complexity, which is asymptotically O(T2)𝑂superscript𝑇2O(T^{2})italic_O ( italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where T𝑇Titalic_T is the sequence length. This computational bottleneck has been a critical impediment to the further scaling-up of transformer models. State-space models such as S4 (Gu et al., 2022b), S5 (Smith et al., 2023), LRU (Orvieto et al., 2023b), RWKV (Peng et al., 2023), RetNet (Sun et al., 2023) and Mamba (Gu & Dao, 2023) offer an alternative approach. These models are of the recurrent type and excel in long-term memory learning. Their architecture is specifically designed to capture temporal dependencies over extended sequences, providing a robust solution for tasks requiring long-term memory (Tay et al., 2021). One of the advantages of state-space models over traditional RNNs lies in their computational efficiency, achieved through the application of parallel scan algorithms (Martin & Cundy, 2018) and Fast Fourier Transform (FFT) (Tolimieri et al., 1989; Gu et al., 2022b). Traditional nonlinear RNNs are often plagued by slow forward and backward propagation, a limitation that state-space models circumvent by leveraging linear RNN blocks.

Traditional linear/nonlinear RNNs exhibit an asymptotically exponential decay in memory (Wang et al., 2023). This phenomenon explains the difficulty in both approximation and optimization to learn long-term memory using RNNs (also named curse of memory). In practice, empirical results show that SSMs variants like S4 overcome some of the memory issues. The previous empirical results suggest that either (i) the “linear dynamics and nonlinear layerwise activation” or (ii) the parameterization inherent to S4, is pivotal in achieving the enhanced performance. Current research answers which one is more important. We first prove an inverse approximation theorem showing that state-space models without reparameterization still suffer from the “curse of memory”, which is consistent with empirical results (Wang & Xue, 2023). This rules out the point (i) as the reason for SSMs’ good long-term memory learning. A natural question arises regarding whether the reparameterizations are the key to learn long-term memory. We prove a class of reparameterization functions f𝑓fitalic_f, which we call stable reparameterization, enables the stable approximation of nonlinear functionals. This includes commonly used exponential reparameterization and softplus reparameterization. Furthermore, we question whether S4’s parameterizations are optimal. Here we give a particular sense in terms of optimization stability that they are not optimal. We propose the optimal one and show its stability via numerical experiments.

We summarize our main contributions as follow:

  1. 1.

    We prove that similar to RNNs, the state-space models without reparameterization can only stably approximate targets with exponential decaying memory.

  2. 2.

    We identify a class of stable reparameterization which achieves the stable approximation of any nonlinear functionals. Both theoretical and empirical evidence highlight that stable reparameterization is crucial for long-term memory learning.

  3. 3.

    From the optimization viewpoint, we propose the gradient boundedness as the criterion and show the gradients are bounded by a form that depends on the parameterization. Based on the gradient bound, we solve the differential equation and derive the “best” reparameterization in the stability sense and verify the stability of this new reparameterization across different parameterization schemes.

Notation.

We use the bold face to represent the sequence while then normal letters are scalars, vectors or functions. Throughout this paper we use \|\cdot\|∥ ⋅ ∥ to denote norms over sequences of vectors, or function(al)s, while |||\cdot|| ⋅ | (with subscripts) represents the norm of number, vector or weights tuple. Here |x|:=maxi|xi|,|x|2:=ixi2,|x|1:=i|xi|formulae-sequenceassignsubscript𝑥subscript𝑖subscript𝑥𝑖formulae-sequenceassignsubscript𝑥2subscript𝑖superscriptsubscript𝑥𝑖2assignsubscript𝑥1subscript𝑖subscript𝑥𝑖|x|_{\infty}:=\max_{i}|x_{i}|,|x|_{2}:=\sqrt{\sum_{i}x_{i}^{2}},|x|_{1}:=\sum_% {i}|x_{i}|| italic_x | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT := roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , | italic_x | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , | italic_x | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | are the usual max (Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT) norm, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm. We use m𝑚mitalic_m to denote the hidden dimension.

2 Background

In this section, we first introduce the state-space models and compare them to traditional nonlinear RNNs. Subsequently, we adopt the sequence modeling as a problem in nonlinear functional approximation framework. Specifically, the theoretical properties we anticipate from the targets are defined. Moreover, we define the “curse of memory” phenomenon and provide a concise summary of prior theoretical definitions and results concerning RNNs.

2.1 State-space models

State-space models (SSMs) are a family of neural networks specialized in sequence modeling. Unlike Recurrent Neural Networks (RNNs) (Rumelhart et al., 1986), SSMs have layer-wise nonlinearity and linear dynamics within their hidden states. This unique structure facilitates accelerated computing using FFT (Gu et al., 2022b) or parallel scan (Martin & Cundy, 2018). With trainable weights Wm×m,Um×d,b,cmformulae-sequence𝑊superscript𝑚𝑚formulae-sequence𝑈superscript𝑚𝑑𝑏𝑐superscript𝑚W\in\mathbb{R}^{m\times m},U\in\mathbb{R}^{m\times d},b,c\in\mathbb{R}^{m}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT , italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT , italic_b , italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and activation function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ), the simplest SSM maps d𝑑ditalic_d-dimensional input sequence 𝐱={xt}𝐱subscript𝑥𝑡\mathbf{x}=\{x_{t}\}bold_x = { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } to 1-dimensional output sequence {y^t}subscript^𝑦𝑡\{\hat{y}_{t}\}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. To simplify our analysis, we utilize the continuous-time framework referenced in Li et al. (2020):

dhtdt=Wht+Uxt+b,h=0,y^t=c𝝈(ht),t.𝑑subscript𝑡𝑑𝑡absent𝑊subscript𝑡𝑈subscript𝑥𝑡𝑏subscript0subscript^𝑦𝑡absentsuperscript𝑐top𝝈subscript𝑡𝑡\displaystyle\begin{array}[]{rll}\frac{dh_{t}}{dt}&=Wh_{t}+Ux_{t}+b,&h_{-% \infty}=0,\\ \hat{y}_{t}&=c^{\top}\bm{\sigma}(h_{t}),&t\in\mathbb{R}.\end{array}start_ARRAY start_ROW start_CELL divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG end_CELL start_CELL = italic_W italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b , end_CELL start_CELL italic_h start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT = 0 , end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_σ ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_t ∈ blackboard_R . end_CELL end_ROW end_ARRAY (3)

As detailed in Appendix A, the above form is a simplification of practical SSMs in the sense that practical SSMs can be realized by the stacking of Equation 3.

It is known that multi-layer state-space models are universal approximators (Wang & Xue, 2023; Orvieto et al., 2023a). In particular, when the nonlinearity is added layer-wise, it is sufficient (in approximation sense) to use real diagonal W𝑊Witalic_W (Gu et al., 2022a; Li et al., 2022). In this paper, we only consider the real diagonal matrix case and denote it by Λ=Diag(λ1,,λm)ΛDiagsubscript𝜆1subscript𝜆𝑚\Lambda=\textrm{Diag}(\lambda_{1},\dots,\lambda_{m})roman_Λ = Diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ).

dhtdt𝑑subscript𝑡𝑑𝑡\displaystyle\frac{dh_{t}}{dt}divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =Λht+Uxt+b.absentΛsubscript𝑡𝑈subscript𝑥𝑡𝑏\displaystyle=\Lambda h_{t}+Ux_{t}+b.= roman_Λ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b . (4)

Compared with S4, the major differences lie in initialization such as HiPPO (Gu et al., 2020) and parameters saving method such as DPLR (Gu et al., 2022a) and NPLR (Gu et al., 2022b).

2.2 Sequence modeling as nonlinear functional approximations

Sequence modeling aims to discern the association between an input series, represented as 𝐱={xt}𝐱subscript𝑥𝑡\mathbf{x}=\{x_{t}\}bold_x = { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, and its corresponding output series, denoted as 𝐲={yt}𝐲subscript𝑦𝑡\mathbf{y}=\{y_{t}\}bold_y = { italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. The input series are continuous bounded inputs vanishing at infinity: 𝐱𝒳=C0(,d)𝐱𝒳subscript𝐶0superscript𝑑\mathbf{x}\in\mathcal{X}=C_{0}(\mathbb{R},\mathbb{R}^{d})bold_x ∈ caligraphic_X = italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( blackboard_R , blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) with norm 𝐱:=supt|xt|assignsubscriptnorm𝐱subscriptsupremum𝑡subscriptsubscript𝑥𝑡\|\mathbf{x}\|_{\infty}:=\sup_{t\in\mathbb{R}}|x_{t}|_{\infty}∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT italic_t ∈ blackboard_R end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. It is assumed that the input and output sequences are determined from the inputs via a set of functionals, symbolized as

𝐇={Ht:𝒳:t},𝐇conditional-setsubscript𝐻𝑡:𝒳𝑡\mathbf{H}=\{H_{t}:\mathcal{X}\rightarrow\mathbb{R}:t\in\mathbb{R}\},bold_H = { italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_X → blackboard_R : italic_t ∈ blackboard_R } , (5)

through the relationship yt=Ht(𝐱)subscript𝑦𝑡subscript𝐻𝑡𝐱y_{t}=H_{t}(\mathbf{x})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ). In essence, the challenge of sequential approximation boils down to estimating the desired functional sequence 𝐇𝐇\mathbf{H}bold_H using a different functional sequence 𝐇^^𝐇\mathbf{\widehat{H}}over^ start_ARG bold_H end_ARG potentially from a predefined model space such as SSMs.

In this paper we focus on target functionals that are bounded, causal, continuous, regular, time-homogeneous (time-shift invariant). Formal definitions are given in Section B.1. The continuity, boundedness, time-homogeneity, causality are important properties for good sequence-to-sequence models to have. Linearity is an important simplification as many theoretical theorems are available in functional analysis (Stein & Shakarchi, 2003). Without loss of generality, we assume that the nonlinear functionals satisfy Ht(𝟎)=0subscript𝐻𝑡00H_{t}(\mathbf{0})=0italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_0 ) = 0. It can be achieved via studying Htadjusted(𝐱)=Ht(𝐱)Ht(𝟎)superscriptsubscript𝐻𝑡adjusted𝐱subscript𝐻𝑡𝐱subscript𝐻𝑡0H_{t}^{\textrm{adjusted}}(\mathbf{x})=H_{t}(\mathbf{x})-H_{t}(\mathbf{0})italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adjusted end_POSTSUPERSCRIPT ( bold_x ) = italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) - italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_0 ).

2.3 Memory function, stable approximation and curse of memory

The concept of memory has been extensively explored in academic literature, yet much of previous works rely on heuristic approaches and empirical testing, particularly in the context of learning long-term memory (Poli et al., 2023). Here we study the memory property from a theoretical perspective.

Our study employs the extended framework proposed by Wang et al. (2023), which specifically focuses on nonlinear RNNs. However, these studies do not address the case of state-space models. Within the same framework, the slightly different memory function and decaying memory concepts enable us to explore the approximation capabilities of nonlinear functionals using SSMs.

Definition 2.1 (Memory function).

For bounded, causal, continuous, regular and time-homogeneous nonlinear functional sequences 𝐇={Ht:t}𝐇conditional-setsubscript𝐻𝑡𝑡\mathbf{H}=\{H_{t}:t\in\mathbb{R}\}bold_H = { italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t ∈ blackboard_R } on 𝒳𝒳\mathcal{X}caligraphic_X, define the following function as the memory function of 𝐇𝐇\mathbf{H}bold_H: Over bounded Heaviside input 𝐮x(t)=x𝟏{t0}superscript𝐮𝑥𝑡𝑥subscript1𝑡0\mathbf{u}^{x}(t)=x\cdot\bm{1}_{\{t\geq 0\}}bold_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ( italic_t ) = italic_x ⋅ bold_1 start_POSTSUBSCRIPT { italic_t ≥ 0 } end_POSTSUBSCRIPT

(𝐇)(t):=supx0|ddtHt(𝐮x)||x|+1.assign𝐇𝑡subscriptsupremum𝑥0𝑑𝑑𝑡subscript𝐻𝑡superscript𝐮𝑥subscript𝑥1\mathcal{M}(\mathbf{H})(t):=\sup_{x\neq 0}\frac{\left|\frac{d}{dt}H_{t}(% \mathbf{u}^{x})\right|}{|x|_{\infty}+1}.caligraphic_M ( bold_H ) ( italic_t ) := roman_sup start_POSTSUBSCRIPT italic_x ≠ 0 end_POSTSUBSCRIPT divide start_ARG | divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) | end_ARG start_ARG | italic_x | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG . (6)

We add 1 in the memory function definition to make it more regular. The memory function of the target functionals is assumed to be finite for all t𝑡t\in\mathbb{R}italic_t ∈ blackboard_R.

Definition 2.2 (Decaying memory).

The functional sequences 𝐇𝐇\mathbf{H}bold_H has a decaying memory if

limt(𝐇)(t)=0.subscript𝑡𝐇𝑡0\lim_{t\to\infty}\mathcal{M}(\mathbf{H})(t)=0.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT caligraphic_M ( bold_H ) ( italic_t ) = 0 . (7)

In particular, we say it has an exponential (polynomial) decaying memory if there exists constant β>0𝛽0\beta>0italic_β > 0 such that limteβt(𝐇)(t)=0subscript𝑡superscript𝑒𝛽𝑡𝐇𝑡0\lim_{t\to\infty}e^{\beta t}\mathcal{M}(\mathbf{H})(t)=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β italic_t end_POSTSUPERSCRIPT caligraphic_M ( bold_H ) ( italic_t ) = 0 (limttβ(𝐇)(t)=0subscript𝑡superscript𝑡𝛽𝐇𝑡0\lim_{t\to\infty}t^{\beta}\mathcal{M}(\mathbf{H})(t)=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_M ( bold_H ) ( italic_t ) = 0).

Similar to Wang et al. (2023), this adjusted memory function definition is also compatible with the memory concept in linear functional which is based on the famous Riesz representation theorem (Theorem B.3 in Appendix B). In the linear functional case, this memory function is the impulse response function. It measures the decay speed of the memory about an impulse given at t=0𝑡0t=0italic_t = 0. It is a surrogate to characterize the model’s memorization about the previous inputs in the hidden states htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and outputs ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. While a large memory value (t)𝑡\mathcal{M}(t)caligraphic_M ( italic_t ) does not mean the model at time t𝑡titalic_t has a clear memorization about previous inputs x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a small memory value (t)𝑡\mathcal{M}(t)caligraphic_M ( italic_t ) means the model has forgotten the impulse input x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore, having a slow decay memory function ()\mathcal{M}(\cdot)caligraphic_M ( ⋅ ) is a necessary condition to build a model with long-term memory. As shown in Section C.1, the nonlinear functionals constructed by state-space models are point-wise continuous over Heaviside inputs. Combined with time-homogeneity, we know that state-space models are nonlinear functionals with decaying memory (see Section C.2).

Definition 2.3 (Functional sequence approximation in Sobolev-type norm).

Given functional sequences 𝐇𝐇\mathbf{H}bold_H and 𝐇^^𝐇\widehat{\mathbf{H}}over^ start_ARG bold_H end_ARG, we consider the approximation in the following Sobolev-type norm (Section B.2):

𝐇𝐇^W1,:=assignsubscriptnorm𝐇^𝐇superscript𝑊1absent\displaystyle\left\|\mathbf{H}-\widehat{\mathbf{H}}\right\|_{W^{1,\infty}}:=∥ bold_H - over^ start_ARG bold_H end_ARG ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := (8)
supt(HtH^t+dHtdtdH^tdt).subscriptsupremum𝑡subscriptnormsubscript𝐻𝑡subscript^𝐻𝑡subscriptnorm𝑑subscript𝐻𝑡𝑑𝑡𝑑subscript^𝐻𝑡𝑑𝑡\displaystyle\sup_{t}\left(\|H_{t}-\widehat{H}_{t}\|_{\infty}+\left\|\frac{dH_% {t}}{dt}-\frac{d\widehat{H}_{t}}{dt}\right\|_{\infty}\right).roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ divide start_ARG italic_d italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG - divide start_ARG italic_d over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) . (9)
Definition 2.4 (Perturbation error).

For target 𝐇𝐇\mathbf{H}bold_H and parameterized model 𝐇^(,θm),θm=(Λ,U,b,c)Θm:={m×m×m×d×m×m}^𝐇subscript𝜃𝑚subscript𝜃𝑚Λ𝑈𝑏𝑐subscriptΘ𝑚assignsuperscript𝑚𝑚superscript𝑚𝑑superscript𝑚superscript𝑚\widehat{\mathbf{H}}(\cdot,\theta_{m}),\theta_{m}=(\Lambda,U,b,c)\in\Theta_{m}% :=\{\mathbb{R}^{m\times m}\times\mathbb{R}^{m\times d}\times\mathbb{R}^{m}% \times\mathbb{R}^{m}\}over^ start_ARG bold_H end_ARG ( ⋅ , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ( roman_Λ , italic_U , italic_b , italic_c ) ∈ roman_Θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT := { blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT }, we define the perturbation error for hidden dimension m𝑚mitalic_m:

Em(β):=supθ~m{θ:|θθm|2β}𝐇𝐇^(;θ~m)W1,.assignsubscript𝐸𝑚𝛽subscriptsupremumsubscript~𝜃𝑚conditional-set𝜃subscript𝜃subscript𝜃𝑚2𝛽subscriptnorm𝐇^𝐇subscript~𝜃𝑚superscript𝑊1E_{m}(\beta):=\sup_{\tilde{\theta}_{m}\in\{\theta:|\theta-\theta_{m}|_{2}\leq% \beta\}}\|\mathbf{H}-\widehat{\mathbf{H}}(\cdot;\tilde{\theta}_{m})\|_{W^{1,% \infty}}.italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) := roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ { italic_θ : | italic_θ - italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β } end_POSTSUBSCRIPT ∥ bold_H - over^ start_ARG bold_H end_ARG ( ⋅ ; over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (10)

In particular, 𝐇~~𝐇\widetilde{\mathbf{H}}over~ start_ARG bold_H end_ARG refers to the perturbed models 𝐇^(;θ~m)^𝐇subscript~𝜃𝑚\widehat{\mathbf{H}}(\cdot;\tilde{\theta}_{m})over^ start_ARG bold_H end_ARG ( ⋅ ; over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Moreover, E(β):=lim supmEm(β)assign𝐸𝛽subscriptlimit-supremum𝑚subscript𝐸𝑚𝛽E(\beta):=\limsup_{m\to\infty}E_{m}(\beta)italic_E ( italic_β ) := lim sup start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) is the asymptotic perturbation error. The weight norm for SSM is |θ|2:=max(|Λ|2,|U|2,|b|2,|c|2)assignsubscript𝜃2subscriptΛ2subscript𝑈2subscript𝑏2subscript𝑐2|\theta|_{2}:=\max(|\Lambda|_{2},|U|_{2},|b|_{2},|c|_{2})| italic_θ | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := roman_max ( | roman_Λ | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | italic_U | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | italic_b | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | italic_c | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Based on the definition of perturbation error, we consider the stable approximation as introduced by Wang et al. (2023).

Definition 2.5 (Stable approximation).

Let β0>0subscript𝛽00\beta_{0}>0italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0. A target functional sequence 𝐇𝐇\mathbf{H}bold_H admits a β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-stable approximation if the perturbed error satisfies that:

  1. 1.

    E(0)=0𝐸00E(0)=0italic_E ( 0 ) = 0.

  2. 2.

    E(β)𝐸𝛽E(\beta)italic_E ( italic_β ) is continuous for β[0,β0]𝛽0subscript𝛽0\beta\in[0,\beta_{0}]italic_β ∈ [ 0 , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ].

Equation E(0)=0𝐸00E(0)=0italic_E ( 0 ) = 0 means the universal approximation is achieved by the hypothesis space. Stable approximation strengthens the universal approximation by requiring the model to be robust against perturbation on the weights. As the stable approximation is the necessary requirement for the optimal parameters to be found by the gradient-based optimizations, it is a desirable assumption.

The “curse of memory” phenomenon, which was originally formulated for linear functionals and linear RNNs, is well-documented in prior research (Li et al., 2020, 2022; Jiang et al., 2023). It describes the phenomenon where targets approximated by linear, hardtanh, or tanh RNNs must demonstrate an exponential decaying memory. However, empirical observations suggest that state-space models, particularly the S4 variant, may possess favorable properties. Thus, it is crucial to ascertain whether the inherent limitations of RNNs can be circumvented using state-space models. Given the impressive performance of state-space models, notably S4, a few pivotal questions arise: Do the model structure of state-space models overcome the “curse of memory”? In the subsequent section, we will demonstrate that the model structure of state-space models does not indeed address the curse of memory phenomenon.

3 Main results

In this section, we first prove that similar to the traditional recurrent neural networks (Li et al., 2020; Wang et al., 2023), state-space models without reparameterization suffer from the “curse of memory” problem. This implies the targets that can be stably approximated by SSMs must have exponential decaying memory. Our analysis reveals that the problem arises from recurrent weights converging to a stability boundary when learning targets associated with long-term memory. Therefore, we introduce a class of stable reparameterization techniques to achieve the stable approximation for targets with polynomial decaying memory.

Beside the benefit of approximation perspective, we also discuss the optimization benefit of the stable reparameterizations. We show that the stable reparameterization can make the gradient scale more balanced, therefore the optimization of large models can be more stable.

3.1 Curse of memory in SSMs

Refer to caption
(a) SSM
Refer to caption
(b) SoftplusSSM
Refer to caption
(c) S4
Figure 1: State-space models without stable reparameterization cannot approximate targets with polynomial decaying memory. In (a), the intersection of lines are shifting towards left as the hidden dimension m𝑚mitalic_m increases. In (b), SSMs using softplus reparameterization has a stable approximation. In (c), S4 can stably approximate the target with better stability.
Table 1: Impact of stable reparameterizations in approximation and stable approximation. As the reparameterization does not change the hypothesis space of SSMs, both vanilla SSMs and StableSSM are universal approximators. Vanilla SSMs can only stably approximate targets with exponential decay while StableSSM can stably approximate any targets with decaying memory.
Approximation Stable approximation
Without reparameterization (Vanilla SSM) Universal (Wang & Xue, 2023) Not universal (Thm 3.3)
With stable reparameterization (StableSSM) Universal (Wang & Xue, 2023) Universal (Thm 3.5)

In this section, we present a theoretical theorem demonstrating that the state-space structure does not alleviate the “curse of memory” phenomenon. State-space models consist of alternately stacked linear RNNs and nonlinear activations. Our result is established for both the shallow case and deep case (Remark C.3). As recurrent models, SSMs without reparameterization continue to exhibit the commonly observed phenomenon of exponential memory decay, as evidenced by empirical findings (Wang & Xue, 2023).

Assumption 3.1.

We assume the hidden states remain uniformly bounded for any input sequence 𝐱𝐱\mathbf{x}bold_x, irrespective of the hidden dimensions m𝑚mitalic_m. Specifically, this can be expressed as

supmsupt|ht|<.subscriptsupremum𝑚subscriptsupremum𝑡subscriptsubscript𝑡\sup_{m}\sup_{t}|h_{t}|_{\infty}<\infty.roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞ . (11)
Assumption 3.2.

We focus on strictly increasing, continuously differentiable nonlinear activations with Lipschitz constant L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This property holds for activations such as tanh, sigmoid, softsign σ(z)=z1+|z|𝜎𝑧𝑧1𝑧\sigma(z)=\frac{z}{1+|z|}italic_σ ( italic_z ) = divide start_ARG italic_z end_ARG start_ARG 1 + | italic_z | end_ARG.

Theorem 3.3 (Curse of memory in SSMs).

Assume 𝐇𝐇\mathbf{H}bold_H is a sequence of bounded, causal, continuous, regular and time-homogeneous functionals on 𝒳𝒳\mathcal{X}caligraphic_X with decaying memory. Suppose there exists a sequence of state-space models {𝐇^(,θm)}m=1superscriptsubscript^𝐇subscript𝜃𝑚𝑚1\{\widehat{\mathbf{H}}(\cdot,\theta_{m})\}_{m=1}^{\infty}{ over^ start_ARG bold_H end_ARG ( ⋅ , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-stably approximating 𝐇𝐇\mathbf{H}bold_H in the norm defined in Equation 8. Assume the model weights are uniformly bounded: θmax:=supm|θm|2<assignsubscript𝜃subscriptsupremum𝑚subscriptsubscript𝜃𝑚2\theta_{\max}:=\sup_{m}|\theta_{m}|_{2}<\inftyitalic_θ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ∞. Then the memory function (𝐇)(t)𝐇𝑡\mathcal{M}(\mathbf{H})(t)caligraphic_M ( bold_H ) ( italic_t ) of the target decays exponentially:

(𝐇)(t)(d+1)L0θmax2eβt,t0,β<β0.formulae-sequence𝐇𝑡𝑑1subscript𝐿0superscriptsubscript𝜃2superscript𝑒𝛽𝑡formulae-sequence𝑡0𝛽subscript𝛽0\mathcal{M}(\mathbf{H})(t)\leq(d+1)L_{0}\theta_{\max}^{2}e^{-\beta t},\quad t% \geq 0,\beta<\beta_{0}.caligraphic_M ( bold_H ) ( italic_t ) ≤ ( italic_d + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_β italic_t end_POSTSUPERSCRIPT , italic_t ≥ 0 , italic_β < italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (12)

Here d𝑑ditalic_d is the dimension of input sequences. When generalized to multi-layer cases, the memory function bound induced from \ellroman_ℓ-layer SSM is: For some polynomial P(t)𝑃𝑡P(t)italic_P ( italic_t ) with degree at most l1𝑙1l-1italic_l - 1

(𝐇)(t)(d+1)L0θmax+1P(t)eβt,t0,β<β0.formulae-sequence𝐇𝑡𝑑1superscriptsubscript𝐿0superscriptsubscript𝜃1𝑃𝑡superscript𝑒𝛽𝑡formulae-sequence𝑡0𝛽subscript𝛽0\mathcal{M}(\mathbf{H})(t)\leq(d+1)L_{0}^{\ell}\theta_{\max}^{\ell+1}P(t)e^{-% \beta t},\quad t\geq 0,\beta<\beta_{0}.caligraphic_M ( bold_H ) ( italic_t ) ≤ ( italic_d + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT italic_P ( italic_t ) italic_e start_POSTSUPERSCRIPT - italic_β italic_t end_POSTSUPERSCRIPT , italic_t ≥ 0 , italic_β < italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (13)

The proof of Theorem 3.3 is provided in Section C.3. The (continuous-time) stability boundary (discussed in Remark C.1) for ΛΛ\Lambdaroman_Λ in state-space models (Equation 4) is maxi[m]λi(Λ)<0subscript𝑖delimited-[]𝑚subscript𝜆𝑖Λ0\max_{i\in[m]}\lambda_{i}(\Lambda)<0roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Λ ) < 0. This boundary comes from the stabiltiy criterion for linear time-invariant system. Compared with previous results (Li et al., 2020; Wang et al., 2023), the main proof difference comes from Lemma C.10 as the activation is in the readout yt=cσ(ht)subscript𝑦𝑡superscript𝑐top𝜎subscript𝑡y_{t}=c^{\top}\sigma(h_{t})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Our results provide a more accurate characterization of memory decay, in contrast to previous works that only offer qualitative estimates. A consequence of Theorem 3.3 is that if the target exhibits a non-exponential decay (e.g., polynomial decay), the recurrent weights converge to a stability boundary, thereby making the approximation unstable. Finding optimal weights can become challenging with gradient-based optimization methods, as the optimization process tends to become unstable with the increase of model size. The numerical verification is presented in Figure 1 (a). The lines intersect and the intersections points shift towards the 0, suggesting that the stable radius β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT does not exist. Therefore SSMs without reparameterization cannot stably approximate targets with polynomial decaying memory.

3.2 Stable reparameterization and its advantage in approximation

The proof of Theorem 3.3 suggests that the “curse of memory” arises due to the recurrent weights approaching a stability boundary. Additionally, our numerical experiments (in Figure 1 (c)) show that while state-space models suffer from curse of memory, the commonly used S4 layer (with exponential reparameterization) ameliorates this issue. However, it is not a unique solution. Our findings highlight that the foundation to achieving a stable approximation is the stable reparameterization method, which we define as follows:

Definition 3.4 (Stable reparameterization).

We say a reparameterization scheme f::𝑓f:\mathbb{R}\to\mathbb{R}italic_f : blackboard_R → blackboard_R is stable if there exists a continuous function g𝑔gitalic_g such that: g:[0,)[0,),g(0)=0:𝑔formulae-sequence00𝑔00g:[0,\infty)\to[0,\infty),g(0)=0italic_g : [ 0 , ∞ ) → [ 0 , ∞ ) , italic_g ( 0 ) = 0:

supw[|f(w)|sup|w~w|β0|ef(w~)tef(w)t|𝑑t]g(β).subscriptsupremum𝑤delimited-[]𝑓𝑤subscriptsupremum~𝑤𝑤𝛽superscriptsubscript0superscript𝑒𝑓~𝑤𝑡superscript𝑒𝑓𝑤𝑡differential-d𝑡𝑔𝛽\sup_{w}\left[|f(w)|\sup_{|\tilde{w}-w|\leq\beta}\int_{0}^{\infty}\left|e^{f(% \tilde{w})t}-e^{f(w)t}\right|dt\right]\leq g(\beta).roman_sup start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ | italic_f ( italic_w ) | roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t ] ≤ italic_g ( italic_β ) . (14)

For example, commonly used reparameterization (Gu et al., 2022b; Smith et al., 2023) such as f(w)=ew𝑓𝑤superscript𝑒𝑤f(w)=-e^{w}italic_f ( italic_w ) = - italic_e start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT, f(w)=log(1+ew)𝑓𝑤1superscript𝑒𝑤f(w)=-\log(1+e^{w})italic_f ( italic_w ) = - roman_log ( 1 + italic_e start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) are all stable. Verifications are provided in Remark C.4.

As depicted in Figure 1 (b), state-space models with stable reparameterization can approximate targets exhibiting polynomial decay in memory. In particular, we prove that under a simplified perturbation setting (solely perturbing the recurrent weights), any linear functional can be stably approximated by linear RNNs. This finding under simplified setting is already significant as the instability in learning long-term memory mainly comes from the recurrent weights.

Theorem 3.5 (Existence of stable approximation by stable reparameterization).

For any bounded, causal, continuous, regular, time-homogeneous linear functional 𝐇𝐇\mathbf{H}bold_H, assume 𝐇𝐇\mathbf{H}bold_H is approximated by a sequence of linear RNNs {𝐇^(,θm)}m=1superscriptsubscript^𝐇subscript𝜃𝑚𝑚1\{\widehat{\mathbf{H}}(\cdot,\theta_{m})\}_{m=1}^{\infty}{ over^ start_ARG bold_H end_ARG ( ⋅ , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT with stable reparameterization, then this approximation is a stable approximation.

The proof of Theorem 3.5 is in Section C.4. The generalization to nonlinear functionals with Volterra-Series representation can be similarly achieved (Remark C.5). Compared to Theorem 3.3, Theorem 3.5 underscores the role of stable reparameterization in achieving stable approximation of nonlinear functional with long-term memory. Although vanilla SSM and StableSSM operate within the same hypothesis space, StableSSM demonstrates better stability in approximating any decaying memory target (Table 1). In contrast, the vanilla SSM model is limited to stably approximate targets characterized by an exponential memory decay.

Refer to caption
Figure 2: The scaling of layer output bound |y^|c1λ^𝑦𝑐1𝜆|\hat{y}|\leq\frac{c}{1-\lambda}| over^ start_ARG italic_y end_ARG | ≤ divide start_ARG italic_c end_ARG start_ARG 1 - italic_λ end_ARG and the gradients |dy^dλ|c(1λ)2𝑑^𝑦𝑑𝜆𝑐superscript1𝜆2|\frac{d\hat{y}}{d\lambda}|\leq\frac{c}{(1-\lambda)^{2}}| divide start_ARG italic_d over^ start_ARG italic_y end_ARG end_ARG start_ARG italic_d italic_λ end_ARG | ≤ divide start_ARG italic_c end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. The stability boundary is λ=±1𝜆plus-or-minus1\lambda=\pm 1italic_λ = ± 1. When the model adapts to learn long-term memory (as λ𝜆\lambdaitalic_λ approaches 1), the gradient experiences an increase that surpasses the rate of output growth. Techniques like layer normalization are insufficient to address this issue of exploding gradients effectively.

3.3 Optimization benefit of stable reparameterization

In the previous section, the approximation benefit of stable reparameterizations in SSMs is discussed. Here we study the impact of different parameterizations on the optimization stability, in particular, the gradient scales.

As pointed out by Li et al. (2020, 2022), the approximation of linear functionals using linear RNNs can be reduced into the approximation of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-integrable memory function ρ(t)𝜌𝑡\rho(t)italic_ρ ( italic_t ) via functions of the form ρ^(t)=i=1mcieλit^𝜌𝑡superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒subscript𝜆𝑖𝑡\hat{\rho}(t)=\sum_{i=1}^{m}c_{i}e^{-\lambda_{i}t}over^ start_ARG italic_ρ end_ARG ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT.

ρ(t)i=1mcieλit,λi>0.formulae-sequence𝜌𝑡superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒subscript𝜆𝑖𝑡subscript𝜆𝑖0\rho(t)\approx\sum_{i=1}^{m}c_{i}e^{-\lambda_{i}t},\quad\lambda_{i}>0.italic_ρ ( italic_t ) ≈ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 . (15)

Within this framework, λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is interpreted as the decay mode. Approaching this from the gradient-based optimization standpoint, and given that learning rates are shared across different decay modes, a fitting characterization for “good parameterization” emerges: The gradient scale across different memory decays modes should be Lipschitz continuous with respect to the weights scale.

|Gradient|:=|Lossλi|L|λi|.assignGradientLosssubscript𝜆𝑖𝐿subscript𝜆𝑖|\textrm{Gradient}|:=\left|\frac{\partial\textrm{Loss}}{\partial\lambda_{i}}% \right|\leq L|\lambda_{i}|.| Gradient | := | divide start_ARG ∂ Loss end_ARG start_ARG ∂ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | ≤ italic_L | italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | . (16)

The Lipschitz constant is denoted by L𝐿Litalic_L. Without this property, the optimization process can be sensitive to the learning rate. We give a detailed discussion in Appendix D. In the following theorem, we first characterize the relationship between gradient norms and recurrent weight parameterization.

Theorem 3.6 (Parameterizations influence the gradient norm scale).

Assume the target functional sequence 𝐇𝐇\mathbf{H}bold_H is being approximated by a sequence of SSMs 𝐇^msubscript^𝐇𝑚\widehat{\mathbf{H}}_{m}over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. If the (diagonal) recurrent weight matrix is parameterized via f::f(w)=λ:𝑓:𝑓𝑤𝜆f:\mathbb{R}\to\mathbb{R}:f(w)=\lambdaitalic_f : blackboard_R → blackboard_R : italic_f ( italic_w ) = italic_λ. w𝑤witalic_w is the trainable weight while λ𝜆\lambdaitalic_λ is the eigenvalue of recurrent weight matrix ΛΛ\Lambdaroman_Λ. The gradient norm Gf(w)subscript𝐺𝑓𝑤G_{f}(w)italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ) of weight w𝑤witalic_w is upper bounded by the following function:

Gf(w):=|Lossw|C𝐇,𝐇^m|f(w)|f(w)2.assignsubscript𝐺𝑓𝑤Loss𝑤subscript𝐶𝐇subscript^𝐇𝑚superscript𝑓𝑤𝑓superscript𝑤2G_{f}(w):=\left|\frac{\partial\textrm{Loss}}{\partial w}\right|\leq C_{\mathbf% {H},\widehat{\mathbf{H}}_{m}}\frac{|f^{\prime}(w)|}{f(w)^{2}}.italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ) := | divide start_ARG ∂ Loss end_ARG start_ARG ∂ italic_w end_ARG | ≤ italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) | end_ARG start_ARG italic_f ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (17)

Here C𝐇,𝐇^msubscript𝐶𝐇subscript^𝐇𝑚C_{\mathbf{H},\widehat{\mathbf{H}}_{m}}italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is independent of the parameterization f𝑓fitalic_f provided that 𝐇,𝐇^m𝐇subscript^𝐇𝑚\mathbf{H},\widehat{\mathbf{H}}_{m}bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are fixed. The discrete-time version is

GfD(w):=|Lossw|C𝐇,𝐇^m|f(w)|(1f(w))2.assignsubscriptsuperscript𝐺𝐷𝑓𝑤Loss𝑤subscript𝐶𝐇subscript^𝐇𝑚superscript𝑓𝑤superscript1𝑓𝑤2G^{D}_{f}(w):=\left|\frac{\partial\textrm{Loss}}{\partial w}\right|\leq C_{% \mathbf{H},\widehat{\mathbf{H}}_{m}}\frac{|f^{\prime}(w)|}{(1-f(w))^{2}}.italic_G start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ) := | divide start_ARG ∂ Loss end_ARG start_ARG ∂ italic_w end_ARG | ≤ italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) | end_ARG start_ARG ( 1 - italic_f ( italic_w ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (18)

Refer to Section C.5 for the proof of Theorem 3.6. In Appendix E we summarize common reparameterization methods and corresponding gradient scale functions.

Remark 3.7 (Generalization to multi-layer models).

We do not prove the gradient bound result for multi-layer case in the paper, here we discuss the idea to genearlize it: Consider a specific layer in a multi-layer model, without loss of generality we also have the boundedness of result from the previous layer and expected inputs for the next layer. If we take the results from previous layer as the inputs and treat the expected inputs for next layer as the outputs, the gradient of recurrent weights for this layer also observe the same gradient norm bound with form in Equation 17. This comes from the fact that the gradient of the selected layer remains unchanged, regardless of whether the remaining layers are frozen or not.

3.4 On the “best” parameterization in stability sense

Refer to caption
(a) Linear functionals
Refer to caption
(b) Language model
Figure 3: In panel (a), in the learning of linear functionals of polynomial decaying memory, the gradient-over-weight scale range during the training of state-space models. It can be seen the “best”discrete parameterization f(w)=11w2+0.5𝑓𝑤11superscript𝑤20.5f(w)=1-\frac{1}{w^{2}+0.5}italic_f ( italic_w ) = 1 - divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 end_ARG achieves the smallest gradient-over-weight scale. Such property is desirable when a large learning rate is used in training. The “best” reparameterization f(w)=11w2+0.5𝑓𝑤11superscript𝑤20.5f(w)=1-\frac{1}{w^{2}+0.5}italic_f ( italic_w ) = 1 - divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 end_ARG maintains the smallest max(|grad||weight|)gradweight\max(\frac{|\textrm{grad}|}{|\textrm{weight}|})roman_max ( divide start_ARG | grad | end_ARG start_ARG | weight | end_ARG ) which is crucial for the training stability. Similar results can be observed in the language modelling task as in panel (b).

According to the criterion given in Equation 16, the “best” stable reparameterization should satisfy the following equation for some constant L>0𝐿0L>0italic_L > 0.

Gf(w)C𝐇,𝐇^m|f(w)|f(w)2=L|w|.subscript𝐺𝑓𝑤subscript𝐶𝐇subscript^𝐇𝑚superscript𝑓𝑤𝑓superscript𝑤2𝐿𝑤G_{f}(w)\leq C_{\mathbf{H},\widehat{\mathbf{H}}_{m}}\frac{|f^{\prime}(w)|}{f(w% )^{2}}=L|w|.italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ) ≤ italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) | end_ARG start_ARG italic_f ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_L | italic_w | . (19)

Based on the criterion, a sufficient condition for the above criterion is to find some function f𝑓fitalic_f that satisfies the following equation for some real a,b𝑎𝑏a,b\in\mathbb{R}italic_a , italic_b ∈ blackboard_R:

f(w)f(w)2superscript𝑓𝑤𝑓superscript𝑤2\displaystyle\frac{f^{\prime}(w)}{f(w)^{2}}divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) end_ARG start_ARG italic_f ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG =d(1f(w))dw=2aw,absent𝑑1𝑓𝑤𝑑𝑤2𝑎𝑤\displaystyle=\frac{d(-\frac{1}{f(w)})}{dw}=2aw,= divide start_ARG italic_d ( - divide start_ARG 1 end_ARG start_ARG italic_f ( italic_w ) end_ARG ) end_ARG start_ARG italic_d italic_w end_ARG = 2 italic_a italic_w , (20)
1f(w)1𝑓𝑤\displaystyle\frac{1}{f(w)}divide start_ARG 1 end_ARG start_ARG italic_f ( italic_w ) end_ARG =(aw2+b)absent𝑎superscript𝑤2𝑏\displaystyle=-(aw^{2}+b)= - ( italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b ) (21)
f(w)absent𝑓𝑤\displaystyle\Rightarrow f(w)⇒ italic_f ( italic_w ) =1aw2+b.absent1𝑎superscript𝑤2𝑏\displaystyle=-\frac{1}{aw^{2}+b}.= - divide start_ARG 1 end_ARG start_ARG italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b end_ARG . (22)

The first equation is achieved by integrating the function f(w)f(w)2superscript𝑓𝑤𝑓superscript𝑤2\frac{f^{\prime}(w)}{f(w)^{2}}divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) end_ARG start_ARG italic_f ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Therefore the “best” parameterization under the assumption of the Lipschitz property of gradient is characterized by the function with two degrees of freedom: By stability requirement f(w)0𝑓𝑤0f(w)\leq 0italic_f ( italic_w ) ≤ 0 for all w𝑤witalic_w

f(w)=1aw2+b,a>0,b0.formulae-sequence𝑓𝑤1𝑎superscript𝑤2𝑏formulae-sequence𝑎0𝑏0f(w)=-\frac{1}{aw^{2}+b},\quad a>0,b\geq 0.italic_f ( italic_w ) = - divide start_ARG 1 end_ARG start_ARG italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b end_ARG , italic_a > 0 , italic_b ≥ 0 . (23)

Similarly, the discrete case gives the solution f(w)=11aw2+b.𝑓𝑤11𝑎superscript𝑤2𝑏f(w)=1-\frac{1}{aw^{2}+b}.italic_f ( italic_w ) = 1 - divide start_ARG 1 end_ARG start_ARG italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b end_ARG . The stability of linear RNN further requires a>0𝑎0a>0italic_a > 0 and b0𝑏0b\geq 0italic_b ≥ 0. We choose a=1,b=0.5formulae-sequence𝑎1𝑏0.5a=1,b=0.5italic_a = 1 , italic_b = 0.5 because this ensures the stability of the hidden state dynamics and stable approximation in Equation 14. Notice that limw011w2+0.5=1subscript𝑤011superscript𝑤20.51\lim_{w\to 0}1-\frac{1}{w^{2}+0.5}=-1roman_lim start_POSTSUBSCRIPT italic_w → 0 end_POSTSUBSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 end_ARG = - 1 which does not cross the stability boundary λ=1𝜆1\lambda=-1italic_λ = - 1. It can be seen in Figure 6 that, compared with direct and exponential reparameterizations, the softplus reparameterization is generally milder in this gradient-over-weight criterion. The “best” parameterization is optimal in the sense it has a bounded gradient-over-weight ratio across different weights w𝑤witalic_w (different eigenvalues λ𝜆\lambdaitalic_λ).

Remark 3.8.

Apart from the reparameterization method, a simple yet effective method is gradient clipping. However, clipped gradient is biased there the training effectiveness of the gradient descent might be reduced. In contrast, the reparameterization is changing the scale of the gradient descent by introducing pre-conditioning term f(w)f(w)2superscript𝑓𝑤𝑓superscript𝑤2\frac{f^{\prime}(w)}{f(w)^{2}}divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) end_ARG start_ARG italic_f ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

Refer to caption
(a) Gradient-weight-ratio at initialization
Refer to caption
(b) Training loss
Figure 4: Language models on WikiText-103. In the left panel (a), we show the gradient-over-weight ratio ranges for different parameterizations of recurrent weights in state-space models. The eigenvalues λ𝜆\lambdaitalic_λ are initialized to be the same while the only difference is the reparameterization function f𝑓fitalic_f. In the right panel (b), the “Best” parameterization is more stable than the ReLU and exponential reparameterizations. Additional experiments for different learning rates are provided in Figure 7.

4 Numerical verifications

Based on the above analyses, we verify the theoretical statements over synthetic tasks and language models using WikiText-103. The additional numerical details are provided in Appendix F.

4.1 Synthetic tasks

Linear functionals have a clear structure, allowing us to study the differences of parameterizations. Similar to Li et al. (2020) and Wang et al. (2023), we consider linear functional targets 𝐇𝐇\mathbf{H}bold_H with following polynomial memory function ρ(t)=1(t+1)1.1𝜌𝑡1superscript𝑡11.1\rho(t)=\frac{1}{(t+1)^{1.1}}italic_ρ ( italic_t ) = divide start_ARG 1 end_ARG start_ARG ( italic_t + 1 ) start_POSTSUPERSCRIPT 1.1 end_POSTSUPERSCRIPT end_ARG: yt=Ht(𝐱)=tρ(ts)xs𝑑s.subscript𝑦𝑡subscript𝐻𝑡𝐱superscriptsubscript𝑡𝜌𝑡𝑠subscript𝑥𝑠differential-d𝑠y_{t}=H_{t}(\mathbf{x})=\int_{-\infty}^{t}\rho(t-s)x_{s}ds.italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ ( italic_t - italic_s ) italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d italic_s . We use the state-space models with tanh activations to learn the sequence relationships. In Figure 3 (a), the eigenvalues λ𝜆\lambdaitalic_λ are initialized to be the same while the only difference is the reparameterization function f(w)𝑓𝑤f(w)italic_f ( italic_w ). Training loss across different reparameterization schemes are similar but the gradient-over-weight ratio across different parameterization schemes are different in terms of the scale.

Table 2: Comparison of stability of different parameterizations over MNIST. The experiments conducted on the MNIST and CIFAR10 datasets were replicated three times, with the standard deviation of the test loss indicated in parentheses.
LR Direct Softplus Exp Best
5e-6 2.314384 (7.19932e-05) 2.241642 (0.001279) 2.241486 (0.001286) 2.241217 (0.001297)
5e-5 2.304331 (2.11817e-07) 0.779663 (0.001801) 0.774661 (0.001685) 0.765220 (0.001352)
5e-4 2.303190 (1.66387e-06) 0.094411 (0.000028) 0.093418 (0.000024) 0.091924 (0.000019)
5e-3 NaN 0.023795 (0.000004) 0.023820 (0.000003) 0.023475 (0.000002)
5e-2 NaN 0.802772 (1.69448) 0.868350 (1.55032) 0.089073 (0.000774)
5e-1 NaN 2.313510 (0.000014) 2.314244 (0.000025) 2.185477 (0.048238)
5e+0 NaN NaN NaN 199.013813 (50690.6)

4.2 Language models

Table 3: Comparison of stability of different parameterizations over CIFAR10
LR Direct Softplus Exp Best
5e-6 NaN 1.745752 (0.000006) 1.745816 (0.000009) 1.745290 (0.000011)
5e-5 NaN 1.220859 (0.000008) 1.218064 (0.000008) 1.215510 (0.000014)
5e-4 NaN 0.883649 (0.000898) 0.866817 (0.000328) 0.870412 (0.000442)
5e-3 NaN 1.449352 (0.000414) 1.567662 (0.021489) 1.364697 (0.013849)
5e-2 NaN 1.942372 (0.011317) 1.846173 (0.007990) 1.713892 (0.013426)
5e-1 NaN 37.802437 (3776.6383) 2.296230 (0.000984) 2.554265 (0.168649)
5e+0 NaN 540.621033 (NaN) NaN 615.374522 (30795.4)

In addition to the synthetic dataset of linear functionals, we further justify Theorem 3.6 by examining the gradient-over-weight ratios for language models using state-space models (S5). In particular, we adopt the Hyena (Poli et al., 2023) architecture while the implicit convolution is replaced by a simple real-weighted state-space model (Smith et al., 2023).

In Figure 4 (a), given the same initialization, we show that stable reparameterizations such as exponential, softplus, tanh and “best” exhibit a narrower range of gradient-over-weight ratios compared to both the direct and relu reparameterizations. Beyond the gradient at the same initialization, in Figure 3 (b), we show the gradient-over-weight ratios during the training process. The stable reparameterization will give better gradient-over-weight ratios in the sense that the “best” stable reparameterization maintains the smallest max(|grad||weight|)gradweight\max(\frac{|\textrm{grad}|}{|\textrm{weight}|})roman_max ( divide start_ARG | grad | end_ARG start_ARG | weight | end_ARG ). Specifically, as illustrated in Figure 4 (b) and Figure 7, while training with a large learning rate may render the exponential parameterization unstable, the “best” reparameterization f(w)=11w2+0.5𝑓𝑤11superscript𝑤20.5f(w)=1-\frac{1}{w^{2}+0.5}italic_f ( italic_w ) = 1 - divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 end_ARG appears to enhance training stability.

Listops Text Retrieval Image Pathfinder Pathx Avg
Exp parameterization (S4) 59.60 86.82 90.90 88.65 94.2 96.35 86.09
Best parameterization 60.80 88.5 91.3 87.39 94.8 96.1 86.48
Table 4: Comparison of parameterizations on long range arena.

4.3 Image classification

Apart from the gradient scale range shown in the language modeling experiments, we further compare the stability of different parameterization schemes over different initial learning rates. As shown in the following Table 2 and Table 3, we found that the “best” parameterization can be trained with a larger learning rates while exp/softplus parameterizations cannot be trained with larger learning rates (lr=5.0). Although the models exhibit comparable performance at lower learning rates, the “best” parameterization consistently outperforms others across a range of learning rates As the training stability issue has been widely reported for larger models 111 https://github.com/state-spaces/mamba/issues/6 222 https://github.com/state-spaces/mamba/issues/22 , we believe the improved training stability is an important component in the scale-up large language models.

4.4 Long Range Arena

We further verify the effectiveness of stable parameterization over the long range arena, as shown in Table 4. Both the exponential and best parameterizations demonstrate stability, yet the best parameterization delivers slightly superior average performance across the long range arena (LRA) (Tay et al., 2021) benchmark.

5 Related works

RNN

RNNs, as introduced by Rumelhart et al. (1986), represent one of the earliest neural network architectures for modeling sequential relationships. Empirical findings by Bengio et al. (1994) have shed light on the challenge of exponential decaying memory in RNNs. Various works (Hochreiter & Schmidhuber, 1997; Rusch & Mishra, 2022; Wang & Yan, 2023) have been done to improve the memory patterns of recurrent models. Theoretical approaches (Li et al., 2020, 2022; Wang et al., 2023) have been taken to study the exponential memory decay of RNNs. In this paper, we study the state-space models which are also recurrent. Our findings theoretically justify that although SSMs variants exhibit good numerical performance in long-sequence modeling (Gu et al., 2022b), simple SSMs also suffer from the “curse of memory”.

SSM

State-space models (Siivola & Honkela, 2003), previously discussed in control theory, has been widely used to study the dynamics of complex systems. The subsequent variants, S4(Gu et al., 2022b), S5 (Smith et al., 2023), RetNet (Sun et al., 2023) and Mamba (Gu & Dao, 2023), have significantly enhanced empirical performance. Notably, they excel in the long-range arena (Tay et al., 2021), an area where transformers traditionally underperform. Contrary to the initial presumption, our investigations disclose that the ability to learn long-term memory is not derived from the linear RNN coupled with nonlinear layer-wise activations. Rather, our study underscores the benefits of stable reparameterization in both approximation and optimization.

Fading memory

This paper studies the targets with decaying memory. A slightly different memory concept (fading memory) has been studied in literature (Boyd et al., 1984; Boyd & Chua, 1985). A critical difference is: fading memory is defined with respect to a particular weight function while decaying memory is defined without a specific weight function. While both concepts are similar in characterizing the speed of target memory decay, they are still distinct. For instance, there are examples with decaying memory but not fading memory (the peak-hold operator introduced in Boyd & Chua (1985)) and vice versa (examples with fading memory but not decaying memory are detailed in Appendix A.7 in Wang et al. (2023)).

6 Conclusion

In this paper, we study the intricacies of long-term memory learning in state-space models, specifically emphasizing the role of recurrent weights parameterization. We prove that state-space models without reparameterization fail to stably approximating targets that exhibit non-exponential decaying memory. Our analysis indicates this “curse of memory” phenomenon is caused by the eigenvalues of recurrent weight matrices converging to stability boundary. As an alternative, we introduce a class of stable reparameterization as a robust solution to this challenge, which also partially explains the performance of S4. With stable reparameterization, state-space models can stably approximate any targets with decaying memory. We also explore the optimization advantages associated with stable reparameterization, especially concerning gradient-over-weight scale. Our results give the theoretical support to observed advantages of reparameterizations in S4 and moreover give principled methods to design “best” reparameterization scheme in the optimization stability sense. This paper shows that stable reparameterization not only enables the learning of targets with long-term memory but also enhances the optimization stability.

Acknowledgements

This research is supported by the National Research Foundation, Singapore, under the NRF fellowship (project No. NRF-NRFF13-2021-0005). Shida Wang is supported by NUS-RMI Scholarship.

Impact Statement

This paper study the approximation and optimization properties of parameterization in state-space models. This paper presents work whose goal is to advance the field of Machine Learning. There are minor potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Bengio et al. (1994) Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. ISSN 1941-0093. doi: 10.1109/72.279181.
  • Boyd & Chua (1985) Boyd, S. and Chua, L. Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Transactions on Circuits and Systems, 32(11):1150–1161, November 1985. ISSN 0098-4094. doi: 10.1109/TCS.1985.1085649.
  • Boyd et al. (1984) Boyd, S., Chua, L. O., and Desoer, C. A. Analytical Foundations of Volterra Series. IMA Journal of Mathematical Control and Information, 1(3):243–282, January 1984. ISSN 0265-0754. doi: 10.1093/imamci/1.3.243.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Connor et al. (1994) Connor, J. T., Martin, R. D., and Atlas, L. E. Recurrent neural networks and robust time series prediction. IEEE transactions on neural networks, 5(2):240–254, 1994.
  • Gu & Dao (2023) Gu, A. and Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, December 2023.
  • Gu et al. (2020) Gu, A., Dao, T., Ermon, S., Rudra, A., and Ré, C. HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Advances in Neural Information Processing Systems, volume 33, pp.  1474–1487. Curran Associates, Inc., 2020.
  • Gu et al. (2022a) Gu, A., Goel, K., Gupta, A., and Ré, C. On the Parameterization and Initialization of Diagonal State Space Models. Advances in Neural Information Processing Systems, 35:35971–35983, December 2022a.
  • Gu et al. (2022b) Gu, A., Goel, K., and Re, C. Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations, January 2022b.
  • Hochreiter (1998) Hochreiter, S. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(02):107–116, April 1998. ISSN 0218-4885, 1793-6411. doi: 10.1142/S0218488598000094.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long Short-term Memory. Neural computation, 9:1735–80, December 1997. doi: 10.1162/neco.1997.9.8.1735.
  • Jiang et al. (2023) Jiang, H., Li, Q., Li, Z., and Wang, S. A Brief Survey on the Approximation Theory for Sequence Modelling. Journal of Machine Learning, 2(1):1–30, June 2023. ISSN 2790-203X, 2790-2048. doi: 10.4208/jml.221221.
  • Li et al. (2019) Li, Y., Wei, C., and Ma, T. Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • Li et al. (2020) Li, Z., Han, J., E, W., and Li, Q. On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis. In International Conference on Learning Representations, October 2020.
  • Li et al. (2022) Li, Z., Han, J., E, W., and Li, Q. Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks. Journal of Machine Learning Research, 23(42):1–85, 2022. ISSN 1533-7928.
  • Martin & Cundy (2018) Martin, E. and Cundy, C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length. In International Conference on Learning Representations, February 2018.
  • Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer Sentinel Mixture Models. In International Conference on Learning Representations, 2016.
  • Orvieto et al. (2023a) Orvieto, A., De, S., Gulcehre, C., Pascanu, R., and Smith, S. L. On the universality of linear recurrences followed by nonlinear projections. arXiv preprint arXiv:2307.11888, 2023a.
  • Orvieto et al. (2023b) Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gulcehre, C., Pascanu, R., and De, S. Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of ICML’23, pp.  26670–26698. JMLR.org, July 2023b.
  • Peng et al. (2023) Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K., et al. RWKV: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  • Poli et al. (2023) Poli, M., Massaroli, S., Nguyen, E., Fu, D. Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., and Re, C. Hyena Hierarchy: Towards Larger Convolutional Language Models. In International Conference on Machine Learning, June 2023.
  • Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. Nature, 323(6088):533–536, October 1986. ISSN 1476-4687. doi: 10.1038/323533a0.
  • Rusch & Mishra (2022) Rusch, T. K. and Mishra, S. Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies. In International Conference on Learning Representations, February 2022.
  • Siivola & Honkela (2003) Siivola, V. and Honkela, A. A state-space method for language modeling. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pp.  548–553, St Thomas, VI, USA, 2003. IEEE. ISBN 978-0-7803-7980-0. doi: 10.1109/ASRU.2003.1318499.
  • Smith et al. (2023) Smith, J. T. H., Warrington, A., and Linderman, S. Simplified State Space Layers for Sequence Modeling. In International Conference on Learning Representations, February 2023.
  • Smith & Topin (2019) Smith, L. N. and Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, pp.  369–386. SPIE, 2019.
  • Stein & Shakarchi (2003) Stein, E. M. and Shakarchi, R. Princeton Lectures in Analysis. Princeton University Press Princeton, 2003.
  • Sun et al. (2023) Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  • Sutskever et al. (2011) Sutskever, I., Martens, J., and Hinton, G. Generating Text with Recurrent Neural Networks. In International Conference on Machine Learning, pp.  1017–1024, January 2011.
  • Tay et al. (2021) Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., Rao, J., Yang, L., Ruder, S., and Metzler, D. Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations, January 2021.
  • Tolimieri et al. (1989) Tolimieri, R., An, M., and Lu, C. Algorithms for Discrete Fourier Transform and Convolution. Signal Processing and Digital Filtering. Springer New York, New York, NY, 1989. ISBN 978-1-4757-3856-8 978-1-4757-3854-4. doi: 10.1007/978-1-4757-3854-4.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Wang & Xue (2023) Wang, S. and Xue, B. State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. In Thirty-Seventh Conference on Neural Information Processing Systems, November 2023.
  • Wang & Yan (2023) Wang, S. and Yan, Z. Improve long-term memory learning through rescaling the error temporally. arXiv preprint arXiv:2307.11462, 2023.
  • Wang et al. (2023) Wang, S., Li, Z., and Li, Q. Inverse Approximation Theory for Nonlinear Recurrent Neural Networks. In The Twelfth International Conference on Learning Representations, October 2023.

Appendix A Graphical demonstration of state-space models as stack of Equation 3

Here we show that Equation 3 corresponds to the practical instantiation of SSM-based models in the following sense: As shown in Figure 5, any practical instantiation of SSM-based models can be implemented as a stack of Equation 3. The pointwise shallow MLP can be realized with two-layer state-space models with layer-wise nonlinearity by setting recurrent weights W𝑊Witalic_W to be 0.

Refer to caption
Figure 5: MLP can be realized by two-layer state-space models. The superscript indicates the layers while the subscript indicates the time index. It can be seen MLP is equivalent to SSMs having zero recurrent weights W1=W2=0subscript𝑊1subscript𝑊20W_{1}=W_{2}=0italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.

Appendix B Theoretical backgrounds

In this section, we collect the definitions for the theoretical statements.

B.1 Properties of targets

We first introduce the definitions on (sequences of) functionals as discussed in (Wang et al., 2023).

Definition B.1.

Let 𝐇={Ht:𝒳;t}𝐇conditional-setsubscript𝐻𝑡formulae-sequencemaps-to𝒳𝑡\mathbf{H}=\{H_{t}:\mathcal{X}\mapsto\mathbb{R};t\in\mathbb{R}\}bold_H = { italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_X ↦ blackboard_R ; italic_t ∈ blackboard_R } be a sequence of functionals.

  1. 1.

    (Linear) Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is linear functional if for any λ,λ𝜆superscript𝜆\lambda,\lambda^{\prime}\in\mathbb{R}italic_λ , italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R and 𝐱,𝐱𝒳𝐱superscript𝐱𝒳\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X}bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, Ht(λ𝐱+λ𝐱)=λHt(𝐱)+λHt(𝐱)subscript𝐻𝑡𝜆𝐱superscript𝜆superscript𝐱𝜆subscript𝐻𝑡𝐱superscript𝜆subscript𝐻𝑡superscript𝐱H_{t}(\lambda\mathbf{x}+\lambda^{\prime}\mathbf{x}^{\prime})=\lambda H_{t}(% \mathbf{x})+\lambda^{\prime}H_{t}(\mathbf{x}^{\prime})italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_λ bold_x + italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_λ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) + italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

  2. 2.

    (Continuous) Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is continuous functional if for any 𝐱,𝐱𝒳\mathbf{x},^{\prime}\mathbf{x}\in\mathcal{X}bold_x , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_x ∈ caligraphic_X, lim𝐱𝐱|Ht(𝐱)Ht(𝐱)|=0subscriptsuperscript𝐱𝐱subscript𝐻𝑡superscript𝐱subscript𝐻𝑡𝐱0\lim_{{\mathbf{x}^{\prime}\to\mathbf{x}}}|H_{t}(\mathbf{x}^{\prime})-H_{t}(% \mathbf{x})|=0roman_lim start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_x end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) | = 0.

  3. 3.

    (Bounded) Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is bounded functional if the norm of functional Ht:=sup{𝐱0}|Ht(𝐱)|𝐱+1+|Ht(𝟎)|<assignsubscriptnormsubscript𝐻𝑡subscriptsupremum𝐱0subscript𝐻𝑡𝐱subscriptnorm𝐱1subscript𝐻𝑡0\|H_{t}\|_{\infty}:=\sup_{\{\mathbf{x}\neq 0\}}\frac{|H_{t}(\mathbf{x})|}{\|% \mathbf{x}\|_{\infty}+1}+|H_{t}(\mathbf{0})|<\infty∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT { bold_x ≠ 0 } end_POSTSUBSCRIPT divide start_ARG | italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG + | italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_0 ) | < ∞.

  4. 4.

    (Time-homogeneous) 𝐇={Ht:t}𝐇conditional-setsubscript𝐻𝑡𝑡\mathbf{H}=\{H_{t}:t\in\mathbb{R}\}bold_H = { italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t ∈ blackboard_R } is time-homogeneous (or time-shift-equivariant) if the input-output relationship commutes with time shift: let [Sτ(𝐱)]t=xtτsubscriptdelimited-[]subscript𝑆𝜏𝐱𝑡subscript𝑥𝑡𝜏[S_{\tau}(\mathbf{x})]_{t}=x_{t-\tau}[ italic_S start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_x ) ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT be a shift operator, then 𝐇(Sτ𝐱)=Sτ𝐇(𝐱)𝐇subscript𝑆𝜏𝐱subscript𝑆𝜏𝐇𝐱\mathbf{H}(S_{\tau}\mathbf{x})=S_{\tau}\mathbf{H}(\mathbf{x})bold_H ( italic_S start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_x ) = italic_S start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_H ( bold_x ).

  5. 5.

    (Causal) Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is causal functional if it does not depend on future values of the input. That is, if 𝐱,𝐱𝐱superscript𝐱\mathbf{x},\mathbf{x}^{\prime}bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT satisfy xt=xtsubscript𝑥𝑡superscriptsubscript𝑥𝑡x_{t}=x_{t}^{\prime}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for any tt0𝑡subscript𝑡0t\leq t_{0}italic_t ≤ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then Ht(𝐱)=Ht(𝐱)subscript𝐻𝑡𝐱subscript𝐻𝑡superscript𝐱H_{t}(\mathbf{x})=H_{t}(\mathbf{x}^{\prime})italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) = italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for any tt0𝑡subscript𝑡0t\leq t_{0}italic_t ≤ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

  6. 6.

    (Regular) Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is regular functional if for any sequence {𝐱(n):n}conditional-setsuperscript𝐱𝑛𝑛\{\mathbf{x}^{(n)}:n\in\mathbb{N}\}{ bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT : italic_n ∈ blackboard_N } such that xs(n)0subscriptsuperscript𝑥𝑛𝑠0x^{(n)}_{s}\to 0italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT → 0 for almost every s𝑠s\in\mathbb{R}italic_s ∈ blackboard_R, then limnHt(𝐱(n))=0.subscript𝑛subscript𝐻𝑡superscript𝐱𝑛0\lim_{n\to\infty}H_{t}(\mathbf{x}^{(n)})=0.roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) = 0 .

B.2 Approximation in Sobolev norm

Definition B.2.

In sequence modeling as a nonlinear functional approximation problem, we consider the Sobolev norm of the functional sequence defined as follow:

𝐇𝐇^W1,=supt(HtH^t+dHtdtdH^tdt).subscriptnorm𝐇^𝐇superscript𝑊1subscriptsupremum𝑡subscriptnormsubscript𝐻𝑡subscript^𝐻𝑡subscriptnorm𝑑subscript𝐻𝑡𝑑𝑡𝑑subscript^𝐻𝑡𝑑𝑡\left\|\mathbf{H}-\widehat{\mathbf{H}}\right\|_{W^{1,\infty}}=\sup_{t}\left(\|% H_{t}-\widehat{H}_{t}\|_{\infty}+\left\|\frac{dH_{t}}{dt}-\frac{d\widehat{H}_{% t}}{dt}\right\|_{\infty}\right).∥ bold_H - over^ start_ARG bold_H end_ARG ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ divide start_ARG italic_d italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG - divide start_ARG italic_d over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) . (24)

Here 𝐇={Ht:t}𝐇conditional-setsubscript𝐻𝑡𝑡\mathbf{H}=\{H_{t}:t\in\mathbb{R}\}bold_H = { italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t ∈ blackboard_R } is the target functional sequence to be approximated while the 𝐇^={H^t:t}^𝐇conditional-setsubscript^𝐻𝑡𝑡\widehat{\mathbf{H}}=\{\widehat{H}_{t}:t\in\mathbb{R}\}over^ start_ARG bold_H end_ARG = { over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t ∈ blackboard_R } is the model we use.

In particular, the nonlinear functional operator norm is given by:

Ht:=sup𝐱𝟎|Ht(𝐱)|𝐱+1+|H(𝟎)|.assignsubscriptnormsubscript𝐻𝑡subscriptsupremum𝐱0subscript𝐻𝑡𝐱subscriptnorm𝐱1𝐻0\|H_{t}\|_{\infty}:=\sup_{\mathbf{x}\neq\mathbf{0}}\frac{|H_{t}(\mathbf{x})|}{% \|\mathbf{x}\|_{\infty}+1}+|H(\mathbf{0})|.∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT divide start_ARG | italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG + | italic_H ( bold_0 ) | . (25)

As 𝐇(𝟎)=0𝐇00\mathbf{H}(\mathbf{0})=0bold_H ( bold_0 ) = 0, Htsubscriptnormsubscript𝐻𝑡\|H_{t}\|_{\infty}∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT is reduced to sup𝐱0|Ht(𝐱)|𝐱+1subscriptsupremum𝐱0subscript𝐻𝑡𝐱subscriptnorm𝐱1\displaystyle\sup_{\mathbf{x}\neq 0}\frac{|H_{t}(\mathbf{x})|}{\|\mathbf{x}\|_% {\infty}+1}roman_sup start_POSTSUBSCRIPT bold_x ≠ 0 end_POSTSUBSCRIPT divide start_ARG | italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG. If 𝐇𝐇\mathbf{H}bold_H is a linear functional, this definition is compatible with the common linear functional norm in Equation 39.

We check this operator norm in Equation 25 is indeed a norm: Without loss of generality, we will drop the time index for brevity.

  1. 1.

    Triangular inequality: For nonlinear functional H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

    H1+H2subscriptnormsubscript𝐻1subscript𝐻2\displaystyle\|H_{1}+H_{2}\|_{\infty}∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT :=sup𝐱𝟎|(H1+H2)(𝐱)|𝐱+1assignabsentsubscriptsupremum𝐱0subscript𝐻1subscript𝐻2𝐱subscriptnorm𝐱1\displaystyle:=\sup_{\mathbf{x}\neq\mathbf{0}}\frac{|(H_{1}+H_{2})(\mathbf{x})% |}{\|\mathbf{x}\|_{\infty}+1}:= roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT divide start_ARG | ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG (26)
    sup𝐱𝟎|H1(𝐱)|𝐱+1+sup𝐱𝟎|H2(𝐱)|𝐱+1=H1+H2.absentsubscriptsupremum𝐱0subscript𝐻1𝐱subscriptnorm𝐱1subscriptsupremum𝐱0subscript𝐻2𝐱subscriptnorm𝐱1subscriptnormsubscript𝐻1subscriptnormsubscript𝐻2\displaystyle\leq\sup_{\mathbf{x}\neq\mathbf{0}}\frac{|H_{1}(\mathbf{x})|}{\|% \mathbf{x}\|_{\infty}+1}+\sup_{\mathbf{x}\neq\mathbf{0}}\frac{|H_{2}(\mathbf{x% })|}{\|\mathbf{x}\|_{\infty}+1}=\|H_{1}\|_{\infty}+\|H_{2}\|_{\infty}.≤ roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT divide start_ARG | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG + roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT divide start_ARG | italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG = ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT . (27)

    The inequality is by the property of supremum.

  2. 2.

    Absolute homogeneity: For any real constant s𝑠sitalic_s and nonlinear functional H𝐻Hitalic_H

    sH:=sup𝐱𝟎|(sH)(𝐱)|𝐱+1=|s|sup𝐱𝟎|H(𝐱)|𝐱+1=|s|H.assignsubscriptnorm𝑠𝐻subscriptsupremum𝐱0𝑠𝐻𝐱subscriptnorm𝐱1𝑠subscriptsupremum𝐱0𝐻𝐱subscriptnorm𝐱1𝑠subscriptnorm𝐻\displaystyle\|sH\|_{\infty}:=\sup_{\mathbf{x}\neq\mathbf{0}}\frac{|(sH)(% \mathbf{x})|}{\|\mathbf{x}\|_{\infty}+1}=|s|\sup_{\mathbf{x}\neq\mathbf{0}}% \frac{|H(\mathbf{x})|}{\|\mathbf{x}\|_{\infty}+1}=|s|\|H\|_{\infty}.∥ italic_s italic_H ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT divide start_ARG | ( italic_s italic_H ) ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG = | italic_s | roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT divide start_ARG | italic_H ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG = | italic_s | ∥ italic_H ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT . (28)
  3. 3.

    Positive definiteness: If H=0subscriptnorm𝐻0\|H\|_{\infty}=0∥ italic_H ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 0, then for all non-zero inputs 𝐱𝟎𝐱0\mathbf{x}\neq\mathbf{0}bold_x ≠ bold_0 we have H(𝐱)=0𝐻𝐱0H(\mathbf{x})=0italic_H ( bold_x ) = 0. As H(𝟎)=0𝐻00H(\mathbf{0})=0italic_H ( bold_0 ) = 0, then we know H𝐻Hitalic_H is a zero functional.

Property of nonlinear functional sequence norm

The definition of functional product is by the element-wise product: (𝐇1𝐇2)(𝐱)=𝐇1(𝐱)𝐇2(𝐱)subscript𝐇1subscript𝐇2𝐱direct-productsubscript𝐇1𝐱subscript𝐇2𝐱(\mathbf{H}_{1}\mathbf{H}_{2})(\mathbf{x})=\mathbf{H}_{1}(\mathbf{x})\odot% \mathbf{H}_{2}(\mathbf{x})( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( bold_x ) = bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) ⊙ bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ). As the functional norm satisfies:

H1H2subscriptnormsubscript𝐻1subscript𝐻2\displaystyle\|H_{1}H_{2}\|_{\infty}∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT :=sup𝐱0|H1(𝐱)H2(𝐱)|𝐱+1+|H1(𝟎)H2(𝟎)|assignabsentsubscriptsupremum𝐱0subscript𝐻1𝐱subscript𝐻2𝐱subscriptnorm𝐱1subscript𝐻10subscript𝐻20\displaystyle:=\sup_{\mathbf{x}\neq 0}\frac{|H_{1}(\mathbf{x})H_{2}(\mathbf{x}% )|}{\|\mathbf{x}\|_{\infty}+1}+|H_{1}(\mathbf{0})H_{2}(\mathbf{0})|:= roman_sup start_POSTSUBSCRIPT bold_x ≠ 0 end_POSTSUBSCRIPT divide start_ARG | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG + | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_0 ) italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_0 ) | (29)
sup𝐱𝟎|H1(𝐱)|𝐱+1|H2(𝐱)|𝐱+1+|H1(𝟎)||H2(𝟎)|absentsubscriptsupremum𝐱0subscript𝐻1𝐱subscriptnorm𝐱1subscript𝐻2𝐱subscriptnorm𝐱1subscript𝐻10subscript𝐻20\displaystyle\leq\sup_{\mathbf{x}\neq\mathbf{0}}\frac{|H_{1}(\mathbf{x})|}{\|% \mathbf{x}\|_{\infty}+1}\frac{|H_{2}(\mathbf{x})|}{\|\mathbf{x}\|_{\infty}+1}+% |H_{1}(\mathbf{0})|\cdot|H_{2}(\mathbf{0})|≤ roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT divide start_ARG | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG divide start_ARG | italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG + | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_0 ) | ⋅ | italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_0 ) | (30)
sup𝐱𝟎(|H1(𝐱)|𝐱+1+|H1(𝟎)|)sup𝐱𝟎(|H2(𝐱)|𝐱+1+|H2(𝟎)|)absentsubscriptsupremum𝐱0subscript𝐻1𝐱subscriptnorm𝐱1subscript𝐻10subscriptsupremum𝐱0subscript𝐻2𝐱subscriptnorm𝐱1subscript𝐻20\displaystyle\leq\sup_{\mathbf{x}\neq\mathbf{0}}\left(\frac{|H_{1}(\mathbf{x})% |}{\|\mathbf{x}\|_{\infty}+1}+|H_{1}(\mathbf{0})|\right)\sup_{\mathbf{x}\neq% \mathbf{0}}\left(\frac{|H_{2}(\mathbf{x})|}{\|\mathbf{x}\|_{\infty}+1}+|H_{2}(% \mathbf{0})|\right)≤ roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT ( divide start_ARG | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG + | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_0 ) | ) roman_sup start_POSTSUBSCRIPT bold_x ≠ bold_0 end_POSTSUBSCRIPT ( divide start_ARG | italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 1 end_ARG + | italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_0 ) | ) (31)
=H1H2absentsubscriptnormsubscript𝐻1subscriptnormsubscript𝐻2\displaystyle=\|H_{1}\|_{\infty}\|H_{2}\|_{\infty}= ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT (32)

Therefore we have

𝐇1𝐇2subscriptnormsubscript𝐇1subscript𝐇2\displaystyle\|\mathbf{H}_{1}\mathbf{H}_{2}\|_{\infty}∥ bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT =supt(H1H2+d(H1H2)dt)absentsubscriptsupremum𝑡subscriptnormsubscript𝐻1subscript𝐻2subscriptnorm𝑑subscript𝐻1subscript𝐻2𝑑𝑡\displaystyle=\sup_{t}\left(\|H_{1}H_{2}\|_{\infty}+\left\|\frac{d(H_{1}H_{2})% }{dt}\right\|_{\infty}\right)= roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ divide start_ARG italic_d ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) (33)
=supt(H1H2+H1dH2dt+dH1dtH2)absentsubscriptsupremum𝑡subscriptnormsubscript𝐻1subscript𝐻2subscriptnormsubscript𝐻1𝑑subscript𝐻2𝑑𝑡subscriptnorm𝑑subscript𝐻1𝑑𝑡subscript𝐻2\displaystyle=\sup_{t}\left(\|H_{1}H_{2}\|_{\infty}+\left\|H_{1}\frac{dH_{2}}{% dt}\right\|_{\infty}+\left\|\frac{dH_{1}}{dt}H_{2}\right\|_{\infty}\right)= roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) (34)
supt(H1H2+H1dH2dt+dH1dtH2)absentsubscriptsupremum𝑡subscriptnormsubscript𝐻1subscriptnormsubscript𝐻2subscriptnormsubscript𝐻1subscriptnorm𝑑subscript𝐻2𝑑𝑡subscriptnorm𝑑subscript𝐻1𝑑𝑡subscriptnormsubscript𝐻2\displaystyle\leq\sup_{t}\left(\|H_{1}\|_{\infty}\|H_{2}\|_{\infty}+\|H_{1}\|_% {\infty}\left\|\frac{dH_{2}}{dt}\right\|_{\infty}+\left\|\frac{dH_{1}}{dt}% \right\|_{\infty}\|H_{2}\|_{\infty}\right)≤ roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) (35)
supt(H1+dH1dt)supt(H2+dH2dt))\displaystyle\leq\sup_{t}\left(\|H_{1}\|_{\infty}+\left\|\frac{dH_{1}}{dt}% \right\|_{\infty}\right)\sup_{t}\left(\|H_{2}\|_{\infty}+\left\|\frac{dH_{2}}{% dt}\right\|_{\infty})\right)≤ roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) ) (36)
=𝐇1𝐇2absentsubscriptnormsubscript𝐇1subscriptnormsubscript𝐇2\displaystyle=\|\mathbf{H}_{1}\|_{\infty}\|\mathbf{H}_{2}\|_{\infty}= ∥ bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT (37)

B.3 Riesz representation theorem for linear functional

Theorem B.3 (Riesz-Markov-Kakutani representation theorem).

Assume H:C0(,d):𝐻maps-tosubscript𝐶0superscript𝑑H:C_{0}(\mathbb{R},\mathbb{R}^{d})\mapsto\mathbb{R}italic_H : italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( blackboard_R , blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ↦ blackboard_R is a linear and continuous functional. Then there exists a unique, vector-valued, regular, countably additive signed measure μ𝜇\muitalic_μ on \mathbb{R}blackboard_R such that

H(𝐱)=xs𝑑μ(s)=i=1dxs,i𝑑μi(s).𝐻𝐱subscriptsuperscriptsubscript𝑥𝑠topdifferential-d𝜇𝑠superscriptsubscript𝑖1𝑑subscriptsubscript𝑥𝑠𝑖differential-dsubscript𝜇𝑖𝑠\displaystyle H(\mathbf{x})=\int_{\mathbb{R}}x_{s}^{\top}d\mu(s)=\sum_{i=1}^{d% }\int_{\mathbb{R}}x_{s,i}d\mu_{i}(s).italic_H ( bold_x ) = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_d italic_μ ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) . (38)

In addition, we have the linear functional norm

H:=sup𝐱𝒳1|H(𝐱)|=μ1():=i|μi|().assignsubscriptnorm𝐻subscriptsupremumsubscriptnorm𝐱𝒳1𝐻𝐱subscriptnorm𝜇1assignsubscript𝑖subscript𝜇𝑖\|H\|_{\infty}:=\sup_{\|\mathbf{x}\|_{\mathcal{X}}\leq 1}|H(\mathbf{x})|=\|\mu% \|_{1}(\mathbb{R}):=\sum_{i}|\mu_{i}|(\mathbb{R}).∥ italic_H ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT ∥ bold_x ∥ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | italic_H ( bold_x ) | = ∥ italic_μ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ) := ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( blackboard_R ) . (39)

In particular, this linear functional norm is compatible with the norm considered for nonlinear functionals in Equation 25.

Appendix C Proofs for theorems and lemmas

In Section C.1, we show that the nonlinear functionals defined by state-space models are point-wise continuous functionals at Heaviside inputs. In Section C.3, the proof for state-space models’ exponential memory decaying memory property is given. In Section C.4, we prove the linear RNN with stable reparameterization can stably approximate any linear functional. The target is no longer limited to have an exponenitally decaying memory. The gradient norm estimate of the recurrent layer is included in Section C.5.

C.1 Proof for SSMs are point-wise continuous functionals

Proof.

Let 𝐱𝐱\mathbf{x}bold_x be any fixed Heaviside input. Assume limk𝐱k𝐱=0subscript𝑘subscriptnormsubscript𝐱𝑘𝐱0\displaystyle\lim_{k\to\infty}\|\mathbf{x}_{k}-\mathbf{x}\|_{\infty}=0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 0. Let hk,tsubscript𝑘𝑡h_{k,t}italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT and htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the hidden state for inputs 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐱𝐱\mathbf{x}bold_x. Without loss of generality, assume t>0𝑡0t>0italic_t > 0. The following |||\cdot|| ⋅ | refers to p=𝑝p=\inftyitalic_p = ∞ norm.

By definition of the hidden states dynamics and triangular inequality, since σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is Lipschitz continuous

d|hk,tht|dt𝑑subscript𝑘𝑡subscript𝑡𝑑𝑡\displaystyle\frac{d|h_{k,t}-h_{t}|}{dt}divide start_ARG italic_d | italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG italic_d italic_t end_ARG =|σ(Λhk,t+Uxk,t)σ(Λht+Uxt)|absent𝜎Λsubscript𝑘𝑡𝑈subscript𝑥𝑘𝑡𝜎Λsubscript𝑡𝑈subscript𝑥𝑡\displaystyle=|\sigma(\Lambda h_{k,t}+Ux_{k,t})-\sigma(\Lambda h_{t}+Ux_{t})|= | italic_σ ( roman_Λ italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT + italic_U italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) - italic_σ ( roman_Λ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | (40)
L|Λhk,t+Uxk,tΛhtUxt|absent𝐿Λsubscript𝑘𝑡𝑈subscript𝑥𝑘𝑡Λsubscript𝑡𝑈subscript𝑥𝑡\displaystyle\leq L|\Lambda h_{k,t}+Ux_{k,t}-\Lambda h_{t}-Ux_{t}|≤ italic_L | roman_Λ italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT + italic_U italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - roman_Λ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_U italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | (41)
=L|Λ(hk,tht)+U(xk,txt)|absent𝐿Λsubscript𝑘𝑡subscript𝑡𝑈subscript𝑥𝑘𝑡subscript𝑥𝑡\displaystyle=L|\Lambda(h_{k,t}-h_{t})+U(x_{k,t}-x_{t})|= italic_L | roman_Λ ( italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_U ( italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | (42)
L(|Λ||hk,tht|+|U||xk,txt|).absent𝐿Λsubscript𝑘𝑡subscript𝑡𝑈subscript𝑥𝑘𝑡subscript𝑥𝑡\displaystyle\leq L(|\Lambda||h_{k,t}-h_{t}|+|U||x_{k,t}-x_{t}|).≤ italic_L ( | roman_Λ | | italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + | italic_U | | italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ) . (43)

Here L𝐿Litalic_L is the Lipschitz constant of activation σ𝜎\sigmaitalic_σ. Apply the Grönwall inequality to the above inequality, we have:

|hk,tht|0teL|Λ|(ts)L|U||xk,sxs|𝑑s.subscript𝑘𝑡subscript𝑡superscriptsubscript0𝑡superscript𝑒𝐿Λ𝑡𝑠𝐿𝑈subscript𝑥𝑘𝑠subscript𝑥𝑠differential-d𝑠|h_{k,t}-h_{t}|\leq\int_{0}^{t}e^{L|\Lambda|(t-s)}L|U|\ |x_{k,s}-x_{s}|ds.| italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_L | roman_Λ | ( italic_t - italic_s ) end_POSTSUPERSCRIPT italic_L | italic_U | | italic_x start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | italic_d italic_s . (44)

As the inputs are bounded, by dominated convergence theorem we have right hand side converges to 0 therefore

limk|hk,tht|=0,t.subscript𝑘subscript𝑘𝑡subscript𝑡0for-all𝑡\lim_{k\to\infty}|h_{k,t}-h_{t}|=0,\quad\forall t.roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = 0 , ∀ italic_t . (45)

Let yk,tsubscript𝑦𝑘𝑡y_{k,t}italic_y start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the outputs for inputs 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐱𝐱\mathbf{x}bold_x. Therefore we show the point-wise convergence of dHtdt𝑑subscript𝐻𝑡𝑑𝑡\frac{dH_{t}}{dt}divide start_ARG italic_d italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG at 𝐱𝐱\mathbf{x}bold_x:

limk|dyk,tdtdytdt|subscript𝑘𝑑subscript𝑦𝑘𝑡𝑑𝑡𝑑subscript𝑦𝑡𝑑𝑡\displaystyle\lim_{k\to\infty}\left|\frac{dy_{k,t}}{dt}-\frac{dy_{t}}{dt}\right|roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT | divide start_ARG italic_d italic_y start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG - divide start_ARG italic_d italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG | =limk|c(dhk,tdtdhtdt)|absentsubscript𝑘superscript𝑐top𝑑subscript𝑘𝑡𝑑𝑡𝑑subscript𝑡𝑑𝑡\displaystyle=\lim_{k\to\infty}\left|c^{\top}(\frac{dh_{k,t}}{dt}-\frac{dh_{t}% }{dt})\right|= roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT | italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG - divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ) | (46)
limk|c|L(|Λ||hk,tht|+|U||xk,txt|)=0.absentsubscript𝑘𝑐𝐿Λsubscript𝑘𝑡subscript𝑡𝑈subscript𝑥𝑘𝑡subscript𝑥𝑡0\displaystyle\leq\lim_{k\to\infty}|c|L(|\Lambda||h_{k,t}-h_{t}|+|U||x_{k,t}-x_% {t}|)=0.≤ roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT | italic_c | italic_L ( | roman_Λ | | italic_h start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + | italic_U | | italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ) = 0 . (47)

C.2 Point-wise continuity leads to decaying memory

Here we give the proof of decaying memory based on the point-wise continuity of dHtdt𝑑subscript𝐻𝑡𝑑𝑡\frac{dH_{t}}{dt}divide start_ARG italic_d italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG and boundedness and time-homogeneity of 𝐇𝐇\mathbf{H}bold_H:

Proof.
limt|dHtdt(𝐮x)|=limt|dH0dt(x𝟏{st})|=|dH0dt(𝐱)|=0.subscript𝑡𝑑subscript𝐻𝑡𝑑𝑡superscript𝐮𝑥subscript𝑡𝑑subscript𝐻0𝑑𝑡𝑥subscript1𝑠𝑡𝑑subscript𝐻0𝑑𝑡𝐱0\lim_{t\to\infty}\left|\frac{dH_{t}}{dt}(\mathbf{u}^{x})\right|=\lim_{t\to% \infty}\left|\frac{dH_{0}}{dt}(x\cdot\bm{1}_{\{s\geq-t\}})\right|=\left|\frac{% dH_{0}}{dt}(\mathbf{x})\right|=0.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | divide start_ARG italic_d italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ( bold_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) | = roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ( italic_x ⋅ bold_1 start_POSTSUBSCRIPT { italic_s ≥ - italic_t } end_POSTSUBSCRIPT ) | = | divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ( bold_x ) | = 0 .

The first equation comes from time-homogeneity. The second equation is derived from the point-wise continuity where input 𝐱𝐱\mathbf{x}bold_x means constant x𝑥xitalic_x for all time 𝐱=x𝟏{s}𝐱𝑥subscript1𝑠\mathbf{x}=x\cdot\bm{1}_{\{s\geq-\infty\}}bold_x = italic_x ⋅ bold_1 start_POSTSUBSCRIPT { italic_s ≥ - ∞ } end_POSTSUBSCRIPT. The third equation is based on the boundedness and time-homogeneity as the output over constant input should be finite and constant Ht(𝐱)=Hs(𝐱)subscript𝐻𝑡𝐱subscript𝐻𝑠𝐱H_{t}(\mathbf{x})=H_{s}(\mathbf{x})italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) = italic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x ) for all s,t𝑠𝑡s,titalic_s , italic_t. Therefore |dH0dt(𝐱)|=0𝑑subscript𝐻0𝑑𝑡𝐱0|\frac{dH_{0}}{dt}(\mathbf{x})|=0| divide start_ARG italic_d italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ( bold_x ) | = 0. ∎

C.3 Proof for Theorem 3.3

The main idea of the proof is two-fold. First of all, we show that state-space models with strictly monotone activation is decaying memory in Lemma C.10. Next, the idea of analysing the memory functions through a transform from [0,)0[0,\infty)[ 0 , ∞ ) to (0,1]01(0,1]( 0 , 1 ] is similar to previous works (Li et al., 2020, 2022; Wang et al., 2023). The remainder of the proof follows a standard approach, as the derivatives of the hidden states follow the rules of linear dynamical systems when Heaviside inputs are considered.

Proof.

Assume the inputs considered are uniformly bounded by X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

𝐱<X0.subscriptnorm𝐱subscript𝑋0\|\mathbf{x}\|_{\infty}<X_{0}.∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (48)

Define the derivative of hidden states for unperturbed model to be vm,t=dhm,tdtsubscript𝑣𝑚𝑡𝑑subscript𝑚𝑡𝑑𝑡v_{m,t}=\frac{dh_{m,t}}{dt}italic_v start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT = divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG. Similarly, v~m,tsubscript~𝑣𝑚𝑡\tilde{v}_{m,t}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT is the derivative of hidden states for perturbed models v~m,t=dh~m,tdtsubscript~𝑣𝑚𝑡𝑑subscript~𝑚𝑡𝑑𝑡\tilde{v}_{m,t}=\frac{d\tilde{h}_{m,t}}{dt}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT = divide start_ARG italic_d over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG.

Since each perturbed model has a decaying memory and the target functional sequence 𝐇𝐇\mathbf{H}bold_H has a stable approximation, by Lemma C.10, we have

limtv~m,t=0,m.subscript𝑡subscript~𝑣𝑚𝑡0for-all𝑚\lim_{t\to\infty}\tilde{v}_{m,t}=0,\quad\forall m.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT = 0 , ∀ italic_m . (49)

If the inputs are limited to Heaviside inputs, the derivative v~m,tsubscript~𝑣𝑚𝑡\tilde{v}_{m,t}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT satisfies the following dynamics: Notice that the hidden state satisfies ht=0,t(,0]formulae-sequencesubscript𝑡0𝑡0h_{t}=0,t\in(-\infty,0]italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 , italic_t ∈ ( - ∞ , 0 ],

dv~m,tdt𝑑subscript~𝑣𝑚𝑡𝑑𝑡\displaystyle\frac{d\tilde{v}_{m,t}}{dt}divide start_ARG italic_d over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =Λ~mv~m,t,t0formulae-sequenceabsentsubscript~Λ𝑚subscript~𝑣𝑚𝑡𝑡0\displaystyle=\widetilde{\Lambda}_{m}\tilde{v}_{m,t},\quad t\geq 0= over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT , italic_t ≥ 0 (50)
v~m,0subscript~𝑣𝑚0\displaystyle\tilde{v}_{m,0}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT =Λ~mh0+U~mx0+b~m=U~mx0+b~mabsentsubscript~Λ𝑚subscript0subscript~𝑈𝑚subscript𝑥0subscript~𝑏𝑚subscript~𝑈𝑚subscript𝑥0subscript~𝑏𝑚\displaystyle=\widetilde{\Lambda}_{m}h_{0}+\widetilde{U}_{m}x_{0}+\tilde{b}_{m% }=\widetilde{U}_{m}x_{0}+\tilde{b}_{m}= over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (51)
v~m,tabsentsubscript~𝑣𝑚𝑡\displaystyle\Rightarrow\tilde{v}_{m,t}⇒ over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT =eΛ~mt(U~mx0+b~m).absentsuperscript𝑒subscript~Λ𝑚𝑡subscript~𝑈𝑚subscript𝑥0subscript~𝑏𝑚\displaystyle=e^{\widetilde{\Lambda}_{m}t}(\widetilde{U}_{m}x_{0}+\tilde{b}_{m% }).= italic_e start_POSTSUPERSCRIPT over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT ( over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (52)

Notice that the perturbed initial conditions of the v~m,tsubscript~𝑣𝑚𝑡\tilde{v}_{m,t}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT are uniformly (in m𝑚mitalic_m) bounded:

V~0subscript~𝑉0\displaystyle\tilde{V}_{0}over~ start_ARG italic_V end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT :=supm|v~m,0|2assignabsentsubscriptsupremum𝑚subscriptsubscript~𝑣𝑚02\displaystyle:=\displaystyle\sup_{m}|\tilde{v}_{m,0}|_{2}:= roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (53)
=supm|U~mx0+b~m|2absentsubscriptsupremum𝑚subscriptsubscript~𝑈𝑚subscript𝑥0subscript~𝑏𝑚2\displaystyle=\sup_{m}|\widetilde{U}_{m}x_{0}+\tilde{b}_{m}|_{2}= roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (54)
supm|U~mx0+b~m|2absentsubscriptsupremum𝑚subscriptsubscript~𝑈𝑚subscript𝑥0subscript~𝑏𝑚2\displaystyle\leq\sup_{m}|\widetilde{U}_{m}x_{0}+\tilde{b}_{m}|_{2}≤ roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (55)
dX0(supmUm2+β0)+supmbm2+β0absent𝑑subscript𝑋0subscriptsupremum𝑚subscriptnormsubscript𝑈𝑚2subscript𝛽0subscriptsupremum𝑚subscriptnormsubscript𝑏𝑚2subscript𝛽0\displaystyle\leq dX_{0}(\sup_{m}\|U_{m}\|_{2}+\beta_{0})+\sup_{m}\|b_{m}\|_{2% }+\beta_{0}≤ italic_d italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (56)
<absent\displaystyle<\infty< ∞ (57)

Here d𝑑ditalic_d is the input sequence dimension.

Similarly, the unperturbed initial conditions satisfy:

V0subscript𝑉0\displaystyle V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT :=supm|v~m,0|2assignabsentsubscriptsupremum𝑚subscriptsubscript~𝑣𝑚02\displaystyle:=\displaystyle\sup_{m}|\tilde{v}_{m,0}|_{2}:= roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (58)
=supm|Umx0+bm|2absentsubscriptsupremum𝑚subscriptsubscript𝑈𝑚subscript𝑥0subscript𝑏𝑚2\displaystyle=\sup_{m}|U_{m}x_{0}+b_{m}|_{2}= roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (59)
supm|Umx0+bm|2absentsubscriptsupremum𝑚subscriptsubscript𝑈𝑚subscript𝑥0subscript𝑏𝑚2\displaystyle\leq\sup_{m}|U_{m}x_{0}+b_{m}|_{2}≤ roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (60)
dX0supmUm2+supmbm2absent𝑑subscript𝑋0subscriptsupremum𝑚subscriptnormsubscript𝑈𝑚2subscriptsupremum𝑚subscriptnormsubscript𝑏𝑚2\displaystyle\leq dX_{0}\sup_{m}\|U_{m}\|_{2}+\sup_{m}\|b_{m}\|_{2}≤ italic_d italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (61)
(dX0+1)θmaxabsent𝑑subscript𝑋01subscript𝜃𝑚𝑎𝑥\displaystyle\leq(dX_{0}+1)\theta_{max}≤ ( italic_d italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 ) italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (62)
<absent\displaystyle<\infty< ∞ (63)

Select a sequence of perturbed recurrent matrices {Λ~m,k}k=1superscriptsubscriptsubscript~Λ𝑚𝑘𝑘1\{\widetilde{\Lambda}_{m,k}\}_{k=1}^{\infty}{ over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT satisfying the following two properties:

  1. 1.

    Λ~m,ksubscript~Λ𝑚𝑘\widetilde{\Lambda}_{m,k}over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT is Hyperbolic, which means the real part of the eigenvalues of the matrix are nonzero.

  2. 2.

    limk(Λ~m,kΛm)=β0Imsubscript𝑘subscript~Λ𝑚𝑘subscriptΛ𝑚subscript𝛽0subscript𝐼𝑚\lim_{k\to\infty}(\widetilde{\Lambda}_{m,k}-\Lambda_{m})=\beta_{0}I_{m}roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ( over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT - roman_Λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Moreover, by Lemma C.11, we know that each hyperbolic matrix Λ~m,ksubscript~Λ𝑚𝑘\widetilde{\Lambda}_{m,k}over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT is Hurwitz as the system for v~m,tsubscript~𝑣𝑚𝑡\tilde{v}_{m,t}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT is asymptotically stable.

supmmaxi[m](λi(Λ~m,k))<0.subscriptsupremum𝑚subscript𝑖delimited-[]𝑚subscript𝜆𝑖subscript~Λ𝑚𝑘0\sup_{m}\max_{i\in[m]}(\lambda_{i}(\widetilde{\Lambda}_{m,k}))<0.roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ) ) < 0 . (64)

This is the stability boundary for the state-space models under perturbations.

Therefore the original diagonal unperturbed recurrent weight matrix ΛmsubscriptΛ𝑚\Lambda_{m}roman_Λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT satisfies the following eigenvalue inequality uniformly in m𝑚mitalic_m. Since ΛmsubscriptΛ𝑚\Lambda_{m}roman_Λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is diagonal:

supmmaxi[m](λi(Λm))β0.subscriptsupremum𝑚subscript𝑖delimited-[]𝑚subscript𝜆𝑖subscriptΛ𝑚subscript𝛽0\sup_{m}\max_{i\in[m]}(\lambda_{i}(\Lambda_{m}))\leq-\beta_{0}.roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) ≤ - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (65)

Therefore the model memory decays exponentially uniformly

(𝐇^m)(t)subscript^𝐇𝑚𝑡\displaystyle\mathcal{M}(\widehat{\mathbf{H}}_{m})(t)caligraphic_M ( over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( italic_t ) :=supX01X0+1|ddty^m,t|assignabsentsubscriptsupremumsubscript𝑋01subscript𝑋01𝑑𝑑𝑡subscript^𝑦𝑚𝑡\displaystyle:=\sup_{X_{0}}\frac{1}{X_{0}+1}\left|\frac{d}{dt}\hat{y}_{m,t}\right|:= roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT | (66)
=supX01X0+1|cm[σ(hm,t)vm,t]|absentsubscriptsupremumsubscript𝑋01subscript𝑋01superscriptsubscript𝑐𝑚topdelimited-[]superscript𝜎subscript𝑚𝑡subscript𝑣𝑚𝑡\displaystyle=\sup_{X_{0}}\frac{1}{X_{0}+1}|c_{m}^{\top}[\sigma^{\prime}(h_{m,% t})\circ v_{m,t}]|= roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ italic_v start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ] | (67)
supX01X0+1|cm|2|σ(hm,t)vm,t|2absentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsubscript𝑐𝑚2subscriptsuperscript𝜎subscript𝑚𝑡subscript𝑣𝑚𝑡2\displaystyle\leq\sup_{X_{0}}\frac{1}{X_{0}+1}|c_{m}|_{2}|\sigma^{\prime}(h_{m% ,t})\circ v_{m,t}|_{2}≤ roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ italic_v start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (68)
supX01X0+1|cm|2supz|σ(z)||eβ0tvm,0|2absentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsubscript𝑐𝑚2subscriptsupremum𝑧superscript𝜎𝑧subscriptsuperscript𝑒subscript𝛽0𝑡subscript𝑣𝑚02\displaystyle\leq\sup_{X_{0}}\frac{1}{X_{0}+1}|c_{m}|_{2}\cdot\sup_{z}|\sigma^% {\prime}(z)|\cdot|e^{-\beta_{0}t}v_{m,0}|_{2}≤ roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) | ⋅ | italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (69)
supX01X0+1(supm|cm|2supz|σ(z)|V0)eβ0tabsentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsupremum𝑚subscriptsubscript𝑐𝑚2subscriptsupremum𝑧superscript𝜎𝑧subscript𝑉0superscript𝑒subscript𝛽0𝑡\displaystyle\leq\sup_{X_{0}}\frac{1}{X_{0}+1}\bigg{(}\sup_{m}|c_{m}|_{2}\cdot% \sup_{z}|\sigma^{\prime}(z)|\cdot V_{0}\bigg{)}e^{-\beta_{0}t}≤ roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) | ⋅ italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT (70)
supX01X0+1(supm|cm|2L0V0)eβ0tabsentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsupremum𝑚subscriptsubscript𝑐𝑚2subscript𝐿0subscript𝑉0superscript𝑒subscript𝛽0𝑡\displaystyle\leq\sup_{X_{0}}\frac{1}{X_{0}+1}\bigg{(}\sup_{m}|c_{m}|_{2}\cdot L% _{0}\cdot V_{0}\bigg{)}e^{-\beta_{0}t}≤ roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT (71)
supX0(supm|cm|2L0\displaystyle\leq\sup_{X_{0}}\bigg{(}\sup_{m}|c_{m}|_{2}\cdot L_{0}≤ roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (72)
(X0X0+1d(supmUm2)+1X0+1(supmbm2)))eβ0t\displaystyle\cdot\Big{(}\frac{X_{0}}{X_{0}+1}d(\sup_{m}\|U_{m}\|_{2})+\frac{1% }{X_{0}+1}(\sup_{m}\|b_{m}\|_{2})\Big{)}\bigg{)}e^{-\beta_{0}t}⋅ ( divide start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG italic_d ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ) italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT (73)
(supm|cm|2L0(dsupmUm2+supmbm2))eβ0tabsentsubscriptsupremum𝑚subscriptsubscript𝑐𝑚2subscript𝐿0𝑑subscriptsupremum𝑚subscriptnormsubscript𝑈𝑚2subscriptsupremum𝑚subscriptnormsubscript𝑏𝑚2superscript𝑒subscript𝛽0𝑡\displaystyle\leq\bigg{(}\sup_{m}|c_{m}|_{2}\cdot L_{0}\Big{(}d\sup_{m}\|U_{m}% \|_{2}+\sup_{m}\|b_{m}\|_{2}\Big{)}\bigg{)}e^{-\beta_{0}t}≤ ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_d roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT (74)
(d+1)L0θmax2eβ0tabsent𝑑1subscript𝐿0superscriptsubscript𝜃2superscript𝑒subscript𝛽0𝑡\displaystyle\leq(d+1)L_{0}\theta_{\max}^{2}e^{-\beta_{0}t}≤ ( italic_d + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT (75)

The inequalities are based on vector norm properties, Lipschitz continuity of σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) and uniform boundedness of unperturbed initial conditions. Therefore we know the model memories are uniformly decaying.

By Lemma C.12, the target 𝐇𝐇\mathbf{H}bold_H has an exponentially decaying memory as it is approximated by a sequence of models {𝐇^m}m=1superscriptsubscriptsubscript^𝐇𝑚𝑚1\{\widehat{\mathbf{H}}_{m}\}_{m=1}^{\infty}{ over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT with uniformly exponentially decaying memory. ∎

Remark C.1.

When the approximation is unstable, we cannot have the real parts of the eigenvalues for recurrent weights bounded away from 0 in Equation 65. As the stability of linear RNNs requires the real parts (of the eigenvalues) to be negative, then the maximum of the real parts will converge to 0. This is the stability boundary of state-space models.

limmmaxi[m](λi(Λm))=0.subscript𝑚subscript𝑖delimited-[]𝑚subscript𝜆𝑖subscriptΛ𝑚superscript0\lim_{m\to\infty}\max_{i\in[m]}(\lambda_{i}(\Lambda_{m}))=0^{-}.roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) = 0 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT . (76)
Remark C.2.

The uniform weights bound is necessary in the sense that: Since state-space models are universal approximators, they can approximate targets with long-term memories. However, if the target has an non-exponential decaying (e.g. polynomial decaying) memory, the weights bound of the approximation sequence will be exponential in the sequence length T𝑇Titalic_T.

θmax2eβ0T(𝐇)(T)(d+1)L0.superscriptsubscript𝜃𝑚𝑎𝑥2superscript𝑒subscript𝛽0𝑇𝐇𝑇𝑑1subscript𝐿0\theta_{max}^{2}\geq e^{\beta_{0}T}\frac{\mathcal{M}(\mathbf{H})(T)}{(d+1)L_{0% }}.italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_e start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG caligraphic_M ( bold_H ) ( italic_T ) end_ARG start_ARG ( italic_d + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG . (77)

This result indicates that scaling up SSMs without reparameterization is inefficient in learning sequence relationships with a large T𝑇Titalic_T and long-term memory.

Remark C.3 (On the generalization to multi-layer cases).

We will use the following two-layer state-space models to demonstrate the idea to generalize this result to multi-layer cases.

dhtdt𝑑subscript𝑡𝑑𝑡\displaystyle\frac{dh_{t}}{dt}divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =Λ1ht+U1xtabsentsubscriptΛ1subscript𝑡subscript𝑈1subscript𝑥𝑡\displaystyle=\Lambda_{1}h_{t}+U_{1}x_{t}= roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (78)
ytsubscript𝑦𝑡\displaystyle y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =σ(ht)absent𝜎subscript𝑡\displaystyle=\sigma(h_{t})= italic_σ ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (79)
dstdt𝑑subscript𝑠𝑡𝑑𝑡\displaystyle\frac{ds_{t}}{dt}divide start_ARG italic_d italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =Λ2st+U2ytabsentsubscriptΛ2subscript𝑠𝑡subscript𝑈2subscript𝑦𝑡\displaystyle=\Lambda_{2}s_{t}+U_{2}y_{t}= roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (80)
z^tsubscript^𝑧𝑡\displaystyle\hat{z}_{t}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =cσ(st)absentsuperscript𝑐top𝜎subscript𝑠𝑡\displaystyle=c^{\top}\sigma(s_{t})= italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (81)

We can have the following memory function bounds: For simplicity, we drop the term m𝑚mitalic_m in Λ1,Λ2,U1,U2subscriptΛ1subscriptΛ2subscript𝑈1subscript𝑈2\Lambda_{1},\Lambda_{2},U_{1},U_{2}roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

(𝐇^m)(t)subscript^𝐇𝑚𝑡\displaystyle\mathcal{M}(\widehat{\mathbf{H}}_{m})(t)caligraphic_M ( over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( italic_t ) :=supX01X0+1|ddtz^m,t|2assignabsentsubscriptsupremumsubscript𝑋01subscript𝑋01subscript𝑑𝑑𝑡subscript^𝑧𝑚𝑡2\displaystyle:=\sup_{X_{0}}\frac{1}{X_{0}+1}\left|\frac{d}{dt}\hat{z}_{m,t}% \right|_{2}:= roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (82)
=supX01X0+1|c(σ(sm,t)dsm,tdt)|2absentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsuperscript𝑐topsuperscript𝜎subscript𝑠𝑚𝑡𝑑subscript𝑠𝑚𝑡𝑑𝑡2\displaystyle=\sup_{X_{0}}\frac{1}{X_{0}+1}\left|c^{\top}\left(\sigma^{\prime}% (s_{m,t})\circ\frac{ds_{m,t}}{dt}\right)\right|_{2}= roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ divide start_ARG italic_d italic_s start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (83)
=supX01X0+1|c(σ(sm,t)0teΛ2t1U2dym,tt1dt𝑑t1)|2absentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsuperscript𝑐topsuperscript𝜎subscript𝑠𝑚𝑡superscriptsubscript0𝑡superscript𝑒subscriptΛ2subscript𝑡1subscript𝑈2𝑑subscript𝑦𝑚𝑡subscript𝑡1𝑑𝑡differential-dsubscript𝑡12\displaystyle=\sup_{X_{0}}\frac{1}{X_{0}+1}\left|c^{\top}\left(\sigma^{\prime}% (s_{m,t})\circ\int_{0}^{t}e^{\Lambda_{2}t_{1}}U_{2}\frac{dy_{m,t-t_{1}}}{dt}dt% _{1}\right)\right|_{2}= roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_d italic_y start_POSTSUBSCRIPT italic_m , italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG italic_d italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (84)
=supX01X0+1|c(σ(sm,t)0teΛ2t1U2(σ(hm,tt1)vm,tt1)𝑑t1)|2absentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsuperscript𝑐topsuperscript𝜎subscript𝑠𝑚𝑡superscriptsubscript0𝑡superscript𝑒subscriptΛ2subscript𝑡1subscript𝑈2superscript𝜎subscript𝑚𝑡subscript𝑡1subscript𝑣𝑚𝑡subscript𝑡1differential-dsubscript𝑡12\displaystyle=\sup_{X_{0}}\frac{1}{X_{0}+1}\left|c^{\top}\left(\sigma^{\prime}% (s_{m,t})\circ\int_{0}^{t}e^{\Lambda_{2}t_{1}}U_{2}(\sigma^{\prime}(h_{m,t-t_{% 1}})\circ v_{m,t-t_{1}})dt_{1}\right)\right|_{2}= roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_m , italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∘ italic_v start_POSTSUBSCRIPT italic_m , italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_d italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (85)
=supX01X0+1|c(σ(sm,t)0teΛ2t1U2(σ(hm,tt1)eΛ1(tt1)vm,0)𝑑t1)|2absentsubscriptsupremumsubscript𝑋01subscript𝑋01subscriptsuperscript𝑐topsuperscript𝜎subscript𝑠𝑚𝑡superscriptsubscript0𝑡superscript𝑒subscriptΛ2subscript𝑡1subscript𝑈2superscript𝜎subscript𝑚𝑡subscript𝑡1superscript𝑒subscriptΛ1𝑡subscript𝑡1subscript𝑣𝑚0differential-dsubscript𝑡12\displaystyle=\sup_{X_{0}}\frac{1}{X_{0}+1}\left|c^{\top}\left(\sigma^{\prime}% (s_{m,t})\circ\int_{0}^{t}e^{\Lambda_{2}t_{1}}U_{2}(\sigma^{\prime}(h_{m,t-t_{% 1}})\circ e^{\Lambda_{1}(t-t_{1})}v_{m,0})dt_{1}\right)\right|_{2}= roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_m , italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∘ italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT ) italic_d italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (86)
supX01X0+1|c|2|σ(sm,t)|20t|eΛ2t1|2|U2|2(|σ(hm,tt1)|2|eΛ1(tt1)|2V0)𝑑t1absentsubscriptsupremumsubscript𝑋01subscript𝑋01subscript𝑐2subscriptsuperscript𝜎subscript𝑠𝑚𝑡2superscriptsubscript0𝑡subscriptsuperscript𝑒subscriptΛ2subscript𝑡12subscriptsubscript𝑈22subscriptsuperscript𝜎subscript𝑚𝑡subscript𝑡12subscriptsuperscript𝑒subscriptΛ1𝑡subscript𝑡12subscript𝑉0differential-dsubscript𝑡1\displaystyle\leq\sup_{X_{0}}\frac{1}{X_{0}+1}|c|_{2}|\sigma^{\prime}(s_{m,t})% |_{2}\int_{0}^{t}|e^{\Lambda_{2}t_{1}}|_{2}|U_{2}|_{2}(|\sigma^{\prime}(h_{m,t% -t_{1}})|_{2}|e^{\Lambda_{1}(t-t_{1})}|_{2}V_{0})dt_{1}≤ roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG | italic_c | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( | italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_m , italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (87)
L02θmax2supX01X0+10t|eΛ2t1|2|eΛ1(tt1)|2V0𝑑t1absentsuperscriptsubscript𝐿02superscriptsubscript𝜃𝑚𝑎𝑥2subscriptsupremumsubscript𝑋01subscript𝑋01superscriptsubscript0𝑡subscriptsuperscript𝑒subscriptΛ2subscript𝑡12subscriptsuperscript𝑒subscriptΛ1𝑡subscript𝑡12subscript𝑉0differential-dsubscript𝑡1\displaystyle\leq L_{0}^{2}\theta_{max}^{2}\sup_{X_{0}}\frac{1}{X_{0}+1}\int_{% 0}^{t}|e^{\Lambda_{2}t_{1}}|_{2}|e^{\Lambda_{1}(t-t_{1})}|_{2}V_{0}dt_{1}≤ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (88)
L02θmax3supX0(dX0+1)X0+10t|eΛ2t1|2|eΛ1(tt1)|2𝑑t1absentsuperscriptsubscript𝐿02superscriptsubscript𝜃𝑚𝑎𝑥3subscriptsupremumsubscript𝑋0𝑑subscript𝑋01subscript𝑋01superscriptsubscript0𝑡subscriptsuperscript𝑒subscriptΛ2subscript𝑡12subscriptsuperscript𝑒subscriptΛ1𝑡subscript𝑡12differential-dsubscript𝑡1\displaystyle\leq L_{0}^{2}\theta_{max}^{3}\sup_{X_{0}}\frac{(dX_{0}+1)}{X_{0}% +1}\int_{0}^{t}|e^{\Lambda_{2}t_{1}}|_{2}|e^{\Lambda_{1}(t-t_{1})}|_{2}dt_{1}≤ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG ( italic_d italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 ) end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (89)
L02θmax3supX0(dX0+1)X0+10t|eβ0t1|2|eβ0(tt1)|2𝑑t1absentsuperscriptsubscript𝐿02superscriptsubscript𝜃𝑚𝑎𝑥3subscriptsupremumsubscript𝑋0𝑑subscript𝑋01subscript𝑋01superscriptsubscript0𝑡subscriptsuperscript𝑒subscript𝛽0subscript𝑡12subscriptsuperscript𝑒subscript𝛽0𝑡subscript𝑡12differential-dsubscript𝑡1\displaystyle\leq L_{0}^{2}\theta_{max}^{3}\sup_{X_{0}}\frac{(dX_{0}+1)}{X_{0}% +1}\int_{0}^{t}|e^{-\beta_{0}t_{1}}|_{2}|e^{-\beta_{0}(t-t_{1})}|_{2}dt_{1}≤ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG ( italic_d italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 ) end_ARG start_ARG italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (90)
(d+1)L02θmax3teβ0t.absent𝑑1superscriptsubscript𝐿02superscriptsubscript𝜃𝑚𝑎𝑥3𝑡superscript𝑒subscript𝛽0𝑡\displaystyle\leq(d+1)L_{0}^{2}\theta_{max}^{3}te^{-\beta_{0}t}.≤ ( italic_d + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_t italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT . (91)

The first inequality comes from the Cauchy inequality (|ab|2|a|2|b|2subscript𝑎𝑏2subscript𝑎2subscript𝑏2|a\circ b|_{2}\leq|a|_{2}\cdot|b|_{2}| italic_a ∘ italic_b | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ | italic_a | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ | italic_b | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). The second inequality comes from the property of activation σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) and uniform bound on weights. The third inequality comes from the bound of V0subscript𝑉0V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Equation 58. The last inequality is the direct evaluation based on the eigenvalues of Λ1subscriptΛ1\Lambda_{1}roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Λ2subscriptΛ2\Lambda_{2}roman_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As here is a fast decaying term eβ0tsuperscript𝑒subscript𝛽0𝑡e^{-\beta_{0}t}italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT, we simplify other polynomial scale components in P𝑃Pitalic_P.

A further generalization of the memory function for \ellroman_ℓ-layer SSMs would be: For some polynomial P(t)𝑃𝑡P(t)italic_P ( italic_t ) with degree at most l1𝑙1l-1italic_l - 1

(𝐇^m)(t)(d+1)L0θmax+1P(t)eβ0t.subscript^𝐇𝑚𝑡𝑑1superscriptsubscript𝐿0superscriptsubscript𝜃𝑚𝑎𝑥1𝑃𝑡superscript𝑒subscript𝛽0𝑡\mathcal{M}(\widehat{\mathbf{H}}_{m})(t)\leq(d+1)L_{0}^{\ell}\theta_{max}^{% \ell+1}P(t)e^{-\beta_{0}t}.caligraphic_M ( over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( italic_t ) ≤ ( italic_d + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT italic_P ( italic_t ) italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT . (92)

C.4 Proof for Theorem 3.5

Proof.

Let the target linear functional be Ht(𝐱)=tρ(ts)xs𝑑ssubscript𝐻𝑡𝐱superscriptsubscript𝑡𝜌𝑡𝑠subscript𝑥𝑠differential-d𝑠H_{t}(\mathbf{x})=\int_{-\infty}^{t}\rho(t-s)x_{s}dsitalic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ ( italic_t - italic_s ) italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d italic_s. Here ρ𝜌\rhoitalic_ρ is an L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT integrable function. We consider a simplified model setting with only parameters c𝑐citalic_c and w𝑤witalic_w. Let ci,wisubscript𝑐𝑖subscript𝑤𝑖c_{i},w_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the unperturbed weights and c~i,w~isubscript~𝑐𝑖subscript~𝑤𝑖\tilde{c}_{i},\tilde{w}_{i}over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the perturbed recurrent weights. Similar to ρ𝜌\rhoitalic_ρ being L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT integrable, we note that 0|cief(wi)t|𝑑t=|ci||f(wi)|superscriptsubscript0subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡subscript𝑐𝑖𝑓subscript𝑤𝑖\int_{0}^{\infty}|c_{i}e^{f(w_{i})t}|dt=\frac{|c_{i}|}{|f(w_{i})|}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t = divide start_ARG | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG. To have a sequence of well-defined model, we require they are uniformly (in m𝑚mitalic_m) absolutely integrable:

supmi=1m|ci||f(wi)|<,supmi=1m1|f(wi)|<.formulae-sequencesubscriptsupremum𝑚superscriptsubscript𝑖1𝑚subscript𝑐𝑖𝑓subscript𝑤𝑖subscriptsupremum𝑚superscriptsubscript𝑖1𝑚1𝑓subscript𝑤𝑖\sup_{m}\sum_{i=1}^{m}\frac{|c_{i}|}{|f(w_{i})|}<\infty,\quad\sup_{m}\sum_{i=1% }^{m}\frac{1}{|f(w_{i})|}<\infty.roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG < ∞ , roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG < ∞ . (93)

Based |w~w|2βsubscript~𝑤𝑤2𝛽|\tilde{w}-w|_{2}\leq\beta| over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β and |c~c|2βsubscript~𝑐𝑐2𝛽|\tilde{c}-c|_{2}\leq\beta| over~ start_ARG italic_c end_ARG - italic_c | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β. We know the approximation error is

Em(β)subscript𝐸𝑚𝛽\displaystyle E_{m}(\beta)italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) =sup|w~w|2β,|c~c|2β0|i=1mc~ief(w~i)tρ(t)|𝑑tabsentsubscriptsupremumformulae-sequencesubscript~𝑤𝑤2𝛽subscript~𝑐𝑐2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚subscript~𝑐𝑖superscript𝑒𝑓subscript~𝑤𝑖𝑡𝜌𝑡differential-d𝑡\displaystyle=\sup_{|\tilde{w}-w|_{2}\leq\beta,|\tilde{c}-c|_{2}\leq\beta}\int% _{0}^{\infty}\left|\sum_{i=1}^{m}\tilde{c}_{i}e^{f(\tilde{w}_{i})t}-\rho(t)% \right|dt= roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β , | over~ start_ARG italic_c end_ARG - italic_c | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_ρ ( italic_t ) | italic_d italic_t (94)
sup|w~w|2β0|i=1mcief(wi)tρ(t)|𝑑tabsentsubscriptsupremumsubscript~𝑤𝑤2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡𝜌𝑡differential-d𝑡\displaystyle\leq\sup_{|\tilde{w}-w|_{2}\leq\beta}\int_{0}^{\infty}\left|\sum_% {i=1}^{m}c_{i}e^{f(w_{i})t}-\rho(t)\right|dt≤ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_ρ ( italic_t ) | italic_d italic_t (95)
+sup|w~w|2β0|i=1mcief(w~i)ti=1mcief(wi)t|𝑑tsubscriptsupremumsubscript~𝑤𝑤2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript~𝑤𝑖𝑡superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡\displaystyle+\sup_{|\tilde{w}-w|_{2}\leq\beta}\int_{0}^{\infty}\left|\sum_{i=% 1}^{m}c_{i}e^{f(\tilde{w}_{i})t}-\sum_{i=1}^{m}c_{i}e^{f(w_{i})t}\right|dt+ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (96)
+sup|w~w|2β,|c~c|2β0|i=1m(ci~ci)ef(w~i)t|𝑑tsubscriptsupremumformulae-sequencesubscript~𝑤𝑤2𝛽subscript~𝑐𝑐2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚~subscript𝑐𝑖subscript𝑐𝑖superscript𝑒𝑓subscript~𝑤𝑖𝑡differential-d𝑡\displaystyle+\sup_{|\tilde{w}-w|_{2}\leq\beta,|\tilde{c}-c|_{2}\leq\beta}\int% _{0}^{\infty}\left|\sum_{i=1}^{m}(\tilde{c_{i}}-c_{i})e^{f(\tilde{w}_{i})t}% \right|dt+ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β , | over~ start_ARG italic_c end_ARG - italic_c | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( over~ start_ARG italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (97)
sup|w~w|2β0|i=1mcief(wi)tρ(t)|𝑑tabsentsubscriptsupremumsubscript~𝑤𝑤2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡𝜌𝑡differential-d𝑡\displaystyle\leq\sup_{|\tilde{w}-w|_{2}\leq\beta}\int_{0}^{\infty}\left|\sum_% {i=1}^{m}c_{i}e^{f(w_{i})t}-\rho(t)\right|dt≤ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_ρ ( italic_t ) | italic_d italic_t (98)
+sup|w~w|2β0|i=1mcief(w~i)ti=1mcief(wi)t|𝑑tsubscriptsupremumsubscript~𝑤𝑤2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript~𝑤𝑖𝑡superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡\displaystyle+\sup_{|\tilde{w}-w|_{2}\leq\beta}\int_{0}^{\infty}\left|\sum_{i=% 1}^{m}c_{i}e^{f(\tilde{w}_{i})t}-\sum_{i=1}^{m}c_{i}e^{f(w_{i})t}\right|dt+ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (99)
+sup|w~w|2β,|c~c|2β0i=1mβ|ef(w~i)tef(wi)t+ef(wi)t|dtsubscriptsupremumformulae-sequencesubscript~𝑤𝑤2𝛽subscript~𝑐𝑐2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚𝛽superscript𝑒𝑓subscript~𝑤𝑖𝑡superscript𝑒𝑓subscript𝑤𝑖𝑡superscript𝑒𝑓subscript𝑤𝑖𝑡𝑑𝑡\displaystyle+\sup_{|\tilde{w}-w|_{2}\leq\beta,|\tilde{c}-c|_{2}\leq\beta}\int% _{0}^{\infty}\sum_{i=1}^{m}\beta|e^{f(\tilde{w}_{i})t}-e^{f(w_{i})t}+e^{f(w_{i% })t}|dt+ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β , | over~ start_ARG italic_c end_ARG - italic_c | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (100)
Em(0)+sup|w~w|2β0i=1m|ci||ef(w~i)tef(wi)t|dtabsentsubscript𝐸𝑚0subscriptsupremumsubscript~𝑤𝑤2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript~𝑤𝑖𝑡superscript𝑒𝑓subscript𝑤𝑖𝑡𝑑𝑡\displaystyle\leq E_{m}(0)+\sup_{|\tilde{w}-w|_{2}\leq\beta}\int_{0}^{\infty}% \sum_{i=1}^{m}|c_{i}|\left|e^{f(\tilde{w}_{i})t}-e^{f(w_{i})t}\right|dt≤ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (101)
+sup|w~w|2β,|c~c|2β0βi=1m|ef(w~i)tef(wi)t|dt+0β|i=1mef(wi)t|𝑑tsubscriptsupremumformulae-sequencesubscript~𝑤𝑤2𝛽subscript~𝑐𝑐2𝛽superscriptsubscript0𝛽superscriptsubscript𝑖1𝑚superscript𝑒𝑓subscript~𝑤𝑖𝑡superscript𝑒𝑓subscript𝑤𝑖𝑡𝑑𝑡superscriptsubscript0𝛽superscriptsubscript𝑖1𝑚superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡\displaystyle+\sup_{|\tilde{w}-w|_{2}\leq\beta,|\tilde{c}-c|_{2}\leq\beta}\int% _{0}^{\infty}\beta\sum_{i=1}^{m}|e^{f(\tilde{w}_{i})t}-e^{f(w_{i})t}|dt+\int_{% 0}^{\infty}\beta\left|\sum_{i=1}^{m}e^{f(w_{i})t}\right|dt+ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β , | over~ start_ARG italic_c end_ARG - italic_c | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (102)
=Em(0)+sup|w~w|2β0i=1m(|ci|+β)|ef(w~i)tef(wi)t|dtabsentsubscript𝐸𝑚0subscriptsupremumsubscript~𝑤𝑤2𝛽superscriptsubscript0superscriptsubscript𝑖1𝑚subscript𝑐𝑖𝛽superscript𝑒𝑓subscript~𝑤𝑖𝑡superscript𝑒𝑓subscript𝑤𝑖𝑡𝑑𝑡\displaystyle=E_{m}(0)+\sup_{|\tilde{w}-w|_{2}\leq\beta}\int_{0}^{\infty}\sum_% {i=1}^{m}(|c_{i}|+\beta)\left|e^{f(\tilde{w}_{i})t}-e^{f(w_{i})t}\right|dt= italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_β ) | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (103)
+0β|i=1mef(wi)t|𝑑tsuperscriptsubscript0𝛽superscriptsubscript𝑖1𝑚superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡\displaystyle+\int_{0}^{\infty}\beta\left|\sum_{i=1}^{m}e^{f(w_{i})t}\right|dt+ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (104)
=Em(0)+i=1m(|ci|+β)sup|w~w|2β0|ef(w~i)tef(wi)t|𝑑t+0β|i=1mef(wi)t|𝑑tabsentsubscript𝐸𝑚0superscriptsubscript𝑖1𝑚subscript𝑐𝑖𝛽subscriptsupremumsubscript~𝑤𝑤2𝛽superscriptsubscript0superscript𝑒𝑓subscript~𝑤𝑖𝑡superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡superscriptsubscript0𝛽superscriptsubscript𝑖1𝑚superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡\displaystyle=E_{m}(0)+\sum_{i=1}^{m}(|c_{i}|+\beta)\sup_{|\tilde{w}-w|_{2}% \leq\beta}\int_{0}^{\infty}\left|e^{f(\tilde{w}_{i})t}-e^{f(w_{i})t}\right|dt+% \int_{0}^{\infty}\beta\left|\sum_{i=1}^{m}e^{f(w_{i})t}\right|dt= italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_β ) roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t (105)
=Em(0)+i=1m(|ci|+β)sup|w~iwi|β0|ef(w~i)tef(wi)t|𝑑t+βi=1m1|f(wi)|absentsubscript𝐸𝑚0superscriptsubscript𝑖1𝑚subscript𝑐𝑖𝛽subscriptsupremumsubscript~𝑤𝑖subscript𝑤𝑖𝛽superscriptsubscript0superscript𝑒𝑓subscript~𝑤𝑖𝑡superscript𝑒𝑓subscript𝑤𝑖𝑡differential-d𝑡𝛽superscriptsubscript𝑖1𝑚1𝑓subscript𝑤𝑖\displaystyle=E_{m}(0)+\sum_{i=1}^{m}(|c_{i}|+\beta)\sup_{|\tilde{w}_{i}-w_{i}% |\leq\beta}\int_{0}^{\infty}\left|e^{f(\tilde{w}_{i})t}-e^{f(w_{i})t}\right|dt% +\beta\sum_{i=1}^{m}\frac{1}{|f(w_{i})|}= italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_β ) roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t + italic_β ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG (106)
Em(0)+i=1m(|ci|+β)g(β)|f(wi)|+βi=1m1|f(wi)|absentsubscript𝐸𝑚0superscriptsubscript𝑖1𝑚subscript𝑐𝑖𝛽𝑔𝛽𝑓subscript𝑤𝑖𝛽superscriptsubscript𝑖1𝑚1𝑓subscript𝑤𝑖\displaystyle\leq E_{m}(0)+\sum_{i=1}^{m}(|c_{i}|+\beta)\frac{g(\beta)}{|f(w_{% i})|}+\beta\sum_{i=1}^{m}\frac{1}{|f(w_{i})|}≤ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_β ) divide start_ARG italic_g ( italic_β ) end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG + italic_β ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG (107)
=Em(0)+i=1mg(β)(|ci|+β)+β|f(wi)|.absentsubscript𝐸𝑚0superscriptsubscript𝑖1𝑚𝑔𝛽subscript𝑐𝑖𝛽𝛽𝑓subscript𝑤𝑖\displaystyle=E_{m}(0)+\sum_{i=1}^{m}\frac{g(\beta)(|c_{i}|+\beta)+\beta}{|f(w% _{i})|}.= italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG italic_g ( italic_β ) ( | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_β ) + italic_β end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG . (108)

The first and third inequalities are triangular inequality. The second inequality comes from the fact that |w~iwi||w~w|2βsubscript~𝑤𝑖subscript𝑤𝑖subscript~𝑤𝑤2𝛽|\tilde{w}_{i}-w_{i}|\leq|\tilde{w}-w|_{2}\leq\beta| over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ | over~ start_ARG italic_w end_ARG - italic_w | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_β. The fourth inequality is achieved via the property of stable reparameterization: For some continuous function g(β):[0,)[0,),g(0)=0:𝑔𝛽formulae-sequence00𝑔00g(\beta):[0,\infty)\to[0,\infty),g(0)=0italic_g ( italic_β ) : [ 0 , ∞ ) → [ 0 , ∞ ) , italic_g ( 0 ) = 0:

supw[|f(w)|sup|w~w|β0|ef(w~)tef(w)t|𝑑t]g(β).subscriptsupremum𝑤delimited-[]𝑓𝑤subscriptsupremum~𝑤𝑤𝛽superscriptsubscript0superscript𝑒𝑓~𝑤𝑡superscript𝑒𝑓𝑤𝑡differential-d𝑡𝑔𝛽\sup_{w}\left[|f(w)|\sup_{|\tilde{w}-w|\leq\beta}\int_{0}^{\infty}\left|e^{f(% \tilde{w})t}-e^{f(w)t}\right|dt\right]\leq g(\beta).roman_sup start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ | italic_f ( italic_w ) | roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t ] ≤ italic_g ( italic_β ) . (109)

By definition of stable approximation, we know limmEm(0)=0subscript𝑚subscript𝐸𝑚00\lim_{m\to\infty}E_{m}(0)=0roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) = 0. Also according to the requirement of the stable approximation in Equation 93, we have

limβ0E(β)subscript𝛽0𝐸𝛽\displaystyle\lim_{\beta\to 0}E(\beta)roman_lim start_POSTSUBSCRIPT italic_β → 0 end_POSTSUBSCRIPT italic_E ( italic_β ) =limβ0limmEm(β)absentsubscript𝛽0subscript𝑚subscript𝐸𝑚𝛽\displaystyle=\lim_{\beta\to 0}\lim_{m\to\infty}E_{m}(\beta)= roman_lim start_POSTSUBSCRIPT italic_β → 0 end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) (110)
limβ0limmEm(0)+(supmi=1m|ci|+β|f(wi)|)limβ0g(β)+limβ0β(supmi=1m1|f(wi)|)absentsubscript𝛽0subscript𝑚subscript𝐸𝑚0subscriptsupremum𝑚superscriptsubscript𝑖1𝑚subscript𝑐𝑖𝛽𝑓subscript𝑤𝑖subscript𝛽0𝑔𝛽subscript𝛽0𝛽subscriptsupremum𝑚superscriptsubscript𝑖1𝑚1𝑓subscript𝑤𝑖\displaystyle\leq\lim_{\beta\to 0}\lim_{m\to\infty}E_{m}(0)+\left(\sup_{m}\sum% _{i=1}^{m}\frac{|c_{i}|+\beta}{|f(w_{i})|}\right)*\lim_{\beta\to 0}g(\beta)+% \lim_{\beta\to 0}\beta*\left(\sup_{m}\sum_{i=1}^{m}\frac{1}{|f(w_{i})|}\right)≤ roman_lim start_POSTSUBSCRIPT italic_β → 0 end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_β end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG ) ∗ roman_lim start_POSTSUBSCRIPT italic_β → 0 end_POSTSUBSCRIPT italic_g ( italic_β ) + roman_lim start_POSTSUBSCRIPT italic_β → 0 end_POSTSUBSCRIPT italic_β ∗ ( roman_sup start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG ) (111)
=0+0+0=0=E(0).absent0000𝐸0\displaystyle=0+0+0=0=E(0).= 0 + 0 + 0 = 0 = italic_E ( 0 ) . (112)

Remark C.4.

Here we verify the reparameterization methods satisfy the definition of stable reparameterization.

For exponential reparameterization f(w)=ew,wformulae-sequence𝑓𝑤superscript𝑒𝑤𝑤f(w)=-e^{w},w\in\mathbb{R}italic_f ( italic_w ) = - italic_e start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_w ∈ blackboard_R:

sup|w~w|β0|ef(w~)tef(w)t|𝑑t=eβ1|f(w)|.subscriptsupremum~𝑤𝑤𝛽superscriptsubscript0superscript𝑒𝑓~𝑤𝑡superscript𝑒𝑓𝑤𝑡differential-d𝑡superscript𝑒𝛽1𝑓𝑤\sup_{|\tilde{w}-w|\leq\beta}\int_{0}^{\infty}\left|e^{f(\tilde{w})t}-e^{f(w)t% }\right|dt=\frac{e^{\beta}-1}{|f(w)|}.roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT - 1 end_ARG start_ARG | italic_f ( italic_w ) | end_ARG . (113)

For softplus reparameterization f(w)=log(1+ew),wformulae-sequence𝑓𝑤1superscript𝑒𝑤𝑤f(w)=-\log(1+e^{w}),w\in\mathbb{R}italic_f ( italic_w ) = - roman_log ( 1 + italic_e start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) , italic_w ∈ blackboard_R: Notice that exp(β)log(1+exp(w))sup|w~w|βlog(1+exp(w~))exp(β)log(1+exp(w))𝛽1𝑤subscriptsupremum~𝑤𝑤𝛽1~𝑤𝛽1𝑤\exp(-\beta)\log(1+\exp(w))\leq\sup_{|\tilde{w}-w|\leq\beta}\log(1+\exp(\tilde% {w}))\leq\exp(\beta)\log(1+\exp(w))roman_exp ( - italic_β ) roman_log ( 1 + roman_exp ( italic_w ) ) ≤ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | ≤ italic_β end_POSTSUBSCRIPT roman_log ( 1 + roman_exp ( over~ start_ARG italic_w end_ARG ) ) ≤ roman_exp ( italic_β ) roman_log ( 1 + roman_exp ( italic_w ) ),

sup|w~w|β0|ef(w~)tef(w)t|𝑑teβ1|f(w)|.subscriptsupremum~𝑤𝑤𝛽superscriptsubscript0superscript𝑒𝑓~𝑤𝑡superscript𝑒𝑓𝑤𝑡differential-d𝑡superscript𝑒𝛽1𝑓𝑤\sup_{|\tilde{w}-w|\leq\beta}\int_{0}^{\infty}\left|e^{f(\tilde{w})t}-e^{f(w)t% }\right|dt\leq\frac{e^{\beta}-1}{|f(w)|}.roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t ≤ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT - 1 end_ARG start_ARG | italic_f ( italic_w ) | end_ARG . (114)

For “best” reparameterization f(w)=1aw2+b,w,a,b>0formulae-sequence𝑓𝑤1𝑎superscript𝑤2𝑏formulae-sequence𝑤𝑎𝑏0f(w)=-\frac{1}{aw^{2}+b},w\in\mathbb{R},a,b>0italic_f ( italic_w ) = - divide start_ARG 1 end_ARG start_ARG italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b end_ARG , italic_w ∈ blackboard_R , italic_a , italic_b > 0: Without loss of generality, let w0𝑤0w\geq 0italic_w ≥ 0

sup|w~w|β0|ef(w~)tef(w)t|𝑑tsubscriptsupremum~𝑤𝑤𝛽superscriptsubscript0superscript𝑒𝑓~𝑤𝑡superscript𝑒𝑓𝑤𝑡differential-d𝑡\displaystyle\sup_{|\tilde{w}-w|\leq\beta}\int_{0}^{\infty}\left|e^{f(\tilde{w% })t}-e^{f(w)t}\right|dtroman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t =|a(w+β)2aw2|absent𝑎superscript𝑤𝛽2𝑎superscript𝑤2\displaystyle=|a(w+\beta)^{2}-aw^{2}|= | italic_a ( italic_w + italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | (115)
a(β2+2βw)aw2+b|f(w)|absent𝑎superscript𝛽22𝛽𝑤𝑎superscript𝑤2𝑏𝑓𝑤\displaystyle\leq\frac{\frac{a(\beta^{2}+2\beta w)}{aw^{2}+b}}{|f(w)|}≤ divide start_ARG divide start_ARG italic_a ( italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_β italic_w ) end_ARG start_ARG italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b end_ARG end_ARG start_ARG | italic_f ( italic_w ) | end_ARG (116)
a(β2+2βw)b|f(w)|.absent𝑎superscript𝛽22𝛽𝑤𝑏𝑓𝑤\displaystyle\leq\frac{\frac{a(\beta^{2}+2\beta w)}{b}}{|f(w)|}.≤ divide start_ARG divide start_ARG italic_a ( italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_β italic_w ) end_ARG start_ARG italic_b end_ARG end_ARG start_ARG | italic_f ( italic_w ) | end_ARG . (117)

Here g(β)=a(β2+2βw)b𝑔𝛽𝑎superscript𝛽22𝛽𝑤𝑏g(\beta)=\frac{a(\beta^{2}+2\beta w)}{b}italic_g ( italic_β ) = divide start_ARG italic_a ( italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_β italic_w ) end_ARG start_ARG italic_b end_ARG. The famous Müntz–Szász theorem indicates that selecting any non-zero constant a𝑎aitalic_a does not affect the universality of linear RNN.

While for the case without reparameterization f(w)=w,w<0formulae-sequence𝑓𝑤𝑤𝑤0f(w)=w,w<0italic_f ( italic_w ) = italic_w , italic_w < 0: For 0β<w0𝛽𝑤0\leq\beta<-w0 ≤ italic_β < - italic_w,

sup|w~w|β0|ef(w~)tef(w)t|𝑑t=β(wβ)(w)=β(wβ)|f(w)|,subscriptsupremum~𝑤𝑤𝛽superscriptsubscript0superscript𝑒𝑓~𝑤𝑡superscript𝑒𝑓𝑤𝑡differential-d𝑡𝛽𝑤𝛽𝑤𝛽𝑤𝛽𝑓𝑤\sup_{|\tilde{w}-w|\leq\beta}\int_{0}^{\infty}\left|e^{f(\tilde{w})t}-e^{f(w)t% }\right|dt=\frac{\beta}{(-w-\beta)(-w)}=\frac{\beta}{(-w-\beta)|f(w)|},roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_w end_ARG - italic_w | ≤ italic_β end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT italic_f ( over~ start_ARG italic_w end_ARG ) italic_t end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT italic_f ( italic_w ) italic_t end_POSTSUPERSCRIPT | italic_d italic_t = divide start_ARG italic_β end_ARG start_ARG ( - italic_w - italic_β ) ( - italic_w ) end_ARG = divide start_ARG italic_β end_ARG start_ARG ( - italic_w - italic_β ) | italic_f ( italic_w ) | end_ARG , (118)

Here limwβsupwβwβ=subscript𝑤𝛽subscriptsupremum𝑤𝛽𝑤𝛽\lim_{w\to-\beta}\sup_{w}\frac{\beta}{-w-\beta}=\inftyroman_lim start_POSTSUBSCRIPT italic_w → - italic_β end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT divide start_ARG italic_β end_ARG start_ARG - italic_w - italic_β end_ARG = ∞, therefore the direct parameterization is not a stable reparameterization.

Remark C.5 (On the generalization of existence of stable approximation to nonlinear functionals).

The previous results are established for the stable approximation of linear functionals by linear RNNs with stable approximations.

Here we show that this can be further extended to nonlinear functionals. According to the Volterra Series representation, the nonlinear functional has expansion by multi-layer composition or element-wise product (Wang & Xue, 2023). Therefore if the existence of stable approximation is preserved for functional composition and polynomial, then we can generalize the above argument to the nonlinear functionals by working with nonlinear functional representations.

Theorem C.6 (Boyd et al. (1984); Wang & Xue (2023)).

For any continuous time-invariant system with x(t)𝑥𝑡x(t)italic_x ( italic_t ) as input and y(t)𝑦𝑡y(t)italic_y ( italic_t ) as output can be expanded in the Volterra series as follow

y(t)=ρ0+n=1N0t0tρn(τ1,,τn)j=1nx(tτj)dτj.𝑦𝑡subscript𝜌0superscriptsubscript𝑛1𝑁superscriptsubscript0𝑡superscriptsubscript0𝑡subscript𝜌𝑛subscript𝜏1subscript𝜏𝑛superscriptsubscriptproduct𝑗1𝑛𝑥𝑡subscript𝜏𝑗𝑑subscript𝜏𝑗y(t)=\rho_{0}+\sum_{n=1}^{N}\int_{0}^{t}\cdots\int_{0}^{t}\rho_{n}(\tau_{1},% \dots,\tau_{n})\prod_{j=1}^{n}x(t-\tau_{j})d\tau_{j}.italic_y ( italic_t ) = italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋯ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x ( italic_t - italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_d italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (119)

In particular, we call the expansion order N𝑁Nitalic_N to be the series’ order.

Lemma C.7 (Stable approximation induced by polynomials of stable approximation).

Assume 𝐇1subscript𝐇1\mathbf{H}_{1}bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐇2subscript𝐇2\mathbf{H}_{2}bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be stably approximated, let f𝑓fitalic_f be some polynomial, then f(𝐇1,𝐇2)𝑓subscript𝐇1subscript𝐇2f(\mathbf{H}_{1},\mathbf{H}_{2})italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) can also be stably approximated.

Proof.

Let f(𝐇1,𝐇2)=i,jci,j𝐇1i𝐇2j𝑓subscript𝐇1subscript𝐇2subscript𝑖𝑗subscript𝑐𝑖𝑗superscriptsubscript𝐇1𝑖superscriptsubscript𝐇2𝑗f(\mathbf{H}_{1},\mathbf{H}_{2})=\sum_{i,j}c_{i,j}\mathbf{H}_{1}^{i}\mathbf{H}% _{2}^{j}italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. The definition of functional product is by the element-wise product: (𝐇1𝐇2)(𝐱)=𝐇1(𝐱)𝐇2(𝐱)subscript𝐇1subscript𝐇2𝐱direct-productsubscript𝐇1𝐱subscript𝐇2𝐱(\mathbf{H}_{1}\mathbf{H}_{2})(\mathbf{x})=\mathbf{H}_{1}(\mathbf{x})\odot% \mathbf{H}_{2}(\mathbf{x})( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( bold_x ) = bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) ⊙ bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ).

Em(β)subscript𝐸𝑚𝛽\displaystyle E_{m}(\beta)italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) =sup|θ~θ|βf(𝐇1,𝐇2)f(𝐇1(θ~),𝐇2(θ~))W1,absentsubscriptsupremum~𝜃𝜃𝛽subscriptnorm𝑓subscript𝐇1subscript𝐇2𝑓subscript𝐇1~𝜃subscript𝐇2~𝜃superscript𝑊1\displaystyle=\sup_{|\tilde{\theta}-\theta|\leq\beta}\|f(\mathbf{H}_{1},% \mathbf{H}_{2})-f(\mathbf{H}_{1}(\tilde{\theta}),\mathbf{H}_{2}(\tilde{\theta}% ))\|_{W^{1,\infty}}= roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_θ end_ARG - italic_θ | ≤ italic_β end_POSTSUBSCRIPT ∥ italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (120)
Em(0)+sup|θ~θ|βf(𝐇1(θ),𝐇2(θ))f(𝐇1(θ~),𝐇2(θ~))W1,absentsubscript𝐸𝑚0subscriptsupremum~𝜃𝜃𝛽subscriptnorm𝑓subscript𝐇1𝜃subscript𝐇2𝜃𝑓subscript𝐇1~𝜃subscript𝐇2~𝜃superscript𝑊1\displaystyle\leq E_{m}(0)+\sup_{|\tilde{\theta}-\theta|\leq\beta}\|f(\mathbf{% H}_{1}(\theta),\mathbf{H}_{2}(\theta))-f(\mathbf{H}_{1}(\tilde{\theta}),% \mathbf{H}_{2}(\tilde{\theta}))\|_{W^{1,\infty}}≤ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_θ end_ARG - italic_θ | ≤ italic_β end_POSTSUBSCRIPT ∥ italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ ) ) - italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (121)
Em(0)+sup|θ~θ|βf(𝐇1(θ),𝐇2(θ))f(𝐇1(θ),𝐇2(θ~))W1,absentsubscript𝐸𝑚0subscriptsupremum~𝜃𝜃𝛽subscriptnorm𝑓subscript𝐇1𝜃subscript𝐇2𝜃𝑓subscript𝐇1𝜃subscript𝐇2~𝜃superscript𝑊1\displaystyle\leq E_{m}(0)+\sup_{|\tilde{\theta}-\theta|\leq\beta}\|f(\mathbf{% H}_{1}(\theta),\mathbf{H}_{2}(\theta))-f(\mathbf{H}_{1}(\theta),\mathbf{H}_{2}% (\tilde{\theta}))\|_{W^{1,\infty}}≤ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_θ end_ARG - italic_θ | ≤ italic_β end_POSTSUBSCRIPT ∥ italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ ) ) - italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (122)
+sup|θ~θ|βf(𝐇1(θ),𝐇2(θ~))f(𝐇1(θ~),𝐇2(θ~))W1,subscriptsupremum~𝜃𝜃𝛽subscriptnorm𝑓subscript𝐇1𝜃subscript𝐇2~𝜃𝑓subscript𝐇1~𝜃subscript𝐇2~𝜃superscript𝑊1\displaystyle\qquad\qquad+\sup_{|\tilde{\theta}-\theta|\leq\beta}\|f(\mathbf{H% }_{1}(\theta),\mathbf{H}_{2}(\tilde{\theta}))-f(\mathbf{H}_{1}(\tilde{\theta})% ,\mathbf{H}_{2}(\tilde{\theta}))\|_{W^{1,\infty}}+ roman_sup start_POSTSUBSCRIPT | over~ start_ARG italic_θ end_ARG - italic_θ | ≤ italic_β end_POSTSUBSCRIPT ∥ italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) ) - italic_f ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (123)
Em(0)+i0,j1ci,jj𝐇^1(θ)W1,i(𝐇^2(θ)W1,+Em𝐇2(β))j1Em𝐇2(β)absentsubscript𝐸𝑚0subscriptformulae-sequence𝑖0𝑗1subscript𝑐𝑖𝑗𝑗superscriptsubscriptnormsubscript^𝐇1𝜃superscript𝑊1𝑖superscriptsubscriptnormsubscript^𝐇2𝜃superscript𝑊1superscriptsubscript𝐸𝑚subscript𝐇2𝛽𝑗1superscriptsubscript𝐸𝑚subscript𝐇2𝛽\displaystyle\leq E_{m}(0)+\sum_{i\geq 0,j\geq 1}c_{i,j}j\|\widehat{\mathbf{H}% }_{1}(\theta)\|_{W^{1,\infty}}^{i}(\|\widehat{\mathbf{H}}_{2}(\theta)\|_{W^{1,% \infty}}+E_{m}^{\mathbf{H}_{2}}(\beta))^{j-1}E_{m}^{\mathbf{H}_{2}}(\beta)≤ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) + ∑ start_POSTSUBSCRIPT italic_i ≥ 0 , italic_j ≥ 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_j ∥ over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( ∥ over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_β ) ) start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_β ) (124)
+i1,j0ci,ji(𝐇^1(θ)W1,+Em𝐇1(β))i1𝐇^2(θ)W1,jEm𝐇1(β).subscriptformulae-sequence𝑖1𝑗0subscript𝑐𝑖𝑗𝑖superscriptsubscriptnormsubscript^𝐇1𝜃superscript𝑊1superscriptsubscript𝐸𝑚subscript𝐇1𝛽𝑖1superscriptsubscriptnormsubscript^𝐇2𝜃superscript𝑊1𝑗superscriptsubscript𝐸𝑚subscript𝐇1𝛽\displaystyle\qquad\qquad+\sum_{i\geq 1,j\geq 0}c_{i,j}i(\|\widehat{\mathbf{H}% }_{1}(\theta)\|_{W^{1,\infty}}+E_{m}^{\mathbf{H}_{1}}(\beta))^{i-1}\|\widehat{% \mathbf{H}}_{2}(\theta)\|_{W^{1,\infty}}^{j}E_{m}^{\mathbf{H}_{1}}(\beta).+ ∑ start_POSTSUBSCRIPT italic_i ≥ 1 , italic_j ≥ 0 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_i ( ∥ over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_β ) ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ ) ∥ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT 1 , ∞ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_β ) . (125)

Therefore E(β)limmEm(β)<𝐸𝛽subscript𝑚subscript𝐸𝑚𝛽E(\beta)\leq\lim_{m\to\infty}E_{m}(\beta)<\inftyitalic_E ( italic_β ) ≤ roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) < ∞. The third inequality comes from Equation 33. ∎

C.5 Proof for Theorem 3.6

Proof.

For any 1jm1𝑗𝑚1\leq j\leq m1 ≤ italic_j ≤ italic_m, assume the loss function we used is the Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm: Loss=suptHtH^m,tLosssubscriptsupremum𝑡subscriptnormsubscript𝐻𝑡subscript^𝐻𝑚𝑡\textrm{Loss}=\sup_{t}\|H_{t}-\widehat{H}_{m,t}\|_{\infty}Loss = roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. Notice that by time-homogeneity, Loss=HtH^m,tLosssubscriptnormsubscript𝐻𝑡subscript^𝐻𝑚𝑡\textrm{Loss}=\|H_{t}-\widehat{H}_{m,t}\|_{\infty}Loss = ∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT for any t𝑡titalic_t. This loss function is larger than the common mean squared error, which is usually chosen in practice for the smoothness reason.

|Losswj|Losssubscript𝑤𝑗\displaystyle\left|\frac{\partial\textrm{Loss}}{\partial w_{j}}\right|| divide start_ARG ∂ Loss end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | =|HtH^m,twj|absentsubscriptnormsubscript𝐻𝑡subscript^𝐻𝑚𝑡subscript𝑤𝑗\displaystyle=\left|\frac{\partial\|H_{t}-\widehat{H}_{m,t}\|_{\infty}}{% \partial w_{j}}\right|= | divide start_ARG ∂ ∥ italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | (126)
=|sup𝐱1|Ht(𝐱)H^m,t(𝐱)|wj|absentsubscriptsupremumsubscriptnorm𝐱1subscript𝐻𝑡𝐱subscript^𝐻𝑚𝑡𝐱subscript𝑤𝑗\displaystyle=\left|\frac{\partial\sup_{\|\mathbf{x}\|_{\infty}\leq 1}|H_{t}(% \mathbf{x})-\widehat{H}_{m,t}(\mathbf{x})|}{\partial w_{j}}\right|= | divide start_ARG ∂ roman_sup start_POSTSUBSCRIPT ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) - over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ( bold_x ) | end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | (127)
=|sup𝐱1|t(ρ(ts)i=1mcief(wi)(ts))xs𝑑s|wj|absentsubscriptsupremumsubscriptnorm𝐱1superscriptsubscript𝑡𝜌𝑡𝑠superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡𝑠subscript𝑥𝑠differential-d𝑠subscript𝑤𝑗\displaystyle=\left|\frac{\partial\sup_{\|\mathbf{x}\|_{\infty}\leq 1}|\int_{-% \infty}^{t}(\rho(t-s)-\sum_{i=1}^{m}c_{i}e^{-f(w_{i})(t-s)})x_{s}ds|}{\partial w% _{j}}\right|= | divide start_ARG ∂ roman_sup start_POSTSUBSCRIPT ∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ρ ( italic_t - italic_s ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_t - italic_s ) end_POSTSUPERSCRIPT ) italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d italic_s | end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | (128)
=|t|ρ(ts)i=1mcief(wi)(ts)|𝑑swj|absentsuperscriptsubscript𝑡𝜌𝑡𝑠superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡𝑠differential-d𝑠subscript𝑤𝑗\displaystyle=\left|\frac{\partial\int_{-\infty}^{t}|\rho(t-s)-\sum_{i=1}^{m}c% _{i}e^{-f(w_{i})(t-s)}|ds}{\partial w_{j}}\right|= | divide start_ARG ∂ ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_ρ ( italic_t - italic_s ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_t - italic_s ) end_POSTSUPERSCRIPT | italic_d italic_s end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | (129)
=|t|(ρ(ts)ijcief(wi)(ts))cjef(wj)(ts)|𝑑swj|absentsuperscriptsubscript𝑡𝜌𝑡𝑠subscript𝑖𝑗subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡𝑠subscript𝑐𝑗superscript𝑒𝑓subscript𝑤𝑗𝑡𝑠differential-d𝑠subscript𝑤𝑗\displaystyle=\left|\frac{\partial\int_{-\infty}^{t}|(\rho(t-s)-\sum_{i\neq j}% c_{i}e^{-f(w_{i})(t-s)})-c_{j}e^{-f(w_{j})(t-s)}|ds}{\partial w_{j}}\right|= | divide start_ARG ∂ ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | ( italic_ρ ( italic_t - italic_s ) - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_t - italic_s ) end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_t - italic_s ) end_POSTSUPERSCRIPT | italic_d italic_s end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | (130)
=|0|(ρ(s)ijcief(wi)s)cjef(wj)s|𝑑swj|absentsuperscriptsubscript0𝜌𝑠subscript𝑖𝑗subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑠subscript𝑐𝑗superscript𝑒𝑓subscript𝑤𝑗𝑠differential-d𝑠subscript𝑤𝑗\displaystyle=\left|\frac{\partial\int_{0}^{\infty}|(\rho(s)-\sum_{i\neq j}c_{% i}e^{-f(w_{i})s})-c_{j}e^{-f(w_{j})s}|ds}{\partial w_{j}}\right|= | divide start_ARG ∂ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | ( italic_ρ ( italic_s ) - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT | italic_d italic_s end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | (131)
0||(ρ(s)ijcief(wi)s)cjef(wj)s|wj|𝑑sabsentsuperscriptsubscript0𝜌𝑠subscript𝑖𝑗subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑠subscript𝑐𝑗superscript𝑒𝑓subscript𝑤𝑗𝑠subscript𝑤𝑗differential-d𝑠\displaystyle\leq\int_{0}^{\infty}\left|\frac{\partial|(\rho(s)-\sum_{i\neq j}% c_{i}e^{-f(w_{i})s})-c_{j}e^{-f(w_{j})s}|}{\partial w_{j}}\right|ds≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | divide start_ARG ∂ | ( italic_ρ ( italic_s ) - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT | end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | italic_d italic_s (132)
0||cjef(wj)s|wj|𝑑sabsentsuperscriptsubscript0subscript𝑐𝑗superscript𝑒𝑓subscript𝑤𝑗𝑠subscript𝑤𝑗differential-d𝑠\displaystyle\leq\int_{0}^{\infty}\left|\frac{\partial|c_{j}e^{-f(w_{j})s}|}{% \partial w_{j}}\right|ds≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | divide start_ARG ∂ | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT | end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | italic_d italic_s (133)

The first equality is the definition of the loss function. The second equality equality comes from the definition of the linear functional norm. The third equality expand the linear functional and linear RNNs into the convolution form. The fourth equality utilize the fact that we can manually select xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s sign to achieve the maximum value. The fifth equality is separating the term in dependent of variable wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The sixth equality is change of variable from ts𝑡𝑠t-sitalic_t - italic_s to s𝑠sitalic_s. The inequality is triangular inequality. The last equality is dropping the term independent of variable wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

|Losswj|Losssubscript𝑤𝑗\displaystyle\left|\frac{\partial\textrm{Loss}}{\partial w_{j}}\right|| divide start_ARG ∂ Loss end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | 0||cjef(wj)s|wj|𝑑sabsentsuperscriptsubscript0subscript𝑐𝑗superscript𝑒𝑓subscript𝑤𝑗𝑠subscript𝑤𝑗differential-d𝑠\displaystyle\leq\int_{0}^{\infty}\left|\frac{\partial|c_{j}e^{-f(w_{j})s}|}{% \partial w_{j}}\right|ds≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | divide start_ARG ∂ | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT | end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | italic_d italic_s (134)
=|cjf(wj)|0ef(wj)ss𝑑sabsentsubscript𝑐𝑗superscript𝑓subscript𝑤𝑗superscriptsubscript0superscript𝑒𝑓subscript𝑤𝑗𝑠𝑠differential-d𝑠\displaystyle=|c_{j}f^{\prime}(w_{j})|\int_{0}^{\infty}e^{-f(w_{j})s}s\ ds= | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT italic_s italic_d italic_s (135)
=|cjf(wj)f(wj)|0ef(wj)s𝑑sabsentsubscript𝑐𝑗superscript𝑓subscript𝑤𝑗𝑓subscript𝑤𝑗superscriptsubscript0superscript𝑒𝑓subscript𝑤𝑗𝑠differential-d𝑠\displaystyle=\left|c_{j}\frac{f^{\prime}(w_{j})}{f(w_{j})}\right|\int_{0}^{% \infty}e^{-f(w_{j})s}ds= | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT italic_d italic_s (136)
=|cjf(wj)f(wj)2|(1limsef(wj)s)=|cjf(wj)f(wj)2|.absentsubscript𝑐𝑗superscript𝑓subscript𝑤𝑗𝑓superscriptsubscript𝑤𝑗21subscript𝑠superscript𝑒𝑓subscript𝑤𝑗𝑠subscript𝑐𝑗superscript𝑓subscript𝑤𝑗𝑓superscriptsubscript𝑤𝑗2\displaystyle=\left|c_{j}\frac{f^{\prime}(w_{j})}{f(w_{j})^{2}}\right|(1-\lim_% {s\to\infty}e^{-f(w_{j})s})=\left|c_{j}\frac{f^{\prime}(w_{j})}{f(w_{j})^{2}}% \right|.= | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | ( 1 - roman_lim start_POSTSUBSCRIPT italic_s → ∞ end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s end_POSTSUPERSCRIPT ) = | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | . (137)

The first equality is evaluating the derivative. The second equality is extracting |f(w)|superscript𝑓𝑤|f^{\prime}(w)|| italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) | from integral. The third equality is doing the integration by parts.

In particular, notice that cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a constant independent of the recurrent weight parameterization f𝑓fitalic_f:

H^m,t(𝐱)=ti=1mcief(wi)(ts)xsds.subscript^𝐻𝑚𝑡𝐱superscriptsubscript𝑡superscriptsubscript𝑖1𝑚subscript𝑐𝑖superscript𝑒𝑓subscript𝑤𝑖𝑡𝑠subscript𝑥𝑠𝑑𝑠\widehat{H}_{m,t}(\mathbf{x})=\int_{-\infty}^{t}\sum_{i=1}^{m}c_{i}e^{-f(w_{i}% )(t-s)}x_{s}ds.over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ( bold_x ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_t - italic_s ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d italic_s . (138)

Therefore cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a parameterization indepndent value, we will denote it by C𝐇,𝐇^msubscript𝐶𝐇subscript^𝐇𝑚C_{\mathbf{H},\widehat{\mathbf{H}}_{m}}italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Moreover, in the discrete setting, assume hk+1=f(w)hk+Uxksubscript𝑘1𝑓𝑤subscript𝑘𝑈subscript𝑥𝑘h_{k+1}=f(w)\circ h_{k}+Ux_{k}italic_h start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_f ( italic_w ) ∘ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_U italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,

|Losswj|Losssubscript𝑤𝑗\displaystyle\left|\frac{\partial\textrm{Loss}}{\partial w_{j}}\right|| divide start_ARG ∂ Loss end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | k=0||cjf(wj)k|wj|dsabsentsuperscriptsubscript𝑘0subscript𝑐𝑗𝑓superscriptsubscript𝑤𝑗𝑘subscript𝑤𝑗𝑑𝑠\displaystyle\leq\sum_{k=0}^{\infty}\left|\frac{\partial|c_{j}f(w_{j})^{k}|}{% \partial w_{j}}\right|ds≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | divide start_ARG ∂ | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | italic_d italic_s (139)
=|cjf(wj)|k=1kf(wj)k1absentsubscript𝑐𝑗superscript𝑓subscript𝑤𝑗superscriptsubscript𝑘1𝑘𝑓superscriptsubscript𝑤𝑗𝑘1\displaystyle=|c_{j}f^{\prime}(w_{j})|\sum_{k=1}^{\infty}kf(w_{j})^{k-1}= | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_k italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT (140)
=|cjf(wj)|(k=1f(wj)k1)2absentsubscript𝑐𝑗superscript𝑓subscript𝑤𝑗superscriptsuperscriptsubscript𝑘1𝑓superscriptsubscript𝑤𝑗𝑘12\displaystyle=|c_{j}f^{\prime}(w_{j})|\left(\sum_{k=1}^{\infty}f(w_{j})^{k-1}% \right)^{2}= | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (141)
=|cjf(wj)(1f(wj))2|.absentsubscript𝑐𝑗superscript𝑓subscript𝑤𝑗superscript1𝑓subscript𝑤𝑗2\displaystyle=\left|c_{j}\frac{f^{\prime}(w_{j})}{(1-f(w_{j}))^{2}}\right|.= | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ( 1 - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | . (142)

So the gradient norm is bounded by

|Losswj|=|cjf(wj)|(1f(wj))2.Losssubscript𝑤𝑗subscript𝑐𝑗superscript𝑓subscript𝑤𝑗superscript1𝑓subscript𝑤𝑗2\left|\frac{\partial\textrm{Loss}}{\partial w_{j}}\right|=\frac{|c_{j}f^{% \prime}(w_{j})|}{(1-f(w_{j}))^{2}}.| divide start_ARG ∂ Loss end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | = divide start_ARG | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | end_ARG start_ARG ( 1 - italic_f ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (143)

Nonlinear functionals

Now we show the generalization into the nonlinear functional: Consider the Volterra Series representation of the nonlinear functional.

Theorem C.8 ((Boyd et al., 1984)).

For any continuous time-invariant system with x(t)𝑥𝑡x(t)italic_x ( italic_t ) as input and y(t)𝑦𝑡y(t)italic_y ( italic_t ) as output can be expanded in the Volterra series as follow

y(t)=h0+n=1N0t0thn(τ1,,τn)j=1nx(tτj)dτj.𝑦𝑡subscript0superscriptsubscript𝑛1𝑁superscriptsubscript0𝑡superscriptsubscript0𝑡subscript𝑛subscript𝜏1subscript𝜏𝑛superscriptsubscriptproduct𝑗1𝑛𝑥𝑡subscript𝜏𝑗𝑑subscript𝜏𝑗y(t)=h_{0}+\sum_{n=1}^{N}\int_{0}^{t}\cdots\int_{0}^{t}h_{n}(\tau_{1},\dots,% \tau_{n})\prod_{j=1}^{n}x(t-\tau_{j})d\tau_{j}.italic_y ( italic_t ) = italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋯ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x ( italic_t - italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_d italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (144)

Here N𝑁Nitalic_N is the series’ order. Linear functional is an order-1 Volterra series.

For simplicity, we will only discuss the case for N=2𝑁2N=2italic_N = 2. When we take the Hyena approach (Poli et al., 2023) and approximate the order-2 kernel h2(τ1,τ2)subscript2subscript𝜏1subscript𝜏2h_{2}(\tau_{1},\tau_{2})italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with its rank-1 approximation:

h2(τ1,τ2)=h2,1(τ1)h2,2(τ2).subscript2subscript𝜏1subscript𝜏2subscript21subscript𝜏1subscript22subscript𝜏2h_{2}(\tau_{1},\tau_{2})=h_{2,1}(\tau_{1})h_{2,2}(\tau_{2}).italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_h start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (145)

Here h2,1subscript21h_{2,1}italic_h start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT and h2,2subscript22h_{2,2}italic_h start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT are again order-1 kernel which can be approximated with linear RNN’s kernel. In other words, the same gradient bound also holds for general nonlinear functional with the following form:

Gf(w):=|Ew|=C𝐇,𝐇^m|f(w)|f(w)2.assignsubscript𝐺𝑓𝑤𝐸𝑤subscript𝐶𝐇subscript^𝐇𝑚superscript𝑓𝑤𝑓superscript𝑤2G_{f}(w):=\left|\frac{\partial E}{\partial w}\right|=C_{\mathbf{H},\widehat{% \mathbf{H}}_{m}}\frac{|f^{\prime}(w)|}{f(w)^{2}}.italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ) := | divide start_ARG ∂ italic_E end_ARG start_ARG ∂ italic_w end_ARG | = italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) | end_ARG start_ARG italic_f ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (146)

And the discrete version is

GfD(w):=|Ew|=C𝐇,𝐇^m|f(w)|(1f(w))2.assignsubscriptsuperscript𝐺𝐷𝑓𝑤𝐸𝑤subscript𝐶𝐇subscript^𝐇𝑚superscript𝑓𝑤superscript1𝑓𝑤2G^{D}_{f}(w):=\left|\frac{\partial E}{\partial w}\right|=C_{\mathbf{H},% \widehat{\mathbf{H}}_{m}}\frac{|f^{\prime}(w)|}{(1-f(w))^{2}}.italic_G start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ) := | divide start_ARG ∂ italic_E end_ARG start_ARG ∂ italic_w end_ARG | = italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) | end_ARG start_ARG ( 1 - italic_f ( italic_w ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (147)

C.6 Lemmas

Lemma C.9.

If the activation σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is bounded, strictly increasing, continuously differentiable function over \mathbb{R}blackboard_R. Then for all C>0𝐶0C>0italic_C > 0, there exists ϵCsubscriptitalic-ϵ𝐶\epsilon_{C}italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT such that |z|Cϵfor-all𝑧subscript𝐶italic-ϵ\forall|z|\leq C_{\epsilon}∀ | italic_z | ≤ italic_C start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT, |σ(z)|ϵCsuperscript𝜎𝑧subscriptitalic-ϵ𝐶|\sigma^{\prime}(z)|\geq\epsilon_{C}| italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) | ≥ italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT.

Proof.

Since σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is monotonically increasing, therefore σ()>0,z0formulae-sequencesuperscript𝜎0for-all𝑧0\sigma^{\prime}(\cdot)>0,\forall z\geq 0italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ) > 0 , ∀ italic_z ≥ 0. Notice that σ()superscript𝜎\sigma^{\prime}(\cdot)italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ) is continuous, for any C>0𝐶0C>0italic_C > 0, we know 12min|z|Cσ(z)>012subscript𝑧𝐶superscript𝜎𝑧0\frac{1}{2}\min_{|z|\leq C}\sigma^{\prime}(z)>0divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_min start_POSTSUBSCRIPT | italic_z | ≤ italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) > 0. Define ϵC:=12min|z|Cσ(z)>0assignsubscriptitalic-ϵ𝐶12subscript𝑧𝐶superscript𝜎𝑧0\epsilon_{C}:=\frac{1}{2}\min_{|z|\leq C}\sigma^{\prime}(z)>0italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_min start_POSTSUBSCRIPT | italic_z | ≤ italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) > 0, it can be seen the target statement is satisfied. ∎

Lemma C.10.

Assume the target functional sequence has a β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-stable approximation and the perturbed model has a decaying memory, we show that v~m,t0subscript~𝑣𝑚𝑡0\tilde{v}_{m,t}\to 0over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT → 0 for all m𝑚mitalic_m.

Proof.

For any m𝑚mitalic_m, fix Λ~msubscript~Λ𝑚\widetilde{\Lambda}_{m}over~ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and U~msubscript~𝑈𝑚\widetilde{U}_{m}over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Since the perturbed model has a decaying memory,

limt|ddtH~m(𝐮x)|=limt|c(σ(h~m,t)dh~m,tdt)|=limt|c(σ(h~m,t)v~m,t)|=0.subscript𝑡𝑑𝑑𝑡subscript~𝐻𝑚superscript𝐮𝑥subscript𝑡superscript𝑐topsuperscript𝜎subscript~𝑚𝑡𝑑subscript~𝑚𝑡𝑑𝑡subscript𝑡superscript𝑐topsuperscript𝜎subscript~𝑚𝑡subscript~𝑣𝑚𝑡0\lim_{t\to\infty}\left|\frac{d}{dt}\widetilde{H}_{m}(\mathbf{u}^{x})\right|=% \lim_{t\to\infty}\left|c^{\top}(\sigma^{\prime}(\tilde{h}_{m,t})\circ\frac{d% \tilde{h}_{m,t}}{dt})\right|=\lim_{t\to\infty}\left|c^{\top}(\sigma^{\prime}(% \tilde{h}_{m,t})\circ\tilde{v}_{m,t})\right|=0.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) | = roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ divide start_ARG italic_d over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ) | = roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) | = 0 . (148)

By linear algebra, there exist m𝑚mitalic_m vectors {Δci}i=1msuperscriptsubscriptΔsubscript𝑐𝑖𝑖1𝑚\{\Delta c_{i}\}_{i=1}^{m}{ roman_Δ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, |Δci|<βsubscriptΔsubscript𝑐𝑖𝛽|\Delta c_{i}|_{\infty}<\beta| roman_Δ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_β such that cm+Δc1subscript𝑐𝑚Δsubscript𝑐1c_{m}+\Delta c_{1}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + roman_Δ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, …, cm+Δcmsubscript𝑐𝑚Δsubscript𝑐𝑚c_{m}+\Delta c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + roman_Δ italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT form a basis of msuperscript𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. We can then decompose any vector u𝑢uitalic_u into

u=k1(cm+Δc1)++km(cm+Δcm).𝑢subscript𝑘1subscript𝑐𝑚Δsubscript𝑐1subscript𝑘𝑚subscript𝑐𝑚Δsubscript𝑐𝑚u=k_{1}(c_{m}+\Delta c_{1})+\cdots+k_{m}(c_{m}+\Delta c_{m}).italic_u = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + roman_Δ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⋯ + italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + roman_Δ italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (149)

Take the inner product of u𝑢uitalic_u and v~m,tsubscript~𝑣𝑚𝑡\tilde{v}_{m,t}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT, we have

limtu(σ(h~m,t)v~m,t)=i=1mkilimt(cm+Δci)(σ(h~m,t)v~m,t)=0subscript𝑡superscript𝑢topsuperscript𝜎subscript~𝑚𝑡subscript~𝑣𝑚𝑡superscriptsubscript𝑖1𝑚subscript𝑘𝑖subscript𝑡superscriptsubscript𝑐𝑚Δsubscript𝑐𝑖topsuperscript𝜎subscript~𝑚𝑡subscript~𝑣𝑚𝑡0\lim_{t\to\infty}u^{\top}(\sigma^{\prime}(\tilde{h}_{m,t})\circ\tilde{v}_{m,t}% )=\sum_{i=1}^{m}k_{i}\lim_{t\to\infty}(c_{m}+\Delta c_{i})^{\top}(\sigma^{% \prime}(\tilde{h}_{m,t})\circ\tilde{v}_{m,t})=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + roman_Δ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) = 0 (150)

As the above result holds for any vector u𝑢uitalic_u, we get

limt|σ(h~m,t)v~m,t|=0.subscript𝑡subscriptsuperscript𝜎subscript~𝑚𝑡subscript~𝑣𝑚𝑡0\lim_{t\to\infty}\left|\sigma^{\prime}(\tilde{h}_{m,t})\circ\tilde{v}_{m,t}% \right|_{\infty}=0.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) ∘ over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 0 . (151)

As required in Equation 11, the hidden states are uniformly (in m𝑚mitalic_m) bounded over bounded input sequence. There exists constant C0>0subscript𝐶00C_{0}>0italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 such that

supm,t|hm,t|<C0.subscriptsupremum𝑚𝑡subscriptsubscript𝑚𝑡subscript𝐶0\sup_{m,t}|h_{m,t}|_{\infty}<C_{0}.roman_sup start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (152)

Since σ𝜎\sigmaitalic_σ is continuously differentiable and strictly increasing, by Lemma C.9, there exists ϵC0>0subscriptitalic-ϵsubscript𝐶00\epsilon_{C_{0}}>0italic_ϵ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT > 0 such that

|σ(z)|>ϵC0,|z|C0.formulae-sequencesuperscript𝜎𝑧subscriptitalic-ϵsubscript𝐶0for-all𝑧subscript𝐶0|\sigma^{\prime}(z)|>\epsilon_{C_{0}},\quad\forall|z|\leq C_{0}.| italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) | > italic_ϵ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ∀ | italic_z | ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (153)

Therefore

supt|σ(h~m,t)|>ϵC0.subscriptsupremum𝑡subscriptsuperscript𝜎subscript~𝑚𝑡subscriptitalic-ϵsubscript𝐶0\sup_{t}\left|\sigma^{\prime}(\tilde{h}_{m,t})\right|_{\infty}>\epsilon_{C_{0}}.roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT > italic_ϵ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (154)

We get

limt|v~m,t|=0.subscript𝑡subscriptsubscript~𝑣𝑚𝑡0\lim_{t\to\infty}|\tilde{v}_{m,t}|_{\infty}=0.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 0 . (155)

Lemma C.11.

Consider a dynamical system with the following dynamics: h0=0subscript00h_{0}=0italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0

dvtdt𝑑subscript𝑣𝑡𝑑𝑡\displaystyle\frac{dv_{t}}{dt}divide start_ARG italic_d italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =Λvt,absentΛsubscript𝑣𝑡\displaystyle=\Lambda v_{t},= roman_Λ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (156)
v0subscript𝑣0\displaystyle v_{0}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =Λh0+U~x0+b~=U~x0+b~.absentΛsubscript0~𝑈subscript𝑥0~𝑏~𝑈subscript𝑥0~𝑏\displaystyle=\Lambda h_{0}+\widetilde{U}x_{0}+\tilde{b}=\widetilde{U}x_{0}+% \tilde{b}.= roman_Λ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_U end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG = over~ start_ARG italic_U end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG .

If Λm×mΛsuperscript𝑚𝑚\Lambda\in\mathbb{R}^{m\times m}roman_Λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT is diagonal, hyperbolic and the system in Equation 156 is satisfies limtvt=0subscript𝑡subscript𝑣𝑡0\lim_{t\to\infty}v_{t}=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 over any bounded Heaviside input 𝐮x0,|x0|<superscript𝐮subscript𝑥0subscriptsubscript𝑥0\mathbf{u}^{x_{0}},|x_{0}|_{\infty}<\inftybold_u start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞, then the matrix ΛΛ\Lambdaroman_Λ is Hurwitz.

Proof.

By integration we have the following explicit form:

vt=eΛtv0=eΛt(U~x0+b~).subscript𝑣𝑡superscript𝑒Λ𝑡subscript𝑣0superscript𝑒Λ𝑡~𝑈subscript𝑥0~𝑏v_{t}=e^{\Lambda t}v_{0}=e^{\Lambda t}(\widetilde{U}x_{0}+\tilde{b}).italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT roman_Λ italic_t end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT roman_Λ italic_t end_POSTSUPERSCRIPT ( over~ start_ARG italic_U end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG ) . (157)

The stability requires limt|vt|=0subscript𝑡subscript𝑣𝑡0\displaystyle\lim_{t\to\infty}|v_{t}|=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = 0 for all inputs v0=U~x0+b~subscript𝑣0~𝑈subscript𝑥0~𝑏v_{0}=\widetilde{U}x_{0}+\tilde{b}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over~ start_ARG italic_U end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG. Notice that with perturbation from U~~𝑈\tilde{U}over~ start_ARG italic_U end_ARG and b~~𝑏\tilde{b}over~ start_ARG italic_b end_ARG, the set of initial points {v0}subscript𝑣0\{v_{0}\}{ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } is m-dimensional. Therefore the matrix ΛΛ\Lambdaroman_Λ is Hurwitz in the sense that all eigenvalues’ real parts are negative. ∎

Lemma C.12.

Consider a continuous function f:[0,):𝑓0f:[0,\infty)\to\mathbb{R}italic_f : [ 0 , ∞ ) → blackboard_R, assume it can be approximated by a sequence of continuous functions {fm}m=1superscriptsubscriptsubscript𝑓𝑚𝑚1\{f_{m}\}_{m=1}^{\infty}{ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT universally:

limmsupt0|f(t)fm(t)|=0.subscript𝑚subscriptsupremum𝑡0𝑓𝑡subscript𝑓𝑚𝑡0\lim_{m\to\infty}\sup_{t\geq 0}|f(t)-f_{m}(t)|=0.roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT | italic_f ( italic_t ) - italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) | = 0 . (158)

Assume the approximators fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are uniformly exponentially decaying with the same β0>0subscript𝛽00\beta_{0}>0italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0:

limtsupm+eβ0t|fm(t)|0.subscript𝑡subscriptsupremum𝑚subscriptsuperscript𝑒subscript𝛽0𝑡subscript𝑓𝑚𝑡0\lim_{t\to\infty}\sup_{m\in\mathbb{N}_{+}}e^{\beta_{0}t}|f_{m}(t)|\to 0.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_m ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT | italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) | → 0 . (159)

Then the function f𝑓fitalic_f is also decaying exponentially:

limteβt|f(t)|0,0<β<β0.formulae-sequencesubscript𝑡superscript𝑒𝛽𝑡𝑓𝑡0for-all0𝛽subscript𝛽0\lim_{t\to\infty}e^{\beta t}|f(t)|\to 0,\quad\forall 0<\beta<\beta_{0}.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β italic_t end_POSTSUPERSCRIPT | italic_f ( italic_t ) | → 0 , ∀ 0 < italic_β < italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (160)

The proof is the same as Lemma A.11 from (Wang et al., 2023). For completeness purpose, we attach the proof here:

Proof.

Given a function fC([0,))𝑓𝐶0f\in C([0,\infty))italic_f ∈ italic_C ( [ 0 , ∞ ) ), we consider the transformation 𝒯f:[0,1]:𝒯𝑓01\mathcal{T}f:[0,1]\to\mathbb{R}caligraphic_T italic_f : [ 0 , 1 ] → blackboard_R defined as:

(𝒯f)(s)={0,s=0f(logsβ0)s,s(0,1].𝒯𝑓𝑠cases0missing-subexpression𝑠0𝑓𝑠subscript𝛽0𝑠missing-subexpression𝑠01(\mathcal{T}f)(s)=\left\{\begin{array}[]{lcl}0,&&{s=0}\\ \frac{f(-\frac{\log s}{\beta_{0}})}{s},&&{s\in(0,1].}\end{array}\right.( caligraphic_T italic_f ) ( italic_s ) = { start_ARRAY start_ROW start_CELL 0 , end_CELL start_CELL end_CELL start_CELL italic_s = 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_f ( - divide start_ARG roman_log italic_s end_ARG start_ARG italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG italic_s end_ARG , end_CELL start_CELL end_CELL start_CELL italic_s ∈ ( 0 , 1 ] . end_CELL end_ROW end_ARRAY (161)

Under the change of variables s=eβ0t𝑠superscript𝑒subscript𝛽0𝑡s=e^{-\beta_{0}t}italic_s = italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT, we have:

f(t)=eβ0t(𝒯f)(eβ0t),t0.formulae-sequence𝑓𝑡superscript𝑒subscript𝛽0𝑡𝒯𝑓superscript𝑒subscript𝛽0𝑡𝑡0f(t)=e^{-\beta_{0}t}(\mathcal{T}f)(e^{-\beta_{0}t}),\quad t\geq 0.italic_f ( italic_t ) = italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT ( caligraphic_T italic_f ) ( italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_t ≥ 0 . (162)

According to uniformly exponentially decaying assumptions on fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT:

lims0+(𝒯fm)(s)=limtfm(t)eβ0t=limteβ0tfm(t)=0,subscript𝑠superscript0𝒯subscript𝑓𝑚𝑠subscript𝑡subscript𝑓𝑚𝑡superscript𝑒subscript𝛽0𝑡subscript𝑡superscript𝑒subscript𝛽0𝑡subscript𝑓𝑚𝑡0\lim_{s\to 0^{+}}(\mathcal{T}f_{m})(s)=\lim_{t\to\infty}\frac{f_{m}(t)}{e^{-% \beta_{0}t}}=\lim_{t\to\infty}e^{\beta_{0}t}f_{m}(t)=0,roman_lim start_POSTSUBSCRIPT italic_s → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( italic_s ) = roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG = roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) = 0 , (163)

which implies 𝒯fmC([0,1])𝒯subscript𝑓𝑚𝐶01\mathcal{T}f_{m}\in C([0,1])caligraphic_T italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ italic_C ( [ 0 , 1 ] ).

For any β<β0𝛽subscript𝛽0\beta<\beta_{0}italic_β < italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, let δ=β0β>0𝛿subscript𝛽0𝛽0\delta=\beta_{0}-\beta>0italic_δ = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_β > 0. Next we have the following estimate

sups[0,1]|(𝒯fm1)(s)(𝒯fm2)(s)|subscriptsupremum𝑠01𝒯subscript𝑓subscript𝑚1𝑠𝒯subscript𝑓subscript𝑚2𝑠\displaystyle\sup_{s\in[0,1]}\left|(\mathcal{T}f_{m_{1}})(s)-(\mathcal{T}f_{m_% {2}})(s)\right|roman_sup start_POSTSUBSCRIPT italic_s ∈ [ 0 , 1 ] end_POSTSUBSCRIPT | ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( italic_s ) - ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( italic_s ) | (164)
=\displaystyle== supt0|fm1(t)eβtfm2(t)eβt|subscriptsupremum𝑡0subscript𝑓subscript𝑚1𝑡superscript𝑒𝛽𝑡subscript𝑓subscript𝑚2𝑡superscript𝑒𝛽𝑡\displaystyle\sup_{t\geq 0}\left|\frac{f_{m_{1}}(t)}{e^{-\beta t}}-\frac{f_{m_% {2}}(t)}{e^{-\beta t}}\right|roman_sup start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT | divide start_ARG italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - italic_β italic_t end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - italic_β italic_t end_POSTSUPERSCRIPT end_ARG | (165)
\displaystyle\leq max{sup0tT0|fm1(t)eβtfm2(t)eβt|,C0eδT0}subscriptsupremum0𝑡subscript𝑇0subscript𝑓subscript𝑚1𝑡superscript𝑒𝛽𝑡subscript𝑓subscript𝑚2𝑡superscript𝑒𝛽𝑡subscript𝐶0superscript𝑒𝛿subscript𝑇0\displaystyle\max\left\{\sup_{0\leq t\leq T_{0}}\left|\frac{f_{m_{1}}(t)}{e^{-% \beta t}}-\frac{f_{m_{2}}(t)}{e^{-\beta t}}\right|,C_{0}e^{-\delta T_{0}}\right\}roman_max { roman_sup start_POSTSUBSCRIPT 0 ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | divide start_ARG italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - italic_β italic_t end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - italic_β italic_t end_POSTSUPERSCRIPT end_ARG | , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_δ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } (166)
\displaystyle\leq max{eβT0sup0tT0|fm1(t)fm2(t)|,C0eδT0}superscript𝑒𝛽subscript𝑇0subscriptsupremum0𝑡subscript𝑇0subscript𝑓subscript𝑚1𝑡subscript𝑓subscript𝑚2𝑡subscript𝐶0superscript𝑒𝛿subscript𝑇0\displaystyle\max\left\{e^{\beta T_{0}}\sup_{0\leq t\leq T_{0}}\left|f_{m_{1}}% (t)-f_{m_{2}}(t)\right|,C_{0}e^{-\delta T_{0}}\right\}roman_max { italic_e start_POSTSUPERSCRIPT italic_β italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT 0 ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) - italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) | , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_δ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } (167)

where C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a constant uniform in m𝑚mitalic_m.

For any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, take T0=ln(ϵC0)δ,subscript𝑇0italic-ϵsubscript𝐶0𝛿T_{0}=-\frac{\ln(\frac{\epsilon}{C_{0}})}{\delta},italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - divide start_ARG roman_ln ( divide start_ARG italic_ϵ end_ARG start_ARG italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG italic_δ end_ARG , we have C0eδT0ϵsubscript𝐶0superscript𝑒𝛿subscript𝑇0italic-ϵC_{0}e^{-\delta T_{0}}\leq\epsilonitalic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_δ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ italic_ϵ. For sufficiently large M𝑀Mitalic_M which depends on ϵitalic-ϵ\epsilonitalic_ϵ and T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, by universal approximation (Equation 158), we have m1,m2Mfor-allsubscript𝑚1subscript𝑚2𝑀\forall m_{1},m_{2}\geq M∀ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_M,

sup0tT0|fm1(t)fm2(t)|subscriptsupremum0𝑡subscript𝑇0subscript𝑓subscript𝑚1𝑡subscript𝑓subscript𝑚2𝑡\displaystyle\sup_{0\leq t\leq T_{0}}\left|f_{m_{1}}(t)-f_{m_{2}}(t)\right|roman_sup start_POSTSUBSCRIPT 0 ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) - italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) | eβT0ϵ,absentsuperscript𝑒𝛽subscript𝑇0italic-ϵ\displaystyle\leq e^{-\beta T_{0}}\epsilon,≤ italic_e start_POSTSUPERSCRIPT - italic_β italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ , (168)
eβT0sup0tT0|fm1(t)fm2(t)|superscript𝑒𝛽subscript𝑇0subscriptsupremum0𝑡subscript𝑇0subscript𝑓subscript𝑚1𝑡subscript𝑓subscript𝑚2𝑡\displaystyle e^{\beta T_{0}}\sup_{0\leq t\leq T_{0}}\left|f_{m_{1}}(t)-f_{m_{% 2}}(t)\right|italic_e start_POSTSUPERSCRIPT italic_β italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT 0 ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) - italic_f start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) | ϵ.absentitalic-ϵ\displaystyle\leq\epsilon.≤ italic_ϵ . (169)

Therefore, {fm}subscript𝑓𝑚\{f_{m}\}{ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } is a Cauchy sequence in C([0,))𝐶0C([0,\infty))italic_C ( [ 0 , ∞ ) ).

Since {fm}subscript𝑓𝑚\{f_{m}\}{ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } is a Cauchy sequence in C([0,))𝐶0C([0,\infty))italic_C ( [ 0 , ∞ ) ) equipped with the sup-norm, using the above estimate we can have{𝒯fm}𝒯subscript𝑓𝑚\{\mathcal{T}f_{m}\}{ caligraphic_T italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } is a Cauchy sequence in C([0,1])𝐶01C([0,1])italic_C ( [ 0 , 1 ] ) equipped with the sup-norm. By the completeness of C([0,1])𝐶01C([0,1])italic_C ( [ 0 , 1 ] ), there exists fC([0,1])superscript𝑓𝐶01f^{*}\in C([0,1])italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_C ( [ 0 , 1 ] ) with f(0)=0superscript𝑓00f^{*}(0)=0italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) = 0 such that

limmsups[0,1]|(𝒯fm)(s)f(s)|=0.subscript𝑚subscriptsupremum𝑠01𝒯subscript𝑓𝑚𝑠superscript𝑓𝑠0\lim_{m\to\infty}\sup_{s\in[0,1]}|(\mathcal{T}f_{m})(s)-f^{*}(s)|=0.roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_s ∈ [ 0 , 1 ] end_POSTSUBSCRIPT | ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( italic_s ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) | = 0 . (170)

Given any s>0𝑠0s>0italic_s > 0, we have

f(s)=limm(𝒯fm)(s)=(𝒯f)(s),superscript𝑓𝑠subscript𝑚𝒯subscript𝑓𝑚𝑠𝒯𝑓𝑠f^{*}(s)=\lim_{m\to\infty}(\mathcal{T}f_{m})(s)=(\mathcal{T}f)(s),italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( italic_s ) = ( caligraphic_T italic_f ) ( italic_s ) , (171)

hence

limteβtf(t)=lims0+(𝒯f)(s)=f(0)=0.subscript𝑡superscript𝑒𝛽𝑡𝑓𝑡subscript𝑠superscript0𝒯𝑓𝑠superscript𝑓00\lim_{t\to\infty}e^{\beta t}f(t)=\lim_{s\to 0^{+}}(\mathcal{T}f)(s)=f^{*}(0)=0.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β italic_t end_POSTSUPERSCRIPT italic_f ( italic_t ) = roman_lim start_POSTSUBSCRIPT italic_s → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_T italic_f ) ( italic_s ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) = 0 . (172)

Appendix D Motivation for the gradient-over-weight Lipschitz criterion

Here we discuss the motivation for adopting the gradient-over-weight boundedness as the criterion for “best-in-stability” reparameterization. First of all, the “best” reparameterization is proposed to further improve the optimization stability across memory patterns with different decays. The criterion “gradient is Lipschitz to the weight” is a necessary condition for the stability in the following sense:

  1. 1.

    Consider functions f(x)=x4𝑓𝑥superscript𝑥4f(x)=x^{4}italic_f ( italic_x ) = italic_x start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, the gradient function dfdx(x)=4x3𝑑𝑓𝑑𝑥𝑥4superscript𝑥3\frac{df}{dx}(x)=4x^{3}divide start_ARG italic_d italic_f end_ARG start_ARG italic_d italic_x end_ARG ( italic_x ) = 4 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT does not have a global Lipschitz coefficient for all input values x𝑥xitalic_x. Therefore for any fixed positive learning rate η𝜂\etaitalic_η, there exists an initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (for example x0=12η+1subscript𝑥012𝜂1x_{0}=\sqrt{\frac{1}{2\eta}}+1italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG end_ARG + 1) such that the convergence from initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT cannot be achieved via the gradient descent step

    xk+1=xkηg(xk).subscript𝑥𝑘1subscript𝑥𝑘𝜂𝑔subscript𝑥𝑘x_{k+1}=x_{k}-\eta g(x_{k}).italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_η italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (173)

    It can be verified the convergence does not hold as |xk+1|>|xk|subscript𝑥𝑘1subscript𝑥𝑘|x_{k+1}|>|x_{k}|| italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | > | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | for all k𝑘kitalic_k when x0=12η+1subscript𝑥012𝜂1x_{0}=\sqrt{\frac{1}{2\eta}}+1italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG end_ARG + 1. This comes from the fact that |xk|12η,ηg(x)2xkformulae-sequencesubscript𝑥𝑘12𝜂𝜂𝑔𝑥2subscript𝑥𝑘|x_{k}|\geq\sqrt{\frac{1}{2\eta}},\eta g(x)\geq 2x_{k}| italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ≥ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG end_ARG , italic_η italic_g ( italic_x ) ≥ 2 italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT hold for all k𝑘kitalic_k.

  2. 2.

    Consider functions f(x)=x2𝑓𝑥superscript𝑥2f(x)=x^{2}italic_f ( italic_x ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the gradient function g(x)=2x𝑔𝑥2𝑥g(x)=2xitalic_g ( italic_x ) = 2 italic_x is associated with a Lipschitz constant L=2𝐿2L=2italic_L = 2. Then the same gradient descent step converges for any η12𝜂12\eta\leq\frac{1}{2}italic_η ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG in Equation 173.

  3. 3.

    As can be seen in the above two examples, the criterion “gradient is Lipschitz to the weight” is associated with the convergence under large learning rate. As the use of larger learning rate is usually associated with faster convergence (Smith & Topin, 2019), smaller generalization errors (Li et al., 2019), we believe the Lipschitz criterion is a suitable stability criterion for the measure of optimization stability.

  4. 4.

    The gradient-over-weight ratio evaluated in Figure 4(a) is a numerical verification of our Theorem 3.4. The gradients of stable reparameterizations are less susceptible to the well-known issue of exploding or vanishing gradients (Bengio et al., 1994; Hochreiter, 1998).

Table 5: Summary of reparameterizations and corresponding gradient norm functions in continuous and discrete time. Notice that the Gfsubscript𝐺𝑓G_{f}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and GfDsuperscriptsubscript𝐺𝑓𝐷G_{f}^{D}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT are rescaled up to a constant C𝐇,𝐇^subscript𝐶𝐇^𝐇C_{\mathbf{H},\widehat{\mathbf{H}}}italic_C start_POSTSUBSCRIPT bold_H , over^ start_ARG bold_H end_ARG end_POSTSUBSCRIPT.
Reparameteriations f𝑓fitalic_f Gfsubscript𝐺𝑓G_{f}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT or GfDsuperscriptsubscript𝐺𝑓𝐷G_{f}^{D}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT
Continuous ReLU ReLU(w)ReLU𝑤-\textrm{ReLU}(w)- ReLU ( italic_w ) 1w2𝟏{w>0}1superscript𝑤2subscript1𝑤0\frac{1}{w^{2}}\bm{1}_{\{w>0\}}divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_1 start_POSTSUBSCRIPT { italic_w > 0 } end_POSTSUBSCRIPT
Exp exp(w)𝑤-\exp(w)- roman_exp ( italic_w ) ewsuperscript𝑒𝑤e^{-w}italic_e start_POSTSUPERSCRIPT - italic_w end_POSTSUPERSCRIPT
Softplus log(1+exp(w))1𝑤-\log(1+\exp(w))- roman_log ( 1 + roman_exp ( italic_w ) ) exp(w)(1+exp(w))log(1+exp(w))2\frac{\exp(w)}{(1+\exp(w))\log(1+\exp(w))^{2}}divide start_ARG roman_exp ( italic_w ) end_ARG start_ARG ( 1 + roman_exp ( italic_w ) ) roman_log ( 1 + roman_exp ( italic_w ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
“Best”(Ours) 1aw2+b,a>0,b>0formulae-sequence1𝑎superscript𝑤2𝑏𝑎0𝑏0-\frac{1}{aw^{2}+b},a>0,b>0- divide start_ARG 1 end_ARG start_ARG italic_a italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b end_ARG , italic_a > 0 , italic_b > 0 2a|w|2𝑎𝑤2a|w|2 italic_a | italic_w |
Discrete ReLU exp(ReLU(w))ReLU𝑤\exp(-\textrm{ReLU}(w))roman_exp ( - ReLU ( italic_w ) ) exp(w)(1exp(w))2𝟏{w>0}𝑤superscript1𝑤2subscript1𝑤0\frac{\exp(-w)}{(1-\exp(-w))^{2}}\bm{1}_{\{w>0\}}divide start_ARG roman_exp ( - italic_w ) end_ARG start_ARG ( 1 - roman_exp ( - italic_w ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_1 start_POSTSUBSCRIPT { italic_w > 0 } end_POSTSUBSCRIPT
Exp exp(exp(w))𝑤\exp(-\exp(w))roman_exp ( - roman_exp ( italic_w ) ) exp(wexp(w))(1exp(exp(w)))2𝑤𝑤superscript1𝑤2\frac{\exp(w-\exp(w))}{(1-\exp(-\exp(w)))^{2}}divide start_ARG roman_exp ( italic_w - roman_exp ( italic_w ) ) end_ARG start_ARG ( 1 - roman_exp ( - roman_exp ( italic_w ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
Softplus 11+exp(w)11𝑤\frac{1}{1+\exp(w)}divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( italic_w ) end_ARG ewsuperscript𝑒𝑤e^{-w}italic_e start_POSTSUPERSCRIPT - italic_w end_POSTSUPERSCRIPT
Tanh tanh(w)=e2w1e2w+1𝑤superscript𝑒2𝑤1superscript𝑒2𝑤1\tanh(w)=\frac{e^{2w}-1}{e^{2w}+1}roman_tanh ( italic_w ) = divide start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_w end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_w end_POSTSUPERSCRIPT + 1 end_ARG e2wsuperscript𝑒2𝑤e^{2w}italic_e start_POSTSUPERSCRIPT 2 italic_w end_POSTSUPERSCRIPT
“Best”(Ours) 11w2+0.5(1,1)11superscript𝑤20.5111-\frac{1}{w^{2}+0.5}\in(-1,1)1 - divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 end_ARG ∈ ( - 1 , 1 ) 2|w|2𝑤2|w|2 | italic_w |
Refer to caption
(a) Continuous time Gf(w)|w|subscript𝐺𝑓𝑤𝑤\frac{G_{f}(w)}{|w|}divide start_ARG italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ) end_ARG start_ARG | italic_w | end_ARG
Refer to caption
(b) Discrete time GfD(w)|w|superscriptsubscript𝐺𝑓𝐷𝑤𝑤\frac{G_{f}^{D}(w)}{|w|}divide start_ARG italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( italic_w ) end_ARG start_ARG | italic_w | end_ARG
Figure 6: Gradient norm function Gfsubscript𝐺𝑓G_{f}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and GfDsuperscriptsubscript𝐺𝑓𝐷G_{f}^{D}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT of different parameterization methods. The “best” parameterization methods maintain a balanced gradient-over-weight ratio.

Appendix E Comparison of different recurrent weights parameterization schemes

Here we evaluate the gradient norm bound function Gfsubscript𝐺𝑓G_{f}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and GfDsuperscriptsubscript𝐺𝑓𝐷G_{f}^{D}italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT for different parameterization schemes in Table 5 and Figure 6.

On the Scenarios Where “Best” Parameterization is Preferable

There is no guarantee that the “best” parameterization will outperform the Exp/Softplus parameterizations when all models exhibit good training stability. When the learning rate has been finetuned (at 5e-4) for CIFAR10, the optimal performance from “best” parameterization is worse than exp parameterization. This outcome is expected since this paper focuses on training stability rather than generalization. The key insight from Tables 1 and 2 is that the “best” parameterization offers a theoretically grounded alternative to the exp/softplus parameterizations.

Appendix F Numerical details

In this section, the details of numerical experiments are provided for the completeness and reproducibility.

F.1 Synthetic task

We conduct the approximation of linear functional with linear RNNs in the one-dimensional input and one-dimensional output case. The synthetic linear functional is constructed with the polynomial decaying memory function is ρ(t)=1(t+1)1.1𝜌𝑡1superscript𝑡11.1\rho(t)=\frac{1}{(t+1)^{1.1}}italic_ρ ( italic_t ) = divide start_ARG 1 end_ARG start_ARG ( italic_t + 1 ) start_POSTSUPERSCRIPT 1.1 end_POSTSUPERSCRIPT end_ARG. Sequence length is 100. Total number of synthetic samples is 153600. The learning rate used is 0.01 and the batch size is 512.

The perturbation list β[0,103,10321/2,10322/2,,103220/2]𝛽0superscript103superscript103superscript212superscript103superscript222superscript103superscript2202\beta\in[0,10^{-3},10^{-3}*2^{1/2},10^{-3}*2^{2/2},\dots,10^{-3}*2^{20/2}]italic_β ∈ [ 0 , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ∗ 2 start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ∗ 2 start_POSTSUPERSCRIPT 2 / 2 end_POSTSUPERSCRIPT , … , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ∗ 2 start_POSTSUPERSCRIPT 20 / 2 end_POSTSUPERSCRIPT ]. Each evaluation of the perturbed error is sampled with 30 different weight perturbations to reduce the variance.

F.2 Language models

The language modeling is done over WikiText-103 dataset (Merity et al., 2016). The model we used is based on the Hyena architecture with simple real-weights state-space models as the mixer (Poli et al., 2023; Smith et al., 2023). The batch size is 16, total steps 115200 (around 16 epochs), warmup steps 1000. The optimizer used is AdamW and the weight decay coefficient is 0.25. The learning rate for the recurrent layer is 0.004 while the learning rate for other layers are 0.005.

Refer to caption
(a) lr=0.002, “best” reparameterization is also not optimal, but the final loss is comparable against Exp and Softplus
Refer to caption
(b) lr=0.01, “best” reparameterization achieve the smallest loss
Figure 7: The stability advantage of “best” reparameterization (red line) is usually better when the learning rate is larger.

In the main paper, we provide the training loss curve for learning rate = 0.005 as the stability of “best” discrete-time parameterization f(w)=11w2+0.5𝑓𝑤11superscript𝑤20.5f(w)=1-\frac{1}{w^{2}+0.5}italic_f ( italic_w ) = 1 - divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 end_ARG is mostly significant as the learning rate is large. In Figure 7, we further provide the results for other learning rates (lr = 0.002, 0.010). Despite the final loss not being optimal for the “best” reparameterization, it is observed that the training process exhibits enhanced stability compared to other parameterization methods.

F.3 On the stability of “best” reparameterization for large models

The previous experiment on WikiText-103 language modelling shows the performance of stable reparameterization over the unstable cases. We further verify the optimization stability of “best” reparameterization in the following extreme setting. We construct a large scale language model with 3B parameters and train with larger learning rate (lr=0.01). As can be seen in the following table, the only convergent model is the model with “best” reparameterization. We emphasize that the only difference between these models are the parameterization schemes for recurrent weights. Therefore the best reparameterization is the most stable parameterization. (We repeats the experiments with different seeds for three times.)

“Best” Exp Softplus Direct
Convergent / total experiments 3/3 0/3 0/3 0/3
Table 6: Experiment to the stability of “best” reparameterization over lr = 0.01. All other reparameterizations diverged within 100 steps while the “best” reparameterizations can be used to train the model.

F.4 Additional numerical results for associative recalls

In this section, we study the performance of of different stable reparameterizations over the extremely long sequences (up to 131k). It can be seen in Table 7 that stable parameterizations are better than the case without reparameterization and simple clipping. The advantage is more significant when the sequence length is longer. The models are trained under the exactly same hyperparameters.

Reparameterizations Train acc, T=20 Test acc, T=20 Train acc, T=131k Test acc, T=131k
“Best” 57.95 99.8 53.57 100
Exp(S5) 54.55 99.8 53.57 100
Clip 50.0 76.6 13.91 9.4
Direct 43.18 67.0 16.59 5.6
Table 7: Comparison of parameterizations on associative recalls. The first two columns are the train and test accuracy over sequence length 20, vocabulary size 10, while the second two columns are the train and test accuracy over sequence length 131k and vocabulary size 30.