Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

From Generalization Analysis to Optimization Designs for State Space Models

Fusheng Liu
National University of Singapore
fusheng@u.nus.edu
   Qianxiao Li
National University of Singapore
qianxiao@nus.edu.sg
Abstract

A State Space Model (SSM) is a foundation model in time series analysis, which has recently been shown as an alternative to transformers in sequence modeling. In this paper, we theoretically study the generalization of SSMs and propose improvements to training algorithms based on the generalization results. Specifically, we give a data-dependent generalization bound for SSMs, showing an interplay between the SSM parameters and the temporal dependencies of the training sequences. Leveraging the generalization bound, we (1) set up a scaling rule for model initialization based on the proposed generalization measure, which significantly improves the robustness of the output value scales on SSMs to different temporal patterns in the sequence data; (2) introduce a new regularization method for training SSMs to enhance the generalization performance. Numerical results are conducted to validate our results.

1 Introduction

Sequence modeling has been a long-standing research topic in many machine learning areas, such as speech recognition (Hinton et al., 2012), time series prediction (Li et al., 2019), and natural language processing (Devlin et al., 2019). Various machine learning models have been successfully applied in sequence modeling to handle different types of sequence data, ranging from the (probabilistic) Hidden Markov model (Baum and Petrie, 1966) to deep learning models, e.g., Recurrent Neural Networks (RNNs), Long Short-Term Memory units (Hochreiter and Schmidhuber, 1997), Gated Recurrent Unit (Chung et al., 2014), and transformers (Vaswani et al., 2017). In this paper, we focus on the state space model (SSM), which has a simple mathematical expression: h(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t)formulae-sequencesuperscript𝑡𝐴𝑡𝐵𝑥𝑡𝑦𝑡𝐶𝑡𝐷𝑥𝑡h^{\prime}(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t)italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) , italic_y ( italic_t ) = italic_C italic_h ( italic_t ) + italic_D italic_x ( italic_t ) where h(t)𝑡h(t)italic_h ( italic_t ) is the hidden state, x(t)𝑥𝑡x(t)italic_x ( italic_t ) is the input sequence, y(t)𝑦𝑡y(t)italic_y ( italic_t ) is the output sequence and A,B,C,D𝐴𝐵𝐶𝐷A,B,C,Ditalic_A , italic_B , italic_C , italic_D are trainable parameters. To simplify the analysis, we omit the skip connection by letting D=0𝐷0D=0italic_D = 0. In fact, our analysis can also applied to the case when D𝐷Ditalic_D is included (see the discussions in Section 4.2). Recent studies have demonstrated the power of SSMs in deep learning. For example, it was shown in Gu et al. (2022a) that by a new parameterization and a carefully chosen initialization, the structured state space sequence (S4) model achieved strong empirical results on image and language tasks. Following the S4 model, more variants of SSMs are proposed, e.g., diagonal SSMs (Gu et al., 2022b, Gupta et al., 2022) S5 (Smith et al., 2023), H3 (Fu et al., 2023), GSS (Mehta et al., 2023), Hyena Hierarchy (Poli et al., 2023), and Mamba (Gu and Dao, 2023).

Theoretical analysis and understanding of the approximation and optimization of SSMs are well studied in the literature such as (Li et al., 2021; 2022, Gu et al., 2022a; 2023). Since the SSM can be regarded as a continuous linear RNN model (Li et al., 2022), most generalization analysis of SSMs is based on the generalization theory of RNNs (Zhang et al., 2018, Chen et al., 2019, Tu et al., 2019). However, these previous works did not study the effects of the temporal dependencies in the sequence data on the SSM generalization (see more details on the comparison in Section 4.1). As an attempt to understand the relationship between the temporal dependencies and the generalization performance, this paper aims to provide a generalization bound that connects the memory structure of the model with the temporal structure of the data. We can, in turn, use the proposed bound to guide us in designing new algorithms to improve optimization and generalization. Specifically, we discover two roles for the proposed generalization measure: (1) generalization bound as an initialization scheme; (2) generalization bound as a regularization method. The common initialization method for the S4 model and its variants follows from the HiPPO framework (Gu et al., 2022a; 2023), which is based on the prerequisite that the training sequence data is stable. To improve the robustness of the output value scales on SSMs to different temporal patterns in the sequence data, we consider to rescale the initialization of SSMs with respect to the generalization measure. This new initialization scheme makes the SSMs more resilient on their initial output value scales to variations in the temporal patterns of the training data. Except for the initialization setup, our generalization bound can also be served as a regularizer. Regularization methods like weight decay and dropout are widely applied to training SSMs, but the hidden state matrix A𝐴Aitalic_A is not regularized because its imaginary part controls the oscillating frequencies of the basis function eAtBsuperscript𝑒𝐴𝑡𝐵e^{At}Bitalic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_B (Gu et al., 2022b). By taking into account the interaction between the SSM structure and the temporal dependencies, we introduce a new regularization method based on our bound, and it can be applied to the hidden state space to improve the generalization performance. Combining the initialization scheme and the regularization method, our method is applicable to various tasks, ranging from image classification to language processing, while only introducing a minimal computational overhead. To summarize, our contributions are as follows:

  • We provide a data-dependent generalization bound for SSMs by taking into account the temporal structure. Specifically, the generalization bound correlates with the memory structure of the model and the (auto)covariance process of the data. It indicates that instead of the weight or the data norm, it is the interplay between the memory structure and the temporal structure of the sequence data that influences the generalization.

  • Based on the proposed generalization bound, we setup an initialization scaling rule by adjusting the magnitude of the model parameters with respect to the generalization measure at initialization. This scaling rule improves the robustness of the initial output value scales on SSMs across different temporal patterns of the sequence data.

  • Apart from the initialization scheme, we design a new regularizer for SSMs. Unlike weight decay, our regularizer does not penalize the parameter norm but encourages the model to find a minimizer with lower generalization bound to improve the generalization performance.

2 Related Works

Since a SSM is also a continuous linear RNN, there are three lines of related work: generalization of RNNs, temporal structure analysis on RNNs, and optimization of SSMs.

Generalization of RNNs. Existing works on the generalization of RNNs focus on the generalization error bound analysis. Specifically, in the early two works of Dasgupta and Sontag (1995) and Koiran and Sontag (1998), VC dimension-based generalization bounds were provided to show the learnability of RNNs. In recent studies, Zhang et al. (2018), Chen et al. (2019), Tu et al. (2019) proved norm-based generalization bounds, improving the VC dimension-based bounds by the Rademacher complexity technique (Bartlett and Mendelson, 2002) under the uniform-convergence framework. In the overparameterization settings, it was shown in Allen-Zhu and Li (2019) that RNNs can learn some concept class in polynomial time given that the model size is large enough. These generalization bounds, however, do not take into account the temporal dependencies and their effects on generalization. In this work, we provide a new generalization bound by combining the memory structure of the model and the temporal structure of the data.

Temporal structure analysis on RNNs. Sequence data has long-range temporal dependencies across the time domain, which notably set it apart from non-sequence data. Recent studies have studied the effects of such temporal dependencies on the approximation and optimization of RNNs. For example, in the two works of Li et al. (2021; 2022), a “curse of memory” phenomenon was discovered when using linear RNNs to model the temporal input-output relationships. Particularly, when the target relationship between the input and output has a long-term memory, then both approximation and optimization become extremely challenging. In Wang et al. (2023), the “curse of memory” phenomenon on approximation and optimization was extended to non-linear RNNs based on the temporal relationships. In this paper, we conduct a fine-grained analysis on the effects of the temporal structure analysis on the generalization of RNNs.

Optimization of SSMs. RNN optimization is known for two issues: training stability and computational cost (Bengio et al., 1994, Pascanu et al., 2013). To address these issues and capture the long dependencies efficiently in sequence modeling, the S4 model was proposed by new paraemterization, initialization and discretization (Gu et al., 2022a). Recent variants for the S4 model simplified the hidden state matrix by a diagonal matrix to enhance computational efficiency (Gu et al., 2022b, Gupta et al., 2022, Smith et al., 2023, Orvieto et al., 2023). Regularization methods are also applied for SSMs to prevent overfitting, such as dropout, weight decay and the data continuity regularizer (Qu et al., 2023). However, the principled way to regularize and initialize the parameters still remains to be explored. In this study, we design a new regularization and initialization scheme to improve both optimization and generalization.

3 Preliminaries

In this section, we briefly introduce the SSM in Section 3.1 and the motivation for optimization designs based on the generalization analysis in Section 3.2.

3.1 Introduction to SSMs

In this paper, we consider the following single-input single-output SSM,

h(t)=Ah(t)+Bx(t),y(t)=Ch(t),t0formulae-sequencesuperscript𝑡𝐴𝑡𝐵𝑥𝑡formulae-sequence𝑦𝑡𝐶𝑡𝑡0h^{\prime}(t)=Ah(t)+Bx(t),\quad y(t)=Ch(t),\quad t\geq 0italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) , italic_y ( italic_t ) = italic_C italic_h ( italic_t ) , italic_t ≥ 0 (1)

where x𝑥xitalic_x is the input from an input space111A linear space of continuous functions from 0subscriptabsent0\mathbb{R}_{\geq 0}blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT to \mathbb{R}blackboard_R that vanishes at infinity. 𝒳:=C0(0,)assign𝒳subscript𝐶0subscriptabsent0\mathcal{X}:=C_{0}(\mathbb{R}_{\geq 0},\mathbb{R})caligraphic_X := italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT , blackboard_R ); y(t)𝑦𝑡y(t)\in\mathbb{R}italic_y ( italic_t ) ∈ blackboard_R is the output at time t𝑡titalic_t; h(t)m𝑡superscript𝑚h(t)\in\mathbb{R}^{m}italic_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the hidden state with h(0)=000h(0)=0italic_h ( 0 ) = 0; Am×m,Bm×1,C1×mformulae-sequence𝐴superscript𝑚𝑚formulae-sequence𝐵superscript𝑚1𝐶superscript1𝑚A\in\mathbb{R}^{m\times m},B\in\mathbb{R}^{m\times 1},C\in\mathbb{R}^{1\times m}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × 1 end_POSTSUPERSCRIPT , italic_C ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_m end_POSTSUPERSCRIPT are trainable parameters. Then (1) has an explicit solution y(t)=0tρθ(s)x(ts)𝑑s𝑦𝑡superscriptsubscript0𝑡subscript𝜌𝜃𝑠𝑥𝑡𝑠differential-d𝑠y(t)=\int_{0}^{t}\rho_{\theta}(s)x(t-s)dsitalic_y ( italic_t ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) italic_x ( italic_t - italic_s ) italic_d italic_s, where ρθ(s):=CeAsBassignsubscript𝜌𝜃𝑠𝐶superscript𝑒𝐴𝑠𝐵\rho_{\theta}(s):=Ce^{As}Bitalic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) := italic_C italic_e start_POSTSUPERSCRIPT italic_A italic_s end_POSTSUPERSCRIPT italic_B with θ=(C,A,B)𝜃𝐶𝐴𝐵\theta=(C,A,B)italic_θ = ( italic_C , italic_A , italic_B ). The function ρθ(s)subscript𝜌𝜃𝑠\rho_{\theta}(s)italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) captures the memory structure of the model and the temporal input-output relationship (Li et al., 2022). For the S4 model and its variants (Gu et al., 2022a; b, Gupta et al., 2022, Gu et al., 2023), (1) is usually discretized by the Zero-Order Hold method, i.e., given a timescale ΔΔ\Delta\in\mathbb{R}roman_Δ ∈ blackboard_R, hk+1=A¯hk+B¯xk,yk=C¯hk,k=0,1,,formulae-sequencesubscript𝑘1¯𝐴subscript𝑘¯𝐵subscript𝑥𝑘formulae-sequencesubscript𝑦𝑘¯𝐶subscript𝑘𝑘01h_{k+1}=\bar{A}h_{k}+\bar{B}x_{k},\quad y_{k}=\bar{C}h_{k},\quad k=0,1,\ldots,italic_h start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = over¯ start_ARG italic_A end_ARG italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over¯ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_C end_ARG italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 0 , 1 , … , where A¯=eΔA,B¯=(A¯𝕀m)A1B,C¯=Cformulae-sequence¯𝐴superscript𝑒Δ𝐴formulae-sequence¯𝐵¯𝐴subscript𝕀𝑚superscript𝐴1𝐵¯𝐶𝐶\bar{A}=e^{\Delta\cdot A},\bar{B}=(\bar{A}-\mathbb{I}_{m})A^{-1}B,\bar{C}=Cover¯ start_ARG italic_A end_ARG = italic_e start_POSTSUPERSCRIPT roman_Δ ⋅ italic_A end_POSTSUPERSCRIPT , over¯ start_ARG italic_B end_ARG = ( over¯ start_ARG italic_A end_ARG - blackboard_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B , over¯ start_ARG italic_C end_ARG = italic_C. Then, yk=C¯A¯kB¯x0+C¯A¯k1B¯x1++C¯B¯xk=[K¯x]ksubscript𝑦𝑘¯𝐶superscript¯𝐴𝑘¯𝐵subscript𝑥0¯𝐶superscript¯𝐴𝑘1¯𝐵subscript𝑥1¯𝐶¯𝐵subscript𝑥𝑘subscriptdelimited-[]¯𝐾𝑥𝑘y_{k}=\bar{C}\bar{A}^{k}\bar{B}x_{0}+\bar{C}\bar{A}^{k-1}\bar{B}x_{1}+\ldots+% \bar{C}\bar{B}x_{k}=[\bar{K}*x]_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_C end_ARG over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT over¯ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over¯ start_ARG italic_C end_ARG over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + over¯ start_ARG italic_C end_ARG over¯ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ over¯ start_ARG italic_K end_ARG ∗ italic_x ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where K¯=(C¯B¯,C¯A¯B¯,,C¯A¯kB¯)¯𝐾¯𝐶¯𝐵¯𝐶¯𝐴¯𝐵¯𝐶superscript¯𝐴𝑘¯𝐵\bar{K}=(\bar{C}\bar{B},\bar{C}\bar{A}\bar{B},\ldots,\bar{C}\bar{A}^{k}\bar{B})over¯ start_ARG italic_K end_ARG = ( over¯ start_ARG italic_C end_ARG over¯ start_ARG italic_B end_ARG , over¯ start_ARG italic_C end_ARG over¯ start_ARG italic_A end_ARG over¯ start_ARG italic_B end_ARG , … , over¯ start_ARG italic_C end_ARG over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT over¯ start_ARG italic_B end_ARG ) and * represents to convolution.

3.2 Motivation: a linear regression model

In this subsection, we use a linear regression model on non-sequential data as an example to illustrate the connection between the generalization analysis and the optimization designs. This example then motivates us to extend the connection to SSMs on sequential data.

Linear regression. We consider a simple linear model y=θx𝑦superscript𝜃top𝑥y=\theta^{\top}xitalic_y = italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x with input xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, output y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R and parameter θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Let the training data {(xi,yi)}i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\{(x_{i},y_{i})\}_{i=1}^{n}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be i.i.d. sampled from a distribution 𝒟𝒟\mathcal{D}caligraphic_D such that xi2=r,|yi|1(i[1:n])\|x_{i}\|_{2}=r,|y_{i}|\leq 1(\forall i\in[1:n])∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_r , | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ 1 ( ∀ italic_i ∈ [ 1 : italic_n ] ). Define the empirical risk n(θ):=1ni=1n(θxiyi)2assignsubscript𝑛𝜃1𝑛superscriptsubscript𝑖1𝑛superscriptsuperscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖2\mathcal{L}_{n}(\theta):=\frac{1}{n}\sum_{i=1}^{n}(\theta^{\top}x_{i}-y_{i})^{2}caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the population risk 𝒟(θ):=𝔼x,y[(θxy)2]assignsubscript𝒟𝜃subscript𝔼𝑥𝑦delimited-[]superscriptsuperscript𝜃top𝑥𝑦2\mathcal{L}_{\mathcal{D}}(\theta):=\mathbb{E}_{x,y}[(\theta^{\top}x-y)^{2}]caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_θ ) := blackboard_E start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT [ ( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Then given a norm-constrained space Θ:={θd:θ2R}assignΘconditional-set𝜃superscript𝑑subscriptnorm𝜃2𝑅\Theta:=\{\theta\in\mathbb{R}^{d}:\|\theta\|_{2}\leq R\}roman_Θ := { italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_R }, with probability at least 1δ1𝛿1-\delta1 - italic_δ over 𝒟𝒟\mathcal{D}caligraphic_D,

supθΘ|n(θ)𝒟(θ)|(rR+1)2𝒪(log(1/δ)/n).subscriptsupremum𝜃Θsubscript𝑛𝜃subscript𝒟𝜃superscript𝑟𝑅12𝒪1𝛿𝑛\sup_{\theta\in\Theta}|\mathcal{L}_{n}(\theta)-\mathcal{L}_{\mathcal{D}}(% \theta)|\leq(rR+1)^{2}\cdot\mathcal{O}(\sqrt{\log(1/\delta)/n}).roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) - caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_θ ) | ≤ ( italic_r italic_R + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ caligraphic_O ( square-root start_ARG roman_log ( 1 / italic_δ ) / italic_n end_ARG ) . (2)

This is a well-known norm-based generalization bound based on the Rademacher theory (Mohri et al., 2012), and we provide a proof in Appendix B for completeness. Notice that the key term r2R2superscript𝑟2superscript𝑅2r^{2}R^{2}italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the generalization bound (2) is also an upper bound for the magnitude of the linear model output, i.e., supθΘ(θxi)2r2R2subscriptsupremum𝜃Θsuperscriptsuperscript𝜃topsubscript𝑥𝑖2superscript𝑟2superscript𝑅2\sup_{\theta\in\Theta}(\theta^{\top}x_{i})^{2}\leq r^{2}R^{2}roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Thus, we connect the model stability with the generalization bound stability, and this connection induces an initialization scheme for the initialization θ(0)superscript𝜃0\theta^{(0)}italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT by setting θ(0)2𝒪(1/r)similar-tosubscriptnormsuperscript𝜃02𝒪1𝑟\|\theta^{(0)}\|_{2}\sim\mathcal{O}(1/r)∥ italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_O ( 1 / italic_r ). In particular, if we normalize each input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that r𝑟ritalic_r is also 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ), then θ(0)2𝒪(1)similar-tosubscriptnormsuperscript𝜃02𝒪1\|\theta^{(0)}\|_{2}\sim\mathcal{O}(1)∥ italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_O ( 1 ). Since θ(0)dsuperscript𝜃0superscript𝑑\theta^{(0)}\in\mathbb{R}^{d}italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, one possible initialization scheme is that θ(0)superscript𝜃0\theta^{(0)}italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT follows a Uniform distribution U[1/d,1/d]𝑈1𝑑1𝑑U[-1/\sqrt{d},1/\sqrt{d}]italic_U [ - 1 / square-root start_ARG italic_d end_ARG , 1 / square-root start_ARG italic_d end_ARG ], which corresponds to the Kaiming initialization (up to some constant) (He et al., 2015). When treating the term r2R2superscript𝑟2superscript𝑅2r^{2}R^{2}italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as a regularizer to improve the generalization, we get the weight decay method, i.e., the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization w.r.t. θ22superscriptsubscriptnorm𝜃22\|\theta\|_{2}^{2}∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We summarize the above logic chain that connects the generalization analysis with optimization designs in Figure 1.

Initialization scheme: set θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT s.t. Complexity(θ0)=𝒪(1)Complexitysubscript𝜃0𝒪1\text{Complexity}(\theta_{0})=\mathcal{O}(1)Complexity ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_O ( 1 )
Generalization Estimate
GenError(θ)Complexity(θ)nsimilar-toGenError𝜃Complexity𝜃𝑛\text{GenError}(\theta)\sim\frac{\text{Complexity}(\theta)}{\sqrt{n}}GenError ( italic_θ ) ∼ divide start_ARG Complexity ( italic_θ ) end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG
Regularization method: penalize Complexity(θ)Complexity𝜃\text{Complexity}(\theta)Complexity ( italic_θ )
Figure 1: The logic diagram goes from generalization analysis to optimization designs.

Now for SSMs, we extend the generalization analysis from non-sequential data to sequential data by taking into account the temporal structure of the data. This linear regression example motivates us to apply the same logic diagram (Figure 1) to the SSMs, and this is exactly what we are going to present in the following part of this paper.

4 Main results

In this section, we first give a generalization bound for SSMs in Section 4.1, then we design a new initialization scheme in Section 4.2 based on this proposed bound. Apart from the initialization scheme, we introduce a new regularization method in Section 4.3. Finally, we conduct experiments to test the initialization scheme and the regularization method in Section 4.4.

4.1 A generalization bound of SSMs

In this section, we present a generalization bound for the SSM (1) and reveal the effects of the temporal dependencies on the generalization performance. We show that our bound gives a tighter estimate compared with previous norm-based bounds through a toy example. Following the same notation in Section 3.1, we define the empirical risk Rn(θ)subscript𝑅𝑛𝜃{R}_{n}(\theta)italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) and the population risk Rx(θ)subscript𝑅𝑥𝜃R_{x}(\theta)italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) as

Rn(θ):=1ni=1n|0Tρθ(Ts)xi(s)𝑑syi|2,Rx(θ):=𝔼x|0Tρθ(Ts)x(s)𝑑sy|2formulae-sequenceassignsubscript𝑅𝑛𝜃1𝑛superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠subscript𝑥𝑖𝑠differential-d𝑠subscript𝑦𝑖2assignsubscript𝑅𝑥𝜃subscript𝔼𝑥superscriptsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝑥𝑠differential-d𝑠𝑦2{R}_{n}(\theta):=\frac{1}{n}\sum_{i=1}^{n}\left|\int_{0}^{T}{\rho}_{\theta}(T-% s)x_{i}(s)ds-y_{i}\right|^{2},\quad R_{x}(\theta):=\mathbb{E}_{x}\left|\int_{0% }^{T}{\rho}_{\theta}(T-s)x(s)ds-y\right|^{2}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) italic_d italic_s - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) := blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x ( italic_s ) italic_d italic_s - italic_y | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where T>0𝑇0T>0italic_T > 0 is some finite terminal time, the training sequence data {xi(t)}i=1nsuperscriptsubscriptsubscript𝑥𝑖𝑡𝑖1𝑛\{x_{i}(t)\}_{i=1}^{n}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are independently sampled from a stochastic process with mean 𝔼[x(t)]:=μ(t)assign𝔼delimited-[]𝑥𝑡𝜇𝑡\mathbb{E}[x(t)]:=\mu(t)blackboard_E [ italic_x ( italic_t ) ] := italic_μ ( italic_t ) and covariance 𝔼[(x(s)μ(s))(x(t)μ(t))]:=K(s,t)assign𝔼delimited-[]𝑥𝑠𝜇𝑠𝑥𝑡𝜇𝑡𝐾𝑠𝑡\mathbb{E}[(x(s)-\mu(s))(x(t)-\mu(t))]:=K(s,t)blackboard_E [ ( italic_x ( italic_s ) - italic_μ ( italic_s ) ) ( italic_x ( italic_t ) - italic_μ ( italic_t ) ) ] := italic_K ( italic_s , italic_t ), and the label y𝑦yitalic_y is generated by some underlying functional HT:𝒳:subscript𝐻𝑇absent𝒳H_{T}:\mathcal{X}\xrightarrow{}\mathbb{R}italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : caligraphic_X start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW blackboard_R, i.e., y=HT(x)𝑦subscript𝐻𝑇𝑥y=H_{T}(x)italic_y = italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ). We assume that |y|1𝑦1|y|\leq 1| italic_y | ≤ 1 for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, otherwise, we truncate the value of the label to 1111. In the next, we make an assumption on the normalized process x~(t):=(x(t)μ(t))/K(t,t)assign~𝑥𝑡𝑥𝑡𝜇𝑡𝐾𝑡𝑡\tilde{x}(t):=(x(t)-\mu(t))/\sqrt{K(t,t)}over~ start_ARG italic_x end_ARG ( italic_t ) := ( italic_x ( italic_t ) - italic_μ ( italic_t ) ) / square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG:

Assumption 1.

The normalized process x~(t)~𝑥𝑡\tilde{x}(t)over~ start_ARG italic_x end_ARG ( italic_t ) is (1): almost surely Hölder continuous, i.e., L,H>0,s.t.s,t[0,T],|x~(s)x~(t)|L|st|Ha.s.formulae-sequence𝐿𝐻0𝑠𝑡for-all𝑠𝑡0𝑇~𝑥𝑠~𝑥𝑡𝐿superscript𝑠𝑡𝐻𝑎𝑠\exists L,H>0,s.t.\forall s,t\in[0,T],|\tilde{x}(s)-\tilde{x}(t)|\leq L|s-t|^{% H}a.s.∃ italic_L , italic_H > 0 , italic_s . italic_t . ∀ italic_s , italic_t ∈ [ 0 , italic_T ] , | over~ start_ARG italic_x end_ARG ( italic_s ) - over~ start_ARG italic_x end_ARG ( italic_t ) | ≤ italic_L | italic_s - italic_t | start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_a . italic_s .; (2): is σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian for every t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], i.e., σ>0,s.t.u>0,P(|x~(t)|u)2exp(u2/2σ2)formulae-sequence𝜎0𝑠𝑡formulae-sequencefor-all𝑢0𝑃~𝑥𝑡𝑢2superscript𝑢22superscript𝜎2\exists\sigma>0,s.t.\forall u>0,P\left(|\tilde{x}(t)|\geq u\right)\leq 2\exp(-% u^{2}/2\sigma^{2})∃ italic_σ > 0 , italic_s . italic_t . ∀ italic_u > 0 , italic_P ( | over~ start_ARG italic_x end_ARG ( italic_t ) | ≥ italic_u ) ≤ 2 roman_exp ( - italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for any t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ].

We leave the discussion of the assumption after the statement of the main theorem. Now we proceed to bound generalization gap |Rx(θ)Rn(θ)|subscript𝑅𝑥𝜃subscript𝑅𝑛𝜃|R_{x}(\theta)-{R}_{n}(\theta)|| italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) | by establishing uniform convergence of the empirical risk to its corresponding population risk, as stated in following theorem:

Theorem 1.

For a SSM 0Tρθ(Ts)x(s)𝑑ssuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝑥𝑠differential-d𝑠\int_{0}^{T}{\rho}_{\theta}(T-s)x(s)ds∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x ( italic_s ) italic_d italic_s, following notations and settings in Section 3.1 & 4.1, we define ψ(Θ):=supθΘ0T|ρθ(Ts)|K(s,s)𝑑s+supθΘ|0Tρθ(Ts)μ(s)𝑑s|assign𝜓Θsubscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝐾𝑠𝑠differential-d𝑠subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝜇𝑠differential-d𝑠\psi(\Theta):=\sup_{\theta\in\Theta}\int_{0}^{T}\left|{\rho}_{\theta}(T-s)% \right|\sqrt{K(s,s)}ds+\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}_{\theta}% (T-s)\mu(s)ds\right|italic_ψ ( roman_Θ ) := roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) | square-root start_ARG italic_K ( italic_s , italic_s ) end_ARG italic_d italic_s + roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_μ ( italic_s ) italic_d italic_s |. Then under Assumption 1, given a parameter space ΘΘ\Thetaroman_Θ for θ𝜃\thetaitalic_θ, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ over the training sequences,

supθΘ|Rx(θ)Rn(θ)|(ψ(Θ)+1)2𝒪(log3/2(Tn/δ)/n).subscriptsupremum𝜃Θsubscript𝑅𝑥𝜃subscript𝑅𝑛𝜃superscript𝜓Θ12𝒪superscript32𝑇𝑛𝛿𝑛\sup_{\theta\in\Theta}\left|R_{x}(\theta)-{R}_{n}(\theta)\right|\leq(\psi(% \Theta)+1)^{2}\cdot\mathcal{O}(\log^{3/2}(Tn/\delta)/{\sqrt{n}}).roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) | ≤ ( italic_ψ ( roman_Θ ) + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ caligraphic_O ( roman_log start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ( italic_T italic_n / italic_δ ) / square-root start_ARG italic_n end_ARG ) . (3)

Where 𝒪𝒪\mathcal{O}caligraphic_O hides a constant that depends on σ,L,H𝜎𝐿𝐻\sigma,L,Hitalic_σ , italic_L , italic_H. The proof is given in Appendix E. We see that this bound decreases to zero as the sample size nabsent𝑛n\xrightarrow{}\inftyitalic_n start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW ∞, provided that the terminal time T𝑇Titalic_T is finite and ψΘsubscript𝜓Θ\psi_{\Theta}italic_ψ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT grows slower than n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG. For example, when the data statistics (e.g., μ(s)𝜇𝑠\mu(s)italic_μ ( italic_s ) and K(s,s)𝐾𝑠𝑠K(s,s)italic_K ( italic_s , italic_s )) are uniformly bounded along the time horizon, by the exponentially decay property of the SSM function ρθ(s)subscript𝜌𝜃𝑠\rho_{\theta}(s)italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ), we have ψΘsubscript𝜓Θ\psi_{\Theta}italic_ψ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT is finite, then the generalization bound is 𝒪~(1/n)~𝒪1𝑛\tilde{\mathcal{O}}(1/\sqrt{n})over~ start_ARG caligraphic_O end_ARG ( 1 / square-root start_ARG italic_n end_ARG ), yielding that the mean and variance at each length position together play important roles in generalization analysis.

Proof sketch. The proof is based on Rademacher theory (Bartlett and Mendelson, 2002). The main difficulty is to bound the Rademacher complexity of the SSM function 0Tρθ(Ts)x(s)𝑑ssuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝑥𝑠differential-d𝑠\int_{0}^{T}{\rho}_{\theta}(T-s)x(s)ds∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x ( italic_s ) italic_d italic_s for a stochastic process x(s)𝑥𝑠x(s)italic_x ( italic_s ). We first use the Hölder inequality to get an upper bound for the Rademacher complexity w.r.t. the normalized process x~(s)~𝑥𝑠\tilde{x}(s)over~ start_ARG italic_x end_ARG ( italic_s ), then combining Hölder continuity and the heavy-tail property in Assumption 1, we show the finiteness of sups[0,T]x~(s)subscriptsupremum𝑠0𝑇~𝑥𝑠\sup_{s\in[0,T]}\tilde{x}(s)roman_sup start_POSTSUBSCRIPT italic_s ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG ( italic_s ). Finally we use an ε𝜀\varepsilonitalic_ε-net argument to give an explicit bound for the Rademacher complexity, which then finishes the proof.

Discussions of Assumption 1. This assumption contains two parts. Hölder continuity is used to bound sups[0,T]x~(s)subscriptsupremum𝑠0𝑇~𝑥𝑠\sup_{s\in[0,T]}\tilde{x}(s)roman_sup start_POSTSUBSCRIPT italic_s ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG ( italic_s ) and the Rademacher complexity of the SSM function class. By the Kolmogorov continuity theorem (Stroock and Varadhan, 1997), Hölder continuity covers a wide range of random process that satisfies certain inequalities for its moments. For the sub-Gaussian property, it ensures x~(s)~𝑥𝑠\tilde{x}(s)over~ start_ARG italic_x end_ARG ( italic_s ) is bounded in a finite time set with high probability. Sub-Gaussian random variables include Gaussian and any bounded variables. Specifically, for image classification tasks with flattened image pixels, if the range of the pixel values is a finite class (e.g., integer numbers from 0 to 255), then the Hölder continuity condition can be dropped. We leave more detailed discussions and provide some concrete examples that satisfy Assumption 1 in Appendix C.

Comparison to previous bounds. Since a SSM is also a continuous linear RNN model, we compare (3) with previous bounds for linear RNNs. A generalization bound 𝒪~(x2B2C2A2/n)~𝒪subscriptnorm𝑥2subscriptnorm𝐵2subscriptnorm𝐶2subscriptnorm𝐴2𝑛\widetilde{\mathcal{O}}\left(\|x\|_{2}\|B\|_{2}\|C\|_{2}\|A\|_{2}/{\sqrt{n}}\right)over~ start_ARG caligraphic_O end_ARG ( ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_B ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_C ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_n end_ARG ) is provided In Chen et al. (2019), where x2subscriptnorm𝑥2\|x\|_{2}∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the 2-norm of the discrete input sequence. In the continuous case, x2subscriptnorm𝑥2\|x\|_{2}∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT corresponds to the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm w.r.t. a Dirac measure. By changing the matrix 2-norm to matrix 1-norm, Tu et al. (2019) shows another similar generalization bound. These bounds separate the data complexity and the model complexity by the data norm and the model parameter norm individually, and do not account for the temporal dependencies across the time domain. In this work, instead, we incorporate the temporal dependencies via the sequence statistics (mean and variance) to get a generalization bound. Next, we use a toy example to illustrate that our bound gives a tighter estimation. Given a stochastic process {x(t)}t[0,T]subscript𝑥𝑡𝑡0𝑇\{x(t)\}_{t\in[0,T]}{ italic_x ( italic_t ) } start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT with mean μ(t)𝜇𝑡\mu(t)italic_μ ( italic_t ) and covariance K(s,t)𝐾𝑠𝑡K(s,t)italic_K ( italic_s , italic_t ), we consider the following two upscale transformations (by increasing T𝑇Titalic_T to 2T2𝑇2T2 italic_T):

  1. 1.

    left zero padding: x1(t)=0,t[0,T);x1(t)=x(tT),t[T,2T]formulae-sequencesubscript𝑥1𝑡0formulae-sequence𝑡0𝑇formulae-sequencesubscript𝑥1𝑡𝑥𝑡𝑇𝑡𝑇2𝑇x_{1}(t)=0,\ t\in[0,T);\quad x_{1}(t)=x(t-T),\ t\in[T,2T]italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) = 0 , italic_t ∈ [ 0 , italic_T ) ; italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) = italic_x ( italic_t - italic_T ) , italic_t ∈ [ italic_T , 2 italic_T ]

  2. 2.

    right zero padding: x2(t)=x(t),t[0,T];x2(t)=0,t(T,2T]formulae-sequencesubscript𝑥2𝑡𝑥𝑡formulae-sequence𝑡0𝑇formulae-sequencesubscript𝑥2𝑡0𝑡𝑇2𝑇x_{2}(t)=x(t),\ t\in[0,T];\quad x_{2}(t)=0,\ t\in(T,2T]italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) = italic_x ( italic_t ) , italic_t ∈ [ 0 , italic_T ] ; italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) = 0 , italic_t ∈ ( italic_T , 2 italic_T ]

Then the two SSM outputs are given by yi(2T)=02Tρθ(2Ts)xi(s)𝑑ssubscript𝑦𝑖2𝑇superscriptsubscript02𝑇subscript𝜌𝜃2𝑇𝑠subscript𝑥𝑖𝑠differential-d𝑠y_{i}(2T)=\int_{0}^{2T}{\rho}_{\theta}(2T-s)x_{i}(s)dsitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 2 italic_T ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 2 italic_T - italic_s ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) italic_d italic_s for i=1,2𝑖12i=1,2italic_i = 1 , 2. Hence,

y1(2T)=C0TeA(Ts)Bx(s)𝑑s,y2(2T)=CeAT0TeA(Ts)Bx(s)𝑑s.formulae-sequencesubscript𝑦12𝑇𝐶superscriptsubscript0𝑇superscript𝑒𝐴𝑇𝑠𝐵𝑥𝑠differential-d𝑠subscript𝑦22𝑇𝐶superscript𝑒𝐴𝑇superscriptsubscript0𝑇superscript𝑒𝐴𝑇𝑠𝐵𝑥𝑠differential-d𝑠y_{1}(2T)=C\int_{0}^{T}e^{A(T-s)}Bx(s)ds,\quad y_{2}(2T)=Ce^{AT}\int_{0}^{T}e^% {A(T-s)}Bx(s)ds.italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 2 italic_T ) = italic_C ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_s ) end_POSTSUPERSCRIPT italic_B italic_x ( italic_s ) italic_d italic_s , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 2 italic_T ) = italic_C italic_e start_POSTSUPERSCRIPT italic_A italic_T end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_s ) end_POSTSUPERSCRIPT italic_B italic_x ( italic_s ) italic_d italic_s .

We see that the magnitude of y1(2T)subscript𝑦12𝑇y_{1}(2T)italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 2 italic_T ) and y2(2T)subscript𝑦22𝑇y_{2}(2T)italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 2 italic_T ) differs with an exponential factor eATsuperscript𝑒𝐴𝑇e^{AT}italic_e start_POSTSUPERSCRIPT italic_A italic_T end_POSTSUPERSCRIPT. Since all the eigenvalues of A𝐴Aitalic_A have negative real part, y2(2T)0absentsubscript𝑦22𝑇0y_{2}(2T)\xrightarrow{}0italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 2 italic_T ) start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW 0 as T𝑇Titalic_T increases. Hence, the right zero padding transformation degenerates the SSM function class to a zero function class for large T𝑇Titalic_T, inducing a minimal generalization gap that only contains the statistical sampling error (see (3) by letting K(s,s)=μ(s)=0𝐾𝑠𝑠𝜇𝑠0K(s,s)=\mu(s)=0italic_K ( italic_s , italic_s ) = italic_μ ( italic_s ) = 0). Therefore, a desired generalization bound should reflect such a difference caused by the different temporal dependencies. However, previous norm-based generalization bounds do not capture such a difference for these two transformations as they produce the same L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm for the input sequence. Let us see what happens for our proposed generalization measure. For the left zero padding, the key term in (3) becomes

0T|CeA(Ts)B|K(s,s)𝑑s+|0TCeA(Ts)Bμ(s)𝑑s|+1superscriptsubscript0𝑇𝐶superscript𝑒𝐴𝑇𝑠𝐵𝐾𝑠𝑠differential-d𝑠superscriptsubscript0𝑇𝐶superscript𝑒𝐴𝑇𝑠𝐵𝜇𝑠differential-d𝑠1\int_{0}^{T}\left|Ce^{A(T-s)}B\right|\sqrt{K(s,s)}ds+\left|\int_{0}^{T}Ce^{A(T% -s)}B\mu(s)ds\right|+1∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_C italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_s ) end_POSTSUPERSCRIPT italic_B | square-root start_ARG italic_K ( italic_s , italic_s ) end_ARG italic_d italic_s + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_C italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_s ) end_POSTSUPERSCRIPT italic_B italic_μ ( italic_s ) italic_d italic_s | + 1 (4)

For the right zero padding, the key term in (3) becomes

0T|CeATeA(Ts)B|K(s,s)𝑑s+|0TCeATeA(Ts)Bμ(s)𝑑s|+1superscriptsubscript0𝑇𝐶superscript𝑒𝐴𝑇superscript𝑒𝐴𝑇𝑠𝐵𝐾𝑠𝑠differential-d𝑠superscriptsubscript0𝑇𝐶superscript𝑒𝐴𝑇superscript𝑒𝐴𝑇𝑠𝐵𝜇𝑠differential-d𝑠1\int_{0}^{T}\left|Ce^{AT}e^{A(T-s)}B\right|\sqrt{K(s,s)}ds+\left|\int_{0}^{T}% Ce^{AT}e^{A(T-s)}B\mu(s)ds\right|+1∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_C italic_e start_POSTSUPERSCRIPT italic_A italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_s ) end_POSTSUPERSCRIPT italic_B | square-root start_ARG italic_K ( italic_s , italic_s ) end_ARG italic_d italic_s + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_C italic_e start_POSTSUPERSCRIPT italic_A italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_s ) end_POSTSUPERSCRIPT italic_B italic_μ ( italic_s ) italic_d italic_s | + 1 (5)

The detailed derivations are given in Appendix D. By the same argument, our bound (3) indeed captures the difference on the magnitude of the generalization performance for these two sequence transformations. In particular, as Tabsent𝑇T\xrightarrow{}\inftyitalic_T start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW ∞, (5) reduces to 1111, which yields a minimal generalization gap as expected for the zero function class. In that sense, we get a tighter bound for the SSMs.

Zero shot transferability. A benefit of SSMs is the zero-shot transferability to other sampling frequencies (i.e., the timescale measure in continuous case). For example, for a SSM function yT=0Tρθ(Ts)x(s)𝑑ssubscript𝑦𝑇superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝑥𝑠differential-d𝑠y_{T}=\int_{0}^{T}{\rho}_{\theta}(T-s)x(s)dsitalic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x ( italic_s ) italic_d italic_s, if we downscale the input sequence x(s)𝑥𝑠x(s)italic_x ( italic_s ) by half of the sampling frequency, then the SSM output becomes yT=0T/2ρθ(T2s)x(2s)𝑑ssubscript𝑦𝑇superscriptsubscript0𝑇2subscript𝜌𝜃𝑇2𝑠𝑥2𝑠differential-d𝑠y_{T}=\int_{0}^{T/2}{\rho}_{\theta}(T-2s)x(2s)dsitalic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - 2 italic_s ) italic_x ( 2 italic_s ) italic_d italic_s, which equals to 0T12ρθ(Ts)x(s)𝑑ssuperscriptsubscript0𝑇12subscript𝜌𝜃𝑇𝑠𝑥𝑠differential-d𝑠\int_{0}^{T}\frac{1}{2}{\rho}_{\theta}(T-s)x(s)ds∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x ( italic_s ) italic_d italic_s. Now for a new SSM parameter θ~=(2C,A,B)~𝜃2𝐶𝐴𝐵\tilde{\theta}=(2C,A,B)over~ start_ARG italic_θ end_ARG = ( 2 italic_C , italic_A , italic_B ), we have ρθ~(s)=2ρθ(s)subscript𝜌~𝜃𝑠2subscript𝜌𝜃𝑠\rho_{\tilde{\theta}}(s)=2\rho_{\theta}(s)italic_ρ start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_s ) = 2 italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ), indicating that by simply modifying the SSM parameters, one can transfer the model to half the sampling frequency while keeping the output invariant. One advantage for our generalization measure is that it is also zero shot transferable. To see this, we use the same example here. Under the downscale sampling, both 0T|ρθ(Ts)|K(s,s)𝑑ssuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝐾𝑠𝑠differential-d𝑠\int_{0}^{T}\left|{\rho}_{\theta}(T-s)\right|\sqrt{K(s,s)}ds∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) | square-root start_ARG italic_K ( italic_s , italic_s ) end_ARG italic_d italic_s and |0Tρθ(Ts)μ(s)𝑑s|superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝜇𝑠differential-d𝑠\left|\int_{0}^{T}{\rho}_{\theta}(T-s)\mu(s)ds\right|| ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_μ ( italic_s ) italic_d italic_s | remain invariant for the new parameter θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG because K(s,s)𝐾𝑠𝑠\sqrt{K(s,s)}square-root start_ARG italic_K ( italic_s , italic_s ) end_ARG and μ(s)𝜇𝑠\mu(s)italic_μ ( italic_s ) have the same scaling as x(s)𝑥𝑠x(s)italic_x ( italic_s ). Similarly, other sampling frequencies are also zero shot transferable for our generalization measure by simply adjusting the SSM parameters.

4.2 Generalization bound as an initialization scheme

In this section, we design a scaling rule for the SSM parameters at initialization based on the generalization bound (3). This new initialization scheme improves the robustness of the initial output value scales on SSMs across different temporal patterns of the sequence data.

Our proposed initialization scheme is built on the HiPPO based initialization (Gu et al., 2023), which is a data independent initialization method. Specifically, the HiPPO framework initializes the hidden state matrices A,B𝐴𝐵A,Bitalic_A , italic_B to produce orthogonal basis functions, and the matrix C𝐶Citalic_C to be standard normal for training stability. However, the argument for the training stability relies on the prerequisite that the input sequence is constant along the time index (Gu et al. (2023, Corollary 3.4)), which has some limitations in applicability as the long-range dependencies may lead to very different temporal patterns on the input sequence. As the dashed lines in the left and the right part of Figure 2 show, the SSM output value scale and the loss value scale under the HiPPO based initialization vary much across different temporal dependencies, making the loss values inconsistent during training. To address this issue, we follow the logic diagram in Figure 1 by adjusting the generalization complexity to be 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ). Specifically, we extract the dominant term in the generalization bound (3):

τ(θ):=(0T|ρθ(Ts)|K(s,s)𝑑s+|0Tρθ(Ts)μ(s)𝑑s|)2.assign𝜏𝜃superscriptsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝐾𝑠𝑠differential-d𝑠superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝜇𝑠differential-d𝑠2\tau(\theta):=\left(\int_{0}^{T}\left|{\rho}_{\theta}(T-s)\right|\sqrt{K(s,s)}% ds+\left|\int_{0}^{T}{\rho}_{\theta}(T-s)\mu(s)ds\right|\right)^{2}.italic_τ ( italic_θ ) := ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) | square-root start_ARG italic_K ( italic_s , italic_s ) end_ARG italic_d italic_s + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_μ ( italic_s ) italic_d italic_s | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

Notice that ρθ(s)=CeAsBsubscript𝜌𝜃𝑠𝐶superscript𝑒𝐴𝑠𝐵{\rho}_{\theta}(s)=Ce^{As}Bitalic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) = italic_C italic_e start_POSTSUPERSCRIPT italic_A italic_s end_POSTSUPERSCRIPT italic_B, if we rescale C𝐶Citalic_C to ξC𝜉𝐶\xi Citalic_ξ italic_C for some ξ𝜉\xi\in\mathbb{R}italic_ξ ∈ blackboard_R, we have τ(θ~)=ξ2τ(θ)𝜏~𝜃superscript𝜉2𝜏𝜃\tau(\tilde{\theta})=\xi^{2}\cdot\tau(\theta)italic_τ ( over~ start_ARG italic_θ end_ARG ) = italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_τ ( italic_θ ) for θ~=(ξC,A,B)~𝜃𝜉𝐶𝐴𝐵\tilde{\theta}=(\xi C,A,B)over~ start_ARG italic_θ end_ARG = ( italic_ξ italic_C , italic_A , italic_B ). This induces a new initialization scheme, i.e., once the parameters θ=(C,A,B)𝜃𝐶𝐴𝐵\theta=(C,A,B)italic_θ = ( italic_C , italic_A , italic_B ) are initialized by the HiPPO method, we rescale C𝐶Citalic_C to C~~𝐶\tilde{C}over~ start_ARG italic_C end_ARG such that

C~=1τ(θ)C.~𝐶1𝜏𝜃𝐶\tilde{C}=\frac{1}{\sqrt{\tau(\theta)}}\cdot C.over~ start_ARG italic_C end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_τ ( italic_θ ) end_ARG end_ARG ⋅ italic_C . (7)

This rescaling method guarantees the SSM output value is bounded at initialization for any stochastic process that satisfies Assumption 1, ensuring the robustness of the initial loss value scales on SSMs across different temporal structures. We formalize the statement in Proposition 1.

Proposition 1.

Consider a SSM 0Tρθ(Ts)x(s)𝑑ssuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑠𝑥𝑠differential-d𝑠\int_{0}^{T}{\rho}_{\theta}(T-s)x(s)ds∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x ( italic_s ) italic_d italic_s with θ=(C,A,B)𝜃𝐶𝐴𝐵\theta=(C,A,B)italic_θ = ( italic_C , italic_A , italic_B ), for any random process x(s)𝑥𝑠x(s)italic_x ( italic_s ) satisfies Assumption 1, let C~~𝐶\tilde{C}over~ start_ARG italic_C end_ARG be given by the rescaling method (7), then for θ~:=(C~,A,B)assign~𝜃~𝐶𝐴𝐵\tilde{\theta}:=(\tilde{C},A,B)over~ start_ARG italic_θ end_ARG := ( over~ start_ARG italic_C end_ARG , italic_A , italic_B ), we have 𝔼x[|0Tρθ~(Ts)x(s)𝑑s|]𝒪(logT)subscript𝔼𝑥delimited-[]superscriptsubscript0𝑇subscript𝜌~𝜃𝑇𝑠𝑥𝑠differential-d𝑠𝒪𝑇\mathbb{E}_{x}\left[\left|\int_{0}^{T}{\rho}_{\tilde{\theta}}(T-s)x(s)ds\right% |\right]\leq\mathcal{O}(\sqrt{\log T})blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_T - italic_s ) italic_x ( italic_s ) italic_d italic_s | ] ≤ caligraphic_O ( square-root start_ARG roman_log italic_T end_ARG ). Here 𝒪𝒪\mathcal{O}caligraphic_O hides a constant that only depends on σ𝜎\sigmaitalic_σ and L𝐿Litalic_L as described in Assumption 1.

The proof is provided in Appendix F. Proposition 1 shows that the expected SSM output value are bounded for any stochastic processes that satisfies Assumption 1, even when the input sequence is not almost surely bounded. This improves the robustness of the output value scales on SSMs in the sense that the scale of the output value does not depend on the variations of the temporal structures. It is worth noting that different from normalization methods such as min-max normalization and standardization, our method only changes the model parameters. This is important because normalization on data numerical values in language tasks can lead to loss of crucial information. For example, mathematical expressions like “max(1,9)=9199\max(1,9)=9roman_max ( 1 , 9 ) = 9” have a contextual meaning where normalization may result in the loss of structured information that is essential to understand.

Algorithm 1 Training an \ellroman_ℓ-layer SSM with the scheme (7)
1:training sequences with length L𝐿Litalic_L, model dimension d𝑑ditalic_d, projection matrix Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of i𝑖iitalic_i-th layer, number of epochs S𝑆Sitalic_S
2:for s=0𝑠0s=0italic_s = 0 to S1𝑆1S-1italic_S - 1 do
3:     if s=0𝑠0s=0italic_s = 0 then
4:         Sample a minibatch sequence from the training sequences
5:         for i=1𝑖1i=1italic_i = 1 to \ellroman_ℓ do
6:              Compute mean μiL×dsubscript𝜇𝑖superscript𝐿𝑑\mu_{i}\in\mathbb{R}^{L\times d}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT, variance KiL×dsubscript𝐾𝑖superscript𝐿𝑑K_{i}\in\mathbb{R}^{L\times d}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT of the i𝑖iitalic_i-th layer’s inputs along the batch dimension \triangleright Inputs of the i𝑖iitalic_i-th layer depend on model parameters of the first i1𝑖1i-1italic_i - 1 layers
7:              Calculate the SSM kernel kiL×dsubscript𝑘𝑖superscript𝐿𝑑k_{i}\in\mathbb{R}^{L\times d}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT by the model parameters of the i𝑖iitalic_i-th layer
8:              τi[|ki|Ki+|kiμi|]Ldsubscript𝜏𝑖subscriptdelimited-[]subscript𝑘𝑖subscript𝐾𝑖subscript𝑘𝑖subscript𝜇𝑖𝐿superscript𝑑\tau_{i}\leftarrow\left[|k_{i}|*\sqrt{K_{i}}+|k_{i}*\mu_{i}|\right]_{L}\in% \mathbb{R}^{d}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← [ | italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∗ square-root start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + | italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
9:              Averaging over the feature dimension: τiτi22/dsubscript𝜏𝑖superscriptsubscriptnormsubscript𝜏𝑖22𝑑\tau_{i}\leftarrow\|\tau_{i}\|_{2}^{2}/ditalic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← ∥ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d
10:              Update: CiCi/τisubscript𝐶𝑖subscript𝐶𝑖subscript𝜏𝑖C_{i}\leftarrow{C_{i}}/\sqrt{\tau_{i}}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / square-root start_ARG italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
11:         end for
12:     end if
13:     Regular training procedure for the updated initialization
14:end for

Implementation for high-dimensional, multi-layer SSMs. In the practical training, the SSMs used for tasks such as image classification or language processing are usually deep and high dimensional (d>1𝑑1d>1italic_d > 1), while our initialization scheme (7) is designed based on the one-dimensional shallow SSM. To extend to high-dimensional SSMs, we empirically treat all features to be independent and calculate τ(θ)𝜏𝜃\tau(\theta)italic_τ ( italic_θ ) by its average along the feature dimension. For an \ellroman_ℓ-layer SSM with the initial projection matrix C1,,Csubscript𝐶1subscript𝐶C_{1},\ldots,C_{\ell}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT at each layer, we first calculate the complexity measure τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for the first layer and rescale C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by C1/τ1subscript𝐶1subscript𝜏1C_{1}/\sqrt{\tau_{1}}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / square-root start_ARG italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG. Then we calculate the complexity measure τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the second layer by the updated input sequence of layer 2 and rescale C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by C2/τ2subscript𝐶2subscript𝜏2C_{2}/\sqrt{\tau_{2}}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG. We repeat this process until the last layer. We describe the complete procedures in Algorithm 1, where the |||\cdot|| ⋅ | and \sqrt{\cdot}square-root start_ARG ⋅ end_ARG in Line 7777 represent to element-wise absolute value and element-wise square root respectively. []Lsubscriptdelimited-[]𝐿[\cdot]_{L}[ ⋅ ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT extracts the last position of an element obtained from the convolution. The extension of our theory to the multi-layer case is an interesting direction, which we leave for future work.

Skip connections and nonlinearities. There are several gaps between the theory and the methodologies in this paper. The first one that the skip connection matrix D𝐷Ditalic_D is omitted in our defined model (1). This will not affect our generalization bound because we may express the explicit solution for (1) as y(t)=0t(ρθ(s)+Dδ(s))x(ts)𝑑s𝑦𝑡superscriptsubscript0𝑡subscript𝜌𝜃𝑠𝐷𝛿𝑠𝑥𝑡𝑠differential-d𝑠y(t)=\int_{0}^{t}(\rho_{\theta}(s)+D\delta(s))x(t-s)dsitalic_y ( italic_t ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) + italic_D italic_δ ( italic_s ) ) italic_x ( italic_t - italic_s ) italic_d italic_s where δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) is a delta function. In that case, the SSM is still a convolution model but with a new kernel function ρθ(s)+Dδ(s)subscript𝜌𝜃𝑠𝐷𝛿𝑠\rho_{\theta}(s)+D\delta(s)italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) + italic_D italic_δ ( italic_s ). However, the initialization scheme (7) only adjusts C𝐶Citalic_C and requires the kernel function to be linear in C𝐶Citalic_C. Hence, (7) may not work well when Dx(t)𝐷𝑥𝑡Dx(t)italic_D italic_x ( italic_t ) is much larger than 0tρθ(s)x(ts)𝑑ssuperscriptsubscript0𝑡subscript𝜌𝜃𝑠𝑥𝑡𝑠differential-d𝑠\int_{0}^{t}\rho_{\theta}(s)x(t-s)ds∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) italic_x ( italic_t - italic_s ) italic_d italic_s. However, we can still derive a proper rescaling scheme for this case. One straightforward way is that we first calculate τ(θ)𝜏𝜃\tau(\theta)italic_τ ( italic_θ ) for a given initialization, and then rescale C,D𝐶𝐷C,Ditalic_C , italic_D as Cτ(θ)𝐶𝜏𝜃C\sqrt{\tau(\theta)}italic_C square-root start_ARG italic_τ ( italic_θ ) end_ARG and D/τ(θ)𝐷𝜏𝜃D/\sqrt{\tau(\theta)}italic_D / square-root start_ARG italic_τ ( italic_θ ) end_ARG respectively. This reinitialization method guarantees that τ(θ)=1𝜏𝜃1\tau(\theta)=1italic_τ ( italic_θ ) = 1 after rescaling. The second gap is that our theory is for single-layer linear SSMs. When nonlinearities are added, our generalization bound still works for single-layer SSMs if the nonlinearity does not affect the Hölder condition and the sub-Gaussian property (Assumption 1). For Lipschitz (also Hölder continuous) nonlinearities, there are some known examples (see Appendix G) where the sub-Gaussian condition still remains after the nonlinearity.

4.3 Generalization bound as a regularization method

Algorithm 2 Training an \ellroman_ℓ-layer SSM with the scheme (8)
1:training sequences with length L𝐿Litalic_L, model dimension d𝑑ditalic_d, initialization θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, loss function \mathcal{L}caligraphic_L, regularization factor λ𝜆\lambdaitalic_λ, optimizer OPT, number of epochs S𝑆Sitalic_S
2:for s=0𝑠0s=0italic_s = 0 to S1𝑆1S-1italic_S - 1 do
3:     Sample a minibatch from the training sequences
4:     Set total complexity τ=0𝜏0\tau=0italic_τ = 0
5:     for i=1𝑖1i=1italic_i = 1 to \ellroman_ℓ do
6:         Compute mean μiL×dsubscript𝜇𝑖superscript𝐿𝑑\mu_{i}\in\mathbb{R}^{L\times d}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT and variance KiL×dsubscript𝐾𝑖superscript𝐿𝑑K_{i}\in\mathbb{R}^{L\times d}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT for inputs of the i𝑖iitalic_i-th layer along the batch dimension
7:         Calculate the SSM kernel kiL×dsubscript𝑘𝑖superscript𝐿𝑑k_{i}\in\mathbb{R}^{L\times d}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT by the model parameters of the i𝑖iitalic_i-th layer
8:         τi[|ki|Ki+|kiμi|]Ldsubscript𝜏𝑖subscriptdelimited-[]subscript𝑘𝑖subscript𝐾𝑖subscript𝑘𝑖subscript𝜇𝑖𝐿superscript𝑑\tau_{i}\leftarrow\left[|k_{i}|*\sqrt{K_{i}}+|k_{i}*\mu_{i}|\right]_{L}\in% \mathbb{R}^{d}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← [ | italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∗ square-root start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + | italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
9:         Averaging over the feature dimension: τiτi22/dsubscript𝜏𝑖superscriptsubscriptnormsubscript𝜏𝑖22𝑑\tau_{i}\leftarrow\|\tau_{i}\|_{2}^{2}/ditalic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← ∥ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d
10:         Add the complexity of the i𝑖iitalic_i-th layer: ττ+τi𝜏𝜏subscript𝜏𝑖\tau\leftarrow\tau+\tau_{i}italic_τ ← italic_τ + italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
11:     end for
12:     Compute the training loss: +λτ𝜆𝜏\mathcal{L}\leftarrow\mathcal{L}+\lambda\cdot\taucaligraphic_L ← caligraphic_L + italic_λ ⋅ italic_τ
13:     Parameters update: θi+1OPT(θi,)subscript𝜃𝑖1OPTsubscript𝜃𝑖\theta_{i+1}\leftarrow\textsf{OPT}(\theta_{i},\mathcal{L})italic_θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ← OPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L )
14:end for
15:Updated model parameter θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
Refer to caption
Figure 2: Effects of the initialization scheme (7) on the model output scale, the gradient norm and the training loss under different temporal dependencies by varying the moment coefficient b=0.01,0.1,1𝑏0.010.11b=0.01,0.1,1italic_b = 0.01 , 0.1 , 1. (Left) The output 𝔼x[|yL|]subscript𝔼𝑥delimited-[]subscript𝑦𝐿\mathbb{E}_{x}[|y_{L}|]blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ | italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | ] at initialization w.r.t. the Gaussian white noise sequence (x1,,xL)subscript𝑥1subscript𝑥𝐿(x_{1},\ldots,x_{L})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) for length L𝐿Litalic_L from 1111 to 1000100010001000; (Middle) The gradient norm Rn(θ)normsubscript𝑅𝑛𝜃\|\nabla R_{n}(\theta)\|∥ ∇ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) ∥ at initialization w.r.t. the mean squared error (MSE) for varied sequence length; (Right) The training MSE curve for the Gaussian white noise with length L=1000𝐿1000L=1000italic_L = 1000.

In addition to its role as an initialization scheme, the generalization measure can also be regarded as a regularizer. In this section, we utilize the bound (3) to design a regularization method to improve the generalization performance, and simultaneously bring a little extra computational cost. For the generalization bound (3), we consider to use the dominant term (for large T𝑇Titalic_T) τ(θ)𝜏𝜃\tau(\theta)italic_τ ( italic_θ ) defined in (6) as a regularizer. Then, the new empirical risk with regularization is given by

R~n(θ):=Rn(θ)+λτ(θ),assignsubscript~𝑅𝑛𝜃subscript𝑅𝑛𝜃𝜆𝜏𝜃\tilde{R}_{n}(\theta):=R_{n}(\theta)+\lambda\cdot\tau(\theta),over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) := italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) + italic_λ ⋅ italic_τ ( italic_θ ) , (8)

where λ0𝜆0\lambda\geq 0italic_λ ≥ 0 is the regularization coefficient. When training multi-layer SSMs, we calculate the complexity τ(θ)𝜏𝜃\tau(\theta)italic_τ ( italic_θ ) in (8) at each layer and add them together as a total regularization. We describe the training procedures for one-layer SSMs in Algorithm 2, where the notations follow Algorithm 1.

Computational cost analysis. From the training procedures in Algorithm 2, we can see that the newly introduced training complexity mainly comes from the calculation for the convolution between the SSM kernel and the sequence statistics (μ,K)𝜇𝐾(\mu,K)( italic_μ , italic_K ). Since the convolution can be conducted by the fast Fourier transform (Gu et al., 2022a) with complexity 𝒪(bdLlogL)𝒪𝑏𝑑𝐿𝐿\mathcal{O}(bdL\log L)caligraphic_O ( italic_b italic_d italic_L roman_log italic_L ) where b𝑏bitalic_b is the batch size. Then the new complexity for Algorithm 2 becomes 𝒪((b+2)dLlogL)𝒪𝑏2𝑑𝐿𝐿\mathcal{O}((b+2)dL\log L)caligraphic_O ( ( italic_b + 2 ) italic_d italic_L roman_log italic_L ), which is acceptable in the practical training. We also include a concrete comparison of the running times in training real datasets to confirm this in Table 2.

4.4 Experiments

Training loss (MSE) Test loss (MSE) Generalization measure ψΘ2/nsuperscriptsubscript𝜓Θ2𝑛\psi_{\Theta}^{2}/\sqrt{n}italic_ψ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / square-root start_ARG italic_n end_ARG
b=1𝑏1b=1italic_b = 1 b=0.1𝑏0.1b=0.1italic_b = 0.1 b=0.01𝑏0.01b=0.01italic_b = 0.01 b=1𝑏1b=1italic_b = 1 b=0.1𝑏0.1b=0.1italic_b = 0.1 b=0.01𝑏0.01b=0.01italic_b = 0.01 b=1𝑏1b=1italic_b = 1 b=0.1𝑏0.1b=0.1italic_b = 0.1 b=0.01𝑏0.01b=0.01italic_b = 0.01
w/o (7, 8) 0.15±0.002 0.67±0.09subscript0.67plus-or-minus0.09{0.67}_{\pm 0.09}0.67 start_POSTSUBSCRIPT ± 0.09 end_POSTSUBSCRIPT 2.50±0.52subscript2.50plus-or-minus0.522.50_{\pm 0.52}2.50 start_POSTSUBSCRIPT ± 0.52 end_POSTSUBSCRIPT 0.25±0.01subscript0.25plus-or-minus0.010.25_{\pm 0.01}0.25 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.01±0.14subscript1.01plus-or-minus0.141.01_{\pm 0.14}1.01 start_POSTSUBSCRIPT ± 0.14 end_POSTSUBSCRIPT 4.70±0.77subscript4.70plus-or-minus0.774.70_{\pm 0.77}4.70 start_POSTSUBSCRIPT ± 0.77 end_POSTSUBSCRIPT 0.93±0.20subscript0.93plus-or-minus0.200.93_{\pm 0.20}0.93 start_POSTSUBSCRIPT ± 0.20 end_POSTSUBSCRIPT 5.16±0.84subscript5.16plus-or-minus0.845.16_{\pm 0.84}5.16 start_POSTSUBSCRIPT ± 0.84 end_POSTSUBSCRIPT 46.23±7.49subscript46.23plus-or-minus7.4946.23_{\pm 7.49}46.23 start_POSTSUBSCRIPT ± 7.49 end_POSTSUBSCRIPT
w (7) 0.11±0.006subscript0.11plus-or-minus0.006\textbf{0.11}_{\pm 0.006}0.11 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT 0.27±0.02subscript0.27plus-or-minus0.02\textbf{0.27}_{\pm 0.02}0.27 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.20±0.01subscript0.20plus-or-minus0.01\textbf{0.20}_{\pm 0.01}0.20 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.20±0.003subscript0.20plus-or-minus0.003{0.20}_{\pm 0.003}0.20 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT 0.75±0.05subscript0.75plus-or-minus0.050.75_{\pm 0.05}0.75 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT 1.06±0.12subscript1.06plus-or-minus0.121.06_{\pm 0.12}1.06 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 0.45±0.03subscript0.45plus-or-minus0.030.45_{\pm 0.03}0.45 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 2.36±0.44subscript2.36plus-or-minus0.442.36_{\pm 0.44}2.36 start_POSTSUBSCRIPT ± 0.44 end_POSTSUBSCRIPT 7.19±1.60subscript7.19plus-or-minus1.607.19_{\pm 1.60}7.19 start_POSTSUBSCRIPT ± 1.60 end_POSTSUBSCRIPT
w (8) 0.21±0.008subscript0.21plus-or-minus0.0080.21_{\pm 0.008}0.21 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.97±0.12subscript0.97plus-or-minus0.120.97_{\pm 0.12}0.97 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 4.83±0.52subscript4.83plus-or-minus0.524.83_{\pm 0.52}4.83 start_POSTSUBSCRIPT ± 0.52 end_POSTSUBSCRIPT 0.22±0.008subscript0.22plus-or-minus0.0080.22_{\pm 0.008}0.22 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.87±0.07subscript0.87plus-or-minus0.070.87_{\pm 0.07}0.87 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT 3.59±0.09subscript3.59plus-or-minus0.093.59_{\pm 0.09}3.59 start_POSTSUBSCRIPT ± 0.09 end_POSTSUBSCRIPT 0.55±0.05subscript0.55plus-or-minus0.050.55_{\pm 0.05}0.55 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT 2.76±0.23subscript2.76plus-or-minus0.232.76_{\pm 0.23}2.76 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 22.49±0.78subscript22.49plus-or-minus0.7822.49_{\pm 0.78}22.49 start_POSTSUBSCRIPT ± 0.78 end_POSTSUBSCRIPT
w (7, 8) 0.15±0.005subscript0.15plus-or-minus0.0050.15_{\pm 0.005}0.15 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT 0.37±0.04subscript0.37plus-or-minus0.040.37_{\pm 0.04}0.37 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT 0.35±0.02subscript0.35plus-or-minus0.020.35_{\pm 0.02}0.35 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.18±0.004subscript0.18plus-or-minus0.004\textbf{0.18}_{\pm 0.004}0.18 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT 0.59±0.03subscript0.59plus-or-minus0.03\textbf{0.59}_{\pm 0.03}0.59 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 0.60±0.01subscript0.60plus-or-minus0.01\textbf{0.60}_{\pm 0.01}0.60 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.23±0.01subscript0.23plus-or-minus0.01\textbf{0.23}_{\pm 0.01}0.23 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.46±0.12subscript0.46plus-or-minus0.12\textbf{0.46}_{\pm 0.12}0.46 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 0.46±0.07subscript0.46plus-or-minus0.07\textbf{0.46}_{\pm 0.07}0.46 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT
Table 1: Training loss, test loss and generalization measure ψΘ2/nsuperscriptsubscript𝜓Θ2𝑛\psi_{\Theta}^{2}/\sqrt{n}italic_ψ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / square-root start_ARG italic_n end_ARG on Gaussian white noise sequences with different coefficients b𝑏bitalic_b after convergence. By adding the initialization scheme (7), SSMs achieve better optimization performance and are more robust on the final training loss value across different temporal dependencies. By adding the regularization term (8), SSMs get better generalization performance (lower test loss).
ListOps Text Retrieval Image Pathfinder PathX Average
S4-Legs w/o (7, 8) 61.16±0.32 88.69±0.07 91.21±0.17 87.41±0.14 95.89±0.10 96.97±0.31 86.89
w (7) 60.79±0.26 88.58±0.21 91.29±0.26 87.67±0.29 95.79±0.31 95.99±0.18 86.69
w (8) 61.63±0.10 88.80±0.27 91.17±0.17 88.27±0.14 96.02±0.16 97.18±0.20 87.18
w (7, 8) 61.04±0.25 88.53±0.04 91.21±0.31 88.63±0.21 95.92±0.45 96.51±0.53 86.97
Time / epoch, w/o (7, 8) 5min 39s 3min 24s 17min 21s 1min 55s 3min 25s 67min 41s 16min 34s
Time / epoch, w (8) 6min 03s 4min 03s 19min 19s 2min 08s 3min 50s 73min 10s 18min 6s
S4D-Legs w/o (7, 8) 60.80±0.39 87.87±0.03 90.68±0.14 86.69±0.29 94.87±0.06 97.34±0.07 86.38
w (7) 60.97±0.27 87.83±0.16 91.08±0.19 87.89±0.11 94.72±0.21 95.86±0.66 86.40
w (8) 61.32±0.43 88.02±0.06 91.10±0.11 87.98±0.09 95.04±0.07 97.46±0.15 86.82
w (7, 8) 61.48±0.09 88.19±0.42 91.25±0.17 88.12±0.25 94.93±0.30 95.63±0.48 86.60
Time / epoch, w/o (7, 8) 5min 10s 3min 07s 16min 37s 1min 42s 3min 02s 49min 39s 13min 13s
Time / epoch, w (8) 5min 33s 3min 13s 18min 43s 1min 56s 3min 28s 55min 33s 14min 44s
Table 2: Test accuracy and running time (per epoch on A100 GPU) on the LRA benchmark under different settings for different models. Mean and standard error are reported based on 3 independent runs.

This section contains experiments to demonstrate the effectiveness of the proposed initialization scheme (7) and the regularization method (8). We use a synthetic dataset and the Long Range Arena (LRA) benchmark (Tay et al., 2021) for numerical validations. To simplify the notation, we use w/o (7, 8), w (7), w (8) and w (7, 8) to represent the baseline training without rescaling and regularization, training with rescaling, training with regularization and training with both rescaling and regularization respectively.

A synthetic dataset. We consider a synthetic sequence dataset generated by a Gaussian white noise. To more closely resemble real datasets, we generate training inputs by sampling data from non-centered Gaussian white noise with mean μ(s)=1𝜇𝑠1\mu(s)=1italic_μ ( italic_s ) = 1 and covariance K(s,t)=1|b|πe((st)/b)2𝐾𝑠𝑡1𝑏𝜋superscript𝑒superscript𝑠𝑡𝑏2K(s,t)=\frac{1}{|b|\sqrt{\pi}}e^{-((s-t)/b)^{2}}italic_K ( italic_s , italic_t ) = divide start_ARG 1 end_ARG start_ARG | italic_b | square-root start_ARG italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - ( ( italic_s - italic_t ) / italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, which is a stationary Gaussian process and satisfies Assumption 1 (see Section 4.1). Then we can get different temporal dependencies by varying the coefficient b𝑏bitalic_b, i.e., as the magnitude of b𝑏bitalic_b decreasing, the temporal dependence of the corresponding Gaussian white noise decreases as well. In particular, as b0absent𝑏0b\xrightarrow{}0italic_b start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW 0, 1|b|πe(x/b)21𝑏𝜋superscript𝑒superscript𝑥𝑏2\frac{1}{|b|\sqrt{\pi}}e^{-(x/b)^{2}}divide start_ARG 1 end_ARG start_ARG | italic_b | square-root start_ARG italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - ( italic_x / italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT becomes a delta function δ(x)𝛿𝑥\delta(x)italic_δ ( italic_x ), entailing a zero temporal dependence for the sequence data.

In the following experiment, we generate the sequence data by the Gaussian white noise with b=[1,0.1,0.01]𝑏10.10.01b=[1,0.1,0.01]italic_b = [ 1 , 0.1 , 0.01 ]. For each input sequence (x1,,xL)subscript𝑥1subscript𝑥𝐿(x_{1},\ldots,x_{L})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), its corresponding label is obtained by sin(x[L/2])subscript𝑥delimited-[]𝐿2\sin(x_{[L/2]})roman_sin ( italic_x start_POSTSUBSCRIPT [ italic_L / 2 ] end_POSTSUBSCRIPT ), i.e., the sine value of the time-lagged input. We use the S4-Legs model (Gu et al., 2022a) (that only contains the convolution layer) to train the sequence data. More details about the experiment setup are provided in Appendix A.1. In Figure 2, we plot the model output 𝔼x[|yL|]subscript𝔼𝑥delimited-[]subscript𝑦𝐿\mathbb{E}_{x}[|y_{L}|]blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ | italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | ], the gradient norm Rn(θ)normsubscript𝑅𝑛𝜃\|\nabla R_{n}(\theta)\|∥ ∇ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) ∥ at initialization, and the training loss (w (7)) with different temporal patterns by varying the Gaussian white noise parameter b𝑏bitalic_b. We see that the initialization scheme (7) enhances the robustness of the output value scales (matches with Proposition 1), gradient norm at initialization and also the training loss value across different temporal structures. In Table 1, we report the training loss, test loss and the dominant generalization measure ψΘ2/nsuperscriptsubscript𝜓Θ2𝑛\psi_{\Theta}^{2}/\sqrt{n}italic_ψ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / square-root start_ARG italic_n end_ARG after convergence. By comparing the final training loss with and without (7) in Table 1 (w/o (7, 8) vs w (7) and w (8) vs w (7, 8)), we see that adding the rescaling scheme (7) also improves the training performance and makes the final training error more robust on different temporal dependencies (by varying b𝑏bitalic_b). For the regularization method (8), we compare the final test loss with and without (8) in Table 1 (w/o (7, 8) vs w (8) and w (7) vs w (7, 8)). We can see that the our regularization method improves the generalization performance. Moreover, combining (7) and (8), the model get the best test performance across various temporal structures of the sequence data. The positive correlation between the generalization measure ψΘ/nsubscript𝜓Θ𝑛\psi_{\Theta}/\sqrt{n}italic_ψ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT / square-root start_ARG italic_n end_ARG and the test loss across different b𝑏bitalic_b indicates that our generalization bound is able to capture different temporal dependencies.

LRA benchmark. For real datasets, we investigate the effects of the initialization scheme (7) and the regularization method (8) on the LRA benchmark, which contains 6 tasks ranging from image classification to language processing. We consider to train two base models: 6666-layer S4-Legs (Goel et al., 2022) and 6666-layer S4D-Legs (Gu et al., 2022b). For the S4-Legs model, the hidden state matrix A𝐴Aitalic_A is a full matrix while for the S4D-Legs model, A𝐴Aitalic_A is a diagonal matrix. We follow the training rules as described in Gu et al. (2023). When training with regularization (8), we vary the regularization coefficient λ𝜆\lambdaitalic_λ with 103,104,105superscript103superscript104superscript10510^{-3},10^{-4},10^{-5}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for ListOps, Text, Retrieval, Image and Pathfinder tasks. For the most challenging task PathX, λ𝜆\lambdaitalic_λ is taken from 104,105,106superscript104superscript105superscript10610^{-4},10^{-5},10^{-6}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. We report the best test accuracy when training with regularization (8), and we include the exact running time for each epoch in Table 2. Note that the reproduction of the baseline numbers (w/o (7, 8)) is inconsistent with the results in (Gu et al., 2022b). This is because we do not use the same PyTorch version and CUDA version as suggested in the official codebase, which may lead to the performance difference. However, these slight differences do not affect the scientific conclusions we draw from this paper.

By comparing the best test accuracy for w/o (7, 8) vs w (8) and w (7) vs w (7, 8) in Table 2, we see that adding the regularization (8) enhances the generalization performance (test accuracy) in almost all the tasks for both S4-Legs and S4D-Legs models. When only adding the initialization scheme, by comparing w (7) vs w/o (7, 8), the rescaling method becomes less effective compared to the synthetic case. This is because for the LRA benchmark, we follow the the original S4 paper (Gu et al., 2023) to add the batch norm/layer norm to the model, which may potentially help to decrease the temporal dependencies of the data, and thus the rescaling method is not so much effective as in the synthetic case. However, when combining the initialization scheme (7) and the regularization (8), one can still get the best test performance in half of tasks, indicating that our proposed optimization designs help to improve the generalization performance. We also compare the running time without or with the proposed optimization designs. Since (7) is conducted before training which will not introduce additional training complexity, we report the running time for w/o (7, (8)) and w (8) in Table 2. The results show that the regularization brings a little extra computational cost, matching the computational cost analysis in Section 4.3. We provide an ablation study for the regularization coefficient λ𝜆\lambdaitalic_λ in Appendix A.2. Results in Table 5 and Table 6 show that the test accuracy is much more sensitive to λ𝜆\lambdaitalic_λ for the Pathfinder and PathX tasks compared to other tasks, which aligns with the findings of in Gu et al. (2023) that challenging tasks are more sensitive to the hyperparameters. More details on the dataset description and the experiment setup are given in Appendix A.2. We include additional experiment results in Appendix A.3 for small S4-Legs and S4D-Legs with either smaller depth or smaller feature dimension. We can see in Table 9 that the improvements for small models are more significant (e.g., nearly 2%percent22\%2 % on the most challenging PathX tasks for S4-Legs and >1%absentpercent1>1\%> 1 % on the average accuracy for S4D-Legs). We also provide comparisons for different regularization schemes for both synthetic and real dataset. One regularization method is filter norm regularization, i.e., we regularize the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the filter ρθsubscript𝜌𝜃\rho_{\theta}italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and another is weight decay on the hidden matrix A𝐴Aitalic_A. Experiment results and details are shown in Appendix A.4.

5 Discussions

In this work, we study the optimization and the generalization for SSMs. Specifically, we give a data-dependent generalization bound, revealing an effect of the temporal dependencies of the sequence data on the generalization. Based on the bound, we design two algorithms to improve the optimization and generalization for SSMs across different temporal patterns. The first is a new initialization scheme, by which we rescale the initialization such that the generalization measure is normalized. This initialization scheme improves the robustness of SSMs on the output scales across various temporal dependencies. The second is a new regularization method, which enhances the generalization performance in sequence modeling with only little extra computation cost. However, our theory does not apply to multi-layer SSMs and we do not address the feature dependencies when calculating the generalization measure (6) for high-dimensional SSMs, but simply treat all the features independent. It is interesting to understand the effects of depth and feature structures on optimization and generalization of SSMs, which we leave for future work.

6 Acknowledgement

This research is supported by the National Research Foundation, Singapore, under the NRF fellowship (project No. NRF-NRFF13-2021-0005).

References

  • Allen-Zhu and Li (2019) Zeyuan Allen-Zhu and Yuanzhi Li. Can sgd learn recurrent neural networks with provable generalization? Advances in Neural Information Processing Systems, 32, 2019.
  • Azmoodeh et al. (2014) Ehsan Azmoodeh, Tommi Sottinen, Lauri Viitasaari, and Adil Yazigi. Necessary and sufficient conditions for hölder continuity of gaussian processes. Statistics & Probability Letters, 94:230–235, 2014.
  • Bartlett and Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Baum and Petrie (1966) Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554–1563, 1966.
  • Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
  • Boucheron et al. (2013) S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013.
  • Chen et al. (2019) Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. arXiv preprint arXiv:1910.12947, 2019.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • Dasgupta and Sontag (1995) Bhaskar Dasgupta and Eduardo Sontag. Sample complexity for learning recurrent perceptron mappings. Advances in Neural Information Processing Systems, 8, 1995.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, June 2019.
  • Fu et al. (2023) Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023.
  • Goel et al. (2022) Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633. PMLR, 2022.
  • Gu and Dao (2023) Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  • Gu et al. (2022a) Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022a.
  • Gu et al. (2022b) Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35, 2022b.
  • Gu et al. (2023) Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Re. How to train your HIPPO: State space models with generalized orthogonal basis projections. In International Conference on Learning Representations, 2023.
  • Gupta et al. (2022) Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. In Advances in Neural Information Processing Systems, 2022.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Koiran and Sontag (1998) Pascal Koiran and Eduardo D Sontag. Vapnik-chervonenkis dimension of recurrent neural networks. Discrete Applied Mathematics, 86(1):63–79, 1998.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Ledoux and Talagrand (2013) Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
  • Li et al. (2019) Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, 32, 2019.
  • Li et al. (2021) Zhong Li, Jiequn Han, Weinan E, and Qianxiao Li. On the curse of memory in recurrent neural networks: Approximation and optimization analysis. In International Conference on Learning Representations, 2021.
  • Li et al. (2022) Zhong Li, Jiequn Han, Weinan E, and Qianxiao Li. Approximation and optimization theory for linear continuous-time recurrent neural networks. The Journal of Machine Learning Research, 23(1):1997–2081, 2022.
  • Linsley et al. (2018) Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre. Learning long-range spatial dependencies with horizontal gated recurrent units. Advances in neural information processing systems, 31, 2018.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150. Association for Computational Linguistics, June 2011.
  • Mehta et al. (2023) Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, 2023.
  • Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
  • Nangia and Bowman (2018) Nikita Nangia and Samuel R Bowman. Listops: A diagnostic dataset for latent tree learning. arXiv preprint arXiv:1804.06028, 2018.
  • Orvieto et al. (2023) Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023.
  • Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013.
  • Poli et al. (2023) Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  • Qu et al. (2023) Eric Qu, Xufang Luo, and Dongsheng Li. Data continuity matters: Improving sequence modeling with lipschitz regularizer. In The Eleventh International Conference on Learning Representations, 2023.
  • Radev et al. (2009) Dragomir R. Radev, Pradeep Muthukrishnan, and Vahed Qazvinian. The ACL Anthology network corpus. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL), pages 54–61. Association for Computational Linguistics, August 2009.
  • Smith et al. (2023) Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023.
  • Stroock and Varadhan (1997) Daniel W Stroock and SR Srinivasa Varadhan. Multidimensional diffusion processes, volume 233. Springer Science & Business Media, 1997.
  • Tay et al. (2021) Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021.
  • Tu et al. (2019) Zhuozhuo Tu, Fengxiang He, and Dacheng Tao. Understanding generalization in recurrent neural networks. In International Conference on Learning Representations, 2019.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vershynin (2020) Roman Vershynin. High-dimensional probability. University of California, Irvine, 2020.
  • Wang et al. (2023) Shida Wang, Zhong Li, and Qianxiao Li. Inverse approximation theory for nonlinear recurrent neural networks. arXiv preprint arXiv:2305.19190, 2023.
  • Zhang et al. (2018) Jiong Zhang, Qi Lei, and Inderjit Dhillon. Stabilizing gradients for deep neural networks via efficient svd parameterization. In International Conference on Machine Learning, pages 5806–5814. PMLR, 2018.

Appendix A Experiments details

In this section, we provide more details for the experiments of the synthetic dataset and the LRA benchmark in Section 4.4.

A.1 The synthetic experiment

For the Gaussian white noise sequences, we generate 100100100100 i.i.d. sequences for training and 1000100010001000 i.i.d. sequences for test. The timescale for the discrete sequences is set to be 1111, i.e., to generate a Gaussian white noise sequence with length L𝐿Litalic_L, we sample from a multivariate normal distribution with mean 1111 and covariance matrix Ki,j=h(ij)subscript𝐾𝑖𝑗𝑖𝑗K_{i,j}=h(i-j)italic_K start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_h ( italic_i - italic_j ) for i,j[1:L]i,j\in[1:L]italic_i , italic_j ∈ [ 1 : italic_L ], where h(t)=1|b|πe(t/b)2𝑡1𝑏𝜋superscript𝑒superscript𝑡𝑏2h(t)=\frac{1}{|b|\sqrt{\pi}}e^{-(t/b)^{2}}italic_h ( italic_t ) = divide start_ARG 1 end_ARG start_ARG | italic_b | square-root start_ARG italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - ( italic_t / italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. The model that we use is the one-layer S4 model that only contains the FFTConv (fast Fourier transform convolution) layer and without activation and the skip connection (D=0𝐷0D=0italic_D = 0) (Gu et al., 2022a). The state space dimension for the FFTConv layer is 64646464, other settings such as the discretization, the initialization and the parameterization follow the default settings in Gu et al. (2023), i.e., we use the ZOH discretization, the LegS initialization and the exponential parameterization for the hidden state matrix A𝐴Aitalic_A.

For the optimizer, we follow Gu et al. (2023) to set the optimizer by groups. For the (ZOH) timescale ΔΔ\Deltaroman_Δ, the hidden state matrices A,B𝐴𝐵A,Bitalic_A , italic_B, we use Adam optimizer with learning rate 0.0010.0010.0010.001, while for the matrix C𝐶Citalic_C, we use AdamW with learning rate 0.010.010.010.01 and decay rate 0.010.010.010.01. For all the parameters, we use the cosine annealing schedule. The batch size is set to be 100100100100 (full batch) and the training epochs is 100100100100. The regularization coefficient λ𝜆\lambdaitalic_λ used for training with (8) is set to be 0.010.010.010.01 across all the temporal patterns.

A.2 LRA benchmark

Datasets. The datasets in the LRA benchmark contain (1) ListOps (Nangia and Bowman, 2018), a dataset that is made up of a list of mathematical operations with answers; (2) Text (Maas et al., 2011), a movie review dataset collected from IMDB, which is used for sentiment analysis; (3) Retrieval (Radev et al., 2009), a task of retrieving documents utilizing byte-level texts from the ACL Anthology Network. (4)Image (Krizhevsky et al., 2009), a sequential CIFAR10 dataset used for sequence classification; (5) Pathfinder (Linsley et al., 2018), a task that requires a model to tell whether two points in an image are connected by a dashed path. (6) PathX, a similar but more challenge task as Pathfinder with a higher image resolution increased from 32×32323232\times 3232 × 32 to 128×128128128128\times 128128 × 128.

Models. The models consist of S4-Legs and S4D-Legs. Both models use the default Legs initialization. Discretization and model parameterization are set to be consistent with Gu et al. (2023). For the optimizer, we also follow the standard setup in Gu et al. (2023) that the hidden state matrices are trained in a relatively small learning rate with no weight decay, while other parameters are trained with AdamW with a larger learning rate. Let D,H,N𝐷𝐻𝑁D,H,Nitalic_D , italic_H , italic_N denote the depth, feature dimension and hidden state space dimension respectively, we summarize the model hyperparameters for S4-Legs and S4D-Legs in Table 3 and Table 4 respectively.

D𝐷Ditalic_D H𝐻Hitalic_H N𝑁Nitalic_N Dropout Learning rate Batch size Epochs Weight decay
ListOps 6 256 4 0 0.01 32 40 0.05
Text 6 256 4 0 0.01 16 32 0.05
Retrieval 6 256 4 0 0.01 64 20 0.05
Image 6 512 64 0.1 0.01 50 200 0.05
Pathfinder 6 256 64 0.0 0.004 64 200 0.05
PathX 6 256 64 0.0 0.0005 16 50 0.05
Table 3: List of the S4-Legs model hyperparameters for the LRA benchmark.
D𝐷Ditalic_D H𝐻Hitalic_H N𝑁Nitalic_N Dropout Learning rate Batch size Epochs Weight decay
ListOps 6 256 4 0 0.01 32 40 0.05
Text 6 256 4 0 0.01 16 32 0.05
Retrieval 6 256 4 0 0.01 64 20 0.05
Image 6 512 64 0.1 0.01 50 200 0.05
Pathfinder 6 256 64 0.0 0.004 64 200 0.05
PathX 6 256 64 0.0 0.0005 16 50 0.05
Table 4: List of the S4D-Legs model hyperparameters for the LRA benchmark.
ListOps
λ=0𝜆0\lambda=0italic_λ = 0 61.16±0.32
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 61.36±0.30
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 61.11±0.10
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 61.63±0.10
Text
λ=0𝜆0\lambda=0italic_λ = 0 88.69±0.07
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 88.80±0.27
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 88.66±0.20
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.71±0.12
Retrieval
λ=0𝜆0\lambda=0italic_λ = 0 91.21±0.17
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 91.17±0.17
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 89.77±2.28
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.25±2.66
Image
λ=0𝜆0\lambda=0italic_λ = 0 87.41±0.14
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 87.43±0.33
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 87.45±0.39
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.27±0.14
Pathfinder
λ=0𝜆0\lambda=0italic_λ = 0 95.89±0.10
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 96.02±0.16
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 95.81±0.33
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 89.06±8.31
PathX
λ=0𝜆0\lambda=0italic_λ = 0 96.97±0.31
λ=106𝜆superscript106\lambda=10^{-6}italic_λ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 97.18±0.20
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 97.16±0.13
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Table 5: Test accuracy for S4-Legs on LRA benchmark by varying the regularization coefficient λ𝜆\lambdaitalic_λ.
ListOps
λ=0𝜆0\lambda=0italic_λ = 0 60.80±0.39
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 60.85±0.62
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 60.80±0.44
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 61.32±0.43
Text
λ=0𝜆0\lambda=0italic_λ = 0 87.87±0.03
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 87.64±0.17
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 87.87±0.36
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.02±0.06
Retrieval
λ=0𝜆0\lambda=0italic_λ = 0 90.68±0.14
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 91.04±0.13
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 90.95±0.20
λ=×103\lambda=\times 10^{-3}italic_λ = × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 91.10±0.11
Image
λ=0𝜆0\lambda=0italic_λ = 0 86.69±0.29
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 86.91±0.12
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 86.96±0.22
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 87.98±0.09
Pathfinder
λ=0𝜆0\lambda=0italic_λ = 0 94.87±0.06
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 95.04±0.07
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 94.38±0.15
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 64.56±19.94
PathX
λ=0𝜆0\lambda=0italic_λ = 0 97.34±0.07
λ=106𝜆superscript106\lambda=10^{-6}italic_λ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 97.32±0.14
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 97.46±0.15
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Table 6: Test accuracy for S4D-Legs on LRA benchmark by varying the regularization coefficient λ𝜆\lambdaitalic_λ.
D𝐷Ditalic_D H𝐻Hitalic_H N𝑁Nitalic_N Dropout Learning rate Batch size Epochs Weight decay
ListOps 4 128 64 0 0.01 50 40 0.05
Text 4 128 64 0 0.01 50 50 0.0
Retrieval 4 96 4 0 0.01 64 20 0.05
Image 4 128 64 0.1 0.01 50 100 0.05
Pathfinder 6 128 64 0.0 0.004 64 40 0.01
PathX 4 96 64 0.0 0.0005 64 50 0.05
Table 7: List of the small S4-Legs model hyperparameters for the LRA benchmark.
D𝐷Ditalic_D H𝐻Hitalic_H N𝑁Nitalic_N Dropout Learning rate Batch size Epochs Weight decay
ListOps 4 128 64 0 0.01 50 40 0.05
Text 4 128 64 0 0.01 50 50 0.0
Retrieval 4 96 4 0 0.01 64 20 0.05
Image 4 128 64 0.1 0.01 50 100 0.05
Pathfinder 6 128 64 0.0 0.004 64 40 0.01
PathX 4 96 64 0.0 0.0005 64 50 0.05
Table 8: List of the small S4D-Legs model hyperparameters for the LRA benchmark.
ListOps Text Retrieval Image Pathfinder PathX Average
S4-Legs w/o (7, 8) 55.38±0.76 84.72±0.40 85.75±0.46 82.07±0.11 89.36±0.38 88.75±0.62 81.01
w (7) 53.72±1.59 85.21±0.21 84.47±1.50 83.71±0.21 89.16±1.38 88.96±1.62 80.87
w (8) 55.43±1.55 85.12±0.34 83.30±1,75 83.86±0.25 89.39±0.34 90.70±0.61 81.30
w (7, 8) 54.97±0.30 85.27±0.21 85.82±0.42 84.74±0.18 88.64±0.36 90.19±0.90 81.61
Time / epoch, w/o (7, 8) 2min 06s 50s 5min 57s 33s 2min 13s 10min 33s 3min 42s
Time / epoch, w (8) 2min 18s 52s 6min 28s 37s 2min 31s 11min 46s 4min 6s
S4D-Legs w/o (7, 8) 55.17±0.20 83.60±0.09 89.12±0.14 81.07±0.39 87.28±0.47 89.91±0.53 81.03
w (7) 55.80±0.11 85.30±0.10 89.32±0.17 82.35±0.56 88.00±0.82 90.15±0.86 81.82
w (8) 56.45±0.33 84.86±0.38 89.21±0.09 82.39±0.18 87.86±0.31 90.95±0.21 81.95
w (7, 8) 55.82±0.66 85.50±0.06 89.34±0.04 83.79±0.29 88.53±0.69 90.51±1.01 82.25
Time / epoch, w/o (7, 8) 1min 53s 47s 5min 40s 29s 2min 9min 52s 3min 27s
Time / epoch, w (8) 2min 11s 48s 6min 15s 34s 2min 16s 11min 05s 3min 52s
Table 9: Test accuracy and running time (per epoch on A100 GPU) on the LRA benchmark under different settings for small S4-Legs and S4D-Legs. Mean and standard error are reported based on 3 independent runs.
ListOps
λ=0𝜆0\lambda=0italic_λ = 0 55.38±0.76
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 55.32±1.03
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 55.43±1.55
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 55.33±0.44
Text
λ=0𝜆0\lambda=0italic_λ = 0 84.72±0.40
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 84.74±0.21
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 84.62±0.18
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 85.12±0.34
Retrieval
λ=0𝜆0\lambda=0italic_λ = 0 85.75±0.46
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 83.30±1.75
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 82.71±1.18
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 82.09±0.41
Image
λ=0𝜆0\lambda=0italic_λ = 0 82.07±0.11
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 82.80±0.32
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 82.98±0.15
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 83.86±0.25
Pathfinder
λ=0𝜆0\lambda=0italic_λ = 0 89.36±0.38
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 89.39±0.34
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 89.20±0.19
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 50.54±0.01
PathX
λ=0𝜆0\lambda=0italic_λ = 0 88.75±0.62
λ=106𝜆superscript106\lambda=10^{-6}italic_λ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 88.51±0.70
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 89.71±0.40
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 90.70±0.61
Table 10: Test accuracy for small S4-Legs on LRA benchmark by varying the regularization coefficient λ𝜆\lambdaitalic_λ.
ListOps
λ=0𝜆0\lambda=0italic_λ = 0 55.17±0.20
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 56.45±0.33
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 56.03±1.36
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 55.48±0.50
Text
λ=0𝜆0\lambda=0italic_λ = 0 83.60±0.09
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 84.13±0.48
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 84.48±0.20
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 84.86±0.38
Retrieval
λ=0𝜆0\lambda=0italic_λ = 0 89.12±0.14
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 89.21±0.09
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 89.18±0.11
λ=×103\lambda=\times 10^{-3}italic_λ = × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.97±0.07
Image
λ=0𝜆0\lambda=0italic_λ = 0 81.07±0.39
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 81.39±0.35
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 81.71±0.39
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 82.39±0.18
Pathfinder
λ=0𝜆0\lambda=0italic_λ = 0 87.28±0.47
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 87.86±0.31
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 50.14±0.57
λ=103𝜆superscript103\lambda=10^{-3}italic_λ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 50.54±0.00
PathX
λ=0𝜆0\lambda=0italic_λ = 0 89.91±0.53
λ=106𝜆superscript106\lambda=10^{-6}italic_λ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 89.79±0.65
λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 90.95±0.21
λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 86.32±1.53
Table 11: Test accuracy for small S4D-Legs on LRA benchmark by varying the regularization coefficient λ𝜆\lambdaitalic_λ.
Test loss (MSE)
w/o (7, 8) w (8) Weight decay on A𝐴Aitalic_A Filter norm regularization
b=1𝑏1b=1italic_b = 1 0.25±0.01 0.22±0.008subscript0.22plus-or-minus0.008\textbf{0.22}_{\pm 0.008}0.22 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.24±0.004subscript0.24plus-or-minus0.0040.24_{\pm 0.004}0.24 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT 0.22±0.007subscript0.22plus-or-minus0.007\textbf{0.22}_{\pm 0.007}0.22 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT
b=0.1𝑏0.1b=0.1italic_b = 0.1 1.01±0.14 0.87±0.07subscript0.87plus-or-minus0.07\textbf{0.87}_{\pm 0.07}0.87 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT 0.97±0.07subscript0.97plus-or-minus0.070.97_{\pm 0.07}0.97 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT 0.96±0.12subscript0.96plus-or-minus0.120.96_{\pm 0.12}0.96 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT
b=0.01𝑏0.01b=0.01italic_b = 0.01 4.70±0.77subscript4.70plus-or-minus0.774.70_{\pm 0.77}4.70 start_POSTSUBSCRIPT ± 0.77 end_POSTSUBSCRIPT 3.59±0.09subscript3.59plus-or-minus0.09\textbf{3.59}_{\pm 0.09}3.59 start_POSTSUBSCRIPT ± 0.09 end_POSTSUBSCRIPT 4.23±0.23subscript4.23plus-or-minus0.234.23_{\pm 0.23}4.23 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 4.61±0.73subscript4.61plus-or-minus0.734.61_{\pm 0.73}4.61 start_POSTSUBSCRIPT ± 0.73 end_POSTSUBSCRIPT
Table 12: Test loss for different regularization methods on synthetic data after convergence.
S4-Legs ListOps Text Retrieval Image Pathfinder PathX Avg
w/o (7, 8) 61.16±0.32 88.69±0.07 91.21±0.17 87.41±0.14 95.89±0.10 96.97±0.31 86.89
w (8) 61.63±0.10 88.80±0.27 91.17±0.17 88.27±0.14 96.02±0.16 97.18±0.20 87.18
Weight decay for A𝐴Aitalic_A 49.90±0.67 86.58±0.91 91.21±0.17 87.65±0.16 96.00±0.09 97.22±0.05 84.76
Filter norm regularization 61.53±0.39 88.88±0.13 91.44±0.08 87.70±0.20 95.83±0.14 97.16±0.16 87.09
Table 13: Test accuracy for different regularization methods on the LRA benchmark for S4-Legs.

Ablation studies on λ𝜆\lambdaitalic_λ. When training with the regularization method (8), we vary the regularization coefficient λ𝜆\lambdaitalic_λ for different magnitudes ranging from 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT when the model performs best on the validation set. In Table 5 and Table 6, we report the test accuracy on the LRA benchmark with different λ𝜆\lambdaitalic_λ for the S4-Legs and S4D-Legs model respectively. From the results in Table 5 and Table 6, we find that for both models, adding the regularization helps the generalization performance (test accuracy) for all the tasks except for the Retrieval task trained by the S4-Legs model. In particular, the test accuracy is much more sensitive to the regularization coefficient λ𝜆\lambdaitalic_λ for the Pathfinder and PathX tasks compared to other tasks. For example, the variance of the test accuracy for the Pathfinder task is very high when λ=0.001𝜆0.001\lambda=0.001italic_λ = 0.001. For the PathX task, both the S4-Legs and the S4D-Legs model can not even learn the dataset when λ=0.0001𝜆0.0001\lambda=0.0001italic_λ = 0.0001. The high sensitivity of the model in the hyperparameter aligns with the numerical findings in Gu et al. (2023).

A.3 Additional experiment results for small SSMs

In this section, we include more experiment results for smaller size of S4-Legs and S4D-Legs on the LRA benchmark. The best test accuracy results and the running time for the small models are reported in Table 9. The details for the model size and hyperparmeters are provided in Table 7 and Table 8, where the notations follow from Table 3. The ablation studies on the regularization coefficient λ𝜆\lambdaitalic_λ (without the initialization scheme (7)) for the small S4-Legs and S4D-Legs are given in Table 10 and Table 11.

From Table 9, by comparing the test performance for w/o (7, 8) vs w (8) and w(7) vs w (7, 8), we can see that the regularization scheme (8) helps to improve the test performance for all the tasks except the Retrieval task for S4-Legs. This is also verified in the ablation studies of the regularization coefficient λ𝜆\lambdaitalic_λ, as shown in Table 10 and Table 11. Combining the initialization scheme (7) and the regularization method (8), more than half of the tasks can achieve the best test accuracy. For both S4-Legs and S4D-Legs, integrating the two methods (7) and (8) induces the best average test accuracy across all the 6666 tasks in the LRA benchmark. Therefore, our methods also work for small size of SSMs with a little extra computation cost.

A.4 Comparisons with different regularization schemes

In this section, we add two additional regularization schemes for comparison.

  1. 1.

    Filter norm regularization. We regularize the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the filter ρθsubscript𝜌𝜃\rho_{\theta}italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, i.e., when calculating the regularization measure τ(θ)𝜏𝜃\tau(\theta)italic_τ ( italic_θ ), we simply take μ(s)=0𝜇𝑠0\mu(s)=0italic_μ ( italic_s ) = 0 and K(s,s)=1𝐾𝑠𝑠1K(s,s)=1italic_K ( italic_s , italic_s ) = 1 to ignore the effects of the temporal structure of the data.

  2. 2.

    Weight decay on the hidden matrix A𝐴Aitalic_A. In the original S4(D) papers Gu et al. (2022a; b; 2023), the default training methods do not apply weight decay to the hidden matrix A𝐴Aitalic_A, and there is no known ablation study on the effect of weight decay on A𝐴Aitalic_A. Here we add weight decay to compare with the proposed regularization schemes.

For synthetic task, we follow the experiment settings in the main paper. The filter norm regularization results are obtained by following the same training settings in the paper. The weight decay results are chosen from the best weight decay coefficient from 103,102,101,100,101superscript103superscript102superscript101superscript100superscript10110^{-3},10^{-2},10^{-1},10^{0},10^{1}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. We report the test loss in Table 12. For the LRA benchmark, we also follow the same training setup in the paper to compare the performance of different regularization schemes on the S4-Legs model. The test accuracy for each task is shown in Table 13. From the synthetic results, we see that our regularization scheme can achieve the best performance compared to the other regularization schemes across different temporal structures. For the LRA benchmark, the proposed regularization scheme also achieves the best performance on the average accuracy across different tasks. In particular, for the ListOps task, weight decay performs much worse than the other regularization methods.

Appendix B Proof for the linear regression result in Section 3.2.

In this section, we give the proof for the generalization bound (2). The proof is based on the following uniform-convergence generalization bound in Mohri et al. (2012).

Lemma 1.

Consider a family of functions \mathcal{F}caligraphic_F mapping from 𝒵𝒵\mathcal{Z}caligraphic_Z to [a,b]𝑎𝑏[a,b][ italic_a , italic_b ]. Let 𝒟𝒟\mathcal{D}caligraphic_D denote the distribution according to which samples are drawn. Then for any δ>0𝛿0\delta>0italic_δ > 0, with probability at least 1δ1𝛿1-\delta1 - italic_δ over the draw of an i.i.d. sample S={z1,,zn}𝑆subscript𝑧1subscript𝑧𝑛S=\left\{z_{1},\dots,z_{n}\right\}italic_S = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the following holds for all f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F:

𝔼z𝒟[f(z)]1ni=1nf(zi)2S()+3(ba)log(2/δ)2n,subscript𝔼similar-to𝑧𝒟delimited-[]𝑓𝑧1𝑛superscriptsubscript𝑖1𝑛𝑓subscript𝑧𝑖2subscript𝑆3𝑏𝑎2𝛿2𝑛\mathbb{E}_{z\sim\mathcal{D}}\left[f(z)\right]-\frac{1}{n}\sum_{i=1}^{n}f(z_{i% })\leq 2\mathcal{R}_{S}(\mathcal{F})+3(b-a)\sqrt{\frac{\log(2/\delta)}{2n}},blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_f ( italic_z ) ] - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ 2 caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) + 3 ( italic_b - italic_a ) square-root start_ARG divide start_ARG roman_log ( 2 / italic_δ ) end_ARG start_ARG 2 italic_n end_ARG end_ARG ,

where S()subscript𝑆\mathcal{R}_{S}(\mathcal{F})caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) is the empirical Rademacher complexity with respect to the sample S𝑆Sitalic_S, defined as: S()=𝔼σ[supf1ni=1nσif(zi)]subscript𝑆subscript𝔼𝜎delimited-[]subscriptsupremum𝑓1𝑛superscriptsubscript𝑖1𝑛subscript𝜎𝑖𝑓subscript𝑧𝑖\mathcal{R}_{S}(\mathcal{F})=\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}% \frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(z_{i})\right]caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]. {σi}i=1nsuperscriptsubscriptsubscript𝜎𝑖𝑖1𝑛\left\{\sigma_{i}\right\}_{i=1}^{n}{ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are i.i.d. random variables drawn from U{1,1}𝑈11U\{-1,1\}italic_U { - 1 , 1 } with P(σi=1)=P(σi=1)=0.5𝑃subscript𝜎𝑖1𝑃subscript𝜎𝑖10.5P(\sigma_{i}=1)=P(\sigma_{i}=-1)=0.5italic_P ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = italic_P ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1 ) = 0.5.

And the Talagrand’s contraction lemma Ledoux and Talagrand (2013).

Lemma 2.

Let H𝐻Hitalic_H be a hypothesis set of functions mapping 𝒳𝒳\mathcal{X}caligraphic_X to \mathbb{R}blackboard_R and Ψ1,,ΨmsubscriptΨ1subscriptΨ𝑚\Psi_{1},\ldots,\Psi_{m}roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, μ𝜇\muitalic_μ-Lipschitz functions for some μ>0𝜇0\mu>0italic_μ > 0. Then, for any sample S𝑆Sitalic_S of m𝑚mitalic_m points x1,,xm𝒳subscript𝑥1subscript𝑥𝑚𝒳x_{1},...,x_{m}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_X, the following inequality holds

1m𝔼σ[suphHi=1mσi(Ψih)(xi)]μm𝔼σ[suphHi=1mσih(xi)]1𝑚subscript𝔼𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscriptΨ𝑖subscript𝑥𝑖𝜇𝑚subscript𝔼𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscript𝑥𝑖\frac{1}{m}{\mathbb{E}_{\sigma}}\left[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}% \left(\Psi_{i}\circ h\right)\left(x_{i}\right)\right]\leq\frac{\mu}{m}{\mathbb% {E}_{\sigma}}\left[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}h\left(x_{i}\right)\right]divide start_ARG 1 end_ARG start_ARG italic_m end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_h ) ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ≤ divide start_ARG italic_μ end_ARG start_ARG italic_m end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]

Now we begin our proof:

Proof.

First, notice for any i[1:n]i\in[1:n]italic_i ∈ [ 1 : italic_n ] and θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, we have

(θxiyi)22(θxi)2+2yi22r2R2+2superscriptsuperscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖22superscriptsuperscript𝜃topsubscript𝑥𝑖22superscriptsubscript𝑦𝑖22superscript𝑟2superscript𝑅22(\theta^{\top}x_{i}-y_{i})^{2}\leq 2(\theta^{\top}x_{i})^{2}+2y_{i}^{2}\leq 2r% ^{2}R^{2}+2( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2

Second, note that (θxiyi)2superscriptsuperscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖2(\theta^{\top}x_{i}-y_{i})^{2}( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is 2supθΘ,i[1:n]|θxiyi|2subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛superscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖2\sup_{\theta\in\Theta,i\in[1:n]}|\theta^{\top}x_{i}-y_{i}|2 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |-Lipschitz (the maximum gradient norm) with respect to θxiyisuperscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖\theta^{\top}x_{i}-y_{i}italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and we can bound the Lipschitz constant as

2supθΘ,i[1:n]|θxiyi|2rR+22subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛superscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖2𝑟𝑅22\sup_{\theta\in\Theta,i\in[1:n]}|\theta^{\top}x_{i}-y_{i}|\leq 2rR+22 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ 2 italic_r italic_R + 2

Then by Lemma 2, the Rademacher complexity for the linear model is bounded as

S()subscript𝑆\displaystyle\mathcal{R}_{S}(\mathcal{F})caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) =1n𝔼σ[supθ2Ri=1nσi(θxiyi)2]absent1𝑛subscript𝔼𝜎delimited-[]subscriptsupremumsubscriptnorm𝜃2𝑅superscriptsubscript𝑖1𝑛subscript𝜎𝑖superscriptsuperscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖2\displaystyle=\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{\|\theta\|_{2}\leq R}% \sum_{i=1}^{n}\sigma_{i}(\theta^{\top}x_{i}-y_{i})^{2}\right]= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2rR+2n𝔼σ[supθ2Ri=1nσi(θxiyi)]absent2𝑟𝑅2𝑛subscript𝔼𝜎delimited-[]subscriptsupremumsubscriptnorm𝜃2𝑅superscriptsubscript𝑖1𝑛subscript𝜎𝑖superscript𝜃topsubscript𝑥𝑖subscript𝑦𝑖\displaystyle\leq\frac{2rR+2}{n}\mathbb{E}_{\sigma}\left[\sup_{\|\theta\|_{2}% \leq R}\sum_{i=1}^{n}\sigma_{i}(\theta^{\top}x_{i}-y_{i})\right]≤ divide start_ARG 2 italic_r italic_R + 2 end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]
=2rR+2n𝔼σ[supθ2Ri=1nσiθxi]absent2𝑟𝑅2𝑛subscript𝔼𝜎delimited-[]subscriptsupremumsubscriptnorm𝜃2𝑅superscriptsubscript𝑖1𝑛subscript𝜎𝑖superscript𝜃topsubscript𝑥𝑖\displaystyle=\frac{2rR+2}{n}\mathbb{E}_{\sigma}\left[\sup_{\|\theta\|_{2}\leq R% }\sum_{i=1}^{n}\sigma_{i}\theta^{\top}x_{i}\right]= divide start_ARG 2 italic_r italic_R + 2 end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]
2R(rR+1)n𝔼σi=1nσixiabsent2𝑅𝑟𝑅1𝑛subscript𝔼𝜎normsuperscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖\displaystyle\leq\frac{2R(rR+1)}{n}\mathbb{E}_{\sigma}\left\|\sum_{i=1}^{n}% \sigma_{i}x_{i}\right\|≤ divide start_ARG 2 italic_R ( italic_r italic_R + 1 ) end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥
2R(rR+1)n𝔼σi=1nσixi2absent2𝑅𝑟𝑅1𝑛subscript𝔼𝜎superscriptnormsuperscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖2\displaystyle\leq\frac{2R(rR+1)}{n}\sqrt{\mathbb{E}_{\sigma}\left\|\sum_{i=1}^% {n}\sigma_{i}x_{i}\right\|^{2}}≤ divide start_ARG 2 italic_R ( italic_r italic_R + 1 ) end_ARG start_ARG italic_n end_ARG square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=2R(rR+1)ni=1nxi2absent2𝑅𝑟𝑅1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscript𝑥𝑖2\displaystyle=\frac{2R(rR+1)}{n}\sqrt{\sum_{i=1}^{n}\|x_{i}\|^{2}}= divide start_ARG 2 italic_R ( italic_r italic_R + 1 ) end_ARG start_ARG italic_n end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
2rR(rR+1)nabsent2𝑟𝑅𝑟𝑅1𝑛\displaystyle\leq\frac{2rR(rR+1)}{\sqrt{n}}≤ divide start_ARG 2 italic_r italic_R ( italic_r italic_R + 1 ) end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG

Combining with the function value bound, we get the desired bound (2) by Lemma 1. ∎

Appendix C Detailed discussions of Assumption 1

In this section, we add more discussions on the Assumption 1 and provide some concrete examples for the stochastic processes that satisfy the assumption. We first write down the complete description for the Kolmogorov continuity theorem.

Lemma 3 (Kolmogorov).

Let {Xt}t0subscriptsubscript𝑋𝑡𝑡0\{X_{t}\}_{t\geq 0}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be a real-valued stochastic process such that there exists positive constants α,β,C𝛼𝛽𝐶\alpha,\beta,Citalic_α , italic_β , italic_C satisfying

𝔼[|XtXs|α]C|ts|1+β𝔼delimited-[]superscriptsubscript𝑋𝑡subscript𝑋𝑠𝛼𝐶superscript𝑡𝑠1𝛽\mathbb{E}\left[\left|X_{t}-X_{s}\right|^{\alpha}\right]\leq C|t-s|^{1+\beta}blackboard_E [ | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ] ≤ italic_C | italic_t - italic_s | start_POSTSUPERSCRIPT 1 + italic_β end_POSTSUPERSCRIPT

for all s,t0𝑠𝑡0s,t\geq 0italic_s , italic_t ≥ 0. Then X𝑋Xitalic_X has a continuous modification which, with probability one, is locally γ𝛾\gammaitalic_γ-Hölder continuous for every 0<γ<β/α0𝛾𝛽𝛼0<\gamma<\beta/\alpha0 < italic_γ < italic_β / italic_α.

In the case of Brownian motion on \mathbb{R}blackboard_R, the choice of constants α=4,β=1,C=2formulae-sequence𝛼4formulae-sequence𝛽1𝐶2\alpha=4,\beta=1,C=2italic_α = 4 , italic_β = 1 , italic_C = 2 will work in the Kolmogorov continuity theorem. When it comes to the Gaussian process, we have the following theorem (Azmoodeh et al., 2014, Theorem 1.) that gives a necessary and sufficient condition for Hölder continuity.

Lemma 4.

A centered (mean zero) Gaussian process X𝑋Xitalic_X is Hölder continuous of any order a<H𝑎𝐻a<Hitalic_a < italic_H, i.e.,

|XtXs|Cε|ts|Hε,ε(0,H)formulae-sequencesubscript𝑋𝑡subscript𝑋𝑠subscript𝐶𝜀superscript𝑡𝑠𝐻𝜀for-all𝜀0𝐻|X_{t}-X_{s}|\leq C_{\varepsilon}|t-s|^{H-\varepsilon},\quad\forall\varepsilon% \in(0,H)| italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | ≤ italic_C start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT | italic_t - italic_s | start_POSTSUPERSCRIPT italic_H - italic_ε end_POSTSUPERSCRIPT , ∀ italic_ε ∈ ( 0 , italic_H )

if and only if there exists constants cεsubscript𝑐𝜀c_{\varepsilon}italic_c start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT such that

𝔼[(XtXs)2]cε(ts)2H2ε,ε(0,H)formulae-sequence𝔼delimited-[]superscriptsubscript𝑋𝑡subscript𝑋𝑠2subscript𝑐𝜀superscript𝑡𝑠2𝐻2𝜀for-all𝜀0𝐻\mathbb{E}\left[(X_{t}-X_{s})^{2}\right]\leq c_{\varepsilon}(t-s)^{2H-2% \varepsilon},\quad\forall\varepsilon\in(0,H)blackboard_E [ ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_c start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_t - italic_s ) start_POSTSUPERSCRIPT 2 italic_H - 2 italic_ε end_POSTSUPERSCRIPT , ∀ italic_ε ∈ ( 0 , italic_H )

For a stationary Gaussian process with covariance K(st)𝐾𝑠𝑡K(s-t)italic_K ( italic_s - italic_t ), the Hölder continuity (in expectation) assumption is equivalent to 1K(st)/K(0)cα(ts)2α/21𝐾𝑠𝑡𝐾0subscript𝑐𝛼superscript𝑡𝑠2𝛼21-K(s-t)/K(0)\leq{c}_{\alpha}(t-s)^{2\alpha}/21 - italic_K ( italic_s - italic_t ) / italic_K ( 0 ) ≤ italic_c start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_t - italic_s ) start_POSTSUPERSCRIPT 2 italic_α end_POSTSUPERSCRIPT / 2 for any α(0,H)𝛼0𝐻\alpha\in(0,H)italic_α ∈ ( 0 , italic_H ). Now combining these results, we see that for any stationary Gaussian process with continuous mean μ(t)𝜇𝑡\mu(t)italic_μ ( italic_t ), covariance K(st)𝐾𝑠𝑡K(s-t)italic_K ( italic_s - italic_t ), and 1K(st)/K(0)cα(ts)2α/2,α(0,H)formulae-sequence1𝐾𝑠𝑡𝐾0subscript𝑐𝛼superscript𝑡𝑠2𝛼2for-all𝛼0𝐻1-K(s-t)/K(0)\leq{c}_{\alpha}(t-s)^{2\alpha}/2,\forall\alpha\in(0,H)1 - italic_K ( italic_s - italic_t ) / italic_K ( 0 ) ≤ italic_c start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_t - italic_s ) start_POSTSUPERSCRIPT 2 italic_α end_POSTSUPERSCRIPT / 2 , ∀ italic_α ∈ ( 0 , italic_H ), it satisfies Hölder continuity in Assumption 1. As for the sub-Gaussian property, since the normalized Gaussian process X~tsubscript~𝑋𝑡\tilde{X}_{t}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is standard normal at each time t𝑡titalic_t, then any Gaussian process that satisfies Hölder continuity automatically satisfies the sub-Gaussian property in Assumption 1. Concrete examples include:

  • identical sequences: x(t)=x𝑥𝑡𝑥x(t)=xitalic_x ( italic_t ) = italic_x for all t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], where x𝒩(0,1)similar-to𝑥𝒩01x\sim\mathcal{N}(0,1)italic_x ∼ caligraphic_N ( 0 , 1 )

  • Gaussian white noise: μ(t)=0𝜇𝑡0\mu(t)=0italic_μ ( italic_t ) = 0, K(s,t)=1|b|πe((st)/b)2𝐾𝑠𝑡1𝑏𝜋superscript𝑒superscript𝑠𝑡𝑏2K(s,t)=\frac{1}{|b|\sqrt{\pi}}e^{-((s-t)/b)^{2}}italic_K ( italic_s , italic_t ) = divide start_ARG 1 end_ARG start_ARG | italic_b | square-root start_ARG italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - ( ( italic_s - italic_t ) / italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for some b0𝑏0b\neq 0italic_b ≠ 0

  • Ornstein-Uhlenbeck process: μ(t)=0𝜇𝑡0\mu(t)=0italic_μ ( italic_t ) = 0, K(s,t)=e|st|𝐾𝑠𝑡superscript𝑒𝑠𝑡K(s,t)=e^{-|s-t|}italic_K ( italic_s , italic_t ) = italic_e start_POSTSUPERSCRIPT - | italic_s - italic_t | end_POSTSUPERSCRIPT

Relaxations of Assumption 1. In fact, Assumption 1 is used to show upper bounds for two key terms (13) in the proof of Theorem 1. In particular, the sub-Gaussian property in Assumption 1 guarantees that the input random process is bounded in a finite time set with high probability. The Hölder condition then ensures the boundedness in a infinite time set t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]. Thus, if the input random process is from a finite subset of \mathbb{R}blackboard_R, then the Hölder condition can be removed. For example, in computer vision tasks when the input image is flattened as a sequence, the range for each pixel value is a finite set (for a MNIST image, each pixel value is a positive integer between 00 to 255255255255). In that case, the Holder continuity condition in Assumption 1 can be dropped.

Appendix D Derivations for (4) and (5) in Section 4.1

For the left zero padding transformation, the key term in (3) becomes

02T|ρθ(2Tt)|K1(t,t)𝑑t+|02Tρθ(2Tt)μ1(t)𝑑t|+1superscriptsubscript02𝑇subscript𝜌𝜃2𝑇𝑡subscript𝐾1𝑡𝑡differential-d𝑡superscriptsubscript02𝑇subscript𝜌𝜃2𝑇𝑡subscript𝜇1𝑡differential-d𝑡1\displaystyle\int_{0}^{2T}\left|{\rho}_{\theta}(2T-t)\right|\sqrt{K_{1}(t,t)}% dt+\left|\int_{0}^{2T}{\rho}_{\theta}(2T-t)\mu_{1}(t)dt\right|+1∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 2 italic_T - italic_t ) | square-root start_ARG italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 2 italic_T - italic_t ) italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t | + 1
=\displaystyle== 0T|ρθ(Tt)|K(t,t)𝑑t+|0Tρθ(Tt)μ(t)𝑑t|+1superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡1\displaystyle\int_{0}^{T}\left|{\rho}_{\theta}(T-t)\right|\sqrt{K(t,t)}dt+% \left|\int_{0}^{T}{\rho}_{\theta}(T-t)\mu(t)dt\right|+1∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | + 1

For the right zero padding transformation, the key term in (3) becomes

02T|ρθ(2Tt)|K2(t,t)𝑑t+|02Tρθ(2Tt)μ2(t)𝑑t|+1superscriptsubscript02𝑇subscript𝜌𝜃2𝑇𝑡subscript𝐾2𝑡𝑡differential-d𝑡superscriptsubscript02𝑇subscript𝜌𝜃2𝑇𝑡subscript𝜇2𝑡differential-d𝑡1\displaystyle\int_{0}^{2T}\left|{\rho}_{\theta}(2T-t)\right|\sqrt{K_{2}(t,t)}% dt+\left|\int_{0}^{2T}{\rho}_{\theta}(2T-t)\mu_{2}(t)dt\right|+1∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 2 italic_T - italic_t ) | square-root start_ARG italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 2 italic_T - italic_t ) italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t | + 1
=\displaystyle== 0T|ρθ(2Tt)|K(t,t)𝑑t+|0Tρθ(2Tt)μ(t)𝑑t|+1superscriptsubscript0𝑇subscript𝜌𝜃2𝑇𝑡𝐾𝑡𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃2𝑇𝑡𝜇𝑡differential-d𝑡1\displaystyle\int_{0}^{T}\left|{\rho}_{\theta}(2T-t)\right|\sqrt{K(t,t)}dt+% \left|\int_{0}^{T}{\rho}_{\theta}(2T-t)\mu(t)dt\right|+1∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 2 italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 2 italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | + 1
=\displaystyle== 0T|CeATeA(Tt)B|K(t,t)𝑑t+|0TCeATeA(Tt)Bμ(t)𝑑t|+1superscriptsubscript0𝑇𝐶superscript𝑒𝐴𝑇superscript𝑒𝐴𝑇𝑡𝐵𝐾𝑡𝑡differential-d𝑡superscriptsubscript0𝑇𝐶superscript𝑒𝐴𝑇superscript𝑒𝐴𝑇𝑡𝐵𝜇𝑡differential-d𝑡1\displaystyle\int_{0}^{T}\left|Ce^{AT}e^{A(T-t)}B\right|\sqrt{K(t,t)}dt+\left|% \int_{0}^{T}Ce^{AT}e^{A(T-t)}B\mu(t)dt\right|+1∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_C italic_e start_POSTSUPERSCRIPT italic_A italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_t ) end_POSTSUPERSCRIPT italic_B | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_C italic_e start_POSTSUPERSCRIPT italic_A italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_T - italic_t ) end_POSTSUPERSCRIPT italic_B italic_μ ( italic_t ) italic_d italic_t | + 1

Then we get (4) and (5).

Appendix E Proof for Theorem 1

In this section, we will prove Theorem 1. Before moving into the formal proof, we first introduce some useful lemmas that help to build the proof.

The first lemma is the Massart Lemma for the Rademacher complexity with finite class.

Lemma 5 (Massart).

Let 𝒜𝒜\mathcal{A}caligraphic_A be some finite subset of Rmsuperscript𝑅𝑚R^{m}italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and σ1,,σmsubscript𝜎1subscript𝜎𝑚\sigma_{1},\ldots,\sigma_{m}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be independent Rademacher random variables. Let r=supa𝒜a𝑟subscriptsupremum𝑎𝒜norm𝑎r=\sup_{a\in\mathcal{A}}\|a\|italic_r = roman_sup start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∥ italic_a ∥. Then, we have,

𝔼σ[supa𝒜i=1mσiai]r2log|𝒜|subscript𝔼𝜎delimited-[]subscriptsupremum𝑎𝒜superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscript𝑎𝑖𝑟2𝒜\mathbb{E}_{\sigma}\left[\sup_{a\in\mathcal{A}}\sum_{i=1}^{m}\sigma_{i}a_{i}% \right]\leq r\sqrt{2\log|\mathcal{A}|}blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≤ italic_r square-root start_ARG 2 roman_log | caligraphic_A | end_ARG

The second lemma is to bound the supremum of a stochastic process that is Hölder continuous and sub-Gaussian.

Lemma 6 (Hölder maximal inequality).

Suppose {Xt}t[0,T]subscriptsubscript𝑋𝑡𝑡0𝑇\{X_{t}\}_{t\in[0,T]}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT is a centered Hölder process, i.e., L,H>0,s.t.|XsXt|L|st|H,s,t[0,T]formulae-sequence𝐿𝐻0𝑠𝑡formulae-sequencesubscript𝑋𝑠subscript𝑋𝑡𝐿superscript𝑠𝑡𝐻for-all𝑠𝑡0𝑇\exists L,H>0,s.t.|X_{s}-X_{t}|\leq L|s-t|^{H},\forall s,t\in[0,T]∃ italic_L , italic_H > 0 , italic_s . italic_t . | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ italic_L | italic_s - italic_t | start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , ∀ italic_s , italic_t ∈ [ 0 , italic_T ]. If further Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian for every t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], i.e., u>0,P(|Xt|u)2exp(u2/2σ2)formulae-sequencefor-all𝑢0𝑃subscript𝑋𝑡𝑢2superscript𝑢22superscript𝜎2\forall u>0,P\left(|X_{t}|\geq u\right)\leq 2\exp(-u^{2}/2\sigma^{2})∀ italic_u > 0 , italic_P ( | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≥ italic_u ) ≤ 2 roman_exp ( - italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for some σ>0𝜎0\sigma>0italic_σ > 0. Then with probability at least 1δ1𝛿1-\delta1 - italic_δ,

supt[0,T]|Xt|L+σ2log(2T/δ).subscriptsupremum𝑡0𝑇subscript𝑋𝑡𝐿𝜎22𝑇𝛿\sup_{t\in[0,T]}|X_{t}|\leq L+\sigma\sqrt{2\log(2T/\delta)}.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ italic_L + italic_σ square-root start_ARG 2 roman_log ( 2 italic_T / italic_δ ) end_ARG .
Proof.

The proof is based on the ε𝜀\varepsilonitalic_ε-net and covering number argument. We first discretize the time interval [0,T]0𝑇[0,T][ 0 , italic_T ] into N𝑁Nitalic_N parts [0,T/N][T/N,2T/N][(N1)T/N,T]0𝑇𝑁𝑇𝑁2𝑇𝑁𝑁1𝑇𝑁𝑇[0,T/N]\cup[T/N,2T/N]\cdots\cup[(N-1)T/N,T][ 0 , italic_T / italic_N ] ∪ [ italic_T / italic_N , 2 italic_T / italic_N ] ⋯ ∪ [ ( italic_N - 1 ) italic_T / italic_N , italic_T ]. Then for any time t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], there exists a π(t){0,T/N,,(N1)T/N}𝜋𝑡0𝑇𝑁𝑁1𝑇𝑁\pi(t)\in\{0,T/N,\ldots,(N-1)T/N\}italic_π ( italic_t ) ∈ { 0 , italic_T / italic_N , … , ( italic_N - 1 ) italic_T / italic_N } such that |tπ(t)|T/N𝑡𝜋𝑡𝑇𝑁|t-\pi(t)|\leq T/N| italic_t - italic_π ( italic_t ) | ≤ italic_T / italic_N. Therefore, by Hölder continuity, we have

supt[0,T]|Xt|supt[0,T]|XtXπ(t)|+supt[0,T]|Xπ(t)|L(TN)H+maxi[0:N1]|XiT/N|.subscriptsupremum𝑡0𝑇subscript𝑋𝑡subscriptsupremum𝑡0𝑇subscript𝑋𝑡subscript𝑋𝜋𝑡subscriptsupremum𝑡0𝑇subscript𝑋𝜋𝑡𝐿superscript𝑇𝑁𝐻subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁\sup_{t\in[0,T]}|X_{t}|\leq\sup_{t\in[0,T]}|X_{t}-X_{\pi(t)}|+\sup_{t\in[0,T]}% |X_{\pi(t)}|\leq L\left(\frac{T}{N}\right)^{H}+\max_{i\in[0:N-1]}|X_{iT/N}|.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_π ( italic_t ) end_POSTSUBSCRIPT | + roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_π ( italic_t ) end_POSTSUBSCRIPT | ≤ italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT + roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | .

Since Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sub-Guassian for every time t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], then for each i[0:N1]i\in[0:N-1]italic_i ∈ [ 0 : italic_N - 1 ], by letting u=σ2log(2N/δ)𝑢𝜎22𝑁𝛿u=\sigma\sqrt{2\log(2N/\delta)}italic_u = italic_σ square-root start_ARG 2 roman_log ( 2 italic_N / italic_δ ) end_ARG, we have with probability at least 1δ/N1𝛿𝑁1-\delta/N1 - italic_δ / italic_N,

XiT/Nσ2log(2N/δ).subscript𝑋𝑖𝑇𝑁𝜎22𝑁𝛿X_{iT/N}\leq\sigma\sqrt{2\log(2N/\delta)}.italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT ≤ italic_σ square-root start_ARG 2 roman_log ( 2 italic_N / italic_δ ) end_ARG .

Taking the union bound over all i[0:N1]i\in[0:N-1]italic_i ∈ [ 0 : italic_N - 1 ], we have with probability at least 1δ1𝛿1-\delta1 - italic_δ,

maxi[0:N1]XiT/Nσ2log(N/δ).subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁𝜎2𝑁𝛿\max_{i\in[0:N-1]}X_{iT/N}\leq\sigma\sqrt{2\log(N/\delta)}.roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT ≤ italic_σ square-root start_ARG 2 roman_log ( italic_N / italic_δ ) end_ARG .

Hence,

supt[0,T]XtL(TN)H+σ2log(2N/δ)subscriptsupremum𝑡0𝑇subscript𝑋𝑡𝐿superscript𝑇𝑁𝐻𝜎22𝑁𝛿\sup_{t\in[0,T]}X_{t}\leq L\left(\frac{T}{N}\right)^{H}+\sigma\sqrt{2\log(2N/% \delta)}roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT + italic_σ square-root start_ARG 2 roman_log ( 2 italic_N / italic_δ ) end_ARG

holds for all N𝑁Nitalic_N. Here we simply take N=[T]+1𝑁delimited-[]𝑇1N=[T]+1italic_N = [ italic_T ] + 1, then we get

supt[0,T]XtL+σ2log(2T/δ).subscriptsupremum𝑡0𝑇subscript𝑋𝑡𝐿𝜎22𝑇𝛿\sup_{t\in[0,T]}X_{t}\leq L+\sigma\sqrt{2\log(2T/\delta)}.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_L + italic_σ square-root start_ARG 2 roman_log ( 2 italic_T / italic_δ ) end_ARG .

Now we are ready to prove the main result Theorem 1.

Proof.

We let gθ(x):=0Tρθ(Tt)x(t)𝑑tyassignsubscript𝑔𝜃𝑥superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝑥𝑡differential-d𝑡𝑦g_{\theta}(x):=\int_{0}^{T}{\rho}_{\theta}(T-t)x(t)dt-yitalic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) := ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_x ( italic_t ) italic_d italic_t - italic_y, then the generalization gap is given by

Rx(θ)Rn(θ)=𝔼x[gθ2(x)]gθ2(x1)++gθ2(xn)n.subscript𝑅𝑥𝜃subscript𝑅𝑛𝜃subscript𝔼𝑥delimited-[]superscriptsubscript𝑔𝜃2𝑥superscriptsubscript𝑔𝜃2subscript𝑥1superscriptsubscript𝑔𝜃2subscript𝑥𝑛𝑛R_{x}(\theta)-{R}_{n}(\theta)=\mathbb{E}_{x}[g_{\theta}^{2}(x)]-\frac{g_{% \theta}^{2}(x_{1})+\ldots+g_{\theta}^{2}(x_{n})}{n}.italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ] - divide start_ARG italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + … + italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n end_ARG .

Now let hypothesis space ={xgθ2(x):θΘ}conditional-setmaps-to𝑥superscriptsubscript𝑔𝜃2𝑥𝜃Θ\mathcal{F}=\{x\mapsto g_{\theta}^{2}(x):\theta\in\Theta\}caligraphic_F = { italic_x ↦ italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) : italic_θ ∈ roman_Θ }, then its empirical Rademacher complexity is given by

S()subscript𝑆\displaystyle\mathcal{R}_{S}(\mathcal{F})caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) =𝔼σ[supθΘ1ni=1nσigθ2(xi)]absentsubscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θ1𝑛superscriptsubscript𝑖1𝑛subscript𝜎𝑖superscriptsubscript𝑔𝜃2subscript𝑥𝑖\displaystyle=\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\frac{1}{n}\sum_{% i=1}^{n}\sigma_{i}g_{\theta}^{2}(x_{i})\right]= blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]
=1n𝔼σ[supθΘi=1nσi|0Tρθ(Tt)xi(t)𝑑tyi|2]absent1𝑛subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript𝑖1𝑛subscript𝜎𝑖superscriptsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡subscript𝑥𝑖𝑡differential-d𝑡subscript𝑦𝑖2\displaystyle=\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\sum_{% i=1}^{n}\sigma_{i}\left|\int_{0}^{T}{\rho}_{\theta}(T-t)x_{i}(t)dt-y_{i}\right% |^{2}\right]= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

By the Talagrand’s contraction Lemma 2, since gθ2(xi)superscriptsubscript𝑔𝜃2subscript𝑥𝑖g_{\theta}^{2}(x_{i})italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is 2supθΘ,i[1:n]|gθ(xi)|2subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛subscript𝑔𝜃subscript𝑥𝑖2\sup_{\theta\in\Theta,i\in[1:n]}|g_{\theta}(x_{i})|2 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | Lipschitz, we have

S()subscript𝑆\displaystyle\mathcal{R}_{S}(\mathcal{F})caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) 2supθΘ,i[1:n]|gθ(xi)|1n𝔼σ[supθΘi=1nσi(0Tρθ(Tt)xi(t)𝑑tyi)]absent2subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛subscript𝑔𝜃subscript𝑥𝑖1𝑛subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript𝑖1𝑛subscript𝜎𝑖superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡subscript𝑥𝑖𝑡differential-d𝑡subscript𝑦𝑖\displaystyle\leq 2\sup_{\theta\in\Theta,i\in[1:n]}|g_{\theta}(x_{i})|\cdot% \frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\sum_{i=1}^{n}\sigma% _{i}\left(\int_{0}^{T}{\rho}_{\theta}(T-t)x_{i}(t)dt-y_{i}\right)\right]≤ 2 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ⋅ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]
=2supθΘ,i[1:n]|gθ(xi)|n𝔼σ[supθΘ0Tρθ(Tt)i=1nσixi(t)dt]absent2subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛subscript𝑔𝜃subscript𝑥𝑖𝑛subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝑑𝑡\displaystyle=\frac{2\sup_{\theta\in\Theta,i\in[1:n]}|g_{\theta}(x_{i})|}{n}% \mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}_{\theta}(T-t% )\sum_{i=1}^{n}\sigma_{i}x_{i}(t)dt\right]= divide start_ARG 2 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t ]

Now we separate the expectation into two parts: the unbiased part invovled with xi(t)μ(t)subscript𝑥𝑖𝑡𝜇𝑡x_{i}(t)-\mu(t)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) and the biased part μ(t)𝜇𝑡\mu(t)italic_μ ( italic_t ), by noticing that

𝔼σ[supθΘ0Tρθ(Tt)i=1nσixi(t)dt]subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝑑𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}% _{\theta}(T-t)\sum_{i=1}^{n}\sigma_{i}x_{i}(t)dt\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t ]
=\displaystyle== 𝔼σ[supθΘ0Tρθ(Tt)i=1nσi(xi(t)μ(t))dt+0Tρθ(Tt)i=1nσiμ(t)dt]subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝑑𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖𝜇𝑡𝑑𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}% _{\theta}(T-t)\sum_{i=1}^{n}\sigma_{i}(x_{i}(t)-\mu(t))dt+\int_{0}^{T}{\rho}_{% \theta}(T-t)\sum_{i=1}^{n}\sigma_{i}\mu(t)dt\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) ) italic_d italic_t + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ ( italic_t ) italic_d italic_t ]
\displaystyle\leq 𝔼σ[supθΘ0Tρθ(Tt)i=1nσi(xi(t)μ(t))dt]+𝔼σ[supθΘ0Tρθ(Tt)i=1nσiμ(t)dt]subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝑑𝑡subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖𝜇𝑡𝑑𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}% _{\theta}(T-t)\sum_{i=1}^{n}\sigma_{i}(x_{i}(t)-\mu(t))dt\right]+\mathbb{E}_{% \sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}_{\theta}(T-t)\sum_{i=1}^% {n}\sigma_{i}\mu(t)dt\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) ) italic_d italic_t ] + blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ ( italic_t ) italic_d italic_t ]

For the unbiased part, by the Hölder’s inequality, for any p,q[1,]𝑝𝑞1p,q\in[1,\infty]italic_p , italic_q ∈ [ 1 , ∞ ] such that 1p+1q=11𝑝1𝑞1\frac{1}{p}+\frac{1}{q}=1divide start_ARG 1 end_ARG start_ARG italic_p end_ARG + divide start_ARG 1 end_ARG start_ARG italic_q end_ARG = 1,

𝔼σ[supθΘ0Tρθ(Tt)i=1nσi(xi(t)μ(t))dt]subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝑑𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}% _{\theta}(T-t)\sum_{i=1}^{n}\sigma_{i}(x_{i}(t)-\mu(t))dt\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) ) italic_d italic_t ] (9)
\displaystyle\leq supθΘ(0T|ρθp(Tt)|Kp/2(t,t)𝑑t)1/p𝔼σ[(0T|i=1nσixi(t)μ(t)K(t,t)|q𝑑t)1/q]subscriptsupremum𝜃Θsuperscriptsuperscriptsubscript0𝑇subscriptsuperscript𝜌𝑝𝜃𝑇𝑡superscript𝐾𝑝2𝑡𝑡differential-d𝑡1𝑝subscript𝔼𝜎delimited-[]superscriptsuperscriptsubscript0𝑇superscriptsuperscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡𝑞differential-d𝑡1𝑞\displaystyle\sup_{\theta\in\Theta}\left(\int_{0}^{T}\left|{\rho}^{p}_{\theta}% (T-t)\right|K^{p/2}(t,t)dt\right)^{1/p}\mathbb{E}_{\sigma}\left[\left(\int_{0}% ^{T}\left|\sum_{i=1}^{n}\sigma_{i}\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|% ^{q}dt\right)^{1/q}\right]roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | italic_K start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT ( italic_t , italic_t ) italic_d italic_t ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_d italic_t ) start_POSTSUPERSCRIPT 1 / italic_q end_POSTSUPERSCRIPT ]

For the biased part,

𝔼σ[supθΘ0Tρθ(Tt)i=1nσiμ(t)dt]subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖𝜇𝑡𝑑𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}% _{\theta}(T-t)\sum_{i=1}^{n}\sigma_{i}\mu(t)dt\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ ( italic_t ) italic_d italic_t ] supθΘ|0Tρθ(Tt)μ(t)𝑑t|𝔼σ[|i=1nσi|]absentsubscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡subscript𝔼𝜎delimited-[]superscriptsubscript𝑖1𝑛subscript𝜎𝑖\displaystyle\leq\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}_{\theta}(T-t)% \mu(t)dt\right|\mathbb{E}_{\sigma}\left[\left|\sum_{i=1}^{n}\sigma_{i}\right|\right]≤ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] (10)
supθΘ|0Tρθ(Tt)μ(t)𝑑t|𝔼σ[|i=1nσi|2]absentsubscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡subscript𝔼𝜎delimited-[]superscriptsuperscriptsubscript𝑖1𝑛subscript𝜎𝑖2\displaystyle\leq\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}_{\theta}(T-t)% \mu(t)dt\right|\sqrt{\mathbb{E}_{\sigma}\left[\left|\sum_{i=1}^{n}\sigma_{i}% \right|^{2}\right]}≤ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG
=nsupθΘ|0Tρθ(Tt)μ(t)𝑑t|absent𝑛subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡\displaystyle=\sqrt{n}\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}_{\theta}(% T-t)\mu(t)dt\right|= square-root start_ARG italic_n end_ARG roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t |

Now for the unbiased part (9), we take p=1,q=formulae-sequence𝑝1𝑞p=1,q=\inftyitalic_p = 1 , italic_q = ∞. Then we have

𝔼σ[supθΘ0Tρθ(Tt)i=1nσi(xi(t)μ(t))dt]subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝑑𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}% _{\theta}(T-t)\sum_{i=1}^{n}\sigma_{i}(x_{i}(t)-\mu(t))dt\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) ) italic_d italic_t ] (11)
\displaystyle\leq supθΘ(0T|ρθ(Tt)|K(t,t)𝑑t)𝔼σ[supt[0,T]|i=1nσixi(t)μ(t)K(t,t)|]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡subscript𝔼𝜎delimited-[]subscriptsupremum𝑡0𝑇superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡\displaystyle\sup_{\theta\in\Theta}\left(\int_{0}^{T}\left|{\rho}_{\theta}(T-t% )\right|\sqrt{K(t,t)}dt\right)\mathbb{E}_{\sigma}\left[\sup_{t\in[0,T]}\left|% \sum_{i=1}^{n}\sigma_{i}\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|\right]roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t ) blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ]

Also by the same argument, note that

supθΘ,i[1:n]|gθ(xi)|subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛subscript𝑔𝜃subscript𝑥𝑖\displaystyle\sup_{\theta\in\Theta,i\in[1:n]}|g_{\theta}(x_{i})|roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | (12)
=\displaystyle== supθΘ,i[1:n]|0Tρθ(Tt)xi(t)𝑑tyi|subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡subscript𝑥𝑖𝑡differential-d𝑡subscript𝑦𝑖\displaystyle\sup_{\theta\in\Theta,i\in[1:n]}\left|\int_{0}^{T}{\rho}_{\theta}% (T-t)x_{i}(t)dt-y_{i}\right|roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |
\displaystyle\leq supθΘ,i[1:n]|0Tρθ(Tt)(xi(t)μ(t))𝑑t|+supθΘ|0Tρθ(Tt)μ(t)𝑑t|+1subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡subscript𝑥𝑖𝑡𝜇𝑡differential-d𝑡subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡1\displaystyle\sup_{\theta\in\Theta,i\in[1:n]}\left|\int_{0}^{T}{\rho}_{\theta}% (T-t)(x_{i}(t)-\mu(t))dt\right|+\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}% _{\theta}(T-t)\mu(t)dt\right|+1roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) ) italic_d italic_t | + roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | + 1
\displaystyle\leq supθΘ(0T|ρθ(Tt)|K(t,t)𝑑t)supi[1:n],t[0,T]|xi(t)μ(t)K(t,t)|+supθΘ|0T(ρθ(Tt))μ(t)𝑑t|+1subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡subscriptsupremum𝑖delimited-[]:1𝑛𝑡0𝑇subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡1\displaystyle\sup_{\theta\in\Theta}\left(\int_{0}^{T}\left|{\rho}_{\theta}(T-t% )\right|\sqrt{K(t,t)}dt\right)\sup_{i\in[1:n],t\in[0,T]}\left|\frac{x_{i}(t)-% \mu(t)}{\sqrt{K(t,t)}}\right|+\sup_{\theta\in\Theta}\left|\int_{0}^{T}\Re({% \rho}_{\theta}(T-t))\mu(t)dt\right|+1roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t ) roman_sup start_POSTSUBSCRIPT italic_i ∈ [ 1 : italic_n ] , italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | + roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_ℜ ( italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ) italic_μ ( italic_t ) italic_d italic_t | + 1

Thus, there are two terms that we need to bound:

supi[1:n],t[0,T]|xi(t)μ(t)K(t,t)|,𝔼σ[supt[0,T]|i=1nσixi(t)μ(t)K(t,t)|]subscriptsupremum𝑖delimited-[]:1𝑛𝑡0𝑇subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡subscript𝔼𝜎delimited-[]subscriptsupremum𝑡0𝑇superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡\sup_{i\in[1:n],t\in[0,T]}\left|\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|,% \quad\mathbb{E}_{\sigma}\left[\sup_{t\in[0,T]}\left|\sum_{i=1}^{n}\sigma_{i}% \frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|\right]roman_sup start_POSTSUBSCRIPT italic_i ∈ [ 1 : italic_n ] , italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | , blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ] (13)

For the first term, notice that the normalized Gaussian process xi(t)μ(t)K(t,t)subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG is centered. By Assumption 1, it is Hölder continuous and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian on t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]. Therefore, we can directly apply Lemma 6 and get with probability at least 1δ/3n1𝛿3𝑛1-\delta/3n1 - italic_δ / 3 italic_n,

supt[0,T]|xi(t)μ(t)K(t,t)|L+σ2log(6Tn/δ),i=1,,nformulae-sequencesubscriptsupremum𝑡0𝑇subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡𝐿𝜎26𝑇𝑛𝛿for-all𝑖1𝑛\sup_{t\in[0,T]}\left|\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|\leq L+% \sigma\sqrt{2\log(6Tn/\delta)},\quad\forall i=1,\ldots,nroman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ≤ italic_L + italic_σ square-root start_ARG 2 roman_log ( 6 italic_T italic_n / italic_δ ) end_ARG , ∀ italic_i = 1 , … , italic_n

Now by taking a union bound over i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n, we get with probability at least 1δ/31𝛿31-\delta/31 - italic_δ / 3,

supi[1:n],t[0,T]|xi(t)μ(t)K(t,t)|L+σ2log(6Tn/δ).subscriptsupremum𝑖delimited-[]:1𝑛𝑡0𝑇subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡𝐿𝜎26𝑇𝑛𝛿\sup_{i\in[1:n],t\in[0,T]}\left|\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|% \leq L+\sigma\sqrt{2\log(6Tn/\delta)}.roman_sup start_POSTSUBSCRIPT italic_i ∈ [ 1 : italic_n ] , italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ≤ italic_L + italic_σ square-root start_ARG 2 roman_log ( 6 italic_T italic_n / italic_δ ) end_ARG . (14)

For the second term, we apply the ε𝜀\varepsilonitalic_ε-net and covering number argument as in Lemma 6. We discretize the time interval [0,T]0𝑇[0,T][ 0 , italic_T ] into N𝑁Nitalic_N parts [0,T/N][T/N,2T/N][(N1)T/N,T]0𝑇𝑁𝑇𝑁2𝑇𝑁𝑁1𝑇𝑁𝑇[0,T/N]\cup[T/N,2T/N]\cdots\cup[(N-1)T/N,T][ 0 , italic_T / italic_N ] ∪ [ italic_T / italic_N , 2 italic_T / italic_N ] ⋯ ∪ [ ( italic_N - 1 ) italic_T / italic_N , italic_T ], then for any t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], there exists a sub-interval such that t[(k1)T/N,kT/N]𝑡𝑘1𝑇𝑁𝑘𝑇𝑁t\in[(k-1)T/N,kT/N]italic_t ∈ [ ( italic_k - 1 ) italic_T / italic_N , italic_k italic_T / italic_N ] for some k[1:N]k\in[1:N]italic_k ∈ [ 1 : italic_N ]. Therefore, t[0,T]for-all𝑡0𝑇\forall t\in[0,T]∀ italic_t ∈ [ 0 , italic_T ] such that t[(k1)T/N,kT/N]𝑡𝑘1𝑇𝑁𝑘𝑇𝑁t\in[(k-1)T/N,kT/N]italic_t ∈ [ ( italic_k - 1 ) italic_T / italic_N , italic_k italic_T / italic_N ] for some k[1:N]k\in[1:N]italic_k ∈ [ 1 : italic_N ], by Hölder continuity in Assumption 1 for the normalized process, we have

|i=1nσixi(t)μ(t)K(t,t)|superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡\displaystyle\left|\sum_{i=1}^{n}\sigma_{i}\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)% }}\right|| ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | |i=1nσixi((k1)TN)μ((k1)TN)K((k1)TN,(k1)TN)|+|i=1nσi(xi((k1)TN)μ((k1)TN)K((k1)TN,(k1)TN)xi(t)μ(t)K(t,t))|absentsuperscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑘1𝑇𝑁𝜇𝑘1𝑇𝑁𝐾𝑘1𝑇𝑁𝑘1𝑇𝑁superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑘1𝑇𝑁𝜇𝑘1𝑇𝑁𝐾𝑘1𝑇𝑁𝑘1𝑇𝑁subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡\displaystyle\leq\left|\sum_{i=1}^{n}\sigma_{i}\frac{x_{i}\left(\frac{(k-1)T}{% N}\right)-\mu\left(\frac{(k-1)T}{N}\right)}{\sqrt{K\left(\frac{(k-1)T}{N},% \frac{(k-1)T}{N}\right)}}\right|+\left|\sum_{i=1}^{n}\sigma_{i}\left(\frac{x_{% i}\left(\frac{(k-1)T}{N}\right)-\mu\left(\frac{(k-1)T}{N}\right)}{\sqrt{K\left% (\frac{(k-1)T}{N},\frac{(k-1)T}{N}\right)}}-\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t% )}}\right)\right|≤ | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) - italic_μ ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG start_ARG square-root start_ARG italic_K ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG , divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG end_ARG | + | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) - italic_μ ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG start_ARG square-root start_ARG italic_K ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG , divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG end_ARG - divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG ) |
maxk=1,,N|i=1nσixi((k1)TN)μ((k1)TN)K((k1)TN,(k1)TN)|+σnL(TN)Habsentsubscript𝑘1𝑁superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑘1𝑇𝑁𝜇𝑘1𝑇𝑁𝐾𝑘1𝑇𝑁𝑘1𝑇𝑁norm𝜎𝑛𝐿superscript𝑇𝑁𝐻\displaystyle\leq\max_{k=1,\ldots,N}\left|\sum_{i=1}^{n}\sigma_{i}\frac{x_{i}% \left(\frac{(k-1)T}{N}\right)-\mu\left(\frac{(k-1)T}{N}\right)}{\sqrt{K\left(% \frac{(k-1)T}{N},\frac{(k-1)T}{N}\right)}}\right|+\|\sigma\|\sqrt{n}L\left(% \frac{T}{N}\right)^{H}≤ roman_max start_POSTSUBSCRIPT italic_k = 1 , … , italic_N end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) - italic_μ ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG start_ARG square-root start_ARG italic_K ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG , divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG end_ARG | + ∥ italic_σ ∥ square-root start_ARG italic_n end_ARG italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT
=maxk=1,,N|i=1nσixi((k1)TN)μ((k1)TN)K((k1)TN,(k1)TN)|+nL(TN)Habsentsubscript𝑘1𝑁superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑘1𝑇𝑁𝜇𝑘1𝑇𝑁𝐾𝑘1𝑇𝑁𝑘1𝑇𝑁𝑛𝐿superscript𝑇𝑁𝐻\displaystyle=\max_{k=1,\ldots,N}\left|\sum_{i=1}^{n}\sigma_{i}\frac{x_{i}% \left(\frac{(k-1)T}{N}\right)-\mu\left(\frac{(k-1)T}{N}\right)}{\sqrt{K\left(% \frac{(k-1)T}{N},\frac{(k-1)T}{N}\right)}}\right|+nL\left(\frac{T}{N}\right)^{H}= roman_max start_POSTSUBSCRIPT italic_k = 1 , … , italic_N end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) - italic_μ ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG start_ARG square-root start_ARG italic_K ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG , divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG end_ARG | + italic_n italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT

Then by the Massart Lemma 5 and the sup norm bound (14), with probability at least 1δ/31𝛿31-\delta/31 - italic_δ / 3,

𝔼σ[supt[0,T]|i=1nσixi(t)μ(t)K(t,t)|]subscript𝔼𝜎delimited-[]subscriptsupremum𝑡0𝑇superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{t\in[0,T]}\left|\sum_{i=1}^{n}% \sigma_{i}\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ] 𝔼σ[maxk=1,,N|i=1nσixi((k1)TN)μ((k1)TN)K((k1)TN,(k1)TN)|]+nL(TN)Habsentsubscript𝔼𝜎delimited-[]subscript𝑘1𝑁superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑘1𝑇𝑁𝜇𝑘1𝑇𝑁𝐾𝑘1𝑇𝑁𝑘1𝑇𝑁𝑛𝐿superscript𝑇𝑁𝐻\displaystyle\leq\mathbb{E}_{\sigma}\left[\max_{k=1,\ldots,N}\left|\sum_{i=1}^% {n}\sigma_{i}\frac{x_{i}\left(\frac{(k-1)T}{N}\right)-\mu\left(\frac{(k-1)T}{N% }\right)}{\sqrt{K\left(\frac{(k-1)T}{N},\frac{(k-1)T}{N}\right)}}\right|\right% ]+nL\left(\frac{T}{N}\right)^{H}≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_k = 1 , … , italic_N end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) - italic_μ ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG start_ARG square-root start_ARG italic_K ( divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG , divide start_ARG ( italic_k - 1 ) italic_T end_ARG start_ARG italic_N end_ARG ) end_ARG end_ARG | ] + italic_n italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT
2nlogNsupi[1:n],t[0,T]|xi(t)μ(t)K(t,t)|+nL(TN)Habsent2𝑛𝑁subscriptsupremum𝑖delimited-[]:1𝑛𝑡0𝑇subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡𝑛𝐿superscript𝑇𝑁𝐻\displaystyle\leq\sqrt{2n\log N}\cdot\sup_{i\in[1:n],t\in[0,T]}\left|\frac{x_{% i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|+nL\left(\frac{T}{N}\right)^{H}≤ square-root start_ARG 2 italic_n roman_log italic_N end_ARG ⋅ roman_sup start_POSTSUBSCRIPT italic_i ∈ [ 1 : italic_n ] , italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | + italic_n italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT
2nlogN(L+σ2log(6Tn/δ))+nL(TN)Habsent2𝑛𝑁𝐿𝜎26𝑇𝑛𝛿𝑛𝐿superscript𝑇𝑁𝐻\displaystyle\leq\sqrt{2n\log N}\left(L+\sigma\sqrt{2\log(6Tn/\delta)}\right)+% nL\left(\frac{T}{N}\right)^{H}≤ square-root start_ARG 2 italic_n roman_log italic_N end_ARG ( italic_L + italic_σ square-root start_ARG 2 roman_log ( 6 italic_T italic_n / italic_δ ) end_ARG ) + italic_n italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT

Since N𝑁Nitalic_N is an arbitrary integer number, we let N=[Tn1/H]+1𝑁delimited-[]𝑇superscript𝑛1𝐻1N=\left[Tn^{1/H}\right]+1italic_N = [ italic_T italic_n start_POSTSUPERSCRIPT 1 / italic_H end_POSTSUPERSCRIPT ] + 1, then we get

𝔼σ[supt[0,T]|i=1nσixi(t)μ(t)K(t,t)|]subscript𝔼𝜎delimited-[]subscriptsupremum𝑡0𝑇superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝜇𝑡𝐾𝑡𝑡\displaystyle\mathbb{E}_{\sigma}\left[\sup_{t\in[0,T]}\left|\sum_{i=1}^{n}% \sigma_{i}\frac{x_{i}(t)-\mu(t)}{\sqrt{K(t,t)}}\right|\right]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ] 𝒪(nlogNlog(Tn/δ))absent𝒪𝑛𝑁𝑇𝑛𝛿\displaystyle\leq{\mathcal{O}}\left(\sqrt{n\cdot\log N\cdot\log(Tn/\delta)}\right)≤ caligraphic_O ( square-root start_ARG italic_n ⋅ roman_log italic_N ⋅ roman_log ( italic_T italic_n / italic_δ ) end_ARG ) (15)
𝒪(nlog(NTn/δ))absent𝒪𝑛𝑁𝑇𝑛𝛿\displaystyle\leq\mathcal{O}\left(\sqrt{n}\log(NTn/\delta)\right)≤ caligraphic_O ( square-root start_ARG italic_n end_ARG roman_log ( italic_N italic_T italic_n / italic_δ ) )
=𝒪(nlog(Tn/δ)).absent𝒪𝑛𝑇𝑛𝛿\displaystyle=\mathcal{O}\left(\sqrt{n}\log(Tn/\delta)\right).= caligraphic_O ( square-root start_ARG italic_n end_ARG roman_log ( italic_T italic_n / italic_δ ) ) .

Combining (14), (15), (10) and (11), we can further bound (12) as

supθΘ,i[1:n]|gθ(xi)|supθΘ(0T|ρθ(Tt)|K(t,t)𝑑t)𝒪(log(Tn/δ))+supθΘ|0T(ρθ(Tt))μ(t)𝑑t|+1subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛subscript𝑔𝜃subscript𝑥𝑖subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡𝒪𝑇𝑛𝛿subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡1\sup_{\theta\in\Theta,i\in[1:n]}|g_{\theta}(x_{i})|\leq\sup_{\theta\in\Theta}% \left(\int_{0}^{T}\left|{\rho}_{\theta}(T-t)\right|\sqrt{K(t,t)}dt\right)% \mathcal{O}\left(\sqrt{\log(Tn/\delta)}\right)+\sup_{\theta\in\Theta}\left|% \int_{0}^{T}\Re({\rho}_{\theta}(T-t))\mu(t)dt\right|+1roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t ) caligraphic_O ( square-root start_ARG roman_log ( italic_T italic_n / italic_δ ) end_ARG ) + roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_ℜ ( italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ) italic_μ ( italic_t ) italic_d italic_t | + 1 (16)

And the Rademacher complexity is further bounded as

S()subscript𝑆\displaystyle\mathcal{R}_{S}(\mathcal{F})caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F )
\displaystyle\leq 2supθΘ,i[1:n]|gθ(xi)|n𝔼σ[supθΘ0Tρθ(Tt)i=1nσixi(t)dt]2subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛subscript𝑔𝜃subscript𝑥𝑖𝑛subscript𝔼𝜎delimited-[]subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡superscriptsubscript𝑖1𝑛subscript𝜎𝑖subscript𝑥𝑖𝑡𝑑𝑡\displaystyle\frac{2\sup_{\theta\in\Theta,i\in[1:n]}|g_{\theta}(x_{i})|}{n}% \mathbb{E}_{\sigma}\left[\sup_{\theta\in\Theta}\int_{0}^{T}{\rho}_{\theta}(T-t% )\sum_{i=1}^{n}\sigma_{i}x_{i}(t)dt\right]divide start_ARG 2 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG start_ARG italic_n end_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t ]
\displaystyle\leq 2supθΘ,i[1:n]|gθ(xi)|n(supθΘ0T|ρθ(Tt)|K(t,t)𝑑t+supθΘ|0Tρθ(Tt)μ(t)𝑑t|)𝒪(nlog(Tn/δ))2subscriptsupremum𝜃Θ𝑖delimited-[]:1𝑛subscript𝑔𝜃subscript𝑥𝑖𝑛subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡𝒪𝑛𝑇𝑛𝛿\displaystyle\frac{2\sup_{\theta\in\Theta,i\in[1:n]}|g_{\theta}(x_{i})|}{n}% \left(\sup_{\theta\in\Theta}\int_{0}^{T}\left|{\rho}_{\theta}(T-t)\right|\sqrt% {K(t,t)}dt+\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}_{\theta}(T-t)\mu(t)% dt\right|\right)\cdot\mathcal{O}\left(\sqrt{n}\log(Tn/\delta)\right)divide start_ARG 2 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ , italic_i ∈ [ 1 : italic_n ] end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG start_ARG italic_n end_ARG ( roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | ) ⋅ caligraphic_O ( square-root start_ARG italic_n end_ARG roman_log ( italic_T italic_n / italic_δ ) )
\displaystyle\leq (supθΘ0T|ρθ(Tt)|K(t,t)𝑑t+supθΘ|0Tρθ(Tt)μ(t)𝑑t|+1)2𝒪(log3/2(Tn/δ)n).superscriptsubscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡12𝒪superscript32𝑇𝑛𝛿𝑛\displaystyle\left(\sup_{\theta\in\Theta}\int_{0}^{T}\left|{\rho}_{\theta}(T-t% )\right|\sqrt{K(t,t)}dt+\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}_{\theta% }(T-t)\mu(t)dt\right|+1\right)^{2}\cdot\mathcal{O}\left(\frac{\log^{3/2}(Tn/% \delta)}{\sqrt{n}}\right).( roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ caligraphic_O ( divide start_ARG roman_log start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ( italic_T italic_n / italic_δ ) end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) .

Finally, by the symmetrization of Rx(θ)Rn(θ)subscript𝑅𝑥𝜃subscript𝑅𝑛𝜃R_{x}(\theta)-{R}_{n}(\theta)italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ), combining it with (16) and (1), we have with probability at least 1δ1𝛿1-\delta1 - italic_δ,

supθΘ|Rx(θ)Rn(θ)|(supθΘ0T|ρθ(Tt)|K(t,t)𝑑t+supθΘ|0Tρθ(Tt)μ(t)𝑑t|+1)2𝒪(log3/2(Tn/δ)n).subscriptsupremum𝜃Θsubscript𝑅𝑥𝜃subscript𝑅𝑛𝜃superscriptsubscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡subscriptsupremum𝜃Θsuperscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡12𝒪superscript32𝑇𝑛𝛿𝑛\sup_{\theta\in\Theta}\left|R_{x}(\theta)-{R}_{n}(\theta)\right|\leq\left(\sup% _{\theta\in\Theta}\int_{0}^{T}\left|{\rho}_{\theta}(T-t)\right|\sqrt{K(t,t)}dt% +\sup_{\theta\in\Theta}\left|\int_{0}^{T}{\rho}_{\theta}(T-t)\mu(t)dt\right|+1% \right)^{2}\cdot\mathcal{O}\left(\frac{\log^{3/2}(Tn/\delta)}{\sqrt{n}}\right).roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) | ≤ ( roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ caligraphic_O ( divide start_ARG roman_log start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ( italic_T italic_n / italic_δ ) end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) .

Appendix F Proof for Proposition 1

Proof.

First, notice that by the Hölder’s inequality with p=1,q=formulae-sequence𝑝1𝑞p=1,q=\inftyitalic_p = 1 , italic_q = ∞, we have

𝔼x[|0Tρθ~(Tt)x(t)𝑑t|]subscript𝔼𝑥delimited-[]superscriptsubscript0𝑇subscript𝜌~𝜃𝑇𝑡𝑥𝑡differential-d𝑡\displaystyle\mathbb{E}_{x}\left[\left|\int_{0}^{T}{\rho}_{\tilde{\theta}}(T-t% )x(t)dt\right|\right]blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_x ( italic_t ) italic_d italic_t | ]
=\displaystyle== 𝔼x[|0Tρθ(Tt)x(t)𝑑t|]0T|ρθ(Tt)|K(t,t)𝑑t+|0Tρθ(Tt)μ(t)𝑑t|subscript𝔼𝑥delimited-[]superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝑥𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡\displaystyle\frac{\mathbb{E}_{x}\left[\left|\int_{0}^{T}{\rho}_{\theta}(T-t)x% (t)dt\right|\right]}{\int_{0}^{T}\left|{\rho}_{\theta}(T-t)\right|\sqrt{K(t,t)% }dt+\left|\int_{0}^{T}{\rho}_{\theta}(T-t)\mu(t)dt\right|}divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_x ( italic_t ) italic_d italic_t | ] end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | end_ARG
\displaystyle\leq 𝔼x[|0Tρθ(Tt)(x(t)μ(t))𝑑t|]+|0Tρθ(Tt)μ(t)𝑑t|0T|ρθ(Tt)|K(t,t)𝑑t+|0Tρθ(Tt)μ(t)𝑑t|subscript𝔼𝑥delimited-[]superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝑥𝑡𝜇𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡\displaystyle\frac{\mathbb{E}_{x}\left[\left|\int_{0}^{T}{\rho}_{\theta}(T-t)(% x(t)-\mu(t))dt\right|\right]+\left|\int_{0}^{T}{\rho}_{\theta}(T-t)\mu(t)dt% \right|}{\int_{0}^{T}\left|{\rho}_{\theta}(T-t)\right|\sqrt{K(t,t)}dt+\left|% \int_{0}^{T}{\rho}_{\theta}(T-t)\mu(t)dt\right|}divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) ( italic_x ( italic_t ) - italic_μ ( italic_t ) ) italic_d italic_t | ] + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | end_ARG
\displaystyle\leq 0T|ρθ(Tt)|K(t,t)𝑑t𝔼x[supt[0,T]|x(t)μ(t)K(t,t)|]+|0Tρθ(Tt)μ(t)𝑑t|0T|ρθ(Tt)|K(t,t)𝑑t+|0Tρθ(Tt)μ(t)𝑑t|superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡subscript𝔼𝑥delimited-[]subscriptsupremum𝑡0𝑇𝑥𝑡𝜇𝑡𝐾𝑡𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝐾𝑡𝑡differential-d𝑡superscriptsubscript0𝑇subscript𝜌𝜃𝑇𝑡𝜇𝑡differential-d𝑡\displaystyle\frac{\int_{0}^{T}|{\rho}_{\theta}(T-t)|\sqrt{K(t,t)}dt\cdot% \mathbb{E}_{x}\left[\sup_{t\in[0,T]}\left|\frac{x(t)-\mu(t)}{\sqrt{K(t,t)}}% \right|\right]+\left|\int_{0}^{T}{\rho}_{\theta}(T-t)\mu(t)dt\right|}{\int_{0}% ^{T}\left|{\rho}_{\theta}(T-t)\right|\sqrt{K(t,t)}dt+\left|\int_{0}^{T}{\rho}_% {\theta}(T-t)\mu(t)dt\right|}divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t ⋅ blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | divide start_ARG italic_x ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ] + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) | square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG italic_d italic_t + | ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T - italic_t ) italic_μ ( italic_t ) italic_d italic_t | end_ARG
\displaystyle\leq 𝔼x[supt[0,T]|x(t)μ(t)K(t,t)|]+1subscript𝔼𝑥delimited-[]subscriptsupremum𝑡0𝑇𝑥𝑡𝜇𝑡𝐾𝑡𝑡1\displaystyle\mathbb{E}_{x}\left[\sup_{t\in[0,T]}\left|\frac{x(t)-\mu(t)}{% \sqrt{K(t,t)}}\right|\right]+1blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | divide start_ARG italic_x ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG | ] + 1

We let Xt:=x(t)μ(t)K(t,t)assignsubscript𝑋𝑡𝑥𝑡𝜇𝑡𝐾𝑡𝑡X_{t}:=\frac{x(t)-\mu(t)}{\sqrt{K(t,t)}}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := divide start_ARG italic_x ( italic_t ) - italic_μ ( italic_t ) end_ARG start_ARG square-root start_ARG italic_K ( italic_t , italic_t ) end_ARG end_ARG, then by Assumption 1, Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is Hölder continuous and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT sub-Gaussian for any t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]. Again, we use an ε𝜀\varepsilonitalic_ε-net argument to bound 𝔼[supt[0,T]|Xt|]𝔼delimited-[]subscriptsupremum𝑡0𝑇subscript𝑋𝑡\mathbb{E}\left[\sup_{t\in[0,T]}|X_{t}|\right]blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ]. By separating the time interval [0,T]0𝑇[0,T][ 0 , italic_T ] into N𝑁Nitalic_N parts [0,T/N][T/N,2T/N][(N1)T/N,T]0𝑇𝑁𝑇𝑁2𝑇𝑁𝑁1𝑇𝑁𝑇[0,T/N]\cup[T/N,2T/N]\cdots\cup[(N-1)T/N,T][ 0 , italic_T / italic_N ] ∪ [ italic_T / italic_N , 2 italic_T / italic_N ] ⋯ ∪ [ ( italic_N - 1 ) italic_T / italic_N , italic_T ]. Then for any time t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], there exists a π(t){0,T/N,,(N1)T/N}𝜋𝑡0𝑇𝑁𝑁1𝑇𝑁\pi(t)\in\{0,T/N,\ldots,(N-1)T/N\}italic_π ( italic_t ) ∈ { 0 , italic_T / italic_N , … , ( italic_N - 1 ) italic_T / italic_N } such that |tπ(t)|T/N𝑡𝜋𝑡𝑇𝑁|t-\pi(t)|\leq T/N| italic_t - italic_π ( italic_t ) | ≤ italic_T / italic_N. Therefore, by Hölder continuity,

𝔼[supt[0,T]|Xt|]𝔼delimited-[]subscriptsupremum𝑡0𝑇subscript𝑋𝑡\displaystyle\mathbb{E}\left[\sup_{t\in[0,T]}|X_{t}|\right]blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ] 𝔼[supt[0,T]|XtXπ(t)|]+𝔼[supt[0,T]|Xπ(t)|]absent𝔼delimited-[]subscriptsupremum𝑡0𝑇subscript𝑋𝑡subscript𝑋𝜋𝑡𝔼delimited-[]subscriptsupremum𝑡0𝑇subscript𝑋𝜋𝑡\displaystyle\leq\mathbb{E}\left[\sup_{t\in[0,T]}|X_{t}-X_{\pi(t)}|\right]+% \mathbb{E}\left[\sup_{t\in[0,T]}|X_{\pi(t)}|\right]≤ blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_π ( italic_t ) end_POSTSUBSCRIPT | ] + blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_π ( italic_t ) end_POSTSUBSCRIPT | ]
L(TN)H+𝔼[maxi[0:N1]|XiT/N|].absent𝐿superscript𝑇𝑁𝐻𝔼delimited-[]subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁\displaystyle\leq L\left(\frac{T}{N}\right)^{H}+\mathbb{E}\left[\max_{i\in[0:N% -1]}|X_{iT/N}|\right].≤ italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT + blackboard_E [ roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ] .

For the maximum over a finite class, notice that for any u0>0subscript𝑢00u_{0}>0italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0,

𝔼[maxi[0:N1]|XiT/N|]𝔼delimited-[]subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁\displaystyle\mathbb{E}\left[\max_{i\in[0:N-1]}|X_{iT/N}|\right]blackboard_E [ roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ] =0P(maxi[0:N1]|XiT/N|u)𝑑uabsentsuperscriptsubscript0𝑃subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁𝑢differential-d𝑢\displaystyle=\int_{0}^{\infty}P\left(\max_{i\in[0:N-1]}|X_{iT/N}|\geq u\right% )du= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_P ( roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ≥ italic_u ) italic_d italic_u
=0u0P(maxi[0:N1]|XiT/N|u)𝑑u+u0P(maxi[0:N1]|XiT/N|u)𝑑uabsentsuperscriptsubscript0subscript𝑢0𝑃subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁𝑢differential-d𝑢superscriptsubscriptsubscript𝑢0𝑃subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁𝑢differential-d𝑢\displaystyle=\int_{0}^{u_{0}}P\left(\max_{i\in[0:N-1]}|X_{iT/N}|\geq u\right)% du+\int_{u_{0}}^{\infty}P\left(\max_{i\in[0:N-1]}|X_{iT/N}|\geq u\right)du= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_P ( roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ≥ italic_u ) italic_d italic_u + ∫ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_P ( roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ≥ italic_u ) italic_d italic_u
u0+u0i=0N1P(|XiT/N|u)du.absentsubscript𝑢0superscriptsubscriptsubscript𝑢0superscriptsubscript𝑖0𝑁1𝑃subscript𝑋𝑖𝑇𝑁𝑢𝑑𝑢\displaystyle\leq u_{0}+\int_{u_{0}}^{\infty}\sum_{i=0}^{N-1}P\left(|X_{iT/N}|% \geq u\right)du.≤ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∫ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_P ( | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ≥ italic_u ) italic_d italic_u .

Since XiT/Nsubscript𝑋𝑖𝑇𝑁X_{iT/N}italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT is σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT sub-Gaussian for every i[0:N1]i\in[0:N-1]italic_i ∈ [ 0 : italic_N - 1 ], then u0>0for-allsubscript𝑢00\forall u_{0}>0∀ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0,

𝔼[maxi[0:N1]|XiT/N|]𝔼delimited-[]subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁\displaystyle\mathbb{E}\left[\max_{i\in[0:N-1]}|X_{iT/N}|\right]blackboard_E [ roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ] u0+2Nu0exp(u22σ2)𝑑uabsentsubscript𝑢02𝑁superscriptsubscriptsubscript𝑢0superscript𝑢22superscript𝜎2differential-d𝑢\displaystyle\leq u_{0}+2N\int_{u_{0}}^{\infty}\exp\left(-\frac{u^{2}}{2\sigma% ^{2}}\right)du≤ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 2 italic_N ∫ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_d italic_u
u0+2Nu0uu0exp(u22σ2)𝑑uabsentsubscript𝑢02𝑁superscriptsubscriptsubscript𝑢0𝑢subscript𝑢0superscript𝑢22superscript𝜎2differential-d𝑢\displaystyle\leq u_{0}+2N\int_{u_{0}}^{\infty}\frac{u}{u_{0}}\exp\left(-\frac% {u^{2}}{2\sigma^{2}}\right)du≤ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 2 italic_N ∫ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_u end_ARG start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG roman_exp ( - divide start_ARG italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_d italic_u
=u0+2Nσ2u0exp(u022σ2).absentsubscript𝑢02𝑁superscript𝜎2subscript𝑢0superscriptsubscript𝑢022superscript𝜎2\displaystyle=u_{0}+\frac{2N\sigma^{2}}{u_{0}}\exp\left(-\frac{u_{0}^{2}}{2% \sigma^{2}}\right).= italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + divide start_ARG 2 italic_N italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG roman_exp ( - divide start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

Minimizing the above term over u0>0subscript𝑢00u_{0}>0italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, we can simply let u0=σ2log2Nsubscript𝑢0𝜎22𝑁u_{0}=\sigma\sqrt{2\log 2N}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_σ square-root start_ARG 2 roman_log 2 italic_N end_ARG, then

𝔼[maxi[0:N1]|XiT/N|]σ2log2N+σ2log2N2σ2log2N.𝔼delimited-[]subscript𝑖delimited-[]:0𝑁1subscript𝑋𝑖𝑇𝑁𝜎22𝑁𝜎22𝑁2𝜎22𝑁\mathbb{E}\left[\max_{i\in[0:N-1]}|X_{iT/N}|\right]\leq\sigma\sqrt{2\log 2N}+% \frac{\sigma}{\sqrt{2\log 2N}}\leq 2\sigma\sqrt{2\log 2N}.blackboard_E [ roman_max start_POSTSUBSCRIPT italic_i ∈ [ 0 : italic_N - 1 ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i italic_T / italic_N end_POSTSUBSCRIPT | ] ≤ italic_σ square-root start_ARG 2 roman_log 2 italic_N end_ARG + divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG 2 roman_log 2 italic_N end_ARG end_ARG ≤ 2 italic_σ square-root start_ARG 2 roman_log 2 italic_N end_ARG .

Now back to the original upper bound, we get

𝔼[supt[0,T]|Xt|]L(TN)H+2σ2log2N.𝔼delimited-[]subscriptsupremum𝑡0𝑇subscript𝑋𝑡𝐿superscript𝑇𝑁𝐻2𝜎22𝑁\mathbb{E}\left[\sup_{t\in[0,T]}|X_{t}|\right]\leq L\left(\frac{T}{N}\right)^{% H}+2\sigma\sqrt{2\log 2N}.blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ] ≤ italic_L ( divide start_ARG italic_T end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT + 2 italic_σ square-root start_ARG 2 roman_log 2 italic_N end_ARG .

Since N𝑁Nitalic_N is an arbitrary positive integer, we simply take N=[T]+1𝑁delimited-[]𝑇1N=[T]+1italic_N = [ italic_T ] + 1, finally we get

𝔼[supt[0,T]|Xt|]L+2σ2log(2T+2)=𝒪(logT).𝔼delimited-[]subscriptsupremum𝑡0𝑇subscript𝑋𝑡𝐿2𝜎22𝑇2𝒪𝑇\mathbb{E}\left[\sup_{t\in[0,T]}|X_{t}|\right]\leq L+2\sigma\sqrt{2\log(2T+2)}% =\mathcal{O}(\sqrt{\log T}).blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ] ≤ italic_L + 2 italic_σ square-root start_ARG 2 roman_log ( 2 italic_T + 2 ) end_ARG = caligraphic_O ( square-root start_ARG roman_log italic_T end_ARG ) .

Appendix G Lipschitz function of sub-Gaussian random variables

In this section, we provide some known examples for the sub-Gaussian random variables that remain the sub-Gaussian property under a Lipschitz function.

  1. 1.

    For a bounded random variable X[0,1]𝑋01X\in[0,1]italic_X ∈ [ 0 , 1 ], if f:[0,1]:𝑓absent01f:[0,1]\xrightarrow{}\mathbb{R}italic_f : [ 0 , 1 ] start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW blackboard_R is a quasi-convex function, i.e., {x:f(x)s}conditional-set𝑥𝑓𝑥𝑠\{x:f(x)\leq s\}{ italic_x : italic_f ( italic_x ) ≤ italic_s } is a convex set for all s𝑠s\in\mathbb{R}italic_s ∈ blackboard_R. If f𝑓fitalic_f is further Lipschitz in [0,1]01[0,1][ 0 , 1 ], then f(X)𝑓𝑋f(X)italic_f ( italic_X ) is sub-Gaussian. See Theorem 7.12 in Boucheron et al. (2013).

  2. 2.

    For a sub-Gaussian random variable X𝑋Xitalic_X that has density of the form eU(x)superscript𝑒𝑈𝑥e^{-U(x)}italic_e start_POSTSUPERSCRIPT - italic_U ( italic_x ) end_POSTSUPERSCRIPT with U𝑈Uitalic_U being twice continuously differentiable and U′′(x)>0,xformulae-sequencesuperscript𝑈′′𝑥0for-all𝑥U^{\prime\prime}(x)>0,\forall x\in\mathbb{R}italic_U start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) > 0 , ∀ italic_x ∈ blackboard_R, then if f𝑓fitalic_f is a Lipschitz function, f(X)𝑓𝑋f(X)italic_f ( italic_X ) is also sub-Gaussian. See Theorem 5.2.15 in Vershynin (2020).