Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PID Control-Based Self-Healing to Improve the Robustness of Large Language Models

Zhuotong Chen ztchen@ucsb.edu
Department of Electrical and Computer Engineering, University of California, Santa Barbara
Zihu Wang zihu_wang@ucsb.edu
Department of Electrical and Computer Engineering, University of California, Santa Barbara
Yifan Yang yifanyang@ucsb.edu
Department of Computer Sciences, University of California, Santa Barbara
Qianxiao Li qianxiao@nus.edu.sg
Department of Mathematics, National University of Singapore, Singapore
Institute of High Performance Computing, A*STAR, Singapore
Zheng Zhang zhengzhang@ece.ucsb.edu
Department of Electrical and Computer Engineering, University of California, Santa Barbara
Abstract

Despite the effectiveness of deep neural networks in numerous natural language processing applications, recent findings have exposed the vulnerability of these language models when minor perturbations are introduced. While appearing semantically indistinguishable to humans, these perturbations can significantly reduce the performance of well-trained language models, raising concerns about the reliability of deploying them in safe-critical situations. In this work, we construct a computationally efficient self-healing process to correct undesired model behavior during online inference when perturbations are applied to input data. This is formulated as a trajectory optimization problem in which the internal states of the neural network layers are automatically corrected using a PID (Proportional-Integral-Derivative) control mechanism. The P controller targets immediate state adjustments, while the I and D controllers consider past states and future dynamical trends, respectively. We leverage the geometrical properties of the training data to design effective linear PID controllers. This approach reduces the computational cost to that of using just the P controller, instead of the full PID control. Further, we introduce an analytical method for approximating the optimal control solutions, enhancing the real-time inference capabilities of this controlled system. Moreover, we conduct a theoretical error analysis of the analytic solution in a simplified setting. The proposed PID control-based self-healing is a low-cost framework that improves the robustness of pre-trained large language models, whether standard or robustly trained, against a wide range of perturbations. A detailed implementation can be found in:https://github.com/zhuotongchen/PID-Control-Based-Self-Healing-to-Improve-the-Robustness-of-Large-Language-Models.

1 Introduction

The growth of data and advancements in computing power have heralded a new era in the field of natural language processing (NLP), significantly shaped by the advent of deep neural networks. One of the most influential innovations in this domain is the transformer architecture (Vaswani et al., 2017), which is the fundamental block of many successful large language models (LLMs) (Brown et al., 2020). This architecture has become the state-of-the-art in various NLP tasks, including sentiment analysis (Vinodhini & Chandrasekaran, 2012), text summarization (Nenkova & McKeown, 2012), and speech recognition (Hannun et al., 2014), among others.

However, many deep neural networks are vulnerable to malicious perturbations (Morris et al., 2020). While appearing semantically indistinguishable to humans, these perturbations can significantly degrade the performance of pre-trained LLMs. This vulnerability raises concerns about the reliability of deploying them in safety-critical situations, such as in clinical decision support systems, where LLMs serve a critical role in assisting healthcare professionals with patient care insights (Huang et al., 2019). In response to this challenge, there has been significant progress in developing algorithms to enhance model robustness against such perturbations (Yoo & Qi, 2021; Zhu et al., 2019; Wicker et al., 2021). Predominantly, these methods are rooted in the foundation of adversarial training (Madry et al., 2018), a method where pre-trained LLMs are fine-tuned (or trained from random initialization) to overcome the effects of specific adversarial perturbations. This is achieved by adjusting either the entire set of model parameters or a significant portion thereof (Hu et al., 2021). Despite its effectiveness, this approach raises three critical concerns. Firstly, adjusting model parameters using adversarial examples requires substantial computational resources (Zhang et al., 2019). Due to the discrete input space of NLP tasks, searching for an adversarial example generally involves solving a combinatorial optimization problem (Bernhard & Vygen, 2008), which suffers from an exponential growth in the number of feasible solutions as the size of the problem increases. Secondly, there exists a potential trade-off where improved adversarial robustness may lead to compromised performance on standard, natural datasets (He et al., 2021). Thirdly, and more problematically, adversarial training is less effective against unforeseen adversarial perturbations (Tramer & Boneh, 2019). This limitation becomes particularly noticeable when deploying LLMs in practice, where anticipating the potential adversarial attacks in advance is nearly impossible.

In this paper, we investigate the concept of a self-healing process as a cost-effective method to improve the robustness of pre-trained LLMs against a range of perturbations. The most well-known self-healing mechanism is probably the human immune system: B cells and T cells can work together to identify and kill many external attackers (e.g., bacteria) to maintain the health of the human body (Rajapakse & Groudine, 2011). In the context of machine learning, self-healing refers to the ability of a model to automatically identify and correct issues that may arise during its operation (Wang et al., 2021; Chen et al., 2022). To achieve this, we consider a pre-trained LLM (typically a composition of transformation blocks) as a discretization of the continuous dynamical system (E, 2017), this allows us to formulate the robustness issue of LLMs as a trajectory optimization problem (Hehn & D’Andrea, 2015). Our approach involves designing PID (Proportional-Integral-Derivative) controllers at hidden layers of a pre-trained LLM. A PID controller continuously calculates an error value as the difference between a desired reference and a measured process variable and applies a correction control signal based on proportional, integral, and derivative terms. More specifically, let the error value be the difference between a desired reference and the current state. If the error is large, the output of the P controller will be proportionately large, thereby making a significant adjustment and helping the controller respond quickly to errors. The I controller determines the present control output based on the integration of past errors, which ensures that even small errors are corrected over time. The D controller generates control signals based on the derivative of the error dynamics. The combination of P, I, and D controllers quantifies undesired model behavior from present errors, past accumulated errors, and future error trends, and generates control signals to correct the errors. Figure 1 illustrates the proposed PID control-based self-healing framework. Given a T𝑇Titalic_T-layer LLM, time-dependent PID controllers (represented as Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) generate a feedback control based on the state 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (to simplify the demonstration, only 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is considered as the input for both I and D controllers). These feedback controls aim to remove the undesirable effects caused by input perturbations.

Refer to caption
Figure 1: The structures of feed-forward deep neural network (highlighted in blue) and the proposed PID control method (highlighted in red).

The methodology of constructing a self-healing process to improve the robustness of deep neural networks was initially introduced in Chen et al. (2020) and its subsequent work Chen et al. (2022). It leveraged a closed-loop control method to detect and correct potential errors applied to input data. This method belongs to a special case of P control in the proposed PID control framework, in which only the errors from the present states are corrected. However, the effectiveness of using only proportional controllers is limited, as it merely addresses the error at each time step, neglecting the overall error dynamics (we provide numerical evidence to show this in Section 3.4). Additionally, a major limitation of the closed-loop control method adapted in Chen et al. (2020; 2022) is its computational inefficiency. The optimal control solution requires simulating the Hamiltonian dynamics over several iterations during online inference (Pontryagin, 1987), which involves both forward and backward propagation of a deep neural network (Chen et al., 2020) (we discuss the details about simulating the Hamiltonian dynamics in Section 2.1). This inefficiency renders the self-healing framework impractical for deployment in large-scale LLMs, which may contain millions or even billions of parameters (Kenton & Toutanova, 2019; Liu et al., 2019; Brown et al., 2020). To address the challenges associated with the previously mentioned adversarial training and existing closed-control method, this study presents three contributions:

  • We introduce a novel PID control framework to realize the self-healing capability to improve the robustness of LLMs during online inference. The proposed framework generalizes the conventional robustness improvement methods that predominantly focus on proportional errors. We demonstrate that employing all P, I, and D controllers can be as computationally efficient as single control schemes, achieved through special controller design.

  • We approximate the layer-wise transformations in the pre-trained LLM as linear orthogonal transformations and derive an analytical solution for generating PID control solutions. This analytical method yields a closed-form expression for the optimal solution, enhancing the speed of online inference. This acceleration is especially beneficial when integrating the self-healing framework into LLMs. While these approximations might not align with practical scenarios, our method exhibits superior robustness improvement in a variety of numerical experiments.

  • We derive a comprehensive error analysis of the controlled system, highlighting the robustness improvement of LLMs through PID control solutions. This analysis provides insight into the effectiveness of PID control in improving the robustness of LLMs in simplified settings, thereby contributing to the understanding of controlled systems.

1.1 Background on PID Control

Here we provide background knowledge on PID control, which is an essential building block of our proposed framework. A PID (Proportional, Integral, Derivative) controller is a feedback-based control system extensively utilized in industrial settings and numerous other domains where continuous adjustment is essential. Specifically, given a continuous dynamic system, typically described as an ordinary differential equation,

𝐱˙t=Ψ(𝐱t,𝐮t),𝐱0d,formulae-sequencesubscript˙𝐱𝑡Ψsubscript𝐱𝑡subscript𝐮𝑡subscript𝐱0superscript𝑑\dot{\mathbf{x}}_{t}=\Psi(\mathbf{x}_{t},\mathbf{u}_{t}),\;\;\mathbf{x}_{0}\in% \mathbb{R}^{d},over˙ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Ψ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,

where 𝐱˙tsubscript˙𝐱𝑡\dot{\mathbf{x}}_{t}over˙ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the derivative with respect to time, while Ψ:d×md:Ψsuperscript𝑑superscript𝑚superscript𝑑\Psi:\mathbb{R}^{d}\times\mathbb{R}^{m}\rightarrow\mathbb{R}^{d}roman_Ψ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT defines a function or vector field. The term 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT specifies the initial state. Moreover, 𝐮tmsubscript𝐮𝑡superscript𝑚\mathbf{u}_{t}\in\mathbb{R}^{m}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denotes the control input exerted on the dynamic system (e.g. in this work, the control input is applied linearly to the state 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). The PID control continuously computes an error term, etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, by calculating the difference between a target reference 𝐫tsubscript𝐫𝑡\mathbf{r}_{t}bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the actual state variable 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,

et=𝐫t𝐱t.subscript𝑒𝑡delimited-∥∥subscript𝐫𝑡subscript𝐱𝑡e_{t}=\lVert\mathbf{r}_{t}-\mathbf{x}_{t}\rVert.italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ .

The construction for the target reference generally depends on the application. In this work, the target reference is constructed by a sequence of embedding manifolds (see Section 2.1), and the process variable represents the hidden state of a deep neural network during forward propagation.

Then, a PID controller applies a control 𝐮tsubscript𝐮𝑡\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on proportional, integral, and derivative terms to correct the measured error. In the continuous case,

𝐮t=Kpet+Ki0te(τ)𝑑τ+Kddetdt,subscript𝐮𝑡subscript𝐾𝑝subscript𝑒𝑡subscript𝐾𝑖superscriptsubscript0𝑡𝑒𝜏differential-d𝜏subscript𝐾𝑑𝑑subscript𝑒𝑡𝑑𝑡\mathbf{u}_{t}=K_{p}e_{t}+K_{i}\int_{0}^{t}e(\tau)d\tau+K_{d}\frac{de_{t}}{dt},bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e ( italic_τ ) italic_d italic_τ + italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT divide start_ARG italic_d italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ,

where Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and Kdsubscript𝐾𝑑K_{d}italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, all non-negative, denote the coefficients for the proportional, integral, and derivative terms respectively. In the PID control design,

  • The output from proportional control directly correlates with the current value of the error, etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Thus, a larger error will yield a proportionally larger control output, adjusted by the gain factor Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. However, employing only proportional control leads to a persistent difference between the desired target and the actual process variable, since the generation of the proportional response necessitates the presence of an error.

  • The output from integral control considers the accumulation of past error values over time. This means that when a residual error remains after proportional control is applied, the integral control works to correct this residual error by leveraging the historical total error. This will result in the proportional effect diminishing as the error decreases, but this is compensated for by the growing integral effect.

  • The output from derivative control provides an estimate of the trend of the error, using its current rate of change as a basis. It effectively seeks to reduce the effect of the error by exerting a control influence generated by the rate of error change. Hence, the more rapid the error’s progression, the more intense the applied correction.

Although a PID controller has three control terms, some applications need only one or two terms to provide appropriate control. This is achieved by setting the unused parameters to zero and is called a PI, PD, P, or I controller in the absence of the other control actions. PD controllers are fairly common in applications where integral action would be sensitive to measurement noise, but the derivative term is often needed for the system to reach its target reference.

2 The PID Control-Based Self-Healing Framework for Large Language Models

In this work, we use the concept of "self-healing" to describe the capability of an LLM to automatically correct errors that may arise. This idea has been studied in the domain of integrated circuits, primarily addressing errors attributable to variations in nano-scale fabrication processes. More specifically, the internal dynamics of an electronic circuit network can be understood through the lens of ordinary differential equations (Ho et al., 1975). This approach conceptualizes the circuit network in terms of state variables that track nodal voltages and branch currents as they change over time. Moreover, it’s possible to create a self-healing system within the circuit network, enabling it to continuously monitor and optimize its performance throughout the lifetime of operation (Tang et al., 2012; Lee et al., 2012). There is a growing body of research, including works by (E, 2017; Haber & Ruthotto, 2017; Li et al., 2018b), and others, demonstrating the connection between dynamical systems and deep neural networks. In our study, we interpret a pre-trained T𝑇Titalic_T-layer LLM as a form of discrete dynamical system,

𝐱t+1=Ft(𝐱t+πt(𝐱t),𝜽t),t=0,1,,T1,formulae-sequencesubscript𝐱𝑡1subscript𝐹𝑡subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡subscript𝜽𝑡for-all𝑡01𝑇1\mathbf{x}_{t+1}=F_{t}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t}),\bm{\theta}_{t})% ,\;\;\forall t=0,1,...,T-1,bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∀ italic_t = 0 , 1 , … , italic_T - 1 , (1)

where Ft(,𝜽t):dd:subscript𝐹𝑡subscript𝜽𝑡superscript𝑑superscript𝑑F_{t}(\cdot,\bm{\theta}_{t}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represents a transformer block parametrized by 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, πt:dd:subscript𝜋𝑡superscript𝑑superscript𝑑\pi_{t}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a feedback controller that maps the current state 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to a control action. We aim to construct feedback controllers π¯:={πt}t=0T1assign¯𝜋superscriptsubscriptsubscript𝜋𝑡𝑡0𝑇1\overline{\pi}:=\{\pi_{t}\}_{t=0}^{T-1}over¯ start_ARG italic_π end_ARG := { italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT to ensure that the controlled states (𝐱t+πt(𝐱t)subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )) yield the desired output when perturbations are applied to input data. This can be formulated as a trajectory optimization problem,

minπ¯𝔼(𝐱0,y)𝒟[J(𝐱0,𝐲,π¯)]:=minπ¯𝔼(𝐱0,y)𝒟[Φ(𝐱T,y)+t=0T1({𝐱s}s=0t,πt,ft)],s.t.equation1formulae-sequenceassignsubscript¯𝜋subscript𝔼similar-tosubscript𝐱0𝑦𝒟delimited-[]𝐽subscript𝐱0𝐲¯𝜋subscript¯𝜋subscript𝔼similar-tosubscript𝐱0𝑦𝒟delimited-[]Φsubscript𝐱𝑇𝑦superscriptsubscript𝑡0𝑇1superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡subscript𝑓𝑡st𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛1\min\limits_{\overline{\pi}}\mathbb{E}_{(\mathbf{x}_{0},y)\sim\mathcal{D}}% \left[J(\mathbf{x}_{0},\mathbf{y},\overline{\pi})\right]\vcentcolon=\min% \limits_{\overline{\pi}}\mathbb{E}_{(\mathbf{x}_{0},y)\sim\mathcal{D}}\left[% \Phi(\mathbf{x}_{T},y)+\sum_{t=0}^{T-1}{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},% \pi_{t},f_{t})\right],\;\;{\rm s.t.}\;\;equation~{}\ref{eq: dynamical system}roman_min start_POSTSUBSCRIPT over¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_J ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_y , over¯ start_ARG italic_π end_ARG ) ] := roman_min start_POSTSUBSCRIPT over¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_Φ ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , roman_s . roman_t . italic_e italic_q italic_u italic_a italic_t italic_i italic_o italic_n (2)

where initial states and labels (𝐱0,y)subscript𝐱0𝑦(\mathbf{x}_{0},y)( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) are sampled from the underlying data distribution 𝒟𝒟\mathcal{D}caligraphic_D. The terminal loss Φ(𝐱T,y)Φsubscript𝐱𝑇𝑦\Phi(\mathbf{x}_{T},y)roman_Φ ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) evaluates the discrepancy between the terminal state and a pre-defined destination set. In machine learning applications, this measures the consistency between the terminal state 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (or its transformation) with the true label (e.g., cross-entropy loss). However, this is not feasible in general machine learning applications as the true label y𝑦yitalic_y remains unknown during inference. More specifically, during online inference, given an initial condition 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, its label y𝑦yitalic_y cannot be accessed to optimize the states since this quantity is unknown. Consequently, the terminal loss is negated by setting it to zero. In these cases, the optimal controllers {πt}t=0T1superscriptsubscriptsubscript𝜋𝑡𝑡0𝑇1\{\pi_{t}\}_{t=0}^{T-1}{ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT minimize the cumulative running losses ({𝐱s}s=0t,πt,ft)superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡subscript𝑓𝑡{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},f_{t})caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which assess the state trajectory and control using certain measurement function ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

In Section 2.1, we construct the running loss, which forms a crucial part of the objective function defined in equation 2. This objective function needs to solve a complex optimization problem for each initial state during online inference, which presents a challenge to computational efficiency. To address this, the subsequent Section 2.2 presents a more efficient algorithm for solving the objective function under specific assumptions. Furthermore, Section 2.3 presents a comprehensive theoretical error analysis for the proposed algorithm. Lastly, Section 2.4 provides additional details on the implementation of constructing PID controls.

2.1 PID Control Design via Embedding Manifolds

In analyzing an LLM through the lens of discrete dynamical systems, we observe that its state trajectory, governed by the composition of transformations, forms a lower-dimensional structure embedded in the ambient state space, also known as the “manifold hypothesis" (Fefferman et al., 2016) (empirical evidence is shown in Table 7). This can be conceptualized as a sequence of embedding manifolds. We consider a r-dimensional smooth manifold embedded in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as {𝐱:f(𝐱)=𝟎}conditional-set𝐱𝑓𝐱0\{\mathbf{x}:f(\mathbf{x})=\mathbf{0}\}{ bold_x : italic_f ( bold_x ) = bold_0 }, where f:d(dr):𝑓superscript𝑑superscript𝑑𝑟f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{(d-r)}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT ( italic_d - italic_r ) end_POSTSUPERSCRIPT is a surjective mapping that can be used to measure the distance between a state 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the embedding manifold. For instance, f(𝐱)=0delimited-∥∥𝑓𝐱0\lVert f(\mathbf{x})\rVert=0∥ italic_f ( bold_x ) ∥ = 0 if 𝐱𝐱\mathbf{x}bold_x belongs to the embedding manifold, and f(𝐱)>0delimited-∥∥𝑓𝐱0\lVert f(\mathbf{x})\rVert>0∥ italic_f ( bold_x ) ∥ > 0 if 𝐱𝐱\mathbf{x}bold_x is outside the embedding manifold.

At each time step t𝑡titalic_t (e.g., the tthsuperscripttth{\rm t^{th}}roman_t start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT hidden states), we construct three embedding manifolds with three distinct surjective functions ftP:d(dr):superscriptsubscript𝑓𝑡𝑃superscript𝑑superscript𝑑𝑟f_{t}^{P}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{(d-r)}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT ( italic_d - italic_r ) end_POSTSUPERSCRIPT, ftI:d(dr):superscriptsubscript𝑓𝑡𝐼superscript𝑑superscript𝑑𝑟f_{t}^{I}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{(d-r)}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT ( italic_d - italic_r ) end_POSTSUPERSCRIPT, and ftD:d(dr):superscriptsubscript𝑓𝑡𝐷superscript𝑑superscript𝑑𝑟f_{t}^{D}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{(d-r)}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT ( italic_d - italic_r ) end_POSTSUPERSCRIPT, that represent the embedding manifolds of the state, the integration of past states, and the derivative of the state, respectively. In a discrete setting, integration corresponds to the accumulation of past states, while the derivative is approximated by the difference between two successive states. Under this setting, ftIsuperscriptsubscript𝑓𝑡𝐼f_{t}^{I}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT denotes the embedded manifold of past states, and ftDsuperscriptsubscript𝑓𝑡𝐷f_{t}^{D}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT represents the embedded manifold derived from the difference between two consecutive states. Given these embedding functions, we propose the following running loss to evaluate the controlled state at time step t𝑡titalic_t,

({𝐱s}s=0t,πt,(ftP,ftI,ftD))superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡superscriptsubscript𝑓𝑡𝑃superscriptsubscript𝑓𝑡𝐼superscriptsubscript𝑓𝑡𝐷\displaystyle{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},(f_{t}^{P},f_{t}^{I% },f_{t}^{D}))caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) :=12ftP(𝐱t+πt(𝐱t))22+12ftI(𝐱t+πt(𝐱t)+s=0t1𝐱s)22\displaystyle\vcentcolon=\frac{1}{2}\lVert f_{t}^{P}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert f_{t}^{I}(\mathbf{x}_{t}+\pi_% {t}(\mathbf{x}_{t})+\sum_{s=0}^{t-1}\mathbf{x}_{s})\lVert_{2}^{2}:= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+12ftD(𝐱t+πt(𝐱t)𝐱t1)22+ct2πt(𝐱t)22,12superscriptsubscriptdelimited-∥∥superscriptsubscript𝑓𝑡𝐷subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡subscript𝐱𝑡122subscript𝑐𝑡2superscriptsubscriptdelimited-∥∥subscript𝜋𝑡subscript𝐱𝑡22\displaystyle\;\;\;\;+\frac{1}{2}\lVert f_{t}^{D}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t})-\mathbf{x}_{t-1})\rVert_{2}^{2}+\frac{c_{t}}{2}\lVert\pi_{t}(% \mathbf{x}_{t})\rVert_{2}^{2},+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where the layer-dependent regularization term ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT prevents using large controls. The running loss consists of three components, each assessing the error in the controlled state through distinct embedding functions: proportional, integration, and derivative. In this construction of running loss, the optimal controller results in a controlled state (𝐱t+πt(𝐱t))subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t}))( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) which is expected to closely align with the state embedding manifold, evaluated by ftPsuperscriptsubscript𝑓𝑡𝑃f_{t}^{P}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT. Additionally, the controller must ensure that past controlled states remain close to the embedding manifold of integrated states, as evaluated by ftIsuperscriptsubscript𝑓𝑡𝐼f_{t}^{I}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and the state’s derivative should similarly align closely with the manifold of the state’s derivative embedding, as quantified by ftDsuperscriptsubscript𝑓𝑡𝐷f_{t}^{D}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

The objective function defined in equation 2 and the associated running loss detailed in equation 3 can be solved via the dynamical programming principle (Bellman, 1952). However, this method faces exponential complexity in terms of the dimension of the state. To overcome this “curse of dimensionality", we can reinterpret the optimal control problem through Pontryagin’s Maximum Principle and approximate it using the method of successive approximation (Chen et al., 2020). To begin with, we define the Hamiltonian H(t,{𝐱s}s=0t,𝐩t+1,𝜽t,𝐮t)𝐻𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝐩𝑡1subscript𝜽𝑡subscript𝐮𝑡H(t,\{\mathbf{x}_{s}\}_{s=0}^{t},\mathbf{p}_{t+1},\bm{\theta}_{t},\mathbf{u}_{% t})italic_H ( italic_t , { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as

H(t,{𝐱s}s=0t,𝐩t+1,𝜽t,𝐮t):=𝐩t+1TFt(𝐱t+𝐮t,𝜽t)({𝐱s}s=0t,𝐮t,(ftP,ftI,ftD)),assign𝐻𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝐩𝑡1subscript𝜽𝑡subscript𝐮𝑡superscriptsubscript𝐩𝑡1𝑇subscript𝐹𝑡subscript𝐱𝑡subscript𝐮𝑡subscript𝜽𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝐮𝑡superscriptsubscript𝑓𝑡𝑃superscriptsubscript𝑓𝑡𝐼superscriptsubscript𝑓𝑡𝐷H(t,\{\mathbf{x}_{s}\}_{s=0}^{t},\mathbf{p}_{t+1},\bm{\theta}_{t},\mathbf{u}_{% t})\vcentcolon=\mathbf{p}_{t+1}^{T}\cdot F_{t}(\mathbf{x}_{t}+\mathbf{u}_{t},% \bm{\theta}_{t})-{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\mathbf{u}_{t},(f_{t}^{% P},f_{t}^{I},f_{t}^{D})),italic_H ( italic_t , { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := bold_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) ,

where 𝐮t=πt(𝐱t)subscript𝐮𝑡subscript𝜋𝑡subscript𝐱𝑡\mathbf{u}_{t}=\pi_{t}(\mathbf{x}_{t})bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a control solution. Pontryagin’s maximum principle consists of a two-point boundary value problem,

𝐱t+1=pH(t,{𝐱s}s=0t,𝐩t+1,𝜽t,𝐮t),superscriptsubscript𝐱𝑡1subscript𝑝𝐻𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝐩𝑡1subscript𝜽𝑡subscript𝐮𝑡\displaystyle\mathbf{x}_{t+1}^{\ast}=\nabla_{p}H(t,\{\mathbf{x}_{s}\}_{s=0}^{t% },\mathbf{p}_{t+1},\bm{\theta}_{t},\mathbf{u}_{t}),bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∇ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_H ( italic_t , { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (𝐱0,y)𝒟,similar-tosubscript𝐱0𝑦𝒟\displaystyle(\mathbf{x}_{0},y)\sim\mathcal{D},( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∼ caligraphic_D , (4)
𝐩t=xH(t,{𝐱s}s=0t,𝐩t+1,𝜽t,𝐮t),superscriptsubscript𝐩𝑡subscript𝑥𝐻𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝐩𝑡1subscript𝜽𝑡subscript𝐮𝑡\displaystyle\mathbf{p}_{t}^{\ast}=\nabla_{x}H(t,\{\mathbf{x}_{s}\}_{s=0}^{t},% \mathbf{p}_{t+1},\bm{\theta}_{t},\mathbf{u}_{t}),bold_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_H ( italic_t , { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , 𝐩T=𝟎,subscript𝐩𝑇0\displaystyle\mathbf{p}_{T}=\mathbf{0},bold_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = bold_0 , (5)

plus a maximization condition of the Hamiltonian.

H(t,{𝐱s}s=0t,𝐩t+1,𝜽t,𝐮t)H(t,{𝐱s}s=0t,𝐩t+1,𝜽t,𝐮t),𝐮t and t𝒯.formulae-sequence𝐻𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝐩𝑡1subscript𝜽𝑡superscriptsubscript𝐮𝑡𝐻𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝐩𝑡1subscript𝜽𝑡subscript𝐮𝑡for-allsubscript𝐮𝑡 and for-all𝑡𝒯H(t,\{\mathbf{x}_{s}\}_{s=0}^{t},\mathbf{p}_{t+1},\bm{\theta}_{t},\mathbf{u}_{% t}^{\ast})\geq H(t,\{\mathbf{x}_{s}\}_{s=0}^{t},\mathbf{p}_{t+1},\bm{\theta}_{% t},\mathbf{u}_{t}),\;\forall\;\mathbf{u}_{t}\;\text{ and }\;\forall t\in% \mathcal{T}.italic_H ( italic_t , { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_H ( italic_t , { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∀ bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ∀ italic_t ∈ caligraphic_T . (6)

To obtain a numerical solution, one can consider iterating through the forward dynamic equation 4 to obtain all states {𝐱t}t=0T1superscriptsubscriptsubscript𝐱𝑡𝑡0𝑇1\{\mathbf{x}_{t}\}_{t=0}^{T-1}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT, the backward dynamic equation 5 to compute the adjoint states {𝐩t}t=0T1superscriptsubscriptsubscript𝐩𝑡𝑡0𝑇1\{\mathbf{p}_{t}\}_{t=0}^{T-1}{ bold_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT, and maximizing the Hamiltonian defined in equation 6 with current states and adjoint states via gradient ascent. This iterative process is continued until convergence. Given an initial condition 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Pontryagin’s Maximum Principle characterizes the optimal feedback control πt(𝐱t)subscript𝜋𝑡subscript𝐱𝑡\pi_{t}(\mathbf{x}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with an open-loop control 𝐮tsubscript𝐮𝑡\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, this open-loop control necessitates both forward and backward propagation through a pre-trained deep neural network over several iterations during online inference.

However, implementing the above Pontryagin’s Maximum Principle is generally infeasible for LLMs. In the subsequent section, we construct an analytic solution with certain relaxation assumptions.

2.2 An Analytic Solution for Fast Inference

In this section, we develop an analytic solution for solving the objective function defined in equation 2, under certain assumptions. These assumptions are summarized in the following,

  • Assumption 1111: Both embedding manifold and layer-wise transformation are simplified as linear functions. In this case, the layer-wise transformation, denoted as Ft()subscript𝐹𝑡F_{t}(\cdot)italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ), is linearized through a matrix 𝜽td×dsubscript𝜽𝑡superscript𝑑𝑑\bm{\theta}_{t}\in\mathbb{R}^{d\times d}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. A smooth embedding manifold is represented by a linear embedding subspace. This linear embedding subspace is defined by a set of basis vectors, which are captured by the column space of a matrix 𝐕d×r𝐕superscript𝑑𝑟\mathbf{V}\in\mathbb{R}^{d\times r}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT, corresponding to an r𝑟ritalic_r-dimensional embedding subspace.

  • Assumption 2222: Both embedding manifold and layer-wise transformation are orthogonal. In this case, layer-wise transformations are represented by orthogonal matrices, satisfying 𝜽t𝜽t=𝜽t𝜽t=𝐈superscriptsubscript𝜽𝑡topsubscript𝜽𝑡subscript𝜽𝑡superscriptsubscript𝜽𝑡top𝐈\bm{\theta}_{t}^{\top}\bm{\theta}_{t}=\bm{\theta}_{t}\bm{\theta}_{t}^{\top}=% \mathbf{I}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_I. Additionally, the basis vectors 𝐕tPsuperscriptsubscript𝐕𝑡𝑃\mathbf{V}_{t}^{P}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐕tIsuperscriptsubscript𝐕𝑡𝐼\mathbf{V}_{t}^{I}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐕tDsuperscriptsubscript𝐕𝑡𝐷\mathbf{V}_{t}^{D}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT are considered to be mutually orthogonal,

    (𝐕tP)𝐕tI=𝟎,(𝐕tP)𝐕tD=𝟎,(𝐕tI)𝐕tD=𝟎.formulae-sequencesuperscriptsuperscriptsubscript𝐕𝑡𝑃topsuperscriptsubscript𝐕𝑡𝐼0formulae-sequencesuperscriptsuperscriptsubscript𝐕𝑡𝑃topsuperscriptsubscript𝐕𝑡𝐷0superscriptsuperscriptsubscript𝐕𝑡𝐼topsuperscriptsubscript𝐕𝑡𝐷0(\mathbf{V}_{t}^{P})^{\top}\mathbf{V}_{t}^{I}=\mathbf{0},\;\;\;\;(\mathbf{V}_{% t}^{P})^{\top}\mathbf{V}_{t}^{D}=\mathbf{0},\;\;\;\;(\mathbf{V}_{t}^{I})^{\top% }\mathbf{V}_{t}^{D}=\mathbf{0}.( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = bold_0 , ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = bold_0 , ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = bold_0 .

Based on these assumptions, the computational costs of the proposed control algorithm are similar to performing forward propagation with the original model. This implies that the PID control approach introduces negligible computational cost. The negative impact of these assumptions are discussed in Section 3.4.

An analytic solution under Assumption 1111.

In the special linear case, for the linear embedding subspaces linked to the state, state integration, and state derivative, we define the basis as 𝐕tPsuperscriptsubscript𝐕𝑡𝑃\mathbf{V}_{t}^{P}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐕tIsuperscriptsubscript𝐕𝑡𝐼\mathbf{V}_{t}^{I}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐕tDsuperscriptsubscript𝐕𝑡𝐷\mathbf{V}_{t}^{D}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, respectively. Consequently, the embedding manifolds, represented by the surjective functions ftPsuperscriptsubscript𝑓𝑡𝑃f_{t}^{P}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, ftIsuperscriptsubscript𝑓𝑡𝐼f_{t}^{I}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and ftDsuperscriptsubscript𝑓𝑡𝐷f_{t}^{D}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, are orthogonal projections 𝐐tPsuperscriptsubscript𝐐𝑡𝑃\mathbf{Q}_{t}^{P}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐐tIsuperscriptsubscript𝐐𝑡𝐼\mathbf{Q}_{t}^{I}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐐tDsuperscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, where

𝐐tP=𝐈𝐕tP(𝐕tP),𝐐tI=𝐈𝐕tI(𝐕tI),𝐐tD=𝐈𝐕tD(𝐕tD).formulae-sequencesuperscriptsubscript𝐐𝑡𝑃𝐈superscriptsubscript𝐕𝑡𝑃superscriptsuperscriptsubscript𝐕𝑡𝑃topformulae-sequencesuperscriptsubscript𝐐𝑡𝐼𝐈superscriptsubscript𝐕𝑡𝐼superscriptsuperscriptsubscript𝐕𝑡𝐼topsuperscriptsubscript𝐐𝑡𝐷𝐈superscriptsubscript𝐕𝑡𝐷superscriptsuperscriptsubscript𝐕𝑡𝐷top\mathbf{Q}_{t}^{P}=\mathbf{I}-\mathbf{V}_{t}^{P}(\mathbf{V}_{t}^{P})^{\top},\;% \;\;\;\mathbf{Q}_{t}^{I}=\mathbf{I}-\mathbf{V}_{t}^{I}(\mathbf{V}_{t}^{I})^{% \top},\;\;\;\;\mathbf{Q}_{t}^{D}=\mathbf{I}-\mathbf{V}_{t}^{D}(\mathbf{V}_{t}^% {D})^{\top}.bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = bold_I - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = bold_I - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = bold_I - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

The following proposition solves the objective function defined in equation 2 under linearity assumptions.

Proposition 1.

Consider the following objective function,

minπ¯𝔼(𝐱0,y)𝒟[J(𝐱0,y,π¯)]subscript¯𝜋subscript𝔼similar-tosubscript𝐱0𝑦𝒟delimited-[]𝐽subscript𝐱0𝑦¯𝜋\displaystyle\min\limits_{\overline{\pi}}\mathbb{E}_{(\mathbf{x}_{0},y)\sim% \mathcal{D}}\left[J(\mathbf{x}_{0},y,\overline{\pi})\right]roman_min start_POSTSUBSCRIPT over¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_J ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y , over¯ start_ARG italic_π end_ARG ) ] :=minπ¯𝔼(𝐱0,y)𝒟[Φ(𝐱T,y)+t=0T1({𝐱s}s=0t,πt,(𝐐tP,𝐐tI,𝐐tD))],assignabsentsubscript¯𝜋subscript𝔼similar-tosubscript𝐱0𝑦𝒟delimited-[]Φsubscript𝐱𝑇𝑦superscriptsubscript𝑡0𝑇1superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\displaystyle\vcentcolon=\min\limits_{\overline{\pi}}\mathbb{E}_{(\mathbf{x}_{% 0},y)\sim\mathcal{D}}\left[\Phi(\mathbf{x}_{T},y)+\sum_{t=0}^{T-1}{\cal L}(\{% \mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},(\mathbf{Q}_{t}^{P},\mathbf{Q}_{t}^{I},% \mathbf{Q}_{t}^{D}))\right],:= roman_min start_POSTSUBSCRIPT over¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_Φ ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) ] ,
s.t.𝐱t+1=𝜽t(𝐱t+πt(𝐱t)).formulae-sequencestsubscript𝐱𝑡1subscript𝜽𝑡subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡\displaystyle{\rm s.t.}\;\mathbf{x}_{t+1}=\bm{\theta}_{t}(\mathbf{x}_{t}+\pi_{% t}(\mathbf{x}_{t})).roman_s . roman_t . bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (7)

the optimal value function, parametrized as V(𝐱t)=𝐱t𝐏t𝐱t𝑉subscript𝐱𝑡superscriptsubscript𝐱𝑡topsubscript𝐏𝑡subscript𝐱𝑡V(\mathbf{x}_{t})=\mathbf{x}_{t}^{\top}\mathbf{P}_{t}\mathbf{x}_{t}italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, satisfies the Riccati equation:

𝐏t=12𝐐t+𝜽t𝐏t+1𝜽t12(𝐐t+2𝜽t𝐏t+1𝜽t)(𝐐t+2𝜽t𝐏t+1𝜽t+ct𝐈)1(𝐐t+2𝜽t𝐏t+1𝜽t).subscript𝐏𝑡12subscript𝐐𝑡superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡12superscriptsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡topsuperscriptsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝑐𝑡𝐈1subscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡\mathbf{P}_{t}=\frac{1}{2}\mathbf{Q}_{t}+\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1% }\bm{\theta}_{t}-\frac{1}{2}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_% {t+1}\bm{\theta}_{t})^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_% {t+1}\bm{\theta}_{t}+c_{t}\mathbf{I})^{-1}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{% \top}\mathbf{P}_{t+1}\bm{\theta}_{t}).bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (8)

The optimal control solution is given by

πt(𝐱t)=(𝐐t+c𝐈+2𝜽t𝐏t+1𝜽t)1(𝐐t+2𝜽t𝐏t+1𝜽t)𝐱t,subscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑐𝐈2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡1subscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡\pi_{t}(\mathbf{x}_{t})=-(\mathbf{Q}_{t}+c\cdot\mathbf{I}+2\bm{\theta}_{t}^{% \top}\mathbf{P}_{t+1}\bm{\theta}_{t})^{-1}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{% \top}\mathbf{P}_{t+1}\bm{\theta}_{t})\mathbf{x}_{t},italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c ⋅ bold_I + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (9)

where 𝐐t=𝐐tP+𝐐tI+𝐐tDsubscript𝐐𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}=\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

We provide an outline of the proof, with a detailed derivation available in Appendix 8. The optimal value function V(𝐱t)𝑉subscript𝐱𝑡V(\mathbf{x}_{t})italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of the optimal control problem defined in equation 7 satisfies the Bellman optimality equation,

V(𝐱t)=({𝐱s}s=0t,πt,(𝐐tP,𝐐tI,𝐐tD))+V(𝐱t+1),s.t.𝐱t+1=𝜽t(𝐱t+πt(𝐱t)).formulae-sequence𝑉subscript𝐱𝑡superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷𝑉subscript𝐱𝑡1stsubscript𝐱𝑡1subscript𝜽𝑡subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡V(\mathbf{x}_{t})={\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},(\mathbf{Q}_{t% }^{P},\mathbf{Q}_{t}^{I},\mathbf{Q}_{t}^{D}))+V(\mathbf{x}_{t+1}),\;\;\;\;{\rm s% .t.}\;\;\mathbf{x}_{t+1}=\bm{\theta}_{t}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t}% )).italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) + italic_V ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , roman_s . roman_t . bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

In the linear case, the optimal value function is parametrized by a quadratic function, expressed as V(𝐱t)=𝐱t𝐏t𝐱t𝑉subscript𝐱𝑡superscriptsubscript𝐱𝑡topsubscript𝐏𝑡subscript𝐱𝑡V(\mathbf{x}_{t})=\mathbf{x}_{t}^{\top}\mathbf{P}_{t}\mathbf{x}_{t}italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By setting the derivative dV(𝐱t)dπt(𝐱t)𝑑𝑉subscript𝐱𝑡𝑑subscript𝜋𝑡subscript𝐱𝑡\frac{dV(\mathbf{x}_{t})}{d\pi_{t}(\mathbf{x}_{t})}divide start_ARG italic_d italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG to zero, we arrive at the optimal control solution, as detailed in equation 9. Furthermore, the Riccati equation in equation 8 emerges from substituting this optimal control solution into the Bellman optimality equation.

Remark 2.

As derived in equation 9, using a combination of P, I, and D controllers incurs the same computational cost as using a single type of control scheme. This is due to the linearity of the control process, where the orthogonal projections onto the state embedding, state integration embedding, and state derivative embeddings can be effectively merged. This results in a projection onto the intersecting subspace of the three linear embedding subspaces.

Starting with a pre-trained LLM, the layer-wise transformations can be linearized to form a linear dynamical system. From this, the parameters of the optimal value function 𝐏tsubscript𝐏𝑡\mathbf{P}_{t}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are computed using the discrete dynamical system outlined in equation 8. Subsequently, the optimal feedback control solution πt(𝐱t)subscript𝜋𝑡subscript𝐱𝑡\pi_{t}(\mathbf{x}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is constructed from equation 9. Although this method is feasible, the linearization of a series of transformer layers poses its own set of complexities. Moving forward, we propose an analytic solution that does not rely on linearizing the base model, under additional orthogonality assumptions.

An analytic solution under Assumption 2222.

We further consider the scenario where both embedding manifold and layer-wise transformation are orthogonal. As a result, the linear combination of orthogonal projections, represented as 𝐐t=𝐐tP+𝐐tI+𝐐tDsubscript𝐐𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}=\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, forms an orthogonal projection itself. With these conditions in place, we can then establish an analytic formulation for the optimal control solution, as detailed in the following proposition.

Proposition 3.

When the layer-wise transformations are represented as orthogonal matrices, and the basis of state embedding, state integration embedding, and state derivative embeddings are mutually orthogonal, the optimal feedback control, denoted as πt(𝐱t)subscript𝜋𝑡subscript𝐱𝑡\pi_{t}(\mathbf{x}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), can be computed as follows:

πt(𝐱t)=𝐕t[0000000000001c1+λt+1+c00001c1+λt+1+c]𝐕t𝐱t,subscript𝜋𝑡subscript𝐱𝑡subscript𝐕𝑡matrix0000000000001𝑐1subscript𝜆𝑡1𝑐00001𝑐1subscript𝜆𝑡1𝑐superscriptsubscript𝐕𝑡topsubscript𝐱𝑡\pi_{t}(\mathbf{x}_{t})=-\mathbf{V}_{t}\begin{bmatrix}0&0&\cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&1-\frac{c}{1+\lambda_{t+1}+c}&0\\ 0&0&\cdots&0&1-\frac{c}{1+\lambda_{t+1}+c}\\ \end{bmatrix}\mathbf{V}_{t}^{\top}\mathbf{x}_{t},italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 - divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 1 - divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where the time-varying parameter λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is governed by a backward difference equation λt=c(1+λt+1)1+λt+1+csubscript𝜆𝑡𝑐1subscript𝜆𝑡11subscript𝜆𝑡1𝑐\lambda_{t}=\frac{c(1+\lambda_{t+1})}{1+\lambda_{t+1}+c}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c ( 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG, with the terminal condition specified as λT=0subscript𝜆𝑇0\lambda_{T}=0italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.

A brief overview of the proof is presented here, while a more detailed derivation can be found in Appendix 8. The condition for the embedding subspaces to be orthogonal guarantees that the linear combination represented by 𝐐t=𝐐tP+𝐐tI+𝐐tDsubscript𝐐𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}=\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT forms an orthogonal projection. Moreover, the orthogonality in layer-wise transformations simplifies the Riccati equation. This simplification leads to a recursive approach to formulating control regularization.

When c=0𝑐0c=0italic_c = 0, it holds that λt=0subscript𝜆𝑡0\lambda_{t}=0italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 for every t𝑡titalic_t, and the optimal feedback control corresponds to the orthogonal projection onto the orthogonal complement of the linear subspace. On the other hand, for c>0𝑐0c>0italic_c > 0, the approach yields a time-varying regularization in control across different layers. This analytical solution, which assumes linear orthogonality, is independent of the underlying model. Therefore, the time-variant control regularization ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be pre-calculated prior to the inference process.

2.3 Theoretical Error Analysis

Under both assumptions 1111 and 2222 defined in Section 2.2, the controlled dynamics, under some perturbations, are formulated as follows:

𝐱¯t+1=𝜽t(𝐱¯t+πt(𝐱¯t)),𝐱¯0=𝐱0+𝐳,formulae-sequencesubscript¯𝐱𝑡1subscript𝜽𝑡subscript¯𝐱𝑡subscript𝜋𝑡subscript¯𝐱𝑡subscript¯𝐱0subscript𝐱0𝐳\overline{\mathbf{x}}_{t+1}=\bm{\theta}_{t}(\overline{\mathbf{x}}_{t}+\pi_{t}(% \overline{\mathbf{x}}_{t})),\;\;\overline{\mathbf{x}}_{0}=\mathbf{x}_{0}+% \mathbf{z},over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_z ,

where 𝐳𝐳\mathbf{z}bold_z represents an arbitrary perturbation decomposable into two mutually orthogonal components 𝐳=𝐳𝐳𝐳direct-sumsuperscript𝐳parallel-tosuperscript𝐳perpendicular-to\mathbf{z}=\mathbf{z}^{\parallel}\oplus\mathbf{z}^{\perp}bold_z = bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ⊕ bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT: 𝐳superscript𝐳parallel-to\mathbf{z}^{\parallel}bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT, aligning within the data embedding subspace, and 𝐳superscript𝐳perpendicular-to\mathbf{z}^{\perp}bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, orthogonal to the data manifold. We represent the state trajectory in the absence of input perturbation and without any applied control as 𝐱t+1=𝜽t(𝐱t)subscript𝐱𝑡1subscript𝜽𝑡subscript𝐱𝑡\mathbf{x}_{t+1}=\bm{\theta}_{t}(\mathbf{x}_{t})bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), the following theorem quantifies the error as 𝐱¯t𝐱t22superscriptsubscriptdelimited-∥∥subscript¯𝐱𝑡subscript𝐱𝑡22\lVert\overline{\mathbf{x}}_{t}-\mathbf{x}_{t}\rVert_{2}^{2}∥ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which evaluates the difference between the perturbed state after control correction and the original state from unperturbed input data.

Theorem 4.

For any time step t1𝑡1t\geq 1italic_t ≥ 1, assuming that each 𝛉tsubscript𝛉𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an orthogonal matrix, we have the following error computation:

𝐱¯t𝐱t22=s=0t1αs2𝐳22+𝐳22,superscriptsubscriptdelimited-∥∥subscript¯𝐱𝑡subscript𝐱𝑡22superscriptsubscriptproduct𝑠0𝑡1superscriptsubscript𝛼𝑠2superscriptsubscriptdelimited-∥∥superscript𝐳perpendicular-to22superscriptsubscriptdelimited-∥∥superscript𝐳parallel-to22\lVert\overline{\mathbf{x}}_{t}-\mathbf{x}_{t}\rVert_{2}^{2}=\prod_{s=0}^{t-1}% \alpha_{s}^{2}\cdot\lVert\mathbf{z}^{\perp}\rVert_{2}^{2}+\lVert\mathbf{z}^{% \parallel}\rVert_{2}^{2},∥ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a time-varying parameter defined in relation to the control regularization c𝑐citalic_c, and λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are auxiliary variables, as follows:

αt=c1+λt+1+c,λT=0,λT1=c1+c,λt=c(1+λt+1)1+c+λt+1.formulae-sequencesubscript𝛼𝑡𝑐1subscript𝜆𝑡1𝑐formulae-sequencesubscript𝜆𝑇0formulae-sequencesubscript𝜆𝑇1𝑐1𝑐subscript𝜆𝑡𝑐1subscript𝜆𝑡11𝑐subscript𝜆𝑡1\displaystyle\alpha_{t}=\frac{c}{1+\lambda_{t+1}+c},\quad\lambda_{T}=0,\quad% \lambda_{T-1}=\frac{c}{1+c},\quad\lambda_{t}=\frac{c(1+\lambda_{t+1})}{1+c+% \lambda_{t+1}}.italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG , italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0 , italic_λ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT = divide start_ARG italic_c end_ARG start_ARG 1 + italic_c end_ARG , italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c ( 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 + italic_c + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG .

The detailed derivation is provided in Appendix 9. This computation rigorously demonstrates that perturbations, specifically those spanning the orthogonal complement denoted by 𝐳superscript𝐳perpendicular-to\mathbf{z}^{\perp}bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, exhibit a decay phenomenon during the process of forward propagation. Furthermore, in scenarios where control parameters are subject to regularization constraints, when c>0𝑐0c>0italic_c > 0, our analysis reveals nuanced insights. We establish that the optimal control solution, which is derived by considering the intricate interplay among different transformation layers, adheres to these constraints while optimizing performance, which captures the complex dynamics between layers.

Theorem 4 outlines how errors in state computations at any given time step are influenced by input perturbations represented by 𝐳𝐳\mathbf{z}bold_z, despite these perturbations existing within the continuous domain of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, this setting fits real-world adversarial attacks on LLMs, which involve modifying discrete elements, such as tokens or characters, in the input text. The act of modifying a word or substring through an adversarial attack leads to a discrepancy between the embedding sequences of the original and modified input tokens, manifesting as the perturbation vector 𝐳𝐳\mathbf{z}bold_z within the input embedding space. Specifically, the embedding manifolds, derived from unperturbed training data, capture the structure of this data in a lower-dimensional subspace. Adversarial examples, meanwhile, are designed to be semantically similar to the original input yet induce a marked divergence in the embedding space during the model’s forward propagation. Under these circumstances, the difference between the embedding sequences of the original input and the adversarial example can be quantified and adjusted within the PID control framework. We provide empirical error computation of Theorem 4 in Section 3.4.

2.4 Additional Details for Constructing PID Control

Here, we provide details on implementing the proposed PID control method. We begin with constructing the linear embedding basis 𝐕tPsuperscriptsubscript𝐕𝑡𝑃\mathbf{V}_{t}^{P}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐕tIsuperscriptsubscript𝐕𝑡𝐼\mathbf{V}_{t}^{I}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐕tDsuperscriptsubscript𝐕𝑡𝐷\mathbf{V}_{t}^{D}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT from training dataset. In NLP tasks, the hidden states are generally represented as 2222-dimensional matrix (sequence of embedding vectors), 𝐗tl×dsubscript𝐗𝑡superscript𝑙𝑑\mathbf{X}_{t}\in\mathbb{R}^{l\times d}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_d end_POSTSUPERSCRIPT, where l𝑙litalic_l denotes the temporal length. Given N𝑁Nitalic_N pieces of data sampled from the data distribution 𝒟𝒟\mathcal{D}caligraphic_D (N𝑁Nitalic_N training data), we can concatenate the hidden states as a 3333-way tensor 𝒳𝒩××subscript𝒳superscript𝒩\mathbfcal{X}_{t}\in\mathbb{R}^{N\times l\times d}roman_𝒳 start_POSTSUBSCRIPT ⊔ end_POSTSUBSCRIPT ∈ roman_ℛ start_POSTSUPERSCRIPT roman_𝒩 × ↕ × ⌈ end_POSTSUPERSCRIPT, and apply Tucker decomposition (De Lathauwer et al., 2000b; a; Kolda & Bader, 2009) (known as high-order singular value decomposition) to generate linear embedding basis along both temporal and state embedding dimensions.

Algorithm 1 Tucker Decomposition.
Input: An I𝐼Iitalic_I-way tensor 𝒳𝒳\mathbfcal{X}roman_𝒳.
Output: Core tensor 𝒢𝒢\mathbfcal{G}roman_𝒢, orthogonal basis 𝐕1superscript𝐕1\mathbf{V}^{1}bold_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, 𝐕2superscript𝐕2\mathbf{V}^{2}bold_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, \cdots, 𝐕Isuperscript𝐕𝐼\mathbf{V}^{I}bold_V start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT.
for i=1𝑖1i=1italic_i = 1 to I𝐼Iitalic_I do
     𝐗isuperscript𝐗𝑖\mathbf{X}^{i}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Reshape (𝒳𝒳\mathbfcal{X},iroman_𝒳 ⇔ ⟩), // Reshape the tensor along the ithsuperscriptith\rm i^{th}roman_i start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT mode.
     𝐔isuperscript𝐔𝑖\mathbf{U}^{i}bold_U start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, 𝐒isuperscript𝐒𝑖\mathbf{S}^{i}bold_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, 𝐕isuperscript𝐕𝑖\mathbf{V}^{i}bold_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = SVD (𝐗isuperscript𝐗𝑖\mathbf{X}^{i}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT), // Perform singular value decomposition on the reshaped tensor.
     Save the singular vectors 𝐕isuperscript𝐕𝑖\mathbf{V}^{i}bold_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT as the orthogonal basis.
end for
𝒢𝒳𝒢𝒳\mathbfcal{G}=\mathbfcal{X}roman_𝒢 roman_ℑ roman_𝒳, // Initialize the tensor core with the I𝐼Iitalic_I-way tensor 𝒳𝒳\mathbfcal{X}roman_𝒳.
for i=1𝑖1i=1italic_i = 1 to I𝐼Iitalic_I do
     𝒢𝒢\mathbfcal{G}roman_𝒢 = 𝒢×𝒱subscript𝒢superscript𝒱\mathbfcal{G}\times_{i}\mathbf{V}^{i}roman_𝒢 × start_POSTSUBSCRIPT ⟩ end_POSTSUBSCRIPT roman_𝒱 start_POSTSUPERSCRIPT ⟩ end_POSTSUPERSCRIPT, // Multiply the core tensor by the ithsuperscriptith\rm i^{th}roman_i start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT orthogonal basis.
end for

Tucker Decomposition is an extension of the traditional singular value decomposition to higher-order tensors. Mathematically, Tucker decomposition represents an I𝐼Iitalic_I-way tensor as 𝒳𝒢×𝒱×𝒱××𝒱𝒳subscriptsubscriptsubscriptsubscript𝒢superscript𝒱superscript𝒱superscript𝒱\mathbfcal{X}\approx\mathbfcal{G}\times_{1}\mathbf{V}^{(1)}\times_{2}\mathbf{V% }^{(2)}\times_{3}\cdots\times_{I}\mathbf{V}^{(I)}roman_𝒳 ≈ roman_𝒢 × start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT roman_𝒱 start_POSTSUPERSCRIPT ⇐ ∞ ⇒ end_POSTSUPERSCRIPT × start_POSTSUBSCRIPT ∈ end_POSTSUBSCRIPT roman_𝒱 start_POSTSUPERSCRIPT ⇐ ∈ ⇒ end_POSTSUPERSCRIPT × start_POSTSUBSCRIPT ∋ end_POSTSUBSCRIPT ⋯ × start_POSTSUBSCRIPT roman_ℐ end_POSTSUBSCRIPT roman_𝒱 start_POSTSUPERSCRIPT ⇐ roman_ℐ ⇒ end_POSTSUPERSCRIPT, where 𝒢𝒢\mathbfcal{G}roman_𝒢 is the core tensor, which governs the interaction between different modes, 𝐕isuperscript𝐕𝑖\mathbf{V}^{i}bold_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are orthogonal bases corresponding to the principal components in each tensor mode, ×nsubscript𝑛\times_{n}× start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the mode-n𝑛nitalic_n tensor product. The mode-n𝑛nitalic_n product of a tensor 𝒜××\××𝒜superscriptsubscriptsubscript\subscript\mathbfcal{A}\in\mathbb{R}^{I_{1}\times\dots\times I_{n}\times\dots\times I_{d}}roman_𝒜 ∈ roman_ℛ start_POSTSUPERSCRIPT roman_ℐ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT × ⋯ × roman_ℐ start_POSTSUBSCRIPT \ end_POSTSUBSCRIPT × ⋯ × roman_ℐ start_POSTSUBSCRIPT ⌈ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with a matrix 𝐔J×In𝐔superscript𝐽subscript𝐼𝑛\mathbf{U}\in\mathbb{R}^{J\times I_{n}}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is defined as

𝒜×\𝒰\|\\\\|\\mathbfcal{B}=\mathbfcal{A}\times_{n}\mathbf{U}\Longleftrightarrow b_{i_{1}% \dots i_{n-1}ji_{n+1}\dots i_{d}}=\sum_{i_{n}=1}^{I_{n}}a_{i_{1}\dots i_{n}% \dots i_{d}}\cdot u_{ji_{n}},roman_ℬ roman_ℑ roman_𝒜 × start_POSTSUBSCRIPT \ end_POSTSUBSCRIPT roman_𝒰 ⟺ ⌊ start_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT … ⟩ start_POSTSUBSCRIPT \ ↖ ∞ end_POSTSUBSCRIPT | ⟩ start_POSTSUBSCRIPT \ ⇓ ∞ end_POSTSUBSCRIPT … ⟩ start_POSTSUBSCRIPT ⌈ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℑ ∑ start_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT \ end_POSTSUBSCRIPT roman_ℑ ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℐ start_POSTSUBSCRIPT \ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊣ start_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT … ⟩ start_POSTSUBSCRIPT \ end_POSTSUBSCRIPT … ⟩ start_POSTSUBSCRIPT ⌈ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ ⊓ start_POSTSUBSCRIPT | ⟩ start_POSTSUBSCRIPT \ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⇔

where the (i1,i2,,id)subscript𝑖1subscript𝑖2subscript𝑖𝑑(i_{1},i_{2},\cdots,i_{d})( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_i start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )-th elements of 𝒜𝒜\mathbfcal{A}roman_𝒜 and \mathbfcal{B}roman_ℬ are denoted as ai1i2idsubscript𝑎subscript𝑖1subscript𝑖2subscript𝑖𝑑a_{i_{1}i_{2}\cdots i_{d}}italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ italic_i start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT and bi1i2idsubscript𝑏subscript𝑖1subscript𝑖2subscript𝑖𝑑b_{i_{1}i_{2}\cdots i_{d}}italic_b start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ italic_i start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT, respectively.

An implementation of Tucker decomposition is detailed in Algorithm 1. Along each of the I𝐼Iitalic_I modes, the concatenated high-dimensional states 𝒳𝒳\mathbfcal{X}roman_𝒳 are reshaped along the ithsuperscriptith\rm i^{th}roman_i start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT dimension, which is used to compute the orthogonal basis from singular value decomposition. The core tensor 𝒢𝒢\mathbfcal{G}roman_𝒢 is computed by multiplying the states 𝒳𝒳\mathbfcal{X}roman_𝒳 with each of the I𝐼Iitalic_I basis along each mode. The low-rank reconstruction of concatenated states 𝒳𝒳\mathbfcal{X}roman_𝒳 can be obtained by 𝒢×𝒱×𝒱××𝒱subscriptsubscriptsubscriptsubscript𝒢superscript𝒱superscript𝒱superscript𝒱\mathbfcal{G}\times_{1}\mathbf{V}^{1}\times_{2}\mathbf{V}^{2}\times_{3}\cdots% \times_{I}\mathbf{V}^{I}roman_𝒢 × start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT roman_𝒱 start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT × start_POSTSUBSCRIPT ∈ end_POSTSUBSCRIPT roman_𝒱 start_POSTSUPERSCRIPT ∈ end_POSTSUPERSCRIPT × start_POSTSUBSCRIPT ∋ end_POSTSUBSCRIPT ⋯ × start_POSTSUBSCRIPT roman_ℐ end_POSTSUBSCRIPT roman_𝒱 start_POSTSUPERSCRIPT roman_ℐ end_POSTSUPERSCRIPT.

Given a pre-trained LLM (naively trained or robustly trained), we collect the concatenated states from training data, which results in a set of 3333-way tensors {𝒳}𝒯superscriptsubscriptsubscript𝒳𝒯\{\mathbfcal{X}_{t}\}_{t=0}^{T-1}{ roman_𝒳 start_POSTSUBSCRIPT ⊔ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ⊔ roman_ℑ ′ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_𝒯 ↖ ∞ end_POSTSUPERSCRIPT. Then Tucker decomposition is applied at every 𝒳subscript𝒳\mathbfcal{X}_{t}roman_𝒳 start_POSTSUBSCRIPT ⊔ end_POSTSUBSCRIPT (refer Algorithm 1). Extending this to integral and derivative controls is straightforward, as one can substitute the concatenated states 𝒳𝒳\mathbfcal{X}roman_𝒳 by either the summation of past states 𝒳𝒳𝒳superscriptsubscriptsubscript𝒳\mathbfcal{X}=\sum_{s=0}^{t}\mathbfcal{X}_{s}roman_𝒳 roman_ℑ ∑ start_POSTSUBSCRIPT ∫ roman_ℑ ′ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊔ end_POSTSUPERSCRIPT roman_𝒳 start_POSTSUBSCRIPT ∫ end_POSTSUBSCRIPT or the subtraction of two consequential states 𝒳𝒳𝒳𝒳subscript𝒳subscript𝒳\mathbfcal{X}=\mathbfcal{X}_{t}-\mathbfcal{X}_{t-1}roman_𝒳 roman_ℑ roman_𝒳 start_POSTSUBSCRIPT ⊔ end_POSTSUBSCRIPT ↖ roman_𝒳 start_POSTSUBSCRIPT ⊔ ↖ ∞ end_POSTSUBSCRIPT. Using the linear embedding bases 𝐕tPsuperscriptsubscript𝐕𝑡𝑃\mathbf{V}_{t}^{P}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐕tIsuperscriptsubscript𝐕𝑡𝐼\mathbf{V}_{t}^{I}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐕tDsuperscriptsubscript𝐕𝑡𝐷\mathbf{V}_{t}^{D}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT obtained from Tucker decomposition, the construction of the feedback controller is achieved by adhering to the methodology outlined in Proposition 3.

3 Numerical Experiments

In this section, we first discuss experimental setup in Section 3.1. In Sections 3.2 and 3.3, we assess the performance of the proposed PID control framework against various baseline methods across multiple NLP tasks. Subsequently, in Section 3.4, an ablation study is conducted, providing exploratory justification for the proposed approach.

3.1 Experimental Setup

Evaluation methods:

We consider both adversarial attack algorithms (e.g. A2T, PSO, TextBugger, TextFooler), applied on the SNLI (Bowman et al., 2015), MNLI datasets (Williams et al., 2018) and adversarial datasets (e.g. ANLI) to evaluate the robustness of the proposed PID control and baselines.

  • A2T (Attacking to Training (Yoo & Qi, 2021)) utilizes a cost-effective gradient-based technique to rank the significance of words. This approach encompasses the iterative replacement of each word with synonyms sourced from counter-fitted word embeddings.

  • PSO (Zang et al., 2020) exploits a population of interacting individuals to iteratively search for the optimal solution in the specific space.

  • TextBugger (Li et al., 2019) finds important words by computing the Jacobian matrix of the model and then chooses an optimal perturbation from the generated perturbations.

  • TextFooler (Jin et al., 2020) is the state-of-the-art word-level adversarial attack method to generate adversarial examples. This technique identifies the important words for the target model and subsequently prioritizes their replacement with the most semantically similar and grammatically correct words. This process continues until there is a discernible shift in the model’s prediction.

  • Adversarial NLI (ANLI) (Nie et al., 2020) is a large-scale NLI benchmark, This dataset was curated through an iterative process that incorporates both human and model inputs in an adversarial loop, targeting specific models for attack. The ANLI dataset is particularly potent as an adversarial tool, demonstrating a significant capability to diminish the accuracy of pre-trained models.

Baseline methods:

This study examines two baseline methods focused on adversarial training. The Naive adversarial training (AT), as proposed by Yoo & Qi (2021), employs the A2T attack for its adversarial training process. FreeLB, introduced by Zhu et al. (2019), implements adversarial training in language models during the fine-tuning stage, aiming to enhance both generalization and robustness. It is noteworthy that the PID control method, in contrast to these adversarial training-based approaches, offers a distinct perspective on enhancing model robustness without knowing the attack type in advance. It can be applied to models that have undergone adversarial training to further improve their robustness.

We fine-tune four baseline models using LoRA (Hu et al., 2021), namely distilbert (Sanh et al., 2019), BERT-large (Kenton & Toutanova, 2019), RoBERTaBase and RoBERTaLarge (Liu et al., 2019).

PID control implementation details:

Using a pre-trained model (e.g., BERT), we select training data that this model can accurately predict. Next, we simulate forward propagation using the pre-trained model on this specific set of training data, which generates a collection of 3333-dimensional tensors, denoted as {𝒳}𝒯superscriptsubscriptsubscript𝒳𝒯\{\mathbfcal{X}_{t}\}_{t=0}^{T-1}{ roman_𝒳 start_POSTSUBSCRIPT ⊔ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ⊔ roman_ℑ ′ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_𝒯 ↖ ∞ end_POSTSUPERSCRIPT. Following this, we employ Algorithm 1 on each tensor to determine the basis for a linear embedding subspace (see Section 2.4). The dimension of this subspace is chosen based on the criterion that it must account for 99%percent9999\%99 % of the total variance observed (this is done by accumulating the singular values). Finally, the optimal solution outlined in Proposition 3 is implemented to generate a time-dependent control regularization parameter.

Threat model:

In this work, we consider word/token level adversarial attacks that manipulate discrete tokens or characters in the input text. We verify this from both empirical and theoretical perspectives. In the numerical experiments, we consider a range of adversarial attacks, including A2T, PSO, TextBugger, and TextFooler. These adversarial attacks aim to cause misclassification by modifying the tokens of the input string while maintaining the same semantic meaning.

3.2 Robustness Against Adversarial Examples

Refer to caption
Refer to caption

(a): Distilbert (SNLI)

Refer to caption

(b): RoBERTa (SNLI)

Refer to caption

(c): Distilbert (MNLI)

Refer to caption

(d): RoBERTa (MNLI)

Figure 2: (a) and (b) are radar plots that summarize Distilbert and RoBERTaLarge in Table 8 for SNLI dataset, respectively. (c) and (d) are radar plots that summarize Distilbert and RoBERTaLarge in Table 9 for MNLI dataset, respectively.

Here, we empirically validate the robustness improvement of employing the proposed PID control on pre-trained LLMs. In Figure 2 (a), a radar plot is presented to illustrate the comparative performance between the baseline and controlled models, utilizing the DistilBERT architecture and evaluated on the SNLI dataset. This demonstrates that the employment of PID control significantly improves model robustness against all four distinct types of perturbations, with a negligible impact on performance with unaltered data (denoted as None). Shifting to a different model architecture, Figure 2 (b) reveals that applying the proposed PID control approach to the RoBERTa model yields analogous enhancements in robustness. For more challenging scenarios, Figures 2 (c) and (d) detail the performance of the MNLI dataset. In these plots, both the DistilBERT and RoBERTa architectures are examined. These figures showcase that, despite the increased complexity of the MNLI dataset, the PID control method consistently maintains robustness improvements. The increased complexity of the MNLI dataset poses additional challenges in creating embedding subspaces, making it more difficult to accurately represent state, state integration, and state derivatives with linear embeddings. The plots distinctly highlight that the controlled models exhibit increased resistance to a broader spectrum of linguistic perturbations and complexities, without significant trade-offs in overall accuracy. This underlines the efficacy of PID control in enhancing model robustness across different architectures and datasets.

More detailed comparisons of performance between baseline and controlled models on the SNLI and MNLI datasets are provided in Tables 8 and 9 in Appendix 10. When a method’s accuracy surpasses others by more than 1%percent11\%1 %, it’s highlighted in red. It is evident that the proposed PID control method significantly enhances the robustness of both standard and robustly trained LLMs. The enhancement is more pronounced in standard trained models, which are generally more vulnerable to adversarial attacks. On average, the PID control method yields an improvement of nearly 10%percent1010\%10 % in standard models and about 5%percent55\%5 % in robustly trained models, including both AT and FreeLB training.

In addition, we present the numerical results of OPT-1.3B. OPT-1.3B is a decoder-based large language model that contains 1.3 billion model parameters. For the proposed PID control, we follow the same P-D control implementation (proportional-derivative) as done in all numerical experiments from the paper. Table 1 demonstrates that the controlled OPT-1.3B model consistently improves the robustness performance against all four types of adversarial attacks. Specifically, on the SNLI dataset, the average improvement is over 20%percent2020\%20 % compared with the base model. On a more challenging MNLI dataset, with only a 2.5%percent2.52.5\%2.5 % accuracy drop on the unperturbed testing dataset. The improvement reaches 21%percent2121\%21 % against the TextBugger attack, and 11%percent1111\%11 % on both A2T and PSO attacks.

Table 1: Measurement on SNLI dataset: baseline model / controlled model
SNLI Dataset
None A2T PSO TextBugger TextFooler
OPT 91.24 / 88.69 49.15 / 63.28 48.00 / 60.06 17.57 / 41.79 16.64 / 44.70
MNLI Dataset
None A2T PSO TextBugger TextFooler
OPT 86.89 / 84.27 54.47 / 65.87 45.14 / 59.08 24.12 / 45.97 21.68 / 49.13

3.3 Robustness Against Adversarial Dataset

In this study, we assess the effectiveness of the PID control approach in an adversarial Natural Language Inference (NLI) task. The ANLI dataset is created through an iterative process involving both humans and models, aimed at improving natural language understanding. Initially, human annotators create examples that challenge the current best-performing models. These difficult examples, intended to reveal more weaknesses, are then incorporated into the training set to enhance the models. This cycle of identifying and addressing weaknesses is repeated across several rounds, each producing an increasingly complex adversarial dataset (ANLI consists of three rounds of development and test datasets). Unlike the evaluation using adversarial examples described in Section 3.2, the ANLI dataset is pre-constructed by human annotators. In contrast, adversarial examples from Section 3.2 are created in relation to the specific characteristics of the underlying classifier.

The evaluation with the ANLI dataset encompasses both baseline and controlled models, utilizing the development and test datasets. The results obtained from the ANLI dataset are outlined in Table 2. ANLI involves three progressively challenging rounds. The baseline model shows a decline in performance with increasing difficulty from round 1111 to round 3333. Conversely, the PID control demonstrates a more pronounced improvement in performance as the challenge increases. Specifically, the proposed control method leads to 1.0783%percent1.07831.0783\%1.0783 % in the mean of performance improvement, and a 95%percent9595\%95 % confidence interval of 0.0564%percent0.05640.0564\%0.0564 % to 2.1004%percent2.10042.1004\%2.1004 %.

Table 2: Measurement on ANLI dataset: baseline model / controlled model
r1 r2 r3
RoBERTaLarge (Dev) 72.60 / 72.65 50.99 / 52.33 40.99 / 43.31
RoBERTaLarge (Test) 72.79 / 72.60 48.19 / 49.39 40.66 / 42.41

3.4 Ablation Study

This section provides exploratory justifications for the proposed PID control framework.

Justification of the selection of a P-D control scheme.

We begin with a comparative analysis of various control schemes, emphasizing the benefits of implementing multiple controllers over the single use of Proportional (P) control as previously explored in (Chen et al., 2020). Table 3 showcases a comparison of the robustness performance across Proportional (P), Proportional-Integral (P-I), Proportional-Derivative (P-D), and Proportional-Integral-Derivative (P-I-D) control schemes within different model architectures and training methodologies. It is evident that the P-D control scheme significantly surpasses the others in most scenarios, underscoring the efficacy of the proposed PID control framework, which expands upon the limited capability of earlier P control (closed-loop control) methods. The mean of employing the Proportional-Derivative (P-D) control over the Proportional (P) control is 2.35%percent2.352.35\%2.35 %, with a 95%percent9595\%95 % confidence interval of 1.677%percent1.6771.677\%1.677 % to 3.0121%percent3.01213.0121\%3.0121 %. This validates the choice of P-D control.

The reason why P-D outperforms P-I-D is mainly due to noise sensitivity and hyperparameter tuning.

Noise sensitivity: The integral term has the potential to aggregate errors across multiple hidden layers, incorporating noise inherent in the embedding manifolds, as well as the distributional shift between the training and testing datasets. In scenarios where substantial noises are presented in each hidden layer, the integral component, dependent on the embedding manifold of accumulated past states, may lead to instability during model inference. Conversely, a Proportional-Derivative (PD) controller, lacking the integral component, tends to exhibit improved performance under such noisy conditions by not accumulating this noise.

Hyperparameter tuning: In the realm of traditional PID control design, selecting the appropriate control gains, denoted as Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and Kdsubscript𝐾𝑑K_{d}italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, for proportional, integral, and derivative controls respectively, presents a notable challenge. These gains are crucial for achieving a balance among the different types of controls. Typically, the calibration of these gains is empirically based, with the aim of optimizing the performance of PID control. Our method follows a similar strategy, determining the gains through experimentation with training data. Given that our hyperparameter searching space only contains 00 and 0.50.50.50.5 for each control gain, this results in the values Kp=0.5subscript𝐾𝑝0.5K_{p}=0.5italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.5, Kd=0.5subscript𝐾𝑑0.5K_{d}=0.5italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 0.5, and KI=0subscript𝐾𝐼0K_{I}=0italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = 0. A more principled method would entail adjusting these hyper-parameters through numerical optimization, treating these control gains as adjustable variables. The development of a more sophisticated strategy for fine-tuning the control gains will be studied for future exploration.

Table 3: Measurement on SNLI dataset: P / P-I / P-D / P-I-D
Distilbert
Standard training Adversarial training FreeLB
A2T 60.11 / 58.24 / 62.31 / 61.30 72.09 / 71.12 / 71.81 / 71.97 59.63 / 57.90 / 62.95 / 61.88
PSO 53.39 / 52.67 / 54.96 / 53.96 55.80 / 54.72 / 57.87 / 56.35 54.40 / 53.91 / 56.86 / 55.89
TextBugger 37.15 / 37.15 / 40.26 / 39.54 38.91 / 38.98 / 41.64 / 40.98 32.79 / 32.24 / 37.80 / 36.21
TextFooler 36.81 / 34.84 / 41.73 / 38.98 40.15 / 38.21 / 43.81 / 41.12 32.49 / 31.22 / 39.64 / 37.39
RoBERTaBase
Standard training Adversarial training FreeLB
A2T 61.78 / 60.81 / 64.11 / 61.94 76.63 / 75.82 / 77.08 / 76.28 65.93 / 65.04 / 68.85 / 66.59
PSO 53.34 / 52.94 / 54.40 / 53.35 55.49 / 54.76 / 56.45 / 54.99 53.64 / 52.99 / 55.24 / 53.55
TextBugger 40.46 / 39.00 / 43.20 / 40.53 41.71 / 40.22 / 43.35 / 41.33 39.72 / 38.24 / 42.75 / 40.17
TextFooler 33.19 / 32.10 / 37.35 / 33.83 34.48 / 32.47 / 39.39 / 35.29 30.75 / 29.77 / 36.81 / 32.21
BERT-large
Standard training Adversarial training FreeLB
A2T 75.75 / 75.68 / 75.54 / 75.60 86.13 / 86.03 / 85.76 / 85.92 78.16 / 78.16 / 78.21 / 78.04
PSO 67.72 / 67.69 / 67.55 / 67.60 70.21 / 70.26 / 70.38 / 70.25 65.49 / 65.46 / 65.56 / 65.46
TextBugger 64.59 / 64.53 / 64.41 / 64.36 69.74 / 69.89 / 69.55 / 69.62 59.28 / 59.35 / 59.29 / 59.27
TextFooler 58.48 / 58.25 / 58.27 / 58.12 65.43 / 65.25 / 65.27 / 65.10 55.34 / 55.27 / 55.26 / 55.14
RoBERTaLarge
Standard training Adversarial training FreeLB
A2T 65.10 / 64.89 / 64.95 / 64.38 81.91 / 81.72 / 81.62 / 81.80 70.38 / 70.40 / 71.30 / 70.51
PSO 55.83 / 55.04 / 56.70 / 55.31 57.99 / 57.29 / 59.71 / 58.18 56.23 / 55.60 / 57.20 / 56.18
TextBugger 44.61 / 42.20 / 42.43 / 41.29 45.00 / 43.54 / 44.74 / 43.53 44.52 / 43.11 / 44.42 / 43.21
TextFooler 36.63 / 35.52 / 37.29 / 35.39 39.64 / 37.06 / 42.44 / 39.87 37.56 / 35.97 / 38.59 / 36.71

Discussion on the linearity and orthogonality assumptions.

Here we discuss the negative impact of violating the assumptions made to derive the analytic solution. Through empirical evaluations, we highlight how the main assumptions have increasingly adverse effects, especially when the embedding manifolds fail to accurately capture the complex, high-dimensional states. More specifically, applying regularization on control solutions can mitigate these inaccuracies. However, as the precision of the embedding manifolds decreases, a greater degree of regularization is required, thereby complicating the optimal control problems. The increased complexity in the optimal control problem makes the negative impact of violating the main assumptions more significant.

Our validation approach involves a performance comparison between the proposed analytic solution and the implementation of Pontryagin’s Maximum Principle, an iterative solver that operates without the need for additional assumptions. Pontryagin’s Maximum Principle provides the necessary conditions for an optimal control solution, typically offering a robust approximation of such solutions. We further elaborate this comparison by creating linear embedding subspaces with varying thresholds for accumulated variances, specifically aiming to capture 99%percent9999\%99 %, 95%percent9595\%95 %, 90%percent9090\%90 %, and 85%percent8585\%85 % of the variances in the underlying states. As the variance threshold is lowered, the accuracy of these embedding subspaces decreases, thus posing greater challenges in solving optimal control problems. The performance comparison, detailed in Table 4, includes three LLMs across five evaluation tasks. These tasks include a standard scenario with no perturbation and four adversarial attacks: A2T, PSO, TextBugger, and TextFooler. The results reveal that while the performance difference between the analytic solution and Pontryagin’s Maximum Principle is negligible at higher accuracy levels (e.g., 99%percent9999\%99 % variance), the scenario changes significantly at lower accuracies (e.g., 90%percent9090\%90 % and 85%percent8585\%85 % variances). In these instances, employing Pontryagin’s Maximum Principle, which operates independently of simplifying assumptions, yields noticeably better control solutions.

Table 4: Performance Comparison (Analytic Solution / PMP)
Distilbert
Base 0.99%percent0.990.99\%0.99 % 0.95%percent0.950.95\%0.95 % 0.9%percent0.90.9\%0.9 % 0.85%percent0.850.85\%0.85 %
None 87.23 85.88 / 86.52 68.92 / 80.21 34.24 / 54.06 34.28 / 46.75
A2T 53.89 61.75 / 60.87 57.93 / 62.88 34.10 / 44.17 34.28 / 42.01
PSO 49.84 54.33 / 52.80 56.21 / 58.22 34.13 / 46.27 34.28 / 42.55
TextBugger 24.73 40.35 / 36.69 43.89 / 42.50 36.24 / 39.02 34.28 / 42.20
TextFooler 24.69 40.28 / 36.13 49.05 / 46.69 34.37 / 39.56 34.28 / 40.80
RoBERTaBase
Base 0.99%percent0.990.99\%0.99 % 0.95%percent0.950.95\%0.95 % 0.9%percent0.90.9\%0.9 % 0.85%percent0.850.85\%0.85 %
None 90.87 90.10 / 90.59 85.09 / 89.63 64.60 / 85.82 40.02 / 76.71
A2T 58.36 63.82 / 62.19 65.12 / 66.44 51.19 / 66.86 37.56 / 60.41
PSO 51.44 54.36 / 52.98 59.97 / 56.75 52.55 / 59.18 37.91 / 59.07
TextBugger 35.90 43.03 / 40.53 46.74 / 46.41 38.72 / 47.36 34.84 / 42.43
TextFooler 27.03 37.18 / 33.47 47.16 / 42.79 41.40 / 46.60 35.76 / 45.49
RoBERTaLarge
Base 0.99%percent0.990.99\%0.99 % 0.95%percent0.950.95\%0.95 % 0.9%percent0.90.9\%0.9 % 0.85%percent0.850.85\%0.85 %
None 92.39 91.98 / 92.11 86.50 / 91.68 66.40 / 90.53 44.54 / 86.52
A2T 59.40 64.64 / 63.15 67.17 / 65.03 54.67 / 64.99 41.54 / 61.94
PSO 52.15 56.62 / 54.35 62.14 / 55.51 55.13 / 58.41 41.15 / 58.85
TextBugger 33.72 42.39 / 39.18 47.48 / 41.59 41.30 / 43.77 35.49 / 40.32
TextFooler 26.43 36.92 / 32.27 48.79 / 37.14 47.09 / 41.43 40.47 / 41.97

Computational wall time comparison.

Here we present a detailed discussion of the computation overhead of the proposed PID control method. Specifically, we compare the computational wall time between the base model without any controls applied, the proposed analytic solution, and Pontryagin’s maximum principle employed in the previous closed-loop control approach. As shown in Table 5, across all four models, the computational wall time between the base model and the proposed analytic solution is comparable, the analytic solution only adds a small amount wall time during inference. However, solving the PMP significantly adds the computational wall time of the base model.

Table 5: Computational Wall Time
Wall Time (s) of 10,0001000010,00010 , 000 Test Samples (averaged over 5555 experiments)
Distilbert RoBERTaBase RoBERTaLarge OPT
Base model 6.3751 11.7756 36.0178 123.5379
Controlled model 6.4221 11.8620 36.5051 124.1090
PMP 62.2667 81.2795 263.7920 757.0649

Error computation of the main theorem.

We provide the details of the error computation outlined in Theorem 4. Our objective is to demonstrate that the accuracy of the error computation specified in Theorem 4 diminishes with the addition of more layers to the language model. This decrease in accuracy is due to the assumptions of linearity and orthogonality. According to these assumptions, the transformations applied to each layer of a language model merely rotate the hidden state without altering its magnitude. However, as the model incorporates more of these layer-wise transformations, the accuracy of these assumptions starts to decrease.

Table 6 presents the calculation of the absolute difference between the actual error and the error estimate as per Theorem 4. It is evident that with all types of adversarial perturbations (A2T, PSO, TextBugger, and TextFooler), the increase in the number of layers within the language model (with 6 layers representing Distilbert, 12 layers symbolizing RoBERTaBase, and 24 layers signifying RoBERTaLarge) leads to a rise in the absolute error. This indicates a decline in the precision of the error estimation.

Table 6: Error Comparison (difference between Theorem 4 and true error)
6 Layers 12 Layers 24 Layers
A2T 3.2189 4.3062 5.1566
PSO 1.9156 2.6087 3.3047
TextBugger 3.2189 4.3062 5.1566
TextFooler 3.1348 4.2894 5.2915

Justification on the effectiveness of the PID control framework.

Here we provide evidence that the PID control framework can improve model robustness against adversarial examples. As detailed in Theorem 4, the working principle of the PID control framework is based on the two facts:

  • There exists an embedding structure in the state at every layer. Table 7 verifies the existence of lower-dimensional embedding subspaces. With the OPT-1.3 B LLM (only the first 6 layers are shown), the dimensions of proportional, integral, and derivative embedding subspaces are presented. As can be seen, the dimension of proportional embedding is around 350350350350 on average in a 2048204820482048 dimensional space, integral and derivative embedding subspaces also show low dimensions compared with the full space.

  • The sequence of states from adversarially perturbed input deviates from the true embedding structures. Table 7 presents the error (measured in 2-norm) detected by the combination of P, I, and D embedding subspaces. As can be seen, the perturbation aims to amplify the error as propagated into deeper layers, and the embedding subspaces can effectively detect these errors at all layers.

Table 7: Ranks and Embedding Errors
Ranks: OPT-1.3B (first 6 layers)
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
Proportional 191 / 2048 148 / 2048 224 / 2048 337 / 2048 451 / 2048 549 / 2048
Integral 191 / 2048 181 / 2048 180 / 2048 212 / 2048 253 / 2048 298 / 2048
Derivative 191 / 2048 355 / 2048 1001 / 2048 1352 / 2048 1494 / 2048 1535 / 2048
Embedding error: OPT-1.3B (first 6 layers)
OPT 1.8237 9.5877 6.9810 6.3207 7.5278 16.6526

4 Related Works

We delve into the existing body of literature surrounding robustness issues in NLP tasks (Section 4.1). Moreover, we explore the realm of machine learning from an optimal control perspective, emphasizing its relevance and applicability to the task at hand (Section 4.2).

4.1 Robustness in NLP

In recent years, a variety of approaches have been developed for generating effective adversarial attacks in the context of NLP. Traditionally, text attacks are produced through direct modifications to sentences at the character level, word level, sentence level, or a combination of these Ren et al. (2019); Li et al. (2018a); Xu et al. (2021). Studies such as those by Liu et al. (2020b); Bělohlávek et al. (2018); Zhu et al. (2019) explore methods for generating attacks within the text embedding space. An alternative approach by Wallace et al. (2019) involves prepending the same tokens to all texts for attacks. Moving away from adding perturbations to samples, La Malfa & Kwiatkowska (2022); Wang et al. (2020) integrate generative models for the creation of attacks.

In response to adversarial attacks, most defense strategies primarily rely on the principles of adversarial training Morris et al. (2020); Jiang et al. (2019); Wu et al. (2022). Given that adversarial training is inherently resource-intensive, both in terms of computation and time, studies such as Zhu et al. (2019); Zhang et al. (2019); Shafahi et al. (2019) have proposed methods to expedite the training process. However, despite the efficacy of adversarial training, there’s evidence suggesting that fine-tuning model parameters can diminish performance on clean data He et al. (2021). Additionally, it has been observed that adversarial training often produces suboptimal results when faced with novel, unforeseen perturbations Tramer & Boneh (2019). Adversarial training is analogous to open-loop control, as it leverages a set of fixed controls (e.g. model parameters) for any new, unobserved data. In contrast, the proposed PID control approach dynamically focuses on inputs using feedback controls.

4.2 Deep Learning with Optimal Control

Recent studies have increasingly focused on elucidating the intricate connections between dynamical systems and deep learning, highlighting the fundamental interrelations that exist between them (E, 2017; Haber & Ruthotto, 2017; Li et al., 2018b). These studies have established a robust theoretical framework, offering a novel perspective to comprehend deep learning methodologies through the lens of optimal control theory (Liu & Theodorou, 2019). The work of Li et al. (2018b) and Li & Hao (2018) have successfully bridged the gap between the classical back-propagation algorithm and optimal control theory. This intersection notably highlights how Pontryagin’s Maximum Principle (Kirk, 1970), a cornerstone of control theory, aligns closely with gradient-based training methods in neural networks.

In advancing this line of inquiry, E et al. (2018) has developed the theoretical foundations for interpreting deep learning within the optimal control framework. These insights have laid the groundwork for subsequent research that applies optimal control principles to address key challenges in deep learning. A notable example of this is the work by (Liu et al., 2020a), which introduced sophisticated high-order optimization strategies rooted in differential dynamic programming. This methodology has been instrumental in enhancing the training process’s convergence rates and stability. Moreover, the closed-loop control framework has been proposed to improve the model’s robustness against adversarial attacks (Chen et al., 2020; 2022) and fairness issues (Chen et al., 2023).

5 Discussion and Future Works

The wider perspective of the proposed method on the trustworthy ML:

The presented PID control approach generalizes previous closed-loop control approaches with additional integral and derivative controllers. This development leads to more flexible control schemes, derivative controllers are more effective when the underlying states change rapidly, integral controllers play more significant roles when lower-dimensional embedding structures can be constructed in accumulated states. Such flexibility in control design broadens the applicability of the control framework across a variety of trustworthy ML applications.

This work paves the way for the development of robust large language models. Presently, many large language models face challenges related to trustworthiness, including biases against minority groups in natural language generation tasks. In principle, by constructing state embedding manifolds that capture desired model behaviors, the PID control framework can be employed to adjust any unwanted behaviors in the model. This idea is similar to prompt engineering techniques used to modify input strings for achieving specific outcomes from models. These avenues will be explored further in future research.

How this complements the research on adversarial attacks?

The PID control framework leads to a new method to generate adversarial attacks. In the current work, the aim is to improve model robustness by minimizing the objective function defined in Equation 2. On the contrary, maximizing this loss w.r.t. some input perturbation is equivalent to generating adversarial examples. This can be an optimal control-based adversarial attack algorithm.

Can the same method be used for vision problems?

This method is applicable to computer vision problems. Typically, in deep convolutional neural networks, both the input and hidden states lie in extremely high-dimensional spaces, where the embedding manifolds for these states tend to exist. This assumption aligns with the "manifold hypothesis," which is based on the characteristics of real-world image data, and is further supported by empirical evidence as demonstrated in the studies by Chen et al. (2020) and Chen et al. (2022). Once the embedding manifolds for both input and hidden states are constructed, it becomes possible to formulate the optimal control objective function as outlined in Equations 2 and 3. By aiming to minimize this objective function, there is a potential to significantly improve the robustness of the model.

How does this complement the existing trustworthy ML literature?

The current body of research on trustworthy machine learning predominantly emphasizes adversarial training, which leads to two significant challenges. Firstly, the process of modifying model parameters with adversarial examples demands extensive computational resources. In the context of natural language processing tasks, identifying an adversarial example typically entails solving a combinatorial optimization problem, which suffers from an exponential growth in the number of feasible solutions as the size of the problem increases. Secondly, adversarial training’s efficacy diminishes in the face of unexpected adversarial attacks. This shortcoming is especially evident in the real-world application of large language models, where predicting potential adversarial attacks beforehand is unfeasible. The suggested PID control framework is designed to overcome these challenges by offering two key advantages: 1) It does not significantly increase the inference time when compared to the base model, and 2) It leverages the embedding structure of unperturbed states, making it robust against various unforeseen adversarial attacks.

Extension to other trustworthy issues.

The proposed PID framework focuses on improving the robustness of pre-trained models through a set of linear embedding subspaces. These subspaces effectively encapsulate the embedding structure of the underlying states. This framework may be broadened to tackle various trustworthy concerns in machine learning, including issues related to fairness (Chen et al., 2023). To achieve this, it is necessary to develop embedding subspaces that are capable of capturing embedding structures that are invariant against demographic information. Our next objective is to demonstrate how the PID control framework can be adapted to manage and resolve fairness-related challenges.

Analytic solution on the nonlinear dynamics.

In our approach, we determine the optimal control solution using an analytical method as outlined in Proposition 3. This analytical method is based on the assumption that the layer transformation linearization in the pre-trained model is orthogonal. Consequently, this leads to a time-varying control regularization across various layers, which is independent of the pre-trained model. This method has yielded satisfactory empirical outcomes. Nonetheless, there is a need for a more precise analytical solution that accounts for the intrinsic aspects of the underlying model. To achieve a more refined analytical solution for the optimal control issue, our next step involves linearizing the non-linear layer transformations in the pre-trained model and subsequently utilizing the Riccati equation to generate the optimal control solution.

6 Conclusion

Our study has introduced a novel PID control framework to improve neural network robustness against (unforeseen) input perturbations, outperforming traditional adversarial training methods. This approach maintains computational efficiency, enhances robustness in large language models, and allows for rapid online inference. Our comprehensive error analysis has confirmed the framework’s effectiveness in simulated environments, contributing significantly to neural network security and robustness, and paving the way for more reliable NLP models in critical applications.

7 Acknowledgements

Z. Chen and Z. Zhang are supported by the NSF grant #2107321 under the CCF division. Q. Li is supported by the National Research Foundation, Singapore, under the NRF fellowship (project No. NRF-NRFF13-2021-0005).

References

  • Bellman (1952) Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952.
  • Bělohlávek et al. (2018) Petr Bělohlávek, Ondřej Plátek, Zdeněk Žabokrtskỳ, and Milan Straka. Using adversarial examples in natural language processing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  • Bernhard & Vygen (2008) Korte Bernhard and Jens Vygen. Combinatorial optimization: Theory and algorithms. Springer, Third Edition, 2005., 2008.
  • Bowman et al. (2015) Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, 2015.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen et al. (2020) Zhuotong Chen, Qianxiao Li, and Zheng Zhang. Towards robust neural networks via close-loop control. In International Conference on Learning Representations, 2020.
  • Chen et al. (2022) Zhuotong Chen, Qianxiao Li, and Zheng Zhang. Self-healing robust neural networks via closed-loop control. The Journal of Machine Learning Research, 23(1):14329–14382, 2022.
  • Chen et al. (2023) Zhuotong Chen, Qianxiao Li, and Zheng Zhang. Asymptotically fair participation in machine learning models: an optimal control perspective. arXiv preprint arXiv:2311.10223, 2023.
  • De Lathauwer et al. (2000a) Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 and rank-(r1,r2,,rn)subscript𝑟1subscript𝑟2subscript𝑟𝑛(r_{1},r_{2},\cdots,r_{n})( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) approximation of higher-order tensors. SIAM journal on Matrix Analysis and Applications, 21(4):1324–1342, 2000a.
  • De Lathauwer et al. (2000b) Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000b.
  • E (2017) Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 2017.
  • E et al. (2018) Weinan E, Jiequn Han, and Qianxiao Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 1(6):1–41, 2018.
  • Fefferman et al. (2016) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
  • Haber & Ruthotto (2017) Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.
  • Hannun et al. (2014) Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
  • He et al. (2021) Tianxing He, Jun Liu, Kyunghyun Cho, Myle Ott, Bing Liu, James Glass, and Fuchun Peng. Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  1121–1133, 2021.
  • Hehn & D’Andrea (2015) Markus Hehn and Raffaello D’Andrea. Real-time trajectory generation for quadrocopters. IEEE Transactions on Robotics, 31(4):877–892, 2015.
  • Ho et al. (1975) Chung-Wen Ho, Albert Ruehli, and Pierce Brennan. The modified nodal approach to network analysis. IEEE Transactions on circuits and systems, 22(6):504–509, 1975.
  • Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  • Huang et al. (2019) Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019.
  • Jiang et al. (2019) Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437, 2019.
  • Jin et al. (2020) Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  8018–8025, 2020.
  • Kenton & Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp.  4171–4186, 2019.
  • Kirk (1970) Donald E Kirk. Optimal control theory: an introduction. Springer, 1970.
  • Kolda & Bader (2009) Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
  • La Malfa & Kwiatkowska (2022) Emanuele La Malfa and Marta Kwiatkowska. The king is naked: on the notion of robustness for natural language processing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  11047–11057, 2022.
  • Lee et al. (2012) Jangjoon Lee, Srikar Bhagavatula, Swarup Bhunia, Kaushik Roy, and Byunghoo Jung. Self-healing design in deep scaled cmos technologies. Journal of Circuits, Systems, and Computers, 21(06):1240011, 2012.
  • Li et al. (2019) J Li, S Ji, T Du, B Li, and T Wang. Textbugger: Generating adversarial text against real-world applications. In 26th Annual Network and Distributed System Security Symposium, 2019.
  • Li et al. (2018a) Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018a.
  • Li & Hao (2018) Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to discrete-weight neural networks. In International Conference on Machine Learning, pp.  2985–2994. PMLR, 2018.
  • Li et al. (2018b) Qianxiao Li, Long Chen, Cheng Tai, and Weinan E. Maximum principle based algorithms for deep learning. Journal of Machine Learning Research, 18(165):1–29, 2018b. URL http://jmlr.org/papers/v18/17-653.html.
  • Liu & Theodorou (2019) Guan-Horng Liu and Evangelos A Theodorou. Deep learning theory review: An optimal control and dynamical systems perspective. arXiv preprint arXiv:1908.10920, 2019.
  • Liu et al. (2020a) Guan-Horng Liu, Tianrong Chen, and Evangelos A Theodorou. Differential dynamic programming neural optimizer. arXiv preprint arXiv:2002.08809, 2020a.
  • Liu et al. (2020b) Hui Liu, Yongzheng Zhang, Yipeng Wang, Zheng Lin, and Yige Chen. Joint character-level word embedding and adversarial stability training to defend adversarial text. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  8384–8391, 2020b.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  • Morris et al. (2020) John X Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909, 2020.
  • Nenkova & McKeown (2012) Ani Nenkova and Kathleen McKeown. A survey of text summarization techniques. Mining text data, pp.  43–76, 2012.
  • Nie et al. (2020) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4885–4901, 2020.
  • Pontryagin (1987) Lev Semenovich Pontryagin. Mathematical theory of optimal processes. CRC press, 1987.
  • Rajapakse & Groudine (2011) Indika Rajapakse and Mark Groudine. On emerging nuclear order. Journal of Cell Biology, 192(5):711–721, 2011.
  • Ren et al. (2019) Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp.  1085–1097, 2019.
  • Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  • Shafahi et al. (2019) Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.
  • Tang et al. (2012) Adrian Tang, Frank Hsiao, David Murphy, I-Ning Ku, Jenny Liu, Sandeep D’Souza, Ning-Yi Wang, Hao Wu, Yen-Hsiang Wang, Mandy Tang, et al. A low-overhead self-healing embedded system for ensuring high yield and long-term sustainability of 60ghz 4gb/s radio-on-a-chip. In 2012 IEEE International Solid-State Circuits Conference, pp.  316–318. IEEE, 2012.
  • Tramer & Boneh (2019) Florian Tramer and Dan Boneh. Adversarial training and robustness for multiple perturbations. Advances in neural information processing systems, 32, 2019.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vinodhini & Chandrasekaran (2012) G Vinodhini and RM Chandrasekaran. Sentiment analysis and opinion mining: a survey. International Journal, 2(6):282–292, 2012.
  • Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  • Wang et al. (2021) Ren Wang, Tianqi Chen, Stephen Lindsly, Cooper Stansbury, Alnawaz Rehemtulla, Indika Rajapakse, and Alfred Hero. RAILS: A robust adversarial immune-inspired learning system. arXiv preprint arXiv:2107.02840, 2021.
  • Wang et al. (2020) Tianlu Wang, Xuezhi Wang, Yao Qin, Ben Packer, Kang Li, Jilin Chen, Alex Beutel, and Ed Chi. Cat-gen: Improving robustness in nlp models via controlled adversarial text generation. arXiv preprint arXiv:2010.02338, 2020.
  • Wicker et al. (2021) Matthew Wicker, Luca Laurenti, Andrea Patane, Zhuotong Chen, Zheng Zhang, and Marta Kwiatkowska. Bayesian inference with certifiable adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pp.  2431–2439. PMLR, 2021.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, 2018.
  • Wu et al. (2022) Hongqiu Wu, Yongxiang Liu, Hanwen Shi, Min Zhang, et al. Toward adversarial training on contextualized language representation. In The Eleventh International Conference on Learning Representations, 2022.
  • Xu et al. (2021) Ying Xu, Xu Zhong, Antonio Jimeno Yepes, and Jey Han Lau. Grey-box adversarial attack and defence for sentiment classification. arXiv preprint arXiv:2103.11576, 2021.
  • Yoo & Qi (2021) Jin Yong Yoo and Yanjun Qi. Towards improving adversarial training of nlp models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  945–956, 2021.
  • Zang et al. (2020) Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6066–6080, 2020.
  • Zhang et al. (2019) Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Accelerating adversarial training via maximal principle. Advances in Neural Information Processing Systems, 32, 2019.
  • Zhu et al. (2019) Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations, 2019.

8 Appendix A

In this section, we elaborate on the derivation of the analytic solution as presented in Propositions 1 and 3. Let 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the tthsuperscript𝑡tht^{\text{th}}italic_t start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT linear transformation, and πt:dd:subscript𝜋𝑡superscript𝑑superscript𝑑\pi_{t}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote the PID controller. The controlled dynamical system can be expressed as:

𝐱t+1=𝜽t(𝐱t+πt(𝐱t)),subscript𝐱𝑡1subscript𝜽𝑡subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡\mathbf{x}_{t+1}=\bm{\theta}_{t}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t})),bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ,

where the control action is added to the current state. Recall the running loss defined in equation 3,

({𝐱s}s=0t,πt,(ftP,ftI,ftD))superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡superscriptsubscript𝑓𝑡𝑃superscriptsubscript𝑓𝑡𝐼superscriptsubscript𝑓𝑡𝐷\displaystyle{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},(f_{t}^{P},f_{t}^{I% },f_{t}^{D}))caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) )
:=12ftP(𝐱t+πt(𝐱t))22+12ftI(𝐱t+πt(𝐱t)+s=0t1𝐱s)22+12ftD(𝐱t+πt(𝐱t)𝐱t1)22+ct2πt(𝐱t)22,\displaystyle\vcentcolon=\frac{1}{2}\lVert f_{t}^{P}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert f_{t}^{I}(\mathbf{x}_{t}+\pi_% {t}(\mathbf{x}_{t})+\sum_{s=0}^{t-1}\mathbf{x}_{s})\lVert_{2}^{2}+\frac{1}{2}% \lVert f_{t}^{D}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t})-\mathbf{x}_{t-1})% \rVert_{2}^{2}+\frac{c_{t}}{2}\lVert\pi_{t}(\mathbf{x}_{t})\rVert_{2}^{2},:= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

we consider the surjective mappings ftPsuperscriptsubscript𝑓𝑡𝑃f_{t}^{P}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, ftIsuperscriptsubscript𝑓𝑡𝐼f_{t}^{I}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and ftDsuperscriptsubscript𝑓𝑡𝐷f_{t}^{D}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT as orthogonal projections. Let 𝐐tPsuperscriptsubscript𝐐𝑡𝑃\mathbf{Q}_{t}^{P}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐐tIsuperscriptsubscript𝐐𝑡𝐼\mathbf{Q}_{t}^{I}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐐tDsuperscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT be the orthogonal projections associated with ftPsuperscriptsubscript𝑓𝑡𝑃f_{t}^{P}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, ftIsuperscriptsubscript𝑓𝑡𝐼f_{t}^{I}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and ftDsuperscriptsubscript𝑓𝑡𝐷f_{t}^{D}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, respectively, assuming a uniformly bounded state space with max𝐱𝒳𝐱22B\max_{\mathbf{x}\in\mathcal{X}}\lVert\mathbf{x}\rVert_{2}^{2}\leq Broman_max start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT ∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_B, the running loss can be bounded as follows,

({𝐱s}s=0t,πt,(𝐐tP,𝐐tI,𝐐tD))superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\displaystyle{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},(\mathbf{Q}_{t}^{P}% ,\mathbf{Q}_{t}^{I},\mathbf{Q}_{t}^{D}))caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) )
=12𝐐tP(𝐱t+πt(𝐱t))22+12𝐐tI(𝐱t+πt(𝐱t)+s=0t1𝐱s)22+12𝐐tD(𝐱t+πt(𝐱t)𝐱t1)22+ct2πt(𝐱t)22,\displaystyle=\frac{1}{2}\lVert\mathbf{Q}_{t}^{P}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{I}(\mathbf{x}_% {t}+\pi_{t}(\mathbf{x}_{t})+\sum_{s=0}^{t-1}\mathbf{x}_{s})\lVert_{2}^{2}+% \frac{1}{2}\lVert\mathbf{Q}_{t}^{D}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t})-% \mathbf{x}_{t-1})\rVert_{2}^{2}+\frac{c_{t}}{2}\lVert\pi_{t}(\mathbf{x}_{t})% \rVert_{2}^{2},= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
12𝐐tP(𝐱t+πt(𝐱t))22+12𝐐tI(𝐱t+πt(𝐱t))22+12𝐐tI(s=0t1𝐱s)22+12𝐐tD(𝐱t+πt(𝐱t))22+12𝐐tD𝐱t122\displaystyle\leq\frac{1}{2}\lVert\mathbf{Q}_{t}^{P}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{I}(\mathbf{x}_% {t}+\pi_{t}(\mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{I}% (\sum_{s=0}^{t-1}\mathbf{x}_{s})\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}% ^{D}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t}))\rVert_{2}^{2}+\frac{1}{2}\lVert% \mathbf{Q}_{t}^{D}\mathbf{x}_{t-1}\rVert_{2}^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+ct2πt(𝐱t)22,subscript𝑐𝑡2superscriptsubscriptdelimited-∥∥subscript𝜋𝑡subscript𝐱𝑡22\displaystyle\;\;\;\;+\frac{c_{t}}{2}\lVert\pi_{t}(\mathbf{x}_{t})\rVert_{2}^{% 2},+ divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
12𝐐tP(𝐱t+πt(𝐱t))22+12𝐐tI(𝐱t+πt(𝐱t))22+12𝐐tD(𝐱t+πt(𝐱t))22+12s=0t1𝐱s22+12𝐱t122\displaystyle\leq\frac{1}{2}\lVert\mathbf{Q}_{t}^{P}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{I}(\mathbf{x}_% {t}+\pi_{t}(\mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{D}% (\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t}))\rVert_{2}^{2}+\frac{1}{2}\lVert\sum_{% s=0}^{t-1}\mathbf{x}_{s}\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{x}_{t-1}\rVert% _{2}^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+ct2πt(𝐱t)22,subscript𝑐𝑡2superscriptsubscriptdelimited-∥∥subscript𝜋𝑡subscript𝐱𝑡22\displaystyle\;\;\;\;+\frac{c_{t}}{2}\lVert\pi_{t}(\mathbf{x}_{t})\rVert_{2}^{% 2},+ divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
12𝐐tP(𝐱t+πt(𝐱t))22+12𝐐tI(𝐱t+πt(𝐱t))22+12𝐐tD(𝐱t+πt(𝐱t))22+ct2πt(𝐱t)22+TB2,\displaystyle\leq\frac{1}{2}\lVert\mathbf{Q}_{t}^{P}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{I}(\mathbf{x}_% {t}+\pi_{t}(\mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{D}% (\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t}))\rVert_{2}^{2}+\frac{c_{t}}{2}\lVert% \pi_{t}(\mathbf{x}_{t})\rVert_{2}^{2}+\frac{TB}{2},≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_T italic_B end_ARG start_ARG 2 end_ARG , (10)

where T𝑇Titalic_T represents the maximum number of layers of the neural network, and B𝐵Bitalic_B is the uniform upper bound for the state space.

Let 𝐐t=𝐐tP+𝐐tI+𝐐tDsubscript𝐐𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}=\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, the following Lemma derives the analytic solution for the PID control πt(𝐱𝐭)subscript𝜋𝑡subscript𝐱𝐭\pi_{t}(\mathbf{x_{t}})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ).

See 1

Proof.

In the objective function defined in equation 7, the terminal loss Φ(𝐱T,y)Φsubscript𝐱𝑇𝑦\Phi(\mathbf{x}_{T},y)roman_Φ ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) quantifies the discrepancy between the terminal state 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and the true label y𝑦yitalic_y. However, in general machine learning applications, the true label y𝑦yitalic_y remains unknown during online inference, leading to the terminal loss being set to zero. Recall equation 10, the running loss is defined as

({𝐱s}s=0t,πt,(𝐐tP,𝐐tI,𝐐tD))superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\displaystyle{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},(\mathbf{Q}_{t}^{P}% ,\mathbf{Q}_{t}^{I},\mathbf{Q}_{t}^{D}))caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) )
:=12𝐐tP(𝐱t+πt(𝐱t))22+12𝐐tI(𝐱t+πt(𝐱t)+s=0t1𝐱s)22+12𝐐tD(𝐱t+πt(𝐱t)𝐱t1)22+ct2πt(𝐱t)22.\displaystyle\vcentcolon=\frac{1}{2}\lVert\mathbf{Q}_{t}^{P}(\mathbf{x}_{t}+% \pi_{t}(\mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{I}(% \mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t})+\sum_{s=0}^{t-1}\mathbf{x}_{s})\lVert_{% 2}^{2}+\frac{1}{2}\lVert\mathbf{Q}_{t}^{D}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{% t})-\mathbf{x}_{t-1})\rVert_{2}^{2}+\frac{c_{t}}{2}\lVert\pi_{t}(\mathbf{x}_{t% })\rVert_{2}^{2}.:= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Consequently, the optimal value function V(𝐱t)𝑉subscript𝐱𝑡V(\mathbf{x}_{t})italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies

V(𝐱t)=minπt12(𝐐tP𝐱t+𝐐tPπt(𝐱t))(𝐐tP𝐱t+𝐐tPπt(𝐱t))+12(𝐐tI𝐱t+𝐐tIπt(𝐱t))(𝐐tI𝐱t+𝐐tIπt(𝐱t))𝑉subscript𝐱𝑡subscriptsubscript𝜋𝑡12superscriptsuperscriptsubscript𝐐𝑡𝑃subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑃subscript𝜋𝑡subscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝑃subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑃subscript𝜋𝑡subscript𝐱𝑡12superscriptsuperscriptsubscript𝐐𝑡𝐼subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝜋𝑡subscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝐼subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝜋𝑡subscript𝐱𝑡\displaystyle V(\mathbf{x}_{t})=\min\limits_{\pi_{t}}\frac{1}{2}(\mathbf{Q}_{t% }^{P}\mathbf{x}_{t}+\mathbf{Q}_{t}^{P}\pi_{t}(\mathbf{x}_{t}))^{\top}(\mathbf{% Q}_{t}^{P}\mathbf{x}_{t}+\mathbf{Q}_{t}^{P}\pi_{t}(\mathbf{x}_{t}))+\frac{1}{2% }(\mathbf{Q}_{t}^{I}\mathbf{x}_{t}+\mathbf{Q}_{t}^{I}\pi_{t}(\mathbf{x}_{t}))^% {\top}(\mathbf{Q}_{t}^{I}\mathbf{x}_{t}+\mathbf{Q}_{t}^{I}\pi_{t}(\mathbf{x}_{% t}))italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
+12(𝐐tD𝐱t+𝐐tDπt(𝐱t))(𝐐tD𝐱t+𝐐tDπt(𝐱t))+c2πt(𝐱t)πt(𝐱t)+V(𝐱t+1),12superscriptsuperscriptsubscript𝐐𝑡𝐷subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝜋𝑡subscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝐷subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝜋𝑡subscript𝐱𝑡𝑐2subscript𝜋𝑡superscriptsubscript𝐱𝑡topsubscript𝜋𝑡subscript𝐱𝑡𝑉subscript𝐱𝑡1\displaystyle+\frac{1}{2}(\mathbf{Q}_{t}^{D}\mathbf{x}_{t}+\mathbf{Q}_{t}^{D}% \pi_{t}(\mathbf{x}_{t}))^{\top}(\mathbf{Q}_{t}^{D}\mathbf{x}_{t}+\mathbf{Q}_{t% }^{D}\pi_{t}(\mathbf{x}_{t}))+\frac{c}{2}\cdot\pi_{t}(\mathbf{x}_{t})^{\top}% \pi_{t}(\mathbf{x}_{t})+V(\mathbf{x}_{t+1}),+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + divide start_ARG italic_c end_ARG start_ARG 2 end_ARG ⋅ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_V ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ,
s.t.𝐱t+1=𝜽t(𝐱t+πt(𝐱t)).formulae-sequencestsubscript𝐱𝑡1subscript𝜽𝑡subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡\displaystyle{\rm s.t.}\;\mathbf{x}_{t+1}=\bm{\theta}_{t}(\mathbf{x}_{t}+\pi_{% t}(\mathbf{x}_{t})).roman_s . roman_t . bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

Taking the derivative of the right-hand side with respect to πt(𝐱t)subscript𝜋𝑡subscript𝐱𝑡\pi_{t}(\mathbf{x}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) yields

dV(𝐱t)dπt(𝐱t)𝑑𝑉subscript𝐱𝑡𝑑subscript𝜋𝑡subscript𝐱𝑡\displaystyle\frac{dV(\mathbf{x}_{t})}{d\pi_{t}(\mathbf{x}_{t})}divide start_ARG italic_d italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG =𝐐tP𝐱t+𝐐tPπt(𝐱t)+𝐐tI𝐱t+𝐐tIπt(𝐱t)+𝐐tD𝐱t+𝐐tDπt(𝐱t)+cπt(𝐱t)+(d𝐱t+1dπt(𝐱t))dV(𝐱t+1)d𝐱t+1,absentsuperscriptsubscript𝐐𝑡𝑃subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑃subscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝜋𝑡subscript𝐱𝑡𝑐subscript𝜋𝑡subscript𝐱𝑡superscript𝑑subscript𝐱𝑡1𝑑subscript𝜋𝑡subscript𝐱𝑡top𝑑𝑉subscript𝐱𝑡1𝑑subscript𝐱𝑡1\displaystyle=\mathbf{Q}_{t}^{P}\mathbf{x}_{t}+\mathbf{Q}_{t}^{P}\pi_{t}(% \mathbf{x}_{t})+\mathbf{Q}_{t}^{I}\mathbf{x}_{t}+\mathbf{Q}_{t}^{I}\pi_{t}(% \mathbf{x}_{t})+\mathbf{Q}_{t}^{D}\mathbf{x}_{t}+\mathbf{Q}_{t}^{D}\pi_{t}(% \mathbf{x}_{t})+c\pi_{t}(\mathbf{x}_{t})+\bigg{(}\frac{d\mathbf{x}_{t+1}}{d\pi% _{t}(\mathbf{x}_{t})}\bigg{)}^{\top}\frac{dV(\mathbf{x}_{t+1})}{d\mathbf{x}_{t% +1}},= bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_c italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( divide start_ARG italic_d bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG italic_d italic_V ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ,
=𝐐tP𝐱t+𝐐tPπt(𝐱t)+𝐐tI𝐱t+𝐐tIπt(𝐱t)+𝐐tD𝐱t+𝐐tDπt(𝐱t)+cπt(𝐱t)+2𝜽t𝐏t+1𝐱t+1,absentsuperscriptsubscript𝐐𝑡𝑃subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑃subscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝜋𝑡subscript𝐱𝑡𝑐subscript𝜋𝑡subscript𝐱𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝐱𝑡1\displaystyle=\mathbf{Q}_{t}^{P}\mathbf{x}_{t}+\mathbf{Q}_{t}^{P}\pi_{t}(% \mathbf{x}_{t})+\mathbf{Q}_{t}^{I}\mathbf{x}_{t}+\mathbf{Q}_{t}^{I}\pi_{t}(% \mathbf{x}_{t})+\mathbf{Q}_{t}^{D}\mathbf{x}_{t}+\mathbf{Q}_{t}^{D}\pi_{t}(% \mathbf{x}_{t})+c\pi_{t}(\mathbf{x}_{t})+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+% 1}\mathbf{x}_{t+1},= bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_c italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ,
=(𝐐tP+𝐐tI+𝐐tD)𝐱t+(𝐐tP+𝐐tI+𝐐tD)πt(𝐱t)+cπt(𝐱t)+2𝜽t𝐏t+1𝜽t𝐱t+2𝜽t𝐏t+1𝜽tπt(𝐱t),absentsuperscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷subscript𝜋𝑡subscript𝐱𝑡𝑐subscript𝜋𝑡subscript𝐱𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝜋𝑡subscript𝐱𝑡\displaystyle=(\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D})% \mathbf{x}_{t}+(\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D})\pi_{% t}(\mathbf{x}_{t})+c\pi_{t}(\mathbf{x}_{t})+2\bm{\theta}_{t}^{\top}\mathbf{P}_% {t+1}\bm{\theta}_{t}\mathbf{x}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{% \theta}_{t}\pi_{t}(\mathbf{x}_{t}),= ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_c italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=𝐐t𝐱t+𝐐tπt(𝐱t)+cπt(𝐱t)+2𝜽t𝐏t+1𝜽t𝐱t+2𝜽t𝐏t+1𝜽tπt(𝐱t),absentsubscript𝐐𝑡subscript𝐱𝑡subscript𝐐𝑡subscript𝜋𝑡subscript𝐱𝑡𝑐subscript𝜋𝑡subscript𝐱𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝜋𝑡subscript𝐱𝑡\displaystyle=\mathbf{Q}_{t}\mathbf{x}_{t}+\mathbf{Q}_{t}\pi_{t}(\mathbf{x}_{t% })+c\pi_{t}(\mathbf{x}_{t})+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}% _{t}\mathbf{x}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t}\pi_{% t}(\mathbf{x}_{t}),= bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_c italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where 𝐐t=𝐐tP+𝐐tI+𝐐tDsubscript𝐐𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}=\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Setting the derivative dV(𝐱t)dπt(𝐱t)𝑑𝑉subscript𝐱𝑡𝑑subscript𝜋𝑡subscript𝐱𝑡\frac{dV(\mathbf{x}_{t})}{d\pi_{t}(\mathbf{x}_{t})}divide start_ARG italic_d italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG to 𝟎0\mathbf{0}bold_0 results in the optimal control πt(𝐱t)superscriptsubscript𝜋𝑡subscript𝐱𝑡\pi_{t}^{\ast}(\mathbf{x}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (as shown in equation 9)

πt(𝐱t)=(𝐐t+c𝐈+2𝜽t𝐏t+1𝜽t)1(𝐐t+2𝜽t𝐏t+1𝜽t)𝐱t.superscriptsubscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑐𝐈2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡1subscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡\pi_{t}^{\ast}(\mathbf{x}_{t})=-(\mathbf{Q}_{t}+c\cdot\mathbf{I}+2\bm{\theta}_% {t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})^{-1}(\mathbf{Q}_{t}+2\bm{\theta}_{t% }^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\mathbf{x}_{t}.italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c ⋅ bold_I + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Parametrizing the value function V(𝐱t)𝑉subscript𝐱𝑡V(\mathbf{x}_{t})italic_V ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as 𝐱t𝐏t𝐱tsuperscriptsubscript𝐱𝑡topsubscript𝐏𝑡subscript𝐱𝑡\mathbf{x}_{t}^{\top}\mathbf{P}_{t}\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and considering the optimal control solution equation 9, we can convert the expression of the value function as follows,

𝐱t𝐏t𝐱tsuperscriptsubscript𝐱𝑡topsubscript𝐏𝑡subscript𝐱𝑡\displaystyle\mathbf{x}_{t}^{\top}\mathbf{P}_{t}\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
=minπt12(𝐐tP𝐱t+𝐐tPπt(𝐱t))(𝐐tP𝐱t+𝐐tPπt(𝐱t))+12(𝐐tI𝐱t+𝐐tIπt(𝐱t))(𝐐tI𝐱t+𝐐tIπt(𝐱t))absentsubscriptsubscript𝜋𝑡12superscriptsuperscriptsubscript𝐐𝑡𝑃subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑃subscript𝜋𝑡subscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝑃subscript𝐱𝑡superscriptsubscript𝐐𝑡𝑃subscript𝜋𝑡subscript𝐱𝑡12superscriptsuperscriptsubscript𝐐𝑡𝐼subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝜋𝑡subscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝐼subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐼subscript𝜋𝑡subscript𝐱𝑡\displaystyle=\min\limits_{\pi_{t}}\frac{1}{2}(\mathbf{Q}_{t}^{P}\mathbf{x}_{t% }+\mathbf{Q}_{t}^{P}\pi_{t}(\mathbf{x}_{t}))^{\top}(\mathbf{Q}_{t}^{P}\mathbf{% x}_{t}+\mathbf{Q}_{t}^{P}\pi_{t}(\mathbf{x}_{t}))+\frac{1}{2}(\mathbf{Q}_{t}^{% I}\mathbf{x}_{t}+\mathbf{Q}_{t}^{I}\pi_{t}(\mathbf{x}_{t}))^{\top}(\mathbf{Q}_% {t}^{I}\mathbf{x}_{t}+\mathbf{Q}_{t}^{I}\pi_{t}(\mathbf{x}_{t}))= roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
+12(𝐐tD𝐱t+𝐐tDπt(𝐱t))(𝐐tD𝐱t+𝐐tDπt(𝐱t))+c2πt(𝐱t)πt(𝐱t)+𝐱t+1𝐏t+1𝐱t+1,12superscriptsuperscriptsubscript𝐐𝑡𝐷subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝜋𝑡subscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝐷subscript𝐱𝑡superscriptsubscript𝐐𝑡𝐷subscript𝜋𝑡subscript𝐱𝑡𝑐2subscript𝜋𝑡superscriptsubscript𝐱𝑡topsubscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐱𝑡1topsubscript𝐏𝑡1subscript𝐱𝑡1\displaystyle\;\;\;\;+\frac{1}{2}(\mathbf{Q}_{t}^{D}\mathbf{x}_{t}+\mathbf{Q}_% {t}^{D}\pi_{t}(\mathbf{x}_{t}))^{\top}(\mathbf{Q}_{t}^{D}\mathbf{x}_{t}+% \mathbf{Q}_{t}^{D}\pi_{t}(\mathbf{x}_{t}))+\frac{c}{2}\cdot\pi_{t}(\mathbf{x}_% {t})^{\top}\pi_{t}(\mathbf{x}_{t})+\mathbf{x}_{t+1}^{\top}\mathbf{P}_{t+1}% \mathbf{x}_{t+1},+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + divide start_ARG italic_c end_ARG start_ARG 2 end_ARG ⋅ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ,
=12𝐱t(𝐐tP+𝐐tI+𝐐tD+2𝜽t𝐏t+1𝜽t)𝐱t+12(πt(𝐱t))(𝐐tP+𝐐tI+𝐐tD+c𝐈+2𝜽t𝐏t+1𝜽t)πt(𝐱t)absent12superscriptsubscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡12superscriptsuperscriptsubscript𝜋𝑡subscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷𝑐𝐈2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle=\frac{1}{2}\mathbf{x}_{t}^{\top}(\mathbf{Q}_{t}^{P}+\mathbf{Q}_{% t}^{I}+\mathbf{Q}_{t}^{D}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{% t})\mathbf{x}_{t}+\frac{1}{2}(\pi_{t}^{\ast}(\mathbf{x}_{t}))^{\top}(\mathbf{Q% }_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}+c\mathbf{I}+2\bm{\theta}_{t}^{% \top}\mathbf{P}_{t+1}\bm{\theta}_{t})\pi_{t}^{\ast}(\mathbf{x}_{t})= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT + italic_c bold_I + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
+𝐱t(𝐐tP+𝐐tI+𝐐tD+2𝜽t𝐏t+1𝜽t)πt(𝐱t),superscriptsubscript𝐱𝑡topsuperscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle\;\;\;\;+\mathbf{x}_{t}^{\top}(\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^% {I}+\mathbf{Q}_{t}^{D}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})% \pi_{t}^{\ast}(\mathbf{x}_{t}),+ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=12𝐱t(𝐐t+2𝜽t𝐏t+1𝜽t)𝐱t+12(πt(𝐱t))(𝐐t+c𝐈+2𝜽t𝐏t+1𝜽t)πt(𝐱t)+𝐱t(𝐐t+2𝜽t𝐏t+1𝜽t)πt(𝐱t),absent12superscriptsubscript𝐱𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡12superscriptsuperscriptsubscript𝜋𝑡subscript𝐱𝑡topsubscript𝐐𝑡𝑐𝐈2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐱𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle=\frac{1}{2}\mathbf{x}_{t}^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}% ^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\mathbf{x}_{t}+\frac{1}{2}(\pi_{t}^{% \ast}(\mathbf{x}_{t}))^{\top}(\mathbf{Q}_{t}+c\mathbf{I}+2\bm{\theta}_{t}^{% \top}\mathbf{P}_{t+1}\bm{\theta}_{t})\pi_{t}^{\ast}(\mathbf{x}_{t})+\mathbf{x}% _{t}^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_% {t})\pi_{t}^{\ast}(\mathbf{x}_{t}),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c bold_I + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where πt(𝐱t)superscriptsubscript𝜋𝑡subscript𝐱𝑡\pi_{t}^{\ast}(\mathbf{x}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the optimal control solution leading to the minimum, 𝐐t=𝐐tP+𝐐tI+𝐐tDsubscript𝐐𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}=\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. For the second term in the above, recall the optimal control solution πt(𝐱t)superscriptsubscript𝜋𝑡subscript𝐱𝑡\pi_{t}^{\ast}(\mathbf{x}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) from equation 9,

12(πt(𝐱t))(𝐐t+c𝐈+2𝜽t𝐏t+1𝜽t)πt(𝐱t),12superscriptsuperscriptsubscript𝜋𝑡subscript𝐱𝑡topsubscript𝐐𝑡𝑐𝐈2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle\frac{1}{2}(\pi_{t}^{\ast}(\mathbf{x}_{t}))^{\top}(\mathbf{Q}_{t}% +c\cdot\mathbf{I}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\pi_{% t}^{\ast}(\mathbf{x}_{t}),divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c ⋅ bold_I + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=12((𝐐t+c𝐈+2𝜽t𝐏t+1𝜽t)1(𝐐t+2𝜽t𝐏t+1𝜽t)𝐱t)(𝐐t+c+2𝜽t𝐏t+1𝜽t)πt(𝐱t),absent12superscriptsuperscriptsubscript𝐐𝑡𝑐𝐈2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡1subscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡topsubscript𝐐𝑡𝑐2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle=-\frac{1}{2}\Big{(}(\mathbf{Q}_{t}+c\cdot\mathbf{I}+2\bm{\theta}% _{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})^{-1}(\mathbf{Q}_{t}+2\bm{\theta}_{% t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\mathbf{x}_{t}\Big{)}^{\top}(\mathbf{% Q}_{t}+c+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\pi_{t}^{\ast}% (\mathbf{x}_{t}),= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c ⋅ bold_I + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=12𝐱t(𝐐t+2𝜽t𝐏t+1𝜽t)πt(𝐱t),absent12superscriptsubscript𝐱𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle=-\frac{1}{2}\mathbf{x}_{t}^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t% }^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\pi_{t}^{\ast}(\mathbf{x}_{t}),= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

the above uses the fact that (𝐐t+c𝐈+2𝜽t𝐏t+1𝜽t)1superscriptsubscript𝐐𝑡𝑐𝐈2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡1(\mathbf{Q}_{t}+c\cdot\mathbf{I}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{% \theta}_{t})^{-1}( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c ⋅ bold_I + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is symmetric. Therefore,

𝐱t𝐏t𝐱tsuperscriptsubscript𝐱𝑡topsubscript𝐏𝑡subscript𝐱𝑡\displaystyle\mathbf{x}_{t}^{\top}\mathbf{P}_{t}\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
=12𝐱t(𝐐t+2𝜽t𝐏t+1𝜽t)𝐱t12𝐱t(𝐐t+2𝜽t𝐏t+1𝜽t)πt(𝐱t)+𝐱t(𝐐t+2𝜽t𝐏t+1𝜽t)πt(𝐱t),absent12superscriptsubscript𝐱𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡12superscriptsubscript𝐱𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡superscriptsubscript𝐱𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle=\frac{1}{2}\mathbf{x}_{t}^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}% ^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\mathbf{x}_{t}-\frac{1}{2}\mathbf{x}_{t% }^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t}% )\pi_{t}^{\ast}(\mathbf{x}_{t})+\mathbf{x}_{t}^{\top}(\mathbf{Q}_{t}+2\bm{% \theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\pi_{t}^{\ast}(\mathbf{x}_{t% }),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=12𝐱t(𝐐t+2𝜽t𝐏t+1𝜽t)𝐱t+12𝐱t(𝐐t𝐐t+2𝜽t𝐏t+1𝜽t)πt(𝐱t),absent12superscriptsubscript𝐱𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡subscript𝐱𝑡12superscriptsubscript𝐱𝑡topsuperscriptsubscript𝐐𝑡topsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡superscriptsubscript𝜋𝑡subscript𝐱𝑡\displaystyle=\frac{1}{2}\mathbf{x}_{t}^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}% ^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t})\mathbf{x}_{t}+\frac{1}{2}\mathbf{x}_{t% }^{\top}(\mathbf{Q}_{t}^{\top}\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}% _{t+1}\bm{\theta}_{t})\pi_{t}^{\ast}(\mathbf{x}_{t}),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

which results in the algebraic Riccati equation

𝐏t=12𝐐t+𝜽t𝐏t+1𝜽t12(𝐐t+2𝜽t𝐏t+1𝜽t)(𝐐t+2𝜽t𝐏t+1𝜽t+c𝐈)1(𝐐t+2𝜽t𝐏t+1𝜽t).subscript𝐏𝑡12subscript𝐐𝑡superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡12superscriptsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡topsuperscriptsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡𝑐𝐈1subscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡\mathbf{P}_{t}=\frac{1}{2}\mathbf{Q}_{t}+\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1% }\bm{\theta}_{t}-\frac{1}{2}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_% {t+1}\bm{\theta}_{t})^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_% {t+1}\bm{\theta}_{t}+c\mathbf{I})^{-1}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}% \mathbf{P}_{t+1}\bm{\theta}_{t}).bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

In our analysis, we focus on a specific scenario where each linear transformation 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is both orthogonal and full-rank. This implies that the linear transformations satisfy the condition 𝜽t𝜽t=𝜽t𝜽t=𝐈superscriptsubscript𝜽𝑡topsubscript𝜽𝑡subscript𝜽𝑡superscriptsubscript𝜽𝑡top𝐈\bm{\theta}_{t}^{\top}\bm{\theta}_{t}=\bm{\theta}_{t}\bm{\theta}_{t}^{\top}=% \mathbf{I}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_I for all t𝑡titalic_t in the considered range.

Recall that 𝐐t=𝐐tP+𝐐tI+𝐐tDsubscript𝐐𝑡superscriptsubscript𝐐𝑡𝑃superscriptsubscript𝐐𝑡𝐼superscriptsubscript𝐐𝑡𝐷\mathbf{Q}_{t}=\mathbf{Q}_{t}^{P}+\mathbf{Q}_{t}^{I}+\mathbf{Q}_{t}^{D}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, where

𝐐tP=𝐈𝐕tP(𝐕tP),𝐐tI=𝐈𝐕tI(𝐕tI),𝐐tD=𝐈𝐕tD(𝐕tD),formulae-sequencesuperscriptsubscript𝐐𝑡𝑃𝐈superscriptsubscript𝐕𝑡𝑃superscriptsuperscriptsubscript𝐕𝑡𝑃topformulae-sequencesuperscriptsubscript𝐐𝑡𝐼𝐈superscriptsubscript𝐕𝑡𝐼superscriptsuperscriptsubscript𝐕𝑡𝐼topsuperscriptsubscript𝐐𝑡𝐷𝐈superscriptsubscript𝐕𝑡𝐷superscriptsuperscriptsubscript𝐕𝑡𝐷top\mathbf{Q}_{t}^{P}=\mathbf{I}-\mathbf{V}_{t}^{P}(\mathbf{V}_{t}^{P})^{\top},\;% \;\;\;\mathbf{Q}_{t}^{I}=\mathbf{I}-\mathbf{V}_{t}^{I}(\mathbf{V}_{t}^{I})^{% \top},\;\;\;\;\mathbf{Q}_{t}^{D}=\mathbf{I}-\mathbf{V}_{t}^{D}(\mathbf{V}_{t}^% {D})^{\top},bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = bold_I - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = bold_I - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = bold_I - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

are orthogonal projections corresponding to linear embedding subspaces of state, state integration, and state derivative. For simplicity, we assume that the basis 𝐕tPsuperscriptsubscript𝐕𝑡𝑃\mathbf{V}_{t}^{P}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐕tIsuperscriptsubscript𝐕𝑡𝐼\mathbf{V}_{t}^{I}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐕tDsuperscriptsubscript𝐕𝑡𝐷\mathbf{V}_{t}^{D}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT are mutually orthogonal to each other, meaning that

(𝐕tP)𝐕tI=𝟎,(𝐕tP)𝐕tD=𝟎,(𝐕tI)𝐕tD=𝟎.formulae-sequencesuperscriptsuperscriptsubscript𝐕𝑡𝑃topsuperscriptsubscript𝐕𝑡𝐼0formulae-sequencesuperscriptsuperscriptsubscript𝐕𝑡𝑃topsuperscriptsubscript𝐕𝑡𝐷0superscriptsuperscriptsubscript𝐕𝑡𝐼topsuperscriptsubscript𝐕𝑡𝐷0(\mathbf{V}_{t}^{P})^{\top}\mathbf{V}_{t}^{I}=\mathbf{0},\;\;\;\;(\mathbf{V}_{% t}^{P})^{\top}\mathbf{V}_{t}^{D}=\mathbf{0},\;\;\;\;(\mathbf{V}_{t}^{I})^{\top% }\mathbf{V}_{t}^{D}=\mathbf{0}.( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = bold_0 , ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = bold_0 , ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = bold_0 .

Based on this assumption, the combination of three orthogonal projections 𝐐tsubscript𝐐𝑡\mathbf{Q}_{t}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an orthogonal projection,

𝐐tsubscript𝐐𝑡\displaystyle\mathbf{Q}_{t}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =𝐕tP[000000000000100001](𝐕tP)+𝐕tI[000000000000100001](𝐕tI)+𝐕tD[000000000000100001](𝐕tD),absentsuperscriptsubscript𝐕𝑡𝑃matrix000000000000100001superscriptsuperscriptsubscript𝐕𝑡𝑃topsuperscriptsubscript𝐕𝑡𝐼matrix000000000000100001superscriptsuperscriptsubscript𝐕𝑡𝐼topsuperscriptsubscript𝐕𝑡𝐷matrix000000000000100001superscriptsuperscriptsubscript𝐕𝑡𝐷top\displaystyle=\mathbf{V}_{t}^{P}\begin{bmatrix}0&0&\cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&1&0\\ 0&0&\cdots&0&1\\ \end{bmatrix}(\mathbf{V}_{t}^{P})^{\top}+\mathbf{V}_{t}^{I}\begin{bmatrix}0&0&% \cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&1&0\\ 0&0&\cdots&0&1\\ \end{bmatrix}(\mathbf{V}_{t}^{I})^{\top}+\mathbf{V}_{t}^{D}\begin{bmatrix}0&0&% \cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&1&0\\ 0&0&\cdots&0&1\\ \end{bmatrix}(\mathbf{V}_{t}^{D})^{\top},= bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,
=𝐕t[000000000000100001]𝐕t,absentsubscript𝐕𝑡matrix000000000000100001superscriptsubscript𝐕𝑡top\displaystyle=\mathbf{V}_{t}\begin{bmatrix}0&0&\cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&1&0\\ 0&0&\cdots&0&1\\ \end{bmatrix}\mathbf{V}_{t}^{\top},= bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

where 𝐕tsubscript𝐕𝑡\mathbf{V}_{t}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the basis for the intersection of 𝐕tPsuperscriptsubscript𝐕𝑡𝑃\mathbf{V}_{t}^{P}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝐕tIsuperscriptsubscript𝐕𝑡𝐼\mathbf{V}_{t}^{I}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and 𝐕tDsuperscriptsubscript𝐕𝑡𝐷\mathbf{V}_{t}^{D}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Lemma 5.

Consider a T𝑇Titalic_T-layer neural network characterized by orthogonal linear transformations. The solution to the algebraic Riccati equation, as delineated in equation 8, is given by

𝐏t=12𝐕t[000000000000λt0000λt]𝐕t,subscript𝐏𝑡12subscript𝐕𝑡matrix000000000000subscript𝜆𝑡0000subscript𝜆𝑡superscriptsubscript𝐕𝑡top\mathbf{P}_{t}=\frac{1}{2}\mathbf{V}_{t}\begin{bmatrix}0&0&\cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&\lambda_{t}&0\\ 0&0&\cdots&0&\lambda_{t}\\ \end{bmatrix}\mathbf{V}_{t}^{\top},bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (11)

where the parameter λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is governed by a backward difference equation λt=c(1+λt+1)1+λt+1+csubscript𝜆𝑡𝑐1subscript𝜆𝑡11subscript𝜆𝑡1𝑐\lambda_{t}=\frac{c(1+\lambda_{t+1})}{1+\lambda_{t+1}+c}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c ( 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG, with the initial condition specified as λT=0subscript𝜆𝑇0\lambda_{T}=0italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.

Proof.

The proof proceeds by induction on t𝑡titalic_t. Recall the algebraic Riccati equation 8. Given the terminal condition 𝐏T=𝟎subscript𝐏𝑇0\mathbf{P}_{T}=\mathbf{0}bold_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = bold_0, the equation for t=T1𝑡𝑇1t=T-1italic_t = italic_T - 1 is

𝐏T1subscript𝐏𝑇1\displaystyle\mathbf{P}_{T-1}bold_P start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT =12𝐐T112𝐐T1(𝐐T1+c𝐈)1𝐐T1,absent12subscript𝐐𝑇112superscriptsubscript𝐐𝑇1topsuperscriptsubscript𝐐𝑇1𝑐𝐈1subscript𝐐𝑇1\displaystyle=\frac{1}{2}\mathbf{Q}_{T-1}-\frac{1}{2}\mathbf{Q}_{T-1}^{\top}(% \mathbf{Q}_{T-1}+c\mathbf{I})^{-1}\mathbf{Q}_{T-1},= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_Q start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_Q start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT + italic_c bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ,
=12𝐕T1[000000000000c1+c0000c1+c]𝐕T1,absent12subscript𝐕𝑇1matrix000000000000𝑐1𝑐0000𝑐1𝑐superscriptsubscript𝐕𝑇1top\displaystyle=\frac{1}{2}\mathbf{V}_{T-1}\begin{bmatrix}0&0&\cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&\frac{c}{1+c}&0\\ 0&0&\cdots&0&\frac{c}{1+c}\\ \end{bmatrix}\mathbf{V}_{T-1}^{\top},= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_V start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG italic_c end_ARG start_ARG 1 + italic_c end_ARG end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL divide start_ARG italic_c end_ARG start_ARG 1 + italic_c end_ARG end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

Suppose it is true for t+1𝑡1t+1italic_t + 1, such that,

𝐏t+1=12𝐕t+1[000000000000λt+10000λt+1]𝐕t+1.subscript𝐏𝑡112subscript𝐕𝑡1matrix000000000000subscript𝜆𝑡10000subscript𝜆𝑡1superscriptsubscript𝐕𝑡1top\displaystyle\mathbf{P}_{t+1}=\frac{1}{2}\mathbf{V}_{t+1}\begin{bmatrix}0&0&% \cdots&0&0\\ 0&0&\cdots&0&0\\ \vdots&\vdots&\ddots&0&0\\ 0&0&\cdots&\lambda_{t+1}&0\\ 0&0&\cdots&0&\lambda_{t+1}\\ \end{bmatrix}\mathbf{V}_{t+1}^{\top}.bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Given that 𝜽t𝜽t=𝜽t𝜽t=𝐈superscriptsubscript𝜽𝑡topsubscript𝜽𝑡subscript𝜽𝑡superscriptsubscript𝜽𝑡top𝐈\bm{\theta}_{t}^{\top}\bm{\theta}_{t}=\bm{\theta}_{t}\bm{\theta}_{t}^{\top}=% \mathbf{I}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_I, 𝜽t𝐕t+1=𝐕tsuperscriptsubscript𝜽𝑡topsubscript𝐕𝑡1subscript𝐕𝑡\bm{\theta}_{t}^{\top}\mathbf{V}_{t+1}=\mathbf{V}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, in which case, 𝐐tsubscript𝐐𝑡\mathbf{Q}_{t}bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝜽t𝐏t+1𝜽tsuperscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT contain the same basis 𝐕tsubscript𝐕𝑡\mathbf{V}_{t}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Recall the algebraic Riccati equation 8,

𝐏tsubscript𝐏𝑡\displaystyle\mathbf{P}_{t}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =12𝐐t+𝜽t𝐏t+1𝜽t12(𝐐t+2𝜽t𝐏t+1𝜽t)(𝐐t+2𝜽t𝐏t+1𝜽t+c𝐈)1(𝐐t+2𝜽t𝐏t+1𝜽t),absent12subscript𝐐𝑡superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡12superscriptsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡topsuperscriptsubscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡𝑐𝐈1subscript𝐐𝑡2superscriptsubscript𝜽𝑡topsubscript𝐏𝑡1subscript𝜽𝑡\displaystyle=\frac{1}{2}\mathbf{Q}_{t}+\bm{\theta}_{t}^{\top}\mathbf{P}_{t+1}% \bm{\theta}_{t}-\frac{1}{2}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{% t+1}\bm{\theta}_{t})^{\top}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}\mathbf{P}_{% t+1}\bm{\theta}_{t}+c\mathbf{I})^{-1}(\mathbf{Q}_{t}+2\bm{\theta}_{t}^{\top}% \mathbf{P}_{t+1}\bm{\theta}_{t}),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=12𝐕t[0000001+λt+1]𝐕t12𝐕t[000000(1+λt+1)2(1+λt+1+c)1]𝐕t,absent12subscript𝐕𝑡matrix0000001subscript𝜆𝑡1superscriptsubscript𝐕𝑡top12subscript𝐕𝑡matrix000000superscript1subscript𝜆𝑡12superscript1subscript𝜆𝑡1𝑐1superscriptsubscript𝐕𝑡top\displaystyle=\frac{1}{2}\mathbf{V}_{t}\begin{bmatrix}0&\cdots&0\\ 0&\cdots&0\\ \vdots&\ddots&0\\ 0&\cdots&1+\lambda_{t+1}\\ \end{bmatrix}\mathbf{V}_{t}^{\top}-\frac{1}{2}\mathbf{V}_{t}\begin{bmatrix}0&% \cdots&0\\ 0&\cdots&0\\ \vdots&\ddots&0\\ 0&\cdots&(1+\lambda_{t+1})^{2}(1+\lambda_{t+1}+c)^{-1}\\ \end{bmatrix}\mathbf{V}_{t}^{\top},= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL ( 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,
=12𝐕t[000000000λt=c(1+λt+1)1+λt+1+c]𝐕t.absent12subscript𝐕𝑡matrix000000000subscript𝜆𝑡𝑐1subscript𝜆𝑡11subscript𝜆𝑡1𝑐superscriptsubscript𝐕𝑡top\displaystyle=\frac{1}{2}\mathbf{V}_{t}\begin{bmatrix}0&0&\cdots&0\\ 0&0&\cdots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\cdots&\lambda_{t}=\frac{c(1+\lambda_{t+1})}{1+\lambda_{t+1}+c}\\ \end{bmatrix}\mathbf{V}_{t}^{\top}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c ( 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Recall the optimal control solution in equation 9 and Lemma 5, we reach the following analytic formulation.

See 3

9 Appendix B

Recall the optimal control formulation in Proposition 3, we define a control gain matrix 𝐊tsubscript𝐊𝑡\mathbf{K}_{t}bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

𝐊t=𝐕t[0000000001c1+λt+1+c]𝐕t.subscript𝐊𝑡subscript𝐕𝑡matrix0000000001𝑐1subscript𝜆𝑡1𝑐superscriptsubscript𝐕𝑡top\mathbf{K}_{t}=-\mathbf{V}_{t}\begin{bmatrix}0&0&\cdots&0\\ 0&0&\cdots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\cdots&1-\frac{c}{1+\lambda_{t+1}+c}\\ \end{bmatrix}\mathbf{V}_{t}^{\top}.bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 - divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Let 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the tthsuperscript𝑡tht^{\text{th}}italic_t start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT linear transformation, and π:dd:𝜋superscript𝑑superscript𝑑\pi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_π : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be the closed-loop controller. We denote the unperturbed state at time t𝑡titalic_t as 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the controlled state with perturbation 𝐳𝐳\mathbf{z}bold_z applied at the initial condition as 𝐱¯tsubscript¯𝐱𝑡\overline{\mathbf{x}}_{t}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,

𝐱¯t+1=𝜽t(𝐱¯t+πt(𝐱¯t)),𝐱¯0=𝐱t+𝐳.formulae-sequencesubscript¯𝐱𝑡1subscript𝜽𝑡subscript¯𝐱𝑡subscript𝜋𝑡subscript¯𝐱𝑡subscript¯𝐱0subscript𝐱𝑡𝐳\overline{\mathbf{x}}_{t+1}=\bm{\theta}_{t}(\overline{\mathbf{x}}_{t}+\pi_{t}(% \overline{\mathbf{x}}_{t})),\;\;\overline{\mathbf{x}}_{0}=\mathbf{x}_{t}+% \mathbf{z}.over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_z .

The difference between the controlled system applied with perturbation at the initial condition and the uncontrolled system without perturbation is shown

𝐱¯t+1𝐱t+1subscript¯𝐱𝑡1subscript𝐱𝑡1\displaystyle\overline{\mathbf{x}}_{t+1}-\mathbf{x}_{t+1}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =𝜽t(𝐱¯t+πt(𝐱¯t)𝐱t),absentsubscript𝜽𝑡subscript¯𝐱𝑡subscript𝜋𝑡subscript¯𝐱𝑡subscript𝐱𝑡\displaystyle=\bm{\theta}_{t}(\overline{\mathbf{x}}_{t}+\pi_{t}(\overline{% \mathbf{x}}_{t})-\mathbf{x}_{t}),= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=𝜽t(𝐱¯ϵ,t𝐊t𝐱¯ϵ,t𝐱t),absentsubscript𝜽𝑡subscript¯𝐱italic-ϵ𝑡subscript𝐊𝑡subscript¯𝐱italic-ϵ𝑡subscript𝐱𝑡\displaystyle=\bm{\theta}_{t}(\overline{\mathbf{x}}_{\epsilon,t}-\mathbf{K}_{t% }\overline{\mathbf{x}}_{\epsilon,t}-\mathbf{x}_{t}),= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_ϵ , italic_t end_POSTSUBSCRIPT - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_ϵ , italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=𝜽t(𝐈𝐊t)𝐱¯t𝜽t𝐱t+𝜽t𝐊t𝐱t,absentsubscript𝜽𝑡𝐈subscript𝐊𝑡subscript¯𝐱𝑡subscript𝜽𝑡subscript𝐱𝑡subscript𝜽𝑡subscript𝐊𝑡subscript𝐱𝑡\displaystyle=\bm{\theta}_{t}(\mathbf{I}-\mathbf{K}_{t})\overline{\mathbf{x}}_% {t}-\bm{\theta}_{t}\mathbf{x}_{t}+\bm{\theta}_{t}\mathbf{K}_{t}\mathbf{x}_{t},= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
=𝜽t(𝐈𝐊t)(𝐱¯t𝐱t),absentsubscript𝜽𝑡𝐈subscript𝐊𝑡subscript¯𝐱𝑡subscript𝐱𝑡\displaystyle=\bm{\theta}_{t}(\mathbf{I}-\mathbf{K}_{t})(\overline{\mathbf{x}}% _{t}-\mathbf{x}_{t}),= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (12)

where 𝜽t𝐊t𝐱t=𝟎subscript𝜽𝑡subscript𝐊𝑡subscript𝐱𝑡0\bm{\theta}_{t}\mathbf{K}_{t}\mathbf{x}_{t}=\mathbf{0}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_0 since 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is in the null space of the control gain matrix 𝐊tsubscript𝐊𝑡\mathbf{K}_{t}bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Lemma 6.

For t0𝑡0t\geq 0italic_t ≥ 0, we have

𝐈𝐊t=αt𝐈+(1αt)𝐏t,𝐈subscript𝐊𝑡subscript𝛼𝑡𝐈1subscript𝛼𝑡subscript𝐏𝑡\mathbf{I}-\mathbf{K}_{t}=\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\cdot\mathbf% {P}_{t},bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where 𝐏t:=𝐕t(𝐕t)assignsubscript𝐏𝑡subscript𝐕𝑡superscriptsubscript𝐕𝑡top\mathbf{P}_{t}\vcentcolon=\mathbf{V}_{t}(\mathbf{V}_{t})^{\top}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, αt=c1+λt+1+csubscript𝛼𝑡𝑐1subscript𝜆𝑡1𝑐\alpha_{t}=\frac{c}{1+\lambda_{t+1}+c}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG.

Proof.

Recall equation 12, (𝐈𝐊t𝐈subscript𝐊𝑡\mathbf{I}-\mathbf{K}_{t}bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) can be expressed as

𝐈𝐊t=𝐕t[100010000c1+λt+1+c]𝐕t,𝐈subscript𝐊𝑡subscript𝐕𝑡matrix100010000𝑐1subscript𝜆𝑡1𝑐superscriptsubscript𝐕𝑡top\mathbf{I}-\mathbf{K}_{t}=\mathbf{V}_{t}\begin{bmatrix}1&0&\cdots&0\\ 0&1&\cdots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\cdots&\frac{c}{1+\lambda_{t+1}+c}\\ \end{bmatrix}\mathbf{V}_{t}^{\top},bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG end_CELL end_ROW end_ARG ] bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

where the first r𝑟ritalic_r diagonal elements are 1111, and the last (dr)𝑑𝑟(d-r)( italic_d - italic_r ) diagonal elements are c1+λt+1+c𝑐1subscript𝜆𝑡1𝑐\frac{c}{1+\lambda_{t+1}+c}divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG. By denoting the projection of first r𝑟ritalic_r columns as 𝐕trsuperscriptsubscript𝐕𝑡𝑟\mathbf{V}_{t}^{r}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and last (dr)𝑑𝑟(d-r)( italic_d - italic_r ) columns as 𝐕^trsuperscriptsubscript^𝐕𝑡𝑟\hat{\mathbf{V}}_{t}^{r}over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, it can be further shown

𝐈𝐊t𝐈subscript𝐊𝑡\displaystyle\mathbf{I}-\mathbf{K}_{t}bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =𝐕tr(𝐕tr)+c1+λt+1+c(𝐕^tr(𝐕^tr)),absentsuperscriptsubscript𝐕𝑡𝑟superscriptsuperscriptsubscript𝐕𝑡𝑟top𝑐1subscript𝜆𝑡1𝑐superscriptsubscript^𝐕𝑡𝑟superscriptsuperscriptsubscript^𝐕𝑡𝑟top\displaystyle=\mathbf{V}_{t}^{r}(\mathbf{V}_{t}^{r})^{\top}+\frac{c}{1+\lambda% _{t+1}+c}\big{(}\hat{\mathbf{V}}_{t}^{r}(\hat{\mathbf{V}}_{t}^{r})^{\top}\big{% )},= bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG ( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ,
=𝐏t+αt(𝐈𝐏t),absentsubscript𝐏𝑡subscript𝛼𝑡𝐈subscript𝐏𝑡\displaystyle=\mathbf{P}_{t}+\alpha_{t}\big{(}\mathbf{I}-\mathbf{P}_{t}\big{)},= bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_I - bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=αt𝐈+(1αt)𝐏t,absentsubscript𝛼𝑡𝐈1subscript𝛼𝑡subscript𝐏𝑡\displaystyle=\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\cdot\mathbf{P}_{t},= italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where αt=c1+λt+1+csubscript𝛼𝑡𝑐1subscript𝜆𝑡1𝑐\alpha_{t}=\frac{c}{1+\lambda_{t+1}+c}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_c end_ARG. ∎

In the presented formulation, the input state space, denoted as Z𝑍Zitalic_Z, is partitioned into a direct sum comprising two orthogonal subspaces. This decomposition is expressed as Z=ZZ𝑍direct-sumsuperscript𝑍parallel-tosuperscript𝑍perpendicular-toZ=Z^{\parallel}\oplus Z^{\perp}italic_Z = italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ⊕ italic_Z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, where Zsuperscript𝑍parallel-toZ^{\parallel}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT represents the linear embedding subspace, encapsulating the input data. This is characterized by the condition 𝐱0𝒵subscript𝐱0𝒵\mathbf{x}_{0}\in\mathcal{Z}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_Z for all pairs (𝐱,y)𝐱𝑦(\mathbf{x},y)( bold_x , italic_y ) sampled from the distribution 𝒟𝒟\mathcal{D}caligraphic_D. Concurrently, Zsuperscript𝑍perpendicular-toZ^{\perp}italic_Z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT defines the orthogonal complement of Zsuperscript𝑍parallel-toZ^{\parallel}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT. Extending this notion, the time-dependent state space at any given timestep t𝑡titalic_t is represented as Zt=ZtZtsubscript𝑍𝑡direct-sumsuperscriptsubscript𝑍𝑡parallel-tosuperscriptsubscript𝑍𝑡perpendicular-toZ_{t}=Z_{t}^{\parallel}\oplus Z_{t}^{\perp}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ⊕ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT.

Lemma 7.

For t0𝑡0t\geq 0italic_t ≥ 0, let 𝐏tssuperscriptsubscript𝐏𝑡𝑠\mathbf{P}_{t}^{s}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT be defined as follows,

{𝐏t0:=𝐏t,𝐏t(s+1):=𝜽ts11𝐏ts𝜽ts1,s=0,1,,t1.casesassignsuperscriptsubscript𝐏𝑡0subscript𝐏𝑡otherwiseformulae-sequenceassignsuperscriptsubscript𝐏𝑡𝑠1superscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1𝑠01𝑡1otherwise\begin{cases}\mathbf{P}_{t}^{0}\vcentcolon=\mathbf{P}_{t},\\ \mathbf{P}_{t}^{(s+1)}\vcentcolon=\bm{\theta}_{t-s-1}^{-1}\mathbf{P}_{t}^{s}% \bm{\theta}_{t-s-1},\hskip 8.5359pts=0,1,\ldots,t-1.\end{cases}{ start_ROW start_CELL bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s + 1 ) end_POSTSUPERSCRIPT := bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT , italic_s = 0 , 1 , … , italic_t - 1 . end_CELL start_CELL end_CELL end_ROW

Then

  1. 1.

    𝐏tssuperscriptsubscript𝐏𝑡𝑠\mathbf{P}_{t}^{s}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is a projection.

  2. 2.

    𝐏tssuperscriptsubscript𝐏𝑡𝑠\mathbf{P}_{t}^{s}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is a projection onto Ztssubscriptsuperscript𝑍parallel-to𝑡𝑠Z^{\parallel}_{t-s}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s end_POSTSUBSCRIPT, i.e. range(𝐏ts)=Zts𝑟𝑎𝑛𝑔𝑒superscriptsubscript𝐏𝑡𝑠subscriptsuperscript𝑍parallel-to𝑡𝑠range(\mathbf{P}_{t}^{s})=Z^{\parallel}_{t-s}italic_r italic_a italic_n italic_g italic_e ( bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) = italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s end_POSTSUBSCRIPT.

  3. 3.

    If all 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are orthogonal, then 𝐏tt=𝐏0superscriptsubscript𝐏𝑡𝑡subscript𝐏0\mathbf{P}_{t}^{t}=\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,   tfor-all𝑡\forall t∀ italic_t, where 𝐏0subscript𝐏0\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the orthogonal projection onto Z0subscriptsuperscript𝑍parallel-to0Z^{\parallel}_{0}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Proof.
  1. 1.

    We prove it by induction on s𝑠sitalic_s for each t𝑡titalic_t. For s=0𝑠0s=0italic_s = 0, 𝐏t0=𝐏tsuperscriptsubscript𝐏𝑡0subscript𝐏𝑡\mathbf{P}_{t}^{0}=\mathbf{P}_{t}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is a projection by its definition. Suppose it is true for s𝑠sitalic_s such that 𝐏ts=𝐏ts𝐏tssuperscriptsubscript𝐏𝑡𝑠superscriptsubscript𝐏𝑡𝑠superscriptsubscript𝐏𝑡𝑠\mathbf{P}_{t}^{s}=\mathbf{P}_{t}^{s}\mathbf{P}_{t}^{s}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (𝐏𝐏\mathbf{P}bold_P is a projection if 𝐏=𝐏2𝐏superscript𝐏2\mathbf{P}=\mathbf{P}^{2}bold_P = bold_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), then for (s+1)𝑠1(s+1)( italic_s + 1 ),

    (𝐏ts+1)2superscriptsuperscriptsubscript𝐏𝑡𝑠12\displaystyle(\mathbf{P}_{t}^{s+1})^{2}( bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(𝜽ts11𝐏ts𝜽ts1)2,absentsuperscriptsuperscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠12\displaystyle=\big{(}\bm{\theta}_{t-s-1}^{-1}\mathbf{P}_{t}^{s}\bm{\theta}_{t-% s-1}\big{)}^{2},= ( bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
    =𝜽ts11(𝐏ts)2𝜽ts1,absentsuperscriptsubscript𝜽𝑡𝑠11superscriptsuperscriptsubscript𝐏𝑡𝑠2subscript𝜽𝑡𝑠1\displaystyle=\bm{\theta}_{t-s-1}^{-1}\big{(}\mathbf{P}_{t}^{s}\big{)}^{2}\bm{% \theta}_{t-s-1},= bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ,
    =𝜽ts11𝐏ts𝜽ts1,absentsuperscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1\displaystyle=\bm{\theta}_{t-s-1}^{-1}\mathbf{P}_{t}^{s}\bm{\theta}_{t-s-1},= bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ,
    =𝐏ts+1.absentsuperscriptsubscript𝐏𝑡𝑠1\displaystyle=\mathbf{P}_{t}^{s+1}.= bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT .
  2. 2.

    We prove it by induction on s𝑠sitalic_s for each t𝑡titalic_t. For s=0𝑠0s=0italic_s = 0, 𝐏t0=𝐏tsuperscriptsubscript𝐏𝑡0subscript𝐏𝑡\mathbf{P}_{t}^{0}=\mathbf{P}_{t}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is the orthogonal projection onto Ztsubscriptsuperscript𝑍parallel-to𝑡Z^{\parallel}_{t}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Suppose that it is true for s𝑠sitalic_s such that 𝐏tssuperscriptsubscript𝐏𝑡𝑠\mathbf{P}_{t}^{s}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is a projection onto Ztssubscriptsuperscript𝑍parallel-to𝑡𝑠Z^{\parallel}_{t-s}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s end_POSTSUBSCRIPT, then for (s+1)𝑠1(s+1)( italic_s + 1 ), 𝐏ts+1=𝜽ts11𝐏ts𝜽ts1superscriptsubscript𝐏𝑡𝑠1superscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1\mathbf{P}_{t}^{s+1}=\bm{\theta}_{t-s-1}^{-1}\mathbf{P}_{t}^{s}\bm{\theta}_{t-% s-1}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT, which implies

    range(𝐏ts+1)𝑟𝑎𝑛𝑔𝑒superscriptsubscript𝐏𝑡𝑠1\displaystyle range(\mathbf{P}_{t}^{s+1})italic_r italic_a italic_n italic_g italic_e ( bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT ) =range(𝜽ts11𝐏ts),absent𝑟𝑎𝑛𝑔𝑒superscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠\displaystyle=range(\bm{\theta}_{t-s-1}^{-1}\mathbf{P}_{t}^{s}),= italic_r italic_a italic_n italic_g italic_e ( bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ,
    ={𝜽ts11𝐱:𝐱Zts},absentconditional-setsuperscriptsubscript𝜽𝑡𝑠11𝐱𝐱subscriptsuperscript𝑍parallel-to𝑡𝑠\displaystyle=\{\bm{\theta}_{t-s-1}^{-1}\mathbf{x}:\mathbf{x}\in Z^{\parallel}% _{t-s}\},= { bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_x : bold_x ∈ italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s end_POSTSUBSCRIPT } ,
    =Zts1.absentsubscriptsuperscript𝑍parallel-to𝑡𝑠1\displaystyle=Z^{\parallel}_{t-s-1}.= italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT .
  3. 3.

    If 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is orthogonal,

    𝐏ts+1superscriptsubscript𝐏𝑡𝑠1\displaystyle\mathbf{P}_{t}^{s+1}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT =𝜽ts11𝐏ts𝜽ts1,absentsuperscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1\displaystyle=\bm{\theta}_{t-s-1}^{-1}\mathbf{P}_{t}^{s}\bm{\theta}_{t-s-1},= bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ,
    =𝜽ts1T𝐏ts𝜽ts1,absentsuperscriptsubscript𝜽𝑡𝑠1𝑇superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1\displaystyle=\bm{\theta}_{t-s-1}^{T}\mathbf{P}_{t}^{s}\bm{\theta}_{t-s-1},= bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ,
    =(𝐏ts+1).absentsuperscriptsuperscriptsubscript𝐏𝑡𝑠1top\displaystyle=(\mathbf{P}_{t}^{s+1})^{\top}.= ( bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

    𝐏ts+1superscriptsubscript𝐏𝑡𝑠1\mathbf{P}_{t}^{s+1}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT is a orthogonal projection onto range Zts1subscriptsuperscript𝑍parallel-to𝑡𝑠1Z^{\parallel}_{t-s-1}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT. Therefore, 𝐏tTsuperscriptsubscript𝐏𝑡𝑇\mathbf{P}_{t}^{T}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is a orthogonal projection onto Z0subscriptsuperscript𝑍parallel-to0Z^{\parallel}_{0}italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, orthogonal projection onto the same range is unique, 𝐏tT=𝐏0superscriptsubscript𝐏𝑡𝑇subscript𝐏0\mathbf{P}_{t}^{T}=\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, tfor-all𝑡\forall t∀ italic_t.

The following Lemma uses the concept of oblique projection to show a recursive relationship to project any tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT state space of Eq. (12) back to the input data space.

Lemma 8.

Define for 0st0𝑠𝑡0\leq s\leq t0 ≤ italic_s ≤ italic_t,

𝐆ts:=αt𝐈+(1αt)𝐏ts.assignsuperscriptsubscript𝐆𝑡𝑠subscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝐏𝑡𝑠\mathbf{G}_{t}^{s}\vcentcolon=\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\mathbf{% P}_{t}^{s}.bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT := italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT .

Then, Eq. (12) can be written as

𝐱¯t𝐱t=(𝜽t1𝜽t2𝜽0)(𝐆t1t1𝐆t2t2𝐆00)(𝐱¯0𝐱0),t1.formulae-sequencesubscript¯𝐱𝑡subscript𝐱𝑡subscript𝜽𝑡1subscript𝜽𝑡2subscript𝜽0superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆𝑡2𝑡2superscriptsubscript𝐆00subscript¯𝐱0subscript𝐱0𝑡1\overline{\mathbf{x}}_{t}-\mathbf{x}_{t}=(\bm{\theta}_{t-1}\bm{\theta}_{t-2}% \cdots\bm{\theta}_{0})(\mathbf{G}_{t-1}^{t-1}\mathbf{G}_{t-2}^{t-2}\cdots% \mathbf{G}_{0}^{0})(\overline{\mathbf{x}}_{0}-\mathbf{x}_{0}),\;\;t\geq 1.over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_t ≥ 1 .
Proof.

We prove it by induction on t𝑡titalic_t. For t=1𝑡1t=1italic_t = 1, by the definition of 𝐆tssuperscriptsubscript𝐆𝑡𝑠\mathbf{G}_{t}^{s}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and transformation from Lemma 6,

𝐱¯1𝐱1subscript¯𝐱1subscript𝐱1\displaystyle\overline{\mathbf{x}}_{1}-\mathbf{x}_{1}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝜽0(𝐈𝐊0)(𝐱¯0𝐱0),absentsubscript𝜽0𝐈subscript𝐊0subscript¯𝐱0subscript𝐱0\displaystyle=\bm{\theta}_{0}(\mathbf{I}-\mathbf{K}_{0})(\overline{\mathbf{x}}% _{0}-\mathbf{x}_{0}),= bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_I - bold_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , Eq.(12),Eq.12\displaystyle\textnormal{Eq.}~{}(\ref{eq:state_difference}),Eq. ( ) ,
=𝜽0(α0𝐈+(1α0)𝐏0)(𝐱¯0𝐱0),absentsubscript𝜽0subscript𝛼0𝐈1subscript𝛼0subscript𝐏0subscript¯𝐱0subscript𝐱0\displaystyle=\bm{\theta}_{0}(\alpha_{0}\cdot\mathbf{I}+(1-\alpha_{0})\cdot% \mathbf{P}_{0})(\overline{\mathbf{x}}_{0}-\mathbf{x}_{0}),= bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,
=𝜽0𝐆00(𝐱¯0𝐱0).absentsubscript𝜽0superscriptsubscript𝐆00subscript¯𝐱0subscript𝐱0\displaystyle=\bm{\theta}_{0}\mathbf{G}_{0}^{0}(\overline{\mathbf{x}}_{0}-% \mathbf{x}_{0}).= bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

Suppose that it is true for (𝐱¯t𝐱t)subscript¯𝐱𝑡subscript𝐱𝑡(\overline{\mathbf{x}}_{t}-\mathbf{x}_{t})( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), by Lemma 6, we have

𝐱¯t+1𝐱t+1subscript¯𝐱𝑡1subscript𝐱𝑡1\displaystyle\overline{\mathbf{x}}_{t+1}-\mathbf{x}_{t+1}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =𝜽t(𝐈𝐊t)(𝐱¯t𝐱t),absentsubscript𝜽𝑡𝐈subscript𝐊𝑡subscript¯𝐱𝑡subscript𝐱𝑡\displaystyle=\bm{\theta}_{t}(\mathbf{I}-\mathbf{K}_{t})(\overline{\mathbf{x}}% _{t}-\mathbf{x}_{t}),= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
=𝜽t(αt𝐈(1αt)𝐏t)(𝐱¯t𝐱t),absentsubscript𝜽𝑡subscript𝛼𝑡𝐈1subscript𝛼𝑡subscript𝐏𝑡subscript¯𝐱𝑡subscript𝐱𝑡\displaystyle=\bm{\theta}_{t}(\alpha_{t}\cdot\mathbf{I}-(1-\alpha_{t})\cdot% \mathbf{P}_{t})(\overline{\mathbf{x}}_{t}-\mathbf{x}_{t}),= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I - ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , Lemma6,Lemma6\displaystyle\textnormal{Lemma}~{}\ref{lemma: control matrix},Lemma ,
=𝜽t𝐆t0(𝜽t1𝜽t2𝜽0)(𝐆t1t1𝐆t2t2𝐆00)(𝐱¯0𝐱0).absentsubscript𝜽𝑡superscriptsubscript𝐆𝑡0subscript𝜽𝑡1subscript𝜽𝑡2subscript𝜽0superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆𝑡2𝑡2superscriptsubscript𝐆00subscript¯𝐱0subscript𝐱0\displaystyle=\bm{\theta}_{t}\mathbf{G}_{t}^{0}(\bm{\theta}_{t-1}\bm{\theta}_{% t-2}\cdots\bm{\theta}_{0})(\mathbf{G}_{t-1}^{t-1}\mathbf{G}_{t-2}^{t-2}\cdots% \mathbf{G}_{0}^{0})(\overline{\mathbf{x}}_{0}-\mathbf{x}_{0}).= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (13)

Recall the definitions of 𝐏t(s+1):=𝜽ts11𝐏ts𝜽ts1assignsuperscriptsubscript𝐏𝑡𝑠1superscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1\mathbf{P}_{t}^{(s+1)}\vcentcolon=\bm{\theta}_{t-s-1}^{-1}\mathbf{P}_{t}^{s}% \bm{\theta}_{t-s-1}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s + 1 ) end_POSTSUPERSCRIPT := bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT, and 𝐆ts:=αt𝐈+(1αt)𝐏tsassignsuperscriptsubscript𝐆𝑡𝑠subscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝐏𝑡𝑠\mathbf{G}_{t}^{s}\vcentcolon=\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\mathbf{% P}_{t}^{s}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT := italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT,

𝐆ts+1superscriptsubscript𝐆𝑡𝑠1\displaystyle\mathbf{G}_{t}^{s+1}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT =αt𝐈+(1αt)𝐏t(s+1),absentsubscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝐏𝑡𝑠1\displaystyle=\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\cdot\mathbf{P}_{t}^{(s+% 1)},= italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s + 1 ) end_POSTSUPERSCRIPT ,
=αt𝐈+(1αt)𝜽ts11𝐏ts𝜽ts1,absentsubscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1\displaystyle=\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\cdot\bm{\theta}_{t-s-1}% ^{-1}\mathbf{P}_{t}^{s}\bm{\theta}_{t-s-1},= italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ,
=𝜽ts11(αt𝐈+(1αt)𝐏ts)𝜽ts1,absentsuperscriptsubscript𝜽𝑡𝑠11subscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝐏𝑡𝑠subscript𝜽𝑡𝑠1\displaystyle=\bm{\theta}_{t-s-1}^{-1}\big{(}\alpha_{t}\cdot\mathbf{I}+(1-% \alpha_{t})\cdot\mathbf{P}_{t}^{s}\big{)}\bm{\theta}_{t-s-1},= bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ,
=𝜽ts11𝐆ts𝜽ts1,absentsuperscriptsubscript𝜽𝑡𝑠11superscriptsubscript𝐆𝑡𝑠subscript𝜽𝑡𝑠1\displaystyle=\bm{\theta}_{t-s-1}^{-1}\mathbf{G}_{t}^{s}\bm{\theta}_{t-s-1},= bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT ,

which results in the equality for the oblique projections. Furthermore,

𝜽ts1𝐆t(s+1)=𝐆ts𝜽ts1.subscript𝜽𝑡𝑠1superscriptsubscript𝐆𝑡𝑠1superscriptsubscript𝐆𝑡𝑠subscript𝜽𝑡𝑠1\bm{\theta}_{t-s-1}\mathbf{G}_{t}^{(s+1)}=\mathbf{G}_{t}^{s}\bm{\theta}_{t-s-1}.bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s + 1 ) end_POSTSUPERSCRIPT = bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - italic_s - 1 end_POSTSUBSCRIPT .

Applying the above to Eq. (9) results in

𝐱¯t+1𝐱t+1subscript¯𝐱𝑡1subscript𝐱𝑡1\displaystyle\overline{\mathbf{x}}_{t+1}-\mathbf{x}_{t+1}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =𝜽t𝐆t0(𝜽t1𝜽t2𝜽0)(𝐆t1t1𝐆t2t2𝐆00)(𝐱¯0𝐱0),absentsubscript𝜽𝑡superscriptsubscript𝐆𝑡0subscript𝜽𝑡1subscript𝜽𝑡2subscript𝜽0superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆𝑡2𝑡2superscriptsubscript𝐆00subscript¯𝐱0subscript𝐱0\displaystyle=\bm{\theta}_{t}\mathbf{G}_{t}^{0}(\bm{\theta}_{t-1}\bm{\theta}_{% t-2}\cdots\bm{\theta}_{0})(\mathbf{G}_{t-1}^{t-1}\mathbf{G}_{t-2}^{t-2}\cdots% \mathbf{G}_{0}^{0})(\overline{\mathbf{x}}_{0}-\mathbf{x}_{0}),= bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,
=(𝜽t𝜽t1)𝐆t1(𝜽t2𝜽t3𝜽0)(𝐆t1t1𝐆t2t2𝐆00)(𝐱¯0𝐱0),absentsubscript𝜽𝑡subscript𝜽𝑡1superscriptsubscript𝐆𝑡1subscript𝜽𝑡2subscript𝜽𝑡3subscript𝜽0superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆𝑡2𝑡2superscriptsubscript𝐆00subscript¯𝐱0subscript𝐱0\displaystyle=(\bm{\theta}_{t}\bm{\theta}_{t-1})\mathbf{G}_{t}^{1}(\bm{\theta}% _{t-2}\bm{\theta}_{t-3}\cdots\bm{\theta}_{0})(\mathbf{G}_{t-1}^{t-1}\mathbf{G}% _{t-2}^{t-2}\cdots\mathbf{G}_{0}^{0})(\overline{\mathbf{x}}_{0}-\mathbf{x}_{0}),= ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 3 end_POSTSUBSCRIPT ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,
=(𝜽t𝜽t1𝜽t2)𝐆t2(𝜽t3𝜽t4𝜽0)(𝐆t1t1𝐆t2t2𝐆00)(𝐱¯0𝐱0),absentsubscript𝜽𝑡subscript𝜽𝑡1subscript𝜽𝑡2superscriptsubscript𝐆𝑡2subscript𝜽𝑡3subscript𝜽𝑡4subscript𝜽0superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆𝑡2𝑡2superscriptsubscript𝐆00subscript¯𝐱0subscript𝐱0\displaystyle=(\bm{\theta}_{t}\bm{\theta}_{t-1}\bm{\theta}_{t-2})\mathbf{G}_{t% }^{2}(\bm{\theta}_{t-3}\bm{\theta}_{t-4}\cdots\bm{\theta}_{0})(\mathbf{G}_{t-1% }^{t-1}\mathbf{G}_{t-2}^{t-2}\cdots\mathbf{G}_{0}^{0})(\overline{\mathbf{x}}_{% 0}-\mathbf{x}_{0}),= ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 3 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 4 end_POSTSUBSCRIPT ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,
=(𝜽t𝜽t1𝜽0)(𝐆tt𝐆t1t1𝐆00)(𝐱¯0𝐱0).absentsubscript𝜽𝑡subscript𝜽𝑡1subscript𝜽0superscriptsubscript𝐆𝑡𝑡superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆00subscript¯𝐱0subscript𝐱0\displaystyle=(\bm{\theta}_{t}\bm{\theta}_{t-1}\cdots\bm{\theta}_{0})(\mathbf{% G}_{t}^{t}\mathbf{G}_{t-1}^{t-1}\cdots\mathbf{G}_{0}^{0})(\overline{\mathbf{x}% }_{0}-\mathbf{x}_{0}).= ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

Lemma 9.

Let

𝐅t:=𝐆t1(t1)𝐆t2(t2)𝐆00,t1.formulae-sequenceassignsubscript𝐅𝑡superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆𝑡2𝑡2superscriptsubscript𝐆00𝑡1\mathbf{F}_{t}:=\mathbf{G}_{t-1}^{(t-1)}\mathbf{G}_{t-2}^{(t-2)}\cdots\mathbf{% G}_{0}^{0},\hskip 8.5359ptt\geq 1.bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 2 ) end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_t ≥ 1 .

Then,

𝐅t=s=0t1αs𝐈+(1s=0t1αs)𝐏0.subscript𝐅𝑡superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠𝐈1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠subscript𝐏0\mathbf{F}_{t}=\prod_{s=0}^{t-1}\alpha_{s}\cdot\mathbf{I}+(1-\prod_{s=0}^{t-1}% \alpha_{s})\cdot\mathbf{P}_{0}.bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .
Proof.

We prove it by induction on t𝑡titalic_t. Recall the definition of 𝐆ts:=αt𝐈+(1αt)𝐏tsassignsuperscriptsubscript𝐆𝑡𝑠subscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝐏𝑡𝑠\mathbf{G}_{t}^{s}\vcentcolon=\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\cdot% \mathbf{P}_{t}^{s}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT := italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. When t=1𝑡1t=1italic_t = 1,

𝐅1=𝐆00=α0𝐈+(1α0)𝐏0.subscript𝐅1superscriptsubscript𝐆00subscript𝛼0𝐈1subscript𝛼0subscript𝐏0\mathbf{F}_{1}=\mathbf{G}_{0}^{0}=\alpha_{0}\cdot\mathbf{I}+(1-\alpha_{0})% \cdot\mathbf{P}_{0}.bold_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

Suppose that it is true for t𝑡titalic_t such that

𝐅t=s=0t1αs𝐈+(1s=0t1αs)𝐏0,subscript𝐅𝑡superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠𝐈1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠subscript𝐏0\mathbf{F}_{t}=\prod_{s=0}^{t-1}\alpha_{s}\cdot\mathbf{I}+(1-\prod_{s=0}^{t-1}% \alpha_{s})\cdot\mathbf{P}_{0},bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

for (t+1)𝑡1(t+1)( italic_t + 1 ),

𝐅t+1subscript𝐅𝑡1\displaystyle\mathbf{F}_{t+1}bold_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =𝐆tt𝐅t,absentsuperscriptsubscript𝐆𝑡𝑡subscript𝐅𝑡\displaystyle=\mathbf{G}_{t}^{t}\mathbf{F}_{t},= bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
=(αt𝐈+(1αt)𝐏tt)𝐅t,absentsubscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝐏𝑡𝑡subscript𝐅𝑡\displaystyle=(\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\cdot\mathbf{P}_{t}^{t}% )\mathbf{F}_{t},= ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
=(αt𝐈+(1αt)𝐏tt)(s=0t1αs𝐈+(1s=0t1αs)𝐏0),absentsubscript𝛼𝑡𝐈1subscript𝛼𝑡superscriptsubscript𝐏𝑡𝑡superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠𝐈1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠subscript𝐏0\displaystyle=(\alpha_{t}\cdot\mathbf{I}+(1-\alpha_{t})\cdot\mathbf{P}_{t}^{t}% )(\prod_{s=0}^{t-1}\alpha_{s}\cdot\mathbf{I}+(1-\prod_{s=0}^{t-1}\alpha_{s})% \cdot\mathbf{P}_{0}),= ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,
=s=0tαs𝐈+αt(1s=0t1αs)𝐏0+(1αt)s=0t1αs𝐏tt+(1αt)(1s=1t1αs)𝐏tt𝐏0.absentsuperscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠𝐈subscript𝛼𝑡1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠subscript𝐏01subscript𝛼𝑡superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠superscriptsubscript𝐏𝑡𝑡1subscript𝛼𝑡1superscriptsubscriptproduct𝑠1𝑡1subscript𝛼𝑠superscriptsubscript𝐏𝑡𝑡subscript𝐏0\displaystyle=\prod_{s=0}^{t}\alpha_{s}\cdot\mathbf{I}+\alpha_{t}(1-\prod_{s=0% }^{t-1}\alpha_{s})\cdot\mathbf{P}_{0}+(1-\alpha_{t})\prod_{s=0}^{t-1}\alpha_{s% }\cdot\mathbf{P}_{t}^{t}+(1-\alpha_{t})(1-\prod_{s=1}^{t-1}\alpha_{s})\cdot% \mathbf{P}_{t}^{t}\mathbf{P}_{0}.= ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

Recall Lemma 7, if all 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is orthogonal, then 𝐏tt=𝐏0superscriptsubscript𝐏𝑡𝑡subscript𝐏0\mathbf{P}_{t}^{t}=\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and 𝐏tt𝐏0=𝐏0superscriptsubscript𝐏𝑡𝑡subscript𝐏0subscript𝐏0\mathbf{P}_{t}^{t}\mathbf{P}_{0}=\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Hence,

𝐅t+1subscript𝐅𝑡1\displaystyle\mathbf{F}_{t+1}bold_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =s=0tαs𝐈+αt(1s=0t1αs)𝐏0+(1αt)s=0t1αs𝐏0+(1αt)(1s=1t1αs)𝐏0,absentsuperscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠𝐈subscript𝛼𝑡1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠subscript𝐏01subscript𝛼𝑡superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠subscript𝐏01subscript𝛼𝑡1superscriptsubscriptproduct𝑠1𝑡1subscript𝛼𝑠subscript𝐏0\displaystyle=\prod_{s=0}^{t}\alpha_{s}\cdot\mathbf{I}+\alpha_{t}(1-\prod_{s=0% }^{t-1}\alpha_{s})\cdot\mathbf{P}_{0}+(1-\alpha_{t})\prod_{s=0}^{t-1}\alpha_{s% }\cdot\mathbf{P}_{0}+(1-\alpha_{t})(1-\prod_{s=1}^{t-1}\alpha_{s})\cdot\mathbf% {P}_{0},= ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,
=s=0tαs𝐈+(αts=0tαs+s=0t1αss=0tαs+1αts=0t1αs+s=0tαs)𝐏0,absentsuperscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠𝐈subscript𝛼𝑡superscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠superscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠1subscript𝛼𝑡superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠superscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠subscript𝐏0\displaystyle=\prod_{s=0}^{t}\alpha_{s}\cdot\mathbf{I}+\bigg{(}\alpha_{t}-% \prod_{s=0}^{t}\alpha_{s}+\prod_{s=0}^{t-1}\alpha_{s}-\prod_{s=0}^{t}\alpha_{s% }+1-\alpha_{t}-\prod_{s=0}^{t-1}\alpha_{s}+\prod_{s=0}^{t}\alpha_{s}\bigg{)}% \cdot\mathbf{P}_{0},= ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,
=s=0tαs𝐈+(1s=0tαs)𝐏0.absentsuperscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠𝐈1superscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠subscript𝐏0\displaystyle=\prod_{s=0}^{t}\alpha_{s}\cdot\mathbf{I}+\bigg{(}1-\prod_{s=0}^{% t}\alpha_{s}\bigg{)}\cdot\mathbf{P}_{0}.= ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

See 4

Proof.

The input perturbation 𝐳=𝐱¯0𝐱0𝐳subscript¯𝐱0subscript𝐱0\mathbf{z}=\overline{\mathbf{x}}_{0}-\mathbf{x}_{0}bold_z = over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be decomposed as 𝐳=𝐳+𝐳𝐳superscript𝐳parallel-tosuperscript𝐳perpendicular-to\mathbf{z}=\mathbf{z}^{\parallel}+\mathbf{z}^{\perp}bold_z = bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT + bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, where 𝐳Z0superscript𝐳parallel-tosubscriptsuperscript𝑍parallel-to0\mathbf{z}^{\parallel}\in Z^{\parallel}_{0}bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∈ italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐳Z0superscript𝐳perpendicular-tosubscriptsuperscript𝑍perpendicular-to0\mathbf{z}^{\perp}\in Z^{\perp}_{0}bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∈ italic_Z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and 𝐳superscript𝐳parallel-to\mathbf{z}^{\parallel}bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT and 𝐳superscript𝐳perpendicular-to\mathbf{z}^{\perp}bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT are vectors such that

  • 𝐳𝐳=0superscript𝐳parallel-tosuperscript𝐳perpendicular-to0\mathbf{z}^{\parallel}\cdot\mathbf{z}^{\perp}=0bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ⋅ bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = 0 almost surely.

  • 𝐳superscript𝐳parallel-to\mathbf{z}^{\parallel}bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT, 𝐳superscript𝐳perpendicular-to\mathbf{z}^{\perp}bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT have uncorrelated components.

  • 𝐳Zsuperscript𝐳parallel-tosuperscript𝑍parallel-to\mathbf{z}^{\parallel}\in Z^{\parallel}bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∈ italic_Z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT, and 𝐳𝒵superscript𝐳perpendicular-tosuperscript𝒵perpendicular-to\mathbf{z}^{\perp}\in\mathcal{Z}^{\perp}bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∈ caligraphic_Z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT.

Since the layer transformations 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are orthogonal matrices for all t𝑡titalic_t, recall the dynamical system Eq. (12) and Lemma 8,

𝐱¯t𝐱t22superscriptsubscriptdelimited-∥∥subscript¯𝐱𝑡subscript𝐱𝑡22\displaystyle\lVert\overline{\mathbf{x}}_{t}-\mathbf{x}_{t}\rVert_{2}^{2}∥ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝜽t(𝐈𝐊t)𝜽t1(𝐈𝐊t1)𝜽0(𝐈𝐊0)𝐳22,absentsuperscriptsubscriptdelimited-∥∥subscript𝜽𝑡𝐈subscript𝐊𝑡subscript𝜽𝑡1𝐈subscript𝐊𝑡1subscript𝜽0𝐈subscript𝐊0𝐳22\displaystyle=\lVert\bm{\theta}_{t}(\mathbf{I}-\mathbf{K}_{t})\bm{\theta}_{t-1% }(\mathbf{I}-\mathbf{K}_{t-1})\cdots\bm{\theta}_{0}(\mathbf{I}-\mathbf{K}_{0})% \mathbf{z}\rVert_{2}^{2},= ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_I - bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_I - bold_K start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_I - bold_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
=(𝜽t1𝜽t2𝜽0)(𝐆t1t1𝐆00)𝐳22,absentsuperscriptsubscriptdelimited-∥∥subscript𝜽𝑡1subscript𝜽𝑡2subscript𝜽0superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆00𝐳22\displaystyle=\lVert(\bm{\theta}_{t-1}\bm{\theta}_{t-2}\cdots\bm{\theta}_{0})(% \mathbf{G}_{t-1}^{t-1}\cdots\mathbf{G}_{0}^{0})\mathbf{z}\rVert_{2}^{2},= ∥ ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ⋯ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) bold_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
=(𝐆t1t1𝐆00)𝐳22,absentsuperscriptsubscriptdelimited-∥∥superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆00𝐳22\displaystyle=\lVert(\mathbf{G}_{t-1}^{t-1}\cdots\mathbf{G}_{0}^{0})\mathbf{z}% \rVert_{2}^{2},= ∥ ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) bold_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (14)

For the term (𝐆t1t1𝐆00)𝐳22superscriptsubscriptdelimited-∥∥superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆00𝐳22\lVert(\mathbf{G}_{t-1}^{t-1}\cdots\mathbf{G}_{0}^{0})\mathbf{z}\rVert_{2}^{2}∥ ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) bold_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, recall Lemma 9,

(𝐆t1t1𝐆00)𝐳22superscriptsubscriptdelimited-∥∥superscriptsubscript𝐆𝑡1𝑡1superscriptsubscript𝐆00𝐳22\displaystyle\hskip 11.38092pt\lVert(\mathbf{G}_{t-1}^{t-1}\cdots\mathbf{G}_{0% }^{0})\mathbf{z}\rVert_{2}^{2}∥ ( bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⋯ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) bold_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(s=0t1αs𝐈+(1s=0t1αs)𝐏0)𝐳22,absentsuperscriptsubscriptdelimited-∥∥superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠𝐈1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠subscript𝐏0𝐳22\displaystyle=\lVert\bigg{(}\prod_{s=0}^{t-1}\alpha_{s}\cdot\mathbf{I}+(1-% \prod_{s=0}^{t-1}\alpha_{s})\mathbf{P}_{0}\bigg{)}\mathbf{z}\rVert_{2}^{2},= ∥ ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_I + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
=s=0t1αs𝐳+(1s=0t1αs)𝐳22,absentsuperscriptsubscriptdelimited-∥∥superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠𝐳1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠superscript𝐳parallel-to22\displaystyle=\lVert\prod_{s=0}^{t-1}\alpha_{s}\cdot\mathbf{z}+(1-\prod_{s=0}^% {t-1}\alpha_{s})\cdot\mathbf{z}^{\parallel}\rVert_{2}^{2},= ∥ ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ bold_z + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
=(s=0t1αs)2𝐳22+(1s=0t1αs)2𝐳22+2(s=0t1αs)(1s=0t1αs)(𝐳)𝐳,absentsuperscriptsuperscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠2superscriptsubscriptdelimited-∥∥𝐳22superscript1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠2superscriptsubscriptdelimited-∥∥superscript𝐳parallel-to222superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠superscript𝐳topsuperscript𝐳parallel-to\displaystyle=(\prod_{s=0}^{t-1}\alpha_{s})^{2}\cdot\lVert\mathbf{z}\rVert_{2}% ^{2}+(1-\prod_{s=0}^{t-1}\alpha_{s})^{2}\cdot\lVert\mathbf{z}^{\parallel}% \rVert_{2}^{2}+2(\prod_{s=0}^{t-1}\alpha_{s})(1-\prod_{s=0}^{t-1}\alpha_{s})(% \mathbf{z})^{\top}\mathbf{z}^{\parallel},= ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ( bold_z ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ,
=(s=0t1αs)2(𝐳22+𝐳22)+(1s=0t1α0)2𝐳22+2(s=0t1αs)(1s=0t1αs)(𝐳+𝐳)𝐳absentsuperscriptsuperscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠2superscriptsubscriptdelimited-∥∥superscript𝐳parallel-to22superscriptsubscriptdelimited-∥∥superscript𝐳perpendicular-to22superscript1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼02superscriptsubscriptdelimited-∥∥superscript𝐳parallel-to222superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠superscriptsuperscript𝐳parallel-tosuperscript𝐳perpendicular-totopsuperscript𝐳parallel-to\displaystyle=(\prod_{s=0}^{t-1}\alpha_{s})^{2}\cdot(\lVert\mathbf{z}^{% \parallel}\rVert_{2}^{2}+\lVert\mathbf{z}^{\perp}\rVert_{2}^{2})+(1-\prod_{s=0% }^{t-1}\alpha_{0})^{2}\cdot\lVert\mathbf{z}^{\parallel}\rVert_{2}^{2}+2(\prod_% {s=0}^{t-1}\alpha_{s})(1-\prod_{s=0}^{t-1}\alpha_{s})(\mathbf{z}^{\parallel}+% \mathbf{z}^{\perp})^{\top}\mathbf{z}^{\parallel}= ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( ∥ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ( bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT + bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT
=s=0t1αs2𝐳22+(s=0t1αs2+(1s=0t1αs)2+2(s=0t1αs)(1s=0t1αs))𝐳22,absentsuperscriptsubscriptproduct𝑠0𝑡1superscriptsubscript𝛼𝑠2superscriptsubscriptdelimited-∥∥superscript𝐳perpendicular-to22superscriptsubscriptproduct𝑠0𝑡1superscriptsubscript𝛼𝑠2superscript1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠22superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠1superscriptsubscriptproduct𝑠0𝑡1subscript𝛼𝑠superscriptsubscriptdelimited-∥∥superscript𝐳parallel-to22\displaystyle=\prod_{s=0}^{t-1}\alpha_{s}^{2}\cdot\lVert\mathbf{z}^{\perp}% \rVert_{2}^{2}+\bigg{(}\prod_{s=0}^{t-1}\alpha_{s}^{2}+(1-\prod_{s=0}^{t-1}% \alpha_{s})^{2}+2(\prod_{s=0}^{t-1}\alpha_{s})(1-\prod_{s=0}^{t-1}\alpha_{s})% \bigg{)}\cdot\lVert\mathbf{z}^{\parallel}\rVert_{2}^{2},= ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ( 1 - ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ⋅ ∥ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
=s=0t1αs2𝐳22+𝐳22.absentsuperscriptsubscriptproduct𝑠0𝑡1superscriptsubscript𝛼𝑠2superscriptsubscriptdelimited-∥∥superscript𝐳perpendicular-to22superscriptsubscriptdelimited-∥∥superscript𝐳parallel-to22\displaystyle=\prod_{s=0}^{t-1}\alpha_{s}^{2}\cdot\lVert\mathbf{z}^{\perp}% \rVert_{2}^{2}+\lVert\mathbf{z}^{\parallel}\rVert_{2}^{2}.= ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Recall the error computation in Eq. (14),

𝐱¯ϵ,t𝐱t22=s=0t1αs2𝐳22+𝐳22.superscriptsubscriptdelimited-∥∥subscript¯𝐱italic-ϵ𝑡subscript𝐱𝑡22superscriptsubscriptproduct𝑠0𝑡1superscriptsubscript𝛼𝑠2superscriptsubscriptdelimited-∥∥superscript𝐳perpendicular-to22superscriptsubscriptdelimited-∥∥superscript𝐳parallel-to22\lVert\overline{\mathbf{x}}_{\epsilon,t}-\mathbf{x}_{t}\rVert_{2}^{2}=\prod_{s% =0}^{t-1}\alpha_{s}^{2}\cdot\lVert\mathbf{z}^{\perp}\rVert_{2}^{2}+\lVert% \mathbf{z}^{\parallel}\rVert_{2}^{2}.∥ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_ϵ , italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_z start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_z start_POSTSUPERSCRIPT ∥ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

10 Appendix C

The tables referenced as 8 and 9 provide comprehensive numerical data for the SNLI and MNLI datasets, respectively. The results in Table 8 reveal that the PD control mechanism significantly improves the robustness of both the standard and robustly trained baseline models against every type of adversarial attack. Specifically, the application of PD control to the standardly trained Distilbert model boosts its accuracy by 15%percent1515\%15 % and 16%percent1616\%16 % in facing TextBugger and TextFoller attacks, respectively, while incurring a minimal 1%percent11\%1 % accuracy reduction on the original, unaltered dataset. Moreover, the robustly trained Distilbert model, which includes adversarial training (AT), benefits from the addition of PD control by showing a 12%percent1212\%12 % accuracy increase when confronted with both TextBugger and TextFoller attacks. The comparative analysis of baseline models and the proposed control framework for the MNLI dataset, as detailed in Table 9, demonstrates that implementing the PD controller enhances the standard trained Distilbert’s resistance to perturbations by an average of nearly 10%percent1010\%10 %. However, this improvement is somewhat diminished in models trained for robustness, with an average enhancement of 5%percent55\%5 % against all types of perturbations.

Table 8: Measurement on SNLI dataset: baseline model / controlled model
Standard models
None A2T PSO TextBugger TextFooler
Distilbert 87.24 / 86.05 53.89 / 62.31 49.84 / 54.96 24.73 / 40.26 24.69 / 41.73
RoBERTaBase 90.87 / 90.64 58.36 / 64.11 51.44 / 54.40 35.90 / 43.20 27.03 / 37.35
BERT-large 90.36 / 89.75 74.18 / 75.54 66.84 / 67.55 64.13 / 64.41 56.37 / 58.27
RoBERTaLarge 92.39 / 92.05 59.40 / 64.95 52.15 / 56.70 33.72 / 42.43 26.43 / 37.29
Robust models (trained with AT)
None A2T PSO TextBugger TextFooler
Distilbert 86.74 / 85.81 71.78 / 71.81 52.85 / 57.87 29.63 / 41.64 31.59 / 43.81
RoBERTaBase 90.65 / 89.87 76.28 / 77.08 53.85 / 56.45 35.43 / 43.35 29.64 / 39.39
BERT-large 90.29 / 90.33 86.02 / 85.76 69.23 / 70.38 69.17 / 69.55 63.78 / 65.27
RoBERTaLarge 92.10 / 91.62 81.11 / 81.62 55.28 / 59.71 34.15 / 44.74 28.74 / 42.44
Robust models (trained with FreeLB)
None A2T PSO TextBugger TextFooler
Distilbert 85.68 / 84.50 57.75 / 62.95 52.53 / 56.86 26.68 / 37.80 25.47 / 39.64
RoBERTaBase 91.31 / 90.67 64.23 / 68.85 52.22 / 55.24 34.08 / 42.75 24.80 / 36.81
BERT-large 90.81 / 90.72 77.64 / 78.21 64.72 / 65.56 58.21 / 59.29 53.31 / 56.26
RoBERTaLarge 92.37 / 92.26 67.53 / 71.30 53.37 / 57.20 34.64 / 44.42 27.55 / 38.59
Table 9: Measurement on MNLI dataset: baseline model / controlled model
Standard models
None A2T PSO TextBugger TextFooler
Distilbert 79.39 / 76.98 59.43 / 64.61 51.81 / 59.49 36.02 / 47.34 38.78 / 50.62
RoBERTaBase 86.66 / 85.84 59.60 / 63.39 49.77 / 53.05 34.76 / 40.68 31.43 / 40.22
BERT-large 84.92 / 84.79 77.38 / 77.96 69.71 / 70.41 65.11 / 65.54 64.80 / 65.91
RoBERTaLarge 89.71 / 89.40 62.85 / 67.93 51.19 / 56.77 37.18 / 45.11 32.81 / 43.27
Robust models (trained with AT)
None A2T PSO TextBugger TextFooler
Distilbert 79.70 / 76.50 66.52 / 67.71 57.34 / 62.90 40.22 / 50.00 45.62 / 54.93
RoBERTaBase 86.55 / 85.54 64.52 / 66.70 53.41 / 56.78 35.61 / 40.88 34.27 / 43.41
BERT-large 84.90 / 85.02 81.37 / 81.69 73.05 / 74.05 68.27 / 68.91 71.69 / 72.80
RoBERTaLarge 90.10 / 89.51 76.94 / 78.36 59.34 / 64.44 41.21 / 48.34 40.69 / 49.33
Robust models (trained with FreeLB)
None A2T PSO TextBugger TextFooler
Distilbert 78.76 / 75.33 64.10 / 66.25 58.03 / 62.94 38.87 / 49.50 43.58 / 52.88
RoBERTaBase 86.10 / 85.59 61.60 / 65.31 51.69 / 54.93 36.06 / 42.42 33.27 / 42.21
BERT-large 85.32 / 85.62 79.34 / 79.64 72.25 / 72.68 65.90 / 66.58 67.44 / 68.49
RoBERTaLarge 90.18 / 89.81 67.28 / 71.61 53.27 / 58.04 36.40 / 44.91 32.83 / 43.84

11 Appendix D

A optimal control framework for robust deep neural networks.

We start with a description of the dynamical system approach to machine learning. In the dynamical system framework, we consider the input 𝐱0𝒳subscript𝐱0𝒳\mathbf{x}_{0}\in\mathcal{X}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X as the initial condition of a system of difference equations,

𝐱t+1=Ft(𝐱t+πt(𝐱t),𝜽t),subscript𝐱𝑡1subscript𝐹𝑡subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡subscript𝜽𝑡\mathbf{x}_{t+1}=F_{t}(\mathbf{x}_{t}+\pi_{t}(\mathbf{x}_{t}),\bm{\theta}_{t}),bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents a time-varying difference equation, 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are model parameters of Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, πt:dd:subscript𝜋𝑡superscript𝑑superscript𝑑\pi_{t}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a feedback controller that maps the current state 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to a control action. The goal of optimal control is to design these feedback controllers {πt}t=0T1superscriptsubscriptsubscript𝜋𝑡𝑡0𝑇1\{\pi_{t}\}_{t=0}^{T-1}{ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT such that some objectives are satisfied. This can be represented as the following objective function,

minπ¯𝔼(𝐱0,y)𝒟[J(𝐱0,𝐲,π¯)]:=minπ¯𝔼(𝐱0,y)𝒟[Φ(𝐱T,y)+t=0T1({𝐱s}s=0t,πt,ft)],\displaystyle\min\limits_{\overline{\pi}}\mathbb{E}_{(\mathbf{x}_{0},y)\sim% \mathcal{D}}\left[J(\mathbf{x}_{0},\mathbf{y},\overline{\pi})\right]% \vcentcolon=\min\limits_{\overline{\pi}}\mathbb{E}_{(\mathbf{x}_{0},y)\sim% \mathcal{D}}\left[\Phi(\mathbf{x}_{T},y)+\sum_{t=0}^{T-1}{\cal L}(\{\mathbf{x}% _{s}\}_{s=0}^{t},\pi_{t},f_{t})\right],roman_min start_POSTSUBSCRIPT over¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_J ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_y , over¯ start_ARG italic_π end_ARG ) ] : = roman_min start_POSTSUBSCRIPT over¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_Φ ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ,
s.t.𝐱t+1=Ft(𝐱t+πt(𝐱t),𝜽t),formulae-sequencestsubscript𝐱𝑡1subscript𝐹𝑡subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡subscript𝜽𝑡\displaystyle{\rm s.t.}\;\;\mathbf{x}_{t+1}=F_{t}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}),\bm{\theta}_{t}),roman_s . roman_t . bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) and (){\cal L}(\cdot)caligraphic_L ( ⋅ ) represent terminal and running losses, respectively. In general, objectives in the real world can be structured as terminal and running loss functions. Therefore, optimizing such an objective function aligns with achieving a real-world goal. Take the development of autonomous vehicles as an example. This involves guiding the vehicle to a destination along a specific route. This challenge can be approached as an optimal control problem, where reaching the destination is expressed as a terminal loss function, and the deviation from the planned route is captured through a running loss function.

The essential task of supervised learning is to approximate some function

F:𝒳𝒴,whereF=FT1FT2F1F0,:𝐹formulae-sequence𝒳𝒴where𝐹subscript𝐹𝑇1subscript𝐹𝑇2subscript𝐹1subscript𝐹0F:\mathcal{X}\rightarrow\mathcal{Y},\;\;{\rm where}\;\;F=F_{T-1}\circ F_{T-2}% \circ\cdots\circ F_{1}\circ F_{0},italic_F : caligraphic_X → caligraphic_Y , roman_where italic_F = italic_F start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT italic_T - 2 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

which maps inputs in 𝒳d𝒳superscript𝑑\mathcal{X}\in\mathbb{R}^{d}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (e.g. images, natural language sequences) to labels in 𝒴𝒴\mathcal{Y}caligraphic_Y (categories, numerical predictions). The objective of developing robust deep neural networks can be formulated within an optimal control framework. Here, the aim is to minimize the discrepancy between the model’s predictions and the actual labels through a terminal loss function, Φ(𝐱T,y)Φsubscript𝐱𝑇𝑦\Phi(\mathbf{x}_{T},y)roman_Φ ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ), by implementing feedback controllers. However, this ideal scenario is challenged by the practical limitation that true labels are unavailable during the model’s inference phase. As a consequence, our focus shifts towards developing robust model predictions through the development of running loss functions. In this work, we introduce a running loss function that assesses the state of control at each timestep t𝑡titalic_t,

({𝐱s}s=0t,πt,(ftP,ftI,ftD))superscriptsubscriptsubscript𝐱𝑠𝑠0𝑡subscript𝜋𝑡superscriptsubscript𝑓𝑡𝑃superscriptsubscript𝑓𝑡𝐼superscriptsubscript𝑓𝑡𝐷\displaystyle{\cal L}(\{\mathbf{x}_{s}\}_{s=0}^{t},\pi_{t},(f_{t}^{P},f_{t}^{I% },f_{t}^{D}))caligraphic_L ( { bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) :=12ftP(𝐱t+πt(𝐱t))22+12ftI(𝐱t+πt(𝐱t)+s=0t1𝐱s)22\displaystyle\vcentcolon=\frac{1}{2}\lVert f_{t}^{P}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t}))\lVert_{2}^{2}+\frac{1}{2}\lVert f_{t}^{I}(\mathbf{x}_{t}+\pi_% {t}(\mathbf{x}_{t})+\sum_{s=0}^{t-1}\mathbf{x}_{s})\lVert_{2}^{2}:= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+12ftD(𝐱t+πt(𝐱t)𝐱t1)22+ct2πt(𝐱t)22,12superscriptsubscriptdelimited-∥∥superscriptsubscript𝑓𝑡𝐷subscript𝐱𝑡subscript𝜋𝑡subscript𝐱𝑡subscript𝐱𝑡122subscript𝑐𝑡2superscriptsubscriptdelimited-∥∥subscript𝜋𝑡subscript𝐱𝑡22\displaystyle\;\;\;\;+\frac{1}{2}\lVert f_{t}^{D}(\mathbf{x}_{t}+\pi_{t}(% \mathbf{x}_{t})-\mathbf{x}_{t-1})\rVert_{2}^{2}+\frac{c_{t}}{2}\lVert\pi_{t}(% \mathbf{x}_{t})\rVert_{2}^{2},+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

This running loss function calculates a loss by measuring the difference between the controlled state and certain embedding manifolds. These manifolds capture the structural, integrative, and derivative aspects of state embeddings. Ideally, the states from unperturbed input samples should align with these embedding manifolds. Thus, when perturbations are introduced to the input, this running loss function assesses the quality of the state. Minimizing this running loss helps in adjusting the states to correct the effects of such perturbations. By defining both terminal and running loss functions, solving this optimal control problem is equivalent to generating feedback controllers {πt}t=0T1superscriptsubscriptsubscript𝜋𝑡𝑡0𝑇1\{\pi_{t}\}_{t=0}^{T-1}{ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT, such that the controlled state trajectory of perturbed input performs similarly to the unperturbed counterpart.