Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CaRiNG: Learning Temporal Causal Representation under
Non-Invertible Generation Process

Guangyi Chen    Yifan Shen    Zhenhao Chen    Xiangchen Song    Yuewen Sun    Weiran Yao    Xiao Liu    Kun Zhang
Abstract

Identifying the underlying time-delayed latent causal processes in sequential data is vital for grasping temporal dynamics and making downstream reasoning. While some recent methods can robustly identify these latent causal variables, they rely on strict assumptions about the invertible generation process from latent variables to observed data. However, these assumptions are often hard to satisfy in real-world applications containing information loss. For instance, the visual perception process translates a 3D space into 2D images, or the phenomenon of persistence of vision incorporates historical data into current perceptions. To address this challenge, we establish an identifiability theory that allows for the recovery of independent latent components even when they come from a nonlinear and non-invertible mix. Using this theory as a foundation, we propose a principled approach, CaRiNG, to learn the Causal Representation of Non-invertible Generative temporal data with identifiability guarantees. Specifically, we utilize temporal context to recover lost latent information and apply the conditions in our theory to guide the training process. Through experiments conducted on synthetic datasets, we validate that our CaRiNG method reliably identifies the causal process, even when the generation process is non-invertible. Moreover, we demonstrate that our approach considerably improves temporal understanding and reasoning in practical applications. Code can be accessed through https://github.com/sanshuiii/CaRiNG.

Machine Learning, ICML

1 Introduction

Sequential data, including video, stock, and climate observations, are integral to our daily lives. Gaining an understanding of the causal dynamics in such time series data has always been a crucial challenge (Berzuini et al., 2012; Ghysels et al., 2016; Friston, 2009) and has attracted considerable attention. The core of this task is to identify the underlying causal dynamics in the data we observe.

Towards this goal, we focus on Independent Component Analysis (ICA) (Hyvärinen & Oja, 2000), which is a classical method for decomposing the latent signals from mixed observation. Recent advancements in nonlinear ICA (Hyvarinen & Morioka, 2016, 2017; Hyvarinen et al., 2019; Khemakhem et al., 2020; Sorrenson et al., 2020; Hälvä & Hyvarinen, 2020) have yielded robust theoretical evidence for the identifiability of latent variables, and enabled the use of deep neural networks to address complex scenarios. For example, by assuming the latent variables in the data generation process are mutually independent, and leveraging the auxiliary side information such as time index, domain index, or class label, (Hyvarinen & Morioka, 2017; Hyvarinen et al., 2019; Hälvä & Hyvarinen, 2020) have demonstrated the strong identifiability results. (Hälvä et al., 2021; Klindt et al., 2020; Yao et al., 2022b, a; Lachapelle et al., 2022) further extend this nonlinear ICA framework into scenarios of the time-delayed dynamical systems, which allows the temporal transitions among the latent variables.

However, these nonlinear ICA-based methods usually assume that the mixing function (the generation process from sources to observations) is invertible, which may be difficult to satisfy in real-world scenarios, such as the 3D to 2D projection in the visual process. As shown in Figure 1 (a) and (b), we provide two intuitive instances of the real videos to illustrate how the non-invertibility happens. In (a), when object occlusions occur, information from the obstructed object is lost in the generation process of the current time step, which causes non-invertibility. In (b), the persistence of vision introduces non-invertibility, since the mixing process of the current time step utilizes the history information. We further found that the violation of this invertibility assumption may cause the nonlinear ICA method to yield poor identification performance. In part (c) of Figure 1, we demonstrate that TDRL, one of the typical nonlinear ICA-based methods making the invertibility assumption, markedly degrades its performance in identifying the latent variables with increasing non-invertibility. It motivates us to extend the current nonlinear ICA methods to consider non-invertible mixing function.

Refer to caption

Figure 1: Motivations of the non-invertible generation process. (a) The occlusions raise the non-invertibility since the measured observation cannot cover the obstructed objects. (b) The vision persistence, shown with the high-speed movement of a crashing car, describes the generation process that jointly involves the current state and previous, and causes the non-invertibility. (c) The identifiability of conventional methods, such as TDRL (Yao et al., 2022a) (blue), drops drastically with the increase of non-invertibility, while the identifiability of our method (marked in orange) still holds. The levels of non-invertibility are defined by removing 0,1/30130,1/30 , 1 / 3, and 2/3232/32 / 3 dimensions of 𝐳𝐭subscript𝐳𝐭\bf{z}_{t}bold_z start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT when generating 𝐱𝐭subscript𝐱𝐭\bf{x}_{t}bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT. For example, when the dimension of 𝐳𝐭subscript𝐳𝐭\bf{z}_{t}bold_z start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT is 6, 2/3232/32 / 3 non-invertibility means that we remove 4 variables of 𝐳𝐭subscript𝐳𝐭\bf{z}_{t}bold_z start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT and use only 2 variables to generate 𝐱𝐭subscript𝐱𝐭\bf{x}_{t}bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT.

In this paper, to tackle the challenges above, we propose to leverage the temporal context for retrieving missing information caused by the non-invertible mixing function, mirroring the intuitive mechanisms of human perception. For instance, when we encounter an object with occlusion, our natural inclination is to draw from historical data to reconstruct the obscured portion. We demonstrate that, even when the generation process is non-invertible, the derived latent causal representation remains identifiable if the latent variables can be expressed as an arbitrary function combining the current observation with its history. Built upon this identification theorem, we introduce a principled approach, named CaRiNG, that learns the function to integrate historical data to compensate for the latent information lost due to non-invertibility. This approach extends the Sequential Variational Autoencoder (Sequential VAE (Chung et al., 2015; Li & Mandt, 2018)) with two distinct modifications. Firstly, it incorporates history (or context) information directly into the encoder. Specifically, we transform step-to-step mapping (from current observation to the current latent variable) into sequence-to-step mapping (from current observation and temporal context to the current latent variable). Secondly, a specialized prior module is introduced to determine the prior distribution of latent variables using the normalizing flow (Dinh et al., 2016), ensuring the imposition of an independent noise condition. We evaluate our method using both synthetic and real-world data. Using synthetic data, we design datasets with a non-invertible mixing function to measure identifiability. For real-world applications, CaRiNG is deployed in a traffic accident reasoning task, a scenario in which the intricate traffic dynamics introduce considerable non-invertibility. Experimental outcomes reveal that our method significantly outperforms other temporal representation learning methods for identifying causal representations amid non-invertible generation processes. Furthermore, this causal representation has proven instrumental in enhancing video reasoning tasks.

Key Insights and Contributions of our research include:

  • To the best of our understanding, this paper presents the first identifiability theorem that accommodates a non-invertible generation process, which complements the existing body of the nonlinear ICA theory.

  • We present a principled approach, CaRiNG, to learn the latent causal representation from temporal data under non-invertible generation processes with identifiability guarantees, by integrating temporal context information to recover the lost information.

  • Our evaluations across synthetic and real-world datasets demonstrate the CaRiNG’s effectiveness for learning the identifiable latent causal representation, leading to enhancements in video reasoning tasks.

2 Problem Setup

2.1 Non-invertible Temporal Generative Process

Denote 𝐗={𝐱1,𝐱2,,𝐱T}𝐗subscript𝐱1subscript𝐱2subscript𝐱𝑇\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{T}\}bold_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } as the observed d𝑑ditalic_d-dimensional time series data at T𝑇Titalic_T discrete time steps. Each observation 𝐱tdsubscript𝐱𝑡superscript𝑑\mathbf{x}_{t}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is generated from a nonlinear mixing function 𝐠𝐠\mathbf{g}bold_g that maps r+1𝑟1r+1italic_r + 1 adjacent latent variables 𝐳t:trsubscript𝐳:𝑡𝑡𝑟\mathbf{z}_{t:t-r}bold_z start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT to 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where 𝐳t:trsubscript𝐳:𝑡𝑡𝑟\mathbf{z}_{t:t-r}bold_z start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT refers to {𝐳t,𝐳t1,,𝐳tr}subscript𝐳𝑡subscript𝐳𝑡1subscript𝐳𝑡𝑟\{\mathbf{z}_{t},\mathbf{z}_{t-1},\cdots,\mathbf{z}_{t-r}\}{ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ , bold_z start_POSTSUBSCRIPT italic_t - italic_r end_POSTSUBSCRIPT }. We have 𝐳tnsubscript𝐳𝑡superscript𝑛\mathbf{z}_{t}\in\mathbb{R}^{n}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For every i1,,n𝑖1𝑛i\in{1,\dots,n}italic_i ∈ 1 , … , italic_n, the variable zitsubscript𝑧𝑖𝑡z_{it}italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is derived from a stationary, non-parametric time-delayed causal relation:

𝐱t=𝐠(𝐳t:tr)Nonlinear mixing,subscriptsubscript𝐱𝑡𝐠subscript𝐳:𝑡𝑡𝑟Nonlinear mixing\displaystyle\underbrace{\mathbf{x}_{t}=\mathbf{g}(\mathbf{z}_{t:t-r})}_{\text% {Nonlinear mixing}},under⏟ start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_g ( bold_z start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Nonlinear mixing end_POSTSUBSCRIPT , (1)
zit=fi({zj,t|zj,t𝐏𝐚(zit)},ϵit)Stationary non-parametric transition.subscriptsubscript𝑧𝑖𝑡subscript𝑓𝑖conditional-setsubscript𝑧𝑗superscript𝑡subscript𝑧𝑗superscript𝑡𝐏𝐚subscript𝑧𝑖𝑡subscriptitalic-ϵ𝑖𝑡Stationary non-parametric transition\displaystyle\underbrace{z_{it}={f_{i}\left(\{z_{j,t^{\prime}}|z_{j,t^{\prime}% }\in\mathbf{Pa}(z_{it})\},\epsilon_{it}\right)}}_{\text{Stationary non-% parametric transition}}.under⏟ start_ARG italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( { italic_z start_POSTSUBSCRIPT italic_j , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_j , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ bold_Pa ( italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) } , italic_ϵ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Stationary non-parametric transition end_POSTSUBSCRIPT .

Note that with non-parametric causal transitions, the noise term ϵitpϵisimilar-tosubscriptitalic-ϵ𝑖𝑡subscript𝑝subscriptitalic-ϵ𝑖\epsilon_{it}\sim p_{\epsilon_{i}}italic_ϵ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (where pϵisubscript𝑝subscriptitalic-ϵ𝑖p_{\epsilon_{i}}italic_p start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the distribution of ϵitsubscriptitalic-ϵ𝑖𝑡\epsilon_{it}italic_ϵ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT) and the time-delayed parents 𝐏𝐚(zit)𝐏𝐚subscript𝑧𝑖𝑡\mathbf{Pa}(z_{it})bold_Pa ( italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) of zitsubscript𝑧𝑖𝑡z_{it}italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT (i.e., the set of latent factors that directly cause zitsubscript𝑧𝑖𝑡z_{it}italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT) are interacted and transformed in an arbitrarily nonlinear way to generate zitsubscript𝑧𝑖𝑡z_{it}italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT. τ𝜏\tauitalic_τ denotes the transition time lag. The components of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are mutually independent conditional on history variables 𝐏𝐚(𝐳t)𝐏𝐚subscript𝐳𝑡\mathbf{Pa}(\mathbf{z}_{t})bold_Pa ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

In this case, one cannot recover 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT alone due to the non-invertibility of 𝐠𝐠\mathbf{g}bold_g. Without extra assumptions, it is definitely non-identifiable. As a result, we assume that there exists a time lag μ𝜇\muitalic_μ and a nonlinear function 𝐦𝐦\mathbf{m}bold_m which can map a series of observations to latent variable 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e.,

𝐳t=𝐦(𝐱t:tμ).subscript𝐳𝑡𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{z}_{t}=\mathbf{m}(\mathbf{x}_{t:t-\mu}).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) . (2)

Once we successfully recover the information lost due to non-invertibility from the context, the classical nonlinear ICA algorithm can be used to solve this problem.

2.2 Identification of the Latent Causal Processes

Definition 1 (Identifiable Latent Causal Process).

Let 𝐗={𝐱1,𝐱2,,𝐱T}𝐗subscript𝐱1subscript𝐱2subscript𝐱𝑇\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{T}\}bold_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } be a sequence of observed variables generated by the true temporally causal latent processes specified by (fi,p(ϵi),𝐠)subscript𝑓𝑖𝑝subscriptitalic-ϵ𝑖𝐠(f_{i},p({\epsilon_{i}}),\mathbf{g})( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_g ) given in Eq 1. A learned generative model (f^i,p^(ϵi),𝐠^)subscript^𝑓𝑖^𝑝subscriptitalic-ϵ𝑖^𝐠(\hat{f}_{i},\hat{p}({\epsilon_{i}}),\hat{\mathbf{g}})( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG ( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG bold_g end_ARG ) is observational equivalent to (fi,p(ϵi),𝐠)subscript𝑓𝑖𝑝subscriptitalic-ϵ𝑖𝐠(f_{i},p({\epsilon_{i}}),\mathbf{g})( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_g ) if the model distribution pf^i,p^ϵ,𝐠^(𝐱1:T)subscript𝑝subscript^𝑓𝑖subscript^𝑝italic-ϵ^𝐠subscript𝐱:1𝑇p_{\hat{f}_{i},\hat{p}_{\epsilon},\hat{\mathbf{g}}}(\mathbf{x}_{1:T})italic_p start_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT , over^ start_ARG bold_g end_ARG end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) matches the data distribution pfi,pϵ,𝐠(𝐱1:T)subscript𝑝subscript𝑓𝑖subscript𝑝italic-ϵ𝐠subscript𝐱:1𝑇p_{f_{i},p_{\epsilon},\mathbf{g}}(\mathbf{x}_{1:T})italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT , bold_g end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) for any value of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We say latent causal processes are identifiable if observational equivalence can lead to a version of latent variable 𝐳t=𝐦(𝐱t:tμ)subscript𝐳𝑡𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{z}_{t}=\mathbf{m}(\mathbf{x}_{t:t-\mu})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) up to permutation π𝜋\piitalic_π and component-wise invertible transformation 𝒯𝒯\mathcal{T}caligraphic_T:

pf^i,p^ϵi,𝐠^(𝐱1:T)=pfi,pϵi,𝐠(𝐱1:T)subscript𝑝subscript^𝑓𝑖subscript^𝑝subscriptitalic-ϵ𝑖^𝐠subscript𝐱:1𝑇subscript𝑝subscript𝑓𝑖subscript𝑝subscriptitalic-ϵ𝑖𝐠subscript𝐱:1𝑇\displaystyle p_{\hat{f}_{i},\hat{p}_{\epsilon_{i}},\hat{\mathbf{g}}}(\mathbf{% x}_{1:T})\!=\!p_{f_{i},p_{\epsilon_{i}},\mathbf{g}}(\mathbf{x}_{1:T})italic_p start_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_g end_ARG end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_g end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) (3)
\displaystyle\Rightarrow 𝐦^(𝐱t:tμ)=(𝒯π𝐦)(𝐱t:tμ),𝐱t:tμμ+1.formulae-sequence^𝐦subscript𝐱:𝑡𝑡𝜇𝒯𝜋𝐦subscript𝐱:𝑡𝑡𝜇for-allsubscript𝐱:𝑡𝑡𝜇superscript𝜇1\displaystyle\hat{\mathbf{m}}(\mathbf{x}_{t:t-\mu})\!=\!({\mathcal{T}}\!\circ% \!\pi\!\circ\!\mathbf{m})(\mathbf{x}_{t:t-\mu}),\forall\mathbf{x}_{t:t-\mu}\!% \in\!\mathbb{R}^{\mu+1}.over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) = ( caligraphic_T ∘ italic_π ∘ bold_m ) ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) , ∀ bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_μ + 1 end_POSTSUPERSCRIPT .

Different from the existing literature, we involve 𝐦𝐦\mathbf{m}bold_m in the above definition since it serves implicitly as a property of the mixing function 𝐠𝐠\mathbf{g}bold_g, although it does not explicitly participate in the generation process. Furthermore, the identifiability of 𝐠𝐠\mathbf{g}bold_g is different. In previous nonlinear ICA methods (Yao et al., 2022a; Hyvarinen & Morioka, 2017), the mixing function 𝐠𝐠\mathbf{g}bold_g is identifiable. However, in our case, we cannot find the identifiable mixing function since the information loss is caused by non-invertibility. Instead, we can obtain a component-wise transformation of a permuted version of latent variables 𝐳^t=𝐦(𝐱t:tμ)subscript^𝐳𝑡𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{\hat{z}}_{t}=\mathbf{m}(\mathbf{x}_{t:t-\mu})over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ). The latent causal relations are also identifiable, up to a permutation π𝜋\piitalic_π and component-wise invertible transformation 𝒯𝒯\mathcal{T}caligraphic_T, i.e., 𝐟^=𝒯π𝐟^𝐟𝒯𝜋𝐟\mathbf{\hat{f}}=\mathcal{T}\circ\pi\circ\mathbf{f}over^ start_ARG bold_f end_ARG = caligraphic_T ∘ italic_π ∘ bold_f, once 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is identifiable. Because in the time-delayed causally sufficient system, the conditional independence relations fully characterize time-delayed causal relations when we assume no latent causal confounders in the (latent) causal processes.

Refer to caption
Figure 2: An intuitive illustration of a moving football with a visual persistence effect. Considering the generating process 𝐱t=𝐠(𝐳t:tr)subscript𝐱𝑡𝐠subscript𝐳:𝑡𝑡𝑟\mathbf{x}_{t}=\mathbf{g}(\mathbf{z}_{t:t-r})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_g ( bold_z start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ), 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the observed football with motion blur, and 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the position and phase of the ball. Recovering the latent variables from a single observation will be difficult, which introduces non-invertibility.

2.3 Illustrations of the Problem Setup

Intuitive Illustration with Visual Persistence. Consider a rapidly moving ball on a two-dimensional plane as described in figure 2. The horizontal and vertical coordinates of the ball’s position at any given moment can be represented by the latent variable 𝐳t2subscript𝐳𝑡superscript2\mathbf{z}_{t}\in\mathbb{R}^{2}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We assume that the ball follows a curved trajectory constrained by the nonlinear function 𝐟𝐟\mathbf{f}bold_f as it moves.

Suppose that we observe the ball with a visual persistence effect, where each observation 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT captures several consecutive latent variables as 𝐱t=𝐠(𝐳<t)subscript𝐱𝑡𝐠subscript𝐳absent𝑡\mathbf{x}_{t}=\mathbf{g}(\mathbf{z}_{<t})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_g ( bold_z start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ). The mixing function 𝐠𝐠\mathbf{g}bold_g refers to the weighted sum of the images obtained through multiple exposures, which is what a person ultimately observes as 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In this case, the invertibility of the mapping from 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is compromised since the current frame also contains the latent information from previous frames.

Mathematical Illustration. Besides, we provide a mathematical example to demonstrate the existence of function 𝐦𝐦\mathbf{m}bold_m in Eq 2. Following the concept of visual persistence, let the current observation be a weakened previous observation overlaid with the current image of the object, i.e., 𝐱t=𝐳t+12𝐱t1=i=1(12)i𝐳tisubscript𝐱𝑡subscript𝐳𝑡12subscript𝐱𝑡1superscriptsubscript𝑖1superscript12𝑖subscript𝐳𝑡𝑖\mathbf{x}_{t}=\mathbf{z}_{t}+\frac{1}{2}\mathbf{x}_{t-1}=\sum_{i=1}^{\infty}% \left(\frac{1}{2}\right)^{i}\mathbf{z}_{t-i}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_z start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT (Wolford, 1993). Given an extra observation, the current latent variable can be rewritten as 𝐳t=𝐱t12𝐱t1subscript𝐳𝑡subscript𝐱𝑡12subscript𝐱𝑡1\mathbf{z}_{t}=\mathbf{x}_{t}-\frac{1}{2}\mathbf{x}_{t-1}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Thereby we can easily recover latent variables that cannot be obtained from a single observation, i.e., 𝐳t=𝐦(𝐱t:t1)=𝐱t12𝐱t1subscript𝐳𝑡𝐦subscript𝐱:𝑡𝑡1subscript𝐱𝑡12subscript𝐱𝑡1\mathbf{z}_{t}=\mathbf{m}(\mathbf{x}_{t:t-1})=\mathbf{x}_{t}-\frac{1}{2}% \mathbf{x}_{t-1}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - 1 end_POSTSUBSCRIPT ) = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

Illustration of time-delayed temporal relations. Here, we assume that there are only time-delayed temporal relations in the time series system. In other words, any instantaneous relations will not fall into the discussion. Generally speaking, a group of objects that have instantaneous relations with each other would be treated as one single variable. For example, within a video sequence, a ball in motion may be conceptualized as a cluster of pixels that move consistently and simultaneously (instantaneous relations). This pattern can help distinguish the ball from the others, which potentially provides a principle to extract concepts from time series data like video, motion sequence, etc.

3 Identifiability Theory

In this section, we demonstrate that, given certain mild conditions, the learned causal representation 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is identifiable up to permutation and a component-wise transformation. This holds even if the mixing function 𝐠𝐠\mathbf{g}bold_g is non-invertible. Firstly, we present the identifiability results when faced with a non-invertible mixing function and stationary transitions. Subsequently, we address the gap between permutation-scaling Jacobian and identifiability. Lastly, by leveraging side information such as the domain index and label, we illustrate how identifiability can be achieved even in a non-stationary context. The proofs are available in Appendix A1.

3.1 Identifiability under Non-Invertible Generative Process

W.L.O.G., we first consider a simplified case with τ=r+1𝜏𝑟1\tau=r+1italic_τ = italic_r + 1 and context length μ𝜇\muitalic_μ, which infers such process:

𝐱t=𝐠(𝐳t:tr),zit=fi(𝐳t1:tr1,ϵit),formulae-sequencesubscript𝐱𝑡𝐠subscript𝐳:𝑡𝑡𝑟subscript𝑧𝑖𝑡subscript𝑓𝑖subscript𝐳:𝑡1𝑡𝑟1subscriptitalic-ϵ𝑖𝑡\mathbf{x}_{t}=\mathbf{g}(\mathbf{z}_{t:t-r}),\quad z_{it}=f_{i}\left(\mathbf{% z}_{t-1:t-r-1},\epsilon_{it}\right),bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_g ( bold_z start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) , (4)

where a function 𝐦𝐦\mathbf{m}bold_m satisfying 𝐳t=𝐦(𝐱t:tμ)subscript𝐳𝑡𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{z}_{t}=\mathbf{m}(\mathbf{x}_{t:t-\mu})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) exists. When taking r=0𝑟0r=0italic_r = 0, the time delay is present only in transitions and is absent in the generation process. Taking r>0𝑟0r>0italic_r > 0 leads us to a more intricate scenario, where the mixing function encompasses not just the latent causal variables of the current time step, but also the information of previous steps, termed the Time-delayed Mixing Process. Such a scenario is compelling, acknowledging that the mixing process can be influenced by time-delayed effects. To illustrate, human visual perception provides a fitting example: the phenomenon known as the persistence of vision reveals that humans retain impressions of a visual stimulus even after its cessation (Coltheart, 1980). The extensions for any time lag τ𝜏\tauitalic_τ will be discussed in Appendix A1.4.

Theorem 1 (Identifiability under Non-invertible Generative Process).

For a series of observations 𝐱tdsubscript𝐱𝑡superscript𝑑\mathbf{x}_{t}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and estimated latent variables 𝐳^tnsubscript^𝐳𝑡superscript𝑛\mathbf{\hat{z}}_{t}\in\mathbb{R}^{n}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, suppose there exists function 𝐠^,𝐦^^𝐠^𝐦\mathbf{\hat{g}},\mathbf{\hat{m}}over^ start_ARG bold_g end_ARG , over^ start_ARG bold_m end_ARG which is subject to observational equivalence,

𝐱t=𝐠^(𝐳^t:tr),𝐳^t=𝐦^(𝐱t:tμ).formulae-sequencesubscript𝐱𝑡^𝐠subscript^𝐳:𝑡𝑡𝑟subscript^𝐳𝑡^𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{x}_{t}=\mathbf{\hat{g}}(\mathbf{\hat{z}}_{t:t-r}),\quad\mathbf{\hat{z}% }_{t}=\mathbf{\hat{m}}(\mathbf{x}_{t:t-\mu}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_g end_ARG ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ) , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) . (5)

If assumptions

  • (Smooth and Positive Density) the probability density function of latent variables is third-order differentiable and positive in Rnsuperscript𝑅𝑛R^{n}italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

  • (conditional independence) the components of 𝐳^tsubscript^𝐳𝑡\mathbf{\hat{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are mutually independent conditional on 𝐳^t1:tr1subscript^𝐳:𝑡1𝑡𝑟1\mathbf{\hat{z}}_{t-1:t-r-1}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT,

  • (sufficiency) let ηktlogp(zkt|𝐳t1:tr1)subscript𝜂𝑘𝑡𝑝conditionalsubscript𝑧𝑘𝑡subscript𝐳:𝑡1𝑡𝑟1\eta_{kt}\triangleq\log p(z_{kt}|\mathbf{z}_{t-1:t-r-1})italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ≜ roman_log italic_p ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ), and

    𝐯lt(2η1tz1tzl,tr1,,2ηntzntzl,tr1,\displaystyle\mathbf{v}_{lt}\triangleq\Big{(}\frac{\partial^{2}\eta_{1t}}{% \partial z_{1t}\partial z_{l,t-r-1}},...,\frac{\partial^{2}\eta_{nt}}{\partial z% _{nt}\partial z_{l,t-r-1}},bold_v start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ≜ ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , (6)
    3η1tz1t2zl,tr1,,3ηntznt2zl,tr1),\displaystyle\frac{\partial^{3}\eta_{1t}}{\partial z_{1t}^{2}\partial z_{l,t-r% -1}},...,\frac{\partial^{3}\eta_{nt}}{\partial z_{nt}^{2}\partial z_{l,t-r-1}}% \Big{)}^{\intercal},divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ,

    for l=1,2,,n𝑙12𝑛l=1,2,\cdots,nitalic_l = 1 , 2 , ⋯ , italic_n. For each value of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, there exists 2n2𝑛2n2 italic_n different values of zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT such that the 2n2𝑛2n2 italic_n vector functions 𝐯lt𝐑2nsubscript𝐯𝑙𝑡superscript𝐑2𝑛\mathbf{v}_{lt}\in\mathbf{R}^{2n}bold_v start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT are linearly independent,

are satisfied, then 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be a component-wise transformation of a permuted version of 𝐳^tsubscript^𝐳𝑡\mathbf{\hat{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with regard to context {𝐱jj=t,t1,,tμr}conditional-setsubscript𝐱𝑗for-all𝑗𝑡𝑡1𝑡𝜇𝑟\{\mathbf{x}_{j}\mid\forall j=t,t-1,\cdots,t-\mu-r\}{ bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ ∀ italic_j = italic_t , italic_t - 1 , ⋯ , italic_t - italic_μ - italic_r }.

The proof of Theorem 1 can be found in Appendix A1.1. It is inspired by (Yao et al., 2022a), which follows the line of (Hyvarinen et al., 2019).

Besides, the nonstationary transition can also help to improve the identifiability of CaRiNG. As shown in the sufficiency assumption in Theorem 1, the identifiability relies on the sufficient changes of the conditional distribution p(zkt|𝐳t1:tr1)𝑝conditionalsubscript𝑧𝑘𝑡subscript𝐳:𝑡1𝑡𝑟1p(z_{kt}|\mathbf{z}_{t-1:t-r-1})italic_p ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ). When the distribution of the noise term varies between different domains, the domain index can serve as an auxiliary variable to improve this sufficiency since both domain dynamics and history variables can provide changes. More discussions are in Appendix A1.6.

3.2 Continuity for Permutation Invariance

In this subsection, we will introduce permutation invariance for further discussion.

Definition 2 (Permutation Invariance).

Following Definition  1, if π𝜋\piitalic_π is a fixed permutation and 𝒯𝒯\mathcal{T}caligraphic_T is a component-wise invertible transformation which may vary across different time steps, we call this identifiability under Permutation Invariance.

Let us further consider a more general scenario, with 𝐱t𝒳dsubscript𝐱𝑡𝒳superscript𝑑\mathbf{x}_{t}\in\mathcal{X}\subseteq\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝐳t𝒵nsubscript𝐳𝑡𝒵superscript𝑛\mathbf{z}_{t}\in\mathcal{Z}\subseteq\mathbb{R}^{n}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_Z ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, i.e., the probability density of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT does not have to be non-zero everywhere in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT . To establish identifiability, numerous existing nonlinear ICA-based methods (Yao et al., 2022b, a; Hyvarinen et al., 2019; Hälvä et al., 2021) utilize the Jacobian matrix, denoted by 𝐇=𝐳𝐳^𝐇𝐳^𝐳\mathbf{H}=\frac{\partial\mathbf{z}}{\partial\hat{\mathbf{z}}}bold_H = divide start_ARG ∂ bold_z end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG, which captures the relationship between ground truth and estimated latent variables. These methods propose that the learned latent variables are identifiable if 𝐇ij𝐇ik=0subscript𝐇𝑖𝑗subscript𝐇𝑖𝑘0\mathbf{H}_{ij}\cdot\mathbf{H}_{ik}=0bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ bold_H start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 0 for jk𝑗𝑘j\not=kitalic_j ≠ italic_k (with only a single non-zero element in each row or column). 𝐇𝐇\mathbf{H}bold_H corresponds to the Jacobian matrix of the function 𝐡𝐦𝐠^𝐡𝐦^𝐠\mathbf{h}\triangleq\mathbf{m}\circ\hat{\mathbf{g}}bold_h ≜ bold_m ∘ over^ start_ARG bold_g end_ARG in our scenario (or 𝐠1𝐠^superscript𝐠1^𝐠\mathbf{g}^{-1}\circ\hat{\mathbf{g}}bold_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∘ over^ start_ARG bold_g end_ARG for the general scenario). However, it is crucial to highlight an often overlooked shortcoming: this condition alone is insufficient to establish identifiability when dealing with non-linear generation processes. Concurrently to our work, (Lachapelle et al., 2023) also arrived at the difference between local and global disentanglement, and achieved the global disentanglement under the additive decoding case. Alternatively, we demonstrate the identifiability under the permutation invariance and focus on a more general case without the block-specific decoder assumptions. While in linear ICA, given that the Jacobian remains constant, this condition indeed equates to identifiability. Yet, in nonlinear ICA, the Jacobian matrix, being a function of 𝐳^^𝐳\mathbf{\hat{z}}over^ start_ARG bold_z end_ARG, can vary with different 𝐳^^𝐳\mathbf{\hat{z}}over^ start_ARG bold_z end_ARG values, potentially rendering the mapping unpredictable. A comprehensive discussion is available in Appendix A1.5.

To solve this issue in the nonlinear system, we provide two more assumptions. The domain 𝒵^^𝒵\mathcal{\hat{Z}}over^ start_ARG caligraphic_Z end_ARG of 𝐳^^𝐳\mathbf{\hat{z}}over^ start_ARG bold_z end_ARG should be path-connected, i.e., for any 𝐚,𝐛𝒵^𝐚𝐛^𝒵\mathbf{a},\mathbf{b}\in\mathcal{\hat{Z}}bold_a , bold_b ∈ over^ start_ARG caligraphic_Z end_ARG, there exists a continuous path connecting 𝐚𝐚\mathbf{a}bold_a and 𝐛𝐛\mathbf{b}bold_b with all points of the path in 𝒵^^𝒵\mathcal{\hat{Z}}over^ start_ARG caligraphic_Z end_ARG. In addition, function 𝐡𝐡\mathbf{h}bold_h is second-order differentiable and holds the non-degeneracy condition.

For clarification, the condition that a function 𝐡:nn:𝐡superscript𝑛superscript𝑛\mathbf{h}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}bold_h : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is invertible, or equivalently the non-vanishing of the determinant of the Jacobian matrix 𝐇hsubscript𝐇\mathbf{H}_{h}bold_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, is called the non-degeneracy condition. We first define the partially invertible function, and then give the non-degeneracy condition on it.

Definition 3 (Partially Invertiblility).

A function 𝐳=𝐡(𝐳^,𝐜)𝐳𝐡^𝐳𝐜\mathbf{z}=\mathbf{h}(\mathbf{\hat{z}},\mathbf{c})bold_z = bold_h ( over^ start_ARG bold_z end_ARG , bold_c ), where 𝐳,𝐳^n𝐳^𝐳superscript𝑛\mathbf{z},\mathbf{\hat{z}}\in\mathbb{R}^{n}bold_z , over^ start_ARG bold_z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝐜m𝐜superscript𝑚\mathbf{c}\in\mathbb{R}^{m}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, is partially invertible, if and only if for any given 𝐜𝐜\mathbf{c}bold_c, the rest part 𝐡𝐜:nn:subscript𝐡𝐜superscript𝑛superscript𝑛\mathbf{h}_{\mathbf{c}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}bold_h start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is always invertible.

Definition 4 (Non-degeneracy Condition of Partially Invertible Functions).

The non-degeneracy condition of a partially invertible function 𝐳=𝐡(𝐳^,𝐜)𝐳𝐡^𝐳𝐜\mathbf{z}=\mathbf{h}(\mathbf{\hat{z}},\mathbf{c})bold_z = bold_h ( over^ start_ARG bold_z end_ARG , bold_c ) is that for any given 𝐜𝐜\mathbf{c}bold_c, the determinant of the Jacobian matrix 𝐇𝐡𝐜subscript𝐇subscript𝐡𝐜\mathbf{H}_{\mathbf{h}_{\mathbf{c}}}bold_H start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT of 𝐡𝐜subscript𝐡𝐜\mathbf{h}_{\mathbf{c}}bold_h start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT is always non-zero.

Lemma 1 (Disentanglement with Continuity).

For second-order differentiable invertible function 𝐡𝐡\mathbf{h}bold_h defined on a path-connected domain 𝒵^n^𝒵superscript𝑛\mathcal{\hat{Z}}\subseteq\mathbb{R}^{n}over^ start_ARG caligraphic_Z end_ARG ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT which satisfies 𝐳=𝐡(𝐳^)𝐳𝐡^𝐳\mathbf{z}=\mathbf{h}(\mathbf{\hat{z}})bold_z = bold_h ( over^ start_ARG bold_z end_ARG ), suppose the non-degeneracy condition holds. If there exists at most one non-zero entry in each row of the Jacobian matrix 𝐇=𝐳𝐳^𝐇𝐳^𝐳\mathbf{H}=\frac{\partial\mathbf{z}}{\partial\mathbf{\hat{z}}}bold_H = divide start_ARG ∂ bold_z end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG, the identifiability under Permutation Invariance can be established.

Furthermore, when the Jacobian matrix is more than a function of 𝐳^^𝐳\mathbf{\hat{z}}over^ start_ARG bold_z end_ARG, but also is influenced by a side information 𝐜𝐜\mathbf{c}bold_c, the identifiability can be guaranteed under mild extra conditions.

Lemma 2 (Disentanglement with Continuity under Side Information).

For second-order differentiable invertible function 𝐡𝐡\mathbf{h}bold_h defined on a path-connected domain 𝒵^×𝒞n+m^𝒵𝒞superscript𝑛𝑚\mathcal{\hat{Z}}\times\mathcal{C}\subseteq\mathbb{R}^{n+m}over^ start_ARG caligraphic_Z end_ARG × caligraphic_C ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n + italic_m end_POSTSUPERSCRIPT which satisfies 𝐳=𝐡(𝐳^,𝐜)𝐳𝐡^𝐳𝐜\mathbf{z}=\mathbf{h}(\mathbf{\hat{z}},\mathbf{c})bold_z = bold_h ( over^ start_ARG bold_z end_ARG , bold_c ), suppose the non-degeneracy condition holds. If there exists at most one non-zero entry in each row of the Jacobian matrix 𝐇(𝐜)=𝐳𝐳^𝐇𝐜𝐳^𝐳\mathbf{H}(\mathbf{c})=\frac{\partial\mathbf{z}}{\partial\mathbf{\hat{z}}}bold_H ( bold_c ) = divide start_ARG ∂ bold_z end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG, the identifiability under Permutation Invariance can be established.

With Lemma 2, we can further extend Theorem 1 to guarantee permutation invariance even when the probability density of 𝐳𝐳\mathbf{z}bold_z is not positive everywhere on nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, as long as appropriate continuity conditions are satisfied. This serves as a valuable complement to the existing theory of nonlinear ICA, which further relaxes the required assumptions. This relaxation enhances the robustness of CaRiNG and makes it more adaptable to diverse and complex data, thus improving its applicability in practical settings.

Proposition 1.

For a series of observations 𝐱t𝒳dsubscript𝐱𝑡𝒳superscript𝑑\mathbf{x}_{t}\in\mathcal{X}\subseteq\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and estimated latent variables 𝐳^t𝒵nsubscript^𝐳𝑡𝒵superscript𝑛\mathbf{\hat{z}}_{t}\in\mathcal{Z}\subseteq\mathbb{R}^{n}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_Z ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, suppose there exists function 𝐠^,𝐦^^𝐠^𝐦\mathbf{\hat{g}},\mathbf{\hat{m}}over^ start_ARG bold_g end_ARG , over^ start_ARG bold_m end_ARG which subject to observational equivalence, i.e.,

𝐱t=𝐠^(𝐳^t:tr),𝐳^t=𝐦^(𝐱t:tμ).formulae-sequencesubscript𝐱𝑡^𝐠subscript^𝐳:𝑡𝑡𝑟subscript^𝐳𝑡^𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{x}_{t}=\mathbf{\hat{g}}(\mathbf{\hat{z}}_{t:t-r}),\quad\mathbf{\hat{z}% }_{t}=\mathbf{\hat{m}}(\mathbf{x}_{t:t-\mu}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_g end_ARG ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ) , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) . (7)

where 𝐠,𝐠^,𝐦,𝐦^𝐠^𝐠𝐦^𝐦\mathbf{g},\mathbf{\hat{g}},\mathbf{m},\mathbf{\hat{m}}bold_g , over^ start_ARG bold_g end_ARG , bold_m , over^ start_ARG bold_m end_ARG are second-order differentiable. In addition, if assumptions the same as Theorem 1 are satisfied, then the identifiability of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under Permutation Invariance can be established.

Refer to caption
Figure 3: The overall framework of CaRiNG. It consists of three main modules, including the sequence-to-step encoder, step-to-step decoder, and the transition prior module, which is represented as SeqEnc, StepDec, and 𝐟^𝐳1subscriptsuperscript^𝐟1𝐳\hat{\mathbf{f}}^{-1}_{\mathbf{z}}over^ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT in a different color, respectively. The model is trained with both Reconsubscript𝑅𝑒𝑐𝑜𝑛\mathcal{L}_{Recon}caligraphic_L start_POSTSUBSCRIPT italic_R italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT and KLDsubscript𝐾𝐿𝐷\mathcal{L}_{KLD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L italic_D end_POSTSUBSCRIPT.

4 Approach

Given our results on identifiability, we introduce our CaRiNG approach. This aims to estimate the latent causal dynamics presented in Eq 1, even when faced with a non-invertible mixing procedure. To achieve this, CaRiNG builds upon the Sequential Variational Auto-Encoders (Sequential VAE (Chung et al., 2015; Li & Mandt, 2018)) and incorporates three primary modules: the sequence-to-step encoder (SeqEnc), the step-to-step decoder (StepDec), and the transition prior module (𝐟^𝐳1subscriptsuperscript^𝐟1𝐳\hat{\mathbf{f}}^{-1}_{\mathbf{z}}over^ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT). Through Sequential VAE, we ensure the reconstruction capability from latent variables to observed variables. Meanwhile, in contrast to the Gaussian prior in VAEs, our method employs normalizing flow to control the prior distribution, ensuring that the latent variables satisfy the assumed conditional independence. During the training phase, we integrate the conditions from Sec. 3 as constraints and adopt two corresponding loss functions.

Overall Framework. As visualized in Figure 3, our framework starts by acquiring the latent causal representation via a sequence-to-step encoder, whose input and output are a sequence of observations 𝐱t:tμsubscript𝐱:𝑡𝑡𝜇\mathbf{x}_{t:t-\mu}bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT and the estimated latent variable 𝐳^tsubscript^𝐳𝑡\mathbf{\hat{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Formally, it denotes the inference process of q(𝐳^t|𝐱t:tμ)𝑞conditionalsubscript^𝐳𝑡subscript𝐱:𝑡𝑡𝜇q(\mathbf{\hat{z}}_{t}|\mathbf{x}_{t:t-\mu})italic_q ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ), which corresponds to the function 𝐦𝐦\mathbf{m}bold_m in Eq 2. Following this, observations are generated from the latent space through a step-to-step decoder p(𝐱^t|𝐳^t)𝑝conditionalsubscript^𝐱𝑡subscript^𝐳𝑡p(\mathbf{\hat{x}}_{t}|\mathbf{\hat{z}}_{t})italic_p ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which implies the mixing function 𝐠𝐠\mathbf{g}bold_g as mentioned in Eq 1. To learn the independent latent variables, we apply a constraint using the KL divergence between the posterior distribution of learned latent variables and a prior distribution which is subject to our conditional independence assumption in Theorem 1. The estimation of the prior distribution motivates us to utilize a normalizing flow, converting the prior distribution into Gaussian noise, represented as ϵ^it=fi1(z^it,𝐳^t1:tτ)subscript^italic-ϵ𝑖𝑡subscriptsuperscript𝑓1𝑖subscript^𝑧𝑖𝑡subscript^𝐳:𝑡1𝑡𝜏\hat{\epsilon}_{it}=f^{-1}_{i}(\hat{z}_{it},\hat{\mathbf{z}}_{t-1:t-\tau})over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT ). Moreover, a reconstruction loss between the ground truth and generated observations is integrated for model training. A detailed exploration of all modules and losses is forthcoming.

Sequence-to-Step Encoder and Step-to-Step Decoder. Drawing inspiration from the capability of the human visual system, we utilize temporal context to reclaim the information lost due to non-invertible generation. The human visual system adeptly fills in occluded segments by recognizing coherent motion cues (Palmer, 1999; Wertheimer, 1938; Spelke, 1990). Assuming there’s a function that captures all latent information from the current observation and its temporal context, we can retrieve the latent causal process with identifiability, i.e. 𝐦𝐦\mathbf{m}bold_m exists. Various non-linear models are suitable for estimating this function, taking a sequence of observations, 𝐱t:tμsubscript𝐱:𝑡𝑡𝜇\mathbf{x}_{t:t-\mu}bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT, with a lag of μ𝜇\muitalic_μ as inputs, and yielding the estimated latent representation of the current time step as output. In our experiments, we utilize both Multi-Layer Perceptron (MLP) (Werbos, 1974) and Transformer (Vaswani et al., 2017), catering to different complexities. Given the estimated latent variable 𝐳^tsubscript^𝐳𝑡\hat{\mathbf{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a step-to-step decoder is employed to generate the current observation 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For practical implementation, one MLP is sufficient.

Transition Prior Module. To uphold the conditional independence assumption, we propose to minimize the KL divergence between the posterior distribution and a hard-coding prior distribution with such property. The constraint indicates that current latent variables are mutually independent, conditioned on historical latent variables. Formally, by hard-coding the prior distribution we enforce 𝐳^t|𝐳^t1:tτconditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝜏\mathbf{\hat{z}}_{t}|\mathbf{\hat{z}}_{t-1:t-\tau}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT to be mutually independent. By minimizing the KL divergence, we expect the posterior to be subject to the assumption as well, such as 𝐳^t|𝐱^t:tμ,𝐳^t1:tτconditionalsubscript^𝐳𝑡subscript^𝐱:𝑡𝑡𝜇subscript^𝐳:𝑡1𝑡𝜏\mathbf{\hat{z}}_{t}|\mathbf{\hat{x}}_{t:t-\mu},\mathbf{\hat{z}}_{t-1:t-\tau}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT are mutually independent. Direct estimation of the prior, which has an arbitrary density function, poses challenges. As a solution, we introduce a transition prior module that facilitates the estimation of the prior using normalizing flow. Specifically, the prior is represented through a Gaussian distribution combined with the Jacobian matrix of the transition module.

Formally presented, the transition prior module is represented as ϵ^it=f^i1(z^it,𝐳^t1:tτ)subscript^italic-ϵ𝑖𝑡subscriptsuperscript^𝑓1𝑖subscript^𝑧𝑖𝑡subscript^𝐳:𝑡1𝑡𝜏\hat{\epsilon}_{it}=\hat{f}^{-1}_{i}(\hat{z}_{it},\mathbf{\hat{z}}_{t-1:t-\tau})over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT ). Subsequently, the joint distribution is decomposed as a product of the noise distribution and the determinant of the Jacobian matrix, formulated as p([𝐳^t1:tτ,𝐳^t])=p([𝐳^t1:tτ,ϵ^t])×|𝐉|𝑝subscript^𝐳:𝑡1𝑡𝜏subscript^𝐳𝑡𝑝subscript^𝐳:𝑡1𝑡𝜏subscript^italic-ϵ𝑡𝐉p([\mathbf{\hat{z}}_{t-1:t-\tau},\mathbf{\hat{z}}_{t}])=p([\mathbf{\hat{z}}_{t% -1:t-\tau},\mathbf{\hat{\epsilon}}_{t}])\times|\mathbf{J}|italic_p ( [ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) = italic_p ( [ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) × | bold_J |, with 𝐉=[𝕀nτ𝟎𝟎diag(ϵ^itz^it)]𝐉matrixsubscript𝕀𝑛𝜏00𝑑𝑖𝑎𝑔subscript^italic-ϵ𝑖𝑡subscript^𝑧𝑖𝑡\mathbf{J}=\begin{bmatrix}\mathbb{I}_{n\tau}&\mathbf{0}\\ \mathbf{0}&diag(\frac{\partial\hat{\epsilon}_{it}}{\partial\hat{z}_{it}})\end{bmatrix}bold_J = [ start_ARG start_ROW start_CELL blackboard_I start_POSTSUBSCRIPT italic_n italic_τ end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL italic_d italic_i italic_a italic_g ( divide start_ARG ∂ over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG ) end_CELL end_ROW end_ARG ], where []delimited-[][\cdot][ ⋅ ] denotes concatenation. Leveraging this joint distribution, we derive the prior as

logp(𝐳^t|𝐳^t1:tτ)𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝜏\displaystyle\log p(\mathbf{\hat{z}}_{t}|\mathbf{\hat{z}}_{t-1:t-\tau})roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT ) (8)
=logp([𝐳^t,𝐳^t1:tτ])logp(𝐳^t1:tτ)absent𝑝subscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝜏𝑝subscript^𝐳:𝑡1𝑡𝜏\displaystyle=\log p([\mathbf{\hat{z}}_{t},\mathbf{\hat{z}}_{t-1:t-\tau}])-% \log p(\mathbf{\hat{z}}_{t-1:t-\tau})= roman_log italic_p ( [ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT ] ) - roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT )
=logp([ϵ^t,𝐳^t1:tτ])+log|𝐉|logp(𝐳^t1:tτ)absent𝑝subscript^italic-ϵ𝑡subscript^𝐳:𝑡1𝑡𝜏𝐉𝑝subscript^𝐳:𝑡1𝑡𝜏\displaystyle=\log p([\mathbf{\hat{\epsilon}}_{t},\mathbf{\hat{z}}_{t-1:t-\tau% }])+\log|\mathbf{J}|-\log p(\mathbf{\hat{z}}_{t-1:t-\tau})= roman_log italic_p ( [ over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT ] ) + roman_log | bold_J | - roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT )
=logp(ϵ^t|𝐳^t1:tτ)+log|𝐉|absent𝑝conditionalsubscript^italic-ϵ𝑡subscript^𝐳:𝑡1𝑡𝜏𝐉\displaystyle=\log p(\mathbf{\hat{\epsilon}}_{t}|\mathbf{\hat{z}}_{t-1:t-\tau}% )+\log|\mathbf{J}|= roman_log italic_p ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT ) + roman_log | bold_J |
=logp(ϵ^t)+log|𝐉|absent𝑝subscript^italic-ϵ𝑡𝐉\displaystyle=\log p(\mathbf{\hat{\epsilon}}_{t})+\log|\mathbf{J}|= roman_log italic_p ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_log | bold_J |
=ilogp(ϵ^it)+log|𝐉|: Conditional independenceabsentsubscript𝑖𝑝subscript^italic-ϵ𝑖𝑡𝐉: Conditional independence\displaystyle=\sum_{i}\log p(\hat{\epsilon}_{it})+\log|\mathbf{J}|\quad\text{:% Conditional independence}= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) + roman_log | bold_J | : Conditional independence
=i(logp(ϵ^it)+logϵ^itz^t,i): Lower-triangular.absentsubscript𝑖𝑝subscript^italic-ϵ𝑖𝑡subscript^italic-ϵ𝑖𝑡subscript^𝑧𝑡𝑖: Lower-triangular\displaystyle=\sum_{i}\left(\log p(\hat{\epsilon}_{it})+\log\frac{\partial\hat% {\epsilon}_{it}}{\partial\hat{z}_{t,i}}\right)\text{: Lower-triangular}.= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_log italic_p ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) + roman_log divide start_ARG ∂ over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_ARG ) : Lower-triangular .

The transition prior module can be efficiently executed using an MLP, transforming the latent variables 𝐳^t:tτsubscript^𝐳:𝑡𝑡𝜏\mathbf{\hat{z}}_{t:t-\tau}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_τ end_POSTSUBSCRIPT into ϵ^tsubscript^italic-ϵ𝑡\mathbf{\hat{\epsilon}}_{t}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Refer to caption

Figure 4: Qualitative comparisons between baselines (especially TDRL) and CaRiNG in the setting of Non-invertible Generation. (a) MCC matrix for all 3 latent variables; (b) The scatter plots between the estimated and ground-truth latent variables (only the aligned variables are plot); (c) The validation MCC curves of CaRiNG and other baselines.

Optimization. We train CaRiNG using the Evidence Lower BOund (ELBO) objective, which is written as follows:

ELBO𝔼qϕ(𝐙|𝐗)[logpθ(𝐗|𝐙)]DKL(qϕ(𝐙|𝐗)||pθ(𝐙))=𝔼qϕ(𝐙|𝐗)t=1Tlogpθ(𝐱t|𝐳t)Recon+𝔼qϕ(𝐙|𝐗)[t=1Tlogpθ(𝐳t|𝐳t1:tτ)t=1Tlogqϕ(𝐳t|𝐱t:tμ)]KLD.\small\begin{split}&\textsc{ELBO}\!\triangleq\!\mathbb{E}_{q_{\phi}(\mathbf{Z}% |\mathbf{X})}[\log p_{\theta}(\mathbf{X}|\mathbf{Z})]-D_{KL}(q_{\phi}(\mathbf{% Z}|\mathbf{X})||p_{\theta}(\mathbf{Z}))\\ &\!=\!\underbrace{\mathbb{E}_{q_{\phi}(\mathbf{Z}|\mathbf{X})}\sum_{t=1}^{T}% \log p_{\theta}(\mathbf{x}_{t}|\mathbf{z}_{t})}_{-\mathcal{L}_{\text{Recon}}}% \\ &\!+\!\underbrace{\mathbb{E}_{q_{\phi}(\mathbf{Z}|\mathbf{X})}\left[\sum_{t=1}% ^{T}\log p_{\theta}(\mathbf{z}_{t}|\mathbf{z}_{t-1:t-\tau})\!-\!\sum_{t=1}^{T}% \log q_{\phi}(\mathbf{z}_{t}|\mathbf{x}_{t:t-\mu})\right]}_{-\mathcal{L}_{% \text{KLD}}}.\end{split}start_ROW start_CELL end_CELL start_CELL ELBO ≜ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_Z | bold_X ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X | bold_Z ) ] - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_Z | bold_X ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_Z ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_Z | bold_X ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT Recon end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_Z | bold_X ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_τ end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT KLD end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW (9)

For the reconstruction likelihood ReconsubscriptRecon\mathcal{L}_{\text{Recon}}caligraphic_L start_POSTSUBSCRIPT Recon end_POSTSUBSCRIPT, we utilize the mean-squared error (MSE) to measure the discrepancy between the generated and original observations. When computing the KL divergence KLDsubscriptKLD\mathcal{L}_{\text{KLD}}caligraphic_L start_POSTSUBSCRIPT KLD end_POSTSUBSCRIPT, we resort to a sampling method, given that the prior distribution lacks an explicit form. To elaborate, the posterior is produced by the encoder, while the prior is defined as in Eq 8.

Table 1: MCC scores (with standard deviations over 4 seeds) of CaRiNG and baselines on NG and NG-TDMP settings.
Setting Method
CaRiNG TDRL LEAP SlowVAE PCL betaVAE SKD iVAE SequentialVAE
NG 0.933 ±0.010 0.627 ±0.009 0.651 ±0.019 0.362 ±0.041 0.507 ±0.091 0.551 ±0.007 0.489 ±0.077 0.391 ±0.686 0.750 ±0.035
NG-TDMP 0.921 ±0.010 0.837 ±0.068 0.704 ±0.005 0.398 ±0.037 0.489 ±0.095 0.437 ±0.021 0.381 ±0.084 0.553 ±0.097 0.847 ±0.019

5 Experiments

We conducted the experiments in two simulated environments, utilizing the available ground truth latent variables to evaluate identifiability. Subsequently, we assessed CaRiNG on a real-world VideoQA task, SUTD-TrafficQA (Xu et al., 2021), to verify its capability in representing complex and non-invertible traffic events.

5.1 Simulation Experiments

Dataset and experimental settings. To evaluate whether CaRiNG can learn the causal process and identify the latent variables under a non-invertible scenario, we design a series of simulation experiments based on a random causal structure with a given sample size and variable size. We provide two experimental settings, including NG and NG-TDMP, which simulate the scenarios in Theorem 1 with r=0𝑟0r=0italic_r = 0 (non-invertible generation) and r>0𝑟0r>0italic_r > 0 (time-delayed mixing process), respectively. In particular, for NG, we simulate the visual perception system that uses the ground-truth dimension as 3 to represent the 3D real world and apply 2 measured variables to represent the 2D observation, which indicates the generation is non-invertible. For NG-TDMP, we simulate the persistence of vision that involves the previous latent variables in the current mixing process. It denotes that even if the dimension of the observation is not reduced, the generation process is still non-invertible due to the time-delay mixing. More details of the data generation process can be found in Appendix A2.1.

Evaluation metrics. We apply the standard evaluation metric in the field of ICA, Mean Correlation Coefficient (MCC), to evaluate the identifiability of our CaRiNG. MCC measures the recovery of latent factors by calculating the absolute values of the correlation coefficient between every ground-truth factor against every estimated latent variable. It first calculates the Pearson correlation coefficients to measure the relationship and then adjusts the order with an assignment algorithm. The MCC score is a value from 0 to 1, where the higher score denotes better identifiability.

Baseline methods. We compare CaRiNG with a series of baseline methods. BetaVAE (Higgins et al., 2017) is the most basic baseline which ignores the temporal dependency and cannot utilize any auxiliary information. SlowVAE (Klindt et al., 2020), and PCL (Hyvarinen & Morioka, 2017) show the identifiability results but are limited by the assumption of independent sources. iVAE (Khemakhem et al., 2020) leverage nonstationarity (auxiliary information) to achieve identifiability. It is important to note that iVAE requires additional domain labels as input. In our experiments, we simply used time indices as the domain label. In addition, LEAP (Yao et al., 2022b) and TDRL (Yao et al., 2022a) allow for learning causal processes but assume an invertible generation process. Besides, we also compare CaRiNG with other temporal representation learning methods that are not based on ICA, such as Sequential VAE (Chung et al., 2015) and SKD (Berman et al., 2022), in which the disentangled representation has no identifiability guarantee.

Table 2: Results on the SUTD-TrafficQA dataset. The cross-modality matching parts of TDRL and CaRiNG are based on HCRN.
Method Year Accuracy(%percent\%%)
I3D+LSTM CVPR2017 33.21
HCRN CVPR2020 36.26
VQAC ICCV2021 36.00
MASN ACL2021 36.03
DualVGR TMM2021 36.07
Eclipse CVPR2021 37.05
CMCIR TPAMI2023 38.58
TDRL NeurIPS2022 37.32
CaRiNG - 41.22

Quantitative results. The performance of CaRiNG and other baseline methods in both the NG and NG-TDMP scenarios is presented in Table 1. Initially, it’s evident that all baseline Nonlinear ICA methods yield unsatisfactory MCC scores in both scenarios, including the strong TDRL baseline, which previously obtained good results in invertible settings, as shown in Figure 4 (c). As shown in Figure 4 (a), TDRL cannot recover the lost latent variables caused by non-invertible generation (MCC=0.03 for that variable). It is also illustrated by the scatter plots in Figure 4 (b), which show the independence between the estimated and ground truth variables on that dimension. Interestingly, we find that the Sequential VAE method works better than other methods that don’t use the temporal context, which also demonstrates the necessity of temporal context to solve the invertibility issue. However, we still find that constraining the conditional independence benefits better performance, which shows the effect of the KL part. Furthermore, CaRiNG consistently delivers robust identifiability outcomes in both settings. This suggests that leveraging temporal context significantly enhances identifiability when faced with non-invertible generation processes. Lastly, performance in the NG scenario is better than that in the NG-TDMP scenario, showing the increased complexity introduced by the time-delayed mixing process.

5.2 Real-world Experiments

Dataset and experimental settings. The SUTD-TrafficQA dataset (Xu et al., 2021) is a comprehensive resource tailored for video event understanding in traffic scenarios, notably characterized by numerous occlusions among traffic agents. It consists of 10,090 videos and provides over 62,535 human-annotated QA pairs. Among them, 56,460 QA pairs are used for training and the rest 6,075 QA pairs are used for testing. The dataset challenges models with six reasoning tasks: “Basic Understanding” is designed for grasping traffic dynamics. “Event Forecasting” and “Reverse Reasoning” evaluate the temporal prediction ability. “Introspection”, “Attribution”, and “Counterfactual Inference” require the model to understand the causal dynamic and conduct reasoning. All tasks are formulated as multiple-choice forms (evaluation with accuracy) without limiting the number of candidate answers, and demand a deep comprehension of traffic events and their underlying causality.

Baseline methods. The primary method we benchmark against is TDRL (Yao et al., 2022a), to evaluate the representation ability of the complex and non-invertible traffic environment. Additionally, we evaluate CaRiNG in comparison with state-of-the-art VideoQA methods, including I3D+LSTM (Carreira & Zisserman, 2017), HCRN (Le et al., 2020), VQAC (Kim et al., 2021), MASN (Seo et al., 2021), DualVGR (Wang et al., 2021), Eclipse (Xu et al., 2021), and CMCIR (Liu et al., 2023). In our approach, CaRiNG is leveraged to identify latent causal dynamics, while HCRN serves as the basic model for question answering. Further implementation details are provided in the Appendix.

Quantitative results. Performance comparisons for the six question types on SUTD-TrafficQA are summarized in Table 2. CaRiNG achieves a score of 41.22, which demonstrates a significant improvement which is nearly 6.8% over the next best method. Notably, when compared to TDRL, which lacks temporal context, CaRiNG exhibits significant advancements in representing complex, non-invertible traffic events. When benchmarked against the HCRN baseline, which employs the same cross-modality matching module, our approach further escalates the score by 4.96 through causal representation learning. Though CMCIR (Liu et al., 2023) applies the Swin-Transformer-L (Liu et al., 2021) pretrained on ImageNet-22K dataset as the frame-level appearance extractor and employs the video Swin-B (Liu et al., 2022) pretrained on Kinetics-600 as the clip-level motion feature extractor (more powerful than ours), CaRiNG with sample ResNet101 (He et al., 2016) features still outperforms it with 2.64 in average. More analysis on TrafficQA and another evaluation on Volleyball (Ibrahim et al., 2016) can be found in Appendix A3 and A4, respectively.

6 Conclusion

In this paper, we have proposed to consider learning temporal causal representation under the non-invertible generation process, which is motivated by the common requirement of the temporal system, such as the visual perception process. We have established identifiability theories that allow for recovering the latent causal process with the nonlinear and non-invertible mixing function. Furthermore, based on this theorem, we proposed our approach, CaRiNG, to leverage the temporal context to estimate the lost latent information. We have conducted a series of simulated experiments to verify the identifiability results of CaRiNG under the non-invertible generations and evaluated the learned representation in a complex and non-invertible traffic environment with real-world VideoQA tasks.

Impact Statement

This study introduces both a theoretical framework and a practical approach for extracting causal representations from time-series data. Such advancements enable the development of more transparent and interpretative models, enhancing our grasp of causal dynamics in real-world settings. This approach may benefit many real-world applications, including healthcare, auto-driving, and finance, but it could also be used illegally. For example, within the financial sphere, it can be harnessed to decipher ever-evolving market trends, optimizing predictions and thereby influencing investment and risk management decisions. However, it’s imperative to note that any misjudgment of causal relationships could lead to detrimental consequences in these domains. Thus, establishing causal links must be executed with precision to prevent skewed or biased inferences.

Theoretically, though allowing for the non-invertible generation process, our theoretical assumptions still fall short of fully capturing the intricacies of real-world scenarios. For example, identifiability requires the absence of instantaneous causal relations, i.e., relying solely on time-delayed influences within the latent causal dynamics. Furthermore, we operate under the presumption that the number of variables remains consistent across different time steps, signifying that no agents enter or exit the environment. Moving forward, we aim to broaden our framework to ensure identifiability in more general settings, embracing instantaneous causal dynamics and the flexibility for variables to either enter or exit.

In our experiments, we evaluate our approach with both simulated and real-world datasets. However, our simulation relies predominantly on data points, creating a gap from real-world data. Concurrently, the real datasets lack the presence of ground truth latent variables. In the future, we plan to develop a benchmark specifically tailored for the causal representation learning task. This benchmark will harness the capabilities of game engines and renderers to produce videos embedded with ground-truth latent variables.

Acknowledgments

We would like to acknowledge the support from NSF Grant 2229881, the National Institutes of Health (NIH) under Contract R01HL159805, and grants from Apple Inc., KDDI Research Inc., Quris AI, and Florin Court Capital.

References

  • Berman et al. (2022) Berman, N., Naiman, I., and Azencot, O. Multifactor sequential disentanglement via structured koopman autoencoders. In The Eleventh International Conference on Learning Representations, 2022.
  • Berzuini et al. (2012) Berzuini, C., Dawid, P., and Bernardinell, L. Causality: Statistical perspectives and applications. John Wiley & Sons, 2012.
  • Cai & Xie (2019) Cai, R. and Xie, F. Triad constraints for learning causal structure of latent variables. Advances in neural information processing systems, 2019.
  • Carreira & Zisserman (2017) Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6299–6308, 2017.
  • Choi et al. (2011) Choi, M. J., Tan, V. Y., Anandkumar, A., and Willsky, A. S. Learning latent tree graphical models. Journal of Machine Learning Research, 12:1771–1812, 2011.
  • Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. Advances in neural information processing systems, 28, 2015.
  • Coltheart (1980) Coltheart, M. The persistences of vision. Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 290(1038):57–69, 1980.
  • Dinh et al. (2016) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  • Drton et al. (2017) Drton, M., Lin, S., Weihs, L., and Zwiernik, P. Marginal likelihood and model selection for gaussian latent tree and forest models. 2017.
  • Fraccaro et al. (2016) Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers, 2016.
  • Friston (2009) Friston, K. Causal modelling and brain connectivity in functional magnetic resonance imaging. PLoS biology, 7(2):e1000033, 2009.
  • Ghysels et al. (2016) Ghysels, E., Hill, J. B., and Motegi, K. Testing for granger causality with mixed frequency data. Journal of Econometrics, 192(1):207–230, 2016.
  • Hälvä & Hyvarinen (2020) Hälvä, H. and Hyvarinen, A. Hidden markov nonlinear ica: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pp. 939–948. PMLR, 2020.
  • Hälvä et al. (2021) Hälvä, H., Corff, S. L., Lehéricy, L., So, J., Zhu, Y., Gassiat, E., and Hyvarinen, A. Disentangling identifiable features from noisy data with structured nonlinear ica. arXiv preprint arXiv:2106.09620, 2021.
  • Hartford et al. (2022) Hartford, J., Ahuja, K., Bengio, Y., and Sridhar, D. Beyond the injective assumption in causal representation learning. 2022.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.
  • Huang et al. (2022) Huang, B., Low, C. J. H., Xie, F., Glymour, C., and Zhang, K. Latent hierarchical causal structure discovery with rank constraints. arXiv preprint arXiv:2210.01798, 2022.
  • Hyvarinen & Morioka (2016) Hyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in Neural Information Processing Systems, 29:3765–3773, 2016.
  • Hyvarinen & Morioka (2017) Hyvarinen, A. and Morioka, H. Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp.  460–469. PMLR, 2017.
  • Hyvärinen & Oja (2000) Hyvärinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • Hyvarinen et al. (2019) Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  859–868. PMLR, 2019.
  • Ibrahim et al. (2016) Ibrahim, M. S., Muralidharan, S., Deng, Z., Vahdat, A., and Mori, G. A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1971–1980, 2016.
  • Khemakhem et al. (2020) Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp.  2207–2217. PMLR, 2020.
  • Kim et al. (2021) Kim, N., Ha, S. J., and Kang, J.-W. Video question answering using language-guided deep compressed-domain video feature. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1708–1717, 2021.
  • Klindt et al. (2020) Klindt, D., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. Towards nonlinear disentanglement in natural data with temporal sparse coding. arXiv preprint arXiv:2007.10930, 2020.
  • Kong et al. (2022) Kong, L., Xie, S., Yao, W., Zheng, Y., Chen, G., Stojanov, P., Akinwande, V., and Zhang, K. Partial disentanglement for domain adaptation. In International conference on machine learning, pp. 11455–11472. PMLR, 2022.
  • Kummerfeld & Ramsey (2016) Kummerfeld, E. and Ramsey, J. Causal clustering for 1-factor measurement models. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  1655–1664, 2016.
  • Lachapelle et al. (2022) Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Le Priol, R., Lacoste, A., and Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. In Conference on Causal Learning and Reasoning, pp. 428–484. PMLR, 2022.
  • Lachapelle et al. (2023) Lachapelle, S., Mahajan, D., Mitliagkas, I., and Lacoste-Julien, S. Additive decoders for latent variables identification and cartesian-product extrapolation, 2023.
  • Le et al. (2020) Le, T. M., Le, V., Venkatesh, S., and Tran, T. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9972–9981, 2020.
  • Li & Mandt (2018) Li, Y. and Mandt, S. Disentangled sequential autoencoder. arXiv preprint arXiv:1803.02991, 2018.
  • Lippe et al. (2022a) Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, E. icitris: Causal representation learning for instantaneous temporal effects. arXiv preprint arXiv:2206.06169, 2022a.
  • Lippe et al. (2022b) Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, S. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, pp. 13557–13603. PMLR, 2022b.
  • Liu et al. (2023) Liu, Y., Li, G., and Lin, L. Cross-modal causal relational reasoning for event-level visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Liu et al. (2021) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10012–10022, 2021.
  • Liu et al. (2022) Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12009–12019, 2022.
  • Palmer (1999) Palmer, S. E. Vision science: Photons to phenomenology. MIT press, 1999.
  • Pearl (1988) Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
  • Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530–1538. PMLR, 2015.
  • Seo et al. (2021) Seo, A., Kang, G.-C., Park, J., and Zhang, B.-T. Attend what you need: Motion-appearance synergistic networks for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp.  6167–6177, 2021.
  • Shimizu et al. (2009) Shimizu, S., Hoyer, P. O., and Hyvärinen, A. Estimation of linear non-gaussian acyclic models for latent factors. Neurocomputing, 72(7-9):2024–2027, 2009.
  • Silva et al. (2006) Silva, R., Scheines, R., Glymour, C., Spirtes, P., and Chickering, D. M. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7(2), 2006.
  • Sorrenson et al. (2020) Sorrenson, P., Rother, C., and Köthe, U. Disentanglement by nonlinear ica with general incompressible-flow networks (gin). arXiv preprint arXiv:2001.04872, 2020.
  • Spearman (1928) Spearman, C. Pearson’s contribution to the theory of two factors. British Journal of Psychology, 19(1):95, 1928.
  • Spelke (1990) Spelke, M. Principles of object perception, cognitive science 14. 1990.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2021) Wang, J., Bao, B., and Xu, C. Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, 2021.
  • Werbos (1974) Werbos, P. Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA, 1974.
  • Wertheimer (1938) Wertheimer, M. Laws of organization in perceptual forms. 1938.
  • Wolford (1993) Wolford, G. A model of visible persistence based on linear systems. Canadian Psychology/Psychologie canadienne, 34(2):162, 1993.
  • Xie et al. (2020) Xie, F., Cai, R., Huang, B., Glymour, C., Hao, Z., and Zhang, K. Generalized independent noise condition for estimating latent variable causal graphs. arXiv preprint arXiv:2010.04917, 2020.
  • Xie et al. (2022) Xie, F., Huang, B., Chen, Z., He, Y., Geng, Z., and Zhang, K. Identification of linear non-gaussian latent hierarchical structure. In International Conference on Machine Learning, pp. 24370–24387. PMLR, 2022.
  • Xu et al. (2021) Xu, L., Huang, H., and Liu, J. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In CVPR, pp.  9878–9888, 2021.
  • Yao et al. (2022a) Yao, W., Chen, G., and Zhang, K. Temporally disentangled representation learning. In Advances in Neural Information Processing Systems, 2022a. URL https://openreview.net/forum?id=Vi-sZWNA_Ue.
  • Yao et al. (2022b) Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=RDlLMjLJXdq.
  • Zhang (2004) Zhang, N. L. Hierarchical latent class models for cluster analysis. The Journal of Machine Learning Research, 5:697–723, 2004.
  • Zhou et al. (2022) Zhou, H., Kadav, A., Shamsian, A., Geng, S., Lai, F., Zhao, L., Liu, T., Kapadia, M., and Graf, H. P. Composer: compositional reasoning of group activity in videos with keypoint-only modality. In European Conference on Computer Vision, pp.  249–266. Springer, 2022.
  • Ziegler & Rush (2019) Ziegler, Z. and Rush, A. Latent normalizing flows for discrete sequences. In International Conference on Machine Learning, pp. 7673–7682. PMLR, 2019.

Appendix for
 
“Learning Temporal Causal Representation under Non-Invertible Generation Process”

Appendix A1 Identifiability Theory

A1.1 Proof for Theorem 1

Let us first shed light on the identifiability theory on the special case with τ=r+1𝜏𝑟1\tau=r+1italic_τ = italic_r + 1, i.e.,

𝐱t=𝐠(𝐳t:tr),zit=fi(𝐳t1:tr1,ϵit),𝐳t=𝐦(𝐱t:tμ).formulae-sequencesubscript𝐱𝑡𝐠subscript𝐳:𝑡𝑡𝑟formulae-sequencesubscript𝑧𝑖𝑡subscript𝑓𝑖subscript𝐳:𝑡1𝑡𝑟1subscriptitalic-ϵ𝑖𝑡subscript𝐳𝑡𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{x}_{t}=\mathbf{g}(\mathbf{z}_{t:t-r}),\quad z_{it}=f_{i}\left(\mathbf{% z}_{t-1:t-r-1},\epsilon_{it}\right),\quad\mathbf{z}_{t}=\mathbf{m}(\mathbf{x}_% {t:t-\mu}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_g ( bold_z start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) . (A1)
Theorem A1 (Identifiability under Non-invertible Generative Process).

For a series of observations 𝐱tdsubscript𝐱𝑡superscript𝑑\mathbf{x}_{t}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and estimated latent variables 𝐳^tnsubscript^𝐳𝑡superscript𝑛\mathbf{\hat{z}}_{t}\in\mathbb{R}^{n}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , suppose there exists function 𝐠^,𝐦^^𝐠^𝐦\mathbf{\hat{g}},\mathbf{\hat{m}}over^ start_ARG bold_g end_ARG , over^ start_ARG bold_m end_ARG which is subject to observational equivalence,

𝐱t=𝐠^(𝐳^t:tr),𝐳^t=𝐦^(𝐱t:tμ).formulae-sequencesubscript𝐱𝑡^𝐠subscript^𝐳:𝑡𝑡𝑟subscript^𝐳𝑡^𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{x}_{t}=\mathbf{\hat{g}}(\mathbf{\hat{z}}_{t:t-r}),\quad\mathbf{\hat{z}% }_{t}=\mathbf{\hat{m}}(\mathbf{x}_{t:t-\mu}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_g end_ARG ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ) , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) . (A2)

If assumptions

  • {(Smooth and Positive Density) the probability density function of latent variables is third-order differentiable and positive in Rnsuperscript𝑅𝑛R^{n}italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

  • (conditional independence) the components of 𝐳^tsubscript^𝐳𝑡\mathbf{\hat{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are mutually independent conditional on 𝐳^t1:tr1subscript^𝐳:𝑡1𝑡𝑟1\mathbf{\hat{z}}_{t-1:t-r-1}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT,

  • (sufficiency) let ηktlogp(zkt|𝐳t1:tr1)subscript𝜂𝑘𝑡𝑝conditionalsubscript𝑧𝑘𝑡subscript𝐳:𝑡1𝑡𝑟1\eta_{kt}\triangleq\log p(z_{kt}|\mathbf{z}_{t-1:t-r-1})italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ≜ roman_log italic_p ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ), and

    𝐯lt(2η1tz1tzl,tr1,,2ηntzntzl,tr1,3η1tz1t2zl,tr1,,3ηntznt2zl,tr1),subscript𝐯𝑙𝑡superscriptsuperscript2subscript𝜂1𝑡subscript𝑧1𝑡subscript𝑧𝑙𝑡𝑟1superscript2subscript𝜂𝑛𝑡subscript𝑧𝑛𝑡subscript𝑧𝑙𝑡𝑟1superscript3subscript𝜂1𝑡superscriptsubscript𝑧1𝑡2subscript𝑧𝑙𝑡𝑟1superscript3subscript𝜂𝑛𝑡superscriptsubscript𝑧𝑛𝑡2subscript𝑧𝑙𝑡𝑟1\displaystyle\mathbf{v}_{lt}\triangleq\Big{(}\frac{\partial^{2}\eta_{1t}}{% \partial z_{1t}\partial z_{l,t-r-1}},...,\frac{\partial^{2}\eta_{nt}}{\partial z% _{nt}\partial z_{l,t-r-1}},\frac{\partial^{3}\eta_{1t}}{\partial z_{1t}^{2}% \partial z_{l,t-r-1}},...,\frac{\partial^{3}\eta_{nt}}{\partial z_{nt}^{2}% \partial z_{l,t-r-1}}\Big{)}^{\intercal},bold_v start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ≜ ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT , (A3)

    for l=1,2,,n𝑙12𝑛l=1,2,\cdots,nitalic_l = 1 , 2 , ⋯ , italic_n. For each value of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, there exists 2n2𝑛2n2 italic_n different values of zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT such that the 2n2𝑛2n2 italic_n vector functions 𝐯lt𝐑2nsubscript𝐯𝑙𝑡superscript𝐑2𝑛\mathbf{v}_{lt}\in\mathbf{R}^{2n}bold_v start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT are linearly independent,

are satisfied, then 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be a component-wise transformation of a permuted version of 𝐳^tsubscript^𝐳𝑡\mathbf{\hat{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with regard to context {𝐱jj=t,t1,,tμr}conditional-setsubscript𝐱𝑗for-all𝑗𝑡𝑡1𝑡𝜇𝑟\{\mathbf{x}_{j}\mid\forall j=t,t-1,\cdots,t-\mu-r\}{ bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ ∀ italic_j = italic_t , italic_t - 1 , ⋯ , italic_t - italic_μ - italic_r }.

Proof.

For any t𝑡titalic_t, combining Eq A1 and Eq A2 gives

𝐳tsubscript𝐳𝑡\displaystyle\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =𝐦(𝐱t:tμ)absent𝐦subscript𝐱:𝑡𝑡𝜇\displaystyle=\mathbf{m}(\mathbf{x}_{t:t-\mu})= bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) (A4)
=𝐦(𝐠^(𝐳^t,𝐳^t1:tr),𝐱t1:tμ)absent𝐦^𝐠subscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟subscript𝐱:𝑡1𝑡𝜇\displaystyle=\mathbf{m}(\mathbf{\hat{g}}(\mathbf{\hat{z}}_{t},\mathbf{\hat{z}% }_{t-1:t-r}),\mathbf{x}_{t-1:t-\mu})= bold_m ( over^ start_ARG bold_g end_ARG ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r end_POSTSUBSCRIPT ) , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ end_POSTSUBSCRIPT )
=𝐦(𝐠^(𝐳^t,𝐦^(𝐱t1:tμ1),,𝐦^(𝐱tr:tμr)),𝐱t1:tμ),absent𝐦^𝐠subscript^𝐳𝑡^𝐦subscript𝐱:𝑡1𝑡𝜇1^𝐦subscript𝐱:𝑡𝑟𝑡𝜇𝑟subscript𝐱:𝑡1𝑡𝜇\displaystyle=\mathbf{m}(\mathbf{\hat{g}}(\mathbf{\hat{z}}_{t},\mathbf{\hat{m}% }(\mathbf{x}_{t-1:t-\mu-1}),\cdots,\mathbf{\hat{m}}(\mathbf{x}_{t-r:t-\mu-r}))% ,\mathbf{x}_{t-1:t-\mu}),= bold_m ( over^ start_ARG bold_g end_ARG ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - 1 end_POSTSUBSCRIPT ) , ⋯ , over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - italic_r : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ) ) , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ end_POSTSUBSCRIPT ) ,

as well as 𝐳^t=𝐦^(𝐠(𝐳t,𝐦(𝐱t1:tμ1),,𝐦(𝐱tr:tμr)),𝐱t1:tμ)subscript^𝐳𝑡^𝐦𝐠subscript𝐳𝑡𝐦subscript𝐱:𝑡1𝑡𝜇1𝐦subscript𝐱:𝑡𝑟𝑡𝜇𝑟subscript𝐱:𝑡1𝑡𝜇\mathbf{\hat{z}}_{t}=\mathbf{\hat{m}}(\mathbf{g}(\mathbf{z}_{t},\mathbf{m}(% \mathbf{x}_{t-1:t-\mu-1}),\cdots,\mathbf{m}(\mathbf{x}_{t-r:t-\mu-r})),\mathbf% {x}_{t-1:t-\mu})over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_m end_ARG ( bold_g ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_m ( bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - 1 end_POSTSUBSCRIPT ) , ⋯ , bold_m ( bold_x start_POSTSUBSCRIPT italic_t - italic_r : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ) ) , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ end_POSTSUBSCRIPT ) similarly. Upon Eq A4, we have an unified partially invertible function 𝐳t=𝐡(𝐳^t|𝐱t1:tμr)subscript𝐳𝑡𝐡conditionalsubscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟\mathbf{z}_{t}=\mathbf{h}(\mathbf{\hat{z}}_{t}|\mathbf{x}_{t-1:t-\mu-r})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_h ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ) where 𝐡=𝐦𝐠^𝐡𝐦^𝐠\mathbf{h}=\mathbf{m}\circ\hat{\mathbf{g}}bold_h = bold_m ∘ over^ start_ARG bold_g end_ARG with Jacobian 𝐳t𝐳^t=𝐇t(𝐳^t;𝐱t1:tμr)subscript𝐳𝑡subscript^𝐳𝑡subscript𝐇𝑡subscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{\hat{z}}_{t}}=\mathbf{H}_{t}(% \mathbf{\hat{z}}_{t};\mathbf{x}_{t-1:t-\mu-r})divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ). By partially invertible it means that 𝐳𝐳\mathbf{z}bold_z and 𝐳^^𝐳\mathbf{\hat{z}}over^ start_ARG bold_z end_ARG are in one-to-one correspondence for any context observations 𝐱t1:tμrsubscript𝐱:𝑡1𝑡𝜇𝑟\mathbf{x}_{t-1:t-\mu-r}bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT that are fixed. One more thing to notify is that since 𝐠,𝐠^,𝐦,𝐦^𝐠^𝐠𝐦^𝐦\mathbf{g},\mathbf{\hat{g}},\mathbf{m},\mathbf{\hat{m}}bold_g , over^ start_ARG bold_g end_ARG , bold_m , over^ start_ARG bold_m end_ARG are second-order differentiable, the nested 𝐡𝐡\mathbf{h}bold_h is also second-order differentiable. Let us consider the mapping from joint distribution (𝐳^t,𝐱t1:tμr1)subscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1(\mathbf{\hat{z}}_{t},\mathbf{x}_{t-1:t-\mu-r-1})( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) to (𝐳t,𝐱t1:tμr1)subscript𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1(\mathbf{z}_{t},\mathbf{x}_{t-1:t-\mu-r-1})( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ), i.e.,

P(𝐳t,𝐱t1:tμr1)=P(𝐳^t,𝐱t1:tμr1)/|𝐉t|,𝑃subscript𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1𝑃subscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1subscript𝐉𝑡P(\mathbf{z}_{t},\mathbf{x}_{t-1:t-\mu-r-1})=P(\mathbf{\hat{z}}_{t},\mathbf{x}% _{t-1:t-\mu-r-1})\,/\,|\mathbf{J}_{t}|,italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) / | bold_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | , (A5)

where

𝐉t=[𝐳t𝐳^t𝟎𝐈],subscript𝐉𝑡matrixsubscript𝐳𝑡subscript^𝐳𝑡0𝐈\mathbf{J}_{t}=\begin{bmatrix}\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{% \hat{z}}_{t}}&\mathbf{0}\\ *&\mathbf{I}\end{bmatrix},bold_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL ∗ end_CELL start_CELL bold_I end_CELL end_ROW end_ARG ] , (A6)

which is a lower triangle matrix, where 𝐈𝐈\mathbf{I}bold_I infers eye matrix and * infers any possible matrix. Thus, we have determinant |𝐉t|=|𝐳t𝐳^t|=|𝐇t|subscript𝐉𝑡subscript𝐳𝑡subscript^𝐳𝑡subscript𝐇𝑡|\mathbf{J}_{t}|=|\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{\hat{z}}_{t}}|% =|\mathbf{H}_{t}|| bold_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = | divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG | = | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT |. Dividing both sides of Eq A5 by P(𝐱t1:tμr1)𝑃subscript𝐱:𝑡1𝑡𝜇𝑟1P(\mathbf{x}_{t-1:t-\mu-r-1})italic_P ( bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) gives

LHS=P(𝐳t|𝐱t1:tμr1)=P(𝐳t|𝐳t1:tr1),LHS𝑃conditionalsubscript𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1𝑃conditionalsubscript𝐳𝑡subscript𝐳:𝑡1𝑡𝑟1\textbf{LHS}=P(\mathbf{z}_{t}|\mathbf{x}_{t-1:t-\mu-r-1})=P(\mathbf{z}_{t}|% \mathbf{z}_{t-1:t-r-1}),LHS = italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) = italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) , (A7)

since 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱t1:tμr1subscript𝐱:𝑡1𝑡𝜇𝑟1\mathbf{x}_{t-1:t-\mu-r-1}bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT are independent conditioned on 𝐳t1:tr1subscript𝐳:𝑡1𝑡𝑟1\mathbf{z}_{t-1:t-r-1}bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT. Similarly, RHS=P(𝐳^t|𝐱t1:tμr1)=P(𝐳^t|𝐳^tr1)RHS𝑃conditionalsubscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1𝑃conditionalsubscript^𝐳𝑡subscript^𝐳𝑡𝑟1\textbf{RHS}=P(\mathbf{\hat{z}}_{t}|\mathbf{x}_{t-1:t-\mu-r-1})=P(\mathbf{\hat% {z}}_{t}|\mathbf{\hat{z}}_{t-r-1})RHS = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - italic_r - 1 end_POSTSUBSCRIPT ) holds true as well, which yields to

P(𝐳t|𝐳t1:tr1)=P(𝐳^t|𝐳^t1:tr1)/|𝐇t|.𝑃conditionalsubscript𝐳𝑡subscript𝐳:𝑡1𝑡𝑟1𝑃conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript𝐇𝑡P(\mathbf{z}_{t}|\mathbf{z}_{t-1:t-r-1})=P(\mathbf{\hat{z}}_{t}|\mathbf{\hat{z% }}_{t-1:t-r-1})\,/\,|\mathbf{H}_{t}|.italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) / | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | . (A8)

From a direct observation, if the components of 𝐳^tsubscript^𝐳𝑡\hat{\mathbf{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are mutually independent given 𝐳^t1:tr1subscript^𝐳:𝑡1𝑡𝑟1\hat{\mathbf{z}}_{t-1:t-r-1}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT, then for any distinct ij𝑖𝑗i\neq jitalic_i ≠ italic_j, z^itsubscript^𝑧𝑖𝑡\hat{z}_{it}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and z^jtsubscript^𝑧𝑗𝑡\hat{z}_{jt}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT are conditionally independent given (𝐳^t{z^it,z^jt})𝐳^t1:tr1subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡subscript^𝐳:𝑡1𝑡𝑟1(\hat{\mathbf{z}}_{t}\setminus\{\hat{z}_{it},\hat{z}_{jt}\})\cup\hat{\mathbf{z% }}_{t-1:t-r-1}( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT } ) ∪ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT. This mutual independence of the components of 𝐳^tsubscript^𝐳𝑡\hat{\mathbf{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on 𝐳^t1:tr1subscript^𝐳:𝑡1𝑡𝑟1\hat{\mathbf{z}}_{t-1:t-r-1}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT implies two things:

  • z^itsubscript^𝑧𝑖𝑡\hat{z}_{it}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is independent from 𝐳^t{z^it,z^jt}subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡\hat{\mathbf{z}}_{t}\setminus\{\hat{z}_{it},\hat{z}_{jt}\}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT } conditional on 𝐳^t1:tr1subscript^𝐳:𝑡1𝑡𝑟1\hat{\mathbf{z}}_{t-1:t-r-1}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT. Formally,

    p(z^it|𝐳^t1:tr1)=p(z^it|(𝐳^t{z^it,z^jt})𝐳^t1:tr1).𝑝conditionalsubscript^𝑧𝑖𝑡subscript^𝐳:𝑡1𝑡𝑟1𝑝conditionalsubscript^𝑧𝑖𝑡subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡subscript^𝐳:𝑡1𝑡𝑟1p(\hat{z}_{it}\,|\,\hat{\mathbf{z}}_{t-1:t-r-1})=p(\hat{z}_{it}\,|\,(\hat{% \mathbf{z}}_{t}\setminus\{\hat{z}_{it},\hat{z}_{jt}\})\cup\hat{\mathbf{z}}_{t-% 1:t-r-1}).italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) = italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT | ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT } ) ∪ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) .
  • z^itsubscript^𝑧𝑖𝑡\hat{z}_{it}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is independent from 𝐳^t{z^it}subscript^𝐳𝑡subscript^𝑧𝑖𝑡\hat{\mathbf{z}}_{t}\setminus\{\hat{z}_{it}\}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT } conditional on 𝐳^t1:tr1subscript^𝐳:𝑡1𝑡𝑟1\hat{\mathbf{z}}_{t-1:t-r-1}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT. Represented as:

    p(z^it|𝐳^t1:tr1)=p(z^it|(𝐳^t{z^it})𝐳^t1:tr1).𝑝conditionalsubscript^𝑧𝑖𝑡subscript^𝐳:𝑡1𝑡𝑟1𝑝conditionalsubscript^𝑧𝑖𝑡subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝐳:𝑡1𝑡𝑟1p(\hat{z}_{it}\,|\,\hat{\mathbf{z}}_{t-1:t-r-1})=p(\hat{z}_{it}\,|\,(\hat{% \mathbf{z}}_{t}\setminus\{\hat{z}_{it}\})\cup\hat{\mathbf{z}}_{t-1:t-r-1}).italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) = italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT | ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT } ) ∪ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) .

From these two equations, we can derive:

p(z^it|(𝐳^t{z^it})𝐳^t1:tr1)=p(z^it|(𝐳^t{z^it,z^jt})𝐳^t1:tr1),𝑝conditionalsubscript^𝑧𝑖𝑡subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝐳:𝑡1𝑡𝑟1𝑝conditionalsubscript^𝑧𝑖𝑡subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡subscript^𝐳:𝑡1𝑡𝑟1p(\hat{z}_{it}\,|\,(\hat{\mathbf{z}}_{t}\setminus\{\hat{z}_{it}\})\cup\hat{% \mathbf{z}}_{t-1:t-r-1})=p(\hat{z}_{it}\,|\,(\hat{\mathbf{z}}_{t}\setminus\{% \hat{z}_{it},\hat{z}_{jt}\})\cup\hat{\mathbf{z}}_{t-1:t-r-1}),italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT | ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT } ) ∪ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) = italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT | ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT } ) ∪ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) ,

which yields that z^itsubscript^𝑧𝑖𝑡\hat{z}_{it}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and z^jtsubscript^𝑧𝑗𝑡\hat{z}_{jt}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT are conditionally independent given (𝐳^t{z^it,z^jt})𝐳^t1:tr1subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡subscript^𝐳:𝑡1𝑡𝑟1(\hat{\mathbf{z}}_{t}\setminus\{\hat{z}_{it},\hat{z}_{jt}\})\cup\hat{\mathbf{z% }}_{t-1:t-r-1}( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT } ) ∪ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT for ij𝑖𝑗i\neq jitalic_i ≠ italic_j. Leveraging an inherent fact, i.e., if z^itsubscript^𝑧𝑖𝑡\hat{z}_{it}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and z^jtsubscript^𝑧𝑗𝑡\hat{z}_{jt}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT are conditionally independent given (𝐳^t{z^it,z^jt})𝐳^t1:tr1subscript^𝐳𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡subscript^𝐳:𝑡1𝑡𝑟1(\hat{\mathbf{z}}_{t}\setminus\{\hat{z}_{it},\hat{z}_{jt}\})\cup\hat{\mathbf{z% }}_{t-1:t-r-1}( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT } ) ∪ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT, the subsequent equation arises:

2logp(𝐳^t,𝐳^t1:tr1)z^itz^jt=0,superscript2𝑝subscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡0\frac{\partial^{2}\log p(\hat{\mathbf{z}}_{t},\hat{\mathbf{z}}_{t-1:t-r-1})}{% \partial\hat{z}_{it}\partial\hat{z}_{jt}}=0,divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG = 0 ,

assuming the cross second-order derivative exists.

Given that p(𝐳^t,𝐳^t1:tr1)=p(𝐳^t|𝐳^t1:tr1)p(𝐳^t1:tr1)𝑝subscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1𝑝subscript^𝐳:𝑡1𝑡𝑟1p(\hat{\mathbf{z}}_{t},\hat{\mathbf{z}}_{t-1:t-r-1})=p(\hat{\mathbf{z}}_{t}\,|% \,\hat{\mathbf{z}}_{t-1:t-r-1})p(\hat{\mathbf{z}}_{t-1:t-r-1})italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) = italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) and p(𝐳^t1:tr1)𝑝subscript^𝐳:𝑡1𝑡𝑟1p(\hat{\mathbf{z}}_{t-1:t-r-1})italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) remains independent of z^itsubscript^𝑧𝑖𝑡\hat{z}_{it}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT or z^jtsubscript^𝑧𝑗𝑡\hat{z}_{jt}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT, the above equality is equivalent to

2logp(𝐳^t|𝐳^t1:tr1)z^itz^jt=0.superscript2𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡0\frac{\partial^{2}\log p(\hat{\mathbf{z}}_{t}\,|\,\hat{\mathbf{z}}_{t-1:t-r-1}% )}{\partial\hat{z}_{it}\partial\hat{z}_{jt}}=0.divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG = 0 . (A9)

Referencing Eq A8, it gets expressed as:

logp(𝐳^t|𝐳^t1:tr1)=logp(𝐳t|𝐳t1:tr1)+log|𝐇t|=k=1nηkt+log|𝐇t|.𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1𝑝conditionalsubscript𝐳𝑡subscript𝐳:𝑡1𝑡𝑟1subscript𝐇𝑡superscriptsubscript𝑘1𝑛subscript𝜂𝑘𝑡subscript𝐇𝑡\log p(\hat{\mathbf{z}}_{t}\,|\,\hat{\mathbf{z}}_{t-1:t-r-1})=\log p({\mathbf{% z}}_{t}\,|\,{\mathbf{z}}_{t-1:t-r-1})+\log|\mathbf{H}_{t}|=\sum_{k=1}^{n}\eta_% {kt}+\log|\mathbf{H}_{t}|.roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) = roman_log italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) + roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT + roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | . (A10)

The partial derivative w.r.t. z^itsubscript^𝑧𝑖𝑡\hat{z}_{it}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is presented below:

logp(𝐳^t|𝐳^t1:tr1)z^it𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript^𝑧𝑖𝑡\displaystyle\frac{\partial\log p(\hat{\mathbf{z}}_{t}\,|\,\hat{\mathbf{z}}_{t% -1:t-r-1})}{\partial\hat{z}_{it}}divide start_ARG ∂ roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG =k=1nηktzktzktz^it+log|𝐇t|z^itabsentsuperscriptsubscript𝑘1𝑛subscript𝜂𝑘𝑡subscript𝑧𝑘𝑡subscript𝑧𝑘𝑡subscript^𝑧𝑖𝑡subscript𝐇𝑡subscript^𝑧𝑖𝑡\displaystyle=\sum_{k=1}^{n}\frac{\partial\eta_{kt}}{\partial z_{kt}}\cdot% \frac{\partial z_{kt}}{\partial\hat{z}_{it}}+\frac{\partial\log|\mathbf{H}_{t}% |}{\partial\hat{z}_{it}}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG ∂ italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG
=k=1nηktzkt𝐇kit+log|𝐇t|z^it.absentsuperscriptsubscript𝑘1𝑛subscript𝜂𝑘𝑡subscript𝑧𝑘𝑡subscript𝐇𝑘𝑖𝑡subscript𝐇𝑡subscript^𝑧𝑖𝑡\displaystyle=\sum_{k=1}^{n}\frac{\partial\eta_{kt}}{\partial z_{kt}}\cdot% \mathbf{H}_{kit}+\frac{\partial\log|\mathbf{H}_{t}|}{\partial\hat{z}_{it}}.= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG ∂ italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT + divide start_ARG ∂ roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG .

The second-order cross derivative can be depicted as:

2logp(𝐳^t|𝐳^t1:tr1)z^itz^jtsuperscript2𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡\displaystyle\frac{\partial^{2}\log p(\hat{\mathbf{z}}_{t}\,|\,\hat{\mathbf{z}% }_{t-1:t-r-1})}{\partial\hat{z}_{it}\partial\hat{z}_{jt}}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG =k=1n(2ηktzkt2𝐇kit𝐇kjt+ηktzkt𝐇kitz^jt)+2log|𝐇t|z^itz^jt.absentsuperscriptsubscript𝑘1𝑛superscript2subscript𝜂𝑘𝑡superscriptsubscript𝑧𝑘𝑡2subscript𝐇𝑘𝑖𝑡subscript𝐇𝑘𝑗𝑡subscript𝜂𝑘𝑡subscript𝑧𝑘𝑡subscript𝐇𝑘𝑖𝑡subscript^𝑧𝑗𝑡superscript2subscript𝐇𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡\displaystyle=\sum_{k=1}^{n}\Big{(}\frac{\partial^{2}\eta_{kt}}{\partial z_{kt% }^{2}}\cdot\mathbf{H}_{kit}\mathbf{H}_{kjt}+\frac{\partial\eta_{kt}}{\partial z% _{kt}}\cdot\frac{\partial\mathbf{H}_{kit}}{\partial\hat{z}_{jt}}\Big{)}+\frac{% \partial^{2}\log|\mathbf{H}_{t}|}{\partial\hat{z}_{it}\partial\hat{z}_{jt}}.= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k italic_j italic_t end_POSTSUBSCRIPT + divide start_ARG ∂ italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG ) + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG . (A11)

According to Eq A9, the right-hand side of the presented equation consistently equals 0. Therefore, for each index l𝑙litalic_l ranging from 1 to n𝑛nitalic_n, and every associated value of zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT, its partial derivative with respect to zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT remains 0. That is,

k=1n(3ηktzkt2zl,tr1𝐇kit𝐇kjt+2ηktzktzl,tr1𝐇kitz^jt)0,superscriptsubscript𝑘1𝑛superscript3subscript𝜂𝑘𝑡superscriptsubscript𝑧𝑘𝑡2subscript𝑧𝑙𝑡𝑟1subscript𝐇𝑘𝑖𝑡subscript𝐇𝑘𝑗𝑡superscript2subscript𝜂𝑘𝑡subscript𝑧𝑘𝑡subscript𝑧𝑙𝑡𝑟1subscript𝐇𝑘𝑖𝑡subscript^𝑧𝑗𝑡0\displaystyle\sum_{k=1}^{n}\Big{(}\frac{\partial^{3}\eta_{kt}}{\partial z_{kt}% ^{2}\partial z_{l,t-r-1}}\cdot\mathbf{H}_{kit}\mathbf{H}_{kjt}+\frac{\partial^% {2}\eta_{kt}}{\partial z_{kt}\partial z_{l,t-r-1}}\cdot\frac{\partial\mathbf{H% }_{kit}}{\partial\hat{z}_{jt}}\Big{)}\equiv 0,∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k italic_j italic_t end_POSTSUBSCRIPT + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG ) ≡ 0 , (A12)

where we leveraged the fact that entries of 𝐇tsubscript𝐇𝑡\mathbf{H}_{t}bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do not depend on zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT. Considering any given value of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, there exists at least 2n2𝑛2n2 italic_n different values of 𝐯ltsubscript𝐯𝑙𝑡\mathbf{v}_{lt}bold_v start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT such that they are linearly independent. To make the above equation hold true, one has to set 𝐇kit𝐇kjt=0subscript𝐇𝑘𝑖𝑡subscript𝐇𝑘𝑗𝑡0\mathbf{H}_{kit}\mathbf{H}_{kjt}=0bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k italic_j italic_t end_POSTSUBSCRIPT = 0 or ij𝑖𝑗i\neq jitalic_i ≠ italic_j. In other words, each row of 𝐇tsubscript𝐇𝑡\mathbf{H}_{t}bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists of at most a single non-zero entry, and 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be a component-wise transformation of a permuted version of 𝐳^tsubscript^𝐳𝑡\mathbf{\hat{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. ∎

Note that in the proof of Theorem A1, we require the transition lag τ𝜏\tauitalic_τ to be larger than the mixing lag r=1𝑟1r=1italic_r = 1. When a mixing lag exists, the guarantee of identifiability requires dynamic information from a more previous time step. As long as this inequality τ>r𝜏𝑟\tau>ritalic_τ > italic_r is satisfied, the parameters τ𝜏\tauitalic_τ can be extended to arbitrary numbers following a similar modification in Appendix A1.4.

A1.2 Discussion for the sufficiency assumption.

This assumption describes the changability of latent variables. Taking the video understanding as an example, the latent variables may represent the concepts. The linear independence of the latent variables means that there exists a characteristic of the concept that cannot be linearly represented by others. To further illustrate the sufficiency assumption, we give 2 examples (Yao et al., 2022a) to show when and when not the sufficiency assumption holds.

One possible distribution that breaks this assumption is the additive Gaussian noise. Denote 𝐳hsubscript𝐳\mathbf{z}_{h}bold_z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as historical parents. Let zkt=qk(𝐳h)+ϵktsubscript𝑧𝑘𝑡subscript𝑞𝑘subscript𝐳subscriptitalic-ϵ𝑘𝑡z_{kt}=q_{k}(\mathbf{z}_{h})+\epsilon_{kt}italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT where ϵktN(0,1)similar-tosubscriptitalic-ϵ𝑘𝑡𝑁01\epsilon_{kt}\sim N(0,1)italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ). In this case, we have ηkt=logP(zkt|𝐳h)=log2π(zktqk(𝐳h))22subscript𝜂𝑘𝑡𝑃conditionalsubscript𝑧𝑘𝑡subscript𝐳2𝜋superscriptsubscript𝑧𝑘𝑡subscript𝑞𝑘subscript𝐳22\eta_{kt}=\log P(z_{kt}|\mathbf{z}_{h})=-\log\sqrt{2\pi}-\frac{(z_{kt}-q_{k}(% \mathbf{z}_{h}))^{2}}{2}italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT = roman_log italic_P ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = - roman_log square-root start_ARG 2 italic_π end_ARG - divide start_ARG ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, and 2logP(zkt|𝐳h)2zkt=0superscript2𝑃conditionalsubscript𝑧𝑘𝑡subscript𝐳superscript2subscript𝑧𝑘𝑡0\frac{\partial^{2}\log P(z_{kt}|\mathbf{z}_{h})}{\partial^{2}z_{kt}}=0divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG = 0, which will violate the assumption.

On the opposite, if ϵktsubscriptitalic-ϵ𝑘𝑡\epsilon_{kt}italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT subjects a zero mean generalized normal distribution: P(ϵkt)eλ|ϵkt|βproportional-to𝑃subscriptitalic-ϵ𝑘𝑡superscript𝑒𝜆superscriptsubscriptitalic-ϵ𝑘𝑡𝛽P(\epsilon_{kt})\propto e^{-\lambda|\epsilon_{kt}|^{\beta}}italic_P ( italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ) ∝ italic_e start_POSTSUPERSCRIPT - italic_λ | italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT with λ>0𝜆0\lambda>0italic_λ > 0 and β>2𝛽2\beta>2italic_β > 2 and β3𝛽3\beta\not=3italic_β ≠ 3. Let zkt=qk(𝐳h)+ϵktsubscript𝑧𝑘𝑡subscript𝑞𝑘subscript𝐳subscriptitalic-ϵ𝑘𝑡z_{kt}=q_{k}(\mathbf{z}_{h})+\epsilon_{kt}italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT in which q𝑞qitalic_q is a linear function. If for each zktsubscript𝑧𝑘𝑡z_{kt}italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT there exists at least one ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that ckk=zktzk,t10subscript𝑐𝑘superscript𝑘subscript𝑧𝑘𝑡subscript𝑧superscript𝑘𝑡10c_{kk^{\prime}}=\frac{\partial z_{kt}}{\partial z_{k^{\prime},t-1}}\not=0italic_c start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t - 1 end_POSTSUBSCRIPT end_ARG ≠ 0, the sufficiency assumption must hold.

In this case, we have

3ηkt2zktzk,t1=λ sgn(ϵkt)β(β1)(β2)|ϵkt|β3ckksuperscript3subscript𝜂𝑘𝑡superscript2subscript𝑧𝑘𝑡subscript𝑧superscript𝑘𝑡1𝜆 sgnsubscriptitalic-ϵ𝑘𝑡𝛽𝛽1𝛽2superscriptsubscriptitalic-ϵ𝑘𝑡𝛽3subscript𝑐𝑘superscript𝑘\frac{\partial^{3}\eta_{kt}}{\partial^{2}z_{kt}\partial z_{k^{\prime},t-1}}=-% \lambda\text{ sgn}(\epsilon_{kt})\beta(\beta-1)(\beta-2)|\epsilon_{kt}|^{\beta% -3}c_{kk^{\prime}}divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t - 1 end_POSTSUBSCRIPT end_ARG = - italic_λ sgn ( italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ) italic_β ( italic_β - 1 ) ( italic_β - 2 ) | italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 3 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (A13)

and

2ηktzktzk,t1=λβ(β1)|ϵkt|β2ckk.superscript2subscript𝜂𝑘𝑡subscript𝑧𝑘𝑡subscript𝑧superscript𝑘𝑡1𝜆𝛽𝛽1superscriptsubscriptitalic-ϵ𝑘𝑡𝛽2subscript𝑐𝑘superscript𝑘\frac{\partial^{2}\eta_{kt}}{\partial z_{kt}\partial z_{k^{\prime},t-1}}=-% \lambda\beta(\beta-1)|\epsilon_{kt}|^{\beta-2}c_{kk^{\prime}}.divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t - 1 end_POSTSUBSCRIPT end_ARG = - italic_λ italic_β ( italic_β - 1 ) | italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (A14)

We know that |ϵlt|β2superscriptsubscriptitalic-ϵ𝑙𝑡𝛽2|\epsilon_{lt}|^{\beta-2}| italic_ϵ start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT and |ϵlt|β2superscriptsubscriptitalic-ϵ𝑙𝑡𝛽2|\epsilon_{lt}|^{\beta-2}| italic_ϵ start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT are linearly independent since their ratio |ϵlt|subscriptitalic-ϵ𝑙𝑡|\epsilon_{lt}|| italic_ϵ start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT | is not constant. Besides, |ϵlt|β2superscriptsubscriptitalic-ϵ𝑙𝑡𝛽2|\epsilon_{lt}|^{\beta-2}| italic_ϵ start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT and |ϵlt|β2superscriptsubscriptitalic-ϵ𝑙𝑡𝛽2|\epsilon_{lt}|^{\beta-2}| italic_ϵ start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT, with l=1,2,,n𝑙12𝑛l=1,2,\cdots,nitalic_l = 1 , 2 , ⋯ , italic_n are 2n2𝑛2n2 italic_n linearly independent functions because of the different arguments involved. Suppose there exists αl1,αl2subscript𝛼𝑙1subscript𝛼𝑙2\alpha_{l1},\alpha_{l2}italic_α start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_l 2 end_POSTSUBSCRIPT for l=1,2,,n𝑙12𝑛l=1,2,\cdots,nitalic_l = 1 , 2 , ⋯ , italic_n, such that the weighted sum with regard to 𝐯l,tsubscript𝐯𝑙𝑡\mathbf{v}_{l,t}bold_v start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT is zero. Thus, for any k𝑘kitalic_k we have

αk1ckk|ϵkt|β2+αk2ckk|ϵkt|β3+lk(αl1clk|ϵlt|β2+αl2clk|ϵlt|β3)=0.subscript𝛼𝑘1subscript𝑐𝑘superscript𝑘superscriptsubscriptitalic-ϵ𝑘𝑡𝛽2subscript𝛼𝑘2subscript𝑐𝑘superscript𝑘superscriptsubscriptitalic-ϵ𝑘𝑡𝛽3subscript𝑙𝑘subscript𝛼𝑙1subscript𝑐𝑙superscript𝑘superscriptsubscriptitalic-ϵ𝑙𝑡𝛽2subscript𝛼𝑙2subscript𝑐𝑙superscript𝑘superscriptsubscriptitalic-ϵ𝑙𝑡𝛽30\alpha_{k1}c_{kk^{\prime}}|\epsilon_{kt}|^{\beta-2}+\alpha_{k2}c_{kk^{\prime}}% |\epsilon_{kt}|^{\beta-3}+\sum_{l\not=k}(\alpha_{l1}c_{lk^{\prime}}|\epsilon_{% lt}|^{\beta-2}+\alpha_{l2}c_{lk^{\prime}}|\epsilon_{lt}|^{\beta-3})=0.italic_α start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 3 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_l ≠ italic_k end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_l italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ϵ start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_l 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_l italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ϵ start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 3 end_POSTSUPERSCRIPT ) = 0 . (A15)

Since |ϵkt|β2superscriptsubscriptitalic-ϵ𝑘𝑡𝛽2|\epsilon_{kt}|^{\beta-2}| italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT and |ϵkt|β3superscriptsubscriptitalic-ϵ𝑘𝑡𝛽3|\epsilon_{kt}|^{\beta-3}| italic_ϵ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_β - 3 end_POSTSUPERSCRIPT with l=1,2,,n𝑙12𝑛l=1,2,\cdots,nitalic_l = 1 , 2 , ⋯ , italic_n are linearly independent and ckk0subscript𝑐𝑘superscript𝑘0c_{kk^{\prime}}\not=0italic_c start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≠ 0, the make the above equation holds, we have αk1=αk2=0subscript𝛼𝑘1subscript𝛼𝑘20\alpha_{k1}=\alpha_{k2}=0italic_α start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT = 0. As this applies to any k, we know that αl1subscript𝛼𝑙1\alpha_{l1}italic_α start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT and αl2subscript𝛼𝑙2\alpha_{l2}italic_α start_POSTSUBSCRIPT italic_l 2 end_POSTSUBSCRIPT must be 0, for all l=1,2,,n𝑙12𝑛l=1,2,\cdots,nitalic_l = 1 , 2 , ⋯ , italic_n. That is, {𝐯lt}subscript𝐯𝑙𝑡\{\mathbf{v}_{lt}\}{ bold_v start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT } is linearly independent. Thus, the sufficiency assumption holds.

Please note that the sufficiency assumption is crucial to the identifiability theory, yet not that restrictive. Even if it is not completely satisfied, we can still obtain some subspace identifiability (Kong et al., 2022).

A1.3 Discussion for the cross-time disentanglement

This section demonstrates how the entanglements between variables across time steps are prevented. Generally speaking, if the information is lost in the transition from 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have to borrow the information from context such as 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to recover it. It is natural to receive information from 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT in order to find the best estimator.

Specifically, let us consider a generating process xit=gi(𝐳t),zit=fi(𝐳t1,ϵit)formulae-sequencesubscript𝑥𝑖𝑡subscript𝑔𝑖subscript𝐳𝑡subscript𝑧𝑖𝑡subscript𝑓𝑖subscript𝐳𝑡1subscriptitalic-ϵ𝑖𝑡x_{it}=g_{i}(\mathbf{z}_{t}),z_{it}=f_{i}(\mathbf{z}_{t-1},\epsilon_{it})italic_x start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ), where 𝐱tRd,𝐳tRnformulae-sequencesubscript𝐱𝑡superscript𝑅𝑑subscript𝐳𝑡superscript𝑅𝑛\mathbf{x}_{t}\in R^{d},\mathbf{z}_{t}\in R^{n}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Since xitsubscript𝑥𝑖𝑡x_{it}italic_x start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT can be fully charactized by 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, but not a function of 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, we have xi,t𝐳t1=0subscript𝑥𝑖𝑡subscript𝐳𝑡10\frac{\partial x_{i,t}}{\partial\mathbf{z}_{t-1}}=0divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG = 0. For the estimation process xi,t=g^i(𝐳^t)subscript𝑥𝑖𝑡subscript^𝑔𝑖subscript^𝐳𝑡x_{i,t}=\hat{g}_{i}(\mathbf{\hat{z}}_{t})italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we have xi,t𝐳t1=xi,t𝐳^t𝐳^t𝐳t1=0subscript𝑥𝑖𝑡subscript𝐳𝑡1subscript𝑥𝑖𝑡subscript^𝐳𝑡subscript^𝐳𝑡subscript𝐳𝑡10\frac{\partial x_{i,t}}{\partial\mathbf{z}_{t-1}}=\frac{\partial x_{i,t}}{% \partial\mathbf{\hat{z}}_{t}}\cdot\frac{\partial\mathbf{\hat{z}}_{t}}{\partial% \mathbf{z}_{t-1}}=0divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG = 0 for i=1,2,,d𝑖12𝑑i=1,2,\cdots,ditalic_i = 1 , 2 , ⋯ , italic_d. Formally, we have an equation as 𝐱t𝐳^t𝐳^t𝐳t1=0subscript𝐱𝑡subscript^𝐳𝑡subscript^𝐳𝑡subscript𝐳𝑡10\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{\hat{z}}_{t}}\cdot\frac{\partial% \mathbf{\hat{z}}_{t}}{\partial\mathbf{z}_{t-1}}=0divide start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG = 0, i.e.,

[x1,tz^1tx1,tz^ntxd,tz^1txd,tz^nt][z^1tz1,t1z^1tzn,t1z^ntz1,t1z^ntzn,t1]=𝟎.matrixsubscript𝑥1𝑡subscript^𝑧1𝑡subscript𝑥1𝑡subscript^𝑧𝑛𝑡subscript𝑥𝑑𝑡subscript^𝑧1𝑡subscript𝑥𝑑𝑡subscript^𝑧𝑛𝑡matrixsubscript^𝑧1𝑡subscript𝑧1𝑡1subscript^𝑧1𝑡subscript𝑧𝑛𝑡1subscript^𝑧𝑛𝑡subscript𝑧1𝑡1subscript^𝑧𝑛𝑡subscript𝑧𝑛𝑡10\begin{bmatrix}\frac{\partial x_{1,t}}{\partial\hat{z}_{1t}}&\cdots&\frac{% \partial x_{1,t}}{\partial\hat{z}_{nt}}\\ \vdots&\ddots&\vdots\\ \frac{\partial x_{d,t}}{\partial\hat{z}_{1t}}&\cdots&\frac{\partial x_{d,t}}{% \partial\hat{z}_{nt}}\\ \end{bmatrix}\cdot\begin{bmatrix}\frac{\partial\hat{z}_{1t}}{\partial z_{1,t-1% }}&\cdots&\frac{\partial\hat{z}_{1t}}{\partial z_{n,t-1}}\\ \vdots&\ddots&\vdots\\ \frac{\partial\hat{z}_{nt}}{\partial z_{1,t-1}}&\cdots&\frac{\partial\hat{z}_{% nt}}{\partial z_{n,t-1}}\\ \end{bmatrix}=\mathbf{0}.[ start_ARG start_ROW start_CELL divide start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_d , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_d , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG end_CELL end_ROW end_ARG ] ⋅ [ start_ARG start_ROW start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL end_ROW end_ARG ] = bold_0 . (A16)

If there exists at least n𝑛nitalic_n different i𝑖iitalic_i such that derivative of xitsubscript𝑥𝑖𝑡x_{it}italic_x start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT with respect to 𝐳^tsubscript^𝐳𝑡\hat{\mathbf{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as n𝑛nitalic_n vector functions [xi,tz^1txi,tz^nt]matrixsubscript𝑥𝑖𝑡subscript^𝑧1𝑡subscript𝑥𝑖𝑡subscript^𝑧𝑛𝑡\begin{bmatrix}\frac{\partial x_{i,t}}{\partial\hat{z}_{1t}}&\cdots&\frac{% \partial x_{i,t}}{\partial\hat{z}_{nt}}\end{bmatrix}[ start_ARG start_ROW start_CELL divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG end_CELL end_ROW end_ARG ] are linearly independent, we have

[z^1tz1,t1z^1tzn,t1z^ntz1,t1z^ntzn,t1]=𝟎matrixsubscript^𝑧1𝑡subscript𝑧1𝑡1subscript^𝑧1𝑡subscript𝑧𝑛𝑡1subscript^𝑧𝑛𝑡subscript𝑧1𝑡1subscript^𝑧𝑛𝑡subscript𝑧𝑛𝑡10\begin{bmatrix}\frac{\partial\hat{z}_{1t}}{\partial z_{1,t-1}}&\cdots&\frac{% \partial\hat{z}_{1t}}{\partial z_{n,t-1}}\\ \vdots&\ddots&\vdots\\ \frac{\partial\hat{z}_{nt}}{\partial z_{1,t-1}}&\cdots&\frac{\partial\hat{z}_{% nt}}{\partial z_{n,t-1}}\\ \end{bmatrix}=\mathbf{0}[ start_ARG start_ROW start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT end_ARG end_CELL end_ROW end_ARG ] = bold_0 (A17)

holds true. If the rank R(𝐱t𝐳^t)𝑅subscript𝐱𝑡subscript^𝐳𝑡R(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{\hat{z}}_{t}})italic_R ( divide start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) of the matrix of 𝐱t𝐳^tsubscript𝐱𝑡subscript^𝐳𝑡\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{\hat{z}}_{t}}divide start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is less than n𝑛nitalic_n, we have that the entanglement can only happened on the rest nR(𝐱t𝐳^t)𝑛𝑅subscript𝐱𝑡subscript^𝐳𝑡n-R(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{\hat{z}}_{t}})italic_n - italic_R ( divide start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) dimensions.

Let us consider 2 extreme cases. First, if mixing function 𝐠𝐠\mathbf{g}bold_g is invertible, we have R(𝐱t𝐳^t)=n𝑅subscript𝐱𝑡subscript^𝐳𝑡𝑛R(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{\hat{z}}_{t}})=nitalic_R ( divide start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) = italic_n, and the entanglement between time steps are prevented. As another extreme case, in the NG setting as mentioned in Appendix A2.1, we set one dimension of the latent variable that is totally lost during the mixing process. In this case, we have to use 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for its estimation. Even in this case, the unnecessary entanglements are prevented as well.

A1.4 Extension to Multiple Lags

Multiple Transition Time Lag τ𝜏\tauitalic_τ. For the sake of simplicity, we consider only one special case with τ=r+1𝜏𝑟1\tau=r+1italic_τ = italic_r + 1 in Theorem A1. Our identifiability theorem can be actually extended to arbitrary lags directly. For any given τ𝜏\tauitalic_τ, according to modularity, we have different conclusion at Eq A7 as LHS=P(𝐳t|𝐱t1:tμrτ)=P(𝐳t|𝐳t1:trτ).LHS𝑃conditionalsubscript𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟𝜏𝑃conditionalsubscript𝐳𝑡subscript𝐳:𝑡1𝑡𝑟𝜏\textbf{LHS}=P(\mathbf{z}_{t}|\mathbf{x}_{t-1:t-\mu-r-\tau})=P(\mathbf{z}_{t}|% \mathbf{z}_{t-1:t-r-\tau}).LHS = italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - italic_τ end_POSTSUBSCRIPT ) = italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - italic_τ end_POSTSUBSCRIPT ) . Similarity RHS=P(𝐳^t|𝐱t1:tμrτ)=P(𝐳^t|𝐳^t1:trτ)RHS𝑃conditionalsubscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟𝜏𝑃conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟𝜏\textbf{RHS}=P(\mathbf{\hat{z}}_{t}|\mathbf{x}_{t-1:t-\mu-r-\tau})=P(\mathbf{% \hat{z}}_{t}|\mathbf{\hat{z}}_{t-1:t-r-\tau})RHS = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - italic_τ end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - italic_τ end_POSTSUBSCRIPT ) holds true as well. In addition, some modifications are needed in sufficiency assumption, i.e., re-define ηktlogp(zkt|𝐳t1:trτ)subscript𝜂𝑘𝑡𝑝conditionalsubscript𝑧𝑘𝑡subscript𝐳:𝑡1𝑡𝑟𝜏\eta_{kt}\triangleq\log p(z_{kt}|\mathbf{z}_{t-1:t-r-\tau})italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ≜ roman_log italic_p ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - italic_τ end_POSTSUBSCRIPT ) and there should be at least 2n2𝑛2n2 italic_n linear independent vectors for 𝐯𝐯\mathbf{v}bold_v with regard to zltsubscript𝑧𝑙superscript𝑡z_{lt^{\prime}}italic_z start_POSTSUBSCRIPT italic_l italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT where l=1,2,,n𝑙12𝑛l=1,2,\cdots,nitalic_l = 1 , 2 , ⋯ , italic_n and tτttr1𝑡𝜏superscript𝑡𝑡𝑟1t-\tau\leq t^{\prime}\leq t-r-1italic_t - italic_τ ≤ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_t - italic_r - 1. No extra changes are needed.

Infinite Mixing Lag r𝑟ritalic_r. Theorem A1 can also be easily extended to infinite mixing lag since 𝐳^t=𝐡(𝐳t;𝐱<t)subscript^𝐳𝑡𝐡subscript𝐳𝑡subscript𝐱absent𝑡\mathbf{\hat{z}}_{t}=\mathbf{h}(\mathbf{z}_{t};\mathbf{x}_{<t})over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_h ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) still exists when r𝑟r\rightarrow\inftyitalic_r → ∞, and the theorem still holds true.

A1.5 Continuity for Permutation Invariance

Let us first give an extreme example to illustrate the importance of extra constraints for identifiability when the probability density of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is not non-zero everywhere in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Consider 4 independent random variables u,v,x,y𝑢𝑣𝑥𝑦u,v,x,yitalic_u , italic_v , italic_x , italic_y subject to standard normal distribution respectively. Suppose that there exists an invertible function (x,y)=𝐡(u,v)𝑥𝑦𝐡𝑢𝑣(x,y)=\mathbf{h}(u,v)( italic_x , italic_y ) = bold_h ( italic_u , italic_v ) satisfies

{x=𝕀(x+y>0)u+𝕀(x+y0)vy=𝕀(x+y>0)v+𝕀(x+y0)u.cases𝑥𝕀𝑥𝑦0𝑢𝕀𝑥𝑦0𝑣otherwise𝑦𝕀𝑥𝑦0𝑣𝕀𝑥𝑦0𝑢otherwise\begin{cases}x=\mathbb{I}(x+y>0)\cdot u+\mathbb{I}(x+y\leq 0)\cdot v\\ y=\mathbb{I}(x+y>0)\cdot v+\mathbb{I}(x+y\leq 0)\cdot u.\end{cases}{ start_ROW start_CELL italic_x = blackboard_I ( italic_x + italic_y > 0 ) ⋅ italic_u + blackboard_I ( italic_x + italic_y ≤ 0 ) ⋅ italic_v end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_y = blackboard_I ( italic_x + italic_y > 0 ) ⋅ italic_v + blackboard_I ( italic_x + italic_y ≤ 0 ) ⋅ italic_u . end_CELL start_CELL end_CELL end_ROW (A18)

Notice that the Jacobian from (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) to (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) contains at most one non-zero entry for each column or row. However, the result (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is still entangled, and the identifiability of (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is not achieved. What if now we notate latent variable as 𝐳^=(u,v)^𝐳𝑢𝑣\mathbf{\hat{z}}=(u,v)over^ start_ARG bold_z end_ARG = ( italic_u , italic_v ), estimated latent variable as 𝐳=(x,y)𝐳𝑥𝑦\mathbf{z}=(x,y)bold_z = ( italic_x , italic_y ) and the transition process with two mixing functions as 𝐡=𝐠1𝐠^𝐡superscript𝐠1^𝐠\mathbf{h}=\mathbf{g}^{-1}\circ\mathbf{\hat{g}}bold_h = bold_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∘ over^ start_ARG bold_g end_ARG?

In the literature of nonlinear ICA, the gap between 𝐇ij𝐇ik=0subscript𝐇𝑖𝑗subscript𝐇𝑖𝑘0\mathbf{H}_{ij}\cdot\mathbf{H}_{ik}=0bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ bold_H start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 0 when jk𝑗𝑘j\not=kitalic_j ≠ italic_k and identifiability is ill-discussed. In linear ICA, since the Jacobian is a constant matrix, these two statements are equivalent. Nevertheless, in nonlinear ICA, 𝐇=𝐳𝐳^𝐇𝐳^𝐳\mathbf{H}=\frac{\partial\mathbf{z}}{\partial\mathbf{\hat{z}}}bold_H = divide start_ARG ∂ bold_z end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG is not a constant, but a function of 𝐳^^𝐳\mathbf{\hat{z}}over^ start_ARG bold_z end_ARG, which may leads to the failure of identifiability as shown in Eq A18.

The counterexamples can still be easily constructed even if function 𝐡𝐡\mathbf{h}bold_h is continuous. For brevity, let us denote a segment-wise linear indicator function as f(u,v)=min(max(0,u+v+0.5),1)𝑓𝑢𝑣0𝑢𝑣0.51f(u,v)=\min(\max(0,u+v+0.5),1)italic_f ( italic_u , italic_v ) = roman_min ( roman_max ( 0 , italic_u + italic_v + 0.5 ) , 1 ), and we have 𝐡𝐡\mathbf{h}bold_h as

{x=f(u,v)u+(1f(u,v))vy=f(u,v)v+(1f(u,v))u.cases𝑥𝑓𝑢𝑣𝑢1𝑓𝑢𝑣𝑣otherwise𝑦𝑓𝑢𝑣𝑣1𝑓𝑢𝑣𝑢otherwise\begin{cases}x=f(u,v)\cdot u+(1-f(u,v))\cdot v\\ y=f(u,v)\cdot v+(1-f(u,v))\cdot u.\end{cases}{ start_ROW start_CELL italic_x = italic_f ( italic_u , italic_v ) ⋅ italic_u + ( 1 - italic_f ( italic_u , italic_v ) ) ⋅ italic_v end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_y = italic_f ( italic_u , italic_v ) ⋅ italic_v + ( 1 - italic_f ( italic_u , italic_v ) ) ⋅ italic_u . end_CELL start_CELL end_CELL end_ROW (A19)

When u,v,x,y𝑢𝑣𝑥𝑦u,v,x,yitalic_u , italic_v , italic_x , italic_y are independent uniform distributions on [2,1][1,2]2112[-2,-1]\cup[1,2][ - 2 , - 1 ] ∪ [ 1 , 2 ], all conditions are still satisfied while the identifiability cannot be achieved.

To fill this gap, we provide two more assumptions. The domain 𝒵^^𝒵\mathcal{\hat{Z}}over^ start_ARG caligraphic_Z end_ARG of 𝐳^^𝐳\mathbf{\hat{z}}over^ start_ARG bold_z end_ARG should be path-connected, i.e., for any 𝐳^(1),𝐳^(2)𝒵^superscript^𝐳1superscript^𝐳2^𝒵\mathbf{\hat{z}}^{(1)},\mathbf{\hat{z}}^{(2)}\in\mathcal{\hat{Z}}over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ over^ start_ARG caligraphic_Z end_ARG, there exists a continuous path connecting 𝐳^(1)superscript^𝐳1\mathbf{\hat{z}}^{(1)}over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and 𝐳^(2)superscript^𝐳2\mathbf{\hat{z}}^{(2)}over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT with all points of the path in 𝒵^^𝒵\mathcal{\hat{Z}}over^ start_ARG caligraphic_Z end_ARG. In addition, the derivative of function 𝐡𝐡\mathbf{h}bold_h is not zero for any value of 𝐳^𝒵^^𝐳^𝒵\mathbf{\hat{z}}\in\mathcal{\hat{Z}}over^ start_ARG bold_z end_ARG ∈ over^ start_ARG caligraphic_Z end_ARG

Lemma A1 (Disentanglement with Continuity).

For second-order differentiable invertible function 𝐡𝐡\mathbf{h}bold_h defined on a path-connected domain 𝒵^n^𝒵superscript𝑛\mathcal{\hat{Z}}\subseteq\mathbb{R}^{n}over^ start_ARG caligraphic_Z end_ARG ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT which satisfies 𝐳=𝐡(𝐳^)𝐳𝐡^𝐳\mathbf{z}=\mathbf{h}(\mathbf{\hat{z}})bold_z = bold_h ( over^ start_ARG bold_z end_ARG ), suppose the non-degeneracy condition holds. If there exists at most one non-zero entry in each row of the Jacobian matrix 𝐇=𝐳𝐳^𝐇𝐳^𝐳\mathbf{H}=\frac{\partial\mathbf{z}}{\partial\mathbf{\hat{z}}}bold_H = divide start_ARG ∂ bold_z end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG, the identifiability under Permutation Invariance can be established.

Proof.

For any row i𝑖iitalic_i, 𝐳i𝐳^=[𝐳i𝐳^1,𝐳i𝐳^2,,𝐳i𝐳^n]nsubscript𝐳𝑖^𝐳subscript𝐳𝑖subscript^𝐳1subscript𝐳𝑖subscript^𝐳2subscript𝐳𝑖subscript^𝐳𝑛superscript𝑛\frac{\partial\mathbf{z}_{i}}{\partial\mathbf{\hat{z}}}=[\frac{\partial\mathbf% {z}_{i}}{\partial\mathbf{\hat{z}}_{1}},\frac{\partial\mathbf{z}_{i}}{\partial% \mathbf{\hat{z}}_{2}},...,\frac{\partial\mathbf{z}_{i}}{\partial\mathbf{\hat{z% }}_{n}}]\in\mathbb{R}^{n}divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG = [ divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a n-dimensional variable. Its image is a subspace as k=1n{(𝐳i𝐳^1,𝐳i𝐳^2,,𝐳i𝐳^n)n:𝐳i𝐳^j=0 for all jk, and xk0}superscriptsubscript𝑘1𝑛conditional-setsubscript𝐳𝑖subscript^𝐳1subscript𝐳𝑖subscript^𝐳2subscript𝐳𝑖subscript^𝐳𝑛superscript𝑛formulae-sequencesubscript𝐳𝑖subscript^𝐳𝑗0 for all 𝑗𝑘 and subscript𝑥𝑘0\bigcup_{k=1}^{n}\left\{(\frac{\partial\mathbf{z}_{i}}{\partial\mathbf{\hat{z}% }_{1}},\frac{\partial\mathbf{z}_{i}}{\partial\mathbf{\hat{z}}_{2}},...,\frac{% \partial\mathbf{z}_{i}}{\partial\mathbf{\hat{z}}_{n}})\in\mathbb{R}^{n}:\frac{% \partial\mathbf{z}_{i}}{\partial\mathbf{\hat{z}}_{j}}=0\text{ for all }j\neq k% ,\text{ and }x_{k}\neq 0\right\}⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT { ( divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = 0 for all italic_j ≠ italic_k , and italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 }, since there exists at most one non-zero entry in each row of the Jacobian matrix 𝐇=𝐳𝐳^𝐇𝐳^𝐳\mathbf{H}=\frac{\partial\mathbf{z}}{\partial\mathbf{\hat{z}}}bold_H = divide start_ARG ∂ bold_z end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG and the derivative of function 𝐡𝐡\mathbf{h}bold_h is not zero for any value, according to the non-degeneracy condition.

We use proof by contradiction. Suppose there exist two different samples 𝐚,𝐛𝒵n𝐚𝐛𝒵superscript𝑛\mathbf{a},\mathbf{b}\in\mathcal{Z}\subseteq\mathbb{R}^{n}bold_a , bold_b ∈ caligraphic_Z ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with different non-zero entries jk𝑗𝑘j\not=kitalic_j ≠ italic_k subjects to

[zi𝐳^|𝐳^=𝐚]j0,[zi𝐳^|𝐳^=𝐛]k0formulae-sequencesubscriptdelimited-[]evaluated-atsubscript𝑧𝑖^𝐳^𝐳𝐚𝑗0subscriptdelimited-[]evaluated-atsubscript𝑧𝑖^𝐳^𝐳𝐛𝑘0{\left[\frac{\partial z_{i}}{\partial\hat{\mathbf{z}}}\bigg{|}_{\hat{\mathbf{z% }}=\mathbf{a}}\right]_{j}\not=0,\quad\left[\frac{\partial z_{i}}{\partial\hat{% \mathbf{z}}}\bigg{|}_{\hat{\mathbf{z}}=\mathbf{b}}\right]_{k}\not=0}[ divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG | start_POSTSUBSCRIPT over^ start_ARG bold_z end_ARG = bold_a end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ 0 , [ divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG | start_POSTSUBSCRIPT over^ start_ARG bold_z end_ARG = bold_b end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 (A20)

where []jsubscriptdelimited-[]𝑗[\cdot]_{j}[ ⋅ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT refers to the j𝑗jitalic_j-th entry of vector. Their values are respectively within {(0,0,,ziz^j,0,,0)n:ziz^j0}conditional-set00subscript𝑧𝑖subscript^𝑧𝑗00superscript𝑛subscript𝑧𝑖subscript^𝑧𝑗0\left\{(0,0,...,\frac{\partial{z}_{i}}{\partial{\hat{z}}_{j}},0,...,0)\in% \mathbb{R}^{n}:\frac{\partial{z}_{i}}{\partial{\hat{z}}_{j}}\neq 0\right\}{ ( 0 , 0 , … , divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , 0 , … , 0 ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ≠ 0 } and {(0,0,,ziz^k,0,,0)n:ziz^k0}conditional-set00subscript𝑧𝑖subscript^𝑧𝑘00superscript𝑛subscript𝑧𝑖subscript^𝑧𝑘0\left\{(0,0,...,\frac{\partial{z}_{i}}{\partial{\hat{z}}_{k}},0,...,0)\in% \mathbb{R}^{n}:\frac{\partial{z}_{i}}{\partial{\hat{z}}_{k}}\neq 0\right\}{ ( 0 , 0 , … , divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG , 0 , … , 0 ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≠ 0 }. Clearly, there is no path from zi𝐳^|𝐳^=𝐚evaluated-atsubscript𝑧𝑖^𝐳^𝐳𝐚\frac{\partial z_{i}}{\partial\hat{\mathbf{z}}}\big{|}_{\hat{\mathbf{z}}=% \mathbf{a}}divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG | start_POSTSUBSCRIPT over^ start_ARG bold_z end_ARG = bold_a end_POSTSUBSCRIPT to zi𝐳^|𝐳^=𝐛evaluated-atsubscript𝑧𝑖^𝐳^𝐳𝐛\frac{\partial z_{i}}{\partial\hat{\mathbf{z}}}\big{|}_{\hat{\mathbf{z}}=% \mathbf{b}}divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG | start_POSTSUBSCRIPT over^ start_ARG bold_z end_ARG = bold_b end_POSTSUBSCRIPT. Since 𝐡𝐡\mathbf{h}bold_h is a second-order differentiable invertible function, we have its derivative 𝐡superscript𝐡\mathbf{h}^{\prime}bold_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is also differentiable. Thus, 𝒵^n^𝒵superscript𝑛\mathcal{\hat{Z}}\subseteq\mathbb{R}^{n}over^ start_ARG caligraphic_Z end_ARG ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a path-connected domain which denotes that the image of zi𝐳^subscript𝑧𝑖^𝐳\frac{\partial{z}_{i}}{\partial\mathbf{\hat{z}}}divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG is also path-connected. It will be violated that there is no path from zi𝐳^|𝐳^=𝐚evaluated-atsubscript𝑧𝑖^𝐳^𝐳𝐚\frac{\partial z_{i}}{\partial\hat{\mathbf{z}}}\big{|}_{\hat{\mathbf{z}}=% \mathbf{a}}divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG | start_POSTSUBSCRIPT over^ start_ARG bold_z end_ARG = bold_a end_POSTSUBSCRIPT to zi𝐳^|𝐳^=𝐛evaluated-atsubscript𝑧𝑖^𝐳^𝐳𝐛\frac{\partial z_{i}}{\partial\hat{\mathbf{z}}}\big{|}_{\hat{\mathbf{z}}=% \mathbf{b}}divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG | start_POSTSUBSCRIPT over^ start_ARG bold_z end_ARG = bold_b end_POSTSUBSCRIPT thus the proof is established. ∎

When it comes to partially invertible function with regard to side information 𝐜𝐜\mathbf{c}bold_c, the proof is the same with only a modification on conditions. That is, the path-connected domain assumption is applied to (𝐳,𝐜)𝐳𝐜(\mathbf{z},\mathbf{c})( bold_z , bold_c ), and the infinite differentiability is extended to both 𝐳𝐳\mathbf{z}bold_z and 𝐜𝐜\mathbf{c}bold_c, i.e., 2ziabsuperscript2subscript𝑧𝑖𝑎𝑏\frac{\partial^{2}z_{i}}{\partial a\partial b}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_a ∂ italic_b end_ARG for a,b{z|𝐳i}×{c|𝐜i}𝑎𝑏conditional-set𝑧subscript𝐳𝑖conditional-set𝑐subscript𝐜𝑖a,b\in\{z|\mathbf{z}_{i}\}\times\{c|\mathbf{c}_{i}\}italic_a , italic_b ∈ { italic_z | bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } × { italic_c | bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } when ab𝑎𝑏a\not=bitalic_a ≠ italic_b exists.

Let’s further review the example we provided earlier. Examples in Eq A18 and Eq A19 respectively demonstrate the scenarios where the assumptions of differentiability and connectivity fail, leading to the breakdown of identifiability.

Lemma A2 (Disentanglement with Continuity under Side Information).

For second-order differentiable invertible function 𝐡𝐡\mathbf{h}bold_h defined on a path-connected domain 𝒵^×𝒞n+m^𝒵𝒞superscript𝑛𝑚\mathcal{\hat{Z}}\times\mathcal{C}\subseteq\mathbb{R}^{n+m}over^ start_ARG caligraphic_Z end_ARG × caligraphic_C ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n + italic_m end_POSTSUPERSCRIPT which satisfies 𝐳=𝐡(𝐳^,𝐜)𝐳𝐡^𝐳𝐜\mathbf{z}=\mathbf{h}(\mathbf{\hat{z}},\mathbf{c})bold_z = bold_h ( over^ start_ARG bold_z end_ARG , bold_c ), suppose the non-degeneracy condition holds. If there exists at most one non-zero entry in each row of the Jacobian matrix 𝐇(𝐜)=𝐳𝐳^𝐇𝐜𝐳^𝐳\mathbf{H}(\mathbf{c})=\frac{\partial\mathbf{z}}{\partial\mathbf{\hat{z}}}bold_H ( bold_c ) = divide start_ARG ∂ bold_z end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG end_ARG, the identifiability under Permutation Invariance can be established.

Proof.

Suppose there exist two different samples 𝐚,𝐛𝒵^×𝒞n𝐚𝐛^𝒵𝒞superscript𝑛\mathbf{a},\mathbf{b}\in\mathcal{\hat{Z}}\times\mathcal{C}\subseteq\mathbb{R}^% {n}bold_a , bold_b ∈ over^ start_ARG caligraphic_Z end_ARG × caligraphic_C ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with different non-zero entries jk𝑗𝑘j\not=kitalic_j ≠ italic_k subjects to

[zi(𝐳^,𝐜)|(𝐳^,𝐜)=𝐚]j0,[zi(𝐳^,𝐜)|(𝐳^,𝐜)=𝐛]k0.formulae-sequencesubscriptdelimited-[]evaluated-atsubscript𝑧𝑖^𝐳𝐜^𝐳𝐜𝐚𝑗0subscriptdelimited-[]evaluated-atsubscript𝑧𝑖^𝐳𝐜^𝐳𝐜𝐛𝑘0{\left[\frac{\partial z_{i}}{\partial(\hat{\mathbf{z}},\mathbf{c})}\bigg{|}_{(% \hat{\mathbf{z}},\mathbf{c})=\mathbf{a}}\right]_{j}\not=0,\quad\left[\frac{% \partial z_{i}}{\partial(\hat{\mathbf{z}},\mathbf{c})}\bigg{|}_{(\hat{\mathbf{% z}},\mathbf{c})=\mathbf{b}}\right]_{k}\not=0.}[ divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ ( over^ start_ARG bold_z end_ARG , bold_c ) end_ARG | start_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG , bold_c ) = bold_a end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ 0 , [ divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ ( over^ start_ARG bold_z end_ARG , bold_c ) end_ARG | start_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG , bold_c ) = bold_b end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 . (A21)

Similar to Lemma A1, there exists no path between them because they are blocked in 𝒵^^𝒵\mathcal{\hat{Z}}over^ start_ARG caligraphic_Z end_ARG alone. In the same way, since 𝐡𝐡\mathbf{h}bold_h is a second-order differentiable invertible function, and the non-degeneracy condition holds, the image of zi(𝐳^,𝐜)subscript𝑧𝑖^𝐳𝐜\frac{\partial z_{i}}{\partial(\mathbf{\hat{z}},\mathbf{c})}divide start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ ( over^ start_ARG bold_z end_ARG , bold_c ) end_ARG is also path-connected. It will be violated and the proof is established.

A1.6 Identifiability Benefits from Non-Stationarity

We can further leverage the advantage of non-stationary data for identifiability. We rewrite 𝐯ltsubscript𝐯𝑙𝑡\mathbf{v}_{lt}bold_v start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT, which is defined in Eq A3, as 𝐬lt(ur)subscript𝐬𝑙𝑡subscript𝑢𝑟\mathbf{s}_{lt}(u_{r})bold_s start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) in the ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT context as

𝐬lt(ur)(2η1t(ur)z1tzl,tr1,,2ηnt(ur)zntzl,tr1,3η1t(ur)z1t2zl,tr1,,3ηnt(ur)znt2zl,tr1).subscript𝐬𝑙𝑡subscript𝑢𝑟superscriptsuperscript2subscript𝜂1𝑡subscript𝑢𝑟subscript𝑧1𝑡subscript𝑧𝑙𝑡𝑟1superscript2subscript𝜂𝑛𝑡subscript𝑢𝑟subscript𝑧𝑛𝑡subscript𝑧𝑙𝑡𝑟1superscript3subscript𝜂1𝑡subscript𝑢𝑟superscriptsubscript𝑧1𝑡2subscript𝑧𝑙𝑡𝑟1superscript3subscript𝜂𝑛𝑡subscript𝑢𝑟superscriptsubscript𝑧𝑛𝑡2subscript𝑧𝑙𝑡𝑟1\displaystyle\mathbf{s}_{lt}(u_{r})\triangleq\Big{(}\frac{\partial^{2}\eta_{1t% }(u_{r})}{\partial z_{1t}\partial z_{l,t-r-1}},...,\frac{\partial^{2}\eta_{nt}% (u_{r})}{\partial z_{nt}\partial z_{l,t-r-1}},\frac{\partial^{3}\eta_{1t}(u_{r% })}{\partial z_{1t}^{2}\partial z_{l,t-r-1}},...,\frac{\partial^{3}\eta_{nt}(u% _{r})}{\partial z_{nt}^{2}\partial z_{l,t-r-1}}\Big{)}^{\intercal}.bold_s start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ≜ ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT . (A22)

We also consider the version of subtraction 𝐬̊t(ur)subscript̊𝐬𝑡subscript𝑢𝑟\mathring{\mathbf{s}}_{t}(u_{r})over̊ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) from ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT without taking the derivative with respect to zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT as

𝐬̊t(ur)(2η1t(ur)z1t2η1t(u0)z1t,,2ηnt(ur)znt2ηnt(u0)znt,\displaystyle\mathring{\mathbf{s}}_{t}(u_{r})\triangleq\Big{(}\frac{\partial^{% 2}\eta_{1t}(u_{r})}{\partial z_{1t}}-\frac{\partial^{2}\eta_{1t}(u_{0})}{% \partial z_{1t}},...,\frac{\partial^{2}\eta_{nt}(u_{r})}{\partial z_{nt}}-% \frac{\partial^{2}\eta_{nt}(u_{0})}{\partial z_{nt}},over̊ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ≜ ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG - divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG - divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT end_ARG , (A23)
3η1t(ur)z1t23η1t(u0)z1t2,,3ηnt(ur)znt23ηnt(u0)znt2).\displaystyle\frac{\partial^{3}\eta_{1t}(u_{r})}{\partial z_{1t}^{2}}-\frac{% \partial^{3}\eta_{1t}(u_{0})}{\partial z_{1t}^{2}},...,\frac{\partial^{3}\eta_% {nt}(u_{r})}{\partial z_{nt}^{2}}^{\intercal}-\frac{\partial^{3}\eta_{nt}(u_{0% })}{\partial z_{nt}^{2}}\Big{)}^{\intercal}.divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , … , divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT - divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT .

As provided below, in our case, the identifiability of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is guaranteed by the linear independence of the whole function vectors 𝐬lt(ur)subscript𝐬𝑙𝑡subscript𝑢𝑟\mathbf{s}_{lt}(u_{r})bold_s start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) and 𝐬̊t(ur)subscript̊𝐬𝑡subscript𝑢𝑟\mathring{\mathbf{s}}_{t}(u_{r})over̊ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), with l=1,2,,n𝑙12𝑛l=1,2,...,nitalic_l = 1 , 2 , … , italic_n and every ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. This linear independence is generally a much stronger condition. Theorem A1 can be considered as a special case where the number of domains ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is 1111. In this case, only 𝐬lt(u0)subscript𝐬𝑙𝑡subscript𝑢0\mathbf{s}_{lt}(u_{0})bold_s start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in Eq A22 is utilized but 2n2𝑛2n2 italic_n values of zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT are required. Otherwise, in the nonstationary case, the domain information ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT can increase the changeability of slt(ur)subscript𝑠𝑙𝑡subscript𝑢𝑟s_{lt}(u_{r})italic_s start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). Besides, 𝐬̊t(ur)subscript̊𝐬𝑡subscript𝑢𝑟\mathring{\mathbf{s}}_{t}(u_{r})over̊ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) in Eq A23 can also help to find more independent vectors to satisfy the sufficiency assumption.

Corollary A1 (Identifiability under Non-Stationary Process).

Suppose 𝐱t=𝐠(𝐳t:tr)subscript𝐱𝑡𝐠subscript𝐳:𝑡𝑡𝑟\mathbf{x}_{t}=\mathbf{g}(\mathbf{z}_{t:t-r})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_g ( bold_z start_POSTSUBSCRIPT italic_t : italic_t - italic_r end_POSTSUBSCRIPT ), 𝐳t=𝐦(𝐱t:tμ)subscript𝐳𝑡𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{z}_{t}=\mathbf{m}(\mathbf{x}_{t:t-\mu})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ), and that the conditional distribution p(zkt|𝐳t1:tr1,𝐮)𝑝conditionalsubscript𝑧𝑘𝑡subscript𝐳:𝑡1𝑡𝑟1𝐮p(z_{kt}\,|\,\mathbf{z}_{t-1:t-r-1},\mathbf{u})italic_p ( italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , bold_u ) may change across a+1𝑎1a+1italic_a + 1 values of the auxiliary variable 𝐮𝐮\mathbf{u}bold_u, denoted by u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, …, uasubscript𝑢𝑎u_{a}italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. Suppose the components of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are mutually independent conditional on 𝐳t1:tr1subscript𝐳:𝑡1𝑡𝑟1\mathbf{z}_{t-1:t-r-1}bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT with each auxiliary variable. Assume that the components of 𝐳^tsubscript^𝐳𝑡\hat{\mathbf{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are also mutually independent conditional on 𝐳^t1:tr1subscript^𝐳:𝑡1𝑡𝑟1\hat{\mathbf{z}}_{t-1:t-r-1}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT. Suppose the domain is path-connected and 𝐦,𝐦^,𝐠,𝐠^𝐦^𝐦𝐠^𝐠\mathbf{m},\mathbf{\hat{m}},\mathbf{g},\mathbf{\hat{g}}bold_m , over^ start_ARG bold_m end_ARG , bold_g , over^ start_ARG bold_g end_ARG are second-order differentiable and their combination subjects to non-degenerate condition. If there exists 2n2𝑛2n2 italic_n different values of function vectors 𝐬lt(ur)subscript𝐬𝑙𝑡subscript𝑢𝑟\mathbf{s}_{lt}(u_{r})bold_s start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) or 𝐬̊t(ur)subscript̊𝐬𝑡subscript𝑢𝑟\mathring{\mathbf{s}}_{t}(u_{r})over̊ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) and 𝐬̊t(ur)subscript̊𝐬𝑡subscript𝑢𝑟\mathring{\mathbf{s}}_{t}(u_{r})over̊ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), with l=1,2,,n𝑙12𝑛l=1,2,...,nitalic_l = 1 , 2 , … , italic_n and every ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, are linearly independent, then 𝐳^tsubscript^𝐳𝑡\hat{\mathbf{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a permuted invertible component-wise transformation of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Proof.

For any t𝑡titalic_t we have

𝐳tsubscript𝐳𝑡\displaystyle\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =𝐦(𝐱t:tμ)absent𝐦subscript𝐱:𝑡𝑡𝜇\displaystyle=\mathbf{m}(\mathbf{x}_{t:t-\mu})= bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ) (A24)
=𝐦(𝐠^(𝐳^t,𝐳^t1:tr),𝐱t1:tμ)absent𝐦^𝐠subscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟subscript𝐱:𝑡1𝑡𝜇\displaystyle=\mathbf{m}(\mathbf{\hat{g}}(\mathbf{\hat{z}}_{t},\mathbf{\hat{z}% }_{t-1:t-r}),\mathbf{x}_{t-1:t-\mu})= bold_m ( over^ start_ARG bold_g end_ARG ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r end_POSTSUBSCRIPT ) , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ end_POSTSUBSCRIPT )
=𝐦(𝐠^(𝐳^t,𝐦^(𝐱t1:tμ1),,𝐦^(𝐱tr:tμr)),𝐱t1:tμ),absent𝐦^𝐠subscript^𝐳𝑡^𝐦subscript𝐱:𝑡1𝑡𝜇1^𝐦subscript𝐱:𝑡𝑟𝑡𝜇𝑟subscript𝐱:𝑡1𝑡𝜇\displaystyle=\mathbf{m}(\mathbf{\hat{g}}(\mathbf{\hat{z}}_{t},\mathbf{\hat{m}% }(\mathbf{x}_{t-1:t-\mu-1}),\cdots,\mathbf{\hat{m}}(\mathbf{x}_{t-r:t-\mu-r}))% ,\mathbf{x}_{t-1:t-\mu}),= bold_m ( over^ start_ARG bold_g end_ARG ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - 1 end_POSTSUBSCRIPT ) , ⋯ , over^ start_ARG bold_m end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - italic_r : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ) ) , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ end_POSTSUBSCRIPT ) ,

as well as 𝐳^t=𝐦^(𝐠(𝐳t,𝐦(𝐱t1:tμ1),,𝐦(𝐱tr:tμr)),𝐱t1:tμ)subscript^𝐳𝑡^𝐦𝐠subscript𝐳𝑡𝐦subscript𝐱:𝑡1𝑡𝜇1𝐦subscript𝐱:𝑡𝑟𝑡𝜇𝑟subscript𝐱:𝑡1𝑡𝜇\mathbf{\hat{z}}_{t}=\mathbf{\hat{m}}(\mathbf{g}(\mathbf{z}_{t},\mathbf{m}(% \mathbf{x}_{t-1:t-\mu-1}),\cdots,\mathbf{m}(\mathbf{x}_{t-r:t-\mu-r})),\mathbf% {x}_{t-1:t-\mu})over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_m end_ARG ( bold_g ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_m ( bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - 1 end_POSTSUBSCRIPT ) , ⋯ , bold_m ( bold_x start_POSTSUBSCRIPT italic_t - italic_r : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ) ) , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ end_POSTSUBSCRIPT ) similarly. Thus, we have an unified partially invertible function 𝐳t=𝐡(𝐳^t|𝐱t1:tμr)subscript𝐳𝑡𝐡conditionalsubscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟\mathbf{z}_{t}=\mathbf{h}(\mathbf{\hat{z}}_{t}|\mathbf{x}_{t-1:t-\mu-r})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_h ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ) where 𝐡=𝐦𝐠^𝐡𝐦^𝐠\mathbf{h}=\mathbf{m}\circ\hat{\mathbf{g}}bold_h = bold_m ∘ over^ start_ARG bold_g end_ARG with Jacobian 𝐳t𝐳^t=𝐇t(𝐳^t;𝐱t1:tμr)subscript𝐳𝑡subscript^𝐳𝑡subscript𝐇𝑡subscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{\hat{z}}_{t}}=\mathbf{H}_{t}(% \mathbf{\hat{z}}_{t};\mathbf{x}_{t-1:t-\mu-r})divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r end_POSTSUBSCRIPT ). Let us consider the mapping from joint distribution (𝐳^t,𝐱t1:tμr1)subscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1(\mathbf{\hat{z}}_{t},\mathbf{x}_{t-1:t-\mu-r-1})( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) to (𝐳t,𝐱t1:tμr1)subscript𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1(\mathbf{z}_{t},\mathbf{x}_{t-1:t-\mu-r-1})( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ), i.e.,

P(𝐳t,𝐱t1:tμr1)=P(𝐳^t,𝐱t1:tμr1)/|𝐉t|,𝑃subscript𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1𝑃subscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1subscript𝐉𝑡P(\mathbf{z}_{t},\mathbf{x}_{t-1:t-\mu-r-1})=P(\mathbf{\hat{z}}_{t},\mathbf{x}% _{t-1:t-\mu-r-1})\,/\,|\mathbf{J}_{t}|,italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT ) / | bold_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | , (A25)

where

𝐉t=[𝐳t𝐳^t𝟎𝐈],subscript𝐉𝑡matrixsubscript𝐳𝑡subscript^𝐳𝑡0𝐈\mathbf{J}_{t}=\begin{bmatrix}\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{% \hat{z}}_{t}}&\mathbf{0}\\ *&\mathbf{I}\end{bmatrix},bold_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL ∗ end_CELL start_CELL bold_I end_CELL end_ROW end_ARG ] , (A26)

which is a lower triangle matrix, where 𝐈𝐈\mathbf{I}bold_I infers eye matrix and * infers any possible matrix. Thus, we have determinant |𝐉t|=|𝐳t𝐳^t|=|𝐇t|subscript𝐉𝑡subscript𝐳𝑡subscript^𝐳𝑡subscript𝐇𝑡|\mathbf{J}_{t}|=|\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{\hat{z}}_{t}}|% =|\mathbf{H}_{t}|| bold_J start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = | divide start_ARG ∂ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG | = | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT |. Dividing both sides of Eq A25 by P(𝐱t1:tμr1,ur)𝑃subscript𝐱:𝑡1𝑡𝜇𝑟1subscript𝑢𝑟P(\mathbf{x}_{t-1:t-\mu-r-1},u_{r})italic_P ( bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) gives

LHS=P(𝐳t|𝐱t1:tμr1,ur)=P(𝐳t|𝐳t1:tr1,ur),LHS𝑃conditionalsubscript𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1subscript𝑢𝑟𝑃conditionalsubscript𝐳𝑡subscript𝐳:𝑡1𝑡𝑟1subscript𝑢𝑟\textbf{LHS}=P(\mathbf{z}_{t}|\mathbf{x}_{t-1:t-\mu-r-1},u_{r})=P(\mathbf{z}_{% t}|\mathbf{z}_{t-1:t-r-1},u_{r}),LHS = italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , (A27)

since 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱t1:tμr1subscript𝐱:𝑡1𝑡𝜇𝑟1\mathbf{x}_{t-1:t-\mu-r-1}bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT are independent conditioned on 𝐳t1:tr1subscript𝐳:𝑡1𝑡𝑟1\mathbf{z}_{t-1:t-r-1}bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT with any auxiliary variable ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Similarly, RHS=P(𝐳^t|𝐱t1:tμr1,ur)=P(𝐳^t|𝐳^tr1,ur)RHS𝑃conditionalsubscript^𝐳𝑡subscript𝐱:𝑡1𝑡𝜇𝑟1subscript𝑢𝑟𝑃conditionalsubscript^𝐳𝑡subscript^𝐳𝑡𝑟1subscript𝑢𝑟\textbf{RHS}=P(\mathbf{\hat{z}}_{t}|\mathbf{x}_{t-1:t-\mu-r-1},u_{r})=P(% \mathbf{\hat{z}}_{t}|\mathbf{\hat{z}}_{t-r-1},u_{r})RHS = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_μ - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) holds true as well, which yields to

P(𝐳t|𝐳t1:tr1,ur)=P(𝐳^t|𝐳^t1:tr1,ur)/|𝐇t|.𝑃conditionalsubscript𝐳𝑡subscript𝐳:𝑡1𝑡𝑟1subscript𝑢𝑟𝑃conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript𝑢𝑟subscript𝐇𝑡P(\mathbf{z}_{t}|\mathbf{z}_{t-1:t-r-1},u_{r})=P(\mathbf{\hat{z}}_{t}|\mathbf{% \hat{z}}_{t-1:t-r-1},u_{r})\,/\,|\mathbf{H}_{t}|.italic_P ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) / | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | . (A28)

With conditional independence, we have

2logp(𝐳^t|𝐳^t1:tr1,ur)z^itz^jt=0.superscript2𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript𝑢𝑟subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡0\frac{\partial^{2}\log p(\hat{\mathbf{z}}_{t}\,|\,\hat{\mathbf{z}}_{t-1:t-r-1}% ,u_{r})}{\partial\hat{z}_{it}\partial\hat{z}_{jt}}=0.divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG = 0 . (A29)

Referencing Eq A28, it gets expressed as:

logp(𝐳^t|𝐳^t1:tr1,ur)=logp(𝐳t|𝐳t1:tr1,ur)+log|𝐇t|=k=1nηkt(ur)+log|𝐇t|.𝑝conditionalsubscript^𝐳𝑡subscript^𝐳:𝑡1𝑡𝑟1subscript𝑢𝑟𝑝conditionalsubscript𝐳𝑡subscript𝐳:𝑡1𝑡𝑟1subscript𝑢𝑟subscript𝐇𝑡superscriptsubscript𝑘1𝑛subscript𝜂𝑘𝑡subscript𝑢𝑟subscript𝐇𝑡\log p(\hat{\mathbf{z}}_{t}\,|\,\hat{\mathbf{z}}_{t-1:t-r-1},u_{r})=\log p({% \mathbf{z}}_{t}\,|\,{\mathbf{z}}_{t-1:t-r-1},u_{r})+\log|\mathbf{H}_{t}|=\sum_% {k=1}^{n}\eta_{kt}(u_{r})+\log|\mathbf{H}_{t}|.roman_log italic_p ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = roman_log italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_r - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) + roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) + roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | . (A30)

The second-order derivative is

k=1n(2ηkt(ur)zkt2𝐇kit𝐇kjt+ηkt(ur)zkt𝐇kitz^jt)+2log|𝐇t|z^itz^jt0.superscriptsubscript𝑘1𝑛superscript2subscript𝜂𝑘𝑡subscript𝑢𝑟superscriptsubscript𝑧𝑘𝑡2subscript𝐇𝑘𝑖𝑡subscript𝐇𝑘𝑗𝑡subscript𝜂𝑘𝑡subscript𝑢𝑟subscript𝑧𝑘𝑡subscript𝐇𝑘𝑖𝑡subscript^𝑧𝑗𝑡superscript2subscript𝐇𝑡subscript^𝑧𝑖𝑡subscript^𝑧𝑗𝑡0\sum_{k=1}^{n}\Big{(}\frac{\partial^{2}\eta_{kt}(u_{r})}{\partial z_{kt}^{2}}% \cdot\mathbf{H}_{kit}\mathbf{H}_{kjt}+\frac{\partial\eta_{kt}(u_{r})}{\partial z% _{kt}}\cdot\frac{\partial\mathbf{H}_{kit}}{\partial\hat{z}_{jt}}\Big{)}+\frac{% \partial^{2}\log|\mathbf{H}_{t}|}{\partial\hat{z}_{it}\partial\hat{z}_{jt}}% \equiv 0.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k italic_j italic_t end_POSTSUBSCRIPT + divide start_ARG ∂ italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG ) + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log | bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG ≡ 0 . (A31)

The right-hand side of the presented equation consistently equals 0. Therefore, for each index l𝑙litalic_l ranging from 1 to n𝑛nitalic_n, and every associated value of zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT, its partial derivative with respect to zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT remains 0. That is,

k=1n(3ηkt(ur)zkt2zl,tr1𝐇kit𝐇kjt+2ηkt(ur)zktzl,tr1𝐇kitz^jt)0,superscriptsubscript𝑘1𝑛superscript3subscript𝜂𝑘𝑡subscript𝑢𝑟superscriptsubscript𝑧𝑘𝑡2subscript𝑧𝑙𝑡𝑟1subscript𝐇𝑘𝑖𝑡subscript𝐇𝑘𝑗𝑡superscript2subscript𝜂𝑘𝑡subscript𝑢𝑟subscript𝑧𝑘𝑡subscript𝑧𝑙𝑡𝑟1subscript𝐇𝑘𝑖𝑡subscript^𝑧𝑗𝑡0\sum_{k=1}^{n}\Big{(}\frac{\partial^{3}\eta_{kt}(u_{r})}{\partial z_{kt}^{2}% \partial z_{l,t-r-1}}\cdot\mathbf{H}_{kit}\mathbf{H}_{kjt}+\frac{\partial^{2}% \eta_{kt}(u_{r})}{\partial z_{kt}\partial z_{l,t-r-1}}\cdot\frac{\partial% \mathbf{H}_{kit}}{\partial\hat{z}_{jt}}\Big{)}\equiv 0,∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k italic_j italic_t end_POSTSUBSCRIPT + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ∂ italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG ) ≡ 0 , (A32)

where we leveraged the fact that entries of 𝐇tsubscript𝐇𝑡\mathbf{H}_{t}bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do not depend on zl,tr1subscript𝑧𝑙𝑡𝑟1z_{l,t-r-1}italic_z start_POSTSUBSCRIPT italic_l , italic_t - italic_r - 1 end_POSTSUBSCRIPT.

Again start from Eq A31. Using the fact that 𝐇tsubscript𝐇𝑡\mathbf{H}_{t}bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is not affected by the auxiliary variable, we can subtract the equation with u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from that of ursubscript𝑢𝑟u_{r}italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We have

0=k=1n((2ηkt(ur)zkt22ηkt(u0)zkt2)𝐇kit𝐇kjt+(ηkt(ur)zktηkt(u0)zkt)𝐇kitz^jt).0superscriptsubscript𝑘1𝑛superscript2subscript𝜂𝑘𝑡subscript𝑢𝑟superscriptsubscript𝑧𝑘𝑡2superscript2subscript𝜂𝑘𝑡subscript𝑢0superscriptsubscript𝑧𝑘𝑡2subscript𝐇𝑘𝑖𝑡subscript𝐇𝑘𝑗𝑡subscript𝜂𝑘𝑡subscript𝑢𝑟subscript𝑧𝑘𝑡subscript𝜂𝑘𝑡subscript𝑢0subscript𝑧𝑘𝑡subscript𝐇𝑘𝑖𝑡subscript^𝑧𝑗𝑡0=\sum_{k=1}^{n}\Big{(}\Big{(}\frac{\partial^{2}\eta_{kt}(u_{r})}{\partial z_{% kt}^{2}}-\frac{\partial^{2}\eta_{kt}(u_{0})}{\partial z_{kt}^{2}}\Big{)}\cdot% \mathbf{H}_{kit}\mathbf{H}_{kjt}+\Big{(}\frac{\partial\eta_{kt}(u_{r})}{% \partial z_{kt}}-\frac{\partial\eta_{kt}(u_{0})}{\partial z_{kt}}\Big{)}\cdot% \frac{\partial\mathbf{H}_{kit}}{\partial\hat{z}_{jt}}\Big{)}.0 = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ⋅ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k italic_j italic_t end_POSTSUBSCRIPT + ( divide start_ARG ∂ italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG - divide start_ARG ∂ italic_η start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT end_ARG ) ⋅ divide start_ARG ∂ bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT end_ARG ) . (A33)

Considering any given value of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, there exists at least 2n2𝑛2n2 italic_n different values of 𝐬ltsubscript𝐬𝑙𝑡\mathbf{s}_{lt}bold_s start_POSTSUBSCRIPT italic_l italic_t end_POSTSUBSCRIPT or 𝐬̊tsubscript̊𝐬𝑡\mathring{\mathbf{s}}_{t}over̊ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which corresponds to Eq A32 and Eq A33 respectively, such that they are linearly independent. To make the above equation hold true, one has to set 𝐇kit𝐇kjt=0subscript𝐇𝑘𝑖𝑡subscript𝐇𝑘𝑗𝑡0\mathbf{H}_{kit}\mathbf{H}_{kjt}=0bold_H start_POSTSUBSCRIPT italic_k italic_i italic_t end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k italic_j italic_t end_POSTSUBSCRIPT = 0 or ij𝑖𝑗i\neq jitalic_i ≠ italic_j. In other words, each row of 𝐇tsubscript𝐇𝑡\mathbf{H}_{t}bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists of at most a single non-zero entry, and 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be a component-wise transformation of a permuted version of 𝐳^tsubscript^𝐳𝑡\mathbf{\hat{z}}_{t}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. ∎

Appendix A2 Synthetic experiments

A2.1 Synthetic Dataset Generation

In this section, we give 2 representative simulation settings for NG and NG-TDMP respectively to reveal the identifiability results. For each synthetic dataset, we set latent space to be 3333, i.e., 𝐱t𝒳3subscript𝐱𝑡𝒳superscript3\mathbf{x}_{t}\in\mathcal{X}\subseteq\mathbb{R}^{3}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.

Non-invertible Generation

For NG, we set the transition lag as τ=1𝜏1\tau=1italic_τ = 1. We first generate 10,0001000010,00010 , 000 data points from a uniform distribution as the initial state 𝐳0U(0,1)similar-tosubscript𝐳0𝑈01\mathbf{z}_{0}\sim U(0,1)bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_U ( 0 , 1 ). For t=1,,9𝑡19t=1,\cdots,9italic_t = 1 , ⋯ , 9, each latent variable 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be generated from the proceeding latent variable 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT through a nonlinear function 𝐟𝐟\mathbf{f}bold_f with a non-additive zero-biased Gaussian noise ϵtsubscriptitalic-ϵ𝑡\mathbf{\epsilon}_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1), i.e., 𝐳t=𝐟(𝐳t,ϵt)subscript𝐳𝑡𝐟subscript𝐳𝑡subscriptitalic-ϵ𝑡\mathbf{z}_{t}=\mathbf{f}(\mathbf{z}_{t},\epsilon_{t})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_f ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). To introduce the non-invertibility, the mixing function 𝐠𝐠\mathbf{g}bold_g leverages only the first two entries of the latent variables to generate the 2-d observation 𝐳t=𝐠(x1,t,x2,t)𝒵2subscript𝐳𝑡𝐠subscript𝑥1𝑡subscript𝑥2𝑡𝒵superscript2\mathbf{z}_{t}=\mathbf{g}(x_{1,t},x_{2,t})\in\mathcal{Z}\subseteq\mathbb{R}^{2}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_g ( italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_Z ⊆ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Time-Delayed Mixing Process

For UG-TDMP, we set the transition lag as τ=1𝜏1\tau=1italic_τ = 1 and mixing lag r=2𝑟2r=2italic_r = 2. Similar to the Non-invertible Generation scenario, we generate the initial states from a uniform distribution and the subsequent latent variables following a nonlinear transition function. The noise is also introduced in a nonlinear Gaussian (σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1) way. The mixing process is a nonlinear function with regard to 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT plus a side information from previous steps 𝐳t1:t2subscript𝐳:𝑡1𝑡2\mathbf{z}_{t-1:t-2}bold_z start_POSTSUBSCRIPT italic_t - 1 : italic_t - 2 end_POSTSUBSCRIPT, i.e.,

𝐱t=A3×3σ(B3×3σ(C3×3𝐳t))+[00D3×1𝐳t1+E3×1𝐳t2],subscript𝐱𝑡subscript𝐴33𝜎subscript𝐵33𝜎subscript𝐶33subscript𝐳𝑡matrix00subscript𝐷31subscript𝐳𝑡1subscript𝐸31subscript𝐳𝑡2\mathbf{x}_{t}=A_{3\times 3}\cdot\sigma\big{(}B_{3\times 3}\cdot\sigma(C_{3% \times 3}\cdot\mathbf{z}_{t})\big{)}+\begin{bmatrix}0\\ 0\\ D_{3\times 1}\mathbf{z}_{t-1}+E_{3\times 1}\mathbf{z}_{t-2}\end{bmatrix},bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ⋅ italic_σ ( italic_B start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ⋅ italic_σ ( italic_C start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ⋅ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + [ start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_D start_POSTSUBSCRIPT 3 × 1 end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT 3 × 1 end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (A34)

where σ𝜎\sigmaitalic_σ refers to the ReLU function and the capital characters refer to matrices. Note that we make two modifications to show the advantage of CaRiNG. The reason we consider larger mixing lag is that it is a much more difficult scenario to handle, with more distribution from the mixing process and less dynamic information from transition. We run experiments in both scenarios with different transition and mixing lag. Besides, we also find out that even without time-lagged latent variables in the decoder, it leads to a smaller model that is more stable and easy to train. Refer to Table A1 for a detailed ablation study.

setting τ=1,r=2formulae-sequence𝜏1𝑟2\tau=1,r=2italic_τ = 1 , italic_r = 2 τ=2,r=1formulae-sequence𝜏2𝑟1\tau=2,r=1italic_τ = 2 , italic_r = 1
CaRiNG 0.9436 0.9131
CaRiNG (lagged decoder) 0.9250 0.9220
TDRL 0.8947 0.7519
Table A1: Ablation study on different settings for UG-TDMP. (a) The second column is a more difficult scenario compared to the first, where the performance of CaRiNG remains good while that of baseline decreases significantly. (b) Omit the time-lagged latent variables in the decoder will not damage the performance much, but one can enjoy the benefits from a much simpler model.

Post-processing Precedure

During the generating process, we did not explicitly enforce the data to meet the constraint 𝐳t=𝐦(𝐱t:tμ)subscript𝐳𝑡𝐦subscript𝐱:𝑡𝑡𝜇\mathbf{z}_{t}=\mathbf{m}(\mathbf{x}_{t:t-\mu})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m ( bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT ). On the contrary, we implement a checker to filter the data that is qualified. To be more precise, we do linear regression from 𝐱t:tμsubscript𝐱:𝑡𝑡𝜇\mathbf{x}_{t:t-\mu}bold_x start_POSTSUBSCRIPT italic_t : italic_t - italic_μ end_POSTSUBSCRIPT to 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to figure out how much information of latent variables can be recovered from observation series in the best case. We choose the smallest μ𝜇\muitalic_μ when the amount of information that can be recovered is acceptable. We set μ=2𝜇2\mu=2italic_μ = 2 for UG and μ=4𝜇4\mu=4italic_μ = 4 for UG-TDMP.

A2.2 Implementation Details

Network Architecture

To implement the Sequence-to-Step encoder, we leverage the torch.unfold to generate the nesting observations. Let us denote 𝐱t(μ)=[𝐱t,,𝐱tμ]superscriptsubscript𝐱𝑡𝜇subscript𝐱𝑡subscript𝐱𝑡𝜇\mathbf{x}_{t}^{(\mu)}=[\mathbf{x}_{t},\cdots,\mathbf{x}_{t-\mu}]bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUPERSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_t - italic_μ end_POSTSUBSCRIPT ] as inputs. For the time steps that do not exist, we simply pad them with zero. Refer to Table A2 for detailed network architecture.

Training Details

The models were implemented in PyTorch 1.11.0. An AdamW optimizer is used for training this network. We set the learning rate as 0.0010.0010.0010.001 and the mini-batch size as 64646464. We train each model under four random seeds (770,771,772,773770771772773770,771,772,773770 , 771 , 772 , 773) and report the overall performance with mean and standard deviation across different random seeds.

Table A2: Architecture details. BS: batch size, T: length of time series, i_dim: input dimension, o_dim: output dimension, z_dim: latent dimension, LeakyReLU: Leaky Rectified Linear Unit.
Configuration Description Output
1. Sequence-to-Step Encoder Encoder for Synthetic Data
Input: 𝐱1:T(μ)subscriptsuperscript𝐱𝜇:1𝑇\mathbf{x}^{(\mu)}_{1:T}bold_x start_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT Observed time series BS ×\times× T ×\times× i_dim
Dense 128 neurons, LeakyReLU BS ×\times× T ×\times× 128
Dense 128 neurons, LeakyReLU BS ×\times× T ×\times× 128
Dense 128 neurons, LeakyReLU BS ×\times× T ×\times× 128
Dense Temporal embeddings BS ×\times× T ×\times× z_dim
2. Step-to-Step Decoder Decoder for Synthetic Data
Input: 𝐳^1:Tsubscript^𝐳:1𝑇\hat{\mathbf{z}}_{1:T}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT Sampled latent variables BS ×\times× T ×\times× z_dim
Dense 128 neurons, LeakyReLU BS ×\times× T ×\times× 128
Dense 128 neurons, LeakyReLU BS ×\times× T ×\times× 128
Dense i_dim neurons, reconstructed 𝐱^1:Tsubscript^𝐱:1𝑇\mathbf{\hat{x}}_{1:T}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT BS ×\times× T ×\times× o_dim
3. Factorized Inference Network Bidirectional Inference Network
Input Sequential embeddings BS ×\times× T ×\times× z_dim
Bottleneck Compute mean and variance of posterior μ1:T,σ1:Tsubscript𝜇:1𝑇subscript𝜎:1𝑇\mathbf{\mu}_{1:T},\mathbf{\sigma}_{1:T}italic_μ start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT
Reparameterization Sequential sampling 𝐳^1:Tsubscript^𝐳:1𝑇\hat{\mathbf{z}}_{1:T}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT
4. Modular Prior Nonlinear Transition Prior Network
Input Sampled latent variable sequence 𝐳^1:Tsubscript^𝐳:1𝑇\hat{\mathbf{z}}_{1:T}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT BS ×\times× T ×\times× z_dim
InverseTransition Compute estimated residuals ϵ^itsubscript^italic-ϵ𝑖𝑡\hat{\epsilon}_{it}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT BS ×\times× T ×\times× z_dim
JacobianCompute Compute log(|det(𝐉)|)𝐉\log\left(\lvert\det\left(\mathbf{J}\right)\rvert\right)roman_log ( | roman_det ( bold_J ) | ) BS
Table A3: MCC scores of synthetic datasets with higher dimension.
Dimension CaRiNG TDRL
6 0.9199 0.6329
12 0.9366 0.6155
18 0.7175 0.5265

A2.3 Exploration on higher dimension.

To demonstrate the scalability of our method, we have included experiments with higher dimensions. We keep the experimental setup consistent with NG and set the dimensions of latent variables to be 6,12,18612186,12,186 , 12 , 18 and that of observation to be 4,8,1248124,8,124 , 8 , 12, respectively. The transition function is a permutation function with a shift of 2,4,62462,4,62 , 4 , 6 dimensions respectively. As shown in Table A3, CaRiNG can achieve a consistent improvement over the baseline TDRL when using various dimensions. When the dimension is too high, although the performance of both CaRiNG and TDRL drops because of the complexity of the model, we demonstrate that CaRiNG still benefits from contextual information. This indicates that CaRiNG is scalable and robust to the dimensionality of the latent variables.

A2.4 Model Selection with Varying μ𝜇\muitalic_μ

In this subsection, we will discuss the preliminary experiment that was instrumental in the model selection process for our application in the NG-TDMP settings. The experiment focused on evaluating the performance of the model with varying lengths of time lag μ𝜇\muitalic_μ.

Our findings indicate that an increase in μ𝜇\muitalic_μ does not always correlate with enhanced model performance. We observed that the effectiveness of each latent variable diminishes as the time lag μ𝜇\muitalic_μ increases. In practical applications, this motivates a strategy of model selection where an appropriate value of μ𝜇\muitalic_μ is chosen based on the model’s performance. The following table summarizes our experimental results:

μ𝜇\muitalic_μ 3 4 5
Accuracy (%) 0.88 0.92 0.92
Table A4: Impact of varying μ𝜇\muitalic_μ on model performance in NG-TDMP settings.

These results suggest that while a larger μ𝜇\muitalic_μ might imply a more extensive recovery of context information, it can also introduce inefficiencies in information recovery, potentially adding noise and impeding model training.

Appendix A3 Real-world Experiments on TrafficQA

A3.1 Implementation Details

We choose HCRN (Le et al., 2020) (without classification head) as the encoder backbone of CaRiNG on the real-world dataset: SUTD-TrafficQA. Given that HCRN is an encoder that calculates the cross attention between visual input and text input sequentially, we apply a decoder, which shares the same structure as the Step-to-Step Decoder shown in Table A2 to reconstruct the visual feature embedded with the temporal information. As it goes to transition prior, we use the Modular Prior shown in Table A2. This encoder-decoder structure can guide the model to learn the hidden representation with identifiable guarantees under the non-invertible generation process.

A3.2 More Qualitative Results

As shown in Figure A1, we provide some positive examples and also fail cases to analyze our model. From the top two examples, we can find that our method can solve the occlusions well. From the bottom right one, we find that our model can solve the blurred situation. However, when the alignment between visual and textual domains is difficult. The model may fail.

Refer to caption


Figure A1: Qualitative results on SUTD-TrafficQA dataset. We provide some positive examples and also fail cases to analyze our model.

A3.3 Computation Cost Comparison

We provide the comparisons between the computational cost of the CaRiNG model and HCRN to analyze our efficiency. As shown in Table A5, we provide a detailed comparison of the number of parameters, training time, and inference efficiency. It is important to note that while the CaRiNG model requires a longer training time due to the application of normalizing flow for calculating the Jacobian matrix, its inference efficiency remains on par with HCRN, as the normalizing flow is utilized only for calculating KL loss and not during inference.

Method HCRN CaRiNG
Number of Parameters 42,278,786 43,721,954
Training Time per Epoch 6min 54s/epoch 13min 26s/epoch
Inference Time per Epoch 49s/epoch 49s/epoch
Table A5: Comparative Analysis of HCRN and CaRiNG Models

This analysis clearly demonstrates that the increased training time for the CaRiNG model is offset by its comparable inference efficiency, highlighting its practical applicability in scenarios where inference time is critical.

A3.4 Evaluation of Identifiability in the QA Benchmark

In the context of real-world applications, particularly in scenarios lacking ground truth for rigorous metrics like MCC, alternative evaluation strategies become essential. we leverage proxy metrics to assess the performance of the proposed algorithm, focusing on two pivotal aspects: disentanglement and reconstruction ability of the learned representations. Intuitively, as delineated in Theorem A1 and detailed in Section 4, a representation can be considered identifiable if it possesses the dual capability of fully reconstructing the observation while also achieving disentanglement. Thus, as a supplement to the accuracy we used before, we benchmark disentanglement and reconstruction ability as side evidence to support that the improvement is caused by better identifiability.

We use the ELBO loss as a proxy metric to evaluate the identifiability. Figure A2 illustrates our method’s performance compared to the baseline TDRL method. The results clearly show that our approach exhibits superior disentanglement and reconstruction abilities. This evidence suggests that the advantage of our proposed algorithm is primarily attributed to its enhanced identifiability and effective disentanglement of data representations.

Refer to caption
Figure A2: Comparative analysis of disentanglement and reconstruction abilities of different methods.

A3.5 Parameter analysis on τ𝜏\tauitalic_τ

In this section, we present the results of our parameter analysis conducted on the SUTD-TrafficQA dataset, focusing on the impact of varying the time lag τ𝜏\tauitalic_τ. The study aimed to assess the robustness of our model to changes in the time lag parameter. As the table below illustrates, the model demonstrates consistent accuracy across different values of τ𝜏\tauitalic_τ, indicating robustness to the variation in time lag.

τ𝜏\tauitalic_τ 1 2 3
Accuracy (%) 41.22 41.23 41.27
Table A6: Parameter analysis results of τ𝜏\tauitalic_τ on model accuracy in the SUTD-TrafficQA dataset.

Appendix A4 Real-world Experiments on the Volleyball Dataset

A4.1 Dataset

The volleyball dateset (Ibrahim et al., 2016) is a video action recognition dataset with 4,830 clips from 55 videos. There are 8 group activity labels, including 4 main activities (set, spike, pass, win-point) that are divided into two subgroups, left and right. Two formats for inputs are provided: RGB videos and keypoints time series. In our setting, we simply use key points as the input. We utilized the ’original’ split of the Volleyball dataset in which all videos were randomly assigned, consisting of 39 training videos and 16 testing videos. We adopt this dataset due to the complex occlusion in the sports which is aligned with our non-invertible generation setting.

A4.2 Implementation Details

The method is implemented using a VAE network. Specifically, the Sequence-to-Step Encoder processes the data by first flattening the features from all time steps. Then, following  (Zhou et al., 2022), we apply a Composer to incorporate the interactions with fine-grained information. Subsequently, we aggregate the contextual information through an MLP, mapping from a space of T×dsuperscript𝑇𝑑\mathbb{R}^{T\times d}blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT to dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The Step-to-Step Decoder is also an MLP network mapping from dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We adopt the same Modular Prior network as Table A2. For the implementation of TDRL, the only difference is the removal of temporal dependencies during the encoding process of the model (don’t aggregate the contextual information).

A4.3 Results and Analysis

As shown in Table A7, we observe that CaRiNG achieve consistent performance improvement on both person and group activity accuracy. It indicates that the temporal context is useful in the temporal dynamic modeling. Though the goal of this task is the group activity recognition, we found that the person activity accuracy achieves more improvement. It is not surprising since our method ensures better disentanglement and identification of the latent variables of the group activity, i.e., containing the information of persons.

Method CaRiNG TDRL
Group Activity Top1 Accuracy(%) 93.044 92.895
Group Activity Top3 Accuracy(%) 99.028 98.280
Person Activity Top1 Accuracy(%) 74.551 73.286
Person Activity Top3 Accuracy(%) 98.087 96.634
Table A7: Model accuracy in the Volleyball dataset.

Appendix A5 Related Work

A5.1 Causal Discovery with Latent Variables

Some studies have aimed to discover causally related latent variables, such as (Silva et al., 2006; Kummerfeld & Ramsey, 2016; Huang et al., 2022) leverage the vanishing Tetrad conditions (Spearman, 1928) or rank constraints to identify latent variables in linear-Gaussian models, and (Shimizu et al., 2009; Cai & Xie, 2019; Xie et al., 2020, 2022) draw upon non-Gaussianity in their analysis for linear, non-Gaussian scenarios. Furthermore, some methods aim to find the structure beyond the latent variables, resulting in the hierarchical structure. Some hierarchical model-based approaches assume tree-like configurations, such as (Pearl, 1988; Zhang, 2004; Choi et al., 2011; Drton et al., 2017), while the other methods assume a broader hierarchical structure (Xie et al., 2022; Huang et al., 2022). However, these methods remain confined to linear frameworks and face escalating challenges with intricate datasets, such as videos.

A5.2 Nonlinear ICA for Time Series Data

Nonlinear ICA represents an alternative methodology to identify latent causal variables within time series data. Such methods leverage auxiliary data—like class labels and domain indices—and impose independence constraints to facilitate the identifiability of latent variables. To illustrate: Time-contrastive learning (TCL (Hyvarinen & Morioka, 2016)) adopts the independent sources premise and capitalizes on the variability in variance across different data segments. Furthermore, Permutation-based contrastive (PCL (Hyvarinen & Morioka, 2017)) puts forth a learning paradigm that distinguishes genuine independent sources from their permuted counterparts. Furthermore, i-VAE (Khemakhem et al., 2020) utilizes deep neural networks, VAEs, to closely approximate the joint distribution encompassing observed and auxiliary non-stationary regimes. Recent work, exemplified by LEAP (Yao et al., 2022b), has tackled both stationary and non-stationary scenarios in tandem. In the stationary context, LEAP postulates a linear non-Gaussian generative process. For the non-stationary context, it assumes a nonlinear generative process, gaining leverage from auxiliary variables. Advancing beyond LEAP, TDRL (Yao et al., 2022a) initially extends the linear non-Gaussian generative assumption to a nonlinear formulation for stationary scenarios. Subsequently, it broadens the non-stationary framework to accommodate structural shifts, global alterations, and combinations thereof. Additionally, CITRIS (Lippe et al., 2022b, a) champions the use of intervention target data to precisely identify scalar and multi-dimensional latent causal factors. However, a common thread across these methodologies is the presumption of an invertible generative process, a stance that often deviates from the realities of actual data. Besides, (Hartford et al., 2022) demonstrates that under a non-invertible scenario without extra information, identifiability can be only achieved in a subspace where bijective mapping exists. Their work provides additional support for the importance of addressing non-invertibility.

A5.3 Temporal modeling

Sequential Variational Autoencoders have gained significant popularity for their applications in temporal modeling, including generation, representation, and prediction. Variational RNN (Chung et al., 2015) introduces the Variational Autoencoders into Recurrent Neural Networks, enabling variational inference on time series data. SRNN (Fraccaro et al., 2016) further utilizes the concept of SSM (State Space Model) for temporal modeling. In addition, SKD (Berman et al., 2022) utilizes a structured Koopman autoencoder to achieve multifactor sequential disentanglement. However, none of these methods incorporates a transition function for capturing the temporal dynamics of multivariate data. By integrating a transition function with independent noise through normalizing flow (Rezende & Mohamed, 2015; Ziegler & Rush, 2019), our model can effectively track and represent the causal relations of latent variables over time. Such enhancement positions CaRiNG  as a method focused on learning causal representations with clear identifiability guarantees, marking a departure from the generation-centric objectives commonly seen in traditional VAE-based methods.