Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Preventing Model Collapse in Gaussian Process Latent Variable Models

Ying Li    Zhidi Lin    Feng Yin    Michael Minyi Zhang
Abstract

Gaussian process latent variable models (GPLVMs) are a versatile family of unsupervised learning models commonly used for dimensionality reduction. However, common challenges in modeling data with GPLVMs include inadequate kernel flexibility and improper selection of the projection noise, leading to a type of model collapse characterized by vague latent representations that do not reflect the underlying data structure. This paper addresses these issues by, first, theoretically examining the impact of projection variance on model collapse through the lens of a linear GPLVM. Second, we tackle model collapse due to inadequate kernel flexibility by integrating the spectral mixture (SM) kernel and a differentiable random Fourier feature (RFF) kernel approximation, which ensures computational scalability and efficiency through off-the-shelf automatic differentiation tools for learning the kernel hyperparameters, projection variance, and latent representations within the variational inference framework. The proposed GPLVM, named advised\oldtextscrflvm, is evaluated across diverse datasets and consistently outperforms various salient competing models, including state-of-the-art variational autoencoders (VAEs) and other GPLVM variants, in terms of informative latent representations and missing data imputation.

Machine Learning, ICML
\LetLtxMacro\oldtextsc\externaldocument

supplementary


1 Introduction

A latent variable model (LVM) represents each observed datum 𝐲iMsubscript𝐲𝑖superscript𝑀{\mathbf{y}}_{i}\!\in\!\mathbb{R}^{M}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT using a low-dimensional latent variable 𝐱iQsubscript𝐱𝑖superscript𝑄{\mathbf{x}}_{i}\!\in\!\mathbb{R}^{Q}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, where QMmuch-less-than𝑄𝑀Q\!\ll\!Mitalic_Q ≪ italic_M. As a classic tool in statistical analysis, LVMs unveil hidden structures within the data, providing valuable insights into intricate systems across various domains (Bishop, 2006), such as signal processing (Zarzoso et al., 2010) and economics (Aigner et al., 1984).

One of the critical aspects of LVM is the choice of mapping function from the latent variables to the observed variables. A series of early works assumed that the mapping is linear, as seen in factor analysis (Kim & Mueller, 1978), principal component analysis (PCA) (Pearson, 1901; Tipping & Bishop, 1999), and canonical correlation analysis (CCA) (Hotelling, 1936), among others. However, the linearity assumption limits the capacity of these models to capture complex, nonlinear patterns in the data, rendering them incapable of providing an optimal latent representation for complex data sets. To tackle this issue, more advanced methods like the variational autoencoder (VAE) (Kingma & Welling, 2019, 2013) utilizes neural networks, while the Gaussian process latent variable model (GPLVM) (Lawrence, 2005; Titsias & Lawrence, 2010) employs the Gaussian process (GP) (Rasmussen & Williams, 2006), as the nonlinear mapping modules in LVM, providing enhanced capacity in capturing nonlinear relationships.

GPLVMs benefit from the incorporation of the GP, which offers enhanced interpretability through explicit uncertainty calibration and the interpretable kernel functions (Theodoridis, 2020; Cheng et al., 2022). Additionally, the implicit regularization imposed by the GP prior prevents GPLVMs from severe overfitting (Lotfi et al., 2022; Wilson & Izmailov, 2020). Consequently, GPLVMs often achieve superior performance in practice, even with small sample sizes. Due to these favorable and unique properties, GPLVM has been applied to various applications, such as intrusion detection (Abolhasanzadeh, 2015), image recognition (Eleftheriadis et al., 2013; Li et al., 2017), human pose estimation (Ek et al., 2008), and image-text retrieval (Song et al., 2015).

Despite the popularity of GPLVM and the recent efforts dedicated to enhancing its learning and inference capabilities (Titsias & Lawrence, 2010; Gundersen et al., 2021; Ramchandran et al., 2021; de Souza et al., 2021; Lalchand et al., 2022; Zhang et al., 2023), the existing work still lacks an in-depth understanding of how to optimally learn a compact and informative latent representation using the GPLVM. This ambiguity hinders our ability to overcome “model collapse” (see Definition 2.1), which is characterized by learning vague latent representations with practical implementations. This paper elucidates the two key factors that lead to model collapse–the improper selection of model projection noise and inadequate kernel flexibility. To this end, we propose a new GPLVM that is immune to model collapse. Our contributions are:

  • We provide a theoretical investigation of the impact that projection variance has on encouraging model collapse through the lens of linear GPLVMs. Our empirical validation further demonstrates the relevance of these analyses to general GPLVMs. These findings collectively emphasize the importance of learning the model projection variance.

  • We propose a novel GPLVM that integrates a spectral mixture (SM) kernel (Wilson & Adams, 2013), capable of approximating arbitrary stationary kernels, to overcome model collapse arising from inadequate kernel flexibility. To reduce computational complexity and avoid introducing additional parameters like those in inducing point-based sparse GP methods (Titsias, 2009; Hensman et al., 2013), we leverage a differentiable random Fourier feature (RFF) approximation for the SM kernel (Jung et al., 2022; Lopez-Paz et al., 2014). This deliberate introduction of differentiability in the RFF approximation allows us to readily use modern off-the-shelf automatic differentiation tools (Paszke et al., 2019) to efficiently and scalably learn the kernel hyperparameters, projection variance, and latent representations of the proposed GPLVM within a variational inference framework (Bishop, 2006).

  • Our proposed GPLVM is subjected to rigorous evaluation across diverse datasets, consistently outperforming various models, including the state-of-the-art (SOTA) VAEs and some representative GPLVM variants. Specifically, it excels in learning compact and informative latent representations, addressing the issues of model collapse in existing GPLVMs.

2 Preliminaries

Gaussian Process. The GP is a generalization of the Gaussian distribution defined across infinite index sets (Rasmussen & Williams, 2006), thereby enabling the specification of distribution over functions f:Q:𝑓maps-tosuperscript𝑄f:\mathbb{R}^{Q}\!\mapsto\!\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ↦ blackboard_R. A GP is fully characterized by its mean function μ(𝐱)𝜇𝐱\mu({{\mathbf{x}}})italic_μ ( bold_x ), frequently set as zero, and its covariance function, a.k.a. kernel function, k(𝐱,𝐱;𝜽gp)𝑘𝐱superscript𝐱subscript𝜽𝑔𝑝k({{\mathbf{x}}},{{\mathbf{x}}}^{\prime};\bm{\theta}_{gp})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT ), where 𝜽gpsubscript𝜽𝑔𝑝\bm{\theta}_{gp}bold_italic_θ start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT is a set of hyperparameters that needs to be tuned for model selection. According to the definition of GP, the function values 𝐟={f(𝐱i)}i=1N𝐟superscriptsubscript𝑓subscript𝐱𝑖𝑖1𝑁\mathbf{f}\!=\!\{f({\mathbf{x}}_{i})\}_{i=1}^{N}bold_f = { italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT at any finite set of points 𝐗={𝐱i}i=1N𝐗superscriptsubscriptsubscript𝐱𝑖𝑖1𝑁{\mathbf{X}}\!=\!\{{\mathbf{x}}_{i}\}_{i=1}^{N}bold_X = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT follow a joint Gaussian distribution, i.e.,

𝐟𝐗=𝒩(𝐟𝟎,𝐊),conditional𝐟𝐗𝒩conditional𝐟0𝐊\mathbf{f}\mid{\mathbf{X}}=\mathcal{N}(\mathbf{f}\mid\bm{0},\mathbf{K}),bold_f ∣ bold_X = caligraphic_N ( bold_f ∣ bold_0 , bold_K ) , (1)

where 𝐊𝐊\mathbf{K}bold_K denotes the covariance matrix evaluated on the finite input 𝐗𝐗{\mathbf{X}}bold_X with [𝐊]i,j=k(𝐱i,𝐱j)subscriptdelimited-[]𝐊𝑖𝑗𝑘subscript𝐱𝑖subscript𝐱𝑗[\mathbf{K}]_{i,j}\!=\!k({{{\mathbf{x}}}}_{i},{{\mathbf{x}}}_{j})[ bold_K ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Given the observed function values 𝐟𝐟\mathbf{f}bold_f at the input 𝐗𝐗{\mathbf{X}}bold_X, the GP prediction distribution, p(f(𝒙)|𝐱,𝐟,𝐗)𝑝conditional𝑓subscript𝒙subscript𝐱𝐟𝐗p(f(\bm{x}_{*})|{{\mathbf{x}}}_{*},\mathbf{f},{\mathbf{X}})italic_p ( italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_f , bold_X ), at any new input 𝐱subscript𝐱{{\mathbf{x}}}_{*}bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, is Gaussian, fully characterized by the posterior mean ξ𝜉\xiitalic_ξ and the posterior variance ΞΞ\Xiroman_Ξ. Concretely,

ξ(𝐱)=𝐊𝐱,𝐗𝐊1𝐟,𝜉subscript𝐱subscript𝐊subscript𝐱𝐗superscript𝐊1𝐟\displaystyle\xi(\mathbf{x}_{*})=\mathbf{K}_{\mathbf{x}_{*},\mathbf{X}}\mathbf% {K}^{-1}{\mathbf{f}},italic_ξ ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = bold_K start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_X end_POSTSUBSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_f , (2a)
Ξ(𝐱)=k(𝐱,𝐱)𝐊𝐱,𝐗𝐊1𝐊𝐱,𝐗,Ξsubscript𝐱𝑘subscript𝐱subscript𝐱subscript𝐊subscript𝐱𝐗superscript𝐊1superscriptsubscript𝐊subscript𝐱𝐗top\displaystyle\Xi(\mathbf{x}_{*})=k(\mathbf{x}_{*},\mathbf{x}_{*})-\mathbf{K}_{% \mathbf{x}_{*},\mathbf{X}}\mathbf{K}^{-1}\mathbf{K}_{\mathbf{x}_{*},\mathbf{X}% }^{\top},roman_Ξ ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = italic_k ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - bold_K start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_X end_POSTSUBSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (2b)

where 𝐊𝐱,𝐗subscript𝐊subscript𝐱𝐗\mathbf{K}_{\mathbf{x}_{*},\mathbf{X}}bold_K start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_X end_POSTSUBSCRIPT is the cross covariance matrix evaluated on the new input 𝐱subscript𝐱{{\mathbf{x}}}_{*}bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and the observed input 𝐗𝐗{{\mathbf{X}}}bold_X.

Spectral Mixture Kernel. The behavior of a GP-distributed function is generally defined by the choice of the kernel function. However, subjectively selecting an appropriate kernel for complex applications is considerably challenging. By resorting to the fact that, according to Bochner’s theorem, any stationary kernel and its spectral density are Fourier duals, we know that one type of popular kernel learning methods is to approximate the spectral density of the underlying stationary kernel (Bochner, 1934). In the spectral mixture (SM) kernel (Wilson & Adams, 2013), the underlying spectral density is approximated using a Gaussian mixture:

si(𝐰)=𝒩(𝐰|𝝁i,diag(𝝈i2))+𝒩(𝐰|𝝁i,diag(𝝈i2))2,subscript𝑠𝑖𝐰𝒩conditional𝐰subscript𝝁𝑖diagsuperscriptsubscript𝝈𝑖2𝒩conditional𝐰subscript𝝁𝑖diagsuperscriptsubscript𝝈𝑖22\displaystyle\!\!\!s_{i}({\mathbf{w}})\!=\!\frac{\mathcal{N}(\mathbf{w}|\bm{% \mu}_{i},\operatorname{diag}(\bm{\sigma}_{i}^{2}))\!+\!\mathcal{N}(-\mathbf{w}% |\bm{\mu}_{i},\operatorname{diag}(\bm{\sigma}_{i}^{2}))}{2},italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) = divide start_ARG caligraphic_N ( bold_w | bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_diag ( bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) + caligraphic_N ( - bold_w | bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_diag ( bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) end_ARG start_ARG 2 end_ARG , (3)
psm(𝐰)=i=1mαisi(𝐰),subscript𝑝sm𝐰superscriptsubscript𝑖1𝑚subscript𝛼𝑖subscript𝑠𝑖𝐰\displaystyle\!\!\!p_{\mathrm{sm}}(\mathbf{w})=\sum_{i=1}^{m}\alpha_{i}s_{i}({% \mathbf{w}}),italic_p start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_w ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) ,

where αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the mixture weight, 𝝁iQsubscript𝝁𝑖superscript𝑄\bm{\mu}_{i}\!\in\!\mathbb{R}^{Q}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and 𝝈i2Qsuperscriptsubscript𝝈𝑖2superscript𝑄\bm{\sigma}_{i}^{2}\!\in\!\mathbb{R}^{Q}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT are the mean and variance of the i𝑖iitalic_i-th Gaussian density, m𝑚mitalic_m is the number of mixture components. Taking the inverse Fourier transform, we readily get the SM kernel, ksm(𝐱,𝐱)=subscript𝑘sm𝐱superscript𝐱absentk_{\mathrm{sm}}({\mathbf{x}},{\mathbf{x}}^{\prime})=italic_k start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =

i=1mαiexp(2π2𝝈i(𝐱𝐱)2)cos(2π𝝁i(𝐱𝐱)),superscriptsubscript𝑖1𝑚subscript𝛼𝑖2superscript𝜋2superscriptnormsuperscriptsubscript𝝈𝑖top𝐱superscript𝐱22𝜋superscriptsubscript𝝁𝑖top𝐱superscript𝐱\sum_{i=1}^{m}\alpha_{i}\exp\left(-2\pi^{2}\|\bm{\sigma}_{i}^{\top}({\mathbf{x% }}-{\mathbf{x}}^{\prime})\|^{2}\right)\cos\left(2\pi\bm{\mu}_{i}^{\top}\left({% \mathbf{x}}-{\mathbf{x}}^{\prime}\right)\right),∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - 2 italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_cos ( 2 italic_π bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,

where 𝜽sm={αi,𝝁i,𝝈𝒊𝟐}i=1msubscript𝜽smsuperscriptsubscriptsubscript𝛼𝑖subscript𝝁𝑖subscriptsuperscript𝝈2𝒊𝑖1𝑚\bm{\theta}_{\mathrm{sm}}\!=\!\{\alpha_{i},\bm{\mu}_{i},\bm{\sigma^{2}_{i}}\}_% {i=1}^{m}bold_italic_θ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT = { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the set of hyperparameters. Given that Gaussian mixture is dense, the SM kernel is guaranteed to be able to approximate any stationary kernel arbitrarily well (Wilson & Adams, 2013).

Gaussian Process Latent Variable Models. The GPLVM is a generative model where each observed datum 𝐲iMsubscript𝐲𝑖superscript𝑀{\mathbf{y}}_{i}\!\in\!\mathbb{R}^{M}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is generated through a noisy Gaussian process from a latent variable 𝐱iQsubscript𝐱𝑖superscript𝑄{\mathbf{x}}_{i}\!\in\!\mathbb{R}^{Q}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT (Lawrence, 2005):

𝐲i=f(𝐱i)+𝒗i,𝒗i𝒩(𝟎,σ2𝐈M),formulae-sequencesubscript𝐲𝑖𝑓subscript𝐱𝑖subscript𝒗𝑖similar-tosubscript𝒗𝑖𝒩0superscript𝜎2subscript𝐈𝑀{\mathbf{y}}_{i}=f({\mathbf{x}}_{i})+\bm{v}_{i},\ \ \bm{v}_{i}\sim{\cal N}(\bm% {0},\sigma^{2}\mathbf{I}_{M}),bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) , (4)

where f()𝑓f(\cdot)italic_f ( ⋅ ) follows a zero-mean GP prior, and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the projection variance, which can be interpreted as information lost in dimensionality reduction. A standard normal density is conventionally assigned as the prior to the latent variable, 𝐱i𝒩(𝟎,𝐈Q)similar-tosubscript𝐱𝑖𝒩0subscript𝐈𝑄{\mathbf{x}}_{i}\!\sim\!{\cal N}(\bm{0},\mathbf{I}_{Q})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ). In the case of having N𝑁Nitalic_N observations 𝐘N×M𝐘superscript𝑁𝑀{\mathbf{Y}}\!\in\!\mathbb{R}^{N\!\times\!M}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_M end_POSTSUPERSCRIPT from the GPLVM, the marginal likelihood after integrating out the latent GP, is expressed as:

p(𝐘𝐗)=j=1M𝒩(𝐲:,j𝟎,𝐊+σ2𝐈N)𝑝conditional𝐘𝐗superscriptsubscriptproduct𝑗1𝑀𝒩conditionalsubscript𝐲:𝑗0𝐊superscript𝜎2subscript𝐈𝑁p({\mathbf{Y}}\mid{\mathbf{X}})=\prod_{j=1}^{M}{\cal N}({\mathbf{y}}_{:,j}\mid% \bm{0},\ \mathbf{K}+\sigma^{2}\mathbf{I}_{N})italic_p ( bold_Y ∣ bold_X ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_N ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT ∣ bold_0 , bold_K + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) (5)

where 𝐲:,jNsubscript𝐲:𝑗superscript𝑁{\mathbf{y}}_{:,j}\!\in\!\mathbb{R}^{N}bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT denotes the j𝑗jitalic_j-th column of 𝐘𝐘{\mathbf{Y}}bold_Y. Consequently, the maximum likelihood estimate (MLE) of the latent variables 𝐗𝐗{\mathbf{X}}bold_X can be obtained by solving the following optimization problem,

𝐗^=max𝐗L(𝐗)=max𝐗logp(𝐘𝐗),^𝐗subscript𝐗𝐿𝐗subscript𝐗𝑝conditional𝐘𝐗\hat{{\mathbf{X}}}=\max_{{\mathbf{X}}}\ L({\mathbf{X}})=\max_{{\mathbf{X}}}\ % \log p({\mathbf{Y}}\mid{\mathbf{X}}),over^ start_ARG bold_X end_ARG = roman_max start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT italic_L ( bold_X ) = roman_max start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT roman_log italic_p ( bold_Y ∣ bold_X ) , (6)

using e.g. gradient-based methods (Kingma & Ba, 2014).

In the context of GPLVM, the primary objective is to obtain a compact and informative latent representation of the observed data. Unlike the general definition of model collapse in machine learning models, which is primarily characterized by a gradual shift toward homogeneous output and increased deviations from accurate predictions (Bau et al., 2019), model collapse in GPLVM is closely tied to the effectiveness of latent variable inference, as outlined below:

Definition 2.1 (Model Collapse).

When the latent variables in GPLVMs become more homogeneous and/or their crucial feature details are sacrificed or distorted, we identify this phenomenon as model collapse.

Definition 2.1 posits that two distinct manifestations of model collapse can be identified: distortion and homogeneity. Distortion occurs when the latent manifold, representing the underlying data structure, is warped or twisted, failing to accurately describe the underlying data structures. Homogeneity, on the other hand, manifests as a reduction in diversity among latent variables, resulting in a loss of crucial data features.

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: Top: Latent function estimation using GPLVM with preliminary () or advanced/flexible kernels (). Bottom: (1(a)): 2-D S-shape latent manifold learned by the proposed advised\oldtextscrflvm. (1(b)): 2-D S-shape latent manifold learned by using a preliminary (RBF) kernel. (1(c)): 2-D S-shape latent manifold learned without optimizing projection variance. (1(a))–(1(c)) also show histograms in different dimensions of the learned latent manifold.

3 Causes of Model Collapse

In this section, we will elucidate that the distortion and homogeneity in the latent manifold are attributed to two crucial factors: improper selection of projection variance and inadequate kernel function flexibility. To further illustrate these concepts, Figs. 1(b) and 1(c) depict examples where the learned latent manifolds are distorted and homogeneous, respectively.

3.1 Projection Variance Matters

This subsection investigates the impact of projection variance on encouraging model collapse. To achieve this, we scrutinize the stationary points with respect to the latent variables 𝐗𝐗{\mathbf{X}}bold_X, and establish their connection to the projection variance. However, the computation of the stationary points is intractable due to the non-convex and nonlinear nature of GPLVMs in general. In light of this, we alternatively seek the lens of the linear GPLVM by assuming that the kernel function used in the GPLVM is the inner product kernel, i.e., k(𝐱,𝐱)=𝐱𝐱𝑘𝐱superscript𝐱superscript𝐱topsuperscript𝐱k({\mathbf{x}},{\mathbf{x}}^{\prime})\!=\!{\mathbf{x}}^{\top}{\mathbf{x}}^{\prime}italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This simplified GPLVM is also known as the dual probabilistic principal component analysis (DPPCA) model (Lawrence, 2005). See more details in App. A.1. The main analyses are outlined below.

Theorem 3.1.

Given the maximization problem in Eq. (6), the stationary points, 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG, in the case of the linear GPLVM is:

𝐗^^𝐗\displaystyle\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG =𝐔Q(𝚲Qσ2𝐈Q)1/2𝐑,absentsubscript𝐔𝑄superscriptsubscript𝚲𝑄superscript𝜎2subscript𝐈𝑄12𝐑\displaystyle={\mathbf{U}}_{Q}\left(\bm{\Lambda}_{Q}-{\sigma}^{2}\mathbf{I}_{Q% }\right)^{1/2}\mathbf{R},= bold_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT - italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_R , (7)

where 𝐔Q[𝐮1,,𝐮Q]N×Qsubscript𝐔𝑄subscript𝐮1subscript𝐮𝑄superscript𝑁𝑄{\mathbf{U}}_{Q}\!\triangleq\!\left[{\mathbf{u}}_{1},\ldots,{\mathbf{u}}_{Q}% \right]\!\in\!\mathbb{R}^{N\times Q}bold_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ≜ [ bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_u start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_Q end_POSTSUPERSCRIPT represents arbitrary eigenvectors of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝐑Q×Q𝐑superscript𝑄𝑄\mathbf{R}\in\mathbb{R}^{Q\times Q}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT is an arbitrary orthogonal matrix, and 𝚲QQ×Qsubscript𝚲𝑄superscript𝑄𝑄\bm{\Lambda}_{Q}\!\in\!\mathbb{R}^{Q\times Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT is a diagonal matrix with:

[𝚲Q]i,i={λi,thecorrespondingeigenvalueto𝐮i, orσ2.\!\!\![\bm{\Lambda}_{Q}]_{i,i}\!=\!\!\left\{\begin{aligned} &\lambda_{i},% \operatorname{~{}the~{}corresponding~{}eigenvalue~{}to~{}}\mathbf{u}_{i},% \textbf{~{}or}\\ &{\sigma}^{2}.\end{aligned}\right.[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , start_OPFUNCTION roman_the roman_corresponding roman_eigenvalue roman_to end_OPFUNCTION bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , or end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW
Proof.

See App. A.2 or App. A in (Lawrence, 2005). ∎

Theorem 3.1 reveals that the stationary point, 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG, depends on the projection variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and eigenvalues of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. However, it remains unclear which specific value of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT may trigger the model collapse. Our findings, succinctly summarized in the following propositions, provide additional insight into the impact of the σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on the type of the stationary point and the cause of the model collapse.

Proposition 3.2.

In the case that σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT equals to its MLE estimator, σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT:

σ2=σ^2=1NQj=Q+1Nλj,superscript𝜎2superscript^𝜎21𝑁superscript𝑄superscriptsubscript𝑗superscript𝑄1𝑁subscript𝜆𝑗\sigma^{2}=\hat{\sigma}^{2}=\frac{1}{N-Q^{\prime}}\sum_{j=Q^{\prime}+1}^{N}% \lambda_{j},italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (8)

where Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the number of eigenvalues retained in 𝚲Qsubscript𝚲𝑄\bm{\Lambda}_{Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT from 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, then the only stable maximum111In this case, the stationary points comprise only saddle points and the global optimum; no local optimum exists. is the global optimum.

Proof.

See App. A.3. ∎

Proposition 3.2 suggests that adhering to the principle σ2=σ^2superscript𝜎2superscript^𝜎2\sigma^{2}\!=\!\hat{\sigma}^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT during the optimization of the log marginal likelihood (see Eq. (6) or Eq. (22)), it is expected to yield the global optimum, thereby mitigating the risk of the model collapse.

Proposition 3.3.

If σ2(λQq+1o,λQqo),q=1,,Q1formulae-sequencesuperscript𝜎2subscriptsuperscript𝜆𝑜𝑄𝑞1subscriptsuperscript𝜆𝑜𝑄𝑞𝑞1𝑄1\sigma^{2}\!\in\!(\lambda^{o}_{Q\!-\!q\!+\!1},\lambda^{o}_{Q\!-\!q}),q\!=\!1,% \ldots,Q\!-\!1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ ( italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q - italic_q + 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q - italic_q end_POSTSUBSCRIPT ) , italic_q = 1 , … , italic_Q - 1, where λqosubscriptsuperscript𝜆𝑜𝑞\lambda^{o}_{q}italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT denotes the q𝑞qitalic_q-th largest eigenvalues of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, then the only stable maximum is the local optimum, with the maximizer 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG having q𝑞qitalic_q zero columns. In addition, when q=Q𝑞𝑄q\!=\!Qitalic_q = italic_Q, σ2>λ1osuperscript𝜎2superscriptsubscript𝜆1𝑜\sigma^{2}>\lambda_{1}^{o}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, the only stable maximum occurs when 𝐗^=𝟎^𝐗0\hat{{\mathbf{X}}}=\mathbf{0}over^ start_ARG bold_X end_ARG = bold_0 (i.e., homogeneity).

If σ2<λNosuperscript𝜎2subscriptsuperscript𝜆𝑜𝑁\sigma^{2}\!<\!\lambda^{o}_{N}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, the stationary points comprise a cluster of local minimum points, accompanied by the emergence of zero columns in the 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG.

Proof.

See App. A.4. ∎

Proposition 3.3 implies that an improper choice of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can hinder the optimization process, preventing it from reaching the optimum and leading to a loss of information (homogeneity) in 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG, i.e., the undesirable model collapse.

The aforementioned findings in the linear GPLVM underscore the importance of learning the projection variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and demonstrate how this learning can help mitigate the risk of model collapse. While it is challenging to generalize these results to the broader GPLVM framework due to the model’s non-convexity and nonlinearity, they still offer valuable insights into the role of projection variance in preventing model collapse within general GPLVMs (see § 6.1).

3.2 Kernel Function Flexibility Matters

The occurrence of model collapse is closely linked to the choice of kernel function as well, as the kernel plays a key role in learning the underlying mapping f(𝐱)𝑓𝐱f({\mathbf{x}})italic_f ( bold_x ) in GPLVMs. In particular, if the learned mapping function characterized by the GP posterior diverges from the underlying one, there is a significant possibility that the estimated latent manifold will become distorted or lose crucial feature details, resulting in the model collapse.

This phenomenon is depicted in Fig. 1, where it is evident that the limited flexibility of the preliminary kernels prevents them from adequately exploring the corresponding reproducing kernel Hilbert space (RKHS) to capture the structure of the underlying function f(𝐱)𝑓𝐱f({\mathbf{x}})italic_f ( bold_x ) (Theodoridis, 2020). Consequently, using the preliminary (RBF) kernel can only roughly fit the underlying function, leading to learning a distorted latent manifold–refer to the top of Fig. 1 and the associated latent manifold estimation in Fig. 1(b), where we can see the struggle to fit the model that exhibits short-term irregularities.

Conversely, employing a flexible kernel capable of approximating arbitrary kernels allows for thorough exploration of the kernel space, enabling the automatic discovery of the most suitable kernel to capture hidden and possibly complex data patterns and structures, such as periodicity and long tails (Wilson & Adams, 2013; Duvenaud, 2014). This enhances the capacity to effectively learn the underlying mapping functions and estimate an accurate latent manifold, as evidenced by the learned function using a flexible (SM) kernel in Fig. 1 (top sub-figure) and the latent manifold estimate in Fig. 1(a).

In summary, Fig. 1 demonstrates the importance of kernel flexibility in GPLVMs for mitigating model collapse (distortion) in practice. In this paper, we will employ a kernel capable of approximating arbitrary stationary kernels, namely the SM kernel (Wilson & Adams, 2013). In the next section, we detail our proposed GPLVM that incorporates the SM kernel while learning projection variance to prevent model collapse.

4 Preventing Model Collapse

Integrating general GPLVM with the SM kernel poses two distinct challenges: 1) high computational costs and 2) intractable model learning (de Souza et al., 2021; Jung et al., 2022; Chang et al., 2023). Specifically, the computational complexity of training the GPLVM with the SM kernel scales as 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) with N𝑁Nitalic_N data points (Rasmussen & Williams, 2006), rendering it prohibitive in the context of big data. To tackle the scalability issue of GPLVM, one representative variational method presented by Titsias & Lawrence (2010) involves utilizing sparse GPs based on inducing points (Titsias, 2009). However, this variational method is computationally tractable only for limited preliminary kernel functions, such as the RBF kernel. Recent work has tried to enhance the scalability and flexibility of the GPLVM by using the stochastic variational inference approach proposed by Hensman et al. (2013) (Lalchand et al., 2022; de Souza et al., 2021; Ramchandran et al., 2021) . Despite these endeavors, the need to optimize additional inducing points still leads to increased computational burden and the risk of getting stuck in suboptimal solutions. Thus, despite the enhanced model capability, these models often face challenges in achieving their theoretical potential to address model collapse (see § 6).

To address the aforementioned issues, we resort to the variational inference technique (Jordan et al., 1999) and a random Fourier features (RFF) approximation (Jung et al., 2022; Rahimi & Recht, 2007), which will enable us to efficiently and scalably learn the SM kernel-embedded GPLVM without introducing extra parameters (inducing points) as required in sparse GP-based methods (Titsias & Lawrence, 2010; Lalchand et al., 2022). The vanilla RFF approximates any stationary kernel k(𝐱,𝐱)𝑘𝐱superscript𝐱k({\mathbf{x}},{\mathbf{x}}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) using Monte Carlo integration (Rahimi & Recht, 2007), i.e.,

k(𝐱,𝐱)φ(𝐱)φ(𝐱),φ(𝐱)2L[sin(2π𝐰1𝐱),\displaystyle k({\mathbf{x}},{\mathbf{x}}^{\prime})\approx\varphi(\mathbf{x})^% {\top}\varphi(\mathbf{x}^{\prime}),\ \varphi({\mathbf{x}})\triangleq\sqrt{% \frac{2}{L}}\left[\sin(2\pi{\mathbf{w}}_{1}^{\top}{\mathbf{x}}),\right.italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≈ italic_φ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_φ ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_φ ( bold_x ) ≜ square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_L end_ARG end_ARG [ roman_sin ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x end_ARG ) , (9)
cos(2π𝐰1𝐱),,sin(2π𝐰L2𝐱),cos(2π𝐰L2𝐱)]\displaystyle\left.\cos(2\pi{\mathbf{w}}_{1}^{\top}{\mathbf{x}}),\ldots,\sin(2% \pi{\mathbf{w}}_{\frac{L}{2}}^{\top}{\mathbf{x}}),\cos(2\pi{\mathbf{w}}_{\frac% {L}{2}}^{\top}{\mathbf{x}})\right]roman_cos ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x end_ARG ) , … , roman_sin ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT divide start_ARG italic_L end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x end_ARG ) , roman_cos ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT divide start_ARG italic_L end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x end_ARG ) ]

where {𝐰l}l=1L/2superscriptsubscriptsubscript𝐰𝑙𝑙1𝐿2\{\mathbf{w}_{l}\}_{l=1}^{L/2}{ bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT are L/2𝐿2{L}/{2}italic_L / 2 i.i.d. spectral points drawn from the density function p(𝐰)𝑝𝐰p(\mathbf{w})italic_p ( bold_w ) of the associated kernel function k(𝐱,𝐱)𝑘𝐱superscript𝐱k({\mathbf{x}},{\mathbf{x}}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where L𝐿Litalic_L is an positive, even, integer.

Leveraging the RFF approximation, we can obtain the following SM kernel-embedded GPLVM:

𝐲:,j𝒩(𝟎,φ(𝐗)φ(𝐗)+σ2𝐈N),j=1,,M,formulae-sequencesimilar-tosubscript𝐲:𝑗𝒩0𝜑𝐗𝜑superscript𝐗topsuperscript𝜎2subscript𝐈𝑁𝑗1𝑀\displaystyle\!\!\!{\mathbf{y}}_{:,j}\sim{\cal N}(\bm{0},\ \varphi({\mathbf{X}% })\varphi({\mathbf{X}})^{\top}+\sigma^{2}\mathbf{I}_{N}),\ j\!=\!1,\ldots,M,bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , italic_φ ( bold_X ) italic_φ ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , italic_j = 1 , … , italic_M , (10a)
𝐰lpsm(𝐰),l=1,,L/2,formulae-sequencesimilar-tosubscript𝐰𝑙subscript𝑝sm𝐰𝑙1𝐿2\displaystyle\!\!\!{\mathbf{w}}_{l}\sim p_{\mathrm{sm}}({\mathbf{w}}),\quad l=% 1,\ldots,{L}/{2},bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_w ) , italic_l = 1 , … , italic_L / 2 , (10b)
𝐱i𝒩(𝟎,𝐈Q),i=1,,N,formulae-sequencesimilar-tosubscript𝐱𝑖𝒩0subscript𝐈𝑄𝑖1𝑁\displaystyle\!\!\!\mathbf{x}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{Q}),\ % \ i=1,\ldots,N,bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) , italic_i = 1 , … , italic_N , (10c)

ensuring both computational scalability and modeling flexibility222Similar to Gundersen et al. (2021), we consider 𝐖𝐖{\mathbf{W}}bold_W as part of the data-generating process. We then constrain its prior p(𝐖)𝑝𝐖p({\mathbf{W}})italic_p ( bold_W ) to be Gaussian mixtures, thereby defining SM kernels. For a detailed interpretation of 𝐖𝐖{\mathbf{W}}bold_W, see App. B.2.. The following subsections will further detail our proposed variational inference algorithm to manage the learning tractability and efficacy in addressing the model collapse.

4.1 Approximate Bayesian Inference

Given the SM kernel-embedded GPLVM defined in Eq. (10), we utilize the variational inference technique (Theodoridis, 2020) to learn the model hyperparameters 𝜽=[𝜽sm,σ2]𝜽subscript𝜽smsuperscript𝜎2\bm{\theta}\!=\![\bm{\theta}_{\text{sm}},\sigma^{2}]bold_italic_θ = [ bold_italic_θ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], aiming to mitigate the risk of model collapse. Specifically, we can immediately obtain the joint distribution of the GPLVM in Eq. (10) as

p(𝐘,𝐗,𝐖)𝑝𝐘𝐗𝐖\displaystyle p({\mathbf{Y}},{\mathbf{X}},\mathbf{W})italic_p ( bold_Y , bold_X , bold_W ) =p(𝐗)p(𝐖)p(𝐘|𝐗,𝐖)absent𝑝𝐗𝑝𝐖𝑝conditional𝐘𝐗𝐖\displaystyle=p({\mathbf{X}})p(\mathbf{W})p({\mathbf{Y}}|{\mathbf{X}},\mathbf{% W})= italic_p ( bold_X ) italic_p ( bold_W ) italic_p ( bold_Y | bold_X , bold_W ) (11)
=p(𝐖)i=1Np(𝐱i)j=1Mp(𝐲:,j|𝐗,𝐖),absent𝑝𝐖superscriptsubscriptproduct𝑖1𝑁𝑝subscript𝐱𝑖superscriptsubscriptproduct𝑗1𝑀𝑝conditionalsubscript𝐲:𝑗𝐗𝐖\displaystyle=p(\mathbf{W})\prod_{i=1}^{N}p({\mathbf{x}}_{i})\prod_{j=1}^{M}p(% {\mathbf{y}}_{:,j}|{\mathbf{X}},\mathbf{W}),= italic_p ( bold_W ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_X , bold_W ) ,

where p(𝐖)=l=1L/2psm(𝐰)𝑝𝐖superscriptsubscriptproduct𝑙1𝐿2subscript𝑝sm𝐰p({\mathbf{W}})\!=\!\prod_{l=1}^{{L}/{2}}p_{\mathrm{sm}}({\mathbf{w}})italic_p ( bold_W ) = ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_w ) is the joint distribution of the spectral points. The variational inference method involves constructing a variational lower bound \mathcal{L}caligraphic_L of the log marginal likelihood that has the Kullback–Leibler (KL) divergence from approximating the underlying posterior as its slack: logp(𝐘)=KL[q(𝐗,𝐖)p(𝐗,𝐖|𝐘)]\log p({\mathbf{Y}})\!-\!\mathcal{L}\!=\!\operatorname{KL}[q({\mathbf{X}},{% \mathbf{W}})\|p({\mathbf{X}},{\mathbf{W}}|{\mathbf{Y}})]roman_log italic_p ( bold_Y ) - caligraphic_L = roman_KL [ italic_q ( bold_X , bold_W ) ∥ italic_p ( bold_X , bold_W | bold_Y ) ]. By maximizing \mathcal{L}caligraphic_L w.r.t. q()𝑞q(\cdot)italic_q ( ⋅ ), we improve the quality of the approximation (Cao et al., 2023; Cheng et al., 2022).

For this purpose, we introduce the following variational distribution to approximate the posterior over all the latent variables, {𝐖,𝐗}𝐖𝐗\{\mathbf{W},{\mathbf{X}}\}{ bold_W , bold_X }:

q(𝐗,𝐖)p(𝐖)q(𝐗)=p(𝐖)i=1Nq(𝐱i),𝑞𝐗𝐖𝑝𝐖𝑞𝐗𝑝𝐖superscriptsubscriptproduct𝑖1𝑁𝑞subscript𝐱𝑖q({\mathbf{X}},\mathbf{W})\triangleq p(\mathbf{W})q({\mathbf{X}})=p(\mathbf{W}% )\prod_{i=1}^{N}q({\mathbf{x}}_{i}),italic_q ( bold_X , bold_W ) ≜ italic_p ( bold_W ) italic_q ( bold_X ) = italic_p ( bold_W ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (12)

where q(𝐗)=i=1N𝒩(𝐱i|𝝁i,𝑺i)𝑞𝐗superscriptsubscriptproduct𝑖1𝑁𝒩conditionalsubscript𝐱𝑖subscript𝝁𝑖subscript𝑺𝑖q({\mathbf{X}})=\prod_{i=1}^{N}\mathcal{N}({\mathbf{x}}_{i}|\bm{\mu}_{i},\bm{S% }_{i})italic_q ( bold_X ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and 𝝁iQ,𝑺iQ×Qformulae-sequencesubscript𝝁𝑖superscript𝑄subscript𝑺𝑖superscript𝑄𝑄\bm{\mu}_{i}\in\mathbb{R}^{Q},\bm{S}_{i}\in\mathbb{R}^{Q\times Q}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT are the associated free variational parameters. The variational distribution q(𝐖)𝑞𝐖q({\mathbf{W}})italic_q ( bold_W ) is constrained to be the prior distribution, which is essentially equivalent to explicitly assuming that q(𝐖)𝑞𝐖q({\mathbf{W}})italic_q ( bold_W ) is Gaussian mixtures. See App. B.2 for detailed discussions on this equivalence and other more complex variational distributions of 𝐖𝐖{\mathbf{W}}bold_W. Consequently, the variational lower bound for simultaneous learning and inference is ready to be derived and summarized in the following theorem.

Theorem 4.1.

With the model joint distribution in Eq. (11) and the assumed variational distribution in Eq. (12), the evidence lower bound (ELBO), =𝔼q(𝐗,𝐖)[log(p(𝐘,𝐗,𝐖))log(q(𝐗,𝐖))]subscript𝔼𝑞𝐗𝐖delimited-[]𝑝𝐘𝐗𝐖𝑞𝐗𝐖\mathcal{L}=\mathbb{E}_{q({\mathbf{X}},\mathbf{W})}\left[\log{p({\mathbf{Y}},{% \mathbf{X}},{\mathbf{W}})}-\log{q({\mathbf{X}},{\mathbf{W}})}\right]caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log ( start_ARG italic_p ( bold_Y , bold_X , bold_W ) end_ARG ) - roman_log ( start_ARG italic_q ( bold_X , bold_W ) end_ARG ) ], for the joint learning and inference is

\displaystyle\mathcal{L}caligraphic_L =𝔼q(𝐗,𝐖)[logp(𝐖)i=1Np(𝐱i)j=1Mp(𝐲:,j|𝐗,𝐖)p(𝐖)i=1Nq(𝐱i)]absentsubscript𝔼𝑞𝐗𝐖delimited-[]𝑝𝐖superscriptsubscriptproduct𝑖1𝑁𝑝subscript𝐱𝑖superscriptsubscriptproduct𝑗1𝑀𝑝conditionalsubscript𝐲:𝑗𝐗𝐖𝑝𝐖superscriptsubscriptproduct𝑖1𝑁𝑞subscript𝐱𝑖\displaystyle\!=\!\mathbb{E}_{q({\mathbf{X}},\mathbf{W})}\left[\log\frac{p(% \mathbf{W})\prod_{i=1}^{N}p({\mathbf{x}}_{i})\prod_{j=1}^{M}p({\mathbf{y}}_{:,% j}|{\mathbf{X}},\mathbf{W})}{p(\mathbf{W})\prod_{i=1}^{N}q({\mathbf{x}}_{i})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_W ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_X , bold_W ) end_ARG start_ARG italic_p ( bold_W ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ]
=j=1M𝔼q(𝐗,𝐖)[logp(𝐲:,j|𝐗,𝐖)]Term 1: data reconstructioni=1NKL(q(𝐱i)p(𝐱i))Term 2: regularization.absentsubscriptsuperscriptsubscript𝑗1𝑀subscript𝔼𝑞𝐗𝐖delimited-[]𝑝conditionalsubscript𝐲:𝑗𝐗𝐖Term 1: data reconstructionsubscriptsuperscriptsubscript𝑖1𝑁KLconditional𝑞subscript𝐱𝑖𝑝subscript𝐱𝑖Term 2: regularization\displaystyle\!\!=\!\underbrace{\sum_{j=1}^{M}\mathbb{E}_{q({\mathbf{X}},% \mathbf{W})}\left[\log p({\mathbf{y}}_{:,j}|{\mathbf{X}},\mathbf{W})\right]}_{% \text{Term 1: data reconstruction}}\!-\!\underbrace{\sum_{i=1}^{N}% \operatorname{KL}(q({\mathbf{x}}_{i})\|p({\mathbf{x}}_{i}))}_{\text{Term 2: % regularization}}.= under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_X , bold_W ) ] end_ARG start_POSTSUBSCRIPT Term 1: data reconstruction end_POSTSUBSCRIPT - under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_KL ( italic_q ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT Term 2: regularization end_POSTSUBSCRIPT .

Here, the first term corresponds to the data reconstruction error, which encourages any latent variables 𝐗𝐗{\mathbf{X}}bold_X and 𝐖𝐖{\mathbf{W}}bold_W sampled from the variational distribution, q(𝐗,𝐖)𝑞𝐗𝐖q({\mathbf{X}},{\mathbf{W}})italic_q ( bold_X , bold_W ), to accurately reconstruct the observations/likelihood. The second term represents a regularization for q(𝐗)𝑞𝐗q({\mathbf{X}})italic_q ( bold_X ), which discourages significant deviations of q(𝐗)𝑞𝐗q({\mathbf{X}})italic_q ( bold_X ) from the prior p(𝐗)𝑝𝐗p({\mathbf{X}})italic_p ( bold_X ).

For the evaluation of \mathcal{L}caligraphic_L, the second term can be evaluated analytically due to the Gaussian nature of the distributions. The first term needs to be handled numerically with Monte Carlo estimation, i.e.,

Term 1 =j=1M𝔼q(𝐗,𝐖)[logp(𝐲:,j|𝐗,𝐖)]absentsuperscriptsubscript𝑗1𝑀subscript𝔼𝑞𝐗𝐖delimited-[]𝑝conditionalsubscript𝐲:𝑗𝐗𝐖\displaystyle=\sum_{j=1}^{M}\mathbb{E}_{q({\mathbf{X}},\mathbf{W})}\left[\log p% ({\mathbf{y}}_{:,j}|{\mathbf{X}},\mathbf{W})\right]= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_X , bold_W ) ] (13a)
j=1M1Ii=1Ilog𝒩(𝐲:,j|𝟎,𝐊^sm(i)+σ2𝐈N),absentsuperscriptsubscript𝑗1𝑀1𝐼superscriptsubscript𝑖1𝐼𝒩conditionalsubscript𝐲:𝑗0superscriptsubscript^𝐊sm𝑖superscript𝜎2subscript𝐈𝑁\displaystyle\approx\sum_{j=1}^{M}\frac{1}{{I}}\sum_{i=1}^{I}\log\mathcal{N}({% \mathbf{y}}_{:,j}|\bm{0},\hat{{{\mathbf{K}}}}_{\mathrm{sm}}^{(i)}+\sigma^{2}% \mathbf{I}_{N}),≈ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT roman_log caligraphic_N ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_0 , over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , (13b)

where I𝐼{I}italic_I denotes the number of Monte Carlo samples drawn from q(𝐗)𝑞𝐗q({\mathbf{X}})italic_q ( bold_X ) and p(𝐖)𝑝𝐖p({\mathbf{W}})italic_p ( bold_W ), and 𝐊^smsubscript^𝐊sm\hat{{{\mathbf{K}}}}_{\mathrm{sm}}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT is the SM kernel matrix approximation constructed by the feature map φ()𝜑\varphi(\cdot)italic_φ ( ⋅ ). See App. B.1 for more computational details.

Note that in Eq. (10b), we need to sample 𝐰lsubscript𝐰𝑙{\mathbf{w}}_{l}bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from a Gaussian mixture, involving that first generates an index i𝑖iitalic_i from the discrete probability distribution, P(i)=αi/j=1mαj,i=1,,mformulae-sequence𝑃𝑖subscript𝛼𝑖superscriptsubscript𝑗1𝑚subscript𝛼𝑗𝑖1𝑚P(i)={\alpha_{i}}/{\sum_{j=1}^{m}\alpha_{j}},i=1,\ldots,mitalic_P ( italic_i ) = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_i = 1 , … , italic_m, and then draws sample 𝐰lsubscript𝐰𝑙\mathbf{w}_{l}bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from si(𝐰)subscript𝑠𝑖𝐰s_{i}({\mathbf{w}})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ). However, due to the difficulty of reparameterizing the discrete distribution over mixture weights (Graves, 2016), maximizing the ELBO w.r.t. the weights αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using modern off-the-shelf automatic differentiation tools (e.g., PyTorch (Paszke et al., 2019)) becomes challenging. To this end, we similarly leverage a differentiable RFF feature map construction approach developed for GP regression models by Jung et al. (2022) to ensure inherent differentiability w.r.t. the mixture weights.

4.2 Differentiable RFF Approximation for SM Kernel

Rather than directly sampling from the Gaussian mixture, we first apply the vanilla RFF to get the corresponding feature map φi(𝐱)αiφ(𝐱;{𝐰l(i)}l=1L/2)subscript𝜑𝑖𝐱subscript𝛼𝑖𝜑𝐱superscriptsubscriptsuperscriptsubscript𝐰𝑙𝑖𝑙1𝐿2\varphi_{i}({\mathbf{x}})\!\triangleq\!\sqrt{\alpha_{i}}\cdot\varphi({\mathbf{% x}};\{\mathbf{w}_{l}^{(i)}\}_{l=1}^{{L}/{2}})italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) ≜ square-root start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ italic_φ ( bold_x ; { bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT ), i=1,,m𝑖1𝑚i\!=\!1,\ldots,mitalic_i = 1 , … , italic_m, for each mixture component, where the reparametrization trick (Kingma & Welling, 2019) is employed to sample 𝐰l(i)superscriptsubscript𝐰𝑙𝑖\mathbf{w}_{l}^{(i)}bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT from si(𝐰)subscript𝑠𝑖𝐰s_{i}({\mathbf{w}})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ). Subsequently, the stacking of m𝑚mitalic_m feature maps yields the ultimate new RFF approximation for the SM kernel, denoted as ϕ(𝐱)italic-ϕ𝐱\phi({\mathbf{x}})italic_ϕ ( bold_x ), i.e.,

ϕ(𝐱)=[φ1(𝐱),φ2(𝐱),,φm(𝐱)]mL×1.italic-ϕ𝐱superscriptsubscript𝜑1superscript𝐱topsubscript𝜑2superscript𝐱topsubscript𝜑𝑚superscript𝐱toptopsuperscript𝑚𝐿1\!\!\!\phi\left({\mathbf{x}}\right)\!=\!\left[\varphi_{1}({\mathbf{x}})^{\top}% ,\varphi_{2}({\mathbf{x}})^{\top},\ldots,\varphi_{m}({\mathbf{x}})^{\top}% \right]^{\top}\!\in\!\mathbb{R}^{mL\times 1}.italic_ϕ ( bold_x ) = [ italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m italic_L × 1 end_POSTSUPERSCRIPT . (14)

It can be shown that ϕ(𝐱)ϕ(𝐱)italic-ϕsuperscript𝐱topitalic-ϕ𝐱\phi\left({\mathbf{x}}\right)^{\top}\!\phi\left({\mathbf{x}}\right)italic_ϕ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_x ) is an unbiased estimator of the SM kernel characterized by the hyperparameters 𝜽sm={αi,𝝁i,𝝈𝒊𝟐}i=1msubscript𝜽smsuperscriptsubscriptsubscript𝛼𝑖subscript𝝁𝑖subscriptsuperscript𝝈2𝒊𝑖1𝑚\bm{\theta}_{\mathrm{sm}}\!=\!\{\alpha_{i},\bm{\mu}_{i},\bm{\sigma^{2}_{i}}\}_% {i=1}^{m}bold_italic_θ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT = { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The result is succinctly encapsulated in the following proposition (Lopez-Paz et al., 2014).

Proposition 4.2.

Let 𝐖={𝐰1(i),𝐰2(i),,𝐰L/2(i)}i=1m𝐖superscriptsubscriptsuperscriptsubscript𝐰1𝑖superscriptsubscript𝐰2𝑖superscriptsubscript𝐰𝐿2𝑖𝑖1𝑚{\mathbf{W}}=\{\mathbf{w}_{1}^{(i)},\mathbf{w}_{2}^{(i)},\ldots,\mathbf{w}_{{L% }/{2}}^{(i)}\}_{i=1}^{m}bold_W = { bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , … , bold_w start_POSTSUBSCRIPT italic_L / 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT be the spectral points sampled from the distribution p(𝐖)=i=1ml=1L/2si(𝐰)𝑝𝐖superscriptsubscriptproduct𝑖1𝑚superscriptsubscriptproduct𝑙1𝐿2subscript𝑠𝑖𝐰p({\mathbf{W}})=\prod_{i=1}^{m}\prod_{l=1}^{{L}/{2}}s_{i}({\mathbf{w}})italic_p ( bold_W ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) using the reparameterization trick (Kingma & Welling, 2019). With the RFF feature map constructed in Eq. (14), given any inputs 𝐱𝐱{\mathbf{x}}bold_x and 𝐱superscript𝐱{\mathbf{x}}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, ϕ(𝐱)ϕ(𝐱)italic-ϕsuperscript𝐱topitalic-ϕsuperscript𝐱\phi\left({\mathbf{x}}\right)^{\top}\!\phi\left({\mathbf{x}}^{\prime}\right)italic_ϕ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is an unbiased estimator of ksm(𝐱,𝐱)subscript𝑘sm𝐱superscript𝐱k_{\mathrm{sm}}\left({\mathbf{x}},{\mathbf{x}}^{\prime}\right)italic_k start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with the hyperparameters 𝛉smsubscript𝛉sm\bm{\theta}_{\mathrm{sm}}bold_italic_θ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT, i.e.,

𝔼p(𝐖)[ϕ(𝐱)ϕ(𝐱)]=ksm(𝐱,𝐱;𝜽sm)subscript𝔼𝑝𝐖delimited-[]italic-ϕsuperscript𝐱topitalic-ϕsuperscript𝐱subscript𝑘sm𝐱superscript𝐱subscript𝜽sm\mathbb{E}_{p\left({\mathbf{W}}\right)}\left[\phi\left({\mathbf{x}}\right)^{% \top}\phi\left({\mathbf{x}}^{\prime}\right)\right]=k_{\operatorname{sm}}({% \mathbf{x}},{\mathbf{x}}^{\prime};{{\bm{\theta}}}_{\mathrm{sm}})\vspace{-.12in}blackboard_E start_POSTSUBSCRIPT italic_p ( bold_W ) end_POSTSUBSCRIPT [ italic_ϕ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] = italic_k start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ) (15)
Proof.

See App. C.1

In fact, given inputs 𝐗𝐗{\mathbf{X}}bold_X and the new feature map defined in Eq. (14), we can further characterize the approximation error bound for the constructed SM kernel matrix approximation, 𝐊^sm=Φsm(𝐗)Φsm(𝐗)subscript^𝐊smsubscriptΦsm𝐗subscriptΦsmsuperscript𝐗top\hat{\mathbf{K}}_{\mathrm{sm}}\!=\!\Phi_{\mathrm{sm}}({\mathbf{X}})\Phi_{% \mathrm{sm}}({\mathbf{X}})^{\top}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_X ) roman_Φ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where the random feature matrix Φsm(𝐗)=[ϕ(𝐱1),,ϕ(𝐱N)]N×mLsubscriptΦsm𝐗superscriptitalic-ϕsubscript𝐱1italic-ϕsubscript𝐱𝑁topsuperscript𝑁𝑚𝐿\Phi_{\mathrm{sm}}({\mathbf{X}})\!=\!\left[\phi\left({\mathbf{x}}_{1}\right),% \ldots,\phi\left({\mathbf{x}}_{N}\right)\right]^{\top}\!\in\!\mathbb{R}^{N% \times mL}roman_Φ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_X ) = [ italic_ϕ ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_m italic_L end_POSTSUPERSCRIPT (Jung et al., 2022; Lopez-Paz et al., 2014).

Theorem 4.3.

For all small ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, the approximation error between the underlying SM kernel matrix 𝐊smsubscript𝐊sm\mathbf{K}_{\mathrm{sm}}bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT and its RFF approximation 𝐊^smsubscript^𝐊sm\hat{\mathbf{K}}_{\mathrm{sm}}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT is characterized by

P(𝐊^sm𝐊sm2ϵ)𝑃subscriptnormsubscript^𝐊smsubscript𝐊sm2italic-ϵabsent\displaystyle P\left(\left\|\hat{\mathbf{K}}_{\mathrm{sm}}-\mathbf{K}_{\mathrm% {sm}}\right\|_{2}\geq\epsilon\right)\leqitalic_P ( ∥ over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT - bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ϵ ) ≤ (16)
Nexp(3ϵ2L2Na(6𝐊sm2+3Nam+8ϵ)),𝑁3superscriptitalic-ϵ2𝐿2𝑁𝑎6subscriptnormsubscript𝐊sm23𝑁𝑎𝑚8italic-ϵ\displaystyle N\exp\left(\frac{-3\epsilon^{2}L}{2Na\left(6\left\|\mathbf{K}_{% \mathrm{sm}}\right\|_{2}+3Na\sqrt{m}+8\epsilon\right)}\right),italic_N roman_exp ( divide start_ARG - 3 italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 italic_N italic_a ( 6 ∥ bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 3 italic_N italic_a square-root start_ARG italic_m end_ARG + 8 italic_ϵ ) end_ARG ) ,

where a=i=1mαi2𝑎superscriptsubscript𝑖1𝑚superscriptsubscript𝛼𝑖2a=\sqrt{\sum_{i=1}^{m}\alpha_{i}^{2}}italic_a = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the matrix spectral norm.

Proof.

See App. C.2

Input: Dataset 𝐘𝐘{\mathbf{Y}}bold_Y; Initialized model hyperparameters 𝜽𝜽{{\bm{\theta}}}bold_italic_θ and variational parameters 𝜻𝜻{\bm{\zeta}}bold_italic_ζ.
1 while iterations not terminated do
2       Sample 𝐗𝐗\mathbf{X}bold_X from q(𝐗)=i=1N𝒩(𝐱i|𝝁i,𝑺i)𝑞𝐗superscriptsubscriptproduct𝑖1𝑁𝒩conditionalsubscript𝐱𝑖subscript𝝁𝑖subscript𝑺𝑖q({\mathbf{X}})=\prod_{i=1}^{N}\mathcal{N}({\mathbf{x}}_{i}|\bm{\mu}_{i},\bm{S% }_{i})italic_q ( bold_X ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) using the reparameterization trick
3       Sample 𝐖𝐖\mathbf{W}bold_W from p(𝐖)=i=1ml=1L/2si(𝐰)𝑝𝐖superscriptsubscriptproduct𝑖1𝑚superscriptsubscriptproduct𝑙1𝐿2subscript𝑠𝑖𝐰p({\mathbf{W}})=\prod_{i=1}^{m}\prod_{l=1}^{{L}/{2}}s_{i}({\mathbf{w}})italic_p ( bold_W ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) using the reparameterization trick
4       Construct Φsm(𝐗)subscriptΦsm𝐗{\Phi}_{\mathrm{sm}}({\mathbf{X}})roman_Φ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_X ) using the sampled 𝐗𝐗\mathbf{X}bold_X and 𝐖𝐖\mathbf{W}bold_W
5       Evaluate Term 1 of \mathcal{L}caligraphic_L through Eq. (13)
6       Evaluate Term 2 of \mathcal{L}caligraphic_L analytically
7       Maximize \mathcal{L}caligraphic_L and update 𝜽𝜽{{\bm{\theta}}}bold_italic_θ, 𝜻𝜻{\bm{\zeta}}bold_italic_ζ using Adam
8      
Output: 𝜽𝜽{{\bm{\theta}}}bold_italic_θ, 𝜻𝜻{\bm{\zeta}}bold_italic_ζ.
Algorithm 1 advised\oldtextscrflvm: Auto-Differentiable Variational Inference for SM-Embedded RFLVMs

Beyond the theoretical guarantees of the approximation, the new feature map in Eq. (14) offers a crucial advantage–it renders the variational lower bound \mathcal{L}caligraphic_L differentiable w.r.t. mixture weights αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, leading to the straightforward applicability of the automatic differentiation tools for hyperparameter optimization. Leveraging the new feature map, we can apply gradient-based methods (e.g., Adam (Kingma & Ba, 2014)) to maximize \mathcal{L}caligraphic_L w.r.t. model hyperparamters 𝜽𝜽{{\bm{\theta}}}bold_italic_θ and the variational parameters 𝜻={𝝁i,𝑺i}i=1N𝜻superscriptsubscriptsubscript𝝁𝑖subscript𝑺𝑖𝑖1𝑁\bm{\zeta}\!=\!\{\bm{\mu}_{i},\bm{S}_{i}\}_{i=1}^{N}bold_italic_ζ = { bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The pseudocode summarized in Algorithm 1 outlines the implementation of the proposed method, called auto-differentiable variational inference for \oldtextscsm-embedded \oldtextscrff-\oldtextsclvm, abbreviated as advised\oldtextscrflvm. It is noteworthy that for scenarios where NmLmuch-greater-than𝑁𝑚𝐿N\gg mLitalic_N ≫ italic_m italic_L, the computational complexity per iteration of advised\oldtextscrflvm scales as 𝒪(N(mL)2)𝒪𝑁superscript𝑚𝐿2\mathcal{O}(N(mL)^{2})caligraphic_O ( italic_N ( italic_m italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), as elaborated in App. B.1. Notably, this computational complexity aligns with that of the inducing point-based sparse GP method (Titsias & Lawrence, 2010). However, advised\oldtextscrflvm enhances the capacity of the GPLVM and mitigates the need for optimizing the inducing points, resulting in a lightweight optimization problem while alleviating the model collapse.

5 Related Work

We have already described the main differences between our method and inducing points-based methods throughout the paper, e.g., in § 4. Below we briefly introduce other related work on latent variable modeling and refer the reader to App. D for more details.

VAEs.

Variational autoencoders (VAEs) (Kingma & Welling, 2013) skillfully integrate LVMs typically modeled by neural networks with variational inference (Bishop, 2006), empowering the model to generate novel data. Unfortunately, despite the considerable success demonstrated by VAEs in generative tasks (Kingma & Welling, 2013; Zhao et al., 2020; Nakagawa et al., 2023; Tran et al., 2023), they struggle to capture the underlying compact and informative latent representations of the observed data, resulting in the well-known posterior collapse issue (Menon et al., 2022; Wang & Liu, 2022; Lucas et al., 2019; Razavi et al., 2019), a facet of model collapse (see App. D). This phenomenon is partially attributed to the overfitting, stemming from optimizing a large number of parameters in the encoder of VAE, leading to homogeneous latent spaces (Bowman et al., 2016; Sønderby et al., 2016; Zhu et al., 2023).

RFLVMs.

In addition to inducing points-based GPLVMs, the random feature latent variable model (RFLVM) adopts the RFF approximation of the kernel function as a variant of the GPLVM and leverages a Dirichlet process (DP) mixture of Gaussians to learn the associated spectral density of the kernel function (Rahimi & Recht, 2007; Oliva et al., 2016; Gundersen et al., 2021; Zhang et al., 2023). Despite the capacity to approximate arbitrary stationary kernels, the effectiveness in addressing model collapse in the RFLVM might be compromised by the “rich-get-richer” property inherent in the DP mixture prior (Gundersen et al., 2021), which places a strong assumption regarding the data generation process (Poux-Médard et al., 2023). A comprehensive comparison between our advised\oldtextscrflvm and the SOTA models can be found in Table 3, App. D.

6 Experiments

We showcase the impact of the projection variance and kernel flexibility on model collapse in § 6.1 and § 6.2. In § 6.3 and § 6.4, we further corroborate the superior performance of advised\oldtextscrflvm in latent representation learning on various real-world datasets. More experimental details can be found in App. E, and the code is publicly available at https://github.com/zhidilin/advisedGPLVM.

6.1 Projection Variance Matters

To evaluate the impact of the projection variance in general GPLVM, we apply the advised\oldtextscrflvm on the \oldtextscmnist dataset (LeCun, 1998). We quantify the degree of model collapse under two configurations of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: learned and fixed. The degree of the model collapse is evaluated by counting the number of zero-columns in the learned latent variable 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG and measuring its \oldtextsck-nearest neighbors (\oldtextscknn) classification accuracy. Detailed results are depicted in Fig. 2.

On the left-hand side of Fig. 2, it is observed that, when σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is fixed, the latent variable learned by the advised\oldtextscrflvm rapidly collapses to zero as the value of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT increases. This observation aligns with the findings in the linear GPLVM (see Proposition 3.3). Additionally, the inferior performance of the \oldtextscknn accuracy depicted on the right-hand side of Fig. 2 illustrates that, without learning σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the proposed advised\oldtextscrflvm tends to recover a vague and uninformative latent representation. In stark contrast, advised\oldtextscrflvm with a learned σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT effectively mitigates the risk of model collapse, irrespective of the initialization value of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT or the metric employed. This supports our hypothesis regarding the importance of learning σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to prevent model collapse in general GPLVMs.

Refer to caption
Refer to caption
Figure 2: Left: The number of zero-columns (short as \oldtextscnum-\oldtextsczc) in the latent variable 𝐗𝐗{\mathbf{X}}bold_X versus the initialization value of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (defined as \oldtextscinit-σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). Right: \oldtextscknn classification accuracy against \oldtextscinit-σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Standard deviation is calculated over five experiments.
Refer to caption
Refer to caption
Figure 3: Left: Learned latent manifold in “RBF+periodic” dataset. Right: R2superscriptR2\mathrm{R}^{2}roman_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score performance over different models in two datasets.
Table 1: Classification accuracy evaluated by fitting a \oldtextscknn classifier (k=1)𝑘1(k=1)( italic_k = 1 ) with five-fold cross-validation. Mean and standard deviation are computed over five experiments, and the top performance is in bold.
\oldtextscdataset PCA LDA Isomap HPF BGPLVM GPLVM-SVI
\oldtextscBridges 0.841 ±plus-or-minus\pm± 0.007 0.668 ±plus-or-minus\pm± 0.053 0.797 ±plus-or-minus\pm± 0.025 0.544 ±plus-or-minus\pm± 0.109 0.818 ±plus-or-minus\pm± 0.037 0.796 ±plus-or-minus\pm± 0.019
\oldtextscCifar-10 0.267 ±plus-or-minus\pm± 0.002 0.227 ±plus-or-minus\pm± 0.006 0.272 ±plus-or-minus\pm± 0.006 0.208 ±plus-or-minus\pm± 0.006 0.271 ±plus-or-minus\pm± 0.014 0.251 ±plus-or-minus\pm± 0.012
\oldtextscMnist 0.365 ±plus-or-minus\pm± 0.012 0.233 ±plus-or-minus\pm± 0.026 0.444 ±plus-or-minus\pm± 0.021 0.314 ±plus-or-minus\pm± 0.040 0.567 ±plus-or-minus\pm± 0.033 0.344 ±plus-or-minus\pm± 0.054
\oldtextscMontreal 0.678 ±plus-or-minus\pm± 0.013 0.602 ±plus-or-minus\pm± 0.028 0.709 ±plus-or-minus\pm± 0.005 0.618 ±plus-or-minus\pm± 0.001 0.725 ±plus-or-minus\pm± 0.012 0.676 ±plus-or-minus\pm± 0.010
\oldtextscNewsgroups 0.392 ±plus-or-minus\pm± 0.005 0.391 ±plus-or-minus\pm± 0.018 0.397 ±plus-or-minus\pm± 0.010 0.334 ±plus-or-minus\pm± 0.019 0.385 ±plus-or-minus\pm± 0.010 0.378 ±plus-or-minus\pm± 0.018
\oldtextscYale 0.543 ±plus-or-minus\pm± 0.008 0.338 ±plus-or-minus\pm± 0.023 0.588 ±plus-or-minus\pm± 0.017 0.511 ±plus-or-minus\pm± 0.019 0.553 ±plus-or-minus\pm± 0.036 0.521 ±plus-or-minus\pm± 0.015
\oldtextscdataset VAE NBVAE DCA CVQ-VAE RFLVM advisedRFLVM
\oldtextscBridges 0.751 ±plus-or-minus\pm± 0.016 0.758 ±plus-or-minus\pm± 0.038 0.702 ±plus-or-minus\pm± 0.036 0.688 ±plus-or-minus\pm± 0.013 0.846 ±plus-or-minus\pm± 0.039 0.846 ±plus-or-minus\pm± 0.015
\oldtextscCifar-10 0.266 ±plus-or-minus\pm± 0.002 0.259 ±plus-or-minus\pm± 0.005 0.255 ±plus-or-minus\pm± 0.019 0.224 ±plus-or-minus\pm± 0.012 0.284 ±plus-or-minus\pm± 0.103 0.290 ±plus-or-minus\pm± 0.006
\oldtextscMnist 0.643 ±plus-or-minus\pm± 0.021 0.281 ±plus-or-minus\pm± 0.012 0.171 ±plus-or-minus\pm± 0.075 0.128 ±plus-or-minus\pm± 0.005 0.602 ±plus-or-minus\pm± 0.055 0.795 ±plus-or-minus\pm± 0.015
\oldtextscMontreal 0.668 ±plus-or-minus\pm± 0.012 0.716 ±plus-or-minus\pm± 0.009 0.685 ±plus-or-minus\pm± 0.716 0.646 ±plus-or-minus\pm± 0.003 0.769 ±plus-or-minus\pm± 0.010 0.789 ±plus-or-minus\pm± 0.013
\oldtextscNewsgroups 0.385 ±plus-or-minus\pm± 0.002 0.398 ±plus-or-minus\pm± 0.010 0.399 ±plus-or-minus\pm± 0.034 0.356 ±plus-or-minus\pm± 0.019 0.413 ±plus-or-minus\pm± 0.009 0.418 ±plus-or-minus\pm± 0.007
\oldtextscYale 0.611 ±plus-or-minus\pm± 0.020 0.456 ±plus-or-minus\pm± 0.046 0.284 ±plus-or-minus\pm± 0.054 0.338 ±plus-or-minus\pm± 0.002 0.653 ±plus-or-minus\pm± 0.067 0.765 ±plus-or-minus\pm± 0.010

6.2 S-shaped Latent Manifold Learning

Next, we demonstrate the importance of kernel flexibility in preventing model collapse, utilizing two synthetic datasets, each consisting of N=500𝑁500N\!\!=\!\!500italic_N = 500 observations with M=100𝑀100M\!\!=\!\!100italic_M = 100 dimensions. Both datasets are generated from a GPLVM with a two-dimensional (2222-D) latent S𝑆Sitalic_S-shaped manifold, but employing distinct kernel configurations. One employs a basic RBF kernel, while the other utilizes a more complex combination of an RBF kernel and a periodic kernel (Rasmussen & Williams, 2006). We compare our advised\oldtextscrflvm with three GPLVM variants: BGPLVM (Titsias & Lawrence, 2010), GPLVM-SVI (Lalchand et al., 2022), and RFLVM (Zhang et al., 2023; Gundersen et al., 2021). In the case of BGPLVM and GPLVM-SVI, the default setting (see App. E) is used except that the number of inducing points is selected from the set {6,10,20,30,60,120}610203060120\{6,10,20,30,60,120\}{ 6 , 10 , 20 , 30 , 60 , 120 }, which yields the best inference performance.

Figure 3 reports the results for the S𝑆Sitalic_S-shaped manifold learning, where the coefficient of determination (R2superscriptR2\mathrm{R}^{2}roman_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score) (Chicco et al., 2021) is used to quantify the “closeness” between the inferred manifold (after post-affine transformation) and the ground truth manifold. The results indicate that advised\oldtextscrflvm and RFLVM consistently outperform BGPLVM and GPLVM-SVI in both synthetic datasets. It is obvious that GPLVM-SVI exhibits the worst performance, and BGPLVM shows fluctuated performance, although, in some realizations, they can reasonably estimate the shape of 𝐗𝐗{\mathbf{X}}bold_X (see the left illustration in Fig. 3). The fluctuated performance of BGPLVM and GPLVM-SVI suggests that optimizing the additional inducing points (variational parameters) can complicate the learning process and incur such instability.

The performance gain of the advised\oldtextscrflvm and RFLVM can be attributed to the kernel flexibility, which is particularly evident when the dataset is generated from the underlying GPLVM with a hybrid of RBF kernel and periodic kernel. This validates the crucial role of kernel function flexibility in preventing model collapse. Nevertheless, advised\oldtextscrflvm consistently outperforms the RFLVM, although RFLVM theoretically is capable of approximating arbitrary stationary kernels as well. This discrepancy may stem from the biased assumption of DP priors for the spectral densities in RFLVM (Zhang et al., 2023). Such bias can lead to unfair exposure for the density weights, resulting in only a few effective densities and a degenerated approximation capacity (Gundersen et al., 2021). Moreover, the RFLVM is based on MCMC sampling which may be inferior in this setting to the advised\oldtextscrflvm, which optimizes the ELBO in terms of inference efficiency.

Table 2: Missing data imputation on the \oldtextscmnist and \oldtextscbrendan datasets.
\oldtextscdataset \oldtextscmetric \oldtextscvae \oldtextscbgplvm \oldtextscrflvm advised\oldtextscrflvm
0% 10% 30% 60% 0% 10% 30% 60% 0% 10% 30% 60% 0% 10% 30% 60%
\oldtextscmnist \oldtextscknn acc (\uparrow) 0.715 0.689 0.660 0.585 0.603 0.598 0.541 0.476 0.602 0.391 0.345 0.273 0.806 0.802 0.777 0.636
\oldtextsctest mse (\downarrow) 0.035 0.038 0.045 0.068 0.048 0.040 0.057 0.098 0.066 0.067 0.070 0.120 0.025 0.028 0.039 0.068
\oldtextscbrendan \oldtextsctest mse (\downarrow) 0.005 0.009 0.043 0.150 0.006 0.041 0.087 0.197 0.010 0.015 0.049 0.153 0.003 0.009 0.045 0.152

6.3 Real Dataset Evaluation

This subsection further demonstrates the ability of advised\oldtextscrflvm to capture the latent space on multiple real-world datasets (see Table 1), where the dataset sizes of \oldtextscmnist and \oldtextsccifar are reduced for accommodating the high complexity of RFLVM (see App. E.1 for further details). For each dataset, we hold the labels and employ them to evaluate the estimated latent space using \oldtextscknn classifier with five-fold cross-validation. In addition to the GPLVM variants used in § 6.2, we also encompass various recent VAEs (Kingma & Welling, 2019; Zhao et al., 2020; Eraslan et al., 2019; Zheng & Vedaldi, 2023) and classic dimensionality reduction methods. The \oldtextscknn classification accuracy results for all the competing methods are presented in Table 1.

The results demonstrate that advised\oldtextscrflvm consistently achieves the highest \oldtextscknn accuracy across all datasets. This suggests that the latent variables estimated by advised\oldtextscrflvm are more informative compared to the other methods. The four classic methods, PCA (Wold et al., 1987; Pearson, 1901), hierarchical Poisson factorization (HPF) (Gopalan et al., 2015), latent Dirichlet allocation (LDA) (Blei et al., 2003), and Isomap (Balasubramanian & Schwartz, 2002) showing inferior performance, are primary attributed to their limited model flexibility.

For the VAE models, despite their impressive approximation capabilities through neural network-based decoders and encoders (Kingma & Welling, 2019), they often fall short in their latent space learning performance. This is because optimizing numerous neural network parameters can result in overfitting, rendering these deterministic neural networks directed toward wrong latent spaces. In contrast, GPLVM variants prevent the need for neural network parameter optimization. More importantly, the inherent regularization imposed by the GP prior mitigates overfitting and thus enhances the generalization capability for latent space learning (Wilson & Izmailov, 2020). These lead to GPLVM-based models being expected to attain higher \oldtextscknn accuracy. Nevertheless, the results in Table 1 show that BGPLVM and GPLVM-SVI can only attain comparable performance compared to the PCA. This mainly attributed to the inherently inadequate kernel flexibility and the additional optimization burden of the variational parameters. RFLVM consistently exhibits a slightly inferior performance compared to advised\oldtextscrflvm, primarily due to the unfair exposure of density weights and the inefficient and unscalable MCMC inference algorithm mentioned in § 6.2 and § 5. We also conducted additional simulations on larger datasets. The results, presented in Appendix E.4.4, emphasize the superiority of advised\oldtextscrflvm over state-of-the-art variants regardless of the dataset size.

6.4 Missing Data Imputation

This subsection further evaluates the performance of advised\oldtextscrflvm in the task of imputing missing data on two image datasets, namely \oldtextscmnist and \oldtextscbrendan (Roweis & Saul, 2000). Specifically, we randomly hold out a certain proportion (0%, 10%, 30%, and 60%) of the elements in the observed data matrix, 𝐘𝐘{\mathbf{Y}}bold_Y, and subsequently we utilize advised\oldtextscrflvm to estimate latent variables 𝐗𝐗\mathbf{X}bold_X from the incomplete dataset (denoted as 𝐘obssubscript𝐘𝑜𝑏𝑠\mathbf{Y}_{obs}bold_Y start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT). We then impute the missing values 𝐘misssubscript𝐘𝑚𝑖𝑠𝑠\mathbf{Y}_{miss}bold_Y start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT by their posterior mean 𝐘^miss=𝔼[𝐘miss𝐗,𝐘obs,]subscript^𝐘𝑚𝑖𝑠𝑠𝔼delimited-[]conditionalsubscript𝐘𝑚𝑖𝑠𝑠𝐗subscript𝐘𝑜𝑏𝑠\hat{{\mathbf{Y}}}_{miss}\!=\!\mathbb{E}[\mathbf{Y}_{miss}\mid{\mathbf{X}},% \mathbf{Y}_{obs},-]over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT = blackboard_E [ bold_Y start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT ∣ bold_X , bold_Y start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT , - ]. The imputation performance is evaluated through the mean square error (MSE) between 𝐘^misssubscript^𝐘𝑚𝑖𝑠𝑠\hat{{\mathbf{Y}}}_{miss}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT and the ground-truth 𝐘misssubscript𝐘𝑚𝑖𝑠𝑠{{\mathbf{Y}}}_{miss}bold_Y start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT. Additionally, \oldtextscknn classification accuracy is reported for the \oldtextscmnist dataset to illustrate the latent representation learning results. Table 2 presents the performance of the advised\oldtextscrflvm against competing methods. The results indicate that advised\oldtextscrflvm outperforms most competitors in reconstructing observations and recovering latent representations, regardless of the proportion of missing data. Despite VAE exhibiting reconstruction capabilities comparable to advised\oldtextscrflvm, it still lags behind in recovering informative latent variables due to its potential overfiting and inherent posterior collapse issues (Wang & Liu, 2022). More details about the reconstruction performance of advised\oldtextscrflvm are provided in App. E.4.2, showing its superior ability to restore missing pixels.

7 Conclusions

We have introduced our novel advised\oldtextscrflvm to address model collapse due to inadequate kernel flexibility and inappropriate projection variance selection in GPLVMs. By integrating the SM kernel and the differentiable RFF approximation, our advised\oldtextscrflvm not only enhances model flexibility but also enables the use of modern automatic differentiation tools for optimizing essential parameters, including the projection variance within the variational inference framework. Empirical results across diverse datasets corroborate the superiority of our advised\oldtextscrflvm in learning compact and informative latent representations, highlighting the importance of learning projection variance and kernel flexibility in mitigating model collapse. Furthermore, our model outperforms various state-of-the-art latent variable models, including VAEs and other GPLVM variants. In future work, we are focusing on how to further enhance the variational inference algorithm presented in this paper. We hope that, through our endeavors, we may scale up our LVM for scenarios with massive data sets as an efficient alternative to the resource-intensive deep learning models.

Acknowledgements

The authors would like to thank the anonymous referees for their valuable comments that improved the quality of the paper. The work of Feng Yin was supported by the NSFC under Grant No. 62271433, and in part by the Shenzhen Science and Technology Program under Grant No. JCYJ20220530143806016. The work of Michael Minyi Zhang was supported by the HKU-URC Seed Fund for Basic Research for New Staff.

Impact Statement

This work introduces a novel probabilistic latent variable model tailored to effectively capture the underlying structures of the observed data, which allows us to provide informative but concise foundational knowledge for analyzing highly complex tasks, such as the analysis of social issues, research on human behavior, and exploration of cognitive mechanisms. Technically, this work, conducting theoretical analyses on the impact of the projection variance on model collapse, will strengthen the understanding of broader researchers and engineers on the “default” learning of the projection variance. We also carefully examine the impact of kernel flexibility, and all these rigorous examinations of the potential reasons for model collapse enhance model interpretability, which is crucial for safety-critical systems such as autonomous driving and intelligent healthcare.

Limitations and future works.

Our model faces limitations in handling out-of-distribution data, which requires explicitly learning an encoding function from observable data points into a latent representation. One potential solution to address this is to assume a 𝐘𝐘{\mathbf{Y}}bold_Y-dependent parametric variational distribution of latent variables, q(𝐗|𝐘)𝑞conditional𝐗𝐘q({\mathbf{X}}|{\mathbf{Y}})italic_q ( bold_X | bold_Y ), where the parameters of the distribution are modeled by an encoder network that takes the observation 𝐘𝐘{\mathbf{Y}}bold_Y as input. Consequently, upon completion of the training process, the encoder network can be employed to infer the latent variables of the out-of-distribution data. Another limitation is that despite the reduction in the complexities (linear with N𝑁Nitalic_N), the practical training time of our method may not be endurable for massive datasets.

References

  • Abolhasanzadeh (2015) Abolhasanzadeh, B. Gaussian process latent variable model for dimensionality reduction in intrusion detection. In 2015 23rd Iranian Conference on Electrical Engineering, pp.  674–678. IEEE, 2015.
  • Aigner et al. (1984) Aigner, D. J., Hsiao, C., Kapteyn, A., and Wansbeek, T. Latent variable models in econometrics. Handbook of econometrics, 2:1321–1393, 1984.
  • Balasubramanian & Schwartz (2002) Balasubramanian, M. and Schwartz, E. L. The Isomap algorithm and topological stability. Science, 295(5552):7–7, 2002.
  • Bau et al. (2019) Bau, D., Zhu, J.-Y., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., and Torralba, A. Seeing what a GAN cannot generate. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4502–4511, 2019.
  • Bishop (2006) Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006.
  • Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
  • Bochner (1934) Bochner, S. A theorem on Fourier-Stieltjes integrals. Bulletin of the American Mathematical Society, 40(4):271–276, 1934.
  • Bowman et al. (2016) Bowman, S., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp.  10–21, 2016.
  • Buitinck et al. (2013) Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp.  108–122, 2013.
  • Cao et al. (2023) Cao, J., Kang, M., Jimenez, F., Sang, H., Schaefer, F. T., and Katzfuss, M. Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization. In International Conference on Machine Learning, pp. 3559–3576. PMLR, 2023.
  • Chang et al. (2023) Chang, P. E., Verma, P., John, S., Solin, A., and Khan, M. E. Memory-based dual Gaussian processes for sequential learning. In International Conference on Machine Learning, pp. 4035–4054. PMLR, 2023.
  • Cheng et al. (2022) Cheng, L., Yin, F., Theodoridis, S., Chatzis, S., and Chang, T.-H. Rethinking Bayesian learning for data analysis: The art of prior and inference in sparsity-aware modeling. IEEE Signal Processing Magazine, 39(6):18–52, 2022.
  • Chicco et al. (2021) Chicco, D., Warrens, M. J., and Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 7:e623, 2021.
  • de Souza et al. (2021) de Souza, D., Mesquita, D., Gomes, J. P., and Mattos, C. L. Learning GPLVM with arbitrary kernels using the unscented transformation. In International Conference on Artificial Intelligence and Statistics, pp.  451–459. PMLR, 2021.
  • Duvenaud (2014) Duvenaud, D. Automatic model construction with Gaussian processes. PhD thesis, University of Cambridge, 2014.
  • Ek et al. (2008) Ek, C. H., Torr, P. H. S., and Lawrence, N. D. Gaussian process latent variable models for human pose estimation. In Machine Learning for Multimodal Interaction, pp. 132–143. Springer, 2008.
  • Eleftheriadis et al. (2013) Eleftheriadis, S., Rudovic, O., and Pantic, M. Shared Gaussian process latent variable model for multi-view facial expression recognition. In International Symposium on Visual Computing, pp. 527–538. Springer, 2013.
  • Eraslan et al. (2019) Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S., and Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nature communications, 10(1):390, 2019.
  • Gopalan et al. (2015) Gopalan, P., Hofman, J. M., and Blei, D. M. Scalable recommendation with hierarchical Poisson factorization. In Conference on Uncertainty in Artificial Intelligence, pp. 326–335, 2015.
  • Graves (2016) Graves, A. Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690, 2016.
  • Gulrajani et al. (2016) Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. PixelVAE: A latent variable model for natural images. In International Conference on Learning Representations, 2016.
  • Gundersen et al. (2021) Gundersen, G., Zhang, M., and Engelhardt, B. Latent variable modeling with random features. In International Conference on Artificial Intelligence and Statistics, pp.  1333–1341. PMLR, 2021.
  • Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for big data. In Conference on Uncertainty in Artificial Intelligence, pp. 282–290, Arlington, Virginia, USA, 2013.
  • Hotelling (1936) Hotelling, H. Relations between two sets of variates. Biometrika, 1936.
  • Jin et al. (2017) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In International Conference on Machine Learning, pp. 1724–1732. PMLR, 2017.
  • Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999.
  • Jung et al. (2022) Jung, Y., Song, K., and Park, J. Efficient approximate inference for stationary kernel on frequency domain. In International Conference on Machine Learning, pp. 10502–10538. PMLR, 2022.
  • Kim & Mueller (1978) Kim, J.-O. and Mueller, C. W. Factor analysis: Statistical Methods and Practical Issues, volume 14. sage, 1978.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kingma & Welling (2019) Kingma, D. P. and Welling, M. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  • Lalchand et al. (2022) Lalchand, V., Ravuri, A., and Lawrence, N. D. Generalised GPLVM with stochastic variational inference. In International Conference on Artificial Intelligence and Statistics, pp.  7841–7864. PMLR, 2022.
  • Lawrence (2005) Lawrence, N. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6(60):1783–1816, 2005.
  • LeCun (1998) LeCun, Y. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • Li et al. (2017) Li, J., Zhang, B., and Zhang, D. Shared autoencoder Gaussian process latent variable model for visual classification. IEEE Transactions on Neural Networks and Learning Systems, 29(9):4272–4286, 2017.
  • Lopez-Paz et al. (2014) Lopez-Paz, D., Sra, S., Smola, A., Ghahramani, Z., and Schölkopf, B. Randomized nonlinear component analysis. In International Conference on Machine Learning, pp. 1359–1367. PMLR, 2014.
  • Lotfi et al. (2022) Lotfi, S., Izmailov, P., Benton, G., Goldblum, M., and Wilson, A. G. Bayesian model selection, the marginal likelihood, and generalization. In International Conference on Machine Learning, pp. 14223–14247. PMLR, 2022.
  • Lucas et al. (2019) Lucas, J., Tucker, G., Grosse, R. B., and Norouzi, M. Don’t blame the ELBO! A linear VAE perspective on posterior collapse. Advances in Neural Information Processing Systems, 32, 2019.
  • Menon et al. (2022) Menon, S., Blei, D., and Vondrick, C. Forget-me-not! Contrastive critics for mitigating posterior collapse. In Conference on Uncertainty in Artificial Intelligence, pp. 1360–1370. PMLR, 2022.
  • Nakagawa et al. (2023) Nakagawa, N., Togo, R., Ogawa, T., and Haseyama, M. Gromov-Wasserstein autoencoders. In Proceedings of International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sbS10BCtc7.
  • Oliva et al. (2016) Oliva, J. B., Dubey, A., Wilson, A. G., Póczos, B., Schneider, J., and Xing, E. P. Bayesian nonparametric kernel-learning. In International Conference on Artificial Intelligence and Statistics, pp.  1078–1086. PMLR, 2016.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural iInformation Processing Systems, 32, 2019.
  • Pearson (1901) Pearson, K. LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  • Poux-Médard et al. (2023) Poux-Médard, G., Velcin, J., and Loudcher, S. Powered Dirichlet process-controlling the “rich-get-richer” assumption in bayesian clustering. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  611–626. Springer, 2023.
  • Rahimi & Recht (2007) Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, 2007.
  • Ramchandran et al. (2021) Ramchandran, S., Koskinen, M., and Lähdesmäki, H. Latent Gaussian process with composite likelihoods and numerical quadrature. In International Conference on Artificial Intelligence and Statistics, pp.  3718–3726. PMLR, 2021.
  • Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Razavi et al. (2019) Razavi, A., Oord, A. v. d., Poole, B., and Vinyals, O. Preventing posterior collapse with delta-VAEs. arXiv preprint arXiv:1901.03416, 2019.
  • Roweis & Saul (2000) Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
  • Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. Advances in Neural Information Processing Systems, 29, 2016.
  • Song et al. (2015) Song, G., Wang, S., Huang, Q., and Tian, Q. Similarity Gaussian process latent variable model for multi-modal data analysis. In Proceedings of the IEEE International Conference on Computer Vision, pp.  4050–4058, 2015.
  • Theodoridis (2020) Theodoridis, S. Machine Learning: A Bayesian and Optimization Perspective. Academic Press, 2nd edition, 2020.
  • Tipping & Bishop (1999) Tipping, M. E. and Bishop, C. M. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611–622, 1999.
  • Titsias (2009) Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp.  567–574. PMLR, 2009.
  • Titsias & Lawrence (2010) Titsias, M. and Lawrence, N. D. Bayesian Gaussian process latent variable model. In International Conference on Artificial Intelligence and Statistics, pp.  844–851. PMLR, 2010.
  • Tran et al. (2023) Tran, B.-H., Shahbaba, B., Mandt, S., and Filippone, M. Fully Bayesian autoencoders with latent sparse Gaussian processes. In International Conference on Machine Learning, pp. 34409–34430. PMLR, 23–29 Jul 2023.
  • Tropp (2015) Tropp, J. A. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
  • Wang et al. (2021) Wang, Y., Blei, D., and Cunningham, J. P. Posterior collapse and latent variable non-identifiability. Advances in Neural Information Processing Systems, 34:5443–5455, 2021.
  • Wang & Liu (2022) Wang, Z. and Liu, Z. Posterior collapse of a linear latent variable model. In Advances in Neural Information Processing Systems, 2022.
  • Wilson & Adams (2013) Wilson, A. and Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, pp. 1067–1075. PMLR, 2013.
  • Wilson & Izmailov (2020) Wilson, A. G. and Izmailov, P. Bayesian deep learning and a probabilistic perspective of generalization. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.  4697–4708, 2020.
  • Wold et al. (1987) Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
  • Yang et al. (2017) Yang, Z., Hu, Z., Salakhutdinov, R., and Berg-Kirkpatrick, T. Improved variational autoencoders for text modeling using dilated convolutions. In International conference on machine learning, pp. 3881–3890. PMLR, 2017.
  • Zarzoso et al. (2010) Zarzoso, V., Moreau, E., Gribonval, R., and Vincent, E. Latent Variable Analysis and Signal Separation. Springer, 2010.
  • Zhang et al. (2023) Zhang, M. M., Gundersen, G. W., and Engelhardt, B. E. Bayesian non-linear latent variable modeling via random fourier features. arXiv preprint arXiv:2306.08352, 2023.
  • Zhao et al. (2020) Zhao, H., Rai, P., Du, L., Buntine, W., Phung, D., and Zhou, M. Variational autoencoders for sparse and overdispersed discrete data. In International Conference on Artificial Intelligence and Statistics, pp.  1684–1694. PMLR, 2020.
  • Zheng & Vedaldi (2023) Zheng, C. and Vedaldi, A. Online clustered codebook. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  22798–22807, 2023.
  • Zhu et al. (2023) Zhu, H., Balsells-Rodas, C., and Li, Y. Markovian Gaussian process variational autoencoders. In International Conference on Machine Learning, pp. 42938–42961. PMLR, 2023.

Appendix A Model Collapse Mechanism Revelation

In § A.1, we provide a detailed introduction to dual probabilistic principal analysis (DPPCA) (Lawrence, 2005) and establish its connection with the linear GPLVM. Building upon this connection, a detailed derivation of Theorem 3.1 is provided, delineating the forms of stationary points. Through further exploration of the optimization landscapes around stationary points, we provide detailed proofs for Proposition 3.2 and Proposition 3.3, located in § A.3 and §. A.4, respectively.

A.1 Special Case of GPLVM: Dual Probabilistic Principal Analysis (DPPCA)

In DPPCA (Lawrence, 2005), each observed data point 𝐲iMsubscript𝐲𝑖superscript𝑀\mathbf{y}_{i}\in\mathbb{R}^{M}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is generated from a latent variable 𝐱iQsubscript𝐱𝑖superscript𝑄\mathbf{x}_{i}\in\mathbb{R}^{Q}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT through a linear transformation 𝐀M×Q𝐀superscript𝑀𝑄\mathbf{A}\in\mathbb{R}^{M\times Q}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_Q end_POSTSUPERSCRIPT, i.e.,

𝐲i𝒩(𝐀𝐱i,σ2𝐈M),similar-tosubscript𝐲𝑖𝒩subscript𝐀𝐱𝑖superscript𝜎2subscript𝐈𝑀\displaystyle\mathbf{y}_{i}\sim\mathcal{N}\left({\mathbf{A}}\mathbf{x}_{i},% \sigma^{2}\mathbf{I}_{M}\right),bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_Ax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) , (17a)
p(𝐀)M𝒩(𝟎,𝐈Q),similar-to𝑝𝐀superscriptproduct𝑀𝒩0subscript𝐈𝑄\displaystyle p({\mathbf{A}})\sim\prod^{M}\mathcal{N}\left(\mathbf{0},\mathbf{% I}_{Q}\right),italic_p ( bold_A ) ∼ ∏ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) , (17b)

where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represents the projection variance, representing the uncertainty. For N𝑁Nitalic_N observed data points in DPPCA, denoted as 𝐘N×M𝐘superscript𝑁𝑀{\mathbf{Y}}\in\mathbb{R}^{N\times M}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_M end_POSTSUPERSCRIPT, the marginal likelihood, obtained by marginalizing the transformation matrix 𝐀𝐀\mathbf{A}bold_A, can be represented as follows:

𝐲:,j|𝐗𝒩(𝟎,𝐗𝐗+σ2𝐈N),j=1,,M,formulae-sequencesimilar-toconditionalsubscript𝐲:𝑗𝐗𝒩0superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁𝑗1𝑀\displaystyle\mathbf{y}_{:,j}|{\mathbf{X}}\sim\mathcal{N}\left(\mathbf{0},% \mathbf{X}\mathbf{X}^{\top}+\sigma^{2}\mathbf{I}_{N}\right),\quad j=1,\ldots,M,bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_X ∼ caligraphic_N ( bold_0 , bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , italic_j = 1 , … , italic_M , (18)

where 𝐲:,jsubscript𝐲:𝑗\mathbf{y}_{:,j}bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT denotes j𝑗jitalic_j-th column in the observed data 𝐘𝐘{\mathbf{Y}}bold_Y. Consequently, the maximum likelihood estimate (MLE) for the latent variable, denoted as 𝐗^DPPCAsubscript^𝐗DPPCA\hat{{\mathbf{X}}}_{\text{DPPCA}}over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT DPPCA end_POSTSUBSCRIPT, can be derived by maximizing the logarithm of Eq. (18) through, e.g., gradient-based methods, i.e.,

𝐗^DPPCA=max𝐗logj=1M𝒩(𝐲:,j𝟎,𝐗𝐗+σ2𝐈N).subscript^𝐗DPPCAsubscript𝐗superscriptsubscriptproduct𝑗1𝑀𝒩conditionalsubscript𝐲:𝑗0superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁\hat{{\mathbf{X}}}_{\text{DPPCA}}=\max_{{\mathbf{X}}}\ \log\prod_{j=1}^{M}% \mathcal{N}\left(\mathbf{y}_{:,j}\mid\bm{0},{\mathbf{X}}{\mathbf{X}}^{\top}+% \sigma^{2}\mathbf{I}_{N}\right).over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT DPPCA end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT roman_log ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_N ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT ∣ bold_0 , bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) . (19)

Building upon Eq. (19) and the optimization problem given in Eq. (6), a connection between GPLVM and DPPCA can be established (Lawrence, 2005), encapsulated in the following corollary:

Corollary A.1.

Assuming the kernel function in GPLVM is defined as the inner product kernel with k(𝐱,𝐱)=𝐱𝐱𝑘𝐱superscript𝐱superscript𝐱topsuperscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\mathbf{x}^{\top}\mathbf{x}^{\prime}italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the stationary points for the linear GPLVM, as expressed in Eq. (6), are identical to the stationary points of DPPCA, 𝐗^DPPCAsubscript^𝐗DPPCA\hat{{\mathbf{X}}}_{\text{DPPCA}}over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT DPPCA end_POSTSUBSCRIPT.

Proof.

If the kernel function is the inner product kernel, i.e., k(𝐱,𝐱)=𝐱𝐱𝑘𝐱superscript𝐱superscript𝐱topsuperscript𝐱k({\mathbf{x}},{\mathbf{x}}^{\prime})={\mathbf{x}}^{\top}{\mathbf{x}}^{\prime}italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the marginal likelihood of the linear GPLVM can be reformulated as,

p(𝐘|𝐗)=j=1M𝒩(𝐲:,j|𝟎,𝐗𝐗+σ2𝐈N).𝑝conditional𝐘𝐗superscriptsubscriptproduct𝑗1𝑀𝒩conditionalsubscript𝐲:𝑗0superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁\displaystyle p({\mathbf{Y}}|{\mathbf{X}})=\prod_{j=1}^{M}\mathcal{N}\left(% \mathbf{y}_{:,j}|\bm{0},{\mathbf{X}}{\mathbf{X}}^{\top}+\sigma^{2}\mathbf{I}_{% N}\right).italic_p ( bold_Y | bold_X ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_N ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_0 , bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) . (20)

Then, the stationary points of the linear GPLVM, 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG, is given by

𝐗^^𝐗\displaystyle\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG =max𝐗logp(𝐘|𝐗)=max𝐗M{N2log2π12log|𝐗𝐗+σ2𝐈N|}12tr((𝐗𝐗+σ2𝐈N)1𝐘𝐘).absentsubscript𝐗𝑝conditional𝐘𝐗subscript𝐗𝑀𝑁22𝜋12superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁12trsuperscriptsuperscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁1superscript𝐘𝐘top\displaystyle=\max_{{\mathbf{X}}}\log p({\mathbf{Y}}|{\mathbf{X}})=\max_{{% \mathbf{X}}}M\left\{-\frac{N}{2}\log 2\pi-\frac{1}{2}\log\left|{\mathbf{X}}{% \mathbf{X}}^{\top}+\sigma^{2}\mathbf{I}_{N}\right|\right\}-\frac{1}{2}% \operatorname{tr}\left(\left({\mathbf{X}}{\mathbf{X}}^{\top}+\sigma^{2}\mathbf% {I}_{N}\right)^{-1}{\mathbf{Y}}{\mathbf{Y}}^{\top}\right).= roman_max start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT roman_log italic_p ( bold_Y | bold_X ) = roman_max start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT italic_M { - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG roman_log 2 italic_π - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log | bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_tr ( ( bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) . (21)

The stationary points of DPPCA, 𝐗^DPPCAsubscript^𝐗DPPCA\hat{{\mathbf{X}}}_{\text{DPPCA}}over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT DPPCA end_POSTSUBSCRIPT, given in Eq. (19), can be reformulated as

𝐗^DPPCAsubscript^𝐗DPPCA\displaystyle\hat{{\mathbf{X}}}_{\text{DPPCA}}over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT DPPCA end_POSTSUBSCRIPT =max𝐗M{N2log2π12log|𝐗𝐗+σ2𝐈N|}12tr((𝐗𝐗+σ2𝐈N)1𝐘𝐘).absentsubscript𝐗𝑀𝑁22𝜋12superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁12trsuperscriptsuperscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁1superscript𝐘𝐘top\displaystyle=\max_{{\mathbf{X}}}M\left\{-\frac{N}{2}\log 2\pi-\frac{1}{2}\log% \left|{\mathbf{X}}{\mathbf{X}}^{\top}+\sigma^{2}\mathbf{I}_{N}\right|\right\}-% \frac{1}{2}\operatorname{tr}\left(\left({\mathbf{X}}{\mathbf{X}}^{\top}+\sigma% ^{2}\mathbf{I}_{N}\right)^{-1}{\mathbf{Y}}{\mathbf{Y}}^{\top}\right).= roman_max start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT italic_M { - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG roman_log 2 italic_π - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log | bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_tr ( ( bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) . (22)

It is evident that the stationary points of the linear GPLVM is identical to the stationary points of DPPCA. ∎

A.2 Proof of Theorem 3.1

This subsection conducts a comprehensive derivation, elucidating the stationary points of the linear GPLVM. Our derivation generally adheres to the one in (Lawrence, 2005), albeit with subtle distinctions.

Proof.

Recall that, the log marginal likelihood can be expressed as

LM{N2log2π12log|𝐗𝐗+σ2𝐈N|}12tr((𝐗𝐗+σ2𝐈N)1𝐘𝐘).𝐿𝑀𝑁22𝜋12superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁12trsuperscriptsuperscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁1superscript𝐘𝐘top\displaystyle L\triangleq M\left\{-\frac{N}{2}\log 2\pi-\frac{1}{2}\log|{% \mathbf{X}}{\mathbf{X}}^{\top}+\sigma^{2}\mathbf{I}_{N}|\right\}-\frac{1}{2}% \operatorname{tr}\left(\left({\mathbf{X}}{\mathbf{X}}^{\top}+\sigma^{2}\mathbf% {I}_{N}\right)^{-1}{\mathbf{Y}}{\mathbf{Y}}^{\top}\right).italic_L ≜ italic_M { - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG roman_log 2 italic_π - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log | bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_tr ( ( bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) . (23)

Define 𝐊𝐗𝐗+σ2𝐈N𝐊superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁{\mathbf{K}}\triangleq{\mathbf{X}}{\mathbf{X}}^{\top}+\sigma^{2}\mathbf{I}_{N}bold_K ≜ bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, Eq. (23) could be reformulated as

L=M{N2log2π12log|𝐊|}12tr(𝐊1𝐘𝐘).𝐿𝑀𝑁22𝜋12𝐊12trsuperscript𝐊1superscript𝐘𝐘top\displaystyle L=M\left\{-\frac{N}{2}\log 2\pi-\frac{1}{2}\log|\mathbf{K}|% \right\}-\frac{1}{2}\operatorname{tr}(\mathbf{K}^{-1}{\mathbf{Y}}{\mathbf{Y}}^% {\top}).italic_L = italic_M { - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG roman_log 2 italic_π - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log | bold_K | } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_tr ( bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) . (24)

Taking the gradient of Eq. (24) with respect to 𝐗𝐗{\mathbf{X}}bold_X, we have

L𝐗=𝐊1𝐘𝐘𝐊1𝐗M𝐊1𝐗.𝐿𝐗superscript𝐊1superscript𝐘𝐘topsuperscript𝐊1𝐗𝑀superscript𝐊1𝐗\displaystyle\frac{\partial L}{\partial{\mathbf{X}}}=\mathbf{K}^{-1}{\mathbf{Y% }}{\mathbf{Y}}^{\top}\mathbf{K}^{-1}{\mathbf{X}}-M\mathbf{K}^{-1}{\mathbf{X}}.divide start_ARG ∂ italic_L end_ARG start_ARG ∂ bold_X end_ARG = bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X - italic_M bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X . (25)

Setting this gradient to zero, the stationary points of Eq. (24) should satisfy

1M𝐘𝐘𝐊1𝐗=𝐗.1𝑀superscript𝐘𝐘topsuperscript𝐊1𝐗𝐗\displaystyle\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}\mathbf{K}^{-1}{\mathbf% {X}}={\mathbf{X}}.divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X = bold_X . (26)

According to Lemma B.2, we have

𝐊1𝐗=[𝐗𝐗+σ2𝐈N]1𝐗=𝐗[𝐗𝐗+σ2𝐈Q]1.superscript𝐊1𝐗superscriptdelimited-[]superscript𝐗𝐗topsuperscript𝜎2subscript𝐈𝑁1𝐗𝐗superscriptdelimited-[]superscript𝐗top𝐗superscript𝜎2subscript𝐈𝑄1\displaystyle\mathbf{K}^{-1}{\mathbf{X}}=\left[{\mathbf{X}}{\mathbf{X}}^{\top}% +\sigma^{2}\mathbf{I}_{N}\right]^{-1}{\mathbf{X}}={\mathbf{X}}\left[{\mathbf{X% }}^{\top}{\mathbf{X}}+\sigma^{2}\mathbf{I}_{Q}\right]^{-1}.bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X = [ bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X = bold_X [ bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (27)

We conduct singular value decomposition (SVD) to 𝐗𝐗{\mathbf{X}}bold_X, and get 𝐗=𝐔𝐋𝐕𝐗superscript𝐔𝐋𝐕top{\mathbf{X}}={\mathbf{U}}{\mathbf{L}}{\mathbf{V}}^{\top}bold_X = bold_ULV start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where 𝐔N×Q𝐔superscript𝑁𝑄{\mathbf{U}}\in\mathbb{R}^{N\times Q}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_Q end_POSTSUPERSCRIPT, 𝐋=diag(l1,l2,,lQ)Q×Q𝐋diagsubscript𝑙1subscript𝑙2subscript𝑙𝑄superscript𝑄𝑄{\mathbf{L}}=\operatorname{diag}(l_{1},l_{2},\ldots,l_{Q})\in\mathbb{R}^{Q% \times Q}bold_L = roman_diag ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT is a diagonal matrix, and 𝐕Q×Q𝐕superscript𝑄𝑄{\mathbf{V}}\in\mathbb{R}^{Q\times Q}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT. Together with Eq. (27) and Eq. (26), we have

1M𝐘𝐘𝐔𝐋[𝐋2+σ2𝐈Q]1𝐕=𝐔𝐋𝐕,1𝑀superscript𝐘𝐘top𝐔𝐋superscriptdelimited-[]superscript𝐋2superscript𝜎2subscript𝐈𝑄1superscript𝐕topsuperscript𝐔𝐋𝐕top\displaystyle\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}{\mathbf{U}}{\mathbf{L}% }\left[{\mathbf{L}}^{2}+\sigma^{2}\mathbf{I}_{Q}\right]^{-1}{\mathbf{V}}^{\top% }={\mathbf{U}}{\mathbf{L}}{\mathbf{V}}^{\top},divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_UL [ bold_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_ULV start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (28a)
\displaystyle\Rightarrow\qquad 1M𝐘𝐘𝐔𝐋[𝐋2+σ2𝐈Q]1=𝐔𝐋,1𝑀superscript𝐘𝐘top𝐔𝐋superscriptdelimited-[]superscript𝐋2superscript𝜎2subscript𝐈𝑄1𝐔𝐋\displaystyle\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}{\mathbf{U}}{\mathbf{L}% }\left[{\mathbf{L}}^{2}+\sigma^{2}\mathbf{I}_{Q}\right]^{-1}={\mathbf{U}}{% \mathbf{L}},divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_UL [ bold_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = bold_UL , (28b)
\displaystyle\Rightarrow\qquad 1M𝐘𝐘𝐔𝐋=𝐔(σ2𝐈Q+𝐋2)𝐋.1𝑀superscript𝐘𝐘top𝐔𝐋𝐔superscript𝜎2subscript𝐈𝑄superscript𝐋2𝐋\displaystyle\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}{\mathbf{U}}{\mathbf{L}% }={\mathbf{U}}(\sigma^{2}\mathbf{I}_{Q}+{\mathbf{L}}^{2}){\mathbf{L}}.divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_UL = bold_U ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT + bold_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bold_L . (28c)

Then, we have:

  • If li0subscript𝑙𝑖0l_{i}\neq 0italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0, it indicates that 1M𝐘𝐘𝐮i=𝐮i(σ2+li2)1𝑀superscript𝐘𝐘topsubscript𝐮𝑖subscript𝐮𝑖superscript𝜎2superscriptsubscript𝑙𝑖2\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}{\mathbf{u}}_{i}={\mathbf{u}}_{i}(% \sigma^{2}+l_{i}^{2})divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), implying 𝐮isubscript𝐮𝑖{\mathbf{u}}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an eigenvector of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT corresponding to the eigenvalue λi=σ2+li2subscript𝜆𝑖superscript𝜎2superscriptsubscript𝑙𝑖2\lambda_{i}=\sigma^{2}+l_{i}^{2}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

  • If li=0subscript𝑙𝑖0l_{i}=0italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0, the vector 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is arbitrary. We can set it to be an eigenvector of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT for consistency.

Consequently, all potential stationary solutions for 𝐗𝐗{\mathbf{X}}bold_X can be written as

𝐗^=𝐔Q(𝚲Qσ2𝐈Q)1/2𝐑,^𝐗subscript𝐔𝑄superscriptsubscript𝚲𝑄superscript𝜎2subscript𝐈𝑄12𝐑\displaystyle\hat{{\mathbf{X}}}={\mathbf{U}}_{Q}\left(\bm{\Lambda}_{Q}-\sigma^% {2}\mathbf{I}_{Q}\right)^{1/2}\mathbf{R},over^ start_ARG bold_X end_ARG = bold_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT - italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_R , (29)

where 𝐔QN×Qsubscript𝐔𝑄superscript𝑁𝑄{\mathbf{U}}_{Q}\in\mathbb{R}^{N\times Q}bold_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_Q end_POSTSUPERSCRIPT is a matrix whose columns are eigenvectors of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝐑Q×Q𝐑superscript𝑄𝑄\mathbf{R}\in\mathbb{R}^{Q\times Q}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT is an arbitrary orthogonal matrix and 𝚲QQ×Qsubscript𝚲𝑄superscript𝑄𝑄\bm{\Lambda}_{Q}\in\mathbb{R}^{Q\times Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_Q end_POSTSUPERSCRIPT is a diagonal matrix with:

[𝚲Q]i,i={λi, the corresponding eigenvalue to 𝐮i, or ,σ2.\displaystyle[\bm{\Lambda}_{Q}]_{i,i}=\left\{\begin{aligned} &\lambda_{i},% \text{ the corresponding eigenvalue to }\mathbf{u}_{i},\text{ or },\\ &\sigma^{2}.\end{aligned}\right.[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , the corresponding eigenvalue to bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , or , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (30)

A.3 Proof of Proposition 3.2

A.3.1 Auxiliary Theorem

Before delving into the proof, we first proceed to characterize the stationary point of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the linear GPLVM, which is summarized in the following theorem.

Theorem A.2.

Given 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG, stationary points of the projection variance, denoted as σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, could be obtained by solving the following optimization problem

maxσ2logP(𝐘𝐗^).subscriptsuperscript𝜎2𝑃conditional𝐘^𝐗\displaystyle\max_{\sigma^{2}}\log P({\mathbf{Y}}\mid\hat{{\mathbf{X}}}).roman_max start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( bold_Y ∣ over^ start_ARG bold_X end_ARG ) . (31)

It turns out that σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT takes the following form:

σ^2superscript^𝜎2\displaystyle\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1NQj=Q+1Nλj,absent1𝑁superscript𝑄superscriptsubscript𝑗superscript𝑄1𝑁subscript𝜆𝑗\displaystyle=\frac{1}{N-Q^{\prime}}\sum_{j=Q^{\prime}+1}^{N}\lambda_{j},= divide start_ARG 1 end_ARG start_ARG italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (32)

where Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the number of eigenvalues retained in 𝚲Qsubscript𝚲𝑄\bm{\Lambda}_{Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT.

Proof.

To obtain the stationary point for σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we substitute the stationary point for 𝐗𝐗{\mathbf{X}}bold_X, defined in Eq. (7), into the log marginal likelihood function Eq. (24) to give

L=M2{Nlog2π+j=1Qlog(λj)+(NQ)lnσ2+1σ2j=Q+1Nλj+Q},𝐿𝑀2𝑁2𝜋superscriptsubscript𝑗1superscript𝑄subscript𝜆𝑗𝑁superscript𝑄superscript𝜎21superscript𝜎2superscriptsubscript𝑗superscript𝑄1𝑁subscript𝜆𝑗superscript𝑄\displaystyle L=-\frac{M}{2}\left\{N\log 2\pi+\sum_{j=1}^{Q^{\prime}}\log(% \lambda_{j})+(N-Q^{\prime})\ln\sigma^{2}+\frac{1}{\sigma^{2}}\sum_{j=Q^{\prime% }+1}^{N}\lambda_{j}+Q^{\prime}\right\},italic_L = - divide start_ARG italic_M end_ARG start_ARG 2 end_ARG { italic_N roman_log 2 italic_π + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_log ( start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) + ( italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_ln italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } , (33)

where Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents the number of [𝚲Q]i,i,i1,,Qformulae-sequencesubscriptdelimited-[]subscript𝚲𝑄𝑖𝑖𝑖1𝑄[\bm{\Lambda}_{Q}]_{i,i},i\in 1,...,Q[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT , italic_i ∈ 1 , … , italic_Q that are not equal to σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, see Eq. (30). Consequently, λ1,,λQsubscript𝜆1subscript𝜆superscript𝑄\lambda_{1},...,\lambda_{Q^{\prime}}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denote the eigenvalues associated with the eigenvectors “retained” in 𝐗𝐗{\mathbf{X}}bold_X, while λQ+1,,λNsubscript𝜆superscript𝑄1subscript𝜆𝑁\lambda_{Q^{\prime}+1},...,\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT refer to the eigenvalues that are “discarded”.

By taking the gradient of Eq. (33) with respect to σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and setting it to zero, we obtain:

σ^2=1NQj=Q+1Nλj.superscript^𝜎21𝑁superscript𝑄superscriptsubscript𝑗superscript𝑄1𝑁subscript𝜆𝑗\hat{\sigma}^{2}=\frac{1}{N-Q^{\prime}}\sum_{j=Q^{\prime}+1}^{N}\lambda_{j}.over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .

Remark A.3.

Note that the eigenvalues {λQ+1,λN}subscript𝜆superscript𝑄1subscript𝜆𝑁\{\lambda_{Q^{\prime}+1},\ldots\lambda_{N}\}{ italic_λ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } can be interpreted as the discarded/lost information in the inverse projection process (𝐘𝐗𝐘𝐗{\mathbf{Y}}\rightarrow{\mathbf{X}}bold_Y → bold_X), and the corresponding eigenvectors are treated as discarded vectors.

In addition, with Theorem A.2, we can immediately get the following corollary.

Corollary A.4.

If 𝚲Qsubscript𝚲𝑄\bm{\Lambda}_{Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT contains the first Q𝑄Qitalic_Q principal eigenvalues of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, then the corresponding stationary point becomes the global maximum, which could be represented as:

σ2superscript𝜎2\displaystyle\sigma^{2\star}italic_σ start_POSTSUPERSCRIPT 2 ⋆ end_POSTSUPERSCRIPT =1NQj=Q+1Nλjo,absent1𝑁𝑄superscriptsubscript𝑗𝑄1𝑁superscriptsubscript𝜆𝑗𝑜\displaystyle=\frac{1}{N-Q}\sum_{j=Q+1}^{N}\lambda_{j}^{o},= divide start_ARG 1 end_ARG start_ARG italic_N - italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_Q + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , (34a)
𝐗superscript𝐗\displaystyle\mathbf{X}^{\star}bold_X start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT =𝐔Q(𝚲Q(σ2)𝐈Q)1/2𝐑,absentsuperscriptsubscript𝐔𝑄superscriptsuperscriptsubscript𝚲𝑄superscriptsuperscript𝜎2subscript𝐈𝑄12𝐑\displaystyle=\mathbf{U}_{Q}^{\star}\left(\bm{\Lambda}_{Q}^{\star}-(\sigma^{2}% )^{\star}\mathbf{I}_{Q}\right)^{1/2}\mathbf{R},= bold_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_R , (34b)

where [λ1o,,λNo]superscriptsubscript𝜆1𝑜superscriptsubscript𝜆𝑁𝑜\left[\lambda_{1}^{o},...,\lambda_{N}^{o}\right][ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ] representing the eigenvalues of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT with λ1oλ2o,,λNo\lambda_{1}^{o}\geq\lambda_{2}^{o},...,\geq\lambda_{N}^{o}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ≥ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , … , ≥ italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. Additionally, 𝐔QN×Qsuperscriptsubscript𝐔𝑄superscript𝑁𝑄\mathbf{U}_{Q}^{\star}\in\mathbb{R}^{N\times Q}bold_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_Q end_POSTSUPERSCRIPT are the first Q𝑄Qitalic_Q principal eigenvectors of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, with the associated eigenvalues 𝚲Q=diag(λ1o,λ2o,,λQo)superscriptsubscript𝚲𝑄diagsuperscriptsubscript𝜆1𝑜superscriptsubscript𝜆2𝑜superscriptsubscript𝜆𝑄𝑜\bm{\Lambda}_{Q}^{\star}=\operatorname{diag}(\lambda_{1}^{o},\lambda_{2}^{o},% \ldots,\lambda_{Q}^{o})bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = roman_diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ). The optimal projection variance, σ2superscript𝜎2\sigma^{2\star}italic_σ start_POSTSUPERSCRIPT 2 ⋆ end_POSTSUPERSCRIPT, represents the average variance lost in the projection process.

Proof.

With the stationary point of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT given in Eq.(32), the log marginal likelihood, given in Eq. (24), becomes

L=M2{j=1Qlog(λj)+(NQ)log(1NQj=Q+1Nλj)+Nlog(2π)+N}.𝐿𝑀2superscriptsubscript𝑗1superscript𝑄subscript𝜆𝑗𝑁superscript𝑄1𝑁superscript𝑄superscriptsubscript𝑗superscript𝑄1𝑁subscript𝜆𝑗𝑁2𝜋𝑁\displaystyle L=-\frac{M}{2}\left\{\sum_{j=1}^{Q^{\prime}}\log(\lambda_{j})+(N% -Q^{\prime})\log\left(\frac{1}{N-Q^{\prime}}\sum_{j=Q^{\prime}+1}^{N}\lambda_{% j}\right)+N\log(2\pi)+N\right\}.italic_L = - divide start_ARG italic_M end_ARG start_ARG 2 end_ARG { ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_log ( start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) + ( italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_N roman_log ( start_ARG 2 italic_π end_ARG ) + italic_N } . (35)

Because of the constancy of the sum of all eigenvalues λjsubscript𝜆𝑗\lambda_{j}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (given the data 𝐘𝐘{\mathbf{Y}}bold_Y), maximizing Eq. (35) is equivalently to minimize the following quantity

E=log(1NQi=Q+1Nλi)1NQi=Q+1Nlog(λi),𝐸1𝑁superscript𝑄superscriptsubscript𝑖superscript𝑄1𝑁subscript𝜆𝑖1𝑁superscript𝑄superscriptsubscript𝑖superscript𝑄1𝑁subscript𝜆𝑖\displaystyle E=\log\left(\frac{1}{N-Q^{\prime}}\sum_{i=Q^{\prime}+1}^{N}% \lambda_{i}\right)-\frac{1}{N-Q^{\prime}}\sum_{i=Q^{\prime}+1}^{N}\log(\lambda% _{i}),italic_E = roman_log ( divide start_ARG 1 end_ARG start_ARG italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , (36)

which solely relies on the discarded eigenvalues and remains non-negative (indeed due to Jensen’s inequality). Remarkably, the minimization of E𝐸Eitalic_E necessitates only that the discarded λjsubscript𝜆𝑗\lambda_{j}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT values are contiguous within the spectrum of the ordered eigenvalues of matrix 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. However, in addition to this, Eq. (29) imposes the condition that λj>σ2subscript𝜆𝑗superscript𝜎2\lambda_{j}>\sigma^{2}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all i𝑖iitalic_i in the set {1,2,,Q}12superscript𝑄\{1,2,\ldots,Q^{\prime}\}{ 1 , 2 , … , italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }. Consequently, based on Eq. (32), it can be inferred that the smallest eigenvalue must be among the discarded ones. This deduction is sufficient to establish that E𝐸Eitalic_E is minimized when λQ+1,,λNsubscript𝜆superscript𝑄1subscript𝜆𝑁\lambda_{Q^{\prime}+1},\ldots,\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT represent the smallest NQ𝑁superscript𝑄N-Q^{\prime}italic_N - italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT eigenvalues. As a result, the likelihood L𝐿Litalic_L is maximized when λ1,,λQsubscript𝜆1subscript𝜆superscript𝑄\lambda_{1},\ldots,\lambda_{Q^{\prime}}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are the largest eigenvalues of matrix 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. It is worth noting that the maximization of L𝐿Litalic_L concerning Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is achieved when there are the fewest terms in the sums outlined in Eq. (36). This occurs when Q=Qsuperscript𝑄𝑄Q^{\prime}=Qitalic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_Q, ensuring that none of the li,i1,,Qformulae-sequencesubscript𝑙𝑖𝑖1𝑄l_{i},i\in 1,...,Qitalic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ 1 , … , italic_Q terms are zero. ∎

A.3.2 Proof of Proposition 3.2

1) Outline of Proof.

Without loss of generality, we assume the rotation matrix 𝐑=𝐈Q𝐑subscript𝐈𝑄\mathbf{R}=\mathbf{I}_{Q}bold_R = bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT in Eq. (7), Theorem 3.1, resulting in the stationary points of the latent variable

𝐗^=𝐔Q(𝚲Qσ^2𝐈Q)1/2.^𝐗subscript𝐔𝑄superscriptsubscript𝚲𝑄superscript^𝜎2subscript𝐈𝑄12\hat{\mathbf{X}}=\mathbf{U}_{Q}\left(\bm{\Lambda}_{Q}-\hat{\sigma}^{2}\mathbf{% I}_{Q}\right)^{1/2}.over^ start_ARG bold_X end_ARG = bold_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT - over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT . (37)

Based upon this form, we seek to explore the structure of the optimization landscape around 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG by examining the variation trend of the log marginal likelihood L𝐿Litalic_L at 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG in the spanned space of discarded vectors, denoted as Span(𝐔D)Spansubscript𝐔𝐷\operatorname{Span}({\mathbf{U}}_{D})roman_Span ( bold_U start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), where

𝐔D[𝐮Q+1,,𝐮N].subscript𝐔𝐷subscript𝐮superscript𝑄1subscript𝐮𝑁{\mathbf{U}}_{D}\triangleq\left[{\mathbf{u}}_{Q^{\prime}+1},...,{\mathbf{u}}_{% N}\right].bold_U start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ≜ [ bold_u start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , bold_u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] .

Intuitively, if the evaluation of L𝐿Litalic_L at a stationary point consistently decreases for all axes in Span(𝐔D)Spansubscript𝐔𝐷\operatorname{Span}({\mathbf{U}}_{D})roman_Span ( bold_U start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), then we can consider the corresponding stationary point as a local optimum or global optimum, and vice versa; If the evaluation of L𝐿Litalic_L at a stationary point consistently increases along any axis and decreases along any others within Span(𝐔D)Spansubscript𝐔𝐷\operatorname{Span}({\mathbf{U}}_{D})roman_Span ( bold_U start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), the corresponding stationary point can be recognized as a saddle point.

2) Quantitative Analysis.

To quantitatively analyze the variation in L𝐿Litalic_L at 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG within Span(𝐔D)Spansubscript𝐔𝐷\operatorname{Span}({\mathbf{U}}_{D})roman_Span ( bold_U start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), we introduce a small perturbation to the i𝑖iitalic_i-th column of 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG in the form of ϵ𝐮jitalic-ϵsubscript𝐮𝑗\epsilon\mathbf{u}_{j}italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, resulting in the perturbed stationary point 𝐗^ϵsuperscript^𝐗italic-ϵ\hat{{\mathbf{X}}}^{\epsilon}over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT with

[𝐗^ϵ]:,i=𝐱^i+ϵ𝐮j,i=1,2,,Q,formulae-sequencesubscriptdelimited-[]superscript^𝐗italic-ϵ:𝑖subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗𝑖12𝑄\displaystyle[\hat{{\mathbf{X}}}^{\epsilon}]_{:,i}=\hat{{\mathbf{x}}}_{i}+% \epsilon\mathbf{u}_{j},\quad i=1,2,\ldots,Q,[ over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT = over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_i = 1 , 2 , … , italic_Q , (38)

where 𝐱^isubscript^𝐱𝑖\hat{{\mathbf{x}}}_{i}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th column of 𝐗^ϵsuperscript^𝐗italic-ϵ\hat{{\mathbf{X}}}^{\epsilon}over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, with a bit abuse of notation, and ϵitalic-ϵ\epsilonitalic_ϵ is an arbitrarily small positive constant and 𝐮j,jQ+1,,Nformulae-sequencesubscript𝐮𝑗𝑗superscript𝑄1𝑁\mathbf{u}_{j},j\in Q^{\prime}+1,...,Nbold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_N represents a principal axis in Span(𝐔D)Spansubscript𝐔𝐷\operatorname{Span}({\mathbf{U}}_{D})roman_Span ( bold_U start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). The variation trends from L(𝐗^)𝐿^𝐗L(\hat{{\mathbf{X}}})italic_L ( over^ start_ARG bold_X end_ARG ) to L(𝐗^ϵ)𝐿superscript^𝐗italic-ϵL(\hat{{\mathbf{X}}}^{\epsilon})italic_L ( over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) can be determined by examining the sign of the dot product of the perturbation 𝐮jsubscript𝐮𝑗{\mathbf{u}}_{j}bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with the gradient at 𝐱^i+ϵ𝐮jsubscript^𝐱𝑖italic-ϵsubscript𝐮𝑗\hat{{\mathbf{x}}}_{i}+\epsilon\mathbf{u}_{j}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. More precisely, when the sign is positive, the evaluation of L𝐿Litalic_L at 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG will ascend as 𝐱^isubscript^𝐱𝑖\hat{{\mathbf{x}}}_{i}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT shifts towards the direction of 𝐮jsubscript𝐮𝑗\mathbf{u}_{j}bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and vice versa. For clarity, let us denote the sign of the dot product as sgn(Dij)sgnsubscript𝐷𝑖𝑗\operatorname{sgn}(D_{ij})roman_sgn ( italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), where Dijsubscript𝐷𝑖𝑗D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the dot product and is expressed as

Dijsubscript𝐷𝑖𝑗\displaystyle D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =𝐮j{𝐊1𝐘𝐘𝐊1(𝐱^i+ϵ𝐮j)M𝐊1(𝐱^i+ϵ𝐮j)},absentsuperscriptsubscript𝐮𝑗topsuperscript𝐊1superscript𝐘𝐘topsuperscript𝐊1subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗𝑀superscript𝐊1subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗\displaystyle=\mathbf{u}_{j}^{\top}\left\{{\mathbf{K}}^{-1}{\mathbf{Y}}{% \mathbf{Y}}^{\top}{\mathbf{K}}^{-1}\left(\hat{{\mathbf{x}}}_{i}+\epsilon{% \mathbf{u}}_{j}\right)-M{\mathbf{K}}^{-1}\left(\hat{{\mathbf{x}}}_{i}+\epsilon% {\mathbf{u}}_{j}\right)\right\},= bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT { bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_M bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } , (39)

with 𝐊=𝐗^ϵ𝐗^ϵ+σ^2𝐈N𝐊superscript^𝐗italic-ϵsuperscript^𝐗superscriptitalic-ϵtopsuperscript^𝜎2subscript𝐈𝑁{\mathbf{K}}=\hat{{\mathbf{X}}}^{\epsilon}\hat{{\mathbf{X}}}^{\epsilon^{\top}}% +\hat{\sigma}^{2}\mathbf{I}_{N}bold_K = over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT.

According to Lemma B.2 and Eq. (37), we have

𝐊1𝐗^ϵ=𝐗^ϵ[(𝐗^ϵ)𝐗^ϵ+σ^2𝐈Q]1,=𝐗^ϵ[𝚲Qϵ]1,superscript𝐊1superscript^𝐗italic-ϵabsentsuperscript^𝐗italic-ϵsuperscriptdelimited-[]superscriptsuperscript^𝐗italic-ϵtopsuperscript^𝐗italic-ϵsuperscript^𝜎2subscript𝐈𝑄1missing-subexpressionabsentsuperscript^𝐗italic-ϵsuperscriptdelimited-[]subscriptsuperscript𝚲italic-ϵ𝑄1\displaystyle\begin{aligned} \mathbf{K}^{-1}\hat{{\mathbf{X}}}^{\epsilon}&=% \hat{{\mathbf{X}}}^{\epsilon}\left[(\hat{{\mathbf{X}}}^{\epsilon})^{\top}\hat{% {\mathbf{X}}}^{\epsilon}+\hat{\sigma}^{2}\mathbf{I}_{Q}\right]^{-1},\\ &=\hat{{\mathbf{X}}}^{\epsilon}\left[\mathbf{\Lambda}^{\epsilon}_{Q}\right]^{-% 1},\end{aligned}start_ROW start_CELL bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_CELL start_CELL = over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT [ ( over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT [ bold_Λ start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL end_ROW (40)

where 𝚲Qϵsubscriptsuperscript𝚲italic-ϵ𝑄\mathbf{\Lambda}^{\epsilon}_{Q}bold_Λ start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT is a diagonal matrix with:

[𝚲Qϵ]k,k={[𝚲Q]k,k,ki,[𝚲Q]i,i+ϵ2, otherwise.\displaystyle[\bm{\Lambda}_{Q}^{\epsilon}]_{k,k}=\left\{\begin{aligned} &[\bm{% \Lambda}_{Q}]_{k,k},&\forall k\neq i,\\ &[\bm{\Lambda}_{Q}]_{i,i}+\epsilon^{2},&\text{ otherwise}.\end{aligned}\right.[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT , end_CELL start_CELL ∀ italic_k ≠ italic_i , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL otherwise . end_CELL end_ROW (41)

Checking the i𝑖iitalic_i-th column of the matrices on both sides of Eq. (40), we find that

𝐊1(𝐱^i+ϵ𝐮j)=𝐱^i+ϵ𝐮j[𝚲Q]i,i+ϵ2.superscript𝐊1subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscriptitalic-ϵ2\displaystyle{\mathbf{K}}^{-1}\left(\hat{{\mathbf{x}}}_{i}+\epsilon{\mathbf{u}% }_{j}\right)=\frac{\hat{{\mathbf{x}}}_{i}+\epsilon{\mathbf{u}}_{j}}{[\bm{% \Lambda}_{Q}]_{i,i}+\epsilon^{2}}.bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (42)

Substituting 𝐊1(𝐱^i+ϵ𝐮j)=(𝐱^i+ϵ𝐮j)[𝚲Q]i,i+ϵ2superscript𝐊1subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscriptitalic-ϵ2{\mathbf{K}}^{-1}\left(\hat{{\mathbf{x}}}_{i}+\epsilon{\mathbf{u}}_{j}\right)=% \frac{\left(\hat{{\mathbf{x}}}_{i}+\epsilon{\mathbf{u}}_{j}\right)}{[\bm{% \Lambda}_{Q}]_{i,i}+\epsilon^{2}}bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG into the first term of Eq. (39) yields

Dijsubscript𝐷𝑖𝑗\displaystyle D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =M𝐮j𝐊1{1M𝐘𝐘𝐱^i+ϵ𝐮j[𝚲Q]i,i+ϵ2(𝐱^i+ϵ𝐮j)},absent𝑀superscriptsubscript𝐮𝑗topsuperscript𝐊11𝑀superscript𝐘𝐘topsubscript^𝐱𝑖italic-ϵsubscript𝐮𝑗subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscriptitalic-ϵ2subscript^𝐱𝑖italic-ϵsubscript𝐮𝑗\displaystyle=M\mathbf{u}_{j}^{\top}{\mathbf{K}}^{-1}\left\{\frac{1}{M}{% \mathbf{Y}}{\mathbf{Y}}^{\top}\frac{\hat{{\mathbf{x}}}_{i}+\epsilon{\mathbf{u}% }_{j}}{[\bm{\Lambda}_{Q}]_{i,i}+\epsilon^{2}}-(\hat{{\mathbf{x}}}_{i}+\epsilon% {\mathbf{u}}_{j})\right\},= italic_M bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ,
=M𝐮j𝐊1{1M𝐘𝐘1[𝚲Q]i,i+ϵ2𝐈N}𝐱^i+M𝐮j𝐊1{1M𝐘𝐘1[𝚲Q]i,i+ϵ2𝐈N}ϵ𝐮j,absent𝑀superscriptsubscript𝐮𝑗topsuperscript𝐊11𝑀superscript𝐘𝐘top1subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscriptitalic-ϵ2subscript𝐈𝑁subscript^𝐱𝑖𝑀superscriptsubscript𝐮𝑗topsuperscript𝐊11𝑀superscript𝐘𝐘top1subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscriptitalic-ϵ2subscript𝐈𝑁italic-ϵsubscript𝐮𝑗\displaystyle=M\mathbf{u}_{j}^{\top}{\mathbf{K}}^{-1}\left\{\frac{1}{M}{% \mathbf{Y}}{\mathbf{Y}}^{\top}\frac{1}{[\bm{\Lambda}_{Q}]_{i,i}+\epsilon^{2}}-% \mathbf{I}_{N}\right\}\hat{{\mathbf{x}}}_{i}+M\mathbf{u}_{j}^{\top}{\mathbf{K}% }^{-1}\left\{\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}\frac{1}{[\bm{\Lambda}_% {Q}]_{i,i}+\epsilon^{2}}-\mathbf{I}_{N}\right\}\epsilon{\mathbf{u}}_{j},= italic_M bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_M bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } italic_ϵ bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (43)
M𝐮j𝐊1{1M𝐘𝐘1[𝚲Q]i,i𝐈N}𝐱^i+ϵM𝐮j𝐊1{1M𝐘𝐘1[𝚲Q]i,i𝐈N}𝐮j.absent𝑀superscriptsubscript𝐮𝑗topsuperscript𝐊11𝑀superscript𝐘𝐘top1subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript𝐈𝑁subscript^𝐱𝑖italic-ϵ𝑀superscriptsubscript𝐮𝑗topsuperscript𝐊11𝑀superscript𝐘𝐘top1subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript𝐈𝑁subscript𝐮𝑗\displaystyle\approx M\mathbf{u}_{j}^{\top}{\mathbf{K}}^{-1}\left\{\frac{1}{M}% {\mathbf{Y}}{\mathbf{Y}}^{\top}\frac{1}{[\bm{\Lambda}_{Q}]_{i,i}}-\mathbf{I}_{% N}\right\}\hat{{\mathbf{x}}}_{i}+\epsilon M\mathbf{u}_{j}^{\top}{\mathbf{K}}^{% -1}\left\{\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}\frac{1}{[\bm{\Lambda}_{Q}% ]_{i,i}}-\mathbf{I}_{N}\right\}{\mathbf{u}}_{j}.≈ italic_M bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT end_ARG - bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ italic_M bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT end_ARG - bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (44)

According to Eq. (26), we have

1M𝐘𝐘𝐱^i[𝚲Q]i,i=𝐱^i.1𝑀superscript𝐘𝐘topsubscript^𝐱𝑖subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript^𝐱𝑖\displaystyle\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}\frac{\hat{{\mathbf{x}}% }_{i}}{[\bm{\Lambda}_{Q}]_{i,i}}=\hat{{\mathbf{x}}}_{i}.divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT end_ARG = over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (45)

Therefore, Eq. (44) can be rewritten as

Dij=ϵM(λj[𝚲Q]i,i1)𝐮j𝐊1𝐮j,subscript𝐷𝑖𝑗italic-ϵ𝑀subscript𝜆𝑗subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖1superscriptsubscript𝐮𝑗topsuperscript𝐊1subscript𝐮𝑗\displaystyle D_{ij}=\epsilon M\left(\frac{\lambda_{j}}{[\bm{\Lambda}_{Q}]_{i,% i}}-1\right)\mathbf{u}_{j}^{\top}\mathbf{K}^{-1}\mathbf{u}_{j},italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_ϵ italic_M ( divide start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT end_ARG - 1 ) bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (46)

where λjsubscript𝜆𝑗\lambda_{j}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the eigenvalues corresponding to 𝐮jsubscript𝐮𝑗{\mathbf{u}}_{j}bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Due to the positive definite property of 𝐊1superscript𝐊1\mathbf{K}^{-1}bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, sgn(Dij)sgnsubscript𝐷𝑖𝑗\operatorname{sgn}(D_{ij})roman_sgn ( italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), see Eq. (46), relies solely on

sgn(λj[𝚲Q]i,i1),sgnsubscript𝜆𝑗subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖1\operatorname{sgn}\left(\frac{\lambda_{j}}{[\bm{\Lambda}_{Q}]_{i,i}}-1\right),roman_sgn ( divide start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT end_ARG - 1 ) , (47)

implying that the type of stationary points is dictated by the discarded and retained eigenvalues. Specifically,

  • (i)

    For 𝐱^i,i=1,,Qformulae-sequencesubscript^𝐱𝑖for-all𝑖1𝑄\hat{\mathbf{x}}_{i},\forall i=1,...,Qover^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i = 1 , … , italic_Q, if [𝚲Q]i,i>λj,jQ+1,,Nformulae-sequencesubscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript𝜆𝑗for-all𝑗superscript𝑄1𝑁[\bm{\Lambda}_{Q}]_{i,i}>\lambda_{j},\forall j\in Q^{\prime}+1,...,N[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT > italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_N, then the corresponding stationary point should be recognized as a local or global optimum point;

  • (ii)

    For 𝐱^i,i=1,,Qformulae-sequencesubscript^𝐱𝑖for-all𝑖1𝑄\hat{\mathbf{x}}_{i},\forall i=1,...,Qover^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i = 1 , … , italic_Q, if [𝚲Q]i,i<λj,jQ+1,,Nformulae-sequencesubscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript𝜆𝑗for-all𝑗superscript𝑄1𝑁[\bm{\Lambda}_{Q}]_{i,i}<\lambda_{j},\forall j\in Q^{\prime}+1,...,N[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT < italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_N, then the corresponding stationary point should be recognized as a local minimum point;

  • (iii)

    For 𝐱^i,i=1,,Qformulae-sequencesubscript^𝐱𝑖𝑖1𝑄\hat{\mathbf{x}}_{i},i=1,...,Qover^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_Q, if [𝚲Q]i,i>λj,[𝚲Q]i,i<λk,j,kQ+1,,N,formulae-sequencesubscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript𝜆𝑗formulae-sequencesubscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript𝜆𝑘𝑗𝑘superscript𝑄1𝑁[\bm{\Lambda}_{Q}]_{i,i}>\lambda_{j},[\bm{\Lambda}_{Q}]_{i,i}<\lambda_{k},\ % \exists j,k\in Q^{\prime}+1,...,N,[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT > italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT < italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∃ italic_j , italic_k ∈ italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_N , then the corresponding stationary point should be identified as a saddle point.

3) Final Results.

If [𝚲Q]i,i=λi,i1,,Qformulae-sequencesubscriptdelimited-[]subscript𝚲𝑄𝑖𝑖subscript𝜆𝑖for-all𝑖1𝑄[\bm{\Lambda}_{Q}]_{i,i}=\lambda_{i},\forall i\in 1,...,Q[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ 1 , … , italic_Q, the stationary point represents a global optimum when

λi>λj,i1,,Q, and jQ+1,,N.formulae-sequencesubscript𝜆𝑖subscript𝜆𝑗formulae-sequencefor-all𝑖1𝑄 and for-all𝑗superscript𝑄1𝑁\lambda_{i}>\lambda_{j},\forall i\in 1,...,Q,\text{ and }\forall j\in Q^{% \prime}+1,...,N.italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_i ∈ 1 , … , italic_Q , and ∀ italic_j ∈ italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_N .

However, if there exists a λi<λjsubscript𝜆𝑖subscript𝜆𝑗\lambda_{i}<\lambda_{j}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, these stationary points correspond to saddle points. Additionally, when i1,,Q,[𝚲Q]i,i=σ^2formulae-sequence𝑖1𝑄subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscript^𝜎2\exists i\in 1,...,Q,[\bm{\Lambda}_{Q}]_{i,i}=\hat{\sigma}^{2}∃ italic_i ∈ 1 , … , italic_Q , [ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the associated stationary points are deemed saddle points due to the existence of cases where

σ^2<λj,jQ+1,,N,formulae-sequencesuperscript^𝜎2subscript𝜆𝑗𝑗superscript𝑄1𝑁\hat{\sigma}^{2}<\lambda_{j},j\in Q^{\prime}+1,...,N,over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , … , italic_N ,

considering that σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the average of the discarded eigenvalues. Because the saddle points could be escaped efficiently, they are generally regarded as unstable stationary points (Jin et al., 2017). Therefore, during the optimization process, when we set σ2=σ^2superscript𝜎2superscript^𝜎2\sigma^{2}=\hat{\sigma}^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the only stable maximum point is the global optimum point.

Remark A.5.

The analysis does not account for the equality of eigenvalues. This is because: (1) Equality among the first Q𝑄Qitalic_Q principal eigenvalues does not influence the presented analysis; (2) The equality of all discarded eigenvalues is trivial.

A.4 Proof of Proposition 3.3

Proof.

Suppose the projection variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT takes a value within the range (λQo,λQ1o)superscriptsubscript𝜆𝑄𝑜superscriptsubscript𝜆𝑄1𝑜(\lambda_{Q}^{o},\lambda_{Q-1}^{o})( italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_Q - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ), where λQosuperscriptsubscript𝜆𝑄𝑜\lambda_{Q}^{o}italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and λQ1osuperscriptsubscript𝜆𝑄1𝑜\lambda_{Q-1}^{o}italic_λ start_POSTSUBSCRIPT italic_Q - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT represent the Q𝑄Qitalic_Q-th and (Q1)𝑄1(Q\!-\!1)( italic_Q - 1 )-th principal eigenvalues of 1M𝐘𝐘1𝑀superscript𝐘𝐘top\frac{1}{M}{\mathbf{Y}}{\mathbf{Y}}^{\top}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG bold_YY start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, respectively. In this scenario, the eigenvectors with associated eigenvalues less than λQosuperscriptsubscript𝜆𝑄𝑜\lambda_{Q}^{o}italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT are unambiguously discarded. Furthermore, in such a case, the only stable local optimum point333Other stationary points, manifested as saddle points, are unstable as discussed in App. A.3. comprises the following 𝚲Qsubscript𝚲𝑄\bm{\Lambda}_{Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,

[𝚲Q]i,i={λio, for i1,,Q1, or ,σ2.[\bm{\Lambda}_{Q}]_{i,i}=\left\{\begin{aligned} &\lambda_{i}^{o},\text{ for }i% \in 1,...,Q-1,\text{ or },\\ &\sigma^{2}.\end{aligned}\right.[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , for italic_i ∈ 1 , … , italic_Q - 1 , or , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW

It is evident that, for either [𝚲Q]i,i=σ2subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscript𝜎2[\bm{\Lambda}_{Q}]_{i,i}=\sigma^{2}[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT or λiosuperscriptsubscript𝜆𝑖𝑜\lambda_{i}^{o}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, [𝚲Q]i,i>λjosubscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscriptsubscript𝜆𝑗𝑜[\bm{\Lambda}_{Q}]_{i,i}>\lambda_{j}^{o}[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT > italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT for all i1,,Q𝑖1𝑄i\in 1,...,Qitalic_i ∈ 1 , … , italic_Q and for all jQ1,,N𝑗𝑄1𝑁j\in Q-1,...,Nitalic_j ∈ italic_Q - 1 , … , italic_N, leading to the corresponding stationary points being the local optimum point, with one zero-column in 𝐗^^𝐗\hat{{\mathbf{X}}}over^ start_ARG bold_X end_ARG.

If the projection variance falls within the range (λQ1o,λQ2o)superscriptsubscript𝜆𝑄1𝑜superscriptsubscript𝜆𝑄2𝑜(\lambda_{Q-1}^{o},\lambda_{Q-2}^{o})( italic_λ start_POSTSUBSCRIPT italic_Q - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_Q - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ), the only stable local optimum point comprises the following 𝚲Qsubscript𝚲𝑄\bm{\Lambda}_{Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,

[𝚲Q]i,i={λio, for i1,,Q2, or ,σ2,[\bm{\Lambda}_{Q}]_{i,i}=\left\{\begin{aligned} &\lambda_{i}^{o},\text{ for }i% \in 1,...,Q-2,\text{ or },\\ &\sigma^{2},\end{aligned}\right.[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , for italic_i ∈ 1 , … , italic_Q - 2 , or , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW

with two zero-columns in 𝐗𝐗{\mathbf{X}}bold_X. By deduction, when σ2>λ1osuperscript𝜎2superscriptsubscript𝜆1𝑜\sigma^{2}>\lambda_{1}^{o}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, the only stable local optimum point comprises the 𝚲Qsubscript𝚲𝑄\bm{\Lambda}_{Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT with [𝚲Q]i,i=σ2subscriptdelimited-[]subscript𝚲𝑄𝑖𝑖superscript𝜎2[\bm{\Lambda}_{Q}]_{i,i}=\sigma^{2}[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all i1,,Q𝑖1𝑄i\in 1,...,Qitalic_i ∈ 1 , … , italic_Q with 𝐗=𝟎𝐗0{\mathbf{X}}=\mathbf{0}bold_X = bold_0.

It is worth noting that we deliberately avoid considering the equality of any of the Q𝑄Qitalic_Q principal eigenvalues to streamline the quantitative analysis, as introducing such equality might exacerbate the complexity and hasten the occurrence of model collapse. For instance, when the projection variance falls within the range (λQo,λQ1o)superscriptsubscript𝜆𝑄𝑜superscriptsubscript𝜆𝑄1𝑜(\lambda_{Q}^{o},\lambda_{Q-1}^{o})( italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_Q - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) and there exist two eigenvectors with eigenvalues equal to λQosuperscriptsubscript𝜆𝑄𝑜\lambda_{Q}^{o}italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, the stable local optimum point entails two zero-columns.

Suppose σ2<λNsuperscript𝜎2subscript𝜆𝑁\sigma^{2}<\lambda_{N}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, then, there exist a set of local minima point characterized by the following 𝚲Qsubscript𝚲𝑄\bm{\Lambda}_{Q}bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,

[𝚲Q]i,i={λio, for iNQ+k,,N, or ,σ2,[\bm{\Lambda}_{Q}]_{i,i}=\left\{\begin{aligned} &\lambda_{i}^{o},\text{ for }i% \in N-Q+k,...,N,\text{ or },\\ &\sigma^{2},\end{aligned}\right.[ bold_Λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , for italic_i ∈ italic_N - italic_Q + italic_k , … , italic_N , or , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW

where k𝑘kitalic_k represents the last k𝑘kitalic_k principal eigenvalues that are selected. It is also noteworthy that these local minima point444Equality among the last Q𝑄Qitalic_Q principal eigenvalues does not impact the analysis presented. will feature k𝑘kitalic_k zero-columns in 𝐗𝐗{\mathbf{X}}bold_X.

Appendix B Modeling and Variational Approximation

B.1 ELBO Derivation and Evaluation

\displaystyle\mathcal{L}caligraphic_L =𝔼q(𝐗,𝐖)[p(𝐘,𝐗,𝐖)q(𝐗,𝐖)]absentsubscript𝔼𝑞𝐗𝐖delimited-[]𝑝𝐘𝐗𝐖𝑞𝐗𝐖\displaystyle=\mathbb{E}_{q({\mathbf{X}},\mathbf{W})}\left[\frac{p({\mathbf{Y}% },{\mathbf{X}},\mathbf{W})}{q({\mathbf{X}},\mathbf{W})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ divide start_ARG italic_p ( bold_Y , bold_X , bold_W ) end_ARG start_ARG italic_q ( bold_X , bold_W ) end_ARG ]
=𝔼q(𝐗,𝐖)[logp(𝐖)i=1Np(𝐱i)j=1Mp(𝐲:,j|𝐗,𝐖)p(𝐖)i=1Nq(𝐱i)]absentsubscript𝔼𝑞𝐗𝐖delimited-[]𝑝𝐖superscriptsubscriptproduct𝑖1𝑁𝑝subscript𝐱𝑖superscriptsubscriptproduct𝑗1𝑀𝑝conditionalsubscript𝐲:𝑗𝐗𝐖𝑝𝐖superscriptsubscriptproduct𝑖1𝑁𝑞subscript𝐱𝑖\displaystyle=\mathbb{E}_{q({\mathbf{X}},\mathbf{W})}\left[\log\frac{p(\mathbf% {W})\prod_{i=1}^{N}p({\mathbf{x}}_{i})\prod_{j=1}^{M}p({\mathbf{y}}_{:,j}|{% \mathbf{X}},\mathbf{W})}{p(\mathbf{W})\prod_{i=1}^{N}q({\mathbf{x}}_{i})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_W ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_X , bold_W ) end_ARG start_ARG italic_p ( bold_W ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ]
=j=1M𝔼q(𝐗,𝐖)[logp(𝐲:,j|𝐗,𝐖)]Term 1: data reconstructioni=1NKL(q(𝐱i)p(𝐱i))Term 2: regularizationabsentsubscriptsuperscriptsubscript𝑗1𝑀subscript𝔼𝑞𝐗𝐖delimited-[]𝑝conditionalsubscript𝐲:𝑗𝐗𝐖Term 1: data reconstructionsubscriptsuperscriptsubscript𝑖1𝑁KLconditional𝑞subscript𝐱𝑖𝑝subscript𝐱𝑖Term 2: regularization\displaystyle=\underbrace{\sum_{j=1}^{M}\mathbb{E}_{q({\mathbf{X}},\mathbf{W})% }\left[\log p({\mathbf{y}}_{:,j}|{\mathbf{X}},\mathbf{W})\right]}_{\text{Term % 1: data reconstruction}}\underbrace{-\sum_{i=1}^{N}\operatorname{KL}(q({% \mathbf{x}}_{i})\|p({\mathbf{x}}_{i}))}_{\text{Term 2: regularization}}= under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_X , bold_W ) ] end_ARG start_POSTSUBSCRIPT Term 1: data reconstruction end_POSTSUBSCRIPT under⏟ start_ARG - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_KL ( italic_q ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT Term 2: regularization end_POSTSUBSCRIPT
j=1M1Ii=1Ilog𝒩(𝐲:,j|𝟎,𝐊^sm(i)+σ2𝐈N)12i=1N[tr(𝐒i)+𝝁i𝝁ilog|𝐒i|Q]absentsuperscriptsubscript𝑗1𝑀1𝐼superscriptsubscript𝑖1𝐼𝒩conditionalsubscript𝐲:𝑗0superscriptsubscript^𝐊sm𝑖superscript𝜎2subscript𝐈𝑁12superscriptsubscript𝑖1𝑁delimited-[]trsubscript𝐒𝑖superscriptsubscript𝝁𝑖topsubscript𝝁𝑖subscript𝐒𝑖𝑄\displaystyle\approx\sum_{j=1}^{M}\frac{1}{{I}}\sum_{i=1}^{I}\log\mathcal{N}({% \mathbf{y}}_{:,j}|\bm{0},\hat{{\mathbf{K}}}_{\mathrm{sm}}^{(i)}+\sigma^{2}% \mathbf{I}_{N})-\frac{1}{2}\sum_{i=1}^{N}\Big{[}\operatorname{tr}(\mathbf{S}_{% i})+\bm{\mu}_{i}^{\top}\bm{\mu}_{i}-\log|\mathbf{S}_{i}|-Q\Big{]}≈ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT roman_log caligraphic_N ( bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | bold_0 , over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ roman_tr ( bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log | bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | - italic_Q ]
j=1M1Ii=1I{N2log2π12log|𝐊^sm(i)+σ2𝐈N|12𝐲:,j(𝐊^sm(i)+σ2𝐈N)1𝐲:,j}absentsuperscriptsubscript𝑗1𝑀1𝐼superscriptsubscript𝑖1𝐼𝑁22𝜋12superscriptsubscript^𝐊sm𝑖superscript𝜎2subscript𝐈𝑁12superscriptsubscript𝐲:𝑗topsuperscriptsuperscriptsubscript^𝐊sm𝑖superscript𝜎2subscript𝐈𝑁1subscript𝐲:𝑗\displaystyle\approx\sum_{j=1}^{M}\frac{1}{{I}}\sum_{i=1}^{I}\left\{-\frac{N}{% 2}\log 2\pi-\frac{1}{2}\log\left|\hat{{\mathbf{K}}}_{\text{sm}}^{(i)}+\sigma^{% 2}\mathbf{I}_{N}\right|-\frac{1}{2}{\mathbf{y}}_{:,j}^{\top}\left(\hat{{% \mathbf{K}}}_{\text{sm}}^{(i)}+\sigma^{2}\mathbf{I}_{N}\right)^{-1}{\mathbf{y}% }_{:,j}\right\}\!≈ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT { - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG roman_log 2 italic_π - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log | over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT }
12i=1N[tr(𝐒i)+𝝁i𝝁ilog|𝐒i|Q]12superscriptsubscript𝑖1𝑁delimited-[]trsubscript𝐒𝑖superscriptsubscript𝝁𝑖topsubscript𝝁𝑖subscript𝐒𝑖𝑄\displaystyle~{}~{}~{}-\!\frac{1}{2}\sum_{i=1}^{N}\Big{[}\operatorname{tr}(% \mathbf{S}_{i})+\bm{\mu}_{i}^{\top}\bm{\mu}_{i}-\log|\mathbf{S}_{i}|-Q\Big{]}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ roman_tr ( bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log | bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | - italic_Q ]

where 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is typically assumed to be a diagonal matrix. Note that 𝐊^sm(i)=Φsm(𝐗(i);𝐖)Φsm(𝐗(i);𝐖)superscriptsubscript^𝐊sm𝑖subscriptΦsmsuperscript𝐗𝑖𝐖subscriptΦsmsuperscriptsuperscript𝐗𝑖𝐖top\hat{{\mathbf{K}}}_{\text{sm}}^{(i)}=\Phi_{\text{sm}}({\mathbf{X}}^{(i)};{% \mathbf{W}})\Phi_{\text{sm}}({\mathbf{X}}^{(i)};{\mathbf{W}})^{\top}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; bold_W ) roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; bold_W ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where Φsm(𝐗(i);𝐖)N×mLsubscriptΦsmsuperscript𝐗𝑖𝐖superscript𝑁𝑚𝐿\Phi_{\text{sm}}({\mathbf{X}}^{(i)};{\mathbf{W}})\in\mathbb{R}^{N\times mL}roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; bold_W ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_m italic_L end_POSTSUPERSCRIPT.

Lemma B.1.

Suppose 𝐀𝐀\mathbf{A}bold_A is an invertible n𝑛nitalic_n-by-n𝑛nitalic_n matrix and 𝐔,𝐕𝐔𝐕\mathbf{U},\mathbf{V}bold_U , bold_V are n𝑛nitalic_n-by-m𝑚mitalic_m matrices. Then the following determinant equality holds.

|𝐀+𝐔𝐕|=|𝐈m+𝐕𝐀1𝐔||𝐀|𝐀superscript𝐔𝐕topsubscript𝐈msuperscript𝐕topsuperscript𝐀1𝐔𝐀\left|\mathbf{A}+\mathbf{U}\mathbf{V}^{\top}\right|=\left|\mathbf{I}_{\mathrm{% m}}+\mathbf{V}^{\top}\mathbf{A}^{-1}\mathbf{U}\right|\left|\mathbf{A}\right|| bold_A + bold_UV start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | = | bold_I start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT + bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_U | | bold_A |
Lemma B.2 (Woodbury matrix identity).

Suppose 𝐀𝐀\mathbf{A}bold_A is an invertible n𝑛nitalic_n-by-n𝑛nitalic_n matrix and 𝐔,𝐕𝐔𝐕\mathbf{U},\mathbf{V}bold_U , bold_V are n𝑛nitalic_n-by-m𝑚mitalic_m matrices. Then

(𝐀+𝐔𝐕)1=𝐀1𝐀1𝐔(𝐈m+𝐕𝐔)1𝐕superscript𝐀superscript𝐔𝐕top1superscript𝐀1superscript𝐀1𝐔superscriptsubscript𝐈msuperscript𝐕top𝐔1superscript𝐕top\left(\mathbf{A}+\mathbf{U}\mathbf{V}^{\top}\right)^{-1}=\mathbf{A}^{-1}-% \mathbf{A}^{-1}\mathbf{U}(\mathbf{I}_{\mathrm{m}}+\mathbf{V}^{\top}\mathbf{U})% ^{-1}\mathbf{V}^{\top}( bold_A + bold_UV start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_U ( bold_I start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT + bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

According to the above two lemmas (Rasmussen & Williams, 2006), in the case that NmLmuch-greater-than𝑁𝑚𝐿N\gg mLitalic_N ≫ italic_m italic_L, we can compute the determinant and inversion of 𝐊^sm(i)+σ2𝐈Nsuperscriptsubscript^𝐊sm𝑖superscript𝜎2subscript𝐈𝑁\hat{{\mathbf{K}}}_{\text{sm}}^{(i)}+\sigma^{2}\mathbf{I}_{N}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, reducing the computational complexity of the ELBO evaluation from the original 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) to 𝒪(N(mL)2)𝒪𝑁superscript𝑚𝐿2\mathcal{O}(N(mL)^{2})caligraphic_O ( italic_N ( italic_m italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

|𝐊^sm(i)+σ2𝐈N|=|𝐈mL+1σ2ΦsmΦsm||σ2𝐈N|=σ2N|𝐈mL+1σ2ΦsmΦsm|,superscriptsubscript^𝐊sm𝑖superscript𝜎2subscript𝐈𝑁subscript𝐈𝑚𝐿1superscript𝜎2superscriptsubscriptΦsmtopsubscriptΦsmsuperscript𝜎2subscript𝐈𝑁superscript𝜎2𝑁subscript𝐈𝑚𝐿1superscript𝜎2superscriptsubscriptΦsmtopsubscriptΦsm\displaystyle\left|\hat{{\mathbf{K}}}_{\text{sm}}^{(i)}+\sigma^{2}\mathbf{I}_{% N}\right|=\left|\mathbf{I}_{mL}+\frac{1}{\sigma^{2}}\Phi_{\text{sm}}^{\top}% \Phi_{\text{sm}}\right|\left|\sigma^{2}\mathbf{I}_{N}\right|=\sigma^{2N}\left|% \mathbf{I}_{mL}+\frac{1}{\sigma^{2}}\Phi_{\text{sm}}^{\top}\Phi_{\text{sm}}% \right|,| over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | = | bold_I start_POSTSUBSCRIPT italic_m italic_L end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT | | italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | = italic_σ start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT | bold_I start_POSTSUBSCRIPT italic_m italic_L end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT | , (48)
(𝐊^sm(i)+σ2𝐈N)1=1σ2[𝐈NΦsm(𝐈mL+ΦsmΦsm)1Φsm].superscriptsuperscriptsubscript^𝐊sm𝑖superscript𝜎2subscript𝐈𝑁11superscript𝜎2delimited-[]subscript𝐈𝑁subscriptΦsmsuperscriptsubscript𝐈𝑚𝐿superscriptsubscriptΦsmtopsubscriptΦsm1superscriptsubscriptΦsmtop\displaystyle\left(\hat{{\mathbf{K}}}_{\text{sm}}^{(i)}+\sigma^{2}\mathbf{I}_{% N}\right)^{-1}=\frac{1}{\sigma^{2}}\left[\mathbf{I}_{N}-\Phi_{\text{sm}}(% \mathbf{I}_{mL}+\Phi_{\text{sm}}^{\top}\Phi_{\text{sm}})^{-1}\Phi_{\text{sm}}^% {\top}\right].( over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT italic_m italic_L end_POSTSUBSCRIPT + roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] . (49)

B.2 Interpretation of Modeling and Variational Distribution

In Eqs. (10) and (11), we have modeled the spectral points 𝐖𝐖{\mathbf{W}}bold_W as a part of the data generation process. However, this might cause some confusion, which is clarified as follows.

  • If we have selected the kernel function, the probability model for the data, i.e., Eqs. (10) and (11), can be interpreted as independent of p(𝐖)𝑝𝐖p({\mathbf{W}})italic_p ( bold_W ), as it is inherent to the kernel function.

  • In this paper, we provide another interpretation perspective: Following the setting from RFLVM by Gundersen et al. (2021), we consider the data-generating process for observations 𝐘𝐘{\mathbf{Y}}bold_Y as outlined in Eq. (10) or (11), which is dependent on 𝐖𝐖{\mathbf{W}}bold_W. Subsequently, we constrain its prior p(𝐖)𝑝𝐖p({\mathbf{W}})italic_p ( bold_W ) to be Gaussian mixtures, defining the prior SM kernels functions. This alternative perspective is explained as follows:

    • Let us explicitly assume a parametric variational distribution q𝜼(𝐖)subscript𝑞𝜼𝐖q_{\bm{\eta}}({\mathbf{W}})italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W ), assuming it to be another Gaussian mixture (thus still defines an SM kernel) with parameters denoted as 𝜼𝜼\bm{\eta}bold_italic_η to approximate p(𝐖|𝐘)𝑝conditional𝐖𝐘p({\mathbf{W}}|{\mathbf{Y}})italic_p ( bold_W | bold_Y ). In this case, Eq. (12) becomes:

      q(𝐗,𝐖)=q𝜼(𝐖)q(𝐗).𝑞𝐗𝐖subscript𝑞𝜼𝐖𝑞𝐗q({\mathbf{X}},{\mathbf{W}})=q_{\bm{\eta}}({\mathbf{W}})q({\mathbf{X}}).italic_q ( bold_X , bold_W ) = italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W ) italic_q ( bold_X ) .

      Combining the joint distribution in Eq.  (11), we derive the following ELBO:

      \displaystyle\mathcal{L}caligraphic_L =Eq(𝐗,𝐖)[logp(𝐗)p𝜽(𝐖)pσ(𝐘|𝐗,𝐖)q(𝐗)q𝜼(𝐖)]absentsubscript𝐸𝑞𝐗𝐖delimited-[]𝑝𝐗subscript𝑝𝜽𝐖subscript𝑝𝜎conditional𝐘𝐗𝐖𝑞𝐗subscript𝑞𝜼𝐖\displaystyle={E}_{q({\mathbf{X}},{\mathbf{W}})}\left[\log\frac{p({\mathbf{X}}% )p_{{{\bm{\theta}}}}({\mathbf{W}})p_{\sigma}({\mathbf{Y}}|{\mathbf{X}},{% \mathbf{W}})}{q({\mathbf{X}})q_{\bm{\eta}}({\mathbf{W}})}\right]= italic_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_X ) italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_W ) italic_p start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_Y | bold_X , bold_W ) end_ARG start_ARG italic_q ( bold_X ) italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W ) end_ARG ] (50)
      =Eq(𝐗,𝐖)[logpσ(𝐘|𝐗,𝐖)]KL(q(𝐗)p(𝐗))KL(q𝜼(𝐖)p𝜽(𝐖)).absentsubscript𝐸𝑞𝐗𝐖delimited-[]subscript𝑝𝜎conditional𝐘𝐗𝐖𝐾𝐿conditional𝑞𝐗𝑝𝐗𝐾𝐿conditionalsubscript𝑞𝜼𝐖subscript𝑝𝜽𝐖\displaystyle={E}_{q({\mathbf{X}},{\mathbf{W}})}\left[\log p_{\sigma}({\mathbf% {Y}}|{\mathbf{X}},{\mathbf{W}})\right]-KL(q({\mathbf{X}})\|p({\mathbf{X}}))-KL% (q_{\bm{\eta}}({\mathbf{W}})\|p_{{{\bm{\theta}}}}({\mathbf{W}})).= italic_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_Y | bold_X , bold_W ) ] - italic_K italic_L ( italic_q ( bold_X ) ∥ italic_p ( bold_X ) ) - italic_K italic_L ( italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W ) ∥ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_W ) ) .

      In this ELBO, prior distribution p𝜽(𝐖)subscript𝑝𝜽𝐖p_{{{\bm{\theta}}}}({\mathbf{W}})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_W ) is only related to the last KL divergence term. When maximizing the ELBO, we will obtain that 𝜽=𝜼𝜽𝜼{{\bm{\theta}}}=\bm{\eta}bold_italic_θ = bold_italic_η, ensuring that the last KL divergence term becomes 00. Ultimately, this aligns with the optimization objective in our paper.

  • More complicated p(𝐖|𝐘)𝑝conditional𝐖𝐘p({\mathbf{W}}|{\mathbf{Y}})italic_p ( bold_W | bold_Y ) approximations. It is possible to consider assuming the variational distribution of spectral points is 𝐘𝐘{\mathbf{Y}}bold_Y-dependent q(𝐖|𝐘)𝑞conditional𝐖𝐘q({\mathbf{W}}|{\mathbf{Y}})italic_q ( bold_W | bold_Y ), such as a parametric Gaussian mixture and other distributions.

    • Gaussian mixture: Suppose we use a parametric variational distribution q𝜼(𝐖|𝐘)subscript𝑞𝜼conditional𝐖𝐘q_{\bm{\eta}}({\mathbf{W}}|{\mathbf{Y}})italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W | bold_Y ), in the form of

      q𝜼(𝐖|𝐘)=l=1L/2i=1mαi𝒩𝜼i(μi,σi2),subscript𝑞𝜼conditional𝐖𝐘superscriptsubscriptproduct𝑙1𝐿2superscriptsubscript𝑖1𝑚subscript𝛼𝑖subscript𝒩subscript𝜼𝑖subscript𝜇𝑖superscriptsubscript𝜎𝑖2q_{\bm{\eta}}({\mathbf{W}}|{\mathbf{Y}})=\prod_{l=1}^{L/2}\sum_{i=1}^{m}\alpha% _{i}\mathcal{N}_{\bm{\eta}_{i}}(\mu_{i},\sigma_{i}^{2}),italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W | bold_Y ) = ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (51)

      where (αi,μi,σi2)subscript𝛼𝑖subscript𝜇𝑖superscriptsubscript𝜎𝑖2(\alpha_{i},\mu_{i},\sigma_{i}^{2})( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in each mixture component is modeled by an encoder parametrized by 𝜼isubscript𝜼𝑖\bm{\eta}_{i}bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 𝐘𝐘{\mathbf{Y}}bold_Y as input. Similarly, we can get the ELBO:

      \displaystyle\mathcal{L}caligraphic_L =Eq(𝐗,𝐖|𝐘)[logp(𝐗)p𝜽(𝐖)pσ(𝐘|𝐗,𝐖)q(𝐗)q𝜼(𝐖|𝐘)]absentsubscript𝐸𝑞𝐗conditional𝐖𝐘delimited-[]𝑝𝐗subscript𝑝𝜽𝐖subscript𝑝𝜎conditional𝐘𝐗𝐖𝑞𝐗subscript𝑞𝜼conditional𝐖𝐘\displaystyle={E}_{q({\mathbf{X}},{\mathbf{W}}|{\mathbf{Y}})}\left[\log\frac{p% ({\mathbf{X}})p_{{{\bm{\theta}}}}({\mathbf{W}})p_{\sigma}({\mathbf{Y}}|{% \mathbf{X}},{\mathbf{W}})}{q({\mathbf{X}})q_{\bm{\eta}}({\mathbf{W}}|{\mathbf{% Y}})}\right]= italic_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W | bold_Y ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_X ) italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_W ) italic_p start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_Y | bold_X , bold_W ) end_ARG start_ARG italic_q ( bold_X ) italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W | bold_Y ) end_ARG ] (52)
      =Eq(𝐗,𝐖|𝐘)[logpσ(𝐘|𝐗,𝐖)]KL(q(𝐗)p(𝐗))KL(q𝜼(𝐖|𝐘)p𝜽(𝐖)).absentsubscript𝐸𝑞𝐗conditional𝐖𝐘delimited-[]subscript𝑝𝜎conditional𝐘𝐗𝐖𝐾𝐿conditional𝑞𝐗𝑝𝐗𝐾𝐿conditionalsubscript𝑞𝜼conditional𝐖𝐘subscript𝑝𝜽𝐖\displaystyle={E}_{q({\mathbf{X}},{\mathbf{W}}|{\mathbf{Y}})}\left[\log p_{% \sigma}({\mathbf{Y}}|{\mathbf{X}},{\mathbf{W}})\right]-KL(q({\mathbf{X}})\|p({% \mathbf{X}}))-KL(q_{\bm{\eta}}({\mathbf{W}}|{\mathbf{Y}})\|p_{{{\bm{\theta}}}}% ({\mathbf{W}})).= italic_E start_POSTSUBSCRIPT italic_q ( bold_X , bold_W | bold_Y ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_Y | bold_X , bold_W ) ] - italic_K italic_L ( italic_q ( bold_X ) ∥ italic_p ( bold_X ) ) - italic_K italic_L ( italic_q start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ( bold_W | bold_Y ) ∥ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_W ) ) .

      When maximizing the ELBO, the last KL divergence term will also be 00. The difference between the remaining terms and our objective function lies in the first term, where 𝐖𝐖{\mathbf{W}}bold_W includes information learnt from 𝐘𝐘{\mathbf{Y}}bold_Y. This potentially enhances the kernel selection process and contributes to preventing model collapse, though coming at the cost of increased computational complexity from the encoder evaluation.

    • Other distribution forms: In this case, the variational inference algorithm heavily depends on the specific form of q(𝐖|𝐘)𝑞conditional𝐖𝐘q({\mathbf{W}}|{\mathbf{Y}})italic_q ( bold_W | bold_Y ). While this variational distribution can be more general, such an assumption generally introduces greater intractability, making the evaluation of the ELBO more challenging. Employing Monte Carlo sampling to approximate the ELBO in such scenarios could result in larger approximation variances compared to the case where q(𝐖)=p(𝐖)𝑞𝐖𝑝𝐖q({\mathbf{W}})=p({\mathbf{W}})italic_q ( bold_W ) = italic_p ( bold_W ), thus potentially leading to less robust model performance.

Appendix C Auto-differentiable SM Kernel using RFF Approximation

C.1 Proof of Proposition 4.2

Proof.

With the RFF feature map defined in Eq. (14), we can write down the inner product of the feature maps

ϕ(𝐱;𝐖)ϕ(𝐱;𝐖)=i=1mαil=1L22Lcos(2π𝐰l(i)(𝐱𝐱))italic-ϕsuperscript𝐱𝐖topitalic-ϕsuperscript𝐱𝐖superscriptsubscript𝑖1𝑚subscript𝛼𝑖superscriptsubscript𝑙1𝐿22𝐿2𝜋superscriptsubscript𝐰𝑙limit-from𝑖top𝐱superscript𝐱\displaystyle\phi\left({\mathbf{x}};{\mathbf{W}}\right)^{\top}\phi\left({% \mathbf{x}}^{\prime};{\mathbf{W}}\right)=\sum_{i=1}^{m}\alpha_{i}\sum_{l=1}^{% \frac{L}{2}}\frac{2}{L}\cos(2\pi\mathbf{w}_{l}^{(i)\top}({\mathbf{x}}-{\mathbf% {x}}^{\prime}))italic_ϕ ( bold_x ; bold_W ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_W ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_L end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_L end_ARG roman_cos ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) (53)

where 𝐖{𝐰1(i),𝐰2(i),,𝐰L/2(i)}i=1m𝐖superscriptsubscriptsuperscriptsubscript𝐰1𝑖superscriptsubscript𝐰2𝑖superscriptsubscript𝐰𝐿2𝑖𝑖1𝑚{\mathbf{W}}\triangleq\{\mathbf{w}_{1}^{(i)},\mathbf{w}_{2}^{(i)},\ldots,% \mathbf{w}_{{L}/{2}}^{(i)}\}_{i=1}^{m}bold_W ≜ { bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , … , bold_w start_POSTSUBSCRIPT italic_L / 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and each 𝐰l(i)superscriptsubscript𝐰𝑙𝑖{\mathbf{w}}_{l}^{(i)}bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT are i.i.d. sampled from the symmetric distribution

si(𝐰)=𝒩(𝐰|𝝁i,diag(𝝈i2))+𝒩(𝐰|𝝁i,diag(𝝈i2))2subscript𝑠𝑖𝐰𝒩conditional𝐰subscript𝝁𝑖diagsuperscriptsubscript𝝈𝑖2𝒩conditional𝐰subscript𝝁𝑖diagsuperscriptsubscript𝝈𝑖22s_{i}({\mathbf{w}})=\frac{\mathcal{N}(\mathbf{w}|\bm{\mu}_{i},\operatorname{% diag}(\bm{\sigma}_{i}^{2}))+\mathcal{N}(-\mathbf{w}|\bm{\mu}_{i},\operatorname% {diag}(\bm{\sigma}_{i}^{2}))}{2}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) = divide start_ARG caligraphic_N ( bold_w | bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_diag ( bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) + caligraphic_N ( - bold_w | bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_diag ( bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) end_ARG start_ARG 2 end_ARG

using reparameterization trick (Kingma & Welling, 2019). Taking the expectation w.r.t. p(𝐖)=i=1ml=1L/2si(𝐰)𝑝𝐖superscriptsubscriptproduct𝑖1𝑚superscriptsubscriptproduct𝑙1𝐿2subscript𝑠𝑖𝐰p\left({\mathbf{W}}\right)=\prod_{i=1}^{m}\prod_{l=1}^{L/2}s_{i}({\mathbf{w}})italic_p ( bold_W ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ), we can get

𝔼p(𝐖)[ϕ(𝐱;𝐖)ϕ(𝐱;𝐖)]=𝔼p(𝐖)[i=1mαil=1L/22Lcos(2π𝐰l(i)(𝐱𝐱))]subscript𝔼𝑝𝐖delimited-[]italic-ϕsuperscript𝐱𝐖topitalic-ϕsuperscript𝐱𝐖subscript𝔼𝑝𝐖delimited-[]superscriptsubscript𝑖1𝑚subscript𝛼𝑖superscriptsubscript𝑙1𝐿22𝐿2𝜋superscriptsubscript𝐰𝑙limit-from𝑖top𝐱superscript𝐱\displaystyle\mathbb{E}_{p\left({\mathbf{W}}\right)}\left[\phi\left({\mathbf{x% }};{\mathbf{W}}\right)^{\top}\phi\left({\mathbf{x}}^{\prime};{\mathbf{W}}% \right)\right]=\mathbb{E}_{p\left({\mathbf{W}}\right)}\left[\sum_{i=1}^{m}% \alpha_{i}\sum_{l=1}^{L/2}\frac{2}{L}\cos(2\pi\mathbf{w}_{l}^{(i)\top}({% \mathbf{x}}-{\mathbf{x}}^{\prime}))\right]blackboard_E start_POSTSUBSCRIPT italic_p ( bold_W ) end_POSTSUBSCRIPT [ italic_ϕ ( bold_x ; bold_W ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_W ) ] = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_W ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_L end_ARG roman_cos ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) ]
=i=1mαi𝔼p(𝐰1:L/2(i))[l=1L/22Lcos(2π𝐰l(i)(𝐱𝐱))]absentsuperscriptsubscript𝑖1𝑚subscript𝛼𝑖subscript𝔼𝑝superscriptsubscript𝐰:1𝐿2𝑖delimited-[]superscriptsubscript𝑙1𝐿22𝐿2𝜋superscriptsubscript𝐰𝑙limit-from𝑖top𝐱superscript𝐱\displaystyle=\sum_{i=1}^{m}\alpha_{i}\mathbb{E}_{p\left(\mathbf{w}_{1:L/2}^{(% i)}\right)}\left[\sum_{l=1}^{L/2}\frac{2}{L}\cos(2\pi\mathbf{w}_{l}^{(i)\top}(% {\mathbf{x}}-{\mathbf{x}}^{\prime}))\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p ( bold_w start_POSTSUBSCRIPT 1 : italic_L / 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_L end_ARG roman_cos ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) ] (linearity of expectation)linearity of expectation\displaystyle(\text{linearity of expectation})( linearity of expectation ) (54a)
=i=1mαi𝔼si(𝐰)[cos(2π𝐰1(i)(𝐱𝐱))]absentsuperscriptsubscript𝑖1𝑚subscript𝛼𝑖subscript𝔼subscript𝑠𝑖𝐰delimited-[]2𝜋superscriptsubscript𝐰1limit-from𝑖top𝐱superscript𝐱\displaystyle=\sum_{i=1}^{m}\alpha_{i}\mathbb{E}_{s_{i}({\mathbf{w}})}\left[% \cos(2\pi\mathbf{w}_{1}^{(i)\top}({\mathbf{x}}-{\mathbf{x}}^{\prime}))\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) end_POSTSUBSCRIPT [ roman_cos ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) ] (i.i.d. of 𝐰l(i))i.i.d. of subscriptsuperscript𝐰𝑖𝑙\displaystyle(\text{i.i.d. of }{\mathbf{w}}^{(i)}_{l})( i.i.d. of bold_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) (54b)
=i=1mαi𝔼si(𝐰)[exp(2πj𝐰1(i)(𝐱𝐱))+exp(2πj𝐰1(i)(𝐱𝐱))2]absentsuperscriptsubscript𝑖1𝑚subscript𝛼𝑖subscript𝔼subscript𝑠𝑖𝐰delimited-[]2𝜋𝑗superscriptsubscript𝐰1limit-from𝑖top𝐱superscript𝐱2𝜋𝑗superscriptsubscript𝐰1limit-from𝑖top𝐱superscript𝐱2\displaystyle=\sum_{i=1}^{m}\alpha_{i}\mathbb{E}_{s_{i}({\mathbf{w}})}\left[% \frac{\exp(2\pi j\mathbf{w}_{1}^{(i)\top}({\mathbf{x}}-{\mathbf{x}}^{\prime}))% +\exp(-2\pi j\mathbf{w}_{1}^{(i)\top}({\mathbf{x}}-{\mathbf{x}}^{\prime}))}{2}\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) end_POSTSUBSCRIPT [ divide start_ARG roman_exp ( start_ARG 2 italic_π italic_j bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) + roman_exp ( start_ARG - 2 italic_π italic_j bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) end_ARG start_ARG 2 end_ARG ] (Euler’s identity)Euler’s identity\displaystyle(\text{Euler’s identity})( Euler’s identity ) (54c)
=i=1mαiki(𝐱,𝐱;𝝁i,𝝈𝒊𝟐)absentsuperscriptsubscript𝑖1𝑚subscript𝛼𝑖subscript𝑘𝑖𝐱superscript𝐱subscript𝝁𝑖superscriptsubscript𝝈𝒊2\displaystyle=\sum_{i=1}^{m}\alpha_{i}k_{i}({\mathbf{x}},{\mathbf{x}}^{\prime}% ;\bm{\mu}_{i},\bm{\sigma_{i}^{2}})= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT ) (symmetrity of si(𝐰))symmetrity of subscript𝑠𝑖𝐰\displaystyle(\text{symmetrity of }s_{i}({\mathbf{w}}))( symmetrity of italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) ) (54d)
=ksm(𝐱,𝐱;{αi,𝝁i,𝝈𝒊𝟐}i=1m)absentsubscript𝑘sm𝐱superscript𝐱superscriptsubscriptsubscript𝛼𝑖subscript𝝁𝑖superscriptsubscript𝝈𝒊2𝑖1𝑚\displaystyle=k_{\text{sm}}({\mathbf{x}},{\mathbf{x}}^{\prime};\{\alpha_{i},% \bm{\mu}_{i},\bm{\sigma_{i}^{2}}\}_{i=1}^{m})= italic_k start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) (SM kernel definition)SM kernel definition\displaystyle(\text{SM kernel definition})( SM kernel definition ) (54e)

Hence concludes that ϕ(𝐱;𝐖)ϕ(𝐱;𝐖)italic-ϕsuperscript𝐱𝐖topitalic-ϕ𝐱𝐖\phi\left({\mathbf{x}};{\mathbf{W}}\right)^{\top}\phi\left({\mathbf{x}};{% \mathbf{W}}\right)italic_ϕ ( bold_x ; bold_W ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_x ; bold_W ) is an unbiased estimator of the SM kernel characterized by parameter {αi,𝝁i,𝝈𝒊𝟐}i=1msuperscriptsubscriptsubscript𝛼𝑖subscript𝝁𝑖subscriptsuperscript𝝈2𝒊𝑖1𝑚\{\alpha_{i},\bm{\mu}_{i},\bm{\sigma^{2}_{i}}\}_{i=1}^{m}{ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. ∎

C.2 Proof of Theorem 4.3

Proof.

Similar theorem has been proven in the Gaussian process regression model; see Proposition 3.1 in (Jung et al., 2022), and Theorem 3 in (Lopez-Paz et al., 2014). For ease of reference, we follow the existing results and show the proof as follows.

\bullet  To prove Theorem 4.3, we first introduce the following Lemma for Matrix Bernstein inequality (Tropp, 2015).

Lemma C.1 (Matrix Bernstein Inequality).

Consider a finite sequence {𝐗i}subscript𝐗𝑖\left\{\bm{X}_{i}\right\}{ bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } of independent, random, Hermitian matrices with dimension N𝑁Nitalic_N. Assume that

𝔼[𝑿i]=𝟎 and 𝑿i2H for each index i,𝔼delimited-[]subscript𝑿𝑖0 and subscriptnormsubscript𝑿𝑖2𝐻 for each index 𝑖\mathbb{E}[\bm{X}_{i}]=\mathbf{0}\text{ and }\left\|\bm{X}_{i}\right\|_{2}\leq H% \text{ for each index }i,blackboard_E [ bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = bold_0 and ∥ bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_H for each index italic_i ,

where 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the matrix spectral norm. Introduce the random matrix 𝐘=i𝐗i,𝐘subscript𝑖subscript𝐗𝑖\bm{Y}=\sum_{i}\bm{X}_{i},bold_italic_Y = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , and let v(𝐘)𝑣𝐘v(\bm{Y})italic_v ( bold_italic_Y ) be the matrix variance statistic of the sum:

v(𝒀)=𝔼[𝒀2]=i𝔼[𝑿i2].𝑣𝒀norm𝔼delimited-[]superscript𝒀2normsubscript𝑖𝔼delimited-[]superscriptsubscript𝑿𝑖2v(\bm{Y})=\left\|\mathbb{E}[\bm{Y}^{2}]\right\|=\left\|\sum_{i}\mathbb{E}[\bm{% X}_{i}^{2}]\right\|.italic_v ( bold_italic_Y ) = ∥ blackboard_E [ bold_italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ = ∥ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ .

Then we have

𝔼[𝒀2]2v(𝒀)logN+13LlogN.𝔼delimited-[]subscriptnorm𝒀22𝑣𝒀𝑁13𝐿𝑁\mathbb{E}\left[\|\bm{Y}\|_{2}\right]\leq\sqrt{2v(\bm{Y})\log N}+\frac{1}{3}L% \log N.blackboard_E [ ∥ bold_italic_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ≤ square-root start_ARG 2 italic_v ( bold_italic_Y ) roman_log italic_N end_ARG + divide start_ARG 1 end_ARG start_ARG 3 end_ARG italic_L roman_log italic_N . (55)

Furthermore, for all ϵ0italic-ϵ0\epsilon\geq 0italic_ϵ ≥ 0.

P{𝒀2ϵ}Nexp(ϵ2/2v(𝒀)+Hϵ/3).𝑃subscriptnorm𝒀2italic-ϵ𝑁superscriptitalic-ϵ22𝑣𝒀𝐻italic-ϵ3{P}\left\{\|\bm{Y}\|_{2}\geq\epsilon\right\}\leq N\cdot\exp\left(\frac{-% \epsilon^{2}/2}{v(\bm{Y})+H\epsilon/3}\right).italic_P { ∥ bold_italic_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ϵ } ≤ italic_N ⋅ roman_exp ( divide start_ARG - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_ARG start_ARG italic_v ( bold_italic_Y ) + italic_H italic_ϵ / 3 end_ARG ) . (56)
Proof.

The proof of Lemma C.1 can be found in Theorem 6.6.1, § 6.6, (Tropp, 2015). ∎

\bullet  Next, we show how to apply Lemma C.1 to prove Theorem. 4.3.

1). Factorization of Approximation Error Matrix.

With the constructed SM kernel matrix approximation, 𝐊^sm=Φsm(𝐗)Φsm(𝐗)subscript^𝐊smsubscriptΦsm𝐗subscriptΦsmsuperscript𝐗top\hat{\mathbf{K}}_{\mathrm{sm}}=\Phi_{\mathrm{sm}}({\mathbf{X}})\Phi_{\mathrm{% sm}}({\mathbf{X}})^{\top}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_X ) roman_Φ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where the random feature matrix Φsm(𝐗)=[ϕ(𝐱1),,ϕ(𝐱N)]N×mLsubscriptΦsm𝐗superscriptitalic-ϕsubscript𝐱1italic-ϕsubscript𝐱𝑁topsuperscript𝑁𝑚𝐿\Phi_{\mathrm{sm}}({\mathbf{X}})\!=\!\left[\phi\left({\mathbf{x}}_{1}\right),% \ldots,\phi\left({\mathbf{x}}_{N}\right)\right]^{\top}\!\in\!\mathbb{R}^{N% \times mL}roman_Φ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( bold_X ) = [ italic_ϕ ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_m italic_L end_POSTSUPERSCRIPT, we have the following approximation error matrix:

𝐄=𝐊^sm𝐊sm.𝐄subscript^𝐊smsubscript𝐊sm\mathbf{E}=\hat{\mathbf{K}}_{\mathrm{sm}}-{\mathbf{K}}_{\mathrm{sm}}.bold_E = over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT - bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT . (57)

We are going to show that 𝐄𝐄\mathbf{E}bold_E can be factorized as

𝐄=i=1ml=1L/2𝐄l(i)𝐄superscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿2superscriptsubscript𝐄𝑙𝑖\mathbf{E}=\sum_{i=1}^{m}\sum_{l=1}^{L/2}\mathbf{E}_{l}^{(i)}bold_E = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT (58)

where 𝐄l(i)superscriptsubscript𝐄𝑙𝑖\mathbf{E}_{l}^{(i)}bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is a sequence of independent, random, Hermitian matrices with dimension N𝑁Nitalic_N.

Specifically, we define 𝐙l(i)superscriptsubscript𝐙𝑙𝑖\mathbf{Z}_{l}^{(i)}bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT as

𝐙l(i)=[exp(2πj𝐰l(i)𝐱1),,exp(2πj𝐰l(i)𝐱N)]N×1, where 𝐰l(i)si(𝐰),formulae-sequencesuperscriptsubscript𝐙𝑙𝑖superscript2𝜋𝑗superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱12𝜋𝑗superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱𝑁topsuperscript𝑁1similar-to where superscriptsubscript𝐰𝑙𝑖subscript𝑠𝑖𝐰\mathbf{Z}_{l}^{(i)}=\left[\exp(2\pi j{\mathbf{w}}_{l}^{(i)\top}{\mathbf{x}}_{% 1}),\ldots,\exp(2\pi j{\mathbf{w}}_{l}^{(i)\top}{\mathbf{x}}_{N})\right]^{\top% }\in\mathbb{R}^{N\times 1},\text{ where }{\mathbf{w}}_{l}^{(i)}\sim s_{i}({% \mathbf{w}}),bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ roman_exp ( start_ARG 2 italic_π italic_j bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) , … , roman_exp ( start_ARG 2 italic_π italic_j bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT , where bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_w ) , (59)

and we can show that

[𝐊^sm]h,gsubscriptdelimited-[]subscript^𝐊sm𝑔\displaystyle[\hat{\mathbf{K}}_{\mathrm{sm}}]_{h,g}[ over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_h , italic_g end_POSTSUBSCRIPT =i=1m2αiLl=1L/2cos(2π𝐰l(i)(𝐱h𝐱g))absentsuperscriptsubscript𝑖1𝑚2subscript𝛼𝑖𝐿superscriptsubscript𝑙1𝐿22𝜋superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱subscript𝐱𝑔\displaystyle=\sum_{i=1}^{m}\frac{2\alpha_{i}}{L}\sum_{l=1}^{L/2}\cos(2\pi{% \mathbf{w}}_{l}^{(i)\top}({\mathbf{x}}_{h}-{\mathbf{x}}_{g}))= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT roman_cos ( start_ARG 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) end_ARG ) (60)
=i=1ml=1L/22αiLRe(exp(2πj𝐰l(i)(𝐱h𝐱g)))absentsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿22subscript𝛼𝑖𝐿Re2𝜋𝑗superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱subscript𝐱𝑔\displaystyle=\sum_{i=1}^{m}\sum_{l=1}^{L/2}\frac{2\alpha_{i}}{L}\operatorname% {Re}\left(\exp(2\pi j{\mathbf{w}}_{l}^{(i)\top}({\mathbf{x}}_{h}-{\mathbf{x}}_% {g}))\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG roman_Re ( roman_exp ( start_ARG 2 italic_π italic_j bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) end_ARG ) )
=i=1ml=1L/22αiLRe([𝐙l(i)𝐙l(i)]h,g)absentsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿22subscript𝛼𝑖𝐿Resubscriptdelimited-[]superscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖𝑔\displaystyle=\sum_{i=1}^{m}\sum_{l=1}^{L/2}\frac{2\alpha_{i}}{L}\operatorname% {Re}\left(\left[\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right]_{h,g}\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG roman_Re ( [ bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_h , italic_g end_POSTSUBSCRIPT )

where 𝐙l(i)superscriptsubscript𝐙𝑙𝑖\mathbf{Z}_{l}^{(i)*}bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT is the conjugate transpose of 𝐙l(i)superscriptsubscript𝐙𝑙𝑖\mathbf{Z}_{l}^{(i)}bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Thus, we have 𝐊^sm=i=1ml=1L/22αiLRe(𝐙l(i)𝐙l(i))subscript^𝐊smsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿22subscript𝛼𝑖𝐿Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖\hat{\mathbf{K}}_{\mathrm{sm}}=\sum_{i=1}^{m}\sum_{l=1}^{L/2}\frac{2\alpha_{i}% }{L}\operatorname{Re}(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*})over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ). Based on this factorization and Eq. (54) in Proposition 4.2, we have that

𝐊sm=i=1ml=1L/22αiL𝔼[Re(𝐙l(i)𝐙l(i))].subscript𝐊smsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿22subscript𝛼𝑖𝐿𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖{\mathbf{K}}_{\mathrm{sm}}=\sum_{i=1}^{m}\sum_{l=1}^{L/2}\frac{2\alpha_{i}}{L}% \mathbb{E}[\operatorname{Re}(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*})].bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] .

Therefore, the approximation error matrix 𝐄𝐄\mathbf{E}bold_E can be factorized as 𝐄=i=1ml=1L/2𝐄l(i)𝐄superscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿2superscriptsubscript𝐄𝑙𝑖\mathbf{E}=\sum_{i=1}^{m}\sum_{l=1}^{L/2}\mathbf{E}_{l}^{(i)}bold_E = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT where

𝐄l(i)=2αiL(Re(𝐙l(i)𝐙l(i))𝔼[Re(𝐙l(i)𝐙l(i))])superscriptsubscript𝐄𝑙𝑖2subscript𝛼𝑖𝐿Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖\mathbf{E}_{l}^{(i)}=\frac{2\alpha_{i}}{L}\left(\operatorname{Re}(\mathbf{Z}_{% l}^{(i)}\mathbf{Z}_{l}^{(i)*})-\mathbb{E}[\operatorname{Re}(\mathbf{Z}_{l}^{(i% )}\mathbf{Z}_{l}^{(i)*})]\right)bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG ( roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) - blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] ) (61)

is a sequence of independent, random, Hermitian matrices with dimension N𝑁Nitalic_N that satisfy the condition of 𝔼[𝐄l(i)]=𝟎𝔼delimited-[]superscriptsubscript𝐄𝑙𝑖0\mathbb{E}[\mathbf{E}_{l}^{(i)}]=\mathbf{0}blackboard_E [ bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ] = bold_0.

We next find the upper bound for 𝐄l(i)2subscriptnormsuperscriptsubscript𝐄𝑙𝑖2\|\mathbf{E}_{l}^{(i)}\|_{2}∥ bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

2). Upper Bound for 𝐄l(i)2subscriptnormsuperscriptsubscript𝐄𝑙𝑖2\|\mathbf{E}_{l}^{(i)}\|_{2}∥ bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.
𝐄l,i2subscriptnormsubscript𝐄𝑙𝑖2\displaystyle\|\mathbf{E}_{l,i}\|_{2}∥ bold_E start_POSTSUBSCRIPT italic_l , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =2αiLRe(𝐙l(i)𝐙l(i))𝔼[Re(𝐙l(i)𝐙l(i))]2absent2subscript𝛼𝑖𝐿subscriptnormResuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖2\displaystyle=\frac{2\alpha_{i}}{L}\left\|\operatorname{Re}\left(\mathbf{Z}_{l% }^{(i)}\mathbf{Z}_{l}^{(i)*}\right)-\mathbb{E}\left[\operatorname{Re}\left(% \mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)\right]\right\|_{2}= divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG ∥ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) - blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (62a)
2αiL(Re(𝐙l(i)𝐙l(i))2+𝔼[Re(𝐙l(i)𝐙l(i))]2) (triangle inequality)absent2subscript𝛼𝑖𝐿subscriptnormResuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖2subscriptnorm𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖2 (triangle inequality)\displaystyle\leq\frac{2\alpha_{i}}{L}\left(\left\|\operatorname{Re}\left(% \mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)\right\|_{2}+\left\|\mathbb{E}% \left[\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)% \right]\right\|_{2}\right)\quad\quad\text{ (triangle inequality)}≤ divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG ( ∥ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (triangle inequality) (62b)
2αiL(Re(𝐙l(i)𝐙l(i))2+𝔼[Re(𝐙l(i)𝐙l(i))2]) (Jensen’s inequality)absent2subscript𝛼𝑖𝐿subscriptnormResuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖2𝔼delimited-[]subscriptnormResuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖2 (Jensen’s inequality)\displaystyle\leq\frac{2\alpha_{i}}{L}\left(\left\|\operatorname{Re}\left(% \mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)\right\|_{2}+\mathbb{E}\left[% \left\|\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)% \right\|_{2}\right]\right)\quad\quad\text{ (Jensen’s inequality)}≤ divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG ( ∥ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + blackboard_E [ ∥ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) (Jensen’s inequality) (62c)
2aL(2N+2N)absent2𝑎𝐿2𝑁2𝑁\displaystyle\leq\frac{2a}{L}\left(2N+2N\right)≤ divide start_ARG 2 italic_a end_ARG start_ARG italic_L end_ARG ( 2 italic_N + 2 italic_N ) (62d)
=2aL4Nabsent2𝑎𝐿4𝑁\displaystyle=\frac{2a}{L}4N= divide start_ARG 2 italic_a end_ARG start_ARG italic_L end_ARG 4 italic_N (62e)

where a=i=1mαi2𝑎superscriptsubscript𝑖1𝑚superscriptsubscript𝛼𝑖2a=\sqrt{\sum_{i=1}^{m}\alpha_{i}^{2}}italic_a = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and

𝒄l(i)=[cos(2π𝐰l(i)𝐱1),,cos(2π𝐰l(i)𝐱N)]N×1,superscriptsubscript𝒄𝑙𝑖superscript2𝜋superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱12𝜋superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱𝑁topsuperscript𝑁1\displaystyle\bm{c}_{l}^{(i)}=\left[\cos\left(2\pi{\mathbf{w}}_{l}^{(i)\top}{% \mathbf{x}}_{1}\right),\ldots,\cos\left(2\pi{\mathbf{w}}_{l}^{(i)\top}{\mathbf% {x}}_{N}\right)\right]^{\top}\in\mathbb{R}^{N\times 1},bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ roman_cos ( 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_cos ( 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT , (63a)
𝒔l(i)=[sin(2π𝐰l(i)𝐱1),,sin(2π𝐰l(i)𝐱N)]N×1,superscriptsubscript𝒔𝑙𝑖superscript2𝜋superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱12𝜋superscriptsubscript𝐰𝑙limit-from𝑖topsubscript𝐱𝑁topsuperscript𝑁1\displaystyle\bm{s}_{l}^{(i)}=\left[\sin\left(2\pi{\mathbf{w}}_{l}^{(i)\top}{% \mathbf{x}}_{1}\right),\ldots,\sin\left(2\pi{\mathbf{w}}_{l}^{(i)\top}{\mathbf% {x}}_{N}\right)\right]^{\top}\in\mathbb{R}^{N\times 1},bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ roman_sin ( 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_sin ( 2 italic_π bold_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT , (63b)
Re(𝐙l(i)𝐙l(i))=𝒄l(i)𝒄l(i)+𝒔l(i)𝒔l(i),Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖superscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top\displaystyle\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}% \right)=\bm{c}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{s}_{l}^{(i)}\bm{s}_{l}^{(i)% \top},roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) = bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT , (63c)

and the last inequality in Eq. (62), we use the fact that

Re(𝐙l(i)𝐙l(i))2=sup𝒗22=1𝒗(𝒄l(i)𝒄l(i)+𝒔l(i)𝒔l(i))𝒗2N.subscriptnormResuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖2subscriptsupremumsuperscriptsubscriptnorm𝒗221superscript𝒗topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top𝒗2𝑁\left\|\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)% \right\|_{2}=\sup_{\|\bm{v}\|_{2}^{2}=1}\bm{v}^{\top}\left(\bm{c}_{l}^{(i)}\bm% {c}_{l}^{(i)\top}+\bm{s}_{l}^{(i)}\bm{s}_{l}^{(i)\top}\right)\bm{v}\leq 2N.∥ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT ∥ bold_italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) bold_italic_v ≤ 2 italic_N .

Next, we are going to bound the variance, i=1ml=1L/2𝔼[(𝐄l(i))2]2subscriptnormsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿2𝔼delimited-[]superscriptsuperscriptsubscript𝐄𝑙𝑖22\left\|\sum_{i=1}^{m}\sum_{l=1}^{L/2}\mathbb{E}[(\mathbf{E}_{l}^{(i)})^{2}]% \right\|_{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT blackboard_E [ ( bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

3). Upper Bound for the Variance, i=1ml=1L/2𝔼[(𝐄l(i))2]2subscriptnormsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿2𝔼delimited-[]superscriptsuperscriptsubscript𝐄𝑙𝑖22\left\|\sum_{i=1}^{m}\sum_{l=1}^{L/2}\mathbb{E}[(\mathbf{E}_{l}^{(i)})^{2}]% \right\|_{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT blackboard_E [ ( bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We first have the following bound:

L24αi2𝔼[(𝐄l(i))2]superscript𝐿24superscriptsubscript𝛼𝑖2𝔼delimited-[]superscriptsuperscriptsubscript𝐄𝑙𝑖2\displaystyle\frac{L^{2}}{4\alpha_{i}^{2}}\mathbb{E}\left[\left(\mathbf{E}_{l}% ^{(i)}\right)^{2}\right]divide start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ( bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[Re(𝐙l(i)𝐙l(i))2](𝔼[Re(𝐙l(i)𝐙l(i))])2\displaystyle=\mathbb{E}\left[\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}% \mathbf{Z}_{l}^{(i)*}\right)^{2}\right]-\left(\mathbb{E}\left[\operatorname{Re% }\left(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)\right]\right)^{2}= blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - ( blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (64a)
𝔼[Re(𝐙l(i)𝐙l(i))2]\displaystyle\preccurlyeq\mathbb{E}\left[\operatorname{Re}\left(\mathbf{Z}_{l}% ^{(i)}\mathbf{Z}_{l}^{(i)*}\right)^{2}\right]≼ blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (64b)
=𝔼[(𝒄l(i)𝒄l(i))𝒄l(i)𝒄l(i)+(𝒔l(i)𝒔l(i))𝒔l(i)𝒔l(i)+(𝒔l(i)𝒄l(i))(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))]absent𝔼delimited-[]superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top\displaystyle=\mathbb{E}\left[\left(\bm{c}_{l}^{(i)\top}\bm{c}_{l}^{(i)}\right% )\bm{c}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\left(\bm{s}_{l}^{(i)\top}\bm{s}_{l}^{(i% )}\right)\bm{s}_{l}^{(i)}\bm{s}_{l}^{(i)\top}+\left(\bm{s}_{l}^{(i)\top}\bm{c}% _{l}^{(i)}\right)\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{(i)}% \bm{s}_{l}^{(i)\top}\right)\right]= blackboard_E [ ( bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ] (64c)
N𝔼[𝒄l(i)𝒄l(i)+𝒔l(i)𝒔l(i)]+𝔼[(𝒔l(i)𝒄l(i))(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))]precedes-or-equalsabsent𝑁𝔼delimited-[]superscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top𝔼delimited-[]superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top\displaystyle\preccurlyeq N\mathbb{E}\left[\bm{c}_{l}^{(i)}\bm{c}_{l}^{(i)\top% }+\bm{s}_{l}^{(i)}\bm{s}_{l}^{(i)\top}\right]+\mathbb{E}\left[\left(\bm{s}_{l}% ^{(i)\top}\bm{c}_{l}^{(i)}\right)\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+% \bm{c}_{l}^{(i)}\bm{s}_{l}^{(i)\top}\right)\right]≼ italic_N blackboard_E [ bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ] + blackboard_E [ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ] (64d)
=N𝔼[Re(𝐙l(i)𝐙l(i))]+𝔼[(𝒔l(i)𝒄l(i))(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))]absent𝑁𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖𝔼delimited-[]superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top\displaystyle=N\mathbb{E}\left[\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}% \mathbf{Z}_{l}^{(i)*}\right)\right]+\mathbb{E}\left[\left(\bm{s}_{l}^{(i)\top}% \bm{c}_{l}^{(i)}\right)\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{% (i)}\bm{s}_{l}^{(i)\top}\right)\right]= italic_N blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] + blackboard_E [ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ] (64e)

where the notation 𝐀𝐁precedes-or-equals𝐀𝐁\mathbf{A}\preccurlyeq\mathbf{B}bold_A ≼ bold_B denotes that 𝐁𝐀𝐁𝐀\mathbf{B}-\mathbf{A}bold_B - bold_A is a positive semi definite (PSD) matrix, and the inequality in Eq. (64b) holds due to the fact that (𝔼[Re(𝐙l(i)𝐙l(i))])2superscript𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖2\left(\mathbb{E}\left[\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l% }^{(i)*}\right)\right]\right)^{2}( blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a PSD matrix. The inequality in Eq. (64d) holds because

N𝔼[𝒄l(i)𝒄l(i)+𝒔l(i)𝒔l(i)]𝔼[(𝒄l(i)𝒄l(i))𝒄l(i)𝒄l(i)+(𝒔l(i)𝒔l(i))𝒔l(i)𝒔l(i)]𝑁𝔼delimited-[]superscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top𝔼delimited-[]superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top\displaystyle N\mathbb{E}\left[\bm{c}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{s}_{l}% ^{(i)}\bm{s}_{l}^{(i)\top}\right]-\mathbb{E}\left[\left(\bm{c}_{l}^{(i)\top}% \bm{c}_{l}^{(i)}\right)\bm{c}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\left(\bm{s}_{l}^{% (i)\top}\bm{s}_{l}^{(i)}\right)\bm{s}_{l}^{(i)}\bm{s}_{l}^{(i)\top}\right]italic_N blackboard_E [ bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ] - blackboard_E [ ( bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ] (65)
=𝔼[(𝒔l(i)𝒔l(i))𝒄l(i)𝒄l(i)+(𝒄l(i)𝒄l(i))𝒔l(i)𝒔l(i)][due to (𝒄l(i)𝒄l(i)+𝒔l(i)𝒔l(i))=N]absent𝔼delimited-[]superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖topdelimited-[]due to superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒔𝑙𝑖𝑁\displaystyle=\mathbb{E}\left[\left(\bm{s}_{l}^{(i)\top}\bm{s}_{l}^{(i)}\right% )\bm{c}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\left(\bm{c}_{l}^{(i)\top}\bm{c}_{l}^{(i% )}\right)\bm{s}_{l}^{(i)}\bm{s}_{l}^{(i)\top}\right]\quad\ldots\quad% \footnotesize{\left[\text{due to }\left(\bm{c}_{l}^{(i)\top}\bm{c}_{l}^{(i)}+% \bm{s}_{l}^{(i)\top}\bm{s}_{l}^{(i)}\right)=N\right]}= blackboard_E [ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + ( bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ] … [ due to ( bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = italic_N ]

is a PSD matrix.

Then we are able to bound the variance, i=1ml=1L/2𝔼[(𝐄l(i))2]2subscriptnormsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿2𝔼delimited-[]superscriptsuperscriptsubscript𝐄𝑙𝑖22\left\|\sum_{i=1}^{m}\sum_{l=1}^{L/2}\mathbb{E}[(\mathbf{E}_{l}^{(i)})^{2}]% \right\|_{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT blackboard_E [ ( bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as

i=1ml=1L/2𝔼[(𝐄l(i))2]2subscriptnormsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿2𝔼delimited-[]superscriptsuperscriptsubscript𝐄𝑙𝑖22\displaystyle~{}~{}~{}\left\|\sum_{i=1}^{m}\sum_{l=1}^{L/2}\mathbb{E}[(\mathbf% {E}_{l}^{(i)})^{2}]\right\|_{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT blackboard_E [ ( bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (66)
i=1ml=1L/24αi2L2(N𝔼[Re(𝐙l(i)𝐙l(i))]+𝔼[(𝒔l(i)𝒄l(i))(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))])2absentsubscriptnormsuperscriptsubscript𝑖1𝑚superscriptsubscript𝑙1𝐿24superscriptsubscript𝛼𝑖2superscript𝐿2𝑁𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖𝔼delimited-[]superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top2\displaystyle\leq\left\|\sum_{i=1}^{m}\sum_{l=1}^{L/2}\frac{4\alpha_{i}^{2}}{L% ^{2}}\left(N\mathbb{E}\left[\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}\mathbf% {Z}_{l}^{(i)*}\right)\right]+\mathbb{E}\left[\left(\bm{s}_{l}^{(i)\top}\bm{c}_% {l}^{(i)}\right)\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{(i)}\bm% {s}_{l}^{(i)\top}\right)\right]\right)\right\|_{2}≤ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L / 2 end_POSTSUPERSCRIPT divide start_ARG 4 italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_N blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] + blackboard_E [ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ] ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
2aLi=1mαi(N𝔼[Re(𝐙l(i)𝐙l(i))]+𝔼[(𝒔l(i)𝒄l(i))(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))])2absent2𝑎𝐿subscriptnormsuperscriptsubscript𝑖1𝑚subscript𝛼𝑖𝑁𝔼delimited-[]Resuperscriptsubscript𝐙𝑙𝑖superscriptsubscript𝐙𝑙𝑖𝔼delimited-[]superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top2\displaystyle\leq\frac{2a}{L}\left\|\sum_{i=1}^{m}\alpha_{i}\left(N\mathbb{E}% \left[\operatorname{Re}\left(\mathbf{Z}_{l}^{(i)}\mathbf{Z}_{l}^{(i)*}\right)% \right]+\mathbb{E}\left[\left(\bm{s}_{l}^{(i)\top}\bm{c}_{l}^{(i)}\right)\left% (\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{(i)}\bm{s}_{l}^{(i)\top}% \right)\right]\right)\right\|_{2}≤ divide start_ARG 2 italic_a end_ARG start_ARG italic_L end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N blackboard_E [ roman_Re ( bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) ] + blackboard_E [ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ] ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
2aL(N𝐊sm2+i=1mαi𝔼[(𝒔l(i)𝒄l(i))(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))]2) (triangle inequality)absent2𝑎𝐿𝑁subscriptnormsubscript𝐊sm2superscriptsubscript𝑖1𝑚subscript𝛼𝑖subscriptnorm𝔼delimited-[]superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top2 (triangle inequality)\displaystyle\leq\frac{2a}{L}\left(N\left\|\mathbf{K}_{\mathrm{sm}}\right\|_{2% }+\sum_{i=1}^{m}\alpha_{i}\left\|\mathbb{E}\left[\left(\bm{s}_{l}^{(i)\top}\bm% {c}_{l}^{(i)}\right)\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{(i)% }\bm{s}_{l}^{(i)\top}\right)\right]\right\|_{2}\right)\quad\text{ (triangle % inequality)}≤ divide start_ARG 2 italic_a end_ARG start_ARG italic_L end_ARG ( italic_N ∥ bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ blackboard_E [ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (triangle inequality)
2aL(N𝐊sm2+i=1mαi𝔼[(𝒔l(i)𝒄l(i))(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))2]) (Jensen’s inequality)absent2𝑎𝐿𝑁subscriptnormsubscript𝐊sm2superscriptsubscript𝑖1𝑚subscript𝛼𝑖𝔼delimited-[]subscriptnormsuperscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top2 (Jensen’s inequality)\displaystyle\leq\frac{2a}{L}\left(N\left\|\mathbf{K}_{\mathrm{sm}}\right\|_{2% }+\sum_{i=1}^{m}\alpha_{i}\mathbb{E}\left[\left\|\left(\bm{s}_{l}^{(i)\top}\bm% {c}_{l}^{(i)}\right)\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{(i)% }\bm{s}_{l}^{(i)\top}\right)\right\|_{2}\right]\right)\quad\text{ (Jensen’s % inequality)}≤ divide start_ARG 2 italic_a end_ARG start_ARG italic_L end_ARG ( italic_N ∥ bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) (Jensen’s inequality)
2aL(N𝐊sm2+N2i=1mαi𝔼[(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))2])(|𝒔l(i)𝒄l(i)|N2)absent2𝑎𝐿𝑁subscriptnormsubscript𝐊sm2𝑁2superscriptsubscript𝑖1𝑚subscript𝛼𝑖𝔼delimited-[]subscriptnormsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top2superscriptsubscript𝒔𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖𝑁2\displaystyle\leq\frac{2a}{L}\left(N\left\|\mathbf{K}_{\mathrm{sm}}\right\|_{2% }+\frac{N}{2}\sum_{i=1}^{m}\alpha_{i}\mathbb{E}\left[\left\|\left(\bm{s}_{l}^{% (i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{(i)}\bm{s}_{l}^{(i)\top}\right)\right\|_{% 2}\right]\right)\qquad\qquad\quad\left(|\bm{s}_{l}^{(i)\top}\bm{c}_{l}^{(i)}|% \leq\frac{N}{2}\right)≤ divide start_ARG 2 italic_a end_ARG start_ARG italic_L end_ARG ( italic_N ∥ bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG italic_N end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) ( | bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | ≤ divide start_ARG italic_N end_ARG start_ARG 2 end_ARG )
2aNL(𝐊sm2+N2am)absent2𝑎𝑁𝐿subscriptnormsubscript𝐊sm2𝑁2𝑎𝑚\displaystyle\leq\frac{2aN}{L}\left(\left\|\mathbf{K}_{\mathrm{sm}}\right\|_{2% }+\frac{N}{2}a\sqrt{m}\right)≤ divide start_ARG 2 italic_a italic_N end_ARG start_ARG italic_L end_ARG ( ∥ bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG italic_N end_ARG start_ARG 2 end_ARG italic_a square-root start_ARG italic_m end_ARG )

where the last inequality is because that

𝔼[(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))2]=sup𝒗22=1𝔼[𝒗(𝒔l(i)𝒄l(i)+𝒄l(i)𝒔l(i))𝒗2]N,𝔼delimited-[]subscriptnormsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top2subscriptsupremumsuperscriptsubscriptnorm𝒗221𝔼delimited-[]subscriptnormsuperscript𝒗topsuperscriptsubscript𝒔𝑙𝑖superscriptsubscript𝒄𝑙limit-from𝑖topsuperscriptsubscript𝒄𝑙𝑖superscriptsubscript𝒔𝑙limit-from𝑖top𝒗2𝑁\mathbb{E}\left[\left\|\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}+\bm{c}_{l}^{% (i)}\bm{s}_{l}^{(i)\top}\right)\right\|_{2}\right]=\sup_{\|\bm{v}\|_{2}^{2}=1}% \mathbb{E}\left[\left\|\bm{v}^{\top}\left(\bm{s}_{l}^{(i)}\bm{c}_{l}^{(i)\top}% +\bm{c}_{l}^{(i)}\bm{s}_{l}^{(i)\top}\right)\bm{v}\right\|_{2}\right]\leq N,blackboard_E [ ∥ ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = roman_sup start_POSTSUBSCRIPT ∥ bold_italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT + bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ) bold_italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ≤ italic_N , (67)

and i=1mαiamsuperscriptsubscript𝑖1𝑚subscript𝛼𝑖𝑎𝑚\sum_{i=1}^{m}\alpha_{i}\leq a\sqrt{m}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_a square-root start_ARG italic_m end_ARG by the Cauchy–Schwarz inequality.

4). Final Result.

We next can apply the derived upper bounds, Eqs. (62) and (66), to the H𝐻Hitalic_H and v(𝒀)𝑣𝒀v(\bm{Y})italic_v ( bold_italic_Y ) in Lemma C.1,

P(𝐊^sm𝐊sm2ϵ)Nexp(3ϵ2L2Na(6𝐊sm2+3Nam+8ϵ))𝑃subscriptnormsubscript^𝐊smsubscript𝐊sm2italic-ϵ𝑁3superscriptitalic-ϵ2𝐿2𝑁𝑎6subscriptnormsubscript𝐊sm23𝑁𝑎𝑚8italic-ϵ\displaystyle{P}\left(\left\|\hat{\mathbf{K}}_{\mathrm{sm}}-\mathbf{K}_{% \mathrm{sm}}\right\|_{2}\geq\epsilon\right)\leq N\exp\left(\frac{-3\epsilon^{2% }L}{2Na\left(6\left\|\mathbf{K}_{\mathrm{sm}}\right\|_{2}+3Na\sqrt{m}+8% \epsilon\right)}\right)italic_P ( ∥ over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT - bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ϵ ) ≤ italic_N roman_exp ( divide start_ARG - 3 italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 italic_N italic_a ( 6 ∥ bold_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 3 italic_N italic_a square-root start_ARG italic_m end_ARG + 8 italic_ϵ ) end_ARG ) (68)

which completes the proof of Theorem 4.3

Appendix D Extended Related Work

VAEs.

As a facet of model collapse, the posterior collapse in variational autoencoders (VAEs) occurs when the variational posterior distribution of the latent variables approaches to the prior, resulting in a failure to exploit the valuable knowledge embedded in the observed data. Numerous approaches have been proposed to tackle this issue, with the most commonly embraced heuristic solution being the annealing of the KL term in the ELBO objective (Bowman et al., 2016; Sønderby et al., 2016). Specifically, Gulrajani et al. (2016) suggest that posterior collapse is induced by the high-capacity decoder, which can map any noise vector to the desired target 𝐗𝐗{\mathbf{X}}bold_X. Motivating by this hypothesis, Gulrajani et al. (2016); Yang et al. (2017) propose reducing the capacity of the decoder for better representations, albeit at the cost of a reduction in generative capability. Another line of works, such as (Lucas et al., 2019; Wang & Liu, 2022; Wang et al., 2021), claims that posterior collapse is partially attributed to the suboptimal selection of likelihood variances, aligning with our findings in the context of the Bayesian non-parametric GPLVM. Nevertheless, despite the alignment of these works addressing posterior collapse with our findings, the primary objective in VAEs is to improve generative capacity, deviating from our emphasis, which lies in recovering compact and informative latent representations.

GPLVMs.

This paper focuses on the GPLVMs (Lawrence, 2005), which apply GP for modeling the nonlinear function in LVM, obviating the need to optimize substantial neural network parameters while alleviating overfitting and generalization issues (Wilson & Izmailov, 2020). The seminal work of GPLVM was proposed by Lawrence (2005). Subsequently, Titsias & Lawrence (2010) introduced the Bayesian formulation of the GPLVM, which variationally integrated out latent variables. However, this model exhibits computational efficiency only with specific preliminary kernel functions, such as the radial basis function (RBF) kernel (Rasmussen & Williams, 2006), imposing significant constraints on the model capacity of the GPLVM and leading to model collapse. Recent endeavors have focused on enhancing the scalability and flexibility of the GPLVM (Lalchand et al., 2022; de Souza et al., 2021), as well as ensuring compatibility with various likelihoods (Ramchandran et al., 2021). Despite the relevance of these endeavors, the inference of these models relies on inducing points-based sparse GP (Titsias, 2009). This necessitates optimizing additional inducing points, leading to increased computational burden and the risk of getting stuck in suboptimal solutions. Consequently, despite the enhanced model capability, these models often face challenges in achieving their theoretical potential to address model collapse.

Table 3: A summary of relevant LVMs, where N𝑁Nitalic_N and M𝑀Mitalic_M denote # observations and the observation dimension, respectively, while U,m,L𝑈𝑚𝐿U,m,Litalic_U , italic_m , italic_L represent # inducing points, # mixture components in SM kernel, and the dimension of random features, respectively.
Model
Scalable
model
Advanced
kernel
Probabilistic
mapping
Bayesian inference
of latent variables
Computational
complexity
# parameters
Reference
\oldtextscvae - - - Kingma & Welling (2019)
\oldtextscnbvae - - - Zhao et al. (2020)
\oldtextscdca - - - Eraslan et al. (2019)
\oldtextsccvq-\oldtextscvae - - - Zheng & Vedaldi (2023)
\oldtextscgplvm 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) N(N+Q)+C𝑁𝑁𝑄𝐶N(N+Q)+Citalic_N ( italic_N + italic_Q ) + italic_C Lawrence (2005)
\oldtextscbgplvm 𝒪(NU2)𝒪𝑁superscript𝑈2\mathcal{O}(NU^{2})caligraphic_O ( italic_N italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) Q(1+U+N+NQ)+C𝑄1𝑈𝑁𝑁𝑄𝐶Q(1+U+N+NQ)+Citalic_Q ( 1 + italic_U + italic_N + italic_N italic_Q ) + italic_C Titsias & Lawrence (2010)
\oldtextscgplvm-\oldtextscsvi 𝒪(MU3)𝒪𝑀superscript𝑈3\mathcal{O}(MU^{3})caligraphic_O ( italic_M italic_U start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) U(M+MU+Q)+2NQ+C𝑈𝑀𝑀𝑈𝑄2𝑁𝑄𝐶U(M+MU+Q)+2NQ+Citalic_U ( italic_M + italic_M italic_U + italic_Q ) + 2 italic_N italic_Q + italic_C Lalchand et al. (2022)
\oldtextscrflvm 𝒪(NM2L)𝒪𝑁superscript𝑀2𝐿\mathcal{O}(NM^{2}L)caligraphic_O ( italic_N italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ) NQ+L(Q+M+Q22)+2M+C𝑁𝑄𝐿𝑄𝑀superscript𝑄222𝑀𝐶NQ+L(Q+M+\frac{Q^{2}}{2})+2M+Citalic_N italic_Q + italic_L ( italic_Q + italic_M + divide start_ARG italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) + 2 italic_M + italic_C Zhang et al. (2023)
advised\oldtextscrflvm 𝒪(N(mL)2)𝒪𝑁superscript𝑚𝐿2\mathcal{O}(N(mL)^{2})caligraphic_O ( italic_N ( italic_m italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) Q(N+NQ+2m)+m+C𝑄𝑁𝑁𝑄2𝑚𝑚𝐶Q(N+NQ+2m)+m+Citalic_Q ( italic_N + italic_N italic_Q + 2 italic_m ) + italic_m + italic_C This work

Appendix E Experiment Details

E.1 Data Descriptions and Preprocessing

We first describe the detailed parameter settings for the two synthetic S𝑆Sitalic_S-shaped datasets used in § 6.2. The datasets are generated from a GPLVM with different kernel configurations, which are listed below:

  • Dataset with RBF kernel:

    krbf(𝐱,𝐱)=oexp((𝐱𝐱)22l2),subscript𝑘rbf𝐱superscript𝐱subscript𝑜superscript𝐱superscript𝐱22superscriptsubscript𝑙2k_{\mathrm{rbf}}({\mathbf{x}},{\mathbf{x}}^{\prime})=\ell_{o}\exp(-\frac{({% \mathbf{x}}-{\mathbf{x}}^{\prime})^{2}}{2\ell_{l}^{2}}),italic_k start_POSTSUBSCRIPT roman_rbf end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_ℓ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT roman_exp ( start_ARG - divide start_ARG ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_ℓ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) , (69)

    with outputscale o=1subscript𝑜1\ell_{o}=1roman_ℓ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1 and lengthscale l=1subscript𝑙1\ell_{l}=1roman_ℓ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1.

  • Dataset with a hybrid (RBF+periodic) kernel:

    khybrid(𝐱,𝐱)=krbf(𝐱,𝐱)+kperiodic(𝐱,𝐱),subscript𝑘hybrid𝐱superscript𝐱subscript𝑘rbf𝐱superscript𝐱subscript𝑘periodic𝐱superscript𝐱\displaystyle k_{\mathrm{hybrid}}({\mathbf{x}},{\mathbf{x}}^{\prime})=k_{% \mathrm{rbf}}({\mathbf{x}},{\mathbf{x}}^{\prime})+k_{\mathrm{periodic}}({% \mathbf{x}},{\mathbf{x}}^{\prime}),italic_k start_POSTSUBSCRIPT roman_hybrid end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_k start_POSTSUBSCRIPT roman_rbf end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_k start_POSTSUBSCRIPT roman_periodic end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (70a)
    krbf(𝐱,𝐱)=oexp((𝐱𝐱)22l2),subscript𝑘rbf𝐱superscript𝐱subscript𝑜superscript𝐱superscript𝐱22superscriptsubscript𝑙2\displaystyle k_{\mathrm{rbf}}({\mathbf{x}},{\mathbf{x}}^{\prime})=\ell_{o}% \exp(-\frac{({\mathbf{x}}-{\mathbf{x}}^{\prime})^{2}}{2\ell_{l}^{2}}),italic_k start_POSTSUBSCRIPT roman_rbf end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_ℓ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT roman_exp ( start_ARG - divide start_ARG ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_ℓ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) ,  with o=0.5,l=1;formulae-sequence with subscript𝑜0.5subscript𝑙1\displaystyle\quad\text{ with }\ell_{o}=0.5,\ell_{l}=1;with roman_ℓ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 0.5 , roman_ℓ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 ; (70b)
    kperiodic(𝐱,𝐱)=oexp(2sin2(𝐱𝐱p)l2),subscript𝑘periodic𝐱superscript𝐱subscript𝑜2superscript2𝐱superscript𝐱𝑝superscriptsubscript𝑙2\displaystyle k_{\mathrm{periodic}}({\mathbf{x}},{\mathbf{x}}^{\prime})=\ell_{% o}\exp\left(-\frac{2\sin^{2}\left(\frac{{\mathbf{x}}-{\mathbf{x}}^{\prime}}{p}% \right)}{\ell_{l}^{2}}\right),italic_k start_POSTSUBSCRIPT roman_periodic end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_ℓ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT roman_exp ( - divide start_ARG 2 roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_p end_ARG ) end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,  with o=0.5,l=1,p=4.5.formulae-sequence with subscript𝑜0.5formulae-sequencesubscript𝑙1𝑝4.5\displaystyle\quad\text{ with }\ell_{o}=0.5,\ell_{l}=1,p=4.5.with roman_ℓ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 0.5 , roman_ℓ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 , italic_p = 4.5 . (70c)

Next, we offer a comprehensive introduction to real-world datasets and downsample large-scale datasets to a smaller size to accommodate the high computational complexity in RFLVM (Gundersen et al., 2021).

  • \oldtextsc

    bridges: We recorded the daily count of bicycles crossing each of the four East River bridges in New York City555https://data.cityofnewyork.us/Transportation/Bicycle-Counts-for-East-River-Bridges/gua4-p9wg. To assign labels, we categorized the data into weekday versus weekend, treating them as binary labels due to the absence of explicit labels in the dataset. This categorization was made based on the understanding that weekdays and weekends are inherently linked to variations in bicycle counts.

  • \oldtextsc

    cifar-10: To create a final dataset of size 2000, we subsampled 400 images from each class within [airplane, automobile, bird, cat, deer]. These images were further resized from 32×32323232\times 3232 × 32 pixels to 20×20202020\times 2020 × 20 pixels and converted to grayscale. Test performance of different models on the full dataset can be found in §. E.4.4.

  • \oldtextsc

    mnist: The dataset size was reduced by randomly selecting 1000100010001000 images. Test performance of different models on the full dataset can be found in §. E.4.4.

  • \oldtextsc

    montreal: We analyze the daily count of cyclists on eight bicycle lanes in Montreal666http://donnees.ville.montreal.qc.ca/dataset/f170fecc-18db-44bc-b4fe-5b0b6d2c7297/resource/64c26fd3-0bdf-45f8-92c6-715a9c852a7b. Given the absence of explicit labels, we employed the four seasons as labels, as seasonality is correlated with bicycle counts.

  • \oldtextsc

    newsgroups The 20 Newsgroups Dataset777http://qwone.com/~jason/20Newsgroups/ was employed, with classes limited to comp.sys.mac.hardware, sci.med, and alt.atheism. The vocabulary was constrained to words with document frequencies falling within the range of 1090%10percent9010-90\%10 - 90 %.

  • \oldtextsc

    yale: The Yale Faces Dataset888http://vision.ucsd.edu/content/yale-face-database was employed in our study, with subject IDs utilized as labels.

  • \oldtextsc

    Brendan: This dataset comprises 2000 images, each with a size of 20×28202820\times 2820 × 28 pixels, depicting the face of Brendan999https://cs.nyu.edu/~roweis/data/frey_rawface.mat.

E.2 Benchmark Methods Descriptions

E.3 Default Hyperparameter Configurations

Table 4: Default hyperparameter settings.
parameter value
# mixture densities in SM kernel (m𝑚mitalic_m) 2222
dim. of random feature (L𝐿Litalic_L) 50505050
dim. of latent space (Q𝑄Qitalic_Q) 2222
optimizer adam (Kingma & Ba, 2014)
learning rate 0.0050.0050.0050.005
beta (0.9,0.99)0.90.99(0.9,0.99)( 0.9 , 0.99 )
# iterations 10000100001000010000

Tab. 4 displays the default hyperparameter settings employed by advised\oldtextscrflvm. These hyperparameter settings are employed in the majority of experiments, with the exception of the experiment corresponding to the left side of Fig. 2. In this case, the dimensionality of the latent space is configured to 50505050 to intuitively illustrate the variability in the number of zero-columns within the latent variables.

E.4 Additional Results

E.4.1 S-shaped Latent Manifold Estimation

Refer to caption
Figure 4: (Left) R2superscriptR2\mathrm{R}^{2}roman_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT against the number of mixture densities in SM kernel (m)𝑚(m)( italic_m ). (Right) R2superscriptR2\mathrm{R}^{2}roman_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT versus the dimensionality of the random feature (L/2𝐿2L/2italic_L / 2).

To validate the rationale behind our parameter selection, this section presents an evaluation of advised\oldtextscrflvm, showcasing its performance in manifold visualization and R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores across various values of m𝑚mitalic_m and L/2𝐿2L/2italic_L / 2. Fig. 4 depicts the advised\oldtextscrflvm performance in terms of R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores. Additionally, visualizations of the latent manifold recovered by advised\oldtextscrflvm are provided in Fig. 5 and Fig. 6. The results affirm that opting for m=2𝑚2m=2italic_m = 2 and L/2=50𝐿250L/2=50italic_L / 2 = 50 ensures the lowest computational complexity while maintaining comparable performance.

E.4.2 Missing Data Imputation

To intuitively showcase the capability of advised\oldtextscrflvm in the task of missing data imputation, visualizations of the reconstructed observed data are presented in Fig. 7 and Fig. 8, underscoring its superior ability to restore missing pixels.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure 5: Latent manifold learning results with L/2=25𝐿225L/2=25italic_L / 2 = 25 and different m𝑚mitalic_m
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure 6: Latent manifold learning results with m=2𝑚2m=2italic_m = 2 and different L𝐿Litalic_L.
Refer to caption
Refer to caption
Refer to caption
(a)
Refer to caption
Refer to caption
Refer to caption
(b)
Refer to caption
Refer to caption
Refer to caption
(c)
Refer to caption
Refer to caption
Refer to caption
(d)
Figure 7: MNIST reconstruction task with missing pixels. From left to right: Ground truth, training images, reconstructions
Refer to caption
Refer to caption
Refer to caption
(a)
Refer to caption
Refer to caption
Refer to caption
(b)
Refer to caption
Refer to caption
Refer to caption
(c)
Refer to caption
Refer to caption
Refer to caption
(d)
Figure 8: Brendan faces reconstruction task with missing pixels. From left to right: Ground truth, training images, reconstructions

E.4.3 KNN Classification Accuracy with Varying K𝐾Kitalic_K

Table 5: KNN classification accuracy using different numbers of nearest neighbors (K𝐾Kitalic_K values). We ran this classification using 5-fold cross validation.
\oldtextscmethods VAE advisedRFLVM
K𝐾Kitalic_K-\oldtextscvalue 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
\oldtextscBridges 0.780 0.794 0.766 0.789 0.794 0.799 0.804 0.808 0.780 0.776 0.846 0.846 0.902 0.902 0.907 0.888 0.893 0.898 0.879 0.903
\oldtextscCifar 0.256 0.260 0.266 0.274 0.280 0.282 0.291 0.296 0.293 0.300 0.300 0.310 0.309 0.340 0.335 0.342 0.350 0.357 0.365 0.358
\oldtextscMnist 0.631 0.614 0.657 0.646 0.677 0.670 0.674 0.671 0.673 0.669 0.801 0.780 0.819 0.824 0.823 0.813 0.812 0.800 0.802 0.800
\oldtextscmontreal 0.649 0.655 0.683 0.662 0.699 0.705 0.718 0.718 0.696 0.712 0.799 0.759 0.796 0.802 0.815 0.787 0.777 0.755 0.768 0.759
\oldtextscyale 0.667 0.667 0.672 0.667 0.636 0.630 0.600 0.576 0.558 0.552 0.757 0.703 0.721 0.745 0.727 0.691 0.685 0.673 0.655 0.642
\oldtextscnewsgroups 0.381 0.389 0.384 0.397 0.402 0.409 0.406 0.410 0.399 0.404 0.401 0.403 0.399 0.412 0.419 0.408 0.414 0.414 0.426 0.424
\oldtextscmethods BGPLVM RFLVM
K𝐾Kitalic_K-\oldtextscvalue 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
\oldtextscbridges 0.836 0.808 0.813 0.794 0.818 0.808 0.837 0.832 0.832 0.813 0.860 0.859 0.869 0.892 0.869 0.869 0.883 0.888 0.883 0.887
\oldtextsccifar 0.262 0.278 0.279 0.293 0.291 0.295 0.282 0.290 0.288 0.294 0.270 0.288 0.288 0.288 0.306 0.310 0.320 0.326 0.331 0.333
\oldtextscmnist 0.573 0.585 0.611 0.622 0.627 0.636 0.640 0.645 0.652 0.648 0.592 0.567 0.591 0.603 0.634 0.624 0.638 0.634 0.633 0.633
\oldtextscmontreal 0.752 0.771 0.759 0.771 0.787 0.778 0.800 0.800 0.793 0.787 0.778 0.818 0.819 0.844 0.809 0.831 0.806 0.815 0.806 0.790
\oldtextscyale 0.558 0.515 0.545 0.558 0.527 0.503 0.503 0.467 0.485 0.461 0.576 0.515 0.612 0.564 0.564 0.576 0.576 0.558 0.582 0.576
\oldtextscnewsgroups 0.388 0.374 0.406 0.397 0.392 0.395 0.404 0.397 0.396 0.403 0.404 0.394 0.411 0.425 0.412 0.413 0.417 0.424 0.416 0.426
Table 6: KNN classification accuracy using different numbers of nearest neighbors (K𝐾Kitalic_K values) on larger datasets. We ran this classification using 5-fold cross validation.
\oldtextscmethods VAE advisedRFLVM
K𝐾Kitalic_K-\oldtextscvalue 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
\oldtextscf-Cifar 0.157 0.151 0.157 0.166 0.174 0.178 0.180 0.187 0.188 0.190 0.172 0.161 0.177 0.181 0.194 0.199 0.203 0.209 0.213 0.214
\oldtextscfd-Cifar 0.263 0.266 0.279 0.285 0.293 0.297 0.302 0.304 0.309 0.312 0.321 0.323 0.344 0.359 0.368 0.370 0.377 0.384 0.390 0.391
\oldtextscf-Mnist 0.728 0.728 0.756 0.766 0.774 0.775 0.778 0.782 0.782 0.783 0.794 0.796 0.831 0.838 0.845 0.847 0.850 0.852 0.851 0.852
\oldtextscmethods BGPLVM Isomap
K𝐾Kitalic_K-\oldtextscvalue 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
\oldtextscf-cifar 0.138 0.132 0.140 0.145 0.154 0.156 0.159 0.161 0.162 0.163 0.144 0.142 0.147 0.157 0.159 0.163 0.165 0.168 0.170 0.173
\oldtextscfd-Cifar 0.250 0.260 0.258 0.264 0.279 0.277 0.281 0.285 0.287 0.287 0.264 0.270 0.279 0.287 0.291 0.292 0.296 0.301 0.305 0.305
\oldtextscf-Mnist 0.414 0.420 0.433 0.449 0.455 0.464 0.466 0.470 0.473 0.474 0.456 0.468 0.493 0.504 0.514 0.524 0.529 0.534 0.535 0.540
\oldtextscmethods NBVAE CVQ-VAE
K𝐾Kitalic_K-\oldtextscvalue 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
\oldtextscf-cifar 0.134 0.137 0.140 0.147 0.152 0.157 0.156 0.155 0.161 0.162 0.101 0.098 0.099 0.101 0.102 0.102 0.101 0.100 0.098 0.096
\oldtextscfd-Cifar 0.252 0.248 0.255 0.264 0.273 0.277 0.282 0.287 0.291 0.292 0.203 0.201 0.200 0.199 0.201 0.200 0.201 0.199 0.197 0.200
\oldtextscf-Mnist 0.502 0.502 0.533 0.548 0.557 0.566 0.571 0.577 0.579 0.582 0.104 0.107 0.107 0.104 0.105 0.106 0.102 0.103 0.103 0.103

We have presented the KNN results with ten different choices of K𝐾Kitalic_K in Tab. 5, wherein the setting of K=1𝐾1K=1italic_K = 1 aligns with the configuration employed in (Gundersen et al., 2021). The simulation results consistently demonstrate the superiority of our method over the benchmarks regardless of the K𝐾Kitalic_K values across most datasets. In those exception cases, advised\oldtextscrflvm still achieves very comparable performance with RFLVM on some relatively simple datasets, e.g., \oldtextscBridges, Montreal, and \oldtextscNewsgroup datasets.

E.4.4 Larger Datasets Extension

To ensure equitable evaluation of deep learning methods, such as various VAE variants, we conducted comprehensive comparisons on larger datasets, including the full \oldtextscmnist and \oldtextsccifar datasets. The results are summarized in Table 6, where \oldtextscf-cifar and \oldtextscf-mnist represent the full \oldtextsccifar and \oldtextscmnist datasets, respectively, and \oldtextscfd-cifar denotes the full \oldtextsccifar dataset with each image downsampled to 20×20202020\times 2020 × 20 pixels. Our empirical results demonstrate significant performance improvement for both VAE and our advised\oldtextscrflvm when applied to larger datasets. Notably, advised\oldtextscrflvm consistently outperforms the other benchmarks across datasets of varying sizes, highlighting its superiority over state-of-the-art variants irrespective of the dataset size.