Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Towards a Scalable Reference-Free Evaluation of Generative Models

Azim Ospanov , Jingwei Zhang , Mohammad Jalali ,
Xuenan Cao , Andrej Bogdanov , Farzan Farnia
Department of Computer Science and Engineering, The Chinese University of Hong Kong, aospanov9@cse.cuhk.edu.hkDepartment of Computer Science and Engineering, The Chinese University of Hong Kong, jwzhang22@cse.cuhk.edu.hkDepartment of Electrical and Computer Engineering, Isfahan University of Technology, mjalali@ec.iut.ac.irDepartment of Cultural and Religious Studies, The Chinese University of Hong Kong, xuenancao@cuhk.edu.hkSchool of Electrical Engineering and Computer Science, University of Ottawa, abogdano@uottawa.caDepartment of Computer Science and Engineering, The Chinese University of Hong Kong, farnia@cse.cuhk.edu.hk
Abstract

While standard evaluation scores for generative models are mostly reference-based, a reference-dependent assessment of generative models could be generally difficult due to the unavailability of applicable reference datasets. Recently, the reference-free entropy scores, VENDI [1] and RKE [2], have been proposed to evaluate the diversity of generated data. However, estimating these scores from data leads to significant computational costs for large-scale generative models. In this work, we leverage the random Fourier features framework to reduce the computational price and propose the Fourier-based Kernel Entropy Approximation (FKEA) method. We utilize FKEA’s approximated eigenspectrum of the kernel matrix to efficiently estimate the mentioned entropy scores. Furthermore, we show the application of FKEA’s proxy eigenvectors to reveal the method’s identified modes in evaluating the diversity of produced samples. We provide a stochastic implementation of the FKEA assessment algorithm with a complexity O(n)𝑂𝑛O(n)italic_O ( italic_n ) linearly growing with sample size n𝑛nitalic_n. We extensively evaluate FKEA’s numerical performance in application to standard image, text, and video datasets. Our empirical results indicate the method’s scalability and interpretability applied to large-scale generative models. The codebase is available at https://github.com/aziksh-ospanov/FKEA.

1 Introduction

A quantitative comparison of generative models requires evaluation metrics to measure the quality and diversity of the models’ produced data. Since the introduction of variational autoencoders (VAEs) [3], generative adversarial networks (GANs) [4], and diffusion models [5] that led to impressive empirical results in the last decade, several evaluation scores have been proposed to assess generative models learned by different training methods and architectures. Due to the key role of evaluation criteria in comparing generative models, they have been extensively studied in the literature.

While various statistical methods have been applied to measure the fidelity and variety of a generative model’s produced data, the standard scores commonly perform a reference-based evaluation of generative models, i.e., they quantify the characteristics of generated samples in comparison to a reference distribution. The reference distribution is usually chosen to be either the distribution of samples in the test data partition or a comprehensive dataset containing a significant fraction of real-world sample types, e.g. ImageNet [6] for evaluating image-based generative models.

To provide well-known examples of reference-dependent metrics, note that the distance scores, Fréchet Inception Distance (FID) [7] and Kernel Inception Distance (KID) [8], are explicitly reference-based, measuring the distance between the generative and reference distributions. Similarly, the standard quality/diversity score pairs, Precision/Recall [9, 10] and Density/Coverage [11], perform the evaluation in comparison to a reference dataset. Even the seemingly reference-free Inception Score (IS) [12] can be viewed as implicitly reference-based, since it quantifies the variety and fidelity of data based on the labels and confidence scores assigned by an ImageNet pre-trained neural net, where ImageNet implicitly plays the role of the reference dataset. The reference-based nature of these evaluation scores is desired in many instances including standard image-based generative models, where either a sufficiently large test set or a comprehensive reference dataset such as ImageNet is available for the reference-based evaluation.

On the other hand, a reference-based assessment of generative models may not always be feasible, because the selection of a reference distribution may be challenging in a general learning scenario. For example, in prompt-based generative models where the data are created in response to a user’s input text prompts, the generated data could follow an a priori unknown distribution depending on the specific distribution of the user’s input prompts. Figure 1 illustrates text prompt generated colorful elephant images. Each elephant image corresponds to an unusual color choice, such as luminous yellow or neon pink. Even though reference dataset, such as ImageNet [6] contains elephants, additional elements such as neon colors are not attributed to those images. A proper reference-based evaluation of every user’s generated data would require a distinct reference dataset, which may not be available to the user during the assessment time. Moreover, finding a comprehensive text or video dataset to choose as the reference set would be more difficult compared to image data, because the higher length of text and video samples could significantly contribute to their variety, requiring an inefficiently large reference set to cover all text or video sample types.

Refer to caption
Figure 1: Reference-free diversity evaluation of "colorful elephant" samples generated by text-to-image StableDiffusion [13]. FKEA’s top-4 sample clusters and approximated VENDI and RKE are shown.

The discussed challenging scenarios of conducting a reference-based evaluation highlight the need for reference-free assessment methods that remain functional in the absence of a reference dataset. Recently, entropy-based diversity evaluation scores, the VENDIVENDI\mathrm{VENDI}roman_VENDI metric family [1, 14] and RKERKE\mathrm{RKE}roman_RKE score [2], have been proposed to address the need for reference-free assessment metrics. These scores calculate the entropy of the eigenvalues of a kernel similarity matrix for the generated data. Based on the theoretical results in [2], the evaluation process of these scores can be interpreted as an unsupervised identification of the generative model’s produced sample clusters, followed by the entropy calculation for the frequencies of the detected clusters.

While the VENDIVENDI\mathrm{VENDI}roman_VENDI and RKERKE\mathrm{RKE}roman_RKE entropy scores provide reference-free assessments of generative models, estimating these scores from generated data could incur significant computational costs. In this work, we show that computing the precise RKERKE\mathrm{RKE}roman_RKE and VENDIVENDI\mathrm{VENDI}roman_VENDI scores would require at least Ω(n2)Ωsuperscript𝑛2\Omega(n^{2})roman_Ω ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and Ω(n2.376)Ωsuperscript𝑛2.376\Omega(n^{2.376})roman_Ω ( italic_n start_POSTSUPERSCRIPT 2.376 end_POSTSUPERSCRIPT )111This computation complexity is the minimum known achievable cost for multiplying n×n𝑛𝑛n\times nitalic_n × italic_n matrices which we prove to lower-bound the complexity of computing matrix-based entropy scores. computations for a sample size n𝑛nitalic_n, respectively. While the randomized projection methods in [15, 1] can reduce the computational costs to O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for a general VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT score, the quadratic growth would be a barrier to the method’s application to large n𝑛nitalic_n values. Although the computational expenses could be reduced by limiting the sample size n𝑛nitalic_n, an insufficient sample size would lead to significant error in estimating the entropy scores.

To overcome the challenges of computing the scores, we leverage the random Fourier features (RFFs) framework [16] and develop a scalable entropy-based evaluation method that can be efficiently applied to large sample sizes. Our proposed method, Fourier-based Kernel Entropy Approximation (FKEA), is designed to approximate the kernel covariance matrix using the RFFs drawn from the Fourier transform-inverse of a target shift-invariant kernel. We prove that using a Fourier feature size r=𝒪(lognϵ2)𝑟𝒪𝑛superscriptitalic-ϵ2r=\mathcal{O}\bigl{(}\frac{\log n}{\epsilon^{2}}\bigr{)}italic_r = caligraphic_O ( divide start_ARG roman_log italic_n end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), FKEA computes the eigenspace of the kernel matrix within an ϵitalic-ϵ\epsilonitalic_ϵ-bounded error. Furthermore, we demonstrate the application of the eigenvectors of the FKEA’s proxy kernel matrix for identifying the sample clusters used in the reference-free evaluation of entropic diversity.

Finally, we present numerical results of the entropy-based evaluation of standard generative models using the baseline eigendecomposition and our proposed FKEA methods. In our experiments, the baseline spectral decomposition algorithm could not efficiently scale to sample sizes above a few ten thousand. On the other hand, our stochastic implementation of the FKEA method could scalably apply to large sample sizes. Utilizing the standard embeddings of image, text, and video data, we tested the FKEA assessment while computing the sample clusters and their frequencies in application to large-scale datasets and generative models. Here is a summary of our work’s main contributions:

  • Characterizing the computational complexity of the kernel entropy scores of generative models,

  • Developing the Fourier-based FKEA method to approximate the kernel covariance eigenspace and entropy of generated data,

  • Proving guarantees on FKEA’s required size of random Fourier features indicating a complexity logarithmically growing with the dataset size,

  • Providing numerical results on FKEA’s reference-free assessment of large-scale image,text, video-based datasets and generative models.

2 Related Work

Evaluation of deep generative models. The assessment of generative models has been widely studied in the literature. The existing scores either quantify a distance between the distributions of real and generated data, as in FID [7] and KID [8] scores, or attempt to measure the quality and diversity of the trained generative models, including the Inception Score [12], quality/diversity metric pairs Precision/Recall [9, 10] and Density/Coverage [11]. The mentioned scores are reference-based, while in this work we focus on reference-free metrics. Also, we note that the evaluation of memorization and novelty has received great attention, and several scores including the authenticity score [17], the feature likelihood divergence [18], and the rarity score [19] have been proposed to quantify the generalizability and novelty of generated samples. Note that the evaluation of novelty and generalization is, by nature, reference-based. On the other hand, our study focuses on the diversity of data which can be evaluated in a reference-free way as discussed in [1, 2].

Role of embedding in quantitative evaluation. Following the discussion in [20], we utilize DinoV2 [21] image embeddings in most of our image experiments, as [20]’s results indicate DinoV2 can yield scores more aligned with the human notion of diversity. As noted in [22], it is possible to utilize other non-ImageNet feature spaces such as CLIP [23] and SwAV [24] as opposed to InceptionV3[25] to further improve metrics such as FID. In this work, we mainly focus on DinoV2 feature space, while we note that other feature spaces are also compatible with entropy-based diversity evaluation.

Diversity assessment for text-based models. To quantify the diversity of text data, the n-gram-based methods are commonly used in the literature. A well-known metric is the BLEU score [26], which is based on the geometric average of n-gram precision scores times the Brevity Penalty. To adapt BLEU score to measure text diversity, [27] proposes the Self-BLEU score, calculating the average BLEU score of various generated samples. To further isolate and measure diversity, N-Gram Diversity scores [28, 29, 30] were proposed and defined by a ratio between the number of unique n-grams and overall number of n-grams in the text. Other prominent metrics include Homogenization (ROUGE-L) [31], FBD [32] and Compression Ratios [33].

Kernel PCA, Spectral Cluttering, and Random Fourier Features. Kernel PCA [34] is a well-studied method of dimensionality reduction that utilizes the eigendecomposition of the kernel matrix, similar to the kernel-based diversity evaluation methods in [1, 2]. The related papers [35, 36] study the connections between kernel PCA and spectral clustering. Also, the analysis of random Fourier features [16] for performing scalable kernel PCA has been studied in [37, 38, 39, 40, 41]. We note that while the mentioned works characterize the complexity of estimating the eigenvectors, our analysis focuses on the complexity of computing the kernel matrix’s eigenvalues via Fourier features, as we primarily seek to quantify the diversity of generated data using the kernel matrix’s eigenvalues.

3 Preliminaries

Consider a generative model 𝒢𝒢\mathcal{G}caligraphic_G generating random samples 𝐱1,,𝐱ndsubscript𝐱1subscript𝐱𝑛superscript𝑑\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT following the model’s probability distribution P𝒢subscript𝑃𝒢P_{\mathcal{G}}italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT. In our analysis, we assume the n𝑛nitalic_n generated samples are independently drawn from P𝒢subscript𝑃𝒢P_{\mathcal{G}}italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT. Note that in VAEs [3] and GANs [4], the generative model 𝒢𝒢\mathcal{G}caligraphic_G is a deterministic function G:rd:𝐺superscript𝑟superscript𝑑G:\mathbb{R}^{r}\rightarrow\mathbb{R}^{d}italic_G : blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT mapping an r𝑟ritalic_r-dimensional latent random vector 𝐙PZsimilar-to𝐙subscript𝑃𝑍\mathbf{Z}\sim P_{Z}bold_Z ∼ italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT from a known distribution PZsubscript𝑃𝑍P_{Z}italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT to G(𝐙)𝐺𝐙G(\mathbf{Z})italic_G ( bold_Z ) distributed according to P𝒢subscript𝑃𝒢P_{\mathcal{G}}italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT. On the other hand, in diffusion models, 𝒢𝒢\mathcal{G}caligraphic_G represents an iterative random process that generates a sample from P𝒢subscript𝑃𝒢P_{\mathcal{G}}italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT. The goal of a sample-based diversity evaluation of generative model 𝒢𝒢\mathcal{G}caligraphic_G is to quantify the variety of its generated data 𝐱1,,𝐱nsubscript𝐱1subscript𝐱𝑛\mathbf{x}_{1},\ldots,\mathbf{x}_{n}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

3.1 Kernel Function, Kernel Covariance Matrix, and Matrix-based Rényi Entropy

Following standard definitions, k:d×d:𝑘superscript𝑑superscript𝑑k:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}italic_k : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R is called a kernel function if for every integer n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N and set of inputs 𝐱1,,𝐱ndsubscript𝐱1subscript𝐱𝑛superscript𝑑\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the kernel similarity matrix K=[k(𝐱i,𝐱j)]n×n𝐾subscriptdelimited-[]𝑘subscript𝐱𝑖subscript𝐱𝑗𝑛𝑛K=\bigl{[}k(\mathbf{x}_{i},\mathbf{x}_{j})\bigr{]}_{n\times n}italic_K = [ italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT is positive semi-definite. We call a kernel function k𝑘kitalic_k normalized if for every input 𝐱𝐱\mathbf{x}bold_x we have k(𝐱,𝐱)=1𝑘𝐱𝐱1k(\mathbf{x},\mathbf{x})=1italic_k ( bold_x , bold_x ) = 1. A well-known example of a normalized kernel function is the Gaussian kernel kGaussian(σ2)subscript𝑘Gaussiansuperscript𝜎2k_{\text{\rm Gaussian}(\sigma^{2})}italic_k start_POSTSUBSCRIPT Gaussian ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT with bandwidth parameter σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT defined as

kGaussian(σ2)(𝐱,𝐱):=exp(𝐱𝐱222σ2)assignsubscript𝑘Gaussiansuperscript𝜎2𝐱superscript𝐱subscriptsuperscriptdelimited-∥∥𝐱superscript𝐱222superscript𝜎2k_{\text{\rm Gaussian}(\sigma^{2})}\bigl{(}\mathbf{x},\mathbf{x}^{\prime}\bigr% {)}\,:=\,\exp\Bigl{(}-\frac{\bigl{\|}\mathbf{x}-\mathbf{x}^{\prime}\bigr{\|}^{% 2}_{2}}{2\sigma^{2}}\Bigr{)}italic_k start_POSTSUBSCRIPT Gaussian ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := roman_exp ( - divide start_ARG ∥ bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )

For every kernel function k𝑘kitalic_k, there exists a feature map ϕ:dm:italic-ϕsuperscript𝑑superscript𝑚\phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m}italic_ϕ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT such that k(𝐱,𝐱)=ϕ(𝐱),ϕ(𝐱)𝑘𝐱superscript𝐱italic-ϕ𝐱italic-ϕsuperscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{% \prime})\rangleitalic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ⟨ italic_ϕ ( bold_x ) , italic_ϕ ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ is the inner product of the m𝑚mitalic_m-dimensional feature maps ϕ(𝐱)italic-ϕ𝐱\phi(\mathbf{x})italic_ϕ ( bold_x ) and ϕ(𝐱)italic-ϕsuperscript𝐱\phi(\mathbf{x}^{\prime})italic_ϕ ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Given a kernel k𝑘kitalic_k with feature map ϕitalic-ϕ\phiitalic_ϕ, we define the kernel covariance matrix CXm×msubscript𝐶𝑋superscript𝑚𝑚{C}_{X}\in\mathbb{R}^{m\times m}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT of a distribution PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT as

CX:=𝔼𝐗PX[ϕ(𝐗)ϕ(𝐗)]=pX(𝐱)ϕ(𝐱)ϕ(𝐱)d𝐱assignsubscript𝐶𝑋subscript𝔼similar-to𝐗subscript𝑃𝑋delimited-[]italic-ϕ𝐗italic-ϕsuperscript𝐗topsubscript𝑝𝑋𝐱italic-ϕ𝐱italic-ϕsuperscript𝐱topdifferential-d𝐱{C}_{X}\,:=\,\mathbb{E}_{\mathbf{X}\sim P_{X}}\Bigl{[}\phi(\mathbf{X})\phi(% \mathbf{X})^{\top}\Bigr{]}\,=\,\int p_{X}(\mathbf{x})\phi(\mathbf{x})\phi(% \mathbf{x})^{\top}\mathrm{d}\mathbf{x}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT := blackboard_E start_POSTSUBSCRIPT bold_X ∼ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_ϕ ( bold_X ) italic_ϕ ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] = ∫ italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_x ) italic_ϕ ( bold_x ) italic_ϕ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_d bold_x

The above matrix CXsubscript𝐶𝑋C_{X}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is positive semi-definite with non-negative values. Furthermore, assuming a normalized kernel k𝑘kitalic_k, it can be seen that the eigenvalues of CXsubscript𝐶𝑋C_{X}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT will add up to 1111 (i.e., it has unit trace Tr(CX)=1Trsubscript𝐶𝑋1\mathrm{Tr}(C_{X})=1roman_Tr ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) = 1), providing a probability model. Therefore, one can consider the entropy of CXsubscript𝐶𝑋C_{X}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT’s eigenvalues as a quantification of the diversity of distribution PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT based on the kernel similarity score k𝑘kitalic_k. Here, we review the general family of Rényi entropy used to define VENDIVENDI\mathrm{VENDI}roman_VENDI and RKERKE\mathrm{RKE}roman_RKE scores.

Definition 1.

For a positive semi-definite matrix CXm×msubscript𝐶𝑋superscript𝑚𝑚C_{X}\in\mathbb{R}^{m\times m}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT with eigenvalues λ1,,λmsubscript𝜆1subscript𝜆𝑚\lambda_{1},\ldots,\lambda_{m}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the order-α𝛼\alphaitalic_α Rényi entropy Hα(CX)subscript𝐻𝛼subscript𝐶𝑋H_{\alpha}(C_{X})italic_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) for α>0𝛼0\alpha>0italic_α > 0 is defined as

Hα(CX):=11αlog(i=1mλiα)assignsubscript𝐻𝛼subscript𝐶𝑋11𝛼superscriptsubscript𝑖1𝑚superscriptsubscript𝜆𝑖𝛼H_{\alpha}(C_{X})\,:=\,\frac{1}{1-\alpha}\log\Bigl{(}\,\sum_{i=1}^{m}\lambda_{% i}^{\alpha}\,\Bigr{)}italic_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG roman_log ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT )

To estimate the entropy scores from finite empirical samples 𝐱1,,𝐱nsubscript𝐱1subscript𝐱𝑛\mathbf{x}_{1},\ldots,\mathbf{x}_{n}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we consider the empirical kernel covariance matrix C^Xsubscript^𝐶𝑋\widehat{C}_{X}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT defined as C^X:=1ni=1nϕ(𝐱i)ϕ(𝐱i)assignsubscript^𝐶𝑋1𝑛superscriptsubscript𝑖1𝑛italic-ϕsubscript𝐱𝑖italic-ϕsuperscriptsubscript𝐱𝑖top\widehat{C}_{X}\,:=\,\frac{1}{n}\sum_{i=1}^{n}\phi\bigl{(}\mathbf{x}_{i}\bigr{% )}\phi\bigl{(}\mathbf{x}_{i}\bigr{)}^{\top}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. This matrix provides an empirical estimation of the population kernel covariance matrix CXsubscript𝐶𝑋{C}_{X}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

It can be seen that the m×m𝑚𝑚m\times mitalic_m × italic_m empirical matrix C^Xsubscript^𝐶𝑋\widehat{C}_{X}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and normalized kernel matrix 1nK=1n[k(𝐱i,𝐱𝐣)]n×n1𝑛𝐾1𝑛subscriptdelimited-[]𝑘subscript𝐱𝑖subscript𝐱𝐣𝑛𝑛\frac{1}{n}K=\frac{1}{n}\bigl{[}k(\mathbf{x}_{i},\mathbf{x_{j}})\bigr{]}_{n% \times n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG [ italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT share the same non-zero eigenvalues. Therefore, to compute the matrix-based entropy of the empirical covariance matrix C^Xsubscript^𝐶𝑋\widehat{C}_{X}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, one can equivalently compute the entropy of the eigenvalues of the kernel similarity matrix K𝐾Kitalic_K. This approach results in the definition of the VENDIVENDI\mathrm{VENDI}roman_VENDI and RKERKE\mathrm{RKE}roman_RKE diversity scores: [1] defines the family of VENDIVENDI\mathrm{VENDI}roman_VENDI scores as

VENDIα(𝐱1,,𝐱n):=exp(Hα(1nK))=(i=1nλiα)11α,assignsubscriptVENDI𝛼subscript𝐱1subscript𝐱𝑛subscript𝐻𝛼1𝑛𝐾superscriptsuperscriptsubscript𝑖1𝑛subscriptsuperscript𝜆𝛼𝑖11𝛼\mathrm{VENDI}_{\alpha}(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})\,:=\,\exp\Bigl{(% }H_{\alpha}\bigl{(}\frac{1}{n}K\bigr{)}\Bigr{)}\,=\,\Bigl{(}\sum_{i=1}^{n}% \lambda^{\alpha}_{i}\Bigr{)}^{\frac{1}{1-\alpha}},roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := roman_exp ( italic_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K ) ) = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG end_POSTSUPERSCRIPT ,

where λ1,,λnsubscript𝜆1subscript𝜆𝑛\lambda_{1},\ldots,\lambda_{n}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the eigenvalues of the kernel matrix 1nK1𝑛𝐾\frac{1}{n}Kdivide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K. Also, [2] proposes the RKERKE\mathrm{RKE}roman_RKE score, which is the special order-2 Renyi entropy, RKE(𝐱1,,𝐱n)=exp(H2(1nK))RKEsubscript𝐱1subscript𝐱𝑛subscript𝐻21𝑛𝐾\mathrm{RKE}(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})=\exp(H_{2}(\frac{1}{n}K))roman_RKE ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_exp ( italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K ) ). To compute RKERKE\mathrm{RKE}roman_RKE without computing the eigenvalues, [2] points out the RKE score reduces to the Frobenius norm F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT of the kernel matrix as follows:

RKE(𝐱1,,𝐱n)=1nKF2=(1n2i=1nj=1nk(𝐱i,𝐱j)2)1RKEsubscript𝐱1subscript𝐱𝑛subscriptsuperscriptdelimited-∥∥1𝑛𝐾2𝐹superscript1superscript𝑛2superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛𝑘superscriptsubscript𝐱𝑖subscript𝐱𝑗21\mathrm{RKE}(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})\,=\,\Bigl{\|}\frac{1}{n}K% \Bigr{\|}^{-2}_{F}\,=\,\Bigl{(}\,\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k% \bigl{(}\mathbf{x}_{i},\mathbf{x}_{j}\bigr{)}^{2}\,\Bigr{)}^{-1}roman_RKE ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K ∥ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

3.2 Shift-Invariant Kernels and Random Fourier Features

A kernel function k𝑘kitalic_k is called shift-invariant, if there exists a function κ:d:𝜅superscript𝑑\kappa:\mathbb{R}^{d}\rightarrow\mathbb{R}italic_κ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R such that k(𝐱,𝐱)=κ(𝐱𝐱)𝑘𝐱superscript𝐱𝜅𝐱superscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\kappa(\mathbf{x}-\mathbf{x}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_κ ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for every 𝐱,𝐱d𝐱superscript𝐱superscript𝑑\mathbf{x},\mathbf{x}^{\prime}\in\mathbb{R}^{d}bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Bochner’s theorem proves that a function κ:d:𝜅superscript𝑑\kappa:\mathbb{R}^{d}\rightarrow\mathbb{R}italic_κ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R will lead to a shift-invariant kernel similarity score κ(x𝐱)𝜅𝑥superscript𝐱\kappa(x-\mathbf{x}^{\prime})italic_κ ( italic_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) between 𝐱,𝐱𝐱superscript𝐱\mathbf{x},\mathbf{x}^{\prime}bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if and only if its Fourier transform κ^:d:^𝜅superscript𝑑\widehat{\kappa}:\mathbb{R}^{d}\rightarrow\mathbb{R}over^ start_ARG italic_κ end_ARG : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R is non-negative everywhere (i.e, κ^(𝝎)0^𝜅𝝎0\widehat{\kappa}(\bm{\omega})\geq 0over^ start_ARG italic_κ end_ARG ( bold_italic_ω ) ≥ 0 for every 𝝎𝝎\bm{\omega}bold_italic_ω). Note that the Fourier transform κ^^𝜅\widehat{\kappa}over^ start_ARG italic_κ end_ARG is defined as

κ^(𝝎):=1(2π)dκ(𝐱)exp(i𝝎𝐱)d𝐱assign^𝜅𝝎1superscript2𝜋𝑑𝜅𝐱𝑖superscript𝝎top𝐱differential-d𝐱\widehat{\kappa}(\bm{\omega})\,:=\,\frac{1}{(2\pi)^{d}}\int{\kappa}(\mathbf{x}% )\exp\bigl{(}-i\bm{\omega}^{\top}\mathbf{x}\bigr{)}\mathrm{d}\mathbf{x}over^ start_ARG italic_κ end_ARG ( bold_italic_ω ) := divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∫ italic_κ ( bold_x ) roman_exp ( - italic_i bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) roman_d bold_x

Specifically, Bochner’s theorem shows the Fourier transform κ^^𝜅\widehat{\kappa}over^ start_ARG italic_κ end_ARG of a normalized shift-invariant kernel k(𝐱,𝐱)=κ(𝐱𝐱)𝑘𝐱superscript𝐱𝜅𝐱superscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\kappa(\mathbf{x}-\mathbf{x}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_κ ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where κ(0)=1𝜅01\kappa(0)=1italic_κ ( 0 ) = 1, will be a probability density function (PDF). The framework of random Fourier features (RFFs) [16] utilizes independent samples drawn from PDF κ^^𝜅\widehat{\kappa}over^ start_ARG italic_κ end_ARG to approximate the kernel function. Here, given independent samples 𝝎1,,𝝎rκ^similar-tosubscript𝝎1subscript𝝎𝑟^𝜅\bm{\omega}_{1},\ldots,\bm{\omega}_{r}\sim\widehat{\kappa}bold_italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∼ over^ start_ARG italic_κ end_ARG, we form the following proxy feature map ϕ~r:d2r:subscript~italic-ϕ𝑟superscript𝑑superscript2𝑟\widetilde{\phi}_{r}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{2r}over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT

ϕ~r(𝐱)=1r[cos(𝝎1𝐱),sin(𝝎1𝐱),,cos(𝝎r𝐱),sin(𝝎r𝐱)].subscript~italic-ϕ𝑟𝐱1𝑟superscriptsubscript𝝎1top𝐱superscriptsubscript𝝎1top𝐱superscriptsubscript𝝎𝑟top𝐱superscriptsubscript𝝎𝑟top𝐱\widetilde{\phi}_{r}(\mathbf{x})\,=\,\frac{1}{\sqrt{r}}\Bigl{[}\cos(\bm{\omega% }_{1}^{\top}\mathbf{x}),\sin(\bm{\omega}_{1}^{\top}\mathbf{x}),\ldots,\cos(\bm% {\omega}_{r}^{\top}\mathbf{x}),\sin(\bm{\omega}_{r}^{\top}\mathbf{x})\Bigr{]}.over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_r end_ARG end_ARG [ roman_cos ( bold_italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) , roman_sin ( bold_italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) , … , roman_cos ( bold_italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) , roman_sin ( bold_italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) ] . (1)

As demonstrated in [16, 42], the 2r2𝑟2r2 italic_r-dimensional proxy map ϕ~rsubscript~italic-ϕ𝑟\widetilde{\phi}_{r}over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT can approximate the kernel function as k(𝐱,𝐱)=𝔼𝝎κ^[cos(𝝎𝐱)cos(𝝎𝐱)+sin(𝝎𝐱)sin(𝝎𝐱)]ϕ~r(𝐱)ϕ~r(𝐱)𝑘𝐱superscript𝐱subscript𝔼similar-to𝝎^𝜅delimited-[]superscript𝝎top𝐱superscript𝝎topsuperscript𝐱superscript𝝎top𝐱superscript𝝎topsuperscript𝐱subscript~italic-ϕ𝑟superscript𝐱topsubscript~italic-ϕ𝑟superscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\mathbb{E}_{\bm{\omega}\sim\widehat{\kappa}}% \Bigl{[}\cos(\bm{\omega}^{\top}\mathbf{x})\cos(\bm{\omega}^{\top}\mathbf{x}^{% \prime})+\sin(\bm{\omega}^{\top}\mathbf{x})\sin(\bm{\omega}^{\top}\mathbf{x}^{% \prime})\Bigr{]}\approx\widetilde{\phi}_{r}(\mathbf{x})^{\top}\widetilde{\phi}% _{r}(\mathbf{x}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT bold_italic_ω ∼ over^ start_ARG italic_κ end_ARG end_POSTSUBSCRIPT [ roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + roman_sin ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) roman_sin ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ≈ over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

4 Computational Complexity of VENDI & RKE Scores

As discussed, computing RKERKE\mathrm{RKE}roman_RKE and general VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT scores requires computing the order-α𝛼\alphaitalic_α entropy of kernel matrix 1nK1𝑛𝐾\frac{1}{n}Kdivide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K. Using the standard definition of α𝛼\alphaitalic_α-norm 𝐯α=(i=1n|vi|α)1/αsubscriptnorm𝐯𝛼superscriptsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑣𝑖𝛼1𝛼\|\mathbf{v}\|_{\alpha}=\bigl{(}\sum_{i=1}^{n}|v_{i}|^{\alpha}\bigr{)}^{1/\alpha}∥ bold_v ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT, we observe that the computation of VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT score is equivalent to computing the α𝛼\alphaitalic_α-norm 𝝀αsubscriptnorm𝝀𝛼\|\bm{\lambda}\|_{\alpha}∥ bold_italic_λ ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT of the n𝑛nitalic_n-dimensional eigenvalue vector 𝝀=[λ1,,λn]𝝀subscript𝜆1subscript𝜆𝑛\bm{\lambda}=[\lambda_{1},\ldots,\lambda_{n}]bold_italic_λ = [ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] where λ1λnsubscript𝜆1subscript𝜆𝑛\lambda_{1}\geq\cdots\geq\lambda_{n}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the sorted eigenvalues of the normalized kernel matrix 1nK1𝑛𝐾\frac{1}{n}Kdivide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K.

In the following theorem, we prove that except order α=2𝛼2\alpha=2italic_α = 2, which is the RKERKE\mathrm{RKE}roman_RKE score, computing any other VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT score is at least as expensive as computing the product of two n×n𝑛𝑛n\times nitalic_n × italic_n matrices. Therefore, the theorem suggests that the computational complexity of every member of the VENDI family is lower-bounded by O(n2.367)𝑂superscript𝑛2.367O(n^{2.367})italic_O ( italic_n start_POSTSUPERSCRIPT 2.367 end_POSTSUPERSCRIPT ) which is the least known cost of multiplying n×n𝑛𝑛n\times nitalic_n × italic_n matrices.

In the theorem, we suppose \mathcal{B}caligraphic_B is any fixed set of “basis” functions. A circuit 𝒞𝒞\mathcal{C}caligraphic_C is a directed acyclic graph each of whose internal nodes is labeled by a gate coming from a set \mathcal{B}caligraphic_B. A subset of gates are designated as outputs of 𝒞𝒞\mathcal{C}caligraphic_C. A circuit with n𝑛nitalic_n source nodes and m𝑚mitalic_m outputs computes a function from nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to msuperscript𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT by evaluating the gate at each internal gate in topological order. The size of a circuit is the number of gates. Also, \nabla\mathcal{B}∇ caligraphic_B is the basis consisting of the gradients of all functions in \mathcal{B}caligraphic_B. We will provide the proof of the theorems in the Appendix.

Theorem 1.

If VENDIα(K)subscriptVENDI𝛼𝐾\mathrm{VENDI}_{\alpha}(K)roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K ) for α2𝛼2\alpha\neq 2italic_α ≠ 2 is computable by a circuit 𝒞𝒞\mathcal{C}caligraphic_C of size s(n)𝑠𝑛s(n)italic_s ( italic_n ) over basis \mathcal{B}caligraphic_B, then n×n𝑛𝑛n\times nitalic_n × italic_n matrices can be multiplied by a circuit 𝒞𝒞\mathcal{C}caligraphic_C of size O(s(n))𝑂𝑠𝑛O(s(n))italic_O ( italic_s ( italic_n ) ) over basis |{+,×}conditional\nabla\mathcal{B}|\cup\{+,\times\}∇ caligraphic_B | ∪ { + , × }.

Remark 1.

The smallest known circuits for multiplying n×n𝑛𝑛n\times nitalic_n × italic_n matrices have size Θ(nω)Θsuperscript𝑛𝜔\Theta(n^{\omega})roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT ), where ω2.367𝜔2.367\omega\approx 2.367italic_ω ≈ 2.367. Despite tremendous research efforts only minor improvements have been obtained in recent years. There is evidence that ω𝜔\omegaitalic_ω is bounded away from 2 for certain classes of circuits [43, 44]. In contrast, 𝒮2subscript𝒮2\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is computable in quadratic time Θ(n2)Θsuperscript𝑛2\Theta(n^{2})roman_Θ ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in the basis B={×,+,log}𝐵B=\{\times,+,\log\}italic_B = { × , + , roman_log }.

The above discussion indicates that except the RKE(𝐱1,,𝐱n)RKEsubscript𝐱1subscript𝐱𝑛\mathrm{RKE}(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})roman_RKE ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), i.e. order-2 Renyi entropy, whose computational complexity is quadratically growing with sample size Θ(n2)Θsuperscript𝑛2\Theta(n^{2})roman_Θ ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), the other members of the VENDI family VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT would have a super-quadratic complexity on the order of 𝒪(n2.367)𝒪superscript𝑛2.367\mathcal{O}(n^{2.367})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2.367 end_POSTSUPERSCRIPT ). In practice, the computation of VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT scores is performed by the eigendecomposition of the n×n𝑛𝑛n\times nitalic_n × italic_n kernel matrix that requires O(n3)𝑂superscript𝑛3O(n^{3})italic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) computations for precise computation and O(n2M)𝑂superscript𝑛2𝑀O(n^{2}M)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M ) computations using a randomized projection onto an M𝑀Mitalic_M-dimensional space [15, 1].

5 A Scalable Fourier-based Method for Computing Kernel Entropy Scores

As we showed earlier, the complexity of computing RKERKE\mathrm{RKE}roman_RKE and VENDIVENDI\mathrm{VENDI}roman_VENDI scores are at least quadratically growing with the sample size n𝑛nitalic_n. The super-linear growth of the scores’ complexity with sample size n𝑛nitalic_n can hinder their application to large-scale datasets and generative models with potentially hundreds of sample types. In such cases, a proper entropy estimation should be performed over potentially hundreds of thousands of data, where the quadratic complexity of the scores would be a significant barrier toward their accurate estimation.

Here, we conisider a shift-invariant kernel matrix k(𝐱,𝐱)=κ(𝐱𝐱)𝑘𝐱superscript𝐱𝜅𝐱superscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\kappa(\mathbf{x}-\mathbf{x}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_κ ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where κ(𝟎)=1𝜅01\kappa(\mathbf{0})=1italic_κ ( bold_0 ) = 1 and propose applying the random Fourier features (RFF) framework [16] to perform an efficient approximation of the RKE and VENDI scores. To do this, we utilize the Fourier transform κ^^𝜅\widehat{\kappa}over^ start_ARG italic_κ end_ARG that, according to Bochner’s theorem, is a valid PDF, and we independently generate 𝝎1,,𝝎riidκ^superscriptsimilar-toiidsubscript𝝎1subscript𝝎𝑟^𝜅\bm{\omega}_{1},\ldots,\bm{\omega}_{r}\stackrel{{\scriptstyle\text{\rm iid}}}{% {\sim}}\widehat{\kappa}bold_italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG iid end_ARG end_RELOP over^ start_ARG italic_κ end_ARG. Note that in the case of the Gaussian kernel kGaussian(σ2)subscript𝑘Gaussiansuperscript𝜎2k_{\text{\rm Gaussian}(\sigma^{2})}italic_k start_POSTSUBSCRIPT Gaussian ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT, the corresponding PDF will be an isotropic Gaussian 𝒩(𝟎,1σ2Id)𝒩01superscript𝜎2subscript𝐼𝑑\mathcal{N}(\mathbf{0},\frac{1}{\sigma^{2}}I_{d})caligraphic_N ( bold_0 , divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) with zero mean and covariance matrix 1σ2Id1superscript𝜎2subscript𝐼𝑑\frac{1}{\sigma^{2}}I_{d}divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Then, we consider the RFF proxy feature map ϕ~r:d2r:subscript~italic-ϕ𝑟superscript𝑑superscript2𝑟\widetilde{\phi}_{r}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{2r}over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT as defined in (1) and define the proxy kernel covariance matrix C~X,r2r×2rsubscript~𝐶𝑋𝑟superscript2𝑟2𝑟\widetilde{C}_{X,r}\in\mathbb{R}^{2r\times 2r}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_r × 2 italic_r end_POSTSUPERSCRIPT:

C~X,r=1ni=1nϕ~r(𝐱i)ϕ~r(𝐱i)subscript~𝐶𝑋𝑟1𝑛superscriptsubscript𝑖1𝑛subscript~italic-ϕ𝑟subscript𝐱𝑖subscript~italic-ϕ𝑟superscriptsubscript𝐱𝑖top\widetilde{C}_{X,r}\,=\,\frac{1}{n}\sum_{i=1}^{n}\,\widetilde{\phi}_{r}\bigl{(% }\mathbf{x}_{i}\bigr{)}\,\widetilde{\phi}_{r}\bigl{(}\mathbf{x}_{i}\bigr{)}^{\top}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (2)

Note that the 2r×2r2𝑟2𝑟2r\times 2r2 italic_r × 2 italic_r matrix C^X,rsubscript^𝐶𝑋𝑟\widehat{C}_{X,r}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT has the same non-zero eigenvalues as the n×n𝑛𝑛n\times nitalic_n × italic_n RFF proxy kernel matrix 1nK~r1𝑛subscript~𝐾𝑟\frac{1}{n}\widetilde{K}_{r}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and therefore can be utilized to approximate the eigenvalues of the original n×n𝑛𝑛n\times nitalic_n × italic_n kernel matrix 1nK1𝑛𝐾\frac{1}{n}Kdivide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K. Therefore, we propose the Fourier-based Kernel Entropy Approximation (FKEA) method to approximate the RKERKE\mathrm{RKE}roman_RKE and VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT scores as follows:

FKEA-RKE(𝐱1,,𝐱n)FKEA-RKEsubscript𝐱1subscript𝐱𝑛\displaystyle\mathrm{FKEA}\text{-}\mathrm{RKE}\bigl{(}\mathbf{x}_{1},\ldots,% \mathbf{x}_{n}\bigr{)}\,roman_FKEA - roman_RKE ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =exp(H2(C~X,r))=C~X,rF2,absentsubscript𝐻2subscript~𝐶𝑋𝑟superscriptsubscriptdelimited-∥∥subscript~𝐶𝑋𝑟𝐹2\displaystyle=\,\exp\bigl{(}H_{2}(\widetilde{C}_{X,r})\bigr{)}\,=\,\bigl{\|}% \widetilde{C}_{X,r}\bigr{\|}_{F}^{-2},= roman_exp ( italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT ) ) = ∥ over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , (3)
FKEA-VENDIα(𝐱1,,𝐱n)FKEA-subscriptVENDI𝛼subscript𝐱1subscript𝐱𝑛\displaystyle\mathrm{FKEA}\text{-}\mathrm{VENDI}_{\alpha}\bigl{(}\mathbf{x}_{1% },\ldots,\mathbf{x}_{n}\bigr{)}\,roman_FKEA - roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =exp(Hα(C^X,r))=(i=12rλ~r,iα)11αabsentsubscript𝐻𝛼subscript^𝐶𝑋𝑟superscriptsuperscriptsubscript𝑖12𝑟superscriptsubscript~𝜆𝑟𝑖𝛼11𝛼\displaystyle=\,\exp\bigl{(}H_{\alpha}(\widehat{C}_{X,r})\bigr{)}\,=\,\Bigl{(}% \,\sum_{i=1}^{2r}\,{\widetilde{\lambda}_{r,i}}^{\alpha}\,\Bigr{)}^{\frac{1}{1-% \alpha}}= roman_exp ( italic_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT ) ) = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG end_POSTSUPERSCRIPT (4)

Note that in the above, λ~r,iαsubscriptsuperscript~𝜆𝛼𝑟𝑖\widetilde{\lambda}^{\alpha}_{r,i}over~ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_ith eigenvalue of the 2r×2r2𝑟2𝑟2r\times 2r2 italic_r × 2 italic_r matrix C^X,rsubscript^𝐶𝑋𝑟\widehat{C}_{X,r}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT. We remark that the computation of both FKEA-RKEFKEA-RKE\mathrm{FKEA}\text{-}\mathrm{RKE}roman_FKEA - roman_RKE and FKEA-VENDIαFKEA-subscriptVENDI𝛼\mathrm{FKEA}\text{-}\mathrm{VENDI}_{\alpha}roman_FKEA - roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT can be done by a stochastic algorithm which computes the proxy covariance matrix (2) by summing the sample-based 2r×2r2𝑟2𝑟2r\times 2r2 italic_r × 2 italic_r matrix terms, and then computing the resulting matrix’s Frobenius norm for RKERKE\mathrm{RKE}roman_RKE score or all the 2r2𝑟2r2 italic_r matrix’s eigenvalues for a general VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT with α2𝛼2\alpha\neq 2italic_α ≠ 2.

Therefore, to show the FKEA method’s scalability, we need to bound the required RFF size 2r2𝑟2r2 italic_r for an accurate approximation of the original n×n𝑛𝑛n\times nitalic_n × italic_n kernel matrix. The following theorem proves that the needed feature size will be 𝒪(lognϵ2)𝒪𝑛superscriptitalic-ϵ2\mathcal{O}\bigl{(}\frac{\log n}{\epsilon^{2}}\bigr{)}caligraphic_O ( divide start_ARG roman_log italic_n end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) for an ϵitalic-ϵ\epsilonitalic_ϵ-accurate approximations of the matrix’s eigenspace.

Theorem 2.

Consider a shift-invariant kernel k(𝐱,𝐱)=κ(𝐱𝐱)𝑘𝐱superscript𝐱𝜅𝐱superscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\kappa(\mathbf{x}-\mathbf{x}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_κ ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where κ(𝟎)=1𝜅01\kappa(\mathbf{0})=1italic_κ ( bold_0 ) = 1. Suppose 𝛚1,,𝛚rκ^similar-tosubscript𝛚1subscript𝛚𝑟^𝜅\bm{\omega}_{1},\ldots,\bm{\omega}_{r}\sim\widehat{\kappa}bold_italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∼ over^ start_ARG italic_κ end_ARG are independently drawn from PDF κ^^𝜅\widehat{\kappa}over^ start_ARG italic_κ end_ARG. Let λ1λnsubscript𝜆1subscript𝜆𝑛\lambda_{1}\geq\ldots\geq\lambda_{n}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ … ≥ italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the sorted eigenvalues of the normalized kernel matrix 1nK=1n[k(𝐱i,𝐱j)]n×n1𝑛𝐾1𝑛subscriptdelimited-[]𝑘subscript𝐱𝑖subscript𝐱𝑗𝑛𝑛\frac{1}{n}K=\frac{1}{n}\bigl{[}k(\mathbf{x}_{i},\mathbf{x}_{j})\bigr{]}_{n% \times n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG [ italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT. Also, consider the eigenvalues of λ~1λ~2rsubscript~𝜆1subscript~𝜆2𝑟\widetilde{\lambda}_{1}\geq\ldots\geq\widetilde{\lambda}_{2r}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ … ≥ over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 2 italic_r end_POSTSUBSCRIPT of random matrix C~X,rsubscript~𝐶𝑋𝑟\widetilde{C}_{X,r}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X , italic_r end_POSTSUBSCRIPT with their corresponding eigenvectors 𝐯~1,,𝐯~2rsubscript~𝐯1subscript~𝐯2𝑟\widetilde{\mathbf{v}}_{1},\ldots,\widetilde{\mathbf{v}}_{2r}over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_r end_POSTSUBSCRIPT. Let λ~j=0subscript~𝜆𝑗0\,\widetilde{\lambda}_{j}=0over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 for every j>2r𝑗2𝑟j>2ritalic_j > 2 italic_r. Then, for every δ>0𝛿0\delta>0italic_δ > 0, the following holds with probability at least 1δ1𝛿1-\delta1 - italic_δ:

i=1n(λ~iλi)28log(n/2δ)randi=1n1nK𝐯^iλi𝐯^i2232log(n/2δ)r,formulae-sequencesuperscriptsubscript𝑖1𝑛superscriptsubscript~𝜆𝑖subscript𝜆𝑖28𝑛2𝛿𝑟andsuperscriptsubscript𝑖1𝑛subscriptsuperscriptdelimited-∥∥1𝑛𝐾subscript^𝐯𝑖subscript𝜆𝑖subscript^𝐯𝑖2232𝑛2𝛿𝑟\sqrt{\sum_{i=1}^{n}\bigl{(}\widetilde{\lambda}_{i}-\lambda_{i}\bigr{)}^{2}}\,% \leq\,\sqrt{\frac{8\log(n/2\delta)}{r}}\qquad\text{\rm and}\qquad\sqrt{\sum_{i% =1}^{n}\Bigl{\|}\frac{1}{n}K\widehat{\mathbf{v}}_{i}-\lambda_{i}\widehat{% \mathbf{v}}_{i}\Bigr{\|}^{2}_{2}}\,\leq\,\sqrt{\frac{32\log(n/2\delta)}{r}},square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ square-root start_ARG divide start_ARG 8 roman_log ( italic_n / 2 italic_δ ) end_ARG start_ARG italic_r end_ARG end_ARG and square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ≤ square-root start_ARG divide start_ARG 32 roman_log ( italic_n / 2 italic_δ ) end_ARG start_ARG italic_r end_ARG end_ARG ,

where 𝐯^i:=j=1rsin(𝐯~2j𝐱i)𝐯~2j+cos(𝐯~2j1𝐱i)𝐯~2j1assignsubscript^𝐯𝑖superscriptsubscript𝑗1𝑟superscriptsubscript~𝐯2𝑗topsubscript𝐱𝑖subscript~𝐯2𝑗superscriptsubscript~𝐯2𝑗1topsubscript𝐱𝑖subscript~𝐯2𝑗1\widehat{\mathbf{v}}_{i}:=\sum_{j=1}^{r}\sin\bigl{(}\widetilde{\mathbf{v}}_{2j% }^{\top}\mathbf{x}_{i}\bigr{)}\widetilde{\mathbf{v}}_{2j}+\cos\bigl{(}% \widetilde{\mathbf{v}}_{2j-1}^{\top}\mathbf{x}_{i}\bigr{)}\widetilde{\mathbf{v% }}_{2j-1}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT roman_sin ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT + roman_cos ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j - 1 end_POSTSUBSCRIPT is the i𝑖iitalic_ith proxy eigenvector for 1nK1𝑛𝐾\frac{1}{n}Kdivide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K.

Corollary 1.

In the setting of Theorem 2, the following approximation guarantees hold for RKERKE\mathrm{RKE}roman_RKE and VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT scores

  • For every VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT with α2𝛼2\alpha\geq 2italic_α ≥ 2, including RKERKE\mathrm{RKE}roman_RKE for α=2𝛼2\alpha=2italic_α = 2, the following dimension-independent bound holds with probability at least 1δ1𝛿1-\delta1 - italic_δ:

    |FKEA-VENDIα(𝐱1,,𝐱n)1ααVENDIα(𝐱1,,𝐱n)1αα|8log(n/2δ)rFKEA-subscriptVENDI𝛼superscriptsubscript𝐱1subscript𝐱𝑛1𝛼𝛼subscriptVENDI𝛼superscriptsubscript𝐱1subscript𝐱𝑛1𝛼𝛼8𝑛2𝛿𝑟\Bigl{|}\mathrm{FKEA}\text{-}\mathrm{VENDI}_{\alpha}\bigl{(}\mathbf{x}_{1},% \ldots,\mathbf{x}_{n}\bigr{)}^{\frac{1-\alpha}{\alpha}}\,-\,\mathrm{VENDI}_{% \alpha}\bigl{(}\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\bigr{)}^{\frac{1-\alpha}{% \alpha}}\Bigr{|}\,\leq\,\sqrt{\frac{8\log(n/2\delta)}{r}}| roman_FKEA - roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 - italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT - roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 - italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT | ≤ square-root start_ARG divide start_ARG 8 roman_log ( italic_n / 2 italic_δ ) end_ARG start_ARG italic_r end_ARG end_ARG
  • For every VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT with 1α<21𝛼21\leq\alpha<21 ≤ italic_α < 2, assuming a finite dimension for the kernel feature map dim(ϕ)=mdimitalic-ϕ𝑚\mathrm{dim}(\phi)=mroman_dim ( italic_ϕ ) = italic_m, the following bound holds with probability at least 1δ1𝛿1-\delta1 - italic_δ:

    |FKEA-VENDIα(𝐱1,,𝐱n)1ααVENDIα(𝐱1,,𝐱n)1αα|m1α128log(n/2δ)rFKEA-subscriptVENDI𝛼superscriptsubscript𝐱1subscript𝐱𝑛1𝛼𝛼subscriptVENDI𝛼superscriptsubscript𝐱1subscript𝐱𝑛1𝛼𝛼superscript𝑚1𝛼128𝑛2𝛿𝑟\Bigl{|}\mathrm{FKEA}\text{-}\mathrm{VENDI}_{\alpha}\bigl{(}\mathbf{x}_{1},% \ldots,\mathbf{x}_{n}\bigr{)}^{\frac{1-\alpha}{\alpha}}\,-\,\mathrm{VENDI}_{% \alpha}\bigl{(}\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\bigr{)}^{\frac{1-\alpha}{% \alpha}}\Bigr{|}\,\leq\,m^{\frac{1}{\alpha}-\frac{1}{2}}\sqrt{\frac{8\log(n/2% \delta)}{r}}| roman_FKEA - roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 - italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT - roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 - italic_α end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT | ≤ italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_α end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 8 roman_log ( italic_n / 2 italic_δ ) end_ARG start_ARG italic_r end_ARG end_ARG
Remark 2.

According to the theoretical results in [45], the top-t𝑡titalic_t eigenvectors of kernel covariance matrix CXsubscript𝐶𝑋C_{X}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT will correspond to the mean of the modes of a mixture distribution with t𝑡titalic_t well-separable modes. Theorem 2 shows for every 1i2r1𝑖2𝑟1\leq i\leq 2r1 ≤ italic_i ≤ 2 italic_r, FKEA provides the proxy score function u~i:d:subscript~𝑢𝑖superscript𝑑\widetilde{u}_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R that assigns a likelihood score for an input 𝐱𝐱\mathbf{x}bold_x to belong to the i𝑖iitalic_ith identified mode:

u~i(𝐱)=j=1rsin(𝐯~2j𝐱)𝐯~2j,i+cos(𝐯~2j1𝐱)𝐯~2j1,isubscript~𝑢𝑖𝐱superscriptsubscript𝑗1𝑟superscriptsubscript~𝐯2𝑗top𝐱subscript~𝐯2𝑗𝑖superscriptsubscript~𝐯2𝑗1top𝐱subscript~𝐯2𝑗1𝑖\widetilde{u}_{i}(\mathbf{x})\,=\,\sum_{j=1}^{r}\sin\bigl{(}\widetilde{\mathbf% {v}}_{2j}^{\top}\mathbf{x}\bigr{)}\widetilde{\mathbf{v}}_{2j,i}+\cos\bigl{(}% \widetilde{\mathbf{v}}_{2j-1}^{\top}\mathbf{x}\bigr{)}\widetilde{\mathbf{v}}_{% 2j-1,i}over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT roman_sin ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j , italic_i end_POSTSUBSCRIPT + roman_cos ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 2 italic_j - 1 , italic_i end_POSTSUBSCRIPT (5)

Therefore, one can compute the above FKEA-based score for each of the 2r2𝑟2r2 italic_r eigenvectors over a sample set, and use the samples with the highest scores according to every u~isubscript~𝑢𝑖\widetilde{u}_{i}over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to characterize the i𝑖iitalic_i sample cluster captured by the FKEA method.

Refer to caption
Figure 2: RFF-based identified clusters used in FKEA Evaluation in single-colored MNIST [46] dataset with pixel embedding, Fourier feature dimension 2r=40002𝑟40002r=40002 italic_r = 4000 and bandwidth σ=7𝜎7\sigma=7italic_σ = 7. The graphs indicate increase in FKEA RKE/VENDI diversity metrics with increasing number of labels.

6 Numerical Results

We evaluated the FKEA method on several image, text, and video datasets to assess its performance in quantifying diversity in different domains. In the experiments, we computed the empirical covariance matrix of 2r2𝑟2r2 italic_r-dimensional Fourier features with a Gaussian kernel with bandwidth parameter σ𝜎\sigmaitalic_σ tuned for each dataset, and then applied FKEA approximation for the VENDI1subscriptVENDI1\mathrm{VENDI}_{1}roman_VENDI start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, VENDI1.5subscriptVENDI1.5\mathrm{VENDI}_{1.5}roman_VENDI start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT, and the RKERKE\mathrm{RKE}roman_RKE (same as VENDI2subscriptVENDI2\mathrm{VENDI}_{2}roman_VENDI start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) scores. Experiments were conducted on 8 RTX3090 GPUs. We interpreted the modes identified by FKEA entropy-based diversity evaluation using the eigenvectors of the proxy covariance matrix as discussed in Remark 2. For each eigenvector, we presented the training data with maximum eigenfunction values corresponding to the eigenvector.

6.1 Experimental Results on Image Data

To investigate the FKEA method’s diversity evaluation in settings where we know the ground-truth clusters and their quantity, we simulated an experiment on the colored MNIST [46] data with the images of 10 colored digits as shown in Figure 2. We evaluated the FKEA approximations of the diversity scores while including samples from t𝑡titalic_t digits for t{1,,10}𝑡110t\in\{1,\ldots,10\}italic_t ∈ { 1 , … , 10 }. The plots in Figure 2 indicate the increasing trend of the scores and FKEA’s tight approximations of the scores. Also, we show the top-25 training samples with the highest scores according to the top-10 FKEA eigenvectors, showing the method captured the ground-truth clusters.

We conducted an experiment on the ImageNet data to monitor the scores’ behavior evaluated for 50k samples from an increasing number of ImageNet labels. Figure 3 shows the increasing trend of the scores as well as the top-9 samples representing the top-4 identified clusters used for the entropy calculation. Figures 3, 5 and 4 illustrate the top FKEA modes with corresponding parameters. The modes represent a specific characteristic common between the top images. Table 1 evaluates the diversity scores of various ImageNet GAN models using the FKEA method applied to VENDI-1 and RKE, with potential extension to the entire VENDI family. The comparison includes baseline diversity metrics such as Inception Score [12], FID [7], Improved Precision/Recall [10], and Density/Coverage [11]. Also, Figure 7 presents the FKEA approximated entropy scores with different truncation factors in StyleGAN3 [47] on 30k generated data for each truncation factor, showing the increasing diversity scores with the truncation factor.

We defer discussing the results on generated synthetic images to the Appendix.

Refer to caption
Figure 3: RFF-based identified clusters used in FKEA Evaluation in ImageNet[6] dataset with DinoV2 embedding, Fourier feature dimension 2r=16k2𝑟16𝑘2r=16k2 italic_r = 16 italic_k and Gaussian Kernel bandwidth σ=25𝜎25\sigma=25italic_σ = 25. The graphs indicate increase in FKEA diversity metrics with increasing number of labels per 50k samples.
Refer to caption
Figure 4: RFF-based identified clusters used in FKEA Evaluation in FFHQ[48] dataset with DinoV2 embedding, Fourier feature dimension 2r=16k2𝑟16𝑘2r=16k2 italic_r = 16 italic_k and Gaussian Kernel bandwidth σ=20𝜎20\sigma=20italic_σ = 20.
Refer to caption
Figure 5: RFF-based identified clusters used in FKEA Evaluation in AFHQ[49] dataset with DinoV2 embedding, Fourier feature dimension 2r=16k2𝑟16𝑘2r=16k2 italic_r = 16 italic_k and Gaussian Kernel bandwidth σ=20𝜎20\sigma=20italic_σ = 20.
Refer to caption
Figure 6: RFF-based identified clusters used in FKEA Evaluation in MS COCO [50] dataset with DinoV2 embedding, Fourier feature dimension 2r=16k2𝑟16𝑘2r=16k2 italic_r = 16 italic_k and Gaussian Kernel bandwidth σ=22𝜎22\sigma=22italic_σ = 22.
Refer to caption
Figure 7: FKEA metrics behavior under different truncation factor ψ𝜓\psiitalic_ψ of StyleGAN3 [47] generated FFHQ samples.
Method IS \uparrow FID \downarrow Precision \uparrow Recall \uparrow Density \uparrow Coverage \uparrow FKEA VENDI-1 \uparrow FKEA RKE \uparrow
Dataset (100k) - - - - - - 9176.9 996.7
ADM [51] 542.6 11.12 0.78 0.79 0.88 0.89 8360.3 633.4
ADMG [51] 659.3 5.63 0.87 0.84 0.80 0.85 8524.2 811.5
ADMG-ADMU [51] 701.6 4.78 0.90 0.73 1.20 0.96 8577.6 839.8
BigGAN [52] 696.4 7.91 0.81 0.44 0.99 0.57 7120.5 492.4
DiT-XL-2 [53] 743.2 3.56 0.92 0.84 1.16 0.97 8626.5 855.8
GigaGAN [54] 678.8 4.29 0.89 0.74 0.74 0.70 8432.5 671.6
LDM [55] 734.4 4.75 0.93 0.76 1.04 0.93 8573.7 811.9
Mask-GIT [56] 717.4 5.66 0.91 0.72 1.01 0.82 8557.4 759.5
RQ-Transformer [57] 558.3 9.57 0.80 0.76 0.77 0.59 8078.4 512.1
StyleGAN-XL[58] 675.4 4.34 0.89 0.74 1.18 0.96 8171.9 703.5
Table 1: Evaluated scores for ImageNet generative models. The Gaussian kernel bandwidth parameter chosen for RKE, VENDI, FKEA-VENDI and FKEA-RKE is σ=25𝜎25\sigma=25italic_σ = 25 and Fourier features dimension 2r=16k2𝑟16𝑘2r=16k2 italic_r = 16 italic_k. The scores were obtained by running the GitHub of [20] on pre-generated 50k samples.

6.2 Experimental Results on Text and Video Data

To perform experiments on the text data with known clustering ground-truth, we generated 500,000 paragraphs using GPT-3.5 [59] about 100 randomly selected countries (5k samples per country). In the experiments, the text embedding used was text-embedding-3-large [60, 61, 59]. We evaluated the diversity scores over data subsets of size 50k with different numbers of mentioned countries. Figure 8 shows the growing trend of the diversity scores when including more countries. The figure also shows the countries mentioned in the top-6 modes provided by FKEA-based principal eigenvectors, which shows the RFF-based clustering of countries correlates with their continent and geographical location. We discuss the numerical results on non-synthetic text datasets, CNN/Dailymail [62][63], CMU Movie Corpus [64], in the Appendix.

In Table 2 we extend FKEA to non-synthetic Wikipedia dataset. To visualise the results, we used YAKE [65] algorithm to extract and present the identified unigram and bigram keywords. Identified mode 1 correlates most with historical figures/events/places. Mode 2 clusters smaller villages and rural regions together. Mode 3 is exclusively about people in sports, such as athletes and referees. Mode 4 visualises various music bands and albums. Lastly, mode 5 presents the articles about sports events, such as football leagues.

Mode #1 Mode #2 Mode #3 Mode #4 Mode #5
Grosse Pointe Ishkli Gerson Garca Girugamesh (album) 2009 WPSL season
Mark Scharf Khazora Valentin Mogilny Japonesque (album) 2012 Milwaukee…
Alexander McKee Sis, Azerbaijan Gerald Lehner (referee) Documentaly 2020 San Antonio FC…
Clay Huffman Zasun Dmitri Nezhelev EX-Girl 2020 HFX Wanderers FC…
Ravenna, Ohio Zaravat Grigori Ivanov Indie 2000 2020 Sporting Kansas City…
C. M. Eddy Jr. Bogat Leonidas Morakis Triangle (Perfume album) FC Tucson
Hornell, New York Yakkakhona Jos Luis Alonso Ber… Waste Management (album) 2008 K League
Larchmont, New York Yava, Tajikistan Giovanni Gasperini Fush Yu Mang 201112 New Zealand Football…
Robert Hague Ikizyak Mohamed Chab Fantastipo (song) 200809 Melbourne Victory FC…
General Hershy Bar Khushikat Louis Darmanin Xtort 2012 Pittsburgh Power season
Keywords
London populated places players category music video American football
American History Maplandia.com Category Association football album players Category
University Press municipality FIFA World studio album Football League
United States village World Cup Records albums League
World War Osh Region Summer Olympics Singles Chart League Soccer
Table 2: Top 5 Wikipedia Dataset Modes with corresponding eigenvalues with text-embedding-3-large embeddings and bandwidth σ=1.0𝜎1.0\sigma=1.0italic_σ = 1.0

For video data experiments, we considered two standard video datasets, UCF101 [66] and Kinetics-400 [67]. Following the video evaluation literature [68, 69], we used the I3D pre-trained model [70] as embedding, which maps a video sample to a 1024-dimensional vector. As shown in Figure 9, increasing the number of video classes of test samples led to an increase in the FEKA approximated diversity metrics. Also, while the samples identified for the first identified cluster look broad, the next modes seemed more specific. We discuss the results of the Kinetics dataset in the Appendix.

Refer to caption
Figure 8: FKEA diversity metrics with the increasing number of countries in the synthetic dataset.
Mode #1 Mode #2 Mode #3 Mode #4 Mode #5 Mode #6
Burkina Faso 34% Argentina 77% Azerbaijan 100% Cambodia 94% Belarus 100% Bolivia 97%
Benin 23% Chile 23% Afghanistan 6% Ecuador 3%
Chad 22%
Burundi 13%
Cameroon 8%
Table 3: Top 5 synthetic countries dataset modes with text-embedding-3-large embedding, Fourier features dim 2r=80002𝑟80002r=80002 italic_r = 8000 and Gaussian Kernel bandwidth σ=0.9𝜎0.9\sigma=0.9italic_σ = 0.9. The table summarises the mentions of each country in the top 100 paragraphs identified for the eigenvectors corresponding to each mode.
Refer to caption
Refer to caption
(a) Mode #1
Refer to caption
(b) Mode #2
Refer to caption
(c) Mode #3
Refer to caption
(d) Mode #4
Figure 9: RFF-based identified clusters used in FKEA evaluation in UCF101 dataset with I3D embedding. The graphs indicate an increase in FKEA diversity metrics with more classes.

7 Limitations

Incompatibility with non shift-invariant kernels. Our analysis targets a shift-invariant kernel, which does not apply to a general kernel function, such as polynomial kernels. In practice, many ML algorithms rely on simpler kernels that may not have the shift-invariant property. Due to to specifics of FKEA framework, we cannot directly extend the work to such kernels. We leave the framework’s extension to other kernel functions for future studies.

Reliance on Embeddings. FKEA clustering and diversity assessment metrics rely on the quality of the underlying embedding space. Depending on the training and pre-training datasets, the semantic clustering properties may change. We leave in-depth study of embedding space behavior for future research.

8 Conclusion

In this work, we proposed the Fourier-based FKEA method to efficiently approximate the kernel-based entropy scores VENDIαsubscriptVENDI𝛼\mathrm{VENDI}_{\alpha}roman_VENDI start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and RKERKE\mathrm{RKE}roman_RKE scores. The application of FKEA results in a scalable reference-free evaluation of generative models, which could be utilized in applications where no reference data is available for evaluation. A future direction to our work is to study the sample complexity of the matrix-based entropy scores and the FKEA’s approximation under high-dimensional kernel feature maps, e.g. the Gaussian kernel. Also, analyzing the role of feature embedding in the method’s application to text and video data would be interesting for future exploration.

References

  • [1] Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. In Transactions on Machine Learning Research, 2023.
  • [2] Mohammad Jalali, Cheuk Ting Li, and Farzan Farnia. An information-theoretic evaluation of generative models in learning multi-modal distributions. In Advances in Neural Information Processing Systems, volume 36, pages 9931–9943, 2023.
  • [3] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2013.
  • [4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • [5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [6] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. In International Journal of Computer Vision (IJCV), number 3, pages 211–252, 2015.
  • [7] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2018.
  • [8] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. In International Conference on Learning Representations, 2018.
  • [9] Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [10] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [11] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of ICML’20, pages 7176–7185. JMLR.org, 2020.
  • [12] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [13] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • [14] Amey Pasarkar and Adji Bousso Dieng. Cousins of the vendi score: A family of similarity-based diversity metrics for science and machine learning. In International Conference on Artificial Intelligence and Statistics. PMLR, 2024.
  • [15] Yuxin Dong, Tieliang Gong, Shujian Yu, and Chen Li. Optimal randomized approximations for matrix-based rényi’s entropy. IEEE Transactions on Information Theory, 2023.
  • [16] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
  • [17] Ahmed M. Alaa, Boris van Breugel, Evgeny Saveliev, and Mihaela van der Schaar. How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models, July 2022. arXiv:2102.08921 [cs, stat].
  • [18] Marco Jiralerspong, Joey Bose, Ian Gemp, Chongli Qin, Yoram Bachrach, and Gauthier Gidel. Feature likelihood score: Evaluating the generalization of generative models using samples. Advances in Neural Information Processing Systems, 36, 2024.
  • [19] Jiyeon Han, Hwanil Choi, Yunjey Choi, Junho Kim, Jung-Woo Ha, and Jaesik Choi. Rarity score: A new metric to evaluate the uncommonness of synthesized images. arXiv preprint arXiv:2206.08549, 2022.
  • [20] George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 3732–3784. Curran Associates, Inc., 2023.
  • [21] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: Learning robust visual features without supervision. In Transactions on Machine Learning Research, 2023.
  • [22] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The Role of ImageNet Classes in Fréchet Inception Distance. September 2022.
  • [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning, pages 8748–8763. arXiv, February 2021. arXiv:2103.00020 [cs].
  • [24] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Advances in Neural Information Processing Systems, volume 33, pages 9912–9924. Curran Associates, Inc., 2020.
  • [25] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826. ISSN: 1063-6919.
  • [26] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318. Association for Computational Linguistics, 2002.
  • [27] Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pages 1097–1100. Association for Computing Machinery, 2018.
  • [28] Vishakh Padmakumar and He He. Does Writing with Language Models Reduce Content Diversity?, March 2024. arXiv:2309.05196 [cs].
  • [29] Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive Decoding: Open-ended Text Generation as Optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [30] Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. Locally Typical Sampling. Transactions of the Association for Computational Linguistics, 11:102–121, 2023. Place: Cambridge, MA Publisher: MIT Press.
  • [31] Chin-Yew Lin and Franz Josef Och. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605–612, Barcelona, Spain, July 2004.
  • [32] Ehsan Montahaei, Danial Alihosseini, and Mahdieh Soleymani Baghshah. Jointly Measuring Diversity and Quality in Text Generation Models. arXiv, May 2019. arXiv:1904.03971 [cs, stat].
  • [33] Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, and Ani Nenkova. Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.
  • [34] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299–1319, July 1998. Conference Name: Neural Computation.
  • [35] Yoshua Bengio, Pascal Vincent, Jean-François Paiement, O Delalleau, M Ouimet, and N LeRoux. Learning eigenfunctions of similarity: linking spectral clustering and kernel pca. Technical report, Technical Report 1232, Departement d’Informatique et Recherche Oprationnelle …, 2003.
  • [36] Yoshua Bengio, Pascal Vincent, Jean-François Paiement, Olivier Delalleau, Marie Ouimet, and Nicolas Le Roux. Spectral clustering and kernel PCA are learning eigenfunctions, volume 1239. Citeseer, 2003.
  • [37] Radha Chitta, Rong Jin, and Anil K Jain. Efficient kernel clustering using random fourier features. In 2012 IEEE 12th International Conference on Data Mining, pages 161–170. IEEE, 2012.
  • [38] Mina Ghashami, Daniel J Perry, and Jeff Phillips. Streaming kernel principal component analysis. In Artificial intelligence and statistics, pages 1365–1374. PMLR, 2016.
  • [39] Enayat Ullah, Poorya Mianjy, Teodor Vanislavov Marinov, and Raman Arora. Streaming kernel pca with o(n)𝑜𝑛o(\sqrt{n})italic_o ( square-root start_ARG italic_n end_ARG ) random features. Advances in Neural Information Processing Systems, 31, 2018.
  • [40] Bharath K Sriperumbudur and Nicholas Sterge. Approximate kernel pca: Computational versus statistical trade-off. The Annals of Statistics, 50(5):2713–2736, 2022.
  • [41] Daniel Gedon, Antônio H Ribeiro, Niklas Wahlström, and Thomas B Schön. Invertible kernel pca with random fourier features. IEEE Signal Processing Letters, 30:563–567, 2023.
  • [42] Danica J. Sutherland and Jeff Schneider. On the Error of Random Fourier Features. In Uncertainty in Artificial Intelligence (UAI) 2015. arXiv, June 2015. arXiv:1506.02785 [cs, stat].
  • [43] Josh Alman. Limits on the universal method for matrix multiplication. Theory of Computing, 17(1):1–30, 2021.
  • [44] Matthias Christandl, Péter Vrana, and Jeroen Zuiddam. Barriers for fast matrix multiplication from irreversibility. Theory of Computing, 17(2):1–32, 2021.
  • [45] Jingwei Zhang, Cheuk Ting Li, and Farzan Farnia. An interpretable evaluation of entropy-based novelty of generative models. In International Conference on Machine Learning (ICML 2024).
  • [46] Li Deng. The MNIST database of handwritten digit images for machine learning research [best of the web]. In IEEE Signal Processing Magazine, volume 29, pages 141–142, 2012. Conference Name: IEEE Signal Processing Magazine.
  • [47] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In Advances in Neural Information Processing Systems, volume 34, pages 852–863. Curran Associates, Inc.
  • [48] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. 43(12):4217–4228. Publisher: IEEE Computer Society.
  • [49] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • [50] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, European Conference on Computer Vision (ECCV), volume 8693, pages 740–755. Springer International Publishing, 2014. Book Title: Computer Vision – ECCV 2014 Series Title: Lecture Notes in Computer Science.
  • [51] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc.
  • [52] Andrew Brock, Jeff Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis.
  • [53] William Peebles and Saining Xie. Scalable diffusion models with transformers.
  • [54] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [55] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685. IEEE Computer Society.
  • [56] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
  • [57] Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022.
  • [58] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 Conference Proceedings, volume abs/2201.00273, 2022.
  • [59] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  • [60] OpenAI. text-embedding-3-large. https://platform.openai.com/docs/models/embeddings, 2024.
  • [61] OpenAI. GPT-4 Technical Report, March 2024. arXiv:2303.08774 [cs].
  • [62] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In NIPS, pages 1693–1701, 2015.
  • [63] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  • [64] David Bamman, Brendan O’Connor, and Noah A. Smith. Learning latent personas of film characters. In ACL 2013, 2013.
  • [65] Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes, and Adam Jatowt. YAKE! keyword extraction from single documents using multiple local features. In Information Sciences, volume 509, pages 257–289, 2020.
  • [66] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012.
  • [67] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017.
  • [68] Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 128(10):2586–2606, Nov 2020.
  • [69] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric and challenges, 2019.
  • [70] João Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017.
  • [71] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of the local rounding errors. Master’s thesis, University of Helsinki, 1970.
  • [72] W. Baur and V. Strassen. The complexity of partial derivatives. Theoretical Computer Science, 22:317 – 330, 1983.
  • [73] Alan J Hoffman and Helmut W Wielandt. The variation of the spectrum of a normal matrix. In Selected Papers Of Alan J Hoffman: With Commentary, pages 118–120. World Scientific, 2003.
  • [74] Rewon Child. Very deep VAEs generalize autoregressive models and can outperform them on images.
  • [75] Ceyuan Yang, Yujun Shen, Yinghao Xu, and Bolei Zhou. Data-efficient instance generation from instance discrimination. In Advances in Neural Information Processing Systems, volume 34, pages 9378–9390. Curran Associates, Inc.

Appendix A Proofs

A.1 Proof of Theorem 1

The proof of Theorem 1 combines third ingredients. The first is the relation between the circuit size of a function C𝐶Citalic_C and of its partial derivatives C=(\delC/\delx1,,\delC/\delxn)𝐶\del𝐶\delsubscript𝑥1\del𝐶\delsubscript𝑥𝑛\nabla C=(\del C/\del x_{1},\dots,\del C/\del x_{n})∇ italic_C = ( italic_C / italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C / italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

Lemma 1.

The function C𝐶\nabla C∇ italic_C has a circuit over basis B{+,×}𝐵\nabla B\cup\{+,\times\}∇ italic_B ∪ { + , × } whose size is within a constant factor of the size of C𝐶Citalic_C.

Lemma 1 is a feature of the backpropagation algorithm [71, 72]. This is a linear-time algorithm for constructing a circuit for C𝐶\nabla C∇ italic_C given the circuit C𝐶Citalic_C as input. In contrast, the forward propagation algorithm allows efficient computation of a single (partial) derivative even for circuits with multivalued outputs, giving the second ingredient:

Lemma 2.

Let C𝐶Citalic_C be a circuit over basis B𝐵Bitalic_B and t𝑡titalic_t be an input to C𝐶Citalic_C. There exists a circuit that computes the derivative \delg/\delt\del𝑔\del𝑡\del g/\del titalic_g / italic_t for every gate g𝑔gitalic_g of C𝐶Citalic_C over basis B{+,×}𝐵\nabla B\cup\{+,\times\}∇ italic_B ∪ { + , × } whose size is within a constant factor of the size of C𝐶Citalic_C.

The last ingredient is the following identity. For a scalar function f𝑓fitalic_f over the complex numbers and matrix X𝑋Xitalic_X diagonalizable as UΛUT𝑈Λsuperscript𝑈𝑇U\Lambda U^{T}italic_U roman_Λ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we define f(X)𝑓𝑋f(X)italic_f ( italic_X ) to be the function Uf(Λ)UT𝑈𝑓Λsuperscript𝑈𝑇Uf(\Lambda)U^{T}italic_U italic_f ( roman_Λ ) italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT where f𝑓fitalic_f is applied entry-wise to the diagonal matrix ΛΛ\Lambdaroman_Λ.

Lemma 3.

For every f𝑓fitalic_f that is analytic over an open domain ΩΩ\Omegaroman_Ω containing all sufficiently large complex numbers and every matrix X𝑋Xitalic_X whose spectrum is contained in ΩΩ\Omegaroman_Ω, Tr(f(X))=f(X)Tr𝑓𝑋superscript𝑓𝑋\nabla\mathrm{Tr}(f(X))=f^{\prime}(X)∇ roman_Tr ( italic_f ( italic_X ) ) = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X ).

We first illustrate the proof in the special case when \normX\norm𝑋\norm{X}italic_X is within the radius of convergence of f𝑓fitalic_f. Namely, f(x)𝑓𝑥f(x)italic_f ( italic_x ) is represented by the absolutely convergent series f(k)(0)xk/k!superscript𝑓𝑘0superscript𝑥𝑘𝑘\sum f^{(k)}(0)x^{k}/k!∑ italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( 0 ) italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT / italic_k ! for all \absxρ\abs𝑥𝜌\abs{x}\leq\rhoitalic_x ≤ italic_ρ. Then f(X)=f(k)(0)Xk/k!𝑓𝑋superscript𝑓𝑘0superscript𝑋𝑘𝑘f(X)=\sum f^{(k)}(0)X^{k}/k!italic_f ( italic_X ) = ∑ italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( 0 ) italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT / italic_k ! assuming \normXρ\norm𝑋𝜌\norm{X}\leq\rhoitalic_X ≤ italic_ρ. By linearity (and using the fact that derivatives preserve radius of convergence) it is sufficient to show that

TrXk=dXkdX,Trsuperscript𝑋𝑘𝑑superscript𝑋𝑘𝑑𝑋\nabla\mathrm{Tr}X^{k}=\frac{dX^{k}}{dX},∇ roman_Tr italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG italic_d italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_X end_ARG , (6)

which can be verified by explicit calculation: Both sides equal kXk1𝑘superscript𝑋𝑘1kX^{k-1}italic_k italic_X start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT. This is sufficient to establish Theorem 1 for all integer α>2𝛼2\alpha>2italic_α > 2.

Proof of Lemma 3.

The Cauchy integral formula for matrices yields the representation

f(X)=12πiCf(z)(zIX)1𝑑z,𝑓𝑋12𝜋𝑖subscript𝐶𝑓𝑧superscript𝑧𝐼𝑋1differential-d𝑧f(X)=\frac{1}{2\pi i}\int_{C}f(z)(zI-X)^{-1}dz,italic_f ( italic_X ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_f ( italic_z ) ( italic_z italic_I - italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d italic_z ,

for any closed curve C𝐶Citalic_C whose interior contains the spectrum of X𝑋Xitalic_X. As (zIX)1superscript𝑧𝐼𝑋1(zI-X)^{-1}( italic_z italic_I - italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is continuous along C𝐶Citalic_C, we can write

Trf(X)=12πiCf(z)Tr(zIX)1𝑑z.Tr𝑓𝑋12𝜋𝑖subscript𝐶𝑓𝑧Trsuperscript𝑧𝐼𝑋1differential-d𝑧\nabla\mathrm{Tr}f(X)=\frac{1}{2\pi i}\int_{C}f(z)\nabla\mathrm{Tr}(zI-X)^{-1}dz.∇ roman_Tr italic_f ( italic_X ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_f ( italic_z ) ∇ roman_Tr ( italic_z italic_I - italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d italic_z . (7)

Choosing C𝐶Citalic_C to be a circle of radius ρ𝜌\rhoitalic_ρ larger than the spectral norm of X𝑋Xitalic_X, for all z𝑧zitalic_z of magnitude ρ𝜌\rhoitalic_ρ we have the identity

(zIX)1=z1(Iz1X)1=z1k=0zkXksuperscript𝑧𝐼𝑋1superscript𝑧1superscript𝐼superscript𝑧1𝑋1superscript𝑧1superscriptsubscript𝑘0superscript𝑧𝑘superscript𝑋𝑘(zI-X)^{-1}=z^{-1}(I-z^{-1}X)^{-1}=z^{-1}\sum_{k=0}^{\infty}z^{-k}X^{k}( italic_z italic_I - italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT

As the series zkTrXk=kzkXk1superscript𝑧𝑘Trsuperscript𝑋𝑘𝑘superscript𝑧𝑘superscript𝑋𝑘1\sum z^{-k}\nabla\mathrm{Tr}X^{k}=\sum kz^{-k}X^{k-1}∑ italic_z start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT ∇ roman_Tr italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ∑ italic_k italic_z start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT converges absolutely in spectral norm, using (6) we obtain the identity Tr(zIX)1=d(zIX)1/dXTrsuperscript𝑧𝐼𝑋1𝑑superscript𝑧𝐼𝑋1𝑑𝑋\nabla\mathrm{Tr}(zI-X)^{-1}=d(zI-X)^{-1}/dX∇ roman_Tr ( italic_z italic_I - italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_d ( italic_z italic_I - italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT / italic_d italic_X, namely the lemma holds for the function f(X)=(zIX)1𝑓𝑋superscript𝑧𝐼𝑋1f(X)=(zI-X)^{-1}italic_f ( italic_X ) = ( italic_z italic_I - italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Plugging into (7) and exchanging the order of integration and derivation proves the lemma. ∎

Proof of Theorem 1.

Assume TrραTrsuperscript𝜌𝛼\mathrm{Tr}\rho^{\alpha}roman_Tr italic_ρ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT (resp., TrρlogρTr𝜌𝜌-\mathrm{Tr}\rho\log\rho- roman_Tr italic_ρ roman_log italic_ρ) has circuit size s(d)𝑠𝑑s(d)italic_s ( italic_d ). By Lemma 1 and Lemma 3, Trρα=αρα1Trsuperscript𝜌𝛼𝛼superscript𝜌𝛼1\nabla\mathrm{Tr}\rho^{\alpha}=\alpha\rho^{\alpha-1}∇ roman_Tr italic_ρ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = italic_α italic_ρ start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT (resp., Trρlogρ=logρ1/ln2Tr𝜌𝜌𝜌12-\nabla\mathrm{Tr}\rho\log\rho=\log\rho-1/\ln 2- ∇ roman_Tr italic_ρ roman_log italic_ρ = roman_log italic_ρ - 1 / roman_ln 2) has circuit size O(s(d))𝑂𝑠𝑑O(s(d))italic_O ( italic_s ( italic_d ) ). For every symmetric matrix X𝑋Xitalic_X and sufficiently small t𝑡titalic_t, the matrix ρ=I+tX𝜌𝐼𝑡𝑋\rho=I+tXitalic_ρ = italic_I + italic_t italic_X is positive semi-definite. By Lemma 2 the Rd2superscript𝑅superscript𝑑2R^{d^{2}}italic_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT-valued function 2ρ/t2superscript2𝜌superscript𝑡2\partial^{2}\rho/\partial t^{2}∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ / ∂ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT has circuit size O(sd)𝑂superscript𝑠𝑑O(s^{d})italic_O ( italic_s start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ). The value of this function at t=0𝑡0t=0italic_t = 0 is α(α1)(α2)X2𝛼𝛼1𝛼2superscript𝑋2\alpha(\alpha-1)(\alpha-2)X^{2}italic_α ( italic_α - 1 ) ( italic_α - 2 ) italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (resp., X2superscript𝑋2X^{2}italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), namely the square of the input matrix X𝑋Xitalic_X up to constant. Finally, computing the product AB𝐴𝐵ABitalic_A italic_B reduces to squaring the symmetric matrix

(ATBABT).matrixmissing-subexpressionsuperscript𝐴𝑇𝐵𝐴missing-subexpressionmissing-subexpressionsuperscript𝐵𝑇missing-subexpressionmissing-subexpression\begin{pmatrix}&A^{T}&B\\ A&&\\ B^{T}&&\end{pmatrix}.\hfill\qed( start_ARG start_ROW start_CELL end_CELL start_CELL italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_B end_CELL end_ROW start_ROW start_CELL italic_A end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARG ) . italic_∎

A.2 Proof of Theorem 2

Assuming that the shift-invariant kernel k(𝐱,𝐱)=κ(𝐱𝐱)𝑘𝐱superscript𝐱𝜅𝐱superscript𝐱k(\mathbf{x},\mathbf{x}^{\prime})=\kappa(\mathbf{x}-\mathbf{x}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_κ ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is normalized (i.e. κ(𝟎)=1𝜅01\kappa(\mathbf{0})=1italic_κ ( bold_0 ) = 1), then the Fourier transform κ^^𝜅\widehat{\kappa}over^ start_ARG italic_κ end_ARG is a valid PDF according to Bochner’s theorem and also an even function because κ𝜅\kappaitalic_κ takes real values. Then, we have

k(𝐱,𝐱)𝑘𝐱superscript𝐱\displaystyle k\bigl{(}\mathbf{x},\mathbf{x}^{\prime}\bigr{)}\,italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =κσ(𝐱𝐱)absentsubscript𝜅𝜎𝐱superscript𝐱\displaystyle=\,\kappa_{\sigma}\bigl{(}\mathbf{x}-\mathbf{x}^{\prime}\bigr{)}= italic_κ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=(a)κσ^(𝝎)exp(i𝝎(𝐱𝐱))d𝝎superscript𝑎absent^subscript𝜅𝜎𝝎𝑖superscript𝝎top𝐱superscript𝐱differential-d𝝎\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\,\int\widehat{\kappa_{\sigma}}(% \bm{\omega})\exp\bigl{(}i\bm{\omega}^{\top}(\mathbf{x}-\mathbf{x}^{\prime})% \bigr{)}\mathrm{d}\bm{\omega}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP ∫ over^ start_ARG italic_κ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_ARG ( bold_italic_ω ) roman_exp ( italic_i bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) roman_d bold_italic_ω
=(b)κσ^(𝝎)cos(𝝎(𝐱𝐱))d𝝎superscript𝑏absent^subscript𝜅𝜎𝝎superscript𝝎top𝐱superscript𝐱differential-d𝝎\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\,\int\widehat{\kappa_{\sigma}}(% \bm{\omega})\cos\bigl{(}\bm{\omega}^{\top}(\mathbf{x}-\mathbf{x}^{\prime})% \bigr{)}\mathrm{d}\bm{\omega}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_b ) end_ARG end_RELOP ∫ over^ start_ARG italic_κ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_ARG ( bold_italic_ω ) roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) roman_d bold_italic_ω
=𝔼𝝎κ^[cos(𝝎(𝐱𝐱))]superscriptabsentabsentsubscript𝔼similar-to𝝎^𝜅delimited-[]superscript𝝎top𝐱superscript𝐱\displaystyle\stackrel{{\scriptstyle}}{{=}}\,\mathbb{E}_{\bm{\omega}\sim% \widehat{\kappa}}\Bigl{[}\cos\bigl{(}\bm{\omega}^{\top}(\mathbf{x}-\mathbf{x}^% {\prime})\bigr{)}\Bigr{]}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT bold_italic_ω ∼ over^ start_ARG italic_κ end_ARG end_POSTSUBSCRIPT [ roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ]
=𝔼𝝎κ^[cos(𝝎𝐱)cos(𝝎𝐱)+sin(𝝎𝐱)sin(𝝎𝐱)]absentsubscript𝔼similar-to𝝎^𝜅delimited-[]superscript𝝎top𝐱superscript𝝎topsuperscript𝐱superscript𝝎top𝐱superscript𝝎topsuperscript𝐱\displaystyle=\,\mathbb{E}_{\bm{\omega}\sim\widehat{\kappa}}\Bigl{[}\cos\bigl{% (}\bm{\omega}^{\top}\mathbf{x})\cos\bigl{(}\bm{\omega}^{\top}\mathbf{x}^{% \prime})+\sin\bigl{(}\bm{\omega}^{\top}\mathbf{x})\sin\bigl{(}\bm{\omega}^{% \top}\mathbf{x}^{\prime})\Bigr{]}= blackboard_E start_POSTSUBSCRIPT bold_italic_ω ∼ over^ start_ARG italic_κ end_ARG end_POSTSUBSCRIPT [ roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + roman_sin ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ) roman_sin ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]

Here, (a) comes from the synthesis property of the Fourier transform. (b) holds since κσ^^subscript𝜅𝜎\widehat{\kappa_{\sigma}}over^ start_ARG italic_κ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_ARG is an even function, resulting in a zero imaginary term in the Fourier synthesis.

Therefore, since |cos(𝝎𝐲)|1superscript𝝎top𝐲1\bigl{|}\cos\bigl{(}\bm{\omega}^{\top}\mathbf{y}\bigr{)}\bigr{|}\leq 1| roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_y ) | ≤ 1 for all 𝝎𝝎\bm{\omega}bold_italic_ω and 𝐲𝐲\mathbf{y}bold_y, one can apply Hoeffding’s inequality to show for independently drawn 𝝎1,,𝝎riidκ^superscriptsimilar-toiidsubscript𝝎1subscript𝝎𝑟^𝜅\bm{\omega}_{1},\ldots,\bm{\omega}_{r}\stackrel{{\scriptstyle\mathrm{iid}}}{{% \sim}}\widehat{\kappa}bold_italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_iid end_ARG end_RELOP over^ start_ARG italic_κ end_ARG the following probably correct bound holds:

(|1ri=1rcos(𝝎i(𝐱𝐱))𝔼𝝎κ^[cos(𝝎(𝐱𝐱))]|ϵ)2exp(rϵ22)1𝑟superscriptsubscript𝑖1𝑟superscriptsubscript𝝎𝑖top𝐱superscript𝐱subscript𝔼similar-to𝝎^𝜅delimited-[]superscript𝝎top𝐱superscript𝐱italic-ϵ2𝑟superscriptitalic-ϵ22\mathbb{P}\biggl{(}\Bigl{|}\frac{1}{r}\sum_{i=1}^{r}\cos\bigl{(}\bm{\omega}_{i% }^{\top}(\mathbf{x}-\mathbf{x}^{\prime})\bigr{)}-\mathbb{E}_{\bm{\omega}\sim% \widehat{\kappa}}\Bigl{[}\cos\bigl{(}\bm{\omega}^{\top}(\mathbf{x}-\mathbf{x}^% {\prime})\bigr{)}\Bigr{]}\Bigr{|}\geq\epsilon\biggr{)}\leq 2\exp\Bigl{(}-\frac% {r\epsilon^{2}}{2}\Bigr{)}blackboard_P ( | divide start_ARG 1 end_ARG start_ARG italic_r end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT roman_cos ( bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) - blackboard_E start_POSTSUBSCRIPT bold_italic_ω ∼ over^ start_ARG italic_κ end_ARG end_POSTSUBSCRIPT [ roman_cos ( bold_italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] | ≥ italic_ϵ ) ≤ 2 roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG )

Therefore, as the identity cos(ab)=cos(a)cos(b)+sin(a)sin(b)𝑎𝑏𝑎𝑏𝑎𝑏\cos(a-b)=\cos(a)\cos(b)+\sin(a)\sin(b)roman_cos ( italic_a - italic_b ) = roman_cos ( italic_a ) roman_cos ( italic_b ) + roman_sin ( italic_a ) roman_sin ( italic_b ) reveals 1ri=1rcos(𝝎i(𝐱𝐱))=ϕ~r(𝐱)ϕ~r(𝐱)1𝑟superscriptsubscript𝑖1𝑟superscriptsubscript𝝎𝑖top𝐱superscript𝐱subscript~italic-ϕ𝑟superscript𝐱topsubscript~italic-ϕ𝑟superscript𝐱\frac{1}{r}\sum_{i=1}^{r}\cos\bigl{(}\bm{\omega}_{i}^{\top}(\mathbf{x}-\mathbf% {x}^{\prime})\bigr{)}\,=\,\widetilde{\phi}_{r}(\mathbf{x})^{\top}\widetilde{% \phi}_{r}(\mathbf{x}^{\prime})divide start_ARG 1 end_ARG start_ARG italic_r end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT roman_cos ( bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the above bound can be rewritten as

(|ϕ~r(𝐱)ϕ~r(𝐱)k(𝐱,𝐱)|ϵ)2exp(rϵ22).subscript~italic-ϕ𝑟superscript𝐱topsubscript~italic-ϕ𝑟superscript𝐱𝑘𝐱superscript𝐱italic-ϵ2𝑟superscriptitalic-ϵ22\mathbb{P}\biggl{(}\Bigl{|}\widetilde{\phi}_{r}(\mathbf{x})^{\top}\widetilde{% \phi}_{r}(\mathbf{x}^{\prime})-k(\mathbf{x},\mathbf{x}^{\prime})\Bigr{|}\geq% \epsilon\biggr{)}\leq 2\exp\Bigl{(}-\frac{r\epsilon^{2}}{2}\Bigr{)}.blackboard_P ( | over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≥ italic_ϵ ) ≤ 2 roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) .

Also, k~r(𝐱,𝐱)=ϕ~r(𝐱)ϕ~r(𝐱)subscript~𝑘𝑟𝐱superscript𝐱subscript~italic-ϕ𝑟superscript𝐱topsubscript~italic-ϕ𝑟superscript𝐱\widetilde{k}_{r}(\mathbf{x},\mathbf{x}^{\prime})=\widetilde{\phi}_{r}(\mathbf% {x})^{\top}\widetilde{\phi}_{r}(\mathbf{x}^{\prime})over~ start_ARG italic_k end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is by definition a normalized kernel, implying that

𝐱d:ϕ~r(𝐱)ϕ~r(𝐱)k(𝐱,𝐱)= 0.\forall\mathbf{x}\in\mathbb{R}^{d}:\quad\widetilde{\phi}_{r}(\mathbf{x})^{\top% }\widetilde{\phi}_{r}(\mathbf{x})-k(\mathbf{x},\mathbf{x})\,=\,0.∀ bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) - italic_k ( bold_x , bold_x ) = 0 .

As a result, one can apply the union bound to combine the above inequalities and show for every sample set 𝐱1,,𝐱nsubscript𝐱1subscript𝐱𝑛\mathbf{x}_{1},\ldots,\mathbf{x}_{n}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

(max1i,jn(ϕ~r(𝐱i)ϕ~r(𝐱j)kGaussian(σ2)(𝐱i,𝐱j))2ϵ2)2(n2)exp(rϵ22).\mathbb{P}\biggl{(}\max_{1\leq i,j\leq n}\Bigl{(}\widetilde{\phi}_{r}(\mathbf{% x}_{i})^{\top}\widetilde{\phi}_{r}(\mathbf{x}_{j})-k_{\text{\rm Gaussian}(% \sigma^{2})}(\mathbf{x}_{i},\mathbf{x}_{j})\Bigr{)}^{2}\geq\epsilon^{2}\biggr{% )}\leq 2{n\choose 2}\exp\Bigl{(}-\frac{r\epsilon^{2}}{2}\Bigr{)}.blackboard_P ( roman_max start_POSTSUBSCRIPT 1 ≤ italic_i , italic_j ≤ italic_n end_POSTSUBSCRIPT ( over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_k start_POSTSUBSCRIPT Gaussian ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ 2 ( binomial start_ARG italic_n end_ARG start_ARG 2 end_ARG ) roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) .

Considering the normalized kernel matrix 1nK=1n[k(𝐱i,𝐱j)]1i,jn1𝑛𝐾1𝑛subscriptdelimited-[]𝑘subscript𝐱𝑖subscript𝐱𝑗formulae-sequence1𝑖𝑗𝑛\frac{1}{n}K=\frac{1}{n}\bigl{[}k(\mathbf{x}_{i},\mathbf{x}_{j})\bigr{]}_{1% \leq i,j\leq n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG [ italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT 1 ≤ italic_i , italic_j ≤ italic_n end_POSTSUBSCRIPT and the proxy normalized kernel matrix 1nK~=1n[ϕ~r(𝐱i)ϕ~r(𝐱j)]1i,jn1𝑛~𝐾1𝑛subscriptdelimited-[]subscript~italic-ϕ𝑟superscriptsubscript𝐱𝑖topsubscript~italic-ϕ𝑟subscript𝐱𝑗formulae-sequence1𝑖𝑗𝑛\frac{1}{n}\widetilde{K}=\frac{1}{n}\bigl{[}\widetilde{\phi}_{r}(\mathbf{x}_{i% })^{\top}\widetilde{\phi}_{r}(\mathbf{x}_{j})\bigr{]}_{1\leq i,j\leq n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG [ over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT 1 ≤ italic_i , italic_j ≤ italic_n end_POSTSUBSCRIPT, the above inequality implies that

(1nK~1nKF2n2ϵ2n2)(n2)exp(rϵ22).subscriptsuperscriptdelimited-∥∥1𝑛~𝐾1𝑛𝐾2𝐹superscript𝑛2superscriptitalic-ϵ2superscript𝑛2binomial𝑛2𝑟superscriptitalic-ϵ22\displaystyle\mathbb{P}\Bigl{(}\bigl{\|}\frac{1}{n}\widetilde{K}-\frac{1}{n}K% \bigr{\|}^{2}_{F}\,\geq\,n^{2}\frac{\epsilon^{2}}{n^{2}}\Bigr{)}\;\leq\;{n% \choose 2}\exp\Bigl{(}-\frac{r\epsilon^{2}}{2}\Bigr{)}.blackboard_P ( ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≥ italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ≤ ( binomial start_ARG italic_n end_ARG start_ARG 2 end_ARG ) roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) .
\displaystyle\Longrightarrow\quad (1nK~1nKFϵ)<n22exp(rϵ22).subscriptdelimited-∥∥1𝑛~𝐾1𝑛𝐾𝐹italic-ϵsuperscript𝑛22𝑟superscriptitalic-ϵ22\displaystyle\mathbb{P}\Bigl{(}\bigl{\|}\frac{1}{n}\widetilde{K}-\frac{1}{n}K% \bigr{\|}_{F}\,\geq\,\epsilon\Bigr{)}\,<\,\frac{n^{2}}{2}\exp\Bigl{(}-\frac{r% \epsilon^{2}}{2}\Bigr{)}.blackboard_P ( ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≥ italic_ϵ ) < divide start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) . (8)

Leveraging the eigenvalue-perturbation bound in [73], we can translate the above bound to the following for the sorted eigenvalues λ1λnsubscript𝜆1subscript𝜆𝑛\lambda_{1}\geq\cdots\geq\lambda_{n}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of 1nK1𝑛𝐾\frac{1}{n}Kdivide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K and the sorted eigenvalues λ~1λ~nsubscript~𝜆1subscript~𝜆𝑛{\widetilde{\lambda}}_{1}\geq\cdots\geq{\widetilde{\lambda}}_{n}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of 1nK~1𝑛~𝐾\frac{1}{n}\widetilde{K}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG

i=1n(λ~iλi)21nK~1nKFsuperscriptsubscript𝑖1𝑛superscriptsubscript~𝜆𝑖subscript𝜆𝑖2subscriptdelimited-∥∥1𝑛~𝐾1𝑛𝐾𝐹\sqrt{\sum_{i=1}^{n}(\widetilde{\lambda}_{i}-\lambda_{i})^{2}}\,\leq\,\Bigl{\|% }\frac{1}{n}{\widetilde{K}}-\frac{1}{n}{K}\Bigr{\|}_{F}square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT

which shows

(i=1r(λ~iλi)2ϵ)n22exp(rϵ22)superscriptsubscript𝑖1superscript𝑟superscriptsubscript~𝜆𝑖subscript𝜆𝑖2italic-ϵsuperscript𝑛22𝑟superscriptitalic-ϵ22\displaystyle\mathbb{P}\Bigl{(}\sqrt{\sum_{i=1}^{r^{\prime}}(\widetilde{% \lambda}_{i}-\lambda_{i})^{2}}\,\geq\,\epsilon\Bigr{)}\,\leq\,\frac{n^{2}}{2}% \exp\Bigl{(}-\frac{r\epsilon^{2}}{2}\Bigr{)}blackboard_P ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≥ italic_ϵ ) ≤ divide start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) (9)

Defining δ=n22exp(rϵ22)𝛿superscript𝑛22𝑟superscriptitalic-ϵ22\delta=\frac{n^{2}}{2}\exp\Bigl{(}-\frac{r\epsilon^{2}}{2}\Bigr{)}italic_δ = divide start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ), i.e., ϵ=8log(n/2δ)ritalic-ϵ8𝑛2𝛿𝑟\epsilon=\sqrt{\frac{8\log\bigl{(}n/2\delta\bigr{)}}{r}}italic_ϵ = square-root start_ARG divide start_ARG 8 roman_log ( italic_n / 2 italic_δ ) end_ARG start_ARG italic_r end_ARG end_ARG, leads to

(i=1n(λ~iλi)2ϵ) 1δ.superscriptsubscript𝑖1𝑛superscriptsubscript~𝜆𝑖subscript𝜆𝑖2italic-ϵ1𝛿\displaystyle\mathbb{P}\Bigl{(}\sqrt{\sum_{i=1}^{n}(\widetilde{\lambda}_{i}-% \lambda_{i})^{2}}\,\leq\,\epsilon\Bigr{)}\,\geq\,1-\delta.blackboard_P ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ italic_ϵ ) ≥ 1 - italic_δ . (10)

Noting that the normalized proxy kernel matrix K~~𝐾\widetilde{K}over~ start_ARG italic_K end_ARG and the proxy kernel covariance matrix C~Xsubscript~𝐶𝑋\widetilde{C}_{X}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT share identical non-zero eigenvalues together with the above bound finish the proof of Theorem 2’s first part.

Concerning Theorem 2’s approximation guarantee for the eigenvectors, note that for each eigenvectors 𝐯^isubscript^𝐯𝑖\widehat{\mathbf{v}}_{i}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the proxy kernel matrix 1nK~1𝑛~𝐾\frac{1}{n}\widetilde{K}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG, the following holds:

1nK𝐯^iλi𝐯^i2subscriptdelimited-∥∥1𝑛𝐾subscript^𝐯𝑖subscript𝜆𝑖subscript^𝐯𝑖2\displaystyle\Bigl{\|}\frac{1}{n}{K}\widehat{\mathbf{v}}_{i}-{\lambda}_{i}% \widehat{\mathbf{v}}_{i}\Bigr{\|}_{2}\,∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1nK𝐯^iλ~i𝐯^i2+λ~i𝐯^iλi𝐯^i2absentsubscriptdelimited-∥∥1𝑛𝐾subscript^𝐯𝑖subscript~𝜆𝑖subscript^𝐯𝑖2subscriptdelimited-∥∥subscript~𝜆𝑖subscript^𝐯𝑖subscript𝜆𝑖subscript^𝐯𝑖2\displaystyle\leq\,\Bigl{\|}\frac{1}{n}{K}\widehat{\mathbf{v}}_{i}-\widetilde{% \lambda}_{i}\widehat{\mathbf{v}}_{i}\Bigr{\|}_{2}+\Bigl{\|}\widetilde{\lambda}% _{i}\widehat{\mathbf{v}}_{i}-{\lambda}_{i}\widehat{\mathbf{v}}_{i}\Bigr{\|}_{2}≤ ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=(1nK1nK~)𝐯^i2+|λ~iλi|absentsubscriptdelimited-∥∥1𝑛𝐾1𝑛~𝐾subscript^𝐯𝑖2subscript~𝜆𝑖subscript𝜆𝑖\displaystyle=\,\Bigl{\|}\bigl{(}\frac{1}{n}{K}-\frac{1}{n}\widetilde{K}\bigr{% )}\widehat{\mathbf{v}}_{i}\Bigr{\|}_{2}+\bigl{|}\tilde{\lambda}_{i}-{\lambda}_% {i}\bigr{|}= ∥ ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ) over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + | over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |

Therefore, applying Young’s inequality shows that

1nK𝐯^iλi𝐯^i22subscriptsuperscriptdelimited-∥∥1𝑛𝐾subscript^𝐯𝑖subscript𝜆𝑖subscript^𝐯𝑖22\displaystyle\Bigl{\|}\frac{1}{n}{K}\widehat{\mathbf{v}}_{i}-{\lambda}_{i}% \widehat{\mathbf{v}}_{i}\Bigr{\|}^{2}_{2}\,∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT  2(1nK1nK~)𝐯^i22+2(λ~iλi)2absent2subscriptsuperscriptdelimited-∥∥1𝑛𝐾1𝑛~𝐾subscript^𝐯𝑖222superscriptsubscript~𝜆𝑖subscript𝜆𝑖2\displaystyle\leq\,2\Bigl{\|}\bigl{(}\frac{1}{n}{K}-\frac{1}{n}\widetilde{K}% \bigr{)}\widehat{\mathbf{v}}_{i}\Bigr{\|}^{2}_{2}+2\bigl{(}\tilde{\lambda}_{i}% -{\lambda}_{i}\bigr{)}^{2}≤ 2 ∥ ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ) over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
= 2Tr(𝐯^i(1nK1nK~)2𝐯^i)+2(λ~iλi)2absent2Trsubscriptsuperscript^𝐯top𝑖superscript1𝑛𝐾1𝑛~𝐾2subscript^𝐯𝑖2superscriptsubscript~𝜆𝑖subscript𝜆𝑖2\displaystyle=\,2\mathrm{Tr}\Bigl{(}\widehat{\mathbf{v}}^{\top}_{i}\bigl{(}% \frac{1}{n}{K}-\frac{1}{n}\widetilde{K}\bigr{)}^{2}\widehat{\mathbf{v}}_{i}% \Bigr{)}+2\bigl{(}\tilde{\lambda}_{i}-{\lambda}_{i}\bigr{)}^{2}= 2 roman_Tr ( over^ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 2 ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
= 2Tr(𝐯^i𝐯^i(1nK1nK~)2)+2(λ~iλi)2,absent2Trsubscript^𝐯𝑖subscriptsuperscript^𝐯top𝑖superscript1𝑛𝐾1𝑛~𝐾22superscriptsubscript~𝜆𝑖subscript𝜆𝑖2\displaystyle=\,2\mathrm{Tr}\Bigl{(}\widehat{\mathbf{v}}_{i}\widehat{\mathbf{v% }}^{\top}_{i}\bigl{(}\frac{1}{n}{K}-\frac{1}{n}\widetilde{K}\bigr{)}^{2}\Bigr{% )}+2\bigl{(}\tilde{\lambda}_{i}-{\lambda}_{i}\bigr{)}^{2},= 2 roman_Tr ( over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which implies that

i=1n1nK𝐯^iλi𝐯^i22superscriptsubscript𝑖1𝑛subscriptsuperscriptdelimited-∥∥1𝑛𝐾subscript^𝐯𝑖subscript𝜆𝑖subscript^𝐯𝑖22\displaystyle\sum_{i=1}^{n}\Bigl{\|}\frac{1}{n}{K}\widehat{\mathbf{v}}_{i}-{% \lambda}_{i}\widehat{\mathbf{v}}_{i}\Bigr{\|}^{2}_{2}\,∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2Tr((i=1n𝐯^i𝐯^i)(1nK1nK~)2)+2i=1n(λ~iλi)2absent2Trsuperscriptsubscript𝑖1𝑛subscript^𝐯𝑖subscriptsuperscript^𝐯top𝑖superscript1𝑛𝐾1𝑛~𝐾22superscriptsubscript𝑖1𝑛superscriptsubscript~𝜆𝑖subscript𝜆𝑖2\displaystyle\leq 2\mathrm{Tr}\Bigl{(}\bigl{(}\sum_{i=1}^{n}\widehat{\mathbf{v% }}_{i}\widehat{\mathbf{v}}^{\top}_{i}\bigr{)}\bigl{(}\frac{1}{n}{K}-\frac{1}{n% }\widetilde{K}\bigr{)}^{2}\Bigr{)}+2\sum_{i=1}^{n}\bigl{(}\tilde{\lambda}_{i}-% {\lambda}_{i}\bigr{)}^{2}≤ 2 roman_T roman_r ( ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=2Tr((1nK1nK~)2)+2i=1n(λ~iλi)2absent2Trsuperscript1𝑛𝐾1𝑛~𝐾22superscriptsubscript𝑖1𝑛superscriptsubscript~𝜆𝑖subscript𝜆𝑖2\displaystyle=2\mathrm{Tr}\Bigl{(}\bigl{(}\frac{1}{n}{K}-\frac{1}{n}\widetilde% {K}\bigr{)}^{2}\Bigr{)}+2\sum_{i=1}^{n}\bigl{(}\tilde{\lambda}_{i}-{\lambda}_{% i}\bigr{)}^{2}= 2 roman_T roman_r ( ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=21nK1nK~F2+2i=1n(λ~iλi)2absent2subscriptsuperscriptdelimited-∥∥1𝑛𝐾1𝑛~𝐾2𝐹2superscriptsubscript𝑖1𝑛superscriptsubscript~𝜆𝑖subscript𝜆𝑖2\displaystyle=2\Bigl{\|}\frac{1}{n}{K}-\frac{1}{n}\widetilde{K}\Bigr{\|}^{2}_{% F}+2\sum_{i=1}^{n}\bigl{(}\tilde{\lambda}_{i}-{\lambda}_{i}\bigr{)}^{2}= 2 ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
41nK1nK~F2.absent4subscriptsuperscriptdelimited-∥∥1𝑛𝐾1𝑛~𝐾2𝐹\displaystyle\leq 4\Bigl{\|}\frac{1}{n}{K}-\frac{1}{n}\widetilde{K}\Bigr{\|}^{% 2}_{F}.≤ 4 ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT .

The above proves that

(i=1n1nK𝐯^iλi𝐯^i22ϵ)superscriptsubscript𝑖1𝑛subscriptsuperscriptdelimited-∥∥1𝑛𝐾subscript^𝐯𝑖subscript𝜆𝑖subscript^𝐯𝑖22italic-ϵ\displaystyle\mathbb{P}\Bigl{(}\sqrt{\sum_{i=1}^{n}\Bigl{\|}\frac{1}{n}{K}% \widehat{\mathbf{v}}_{i}-{\lambda}_{i}\widehat{\mathbf{v}}_{i}\Bigr{\|}^{2}_{2% }}\geq\epsilon\Bigr{)}\,blackboard_P ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ≥ italic_ϵ ) (1nK~1nKFϵ2)absentsubscriptdelimited-∥∥1𝑛~𝐾1𝑛𝐾𝐹italic-ϵ2\displaystyle\leq\,\mathbb{P}\Bigl{(}\bigl{\|}\frac{1}{n}\widetilde{K}-\frac{1% }{n}K\bigr{\|}_{F}\,\geq\,\frac{\epsilon}{2}\Bigr{)}≤ blackboard_P ( ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over~ start_ARG italic_K end_ARG - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≥ divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG )
<n22exp(rϵ28).absentsuperscript𝑛22𝑟superscriptitalic-ϵ28\displaystyle<\,\frac{n^{2}}{2}\exp\Bigl{(}-\frac{r\epsilon^{2}}{8}\Bigr{)}.< divide start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) .

Therefore, considering the provided definition δ=n22exp(rϵ22)𝛿superscript𝑛22𝑟superscriptitalic-ϵ22\delta=\frac{n^{2}}{2}\exp\Bigl{(}-\frac{r\epsilon^{2}}{2}\Bigr{)}italic_δ = divide start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG roman_exp ( - divide start_ARG italic_r italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ), i.e., 2ϵ=32log(n/2δ)r2italic-ϵ32𝑛2𝛿𝑟2\epsilon=\sqrt{\frac{32\log\bigl{(}n/2\delta\bigr{)}}{r}}2 italic_ϵ = square-root start_ARG divide start_ARG 32 roman_log ( italic_n / 2 italic_δ ) end_ARG start_ARG italic_r end_ARG end_ARG, we will have the following which completes the proof:

(i=1n1nK𝐯^iλi𝐯^i222ϵ) 1δ.superscriptsubscript𝑖1𝑛subscriptsuperscriptdelimited-∥∥1𝑛𝐾subscript^𝐯𝑖subscript𝜆𝑖subscript^𝐯𝑖222italic-ϵ1𝛿\displaystyle\mathbb{P}\Bigl{(}\sqrt{\sum_{i=1}^{n}\Bigl{\|}\frac{1}{n}{K}% \widehat{\mathbf{v}}_{i}-{\lambda}_{i}\widehat{\mathbf{v}}_{i}\Bigr{\|}^{2}_{2% }}\leq 2{\epsilon}\Bigr{)}\,\geq\,1-\delta.blackboard_P ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_K over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ≤ 2 italic_ϵ ) ≥ 1 - italic_δ .

Appendix B Additional Numerical Results

B.1 Synthetic Image Dataset Modes

In addition to running clustering on real image datasets, we also applied FKEA with varying Gaussian Kernel bandwidth parameter σ𝜎\sigmaitalic_σ to generated synthetic datasets. The results are presented for LDM, VDVAE, StyleGAN-XL and InsGen models trained on FFHQ samples.

Refer to caption
Figure 10: RFF-based identified clusters used in FKEA Evaluation of LDM [55] generative model in FFHQ with DINOv2 embeddings and bandwidth σ=20𝜎20\sigma=20italic_σ = 20
Refer to caption
Figure 11: RFF-based identified clusters used in FKEA Evaluation of VDVAE [74] generative model in FFHQ with DINOv2 embeddings and bandwidth σ=20𝜎20\sigma=20italic_σ = 20
Refer to caption
Figure 12: RFF-based identified clusters used in FKEA Evaluation of InsGen [75] generative model in FFHQ with DINOv2 embeddings and bandwidth σ=20𝜎20\sigma=20italic_σ = 20
Refer to caption
Figure 13: RFF-based identified clusters used in FKEA Evaluation of StyleGAN-XL[58] generative model in FFHQ with DINOv2 embeddings and bandwidth σ=20𝜎20\sigma=20italic_σ = 20

B.2 Effect of other embeddings on FKEA clustering

Even though DinoV2 is a primary embedding in our experimental settings, we acknowldge the use of other embedding models such as SwAV[24] and CLIP[23]. The resulting clusters differ from original DinoV2 clusters and require separate bandwidth parameter finetuning. In our experiments, SwAV embedding emphasizes object placement, such as animal in grass or white backgrounds, as seen in figure 14. CLIP on the other hand clusters by objects, such as birds/dogs/bugs, as seen in figure 15. These results indicate that FKEA powered by other embeddings will slightly change the clustering features; however, it does not hinder the clustering performance of RFF based clustering with FKEA method.

Refer to caption
Figure 14: RFF-based identified clusters used in FKEA Evaluation of SwAV embedding on ImageNet with bandwidth σ=0.8𝜎0.8\sigma=0.8italic_σ = 0.8
Refer to caption
Figure 15: RFF-based identified clusters used in FKEA Evaluation of CLIP embedding on ImageNet with bandwidth σ=7.0𝜎7.0\sigma=7.0italic_σ = 7.0

B.3 Text Dataset Modes

To understand the applicability and effectiveness of the FKEA method beyond images, we extended our study to text datasets. We observed that clustering text data poses a more challenging task compared to image data. This increased difficulty arises from the ambiguity in defining clear separability factors within text, a contrast to the more visually distinguishable criteria in images. The process of evaluating text clusters is not straightforward and often varies significantly based on human judgment and perception.

To visualise the results, we use YAKE [65] algorithm to extract the keywords in each text mode and present the identified unigram and bigram keywords. We demonstrate that the results hold for text datasets and identified clusters are meaningful.

Table 4 outlines the largest modes identified within a news dataset analyzed using the FKEA method, with a detailed focus on the content themes of each mode. The most dominant mode is associated with topics related to crime and police activities, indicating a frequent coverage area in the dataset. Mode 2 is closely correlated with President Obama, reflecting a significant focus on political coverage. Mode 3 pertains to dieting, which suggests a presence of health and lifestyle topics. Mode 4 is linked to environmental disasters, highlighting the dataset’s attention to ecological and crisis-related news. Finally, Mode 5 deals with plane crash accidents, underscoring the coverage of major transportation incidents.

Mode #1 Mode #2 Mode #3 Mode #4 Mode #5
police President Obama size people died
British police Obama year severe weather family
police officer Barack Obama weight Death toll plane crash
Police found President dress size heavy rain mother
family White House stone Environment Agency plane
found Obama administration Slimming World million people found
told police Obama calls lost rain people
court House lose weight flood warnings children
arrested United States diet people dead hospital
home Obama plan model people killed found dead
Table 4: Top 5 CNN/Dailymail 3.0.0 [62][63] Dataset Modes with corresponding eigenvalues with text-embedding-3-large embeddings and bandwidth σ=0.8𝜎0.8\sigma=0.8italic_σ = 0.8.

Table 5 delineates the distribution of genres and production types within a dataset of movie summaries analyzed using the FKEA method. The first mode predominantly covers drama TV shows without focusing on any specific subtopic, indicating a broad categorization within this genre. From mode 2 onwards, the features become more distinct and defined. Mode 2 specifically represents Bollywood movies, with a significant emphasis on the Romance genre. Mode 3 is dedicated to clustering comedy shows. Mode 4 is exclusively associated with cartoons, evidenced by keywords such as "Tom & Jerry". Lastly, mode 5 clusters together detective and crime fiction shows.

Mode #1 Mode #2 Mode #3 Mode #4 Mode #5
The House on Tele.. Anand Bring Your Smile Along Chhota Bheem… Walk a Crooked Mile
Seems Like Old Times I Love You The Girl Most Likely Duck Amuck Assignment to Kill
Shadows and Fog Toh Baat Pakki Hips, Hips, Hooray! Hare-Abian Nights The Crime of the Century
Obsession Abodh Lady Be Good Porky’s Five and Ten Murder at Glen Athol
Milk Money Khulay Aasman… The Courtship of Eddie’s… Sock-a-Doodle-Do Guns
Very Bad Things Kasthuri Maan You Live and Learn Buccaneer Bunny Because of the Cats
Blame It on the Bell… Chhaya Dames Hare Lift The House of Hate
The Miracle Man Yeh Dillagi Painting the Clouds… Scrap Happy Daffy The Ace of Scotland Yard
The Sleeping Tiger Deva Pin Up Girl Hic-cup Pup The World Gone Mad
The Scapegoat Bhalobasa Bha… Too Young to Kiss The Goofy Gophers Firepower
Keywords
mystery hindi film musical animation crime
noir romance theme songs Tom & Jerry murder
kidnap love city Spike detective
crime marriage romance adventure investigation
police daughter comedy killer
Table 5: Top 5 CMU Movie Summary Corpus [64] Dataset Modes with corresponding eigenvalues with text-embedding-3-large embeddings and bandwidth σ=0.8𝜎0.8\sigma=0.8italic_σ = 0.8.

B.4 Video Dataset Modes

In this section, we present additional experiments on the Kinetics-400[67] video dataset. This dataset comprises 400 human action categories, each with a minimum of 400 video clips depicting the action. Similar to the video evaluation metrics, we used the I3D pre-trained model[70] which maps each video to a 1024-vector feature. Figure 16, the first mode captured broader concepts while the other models focused on specific ones. Also, the plots indicate that increasing the number of classes from 40 to 400 results in an increase in the FKEA metrics.

Refer to caption
Refer to caption
(a) Mode #1
Refer to caption
(b) Mode #2
Refer to caption
(c) Mode #3
Refer to caption
(d) Mode #4
Refer to caption
(e) Mode #5
Refer to caption
(f) Mode #6
Figure 16: RFF-based identified clusters used in FKEA Evaluation in Kinetics-400 dataset with I3D embeddings. Plots indicate that increasing the number of classes from 40 to 400 results in an increase in the FKEA metrics.