Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11affiliationtext: Wyss Institute for Biologically Inspired Engineering22affiliationtext: Department of Systems Biology, Harvard University**affiliationtext: corresponding authors: ggowri@g.harvard.edu, allon_klein@hms.harvard.eduaffiliationtext: senior authors: A.M.K. and P.Y. co-supervised this work.

Approximating mutual information of high- dimensional variables using learned representations

Gokul Gowri Xiao-Kang Lun Allon M. Klein Peng Yin
Abstract

Mutual information (MI) is a general measure of statistical dependence with widespread application across the sciences. However, estimating MI between multi-dimensional variables is challenging because the number of samples necessary to converge to an accurate estimate scales unfavorably with dimensionality. In practice, existing techniques can reliably estimate MI in up to tens of dimensions, but fail in higher dimensions, where sufficient sample sizes are infeasible. Here, we explore the idea that underlying low-dimensional structure in high-dimensional data can be exploited to faithfully approximate MI in high-dimensional settings with realistic sample sizes. We develop a method that we call latent MI (LMI) approximation, which applies a nonparametric MI estimator to low-dimensional representations learned by a simple, theoretically-motivated model architecture. Using several benchmarks, we show that unlike existing techniques, LMI can approximate MI well for variables with >103absentsuperscript103>10^{3}> 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT dimensions if their dependence structure has low intrinsic dimensionality. Finally, we showcase LMI on two open problems in biology. First, we approximate MI between protein language model (pLM) representations of interacting proteins, and find that pLMs encode non-trivial information about protein-protein interactions. Second, we quantify cell fate information contained in single-cell RNA-seq (scRNA-seq) measurements of hematopoietic stem cells, and find a sharp transition during neutrophil differentiation when fate information captured by scRNA-seq increases dramatically.

1 Introduction

Mutual information is a universal dependence measure which has been used to describe relationships between variables in a wide variety of complex systems: developing embryos [1], artificial neural networks [2], flocks of birds [3], and more. Its widespread use can be attributed to at least two of its appealing properties: equitability and interpretability.

Many dependence measures are inequitable, meaning that they are biased toward relationships of specific forms [4]. For example, Pearson correlations quantify the strength of linear relationships, and Spearman correlations quantify the strength of monotonic relationships. Inequitability can be particularly problematic for complex systems, where relationships can be nonlinear, non-monotonic, or involve higher-order interactions between multidimensional variables [5]. Mutual information (MI) stands out as an equitable measure that can capture relationships of any form, and generalizes across continuous, discrete, and multidimensional variables [6]. And when scaled consistently, MI provides a universal currency in interpretable units – which can be understood as the number of ‘bits’ of information shared between variables [7]. MI can also be interpreted through decomposition into pointwise mutual information (pMI) [8, 9], which attributes dependence to specific pairs of values.

MI can be defined as the Kullback-Leibler divergence, DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT, of a joint distribution from the product of its marginals. For absolutely continuous random variables X,Y𝑋𝑌X,Yitalic_X , italic_Y defined over 𝒳,𝒴𝒳𝒴\mathcal{X,Y}caligraphic_X , caligraphic_Y, with joint distribution PXYsubscript𝑃𝑋𝑌P_{XY}italic_P start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT and marginal distributions PX,PYsubscript𝑃𝑋subscript𝑃𝑌P_{X},P_{Y}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT

I(X;Y)=DKL(PXY||PXPY)=𝒳𝒴PXY(x,y)logPXY(x,y)PX(x)PY(y)dydxI(X;Y)=D_{\text{KL}}(P_{XY}||P_{X}\otimes P_{Y})=\int_{\mathcal{X}}\int_{% \mathcal{Y}}P_{XY}(x,y)\log\frac{P_{XY}(x,y)}{P_{X}(x)P_{Y}(y)}dydxitalic_I ( italic_X ; italic_Y ) = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT | | italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⊗ italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) = ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_Y end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT ( italic_x , italic_y ) roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT ( italic_x , italic_y ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG italic_d italic_y italic_d italic_x (1)

In practice, PXYsubscript𝑃𝑋𝑌P_{XY}italic_P start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT is often unknown, and I(X;Y)𝐼𝑋𝑌I(X;Y)italic_I ( italic_X ; italic_Y ) must be estimated from observations {(xi,yi)}subscript𝑥𝑖subscript𝑦𝑖\{(x_{i},y_{i})\}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } that sparsely sample PXYsubscript𝑃𝑋𝑌P_{XY}italic_P start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT. While nonparametric MI estimators have been remarkably successful for variables with a single dimension [10, 11, 12, 13], MI estimation for high-dimensional variables remains a significant challenge. Nonparametric MI estimation suffers from the curse of dimensionality – accurate estimation requires a number of samples that scales exponentially with the dimensionality of the variables [14].

An exciting recent approach to scaling MI estimates to high dimension is the use of variational bounds on KL divergence to reduce the MI estimation problem to a gradient descent optimization problem [15, 16]. MI estimators based on variational bounds indeed empirically perform well for data with ones to tens of dimensions [12], but still suffer from the curse of dimensionality [14, 17], and can exhibit high variance [16, 18]. To our knowledge, no techniques have been shown to reliably estimate MI in practice for variables with hundreds or thousands of dimensions – a regime relevant to many fields, including genomics, neuroscience, ecology, and machine learning [19, 20, 5, 21].

More generally, it has been shown that no technique can accurately estimate MI from finite samples without making strong assumptions about the distribution from which samples are drawn [17], resulting in a fundamental tension between the theoretical appeal of MI and the practical difficulty of its estimation. One way to resolve this tension is to develop alternative measures of statistical dependence, which retain desirable properties of MI, but are feasible to estimate. Sliced MI, which is the average of MI estimates on random low-dimensional linear projections (“slices”) of high-dimensional data, is an example of such an approach [14, 22]. While sliced MI is an appealing measure for information-theoretic objective functions in machine learning [23, 24], it does not retain the interpretability (in bits) of classical MI, and is inequitable, as it quantifies information that can be extracted through linear projections [14].

Here, we take a complementary approach to sliced measures. Rather than considering alternatives to classical MI, we ask if it is possible to make strong, yet reasonable, assumptions about data which enable feasible MI estimation. In this work, we explore the usefulness of the empirically supported assumption that complex systems have underlying low-dimensional structure [5].

Specifically, we propose latent mutual information (LMI) approximation, which applies a nonparametric MI estimator to mutually informative compressed representations of high-dimensional variables. To learn such representations, we design a simple neural network architecture motivated by information-theoretic principles. We demonstrate, using synthetic multivariate Gaussian data, that LMI approximation can be stable for variables reaching thousands of dimensions, provided their dependence has low-dimensional structure. We then introduce an approach for resampling real data to generate benchmark datasets of two high-dimensional variables where ground truth mutual information is known. Using this approach, we evaluate the ability of LMI to capture statistical dependence in two types of real data: images and protein sequence embeddings. Finally, we apply LMI to two open problems in biology: quantifying interaction information in protein language model embeddings, and quantifying cell fate information in the gene expression of mouse hematopoietic stem cells.

2 Approach

Refer to caption
Figure 1: Workflow of latent MI approximation a) Embed high-dimensional data in low-dimensional space such that mutually informative structure is preserved. b) The KSG estimator [10] is used to estimate MI by averaging over pointwise MI (pMI) contributions.

Our goal is to approximate MI from high-dimensional data using low-dimensional representations which capture dependence structure. Our specific approach is to use neural networks to map variables X,Y𝑋𝑌X,Yitalic_X , italic_Y to low-dimensional representations Zx,Zysubscript𝑍𝑥subscript𝑍𝑦Z_{x},Z_{y}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. Then, we use the well-established nonparametric MI estimator introduced in [10] to estimate I^(Zx;Zy)^𝐼subscript𝑍𝑥subscript𝑍𝑦\hat{I}(Z_{x};Z_{y})over^ start_ARG italic_I end_ARG ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ).

The central challenge here is to learn Zx,Zysubscript𝑍𝑥subscript𝑍𝑦Z_{x},Z_{y}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT such that I^(Zx;Zy)I(X;Y)^𝐼subscript𝑍𝑥subscript𝑍𝑦𝐼𝑋𝑌\hat{I}(Z_{x};Z_{y})\approx I(X;Y)over^ start_ARG italic_I end_ARG ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≈ italic_I ( italic_X ; italic_Y ). One sensible approach would be to use autoencoders [25] or other popular nonlinear dimensionality reduction techniques [26, 27] to compress each variable separately. While such an approach could yield a good approximation if compression is perfectly lossless, it can result in a poor approximation if compression is lossy. An illustrative example is two variables each with thousands of independent dimensions but a single pair of strongly dependent dimensions – two separate lossy compressions are unlikely to preserve the rare dependent components.

We can make this intuition precise using properties of entropy and MI [7]. For simplicity, let us consider an approximation using only one compressed variable, Zy=f(Y)subscript𝑍𝑦𝑓𝑌Z_{y}=f(Y)italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_f ( italic_Y ). By rewriting MI in terms of differential entropy, denoted hhitalic_h, we see that for absolutely continuous X,Y𝑋𝑌X,Yitalic_X , italic_Y with finite differential entropy,

I(X;Y)I(X;Zy)=h(X)h(X|Y)h(X)+h(X|Zy)=h(X|Zy)h(X|Y)𝐼𝑋𝑌𝐼𝑋subscript𝑍𝑦𝑋conditional𝑋𝑌𝑋conditional𝑋subscript𝑍𝑦conditional𝑋subscript𝑍𝑦conditional𝑋𝑌I(X;Y)-I(X;Z_{y})=h(X)-h(X|Y)-h(X)+h(X|Z_{y})=h(X|Z_{y})-h(X|Y)italic_I ( italic_X ; italic_Y ) - italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_h ( italic_X ) - italic_h ( italic_X | italic_Y ) - italic_h ( italic_X ) + italic_h ( italic_X | italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_h ( italic_X | italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) - italic_h ( italic_X | italic_Y ) (2)

In the case of lossless compression, h(X|Zy)=h(X|Y)conditional𝑋subscript𝑍𝑦conditional𝑋𝑌h(X|Z_{y})=h(X|Y)italic_h ( italic_X | italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_h ( italic_X | italic_Y ), so

I(X;Y)I(X;Zy)=h(X|Zy)h(X|Y)=0𝐼𝑋𝑌𝐼𝑋subscript𝑍𝑦conditional𝑋subscript𝑍𝑦conditional𝑋𝑌0I(X;Y)-I(X;Z_{y})=h(X|Z_{y})-h(X|Y)=0italic_I ( italic_X ; italic_Y ) - italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_h ( italic_X | italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) - italic_h ( italic_X | italic_Y ) = 0 (3)

However, if compression does not perfectly preserve information, it is possible that h(X|Zy)h(X|Y)much-greater-thanconditional𝑋subscript𝑍𝑦conditional𝑋𝑌h(X|Z_{y})\gg h(X|Y)italic_h ( italic_X | italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≫ italic_h ( italic_X | italic_Y ). Since h(X|Y)conditional𝑋𝑌h(X|Y)italic_h ( italic_X | italic_Y ) is intrinsic to the data and independent of learned representations, minimizing h(X|Zy)conditional𝑋subscript𝑍𝑦h(X|Z_{y})italic_h ( italic_X | italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) is equivalent to minimizing I(X;Y)I(X;Zy)𝐼𝑋𝑌𝐼𝑋subscript𝑍𝑦I(X;Y)-I(X;Z_{y})italic_I ( italic_X ; italic_Y ) - italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ). This points to an approach to learn representations suitable for approximating I(X;Y)𝐼𝑋𝑌I(X;Y)italic_I ( italic_X ; italic_Y ): regularizing a pair of autoencoders to learn compressed representations Zx=f(X)subscript𝑍𝑥𝑓𝑋Z_{x}=f(X)italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_f ( italic_X ) and Zy=g(Y)subscript𝑍𝑦𝑔𝑌Z_{y}=g(Y)italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_g ( italic_Y ) while minimizing h(X|Zy)conditional𝑋subscript𝑍𝑦h(X|Z_{y})italic_h ( italic_X | italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) and h(Y|Zx)conditional𝑌subscript𝑍𝑥h(Y|Z_{x})italic_h ( italic_Y | italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ).

Because directly minimizing conditional entropies is intractable, we instead minimize a convenient proxy, which is the mean-squared error (MSE) loss of networks that predict one variable from another. The connection between conditional entropy and reconstruction loss has long been appreciated as a way to interpret autoencoders through an information-theoretic lens [28, 29]. Here, we observe that this connection can be applied to learn representations which lend themselves to MI estimation. We explicitly show that minimizing cross-prediction loss from Zxsubscript𝑍𝑥Z_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to Y𝑌Yitalic_Y is equivalent to minimizing an upper bound on the conditional entropy h(Y|Zx)conditional𝑌subscript𝑍𝑥h(Y|Z_{x})italic_h ( italic_Y | italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) in Appendix A.1.1, Theorem 1.

Applying cross-predictive regularization to a pair of autoencoders results in a network architecture (Fig. 1) with one encoder for each variable and four decoders which reconstruct each variable from each latent code. We train the networks by minimizing the sum of the MSE reconstruction loss for each decoder. More precisely, for variables X,Y𝑋𝑌X,Yitalic_X , italic_Y with dimensionality dX,dYsubscript𝑑𝑋subscript𝑑𝑌d_{X},d_{Y}italic_d start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, we optimize encoders EX,EYsubscript𝐸𝑋subscript𝐸𝑌E_{X},E_{Y}italic_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, and decoders DXX,DXY,DYY,DYXsubscript𝐷𝑋𝑋subscript𝐷𝑋𝑌subscript𝐷𝑌𝑌subscript𝐷𝑌𝑋D_{XX},D_{XY},D_{YY},D_{YX}italic_D start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_Y italic_Y end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_Y italic_X end_POSTSUBSCRIPT to minimize AEC=AE+Csubscript𝐴𝐸𝐶subscript𝐴𝐸subscript𝐶\mathcal{L}_{AEC}=\mathcal{L}_{AE}+\mathcal{L}_{C}caligraphic_L start_POSTSUBSCRIPT italic_A italic_E italic_C end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_A italic_E end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, where

AE=1dX𝔼[XDXX(EX(X))22]+1dY𝔼[YDYY(EY(Y))22]subscript𝐴𝐸1subscript𝑑𝑋𝔼delimited-[]subscriptsuperscriptnorm𝑋subscript𝐷𝑋𝑋subscript𝐸𝑋𝑋221subscript𝑑𝑌𝔼delimited-[]subscriptsuperscriptnorm𝑌subscript𝐷𝑌𝑌subscript𝐸𝑌𝑌22\mathcal{L}_{AE}=\frac{1}{d_{X}}\mathbb{E}[||X-D_{XX}(E_{X}(X))||^{2}_{2}]+% \frac{1}{d_{Y}}\mathbb{E}[||Y-D_{YY}(E_{Y}(Y))||^{2}_{2}]caligraphic_L start_POSTSUBSCRIPT italic_A italic_E end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG blackboard_E [ | | italic_X - italic_D start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_ARG blackboard_E [ | | italic_Y - italic_D start_POSTSUBSCRIPT italic_Y italic_Y end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_Y ) ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (4)
C=1dX𝔼[XDYX(EY(Y))22]+1dY𝔼[YDXY(EX(X))22]subscript𝐶1subscript𝑑𝑋𝔼delimited-[]subscriptsuperscriptnorm𝑋subscript𝐷𝑌𝑋subscript𝐸𝑌𝑌221subscript𝑑𝑌𝔼delimited-[]subscriptsuperscriptnorm𝑌subscript𝐷𝑋𝑌subscript𝐸𝑋𝑋22\mathcal{L}_{C}=\frac{1}{d_{X}}\mathbb{E}[||X-D_{YX}(E_{Y}(Y))||^{2}_{2}]+% \frac{1}{d_{Y}}\mathbb{E}[||Y-D_{XY}(E_{X}(X))||^{2}_{2}]caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG blackboard_E [ | | italic_X - italic_D start_POSTSUBSCRIPT italic_Y italic_X end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_Y ) ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_ARG blackboard_E [ | | italic_Y - italic_D start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (5)

This is not the only way one could regularize autoencoders to preserve mutually informative structure. We design and empirically characterize some alternatives in Appendix A.2. We find that multiple regularization approaches can be effective, but cross-prediction comes with unique benefits. For example cross-predictive networks can be dissected to attribute high-dimensional MI estimates to specific dimensions, as demonstrated in Appendix A.2.3.

While the specific architecture of encoders and decoders could be carefully chosen for each estimation problem (e.g. convolutional layers for image data), here we use multilayer perceptrons with a priori determined hidden layer sizes for all problems. This is intentional: a useful MI estimator should not need extensive parameter selection. Every LMI estimate shown in this paper (excluding Appendix) uses the default parameters of our library, equivalent to running lmi(X_samples,Y_samples).

To ensure that optimizing based on cross-reconstruction does not introduce spurious dependence due to overfitting, we learn representations and estimate MI on different subsets of the data. That is, for N𝑁Nitalic_N joint samples, we train the network using a subset of N/2𝑁2N/2italic_N / 2 samples, then estimate MI by applying the estimator of [10] to latent representations of the remaining N/2𝑁2N/2italic_N / 2 samples. A high-level overview of an MI estimate using LMI approximation is given in Algorithm 1.

We also state and prove some basic properties of LMI approximation, namely that I(Zx;Zy)I(X;Y)𝐼subscript𝑍𝑥subscript𝑍𝑦𝐼𝑋𝑌I(Z_{x};Z_{y})\leq I(X;Y)italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ italic_I ( italic_X ; italic_Y ), and that I(Zx;Zy)=0𝐼subscript𝑍𝑥subscript𝑍𝑦0I(Z_{x};Z_{y})=0italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = 0 if I(X;Y)=0𝐼𝑋𝑌0I(X;Y)=0italic_I ( italic_X ; italic_Y ) = 0 in Appendix 1.3, Theorems 2 and 3.

Algorithm 1 Estimating MI using LMI Approximation
N𝑁Nitalic_N joint samples {(xi,yi)}i=1Nsubscriptsuperscriptsubscript𝑥𝑖subscript𝑦𝑖𝑁𝑖1\{(x_{i},y_{i})\}^{N}_{i=1}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT of random variables X,Y𝑋𝑌X,Yitalic_X , italic_Y
Encoders EX,EY,subscript𝐸𝑋subscript𝐸𝑌E_{X},E_{Y},italic_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT , decoders DXX,,DYYsubscript𝐷𝑋𝑋subscript𝐷𝑌𝑌D_{XX},\ldots,D_{YY}italic_D start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_Y italic_Y end_POSTSUBSCRIPT parameterized by θ1,,θ6subscript𝜃1subscript𝜃6\theta_{1},\ldots,\theta_{6}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT
randomly split into two subsets of N/2𝑁2N/2italic_N / 2 samples, 𝒟train,𝒟estsubscript𝒟trainsubscript𝒟est\mathcal{D}_{\text{train}},\mathcal{D}_{\text{est}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT est end_POSTSUBSCRIPT
optimize θ1,,θ6subscript𝜃1subscript𝜃6\theta_{1},\ldots,\theta_{6}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT to minimize AECsubscript𝐴𝐸𝐶\mathcal{L}_{AEC}caligraphic_L start_POSTSUBSCRIPT italic_A italic_E italic_C end_POSTSUBSCRIPT on 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT
encode 𝒟estsubscript𝒟est\mathcal{D}_{\text{est}}caligraphic_D start_POSTSUBSCRIPT est end_POSTSUBSCRIPT using E1,E2,θ1,θ2subscript𝐸1subscript𝐸2subscript𝜃1subscript𝜃2E_{1},E_{2},\theta_{1},\theta_{2}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, yielding {(Zix,Ziy)}i=1N/2superscriptsubscriptsubscriptsuperscript𝑍𝑥𝑖subscriptsuperscript𝑍𝑦𝑖𝑖1𝑁2\{(Z^{x}_{i},Z^{y}_{i})\}_{i=1}^{N/2}{ ( italic_Z start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT
return I^KSG({(Zix,Ziy)}i=1N/2)subscript^𝐼𝐾𝑆𝐺superscriptsubscriptsubscriptsuperscript𝑍𝑥𝑖subscriptsuperscript𝑍𝑦𝑖𝑖1𝑁2\hat{I}_{KSG}(\{(Z^{x}_{i},Z^{y}_{i})\}_{i=1}^{N/2})over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_K italic_S italic_G end_POSTSUBSCRIPT ( { ( italic_Z start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT )

3 Empirical evaluation

Next, we empirically study the effectiveness of LMI approximation with 16 dimensional latent space (8 dimensions per variable), in comparison with three popular estimators: the nonparametric estimator from [10], referred to as KSG, and the variational bound estimators from [15, 30] referred to as MINE and InfoNCE, respectively (implementation details in Appendix A4).

3.1 Evaluating mutual information estimators on synthetic data

We first consider the problem of MI estimation between multivariate Gaussian distributions, because ground truth MI can be analytically computed, and dimensionality can be easily tuned. We consider the scalability of MI estimators with increasing dimensionality of two kinds: the ambient dimensionality of the data, denoted d𝑑ditalic_d, and the intrinsic dimensionality of the dependence structure, denoted k𝑘kitalic_k. We benchmark the performance of estimators in the regime of high ambient dimensionality and low intrinsic dimensionality. Specifically, we consider variables with d=10𝑑10d=10italic_d = 10 to d=5103𝑑5superscript103d=5\cdot 10^{3}italic_d = 5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ambient dimensions and k=1𝑘1k=1italic_k = 1 to k=9𝑘9k=9italic_k = 9 dimensional dependence structure.

To generate samples from two d𝑑ditalic_d-dimensional random variables X,Y𝑋𝑌X,Yitalic_X , italic_Y with k𝑘kitalic_k-dimensional dependence structure, we sample d𝑑ditalic_d bivariate Gaussians and concatenate the first components to construct samples of X𝑋Xitalic_X, and concatenate the second components to construct Y𝑌Yitalic_Y. By choosing the covariance of each of the bivariate Gaussians, I(X,Y)𝐼𝑋𝑌I(X,Y)italic_I ( italic_X , italic_Y ) and k𝑘kitalic_k can be tuned. To enforce a k𝑘kitalic_k-dimensional dependence structure, we can choose covariance matrices such that Cor(Xi,Yi)=ρHeav(ik)Corsubscript𝑋𝑖subscript𝑌𝑖𝜌Heav𝑖𝑘\text{Cor}(X_{i},Y_{i})=\rho\textbf{Heav}(i-k)Cor ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ρ Heav ( italic_i - italic_k ) where Heav is the Heaviside step function, and Cor(Xi,Yj)=0ijCorsubscript𝑋𝑖subscript𝑌𝑗0for-all𝑖𝑗\text{Cor}(X_{i},Y_{j})=0\ \forall i\neq jCor ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 0 ∀ italic_i ≠ italic_j. The exact sampling procedure we use for these experiments is given in Appendix A.3.1, Algorithm 4.

Results of benchmarking MI estimators using these synthetic datasets are given in Fig. 2. For estimates from N=2103𝑁2superscript103N=2\cdot 10^{3}italic_N = 2 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples, we find that, as expected, the performance of existing estimators degrades with d𝑑ditalic_d, with near complete failure for d>100𝑑100d>100italic_d > 100 (Fig. 2a-c, 2f-h.). In contrast, applying LMI approximation results in stable estimates up to d=5103𝑑5superscript103d=5\cdot 10^{3}italic_d = 5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ambient dimensions (Fig. 2d, 2i). The faithfulness of LMI approximation instead degrades with increasing k𝑘kitalic_k. Nonetheless, LMI approximation gives more absolutely and relatively accurate MI estimates than alternatives for 83% and 87% of tested settings respectively (Fig. 2e, 2j).

Refer to caption
Figure 2: MI estimator performance scaling with increasing dimensionality. a) - d) Absolute accuracy measured by mean-squared error over 10 estimates per setting, with ground truth MI between 0 and 2 bits, and 21032superscript1032\cdot 10^{3}2 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples per estimate. e) Estimator with highest absolute accuracy in each setting. Ties broken randomly. f) - i) Relative accuracy measured by Kendall τ𝜏\tauitalic_τ rank correlation of estimates with ground truth. j) Estimator with highest relative accuracy in each setting. Ties broken randomly.

3.1.1 Empirically quantifying convergence rates of MI estimators on synthetic data

The principle enabling the scalability of LMI approximation is that the number of samples it requires to converge is limited by k𝑘kitalic_k rather than d𝑑ditalic_d when kdmuch-less-than𝑘𝑑k\ll ditalic_k ≪ italic_d. We empirically demonstrate this by quantifying the convergence rates of MI estimators on the synthetic Gaussian datasets described above. We generate datasets with sample numbers in N[102,104]𝑁superscript102superscript104N\in[10^{2},10^{4}]italic_N ∈ [ 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ], and ambient dimensionalities in d[1,50]𝑑150d\in[1,50]italic_d ∈ [ 1 , 50 ], each with a single correlated dimension between variables (k=1𝑘1k=1italic_k = 1), and 1 bit MI. For each estimator and each ambient dimensionality d𝑑ditalic_d, we empirically determine the number of samples required to achieve |I(X,Y)I^(X,Y)|<ϵ bits𝐼𝑋𝑌^𝐼𝑋𝑌italic-ϵ bits|I(X,Y)-\hat{I}(X,Y)|<\epsilon\text{ bits}| italic_I ( italic_X , italic_Y ) - over^ start_ARG italic_I end_ARG ( italic_X , italic_Y ) | < italic_ϵ bits, with linear interpolation between tested sample numbers.

As expected, methods that do not explicitly learn low-dimensional representations (InfoNCE, MINE, KSG) require increasing numbers of samples to estimate MI with error below ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1 (Fig. 3a). KSG fails to estimate MI for d13𝑑13d\geq 13italic_d ≥ 13 for any N𝑁Nitalic_N, while MINE and InfoNCE scale slightly better, failing for d25𝑑25d\geq 25italic_d ≥ 25 and d37𝑑37d\geq 37italic_d ≥ 37 respectively. In contrast, the sample requirements of LMI remain qualitatively stable – no more than 41034superscript1034\cdot 10^{3}4 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples are necessary for an accurate estimate.

While the convergence behavior of LMI is mostly unaffected by varying d𝑑ditalic_d, it is sensitive to varying k𝑘kitalic_k. When the same experiment is performed with increasing numbers of correlated dimensions at the limit where k=d𝑘𝑑k=ditalic_k = italic_d, the convergence behavior of LMI is no longer favored over other estimators (Fig. 3c). The performance of all estimators dramatically decreases with k𝑘kitalic_k, such that a larger error tolerance must be chosen for informative convergence estimates. The dependence of variational bound estimator convergence on k𝑘kitalic_k is, to our knowledge, not explained by existing theory [16, 17]. In the intermediate case of k=0.1d𝑘0.1𝑑k=\lfloor 0.1\cdot d\rflooritalic_k = ⌊ 0.1 ⋅ italic_d ⌋ (Fig. 3b), we find that LMI convergence is fast with low k𝑘kitalic_k, but becomes slow as k𝑘kitalic_k grows, nonetheless remaining favorable compared to other estimators.

Refer to caption
Figure 3: Number of samples required to achieve |I(X,Y)I^(X,Y)|<ϵ𝐼𝑋𝑌^𝐼𝑋𝑌italic-ϵ|I(X,Y)-\hat{I}(X,Y)|<\epsilon| italic_I ( italic_X , italic_Y ) - over^ start_ARG italic_I end_ARG ( italic_X , italic_Y ) | < italic_ϵ. a) Data with low-rank dependence structure, with ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1. b) Moderate-rank dependence structure, with ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2. c) Full-rank dependence structure, with ϵ=0.4italic-ϵ0.4\epsilon=0.4italic_ϵ = 0.4 “+” marker indicates that N>104𝑁superscript104N>10^{4}italic_N > 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples are required for accurate estimates for all larger d𝑑ditalic_d.

3.2 Evaluating mutual information estimators on resampled real-world data

While the empirical results on multivariate Gaussians are reassuring, they are not representative of performance on real data, where low intrinsic dimensionality is not known a priori, and distributions can be non-Gaussian. To better understand the behavior of LMI in more realistic settings, we introduce a technique for creating benchmark datasets by resampling real-world data. Briefly, we use correspondences between discrete labels and complex data (i.e. digit labels and digit images in MNIST) to transform simple discrete distributions into realistic high-dimensional distributions.

Specifically, we draw samples from a bivariate Bernoulli vector, 𝐋=[Lx,Ly]{0,1}2𝐋subscript𝐿𝑥subscript𝐿𝑦superscript012\mathbf{L}=[L_{x},L_{y}]\in\{0,1\}^{2}bold_L = [ italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] ∈ { 0 , 1 } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with prescribed pairwise correlation Cor(Lx,Ly)=ρCorsubscript𝐿𝑥subscript𝐿𝑦𝜌\text{Cor}(L_{x},L_{y})=\rhoCor ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_ρ, where each value corresponds to a discrete label of a set of samples in a high-dimensional dataset (e.g. 00 and 1111 correspond to images of 0s and 1s in MNIST). For each sample of 𝐋𝐋\mathbf{L}bold_L, we replace each component with a random (without replacement) high-dimensional sample matching the label. For the example of MNIST, this transforms samples from 𝐋𝐋\mathbf{L}bold_L into pairs of images of 0s and 1s, represented as samples of random vectors X,Y784𝑋𝑌superscript784X,Y\in\mathbb{R}^{784}italic_X , italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT 784 end_POSTSUPERSCRIPT.

Under the assumption that discrete labels can be uniquely identified by high-dimensional vectors, high-dimensional MI is identical to the discrete label MI. That is, assuming H(Lx|X)=H(Ly|Y)=0𝐻conditionalsubscript𝐿𝑥𝑋𝐻conditionalsubscript𝐿𝑦𝑌0H(L_{x}|X)=H(L_{y}|Y)=0italic_H ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_X ) = italic_H ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) = 0 then I(X;Y)=I(Lx;Ly)𝐼𝑋𝑌𝐼subscript𝐿𝑥subscript𝐿𝑦I(X;Y)=I(L_{x};L_{y})italic_I ( italic_X ; italic_Y ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) (shown in Appendix A.3.2, Theorem 4). And using our knowledge of ρ𝜌\rhoitalic_ρ, I(Lx;Ly)𝐼subscript𝐿𝑥subscript𝐿𝑦I(L_{x};L_{y})italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) can be analytically computed.

We resample two different source datasets: (1) “binary” subset of MNIST, containing only images of 0s and 1s, with 5000500050005000 samples and 784784784784 dimensions and (2) embeddings of a subset of protein sequences from E. coli and A. thaliana proteins, with 4402440244024402 samples and 1024102410241024 dimensions. For both source datasets, we validate the I(Lx;Ly)I(X;Y)𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼𝑋𝑌I(L_{x};L_{y})\approx I(X;Y)italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≈ italic_I ( italic_X ; italic_Y ) approximation in Appendix A.3.3.

For each source dataset, we generate 200 benchmark datasets with true MI ranging from 0 to 1 bits. We estimate MI on each dataset with each estimator (Fig. 4a, 4b), and quantify absolute accuracy (MSE), relative accuracy (rank correlation with ground truth), and runtime for each estimator (Fig. 4c). For both types of source data, we find that variational bound estimators have high variance, in line with previous observations [16]. On protein embedding datasets, variational estimators nearly always fail to estimate nonzero values – resulting in a rank correlation below 0 for InfoNCE. The KSG estimator, while achieving high relative accuracy, systematically underestimates MI, resulting in low absolute accuracy. Furthermore, the amount by which it underestimates true MI is different between the two datasets – indicating inequitability. In contrast, LMI approximation yields estimates consistently close to the ground truth, with high relative and absolute accuracy.

Refer to caption
Figure 4: Performance of MI estimators on resampled real datasets. a) Estimates on resampled pairs of MNIST digits, with 51035superscript1035\cdot 10^{3}5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples and 784784784784 dimensions. b) Estimates on resampled pairs of ProtTrans5 sequence embeddings, with 4.41034.4superscript1034.4\cdot 10^{3}4.4 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples and 1024102410241024 dimensions. c) Statistics of estimator accuracy and runtime (in seconds), for each dataset type.

4 Applications

4.1 Quantifying interaction information in protein language model embeddings

Refer to caption
Figure 5: Quantifying dependence between participants of protein interactions. a) - b) MI estimates between interaction partners, compared to randomly permuted data. c) - d) ROC curves of density ratio classifier distinguishing annotated interacting pairs from unannotated “negative” samples, for all pairs of 170170170170 held-out proteins. Averages over 20 random hold-out splits.

Pretrained protein language models (pLMs) have recently seen widespread use, largely due to their convenient representations of protein sequences (vectors in Nsuperscript𝑁\mathbb{R}^{N}blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, typically with N103𝑁superscript103N\approx 10^{3}italic_N ≈ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) which can be used for transfer learning on downstream tasks [31, 32, 33]. While it is known that pLM sequence embeddings contain significant information about protein structure [34], it is not clear how well existing pLMs encode functional information. Recent work has shown that pLMs fail to capture some important functional properties, such as thermostability [35]. It remains unclear the extent to which pLM embeddings contain information about protein-protein interactions, which are essential to protein function. Here, we use an information-theoretic approach to quantify interaction information contained in 1024-dimensional sequence embeddings from the ProtTrans5 model [32].

We study two types of protein-protein interactions: kinase-target and ligand-receptor interactions. For both, the OmniPath database [36] contains lists of thousands of annotated pairs of interacting proteins. We consider each annotated pair to be a sample from a joint “interaction” distribution over pairs of sequence embeddings. For example, for kinase-target interactions, we consider kinase and target sequence embeddings as random variables K1024,T1024formulae-sequence𝐾superscript1024𝑇superscript1024K\in\mathbb{R}^{1024},T\in\mathbb{R}^{1024}italic_K ∈ blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT , italic_T ∈ blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT, with joint distribution PKT(k,t)subscript𝑃𝐾𝑇𝑘𝑡P_{KT}(k,t)italic_P start_POSTSUBSCRIPT italic_K italic_T end_POSTSUBSCRIPT ( italic_k , italic_t ). Then, using joint samples {(k1,t1),,(kN,tN)}subscript𝑘1subscript𝑡1subscript𝑘𝑁subscript𝑡𝑁\{(k_{1},t_{1}),\ldots,(k_{N},t_{N})\}{ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_k start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } we estimate mutual information between interaction partners, I(K;T)𝐼𝐾𝑇I(K;T)italic_I ( italic_K ; italic_T ). We analogously estimate I(L;R)𝐼𝐿𝑅I(L;R)italic_I ( italic_L ; italic_R ) for ligand-receptor interactions.

If pLM embeddings capture interaction information, MI between interaction partners should be significantly above 0 bits. Applying LMI approximation, we estimate I(L;R)2.8 bits𝐼𝐿𝑅2.8 bitsI(L;R)\approx 2.8\text{ bits}italic_I ( italic_L ; italic_R ) ≈ 2.8 bits and I(K;T)0.8 bits𝐼𝐾𝑇0.8 bitsI(K;T)\approx 0.8\text{ bits}italic_I ( italic_K ; italic_T ) ≈ 0.8 bits. To test the significance of these values, we estimated MI from shuffled data, and found Ishuff(K;T)Ishuff(L;R)0subscript𝐼shuff𝐾𝑇subscript𝐼shuff𝐿𝑅0I_{\text{shuff}}(K;T)\approx I_{\text{shuff}}(L;R)\approx 0italic_I start_POSTSUBSCRIPT shuff end_POSTSUBSCRIPT ( italic_K ; italic_T ) ≈ italic_I start_POSTSUBSCRIPT shuff end_POSTSUBSCRIPT ( italic_L ; italic_R ) ≈ 0 (both mean and standard deviation <0.05absent0.05<0.05< 0.05), across 20 random shuffles. These results indicate that pLM embeddings contain information about both types of interactions, and contain more information about ligand-receptor interactions than kinase-target interactions. In contrast, existing estimators yield far lower estimates, with MINE estimates indicating independence for both types of interactions (Fig. 5a, 5b). To validate the LMI estimates of dependence, we next operationally verify the presence of interaction information.

If protein-protein interactions can be predicted for a set of held-out proteins based on sequence embeddings, then sequence embeddings must contain interaction information. To see if this is the case, we extend LMI to predict protein interactions from sequence embeddings. For ligand-receptor prediction, our goal is to predict whether a held-out pair of sequence embeddings (l,r)𝑙𝑟(l,r)( italic_l , italic_r ) is an annotated ligand-receptor pair. One way to do this is estimating the log density ratio logPLR(l,r)PL(l)PR(r)subscript𝑃𝐿𝑅𝑙𝑟subscript𝑃𝐿𝑙subscript𝑃𝑅𝑟\log\frac{P_{LR}(l,r)}{P_{L}(l)\cdot P_{R}(r)}roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT ( italic_l , italic_r ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_l ) ⋅ italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_r ) end_ARG, and setting a threshold above which sequence pairs are predicted to be annotated interactions.

We make a simple modification to the KSG estimator to yield the desired density ratio estimates (given in Algorithm 2), and use these estimates (with latent approximation) to predict interaction annotations. For 20 different random splits of 170 held out proteins, we use density ratio estimates to classify all 2.891042.89superscript1042.89\cdot 10^{4}2.89 ⋅ 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT pairs of held out proteins as interacting or non-interacting. The receiver operating characteristic (ROC) curves for predictions of both interaction types are shown in Fig. 5c, 5d, with mean AUC-ROC scores of 0.87 and 0.74 for kinase-target and ligand-receptor interactions respectively. These results demonstrate that protein interactions can be predicted better than random chance using ProtTrans5 embeddings, suggesting that pLM embeddings capture information about kinase-target and ligand-receptor interactions. And in line with the LMI estimates, ligand-receptor interactions are better predicted than kinase-target interactions.

Algorithm 2 k-nearest neighbor log density ratio estimator
joint samples {(xi,yi)}i=1Nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\{(x_{i},y_{i})\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
query point (qx,qy)subscript𝑞𝑥subscript𝑞𝑦(q_{x},q_{y})( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT )
let (rx,ry)subscript𝑟𝑥subscript𝑟𝑦absent(r_{x},r_{y})\leftarrow\ ( italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ← k𝑘kitalic_k-th nearest neighbor sample of (qx,qy)subscript𝑞𝑥subscript𝑞𝑦(q_{x},q_{y})( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) in joint space (default k=3𝑘3k=3italic_k = 3)
\\ compute Chebyshev distance
let d(qx,qy)(rx,ry)𝑑subscriptnormsubscript𝑞𝑥subscript𝑞𝑦subscript𝑟𝑥subscript𝑟𝑦d\leftarrow||(q_{x},q_{y})-(r_{x},r_{y})||_{\infty}italic_d ← | | ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) - ( italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
let nx0,ny0formulae-sequencesubscript𝑛𝑥0subscript𝑛𝑦0n_{x}\leftarrow 0,n_{y}\leftarrow 0italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← 0 , italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← 0
\\ count neighbors within d𝑑ditalic_d in marginal spaces
for each (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) do
     if qxxi<dsubscriptnormsubscript𝑞𝑥subscript𝑥𝑖𝑑||q_{x}-x_{i}||_{\infty}<d| | italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_d then
         nxsubscript𝑛𝑥n_{x}italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT += 1
     end if
     if qyyi<dsubscriptnormsubscript𝑞𝑦subscript𝑦𝑖𝑑||q_{y}-y_{i}||_{\infty}<d| | italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_d then
         nysubscript𝑛𝑦n_{y}italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT += 1
     end if
end for
\\ return estimate of logp(qx,qy)p(qx)p(qy)𝑝subscript𝑞𝑥subscript𝑞𝑦𝑝subscript𝑞𝑥𝑝subscript𝑞𝑦\log\frac{p(q_{x},q_{y})}{p(q_{x})p(q_{y})}roman_log divide start_ARG italic_p ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) italic_p ( italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_ARG
return ψ(k)+ψ(N)ψ(nx)ψ(ny)𝜓𝑘𝜓𝑁𝜓subscript𝑛𝑥𝜓subscript𝑛𝑦\psi(k)+\psi(N)-\psi(n_{x})-\psi(n_{y})italic_ψ ( italic_k ) + italic_ψ ( italic_N ) - italic_ψ ( italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) - italic_ψ ( italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ), where ψ𝜓\psiitalic_ψ is Digamma function

4.2 Identifying cell fate information in hematopoietic stem cells

Single cell RNA sequencing (scRNA-seq) measures the expression of g104𝑔superscript104g\approx 10^{4}italic_g ≈ 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT genes in single cells, which can be thought of samples of a gene expression state variable Xg𝑋superscript𝑔X\in\mathbb{R}^{g}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT. These samples can be used to infer a probability distribution over gene expression states, PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. To study the dynamics of gene expression, one approach is to make measurements at multiple timepoints t1,tNsubscript𝑡1subscript𝑡𝑁t_{1},...t_{N}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, which can be thought of as samples of random variables Xigsubscript𝑋𝑖superscript𝑔X_{i}\in\mathbb{R}^{g}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT. Lineage tracing is a technique where clonally related cells can be labelled with barcodes, allowing sampled cells from different timepoints to be matched with their “twins” from another. When combined with scRNA-seq, lineage tracing can be thought to provide samples from the joint distribution PX1,,XNsubscript𝑃subscript𝑋1subscript𝑋𝑁P_{X_{1},\ldots,X_{N}}italic_P start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT [37].

One fundamental question about cellular dynamics is whether the time evolution of gene expression state is dependent entirely on the current gene expression state. That is, if the “fate” of a cell can be formally modeled as a Markov chain XiXi+1subscript𝑋𝑖subscript𝑋𝑖1X_{i}\to X_{i+1}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. In some cases, cell behavior may be a function of hidden variables resulting in non-Markovian dynamics. Using the data processing inequality (DPI) [7], we know that if gene expression dynamics are Markovian, I(Xi;Xi+1)I(Xi;Xi+2)𝐼subscript𝑋𝑖subscript𝑋𝑖1𝐼subscript𝑋𝑖subscript𝑋𝑖2I(X_{i};X_{i+1})\geq I(X_{i};X_{i+2})italic_I ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ≥ italic_I ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT ), and if the DPI is does not hold, then gene expression states are non-Markovian. This can, in principle, be tested using samples from the joint distribution PX1,..,XNP_{X_{1},..,X_{N}}italic_P start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Due to the difficulty of high-dimensional MI estimation, previous work has used heuristic alternatives to indirectly test the Markovian assumption for lineage-traced scRNA-seq (LT-seq) data [38]. Here, we will use LMI to explicitly estimate high-dimensional MI from LT-seq data, and test the Markovian assumption for gene expression states.

We study a previously published LT-seq data set of in vitro differentiating mouse hematopoietic stem cells [38]. The dataset includes sister cells which are separated at day 2 of the experiment and allowed to differentiate in separate wells until day 6, with cell states sampled on both days 2 and 6. Under the Markovian assumption, this can be modeled as X2X6subscript𝑋2subscript𝑋6X_{2}\to X_{6}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, X2X6subscript𝑋2subscript𝑋superscript6X_{2}\to X_{6^{\prime}}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 6 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and by the DPI we should have I(X2;X6)I(X6;X6)𝐼subscript𝑋2subscript𝑋6𝐼subscript𝑋6subscript𝑋superscript6I(X_{2};X_{6})\geq I(X_{6};X_{6^{\prime}})italic_I ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) ≥ italic_I ( italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT 6 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Using LMI, we estimate that I(X2;X6)0.31±0.02 bits𝐼subscript𝑋2subscript𝑋6plus-or-minus0.310.02 bitsI(X_{2};X_{6})\approx 0.31\pm 0.02\text{ bits}italic_I ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) ≈ 0.31 ± 0.02 bits and I(X6;X6)0.98±0.01 bits𝐼subscript𝑋6subscript𝑋superscript6plus-or-minus0.980.01 bitsI(X_{6};X_{6^{\prime}})\approx 0.98\pm 0.01\text{ bits}italic_I ( italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT 6 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≈ 0.98 ± 0.01 bits (mean ±plus-or-minus\pm± SEM), over 20 random pairings of clonally related cells. This indicates that gene expression states are non-Markovian, in line with prior findings [38, 39].

Cell fate information manifests in transcriptomes sometime between days 2 and 6. By decomposing LMI estimates into pointwise contributions (Appendix A.4.3, Algorithm 5), we can determine precisely when this hidden information emerges. Generally, we see that pMI increases along differentiation trajectories (Fig. 6b). Along the neutrophil trajectory, we quantatively compare cell fate information with neutrophil pseudotime computed by graph smoothing [38], and find that pMI begins to rapidly increase around pseudotime value of 3510335superscript10335\cdot 10^{3}35 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, which is roughly aligned with the transition from granulocyte-myeloid progenitor to promyelocyte, as defined in [38].

Refer to caption
Figure 6: Quantifying cell fate information in single cell transcriptomes. a) 2D SPRING embedding [40] of lineage-traced single cell RNA-sequencing data from [38]. b) Pointwise decomposition of MI between sister cells across timepoints, as estimated using LMI. Hue applied to early timepoint cells, late timepoint and unbarcoded cells in grey. Star indicates neutrophil pseudotime value 3510335superscript10335\cdot 10^{3}35 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. c) Smoothed pMI (rolling average over 100 cells) across neutrophil differentiation trajectory. Vertical line denotes neutrophil pseudotime value 3510335superscript10335\cdot 10^{3}35 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.

5 Discussion

In this paper, we tested the hypothesis that low-dimensional structure can enable scalable estimation of MI from high-dimensional data. We introduced LMI approximation, which applies a nonparametric MI estimator to low-dimensional representations learned by neural networks. We quantified the effectiveness of LMI approximation, using multiple approaches spanning over 3000 benchmark datasets. Our results suggest that, unlike existing techniques, LMI approximation is generally effective for high-dimensional data with low-dimensional structure, even if the number of available samples remains relatively low – a regime where many real datasets reside [5].

We used LMI to study two open problems in biology. We show one example where LMI enables the use of information-theoretic ideas to study the dynamics of gene expression. In the original study [40], the authors had indeed wished to estimate MI but were unable to do so and resorted to heuristic approaches. LMI may similarly help identify dependence in cellular dynamics in other systems [39, 41, 42]. In a different subfield of biology, we showed an example of using LMI to quantify functional information learned by pLMs. Our results suggest that nontrivial protein-protein interaction information is learned by ProtTrans5, motivating the development of interaction prediction tools based on pLMs. As the number of large pLMs grows [31, 33, 43, 32], information-theoretic approaches using LMI could help benchmark and compare models.

More broadly, LMI may be able to help bridge the gap between theory and experimental measurements in the study of complex systems [44, 45, 46, 2]. With the goal of making high-dimensional MI estimation accessible to practitioners, we provide an open-source implementation of LMI approximation along with the code necessary to reproduce the results of this paper at this link.

Limitations

The most prominent limitation of LMI follows directly from its motivating assumption: it will fail to capture dependence structure with dimensionality larger than its learned representation. As a result, it is very easy to design synthetic datasets on which LMI will fail entirely (see Fig. 3c). However, it is likely that many real datasets (beyond those explored in this paper) will be amenable to LMI approximation, as there is strong evidence that complex systems generally have low intrinsic dimensionality [5]. Our implementation of LMI approximation also inherits some limitations of the KSG estimator, notably that it fails for strongly dependent (near deterministically related) variables [17, 47]. To overcome this, it may be possible to apply previously developed corrections [47]. Finally, despite reassuring empirical results, few theoretical properties of LMI have been derived. This is an important line of future work, which could help precisely identify settings where LMI approximation is effective.

Broader impacts

MI estimators have been used to quantify moral and legal fairness [24, 48, 49]. LMI approximation is not universally faithful, and more generally no MI estimator can be universally accurate [17]. MI estimates must be interpreted with great care when applied to human lives. The experiments (and pilot iterations) in this paper were performed on a single NVIDIA RTX 3090, and resulted in estimated 45.36 kg CO2eq. Estimates made using [50].

Code reproducibility

The lmi Python library and the code necessary to reproduce the results from this paper are available at https://anonymous.4open.science/r/latent-mutual-information-BCBF. Documentation for the lmi library can be found at https://latentmi.readthedocs.io/en/latest/.

Author contributions

G.G. conceived of, designed, and performed the study, analyzed the data, and wrote the paper. X.L. contributed to study design. A.M.K. contributed to the design of the study, analysis of the data, and wrote the paper, and supervised the study. P.Y. supervised the study. All authors reviewed, edited, and approve the paper.

Acknowledgments and Disclosure of Funding

We thank Caroline Holmes, Sean McGeary, and Pippa Richter for thoughtful discussions. This work is supported by funding from NIH Pioneer Award DP1GM133052, R01HG012926 to P.Y., and Molecular Robotics Initiative at the Wyss Institute.

References

  • [1] Julien O Dubuis, Gasper Tkacik, Eric F Wieschaus, Thomas Gregor, and William Bialek. Positional information, in bits. Proc. Natl. Acad. Sci. U. S. A., 110(41):16301–16308, October 2013.
  • [2] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv [cs.LG], March 2017.
  • [3] Matthijs Meijers, Sosuke Ito, and Pieter Rein Ten Wolde. Behavior of information flow near criticality. Phys Rev E, 103(1):L010102, January 2021.
  • [4] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting novel associations in large data sets. Science, 334(6062):1518–1524, December 2011.
  • [5] Vincent Thibeault, Antoine Allard, and Patrick Desrosiers. The low-rank hypothesis of complex systems. Nat. Phys., pages 1–9, January 2024.
  • [6] Justin B Kinney and Gurinder S Atwal. Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. U. S. A., 111(9):3354–3359, March 2014.
  • [7] Thomas M Cover and Joy A Thomas. Elements of Information Theory. Wiley & Sons, Incorporated, John, 2006.
  • [8] Yao-Hung Hubert Tsai, Han Zhao, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Neural methods for point-wise dependency estimation. arXiv [cs.LG], June 2020.
  • [9] Xianghao Kong, Ollie Liu, Han Li, Dani Yogatama, and Greg Ver Steeg. Interpretable diffusion via information decomposition. arXiv [cs.LG], October 2023.
  • [10] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E Stat. Nonlin. Soft Matter Phys., 69(6 Pt 2):066138, June 2004.
  • [11] Caroline M Holmes and Ilya Nemenman. Estimation of mutual information for real-valued data with error bars and controlled bias. Phys Rev E, 100(2-1):022404, August 2019.
  • [12] Paweł Czyż, Frederic Grabowski, Julia E Vogt, Niko Beerenwinkel, and Alexander Marx. Beyond normal: On the evaluation of mutual information estimators. arXiv [stat.ML], June 2023.
  • [13] Thalia E Chan, Michael P H Stumpf, and Ann C Babtie. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst, 5(3):251–267.e3, September 2017.
  • [14] Ziv Goldfeld and Kristjan Greenewald. Sliced mutual information: A scalable measure of statistical dependence. arXiv [cs.IT], October 2021.
  • [15] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. MINE: Mutual information neural estimation. arXiv [cs.LG], January 2018.
  • [16] Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On variational bounds of mutual information. arXiv [cs.LG], May 2019.
  • [17] David McAllester and Karl Stratos. Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, pages 875–884. PMLR, June 2020.
  • [18] Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators. arXiv [cs.LG], October 2019.
  • [19] Allon M Klein, Linas Mazutis, Ilke Akartuna, Naren Tallapragada, Adrian Veres, Victor Li, Leonid Peshkin, David A Weitz, and Marc W Kirschner. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187–1201, May 2015.
  • [20] Surya Ganguli and Haim Sompolinsky. Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis. Annu. Rev. Neurosci., 35:485–508, April 2012.
  • [21] Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. Estimating information flow in deep neural networks. arXiv [cs.LG], October 2018.
  • [22] Ziv Goldfeld, Kristjan Greenewald, Theshani Nuradha, and Galen Reeves. k-sliced mutual information: A quantitative study of scalability with dimension. arXiv [cs.IT], June 2022.
  • [23] Dor Tsur, Ziv Goldfeld, and Kristjan Greenewald. Max-sliced mutual information. arXiv [cs.LG], September 2023.
  • [24] Yanzhi Chen, Wei-Der Sun, Yingzhen Li, and Adrian Weller. Scalable infomin learning. Adv. Neural Inf. Process. Syst., abs/2302.10701, February 2023.
  • [25] G E Hinton and R R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.
  • [26] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel W H Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol., December 2018.
  • [27] Kevin R Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B Burkhardt, William S Chen, Kristina Yim, Antonia van den Elzen, Matthew J Hirn, Ronald R Coifman, Natalia B Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol., 37(12):1482–1492, December 2019.
  • [28] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 1096–1103, New York, NY, USA, July 2008. Association for Computing Machinery.
  • [29] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv [stat.ML], August 2018.
  • [30] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv [cs.LG], July 2018.
  • [31] Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A., 118(15), April 2021.
  • [32] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost. ProtTrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv [cs.LG], July 2020.
  • [33] Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16(12):1315–1322, December 2019.
  • [34] Zhidian Zhang, Hannah K Wayment-Steele, Garyk Brixi, Haobo Wang, Matteo Dal Peraro, Dorothee Kern, and Sergey Ovchinnikov. Protein language models learn evolutionary statistics of interacting sequence motifs. bioRxiv, page 2024.01.30.577970, January 2024.
  • [35] Francesca-Zhoufan Li, Ava P Amini, Yisong Yue, Kevin K Yang, and Alex X Lu. Feature reuse and scaling: Understanding transfer learning with protein language models. bioRxiv, page 2024.02.05.578959, February 2024.
  • [36] Dénes Türei, Alberto Valdeolivas, Lejla Gul, Nicolàs Palacio-Escat, Michal Klein, Olga Ivanova, Márton Ölbei, Attila Gábor, Fabian Theis, Dezső Módos, Tamás Korcsmáros, and Julio Saez-Rodriguez. Integrated intra- and intercellular signaling knowledge for multicellular omics analysis. Mol. Syst. Biol., 17(3):e9923, March 2021.
  • [37] Shou-Wen Wang, Michael J Herriges, Kilian Hurley, Darrell N Kotton, and Allon M Klein. CoSpar identifies early cell fate biases from single-cell transcriptomic and lineage information. Nat. Biotechnol., February 2022.
  • [38] Caleb Weinreb, Alejo Rodriguez-Fraticelli, Fernando D Camargo, and Allon M Klein. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science, 367(6479), February 2020.
  • [39] Kunal Jindal, Mohd Tayyab Adil, Naoto Yamaguchi, Xue Yang, Helen C Wang, Kenji Kamimoto, Guillermo C Rivera-Gonzalez, and Samantha A Morris. Single-cell lineage capture across genomic modalities with CellTag-multi reveals fate-specific gene regulatory changes. Nat. Biotechnol., September 2023.
  • [40] Caleb Weinreb, Samuel Wolock, and Allon M Klein. SPRING: a kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics, 34(7):1246–1248, April 2018.
  • [41] Duncan M Chadly, Kirsten L Frieda, Chen Gui, Leslie Klock, Martin Tran, Margaret Y Sui, Yodai Takei, Remco Bouckaert, Carlos Lois, Long Cai, and Michael B Elowitz. Reconstructing cell histories in space with image-readable base editor recording. bioRxiv, page 2024.01.03.573434, January 2024.
  • [42] Daniel E Wagner and Allon M Klein. Lineage tracing meets single-cell omics: opportunities and challenges. Nat. Rev. Genet., 21(7):410–427, July 2020.
  • [43] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan Dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, March 2023.
  • [44] David B Brückner and Gašper Tkačik. Information content and optimization of self-organized developmental systems. arXiv [physics.bio-ph], December 2023.
  • [45] Gašper Tkačik and William Bialek. Information processing in living systems. Annu. Rev. Condens. Matter Phys., 7(1):89–117, March 2016.
  • [46] D Bray. Protein molecules as computational elements in living cells. Nature, 376(6538):307–312, July 1995.
  • [47] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information for strongly dependent variables. arXiv [cs.IT], November 2014.
  • [48] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. A fair classifier using mutual information. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2521–2526. IEEE, June 2020.
  • [49] Jian Kang, Tiankai Xie, Xintao Wu, Ross Maciejewski, and Hanghang Tong. InfoFair: Information-theoretic intersectional fairness. In 2022 IEEE International Conference on Big Data (Big Data), pages 1455–1464. IEEE, December 2022.
  • [50] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv [cs.CY], October 2019.
  • [51] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 2010. PMLR.
  • [52] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. arXiv [cs.LG], December 2019.
  • [53] Lukas Heumos, Anna C Schaar, Christopher Lance, Anastasia Litinetskaya, Felix Drost, Luke Zappia, Malte D Lücken, Daniel C Strobl, Juan Henao, Fabiola Curion, Single-cell Best Practices Consortium, Herbert B Schiller, and Fabian J Theis. Best practices for single-cell analysis across modalities. Nat. Rev. Genet., pages 1–23, March 2023.

Appendix A Appendix / supplemental material

In this appendix, we first elaborate on the theory and implementation details of the LMI approximation. Then, we will discuss some alternative approaches one could use to implement the LMI approximation. Then, we describe the details, assumptions, and motivations of our empirical evaluation benchmarks. Finally, we provide details relevant to reproducibility – specifically, our implementations of existing estimators and data preprocessing methods (all of which are also included in our code supplement).

A.1 Theory and implementation of LMI

A.1.1 Theoretically motivating the cross-predictive representation learning architecture

The core theoretical underpinning of the network architecture used in the LMI approximation is that cross-predictive mean-squared loss is a proxy of conditional entropy. Here, we explicitly show this.

Theorem 1.

Let X=[X1,,Xd]𝑋subscript𝑋1subscript𝑋𝑑X=[X_{1},\ldots,X_{d}]italic_X = [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] and Z=[Z1,,Zk]𝑍subscript𝑍1subscript𝑍𝑘Z=[Z_{1},\ldots,Z_{k}]italic_Z = [ italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] be absolutely continuous random vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT respectively with finite differential entropy. Let fθ:kd:subscript𝑓𝜃superscript𝑘superscript𝑑f_{\theta}:\mathbb{R}^{k}\to\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a function (a neural network parameterized by θ𝜃\thetaitalic_θ) to estimate X^=fθ(Z)^𝑋subscript𝑓𝜃𝑍\hat{X}=f_{\theta}(Z)over^ start_ARG italic_X end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z ). For any θ𝜃\thetaitalic_θ,

h(X|Z)α+12logMSE(X^,X)conditional𝑋𝑍𝛼12MSE^𝑋𝑋h(X|Z)\leq\alpha+\frac{1}{2}\log\text{MSE}(\hat{X},X)italic_h ( italic_X | italic_Z ) ≤ italic_α + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log MSE ( over^ start_ARG italic_X end_ARG , italic_X ) (6)

where α𝛼\alphaitalic_α is a positive constant and MSE(X^,X)=1di𝔼[(XiX^i)2]MSE^𝑋𝑋1𝑑subscript𝑖𝔼delimited-[]superscriptsubscript𝑋𝑖subscript^𝑋𝑖2\text{MSE}(\hat{X},X)=\frac{1}{d}\sum_{i}\mathbb{E}[(X_{i}-\hat{X}_{i})^{2}]MSE ( over^ start_ARG italic_X end_ARG , italic_X ) = divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

Proof.

From the chain rule for differential entropy, we can bound

h(X|Z)ih(Xi|Z)conditional𝑋𝑍subscript𝑖conditionalsubscript𝑋𝑖𝑍h(X|Z)\leq\sum_{i}h(X_{i}|Z)italic_h ( italic_X | italic_Z ) ≤ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Z ) (7)

Because the maximum entropy distribution with fixed variance is Gaussian [7], we can bound

ih(Xi|Z)i12log(2πeVar(Xi|Z))subscript𝑖conditionalsubscript𝑋𝑖𝑍subscript𝑖122𝜋𝑒Varconditionalsubscript𝑋𝑖𝑍\sum_{i}h(X_{i}|Z)\leq\sum_{i}\frac{1}{2}\log(2\pi e\ \text{Var}(X_{i}|Z))∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Z ) ≤ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_e Var ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Z ) ) (8)
=i12log(2πe𝔼[(Xi𝔼[Xi|Z])2])absentsubscript𝑖122𝜋𝑒𝔼delimited-[]superscriptsubscript𝑋𝑖𝔼delimited-[]conditionalsubscript𝑋𝑖𝑍2=\sum_{i}\frac{1}{2}\log(2\pi e\ \mathbb{E}\big{[}(X_{i}-\mathbb{E}[X_{i}|Z])^% {2}\big{]})= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_e blackboard_E [ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - blackboard_E [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Z ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) (9)

Because the expectation of a random variable is its best estimator, [7]

i12log(2πe𝔼[(Xi𝔼[Xi|Z])2])i12log(2πe𝔼[(XiX^i)2])subscript𝑖122𝜋𝑒𝔼delimited-[]superscriptsubscript𝑋𝑖𝔼delimited-[]conditionalsubscript𝑋𝑖𝑍2subscript𝑖122𝜋𝑒𝔼delimited-[]superscriptsubscript𝑋𝑖subscript^𝑋𝑖2\sum_{i}\frac{1}{2}\log(2\pi e\ \mathbb{E}\big{[}(X_{i}-\mathbb{E}[X_{i}|Z])^{% 2}\big{]})\leq\sum_{i}\frac{1}{2}\log(2\pi e\ \mathbb{E}\big{[}(X_{i}-\hat{X}_% {i}\big{)}^{2}])∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_e blackboard_E [ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - blackboard_E [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Z ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) ≤ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_e blackboard_E [ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) (10)

So with positive constant α𝛼\alphaitalic_α,

h(X|Z)α+12logMSE(X^,X)conditional𝑋𝑍𝛼12MSE^𝑋𝑋h(X|Z)\leq\alpha+\frac{1}{2}\log\text{MSE}(\hat{X},X)italic_h ( italic_X | italic_Z ) ≤ italic_α + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log MSE ( over^ start_ARG italic_X end_ARG , italic_X ) (11)

In LMI, Z𝑍Zitalic_Z corresponds to the latent representation of one input variable (Y𝑌Yitalic_Y), and f𝑓fitalic_f corresponds to the decoder which aims to reconstruct the other variable X𝑋Xitalic_X from the latent code Z𝑍Zitalic_Z.This result is very similar to some used in information-theoretic interpretations of autoencoders [28, 29], and can be thought of as a continuous analog of Fano’s inequality [7]. The d=k=1𝑑𝑘1d=k=1italic_d = italic_k = 1 case of this bound is given as a Corollary to Theorem 8.6.6 in [7].

A.1.2 Implementation of the representation learning architecture

There are many ways one could implement the high-level network architecture suggested by Theorem 1, with different encoder and decoder architectures, and choices of hyperparameters. Here, we will describe our design choices. Our motivating philosophy was that the implementation details should not need to be tuned for specific estimation problems.

All LMI estimates presented in the main text use programatically determined parameters, requiring no user input beyond joint samples. By default, all latent representations have 16 dimensions (8 for each variable). Each encoder and decoder is a multilayer perceptron (MLP) with three hidden layers, whose sizes are determined as a function of the dimensionality of the input variable. For a variable with dimensionality d𝑑ditalic_d, the encoder has hidden layer sizes L,L/2,L/4𝐿𝐿2𝐿4L,L/2,L/4italic_L , italic_L / 2 , italic_L / 4 with L=max(2log2(d),1024)𝐿superscript2𝑙𝑜subscript𝑔2𝑑1024L=\max(2^{\lfloor log_{2}(d)\rfloor},1024)italic_L = roman_max ( 2 start_POSTSUPERSCRIPT ⌊ italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) ⌋ end_POSTSUPERSCRIPT , 1024 ). Decoders have the same structure inverted, with hidden layer sizes L/4,L/2,L𝐿4𝐿2𝐿L/4,L/2,Litalic_L / 4 , italic_L / 2 , italic_L.

All MLP activations used are Leaky ReLUs, with negative slope 0.2, except the last layers of decoders, which have no activation. Cross-decoders are trained with 50% dropout after each activation layer. All weights are initialized using Xavier uniform initialization [51], and optimized using Adam, with hyperparameters listed in Table 1. They are trained with batch size of 512, with 1:1:111:11 : 1 train-validation splits, and a maximum of 300 epochs using early stopping procedure provided in Algorithm 3.

All models are implemented in Pytorch [52]. All experiments in this paper were done using a single NVIDIA RTX 3090.

Parameter Value
Learning rate (α𝛼\alphaitalic_α) 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.90.90.90.9
β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.9990.9990.9990.999
Epsilon (ϵitalic-ϵ\epsilonitalic_ϵ) 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Table 1: Adam optimizer parameters used in LMI.
Algorithm 3 Early Stopping
procedure EarlyStopping(model,validation_losses,patience=30modelvalidation_lossespatience30\text{model},\text{validation\_losses},\text{patience}=30model , validation_losses , patience = 30)
     best_lossbest_loss\text{best\_loss}\leftarrow\inftybest_loss ← ∞
     patience_counter0patience_counter0\text{patience\_counter}\leftarrow 0patience_counter ← 0
     best_modelNonebest_modelNone\text{best\_model}\leftarrow\text{None}best_model ← None
     while patience_counter<patiencepatience_counterpatience\text{patience\_counter}<\text{patience}patience_counter < patience do
         current_lossvalidation_loss(model)current_lossvalidation_lossmodel\text{current\_loss}\leftarrow\text{validation\_loss}(\text{model})current_loss ← validation_loss ( model )
         if current_loss<best_losscurrent_lossbest_loss\text{current\_loss}<\text{best\_loss}current_loss < best_loss then
              best_losscurrent_lossbest_losscurrent_loss\text{best\_loss}\leftarrow\text{current\_loss}best_loss ← current_loss
              patience_counter0patience_counter0\text{patience\_counter}\leftarrow 0patience_counter ← 0
              best_modelmodelbest_modelmodel\text{best\_model}\leftarrow\text{model}best_model ← model
         else
              patience_counterpatience_counter+1patience_counterpatience_counter1\text{patience\_counter}\leftarrow\text{patience\_counter}+1patience_counter ← patience_counter + 1
         end if
     end while
     return best_loss, best_model
end procedure

A.1.3 Theoretical properties of LMI approximation

The error of MI estimates using LMI approximation can be broadly attributed to two sources: error due to the representation approximation (i.e. |I(X;Y)I(Zx;Zy)|𝐼𝑋𝑌𝐼subscript𝑍𝑥subscript𝑍𝑦|I(X;Y)-I(Z_{x};Z_{y})|| italic_I ( italic_X ; italic_Y ) - italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) |), and classical MI estimation error (i.e. |I^KSG(Zx;Zy)I(Zx;Zy)|subscript^𝐼𝐾𝑆𝐺subscript𝑍𝑥subscript𝑍𝑦𝐼subscript𝑍𝑥subscript𝑍𝑦|\hat{I}_{KSG}(Z_{x};Z_{y})-I(Z_{x};Z_{y})|| over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_K italic_S italic_G end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) - italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) |). Here, we will explore the first source by deriving some basic properties of the representation approximation I(Zx;Zy)𝐼subscript𝑍𝑥subscript𝑍𝑦I(Z_{x};Z_{y})italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ).

Theorem 2.

Let X,Y𝑋𝑌X,Yitalic_X , italic_Y be random vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and Zx=fθ(X)subscript𝑍𝑥subscript𝑓𝜃𝑋Z_{x}=f_{\theta}(X)italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ), Zy=gϕ(Y)subscript𝑍𝑦subscript𝑔italic-ϕ𝑌Z_{y}=g_{\phi}(Y)italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_Y ) where f,g𝑓𝑔f,gitalic_f , italic_g are neural networks parameterized by θ,ϕ𝜃italic-ϕ\theta,\phiitalic_θ , italic_ϕ. For any θ,ϕ𝜃italic-ϕ\theta,\phiitalic_θ , italic_ϕ,

I(Zx;Zy)I(X;Y)𝐼subscript𝑍𝑥subscript𝑍𝑦𝐼𝑋𝑌I(Z_{x};Z_{y})\leq I(X;Y)italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ italic_I ( italic_X ; italic_Y ) (12)
Proof.

Because Zy=fθ(Y)subscript𝑍𝑦subscript𝑓𝜃𝑌Z_{y}=f_{\theta}(Y)italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y ), we have that XYZy𝑋𝑌subscript𝑍𝑦X\to Y\to Z_{y}italic_X → italic_Y → italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT form a Markov chain. From the data processing inequality, [7]

I(X;Y)I(X;Zy)𝐼𝑋𝑌𝐼𝑋subscript𝑍𝑦I(X;Y)\geq I(X;Z_{y})italic_I ( italic_X ; italic_Y ) ≥ italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) (13)

Similarly, we have ZyXZxsubscript𝑍𝑦𝑋subscript𝑍𝑥Z_{y}\to X\to Z_{x}italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT → italic_X → italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and

I(X;Y)I(X;Zy)I(Zx;Zy)𝐼𝑋𝑌𝐼𝑋subscript𝑍𝑦𝐼subscript𝑍𝑥subscript𝑍𝑦I(X;Y)\geq I(X;Z_{y})\geq I(Z_{x};Z_{y})italic_I ( italic_X ; italic_Y ) ≥ italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≥ italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) (14)

Theorem 3.

Let X,Y𝑋𝑌X,Yitalic_X , italic_Y be independent random vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, such that I(X;Y)=0𝐼𝑋𝑌0I(X;Y)=0italic_I ( italic_X ; italic_Y ) = 0. Let Zx=fθ(X)subscript𝑍𝑥subscript𝑓𝜃𝑋Z_{x}=f_{\theta}(X)italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ), Zy=gϕ(Y)subscript𝑍𝑦subscript𝑔italic-ϕ𝑌Z_{y}=g_{\phi}(Y)italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_Y ) where f,g𝑓𝑔f,gitalic_f , italic_g are neural networks parameterized by θ,ϕ𝜃italic-ϕ\theta,\phiitalic_θ , italic_ϕ. For any θ,ϕ𝜃italic-ϕ\theta,\phiitalic_θ , italic_ϕ,

I(Zx;Zy)=0𝐼subscript𝑍𝑥subscript𝑍𝑦0I(Z_{x};Z_{y})=0italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = 0 (15)
Proof.

Due to the nonnegativity of mutual information, we know

I(Zx;Zy)0𝐼subscript𝑍𝑥subscript𝑍𝑦0I(Z_{x};Z_{y})\geq 0italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≥ 0 (16)

From Theorem 2, we have

I(Zx;Zy)I(X;Y)=0𝐼subscript𝑍𝑥subscript𝑍𝑦𝐼𝑋𝑌0I(Z_{x};Z_{y})\leq I(X;Y)=0italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ italic_I ( italic_X ; italic_Y ) = 0 (17)

So,

0I(Zx;Zy)00𝐼subscript𝑍𝑥subscript𝑍𝑦00\leq I(Z_{x};Z_{y})\leq 00 ≤ italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ 0 (18)
I(Zx;Zy)=0𝐼subscript𝑍𝑥subscript𝑍𝑦0I(Z_{x};Z_{y})=0italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = 0 (19)

A.2 Alternate approaches to latent MI approximation

The broad goal of LMI, to estimate I(X;Y)𝐼𝑋𝑌I(X;Y)italic_I ( italic_X ; italic_Y ) using I(Zy;Zx)𝐼subscript𝑍𝑦subscript𝑍𝑥I(Z_{y};Z_{x})italic_I ( italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) where Zx,Zysubscript𝑍𝑥subscript𝑍𝑦Z_{x},Z_{y}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are low-dimensional representations, is quite general and could be approached in many ways beyond cross-predictive regularization. Next, we will discuss some alternative approaches, empirically explore their performance, and finally, show one unique advantage of the cross-predictive representation learning architectures.

A.2.1 Alternate methods of regularizing autoencoders for MI estimation

Another approach to learn Zx,Zysubscript𝑍𝑥subscript𝑍𝑦Z_{x},Z_{y}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT such that I^(Zx;Zy)I(X;Y)^𝐼subscript𝑍𝑥subscript𝑍𝑦𝐼𝑋𝑌\hat{I}(Z_{x};Z_{y})\approx I(X;Y)over^ start_ARG italic_I end_ARG ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≈ italic_I ( italic_X ; italic_Y ) is to regularize autoencoders to maximize I(Zx;Zy)𝐼subscript𝑍𝑥subscript𝑍𝑦I(Z_{x};Z_{y})italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ). This approach is sensible because the data processing inequality ensures that I(Zx;Zy)I(X;Y)𝐼subscript𝑍𝑥subscript𝑍𝑦𝐼𝑋𝑌I(Z_{x};Z_{y})\leq I(X;Y)italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ italic_I ( italic_X ; italic_Y ). So maximizing I(Zx;Zy)𝐼subscript𝑍𝑥subscript𝑍𝑦I(Z_{x};Z_{y})italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) is equivalent to minimizing the approximation error I(X;Y)I(Zx;Zy)𝐼𝑋𝑌𝐼subscript𝑍𝑥subscript𝑍𝑦I(X;Y)-I(Z_{x};Z_{y})italic_I ( italic_X ; italic_Y ) - italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ).

While maximizing directly I(Zx;Zy)𝐼subscript𝑍𝑥subscript𝑍𝑦I(Z_{x};Z_{y})italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) is intractable, we can build on the variational bounds explored in [15, 30]. We can add loss term MINE(Zx;Zy)subscriptMINEsubscript𝑍𝑥subscript𝑍𝑦\mathcal{L}_{\text{MINE}}(Z_{x};Z_{y})caligraphic_L start_POSTSUBSCRIPT MINE end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) or InfoNCE(Zx;Zy)subscriptInfoNCEsubscript𝑍𝑥subscript𝑍𝑦\mathcal{L}_{\text{InfoNCE}}(Z_{x};Z_{y})caligraphic_L start_POSTSUBSCRIPT InfoNCE end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) to our autoencoder loss functions to regularize latent codes to preserve mutually informative structure.

We implement both of these approaches, and benchmark them using the approach described in Figure 2 and Section 3.1. We find that the estimates from these regularization approaches perform similarly, but slightly more poorly, than the cross-predictive regularization.

Refer to caption
(a) Absolute error, measured by MSE.
Refer to caption
(b) Relative error, measured by Kendall τ𝜏\tauitalic_τ.
Figure 7: Experiment from Figure 2, with alternate regularization approaches.

A.2.2 Comparing latent nonparametric and latent variational MI estimation

After learning low-dimensional representations, there are multiple estimators that could be used for latent MI approximation. This appendix evaluates two methods: the KSG nonparametric estimator or the InfoNCE variational estimator. Nonparametric nearest-neighbor estimators have several advantages in low-dimensional settings: they generally require far fewer samples for accurate estimation [12], and yield pointwise mutual information decompositions (see Algorithm 5).

For completeness, we empirically compare both options here on one particular estimate. For a Gaussian dataset generated by Algorithm 4 with d=100𝑑100d=100italic_d = 100, k=1𝑘1k=1italic_k = 1, N=5103𝑁5superscript103N=5\cdot 10^{3}italic_N = 5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 1 bit MI, we train autoencoders regularized by InfoNCE loss to maximize I(Zx;Zy)𝐼subscript𝑍𝑥subscript𝑍𝑦I(Z_{x};Z_{y})italic_I ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ). After each training epoch, we measure LMI as estimated using latent KSG, and using latent InfoNCE. Both are plotted for 100 trials in Appendix Figure 9. The latent KSG estimation converges quickly to the true value, while the latent InfoNCE estimate converges somewhat slowly to a value below the ground truth.

Refer to caption
Figure 8: Convergence of multiple latent estimation approaches during training. Bold lines indicate averages over 100 trials. Ground truth is 1 bit MI.

A.2.3 Interpreting decoders with element-wise reconstruction error

Beyond performance differences, one benefit of using cross-predictive networks to regularize latent representations is that the cross-decoders themselves are useful. For example, by inspecting the dimension-wise reconstruction error of decoders, we can attribute an MI estimate to the predictability of certain dimensions. For a network trained on the “binary” MNIST dataset with Lx=Lysubscript𝐿𝑥subscript𝐿𝑦L_{x}=L_{y}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, visualizing the dimension-wise reconstruction error of a cross-predictive decoder reveals the pixels that contain information about digit identity (Fig. 10).

Pixels with low reconstruction error are likely to be “well-explained” by the other high-dimensional variable, while pixels with high reconstruction error are poorly explained. However, this reasoning is not universally applicable. Dimensions with no variation do not contribute to MI, but have very low reconstruction error. In Fig. 10, the outermost pixels are examples of this.

Refer to caption
Figure 9: Pixel-wise reconstruction error of cross-decoders in paired binary MNIST dataset where Lx=Lysubscript𝐿𝑥subscript𝐿𝑦L_{x}=L_{y}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT.

A.3 Details of experimental evaluation benchmarks

In this section, we will describe the details of our experimental evaluation benchmarks from Section 3. First, we will describe how we generate multivariate Gaussian datasets. Then, we will provide theoretical and empirical validation for the cluster-based benchmarking approach.

A.3.1 Generating multivariate Gaussian datasets with low-dimensional dependence structure

The algorithm used to sample multivariate Gaussians in Figure 2 is given in Algorithm 4. Briefly, we generate samples of k𝑘kitalic_k correlated dimensions by sampling k𝑘kitalic_k bivariate Gaussians. We then complete the remaining dk𝑑𝑘d-kitalic_d - italic_k ambient dimensions using half “redundant” dimensions (copies of the k𝑘kitalic_k correlated dimensions) and half “nuisance” dimensions (independent univariate Gaussians).

Algorithm 4 Generating multivariate Gaussian datasets with low-dimensional dependence structure
ambient dimensionality d𝑑ditalic_d
dependence structure dimensionality k𝑘kitalic_k
number of nuisance dimensions n𝑛nitalic_n (default (dk)/2𝑑𝑘2(d-k)/2( italic_d - italic_k ) / 2)
number of samples N𝑁Nitalic_N
ground truth MI b𝑏bitalic_b
\\ Compute cov matrix to yield dependence b𝑏bitalic_b
let ρ63.5122b/k𝜌63.51superscript22𝑏𝑘\rho\leftarrow\sqrt{6*3.5}*\sqrt{1-2^{-2b/k}}italic_ρ ← square-root start_ARG 6 ∗ 3.5 end_ARG ∗ square-root start_ARG 1 - 2 start_POSTSUPERSCRIPT - 2 italic_b / italic_k end_POSTSUPERSCRIPT end_ARG
let Σ[[6,ρ],[ρ,3/5]]Σ6𝜌𝜌35\Sigma\leftarrow[[6,\rho],[\rho,3/5]]roman_Σ ← [ [ 6 , italic_ρ ] , [ italic_ρ , 3 / 5 ] ]
\\ Sample bivariate Gaussians for k𝑘kitalic_k dependent dimensions
for i𝑖iitalic_i in 1..k1..k1 . . italic_k do
     let Xi,YiNsubscript𝑋𝑖subscript𝑌𝑖𝑁X_{i},Y_{i}\leftarrow Nitalic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_N samples from 𝒩(0,Σ)𝒩0Σ\mathcal{N}(0,\Sigma)caligraphic_N ( 0 , roman_Σ )
end for
\\ Duplicate random dimensions for redundant dimensions
for i𝑖iitalic_i in 1..(d(k+n))1..(d-(k+n))1 . . ( italic_d - ( italic_k + italic_n ) ) do
     let runif([1..k])r\leftarrow\text{unif}([1..k])italic_r ← unif ( [ 1 . . italic_k ] )
     let Xi+k,Yi+kXr,Yrformulae-sequencesubscript𝑋𝑖𝑘subscript𝑌𝑖𝑘subscript𝑋𝑟subscript𝑌𝑟X_{i+k},Y_{i+k}\leftarrow X_{r},Y_{r}italic_X start_POSTSUBSCRIPT italic_i + italic_k end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i + italic_k end_POSTSUBSCRIPT ← italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
end for
\\ Sample univariate Gaussians for nuisance dimensions
for i𝑖iitalic_i in 0..(n1)0..(n-1)0 . . ( italic_n - 1 ) do let XdiNsubscript𝑋𝑑𝑖𝑁X_{d-i}\leftarrow Nitalic_X start_POSTSUBSCRIPT italic_d - italic_i end_POSTSUBSCRIPT ← italic_N samples from 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) let YdiNsubscript𝑌𝑑𝑖𝑁Y_{d-i}\leftarrow Nitalic_Y start_POSTSUBSCRIPT italic_d - italic_i end_POSTSUBSCRIPT ← italic_N samples from 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 )
end for
return [X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,…Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT], [Y1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,…Ydsubscript𝑌𝑑Y_{d}italic_Y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT]

A.3.2 Theoretical justification for label MI approximation of high-dimensional MI

For the benchmarking setup described in Section 3.2, we have LyLxXsubscript𝐿𝑦subscript𝐿𝑥𝑋L_{y}\to L_{x}\to Xitalic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT → italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT → italic_X and LxLyYsubscript𝐿𝑥subscript𝐿𝑦𝑌L_{x}\to L_{y}\to Yitalic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT → italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT → italic_Y. We will show that I(X;Y)=I(Lx;Ly)𝐼𝑋𝑌𝐼subscript𝐿𝑥subscript𝐿𝑦I(X;Y)=I(L_{x};L_{y})italic_I ( italic_X ; italic_Y ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) under the condition that H(Lx|X)=H(Ly|Y)=0𝐻conditionalsubscript𝐿𝑥𝑋𝐻conditionalsubscript𝐿𝑦𝑌0H(L_{x}|X)=H(L_{y}|Y)=0italic_H ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_X ) = italic_H ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) = 0.

Theorem 4.

Let Lx,Lysubscript𝐿𝑥subscript𝐿𝑦L_{x},L_{y}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT be Bernoulli random variables. Let X,Y𝑋𝑌X,Yitalic_X , italic_Y be absolutely continuous random vectors in Nsuperscript𝑁\mathbb{R}^{N}blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT such that I(Lx;Y|Ly)=I(Ly;X|Lx)=0𝐼subscript𝐿𝑥conditional𝑌subscript𝐿𝑦𝐼subscript𝐿𝑦conditional𝑋subscript𝐿𝑥0I(L_{x};Y|L_{y})=I(L_{y};X|L_{x})=0italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Y | italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ; italic_X | italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = 0 and H(Lx|X)=H(Ly|Y)=0𝐻conditionalsubscript𝐿𝑥𝑋𝐻conditionalsubscript𝐿𝑦𝑌0H(L_{x}|X)=H(L_{y}|Y)=0italic_H ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_X ) = italic_H ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) = 0. Then,

I(X;Y)=I(Lx;Ly)𝐼𝑋𝑌𝐼subscript𝐿𝑥subscript𝐿𝑦I(X;Y)=I(L_{x};L_{y})italic_I ( italic_X ; italic_Y ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) (20)
Proof.

Due to conditional independence,

I(Lx;Ly)=I(Lx;Ly)+I(Lx;Y|Ly)𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼subscript𝐿𝑥conditional𝑌subscript𝐿𝑦I(L_{x};L_{y})=I(L_{x};L_{y})+I(L_{x};Y|L_{y})italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) + italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Y | italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) (21)

Using the chain rule for mutual information [7],

I(Lx;Ly)=I(Lx;Ly)+I(Lx;Y|Ly)=I(Lx;Y)+I(Lx;Ly|Y)𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼subscript𝐿𝑥conditional𝑌subscript𝐿𝑦𝐼subscript𝐿𝑥𝑌𝐼subscript𝐿𝑥conditionalsubscript𝐿𝑦𝑌I(L_{x};L_{y})=I(L_{x};L_{y})+I(L_{x};Y|L_{y})=I(L_{x};Y)+I(L_{x};L_{y}|Y)italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) + italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Y | italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Y ) + italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) (22)

Applying the chain rule again,

I(Lx;Ly)=I(Lx;Y)+I(Lx;Ly|Y)=I(X;Y)+I(Lx;Y|X)+I(Lx;Ly|Y)𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼subscript𝐿𝑥𝑌𝐼subscript𝐿𝑥conditionalsubscript𝐿𝑦𝑌𝐼𝑋𝑌𝐼subscript𝐿𝑥conditional𝑌𝑋𝐼subscript𝐿𝑥conditionalsubscript𝐿𝑦𝑌I(L_{x};L_{y})=I(L_{x};Y)+I(L_{x};L_{y}|Y)=I(X;Y)+I(L_{x};Y|X)+I(L_{x};L_{y}|Y)italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Y ) + italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) = italic_I ( italic_X ; italic_Y ) + italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_Y | italic_X ) + italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) (23)

Due to the non-negativity of mutual information we have

I(Lx;Ly)I(X;Y)𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼𝑋𝑌I(L_{x};L_{y})\geq I(X;Y)italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≥ italic_I ( italic_X ; italic_Y ) (24)

Because Lx,Lysubscript𝐿𝑥subscript𝐿𝑦L_{x},L_{y}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are discrete, we can bound

I(Lx;Ly)I(X;Y)+H(Lx|X)+H(Ly|Y)𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼𝑋𝑌𝐻conditionalsubscript𝐿𝑥𝑋𝐻conditionalsubscript𝐿𝑦𝑌I(L_{x};L_{y})\leq I(X;Y)+H(L_{x}|X)+H(L_{y}|Y)italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ italic_I ( italic_X ; italic_Y ) + italic_H ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_X ) + italic_H ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) (25)

So if H(Lx|X)=H(Ly|Y)=0𝐻conditionalsubscript𝐿𝑥𝑋𝐻conditionalsubscript𝐿𝑦𝑌0H(L_{x}|X)=H(L_{y}|Y)=0italic_H ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_X ) = italic_H ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) = 0, we have

I(Lx;Ly)I(X;Y)𝐼subscript𝐿𝑥subscript𝐿𝑦𝐼𝑋𝑌I(L_{x};L_{y})\leq I(X;Y)italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ italic_I ( italic_X ; italic_Y ) (26)

Combining this with (24), we have

I(X;Y)=I(Lx;Ly)𝐼𝑋𝑌𝐼subscript𝐿𝑥subscript𝐿𝑦I(X;Y)=I(L_{x};L_{y})italic_I ( italic_X ; italic_Y ) = italic_I ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) (27)

A.3.3 Validating the assumptions of cluster-based benchmarking setups

The effectiveness of the benchmarking setup in Section 3.2 relies on the assumption that H(Lx|X)H(Ly|Y)0𝐻conditionalsubscript𝐿𝑥𝑋𝐻conditionalsubscript𝐿𝑦𝑌0H(L_{x}|X)\approx H(L_{y}|Y)\approx 0italic_H ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_X ) ≈ italic_H ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | italic_Y ) ≈ 0. We provide evidence that this is the case for MNIST 0s and 1s, and sequence embeddings of E. Coli and A. Thaliana proteins. First, we provide qualitative evidence by showing that label clusters (digits and species respectively) are well separated on UMAP visualizations of each dataset (Figure 7). This indicates that the label of a sample can be reliably determined from its high-dimensional representation (such that H(Lx|X)0𝐻conditionalsubscript𝐿𝑥𝑋0H(L_{x}|X)\approx 0italic_H ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_X ) ≈ 0). We make this notion more quantitatively precise by showing that logistic regression can predict the labels of held-out samples with high accuracy. Over 100 random 1:1:111:11 : 1 train-test splits, logistic regression achieves mean validation accuracy >0.98absent0.98>0.98> 0.98 with standard error of mean <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for both datasets. Logistic regression classifiers are trained with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty and λ=1𝜆1\lambda=1italic_λ = 1.

Refer to caption
(a) UMAP of binary MNIST subset.
Refer to caption
(b) UMAP of proteome subsets.
Figure 10: Validating assumptions of cluster-based benchmarking

A.4 Experimental reproducibility details

Here, we will briefly summarize key details necessary for the reproduction of experiments in this paper. All of this information, and more details, can be found (albeit in a less easily readable form) in our code supplement.

A.4.1 Preprocessing ProtTrans5 embeddings

We downloaded ProtTrans5 embeddings of all H. sapiens, A. thaliana, and E. coli proteins directly from the UniProt database. Embeddings are from prottrans_t5_xl_u50 [32]. All proteins longer than 1210312superscript10312\cdot 10^{3}12 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT residues are excluded. These embeddings are then unit variance normalized, and values are clipped at 10 and -10.

A.4.2 Preprocessing hematopoiesis lineage tracing scRNA-seq data

We first downloaded all data from Experiment 1 of the data repository from [38]. Then, we preprocessed using the Scanpy best practices [53], normalizing total reads per cell to 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, log transforming, filtering for the 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT most highly variable genes, and finally unit variance normalizing and clipping values at 10 and -10. We used the inferred diffusion pseudotime and SPRING embeddings computed in [38]. For pseudotime analysis, we omit cells with pseudotime value below 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, because many are Lymphoid-fated rather than Neutrophil-fated. To generate joint samples of clones between two conditions, we identified all clonal barcodes that appeared in both conditions, and randomly sampled a single cell with each barcode from each condition. Because there are often several cells with the same clonal barcode in the same sample, there are many possible random pairings of clonally related cells.

A.4.3 MINE and InfoNCE implementation details

We use the implementations of MINE and InfoNCE from [12], with their default parameter choices. To summarize, architectures have two hidden layers with sizes (16, 8) and are optimized using Adam.

Choosing critic architectures

Because the benchmark tasks of [12] are dramatically different from those considered in this work (tens of dimensions as opposed to thousands), we consider the possibility that the parameters of MINE and InfoNCE used in [12] are not suitable for our use case.

To determine if other critic architectures could be more suitable, we first tested two layer architectures with layer sizes (L,L/2)𝐿𝐿2(L,L/2)( italic_L , italic_L / 2 ) for various L𝐿Litalic_L from 16 to 1024 on multivariate Gaussian data sets. We considered variables with dimensions 10, 100, and 1000, with 1 bit MI, 1-dimensional dependence structure, and 51035superscript1035\cdot 10^{3}5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples. We find that increasing L𝐿Litalic_L does not improve estimation quality, and the architecture used by [12] indeed is optimal for all three tested settings (Fig. 10). We suspect that this is because increasing model complexity does not help when sample size remains small.

Refer to caption
Figure 11: Variational bound estimators with increasing critic complexity, evaluated on multivariate Gaussians with 1-dimensional dependence structure. Each estimation problem has 51035superscript1035\cdot 10^{3}5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples and ground truth MI of 1 bit.

On the MNIST benchmarking setup, we verify that the effectiveness of LMI estimation over MINE and InfoNCE are not merely from model complexity, by using critics with the same complexity as the LMI encoders. In line with the hypothesis that large critic architectures are not suitable for the small sample size regime, we find that with high complexity critics, MINE and InfoNCE fail entirely on the MNIST benchmark (Fig. 11).

Refer to caption
Figure 12: MNIST benchmarking for neural estimators with critic complexity equivalent to LMI encoders, over 20 datasets with true MI between 0 and 1.

A.4.4 KSG implementation details for pointwise decompositions

We implement KSG with k=3𝑘3k=3italic_k = 3 nearest neighbors. For protein interaction data, due to large sample numbers resulting in high computational cost, we obtain KSG estimates by averaging over estimates on batches of data containing 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT samples.

We slightly adjust the KSG estimator to yield pointwise mutual information estimates. While the original KSG estimator [10] takes a sample expectation over computed pointwise mutual information values, we simply return an array of pointwise estimates rather than the average. For completeness, the algorithm is given in Algorithm 5.

Algorithm 5 KSG estimator for pointwise estimates
joint samples {(xi,yi)}i=1Nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\{(x_{i},y_{i})\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
parameter k𝑘kitalic_k (default k=3𝑘3k=3italic_k = 3)
let pmis []absent\leftarrow[\hskip 4.30554pt]← [ ]
for each (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) do
     find k𝑘kitalic_k-th nearest neighbor in joint space (xk,yk)subscript𝑥𝑘subscript𝑦𝑘(x_{k},y_{k})( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
     compute Chebyshev distance d=(xk,yk)(xi,yi)𝑑subscriptnormsubscript𝑥𝑘subscript𝑦𝑘subscript𝑥𝑖subscript𝑦𝑖d=||(x_{k},y_{k})-(x_{i},y_{i})||_{\infty}italic_d = | | ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
     let nx0,ny0formulae-sequencesubscript𝑛𝑥0subscript𝑛𝑦0n_{x}\leftarrow 0,n_{y}\leftarrow 0italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← 0 , italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← 0
     for each (xj,yj)subscript𝑥𝑗subscript𝑦𝑗(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) do
         if xjxi<dsubscriptnormsubscript𝑥𝑗subscript𝑥𝑖𝑑||x_{j}-x_{i}||_{\infty}<d| | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_d then
              nxsubscript𝑛𝑥n_{x}italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT += 1
         end if
         if yjyi<dsubscriptnormsubscript𝑦𝑗subscript𝑦𝑖𝑑||y_{j}-y_{i}||_{\infty}<d| | italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_d then
              nysubscript𝑛𝑦n_{y}italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT += 1
         end if
     end for
     pmis.append( ψ(k)+ψ(N)ψ(nx)ψ(ny)𝜓𝑘𝜓𝑁𝜓subscript𝑛𝑥𝜓subscript𝑛𝑦\psi(k)+\psi(N)-\psi(n_{x})-\psi(n_{y})italic_ψ ( italic_k ) + italic_ψ ( italic_N ) - italic_ψ ( italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) - italic_ψ ( italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) )
end for
return pmis

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: The results stated in the abstract and introduction are summaries of the experimental results shown in Sections 3 and 4 (figures 2-6).

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: There is a subsection of the Discussion dedicated to limitations of the LMI approach. These limitations are considered throughout the paper. For instance, in the abstract, we highlight the necessity of low dimensional intrinsic dependence structure.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [Yes]

  14. Justification: Assumptions about distributions (such as absolute continuity, finite entropies) are specified in theoretical analysis (Theorems 1-4).

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [Yes]

  19. Justification: Key details necessary to reproduce the experimental results are given in the appendix. All experiments can be reproduced using the supplementary code. It is carefully annotated to match each result in the paper to a specific Jupyter notebook.

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [Yes]

  24. Justification: As stated before, all experiments can be reproduced using the supplementary code. It is carefully annotated to match each result in the paper to a specific Jupyter notebook. The code reproducibility “best practices” were followed.

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [Yes]

  29. Justification: All training and test details are provided in the Appendix sections on implementation, and can be found in the supplementary code.

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [Yes]

  34. Justification: We report SEMs and standard deviations in the text when relevant and feasible. No plots contain error bars.

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [Yes]

  39. Justification: We report the hardware used for all experiments, and the projected overall environmental impact of all experiments (failed and pilot) reported in the paper. We do not provide further granularity because the experiments have rather modest compute requirements – the entirety of the experiments in this paper can be reproduced using a commercial NVIDIA GPU (RTX 3090) in about one day (by running every Jupyter notebook in the code supplement).

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes]

  44. Justification: We have reviewed and adhered to the guidelines. We mention potential social and environmental impacts in the Discussion section.

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [Yes]

  49. Justification: We discuss the potential social impact of MI estimators.

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [N/A]

  54. Justification: We do not release data or models with high risk for misuse.

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [Yes]

  59. Justification: We cite the library used to implement the LMI approximation (PyTorch).

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [Yes]

  64. Justification: The code developed for latent MI approximation (the lmi library provided in the supplement) is well documented.

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification: This work does not involve human subjects or crowdsourcing.

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification: This work does not involve human subjects or crowdsourcing.

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.