-
Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets
Authors:
Boris Landa,
Yuval Kluger,
Rong Ma
Abstract:
Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techni…
▽ More
Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose \textit{Entropic Optimal Transport (EOT) eigenmaps}, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align the datasets accordingly in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We then analyze a data-generative model where two observed high-dimensional datasets share latent variables on a common low-dimensional manifold, but each dataset is subject to data-specific translation, scaling, nuisance structures, and noise. We show that in a high-dimensional asymptotic regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables. Subsequently, we provide a geometric interpretation of our embedding by relating it to the eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
The Dyson Equalizer: Adaptive Noise Stabilization for Low-Rank Signal Detection and Recovery
Authors:
Boris Landa,
Yuval Kluger
Abstract:
Detecting and recovering a low-rank signal in a noisy data matrix is a fundamental task in data analysis. Typically, this task is addressed by inspecting and manipulating the spectrum of the observed data, e.g., thresholding the singular values of the data matrix at a certain critical level. This approach is well-established in the case of homoskedastic noise, where the noise variance is identical…
▽ More
Detecting and recovering a low-rank signal in a noisy data matrix is a fundamental task in data analysis. Typically, this task is addressed by inspecting and manipulating the spectrum of the observed data, e.g., thresholding the singular values of the data matrix at a certain critical level. This approach is well-established in the case of homoskedastic noise, where the noise variance is identical across the entries. However, in numerous applications, the noise can be heteroskedastic, where the noise characteristics may vary considerably across the rows and columns of the data. In this scenario, the spectral behavior of the noise can differ significantly from the homoskedastic case, posing various challenges for signal detection and recovery. To address these challenges, we develop an adaptive normalization procedure that equalizes the average noise variance across the rows and columns of a given data matrix. Our proposed procedure is data-driven and fully automatic, supporting a broad range of noise distributions, variance patterns, and signal structures. We establish that in many cases, this procedure enforces the standard spectral behavior of homoskedastic noise -- the Marchenko-Pastur (MP) law, allowing for simple and reliable detection of signal components. Furthermore, we demonstrate that our approach can substantially improve signal recovery in heteroskedastic settings by manipulating the spectrum after normalization. Lastly, we apply our method to single-cell RNA sequencing and spatial transcriptomics data, showcasing accurate fits to the MP law after normalization. Our approach relies on recent results in random matrix theory, which describe the resolvent of the noise via the so-called Dyson equation. By leveraging this relation, we can accurately infer the noise level in each row and each column directly from the resolvent of the data.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Robust Inference of Manifold Density and Geometry by Doubly Stochastic Scaling
Authors:
Boris Landa,
Xiuyuan Cheng
Abstract:
The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points. Yet, they can be inaccurate under high-dimensional noise, especially if the noise magnitude varies considerably across the data, e.g., under heteroskedasticity or outliers. In this work, we investigate a more robust alternative -- the doubly stochasti…
▽ More
The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points. Yet, they can be inaccurate under high-dimensional noise, especially if the noise magnitude varies considerably across the data, e.g., under heteroskedasticity or outliers. In this work, we investigate a more robust alternative -- the doubly stochastic normalization of the Gaussian kernel. We consider a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish that the doubly stochastic affinity matrix and its scaling factors concentrate around certain population forms, and provide corresponding finite-sample probabilistic error bounds. We then utilize these results to develop several tools for robust inference under general high-dimensional noise. First, we derive a robust density estimator that reliably infers the underlying sampling density and can substantially outperform the standard kernel density estimator under heteroskedasticity and outliers. Second, we obtain estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that accurately approximate various manifold Laplacians, including the Laplace Beltrami operator, improving over traditional normalizations in noisy settings. We exemplify our results in simulations and on real single-cell RNA-sequencing data. For the latter, we show that in contrast to traditional methods, our approach is robust to variability in technical noise levels across cell types.
△ Less
Submitted 10 July, 2023; v1 submitted 16 September, 2022;
originally announced September 2022.
-
Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise
Authors:
Xiuyuan Cheng,
Boris Landa
Abstract:
Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in…
▽ More
Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $ε\to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $ε\sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.
△ Less
Submitted 26 January, 2023; v1 submitted 22 June, 2022;
originally announced June 2022.
-
Biwhitening Reveals the Rank of a Count Matrix
Authors:
Boris Landa,
Thomas T. C. K. Zhang,
Yuval Kluger
Abstract:
Estimating the rank of a corrupted data matrix is an important task in data analysis, most notably for choosing the number of components in PCA. Significant progress on this task was achieved using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, e.g.…
▽ More
Estimating the rank of a corrupted data matrix is an important task in data analysis, most notably for choosing the number of components in PCA. Significant progress on this task was achieved using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, e.g., Poisson, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, we consider a Poisson random matrix with independent entries, and propose a simple procedure termed \textit{biwhitening} for estimating the rank of the underlying signal matrix (i.e., the Poisson parameter matrix) without any prior knowledge. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson, our approach is extended to families of distributions that satisfy a quadratic relation between the mean and the variance, such as the generalized Poisson, binomial, negative binomial, gamma, and many others. This quadratic relation can also account for missing entries in the data. We conduct numerical experiments that corroborate our theoretical findings, and showcase the advantage of our approach for rank estimation in challenging regimes. Furthermore, we demonstrate the favorable performance of our approach on several real datasets of single-cell RNA sequencing (scRNA-seq), High-Throughput Chromosome Conformation Capture (Hi-C), and document topic modeling.
△ Less
Submitted 2 November, 2021; v1 submitted 25 March, 2021;
originally announced March 2021.
-
Scaling positive random matrices: concentration and asymptotic convergence
Authors:
Boris Landa
Abstract:
It is well known that any positive matrix can be scaled to have prescribed row and column sums by multiplying its rows and columns by certain positive scaling factors (which are unique up to a positive scalar). This procedure is known as matrix scaling, and has found numerous applications in operations research, economics, image processing, and machine learning. In this work, we investigate the be…
▽ More
It is well known that any positive matrix can be scaled to have prescribed row and column sums by multiplying its rows and columns by certain positive scaling factors (which are unique up to a positive scalar). This procedure is known as matrix scaling, and has found numerous applications in operations research, economics, image processing, and machine learning. In this work, we investigate the behavior of the scaling factors and the resulting scaled matrix when the matrix to be scaled is random. Specifically, letting $\widetilde{A}\in\mathbb{R}^{M\times N}$ be a positive and bounded random matrix whose entries assume a certain type of independence, we provide a concentration inequality for the scaling factors of $\widetilde{A}$ around those of $A = \mathbb{E}[\widetilde{A}]$. This result is employed to bound the convergence rate of the scaling factors of $\widetilde{A}$ to those of $A$, as well as the concentration of the scaled version of $\widetilde{A}$ around the scaled version of $A$ in operator norm, as $M,N\rightarrow\infty$. When the entries of $\widetilde{A}$ are independent, $M=N$, and all prescribed row and column sums are $1$ (i.e., doubly-stochastic matrix scaling), both of the previously-mentioned bounds are $\mathcal{O}(\sqrt{\log N / N})$ with high probability. We demonstrate our results in several simulations.
△ Less
Submitted 11 December, 2020;
originally announced December 2020.
-
Local Two-Sample Testing over Graphs and Point-Clouds by Random-Walk Distributions
Authors:
Boris Landa,
Rihao Qu,
Joseph Chang,
Yuval Kluger
Abstract:
Rejecting the null hypothesis in two-sample testing is a fundamental tool for scientific discovery. Yet, aside from concluding that two samples do not come from the same probability distribution, it is often of interest to characterize how the two distributions differ. Given samples from two densities $f_1$ and $f_0$, we consider the task of localizing occurrences of the inequality $f_1 > f_0$. To…
▽ More
Rejecting the null hypothesis in two-sample testing is a fundamental tool for scientific discovery. Yet, aside from concluding that two samples do not come from the same probability distribution, it is often of interest to characterize how the two distributions differ. Given samples from two densities $f_1$ and $f_0$, we consider the task of localizing occurrences of the inequality $f_1 > f_0$. To avoid the challenges associated with high-dimensional space, we propose a general hypothesis testing framework where hypotheses are formulated adaptively to the data by conditioning on the combined sample from the two densities. We then investigate a special case of this framework where the notion of locality is captured by a random walk on a weighted graph constructed over this combined sample. We derive a tractable testing procedure for this case employing a type of scan statistic, and provide non-asymptotic lower bounds on the power and accuracy of our test to detect whether $f_1>f_0$ in a local sense. Furthermore, we characterize the test's consistency according to a certain problem-hardness parameter, and show that our test achieves the minimax detection rate for this parameter. We conduct numerical experiments to validate our method, and demonstrate our approach on two real-world applications: detecting and localizing arsenic well contamination across the United States, and analyzing two-sample single-cell RNA sequencing data from melanoma patients.
△ Less
Submitted 7 September, 2021; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Doubly-Stochastic Normalization of the Gaussian Kernel is Robust to Heteroskedastic Noise
Authors:
Boris Landa,
Ronald R. Coifman,
Yuval Kluger
Abstract:
A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We d…
▽ More
A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We demonstrate that the doubly-stochastic normalization of the Gaussian kernel with zero main diagonal (i.e., no self loops) is robust to heteroskedastic noise. That is, the doubly-stochastic normalization is advantageous in that it automatically accounts for observations with different noise variances. Specifically, we prove that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate $m^{-1/2}$, where $m$ is the ambient dimension. We demonstrate this result numerically, and show that in contrast, the popular row-stochastic and symmetric normalizations behave unfavorably under heteroskedastic noise. Furthermore, we provide examples of simulated and experimental single-cell RNA sequence data with intrinsic heteroskedasticity, where the advantage of the doubly-stochastic normalization for exploratory analysis is evident.
△ Less
Submitted 25 January, 2021; v1 submitted 30 May, 2020;
originally announced June 2020.
-
KLT Picker: Particle Picking Using Data-Driven Optimal Templates
Authors:
Amitay Eldar,
Boris Landa,
Yoel Shkolnisky
Abstract:
Particle picking is currently a critical step in the cryo-EM single particle reconstruction pipeline. Despite extensive work on this problem, for many data sets it is still challenging, especially for low SNR micrographs. We present the KLT (Karhunen Loeve Transform) picker, which is fully automatic and requires as an input only the approximated particle size. In particular, it does not require an…
▽ More
Particle picking is currently a critical step in the cryo-EM single particle reconstruction pipeline. Despite extensive work on this problem, for many data sets it is still challenging, especially for low SNR micrographs. We present the KLT (Karhunen Loeve Transform) picker, which is fully automatic and requires as an input only the approximated particle size. In particular, it does not require any manual picking. Our method is designed especially to handle low SNR micrographs. It is based on learning a set of optimal templates through the use of multi-variate statistical analysis via the Karhunen Loeve Transform. We evaluate the KLT picker on publicly available data sets and present high-quality results with minimal manual effort.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
Method of moments for 3-D single particle ab initio modeling with non-uniform distribution of viewing angles
Authors:
Nir Sharon,
Joe Kileel,
Yuehaw Khoo,
Boris Landa,
Amit Singer
Abstract:
Single-particle reconstruction in cryo-electron microscopy (cryo-EM) is an increasingly popular technique for determining the 3-D structure of a molecule from several noisy 2-D projections images taken at unknown viewing angles. Most reconstruction algorithms require a low-resolution initialization for the 3-D structure, which is the goal of ab initio modeling. Suggested by Zvi Kam in 1980, the me…
▽ More
Single-particle reconstruction in cryo-electron microscopy (cryo-EM) is an increasingly popular technique for determining the 3-D structure of a molecule from several noisy 2-D projections images taken at unknown viewing angles. Most reconstruction algorithms require a low-resolution initialization for the 3-D structure, which is the goal of ab initio modeling. Suggested by Zvi Kam in 1980, the method of moments (MoM) offers one approach, wherein low-order statistics of the 2-D images are computed and a 3-D structure is estimated by solving a system of polynomial equations. Unfortunately, Kam's method suffers from restrictive assumptions, most notably that viewing angles should be distributed uniformly. Often unrealistic, uniformity entails the computation of higher-order correlations, as in this case first and second moments fail to determine the 3-D structure. In the present paper, we remove this hypothesis, by permitting an unknown, non-uniform distribution of viewing angles in MoM. Perhaps surprisingly, we show that this case is statistically easier than the uniform case, as now first and second moments generically suffice to determine low-resolution expansions of the molecule. In the idealized setting of a known, non-uniform distribution, we find an efficient provable algorithm inverting first and second moments. For unknown, non-uniform distributions, we use non-convex optimization methods to solve for both the molecule and distribution.
△ Less
Submitted 23 November, 2019; v1 submitted 11 July, 2019;
originally announced July 2019.
-
Multi-reference factor analysis: low-rank covariance estimation under unknown translations
Authors:
Boris Landa,
Yoel Shkolnisky
Abstract:
We consider the problem of estimating the covariance matrix of a random signal observed through unknown translations (modeled by cyclic shifts) and corrupted by noise. Solving this problem allows to discover low-rank structures masked by the existence of translations (which act as nuisance parameters), with direct application to Principal Components Analysis (PCA). We assume that the underlying si…
▽ More
We consider the problem of estimating the covariance matrix of a random signal observed through unknown translations (modeled by cyclic shifts) and corrupted by noise. Solving this problem allows to discover low-rank structures masked by the existence of translations (which act as nuisance parameters), with direct application to Principal Components Analysis (PCA). We assume that the underlying signal is of length $L$ and follows a standard factor model with mean zero and $r$ normally-distributed factors. To recover the covariance matrix in this case, we propose to employ the second- and fourth-order shift-invariant moments of the signal known as the $\textit{power spectrum}$ and the $\textit{trispectrum}$. We prove that they are sufficient for recovering the covariance matrix (under a certain technical condition) when $r<\sqrt{L}$. Correspondingly, we provide a polynomial-time procedure for estimating the covariance matrix from many (translated and noisy) observations, where no explicit knowledge of $r$ is required, and prove the procedure's statistical consistency. While our results establish that covariance estimation is possible from the power spectrum and the trispectrum for low-rank covariance matrices, we prove that this is not the case for full-rank covariance matrices. We conduct numerical experiments that corroborate our theoretical findings, and demonstrate the favorable performance of our algorithms in various settings, including in high levels of noise.
△ Less
Submitted 21 September, 2020; v1 submitted 1 June, 2019;
originally announced June 2019.
-
Rank-one Multi-Reference Factor Analysis
Authors:
Yariv Aizenbud,
Boris Landa,
Yoel Shkolnisky
Abstract:
In recent years, there is a growing need for processing methods aimed at extracting useful information from large datasets. In many cases the challenge is to discover a low-dimensional structure in the data, often concealed by the existence of nuisance parameters and noise. Motivated by such challenges, we consider the problem of estimating a signal from its scaled, cyclically-shifted and noisy ob…
▽ More
In recent years, there is a growing need for processing methods aimed at extracting useful information from large datasets. In many cases the challenge is to discover a low-dimensional structure in the data, often concealed by the existence of nuisance parameters and noise. Motivated by such challenges, we consider the problem of estimating a signal from its scaled, cyclically-shifted and noisy observations. We focus on the particularly challenging regime of low signal-to-noise ratio (SNR), where different observations cannot be shift-aligned. We show that an accurate estimation of the signal from its noisy observations is possible, and derive a procedure which is proved to consistently estimate the signal. The asymptotic sample complexity (the number of observations required to recover the signal) of the procedure is $1/\operatorname{SNR}^4$. Additionally, we propose a procedure which is experimentally shown to improve the sample complexity by a factor equal to the signal's length. Finally, we present numerical experiments which demonstrate the performance of our algorithms, and corroborate our theoretical findings.
△ Less
Submitted 4 June, 2019; v1 submitted 29 May, 2019;
originally announced May 2019.
-
The steerable graph Laplacian and its application to filtering image data-sets
Authors:
Boris Landa,
Yoel Shkolnisky
Abstract:
In recent years, improvements in various image acquisition techniques gave rise to the need for adaptive processing methods, aimed particularly for large datasets corrupted by noise and deformations. In this work, we consider datasets of images sampled from a low-dimensional manifold (i.e. an image-valued manifold), where the images can assume arbitrary planar rotations. To derive an adaptive and…
▽ More
In recent years, improvements in various image acquisition techniques gave rise to the need for adaptive processing methods, aimed particularly for large datasets corrupted by noise and deformations. In this work, we consider datasets of images sampled from a low-dimensional manifold (i.e. an image-valued manifold), where the images can assume arbitrary planar rotations. To derive an adaptive and rotation-invariant framework for processing such datasets, we introduce a graph Laplacian (GL)-like operator over the dataset, termed ${\textit{steerable graph Laplacian}}$. Essentially, the steerable GL extends the standard GL by accounting for all (infinitely-many) planar rotations of all images. As it turns out, similarly to the standard GL, a properly normalized steerable GL converges to the Laplace-Beltrami operator on the low-dimensional manifold. However, the steerable GL admits an improved convergence rate compared to the GL, where the improved convergence behaves as if the intrinsic dimension of the underlying manifold is lower by one. Moreover, it is shown that the steerable GL admits eigenfunctions of the form of Fourier modes (along the orbits of the images' rotations) multiplied by eigenvectors of certain matrices, which can be computed efficiently by the FFT. For image datasets corrupted by noise, we employ a subset of these eigenfunctions to "filter" the dataset via a Fourier-like filtering scheme, essentially using all images and their rotations simultaneously. We demonstrate our filtering framework by de-noising simulated single-particle cryo-EM image datasets.
△ Less
Submitted 7 August, 2018; v1 submitted 6 February, 2018;
originally announced February 2018.
-
Steerable Principal Components for Space-Frequency Localized Images
Authors:
Boris Landa,
Yoel Shkolnisky
Abstract:
This paper describes a fast and accurate method for obtaining steerable principal components from a large dataset of images, assuming the images are well localized in space and frequency. The obtained steerable principal components are optimal for expanding the images in the dataset and all of their rotations. The method relies upon first expanding the images using a series of two-dimensional Prol…
▽ More
This paper describes a fast and accurate method for obtaining steerable principal components from a large dataset of images, assuming the images are well localized in space and frequency. The obtained steerable principal components are optimal for expanding the images in the dataset and all of their rotations. The method relies upon first expanding the images using a series of two-dimensional Prolate Spheroidal Wave Functions (PSWFs), where the expansion coefficients are evaluated using a specially designed numerical integration scheme. Then, the expansion coefficients are used to construct a rotationally-invariant covariance matrix which admits a block-diagonal structure, and the eigen-decomposition of its blocks provides us with the desired steerable principal components. The proposed method is shown to be faster then existing methods, while providing appropriate error bounds which guarantee its accuracy.
△ Less
Submitted 9 August, 2018; v1 submitted 9 August, 2016;
originally announced August 2016.