Search | arXiv e-print repository

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Authors: Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration… ▽ More The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs. △ Less

Submitted 29 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2310.00865 [pdf, other]

Data Science at the Singularity

Authors: David Donoho

Abstract: A purported `AI Singularity' has been in the public eye recently. Mass media and US national political attention focused on `AI Doom' narratives hawked by social media influencers. The European Commission is announcing initiatives to forestall `AI Extinction'. In my opinion, `AI Singularity' is the wrong narrative for what's happening now; recent happenings signal something else entirely. Somethin… ▽ More A purported `AI Singularity' has been in the public eye recently. Mass media and US national political attention focused on `AI Doom' narratives hawked by social media influencers. The European Commission is announcing initiatives to forestall `AI Extinction'. In my opinion, `AI Singularity' is the wrong narrative for what's happening now; recent happenings signal something else entirely. Something fundamental to computation-based research really changed in the last ten years. In certain fields, progress is dramatically more rapid than previously, as the fields undergo a transition to frictionless reproducibility (FR). This transition markedly changes the rate of spread of ideas and practices, affects mindsets, and erases memories of much that came before. The emergence of frictionless reproducibility follows from the maturation of 3 data science principles in the last decade. Those principles involve data sharing, code sharing, and competitive challenges, however implemented in the particularly strong form of frictionless open services. Empirical Machine Learning (EML) is todays leading adherent field, and its consequent rapid changes are responsible for the AI progress we see. Still, other fields can and do benefit when they adhere to the same principles. Many rapid changes from this maturation are misidentified. The advent of FR in EML generates a steady flow of innovations; this flow stimulates outsider intuitions that there's an emergent superpower somewhere in AI. This opens the way for PR to push worrying narratives: not only `AI Extinction', but also the supposed monopoly of big tech on AI research. The helpful narrative observes that the superpower of EML is adherence to frictionless reproducibility practices; these practices are responsible for the striking progress in AI that we see everywhere. △ Less

Submitted 1 October, 2023; originally announced October 2023.

Comments: 1 Figure

arXiv:2308.01839 [pdf, other]

doi 10.1073/pnas.2313719121

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

Authors: Rong Ma, Eric D. Sun, David Donoho, James Zou

Abstract: Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional si… ▽ More Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data. △ Less

Submitted 29 February, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

Journal ref: Proceedings of the National Academy of Sciences, 2024, 121(10) e2313719121

arXiv:2210.04488 [pdf, other]

Optimal Eigenvalue Shrinkage in the Semicircle Limit

Authors: David L. Donoho, Michael J. Feldman

Abstract: Modern datasets are trending towards ever higher dimension. In response, recent theoretical studies of covariance estimation often assume the proportional-growth asymptotic framework, where the sample size $n$ and dimension $p$ are comparable, with $n, p \rightarrow \infty $ and $γ_n = p/n \rightarrow γ> 0$. Yet, many datasets -- perhaps most -- have very different numbers of rows and columns. We… ▽ More Modern datasets are trending towards ever higher dimension. In response, recent theoretical studies of covariance estimation often assume the proportional-growth asymptotic framework, where the sample size $n$ and dimension $p$ are comparable, with $n, p \rightarrow \infty $ and $γ_n = p/n \rightarrow γ> 0$. Yet, many datasets -- perhaps most -- have very different numbers of rows and columns. We consider instead the disproportional-growth asymptotic framework, where $n, p \rightarrow \infty$ and $γ_n \rightarrow 0$ or $γ_n \rightarrow \infty$. Either disproportional limit induces novel behavior unseen within previous proportional and fixed-$p$ analyses. We study the spiked covariance model, with theoretical covariance a low-rank perturbation of the identity. For each of 15 different loss functions, we exhibit in closed form new optimal shrinkage and thresholding rules. Our optimal procedures demand extensive eigenvalue shrinkage and offer substantial performance benefits over the standard empirical covariance estimator. Practitioners may ask whether to view their data as arising within (and apply the procedures of) the proportional or disproportional frameworks. Conveniently, it is possible to remain {\it framework agnostic}: one unified set of closed-form shrinkage rules (depending only on the aspect ratio $γ_n$ of the given data) offers full asymptotic optimality under either framework. At the heart of the phenomena we explore is the spiked Wigner model, in which a low-rank matrix is perturbed by symmetric noise. Exploiting a connection to the spiked covariance model as $γ_n \rightarrow 0$, we derive optimal eigenvalue shrinkage rules for estimation of the low-rank component, of independent and fundamental interest. △ Less

Submitted 30 July, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

arXiv:2106.07053 [pdf, other]

Convex Sparse Blind Deconvolution

Authors: Qingyun Sun, David Donoho

Abstract: In the blind deconvolution problem, we observe the convolution of an unknown filter and unknown signal and attempt to reconstruct the filter and signal. The problem seems impossible in general, since there are seemingly many more unknowns than knowns . Nevertheless, this problem arises in many application fields; and empirically, some of these fields have had success using heuristic methods -- eve… ▽ More In the blind deconvolution problem, we observe the convolution of an unknown filter and unknown signal and attempt to reconstruct the filter and signal. The problem seems impossible in general, since there are seemingly many more unknowns than knowns . Nevertheless, this problem arises in many application fields; and empirically, some of these fields have had success using heuristic methods -- even economically very important ones, in wireless communications and oil exploration. Today's fashionable heuristic formulations pose non-convex optimization problems which are then attacked heuristically as well. The fact that blind deconvolution can be solved under some repeatable and naturally-occurring circumstances poses a theoretical puzzle. To bridge the gulf between reported successes and theory's limited understanding, we exhibit a convex optimization problem that -- assuming signal sparsity -- can convert a crude approximation to the true filter into a high-accuracy recovery of the true filter. Our proposed formulation is based on L1 minimization of inverse filter outputs. We give sharp guarantees on performance of the minimizer assuming sparsity of signal, showing that our proposal precisely recovers the true inverse filter, up to shift and rescaling. There is a sparsity/initial accuracy tradeoff: the less accurate the initial approximation, the greater we rely on sparsity to enable exact recovery. To our knowledge this is the first reported tradeoff of this kind. We consider it surprising that this tradeoff is independent of dimension. We also develop finite-$N$ guarantees, for highly accurate reconstruction under $N\geq O(k \log(k) )$ with high probability. We further show stable approximation when the true inverse filter is infinitely long and extend our guarantees to the case where the observations are contaminated by stochastic or adversarial noise. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.02073 [pdf, other]

Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path

Authors: X. Y. Han, Vardan Papyan, David L. Donoho

Abstract: The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works d… ▽ More The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC. △ Less

Submitted 9 May, 2022; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: ICLR 2022 Outstanding Paper Prize & Oral. Appendix contains [A] empirical experiments, [B-D] proofs of theoretical results, and [E] survey of related works examining Neural Collapse

arXiv:2103.03218 [pdf, ps, other]

The Impossibility Region for Detecting Sparse Mixtures using the Higher Criticism

Authors: David L. Donoho, Alon Kipnis

Abstract: Consider a multiple hypothesis testing setting involving rare/weak effects: relatively few tests, out of possibly many, deviate from their null hypothesis behavior. Summarizing the significance of each test by a P-value, we construct a global test against the null using the Higher Criticism (HC) statistics of these P-values. We calibrate the rare/weak model using parameters controlling the asympto… ▽ More Consider a multiple hypothesis testing setting involving rare/weak effects: relatively few tests, out of possibly many, deviate from their null hypothesis behavior. Summarizing the significance of each test by a P-value, we construct a global test against the null using the Higher Criticism (HC) statistics of these P-values. We calibrate the rare/weak model using parameters controlling the asymptotic distribution of non-null P-values near zero. We derive a region in the parameter space where the HC test is asymptotically powerless. Our derivation involves very different tools than previously used to show the powerlessness of HC, relying on properties of the empirical processes underlying HC. In particular, our result applies to situations where HC is not asymptotically optimal, or when the asymptotically detectable region of the parameter space is unknown. △ Less

Submitted 19 October, 2021; v1 submitted 15 February, 2021; originally announced March 2021.

MSC Class: 2010; Primary: 62H17; 62H15

arXiv:2101.03517 [pdf, other]

Three-dimensional simulations of X-ray cavities inflated by radio galaxies

Authors: Michael D. Smith andJustin Donohoe

Abstract: Vast cavities in the intergalactic medium are excavated by radio galaxies. The cavities appear as such in X-ray images because the external medium has been swept up, leaving a hot but low density bubble surrounding the radio lobes. We explore here the predicted thermal X-ray emission from a large set of high-resolution three dimensional simulations of radio galaxies driven by supersonic jets. We a… ▽ More Vast cavities in the intergalactic medium are excavated by radio galaxies. The cavities appear as such in X-ray images because the external medium has been swept up, leaving a hot but low density bubble surrounding the radio lobes. We explore here the predicted thermal X-ray emission from a large set of high-resolution three dimensional simulations of radio galaxies driven by supersonic jets. We assume adiabatic non-relativistic hydrodynamics with injected straight and precessing jets of supersonic gas emitted from nozzles. Images of X-ray Bremsstrahlung emission tend to generate oval cavities in the soft keV bands and leading arcuate structures in hard X-rays. However, the cavity shape is sensitive to the jet-ambient density contrast, varying from concave-shaped at $η= 0.1$ to convex for $η= 0.0001$ where $η$ is the jet/ambient density ratio. We find lateral ribs in the soft X-rays in certain cases and propose this as an explanation for those detected in the vicinity of Cygnus\,A. In bi-lobed or X-shaped sources and in curved or deflected jets, the strongest X-ray emission is not associated with the hotspot but with the relic lobe or deflection location. This is because the hot high-pressure and dense high-compression regions do not coincide. Directed toward the observer, the cavity becomes a deep round hole surrounded by circular ripples. With short radio-mode outbursts with a duty cycle of 10\% , the intracluster medium simmers with low Mach number shocks widely dissipating the jet energy in between active jet episodes. △ Less

Submitted 10 January, 2021; originally announced January 2021.

Comments: 14 pages, 15 figures, 2 table; accepted for publication to Monthly Notices of the Total Astronomical Society

arXiv:2009.12297 [pdf, other]

doi 10.1214/22-AOS2232

ScreeNOT: Exact MSE-Optimal Singular Value Thresholding in Correlated Noise

Authors: David L. Donoho, Matan Gavish, Elad Romanov

Abstract: We derive a formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise; although it nominally involves unobservables, we show how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable. The proposed method, which we call ScreeNOT, is a mathematically solid alternative to Cattell's e… ▽ More We derive a formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise; although it nominally involves unobservables, we show how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable. The proposed method, which we call ScreeNOT, is a mathematically solid alternative to Cattell's ever-popular but vague Scree Plot heuristic from 1966. ScreeNOT has a surprising oracle property: it typically achieves exactly, in large finite samples, the lowest possible MSE for matrix recovery, on each given problem instance - i.e. the specific threshold it selects gives exactly the smallest achievable MSE loss among all possible threshold choices for that noisy dataset and that unknown underlying true low rank model. The method is computationally efficient and robust against perturbations of the underlying covariance structure. Our results depend on the assumption that the singular values of the noise have a limiting empirical distribution of compact support; this model, which is standard in random matrix theory, is satisfied by many models exhibiting either cross-row correlation structure or cross-column correlation structure, and also by many situations where there is inter-element correlation structure. Simulations demonstrate the effectiveness of the method even at moderate matrix sizes. The paper is supplemented by ready-to-use software packages implementing the proposed algorithm: package ScreeNOT in Python (via PyPI) and R (via CRAN). △ Less

Submitted 26 March, 2023; v1 submitted 25 September, 2020; originally announced September 2020.

Journal ref: Annals of Statistics, 2023

arXiv:2008.08186 [pdf, other]

doi 10.1073/pnas.2015509117

Prevalence of Neural Collapse during the terminal phase of deep learning training

Authors: Vardan Papyan, X. Y. Han, David L. Donoho

Abstract: Modern practice for training classification deepnets involves a Terminal Phase of Training (TPT), which begins at the epoch where training error first vanishes; During TPT, the training error stays effectively zero while training loss is pushed towards zero. Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasi… ▽ More Modern practice for training classification deepnets involves a Terminal Phase of Training (TPT), which begins at the epoch where training error first vanishes; During TPT, the training error stays effectively zero while training loss is pushed towards zero. Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasive inductive bias we call Neural Collapse, involving four deeply interconnected phenomena: (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class-means; (NC2) The class-means collapse to the vertices of a Simplex Equiangular Tight Frame (ETF); (NC3) Up to rescaling, the last-layer classifiers collapse to the class-means, or in other words to the Simplex ETF, i.e. to a self-dual configuration; (NC4) For a given activation, the classifier's decision collapses to simply choosing whichever class has the closest train class-mean, i.e. the Nearest Class Center (NCC) decision rule. The symmetric and very simple geometry induced by the TPT confers important benefits, including better generalization performance, better robustness, and better interpretability. △ Less

Submitted 21 August, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

arXiv:2007.01958 [pdf, other]

doi 10.1214/21-AOS2158

Higher Criticism to Compare Two Large Frequency Tables, with sensitivity to Possible Rare and Weak Differences

Authors: David L. Donoho, Alon Kipnis

Abstract: We adapt Higher Criticism (HC) to the comparison of two frequency tables which may -- or may not -- exhibit moderate differences between the tables in some unknown, relatively small subset out of a large number of categories. Our analysis of the power of the proposed HC test quantifies the rarity and size of assumed differences and applies moderate deviations-analysis to determine the asymptotic p… ▽ More We adapt Higher Criticism (HC) to the comparison of two frequency tables which may -- or may not -- exhibit moderate differences between the tables in some unknown, relatively small subset out of a large number of categories. Our analysis of the power of the proposed HC test quantifies the rarity and size of assumed differences and applies moderate deviations-analysis to determine the asymptotic powerfulness/powerlessness of our proposed HC procedure. Our analysis considers the null hypothesis of no difference in underlying generative model against a rare/weak perturbation alternative, in which the frequencies of $N^{1-β}$ out of the $N$ categories are perturbed by $r(\log N)/2n$ in the Hellinger distance; here $n$ is the size of each sample. Our proposed Higher Criticism (HC) test for this setting uses P-values obtained from $N$ exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the rarity parameter $β$ and the perturbation intensity parameter $r$. Specifically, we derive a region in the $(β,r)$-plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between cases in which the counts in both tables are low, versus cases in which counts are high, corresponding to the cases of sparse and dense frequency tables. The phase transition curve of HC in the high-counts regime matches formally the curve delivered by HC in a two-sample normal means model. △ Less

Submitted 21 June, 2022; v1 submitted 3 July, 2020; originally announced July 2020.

MSC Class: 62H17; 62H15; 62G10

Journal ref: Annals of Statistics 2022, Vol. 50, No. 3, 1447-1472

arXiv:1906.03742 [pdf, other]

Degrees of Freedom Analysis of Unrolled Neural Networks

Authors: Morteza Mardani, Qingyun Sun, Vardan Papyan, Shreyas Vasanawala, John Pauly, David Donoho

Abstract: Unrolled neural networks emerged recently as an effective model for learning inverse maps appearing in image restoration tasks. However, their generalization risk (i.e., test mean-squared-error) and its link to network design and train sample size remains mysterious. Leveraging the Stein's Unbiased Risk Estimator (SURE), this paper analyzes the generalization risk with its bias and variance compon… ▽ More Unrolled neural networks emerged recently as an effective model for learning inverse maps appearing in image restoration tasks. However, their generalization risk (i.e., test mean-squared-error) and its link to network design and train sample size remains mysterious. Leveraging the Stein's Unbiased Risk Estimator (SURE), this paper analyzes the generalization risk with its bias and variance components for recurrent unrolled networks. We particularly investigate the degrees-of-freedom (DOF) component of SURE, trace of the end-to-end network Jacobian, to quantify the prediction variance. We prove that DOF is well-approximated by the weighted \textit{path sparsity} of the network under incoherence conditions on the trained weights. Empirically, we examine the SURE components as a function of train sample size for both recurrent and non-recurrent (with many more parameters) unrolled networks. Our key observations indicate that: 1) DOF increases with train sample size and converges to the generalization risk for both recurrent and non-recurrent schemes; 2) recurrent network converges significantly faster (with less train samples) compared with non-recurrent scheme, hence recurrence serves as a regularization for low sample size regimes. △ Less

Submitted 9 June, 2019; originally announced June 2019.

arXiv:1901.08705 [pdf, other]

Ambitious Data Science Can Be Painless

Authors: Hatef Monajemi, Riccardo Murri, Eric Jonas, Percy Liang, Victoria Stodden, David L. Donoho

Abstract: Modern data science research can involve massive computational experimentation; an ambitious PhD in computational fields may do experiments consuming several million CPU hours. Traditional computing practices, in which researchers use laptops or shared campus-resident resources, are inadequate for experiments at the massive scale and varied scope that we now see in data science. On the other hand,… ▽ More Modern data science research can involve massive computational experimentation; an ambitious PhD in computational fields may do experiments consuming several million CPU hours. Traditional computing practices, in which researchers use laptops or shared campus-resident resources, are inadequate for experiments at the massive scale and varied scope that we now see in data science. On the other hand, modern cloud computing promises seemingly unlimited computational resources that can be custom configured, and seems to offer a powerful new venue for ambitious data-driven science. Exploiting the cloud fully, the amount of work that could be completed in a fixed amount of time can expand by several orders of magnitude. As potentially powerful as cloud-based experimentation may be in the abstract, it has not yet become a standard option for researchers in many academic disciplines. The prospect of actually conducting massive computational experiments in today's cloud systems confronts the potential user with daunting challenges. Leading considerations include: (i) the seeming complexity of today's cloud computing interface, (ii) the difficulty of executing an overwhelmingly large number of jobs, and (iii) the difficulty of monitoring and combining a massive collection of separate results. Starting a massive experiment `bare-handed' seems therefore highly problematic and prone to rapid `researcher burn out'. New software stacks are emerging that render massive cloud experiments relatively painless. Such stacks simplify experimentation by systematizing experiment definition, automating distribution and management of tasks, and allowing easy harvesting of results and documentation. In this article, we discuss several painless computing stacks that abstract away the difficulties of massive experimentation, thereby allowing a proliferation of ambitious experiments for scientific discovery. △ Less

Submitted 24 January, 2019; originally announced January 2019.

Comments: Submitted to Harvard Data Science Review

arXiv:1810.07403 [pdf, ps, other]

Optimal Covariance Estimation for Condition Number Loss in the Spiked Model

Authors: David L. Donoho, Behrooz Ghorbani

Abstract: We study estimation of the covariance matrix under relative condition number loss $κ(Σ^{-1/2} \hatΣ Σ^{-1/2})$, where $κ(Δ)$ is the condition number of matrix $Δ$, and $\hatΣ$ and $Σ$ are the estimated and theoretical covariance matrices. Optimality in $κ$-loss provides optimal guarantees in two stylized applications: Multi-User Covariance Estimation and Multi-Task Linear Discriminant Analysis. We… ▽ More We study estimation of the covariance matrix under relative condition number loss $κ(Σ^{-1/2} \hatΣ Σ^{-1/2})$, where $κ(Δ)$ is the condition number of matrix $Δ$, and $\hatΣ$ and $Σ$ are the estimated and theoretical covariance matrices. Optimality in $κ$-loss provides optimal guarantees in two stylized applications: Multi-User Covariance Estimation and Multi-Task Linear Discriminant Analysis. We assume the so-called spiked covariance model for $Σ$, and exploit recent advances in understanding that model, to derive a nonlinear shrinker which is asymptotically optimal among orthogonally-equivariant procedures. In our asymptotic study, the number of variables $p$ is comparable to the number of observations $n$. The form of the optimal nonlinearity depends on the aspect ratio $γ=p/n$ of the data matrix and on the top eigenvalue of $Σ$. For $γ> 0.618...$, even dependence on the top eigenvalue can be avoided. The optimal shrinker has two notable properties. First, when $p/n \rightarrow γ\gg 1$ is large, it shrinks even very large eigenvalues substantially, by a factor $1/(1+γ)$. Second, even for moderate $γ$, certain highly statistically significant eigencomponents will be completely suppressed. We show that when $γ\gg 1$ is large, purely diagonal covariance matrices can be optimal, despite the top eigenvalues being large and the empirical eigenvalues being highly statistically significant. This aligns with practitioner experience. We identify intuitively reasonable procedures with small worst-case relative regret - the simplest being generalized soft thresholding having threshold at the bulk edge and slope $(1+γ)^{-1}$ above the bulk. For $γ< 2$ it has at most a few percent relative regret. △ Less

Submitted 17 October, 2018; originally announced October 2018.

Comments: 85 pages, 4 figures

arXiv:1806.03963 [pdf, other]

Neural Proximal Gradient Descent for Compressive Imaging

Authors: Morteza Mardani, Qingyun Sun, Shreyas Vasawanala, Vardan Papyan, Hatef Monajemi, John Pauly, David Donoho

Abstract: Recovering high-resolution images from limited sensory data typically leads to a serious ill-posed inverse problem, demanding inversion algorithms that effectively capture the prior information. Learning a good inverse mapping from training data faces severe challenges, including: (i) scarcity of training data; (ii) need for plausible reconstructions that are physically feasible; (iii) need for fa… ▽ More Recovering high-resolution images from limited sensory data typically leads to a serious ill-posed inverse problem, demanding inversion algorithms that effectively capture the prior information. Learning a good inverse mapping from training data faces severe challenges, including: (i) scarcity of training data; (ii) need for plausible reconstructions that are physically feasible; (iii) need for fast reconstruction, especially in real-time applications. We develop a successful system solving all these challenges, using as basic architecture the recurrent application of proximal gradient algorithm. We learn a proximal map that works well with real images based on residual networks. Contraction of the resulting map is analyzed, and incoherence conditions are investigated that drive the convergence of the iterates. Extensive experiments are carried out under different settings: (a) reconstructing abdominal MRI of pediatric patients from highly undersampled Fourier-space data and (b) superresolving natural face images. Our key findings include: 1. a recurrent ResNet with a single residual block unrolled from an iterative algorithm yields an effective proximal which accurately reveals MR image details. 2. Our architecture significantly outperforms conventional non-recurrent deep ResNets by 2dB SNR; it is also trained much more rapidly. 3. It outperforms state-of-the-art compressed-sensing Wavelet-based methods by 4dB SNR, with 100x speedups in reconstruction time. △ Less

Submitted 1 June, 2018; originally announced June 2018.

Comments: arXiv admin note: text overlap with arXiv:1711.10046

arXiv:1711.10046 [pdf, other]

Recurrent Generative Adversarial Networks for Proximal Learning and Automated Compressive Image Recovery

Authors: Morteza Mardani, Hatef Monajemi, Vardan Papyan, Shreyas Vasanawala, David Donoho, John Pauly

Abstract: Recovering images from undersampled linear measurements typically leads to an ill-posed linear inverse problem, that asks for proper statistical priors. Building effective priors is however challenged by the low train and test overhead dictated by real-time tasks; and the need for retrieving visually "plausible" and physically "feasible" images with minimal hallucination. To cope with these challe… ▽ More Recovering images from undersampled linear measurements typically leads to an ill-posed linear inverse problem, that asks for proper statistical priors. Building effective priors is however challenged by the low train and test overhead dictated by real-time tasks; and the need for retrieving visually "plausible" and physically "feasible" images with minimal hallucination. To cope with these challenges, we design a cascaded network architecture that unrolls the proximal gradient iterations by permeating benefits from generative residual networks (ResNet) to modeling the proximal operator. A mixture of pixel-wise and perceptual costs is then deployed to train proximals. The overall architecture resembles back-and-forth projection onto the intersection of feasible and plausible images. Extensive computational experiments are examined for a global task of reconstructing MR images of pediatric patients, and a more local task of superresolving CelebA faces, that are insightful to design efficient architectures. Our observations indicate that for MRI reconstruction, a recurrent ResNet with a single residual block effectively learns the proximal. This simple architecture appears to significantly outperform the alternative deep ResNet architecture by 2dB SNR, and the conventional compressed-sensing MRI by 4dB SNR with 100x faster inference. For image superresolution, our preliminary results indicate that modeling the denoising proximal demands deep ResNets. △ Less

Submitted 27 November, 2017; originally announced November 2017.

Comments: 11 pages, 11 figures

arXiv:1702.03062 [pdf, other]

Sparsity/Undersampling Tradeoffs in Anisotropic Undersampling, with Applications in MR Imaging/Spectroscopy

Authors: Hatef Monajemi, David L. Donoho

Abstract: We study anisotropic undersampling schemes like those used in multi-dimensional NMR spectroscopy and MR imaging, which sample exhaustively in certain time dimensions and randomly in others. Our analysis shows that anisotropic undersampling schemes are equivalent to certain block-diagonal measurement systems. We develop novel exact formulas for the sparsity/undersampling tradeoffs in such measure… ▽ More We study anisotropic undersampling schemes like those used in multi-dimensional NMR spectroscopy and MR imaging, which sample exhaustively in certain time dimensions and randomly in others. Our analysis shows that anisotropic undersampling schemes are equivalent to certain block-diagonal measurement systems. We develop novel exact formulas for the sparsity/undersampling tradeoffs in such measurement systems. Our formulas predict finite-N phase transition behavior differing substantially from the well known asymptotic phase transitions for classical Gaussian undersampling. Extensive empirical work shows that our formulas accurately describe observed finite-N behavior, while the usual formulas based on universality are substantially inaccurate. We also vary the anisotropy, keeping the total number of samples fixed, and for each variation we determine the precise sparsity/undersampling tradeoff (phase transition). We show that, other things being equal, the ability to recover a sparse object decreases with an increasing number of exhaustively-sampled dimensions. △ Less

Submitted 16 March, 2018; v1 submitted 9 February, 2017; originally announced February 2017.

arXiv:1702.01830 [pdf, other]

Incoherence of Partial-Component Sampling in multidimensional NMR

Authors: Hatef Monajemi, David L. Donoho, Jeffrey C. Hoch, Adam D. Schuyler

Abstract: In NMR spectroscopy, undersampling in the indirect dimensions causes reconstruction artifacts whose size can be bounded using the so-called {\it coherence}. In experiments with multiple indirect dimensions, new undersampling approaches were recently proposed: random phase detection (RPD) \cite{Maciejewski11} and its generalization, partial component sampling (PCS) \cite{Schuyler13}. The new approa… ▽ More In NMR spectroscopy, undersampling in the indirect dimensions causes reconstruction artifacts whose size can be bounded using the so-called {\it coherence}. In experiments with multiple indirect dimensions, new undersampling approaches were recently proposed: random phase detection (RPD) \cite{Maciejewski11} and its generalization, partial component sampling (PCS) \cite{Schuyler13}. The new approaches are fully aware of the fact that high-dimensional experiments generate hypercomplex-valued free induction decays; they randomly acquire only certain low-dimensional components of each high-dimensional hypercomplex entry. We provide a classification of various hypercomplex-aware undersampling schemes, and define a hypercomplex-aware coherence appropriate for such undersampling schemes; we then use it to quantify undersampling artifacts of RPD and various PCS schemes. △ Less

Submitted 6 February, 2017; originally announced February 2017.

arXiv:1606.00925 [pdf, other]

Convolutional Imputation of Matrix Networks

Authors: Qingyun Sun, Mengyuan Yan David Donoho, Stephen Boyd

Abstract: A matrix network is a family of matrices, with relatedness modeled by a weighted graph. We consider the task of completing a partially observed matrix network. We assume a novel sampling scheme where a fraction of matrices might be completely unobserved. How can we recover the entire matrix network from incomplete observations? This mathematical problem arises in many applications including medica… ▽ More A matrix network is a family of matrices, with relatedness modeled by a weighted graph. We consider the task of completing a partially observed matrix network. We assume a novel sampling scheme where a fraction of matrices might be completely unobserved. How can we recover the entire matrix network from incomplete observations? This mathematical problem arises in many applications including medical imaging and social networks. To recover the matrix network, we propose a structural assumption that the matrices have a graph Fourier transform which is low-rank. We formulate a convex optimization problem and prove an exact recovery guarantee for the optimization problem. Furthermore, we numerically characterize the exact recovery regime for varying rank and sampling rate and discover a new phase transition phenomenon. Then we give an iterative imputation algorithm to efficiently solve the optimization problem and complete large scale matrix networks. We demonstrate the algorithm with a variety of applications such as MRI and Facebook user network. △ Less

Submitted 7 June, 2018; v1 submitted 2 June, 2016; originally announced June 2016.

Comments: Accepted by ICML 2018

arXiv:1503.02106 [pdf, other]

Variance Breakdown of Huber (M)-estimators: $n/p \rightarrow m \in (1,\infty)$

Authors: David L. Donoho, Andrea Montanari

Abstract: A half century ago, Huber evaluated the minimax asymptotic variance in scalar location estimation, $ \min_ψ\max_{F \in {\cal F}_ε} V(ψ, F) = \frac{1}{I(F_ε^*)} $, where $V(ψ,F)$ denotes the asymptotic variance of the $(M)$-estimator for location with score function $ψ$, and $I(F_ε^*)$ is the minimal Fisher information $ \min_{{\cal F}_ε} I(F)$ over the class of $ε$-Contaminated Normal distribution… ▽ More A half century ago, Huber evaluated the minimax asymptotic variance in scalar location estimation, $ \min_ψ\max_{F \in {\cal F}_ε} V(ψ, F) = \frac{1}{I(F_ε^*)} $, where $V(ψ,F)$ denotes the asymptotic variance of the $(M)$-estimator for location with score function $ψ$, and $I(F_ε^*)$ is the minimal Fisher information $ \min_{{\cal F}_ε} I(F)$ over the class of $ε$-Contaminated Normal distributions. We consider the linear regression model $Y = Xθ_0 + W$, $W_i\sim_{\text{i.i.d.}}F$, and iid Normal predictors $X_{i,j}$, working in the high-dimensional-limit asymptotic where the number $n$ of observations and $p$ of variables both grow large, while $n/p \rightarrow m \in (1,\infty)$; hence $m$ plays the role of `asymptotic number of observations per parameter estimated'. Let $V_m(ψ,F)$ denote the per-coordinate asymptotic variance of the $(M)$-estimator of regression in the $n/p \rightarrow m$ regime. Then $V_m \neq V$; however $V_m \rightarrow V$ as $m \rightarrow \infty$. In this paper we evaluate the minimax asymptotic variance of the Huber $(M)$-estimate. The statistician minimizes over the family $(ψ_λ)_{λ> 0}$ of all tunings of Huber $(M)$-estimates of regression, and Nature maximizes over gross-error contaminations $F \in {\cal F}_ε$. Suppose that $I(F_ε^*) \cdot m > 1$. Then $ \min_λ\max_{F \in {\cal F}_ε} V_m(ψ_λ, F) = \frac{1}{I(F_ε^*) - 1/m} $. Strikingly, if $I(F_ε^*) \cdot m \leq 1$, then the minimax asymptotic variance is $+\infty$. The breakdown point is where the Fisher information per parameter equals unity. △ Less

Submitted 6 March, 2015; originally announced March 2015.

Comments: Based on a lecture delivered at a special colloquium honoring the 50th anniversary of the Seminar für Statistik (SfS) at ETH Zürich, November 25, 2014

MSC Class: 62C20; 62J05; 62G35

arXiv:1410.4743 [pdf, ps, other]

doi 10.1214/14-STS506

Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects

Authors: David Donoho, Jiashun Jin

Abstract: In modern high-throughput data analysis, researchers perform a large number of statistical tests, expecting to find perhaps a small fraction of significant effects against a predominantly null background. Higher Criticism (HC) was introduced to determine whether there are any nonzero effects; more recently, it was applied to feature selection, where it provides a method for selecting useful predic… ▽ More In modern high-throughput data analysis, researchers perform a large number of statistical tests, expecting to find perhaps a small fraction of significant effects against a predominantly null background. Higher Criticism (HC) was introduced to determine whether there are any nonzero effects; more recently, it was applied to feature selection, where it provides a method for selecting useful predictive features from a large body of potentially useful features, among which only a rare few will prove truly useful. In this article, we review the basics of HC in both the testing and feature selection settings. HC is a flexible idea, which adapts easily to new situations; we point out simple adaptions to clique detection and bivariate outlier detection. HC, although still early in its development, is seeing increasing interest from practitioners; we illustrate this with worked examples. HC is computationally effective, which gives it a nice leverage in the increasingly more relevant "Big Data" settings we see today. We also review the underlying theoretical "ideology" behind HC. The Rare/Weak (RW) model is a theoretical framework simultaneously controlling the size and prevalence of useful/significant items among the useless/null bulk. The RW model shows that HC has important advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER), in particular, certain optimality properties. We discuss the rare/weak phase diagram, a way to visualize clearly the class of RW settings where the true signals are so rare or so weak that detection and feature selection are simply impossible, and a way to understand the known optimality properties of HC. △ Less

Submitted 10 April, 2015; v1 submitted 17 October, 2014; originally announced October 2014.

Comments: Published at http://dx.doi.org/10.1214/14-STS506 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS506

Journal ref: Statistical Science 2015, Vol. 30, No. 1, 1-25

arXiv:1405.7511 [pdf, ps, other]

Optimal Shrinkage of Singular Values

Authors: Matan Gavish, David L. Donoho

Abstract: We consider recovery of low-rank matrices from noisy data by shrinkage of singular values, in which a single, univariate nonlinearity is applied to each of the empirical singular values. We adopt an asymptotic framework, in which the matrix size is much larger than the rank of the signal matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. For a variety of lo… ▽ More We consider recovery of low-rank matrices from noisy data by shrinkage of singular values, in which a single, univariate nonlinearity is applied to each of the empirical singular values. We adopt an asymptotic framework, in which the matrix size is much larger than the rank of the signal matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. For a variety of loss functions, including Mean Square Error (MSE - square Frobenius norm), the nuclear norm loss and the operator norm loss, we show that in this framework there is a well-defined asymptotic loss that we evaluate precisely in each case. In fact, each of the loss functions we study admits a unique admissible shrinkage nonlinearity dominating all other nonlinearities. We provide a general method for evaluating these optimal nonlinearities, and demonstrate our framework by working out simple, explicit formulas for the optimal nonlinearities in the Frobenius, nuclear and operator norm cases. For example, for a square low-rank n-by-n matrix observed in white noise with level $σ$, the optimal nonlinearity for MSE loss simply shrinks each data singular value $y$ to $\sqrt{y^2-4nσ^2 }$ (or to 0 if $y<2\sqrt{n}σ$). This optimal nonlinearity guarantees an asymptotic MSE of $2nrσ^2$, which compares favorably with optimally tuned hard thresholding and optimally tuned soft thresholding, providing guarantees of $3nrσ^2$ and $6nrσ^2$, respectively. Our general method also allows one to evaluate optimal shrinkers numerically to arbitrary precision. As an example, we compute optimal shrinkers for the Schatten-p norm loss, for any p>0. △ Less

Submitted 15 May, 2016; v1 submitted 29 May, 2014; originally announced May 2014.

arXiv:1311.0851 [pdf, ps, other]

Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model

Authors: David L. Donoho, Matan Gavish, Iain M. Johnstone

Abstract: We show that in a common high-dimensional covariance model, the choice of loss function has a profound effect on optimal estimation. In an asymptotic framework based on the Spiked Covariance model and use of orthogonally invariant estimators, we show that optimal estimation of the population covariance matrix boils down to design of an optimal shrinker $η$ that acts elementwise on the sample eigen… ▽ More We show that in a common high-dimensional covariance model, the choice of loss function has a profound effect on optimal estimation. In an asymptotic framework based on the Spiked Covariance model and use of orthogonally invariant estimators, we show that optimal estimation of the population covariance matrix boils down to design of an optimal shrinker $η$ that acts elementwise on the sample eigenvalues. Indeed, to each loss function there corresponds a unique admissible eigenvalue shrinker $η^*$ dominating all other shrinkers. The shape of the optimal shrinker is determined by the choice of loss function and, crucially, by inconsistency of both eigenvalues and eigenvectors of the sample covariance matrix. Details of these phenomena and closed form formulas for the optimal eigenvalue shrinkers are worked out for a menagerie of 26 loss functions for covariance estimation found in the literature, including the Stein, Entropy, Divergence, Frechet, Bhattacharya/Matusita, Frobenius Norm, Operator Norm, Nuclear Norm and Condition Number losses. △ Less

Submitted 4 June, 2017; v1 submitted 4 November, 2013; originally announced November 2013.

arXiv:1310.7320 [pdf, other]

High Dimensional Robust M-Estimation: Asymptotic Variance via Approximate Message Passing

Authors: David Donoho, Andrea Montanari

Abstract: In a recent article (Proc. Natl. Acad. Sci., 110(36), 14557-14562), El Karoui et al. study the distribution of robust regression estimators in the regime in which the number of parameters p is of the same order as the number of samples n. Using numerical simulations and `highly plausible' heuristic arguments, they unveil a striking new phenomenon. Namely, the regression coefficients contain an ext… ▽ More In a recent article (Proc. Natl. Acad. Sci., 110(36), 14557-14562), El Karoui et al. study the distribution of robust regression estimators in the regime in which the number of parameters p is of the same order as the number of samples n. Using numerical simulations and `highly plausible' heuristic arguments, they unveil a striking new phenomenon. Namely, the regression coefficients contain an extra Gaussian noise component that is not explained by classical concepts such as the Fisher information matrix. We show here that that this phenomenon can be characterized rigorously techniques that were developed by the authors to analyze the Lasso estimator under high-dimensional asymptotics. We introduce an approximate message passing (AMP) algorithm to compute M-estimators and deploy state evolution to evaluate the operating characteristics of AMP and so also M-estimates. Our analysis clarifies that the `extra Gaussian noise' encountered in this problem is fundamentally similar to phenomena already studied for regularized least squares in the setting n<p. △ Less

Submitted 15 November, 2013; v1 submitted 28 October, 2013; originally announced October 2013.

Comments: 32 pages, 5 figures (v2 contains numerical simulations)

arXiv:1305.5870 [pdf, other]

The Optimal Hard Threshold for Singular Values is 4/sqrt(3)

Authors: Matan Gavish, David L. Donoho

Abstract: We consider recovery of low-rank matrices from noisy data by hard thresholding of singular values, where singular values below a prescribed threshold $λ$ are set to 0. We study the asymptotic MSE in a framework where the matrix size is large compared to the rank of the matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. The AMSE-optimal choice of hard thresh… ▽ More We consider recovery of low-rank matrices from noisy data by hard thresholding of singular values, where singular values below a prescribed threshold $λ$ are set to 0. We study the asymptotic MSE in a framework where the matrix size is large compared to the rank of the matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. The AMSE-optimal choice of hard threshold, in the case of n-by-n matrix in noise level σ, is simply $(4/\sqrt{3}) \sqrt{n}σ\approx 2.309 \sqrt{n}σ$ when $σ$ is known, or simply $2.858\cdot y_{med}$ when $σ$ is unknown, where $y_{med}$ is the median empirical singular value. For nonsquare $m$ by $n$ matrices with $m \neq n$, these thresholding coefficients are replaced with different provided constants. In our asymptotic framework, this thresholding rule adapts to unknown rank and to unknown noise level in an optimal manner: it is always better than hard thresholding at any other value, no matter what the matrix is that we are trying to recover, and is always better than ideal Truncated SVD (TSVD), which truncates at the true rank of the low-rank matrix we are trying to recover. Hard thresholding at the recommended value to recover an n-by-n matrix of rank r guarantees an AMSE at most $3nrσ^2$. In comparison, the guarantee provided by TSVD is $5nrσ^2$, the guarantee provided by optimally tuned singular value soft thresholding is $6nrσ^2$, and the best guarantee achievable by any shrinkage of the data singular values is $2nrσ^2$. Empirical evidence shows that these AMSE properties of the $4/\sqrt{3}$ thresholding rule remain valid even for relatively small n, and that performance improvement over TSVD and other shrinkage rules is substantial, turning it into the practical hard threshold of choice. △ Less

Submitted 4 June, 2014; v1 submitted 24 May, 2013; originally announced May 2013.

arXiv:1304.2085 [pdf, ps, other]

doi 10.1214/14-AOS1257

Minimax risk of matrix denoising by singular value thresholding

Authors: David Donoho, Matan Gavish

Abstract: An unknown $m$ by $n$ matrix $X_0$ is to be estimated from noisy measurements $Y=X_0+Z$, where the noise matrix $Z$ has i.i.d. Gaussian entries. A popular matrix denoising scheme solves the nuclear norm penalization problem $\operatorname {min}_X\|Y-X\|_F^2/2+λ\|X\|_*$, where $\|X\|_*$ denotes the nuclear norm (sum of singular values). This is the analog, for matrices, of $\ell_1$ penalization in… ▽ More An unknown $m$ by $n$ matrix $X_0$ is to be estimated from noisy measurements $Y=X_0+Z$, where the noise matrix $Z$ has i.i.d. Gaussian entries. A popular matrix denoising scheme solves the nuclear norm penalization problem $\operatorname {min}_X\|Y-X\|_F^2/2+λ\|X\|_*$, where $\|X\|_*$ denotes the nuclear norm (sum of singular values). This is the analog, for matrices, of $\ell_1$ penalization in the vector case. It has been empirically observed that if $X_0$ has low rank, it may be recovered quite accurately from the noisy measurement $Y$. In a proportional growth framework where the rank $r_n$, number of rows $m_n$ and number of columns $n$ all tend to $\infty$ proportionally to each other ($r_n/m_n\rightarrow ρ$, $m_n/n\rightarrow β$), we evaluate the asymptotic minimax MSE $\mathcal {M}(ρ,β)=\lim_{m_n,n\rightarrow \infty}\inf_λ\sup_{\operatorname {rank}(X)\leq r_n}\operatorname {MSE}(X_0,\hat{X}_λ)$. Our formulas involve incomplete moments of the quarter- and semi-circle laws ($β=1$, square case) and the Marčenko-Pastur law ($β<1$, nonsquare case). For finite $m$ and $n$, we show that MSE increases as the nonzero singular values of $X_0$ grow larger. As a result, the finite-$n$ worst-case MSE, a quantity which can be evaluated numerically, is achieved when the signal $X_0$ is "infinitely strong." The nuclear norm penalization problem is solved by applying soft thresholding to the singular values of $Y$. We also derive the minimax threshold, namely the value $λ^*(ρ)$, which is the optimal place to threshold the singular values. All these results are obtained for general (nonsquare, nonsymmetric) real matrices. Comparable results are obtained for square symmetric nonnegative-definite matrices. △ Less

Submitted 4 November, 2014; v1 submitted 7 April, 2013; originally announced April 2013.

Comments: Published in at http://dx.doi.org/10.1214/14-AOS1257 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1257

Journal ref: Annals of Statistics 2014, Vol. 42, No. 6, 2413-2440

arXiv:1302.2758 [pdf, other]

doi 10.1051/0004-6361/201321257

Sparsity and the Bayesian Perspective

Authors: J. -L. Starck, D. L. Donoho, M. J. Fadili, A. Rassat

Abstract: Sparsity has been recently introduced in cosmology for weak-lensing and CMB data analysis for different applications such as denoising, component separation or inpainting (i.e. filling the missing data or the mask). Although it gives very nice numerical results, CMB sparse inpainting has been severely criticized by top researchers in cosmology, based on arguments derived from a Bayesian perspectiv… ▽ More Sparsity has been recently introduced in cosmology for weak-lensing and CMB data analysis for different applications such as denoising, component separation or inpainting (i.e. filling the missing data or the mask). Although it gives very nice numerical results, CMB sparse inpainting has been severely criticized by top researchers in cosmology, based on arguments derived from a Bayesian perspective. Trying to understand their point of view, we realize that interpreting a regularization penalty term as a prior in a Bayesian framework can lead to erroneous conclusions. This paper is by no means against the Bayesian approach, which has proven to be very useful for many applications, but warns about a Bayesian-only interpretation in data analysis, which can be misleading in some cases. △ Less

Submitted 15 February, 2013; v1 submitted 12 February, 2013; originally announced February 2013.

Comments: Accepted in A&A

arXiv:1302.2331 [pdf, other]

doi 10.1073/pnas.1306110110

The Phase Transition of Matrix Recovery from Gaussian Measurements Matches the Minimax MSE of Matrix Denoising

Authors: David L. Donoho, Matan Gavish, Andrea Montanari

Abstract: Let $X_0$ be an unknown $M$ by $N$ matrix. In matrix recovery, one takes $n < MN$ linear measurements $y_1,..., y_n$ of $X_0$, where $y_i = \Tr(a_i^T X_0)$ and each $a_i$ is a $M$ by $N$ matrix. For measurement matrices with Gaussian i.i.d entries, it known that if $X_0$ is of low rank, it is recoverable from just a few measurements. A popular approach for matrix recovery is Nuclear Norm Minimizat… ▽ More Let $X_0$ be an unknown $M$ by $N$ matrix. In matrix recovery, one takes $n < MN$ linear measurements $y_1,..., y_n$ of $X_0$, where $y_i = \Tr(a_i^T X_0)$ and each $a_i$ is a $M$ by $N$ matrix. For measurement matrices with Gaussian i.i.d entries, it known that if $X_0$ is of low rank, it is recoverable from just a few measurements. A popular approach for matrix recovery is Nuclear Norm Minimization (NNM). Empirical work reveals a \emph{phase transition} curve, stated in terms of the undersampling fraction $δ(n,M,N) = n/(MN)$, rank fraction $ρ=r/N$ and aspect ratio $β=M/N$. Specifically, a curve $δ^* = δ^*(ρ;β)$ exists such that, if $δ> δ^*(ρ;β)$, NNM typically succeeds, while if $δ< δ^*(ρ;β)$, it typically fails. An apparently quite different problem is matrix denoising in Gaussian noise, where an unknown $M$ by $N$ matrix $X_0$ is to be estimated based on direct noisy measurements $Y = X_0 + Z$, where the matrix $Z$ has iid Gaussian entries. It has been empirically observed that, if $X_0$ has low rank, it may be recovered quite accurately from the noisy measurement $Y$. A popular matrix denoising scheme solves the unconstrained optimization problem $\text{min} \| Y - X \|_F^2/2 + λ\|X\|_* $. When optimally tuned, this scheme achieves the asymptotic minimax MSE $\cM(ρ) = \lim_{N \goto \infty} \inf_λ\sup_{\rank(X) \leq ρ\cdot N} MSE(X,\hat{X}_λ)$. We report extensive experiments showing that the phase transition $δ^*(ρ)$ in the first problem coincides with the minimax risk curve $\cM(ρ)$ in the second problem, for {\em any} rank fraction $0 < ρ< 1$. △ Less

Submitted 10 February, 2013; originally announced February 2013.

arXiv:1112.0708 [pdf, other]

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing

Authors: David L. Donoho, Adel Javanmard, Andrea Montanari

Abstract: We study the compressed sensing reconstruction problem for a broad class of random, band-diagonal sensing matrices. This construction is inspired by the idea of spatial coupling in coding theory. As demonstrated heuristically and numerically by Krzakala et al. \cite{KrzakalaEtAl}, message passing algorithms can effectively solve the reconstruction problem for spatially coupled measurements with un… ▽ More We study the compressed sensing reconstruction problem for a broad class of random, band-diagonal sensing matrices. This construction is inspired by the idea of spatial coupling in coding theory. As demonstrated heuristically and numerically by Krzakala et al. \cite{KrzakalaEtAl}, message passing algorithms can effectively solve the reconstruction problem for spatially coupled measurements with undersampling rates close to the fraction of non-zero coordinates. We use an approximate message passing (AMP) algorithm and analyze it through the state evolution method. We give a rigorous proof that this approach is successful as soon as the undersampling rate $δ$ exceeds the (upper) Rényi information dimension of the signal, $\uRenyi(p_X)$. More precisely, for a sequence of signals of diverging dimension $n$ whose empirical distribution converges to $p_X$, reconstruction is with high probability successful from $\uRenyi(p_X)\, n+o(n)$ measurements taken according to a band diagonal matrix. For sparse signals, i.e., sequences of dimension $n$ and $k(n)$ non-zero entries, this implies reconstruction from $k(n)+o(n)$ measurements. For `discrete' signals, i.e., signals whose coordinates take a fixed finite set of values, this implies reconstruction from $o(n)$ measurements. The result is robust with respect to noise, does not apply uniquely to random signals, but requires the knowledge of the empirical distribution of the signal $p_X$. △ Less

Submitted 18 January, 2013; v1 submitted 3 December, 2011; originally announced December 2011.

Comments: 60 pages, 7 figures, Sections 3,5 and Appendices A,B are added. The stability constant is quantified (cf Theorem 1.7)

arXiv:1111.1041 [pdf, other]

Accurate Prediction of Phase Transitions in Compressed Sensing via a Connection to Minimax Denoising

Authors: David Donoho, Iain Johnstone, Andrea Montanari

Abstract: Compressed sensing posits that, within limits, one can undersample a sparse signal and yet reconstruct it accurately. Knowing the precise limits to such undersampling is important both for theory and practice. We present a formula that characterizes the allowed undersampling of generalized sparse objects. The formula applies to Approximate Message Passing (AMP) algorithms for compressed sensing, w… ▽ More Compressed sensing posits that, within limits, one can undersample a sparse signal and yet reconstruct it accurately. Knowing the precise limits to such undersampling is important both for theory and practice. We present a formula that characterizes the allowed undersampling of generalized sparse objects. The formula applies to Approximate Message Passing (AMP) algorithms for compressed sensing, which are here generalized to employ denoising operators besides the traditional scalar soft thresholding denoiser. This paper gives several examples including scalar denoisers not derived from convex penalization -- the firm shrinkage nonlinearity and the minimax nonlinearity -- and also nonscalar denoisers -- block thresholding, monotone regression, and total variation minimization. Let the variables eps = k/N and delta = n/N denote the generalized sparsity and undersampling fractions for sampling the k-generalized-sparse N-vector x_0 according to y=Ax_0. Here A is an n\times N measurement matrix whose entries are iid standard Gaussian. The formula states that the phase transition curve delta = delta(eps) separating successful from unsuccessful reconstruction of x_0 by AMP is given by: delta = M(eps| Denoiser), where M(eps| Denoiser) denotes the per-coordinate minimax mean squared error (MSE) of the specified, optimally-tuned denoiser in the directly observed problem y = x + z. In short, the phase transition of a noiseless undersampling problem is identical to the minimax MSE in a denoising problem. △ Less

Submitted 7 January, 2013; v1 submitted 4 November, 2011; originally announced November 2011.

Comments: 71 pages, 32 pdf figures

arXiv:1103.1943 [pdf, other]

Compressed Sensing over $\ell_p$-balls: Minimax Mean Square Error

Authors: David Donoho, Iain Johnstone, Arian Maleki, Andrea Montanari

Abstract: We consider the compressed sensing problem, where the object $x_0 \in \bR^N$ is to be recovered from incomplete measurements $y = Ax_0 + z$; here the sensing matrix $A$ is an $n \times N$ random matrix with iid Gaussian entries and $n < N$. A popular method of sparsity-promoting reconstruction is $\ell^1$-penalized least-squares reconstruction (aka LASSO, Basis Pursuit). It is currently popular… ▽ More We consider the compressed sensing problem, where the object $x_0 \in \bR^N$ is to be recovered from incomplete measurements $y = Ax_0 + z$; here the sensing matrix $A$ is an $n \times N$ random matrix with iid Gaussian entries and $n < N$. A popular method of sparsity-promoting reconstruction is $\ell^1$-penalized least-squares reconstruction (aka LASSO, Basis Pursuit). It is currently popular to consider the strict sparsity model, where the object $x_0$ is nonzero in only a small fraction of entries. In this paper, we instead consider the much more broadly applicable $\ell_p$-sparsity model, where $x_0$ is sparse in the sense of having $\ell_p$ norm bounded by $ξ\cdot N^{1/p}$ for some fixed $0 < p \leq 1$ and $ξ> 0$. We study an asymptotic regime in which $n$ and $N$ both tend to infinity with limiting ratio $n/N = δ\in (0,1)$, both in the noisy ($z \neq 0$) and noiseless ($z=0$) cases. Under weak assumptions on $x_0$, we are able to precisely evaluate the worst-case asymptotic minimax mean-squared reconstruction error (AMSE) for $\ell^1$ penalized least-squares: min over penalization parameters, max over $\ell_p$-sparse objects $x_0$. We exhibit the asymptotically least-favorable object (hardest sparse signal to recover) and the maximin penalization. Our explicit formulas unexpectedly involve quantities appearing classically in statistical decision theory. Occurring in the present setting, they reflect a deeper connection between penalized $\ell^1$ minimization and scalar soft thresholding. This connection, which follows from earlier work of the authors and collaborators on the AMP iterative thresholding algorithm, is carefully explained. Our approach also gives precise results under weak-$\ell_p$ ball coefficient constraints, as we show here. △ Less

Submitted 23 March, 2011; v1 submitted 10 March, 2011; originally announced March 2011.

Comments: 41 pages, 11 pdf figures

arXiv:1004.3006 [pdf, ps, other]

Microlocal Analysis of the Geometric Separation Problem

Authors: David L. Donoho, Gitta Kutyniok

Abstract: Image data are often composed of two or more geometrically distinct constituents; in galaxy catalogs, for instance, one sees a mixture of pointlike structures (galaxy superclusters) and curvelike structures (filaments). It would be ideal to process a single image and extract two geometrically `pure' images, each one containing features from only one of the two geometric constituents. This seems t… ▽ More Image data are often composed of two or more geometrically distinct constituents; in galaxy catalogs, for instance, one sees a mixture of pointlike structures (galaxy superclusters) and curvelike structures (filaments). It would be ideal to process a single image and extract two geometrically `pure' images, each one containing features from only one of the two geometric constituents. This seems to be a seriously underdetermined problem, but recent empirical work achieved highly persuasive separations. We present a theoretical analysis showing that accurate geometric separation of point and curve singularities can be achieved by minimizing the $\ell_1$ norm of the representing coefficients in two geometrically complementary frames: wavelets and curvelets. Driving our analysis is a specific property of the ideal (but unachievable) representation where each content type is expanded in the frame best adapted to it. This ideal representation has the property that important coefficients are clustered geometrically in phase space, and that at fine scales, there is very little coherence between a cluster of elements in one frame expansion and individual elements in the complementary frame. We formally introduce notions of cluster coherence and clustered sparsity and use this machinery to show that the underdetermined systems of linear equations can be stably solved by $\ell_1$ minimization; microlocal phase space helps organize the calculations that cluster coherence requires. △ Less

Submitted 18 April, 2010; originally announced April 2010.

Comments: 59 pages, 9 figures

Report number: Technical Report No. 2010-01, Statistics Department, Stanford University

arXiv:1004.1218 [pdf, other]

The Noise-Sensitivity Phase Transition in Compressed Sensing

Authors: David L. Donoho, Arian Maleki, Andrea Montanari

Abstract: Consider the noisy underdetermined system of linear equations: y=Ax0 + z0, with n x N measurement matrix A, n < N, and Gaussian white noise z0 ~ N(0,σ^2 I). Both y and A are known, both x0 and z0 are unknown, and we seek an approximation to x0. When x0 has few nonzeros, useful approximations are obtained by l1-penalized l2 minimization, in which the reconstruction \hxl solves min || y - Ax||^2/2… ▽ More Consider the noisy underdetermined system of linear equations: y=Ax0 + z0, with n x N measurement matrix A, n < N, and Gaussian white noise z0 ~ N(0,σ^2 I). Both y and A are known, both x0 and z0 are unknown, and we seek an approximation to x0. When x0 has few nonzeros, useful approximations are obtained by l1-penalized l2 minimization, in which the reconstruction \hxl solves min || y - Ax||^2/2 + λ||x||_1. Evaluate performance by mean-squared error (MSE = E ||\hxl - x0||_2^2/N). Consider matrices A with iid Gaussian entries and a large-system limit in which n,N\to\infty with n/N \to δand k/n \to ρ. Call the ratio MSE/σ^2 the noise sensitivity. We develop formal expressions for the MSE of \hxl, and evaluate its worst-case formal noise sensitivity over all types of k-sparse signals. The phase space 0 < δ, ρ< 1 is partitioned by curve ρ= \rhoMSE(δ) into two regions. Formal noise sensitivity is bounded throughout the region ρ< \rhoMSE(δ) and is unbounded throughout the region ρ> \rhoMSE(δ). The phase boundary ρ= \rhoMSE(δ) is identical to the previously-known phase transition curve for equivalence of l1 - l0 minimization in the k-sparse noiseless case. Hence a single phase boundary describes the fundamental phase transitions both for the noiseless and noisy cases. Extensive computational experiments validate the predictions of this formalism, including the existence of game theoretical structures underlying it. Underlying our formalism is the AMP algorithm introduced earlier by the authors. Other papers by the authors detail expressions for the formal MSE of AMP and its close connection to l1-penalized reconstruction. Here we derive the minimax formal MSE of AMP and then read out results for l1-penalized reconstruction. △ Less

Submitted 7 April, 2010; originally announced April 2010.

Comments: 40 pages, 13 pdf figures

arXiv:0911.4222 [pdf, other]

Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation

Authors: David L. Donoho, Arian Maleki, Andrea Montanari

Abstract: In a recent paper, the authors proposed a new class of low-complexity iterative thresholding algorithms for reconstructing sparse signals from a small set of linear measurements \cite{DMM}. The new algorithms are broadly referred to as AMP, for approximate message passing. This is the second of two conference papers describing the derivation of these algorithms, connection with related literatur… ▽ More In a recent paper, the authors proposed a new class of low-complexity iterative thresholding algorithms for reconstructing sparse signals from a small set of linear measurements \cite{DMM}. The new algorithms are broadly referred to as AMP, for approximate message passing. This is the second of two conference papers describing the derivation of these algorithms, connection with related literature, extensions of original framework, and new empirical evidence. This paper describes the state evolution formalism for analyzing these algorithms, and some of the conclusions that can be drawn from this formalism. We carried out extensive numerical simulations to confirm these predictions. We present here a few representative results. △ Less

Submitted 21 November, 2009; originally announced November 2009.

Comments: 5 pages, 3 pdf figures, IEEE Information Theory Workshop, Cairo 2010

arXiv:0911.4219 [pdf, ps, other]

Message Passing Algorithms for Compressed Sensing: I. Motivation and Construction

Authors: David L. Donoho, Arian Maleki, Andrea Montanari

Abstract: In a recent paper, the authors proposed a new class of low-complexity iterative thresholding algorithms for reconstructing sparse signals from a small set of linear measurements \cite{DMM}. The new algorithms are broadly referred to as AMP, for approximate message passing. This is the first of two conference papers describing the derivation of these algorithms, connection with the related litera… ▽ More In a recent paper, the authors proposed a new class of low-complexity iterative thresholding algorithms for reconstructing sparse signals from a small set of linear measurements \cite{DMM}. The new algorithms are broadly referred to as AMP, for approximate message passing. This is the first of two conference papers describing the derivation of these algorithms, connection with the related literature, extensions of the original framework, and new empirical evidence. In particular, the present paper outlines the derivation of AMP from standard sum-product belief propagation, and its extension in several directions. We also discuss relations with formal calculations based on statistical mechanics methods. △ Less

Submitted 21 November, 2009; originally announced November 2009.

Comments: 5 pages, IEEE Information Theory Workshop, Cairo 2010

arXiv:0909.0777 [pdf, other]

doi 10.1109/JSTSP.2009.2039176

Optimally Tuned Iterative Reconstruction Algorithms for Compressed Sensing

Authors: Arian Maleki, David L. Donoho

Abstract: We conducted an extensive computational experiment, lasting multiple CPU-years, to optimally select parameters for two important classes of algorithms for finding sparse solutions of underdetermined systems of linear equations. We make the optimally tuned implementations available at {\tt sparselab.stanford.edu}; they run `out of the box' with no user tuning: it is not necessary to select thresh… ▽ More We conducted an extensive computational experiment, lasting multiple CPU-years, to optimally select parameters for two important classes of algorithms for finding sparse solutions of underdetermined systems of linear equations. We make the optimally tuned implementations available at {\tt sparselab.stanford.edu}; they run `out of the box' with no user tuning: it is not necessary to select thresholds or know the likely degree of sparsity. Our class of algorithms includes iterative hard and soft thresholding with or without relaxation, as well as CoSaMP, subspace pursuit and some natural extensions. As a result, our optimally tuned algorithms dominate such proposals. Our notion of optimality is defined in terms of phase transitions, i.e. we maximize the number of nonzeros at which the algorithm can successfully operate. We show that the phase transition is a well-defined quantity with our suite of random underdetermined linear systems. Our tuning gives the highest transition possible within each class of algorithms. △ Less

Submitted 3 September, 2009; originally announced September 2009.

Comments: 12 pages, 14 figures

arXiv:0907.3574 [pdf, ps, other]

doi 10.1073/pnas.0909892106

Message Passing Algorithms for Compressed Sensing

Authors: David L. Donoho, Arian Maleki, Andrea Montanari

Abstract: Compressed sensing aims to undersample certain high-dimensional signals, yet accurately reconstruct them by exploiting signal characteristics. Accurate reconstruction is possible when the object to be recovered is sufficiently sparse in a known basis. Currently, the best known sparsity-undersampling tradeoff is achieved when reconstructing by convex optimization -- which is expensive in importan… ▽ More Compressed sensing aims to undersample certain high-dimensional signals, yet accurately reconstruct them by exploiting signal characteristics. Accurate reconstruction is possible when the object to be recovered is sufficiently sparse in a known basis. Currently, the best known sparsity-undersampling tradeoff is achieved when reconstructing by convex optimization -- which is expensive in important large-scale applications. Fast iterative thresholding algorithms have been intensively studied as alternatives to convex optimization for large-scale problems. Unfortunately known fast algorithms offer substantially worse sparsity-undersampling tradeoffs than convex optimization. We introduce a simple costless modification to iterative thresholding making the sparsity-undersampling tradeoff of the new algorithms equivalent to that of the corresponding convex optimization procedures. The new iterative-thresholding algorithms are inspired by belief propagation in graphical models. Our empirical measurements of the sparsity-undersampling tradeoff for the new algorithms agree with theoretical calculations. We show that a state evolution formalism correctly derives the true sparsity-undersampling tradeoff. There is a surprising agreement between earlier calculations based on random convex polytopes and this new, apparently very different theoretical formalism. △ Less

Submitted 21 July, 2009; originally announced July 2009.

Comments: 6 pages paper + 9 pages supplementary information, 13 eps figure. Submitted to Proc. Natl. Acad. Sci. USA

arXiv:0906.2530 [pdf, other]

doi 10.1098/rsta.2009.0152

Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing

Authors: David L. Donoho, Jared Tanner

Abstract: We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers incre… ▽ More We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers increases beyond a threshold. In combinatorial geometry these transitions appear as abrupt changes in the properties of face counts of convex polytopes when the dimensions are varied. The thresholds in these very different problems appear in the same critical locations after appropriate calibration of variables. These thresholds are important in each subject area: for linear modelling, they place hard limits on the degree to which the now-ubiquitous high-throughput data analysis can be successful; for robustness, they place hard limits on the degree to which standard robust fitting methods can tolerate outliers before breaking down; for compressed sensing, they define the sharp boundary of the undersampling/sparsity tradeoff in undersampling theorems. Existing derivations of phase transitions in combinatorial geometry assume the underlying matrices have independent and identically distributed (iid) Gaussian elements. In applications, however, it often seems that Gaussianity is not required. We conducted an extensive computational experiment and formal inferential analysis to test the hypothesis that these phase transitions are {\it universal} across a range of underlying matrix ensembles. The experimental results are consistent with an asymptotic large-$n$ universality across matrix ensembles; finite-sample universality can be rejected. △ Less

Submitted 14 June, 2009; originally announced June 2009.

Comments: 47 pages, 24 figures, 10 tables

arXiv:0812.2263 [pdf, other]

doi 10.1098/rsta.2009.0129

Feature selection by Higher Criticism thresholding: optimal phase diagram

Authors: David Donoho, Jiashun Jin

Abstract: We consider two-class linear classification in a high-dimensional, low-sample size setting. Only a small fraction of the features are useful, the useful features are unknown to us, and each useful feature contributes weakly to the classification decision -- this setting was called the rare/weak model (RW Model). We select features by thresholding feature $z$-scores. The threshold is set by {\it… ▽ More We consider two-class linear classification in a high-dimensional, low-sample size setting. Only a small fraction of the features are useful, the useful features are unknown to us, and each useful feature contributes weakly to the classification decision -- this setting was called the rare/weak model (RW Model). We select features by thresholding feature $z$-scores. The threshold is set by {\it higher criticism} (HC). Let $\pee_i$ denote the $P$-value associated to the $i$-th $z$-score and $\pee_{(i)}$ denote the $i$-th order statistic of the collection of $P$-values. The HC threshold (HCT) is the order statistic of the $z$-score corresponding to index $i$ maximizing $(i/n - \pee_{(i)})/\sqrt{\pee_{(i)}(1-\pee_{(i)})}$. The ideal threshold optimizes the classification error. In \cite{PNAS} we showed that HCT was numerically close to the ideal threshold. We formalize an asymptotic framework for studying the RW model, considering a sequence of problems with increasingly many features and relatively fewer observations. We show that along this sequence, the limiting performance of ideal HCT is essentially just as good as the limiting performance of ideal thresholding. Our results describe two-dimensional {\it phase space}, a two-dimensional diagram with coordinates quantifying "rare" and "weak" in the RW model. Phase space can be partitioned into two regions -- one where ideal threshold classification is successful, and one where the features are so weak and so rare that it must fail. Surprisingly, the regions where ideal HCT succeeds and fails make the exact same partition of the phase diagram. Other threshold methods, such as FDR threshold selection, are successful in a substantially smaller region of the phase space than either HCT or Ideal thresholding. △ Less

Submitted 11 December, 2008; originally announced December 2008.

Comments: 4 figures, 24 pages

MSC Class: 62H30 (Primary); 62H15; 62G20 (Secondary); 62G32

arXiv:0807.3590 [pdf, ps, other]

Counting the Faces of Randomly-Projected Hypercubes and Orthants, with Applications

Authors: David L. Donoho, Jared Tanner

Abstract: Let $A$ be an $n$ by $N$ real valued random matrix, and $\h$ denote the $N$-dimensional hypercube. For numerous random matrix ensembles, the expected number of $k$-dimensional faces of the random $n$-dimensional zonotope $A\h$ obeys the formula $E f_k(A\h) /f_k(\h) = 1-P_{N-n,N-k}$, where $P_{N-n,N-k}$ is a fair-coin-tossing probability. The formula applies, for example, where the columns of… ▽ More Let $A$ be an $n$ by $N$ real valued random matrix, and $\h$ denote the $N$-dimensional hypercube. For numerous random matrix ensembles, the expected number of $k$-dimensional faces of the random $n$-dimensional zonotope $A\h$ obeys the formula $E f_k(A\h) /f_k(\h) = 1-P_{N-n,N-k}$, where $P_{N-n,N-k}$ is a fair-coin-tossing probability. The formula applies, for example, where the columns of $A$ are drawn i.i.d. from an absolutely continuous symmetric distribution. The formula exploits Wendel's Theorem\cite{We62}. Let $\po$ denote the positive orthant; the expected number of $k$-faces of the random cone$A \po$ obeys $ {\cal E} f_k(A\po) /f_k(\po) = 1 - P_{N-n,N-k}$. The formula applies to numerous matrix ensembles, including those with iid random columns from an absolutely continuous, centrally symmetric distribution. There is an asymptotically sharp threshold in the behavior of face counts of the projected hypercube; thresholds known for projecting the simplex and the cross-polytope, occur at very different locations. We briefly consider face counts of the projected orthant when $A$ does not have mean zero; these do behave similarly to those for the projected simplex. We consider non-random projectors of the orthant; the 'best possible' $A$ is the one associated with the first $n$ rows of the Fourier matrix. These geometric face-counting results have implications for signal processing, information theory, inverse problems, and optimization. Most of these flow in some way from the fact that face counting is related to conditions for uniqueness of solutions of underdetermined systems of linear equations. △ Less

Submitted 22 July, 2008; originally announced July 2008.

Comments: 21 pages, 3 figures

MSC Class: 52A22; 52B05; 52B11; 52B12; 62E20; 68P30; 68P25; 68W20; 68W40; 94B20; 94B35; 94B65; 94B70

arXiv:cs/0702105 [pdf]

doi 10.1109/ISIT.2006.261816

The Simplest Solution to an Underdetermined System of Linear Equations

Authors: David Donoho, Hossein Kakavand, James Mammen

Abstract: Consider a d*n matrix A, with d<n. The problem of solving for x in y=Ax is underdetermined, and has infinitely many solutions (if there are any). Given y, the minimum Kolmogorov complexity solution (MKCS) of the input x is defined to be an input z (out of many) with minimum Kolmogorov-complexity that satisfies y=Az. One expects that if the actual input is simple enough, then MKCS will recover th… ▽ More Consider a d*n matrix A, with d<n. The problem of solving for x in y=Ax is underdetermined, and has infinitely many solutions (if there are any). Given y, the minimum Kolmogorov complexity solution (MKCS) of the input x is defined to be an input z (out of many) with minimum Kolmogorov-complexity that satisfies y=Az. One expects that if the actual input is simple enough, then MKCS will recover the input exactly. This paper presents a preliminary study of the existence and value of the complexity level up to which such a complexity-based recovery is possible. It is shown that for the set of all d*n binary matrices (with entries 0 or 1 and d<n), MKCS exactly recovers the input for an overwhelming fraction of the matrices provided the Kolmogorov complexity of the input is O(d). A weak converse that is loose by a log n factor is also established for this case. Finally, we investigate the difficulty of finding a matrix that has the property of recovering inputs with complexity of O(d) using MKCS. △ Less

Submitted 19 February, 2007; originally announced February 2007.

Comments: Proceedings of the IEEE International Symposium on Information Theory Seattle, Washington, July 9-14, 2006

arXiv:math/0612422 [pdf, ps, other]

doi 10.1214/08-AOS604

Does median filtering truly preserve edges better than linear filtering?

Authors: Ery Arias-Castro, David L. Donoho

Abstract: Image processing researchers commonly assert that "median filtering is better than linear filtering for removing noise in the presence of edges." Using a straightforward large-$n$ decision-theory framework, this folk-theorem is seen to be false in general. We show that median filtering and linear filtering have similar asymptotic worst-case mean-squared error (MSE) when the signal-to-noise ratio… ▽ More Image processing researchers commonly assert that "median filtering is better than linear filtering for removing noise in the presence of edges." Using a straightforward large-$n$ decision-theory framework, this folk-theorem is seen to be false in general. We show that median filtering and linear filtering have similar asymptotic worst-case mean-squared error (MSE) when the signal-to-noise ratio (SNR) is of order 1, which corresponds to the case of constant per-pixel noise level in a digital signal. To see dramatic benefits of median smoothing in an asymptotic setting, the per-pixel noise level should tend to zero (i.e., SNR should grow very large). We show that a two-stage median filtering using two very different window widths can dramatically outperform traditional linear and median filtering in settings where the underlying object has edges. In this two-stage procedure, the first pass, at a fine scale, aims at increasing the SNR. The second pass, at a coarser scale, correctly exploits the nonlinearity of the median. Image processing methods based on nonlinear partial differential equations (PDEs) are often said to improve on linear filtering in the presence of edges. Such methods seem difficult to analyze rigorously in a decision-theoretic framework. A popular example is mean curvature motion (MCM), which is formally a kind of iterated median filtering. Our results on iterated median filtering suggest that some PDE-based methods are candidates to rigorously outperform linear filtering in an asymptotic framework. △ Less

Submitted 20 April, 2009; v1 submitted 14 December, 2006; originally announced December 2006.

Comments: Published in at http://dx.doi.org/10.1214/08-AOS604 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS604 MSC Class: 62G08; 62G20 (Primary) 60G35 (Secondary)

Journal ref: Annals of Statistics 2009, Vol. 37, No. 3, 1172-1206

arXiv:astro-ph/0610958 [pdf, ps, other]

doi 10.1111/j.1365-2966.2006.11237.x

Multi-scale morphology of the galaxy distribution

Authors: Enn Saar, Vicent J. Martinez, Jean-Luc Starck, David L. Donoho

Abstract: Many statistical methods have been proposed in the last years for analyzing the spatial distribution of galaxies. Very few of them, however, can handle properly the border effects of complex observational sample volumes. In this paper, we first show how to calculate the Minkowski Functionals (MF) taking into account these border effects. Then we present a multiscale extension of the MF which giv… ▽ More Many statistical methods have been proposed in the last years for analyzing the spatial distribution of galaxies. Very few of them, however, can handle properly the border effects of complex observational sample volumes. In this paper, we first show how to calculate the Minkowski Functionals (MF) taking into account these border effects. Then we present a multiscale extension of the MF which gives us more information about how the galaxies are spatially distributed. A range of examples using Gaussian random fields illustrate the results. Finally we have applied the Multiscale Minkowski Functionals (MMF) to the 2dF Galaxy Redshift Survey data. The MMF clearly indicates an evolution of morphology with scale. We also compare the 2dF real catalog with mock catalogs and found that Lambda-CDM simulations roughly fit the data, except at the finest scale. △ Less

Submitted 31 October, 2006; originally announced October 2006.

Comments: 17 pages, 19 figures, accepted for publication in MNRAS

Journal ref: Mon.Not.Roy.Astron.Soc.374:1030-1044,2007

arXiv:math/0607364 [pdf, ps, other]

Counting faces of randomly-projected polytopes when the projection radically lowers dimension

Authors: David L. Donoho, Jared Tanner

Abstract: This paper develops asymptotic methods to count faces of random high-dimensional polytopes. Beyond its intrinsic interest, our conclusions have surprising implications - in statistics, probability, information theory, and signal processing - with potential impacts in practical subjects like medical imaging and digital communications. Three such implications concern: convex hulls of Gaussian poin… ▽ More This paper develops asymptotic methods to count faces of random high-dimensional polytopes. Beyond its intrinsic interest, our conclusions have surprising implications - in statistics, probability, information theory, and signal processing - with potential impacts in practical subjects like medical imaging and digital communications. Three such implications concern: convex hulls of Gaussian point clouds, signal recovery from random projections, and how many gross errors can be efficiently corrected from Gaussian error correcting codes. △ Less

Submitted 26 September, 2006; v1 submitted 15 July, 2006; originally announced July 2006.

Comments: 56 pages

MSC Class: 52A22; 52B05; 52B11; 52B12; 62E20; 68P30; 68P25; 68W20; 68W40; 94B20 94B35; 94B65; 94B70

arXiv:math/0605513 [pdf, ps, other]

doi 10.1214/009053605000000787

Adaptive multiscale detection of filamentary structures in a background of uniform random points

Authors: Ery Arias-Castro, David L. Donoho, Xiaoming Huo

Abstract: We are given a set of $n$ points that might be uniformly distributed in the unit square $[0,1]^2$. We wish to test whether the set, although mostly consisting of uniformly scattered points, also contains a small fraction of points sampled from some (a priori unknown) curve with $C^α$-norm bounded by $β$. An asymptotic detection threshold exists in this problem; for a constant $T_-(α,β)>0$, if th… ▽ More We are given a set of $n$ points that might be uniformly distributed in the unit square $[0,1]^2$. We wish to test whether the set, although mostly consisting of uniformly scattered points, also contains a small fraction of points sampled from some (a priori unknown) curve with $C^α$-norm bounded by $β$. An asymptotic detection threshold exists in this problem; for a constant $T_-(α,β)>0$, if the number of points sampled from the curve is smaller than $T_-(α,β)n^{1/(1+α)}$, reliable detection is not possible for large $n$. We describe a multiscale significant-runs algorithm that can reliably detect concentration of data near a smooth curve, without knowing the smoothness information $α$ or $β$ in advance, provided that the number of points on the curve exceeds $T_*(α,β)n^{1/(1+α)}$. This algorithm therefore has an optimal detection threshold, up to a factor $T_*/T_-$. At the heart of our approach is an analysis of the data by counting membership in multiscale multianisotropic strips. The strips will have area $2/n$ and exhibit a variety of lengths, orientations and anisotropies. The strips are partitioned into anisotropy classes; each class is organized as a directed graph whose vertices all are strips of the same anisotropy and whose edges link such strips to their ``good continuations.'' The point-cloud data are reduced to counts that measure membership in strips. Each anisotropy graph is reduced to a subgraph that consist of strips with significant counts. The algorithm rejects $\mathbf{H}_0$ whenever some such subgraph contains a path that connects many consecutive significant counts. △ Less

Submitted 18 May, 2006; originally announced May 2006.

Comments: Published at http://dx.doi.org/10.1214/009053605000000787 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0097 MSC Class: 62M30 (Primary) 62G10; 62G20 (Secondary)

Journal ref: Annals of Statistics 2006, Vol. 34, No. 1, 326-349

arXiv:math/0603673 [pdf, ps, other]

Correction. Connect The Dots: How Many Random Points Can A Regular Curve Pass Through?

Authors: E. Arias-Castro, D. L. Donoho, X. Huo, C. A. Tovey

Abstract: Correction for Adv. in Appl. Probab. 37, no. 3 (2005), 571-603 Correction for Adv. in Appl. Probab. 37, no. 3 (2005), 571-603 △ Less

Submitted 28 March, 2006; originally announced March 2006.

Comments: 2 pages, 1 figure

MSC Class: 60D05; 62M40

arXiv:math/0602311 [pdf, ps, other]

doi 10.1214/009053606000000920

Asymptotic minimaxity of False Discovery Rate thresholding for sparse exponential data

Authors: David Donoho, Jiashun Jin

Abstract: We apply FDR thresholding to a non-Gaussian vector whose coordinates X_i, i=1,..., n, are independent exponential with individual means $μ_i$. The vector $μ=(μ_i)$ is thought to be sparse, with most coordinates 1 but a small fraction significantly larger than 1; roughly, most coordinates are simply `noise,' but a small fraction contain `signal.' We measure risk by per-coordinate mean-squared err… ▽ More We apply FDR thresholding to a non-Gaussian vector whose coordinates X_i, i=1,..., n, are independent exponential with individual means $μ_i$. The vector $μ=(μ_i)$ is thought to be sparse, with most coordinates 1 but a small fraction significantly larger than 1; roughly, most coordinates are simply `noise,' but a small fraction contain `signal.' We measure risk by per-coordinate mean-squared error in recovering $\log(μ_i)$, and study minimax estimation over parameter spaces defined by constraints on the per-coordinate p-norm of $\log(μ_i)$: $\frac{1}{n}\sum_{i=1}^n\log^p(μ_i)\leq η^p$. We show for large n and small $η$ that FDR thresholding can be nearly Minimax. The FDR control parameter 0<q<1 plays an important role: when $q\leq 1/2$, the FDR estimator is nearly minimax, while choosing a fixed q>1/2 prevents near minimaxity. These conclusions mirror those found in the Gaussian case in Abramovich et al. [Ann. Statist. 34 (2006) 584--653]. The techniques developed here seem applicable to a wide range of other distributional assumptions, other loss measures and non-i.i.d. dependency structures. △ Less

Submitted 1 August, 2007; v1 submitted 14 February, 2006; originally announced February 2006.

Comments: Published at http://dx.doi.org/10.1214/009053606000000920 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0150 MSC Class: 62H12; 62C20 (Primary) 62G20; 62C10; 62C12. (Secondary)

Journal ref: Annals of Statistics 2006, Vol. 34, No. 6, 2980-3018

arXiv:astro-ph/0508326 [pdf, ps, other]

doi 10.1086/497125

Morphology of the galaxy distribution from wavelet denoising

Authors: V. J. Martinez, J. -L. Starck, E. Saar, D. L. Donoho, S. Reynolds, P. de la Cruz, S. Paredes

Abstract: We have developed a method based on wavelets to obtain the true underlying smooth density from a point distribution. The goal has been to reconstruct the density field in an optimal way ensuring that the morphology of the reconstructed field reflects the true underlying morphology of the point field which, as the galaxy distribution, has a genuinely multiscale structure, with near-singular behav… ▽ More We have developed a method based on wavelets to obtain the true underlying smooth density from a point distribution. The goal has been to reconstruct the density field in an optimal way ensuring that the morphology of the reconstructed field reflects the true underlying morphology of the point field which, as the galaxy distribution, has a genuinely multiscale structure, with near-singular behavior on sheets, filaments and hotspots. If the discrete distributions are smoothed using Gaussian filters, the morphological properties tend to be closer to those expected for a Gaussian field. The use of wavelet denoising provide us with a unique and more accurate morphological description. △ Less

Submitted 15 August, 2005; originally announced August 2005.

Comments: Accepted for publication in ApJ

Journal ref: Astrophys.J.634:744-755,2005

arXiv:math/0505374 [pdf, ps, other]

Adapting to Unknown Sparsity by controlling the False Discovery Rate

Authors: Felix Abramovich, Yoav Benjamini, David L. Donoho, Iain M. Johnstone

Abstract: We attempt to recover an $n$-dimensional vector observed in white noise, where $n$ is large and the vector is known to be sparse, but the degree of sparsity is unknown. We consider three different ways of defining sparsity of a vector: using the fraction of nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling the $\ell_p$ norm for $p$ small. We obtain a procedur… ▽ More We attempt to recover an $n$-dimensional vector observed in white noise, where $n$ is large and the vector is known to be sparse, but the degree of sparsity is unknown. We consider three different ways of defining sparsity of a vector: using the fraction of nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling the $\ell_p$ norm for $p$ small. We obtain a procedure which is asymptotically minimax for $\ell^r$ loss, simultaneously throughout a range of such sparsity classes. The optimal procedure is a data-adaptive thresholding scheme, driven by control of the {\it False Discovery Rate} (FDR). FDR control is a relatively recent innovation in simultaneous testing, ensuring that at most a certain fraction of the rejected null hypotheses will correspond to false rejections. In our treatment, the FDR control parameter $q_n$ also plays a determining role in asymptotic minimaxity. If $q = \lim q_n \in [0,1/2]$ and also $q_n > γ/\log(n)$ we get sharp asymptotic minimaxity, simultaneously, over a wide range of sparse parameter spaces and loss functions. On the other hand, $ q = \lim q_n \in (1/2,1]$, forces the risk to exceed the minimax risk by a factor growing with $q$. To our knowledge, this relation between ideas in simultaneous inference and asymptotic decision theory is new. Our work provides a new perspective on a class of model selection rules which has been introduced recently by several authors. These new rules impose complexity penalization of the form $2 \cdot \log({potential model size} / {actual model size})$. We exhibit a close connection with FDR-controlling procedures under stringent control of the false discovery rate. △ Less

Submitted 18 May, 2005; originally announced May 2005.

Comments: This is a complete version of a paper to appear in Annals of Statitistics. The paper in AoS has certain proofs abbreviated that are given here in detail

MSC Class: 62F10; 62G12

arXiv:astro-ph/0503374 [pdf, ps, other]

doi 10.1155/ASP.2005.2470

Cosmological non-Gaussian Signature Detection: Comparing Performance of Different Statistical Tests

Authors: J. Jin, J. -L. Starck, D. L. Donoho, N. Aghanim, O. Forni

Abstract: Currently, it appears that the best method for non-Gaussianity detection in the Cosmic Microwave Background (CMB) consists in calculating the kurtosis of the wavelet coefficients. We know that wavelet-kurtosis outperforms other methods such as the bispectrum, the genus, ridgelet-kurtosis and curvelet-kurtosis on an empirical basis, but relatively few studies have compared other transform-based s… ▽ More Currently, it appears that the best method for non-Gaussianity detection in the Cosmic Microwave Background (CMB) consists in calculating the kurtosis of the wavelet coefficients. We know that wavelet-kurtosis outperforms other methods such as the bispectrum, the genus, ridgelet-kurtosis and curvelet-kurtosis on an empirical basis, but relatively few studies have compared other transform-based statistics, such as extreme values, or more recent tools such as Higher Criticism (HC), or proposed `best possible' choices for such statistics. In this paper we consider two models for transform-domain coefficients: (a) a power-law model, which seems suited to the wavelet coefficients of simulated cosmic strings; and (b) a sparse mixture model, which seems suitable for the curvelet coefficients of filamentary structure. For model (a), if power-law behavior holds with finite 8-th moment, excess kurtosis is an asymptotically optimal detector, but if the 8-th moment is not finite, a test based on extreme values is asymptotically optimal. For model (b), if the transform coefficients are very sparse, a recent test, Higher Criticism, is an optimal detector, but if they are dense, kurtosis is an optimal detector. Empirical wavelet coefficients of simulated cosmic strings have power-law character, infinite 8-th moment, while curvelet coefficients of the simulated cosmic strings are not very sparse. In all cases, excess kurtosis seems to be an effective test in moderate-resolution imagery. △ Less

Submitted 16 March, 2005; originally announced March 2005.

Comments: Manuscript will all figures can be download at: http://jstarck.free.fr/HC04.pdf

Showing 1–50 of 53 results for author: Donoho, D