Search | arXiv e-print repository

Learning sources of variability from high-dimensional observational studies

Authors: Eric W. Bridgeford, Jaewon Chung, Brian Gilbert, Sambit Panda, Adam Li, Cencheng Shen, Alexandra Badea, Brian Caffo, Joshua T. Vogelstein

Abstract: Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estiman… ▽ More Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estimands to outcomes with any number of dimensions or any measurable space, and formulates traditional causal estimands for nominal variables as causal discrepancy tests. We propose a simple technique for adjusting universally consistent conditional independence tests and prove that these tests are universally consistent causal discrepancy tests. Numerical experiments illustrate that our method, Causal CDcorr, leads to improvements in both finite sample validity and power when compared to existing strategies. Our methods are all open source and available at github.com/ebridge2/cdcorr. △ Less

Submitted 28 November, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

arXiv:2303.17589 [pdf, other]

Polarity is all you need to learn and transfer faster

Authors: Qingyang Wang, Michael A. Powell, Ali Geisa, Eric W. Bridgeford, Joshua T. Vogelstein

Abstract: Natural intelligences (NIs) thrive in a dynamic world - they learn quickly, sometimes with only a few samples. In contrast, artificial intelligences (AIs) typically learn with a prohibitive number of training samples and computational power. What design principle difference between NI and AI could contribute to such a discrepancy? Here, we investigate the role of weight polarity: development proce… ▽ More Natural intelligences (NIs) thrive in a dynamic world - they learn quickly, sometimes with only a few samples. In contrast, artificial intelligences (AIs) typically learn with a prohibitive number of training samples and computational power. What design principle difference between NI and AI could contribute to such a discrepancy? Here, we investigate the role of weight polarity: development processes initialize NIs with advantageous polarity configurations; as NIs grow and learn, synapse magnitudes update, yet polarities are largely kept unchanged. We demonstrate with simulation and image classification tasks that if weight polarities are adequately set a priori, then networks learn with less time and data. We also explicitly illustrate situations in which a priori setting the weight polarities is disadvantageous for networks. Our work illustrates the value of weight polarities from the perspective of statistical and computational efficiency during learning. △ Less

Submitted 30 May, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: ICML camera-ready

arXiv:2302.14186 [pdf, other]

Approximately optimal domain adaptation with Fisher's Linear Discriminant

Authors: Hayden S. Helm, Ashwin De Silva, Joshua T. Vogelstein, Carey E. Priebe, Weiwei Yang

Abstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propos… ▽ More We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propose a computable approximation, and study the effect of various parameter settings on the relative risks between the optimal hypothesis, hypothesis i), and hypothesis ii). We demonstrate the effectiveness of the proposed optimal classifier in the context of EEG- and ECG-based classification settings and argue that the optimal classifier can be computed without access to direct information from any of the individual source tasks. We conclude by discussing further applications, limitations, and possible future directions. △ Less

Submitted 1 March, 2024; v1 submitted 27 February, 2023; originally announced February 2023.

arXiv:2208.10967 [pdf, other]

The Value of Out-of-Distribution Data

Authors: Ashwin De Silva, Rahul Ramesh, Carey E. Priebe, Pratik Chaudhari, Joshua T. Vogelstein

Abstract: We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task imp… ▽ More We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset. △ Less

Submitted 13 July, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

Comments: Previous versions of this work have been presented at the Out-of-Distribution Generalization in Computer Vision (OOD-CV) Workshop (ECCV 2022) and the Workshop on Distribution Shifts (NeurIPS 2022)

Journal ref: Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7366-7389, 2023

arXiv:2208.03211 [pdf, other]

Why do networks have inhibitory/negative connections?

Authors: Qingyang Wang, Michael A. Powell, Ali Geisa, Eric Bridgeford, Carey E. Priebe, Joshua T. Vogelstein

Abstract: Why do brains have inhibitory connections? Why do deep networks have negative weights? We propose an answer from the perspective of representation capacity. We believe representing functions is the primary role of both (i) the brain in natural intelligence, and (ii) deep networks in artificial intelligence. Our answer to why there are inhibitory/negative weights is: to learn more functions. We pro… ▽ More Why do brains have inhibitory connections? Why do deep networks have negative weights? We propose an answer from the perspective of representation capacity. We believe representing functions is the primary role of both (i) the brain in natural intelligence, and (ii) deep networks in artificial intelligence. Our answer to why there are inhibitory/negative weights is: to learn more functions. We prove that, in the absence of negative weights, neural networks with non-decreasing activation functions are not universal approximators. While this may be an intuitive result to some, to the best of our knowledge, there is no formal theory, in either machine learning or neuroscience, that demonstrates why negative weights are crucial in the context of representation capacity. Further, we provide insights on the geometric properties of the representation space that non-negative deep networks cannot represent. We expect these insights will yield a deeper understanding of more sophisticated inductive priors imposed on the distribution of weights that lead to more efficient biological and machine learning. △ Less

Submitted 17 August, 2023; v1 submitted 5 August, 2022; originally announced August 2022.

Comments: ICCV2023 camera-ready

arXiv:2201.13001 [pdf, other]

Deep Discriminative to Kernel Density Graph for In- and Out-of-distribution Calibrated Inference

Authors: Jayanta Dey, Haoyin Xu, Will LeVine, Ashwin De Silva, Tyler M. Tomita, Ali Geisa, Tiffany Chu, Jacob Desman, Joshua T. Vogelstein

Abstract: Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distr… ▽ More Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distribution (ID) calibration, such as isotonic and Platt's sigmoidal regression, exhibit excellent ID calibration performance. However, these methods are not calibrated for the entire feature space, leading to overconfidence in the case of out-of-distribution (OOD) samples. On the other end of the spectrum, existing out-of-distribution (OOD) calibration methods generally exhibit poor in-distribution (ID) calibration. In this paper, we address ID and OOD calibration problems jointly. We leveraged the fact that deep models, including both random forests and deep-nets, learn internal representations which are unions of polytopes with affine activation functions to conceptualize them both as partitioning rules of the feature space. We replace the affine function in each polytope populated by the training data with a Gaussian kernel. Our experiments on both tabular and vision benchmarks show that the proposed approaches obtain well-calibrated posteriors while mostly preserving or improving the classification accuracy of the original algorithm for ID region, and extrapolate beyond the training data to handle OOD inputs appropriately. △ Less

Submitted 7 June, 2024; v1 submitted 31 January, 2022; originally announced January 2022.

arXiv:2201.07372 [pdf, other]

Prospective Learning: Principled Extrapolation to the Future

Authors: Ashwin De Silva, Rahul Ramesh, Lyle Ungar, Marshall Hussain Shuler, Noah J. Cowan, Michael Platt, Chen Li, Leyla Isik, Seung-Eon Roh, Adam Charles, Archana Venkataraman, Brian Caffo, Javier J. How, Justus M Kebschull, John W. Krakauer, Maxim Bichuch, Kaleab Alemayehu Kinfu, Eva Yezerets, Dinesh Jayaraman, Jong M. Shin, Soledad Villar, Ian Phillips, Carey E. Priebe, Thomas Hartung, Michael I. Miller , et al. (18 additional authors not shown)

Abstract: Learning is a process which can update decision rules, based on past experience, such that future performance improves. Traditionally, machine learning is often evaluated under the assumption that the future will be identical to the past in distribution or change adversarially. But these assumptions can be either too optimistic or pessimistic for many problems in the real world. Real world scenari… ▽ More Learning is a process which can update decision rules, based on past experience, such that future performance improves. Traditionally, machine learning is often evaluated under the assumption that the future will be identical to the past in distribution or change adversarially. But these assumptions can be either too optimistic or pessimistic for many problems in the real world. Real world scenarios evolve over multiple spatiotemporal scales with partially predictable dynamics. Here we reformulate the learning problem to one that centers around this idea of dynamic futures that are partially learnable. We conjecture that certain sequences of tasks are not retrospectively learnable (in which the data distribution is fixed), but are prospectively learnable (in which distributions may be dynamic), suggesting that prospective learning is more difficult in kind than retrospective learning. We argue that prospective learning more accurately characterizes many real world problems that (1) currently stymie existing artificial intelligence solutions and/or (2) lack adequate explanations for how natural intelligences solve them. Thus, studying prospective learning will lead to deeper insights and solutions to currently vexing challenges in both natural and artificial intelligences. △ Less

Submitted 13 July, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

Comments: Accepted at the 2nd Conference on Lifelong Learning Agents (CoLLAs), 2023

arXiv:2111.05366 [pdf, other]

Graph Matching via Optimal Transport

Authors: Ali Saad-Eldin, Benjamin D. Pedigo, Carey E. Priebe, Joshua T. Vogelstein

Abstract: The graph matching problem seeks to find an alignment between the nodes of two graphs that minimizes the number of adjacency disagreements. Solving the graph matching is increasingly important due to it's applications in operations research, computer vision, neuroscience, and more. However, current state-of-the-art algorithms are inefficient in matching very large graphs, though they produce good… ▽ More The graph matching problem seeks to find an alignment between the nodes of two graphs that minimizes the number of adjacency disagreements. Solving the graph matching is increasingly important due to it's applications in operations research, computer vision, neuroscience, and more. However, current state-of-the-art algorithms are inefficient in matching very large graphs, though they produce good accuracy. The main computational bottleneck of these algorithms is the linear assignment problem, which must be solved at each iteration. In this paper, we leverage the recent advances in the field of optimal transport to replace the accepted use of linear assignment algorithms. We present GOAT, a modification to the state-of-the-art graph matching approximation algorithm "FAQ" (Vogelstein, 2015), replacing its linear sum assignment step with the "Lightspeed Optimal Transport" method of Cuturi (2013). The modification provides improvements to both speed and empirical matching accuracy. The effectiveness of the approach is demonstrated in matching graphs in simulated and real data examples. △ Less

Submitted 9 November, 2021; originally announced November 2021.

arXiv:2110.08483 [pdf, other]

Simplest Streaming Trees

Authors: Haoyin Xu, Jayanta Dey, Sambit Panda, Joshua T. Vogelstein

Abstract: Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, especially on tabular data. However, most of the current implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this lim… ▽ More Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, especially on tabular data. However, most of the current implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this limitation. Nonetheless, we found that those state-of-the-art algorithms suffer from a number of drawbacks, including low accuracy on some problems and high memory usage on others. We therefore developed the simplest possible extension of decision trees: given new data, simply update existing trees by continuing to grow them, and replace some old trees with new ones to control the total number of trees. In a benchmark suite containing 72 classification problems (the OpenML-CC18 data suite), we illustrate that our approach, Stream Decision Forest (SDF), does not suffer from either of the aforementioned limitations. On those datasets, we also demonstrate that our approach often performs as well, and sometimes even better, than conventional batch decision forest algorithm. Thus, SDFs establish a simple standard for streaming trees and forests that could readily be applied to many real-world problems. △ Less

Submitted 24 October, 2023; v1 submitted 16 October, 2021; originally announced October 2021.

arXiv:2109.14501 [pdf, other]

Towards a theory of out-of-distribution learning

Authors: Jayanta Dey, Ali Geisa, Ronak Mehta, Tyler M. Tomita, Hayden S. Helm, Haoyin Xu, Eric Eaton, Jeffery Dick, Carey E. Priebe, Joshua T. Vogelstein

Abstract: Learning is a process wherein a learning agent enhances its performance through exposure of experience or data. Throughout this journey, the agent may encounter diverse learning environments. For example, data may be presented to the leaner all at once, in multiple batches, or sequentially. Furthermore, the distribution of each data sample could be either identical and independent (iid) or non-iid… ▽ More Learning is a process wherein a learning agent enhances its performance through exposure of experience or data. Throughout this journey, the agent may encounter diverse learning environments. For example, data may be presented to the leaner all at once, in multiple batches, or sequentially. Furthermore, the distribution of each data sample could be either identical and independent (iid) or non-iid. Additionally, there may exist computational and space constraints for the deployment of the learning algorithms. The complexity of a learning task can vary significantly, depending on the learning setup and the constraints imposed upon it. However, it is worth noting that the current literature lacks formal definitions for many of the in-distribution and out-of-distribution learning paradigms. Establishing proper and universally agreed-upon definitions for these learning setups is essential for thoroughly exploring the evolution of ideas across different learning scenarios and deriving generalized mathematical bounds for these learners. In this paper, we aim to address this issue by proposing a chronological approach to defining different learning tasks using the provably approximately correct (PAC) learning framework. We will start with in-distribution learning and progress to recently proposed lifelong or continual learning. We employ consistent terminology and notation to demonstrate how each of these learning frameworks represents a specific instance of a broader, more generalized concept of learnability. Our hope is that this work will inspire a universally agreed-upon approach to quantifying different types of learning, fostering greater understanding and progress in the field. △ Less

Submitted 7 June, 2024; v1 submitted 29 September, 2021; originally announced September 2021.

arXiv:2108.13637 [pdf, other]

When are Deep Networks really better than Decision Forests at small sample sizes, and how?

Authors: Haoyin Xu, Kaleab A. Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M. White, Joshua T. Vogelstein, Carey E. Priebe

Abstract: Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies… ▽ More Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies using the most contemporary best practices has yet to be performed. Conceptually, we illustrate that both can be profitably viewed as "partition and vote" schemes. Specifically, the representation space that they both learn is a partitioning of feature space into a union of convex polytopes. For inference, each decides on the basis of votes from the activated nodes. This formulation allows for a unified basic understanding of the relationship between these methods. Empirically, we compare these two strategies on hundreds of tabular data settings, as well as several vision and auditory settings. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found forests to excel at tabular and structured data (vision and audition) with small sample sizes, whereas deep nets performed better on structured data with larger sample sizes. This suggests that further gains in both scenarios may be realized via further combining aspects of forests and networks. We will continue revising this technical report in the coming months with updated results. △ Less

Submitted 2 November, 2021; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2107.11732 [pdf, other]

Federated Causal Inference in Heterogeneous Observational Data

Authors: Ruoxuan Xiong, Allison Koenecke, Michael Powell, Zhu Shen, Joshua T. Vogelstein, Susan Athey

Abstract: We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the… ▽ More We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the average treatment effects of combined data across sites. Our methods first compute summary statistics locally using propensity scores and then aggregate these statistics across sites to obtain point and variance estimators of average treatment effects. We show that these estimators are consistent and asymptotically normal. To achieve these asymptotic properties, we find that the aggregation schemes need to account for the heterogeneity in treatment assignments and in outcomes across sites. We demonstrate the validity of our federated methods through a comparative study of two large medical claims databases. △ Less

Submitted 2 April, 2023; v1 submitted 25 July, 2021; originally announced July 2021.

arXiv:2106.02701 [pdf, other]

Hidden Markov Modeling for Maximum Likelihood Neuron Reconstruction

Authors: Thomas L. Athey, Daniel J. Tward, Ulrich Mueller, Joshua T. Vogelstein, Michael I. Miller

Abstract: Recent advances in brain clearing and imaging have made it possible to image entire mammalian brains at sub-micron resolution. These images offer the potential to assemble brain-wide atlases of neuron morphology, but manual neuron reconstruction remains a bottleneck. Several automatic reconstruction algorithms exist, but most focus on single neuron images. In this paper, we present a probabilistic… ▽ More Recent advances in brain clearing and imaging have made it possible to image entire mammalian brains at sub-micron resolution. These images offer the potential to assemble brain-wide atlases of neuron morphology, but manual neuron reconstruction remains a bottleneck. Several automatic reconstruction algorithms exist, but most focus on single neuron images. In this paper, we present a probabilistic reconstruction method, ViterBrain, which combines a hidden Markov state process that encodes neuron geometry with a random field appearance model of neuron fluorescence. Our method utilizes dynamic programming to compute the global maximizers of what we call the "most probable" neuron path. Our most probable estimation method models the task of reconstructing neuronal processes in the presence of other neurons, and thus is applicable in images with several neurons. Our method operates on image segmentations in order to leverage cutting edge computer vision technology. We applied our algorithm to imperfect image segmentations where false negatives severed neuronal processes, and showed that it can follow axons in the presence of noise or nearby neurons. Additionally, it creates a framework where users can intervene to, for example, fit start and endpoints. The code used in this work is available in our open-source Python package brainlit. △ Less

Submitted 27 January, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

arXiv:2104.01532 [pdf, other]

doi 10.3389/fninf.2021.704627

Fitting Splines to Axonal Arbors Quantifies Relationship between Branch Order and Geometry

Authors: Thomas L. Athey, Jacopo Teneggi, Joshua T. Vogelstein, Daniel Tward, Ulrich Mueller, Michael I. Miller

Abstract: Neuromorphology is crucial to identifying neuronal subtypes and understanding learning. It is also implicated in neurological disease. However, standard morphological analysis focuses on macroscopic features such as branching frequency and connectivity between regions, and often neglects the internal geometry of neurons. In this work, we treat neuron trace points as a sampling of differentiable cu… ▽ More Neuromorphology is crucial to identifying neuronal subtypes and understanding learning. It is also implicated in neurological disease. However, standard morphological analysis focuses on macroscopic features such as branching frequency and connectivity between regions, and often neglects the internal geometry of neurons. In this work, we treat neuron trace points as a sampling of differentiable curves and fit them with a set of branching B-splines. We designed our representation with the Frenet-Serret formulas from differential geometry in mind. The Frenet-Serret formulas completely characterize smooth curves, and involve two parameters, curvature and torsion. Our representation makes it possible to compute these parameters from neuron traces in closed form. These parameters are defined continuously along the curve, in contrast to other parameters like tortuosity which depend on start and end points. We applied our method to a dataset of cortical projection neurons traced in two mouse brains, and found that the parameters are distributed differently between primary, collateral, and terminal axon branches, thus quantifying geometric differences between different components of an axonal arbor. The results agreed in both brains, further validating our representation. The code used in this work can be readily applied to neuron traces in SWC format and is available in our open-source Python package brainlit: http://brainlit.neurodata.io/. △ Less

Submitted 5 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

Journal ref: Front. Neuroinform. 15 (2021)

arXiv:2011.06557 [pdf, other]

A partition-based similarity for classification distributions

Authors: Hayden S. Helm, Ronak D. Mehta, Brandon Duderstadt, Weiwei Yang, Christoper M. White, Ali Geisa, Joshua T. Vogelstein, Carey E. Priebe

Abstract: Herein we define a measure of similarity between classification distributions that is both principled from the perspective of statistical pattern recognition and useful from the perspective of machine learning practitioners. In particular, we propose a novel similarity on classification distributions, dubbed task similarity, that quantifies how an optimally-transformed optimal representation for a… ▽ More Herein we define a measure of similarity between classification distributions that is both principled from the perspective of statistical pattern recognition and useful from the perspective of machine learning practitioners. In particular, we propose a novel similarity on classification distributions, dubbed task similarity, that quantifies how an optimally-transformed optimal representation for a source distribution performs when applied to inference related to a target distribution. The definition of task similarity allows for natural definitions of adversarial and orthogonal distributions. We highlight limiting properties of representations induced by (universally) consistent decision rules and demonstrate in simulation that an empirical estimate of task similarity is a function of the decision rule deployed for inference. We demonstrate that for a given target distribution, both transfer efficiency and semantic similarity of candidate source distributions correlate with empirical task similarity. △ Less

Submitted 12 November, 2020; originally announced November 2020.

arXiv:2007.13843 [pdf, other]

Robust Similarity and Distance Learning via Decision Forests

Authors: Tyler M. Tomita, Joshua T. Vogelstein

Abstract: Canonical distances such as Euclidean distance often fail to capture the appropriate relationships between items, subsequently leading to subpar inference and prediction. Many algorithms have been proposed for automated learning of suitable distances, most of which employ linear methods to learn a global metric over the feature space. While such methods offer nice theoretical properties, interpret… ▽ More Canonical distances such as Euclidean distance often fail to capture the appropriate relationships between items, subsequently leading to subpar inference and prediction. Many algorithms have been proposed for automated learning of suitable distances, most of which employ linear methods to learn a global metric over the feature space. While such methods offer nice theoretical properties, interpretability, and computationally efficient means for implementing them, they are limited in expressive capacity. Methods which have been designed to improve expressiveness sacrifice one or more of the nice properties of the linear methods. To bridge this gap, we propose a highly expressive novel decision forest algorithm for the task of distance learning, which we call Similarity and Metric Random Forests (SMERF). We show that the tree construction procedure in SMERF is a proper generalization of standard classification and regression trees. Thus, the mathematical driving forces of SMERF are examined via its direct connection to regression forests, for which theory has been developed. Its ability to approximate arbitrary distances and identify important features is empirically demonstrated on simulated data sets. Last, we demonstrate that it accurately predicts links in networks. △ Less

Submitted 21 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: Submitted to NeurIPS 2020

arXiv:2005.11890 [pdf, other]

mvlearn: Multiview Machine Learning in Python

Authors: Ronan Perry, Gavin Mischler, Richard Guo, Theodore Lee, Alexander Chang, Arman Koul, Cameron Franz, Hugo Richard, Iain Carmichael, Pierre Ablin, Alexandre Gramfort, Joshua T. Vogelstein

Abstract: As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that… ▽ More As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that of scikit-learn for increased ease-of-use. The package can be installed from Python Package Index (PyPI) and the conda package manager and is released under the MIT open-source license. The documentation, detailed examples, and all releases are available at https://mvlearn.github.io/. △ Less

Submitted 25 May, 2021; v1 submitted 24 May, 2020; originally announced May 2020.

Comments: 6 pages, 2 figures, 1 table

arXiv:2005.10700 [pdf, other]

Distance-based Positive and Unlabeled Learning for Ranking

Authors: Hayden S. Helm, Amitabh Basu, Avanti Athreya, Youngser Park, Joshua T. Vogelstein, Carey E. Priebe, Michael Winding, Marta Zlatic, Albert Cardona, Patrick Bourke, Jonathan Larson, Marah Abdin, Piali Choudhury, Weiwei Yang, Christopher W. White

Abstract: Learning to rank -- producing a ranked list of items specific to a query and with respect to a set of supervisory items -- is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set)… ▽ More Learning to rank -- producing a ranked list of items specific to a query and with respect to a set of supervisory items -- is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set) pair. We demonstrate analytically, in simulation, and in real data examples that learning to rank via combining representations using an integer linear program is effective when the supervision is as light as "these few items are similar to your item of interest." While this nomination task is quite general, for specificity we present our methodology from the perspective of vertex nomination in graphs. The methodology described herein is model agnostic. △ Less

Submitted 28 September, 2022; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: 21 pages, 5 figures

arXiv:2004.12926 [pdf]

A New Age of Computing and the Brain

Authors: Polina Golland, Jack Gallant, Greg Hager, Hanspeter Pfister, Christos Papadimitriou, Stefan Schaal, Joshua T. Vogelstein

Abstract: The history of computer science and brain sciences are intertwined. In his unfinished manuscript "The Computer and the Brain," von Neumann debates whether or not the brain can be thought of as a computing machine and identifies some of the similarities and differences between natural and artificial computation. Turing, in his 1950 article in Mind, argues that computing devices could ultimately emu… ▽ More The history of computer science and brain sciences are intertwined. In his unfinished manuscript "The Computer and the Brain," von Neumann debates whether or not the brain can be thought of as a computing machine and identifies some of the similarities and differences between natural and artificial computation. Turing, in his 1950 article in Mind, argues that computing devices could ultimately emulate intelligence, leading to his proposed Turing test. Herbert Simon predicted in 1957 that most psychological theories would take the form of a computer program. In 1976, David Marr proposed that the function of the visual system could be abstracted and studied at computational and algorithmic levels that did not depend on the underlying physical substrate. In December 2014, a two-day workshop supported by the Computing Community Consortium (CCC) and the National Science Foundation's Computer and Information Science and Engineering Directorate (NSF CISE) was convened in Washington, DC, with the goal of bringing together computer scientists and brain researchers to explore these new opportunities and connections, and develop a new, modern dialogue between the two research communities. Specifically, our objectives were: 1. To articulate a conceptual framework for research at the interface of brain sciences and computing and to identify key problems in this interface, presented in a way that will attract both CISE and brain researchers into this space. 2. To inform and excite researchers within the CISE research community about brain research opportunities and to identify and explain strategic roles they can play in advancing this initiative. 3. To develop new connections, conversations and collaborations between brain sciences and CISE researchers that will lead to highly relevant and competitive proposals, high-impact research, and influential publications. △ Less

Submitted 27 April, 2020; originally announced April 2020.

Comments: A Computing Community Consortium (CCC) workshop report, 24 pages

Report number: ccc2014report_5

arXiv:2004.12908 [pdf, other]

A Simple Lifelong Learning Approach

Authors: Joshua T. Vogelstein, Jayanta Dey, Hayden S. Helm, Will LeVine, Ronak D. Mehta, Tyler M. Tomita, Haoyin Xu, Ali Geisa, Qingyang Wang, Gido M. van de Ven, Chenyu Gao, Weiwei Yang, Bryan Tower, Jonathan Larson, Christopher M. White, Carey E. Priebe

Abstract: In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain perf… ▽ More In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be to use data to improve performance on both future tasks (forward transfer) and past tasks (backward transfer). In this paper, we show that a simple approach -- representation ensembling -- demonstrates both forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, vision (CIFAR-100, 5-dataset, Split Mini-Imagenet, and Food1k), and speech (spoken digit), in contrast to various reference algorithms, which typically failed to transfer either forward or backward, or both. Moreover, our proposed approach can flexibly operate with or without a computational budget. △ Less

Submitted 11 June, 2024; v1 submitted 27 April, 2020; originally announced April 2020.

arXiv:1912.12150 [pdf, other]

doi 10.1080/10618600.2021.1938585

The Chi-Square Test of Distance Correlation

Authors: Cencheng Shen, Sambit Panda, Joshua T. Vogelstein

Abstract: Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depe… ▽ More Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data. To overcome the difficulty, in this paper we propose a chi-square test for distance correlation. Method-wise, the chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel. The test exhibits a similar testing power as the standard permutation test, and can be utilized for K-sample and partial testing. Theory-wise, we show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and universally consistent for testing independence, and establish a testing power inequality with respect to the permutation test. △ Less

Submitted 14 May, 2021; v1 submitted 27 December, 2019; originally announced December 2019.

Comments: 21 pages, 4 figures, 1 table

Journal ref: Journal of Computational and Graphical Statistics 31(1), 254-262, 2022

arXiv:1910.08883 [pdf, other]

High-dimensional and universally consistent k-sample tests

Authors: Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz, Carey E. Priebe, Joshua T. Vogelstein

Abstract: The k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution. The standard method for k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA), despite that it depends on strong, and often unsuitable, parametric assumptions. Moreover, independence testing and k-sample testing are closely related, and several univ… ▽ More The k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution. The standard method for k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA), despite that it depends on strong, and often unsuitable, parametric assumptions. Moreover, independence testing and k-sample testing are closely related, and several universally consistent high-dimensional independence tests such as distance correlation (Dcorr) and Hilbert-Schmidt-Independence-Criterion (Hsic) enjoy solid theoretical and empirical properties. In this paper, we prove that independence tests achieve universally consistent k-sample testing and that k-sample statistics such as Energy and Maximum Mean Discrepancy (MMD) are precisely equivalent to Dcorr. An empirical evaluation of nonparametric independence tests showed that they generally perform better than the popular MANOVA test, even in Gaussian distributed scenarios. The evaluation included several popular independence statistics and covered a comprehensive set of simulations. Additionally, the testing approach was extended to perform multiway and multilevel tests, which were demonstrated in a simulated study as well as a real-world fMRI brain scans with a set of attributes. △ Less

Submitted 11 October, 2023; v1 submitted 19 October, 2019; originally announced October 2019.

arXiv:1909.11799 [pdf, other]

Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks

Authors: Adam Li, Ronan Perry, Chester Huynh, Tyler M. Tomita, Ronak Mehta, Jesus Arroyo, Jesse Patsolic, Benjamin Falk, Joshua T. Vogelstein

Abstract: Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structur… ▽ More Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structured data lying on a manifold (such as images, text, and speech) deep networks (Networks), specifically convolutional deep networks (ConvNets), tend to outperform Forests. We conjecture that at least part of the reason for this is that the input to Networks is not simply the feature magnitudes, but also their indices. In contrast, naive Forest implementations fail to explicitly consider feature indices. A recently proposed Forest approach demonstrates that Forests, for each node, implicitly sample a random matrix from some specific distribution. These Forests, like some classes of Networks, learn by partitioning the feature space into convex polytopes corresponding to linear functions. We build on that approach and show that one can choose distributions in a manifold-aware fashion to incorporate feature locality. We demonstrate the empirical performance on data whose features live on three different manifolds: a torus, images, and time-series. Moreover, we demonstrate its strength in multivariate simulated settings and also show superiority in predicting surgical outcome in epilepsy patients and predicting movement direction from raw stereotactic EEG data from non-motor brain regions. In all simulations and real data, Manifold Oblique Random Forest (MORF) algorithm outperforms approaches that ignore feature space structure and challenges the performance of ConvNets. Moreover, MORF runs fast and maintains interpretability and theoretical justification. △ Less

Submitted 5 September, 2022; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: Updated manuscript based on review at SIMODS

MSC Class: 68T05

arXiv:1909.02688 [pdf, other]

AutoGMM: Automatic and Hierarchical Gaussian Mixture Modeling in Python

Authors: Thomas L. Athey, Tingshan Liu, Benjamin D. Pedigo, Joshua T. Vogelstein

Abstract: Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these p… ▽ More Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these problems. However, Python has lacked such a package. We therefore introduce AutoGMM, a Python algorithm for automatic Gaussian mixture modeling, and its hierarchical version, HGMM. AutoGMM builds upon scikit-learn's AgglomerativeClustering and GaussianMixture classes, with certain modifications to make the results more stable. Empirically, on several different applications, AutoGMM performs approximately as well as mclust, and sometimes better. Conclusions: AutoMM, a freely available Python package, enables efficient Gaussian mixture modeling by automatically selecting the initialization, number of clusters and covariance constraints. △ Less

Submitted 12 August, 2021; v1 submitted 5 September, 2019; originally announced September 2019.

arXiv:1908.06486 [pdf, other]

Independence Testing for Temporal Data

Authors: Cencheng Shen, Jaewon Chung, Ronak Mehta, Ting Xu, Joshua T. Vogelstein

Abstract: Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been prop… ▽ More Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been proposed, directly applying them to temporal data can inflate the p-value and result in an invalid test. To address these challenges, this paper introduces the temporal dependence statistic with block permutation to test independence between temporal data. Under proper assumptions, the proposed procedure is asymptotically valid and universally consistent for testing independence between stationary time series, and capable of estimating the optimal dependence lag that maximizes the dependence. Moreover, it is compatible with a rich family of distance and kernel based dependence measures, eliminates the need for multiple testing, and exhibits excellent testing power in various simulation settings. △ Less

Submitted 27 May, 2024; v1 submitted 18 August, 2019; originally announced August 2019.

Comments: 19 pages main + 6 pages appendix

Journal ref: Transactions on Machine Learning Research, 2024

arXiv:1907.03335 [pdf, other]

Graphyti: A Semi-External Memory Graph Library for FlashGraph

Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

Abstract: Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, $\mathcal{O}(m)$ data resides on… ▽ More Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, $\mathcal{O}(m)$ data resides on disk and $\mathcal{O}(n)$ data in memory, for a graph with $n$ vertices and $m$ edges. For developers, this adds complexity because they must explicitly encode I/O within applications. We present principles that are critical for application developers to adopt in order to achieve state-of-the-art performance, while minimizing I/O and memory for algorithms in SEM. We present them in Graphyti, an extensible parallel SEM graph library built on FlashGraph and available in Python via pip. In SEM, Graphyti achieves 80% of the performance of in-memory execution and retains the performance of FlashGraph, which outperforms distributed engines, such as PowerGraph and Galois. △ Less

Submitted 7 July, 2019; originally announced July 2019.

arXiv:1907.02844 [pdf, other]

Geodesic Learning via Unsupervised Decision Forests

Authors: Meghana Madhyastha, Percy Li, James Browne, Veronika Strnadova-Neeley, Carey E. Priebe, Randal Burns, Joshua T. Vogelstein

Abstract: Geodesic distance is the shortest path between two points in a Riemannian manifold. Manifold learning algorithms, such as Isomap, seek to learn a manifold that preserves geodesic distances. However, such methods operate on the ambient dimensionality, and are therefore fragile to noise dimensions. We developed an unsupervised random forest method (URerF) to approximately learn geodesic distances in… ▽ More Geodesic distance is the shortest path between two points in a Riemannian manifold. Manifold learning algorithms, such as Isomap, seek to learn a manifold that preserves geodesic distances. However, such methods operate on the ambient dimensionality, and are therefore fragile to noise dimensions. We developed an unsupervised random forest method (URerF) to approximately learn geodesic distances in linear and nonlinear manifolds with noise. URerF operates on low-dimensional sparse linear combinations of features, rather than the full observed dimensionality. To choose the optimal split in a computationally efficient fashion, we developed a fast Bayesian Information Criterion statistic for Gaussian mixture models. We introduce geodesic precision-recall curves which quantify performance relative to the true latent manifold. Empirical results on simulated and real data demonstrate that URerF is robust to high-dimensional noise, where as other methods, such as Isomap, UMAP, and FLANN, quickly deteriorate in such settings. In particular, URerF is able to estimate geodesic distances on a real connectome dataset better than other approaches. △ Less

Submitted 5 July, 2019; originally announced July 2019.

arXiv:1907.02088 [pdf, other]

hyppo: A Multivariate Hypothesis Testing Python Package

Authors: Sambit Panda, Satish Palaniappan, Junhao Xiong, Eric W. Bridgeford, Ronak Mehta, Cencheng Shen, Joshua T. Vogelstein

Abstract: We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible eno… ▽ More We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible enough to enable future extensions. The documentation and all releases are available at https://hyppo.neurodata.io. △ Less

Submitted 1 April, 2021; v1 submitted 3 July, 2019; originally announced July 2019.

Comments: 5 pages, 1 figure

arXiv:1907.00325 [pdf, other]

Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities

Authors: Ronan Perry, Ronak Mehta, Richard Guo, Eva Yezerets, Jesús Arroyo, Mike Powell, Hayden Helm, Cencheng Shen, Joshua T. Vogelstein

Abstract: Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when… ▽ More Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when different features are measured on different scales.We propose decision forest-based adaptive nearest neighbor estimators and show that they are able to effectively estimate posterior probabilities, conditional entropies, and mutual information even in the aforementioned settings.We provide an extensive study of efficacy for classification and posterior probability estimation, and prove certain forest-based approaches to be consistent estimators of the true posteriors and derived information-theoretic quantities under certain assumptions. In a real-world connectome application, we quantify the uncertainty about neuron type given various cellular features in the Drosophila larva mushroom body, a key challenge for modern neuroscience. △ Less

Submitted 5 October, 2021; v1 submitted 30 June, 2019; originally announced July 2019.

arXiv:1906.10026 [pdf, other]

Inference for multiple heterogeneous networks with a common invariant subspace

Authors: Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E. Priebe, Joshua T. Vogelstein

Abstract: The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to… ▽ More The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge (COSIE) multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The COSIE model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices - the multiple adjacency spectral embedding (MASE) - leads, in a COSIE model, to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, MASE estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation and hypothesis testing. In both simulated and real data, the COSIE model and the MASE embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing and community detection. Specifically, when MASE is applied to a dataset of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by patient and a meaningful determination of heterogeneity across scans of different subjects. △ Less

Submitted 22 August, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

arXiv:1904.05329 [pdf, other]

GraSPy: Graph Statistics in Python

Authors: Jaewon Chung, Benjamin D. Pedigo, Eric W. Bridgeford, Bijan K. Varjavand, Hayden S. Helm, Joshua T. Vogelstein

Abstract: We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The… ▽ More We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The documentation and all releases are available at https://neurodata.io/graspy. △ Less

Submitted 14 August, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

Journal ref: Journal of Machine Learning Research 20.158 (2019): 1-7

arXiv:1902.09527 [pdf, other]

clusterNOR: A NUMA-Optimized Clustering Framework

Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

Abstract: Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We elimina… ▽ More Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We eliminate many barriers, reduce remote memory accesses, and maximize cache reuse. We implement the 'Clustering NUMA Optimized Routines' (clusterNOR) extensible parallel framework that provides algorithmic building blocks. The system is generic, we demonstrate nine modern clustering algorithms that have simple implementations. clusterNOR includes (i) in-memory, (ii) semi-external memory, and (iii) distributed memory execution, enabling computation for varying memory and hardware budgets. For algorithms that rely on Euclidean distance, clusterNOR defines an updated Elkan's triangle inequality pruning algorithm that uses asymptotically less memory so that it works on billion-point data sets. clusterNOR extends and expands the scope of the 'knor' library for k-means clustering by generalizing underlying principles, providing a uniform programming interface and expanding the scope to hierarchical and linear algebraic classes of algorithms. The compound effect of our optimizations is an order of magnitude improvement in speed over other state-of-the-art solutions, such as Spark's MLlib and Apple's Turi. △ Less

Submitted 17 January, 2021; v1 submitted 24 February, 2019; originally announced February 2019.

Comments: arXiv admin note: Journal version of arXiv:1606.08905

arXiv:1812.00029 [pdf, other]

Learning Interpretable Characteristic Kernels via Decision Forests

Authors: Sambit Panda, Cencheng Shen, Joshua T. Vogelstein

Abstract: Decision forests are widely used for classification and regression tasks. A lesser known property of tree-based methods is that one can construct a proximity matrix from the tree(s), and these proximity matrices are induced kernels. While there has been extensive research on the applications and properties of kernels, there is relatively little research on kernels induced by decision forests. We c… ▽ More Decision forests are widely used for classification and regression tasks. A lesser known property of tree-based methods is that one can construct a proximity matrix from the tree(s), and these proximity matrices are induced kernels. While there has been extensive research on the applications and properties of kernels, there is relatively little research on kernels induced by decision forests. We construct Kernel Mean Embedding Random Forests (KMERF), which induce kernels from random trees and/or forests using leaf-node proximity. We introduce the notion of an asymptotically characteristic kernel, and prove that KMERF kernels are asymptotically characteristic for both discrete and continuous data. Because KMERF is data-adaptive, we suspected it would outperform kernels selected a priori on finite sample data. We illustrate that KMERF nearly dominates current state-of-the-art kernel-based tests across a diverse range of high-dimensional two-sample and independence testing settings. Furthermore, our forest-based approach is interpretable, and provides feature importance metrics that readily distinguish important dimensions, unlike other high-dimensional non-parametric testing procedures. Hence, this work demonstrates the decision forest-based kernel can be more powerful and more interpretable than existing methods, flying in the face of conventional wisdom of the trade-off between the two. △ Less

Submitted 28 September, 2023; v1 submitted 30 November, 2018; originally announced December 2018.

arXiv:1808.07801 [pdf, other]

doi 10.1073/pnas.1814462116

On a 'Two Truths' Phenomenon in Spectral Graph Clustering

Authors: Carey E. Priebe, Youngser Park, Joshua T. Vogelstein, John M. Conroy, Vince Lyzinski, Minh Tang, Avanti Athreya, Joshua Cape, Eric Bridgeford

Abstract: Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical resu… ▽ More Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical results provide new understanding of the problem and solutions, and lead us to a 'Two Truths' LSE vs. ASE spectral graph clustering phenomenon convincingly illustrated here via a diffusion MRI connectome data set: the different embedding methods yield different clustering results, with LSE capturing left hemisphere/right hemisphere affinity structure and ASE capturing gray matter/white matter core-periphery structure. △ Less

Submitted 11 February, 2019; v1 submitted 23 August, 2018; originally announced August 2018.

Journal ref: PNAS 116 (2019) 5995-6000

arXiv:1806.07300 [pdf, other]

Forest Packing: Fast, Parallel Decision Forests

Authors: James Browne, Tyler M. Tomita, Disa Mhembere, Randal Burns, Joshua T. Vogelstein

Abstract: Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the travers… ▽ More Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the traversals of tree data structures incur many random memory accesses and are very slow. We present memory packing techniques that reorganize learned forests to minimize cache misses during classification. The resulting layout is hierarchical. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in cache lines. We extend this layout with out-of-order execution and cache-line prefetching to increase memory throughput. Together, these optimizations increase the performance of classification in ensembles by a factor of four over an optimized C++ implementation and a actor of 50 over a popular R language implementation. △ Less

Submitted 19 June, 2018; originally announced June 2018.

arXiv:1806.05514 [pdf, other]

doi 10.1007/s10182-020-00378-1

The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing

Authors: Cencheng Shen, Joshua T. Vogelstein

Abstract: Distance-based tests, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community. Kernel-based tests, developed from "kernel mean embeddings", are leading methods for two-sample and independence tests from the machine learning community. A fixed-point transformation was previously proposed to connect the distance methods and kernel meth… ▽ More Distance-based tests, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community. Kernel-based tests, developed from "kernel mean embeddings", are leading methods for two-sample and independence tests from the machine learning community. A fixed-point transformation was previously proposed to connect the distance methods and kernel methods for the population statistics. In this paper, we propose a new bijective transformation between metrics and kernels. It simplifies the fixed-point transformation, inherits similar theoretical properties, allows distance methods to be exactly the same as kernel methods for sample statistics and p-value, and better preserves the data structure upon transformation. Our results further advance the understanding in distance and kernel-based tests, streamline the code base for implementing these tests, and enable a rich literature of distance-based and kernel-based methodologies to directly communicate with each other. △ Less

Submitted 14 September, 2020; v1 submitted 14 June, 2018; originally announced June 2018.

Comments: 24 pages main + 7 pages appendix, 3 figures

Journal ref: AStA Advances in Statistical Analysis 105(3), 385-403, 2021

arXiv:1710.09859 [pdf, other]

doi 10.1109/TPAMI.2020.2998120

Kernel k-Groups via Hartigan's Method

Authors: Guilherme França, Maria L. Rizzo, Joshua T. Vogelstein

Abstract: Energy statistics was proposed by Sz\' ekely in the 80's inspired by Newton's gravitational potential in classical mechanics and it provides a model-free hypothesis test for equality of distributions. In its original form, energy statistics was formulated in Euclidean spaces. More recently, it was generalized to metric spaces of negative type. In this paper, we consider a formulation for the clust… ▽ More Energy statistics was proposed by Sz\' ekely in the 80's inspired by Newton's gravitational potential in classical mechanics and it provides a model-free hypothesis test for equality of distributions. In its original form, energy statistics was formulated in Euclidean spaces. More recently, it was generalized to metric spaces of negative type. In this paper, we consider a formulation for the clustering problem using a weighted version of energy statistics in spaces of negative type. We show that this approach leads to a quadratically constrained quadratic program in the associated kernel space, establishing connections with graph partitioning problems and kernel methods in machine learning. To find local solutions of such an optimization problem, we propose kernel k-groups, which is an extension of Hartigan's method to kernel spaces. Kernel k-groups is cheaper than spectral clustering and has the same computational cost as kernel k-means (which is based on Lloyd's heuristic) but our numerical results show an improved performance, especially in higher dimensions. Moreover, we verify the efficiency of kernel k-groups in community detection in sparse stochastic block models which has fascinating applications in several areas of science. △ Less

Submitted 11 June, 2020; v1 submitted 26 October, 2017; originally announced October 2017.

Comments: several improvements; connections with community detection and stochastic block model. Matches published version

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

arXiv:1703.03862 [pdf, other]

doi 10.1109/TPAMI.2019.2948619

Joint Embedding of Graphs

Authors: Shangsi Wang, Jesús Arroyo, Joshua T. Vogelstein, Carey E. Priebe

Abstract: Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric… ▽ More Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric matrices and projects adjacency matrices of graphs into this subspace. The projection coefficients can be treated as features of the graphs, while the embedding components can represent vertex features. We also propose a random graph model for multiple graphs that generalizes other classical models for graphs. We show through theory and numerical experiments that under the model, the joint embedding method produces estimates of parameters with small errors. Via simulation experiments, we demonstrate that the joint embedding method produces features which lead to state of the art performance in classifying graphs. Applying the joint embedding method to human brain graphs, we find it extracts interpretable features with good prediction accuracy in different tasks. △ Less

Submitted 17 October, 2019; v1 submitted 10 March, 2017; originally announced March 2017.

arXiv:1612.00356 [pdf, other]

A Large Deformation Diffeomorphic Approach to Registration of CLARITY Images via Mutual Information

Authors: Kwame S. Kutten, Nicolas Charon, Michael I. Miller, J. T. Ratnanather, Jordan Matelsky, Alexander D. Baden, Kunal Lillaney, Karl Deisseroth, Li Ye, Joshua T. Vogelstein

Abstract: CLARITY is a method for converting biological tissues into translucent and porous hydrogel-tissue hybrids. This facilitates interrogation with light sheet microscopy and penetration of molecular probes while avoiding physical slicing. In this work, we develop a pipeline for registering CLARIfied mouse brains to an annotated brain atlas. Due to the novelty of this microscopy technique it is impract… ▽ More CLARITY is a method for converting biological tissues into translucent and porous hydrogel-tissue hybrids. This facilitates interrogation with light sheet microscopy and penetration of molecular probes while avoiding physical slicing. In this work, we develop a pipeline for registering CLARIfied mouse brains to an annotated brain atlas. Due to the novelty of this microscopy technique it is impractical to use absolute intensity values to align these images to existing standard atlases. Thus we adopt a large deformation diffeomorphic approach for registering images via mutual information matching. Furthermore we show how a cascaded multi-resolution approach can improve registration quality while reducing algorithm run time. As acquired image volumes were over a terabyte in size, they were far too large for work on personal computers. Therefore the NeuroData computational infrastructure was deployed for multi-resolution storage and visualization of these images and aligned annotations on the web. △ Less

Submitted 11 August, 2017; v1 submitted 1 December, 2016; originally announced December 2016.

arXiv:1611.05479 [pdf, other]

doi 10.1371/journal.pcbi.1005493

Probabilistic Fluorescence-Based Synapse Detection

Authors: Anish K. Simhal, Cecilia Aguerrebere, Forrest Collman, Joshua T. Vogelstein, Kristina D. Micheva, Richard J. Weinberg, Stephen J. Smith, Guillermo Sapiro

Abstract: Brain function results from communication between neurons connected by complex synaptic networks. Synapses are themselves highly complex and diverse signaling machines, containing protein products of hundreds of different genes, some in hundreds of copies, arranged in precise lattice at each individual synapse. Synapses are fundamental not only to synaptic network function but also to network deve… ▽ More Brain function results from communication between neurons connected by complex synaptic networks. Synapses are themselves highly complex and diverse signaling machines, containing protein products of hundreds of different genes, some in hundreds of copies, arranged in precise lattice at each individual synapse. Synapses are fundamental not only to synaptic network function but also to network development, adaptation, and memory. In addition, abnormalities of synapse numbers or molecular components are implicated in most mental and neurological disorders. Despite their obvious importance, mammalian synapse populations have so far resisted detailed quantitative study. In human brains and most animal nervous systems, synapses are very small and very densely packed: there are approximately 1 billion synapses per cubic millimeter of human cortex. This volumetric density poses very substantial challenges to proteometric analysis at the critical level of the individual synapse. The present work describes new probabilistic image analysis methods for single-synapse analysis of synapse populations in both animal and human brains. △ Less

Submitted 16 November, 2016; originally announced November 2016.

Comments: Current awaiting peer review

arXiv:1606.08905 [pdf, other]

knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library

Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

Abstract: k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine}… ▽ More k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine} (\textsf{knor}) library has (i) in-memory (\textsf{knori}), (ii) distributed memory (\textsf{knord}), and (iii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. \textsf{knori} boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. \textsf{knors} scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. \textsf{knord} retains \textsf{knori}'s performance characteristics, while scaling in-memory through distributed computation in the cloud. \textsf{knor} modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate \textsf{knor} outperforms distributed commercial products like H$_2$O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of $10^7$ to $10^9$ points. △ Less

Submitted 24 June, 2017; v1 submitted 28 June, 2016; originally announced June 2016.

arXiv:1605.02060 [pdf, other]

doi 10.1117/12.2227444

Deformably Registering and Annotating Whole CLARITY Brains to an Atlas via Masked LDDMM

Authors: Kwame S. Kutten, Joshua T. Vogelstein, Nicolas Charon, Li Ye, Karl Deisseroth, Michael I. Miller

Abstract: The CLARITY method renders brains optically transparent to enable high-resolution imaging in the structurally intact brain. Anatomically annotating CLARITY brains is necessary for discovering which regions contain signals of interest. Manually annotating whole-brain, terabyte CLARITY images is difficult, time-consuming, subjective, and error-prone. Automatically registering CLARITY images to a pre… ▽ More The CLARITY method renders brains optically transparent to enable high-resolution imaging in the structurally intact brain. Anatomically annotating CLARITY brains is necessary for discovering which regions contain signals of interest. Manually annotating whole-brain, terabyte CLARITY images is difficult, time-consuming, subjective, and error-prone. Automatically registering CLARITY images to a pre-annotated brain atlas offers a solution, but is difficult for several reasons. Removal of the brain from the skull and subsequent storage and processing cause variable non-rigid deformations, thus compounding inter-subject anatomical variability. Additionally, the signal in CLARITY images arises from various biochemical contrast agents which only sparsely label brain structures. This sparse labeling challenges the most commonly used registration algorithms that need to match image histogram statistics to the more densely labeled histological brain atlases. The standard method is a multiscale Mutual Information B-spline algorithm that dynamically generates an average template as an intermediate registration target. We determined that this method performs poorly when registering CLARITY brains to the Allen Institute's Mouse Reference Atlas (ARA), because the image histogram statistics are poorly matched. Therefore, we developed a method (Mask-LDDMM) for registering CLARITY images, that automatically find the brain boundary and learns the optimal deformation between the brain and atlas masks. Using Mask-LDDMM without an average template provided better results than the standard approach when registering CLARITY brains to the ARA. The LDDMM pipelines developed here provide a fast automated way to anatomically annotate CLARITY images. Our code is available as open source software at http://NeuroData.io. △ Less

Submitted 6 May, 2016; originally announced May 2016.

Journal ref: Proc. SPIE 9896 Optics, Photonics and Digital Technologies for Imaging Applications IV (2016)

arXiv:1604.06414 [pdf, other]

FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs

Authors: Da Zheng, Disa Mhembere, Joshua T. Vogelstein, Carey E. Priebe, Randal Burns

Abstract: R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory c… ▽ More R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory capacity by utilizing solid-state drives (SSDs) automatically. It provides a small number of generalized operations (GenOps) upon which we reimplement a large number of matrix functions in the R base package. As such, FlashR parallelizes and scales existing R code with little/no modification. To reduce data movement between CPU and SSDs, FlashR evaluates matrix operations lazily, fuses operations at runtime, and uses cache-aware, two-level matrix partitioning. We evaluate FlashR on a variety of machine learning and statistics algorithms on inputs of up to four billion data points. FlashR out-of-core tracks closely the performance of FlashR in-memory. The R code for machine learning algorithms executed in FlashR outperforms the in-memory execution of H2O and Spark MLlib by a factor of 2-10 and outperforms Revolution R Open by more than an order of magnitude. △ Less

Submitted 18 May, 2017; v1 submitted 21 April, 2016; originally announced April 2016.

arXiv:1604.03629 [pdf, other]

Quantifying mesoscale neuroanatomy using X-ray microtomography

Authors: Eva L. Dyer, William Gray Roncal, Hugo L. Fernandes, Doga Gürsoy, Vincent De Andrade, Rafael Vescovi, Kamel Fezzaa, Xianghui Xiao, Joshua T. Vogelstein, Chris Jacobsen, Konrad P. Körding, Narayanan Kasthuri

Abstract: Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography (… ▽ More Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography ($μ$CT) for producing mesoscale $(1~μm^3)$ resolution brain maps from millimeter-scale volumes of mouse brain. We introduce a pipeline for $μ$CT-based brain mapping that combines methods for sample preparation, imaging, automated segmentation of image volumes into cells and blood vessels, and statistical analysis of the resulting brain structures. Our results demonstrate that X-ray tomography promises rapid quantification of large brain volumes, complementing other brain mapping and connectomics efforts. △ Less

Submitted 26 July, 2016; v1 submitted 12 April, 2016; originally announced April 2016.

Comments: 28 pages, 9 figures

arXiv:1506.03410 [pdf, other]

Sparse Projection Oblique Randomer Forests

Authors: Tyler M. Tomita, James Browne, Cencheng Shen, Jaewon Chung, Jesse L. Patsolic, Benjamin Falk, Jason Yim, Carey E. Priebe, Randal Burns, Mauro Maggioni, Joshua T. Vogelstein

Abstract: Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortuna… ▽ More Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortunately, these extensions forfeit one or more of the favorable properties of decision forests based on axis-aligned splits, such as robustness to many noise dimensions, interpretability, or computational efficiency. We introduce yet another decision forest, called "Sparse Projection Oblique Randomer Forests" (SPORF). SPORF uses very sparse random projections, i.e., linear combinations of a small subset of features. SPORF significantly improves accuracy over existing state-of-the-art algorithms on a standard benchmark suite for classification with >100 problems of varying dimension, sample size, and number of classes. To illustrate how SPORF addresses the limitations of both axis-aligned and existing oblique decision forest methods, we conduct extensive simulated experiments. SPORF typically yields improved performance over existing decision forests, while mitigating computational efficiency and scalability and maintaining interpretability. SPORF can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains. △ Less

Submitted 3 October, 2019; v1 submitted 10 June, 2015; originally announced June 2015.

Comments: 31 pages; submitted to Journal of Machine Learning Research for review

MSC Class: 68T10 ACM Class: I.5.2

Journal ref: Journal of Machine Learning Research 21(104), 1-39, 2020

arXiv:1411.6880 [pdf, other]

An Automated Images-to-Graphs Framework for High Resolution Connectomics

Authors: William Gray Roncal, Dean M. Kleissas, Joshua T. Vogelstein, Priya Manavalan, Kunal Lillaney, Michael Pekala, Randal Burns, R. Jacob Vogelstein, Carey E. Priebe, Mark A. Chevillet, Gregory D. Hager

Abstract: Reconstructing a map of neuronal connectivity is a critical challenge in contemporary neuroscience. Recent advances in high-throughput serial section electron microscopy (EM) have produced massive 3D image volumes of nanoscale brain tissue for the first time. The resolution of EM allows for individual neurons and their synaptic connections to be directly observed. Recovering neuronal networks by m… ▽ More Reconstructing a map of neuronal connectivity is a critical challenge in contemporary neuroscience. Recent advances in high-throughput serial section electron microscopy (EM) have produced massive 3D image volumes of nanoscale brain tissue for the first time. The resolution of EM allows for individual neurons and their synaptic connections to be directly observed. Recovering neuronal networks by manually tracing each neuronal process at this scale is unmanageable, and therefore researchers are developing automated image processing modules. Thus far, state-of-the-art algorithms focus only on the solution to a particular task (e.g., neuron segmentation or synapse identification). In this manuscript we present the first fully automated images-to-graphs pipeline (i.e., a pipeline that begins with an imaged volume of neural tissue and produces a brain graph without any human interaction). To evaluate overall performance and select the best parameters and methods, we also develop a metric to assess the quality of the output graphs. We evaluate a set of algorithms and parameters, searching possible operating points to identify the best available brain graph for our assessment metric. Finally, we deploy a reference end-to-end version of the pipeline on a large, publicly available data set. This provides a baseline result and framework for community analysis and future algorithm development and testing. All code and data derivatives have been made publicly available toward eventually unlocking new biofidelic computational primitives and understanding of neuropathologies. △ Less

Submitted 30 April, 2015; v1 submitted 25 November, 2014; originally announced November 2014.

Comments: 13 pages, first two authors contributed equally V2: Added additional experiments and clarifications; added information on infrastructure and pipeline environment

arXiv:1411.2158 [pdf, ps, other]

doi 10.1093/biomet/asx008

Covariate-assisted spectral clustering

Authors: Norbert Binkiewicz, Joshua T. Vogelstein, Karl Rohe

Abstract: Biological and social systems consist of myriad interacting units. The interactions can be represented in the form of a graph or network. Measurements of these graphs can reveal the underlying structure of these interactions, which provides insight into the systems that generated the graphs. Moreover, in applications such as connectomics, social networks, and genomics, graph data are accompanied b… ▽ More Biological and social systems consist of myriad interacting units. The interactions can be represented in the form of a graph or network. Measurements of these graphs can reveal the underlying structure of these interactions, which provides insight into the systems that generated the graphs. Moreover, in applications such as connectomics, social networks, and genomics, graph data are accompanied by contextualizing measures on each node. We utilize these node covariates to help uncover latent communities in a graph, using a modification of spectral clustering. Statistical guarantees are provided under a joint mixture model that we call the node-contextualized stochastic blockmodel, including a bound on the mis-clustering rate. The bound is used to derive conditions for achieving perfect clustering. For most simulated cases, covariate-assisted spectral clustering yields results superior to regularized spectral clustering without node covariates and to an adaptation of canonical correlation analysis. We apply our clustering method to large brain graphs derived from diffusion MRI data, using the node locations or neurological region membership as covariates. In both cases, covariate-assisted spectral clustering yields clusters that are easier to interpret neurologically. △ Less

Submitted 30 October, 2016; v1 submitted 8 November, 2014; originally announced November 2014.

Comments: 28 pages, 4 figures, includes substantial changes to theoretical results

Journal ref: Biometrika, Volume 104, Issue 2, 1 June 2017, Pages 361-377

arXiv:1404.4800 [pdf, other]

Automatic Annotation of Axoplasmic Reticula in Pursuit of Connectomes

Authors: Ayushi Sinha, William Gray Roncal, Narayanan Kasthuri, Ming Chuang, Priya Manavalan, Dean M. Kleissas, Joshua T. Vogelstein, R. Jacob Vogelstein, Randal Burns, Jeff W. Lichtman, Michael Kazhdan

Abstract: In this paper, we present a new pipeline which automatically identifies and annotates axoplasmic reticula, which are small subcellular structures present only in axons. We run our algorithm on the Kasthuri11 dataset, which was color corrected using gradient-domain techniques to adjust contrast. We use a bilateral filter to smooth out the noise in this data while preserving edges, which highlights… ▽ More In this paper, we present a new pipeline which automatically identifies and annotates axoplasmic reticula, which are small subcellular structures present only in axons. We run our algorithm on the Kasthuri11 dataset, which was color corrected using gradient-domain techniques to adjust contrast. We use a bilateral filter to smooth out the noise in this data while preserving edges, which highlights axoplasmic reticula. These axoplasmic reticula are then annotated using a morphological region growing algorithm. Additionally, we perform Laplacian sharpening on the bilaterally filtered data to enhance edges, and repeat the morphological region growing algorithm to annotate more axoplasmic reticula. We track our annotations through the slices to improve precision, and to create long objects to aid in segment merging. This method annotates axoplasmic reticula with high precision. Our algorithm can easily be adapted to annotate axoplasmic reticula in different sets of brain data by changing a few thresholds. The contribution of this work is the introduction of a straightforward and robust pipeline which annotates axoplasmic reticula with high precision, contributing towards advancements in automatic feature annotations in neural EM data. △ Less

Submitted 16 April, 2014; originally announced April 2014.

Comments: 2 pages, 1 figure

arXiv:1403.3724 [pdf, other]

VESICLE: Volumetric Evaluation of Synaptic Interfaces using Computer vision at Large Scale

Authors: William Gray Roncal, Michael Pekala, Verena Kaynig-Fittkau, Dean M. Kleissas, Joshua T. Vogelstein, Hanspeter Pfister, Randal Burns, R. Jacob Vogelstein, Mark A. Chevillet, Gregory D. Hager

Abstract: An open challenge problem at the forefront of modern neuroscience is to obtain a comprehensive mapping of the neural pathways that underlie human brain function; an enhanced understanding of the wiring diagram of the brain promises to lead to new breakthroughs in diagnosing and treating neurological disorders. Inferring brain structure from image data, such as that obtained via electron microscopy… ▽ More An open challenge problem at the forefront of modern neuroscience is to obtain a comprehensive mapping of the neural pathways that underlie human brain function; an enhanced understanding of the wiring diagram of the brain promises to lead to new breakthroughs in diagnosing and treating neurological disorders. Inferring brain structure from image data, such as that obtained via electron microscopy (EM), entails solving the problem of identifying biological structures in large data volumes. Synapses, which are a key communication structure in the brain, are particularly difficult to detect due to their small size and limited contrast. Prior work in automated synapse detection has relied upon time-intensive biological preparations (post-staining, isotropic slice thicknesses) in order to simplify the problem. This paper presents VESICLE, the first known approach designed for mammalian synapse detection in anisotropic, non-post-stained data. Our methods explicitly leverage biological context, and the results exceed existing synapse detection methods in terms of accuracy and scalability. We provide two different approaches - one a deep learning classifier (VESICLE-CNN) and one a lightweight Random Forest approach (VESICLE-RF) to offer alternatives in the performance-scalability space. Addressing this synapse detection challenge enables the analysis of high-throughput imaging data soon expected to reach petabytes of data, and provide tools for more rapid estimation of brain-graphs. Finally, to facilitate community efforts, we developed tools for large-scale object detection, and demonstrated this framework to find $\approx$ 50,000 synapses in 60,000 $μm ^3$ (220 GB on disk) of electron microscopy data. △ Less

Submitted 7 September, 2015; v1 submitted 14 March, 2014; originally announced March 2014.

Comments: v4: added clarifying figures and updates for readability. v3: fixed metadata. 11 pp v2: Added CNN classifier, significant changes to improve performance and generalization

Journal ref: Proceedings of the British Machine Vision Conference (BMVC), pages 81.1-81.13. BMVA Press, September 2015

arXiv:1312.4875 [pdf, other]

doi 10.1109/GlobalSIP.2013.6736878

MIGRAINE: MRI Graph Reliability Analysis and Inference for Connectomics

Authors: William Gray Roncal, Zachary H. Koterba, Disa Mhembere, Dean M. Kleissas, Joshua T. Vogelstein, Randal Burns, Anita R. Bowles, Dimitrios K. Donavos, Sephira Ryman, Rex E. Jung, Lei Wu, Vince Calhoun, R. Jacob Vogelstein

Abstract: Currently, connectomes (e.g., functional or structural brain graphs) can be estimated in humans at $\approx 1~mm^3$ scale using a combination of diffusion weighted magnetic resonance imaging, functional magnetic resonance imaging and structural magnetic resonance imaging scans. This manuscript summarizes a novel, scalable implementation of open-source algorithms to rapidly estimate magnetic resona… ▽ More Currently, connectomes (e.g., functional or structural brain graphs) can be estimated in humans at $\approx 1~mm^3$ scale using a combination of diffusion weighted magnetic resonance imaging, functional magnetic resonance imaging and structural magnetic resonance imaging scans. This manuscript summarizes a novel, scalable implementation of open-source algorithms to rapidly estimate magnetic resonance connectomes, using both anatomical regions of interest (ROIs) and voxel-size vertices. To assess the reliability of our pipeline, we develop a novel nonparametric non-Euclidean reliability metric. Here we provide an overview of the methods used, demonstrate our implementation, and discuss available user extensions. We conclude with results showing the efficacy and reliability of the pipeline over previous state-of-the-art. △ Less

Submitted 17 December, 2013; originally announced December 2013.

Comments: Published as part of 2013 IEEE GlobalSIP conference

Showing 1–50 of 54 results for author: Vogelstein, J T