Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 54 results for author: Vogelstein, J T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2307.13868  [pdf, other

    stat.ME cs.LG stat.ML

    Learning sources of variability from high-dimensional observational studies

    Authors: Eric W. Bridgeford, Jaewon Chung, Brian Gilbert, Sambit Panda, Adam Li, Cencheng Shen, Alexandra Badea, Brian Caffo, Joshua T. Vogelstein

    Abstract: Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estiman… ▽ More

    Submitted 28 November, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

  2. arXiv:2303.17589  [pdf, other

    cs.LG cs.CV cs.NE q-bio.NC

    Polarity is all you need to learn and transfer faster

    Authors: Qingyang Wang, Michael A. Powell, Ali Geisa, Eric W. Bridgeford, Joshua T. Vogelstein

    Abstract: Natural intelligences (NIs) thrive in a dynamic world - they learn quickly, sometimes with only a few samples. In contrast, artificial intelligences (AIs) typically learn with a prohibitive number of training samples and computational power. What design principle difference between NI and AI could contribute to such a discrepancy? Here, we investigate the role of weight polarity: development proce… ▽ More

    Submitted 30 May, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: ICML camera-ready

  3. arXiv:2302.14186  [pdf, other

    eess.SP cs.LG stat.AP stat.ME stat.ML

    Approximately optimal domain adaptation with Fisher's Linear Discriminant

    Authors: Hayden S. Helm, Ashwin De Silva, Joshua T. Vogelstein, Carey E. Priebe, Weiwei Yang

    Abstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propos… ▽ More

    Submitted 1 March, 2024; v1 submitted 27 February, 2023; originally announced February 2023.

  4. arXiv:2208.10967  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    The Value of Out-of-Distribution Data

    Authors: Ashwin De Silva, Rahul Ramesh, Carey E. Priebe, Pratik Chaudhari, Joshua T. Vogelstein

    Abstract: We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task imp… ▽ More

    Submitted 13 July, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

    Comments: Previous versions of this work have been presented at the Out-of-Distribution Generalization in Computer Vision (OOD-CV) Workshop (ECCV 2022) and the Workshop on Distribution Shifts (NeurIPS 2022)

    Journal ref: Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7366-7389, 2023

  5. arXiv:2208.03211  [pdf, other

    cs.LG cs.AI cs.NE

    Why do networks have inhibitory/negative connections?

    Authors: Qingyang Wang, Michael A. Powell, Ali Geisa, Eric Bridgeford, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Why do brains have inhibitory connections? Why do deep networks have negative weights? We propose an answer from the perspective of representation capacity. We believe representing functions is the primary role of both (i) the brain in natural intelligence, and (ii) deep networks in artificial intelligence. Our answer to why there are inhibitory/negative weights is: to learn more functions. We pro… ▽ More

    Submitted 17 August, 2023; v1 submitted 5 August, 2022; originally announced August 2022.

    Comments: ICCV2023 camera-ready

  6. arXiv:2201.13001  [pdf, other

    cs.LG cs.AI cs.DS q-bio.NC stat.ML

    Deep Discriminative to Kernel Density Graph for In- and Out-of-distribution Calibrated Inference

    Authors: Jayanta Dey, Haoyin Xu, Will LeVine, Ashwin De Silva, Tyler M. Tomita, Ali Geisa, Tiffany Chu, Jacob Desman, Joshua T. Vogelstein

    Abstract: Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distr… ▽ More

    Submitted 7 June, 2024; v1 submitted 31 January, 2022; originally announced January 2022.

  7. arXiv:2201.07372  [pdf, other

    cs.LG cs.AI

    Prospective Learning: Principled Extrapolation to the Future

    Authors: Ashwin De Silva, Rahul Ramesh, Lyle Ungar, Marshall Hussain Shuler, Noah J. Cowan, Michael Platt, Chen Li, Leyla Isik, Seung-Eon Roh, Adam Charles, Archana Venkataraman, Brian Caffo, Javier J. How, Justus M Kebschull, John W. Krakauer, Maxim Bichuch, Kaleab Alemayehu Kinfu, Eva Yezerets, Dinesh Jayaraman, Jong M. Shin, Soledad Villar, Ian Phillips, Carey E. Priebe, Thomas Hartung, Michael I. Miller , et al. (18 additional authors not shown)

    Abstract: Learning is a process which can update decision rules, based on past experience, such that future performance improves. Traditionally, machine learning is often evaluated under the assumption that the future will be identical to the past in distribution or change adversarially. But these assumptions can be either too optimistic or pessimistic for many problems in the real world. Real world scenari… ▽ More

    Submitted 13 July, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

    Comments: Accepted at the 2nd Conference on Lifelong Learning Agents (CoLLAs), 2023

  8. arXiv:2111.05366  [pdf, other

    stat.ML cs.LG math.CO

    Graph Matching via Optimal Transport

    Authors: Ali Saad-Eldin, Benjamin D. Pedigo, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The graph matching problem seeks to find an alignment between the nodes of two graphs that minimizes the number of adjacency disagreements. Solving the graph matching is increasingly important due to it's applications in operations research, computer vision, neuroscience, and more. However, current state-of-the-art algorithms are inefficient in matching very large graphs, though they produce good… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

  9. arXiv:2110.08483  [pdf, other

    cs.LG cs.AI cs.DS

    Simplest Streaming Trees

    Authors: Haoyin Xu, Jayanta Dey, Sambit Panda, Joshua T. Vogelstein

    Abstract: Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, especially on tabular data. However, most of the current implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this lim… ▽ More

    Submitted 24 October, 2023; v1 submitted 16 October, 2021; originally announced October 2021.

  10. arXiv:2109.14501  [pdf, other

    stat.ML cs.AI cs.LG

    Towards a theory of out-of-distribution learning

    Authors: Jayanta Dey, Ali Geisa, Ronak Mehta, Tyler M. Tomita, Hayden S. Helm, Haoyin Xu, Eric Eaton, Jeffery Dick, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Learning is a process wherein a learning agent enhances its performance through exposure of experience or data. Throughout this journey, the agent may encounter diverse learning environments. For example, data may be presented to the leaner all at once, in multiple batches, or sequentially. Furthermore, the distribution of each data sample could be either identical and independent (iid) or non-iid… ▽ More

    Submitted 7 June, 2024; v1 submitted 29 September, 2021; originally announced September 2021.

  11. arXiv:2108.13637  [pdf, other

    cs.LG cs.AI q-bio.NC stat.ML

    When are Deep Networks really better than Decision Forests at small sample sizes, and how?

    Authors: Haoyin Xu, Kaleab A. Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M. White, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies… ▽ More

    Submitted 2 November, 2021; v1 submitted 31 August, 2021; originally announced August 2021.

  12. arXiv:2107.11732  [pdf, other

    cs.LG econ.EM q-bio.QM stat.ME

    Federated Causal Inference in Heterogeneous Observational Data

    Authors: Ruoxuan Xiong, Allison Koenecke, Michael Powell, Zhu Shen, Joshua T. Vogelstein, Susan Athey

    Abstract: We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the… ▽ More

    Submitted 2 April, 2023; v1 submitted 25 July, 2021; originally announced July 2021.

  13. arXiv:2106.02701  [pdf, other

    cs.CV

    Hidden Markov Modeling for Maximum Likelihood Neuron Reconstruction

    Authors: Thomas L. Athey, Daniel J. Tward, Ulrich Mueller, Joshua T. Vogelstein, Michael I. Miller

    Abstract: Recent advances in brain clearing and imaging have made it possible to image entire mammalian brains at sub-micron resolution. These images offer the potential to assemble brain-wide atlases of neuron morphology, but manual neuron reconstruction remains a bottleneck. Several automatic reconstruction algorithms exist, but most focus on single neuron images. In this paper, we present a probabilistic… ▽ More

    Submitted 27 January, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

  14. arXiv:2104.01532  [pdf, other

    q-bio.NC cs.MS math.DG

    Fitting Splines to Axonal Arbors Quantifies Relationship between Branch Order and Geometry

    Authors: Thomas L. Athey, Jacopo Teneggi, Joshua T. Vogelstein, Daniel Tward, Ulrich Mueller, Michael I. Miller

    Abstract: Neuromorphology is crucial to identifying neuronal subtypes and understanding learning. It is also implicated in neurological disease. However, standard morphological analysis focuses on macroscopic features such as branching frequency and connectivity between regions, and often neglects the internal geometry of neurons. In this work, we treat neuron trace points as a sampling of differentiable cu… ▽ More

    Submitted 5 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

    Journal ref: Front. Neuroinform. 15 (2021)

  15. arXiv:2011.06557  [pdf, other

    stat.ML cs.LG stat.ME

    A partition-based similarity for classification distributions

    Authors: Hayden S. Helm, Ronak D. Mehta, Brandon Duderstadt, Weiwei Yang, Christoper M. White, Ali Geisa, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Herein we define a measure of similarity between classification distributions that is both principled from the perspective of statistical pattern recognition and useful from the perspective of machine learning practitioners. In particular, we propose a novel similarity on classification distributions, dubbed task similarity, that quantifies how an optimally-transformed optimal representation for a… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

  16. arXiv:2007.13843  [pdf, other

    stat.ML cs.IR cs.LG cs.SI

    Robust Similarity and Distance Learning via Decision Forests

    Authors: Tyler M. Tomita, Joshua T. Vogelstein

    Abstract: Canonical distances such as Euclidean distance often fail to capture the appropriate relationships between items, subsequently leading to subpar inference and prediction. Many algorithms have been proposed for automated learning of suitable distances, most of which employ linear methods to learn a global metric over the feature space. While such methods offer nice theoretical properties, interpret… ▽ More

    Submitted 21 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Submitted to NeurIPS 2020

  17. arXiv:2005.11890  [pdf, other

    stat.ML cs.LG stat.CO

    mvlearn: Multiview Machine Learning in Python

    Authors: Ronan Perry, Gavin Mischler, Richard Guo, Theodore Lee, Alexander Chang, Arman Koul, Cameron Franz, Hugo Richard, Iain Carmichael, Pierre Ablin, Alexandre Gramfort, Joshua T. Vogelstein

    Abstract: As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that… ▽ More

    Submitted 25 May, 2021; v1 submitted 24 May, 2020; originally announced May 2020.

    Comments: 6 pages, 2 figures, 1 table

  18. arXiv:2005.10700  [pdf, other

    cs.LG cs.IR stat.ML

    Distance-based Positive and Unlabeled Learning for Ranking

    Authors: Hayden S. Helm, Amitabh Basu, Avanti Athreya, Youngser Park, Joshua T. Vogelstein, Carey E. Priebe, Michael Winding, Marta Zlatic, Albert Cardona, Patrick Bourke, Jonathan Larson, Marah Abdin, Piali Choudhury, Weiwei Yang, Christopher W. White

    Abstract: Learning to rank -- producing a ranked list of items specific to a query and with respect to a set of supervisory items -- is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set)… ▽ More

    Submitted 28 September, 2022; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 21 pages, 5 figures

  19. arXiv:2004.12926  [pdf

    cs.CY cs.AI q-bio.NC

    A New Age of Computing and the Brain

    Authors: Polina Golland, Jack Gallant, Greg Hager, Hanspeter Pfister, Christos Papadimitriou, Stefan Schaal, Joshua T. Vogelstein

    Abstract: The history of computer science and brain sciences are intertwined. In his unfinished manuscript "The Computer and the Brain," von Neumann debates whether or not the brain can be thought of as a computing machine and identifies some of the similarities and differences between natural and artificial computation. Turing, in his 1950 article in Mind, argues that computing devices could ultimately emu… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: A Computing Community Consortium (CCC) workshop report, 24 pages

    Report number: ccc2014report_5

  20. arXiv:2004.12908  [pdf, other

    cs.AI cs.LG stat.ML

    A Simple Lifelong Learning Approach

    Authors: Joshua T. Vogelstein, Jayanta Dey, Hayden S. Helm, Will LeVine, Ronak D. Mehta, Tyler M. Tomita, Haoyin Xu, Ali Geisa, Qingyang Wang, Gido M. van de Ven, Chenyu Gao, Weiwei Yang, Bryan Tower, Jonathan Larson, Christopher M. White, Carey E. Priebe

    Abstract: In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain perf… ▽ More

    Submitted 11 June, 2024; v1 submitted 27 April, 2020; originally announced April 2020.

  21. arXiv:1912.12150  [pdf, other

    stat.ML cs.LG math.ST stat.ME

    The Chi-Square Test of Distance Correlation

    Authors: Cencheng Shen, Sambit Panda, Joshua T. Vogelstein

    Abstract: Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depe… ▽ More

    Submitted 14 May, 2021; v1 submitted 27 December, 2019; originally announced December 2019.

    Comments: 21 pages, 4 figures, 1 table

    Journal ref: Journal of Computational and Graphical Statistics 31(1), 254-262, 2022

  22. arXiv:1910.08883  [pdf, other

    stat.ML cs.LG

    High-dimensional and universally consistent k-sample tests

    Authors: Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution. The standard method for k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA), despite that it depends on strong, and often unsuitable, parametric assumptions. Moreover, independence testing and k-sample testing are closely related, and several univ… ▽ More

    Submitted 11 October, 2023; v1 submitted 19 October, 2019; originally announced October 2019.

  23. arXiv:1909.11799  [pdf, other

    cs.LG stat.ML

    Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks

    Authors: Adam Li, Ronan Perry, Chester Huynh, Tyler M. Tomita, Ronak Mehta, Jesus Arroyo, Jesse Patsolic, Benjamin Falk, Joshua T. Vogelstein

    Abstract: Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structur… ▽ More

    Submitted 5 September, 2022; v1 submitted 25 September, 2019; originally announced September 2019.

    Comments: Updated manuscript based on review at SIMODS

    MSC Class: 68T05

  24. arXiv:1909.02688  [pdf, other

    cs.LG stat.ML

    AutoGMM: Automatic and Hierarchical Gaussian Mixture Modeling in Python

    Authors: Thomas L. Athey, Tingshan Liu, Benjamin D. Pedigo, Joshua T. Vogelstein

    Abstract: Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these p… ▽ More

    Submitted 12 August, 2021; v1 submitted 5 September, 2019; originally announced September 2019.

  25. arXiv:1908.06486  [pdf, other

    stat.ML cs.LG stat.ME

    Independence Testing for Temporal Data

    Authors: Cencheng Shen, Jaewon Chung, Ronak Mehta, Ting Xu, Joshua T. Vogelstein

    Abstract: Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been prop… ▽ More

    Submitted 27 May, 2024; v1 submitted 18 August, 2019; originally announced August 2019.

    Comments: 19 pages main + 6 pages appendix

    Journal ref: Transactions on Machine Learning Research, 2024

  26. arXiv:1907.03335  [pdf, other

    cs.DC cs.DB

    Graphyti: A Semi-External Memory Graph Library for FlashGraph

    Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

    Abstract: Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, $\mathcal{O}(m)$ data resides on… ▽ More

    Submitted 7 July, 2019; originally announced July 2019.

  27. arXiv:1907.02844  [pdf, other

    stat.ML cs.IR cs.LG stat.ME

    Geodesic Learning via Unsupervised Decision Forests

    Authors: Meghana Madhyastha, Percy Li, James Browne, Veronika Strnadova-Neeley, Carey E. Priebe, Randal Burns, Joshua T. Vogelstein

    Abstract: Geodesic distance is the shortest path between two points in a Riemannian manifold. Manifold learning algorithms, such as Isomap, seek to learn a manifold that preserves geodesic distances. However, such methods operate on the ambient dimensionality, and are therefore fragile to noise dimensions. We developed an unsupervised random forest method (URerF) to approximately learn geodesic distances in… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.

  28. arXiv:1907.02088  [pdf, other

    stat.CO cs.MS stat.ME stat.ML

    hyppo: A Multivariate Hypothesis Testing Python Package

    Authors: Sambit Panda, Satish Palaniappan, Junhao Xiong, Eric W. Bridgeford, Ronak Mehta, Cencheng Shen, Joshua T. Vogelstein

    Abstract: We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible eno… ▽ More

    Submitted 1 April, 2021; v1 submitted 3 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure

  29. arXiv:1907.00325  [pdf, other

    cs.LG stat.ML

    Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities

    Authors: Ronan Perry, Ronak Mehta, Richard Guo, Eva Yezerets, Jesús Arroyo, Mike Powell, Hayden Helm, Cencheng Shen, Joshua T. Vogelstein

    Abstract: Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when… ▽ More

    Submitted 5 October, 2021; v1 submitted 30 June, 2019; originally announced July 2019.

  30. arXiv:1906.10026  [pdf, other

    stat.ME cs.SI math.ST

    Inference for multiple heterogeneous networks with a common invariant subspace

    Authors: Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to… ▽ More

    Submitted 22 August, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

  31. arXiv:1904.05329  [pdf, other

    cs.SI stat.ML stat.OT

    GraSPy: Graph Statistics in Python

    Authors: Jaewon Chung, Benjamin D. Pedigo, Eric W. Bridgeford, Bijan K. Varjavand, Hayden S. Helm, Joshua T. Vogelstein

    Abstract: We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The… ▽ More

    Submitted 14 August, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

    Journal ref: Journal of Machine Learning Research 20.158 (2019): 1-7

  32. arXiv:1902.09527  [pdf, other

    cs.DC

    clusterNOR: A NUMA-Optimized Clustering Framework

    Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

    Abstract: Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We elimina… ▽ More

    Submitted 17 January, 2021; v1 submitted 24 February, 2019; originally announced February 2019.

    Comments: arXiv admin note: Journal version of arXiv:1606.08905

  33. arXiv:1812.00029  [pdf, other

    stat.ML cs.LG

    Learning Interpretable Characteristic Kernels via Decision Forests

    Authors: Sambit Panda, Cencheng Shen, Joshua T. Vogelstein

    Abstract: Decision forests are widely used for classification and regression tasks. A lesser known property of tree-based methods is that one can construct a proximity matrix from the tree(s), and these proximity matrices are induced kernels. While there has been extensive research on the applications and properties of kernels, there is relatively little research on kernels induced by decision forests. We c… ▽ More

    Submitted 28 September, 2023; v1 submitted 30 November, 2018; originally announced December 2018.

  34. On a 'Two Truths' Phenomenon in Spectral Graph Clustering

    Authors: Carey E. Priebe, Youngser Park, Joshua T. Vogelstein, John M. Conroy, Vince Lyzinski, Minh Tang, Avanti Athreya, Joshua Cape, Eric Bridgeford

    Abstract: Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical resu… ▽ More

    Submitted 11 February, 2019; v1 submitted 23 August, 2018; originally announced August 2018.

    Journal ref: PNAS 116 (2019) 5995-6000

  35. arXiv:1806.07300  [pdf, other

    cs.PF cs.DC

    Forest Packing: Fast, Parallel Decision Forests

    Authors: James Browne, Tyler M. Tomita, Disa Mhembere, Randal Burns, Joshua T. Vogelstein

    Abstract: Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the travers… ▽ More

    Submitted 19 June, 2018; originally announced June 2018.

  36. The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing

    Authors: Cencheng Shen, Joshua T. Vogelstein

    Abstract: Distance-based tests, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community. Kernel-based tests, developed from "kernel mean embeddings", are leading methods for two-sample and independence tests from the machine learning community. A fixed-point transformation was previously proposed to connect the distance methods and kernel meth… ▽ More

    Submitted 14 September, 2020; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: 24 pages main + 7 pages appendix, 3 figures

    Journal ref: AStA Advances in Statistical Analysis 105(3), 385-403, 2021

  37. arXiv:1710.09859  [pdf, other

    stat.ML cs.CV cs.DS cs.LG math.ST

    Kernel k-Groups via Hartigan's Method

    Authors: Guilherme França, Maria L. Rizzo, Joshua T. Vogelstein

    Abstract: Energy statistics was proposed by Sz\' ekely in the 80's inspired by Newton's gravitational potential in classical mechanics and it provides a model-free hypothesis test for equality of distributions. In its original form, energy statistics was formulated in Euclidean spaces. More recently, it was generalized to metric spaces of negative type. In this paper, we consider a formulation for the clust… ▽ More

    Submitted 11 June, 2020; v1 submitted 26 October, 2017; originally announced October 2017.

    Comments: several improvements; connections with community detection and stochastic block model. Matches published version

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

  38. arXiv:1703.03862  [pdf, other

    stat.AP cs.LG stat.ML

    Joint Embedding of Graphs

    Authors: Shangsi Wang, Jesús Arroyo, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric… ▽ More

    Submitted 17 October, 2019; v1 submitted 10 March, 2017; originally announced March 2017.

  39. arXiv:1612.00356  [pdf, other

    cs.CV

    A Large Deformation Diffeomorphic Approach to Registration of CLARITY Images via Mutual Information

    Authors: Kwame S. Kutten, Nicolas Charon, Michael I. Miller, J. T. Ratnanather, Jordan Matelsky, Alexander D. Baden, Kunal Lillaney, Karl Deisseroth, Li Ye, Joshua T. Vogelstein

    Abstract: CLARITY is a method for converting biological tissues into translucent and porous hydrogel-tissue hybrids. This facilitates interrogation with light sheet microscopy and penetration of molecular probes while avoiding physical slicing. In this work, we develop a pipeline for registering CLARIfied mouse brains to an annotated brain atlas. Due to the novelty of this microscopy technique it is impract… ▽ More

    Submitted 11 August, 2017; v1 submitted 1 December, 2016; originally announced December 2016.

  40. Probabilistic Fluorescence-Based Synapse Detection

    Authors: Anish K. Simhal, Cecilia Aguerrebere, Forrest Collman, Joshua T. Vogelstein, Kristina D. Micheva, Richard J. Weinberg, Stephen J. Smith, Guillermo Sapiro

    Abstract: Brain function results from communication between neurons connected by complex synaptic networks. Synapses are themselves highly complex and diverse signaling machines, containing protein products of hundreds of different genes, some in hundreds of copies, arranged in precise lattice at each individual synapse. Synapses are fundamental not only to synaptic network function but also to network deve… ▽ More

    Submitted 16 November, 2016; originally announced November 2016.

    Comments: Current awaiting peer review

  41. arXiv:1606.08905  [pdf, other

    cs.DC

    knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library

    Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

    Abstract: k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine}… ▽ More

    Submitted 24 June, 2017; v1 submitted 28 June, 2016; originally announced June 2016.

  42. arXiv:1605.02060  [pdf, other

    q-bio.QM cs.CV

    Deformably Registering and Annotating Whole CLARITY Brains to an Atlas via Masked LDDMM

    Authors: Kwame S. Kutten, Joshua T. Vogelstein, Nicolas Charon, Li Ye, Karl Deisseroth, Michael I. Miller

    Abstract: The CLARITY method renders brains optically transparent to enable high-resolution imaging in the structurally intact brain. Anatomically annotating CLARITY brains is necessary for discovering which regions contain signals of interest. Manually annotating whole-brain, terabyte CLARITY images is difficult, time-consuming, subjective, and error-prone. Automatically registering CLARITY images to a pre… ▽ More

    Submitted 6 May, 2016; originally announced May 2016.

    Journal ref: Proc. SPIE 9896 Optics, Photonics and Digital Technologies for Imaging Applications IV (2016)

  43. arXiv:1604.06414  [pdf, other

    cs.DC

    FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs

    Authors: Da Zheng, Disa Mhembere, Joshua T. Vogelstein, Carey E. Priebe, Randal Burns

    Abstract: R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory c… ▽ More

    Submitted 18 May, 2017; v1 submitted 21 April, 2016; originally announced April 2016.

  44. arXiv:1604.03629  [pdf, other

    q-bio.QM cs.CV

    Quantifying mesoscale neuroanatomy using X-ray microtomography

    Authors: Eva L. Dyer, William Gray Roncal, Hugo L. Fernandes, Doga Gürsoy, Vincent De Andrade, Rafael Vescovi, Kamel Fezzaa, Xianghui Xiao, Joshua T. Vogelstein, Chris Jacobsen, Konrad P. Körding, Narayanan Kasthuri

    Abstract: Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography (… ▽ More

    Submitted 26 July, 2016; v1 submitted 12 April, 2016; originally announced April 2016.

    Comments: 28 pages, 9 figures

  45. arXiv:1506.03410  [pdf, other

    stat.ML cs.LG

    Sparse Projection Oblique Randomer Forests

    Authors: Tyler M. Tomita, James Browne, Cencheng Shen, Jaewon Chung, Jesse L. Patsolic, Benjamin Falk, Jason Yim, Carey E. Priebe, Randal Burns, Mauro Maggioni, Joshua T. Vogelstein

    Abstract: Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortuna… ▽ More

    Submitted 3 October, 2019; v1 submitted 10 June, 2015; originally announced June 2015.

    Comments: 31 pages; submitted to Journal of Machine Learning Research for review

    MSC Class: 68T10 ACM Class: I.5.2

    Journal ref: Journal of Machine Learning Research 21(104), 1-39, 2020

  46. arXiv:1411.6880  [pdf, other

    q-bio.QM cs.CV

    An Automated Images-to-Graphs Framework for High Resolution Connectomics

    Authors: William Gray Roncal, Dean M. Kleissas, Joshua T. Vogelstein, Priya Manavalan, Kunal Lillaney, Michael Pekala, Randal Burns, R. Jacob Vogelstein, Carey E. Priebe, Mark A. Chevillet, Gregory D. Hager

    Abstract: Reconstructing a map of neuronal connectivity is a critical challenge in contemporary neuroscience. Recent advances in high-throughput serial section electron microscopy (EM) have produced massive 3D image volumes of nanoscale brain tissue for the first time. The resolution of EM allows for individual neurons and their synaptic connections to be directly observed. Recovering neuronal networks by m… ▽ More

    Submitted 30 April, 2015; v1 submitted 25 November, 2014; originally announced November 2014.

    Comments: 13 pages, first two authors contributed equally V2: Added additional experiments and clarifications; added information on infrastructure and pipeline environment

  47. arXiv:1411.2158  [pdf, ps, other

    stat.ML cs.LG math.ST stat.ME

    Covariate-assisted spectral clustering

    Authors: Norbert Binkiewicz, Joshua T. Vogelstein, Karl Rohe

    Abstract: Biological and social systems consist of myriad interacting units. The interactions can be represented in the form of a graph or network. Measurements of these graphs can reveal the underlying structure of these interactions, which provides insight into the systems that generated the graphs. Moreover, in applications such as connectomics, social networks, and genomics, graph data are accompanied b… ▽ More

    Submitted 30 October, 2016; v1 submitted 8 November, 2014; originally announced November 2014.

    Comments: 28 pages, 4 figures, includes substantial changes to theoretical results

    Journal ref: Biometrika, Volume 104, Issue 2, 1 June 2017, Pages 361-377

  48. arXiv:1404.4800  [pdf, other

    cs.CV

    Automatic Annotation of Axoplasmic Reticula in Pursuit of Connectomes

    Authors: Ayushi Sinha, William Gray Roncal, Narayanan Kasthuri, Ming Chuang, Priya Manavalan, Dean M. Kleissas, Joshua T. Vogelstein, R. Jacob Vogelstein, Randal Burns, Jeff W. Lichtman, Michael Kazhdan

    Abstract: In this paper, we present a new pipeline which automatically identifies and annotates axoplasmic reticula, which are small subcellular structures present only in axons. We run our algorithm on the Kasthuri11 dataset, which was color corrected using gradient-domain techniques to adjust contrast. We use a bilateral filter to smooth out the noise in this data while preserving edges, which highlights… ▽ More

    Submitted 16 April, 2014; originally announced April 2014.

    Comments: 2 pages, 1 figure

  49. arXiv:1403.3724  [pdf, other

    cs.CV cs.CE q-bio.QM

    VESICLE: Volumetric Evaluation of Synaptic Interfaces using Computer vision at Large Scale

    Authors: William Gray Roncal, Michael Pekala, Verena Kaynig-Fittkau, Dean M. Kleissas, Joshua T. Vogelstein, Hanspeter Pfister, Randal Burns, R. Jacob Vogelstein, Mark A. Chevillet, Gregory D. Hager

    Abstract: An open challenge problem at the forefront of modern neuroscience is to obtain a comprehensive mapping of the neural pathways that underlie human brain function; an enhanced understanding of the wiring diagram of the brain promises to lead to new breakthroughs in diagnosing and treating neurological disorders. Inferring brain structure from image data, such as that obtained via electron microscopy… ▽ More

    Submitted 7 September, 2015; v1 submitted 14 March, 2014; originally announced March 2014.

    Comments: v4: added clarifying figures and updates for readability. v3: fixed metadata. 11 pp v2: Added CNN classifier, significant changes to improve performance and generalization

    Journal ref: Proceedings of the British Machine Vision Conference (BMVC), pages 81.1-81.13. BMVA Press, September 2015

  50. MIGRAINE: MRI Graph Reliability Analysis and Inference for Connectomics

    Authors: William Gray Roncal, Zachary H. Koterba, Disa Mhembere, Dean M. Kleissas, Joshua T. Vogelstein, Randal Burns, Anita R. Bowles, Dimitrios K. Donavos, Sephira Ryman, Rex E. Jung, Lei Wu, Vince Calhoun, R. Jacob Vogelstein

    Abstract: Currently, connectomes (e.g., functional or structural brain graphs) can be estimated in humans at $\approx 1~mm^3$ scale using a combination of diffusion weighted magnetic resonance imaging, functional magnetic resonance imaging and structural magnetic resonance imaging scans. This manuscript summarizes a novel, scalable implementation of open-source algorithms to rapidly estimate magnetic resona… ▽ More

    Submitted 17 December, 2013; originally announced December 2013.

    Comments: Published as part of 2013 IEEE GlobalSIP conference