-
STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics
Authors:
Jiawen Chen,
Muqing Zhou,
Wenrong Wu,
Jinwei Zhang,
Yun Li,
Didong Li
Abstract:
Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology ima…
▽ More
Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
△ Less
Submitted 20 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
PhyloGFN: Phylogenetic inference with generative flow networks
Authors:
Mingyang Zhou,
Zichao Yan,
Elliot Layne,
Nikolay Malkin,
Dinghuai Zhang,
Moksh Jain,
Mathieu Blanchette,
Yoshua Bengio
Abstract:
Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt…
▽ More
Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.
△ Less
Submitted 24 March, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
AmadeusGPT: a natural language interface for interactive animal behavioral analysis
Authors:
Shaokai Ye,
Jessy Lauer,
Mu Zhou,
Alexander Mathis,
Mackenzie W. Mathis
Abstract:
The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natura…
▽ More
The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code. Large-language models (LLMs) such as GPT3.5 and GPT4 allow for interactive language-based queries that are potentially well suited for making interactive behavior analysis. However, the comprehension capability of these LLMs is limited by the context window size, which prevents it from remembering distant conversations. To overcome the context window limitation, we implement a novel dual-memory mechanism to allow communication between short-term and long-term memory using symbols as context pointers for retrieval and saving. Concretely, users directly use language-based definitions of behavior and our augmented GPT develops code based on the core AmadeusGPT API, which contains machine learning, computer vision, spatio-temporal reasoning, and visualization modules. Users then can interactively refine results, and seamlessly add new behavioral modules as needed. We benchmark AmadeusGPT and show we can produce state-of-the-art performance on the MABE 2022 behavior challenge tasks. Note, an end-user would not need to write any code to achieve this. Thus, collectively AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system. Code and demos can be found at: https://github.com/AdaptiveMotorControlLab/AmadeusGPT.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity
Authors:
Mu Zhou,
Lucas Stoffl,
Mackenzie Weygandt Mathis,
Alexander Mathis
Abstract:
Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping indivi…
▽ More
Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping individuals, and bottom-up methods often falsely infer connections to distant bodyparts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 48.5 AP on OCHuman, an improvement of 8.6% and 7.8% over the prior art, respectively. Furthermore, we show that our method strongly improves the performance on multi-animal benchmarks involving fish and monkeys. The code is available at https://github.com/amathislab/BUCTD
△ Less
Submitted 30 September, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Statistical Issues and Recommendations for Clinical Trials Conducted During the COVID-19 Pandemic
Authors:
R. Daniel Meyer,
Bohdana Ratitch,
Marcel Wolbers,
Olga Marchenko,
Hui Quan,
Daniel Li,
Chrissie Fletcher,
Xin Li,
David Wright,
Yue Shentu,
Stefan Englert,
Wei Shen,
Jyotirmoy Dey,
Thomas Liu,
Ming Zhou,
Norman Bohidar,
Peng-Liang Zhao,
Michael Hale
Abstract:
The COVID-19 pandemic has had and continues to have major impacts on planned and ongoing clinical trials. Its effects on trial data create multiple potential statistical issues. The scale of impact is unprecedented, but when viewed individually, many of the issues are well defined and feasible to address. A number of strategies and recommendations are put forward to assess and address issues relat…
▽ More
The COVID-19 pandemic has had and continues to have major impacts on planned and ongoing clinical trials. Its effects on trial data create multiple potential statistical issues. The scale of impact is unprecedented, but when viewed individually, many of the issues are well defined and feasible to address. A number of strategies and recommendations are put forward to assess and address issues related to estimands, missing data, validity and modifications of statistical analysis methods, need for additional analyses, ability to meet objectives and overall trial interpretability.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
A Discussion on the Algorithm Design of Electrical Impedance Tomography for Biomedical Applications
Authors:
Mingyong Zhou,
Hongyu Zhu
Abstract:
In this paper, we present a discussion on the algorithms design of Electrical Impedance Tomography (EIT) for biomedical applications. Based on the Maxwell differential equations and the derived the finite element(FE) linear equations, we first investigate the possibility to estimate the matrix that contains the impedance values based on Singular Value Decomposition(SVD) approximations. Secondly ba…
▽ More
In this paper, we present a discussion on the algorithms design of Electrical Impedance Tomography (EIT) for biomedical applications. Based on the Maxwell differential equations and the derived the finite element(FE) linear equations, we first investigate the possibility to estimate the matrix that contains the impedance values based on Singular Value Decomposition(SVD) approximations. Secondly based on the biomedical properties we further explore the possibility to recover the impedance values uniquely by injecting various different types of currents with multi-frequency. Injecting various types of multi-frequency currents lead to a set of different measured voltages configurations, thus enhance the possibility of uniquely recovering the impedance values in a stable way under the assumption that the biological cells respond to the different type of injecting currents in a different way.
△ Less
Submitted 14 January, 2019;
originally announced January 2019.
-
Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data
Authors:
Ehsan Hajiramezanali,
Siamak Zamani Dadaneh,
Alireza Karbalayghareh,
Mingyuan Zhou,
Xiaoning Qian
Abstract:
Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the nu…
▽ More
Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. Experimental results from both our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising potential of BMDL for effective multi-domain learning without "negative transfer" effects often seen in existing multi-task learning and transfer learning methods.
△ Less
Submitted 22 October, 2018;
originally announced October 2018.
-
Differential Expression Analysis of Dynamical Sequencing Count Data with a Gamma Markov Chain
Authors:
Ehsan Hajiramezanali,
Siamak Zamani Dadaneh,
Paul de Figueiredo,
Sing-Hoi Sze,
Mingyuan Zhou,
Xiaoning Qian
Abstract:
Next-generation sequencing (NGS) to profile temporal changes in living systems is gaining more attention for deriving better insights into the underlying biological mechanisms compared to traditional static sequencing experiments. Nonetheless, the majority of existing statistical tools for analyzing NGS data lack the capability of exploiting the richer information embedded in temporal data. Severa…
▽ More
Next-generation sequencing (NGS) to profile temporal changes in living systems is gaining more attention for deriving better insights into the underlying biological mechanisms compared to traditional static sequencing experiments. Nonetheless, the majority of existing statistical tools for analyzing NGS data lack the capability of exploiting the richer information embedded in temporal data. Several recent tools have been developed to analyze such data but they typically impose strict model assumptions, such as smoothness on gene expression dynamic changes. To capture a broader range of gene expression dynamic patterns, we develop the gamma Markov negative binomial (GMNB) model that integrates a gamma Markov chain into a negative binomial distribution model, allowing flexible temporal variation in NGS count data. Using Bayes factors, GMNB enables more powerful temporal gene differential expression analysis across different phenotypes or treatment conditions. In addition, it naturally handles the heterogeneity of sequencing depth in different samples, removing the need for ad-hoc normalization. Efficient Gibbs sampling inference of the GMNB model parameters is achieved by exploiting novel data augmentation techniques. Extensive experiments on both simulated and real-world RNA-seq data show that GMNB outperforms existing methods in both receiver operating characteristic (ROC) and precision-recall (PR) curves of differential expression analysis results.
△ Less
Submitted 7 March, 2018;
originally announced March 2018.
-
BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data
Authors:
Siamak Zamani Dadaneh,
Xiaoning Qian,
Mingyuan Zhou
Abstract:
We perform differential expression analysis of high-throughput sequencing count data under a Bayesian nonparametric framework, removing sophisticated ad-hoc pre-processing steps commonly required in existing algorithms. We propose to use the gamma (beta) negative binomial process, which takes into account different sequencing depths using sample-specific negative binomial probability (dispersion)…
▽ More
We perform differential expression analysis of high-throughput sequencing count data under a Bayesian nonparametric framework, removing sophisticated ad-hoc pre-processing steps commonly required in existing algorithms. We propose to use the gamma (beta) negative binomial process, which takes into account different sequencing depths using sample-specific negative binomial probability (dispersion) parameters, to detect differentially expressed genes by comparing the posterior distributions of gene-specific negative binomial dispersion (probability) parameters. These model parameters are inferred by borrowing statistical strength across both the genes and samples. Extensive experiments on both simulated and real-world RNA sequencing count data show that the proposed differential expression analysis algorithms clearly outperform previously proposed ones in terms of the areas under both the receiver operating characteristic and precision-recall curves.
△ Less
Submitted 2 May, 2017; v1 submitted 13 August, 2016;
originally announced August 2016.
-
Multichannel Electrophysiological Spike Sorting via Joint Dictionary Learning & Mixture Modeling
Authors:
David E. Carlson,
Joshua T. Vogelstein,
Qisong Wu,
Wenzhao Lian,
Mingyuan Zhou,
Colin R. Stoetzner,
Daryl Kipke,
Douglas Weber,
David B. Dunson,
Lawrence Carin
Abstract:
We propose a construction for joint feature learning and clustering of multichannel extracellular electrophysiological data across multiple recording periods for action potential detection and discrimination ("spike sorting"). Our construction improves over the previous state-of-the art principally in four ways. First, via sharing information across channels, we can better distinguish between sing…
▽ More
We propose a construction for joint feature learning and clustering of multichannel extracellular electrophysiological data across multiple recording periods for action potential detection and discrimination ("spike sorting"). Our construction improves over the previous state-of-the art principally in four ways. First, via sharing information across channels, we can better distinguish between single-unit spikes and artifacts. Second, our proposed "focused mixture model" (FMM) elegantly deals with units appearing, disappearing, or reappearing over multiple recording days, an important consideration for any chronic experiment. Third, by jointly learning features and clusters, we improve performance over previous attempts that proceeded via a two-stage ("frequentist") learning process. Fourth, by directly modeling spike rate, we improve detection of sparsely spiking neurons. Moreover, our Bayesian construction seamlessly handles missing data. We present state-of-the-art performance without requiring manually tuning of many hyper-parameters on both a public dataset with partial ground truth and a new experimental dataset.
△ Less
Submitted 4 August, 2013; v1 submitted 2 April, 2013;
originally announced April 2013.
-
Spatial-temporal correlations in the process to self-organized criticality
Authors:
C. B. Yang,
X. Cai,
Z. M. Zhou
Abstract:
A new type of spatial-temporal correlation in the process approaching to the self-organized criticality is investigated for the two simple models for biological evolution. The change behaviors of the position with minimum barrier are shown to be quantitatively different in the two models. Different results of the correlation are given for the two models. We argue that the correlation can be used…
▽ More
A new type of spatial-temporal correlation in the process approaching to the self-organized criticality is investigated for the two simple models for biological evolution. The change behaviors of the position with minimum barrier are shown to be quantitatively different in the two models. Different results of the correlation are given for the two models. We argue that the correlation can be used, together with the power-law distributions, as criteria for self-organized criticality.
△ Less
Submitted 28 March, 2000;
originally announced March 2000.