Search | arXiv e-print repository

DFDRNN: A dual-feature based neural network for drug repositioning

Authors: Enqiang Zhu, Xiang Li, Chanjuan Liu, Nikhil R. Pal

Abstract: Drug repositioning is an economically efficient strategy used to discover new indications for existing drugs beyond their original approvals, expanding their applicability and usage to address challenges in disease treatment. In recent years, deep-learning techniques for drug repositioning have gained much attention. While most deep learning-based research methods focus on encoding drugs and disea… ▽ More Drug repositioning is an economically efficient strategy used to discover new indications for existing drugs beyond their original approvals, expanding their applicability and usage to address challenges in disease treatment. In recent years, deep-learning techniques for drug repositioning have gained much attention. While most deep learning-based research methods focus on encoding drugs and diseases by extracting feature information from neighbors in the network, they often pay little attention to the potential relationships between the features of drugs and diseases, leading to imprecise encoding of drugs and diseases. To address this, we design a dual-feature drug repositioning neural network (DFDRNN) model to achieve precise encoding of drugs and diseases. DFDRNN uses two features to represent drugs and diseases: the similarity feature and the association feature. The model incorporates a self-attention mechanism to design two dual-feature extraction modules for achieving precisely encoding of drugs and diseases: the intra-domain dual-feature extraction (IntraDDFE) module and the inter-domain dual-feature extraction (InterDDFE) module. The IntraDDFE module extracts features from a single domain (drug or disease domain), while the InterDDFE module extracts features from the mixed domain (drug and disease domain). In particular, the feature is changed by InterDDFE, ensuring a precise encoding of drugs and diseases. Finally, a cross-dual-domain decoder is designed to predict drug-disease associations in both the drug and disease domains. Compared to six state-of-the-art methods, DFDRNN outperforms others on four benchmark datasets, with an average AUROC of 0.946 and an average AUPR of 0.597. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2406.08506 [pdf, other]

RGFN: Synthesizable Molecular Generation Using GFlowNets

Authors: Michał Koziarski, Andrei Rekesh, Dmytro Shevchuk, Almer van der Sloot, Piotr Gaiński, Yoshua Bengio, Cheng-Hao Liu, Mike Tyers, Robert A. Batey

Abstract: Generative models hold great promise for small molecule discovery, significantly increasing the size of search space compared to traditional in silico screening libraries. However, most existing machine learning methods for small molecule generation suffer from poor synthesizability of candidate compounds, making experimental validation difficult. In this paper we propose Reaction-GFlowNet (RGFN),… ▽ More Generative models hold great promise for small molecule discovery, significantly increasing the size of search space compared to traditional in silico screening libraries. However, most existing machine learning methods for small molecule generation suffer from poor synthesizability of candidate compounds, making experimental validation difficult. In this paper we propose Reaction-GFlowNet (RGFN), an extension of the GFlowNet framework that operates directly in the space of chemical reactions, thereby allowing out-of-the-box synthesizability while maintaining comparable quality of generated candidates. We demonstrate that with the proposed set of reactions and building blocks, it is possible to obtain a search space of molecules orders of magnitude larger than existing screening libraries coupled with low cost of synthesis. We also show that the approach scales to very large fragment libraries, further increasing the number of potential molecules. We demonstrate the effectiveness of the proposed approach across a range of oracle models, including pretrained proxy models and GPU-accelerated docking. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.20313 [pdf, other]

Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Authors: Guillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, Alexander Tong, Avishek Joey Bose

Abstract: Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFl… ▽ More Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: preprint

arXiv:2405.19221 [pdf]

Domain adaptation in small-scale and heterogeneous biological datasets

Authors: Seyedmehdi Orouji, Martin C. Liu, Tal Korem, Megan A. K. Peters

Abstract: Machine learning techniques are steadily becoming more important in modern biology, and are used to build predictive models, discover patterns, and investigate biological problems. However, models trained on one dataset are often not generalizable to other datasets from different cohorts or laboratories, due to differences in the statistical properties of these datasets. These could stem from tech… ▽ More Machine learning techniques are steadily becoming more important in modern biology, and are used to build predictive models, discover patterns, and investigate biological problems. However, models trained on one dataset are often not generalizable to other datasets from different cohorts or laboratories, due to differences in the statistical properties of these datasets. These could stem from technical differences, such as the measurement technique used, or from relevant biological differences between the populations studied. Domain adaptation, a type of transfer learning, can alleviate this problem by aligning the statistical distributions of features and samples among different datasets so that similar models can be applied across them. However, a majority of state-of-the-art domain adaptation methods are designed to work with large-scale data, mostly text and images, while biological datasets often suffer from small sample sizes, and possess complexities such as heterogeneity of the feature space. This Review aims to synthetically discuss domain adaptation methods in the context of small-scale and highly heterogeneous biological data. We describe the benefits and challenges of domain adaptation in biological research and critically discuss some of its objectives, strengths, and weaknesses through key representative methodologies. We argue for the incorporation of domain adaptation techniques to the computational biologist's toolkit, with further development of customized approaches. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: main manuscript + supplement

arXiv:2405.01616 [pdf, other]

Generative Active Learning for the Search of Small-molecule Protein Binders

Authors: Maksym Korablyov, Cheng-Hao Liu, Moksh Jain, Almer M. van der Sloot, Eric Jolicoeur, Edward Ruediger, Andrei Cristian Nica, Emmanuel Bengio, Kostiantyn Lapchevskyi, Daniel St-Cyr, Doris Alexandra Schuetz, Victor Ion Butoi, Jarrid Rector-Brooks, Simon Blackburn, Leo Feng, Hadi Nekoei, SaiKrishna Gottipati, Priyesh Vijayan, Prateek Gupta, Ladislav Rampášek, Sasikanth Avancha, Pierre-Luc Bacon, William L. Hamilton, Brooks Paige, Sanchit Misra , et al. (9 additional authors not shown)

Abstract: Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecu… ▽ More Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecules to discover candidates with a desired property. We apply LambdaZero with molecular docking to design novel small molecules that inhibit the enzyme soluble Epoxide Hydrolase 2 (sEH), while enforcing constraints on synthesizability and drug-likeliness. LambdaZero provides an exponential speedup in terms of the number of calls to the expensive molecular docking oracle, and LambdaZero de novo designed molecules reach docking scores that would otherwise require the virtual screening of a hundred billion molecules. Importantly, LambdaZero discovers novel scaffolds of synthesizable, drug-like inhibitors for sEH. In in vitro experimental validation, a series of ligands from a generated quinazoline-based scaffold were synthesized, and the lead inhibitor N-(4,6-di(pyrrolidin-1-yl)quinazolin-2-yl)-N-methylbenzamide (UM0152893) displayed sub-micromolar enzyme inhibition of sEH. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.05553 [pdf, other]

Alljoined1 -- A dataset for EEG-to-Image decoding

Authors: Jonathan Xu, Bruno Aristimunha, Max Emanuel Feucht, Emma Qian, Charles Liu, Tazik Shahjahan, Martyna Spyra, Steven Zifan Zhang, Nicholas Short, Jioh Kim, Paula Perdomo, Ricky Renfeng Mao, Yashvir Sabharwal, Michael Ahedor Moaz Shoura, Adrian Nestor

Abstract: We present Alljoined1, a dataset built specifically for EEG-to-Image decoding. Recognizing that an extensive and unbiased sampling of neural responses to visual stimuli is crucial for image reconstruction efforts, we collected data from 8 participants looking at 10,000 natural images each. We have currently gathered 46,080 epochs of brain responses recorded with a 64-channel EEG headset. The datas… ▽ More We present Alljoined1, a dataset built specifically for EEG-to-Image decoding. Recognizing that an extensive and unbiased sampling of neural responses to visual stimuli is crucial for image reconstruction efforts, we collected data from 8 participants looking at 10,000 natural images each. We have currently gathered 46,080 epochs of brain responses recorded with a 64-channel EEG headset. The dataset combines response-based stimulus timing, repetition between blocks and sessions, and diverse image classes with the goal of improving signal quality. For transparency, we also provide data quality scores. We publicly release the dataset and all code at https://linktr.ee/alljoined1. △ Less

Submitted 14 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: 8 Pages, 6 Figures

ACM Class: I.5.1; I.6.3; I.2.6; K.3.2

arXiv:2403.14801 [pdf]

Assessing the Utility of Large Language Models for Phenotype-Driven Gene Prioritization in Rare Genetic Disorder Diagnosis

Authors: Junyoung Kim, Jingye Yang, Kai Wang, Chunhua Weng, Cong Liu

Abstract: Phenotype-driven gene prioritization is a critical process in the diagnosis of rare genetic disorders for identifying and ranking potential disease-causing genes based on observed physical traits or phenotypes. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models have opened doors to the potential of AI prediction… ▽ More Phenotype-driven gene prioritization is a critical process in the diagnosis of rare genetic disorders for identifying and ranking potential disease-causing genes based on observed physical traits or phenotypes. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models have opened doors to the potential of AI predictions through extensive training on diverse corpora and complex models. This study conducted a comprehensive evaluation of five large language models, including two Generative Pre-trained Transformers series, and three Llama2 series, assessing their performance across three key metrics: task completeness, gene prediction accuracy, and adherence to required output structures. Various experiments explored combinations of models, prompts, input types, and task difficulty levels. Our findings reveal that even the best-performing LLM, GPT-4, achieved an accuracy of 16.0%, which still lags behind traditional bioinformatics tools. Prediction accuracy increased with the parameter/model size. A similar increasing trend was observed for the task completion rate, with complicated prompts more likely to increase task completeness in models smaller than GPT-4. However, complicated prompts are more likely to decrease the structure compliance rate, but no prompt effects on GPT-4. Compared to HPO term-based input, LLM was also able to achieve better than random prediction accuracy by taking free-text input, but slightly lower than with the HPO input. Bias analysis showed that certain genes, such as MECP2, CDKL5, and SCN1A, are more likely to be top-ranked, potentially explaining the variances observed across different datasets. This study provides valuable insights into the integration of LLMs within genomic analysis, contributing to the ongoing discussion on the utilization of advanced LLMs in clinical workflows. △ Less

Submitted 2 April, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: 56 pages, 6 figures, 6 tables, 2 supplementary tables

arXiv:2403.09560 [pdf, other]

Self-Consistency Training for Density-Functional-Theory Hamiltonian Prediction

Authors: He Zhang, Chang Liu, Zun Wang, Xinran Wei, Siyuan Liu, Nanning Zheng, Bin Shao, Tie-Yan Liu

Abstract: Predicting the mean-field Hamiltonian matrix in density functional theory is a fundamental formulation to leverage machine learning for solving molecular science problems. Yet, its applicability is limited by insufficient labeled data for training. In this work, we highlight that Hamiltonian prediction possesses a self-consistency principle, based on which we propose self-consistency training, an… ▽ More Predicting the mean-field Hamiltonian matrix in density functional theory is a fundamental formulation to leverage machine learning for solving molecular science problems. Yet, its applicability is limited by insufficient labeled data for training. In this work, we highlight that Hamiltonian prediction possesses a self-consistency principle, based on which we propose self-consistency training, an exact training method that does not require labeled data. It distinguishes the task from predicting other molecular properties by the following benefits: (1) it enables the model to be trained on a large amount of unlabeled data, hence addresses the data scarcity challenge and enhances generalization; (2) it is more efficient than running DFT to generate labels for supervised training, since it amortizes DFT calculation over a set of queries. We empirically demonstrate the better generalization in data-scarce and out-of-distribution scenarios, and the better efficiency over DFT labeling. These benefits push forward the applicability of Hamiltonian prediction to an ever-larger scale. △ Less

Submitted 5 June, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: Accepted by ICML 2024

arXiv:2402.11776 [pdf, other]

Early feasibility of an embedded bi-directional brain-computer interface for ambulation

Authors: Jeffrey Lim, Po T. Wang, Wonjoon Sohn, Claudia Serrano-Amenos, Mina Ibrahim, Derrick Lin, Shravan Thaploo, Susan J. Shaw, Michelle Armacost, Hui Gong, Brian Lee, Darrin Lee, Richard A. Andersen, Payam Heydari, Charles Y. Liu, Zoran Nenadic, An H. Do

Abstract: Current treatments for paraplegia induced by spinal cord injury (SCI) are often limited by the severity of the injury. The accompanying loss of sensory and motor functions often results in reliance on wheelchairs, which in turn causes reduced quality of life and increased risk of co-morbidities. While brain-computer interfaces (BCIs) for ambulation have shown promise in restoring or replacing lowe… ▽ More Current treatments for paraplegia induced by spinal cord injury (SCI) are often limited by the severity of the injury. The accompanying loss of sensory and motor functions often results in reliance on wheelchairs, which in turn causes reduced quality of life and increased risk of co-morbidities. While brain-computer interfaces (BCIs) for ambulation have shown promise in restoring or replacing lower extremity motor functions, none so far have simultaneously implemented sensory feedback functions. Additionally, many existing BCIs for ambulation rely on bulky external hardware that make them ill-suited for non-research settings. Here, we present an embedded bi-directional BCI (BDBCI), that restores motor function by enabling neural control over a robotic gait exoskeleton (RGE) and delivers sensory feedback via direct cortical electrical stimulation (DCES) in response to RGE leg swing. A first demonstration with this system was performed with a single subject implanted with electrocorticography electrodes, achieving an average lag-optimized cross-correlation of 0.80$\pm$0.08 between cues and decoded states over 5 runs. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: 5 pages, 6 figures, two tables, also submitted to IEEE EMBC 2024 conference

MSC Class: 92C55

arXiv:2402.06191 [pdf, other]

The Berkeley Single Cell Computational Microscopy (BSCCM) Dataset

Authors: Henry Pinkard, Cherry Liu, Fanice Nyatigo, Daniel A. Fletcher, Laura Waller

Abstract: Computational microscopy, in which hardware and algorithms of an imaging system are jointly designed, shows promise for making imaging systems that cost less, perform more robustly, and collect new types of information. Often, the performance of computational imaging systems, especially those that incorporate machine learning, is sample-dependent. Thus, standardized datasets are an essential tool… ▽ More Computational microscopy, in which hardware and algorithms of an imaging system are jointly designed, shows promise for making imaging systems that cost less, perform more robustly, and collect new types of information. Often, the performance of computational imaging systems, especially those that incorporate machine learning, is sample-dependent. Thus, standardized datasets are an essential tool for comparing the performance of different approaches. Here, we introduce the Berkeley Single Cell Computational Microscopy (BSCCM) dataset, which contains over ~12,000,000 images of 400,000 of individual white blood cells. The dataset contains images captured with multiple illumination patterns on an LED array microscope and fluorescent measurements of the abundance of surface proteins that mark different cell types. We hope this dataset will provide a valuable resource for the development and testing of new algorithms in computational microscopy and computer vision with practical biomedical applications. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2401.12616 [pdf]

The stability and instability of the language control network: a longitudinal resting-state functional magnetic resonance imaging study

Authors: Zilong Li, Cong Liu, Xin Pan, Guosheng Ding, Ruiming Wang

Abstract: The language control network is vital among language-related networks responsible for solving the problem of multiple language switching. Researchers have expressed concerns about the instability of the language control network when exposed to external influences (e.g., Long-term second language learning). However, some studies have suggested that the language control network is stable. Therefore,… ▽ More The language control network is vital among language-related networks responsible for solving the problem of multiple language switching. Researchers have expressed concerns about the instability of the language control network when exposed to external influences (e.g., Long-term second language learning). However, some studies have suggested that the language control network is stable. Therefore, whether the language control network is stable or not remains unclear. In the present study, we directly evaluated the stability and instability of the language control network using resting-state functional magnetic resonance imaging (rs-fMRI). We employed cohorts of Chinese first-year college students majoring in English who underwent second language (L2) acquisition courses at a university and those who did not. Two resting-state fMRI scans were acquired approximately 1 year apart. We found that the language control network was both moderately stable and unstable. We further investigated the morphological coexistence patterns of stability and instability within the language control network. First, we extracted connections representing stability and plasticity from the entire network. We then evaluated whether the coexistence patterns were modular (stability and instability involve different brain regions) or non-modular (stability and plasticity involve the same brain regions but have unique connectivity patterns). We found that both stability and instability coexisted in a non-modular pattern. Compared with the non-English major group, the English major group has a more non-modular coexistence pattern.. These findings provide preliminary evidence of the coexistence of stability and instability in the language control network. △ Less

Submitted 7 March, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.12498 [pdf, other]

Understanding Cellular Noise with Optical Perturbation and Deep Learning

Authors: Chuanbo Liu, Yu Fu, Lu Lin, Elliot L. Elson, Jin Wang

Abstract: Noise plays a crucial role in the regulation of cellular and organismal function and behavior. Exploring noise's impact is key to understanding fundamental biological processes, such as gene expression, signal transduction, and the mechanisms of development and evolution. Currently, a comprehensive method to quantify dynamical behavior of cellular noise within these biochemical systems is lack… ▽ More Noise plays a crucial role in the regulation of cellular and organismal function and behavior. Exploring noise's impact is key to understanding fundamental biological processes, such as gene expression, signal transduction, and the mechanisms of development and evolution. Currently, a comprehensive method to quantify dynamical behavior of cellular noise within these biochemical systems is lacking. In this study, we introduce an optically-controlled perturbation system utilizing the light-sensitive Phytochrome B (PhyB) from \textit{Arabidopsis thaliana}, which enables precise noise modulation with high spatial-temporal resolution. Our system exhibits exceptional sensitivity to light, reacting consistently to pulsed light signals, distinguishing it from other photoreceptor-based promoter systems that respond to a single light wavelength. To characterize our system, we developed a stochastic model for phytochromes that accounts for photoactivation/deactivation, thermal reversion, and the dynamics of the light-activated gene promoter system. To precisely control our system, we determined the rate constants for this model using an omniscient deep neural network that can directly map rate constant combinations to time-dependent state joint distributions. By adjusting the activation rates through light intensity and degradation rates via N-terminal mutagenesis, we illustrate that out optical-controlled perturbation can effectively modulate molecular expression level as well as noise. Our results highlight the potential of employing an optically-controlled gene perturbation system as a noise-controlled stimulus source. This approach, when combined with the analytical capabilities of a sophisticated deep neural network, enables the accurate estimation of rate constants from observational data in a broad range of biochemical reaction networks. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 33 pages, 4 figures

arXiv:2401.11360 [pdf]

PepHarmony: A Multi-View Contrastive Learning Framework for Integrated Sequence and Structure-Based Peptide Encoding

Authors: Ruochi Zhang, Haoran Wu, Chang Liu, Huaping Li, Yuqian Wu, Kewei Li, Yifan Wang, Yifan Deng, Jiahui Chen, Fengfeng Zhou, Xin Gao

Abstract: Recent advances in protein language models have catalyzed significant progress in peptide sequence representation. Despite extensive exploration in this field, pre-trained models tailored for peptide-specific needs remain largely unaddressed due to the difficulty in capturing the complex and sometimes unstable structures of peptides. This study introduces a novel multi-view contrastive learning fr… ▽ More Recent advances in protein language models have catalyzed significant progress in peptide sequence representation. Despite extensive exploration in this field, pre-trained models tailored for peptide-specific needs remain largely unaddressed due to the difficulty in capturing the complex and sometimes unstable structures of peptides. This study introduces a novel multi-view contrastive learning framework PepHarmony for the sequence-based peptide encoding task. PepHarmony innovatively combines both sequence- and structure-level information into a sequence-level encoding module through contrastive learning. We carefully select datasets from the Protein Data Bank (PDB) and AlphaFold database to encompass a broad spectrum of peptide sequences and structures. The experimental data highlights PepHarmony's exceptional capability in capturing the intricate relationship between peptide sequences and structures compared with the baseline and fine-tuned models. The robustness of our model is confirmed through extensive ablation studies, which emphasize the crucial roles of contrastive loss and strategic data sorting in enhancing predictive performance. The proposed PepHarmony framework serves as a notable contribution to peptide representations, and offers valuable insights for future applications in peptide drug discovery and peptide engineering. We have made all the source code utilized in this study publicly accessible via GitHub at https://github.com/zhangruochi/PepHarmony or http://www.healthinformaticslab.org/supp/. △ Less

Submitted 20 January, 2024; originally announced January 2024.

Comments: 25 pages, 5 figures, 3 tables

arXiv:2401.06823 [pdf, other]

Interpretable deep learning in single-cell omics

Authors: Manoj M Wagle, Siqu Long, Carissa Chen, Chunlei Liu, Pengyi Yang

Abstract: Recent developments in single-cell omics technologies have enabled the quantification of molecular profiles in individual cells at an unparalleled resolution. Deep learning, a rapidly evolving sub-field of machine learning, has instilled a significant interest in single-cell omics research due to its remarkable success in analysing heterogeneous high-dimensional single-cell omics data. Nevertheles… ▽ More Recent developments in single-cell omics technologies have enabled the quantification of molecular profiles in individual cells at an unparalleled resolution. Deep learning, a rapidly evolving sub-field of machine learning, has instilled a significant interest in single-cell omics research due to its remarkable success in analysing heterogeneous high-dimensional single-cell omics data. Nevertheless, the inherent multi-layer nonlinear architecture of deep learning models often makes them `black boxes' as the reasoning behind predictions is often unknown and not transparent to the user. This has stimulated an increasing body of research for addressing the lack of interpretability in deep learning models, especially in single-cell omics data analyses, where the identification and understanding of molecular regulators are crucial for interpreting model predictions and directing downstream experimental validations. In this work, we introduce the basics of single-cell omics technologies and the concept of interpretable deep learning. This is followed by a review of the recent interpretable deep learning models applied to various single-cell omics research. Lastly, we highlight the current limitations and discuss potential future directions. We anticipate this review to bring together the single-cell and machine learning research communities to foster future development and application of interpretable deep learning in single-cell omics research. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.06199 [pdf, other]

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Authors: Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song

Abstract: Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of… ▽ More Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2312.15320 [pdf]

GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts

Authors: Da Wu, Jingye Yang, Cong Liu, Tzung-Chien Hsieh, Elaine Marchi, Justin Blair, Peter Krawitz, Chunhua Weng, Wendy Chung, Gholson J. Lyon, Ian D. Krantz, Jennifer M. Kalish, Kai Wang

Abstract: Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artifi… ▽ More Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artificial intelligence algorithms to facilitate clinical diagnosis, in prioritizing candidate diseases to be further examined by lab tests or genetic assays, or in helping the phenotype-driven reinterpretation of genome/exome sequencing data. Existing methods using frontal facial photos were built on conventional Convolutional Neural Networks (CNNs), rely exclusively on facial images, and cannot capture non-facial phenotypic traits and demographic information essential for guiding accurate diagnoses. Here we introduce GestaltMML, a multimodal machine learning (MML) approach solely based on the Transformer architecture. It integrates facial images, demographic information (age, sex, ethnicity), and clinical notes (optionally, a list of Human Phenotype Ontology terms) to improve prediction accuracy. Furthermore, we also evaluated GestaltMML on a diverse range of datasets, including 528 diseases from the GestaltMatcher Database, several in-house datasets of Beckwith-Wiedemann syndrome (BWS, over-growth syndrome with distinct facial features), Sotos syndrome (overgrowth syndrome with overlapping features with BWS), NAA10-related neurodevelopmental syndrome, Cornelia de Lange syndrome (multiple malformation syndrome), and KBG syndrome (multiple malformation syndrome). Our results suggest that GestaltMML effectively incorporates multiple modalities of data, greatly narrowing candidate genetic diagnoses of rare diseases and may facilitate the reinterpretation of genome/exome sequencing data. △ Less

Submitted 21 April, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

Comments: Significant revisions

arXiv:2311.15156 [pdf, other]

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

Authors: Jing Gong, Minsheng Hao, Xingyi Cheng, Xin Zeng, Chiming Liu, Jianzhu Ma, Xuegong Zhang, Taifeng Wang, Le Song

Abstract: Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classic… ▽ More Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classical transformer architectures are prohibitive to train on such data in terms of both computation and memory. To address this challenge, we propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene$^α$ (or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today. Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes, and it also leads to SOTA performance over various downstream tasks, such as cell type annotation, perturb-seq effect prediction, and drug combination prediction. xTrimoGene model is now available for use as a service via the following link: https://api.biomap.com/xTrimoGene/apply. △ Less

Submitted 24 February, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

Comments: Accepted by NeurIPS 2023

arXiv:2311.02594 [pdf, other]

scBeacon: single-cell biomarker extraction via identifying paired cell clusters across biological conditions with contrastive siamese networks

Authors: Chenyu Liu, Yong Jin Kweon, Jun Ding

Abstract: Despite the breakthroughs in biomarker discovery facilitated by differential gene analysis, challenges remain, particularly at the single-cell level. Traditional methodologies heavily rely on user-supplied cell annotations, focusing on individually expressed data, often neglecting the critical interactions between biological conditions, such as healthy versus diseased states. In response, here we… ▽ More Despite the breakthroughs in biomarker discovery facilitated by differential gene analysis, challenges remain, particularly at the single-cell level. Traditional methodologies heavily rely on user-supplied cell annotations, focusing on individually expressed data, often neglecting the critical interactions between biological conditions, such as healthy versus diseased states. In response, here we introduce scBeacon, an innovative framework built upon a deep contrastive siamese network. scBeacon pioneers an unsupervised approach, adeptly identifying matched cell populations across varied conditions, enabling a refined differential gene analysis. By utilizing a VQ-VAE framework, a contrastive siamese network, and a greedy iterative strategy, scBeacon effectively pinpoints differential genes that hold potential as key biomarkers. Comprehensive evaluations on a diverse array of datasets validate scBeacon's superiority over existing single-cell differential gene analysis tools. Its precision and adaptability underscore its significant role in enhancing diagnostic accuracy in biomarker discovery. With the emphasis on the importance of biomarkers in diagnosis, scBeacon is positioned to be a pivotal asset in the evolution of personalized medicine and targeted treatments. △ Less

Submitted 27 December, 2023; v1 submitted 5 November, 2023; originally announced November 2023.

arXiv:2308.06294 [pdf]

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

Authors: Jingye Yang, Cong Liu, Wendy Deng, Da Wu, Chunhua Weng, Yunyun Zhou, Kai Wang

Abstract: We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT m… ▽ More We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT models, including open-source versions such as GPT-J, Falcon, and LLaMA, as well as closed-source versions such as GPT-3 and GPT-3.5. We compared our methods with PhenoTagger, a recently developed HPO recognition tool that combines rule-based and deep learning methods. We found that our methods can extract more phenotype concepts, including novel ones not characterized by HPO. We also performed case studies on biomedical literature to illustrate how new phenotype information can be recognized and extracted. We compared current BERT-based versus GPT-based models for phenotype tagging, in multiple aspects including model architecture, memory usage, speed, accuracy, and privacy protection. We also discussed the addition of a negation step and an HPO normalization layer to the transformer models for improved HPO term tagging. In conclusion, PhenoBCBERT and PhenoGPT enable the automated discovery of phenotype terms from clinical notes and biomedical literature, facilitating automated downstream tasks to derive new biological insights on human diseases. △ Less

Submitted 9 November, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

arXiv:2307.14907 [pdf, other]

Weakly Supervised AI for Efficient Analysis of 3D Pathology Samples

Authors: Andrew H. Song, Mane Williams, Drew F. K. Williamson, Guillaume Jaume, Andrew Zhang, Bowen Chen, Robert Serafin, Jonathan T. C. Liu, Alex Baras, Anil V. Parwani, Faisal Mahmood

Abstract: Human tissue and its constituent cells form a microenvironment that is fundamentally three-dimensional (3D). However, the standard-of-care in pathologic diagnosis involves selecting a few two-dimensional (2D) sections for microscopic evaluation, risking sampling bias and misdiagnosis. Diverse methods for capturing 3D tissue morphologies have been developed, but they have yet had little translation… ▽ More Human tissue and its constituent cells form a microenvironment that is fundamentally three-dimensional (3D). However, the standard-of-care in pathologic diagnosis involves selecting a few two-dimensional (2D) sections for microscopic evaluation, risking sampling bias and misdiagnosis. Diverse methods for capturing 3D tissue morphologies have been developed, but they have yet had little translation to clinical practice; manual and computational evaluations of such large 3D data have so far been impractical and/or unable to provide patient-level clinical insights. Here we present Modality-Agnostic Multiple instance learning for volumetric Block Analysis (MAMBA), a deep-learning-based platform for processing 3D tissue images from diverse imaging modalities and predicting patient outcomes. Archived prostate cancer specimens were imaged with open-top light-sheet microscopy or microcomputed tomography and the resulting 3D datasets were used to train risk-stratification networks based on 5-year biochemical recurrence outcomes via MAMBA. With the 3D block-based approach, MAMBA achieves an area under the receiver operating characteristic curve (AUC) of 0.86 and 0.74, superior to 2D traditional single-slice-based prognostication (AUC of 0.79 and 0.57), suggesting superior prognostication with 3D morphological features. Further analyses reveal that the incorporation of greater tissue volume improves prognostic performance and mitigates risk prediction variability from sampling bias, suggesting the value of capturing larger extents of heterogeneous 3D morphology. With the rapid growth and adoption of 3D spatial biology and pathology techniques by researchers and clinicians, MAMBA provides a general and efficient framework for 3D weakly supervised learning for clinical decision support and can help to reveal novel 3D morphological biomarkers for prognosis and therapeutic response. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2306.11715 [pdf, other]

Multi-Fidelity Active Learning with GFlowNets

Authors: Alex Hernandez-Garcia, Nikita Saxena, Moksh Jain, Cheng-Hao Liu, Yoshua Bengio

Abstract: In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, the progress in machine learning has turned it into a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently… ▽ More In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, the progress in machine learning has turned it into a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently leverage the available data and resources. For example, in scientific discovery, we are often faced with the problem of exploring very large, high-dimensional spaces, where querying a high fidelity, black-box objective function is very expensive. Progress in machine learning methods that can efficiently tackle such problems would help accelerate currently crucial areas such as drug and materials discovery. In this paper, we propose the use of GFlowNets for multi-fidelity active learning, where multiple approximations of the black-box function are available at lower fidelity and cost. GFlowNets are recently proposed methods for amortised probabilistic inference that have proven efficient for exploring large, high-dimensional spaces and can hence be practical in the multi-fidelity setting too. Here, we describe our algorithm for multi-fidelity active learning with GFlowNets and evaluate its performance in both well-studied synthetic tasks and practically relevant applications of molecular discovery. Our results show that multi-fidelity active learning with GFlowNets can efficiently leverage the availability of multiple oracles with different costs and fidelities to accelerate scientific discovery and engineering design. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: Code: https://github.com/nikita-0209/mf-al-gfn

arXiv:2306.05445 [pdf, other]

Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning

Authors: Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran Jin, Chi Chen, Frank Noé, Haiguang Liu, Tie-Yan Liu

Abstract: Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computation… ▽ More Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as a chemical graph or a protein sequence. This framework enables efficient generation of diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: 80 pages, 11 figures

arXiv:2303.12311 [pdf, other]

Frozen Language Model Helps ECG Zero-Shot Learning

Authors: Jun Li, Che Liu, Sibo Cheng, Rossella Arcucci, Shenda Hong

Abstract: The electrocardiogram (ECG) is one of the most commonly used non-invasive, convenient medical monitoring tools that assist in the clinical diagnosis of heart diseases. Recently, deep learning (DL) techniques, particularly self-supervised learning (SSL), have demonstrated great potential in the classification of ECG. SSL pre-training has achieved competitive performance with only a small amount of… ▽ More The electrocardiogram (ECG) is one of the most commonly used non-invasive, convenient medical monitoring tools that assist in the clinical diagnosis of heart diseases. Recently, deep learning (DL) techniques, particularly self-supervised learning (SSL), have demonstrated great potential in the classification of ECG. SSL pre-training has achieved competitive performance with only a small amount of annotated data after fine-tuning. However, current SSL methods rely on the availability of annotated data and are unable to predict labels not existing in fine-tuning datasets. To address this challenge, we propose Multimodal ECG-Text Self-supervised pre-training (METS), the first work to utilize the auto-generated clinical reports to guide ECG SSL pre-training. We use a trainable ECG encoder and a frozen language model to embed paired ECG and automatically machine-generated clinical reports separately. The SSL aims to maximize the similarity between paired ECG and auto-generated report while minimize the similarity between ECG and other reports. In downstream classification tasks, METS achieves around 10% improvement in performance without using any annotated data via zero-shot classification, compared to other supervised and SSL baselines that rely on annotated data. Furthermore, METS achieves the highest recall and F1 scores on the MIT-BIH dataset, despite MIT-BIH containing different classes of ECG compared to the pre-trained dataset. The extensive experiments have demonstrated the advantages of using ECG-Text multimodal self-supervised learning in terms of generalizability, effectiveness, and efficiency. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2211.14480 [pdf]

Machine Learning Algorithms for Predicting in-Hospital Mortality in Patients with ST-Segment Elevation Myocardial Infar

Authors: Ding Tao, Chen Liu, Shihan Wan

Abstract: Acute myocardial infarction (AMI) is one of the most severe manifestation of coronary artery disease. ST-segment elevation myocardial infarction (STEMI) is the most serious type of AMI. We proposed to develop a machine learning algorithm based on the home page of electronic medical record (HPEMR) for predicting in-hospital mortality of patients with STEMI in the early stage. Methods: This observat… ▽ More Acute myocardial infarction (AMI) is one of the most severe manifestation of coronary artery disease. ST-segment elevation myocardial infarction (STEMI) is the most serious type of AMI. We proposed to develop a machine learning algorithm based on the home page of electronic medical record (HPEMR) for predicting in-hospital mortality of patients with STEMI in the early stage. Methods: This observational study applied clinical information collected between 2013 and 2017 from 7 tertiary hospitals in Shenzhen, China. The patients' STEMI data were used to train 4 different machine learning algorithms to predict in-hospital mortality among the patients with STEMI, including Logistic Regression, Support Vector Machine, Gradient Boosting Decision Tree, and Artificial Neuron network. Results: A total of 5865 patients with STEMI were enrolled in our study. The model was developed by considering 3 types of variables, which included demographic data, diagnosis and comorbidities, and hospitalization information basing on HPEMR. The association of selected features using univariant logistic regression was reported. Specially, for the comorbidities, atrial fibrillation (OR: 11.0; 95% CI: 5.64 - 20.2), acute renal failure (OR: 9.75; 95% CI: 3.81 - 25.0), type 2 diabetic nephropathy (OR: 5.45; 95% CI: 1.57 - 19.0), acute heart failure (OR: 6.05; 95% CI: 1.99 - 14.9), and cardiac function grade IV (OR: 28.6; 95% CI: 20.6 - 39.6) were found to be associated with a high odds of death. Within the test dataset, our model showed a good discrimination ability as measured by area under the receiver operating characteristic curve (AUC; 0.879) (95% CI: 0.825 - 0.933). △ Less

Submitted 25 November, 2022; originally announced November 2022.

Comments: 20 pages, 3 figures, 3 tables

arXiv:2211.07294 [pdf, other]

A universal DNA computing model for solving NP-hard subset problems

Authors: Enqiang Zhu, Xianhang Luo, Chanjuan Liu, Xiaolong Shi, Jin Xu

Abstract: DNA computing, a nontraditional computing mechanism, provides a feasible and effective method for solving NP-hard problems because of the vast parallelism and high-density storage of DNA molecules. Although DNA computing has been exploited to solve various intractable computational problems, such as the Hamiltonian path problem, SAT problem, and graph coloring problem, there has been little discus… ▽ More DNA computing, a nontraditional computing mechanism, provides a feasible and effective method for solving NP-hard problems because of the vast parallelism and high-density storage of DNA molecules. Although DNA computing has been exploited to solve various intractable computational problems, such as the Hamiltonian path problem, SAT problem, and graph coloring problem, there has been little discussion of designing universal DNA computing-based models, which can solve a class of problems. In this paper, by leveraging the dynamic and enzyme-free properties of DNA strand displacement, we propose a universal model named DCMSubset for solving subset problems in graph theory. The model aims to find a minimum (or maximum) set satisfying given constraints. For each element x involved in a given problem, DCMSubset uses an exclusive single-stranded DNA molecule to model x as well as a specific DNA complex to model the relationship between x and other elements. Based on the proposed model, we conducted simulation and biochemical experiments on three kinds of subset problems, a minimum dominating set, maximum independent set, and minimum vertex cover. We observed that DCMSubset can also be used to solve the graph coloring problem. Moreover, we extended DCMSubset to a model for solving the SAT problem. The results of experiments showed the feasibility and university of the proposed method. Our results highlighted the potential for DNA strand displacement to act as a computation tool to solve NP-hard problems. △ Less

Submitted 15 November, 2022; v1 submitted 14 November, 2022; originally announced November 2022.

arXiv:2210.16993 [pdf, other]

STN: a new tensor network method to identify stimulus category from brain activity pattern

Authors: Chunyu Liu, Jiacai Zhang

Abstract: Neural decoding is still a challenge and hot topic in neurocomputing science. Recently, many studies have shown that brain network patterns containing rich spatial and temporal structure information, which represents the activation information of brain under external stimuli. %Therefore, the research of decoding stimuli from brain network received extensive more attention. The traditional method e… ▽ More Neural decoding is still a challenge and hot topic in neurocomputing science. Recently, many studies have shown that brain network patterns containing rich spatial and temporal structure information, which represents the activation information of brain under external stimuli. %Therefore, the research of decoding stimuli from brain network received extensive more attention. The traditional method extracts brain network features directly from the common machine learning method, then puts these features into the classifier, and realizes to decode external stimuli. However, this method cannot effectively extract the multi-dimensional structural information, which is hidden in the brain network. The tensor researchers show that the tensor decomposition model can fully mine unique spatio-temporal structure characteristics in multi-dimensional structure data. This research proposed a stimulus constrained tensor brain model(STN)which involves the tensor decomposition idea and stimulus category constraint information. The model was verified on the real neuroimaging data sets (MEG and fMRI). The experimental results show that the STN model achieves more than 11.06% and 18.46% on accuracy matrix compared with others methods on two modal data sets. These results imply the superiority of extracting discriminative characteristics about STN model, especially for decoding object stimuli with semantic information. △ Less

Submitted 22 November, 2022; v1 submitted 30 October, 2022; originally announced October 2022.

Comments: 12 pages

Report number: EFI-94-11

arXiv:2208.09559 [pdf]

Neural network facilitated ab initio derivation of linear formula: A case study on formulating the relationship between DNA motifs and gene expression

Authors: Chengyu Liu, Wei Wang

Abstract: Developing models with high interpretability and even deriving formulas to quantify relationships between biological data is an emerging need. We propose here a framework for ab initio derivation of sequence motifs and linear formula using a new approach based on the interpretable neural network model called contextual regression model. We showed that this linear model could predict gene expressio… ▽ More Developing models with high interpretability and even deriving formulas to quantify relationships between biological data is an emerging need. We propose here a framework for ab initio derivation of sequence motifs and linear formula using a new approach based on the interpretable neural network model called contextual regression model. We showed that this linear model could predict gene expression levels using promoter sequences with a performance comparable to deep neural network models. We uncovered a list of 300 motifs with important regulatory roles on gene expression and showed that they also had significant contributions to cell-type specific gene expression in 154 diverse cell types. This work illustrates the possibility of deriving formulas to represent biology laws that may not be easily elucidated. (https://github.com/Wang-lab-UCSD/Motif_Finding_Contextual_Regression) △ Less

Submitted 19 August, 2022; originally announced August 2022.

arXiv:2208.05228 [pdf]

Current and perspective sensing methods for monkeypox virus: a reemerging zoonosis in its infancy

Authors: Ijaz Gul, Changyue Liu, Yuan Xi, Zhicheng Du, Shiyao Zhai, Zhengyang Lei, Chen Qun, Muhammad Akmal Raheem, Qian He, Zhang Haihui, Canyang Zhang, Runming Wang, Sanyang Han, Du Ke, Peiwu Qin

Abstract: Objectives The review is dedicated to evaluate the current monkeypox virus (MPXV) detection methods, discuss their pros and cons, and provide recommended solutions to the problems. Methods The literature for this review is identified through searches in PubMed, Web of Science, Google Scholar, ResearchGate, and Science Direct advanced search for articles published in English without any start dat… ▽ More Objectives The review is dedicated to evaluate the current monkeypox virus (MPXV) detection methods, discuss their pros and cons, and provide recommended solutions to the problems. Methods The literature for this review is identified through searches in PubMed, Web of Science, Google Scholar, ResearchGate, and Science Direct advanced search for articles published in English without any start date until June, 2022, by use of the terms "monkeypox virus" or "poxvirus" along with "diagnosis"; "PCR"; "real-time PCR"; "LAMP"; "RPA"; "immunoassay"; "reemergence"; "biothreat"; "endemic", and "multi-country outbreak" and also, by tracking citations of the relevant papers. The most relevant articles are included in the review. Results Our literature review shows that PCR is the gold standard method for MPXV detection. In addition, loop-mediated isothermal amplification (LAMP) and recombinase polymerase amplification (RPA) have been reported as alternatives to PCR. Immunodiagnostics, whole particle detection, and image-based detection are the non-nucleic acid-based MPXV detection modalities. Conclusions PCR is easy to leverage and adapt for a quick response to an outbreak, but the PCR-based MPXV detection approaches may not be suitable for marginalized settings. Limited progress has been made towards innovations in MPXV diagnostics, providing room for the development of novel detection techniques for this virus. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 36 pages, 5 figures, 1 table

arXiv:2204.01847 [pdf, other]

Bayesian Sequential Stacking Algorithm for Concurrently Designing Molecules and Synthetic Reaction Networks

Authors: Qi Zhang, Chang Liu, Stephen Wu, Ryo Yoshida

Abstract: In the last few years, de novo molecular design using machine learning has made great technical progress but its practical deployment has not been as successful. This is mostly owing to the cost and technical difficulty of synthesizing such computationally designed molecules. To overcome such barriers, various methods for synthetic route design using deep neural networks have been studied intensiv… ▽ More In the last few years, de novo molecular design using machine learning has made great technical progress but its practical deployment has not been as successful. This is mostly owing to the cost and technical difficulty of synthesizing such computationally designed molecules. To overcome such barriers, various methods for synthetic route design using deep neural networks have been studied intensively in recent years. However, little progress has been made in designing molecules and their synthetic routes simultaneously. Here, we formulate the problem of simultaneously designing molecules with the desired set of properties and their synthetic routes within the framework of Bayesian inference. The design variables consist of a set of reactants in a reaction network and its network topology. The design space is extremely large because it consists of all combinations of purchasable reactants, often in the order of millions or more. In addition, the designed reaction networks can adopt any topology beyond simple multistep linear reaction routes. To solve this hard combinatorial problem, we present a powerful sequential Monte Carlo algorithm that recursively designs a synthetic reaction network by sequentially building up single-step reactions. In a case study of designing drug-like molecules based on commercially available compounds, compared with heuristic combinatorial search methods, the proposed method shows overwhelming performance in terms of computational efficiency and coverage and novelty with respect to existing compounds. △ Less

Submitted 1 March, 2022; originally announced April 2022.

arXiv:2111.13068 [pdf]

doi 10.1016/j.ultras.2022.106749

Invariant in variants

Authors: Cong Liu, Chen-Wu Wu

Abstract: The coronavirus Covid-19 mutates quickly in the pandemic, leaves people struggling to verify and improve the effectiveness of the vaccine based on biochemistry. Is there any physical invariant in the variants of such kind of pathogen that could be taken advantage to ease the tensions? To this point, extensive numerical experiments based on continuity mechanics were carried out to discover the vibr… ▽ More The coronavirus Covid-19 mutates quickly in the pandemic, leaves people struggling to verify and improve the effectiveness of the vaccine based on biochemistry. Is there any physical invariant in the variants of such kind of pathogen that could be taken advantage to ease the tensions? To this point, extensive numerical experiments based on continuity mechanics were carried out to discover the vibration modes and the range of natural frequency of coronavirus Covid-19. Such invariant could help us in developing some flexible technique to deactivate the coronavirus, like as resonantly breaking the viral spike by ultrasound wave. The fundamental mechanisms governing such process are demonstrated via solving the coupled equations of acoustics and dynamics and thereafter the technique strategies proposed to efficiently realize the concept. △ Less

Submitted 25 November, 2021; originally announced November 2021.

arXiv:2111.10011 [pdf, other]

Game-environment feedback dynamics for voluntary prisoner's dilemma games

Authors: Bin-Quan Li, Cong Liu, Zhi-Xi Wu, Jian-Yue Guan

Abstract: Recently, the eco-evolutionary game theory which describes the coupled dynamics of strategies and environment have attracted great attention. At the same time, most of the current work is focused on the classic two-player two-strategy game. In this work, we study multi-strategy eco-evolutionary game theory which is an extension of the framework. For simplicity, we'll focus on the voluntary partici… ▽ More Recently, the eco-evolutionary game theory which describes the coupled dynamics of strategies and environment have attracted great attention. At the same time, most of the current work is focused on the classic two-player two-strategy game. In this work, we study multi-strategy eco-evolutionary game theory which is an extension of the framework. For simplicity, we'll focus on the voluntary participation Prisoner's dilemma game. For the general class of payoff-dependent feedback dynamics, we show the conditions for the existence and stability of internal equilibrium by using the replicator dynamics, respectively. Where internal equilibrium points, such as, two-strategy coexistence states, three-strategy coexistence states, persistent oscillation states and interior saddle points. These states are determined by the relative feedback strength and payoff matrix, and are independent of the relative feedback speed and initial state. In particular, the three-strategy coexistence provides a new mechanism for maintaining biodiversity in biology, ecology, and sociology. Besides, we find that this three-strategy model return to the persistent oscillation state of the two-strategy model when there is no defective strategy at the initial moment. △ Less

Submitted 18 November, 2021; originally announced November 2021.

arXiv:2110.14329 [pdf]

doi 10.1186/s13059-021-02544-3

Feature selection revisited in the single-cell era

Authors: Pengyi Yang, Hao Huang, Chunlei Liu

Abstract: Feature selection techniques are essential for high-dimensional data analysis. In the last two decades, their popularity has been fuelled by the increasing availability of high-throughput biomolecular data where high-dimensionality is a common data property. Recent advances in biotechnologies enable global profiling of various molecular and cellular features at single-cell resolution, resulting in… ▽ More Feature selection techniques are essential for high-dimensional data analysis. In the last two decades, their popularity has been fuelled by the increasing availability of high-throughput biomolecular data where high-dimensionality is a common data property. Recent advances in biotechnologies enable global profiling of various molecular and cellular features at single-cell resolution, resulting in large-scale datasets with increased complexity. These technological developments have led to a resurgence in feature selection research and application in the single-cell field. Here, we revisit feature selection techniques and summarise recent developments. We review their versatile application to a range of single-cell data types including those generated from traditional cytometry and imaging technologies and the latest array of single-cell omics technologies. We highlight some of the challenges and future directions on which feature selection could have a significant impact. Finally, we consider the scalability and make general recommendations on the utility of each type of feature selection method. We hope this review serves as a reference point to stimulate future research and application of feature selection in the single-cell era. △ Less

Submitted 27 October, 2021; originally announced October 2021.

Journal ref: Genome Biology 22, 321 (2021)

arXiv:2109.10258 [pdf]

Arterial blood pressure waveform in liver transplant surgery possesses variability of morphology reflecting recipients' acuity and predicting short term outcomes

Authors: Shen-Chih Wang, Chien-Kun Ting, Cheng-Yen Chen, Chin-Su Liu, Niang-Cheng Lin, Che-Chuan Loon, Hau-Tieng Wu, Yu-Ting Lin

Abstract: Background: We investigated clinical information underneath the beat-to-beat fluctuation of the arterial blood pressure (ABP) waveform morphology. We proposed the Dynamical Diffusion Map algorithm (DDMap) to quantify the variability of morphology. The underlying physiology could be the compensatory mechanisms involving complex interactions between various physiological mechanisms to regulate the c… ▽ More Background: We investigated clinical information underneath the beat-to-beat fluctuation of the arterial blood pressure (ABP) waveform morphology. We proposed the Dynamical Diffusion Map algorithm (DDMap) to quantify the variability of morphology. The underlying physiology could be the compensatory mechanisms involving complex interactions between various physiological mechanisms to regulate the cardiovascular system. As a liver transplant surgery contains distinct periods, we investigated its clinical behavior in different surgical steps. Methods: Our study used DDmap algorithm, based on unsupervised manifold learning, to obtain a quantitative index for the beat-to-beat variability of morphology. We examined the correlation between the variability of ABP morphology and disease acuity as indicated by Model for End-Stage Liver Disease (MELD) scores, the postoperative laboratory data, and 4 early allograft failure (EAF) scores. Results: Among the 85 enrolled patients, the variability of morphology obtained during the presurgical phase was best correlated with MELD-Na scores. The neohepatic phase variability of morphology was associated with EAF scores as well as postoperative bilirubin levels, international normalized ratio, aspartate aminotransferase levels, and platelet count. Furthermore, variability of morphology presents more associations with the above clinical conditions than the common BP measures and their BP variability indices. Conclusions: The variability of morphology obtained during the presurgical phase is indicative of patient acuity, whereas those during the neohepatic phase are indicative of short-term surgical outcomes. △ Less

Submitted 1 July, 2023; v1 submitted 21 September, 2021; originally announced September 2021.

Comments: 5 figures and 1 table

arXiv:2104.07062 [pdf, other]

Decoding of the Walking States and Step Rates from Cortical Electrocorticogram Signals

Authors: Po T. Wang, Colin M. McCrimmon, Susan J. Shaw, Hui Gong, Luis A. Chui, Payam Heydari, Charles Y. Liu, An H. Do, Zoran Nenadic

Abstract: Brain-computer interfaces (BCIs) have shown promising results in restoring motor function to individuals with spinal cord injury. These systems have traditionally focused on the restoration of upper extremity function; however, the lower extremities have received relatively little attention. Early feasibility studies used noninvasive electroencephalogram (EEG)-based BCIs to restore walking functio… ▽ More Brain-computer interfaces (BCIs) have shown promising results in restoring motor function to individuals with spinal cord injury. These systems have traditionally focused on the restoration of upper extremity function; however, the lower extremities have received relatively little attention. Early feasibility studies used noninvasive electroencephalogram (EEG)-based BCIs to restore walking function to people with paraplegia. However, the limited spatiotemporal resolution of EEG signals restricted the application of these BCIs to elementary gait tasks, such as the initiation and termination of walking. To restore more complex gait functions, BCIs must accurately decode additional degrees of freedom from brain signals. In this study, we used subdurally recorded electrocorticogram (ECoG) signals from able-bodied subjects to design a decoder capable of predicting the walking state and step rate information. We recorded ECoG signals from the motor cortices of two individuals as they walked on a treadmill at different speeds. Our offline analysis demonstrated that the state information could be decoded from >16 minutes of ECoG data with an unprecedented accuracy of 99.8%. Additionally, using a Bayesian filter approach, we achieved an average correlation coefficient between the decoded and true step rates of 0.934. When combined, these decoders may yield decoding accuracies sufficient to safely operate present-day walking prostheses. △ Less

Submitted 14 April, 2021; originally announced April 2021.

arXiv:2104.04672 [pdf, other]

doi 10.1109/ISBI48211.2021.9433808

Deep Learning Identifies Neuroimaging Signatures of Alzheimer's Disease Using Structural and Synthesized Functional MRI Data

Authors: Nanyan Zhu, Chen Liu, Xinyang Feng, Dipika Sikka, Sabrina Gjerswold-Selleck, Scott A. Small, Jia Guo

Abstract: Current neuroimaging techniques provide paths to investigate the structure and function of the brain in vivo and have made great advances in understanding Alzheimer's disease (AD). However, the group-level analyses prevalently used for investigation and understanding of the disease are not applicable for diagnosis of individuals. More recently, deep learning, which can efficiently analyze large-sc… ▽ More Current neuroimaging techniques provide paths to investigate the structure and function of the brain in vivo and have made great advances in understanding Alzheimer's disease (AD). However, the group-level analyses prevalently used for investigation and understanding of the disease are not applicable for diagnosis of individuals. More recently, deep learning, which can efficiently analyze large-scale complex patterns in 3D brain images, has helped pave the way for computer-aided individual diagnosis by providing accurate and automated disease classification. Great progress has been made in classifying AD with deep learning models developed upon increasingly available structural MRI data. The lack of scale-matched functional neuroimaging data prevents such models from being further improved by observing functional changes in pathophysiology. Here we propose a potential solution by first learning a structural-to-functional transformation in brain MRI, and further synthesizing spatially matched functional images from large-scale structural scans. We evaluated our approach by building computational models to discriminate patients with AD from healthy normal subjects and demonstrated a performance boost after combining the structural and synthesized functional brain images into the same model. Furthermore, our regional analyses identified the temporal lobe to be the most predictive structural-region and the parieto-occipital lobe to be the most predictive functional-region of our model, which are both in concordance with previous group-level neuroimaging findings. Together, we demonstrate the potential of deep learning with large-scale structural and synthesized functional MRI to impact AD classification and to identify AD's neuroimaging signatures. △ Less

Submitted 28 May, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: Published in IEEE ISBI 2021. Available at https://ieeexplore.ieee.org/document/9433808

arXiv:2103.00408 [pdf, ps, other]

Absolute quantification of real-time PCR data with stage signal difference analysis

Authors: Chuanbo Liu, Jin Wang

Abstract: Real-time PCR, or Real-time Quantitative PCR (qPCR) is an effective approach to quantify nucleic acid samples. Given the complicated reaction system along with thermal cycles, there has been long-term confusion on accurately calculating the initial nucleic acid amounts from the fluorescence signals. Although many improved algorithms had been proposed, the classical threshold method is still the pr… ▽ More Real-time PCR, or Real-time Quantitative PCR (qPCR) is an effective approach to quantify nucleic acid samples. Given the complicated reaction system along with thermal cycles, there has been long-term confusion on accurately calculating the initial nucleic acid amounts from the fluorescence signals. Although many improved algorithms had been proposed, the classical threshold method is still the primary choice in the routine application. In this study, we will first illustrate the origin of the linear relationship between the threshold value and logarithm of the initial nucleic acid amount by reconstructing the PCR reaction process with stochastic simulations. We then develop a new method for the absolute quantification of nucleic acid samples with qPCR. By monitoring the fluorescence signal changes in every stage of the thermal cycle, we are able to calculate a representation of the step-wise efficiency change. This is the first work calculated PCR efficiency change directly from the fluorescence signal, without fitting or sophisticated analysis. Our results revealed that the efficiency change during the PCR process is complicated and can not be modeled simply by monotone function model. Based on the calculated efficiency, we illustrate a new absolute qPCR analysis method for accurately determining nucleic acid amount. The efficiency problem is completely avoided in this new method. △ Less

Submitted 28 February, 2021; originally announced March 2021.

arXiv:2103.00405 [pdf, other]

Parallel implementations of random time algorithm for chemical network stochastic simulations

Authors: Chuanbo Liu, Jin Wang

Abstract: In this study, we have developed a parallel version of the random time simulation algorithm. Firstly, we gave a rigorous basis of the random time description of the stochastic process of chemical reaction network time evolution. And then we reviewed the random time simulation algorithm and gave the implementations for the parallel version of next reaction random time algorithm. The discussio… ▽ More In this study, we have developed a parallel version of the random time simulation algorithm. Firstly, we gave a rigorous basis of the random time description of the stochastic process of chemical reaction network time evolution. And then we reviewed the random time simulation algorithm and gave the implementations for the parallel version of next reaction random time algorithm. The discussion of computational complexity suggested a factor of $M$ (which is the connection number of the network) folds time consuming reduction for random time simulation algorithm as compared to other exact stochastic simulation algorithms, such as the Gillespie algorithm. For large-scale system, such like the protein-protein interaction network, $M$ is on order of $10^8$. We further demonstrate the power of random time simulation with a GPGPU parallel implementation which achieved roughly 100 folds acceleration as compared with CPU implementations. Therefore the stochastic simulation method we developed here can be of great application value for simulating time evolution process of large-scale network. △ Less

Submitted 28 February, 2021; originally announced March 2021.

arXiv:2007.14965 [pdf, other]

Antiviral Drug-Membrane Permeability: the Viral Envelope and Cellular Organelles

Authors: Changjiang Liu, Paolo Elvati, Angela Violi

Abstract: To shorten the time required to find effective new drugs, like antivirals, a key parameter to consider is membrane permeability, as a compound intended for an intracellular target with poor permeability will have low efficacy. Here, we present a computational model that considers both drug characteristics and membrane properties for the rapid assessment of drugs permeability through the coronaviru… ▽ More To shorten the time required to find effective new drugs, like antivirals, a key parameter to consider is membrane permeability, as a compound intended for an intracellular target with poor permeability will have low efficacy. Here, we present a computational model that considers both drug characteristics and membrane properties for the rapid assessment of drugs permeability through the coronavirus envelope and various cellular membranes. We analyze 79 drugs that are considered as potential candidates for the treatment of SARS-CoV-2 and determine their time of permeation in different organelle membranes grouped by viral baits and mammalian processes. The computational results are correlated with experimental data, present in the literature, on bioavailability of the drugs, showing a negative correlation between fast permeation and most promising drugs. This model represents an important tool capable of evaluating how permeability affects the ability of compounds to reach both intended and unintended intracellular targets in an accurate and rapid way. The method is general and flexible and can be employed for a variety of molecules, from small drugs to nanoparticles, as well to a variety of biological membranes. △ Less

Submitted 29 July, 2020; originally announced July 2020.

arXiv:2005.04224 [pdf]

doi 10.1080/17538947.2020.1809723

Taking the pulse of COVID-19: A spatiotemporal perspective

Authors: Chaowei Yang, Dexuan Sha, Qian Liu, Yun Li, Hai Lan, Weihe Wendy Guan, Tao Hu, Zhenlong Li, Zhiran Zhang, John Hoot Thompson, Zifu Wang, David Wong, Shiyang Ruan, Manzhu Yu, Douglas Richardson, Luyao Zhang, Ruizhi Hou, You Zhou, Cheng Zhong, Yifei Tian, Fayez Beaini, Kyla Carte, Colin Flynn, Wei Liu, Dieter Pfoser , et al. (10 additional authors not shown)

Abstract: The sudden outbreak of the Coronavirus disease (COVID-19) swept across the world in early 2020, triggering the lockdowns of several billion people across many countries, including China, Spain, India, the U.K., Italy, France, Germany, and most states of the U.S. The transmission of the virus accelerated rapidly with the most confirmed cases in the U.S., and New York City became an epicenter of the… ▽ More The sudden outbreak of the Coronavirus disease (COVID-19) swept across the world in early 2020, triggering the lockdowns of several billion people across many countries, including China, Spain, India, the U.K., Italy, France, Germany, and most states of the U.S. The transmission of the virus accelerated rapidly with the most confirmed cases in the U.S., and New York City became an epicenter of the pandemic by the end of March. In response to this national and global emergency, the NSF Spatiotemporal Innovation Center brought together a taskforce of international researchers and assembled implemented strategies to rapidly respond to this crisis, for supporting research, saving lives, and protecting the health of global citizens. This perspective paper presents our collective view on the global health emergency and our effort in collecting, analyzing, and sharing relevant data on global policy and government responses, geospatial indicators of the outbreak and evolving forecasts; in developing research capabilities and mitigation measures with global scientists, promoting collaborative research on outbreak dynamics, and reflecting on the dynamic responses from human societies. △ Less

Submitted 8 May, 2020; originally announced May 2020.

Comments: 27 pages, 18 figures. International Journal of Digital Earth (2020)

arXiv:2001.05551 [pdf, other]

Substituting Gadolinium in Brain MRI Using DeepContrast

Authors: Haoran Sun, Xueqing Liu, Xinyang Feng, Chen Liu, Nanyan Zhu, Sabrina J. Gjerswold-Selleck, Hong-Jian Wei, Pavan S. Upadhyayula, Angeliki Mela, Cheng-Chia Wu, Peter D. Canoll, Andrew F. Laine, J. Thomas Vaughan, Scott A. Small, Jia Guo

Abstract: Cerebral blood volume (CBV) is a hemodynamic correlate of oxygen metabolism and reflects brain activity and function. High-resolution CBV maps can be generated using the steady-state gadolinium-enhanced MRI technique. Such a technique requires an intravenous injection of exogenous gadolinium based contrast agent (GBCA) and recent studies suggest that the GBCA can accumulate in the brain after freq… ▽ More Cerebral blood volume (CBV) is a hemodynamic correlate of oxygen metabolism and reflects brain activity and function. High-resolution CBV maps can be generated using the steady-state gadolinium-enhanced MRI technique. Such a technique requires an intravenous injection of exogenous gadolinium based contrast agent (GBCA) and recent studies suggest that the GBCA can accumulate in the brain after frequent use. We hypothesize that endogenous sources of contrast might exist within the most conventional and commonly acquired structural MRI, potentially obviating the need for exogenous contrast. Here, we test this hypothesis by developing and optimizing a deep learning algorithm, which we call DeepContrast, in mice. We find that DeepContrast performs equally well as exogenous GBCA in mapping CBV of the normal brain tissue and enhancing glioblastoma. Together, these studies validate our hypothesis that a deep learning approach can potentially replace the need for GBCAs in brain MRI. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Journal ref: The IEEE International Symposium on Biomedical Imaging (ISBI) 2020

arXiv:2001.05548 [pdf, other]

Segmentation with Residual Attention U-Net and an Edge-Enhancement Approach Preserves Cell Shape Features

Authors: Nanyan Zhu, Chen Liu, Zakary S. Singer, Tal Danino, Andrew F. Laine, Jia Guo

Abstract: The ability to extrapolate gene expression dynamics in living single cells requires robust cell segmentation, and one of the challenges is the amorphous or irregularly shaped cell boundaries. To address this issue, we modified the U-Net architecture to segment cells in fluorescence widefield microscopy images and quantitatively evaluated its performance. We also proposed a novel loss function appr… ▽ More The ability to extrapolate gene expression dynamics in living single cells requires robust cell segmentation, and one of the challenges is the amorphous or irregularly shaped cell boundaries. To address this issue, we modified the U-Net architecture to segment cells in fluorescence widefield microscopy images and quantitatively evaluated its performance. We also proposed a novel loss function approach that emphasizes the segmentation accuracy on cell boundaries and encourages shape feature preservation. With a 97% sensitivity, 93% specificity, 91% Jaccard similarity, and 95% Dice coefficient, our proposed method called Residual Attention U-Net with edge-enhancement surpassed the state-of-the-art U-Net in segmentation performance as evaluated by the traditional metrics. More remarkably, the same proposed candidate also performed the best in terms of the preservation of valuable shape features, namely area, eccentricity, major axis length, solidity and orientation. These improvements on shape feature preservation can serve as useful assets for downstream cell tracking and quantification of changes in cell statistics or features over time. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Comments: 7 pages, 4 figures, 1 table. Nanyan Zhu and Chen Liu share equal contribution and are listed as co-first authors

ACM Class: I.4.6; I.4.7; I.5.1; I.5.2

arXiv:1912.10749 [pdf]

doi 10.1088/1741-2552/abc8d4

SpikeDeep-Classifier: A deep-learning based fully automatic offline spike sorting algorithm

Authors: Muhammad Saif-ur-Rehman, Omair Ali, Robin Lienkaemper, Sussane Dyck, Marita Metzler, Yaroslav Parpaley, Joerg Wellmer, Charles Liu, Brian Lee, Spencer Kellis, Richard Andersen, Ioannis Iossifidis, Tobias Glasmachers, Christian Klaes

Abstract: Objective. Recent advancements in electrode designs and micro-fabrication technology has allowed existence of microelectrode arrays with hundreds of channels for single-cell recordings. In such electrophysiological recordings, each implanted micro-electrode can record the activities of more than one neuron in its vicinity. Recording the activities of multiple neurons may also be referred to as mul… ▽ More Objective. Recent advancements in electrode designs and micro-fabrication technology has allowed existence of microelectrode arrays with hundreds of channels for single-cell recordings. In such electrophysiological recordings, each implanted micro-electrode can record the activities of more than one neuron in its vicinity. Recording the activities of multiple neurons may also be referred to as multiple unit activity. However, for any further analysis, the main goal is to isolate the activity of each recorded neuron and thus called single-unit activity. This process may also be referred to as spike sorting or spike classification. Recent approaches to extract SUA are time consuming, mainly due to the requirement of human intervention at various stages of spike sorting pipeline. Lack of standardization is another drawback of the current available approaches. Therefore, in this study we proposed a standard spike sorter: SpikeDeep-Classifier, a fully automatic spike sorting algorithm. Approach. We proposed a novel spike sorting pipeline, based on a set of supervised and unsupervised learning algorithms. We used supervised, deep learning-based algorithms for extracting meaningful channels and removing background activities (noise) from the extracted channels. We also showed that the process of clustering becomes straight-forward, once the noise/artifact is completely removed from the data. Therefore, in the next stage, we applied a simple clustering algorithm (K-mean) with predefined maximum number of clusters. Lastly, we used a similarity-based criterion to keep distinct clusters and merge similar-looking clusters. Main results. We evaluated our algorithm on a dataset collected from two different species (humans and non-human primates (NHPs)) without any retraining. We also validated our algorithm on two publicly available labeled datasets. △ Less

Submitted 23 December, 2019; originally announced December 2019.

Comments: 33 Pages, 14 Figures, 10 Tables

arXiv:1907.00329 [pdf, other]

Prediction of Small Molecule Kinase Inhibitors for Chemotherapy Using Deep Learning

Authors: Niranjan Balachandar, Christine Liu, Winston Wang

Abstract: The current state of cancer therapeutics has been moving away from one-size-fits-all cytotoxic chemotherapy, and towards a more individualized and specific approach involving the targeting of each tumor's genetic vulnerabilities. Different tumors, even of the same type, may be more reliant on certain cellular pathways more than others. With modern advancements in our understanding of cancer genome… ▽ More The current state of cancer therapeutics has been moving away from one-size-fits-all cytotoxic chemotherapy, and towards a more individualized and specific approach involving the targeting of each tumor's genetic vulnerabilities. Different tumors, even of the same type, may be more reliant on certain cellular pathways more than others. With modern advancements in our understanding of cancer genome sequencing, these pathways can be discovered. Investigating each of the millions of possible small molecule inhibitors for each kinase in vitro, however, would be extremely expensive and time consuming. This project focuses on predicting the inhibition activity of small molecules targeting 8 different kinases using multiple deep learning models. We trained fingerprint-based MLPs and simplified molecular-input line-entry specification (SMILES)-based recurrent neural networks (RNNs) and molecular graph convolutional networks (GCNs) to accurately predict inhibitory activity targeting these 8 kinases. △ Less

Submitted 30 June, 2019; originally announced July 2019.

Comments: 15 pages, 8 figures, 3 tables

arXiv:1903.12331 [pdf]

doi 10.1002/mp.14255

A Deep Dive into Understanding Tumor Foci Classification using Multiparametric MRI Based on Convolutional Neural Network

Authors: Weiwei Zong, Joon Lee, Chang Liu, Eric Carver, Aharon Feldman, Branislava Janic, Mohamed Elshaikh, Milan Pantelic, David Hearshen, Indrin Chetty, Benjamin Movsas, Ning Wen

Abstract: Deep learning models have had a great success in disease classifications using large data pools of skin cancer images or lung X-rays. However, data scarcity has been the roadblock of applying deep learning models directly on prostate multiparametric MRI (mpMRI). Although model interpretation has been heavily studied for natural images for the past few years, there has been a lack of interpretation… ▽ More Deep learning models have had a great success in disease classifications using large data pools of skin cancer images or lung X-rays. However, data scarcity has been the roadblock of applying deep learning models directly on prostate multiparametric MRI (mpMRI). Although model interpretation has been heavily studied for natural images for the past few years, there has been a lack of interpretation of deep learning models trained on medical images. This work designs a customized workflow for the small and imbalanced data set of prostate mpMRI where features were extracted from a deep learning model and then analyzed by a traditional machine learning classifier. In addition, this work contributes to revealing how deep learning models interpret mpMRI for prostate cancer patients stratification. △ Less

Submitted 14 May, 2020; v1 submitted 28 March, 2019; originally announced March 2019.

arXiv:1901.07467 [pdf, ps, other]

Enhancing Blood Glucose Prediction with Meal Absorption and Physical Exercise Information

Authors: Chengyuan Liu, Josep Vehi, Nick Oliver, Pantelis Georgiou, Pau Herrero

Abstract: Objective: Numerous glucose prediction algorithm have been proposed to empower type 1 diabetes (T1D) management. Most of these algorithms only account for input such as glucose, insulin and carbohydrate, which limits their performance. Here, we present a novel glucose prediction algorithm which, in addition to standard inputs, accounts for meal absorption and physical exercise information to enhan… ▽ More Objective: Numerous glucose prediction algorithm have been proposed to empower type 1 diabetes (T1D) management. Most of these algorithms only account for input such as glucose, insulin and carbohydrate, which limits their performance. Here, we present a novel glucose prediction algorithm which, in addition to standard inputs, accounts for meal absorption and physical exercise information to enhance prediction accuracy. Methods: a compartmental model of glucose-insulin dynamics combined with a deconvolution technique for state estimation is employed for glucose prediction. In silico data corresponding from the 10 adult subjects of UVa-Padova simulator, and clinical data from 10 adults with T1D were used. Finally, a comparison against a validated glucose prediction algorithm based on a latent variable with exogenous input (LVX) model is provided. Results: For a prediction horizon of 60 minutes, accounting for meal absorption and physical exercise improved glucose forecasting accuracy. In particular, root mean square error (mg/dL) went from 26.68 to 23.89, p<0.001 (in silico data); and from 37.02 to 35.96, p<0.001 (clinical data - only meal information). Such improvement in accuracy was translated into significant improvements on hypoglycaemia and hyperglycaemia prediction. Finally, the performance of the proposed algorithm is statistically superior to that of the LVX algorithm (26.68 vs. 32.80, p<0.001 (in silico data); 37.02 vs. 49.17, p<0.01 (clinical data). Conclusion: Taking into account meal absorption and physical exercise information improves glucose prediction accuracy. △ Less

Submitted 13 December, 2018; originally announced January 2019.

Comments: 10 pages, 5 figures, 8 tables and one appendix

arXiv:1812.11001 [pdf]

Multivariate MR Biomarkers Better Predict Cognitive Dysfunction in Mouse Models of Alzheimers Disease

Authors: Alexandra Badea, Natalie A Delpratt, RJ Anderson, Russell Dibb, Yi Qi, Hongjiang Wei, Chunlei Liu, William C Wetsel, Brian B Avants, Carol Colton

Abstract: To understand multifactorial conditions such as Alzheimers disease (AD) we need brain signatures that predict the impact of multiple pathologies and their interactions. To help uncover the relationships between brain circuits and cognitive markers we have used mouse models that represent, at least in part, the complex interactions altered in AD. In particular, we aimed to understand the relationsh… ▽ More To understand multifactorial conditions such as Alzheimers disease (AD) we need brain signatures that predict the impact of multiple pathologies and their interactions. To help uncover the relationships between brain circuits and cognitive markers we have used mouse models that represent, at least in part, the complex interactions altered in AD. In particular, we aimed to understand the relationship between vulnerable brain circuits and memory deficits measured in the Morris water maze, and we tested several predictive modeling approaches. We used in vivo manganese enhanced MRI voxel based analyses to reveal regional differences in volume (morphometry), signal intensity (activity), and magnetic susceptibility (iron deposition, demyelination). These regions included the hippocampus, olfactory areas, entorhinal cortex and cerebellum. The image based properties of these regions were used to predict spatial memory. We next used eigenanatomy, which reduces dimensionality to produce sets of regions that explain the variance in the data. For each imaging marker, eigenanatomy revealed networks underpinning a range of cognitive functions including memory, motor function, and associative learning. Finally, the integration of multivariate markers in a supervised sparse canonical correlation approach outperformed single predictor models and had significant correlates to spatial memory. Among a priori selected regions, the fornix also provided good predictors, raising the possibility of investigating how disease propagation within brain networks leads to cognitive deterioration. Our results support that modeling approaches integrating multivariate imaging markers provide sensitive predictors of AD-like behaviors. Such strategies for mapping brain circuits responsible for behaviors may help in the future predict disease progression, or response to interventions. △ Less

Submitted 28 December, 2018; originally announced December 2018.

Comments: 23 pages, 3 Tables, 6 Figures; submitted for publication

arXiv:1811.02923 [pdf]

Universal Spike Classifier

Authors: Muhammad Saif-ur-Rehman, Robin Lienkämper, Yaroslav Parpaley, Jörg Wellmer, Charles Liu, Brian Lee, Spencer Kellis, Richard Andersen, Ioannis Iossifidis, Tobias Glasmachers, Christian Klaes

Abstract: In electrophysiology, microelectrodes are the primary source for recording neural data of single neurons (single unit activity). These microelectrodes can be implanted individually, or in the form of microelectrodes arrays, consisting of hundreds of electrodes. During recordings, some channels capture the activity of neurons, which is usually contaminated with external artifacts and noise. Another… ▽ More In electrophysiology, microelectrodes are the primary source for recording neural data of single neurons (single unit activity). These microelectrodes can be implanted individually, or in the form of microelectrodes arrays, consisting of hundreds of electrodes. During recordings, some channels capture the activity of neurons, which is usually contaminated with external artifacts and noise. Another considerable fraction of channels does not record any neural data, but external artifacts and noise. Therefore, an automatic identification and tracking of channels containing neural data is of great significance and can accelerate the process of analysis, e.g. automatic selection of meaningful channels during offline and online spike sorting. Another important aspect is the selection of meaningful channels during online decoding in brain-computer interface applications, where threshold crossing events are usually for feature extraction, even though they do not necessarily correspond to neural events. Here, we propose a novel algorithm based on the newly introduced way of feature vector extraction and a supervised deep learning method: a universal spike classifier (USC). The USC enables us to address both above-raised issues. The USC uses the standard architecture of convolutional neural networks (Conv net). It takes the batch of the waveforms, instead of a single waveform as an input, propagates it through the multilayered structure, and finally classifies it as a channel containing neural spike data or artifacts. We have trained the model of USC on data recorded from single tetraplegic patient with Utah arrays implanted in different brain areas. This trained model was then evaluated without retraining on the data collected from six epileptic patients implanted with depth electrodes and two tetraplegic patients implanted with two Utah arrays, individually. △ Less

Submitted 7 November, 2018; originally announced November 2018.

Comments: 21 Pages, 12 Figures

arXiv:1810.08726 [pdf, other]

SL$^2$MF: Predicting Synthetic Lethality in Human Cancers via Logistic Matrix Factorization

Authors: Yong Liu, Min Wu, Chenghao Liu, Xiao-Li Li, Jie Zheng

Abstract: Synthetic lethality (SL) is a promising concept for novel discovery of anti-cancer drug targets. However, wet-lab experiments for detecting SLs are faced with various challenges, such as high cost, low consistency across platforms or cell lines. Therefore, computational prediction methods are needed to address these issues. This paper proposes a novel SL prediction method, named SL2MF, which emplo… ▽ More Synthetic lethality (SL) is a promising concept for novel discovery of anti-cancer drug targets. However, wet-lab experiments for detecting SLs are faced with various challenges, such as high cost, low consistency across platforms or cell lines. Therefore, computational prediction methods are needed to address these issues. This paper proposes a novel SL prediction method, named SL2MF, which employs logistic matrix factorization to learn latent representations of genes from the observed SL data. The probability that two genes are likely to form SL is modeled by the linear combination of gene latent vectors. As known SL pairs are more trustworthy than unknown pairs, we design importance weighting schemes to assign higher importance weights for known SL pairs and lower importance weights for unknown pairs in SL2MF. Moreover, we also incorporate biological knowledge about genes from protein-protein interaction (PPI) data and Gene Ontology (GO). In particular, we calculate the similarity between genes based on their GO annotations and topological properties in the PPI network. Extensive experiments on the SL interaction data from SynLethDB database have been conducted to demonstrate the effectiveness of SL2MF. △ Less

Submitted 19 October, 2018; originally announced October 2018.

arXiv:1810.07263 [pdf, ps, other]

doi 10.1063/1.5050808

Discrete Flux and Velocity Fields of Probability and Their Global Maps in Reaction Systems

Authors: Anna Terebus, Chun Liu, Jie Liang

Abstract: Stochasticity plays important roles in reaction systems. Vector fields of probability flux and velocity characterize time-varying and steady-state properties of these systems, including high probability paths, barriers, checkpoints among different stable regions, as well as mechanisms of dynamic switching among them. However, conventional fluxes on continuous space are ill-defined and are problema… ▽ More Stochasticity plays important roles in reaction systems. Vector fields of probability flux and velocity characterize time-varying and steady-state properties of these systems, including high probability paths, barriers, checkpoints among different stable regions, as well as mechanisms of dynamic switching among them. However, conventional fluxes on continuous space are ill-defined and are problematic when at boundaries of the state space or when copy numbers are small. By re-defining the derivative and divergence operators based on the discrete nature of reactions, we introduce new formulations of discrete fluxes. Our flux model fully accounts for the discreetness of both the state space and the jump processes of reactions. The reactional discrete flux satisfies the continuity equation and describes the behavior of the system evolving along directions of reactions. The species discrete flux directly describes the dynamic behavior in the state space of the reactants such as the transfer of probability mass. With the relationship between these two fluxes specified, we show how to construct time-evolving and steady-state global flow-maps of probability flux and velocity in the directions of every species at every microstate, and how they are related to the outflow and inflow of probability fluxes when tracing out reaction trajectories. We also describe how to impose proper conditions enabling exact quantification of flux and velocity in the boundary regions, without the difficulty of enforcing artificial reflecting conditions. We illustrate the computation of probability flux and velocity using three model systems, namely, the birth-death process, the bistable Schlögl model, and the oscillating Schnakenberg model. △ Less

Submitted 16 October, 2018; originally announced October 2018.

Comments: 21 pages, 5 figures

arXiv:1808.04443 [pdf, other]

Spatial and Spectral Features Fusion for EEG Classification during Motor Imagery in BCI

Authors: Chuanqi Tan, Fuchun Sun, Wenchang Zhang, Shaobo Liu, Chunfang Liu

Abstract: Brain computer interface (BCI) is the only way for some special patients to communicate with the outside world and provide a direct control channel between brain and the external devices. As a non-invasive interface, the scalp electroencephalography (EEG) has a significant potential to be a major input signal for future BCI systems. Traditional methods only focus on a particular feature in the EEG… ▽ More Brain computer interface (BCI) is the only way for some special patients to communicate with the outside world and provide a direct control channel between brain and the external devices. As a non-invasive interface, the scalp electroencephalography (EEG) has a significant potential to be a major input signal for future BCI systems. Traditional methods only focus on a particular feature in the EEG signal, which limits the practical applications of EEG-based BCI. In this paper, we propose a algorithm for EEG classification with the ability to fuse multiple features. First, use the common spatial pattern (CSP) as the spatial feature and use wavelet coefficient as the spectral feature. Second, fuse these features with a fusion algorithm in orchestrate way to improve the accuracy of classification. Our algorithms are applied to the dataset IVa from BCI complete \uppercase\expandafter{\romannumeral3}. By analyzing the experimental results, it is possible to conclude that we can speculate that our algorithm perform better than traditional methods. △ Less

Submitted 6 August, 2018; originally announced August 2018.

Comments: International Conference on Biomedical and Health Informatics (BHI 2017)

Showing 1–50 of 70 results for author: Liu, C