Search | arXiv e-print repository

NeuroSynth: MRI-Derived Neuroanatomical Generative Models and Associated Dataset of 18,000 Samples

Authors: Sai Spandana Chintapalli, Rongguang Wang, Zhijian Yang, Vasiliki Tassopoulou, Fanyang Yu, Vishnu Bashyam, Guray Erus, Pratik Chaudhari, Haochang Shou, Christos Davatzikos

Abstract: Availability of large and diverse medical datasets is often challenged by privacy and data sharing restrictions. For successful application of machine learning techniques for disease diagnosis, prognosis, and precision medicine, large amounts of data are necessary for model building and optimization. To help overcome such limitations in the context of brain MRI, we present NeuroSynth: a collection… ▽ More Availability of large and diverse medical datasets is often challenged by privacy and data sharing restrictions. For successful application of machine learning techniques for disease diagnosis, prognosis, and precision medicine, large amounts of data are necessary for model building and optimization. To help overcome such limitations in the context of brain MRI, we present NeuroSynth: a collection of generative models of normative regional volumetric features derived from structural brain imaging. NeuroSynth models are trained on real brain imaging regional volumetric measures from the iSTAGING consortium, which encompasses over 40,000 MRI scans across 13 studies, incorporating covariates such as age, sex, and race. Leveraging NeuroSynth, we produce and offer 18,000 synthetic samples spanning the adult lifespan (ages 22-90 years), alongside the model's capability to generate unlimited data. Experimental results indicate that samples generated from NeuroSynth agree with the distributions obtained from real data. Most importantly, the generated normative data significantly enhance the accuracy of downstream machine learning models on tasks such as disease classification. Data and models are available at: https://huggingface.co/spaces/rongguangw/neuro-synth. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.09811 [pdf, other]

CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis

Authors: Yihang Xiao, Jinyi Liu, Yan Zheng, Xiaohan Xie, Jianye Hao, Mingzhi Li, Ruitao Wang, Fei Ni, Yuxiao Li, Jintian Luo, Shaoqing Jiao, Jiajie Peng

Abstract: Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for biological research, as it enables the precise characterization of cellular heterogeneity. However, manual manipulation of various tools to achieve desired outcomes can be labor-intensive for researchers. To address this, we introduce CellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework, specifically desi… ▽ More Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for biological research, as it enables the precise characterization of cellular heterogeneity. However, manual manipulation of various tools to achieve desired outcomes can be labor-intensive for researchers. To address this, we introduce CellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework, specifically designed for the automatic processing and execution of scRNA-seq data analysis tasks, providing high-quality results with no human intervention. Firstly, to adapt general LLMs to the biological field, CellAgent constructs LLM-driven biological expert roles - planner, executor, and evaluator - each with specific responsibilities. Then, CellAgent introduces a hierarchical decision-making mechanism to coordinate these biological experts, effectively driving the planning and step-by-step execution of complex data analysis tasks. Furthermore, we propose a self-iterative optimization mechanism, enabling CellAgent to autonomously evaluate and optimize solutions, thereby guaranteeing output quality. We evaluate CellAgent on a comprehensive benchmark dataset encompassing dozens of tissues and hundreds of distinct cell types. Evaluation results consistently show that CellAgent effectively identifies the most suitable tools and hyperparameters for single-cell analysis tasks, achieving optimal performance. This automated framework dramatically reduces the workload for science data analyses, bringing us into the "Agent for Science" era. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.06334 [pdf, other]

Double-Ended Synthesis Planning with Goal-Constrained Bidirectional Search

Authors: Kevin Yu, Jihye Roh, Ziang Li, Wenhao Gao, Runzhong Wang, Connor W. Coley

Abstract: Computer-aided synthesis planning (CASP) algorithms have demonstrated expert-level abilities in planning retrosynthetic routes to molecules of low to moderate complexity. However, current search methods assume the sufficiency of reaching arbitrary building blocks, failing to address the common real-world constraint where using specific molecules is desired. To this end, we present a formulation of… ▽ More Computer-aided synthesis planning (CASP) algorithms have demonstrated expert-level abilities in planning retrosynthetic routes to molecules of low to moderate complexity. However, current search methods assume the sufficiency of reaching arbitrary building blocks, failing to address the common real-world constraint where using specific molecules is desired. To this end, we present a formulation of synthesis planning with starting material constraints. Under this formulation, we propose Double-Ended Synthesis Planning (DESP), a novel CASP algorithm under a bidirectional graph search scheme that interleaves expansions from the target and from the goal starting materials to ensure constraint satisfiability. The search algorithm is guided by a goal-conditioned cost network learned offline from a partially observed hypergraph of valid chemical reactions. We demonstrate the utility of DESP in improving solve rates and reducing the number of search expansions by biasing synthesis planning towards expert goals on multiple new benchmarks. DESP can make use of existing one-step retrosynthesis models, and we anticipate its performance to scale as these one-step model capabilities improve. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 10 pages main, 4 figures

arXiv:2407.01649 [pdf, other]

FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames

Authors: Ruidong Wu, Ruihan Guo, Rui Wang, Shitong Luo, Yue Xu, Jiahan Li, Jianzhu Ma, Qiang Liu, Yunan Luo, Jian Peng

Abstract: Despite the striking success of general protein folding models such as AlphaFold2(AF2, Jumper et al. (2021)), the accurate computational modeling of antibody-antigen complexes remains a challenging task. In this paper, we first analyze AF2's primary loss function, known as the Frame Aligned Point Error (FAPE), and raise a previously overlooked issue that FAPE tends to face gradient vanishing probl… ▽ More Despite the striking success of general protein folding models such as AlphaFold2(AF2, Jumper et al. (2021)), the accurate computational modeling of antibody-antigen complexes remains a challenging task. In this paper, we first analyze AF2's primary loss function, known as the Frame Aligned Point Error (FAPE), and raise a previously overlooked issue that FAPE tends to face gradient vanishing problem on high-rotational-error targets. To address this fundamental limitation, we propose a novel geodesic loss called Frame Aligned Frame Error (FAFE, denoted as F2E to distinguish from FAPE), which enables the model to better optimize both the rotational and translational errors between two frames. We then prove that F2E can be reformulated as a group-aware geodesic loss, which translates the optimization of the residue-to-residue error to optimizing group-to-group geodesic frame distance. By fine-tuning AF2 with our proposed new loss function, we attain a correct rate of 52.3\% (DockQ $>$ 0.23) on an evaluation set and 43.8\% correct rate on a subset with low homology, with substantial improvement over AF2 by 182\% and 100\% respectively. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2404.19041 [pdf, other]

Stochastic dynamics of two-compartment models with regulatory mechanisms for hematopoiesis

Authors: Ren-Yi Wang, Marek Kimmel, Guodong Pang

Abstract: We present an asymptotic analysis of a stochastic two-compartmental cell proliferation system with regulatory mechanisms. We model the system as a state-dependent birth and death process. Proliferation of hematopoietic stem cells (HSCs) is regulated by population density of HSC-derived clones and differentiation of HSC is regulated by population density of HSCs. By scaling up the initial populatio… ▽ More We present an asymptotic analysis of a stochastic two-compartmental cell proliferation system with regulatory mechanisms. We model the system as a state-dependent birth and death process. Proliferation of hematopoietic stem cells (HSCs) is regulated by population density of HSC-derived clones and differentiation of HSC is regulated by population density of HSCs. By scaling up the initial population, we show the density of dynamics converges in distribution to the solution of a system of ordinary differential equations (ODEs). The system of ODEs has a unique non-trivial equilibrium that is globally stable. Furthermore, we show the scaled fluctuation of the population converges in law to a linear diffusion with time-dependent coefficients. With initial data being Gaussian, the limit is a Gauss-Markov process, and it behaves like the FCLT limit under equilibrium with constant coefficients at large times. This is proved by establishing exponential convergence in the 2-Wasserstein metric for the associated Gaussian measures in a $\mathcal{L}_2$ Hilbert space. We apply our results to analyze and compare two regulatory mechanisms in the hematopoietic system. Simulations are conducted to verify our large-scale and long-time approximation of the dynamics. We demonstrate some regulatory mechanisms are efficient (converge to steady state rapidly) but not effective (have large fluctuation around the steady state). △ Less

Submitted 5 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

arXiv:2403.02603 [pdf, other]

Drug resistance revealed by in silico deep mutational scanning and mutation tracker

Authors: Dong Chen, Gengzhuo Liu, Hongyan Du, Junjie Wee, Rui Wang, Jiahui Chen, Jana Shen, Guo-Wei Wei

Abstract: As COVID-19 enters its fifth year, it continues to pose a significant global health threat, with the constantly mutating SARS-CoV-2 virus challenging drug effectiveness. A comprehensive understanding of virus-drug interactions is essential for predicting and improving drug effectiveness, especially in combating drug resistance during the pandemic. In response, the Path Laplacian Transformer-based… ▽ More As COVID-19 enters its fifth year, it continues to pose a significant global health threat, with the constantly mutating SARS-CoV-2 virus challenging drug effectiveness. A comprehensive understanding of virus-drug interactions is essential for predicting and improving drug effectiveness, especially in combating drug resistance during the pandemic. In response, the Path Laplacian Transformer-based Prospective Analysis Framework (PLFormer-PAF) has been proposed, integrating historical data analysis and predictive modeling strategies. This dual-strategy approach utilizes path topology to transform protein-ligand complexes into topological sequences, enabling the use of advanced large language models for analyzing protein-ligand interactions and enhancing its reliability with factual insights garnered from historical data. It has shown unparalleled performance in predicting binding affinity tasks across various benchmarks, including specific evaluations related to SARS-CoV-2, and assesses the impact of virus mutations on drug efficacy, offering crucial insights into potential drug resistance. The predictions align with observed mutation patterns in SARS-CoV-2, indicating that the widespread use of the Pfizer drug has lead to viral evolution and reduced drug efficacy. PLFormer-PAF's capabilities extend beyond identifying drug-resistant strains, positioning it as a key tool in drug discovery research and the development of new therapeutic strategies against fast-mutating viruses like COVID-19. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2401.12616 [pdf]

The stability and instability of the language control network: a longitudinal resting-state functional magnetic resonance imaging study

Authors: Zilong Li, Cong Liu, Xin Pan, Guosheng Ding, Ruiming Wang

Abstract: The language control network is vital among language-related networks responsible for solving the problem of multiple language switching. Researchers have expressed concerns about the instability of the language control network when exposed to external influences (e.g., Long-term second language learning). However, some studies have suggested that the language control network is stable. Therefore,… ▽ More The language control network is vital among language-related networks responsible for solving the problem of multiple language switching. Researchers have expressed concerns about the instability of the language control network when exposed to external influences (e.g., Long-term second language learning). However, some studies have suggested that the language control network is stable. Therefore, whether the language control network is stable or not remains unclear. In the present study, we directly evaluated the stability and instability of the language control network using resting-state functional magnetic resonance imaging (rs-fMRI). We employed cohorts of Chinese first-year college students majoring in English who underwent second language (L2) acquisition courses at a university and those who did not. Two resting-state fMRI scans were acquired approximately 1 year apart. We found that the language control network was both moderately stable and unstable. We further investigated the morphological coexistence patterns of stability and instability within the language control network. First, we extracted connections representing stability and plasticity from the entire network. We then evaluated whether the coexistence patterns were modular (stability and instability involve different brain regions) or non-modular (stability and plasticity involve the same brain regions but have unique connectivity patterns). We found that both stability and instability coexisted in a non-modular pattern. Compared with the non-English major group, the English major group has a more non-modular coexistence pattern.. These findings provide preliminary evidence of the coexistence of stability and instability in the language control network. △ Less

Submitted 7 March, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.09517 [pdf]

Dimensional Neuroimaging Endophenotypes: Neurobiological Representations of Disease Heterogeneity Through Machine Learning

Authors: Junhao Wen, Mathilde Antoniades, Zhijian Yang, Gyujoon Hwang, Ioanna Skampardoni, Rongguang Wang, Christos Davatzikos

Abstract: Machine learning has been increasingly used to obtain individualized neuroimaging signatures for disease diagnosis, prognosis, and response to treatment in neuropsychiatric and neurodegenerative disorders. Therefore, it has contributed to a better understanding of disease heterogeneity by identifying disease subtypes that present significant differences in various brain phenotypic measures. In thi… ▽ More Machine learning has been increasingly used to obtain individualized neuroimaging signatures for disease diagnosis, prognosis, and response to treatment in neuropsychiatric and neurodegenerative disorders. Therefore, it has contributed to a better understanding of disease heterogeneity by identifying disease subtypes that present significant differences in various brain phenotypic measures. In this review, we first present a systematic literature overview of studies using machine learning and multimodal MRI to unravel disease heterogeneity in various neuropsychiatric and neurodegenerative disorders, including Alzheimer disease, schizophrenia, major depressive disorder, autism spectrum disorder, multiple sclerosis, as well as their potential in transdiagnostic settings. Subsequently, we summarize relevant machine learning methodologies and discuss an emerging paradigm which we call dimensional neuroimaging endophenotype (DNE). DNE dissects the neurobiological heterogeneity of neuropsychiatric and neurodegenerative disorders into a low dimensional yet informative, quantitative brain phenotypic representation, serving as a robust intermediate phenotype (i.e., endophenotype) largely reflecting underlying genetics and etiology. Finally, we discuss the potential clinical implications of the current findings and envision future research avenues. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2311.14077 [pdf, other]

RetroDiff: Retrosynthesis as Multi-stage Distribution Interpolation

Authors: Yiming Wang, Yuxuan Song, Minkai Xu, Rui Wang, Hao Zhou, Weiying Ma

Abstract: Retrosynthesis poses a fundamental challenge in biopharmaceuticals, aiming to aid chemists in finding appropriate reactant molecules and synthetic pathways given determined product molecules. With the reactant and product represented as 2D graphs, retrosynthesis constitutes a conditional graph-to-graph generative task. Inspired by the recent advancements in discrete diffusion models for graph gene… ▽ More Retrosynthesis poses a fundamental challenge in biopharmaceuticals, aiming to aid chemists in finding appropriate reactant molecules and synthetic pathways given determined product molecules. With the reactant and product represented as 2D graphs, retrosynthesis constitutes a conditional graph-to-graph generative task. Inspired by the recent advancements in discrete diffusion models for graph generation, we introduce Retrosynthesis Diffusion (RetroDiff), a novel diffusion-based method designed to address this problem. However, integrating a diffusion-based graph-to-graph framework while retaining essential chemical reaction template information presents a notable challenge. Our key innovation is to develop a multi-stage diffusion process. In this method, we decompose the retrosynthesis procedure to first sample external groups from the dummy distribution given products and then generate the external bonds to connect the products and generated groups. Interestingly, such a generation process is exactly the reverse of the widely adapted semi-template retrosynthesis procedure, i.e. from reaction center identification to synthon completion, which significantly reduces the error accumulation. Experimental results on the benchmark have demonstrated the superiority of our method over all other semi-template methods. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2311.06185 [pdf, other]

An Automated Pipeline for Tumour-Infiltrating Lymphocyte Scoring in Breast Cancer

Authors: Adam J Shephard, Mostafa Jahanifar, Ruoyu Wang, Muhammad Dawood, Simon Graham, Kastytis Sidlauskas, Syed Ali Khurram, Nasir M Rajpoot, Shan E Ahmed Raza

Abstract: Tumour-infiltrating lymphocytes (TILs) are considered as a valuable prognostic markers in both triple-negative and human epidermal growth factor receptor 2 (HER2) positive breast cancer. In this study, we introduce an innovative deep learning pipeline based on the Efficient-UNet architecture to predict the TILs score for breast cancer whole-slide images (WSIs). We first segment tumour and stromal… ▽ More Tumour-infiltrating lymphocytes (TILs) are considered as a valuable prognostic markers in both triple-negative and human epidermal growth factor receptor 2 (HER2) positive breast cancer. In this study, we introduce an innovative deep learning pipeline based on the Efficient-UNet architecture to predict the TILs score for breast cancer whole-slide images (WSIs). We first segment tumour and stromal regions in order to compute a tumour bulk mask. We then detect TILs within the tumour-associated stroma, generating a TILs score by closely mirroring the pathologist's workflow. Our method exhibits state-of-the-art performance in segmenting tumour/stroma areas and TILs detection, as demonstrated by internal cross-validation on the TiGER Challenge training dataset and evaluation on the final leaderboards. Additionally, our TILs score proves competitive in predicting survival outcomes within the same challenge, underscoring the clinical relevance and potential of our automated TILs scoring pipeline as a breast cancer prognostic tool. △ Less

Submitted 21 November, 2023; v1 submitted 10 November, 2023; originally announced November 2023.

Comments: 5 pages, 1 figure, 2 tables

arXiv:2311.05486 [pdf, other]

Disease Gene Prioritization With Quantum Walks

Authors: Harto Saarinen, Mark Goldsmith, Rui-Sheng Wang, Joseph Loscalzo, Sabrina Maniscalco

Abstract: Disease gene prioritization assigns scores to genes or proteins according to their likely relevance for a given disease based on a provided set of seed genes. Here, we describe a new algorithm for disease gene prioritization based on continuous-time quantum walks using the adjacency matrix of a protein-protein interaction (PPI) network. Our algorithm can be seen as a quantum version of a previous… ▽ More Disease gene prioritization assigns scores to genes or proteins according to their likely relevance for a given disease based on a provided set of seed genes. Here, we describe a new algorithm for disease gene prioritization based on continuous-time quantum walks using the adjacency matrix of a protein-protein interaction (PPI) network. Our algorithm can be seen as a quantum version of a previous method known as the diffusion kernel, but, importantly, has higher performance in predicting disease genes, and also permits the encoding of seed node self-loops into the underlying Hamiltonian, which offers yet another boost in performance. We demonstrate the success of our proposed method by comparing it to several well-known gene prioritization methods on three disease sets, across seven different PPI networks. In order to compare these methods, we use cross-validation and examine the mean reciprocal ranks and recall values. We further validate our method by performing an enrichment analysis of the predicted genes for coronary artery disease. We also investigate the impact of adding self-loops to the seeds, and argue that they allow the quantum walker to remain more local to low-degree seed nodes. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: 12 pages, 9 figures

arXiv:2310.18377 [pdf, other]

Large-scale Foundation Models and Generative AI for BigData Neuroscience

Authors: Ran Wang, Zhe Sage Chen

Abstract: Recent advances in machine learning have made revolutionary breakthroughs in computer games, image and natural language understanding, and scientific discovery. Foundation models and large-scale language models (LLMs) have recently achieved human-like intelligence thanks to BigData. With the help of self-supervised learning (SSL) and transfer learning, these models may potentially reshape the land… ▽ More Recent advances in machine learning have made revolutionary breakthroughs in computer games, image and natural language understanding, and scientific discovery. Foundation models and large-scale language models (LLMs) have recently achieved human-like intelligence thanks to BigData. With the help of self-supervised learning (SSL) and transfer learning, these models may potentially reshape the landscapes of neuroscience research and make a significant impact on the future. Here we present a mini-review on recent advances in foundation models and generative AI models as well as their applications in neuroscience, including natural language and speech, semantic memory, brain-machine interfaces (BMIs), and data augmentation. We argue that this paradigm-shift framework will open new avenues for many neuroscience research directions and discuss the accompanying challenges and opportunities. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2309.03425 [pdf, ps, other]

doi 10.1016/j.biosystems.2024.10519

Exponential Cell Division and Allometric Scaling in Metabolic Ecology

Authors: Jia-Xu Han, Zhuangdong Bai, Rui-Wu Wang

Abstract: One of the most fundamental rules in metabolic ecology is the allometric equation, which is a power-law scaling that describes the connection between body measurements and body size. The biological dynamics of this essentially empirical allometric equation, however, have yet to be properly addressed in cell level. In order to fill the gap between biological process in cell level and allometric sca… ▽ More One of the most fundamental rules in metabolic ecology is the allometric equation, which is a power-law scaling that describes the connection between body measurements and body size. The biological dynamics of this essentially empirical allometric equation, however, have yet to be properly addressed in cell level. In order to fill the gap between biological process in cell level and allometric scaling in metabolic ecology, we simply assumed a cell bipartition without limitation, and then exponential cells increased during their lifetime. Two synchronous exponential increasing could generate a power-law scaling between body mass and an organ's weight. And the power-law scaling between body mass and metabolic rate may also be obtained by substituting an organ's weight with the weight of erythrocytes. Based on the same assumption, the dynamic of cell proliferation reveal a complex exponential scaling between body mass and longevity rather than the previously reported power-law scaling. In other words, there is a quadratic relationship between longevity and logarithmic form of body mass. In these relationships, all parameters can be explained by indices in cell division and embryo. △ Less

Submitted 6 September, 2023; originally announced September 2023.

arXiv:2309.03093 [pdf]

doi 10.1016/j.isci.2024.109055

Effect of Feedback between Environment and Finite Population

Authors: Jia-Xu Han, Rui-Wu Wang

Abstract: Natural selection imply that any organisms including human being will evolve to improve its fitness advantage and the selected genotype or phenotype in equilibrium state will not vary over the time. However, evolutionary process of biological organisms in reality is greatly affected by the environmental change and historical accidents. In this research, we construct a co-evolutionary system to inv… ▽ More Natural selection imply that any organisms including human being will evolve to improve its fitness advantage and the selected genotype or phenotype in equilibrium state will not vary over the time. However, evolutionary process of biological organisms in reality is greatly affected by the environmental change and historical accidents. In this research, we construct a co-evolutionary system to investigate the impact of species-environment feedback. When we talk about an invasion species or mutation, positive feedback is detrimental to the success of the invasion because positive feedback benefits a large number of individuals, whereas negative feedback benefits the invasion because negative feedback disadvantages a large number of individuals. In the case of a competition between two species with initially equal numbers of individuals, both positive and negative feedback will favor the species with low fitness, increasing its chances of taking over the whole population. The reason for this is that feedback allows initially inferior species to have greater fitness than initially dominating species in the early stages, emphasizing the importance of early random accident. Our findings emphasize the significance of the evolutionary path driven by species-environment feedback. △ Less

Submitted 22 October, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

arXiv:2308.06920 [pdf, other]

ChatGPT in Drug Discovery: A Case Study on Anti-Cocaine Addiction Drug Development with Chatbots

Authors: Rui Wang, Hongsong Feng, Guo-Wei Wei

Abstract: The birth of ChatGPT, a cutting-edge language model-based chatbot developed by OpenAI, ushered in a new era in AI. However, due to potential pitfalls, its role in rigorous scientific research is not clear yet. This paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anti-cocaine addiction drugs, the study employs GPT-4 as a virt… ▽ More The birth of ChatGPT, a cutting-edge language model-based chatbot developed by OpenAI, ushered in a new era in AI. However, due to potential pitfalls, its role in rigorous scientific research is not clear yet. This paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anti-cocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers towards innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of potential pharmaceutical solutions. This paper not only explores the integration of advanced AI in drug discovery but also reimagines the landscape by advocating for AI-powered chatbots as trailblazers in revolutionizing therapeutic innovation. △ Less

Submitted 19 October, 2023; v1 submitted 13 August, 2023; originally announced August 2023.

arXiv:2308.03175 [pdf, other]

Adapting Machine Learning Diagnostic Models to New Populations Using a Small Amount of Data: Results from Clinical Neuroscience

Authors: Rongguang Wang, Guray Erus, Pratik Chaudhari, Christos Davatzikos

Abstract: Machine learning (ML) has shown great promise for revolutionizing a number of areas, including healthcare. However, it is also facing a reproducibility crisis, especially in medicine. ML models that are carefully constructed from and evaluated on a training set might not generalize well on data from different patient populations or acquisition instrument settings and protocols. We tackle this prob… ▽ More Machine learning (ML) has shown great promise for revolutionizing a number of areas, including healthcare. However, it is also facing a reproducibility crisis, especially in medicine. ML models that are carefully constructed from and evaluated on a training set might not generalize well on data from different patient populations or acquisition instrument settings and protocols. We tackle this problem in the context of neuroimaging of Alzheimer's disease (AD), schizophrenia (SZ) and brain aging. We develop a weighted empirical risk minimization approach that optimally combines data from a source group, e.g., subjects are stratified by attributes such as sex, age group, race and clinical cohort to make predictions on a target group, e.g., other sex, age group, etc. using a small fraction (10%) of data from the target group. We apply this method to multi-source data of 15,363 individuals from 20 neuroimaging studies to build ML models for diagnosis of AD and SZ, and estimation of brain age. We found that this approach achieves substantially better accuracy than existing domain adaptation techniques: it obtains area under curve greater than 0.95 for AD classification, area under curve greater than 0.7 for SZ classification and mean absolute error less than 5 years for brain age prediction on all target groups, achieving robustness to variations of scanners, protocols, and demographic or clinical characteristics. In some cases, it is even better than training on all data from the target group, because it leverages the diversity and size of a larger training set. We also demonstrate the utility of our models for prognostic tasks such as predicting disease progression in individuals with mild cognitive impairment. Critically, our brain age prediction models lead to new clinical insights regarding correlations with neurophysiological tests. △ Less

Submitted 6 August, 2023; originally announced August 2023.

arXiv:2307.00751 [pdf, other]

Population Age Group Sensitivity for COVID-19 Infections with Deep Learning

Authors: Md Khairul Islam, Tyler Valentine, Royal Wang, Levi Davis, Matt Manner, Judy Fox

Abstract: The COVID-19 pandemic has created unprecedented challenges for governments and healthcare systems worldwide, highlighting the critical importance of understanding the factors that contribute to virus transmission. This study aimed to identify the most influential age groups in COVID-19 infection rates at the US county level using the Modified Morris Method and deep learning for time series. Our ap… ▽ More The COVID-19 pandemic has created unprecedented challenges for governments and healthcare systems worldwide, highlighting the critical importance of understanding the factors that contribute to virus transmission. This study aimed to identify the most influential age groups in COVID-19 infection rates at the US county level using the Modified Morris Method and deep learning for time series. Our approach involved training the state-of-the-art time-series model Temporal Fusion Transformer on different age groups as a static feature and the population vaccination status as the dynamic feature. We analyzed the impact of those age groups on COVID-19 infection rates by perturbing individual input features and ranked them based on their Morris sensitivity scores, which quantify their contribution to COVID-19 transmission rates. The findings are verified using ground truth data from the CDC and US Census, which provide the true infection rates for each age group. The results suggest that young adults were the most influential age group in COVID-19 transmission at the county level between March 1, 2020, and November 27, 2021. Using these results can inform public health policies and interventions, such as targeted vaccination strategies, to better control the spread of the virus. Our approach demonstrates the utility of feature sensitivity analysis in identifying critical factors contributing to COVID-19 transmission and can be applied in other public health domains. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2306.07484 [pdf, other]

Multi-objective Molecular Optimization for Opioid Use Disorder Treatment Using Generative Network Complex

Authors: Hongsong Feng, Rui Wang, Chang-Guo Zhan, Guo-Wei Wei

Abstract: Opioid Use Disorder (OUD) has emerged as a significant global public health issue, with complex multifaceted conditions. Due to the lack of effective treatment options for various conditions, there is a pressing need for the discovery of new medications. In this study, we propose a deep generative model that combines a stochastic differential equation (SDE)-based diffusion modeling with the latent… ▽ More Opioid Use Disorder (OUD) has emerged as a significant global public health issue, with complex multifaceted conditions. Due to the lack of effective treatment options for various conditions, there is a pressing need for the discovery of new medications. In this study, we propose a deep generative model that combines a stochastic differential equation (SDE)-based diffusion modeling with the latent space of a pretrained autoencoder model. The molecular generator enables efficient generation of molecules that are effective on multiple targets, specifically the mu, kappa, and delta opioid receptors. Furthermore, we assess the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of the generated molecules to identify drug-like compounds. To enhance the pharmacokinetic properties of some lead compounds, we employ a molecular optimization approach. We obtain a diverse set of drug-like molecules. We construct binding affinity predictors by integrating molecular fingerprints derived from autoencoder embeddings, transformer embeddings, and topological Laplacians with advanced machine learning algorithms. Further experimental studies are needed to evaluate the pharmacological effects of these drug-like compounds for OUD treatment. Our machine learning platform serves as a valuable tool in designing and optimizing effective molecules for addressing OUD. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2305.15453 [pdf]

Drugst.One -- A plug-and-play solution for online systems medicine and network-based drug repurposing

Authors: Andreas Maier, Michael Hartung, Mark Abovsky, Klaudia Adamowicz, Gary D. Bader, Sylvie Baier, David B. Blumenthal, Jing Chen, Maria L. Elkjaer, Carlos Garcia-Hernandez, Mohamed Helmy, Markus Hoffmann, Igor Jurisica, Max Kotlyar, Olga Lazareva, Hagai Levi, Markus List, Sebastian Lobentanzer, Joseph Loscalzo, Noel Malod-Dognin, Quirin Manz, Julian Matschinske, Miles Mee, Mhaned Oubounyt, Alexander R. Pico , et al. (14 additional authors not shown)

Abstract: In recent decades, the development of new drugs has become increasingly expensive and inefficient, and the molecular mechanisms of most pharmaceuticals remain poorly understood. In response, computational systems and network medicine tools have emerged to identify potential drug repurposing candidates. However, these tools often require complex installation and lack intuitive visual network mining… ▽ More In recent decades, the development of new drugs has become increasingly expensive and inefficient, and the molecular mechanisms of most pharmaceuticals remain poorly understood. In response, computational systems and network medicine tools have emerged to identify potential drug repurposing candidates. However, these tools often require complex installation and lack intuitive visual network mining capabilities. To tackle these challenges, we introduce Drugst.One, a platform that assists specialized computational medicine tools in becoming user-friendly, web-based utilities for drug repurposing. With just three lines of code, Drugst.One turns any systems biology software into an interactive web tool for modeling and analyzing complex protein-drug-disease networks. Demonstrating its broad adaptability, Drugst.One has been successfully integrated with 21 computational systems medicine tools. Available at https://drugst.one, Drugst.One has significant potential for streamlining the drug discovery process, allowing researchers to focus on essential aspects of pharmaceutical treatment research. △ Less

Submitted 4 July, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 45 pages, 6 figures, 7 tables

arXiv:2302.08695 [pdf, other]

A Countable-Type Branching Process Model for the Tug-of-War Cancer Cell Dynamics

Authors: Ren-Yi Wang, Marek Kimmel

Abstract: We consider a time-continuous Markov branching process of proliferating cells with a countable collection of types. Among-type transitions are inspired by the Tug-of-War process introduced in McFarland et al. as a mathematical model for competition of advantageous driver mutations and deleterious passenger mutations in cancer cells. We introduce a version of the model in which a driver mutation pu… ▽ More We consider a time-continuous Markov branching process of proliferating cells with a countable collection of types. Among-type transitions are inspired by the Tug-of-War process introduced in McFarland et al. as a mathematical model for competition of advantageous driver mutations and deleterious passenger mutations in cancer cells. We introduce a version of the model in which a driver mutation pushes the type of the cell $L$-units up, while a passenger mutation pulls it $1$-unit down. The distribution of time to divisions depends on the type (fitness) of cell, which is an integer. The extinction probability given any initial cell type is strictly less than $1$, which allows us to investigate the transition between types (type transition) in an infinitely long cell lineage of cells. The analysis leads to the result that under driver dominance, the type transition process escapes to infinity, while under passenger dominance, it leads to a limit distribution. Implications in cancer cell dynamics and population genetics are discussed. △ Less

Submitted 4 February, 2024; v1 submitted 16 February, 2023; originally announced February 2023.

arXiv:2212.01575 [pdf]

Multi-view deep learning based molecule design and structural optimization accelerates the SARS-CoV-2 inhibitor discovery

Authors: Chao Pang, Yu Wang, Yi Jiang, Ruheng Wang, Ran Su, Leyi Wei

Abstract: In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and a… ▽ More In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and adaptively learn comprehensive structural semantics from targeted molecular topology and geometry. We show that our MEDICO significantly outperforms the state-of-the-art methods in generating valid, unique, and novel molecules under benchmarking comparisons. In particular, we showcase the multi-view deep learning model enables us to generate not only the molecules structurally similar to the targeted molecules but also the molecules with desired chemical properties, demonstrating the strong capability of our model in exploring the chemical space deeply. Moreover, case study results on targeted molecule generation for the SARS-CoV-2 main protease (Mpro) show that by integrating molecule docking into our model as chemical priori, we successfully generate new small molecules with desired drug-like properties for the Mpro, potentially accelerating the de novo design of Covid-19 drugs. Further, we apply MEDICO to the structural optimization of three well-known Mpro inhibitors (N3, 11a, and GC376) and achieve ~88% improvement in their binding affinity to Mpro, demonstrating the application value of our model for the development of therapeutics for SARS-CoV-2 infection. △ Less

Submitted 3 December, 2022; originally announced December 2022.

arXiv:2210.16640 [pdf]

doi 10.1109/JBHI.2020.3002805

2D and 3D CT Radiomic Features Performance Comparison in Characterization of Gastric Cancer: A Multi-center Study

Authors: Lingwei Meng, Di Dong, Xin Chen, Mengjie Fang, Rongpin Wang, Jing Li, Zaiyi Liu, Jie Tian

Abstract: Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks. Meth… ▽ More Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks. Methods: Four-center 539 GC patients were retrospectively enrolled and divided into the training and validation cohorts. From 2D or 3D regions of interest (ROIs) annotated by radiologists, radiomic features were extracted respectively. Feature selection and model construction procedures were customed for each combination of two modalities (2D or 3D) and three tasks. Subsequently, six machine learning models (Model_2D^LNM, Model_3D^LNM; Model_2D^LVI, Model_3D^LVI; Model_2D^pT, Model_3D^pT) were derived and evaluated to reflect modalities' performances in characterizing GC. Furthermore, we performed an auxiliary experiment to assess modalities' performances when resampling spacing is different. Results: Regarding three tasks, the yielded areas under the curve (AUCs) were: Model_2D^LNM's 0.712 (95% confidence interval, 0.613-0.811), Model_3D^LNM's 0.680 (0.584-0.775); Model_2D^LVI's 0.677 (0.595-0.761), Model_3D^LVI's 0.615 (0.528-0.703); Model_2D^pT's 0.840 (0.779-0.901), Model_3D^pT's 0.813 (0.747-0.879). Moreover, the auxiliary experiment indicated that Models_2D are statistically more advantageous than Models3D with different resampling spacings. Conclusion: Models constructed with 2D radiomic features revealed comparable performances with those constructed with 3D features in characterizing GC. Significance: Our work indicated that time-saving 2D annotation would be the better choice in GC, and provided a related reference to further radiomics-based researches. △ Less

Submitted 29 October, 2022; originally announced October 2022.

Comments: Published in IEEE Journal of Biomedical and Health Informatics

Journal ref: IEEE.J.Biomed.Health.Inf. 25 (2021) 755-763

arXiv:2210.09485 [pdf, other]

Emerging dominant SARS-CoV-2 variants

Authors: Jiahui Chen, Rui Wang, Yuta Hozumi, Gengzhuo Liu, Yuchi Qiu, Xiaoqi Wei, Guo-Wei Wei

Abstract: Accurate and reliable forecasting of emerging dominant severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants enables policymakers and vaccine makers to get prepared for future waves of infections. The last three waves of SARS-CoV-2 infections caused by dominant variants Omicron (BA.1), BA.2, and BA.4/BA.5 were accurately foretold by our artificial intelligence (AI) models built wit… ▽ More Accurate and reliable forecasting of emerging dominant severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants enables policymakers and vaccine makers to get prepared for future waves of infections. The last three waves of SARS-CoV-2 infections caused by dominant variants Omicron (BA.1), BA.2, and BA.4/BA.5 were accurately foretold by our artificial intelligence (AI) models built with biophysics, genotyping of viral genomes, experimental data, algebraic topology, and deep learning. Based on newly available experimental data, we analyzed the impacts of all possible viral spike (S) protein receptor-binding domain (RBD) mutations on the SARS-CoV-2 infectivity. Our analysis sheds light on viral evolutionary mechanisms, i.e., natural selection through infectivity strengthening and antibody resistance. We forecast that BA.2.10.4, BA.2.75, BQ.1.1, and particularly, BA.2.75+R346T, have high potential to become new dominant variants to drive the next surge. △ Less

Submitted 17 October, 2022; originally announced October 2022.

arXiv:2208.05228 [pdf]

Current and perspective sensing methods for monkeypox virus: a reemerging zoonosis in its infancy

Authors: Ijaz Gul, Changyue Liu, Yuan Xi, Zhicheng Du, Shiyao Zhai, Zhengyang Lei, Chen Qun, Muhammad Akmal Raheem, Qian He, Zhang Haihui, Canyang Zhang, Runming Wang, Sanyang Han, Du Ke, Peiwu Qin

Abstract: Objectives The review is dedicated to evaluate the current monkeypox virus (MPXV) detection methods, discuss their pros and cons, and provide recommended solutions to the problems. Methods The literature for this review is identified through searches in PubMed, Web of Science, Google Scholar, ResearchGate, and Science Direct advanced search for articles published in English without any start dat… ▽ More Objectives The review is dedicated to evaluate the current monkeypox virus (MPXV) detection methods, discuss their pros and cons, and provide recommended solutions to the problems. Methods The literature for this review is identified through searches in PubMed, Web of Science, Google Scholar, ResearchGate, and Science Direct advanced search for articles published in English without any start date until June, 2022, by use of the terms "monkeypox virus" or "poxvirus" along with "diagnosis"; "PCR"; "real-time PCR"; "LAMP"; "RPA"; "immunoassay"; "reemergence"; "biothreat"; "endemic", and "multi-country outbreak" and also, by tracking citations of the relevant papers. The most relevant articles are included in the review. Results Our literature review shows that PCR is the gold standard method for MPXV detection. In addition, loop-mediated isothermal amplification (LAMP) and recombinase polymerase amplification (RPA) have been reported as alternatives to PCR. Immunodiagnostics, whole particle detection, and image-based detection are the non-nucleic acid-based MPXV detection modalities. Conclusions PCR is easy to leverage and adapt for a quick response to an outbreak, but the PCR-based MPXV detection approaches may not be suitable for marginalized settings. Limited progress has been made towards innovations in MPXV diagnostics, providing room for the development of novel detection techniques for this virus. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 36 pages, 5 figures, 1 table

arXiv:2205.00532 [pdf, other]

Persistent Laplacian projected Omicron BA.4 and BA.5 to become new dominating variants

Authors: Jiahui Chen, Yuchi Qiu, Rui Wang, Guo-Wei Wei

Abstract: Due to its high transmissibility, Omicron BA.1 ousted the Delta variant to become a dominating variant in late 2021 and was replaced by more transmissible Omicron BA.2 in March 2022. An important question is which new variants will dominate in the future. Topology-based deep learning models have had tremendous success in forecasting emerging variants in the past. However, topology is insensitive t… ▽ More Due to its high transmissibility, Omicron BA.1 ousted the Delta variant to become a dominating variant in late 2021 and was replaced by more transmissible Omicron BA.2 in March 2022. An important question is which new variants will dominate in the future. Topology-based deep learning models have had tremendous success in forecasting emerging variants in the past. However, topology is insensitive to homotopic shape variations in virus-human protein-protein binding, which are crucial to viral evolution and transmission. This challenge is tackled with persistent Laplacian, which is able to capture both the topology and shape of data. Persistent Laplacian-based deep learning models are developed to systematically evaluate variant infectivity. Our comparative analysis of Alpha, Beta, Gamma, Delta, Lambda, Mu, and Omicron BA.1, BA.1.1, BA.2, BA.2.11, BA.2.12.1, BA.3, BA.4, and BA.5 unveils that Omicron BA.2.11, BA.2.12.1, BA.3, BA.4, and BA.5 are more contagious than BA.2. In particular, BA.4 and BA.5 are about 36\% more infectious than BA.2 and are projected to become new dominating variants by natural selection. Moreover, the proposed models outperform the state-of-the-art methods on three major benchmark datasets for mutation-induced protein-protein binding free energy changes. △ Less

Submitted 1 May, 2022; originally announced May 2022.

arXiv:2204.09119 [pdf]

doi 10.1038/s41377-022-01004-2

Optically-generated focused ultrasound for noninvasive brain stimulation with ultrahigh precision

Authors: Yueming Li, Ying Jiang, Lu Lan, Xiaowei Ge, Ran Cheng, Yuewei Zhan, Guo Chen, Linli Shi, Runyu Wang, Nan Zheng, Chen Yang, Ji-Xin Cheng

Abstract: High precision neuromodulation is a powerful tool to decipher neurocircuits and treat neurological diseases. Current non-invasive neuromodulation methods offer limited precision at the millimeter level. Here, we report optically-generated focused ultrasound (OFUS) for non-invasive brain stimulation with ultrahigh precision. OFUS is generated by a soft optoacoustic pad (SOAP) fabricated through emb… ▽ More High precision neuromodulation is a powerful tool to decipher neurocircuits and treat neurological diseases. Current non-invasive neuromodulation methods offer limited precision at the millimeter level. Here, we report optically-generated focused ultrasound (OFUS) for non-invasive brain stimulation with ultrahigh precision. OFUS is generated by a soft optoacoustic pad (SOAP) fabricated through embedding candle soot nanoparticles in a curved polydimethylsiloxane film. SOAP generates a transcranial ultrasound focus at 15 MHz with an ultrahigh lateral resolution of 83 um, which is two orders of magnitude smaller than that of conventional transcranial-focused ultrasound (tFUS). Here, we show effective OFUS neurostimulation in vitro with a single ultrasound cycle. We demonstrate submillimeter transcranial stimulation of the mouse motor cortex in vivo. An acoustic energy of 0.6 mJ/cm^2, four orders of magnitude less than that of tFUS, is sufficient for successful OFUS neurostimulation. OFUS offers new capabilities for neuroscience studies and disease treatments by delivering a focus with ultrahigh precision non-invasively. △ Less

Submitted 3 November, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: 36 pages, 5 main figures, 13 supplementary figures

Journal ref: Light Sci Appl 11, 321 (2022)

arXiv:2202.03632 [pdf, other]

doi 10.34133/research.0153

ECRECer: Enzyme Commission Number Recommendation and Benchmarking based on Multiagent Dual-core Learning

Authors: Zhenkun Shi, Qianqian Yuan, Ruoyu Wang, Hoaran Li, Xiaoping Liao, Hongwu Ma

Abstract: Enzyme Commission (EC) numbers, which associate a protein sequence with the biochemical reactions it catalyzes, are essential for the accurate understanding of enzyme functions and cellular metabolism. Many ab-initio computational approaches were proposed to predict EC numbers for given input sequences directly. However, the prediction performance (accuracy, recall, precision), usability, and effi… ▽ More Enzyme Commission (EC) numbers, which associate a protein sequence with the biochemical reactions it catalyzes, are essential for the accurate understanding of enzyme functions and cellular metabolism. Many ab-initio computational approaches were proposed to predict EC numbers for given input sequences directly. However, the prediction performance (accuracy, recall, precision), usability, and efficiency of existing methods still have much room to be improved. Here, we report ECRECer, a cloud platform for accurately predicting EC numbers based on novel deep learning techniques. To build ECRECer, we evaluate different protein representation methods and adopt a protein language model for protein sequence embedding. After embedding, we propose a multi-agent hierarchy deep learning-based framework to learn the proposed tasks in a multi-task manner. Specifically, we used an extreme multi-label classifier to perform the EC prediction and employed a greedy strategy to integrate and fine-tune the final model. Comparative analyses against four representative methods demonstrate that ECRECer delivers the highest performance, which improves accuracy and F1 score by 70% and 20% over the state-of-the-the-art, respectively. With ECRECer, we can annotate numerous enzymes in the Swiss-Prot database with incomplete EC numbers to their full fourth level. Take UniPort protein "A0A0U5GJ41" as an example (1.14.-.-), ECRECer annotated it with "1.14.11.38", which supported by further protein structure analysis based on AlphaFold2. Finally, we established a webserver (https://ecrecer.biodesign.ac.cn) and provided an offline bundle to improve usability. △ Less

Submitted 7 February, 2022; originally announced February 2022.

Comments: 16 pages, 14 figures

Report number: research.0153 MSC Class: I.2.6

Journal ref: Research. 2023:6;0153

arXiv:2112.03266 [pdf, other]

Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

Authors: Xuesong Wang, Zhihang Hu, Tingyang Yu, Ruijie Wang, Yumeng Wei, Juan Shu, Jianzhu Ma, Yu Li

Abstract: Muilti-modality data are ubiquitous in biology, especially that we have entered the multi-omics era, when we can measure the same biological object (cell) from different aspects (omics) to provide a more comprehensive insight into the cellular system. When dealing with such multi-omics data, the first step is to determine the correspondence among different modalities. In other words, we should mat… ▽ More Muilti-modality data are ubiquitous in biology, especially that we have entered the multi-omics era, when we can measure the same biological object (cell) from different aspects (omics) to provide a more comprehensive insight into the cellular system. When dealing with such multi-omics data, the first step is to determine the correspondence among different modalities. In other words, we should match data from different spaces corresponding to the same object. This problem is particularly challenging in the single-cell multi-omics scenario because such data are very sparse with extremely high dimensions. Secondly, matched single-cell multi-omics data are rare and hard to collect. Furthermore, due to the limitations of the experimental environment, the data are usually highly noisy. To promote the single-cell multi-omics research, we overcome the above challenges, proposing a novel framework to align and integrate single-cell RNA-seq data and single-cell ATAC-seq data. Our approach can efficiently map the above data with high sparsity and noise from different spaces to a low-dimensional manifold in a unified space, making the downstream alignment and integration straightforward. Compared with the other state-of-the-art methods, our method performs better in both simulated and real single-cell data. The proposed method is helpful for the single-cell multi-omics research. The improvement for integration on the simulated data is significant. △ Less

Submitted 13 December, 2021; v1 submitted 5 December, 2021; originally announced December 2021.

arXiv:2112.01318 [pdf, other]

Omicron (B.1.1.529): Infectivity, vaccine breakthrough, and antibody resistance

Authors: Jiahui Chen, Rui Wang, Nancy Benovich Gilby, Guo-Wei Wei

Abstract: The latest severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variant Omicron (B.1.1.529) has ushered panic responses around the world due to its contagious and vaccine escape mutations. The essential infectivity and antibody resistance of the SARS-CoV-2 variant are determined by its mutations on the spike (S) protein receptor-binding domain (RBD). However, a complete experimental evalua… ▽ More The latest severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variant Omicron (B.1.1.529) has ushered panic responses around the world due to its contagious and vaccine escape mutations. The essential infectivity and antibody resistance of the SARS-CoV-2 variant are determined by its mutations on the spike (S) protein receptor-binding domain (RBD). However, a complete experimental evaluation of Omicron might take weeks or even months. Here, we present a comprehensive quantitative analysis of Omicron's infectivity, vaccine-breakthrough, and antibody resistance. An artificial intelligence (AI) model, which has been trained with tens of thousands of experimental data points and extensively validated by experimental data on SARS-CoV-2, reveals that Omicron may be over ten times more contagious than the original virus or about twice as infectious as the Delta variant. Based on 132 three-dimensional (3D) structures of antibody-RBD complexes, we unveil that Omicron may be twice more likely to escape current vaccines than the Delta variant. The Food and Drug Administration (FDA)-approved monoclonal antibodies (mAbs) from Eli Lilly may be seriously compromised. Omicron may also diminish the efficacy of mAbs from Celltrion and Rockefeller University. However, its impact on Regeneron mAb cocktail appears to be mild. △ Less

Submitted 1 December, 2021; originally announced December 2021.

arXiv:2110.04626 [pdf, other]

The evolution of the mechanisms of SARS-CoV-2 evolution revealing vaccine-resistant mutations in Europe and America

Authors: Rui Wang, Jiahui Chen, Guo-Wei Wei

Abstract: The importance of understanding SARS-CoV-2 evolution cannot be overemphasized. Recent studies confirm that natural selection is the dominating mechanism of SARS-CoV-2 evolution, which favors mutations that strengthen viral infectivity. We demonstrate that vaccine-breakthrough or antibody-resistant mutations provide a new mechanism of viral evolution. Specifically, vaccine-resistant mutation Y449S… ▽ More The importance of understanding SARS-CoV-2 evolution cannot be overemphasized. Recent studies confirm that natural selection is the dominating mechanism of SARS-CoV-2 evolution, which favors mutations that strengthen viral infectivity. We demonstrate that vaccine-breakthrough or antibody-resistant mutations provide a new mechanism of viral evolution. Specifically, vaccine-resistant mutation Y449S in the spike (S) protein receptor-bonding domain (RBD), which occurred in co-mutation [Y449S, N501Y], has reduced infectivity compared to the original SARS-CoV-2 but can disrupt existing antibodies that neutralize the virus. By tracing the evolutionary trajectories of vaccine-resistant mutations in over 1.9 million SARS-CoV-2 genomes, we reveal that the occurrence and frequency of vaccine-resistant mutations correlate strongly with the vaccination rates in Europe and America. We anticipate that as a complementary transmission pathway, vaccine-resistant mutations will become a dominating mechanism of SARS-CoV-2 evolution when most of the world's population is vaccinated. Our study sheds light on SARS-CoV-2 evolution and transmission and enables the design of the next-generation mutation-proof vaccines and antibody drugs. △ Less

Submitted 9 October, 2021; originally announced October 2021.

Comments: 11 pages, 4 figures

arXiv:2109.08148 [pdf, other]

Review of the mechanisms of SARS-CoV-2 evolution and transmission

Authors: Jiahui Chen, Rui Wang, Guo-Wei Wei

Abstract: The mechanism of SARS-CoV-2 evolution and transmission is elusive and its understanding, a prerequisite to forecast emerging variants, is of paramount importance. SARS-CoV-2 evolution is driven by the mechanisms at molecular and organism scales and regulated by the transmission pathways at the population scale. In this review, we show that infectivity-based natural selection was discovered as the… ▽ More The mechanism of SARS-CoV-2 evolution and transmission is elusive and its understanding, a prerequisite to forecast emerging variants, is of paramount importance. SARS-CoV-2 evolution is driven by the mechanisms at molecular and organism scales and regulated by the transmission pathways at the population scale. In this review, we show that infectivity-based natural selection was discovered as the mechanism for SARS-CoV-2 evolution and transmission in July 2020. In April 2021, we proved beyond all doubt that such a natural selection via infectivity-based transmission pathway remained the sole mechanism for SARS-CoV-2 evolution. However, we reveal that antibody-disruptive co-mutations [Y449S, N501Y] debuted as a new vaccine-resistant transmission pathway of viral evolution in highly vaccinated populations a few months ago. Over one year ago, we foresaw that mutations spike protein RBD residues, 452 and 501, would "have high chances to mutate into significantly more infectious COVID-19 strains". Mutations on these residues underpin prevailing SARS-CoV-2 variants Alpha, Beta, Gamma, Delta, Epsilon, Theta, Kappa, Lambda, and Mu at present and are expected to be vital to emerging variants. We anticipate that viral evolution will combine RBD co-mutations at these two sites, creating future variants that are tens of times more infectious than the original SARS-CoV-2. Additionally, two complementary transmission pathways of viral evolution: infectivity and vaccine-resistant, will prolong our battle with COVID-19 for years. We predict that RBD co-mutation [A411S, L452R, T478K], [L452R, T478K, N501Y], [L452R, T478K, E484K, N501Y], [K417N, L452R, T478K], and [P384L, K417N, E484K, N501Y] will have high chances to grow into dominating variants due to their high infectivity and/or strong ability to break through current vaccines, calling for the development of new vaccines and antibody therapies. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: 15 pages, 10 figures

arXiv:2109.04509 [pdf, other]

Emerging vaccine-breakthrough SARS-CoV-2 variants

Authors: Rui Wang, Jiahui Chen, Yuta Hozumi, Changchuan Yin, Guo-Wei Wei

Abstract: The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee th… ▽ More The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1,489,884 SARS-CoV-2 genomes isolates, 130 human antibodies, tens of thousands of mutational data points, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein RBD can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 countries, including the United Kingdom, the United States, Denmark, Brazil, and Germany, etc. We envision that natural selection through infectivity will continue to be the main mechanism for viral evolution among unvaccinated populations, while antibody disruptive co-mutations will fuel the future growth of vaccine-breakthrough variants among fully vaccinated populations. Finally, we have identified the co-mutations that have the great likelihood of becoming dominant: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they, particularly the last four, will break through existing vaccines. We foresee an urgent need to develop new vaccines that target these co-mutations. △ Less

Submitted 9 September, 2021; originally announced September 2021.

Comments: 15 pages, 5 figures

arXiv:2107.13219 [pdf]

doi 10.1016/j.isci.2022.104673

Lifespan associations of resting-state brain functional networks with ADHD symptoms

Authors: Rong Wang, Yongchen Fan, Ying Wu, Yu-Feng Zang, Changsong Zhou

Abstract: Attention-deficit/hyperactivity disorder (ADHD) is increasingly being diagnosed in both children and adults, but the neural mechanisms that underlie its distinct symptoms and whether children and adults share the same mechanism remain poorly understood. Here, we used a nested-spectral partition (NSP) approach to study the resting-state brain functional networks of ADHD patients (n=97) and healthy… ▽ More Attention-deficit/hyperactivity disorder (ADHD) is increasingly being diagnosed in both children and adults, but the neural mechanisms that underlie its distinct symptoms and whether children and adults share the same mechanism remain poorly understood. Here, we used a nested-spectral partition (NSP) approach to study the resting-state brain functional networks of ADHD patients (n=97) and healthy controls (HCs, n=97) across the lifespan (7-50 years). Compared to the linear lifespan associations of brain functional segregation and integration with age in HCs, ADHD patients have a quadratic association in the whole brain and in most functional systems, whereas the limbic system dominantly affected by ADHD has a linear association. Furthermore, the limbic system better predicts hyperactivity, and the salient attention system better predicts inattention. These predictions are shared in children and adults with ADHD. Our findings reveal a lifespan association of brain networks with ADHD symptoms and provide potential shared neural bases of distinct ADHD symptoms in children and adults. △ Less

Submitted 22 April, 2022; v1 submitted 28 July, 2021; originally announced July 2021.

Comments: 28 pages, 4 figures

Journal ref: 2022

arXiv:2104.04188 [pdf]

doi 10.1021/acsami.1c00495

Ammonia-induced Calcium Phosphate Nanostructure: A Potential Assay for Studying Osteoporosis and Bone Metastasis

Authors: Sijia Chen, Qiong Wang, Felipe Eltit, Yubin Guo, Michael Cox, Rizhi Wang

Abstract: Osteoclastic resorption of bone plays a central role in both osteoporosis and bone metastasis. A reliable in vitro assay that simulates osteoclastic resorption in vivo would significantly speed up the process of devel-oping effective therapeutic solutions for those diseases. Here we reported the development of a novel and robust nano-structured calcium phosphate coating with unique functions on th… ▽ More Osteoclastic resorption of bone plays a central role in both osteoporosis and bone metastasis. A reliable in vitro assay that simulates osteoclastic resorption in vivo would significantly speed up the process of devel-oping effective therapeutic solutions for those diseases. Here we reported the development of a novel and robust nano-structured calcium phosphate coating with unique functions on the track-etched porous mem-brane by using an ammonia-induced mineralization (AiM) technique. The calcium phosphate coating uni-formly covers one side of the PET membrane enabling testing for osteoclastic resorption. The track-etched pores in the PET membrane allow calcium phosphate mineral pins to grow inside, which, on one hand, enhances coating integration with membrane substrate, and on the other hand provides diffusion channels for delivering drugs from the lower chamber of a double-chamber cell culture system. The applications of the processed calcium phosphate coating was first demonstrated as a drug screening device by using alen-dronate, a widely used drug for osteoporosis. It was confirmed that the delivery of alendronate significant-ly decreased both the number of monocyte-differentiated osteoclasts and coating resorption. To demon-strate the application in studying bone metastasis, we delivered PC3 prostate cancer conditioned medium and confirmed that both the differentiation of monocytes into osteoclasts and the osteoclastic resorption of the calcium phosphate coating were significantly enhanced. This novel assay thus provides a new platform for studying osteoclastic activities and assessing drug efficacy in vitro. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Journal ref: ACS Applied Materials & Interfaces Manuscript ID: am-2021-004953.R2

arXiv:2103.08023 [pdf, other]

Vaccine-escape and fast-growing mutations in the United Kingdom, the United States, Singapore, Spain, South Africa, and other COVID-19-devastated countries

Authors: Rui Wang, Jiahui Chen, Kaifu Gao, Guo-Wei Wei

Abstract: Recently, the SARS-CoV-2 variants from the United Kingdom (UK), South Africa, and Brazil have received much attention for their increased infectivity, potentially high virulence, and possible threats to existing vaccines and antibody therapies. The question remains if there are other more infectious variants transmitted around the world. We carry out a large-scale study of 252,874 SARS-CoV-2 genom… ▽ More Recently, the SARS-CoV-2 variants from the United Kingdom (UK), South Africa, and Brazil have received much attention for their increased infectivity, potentially high virulence, and possible threats to existing vaccines and antibody therapies. The question remains if there are other more infectious variants transmitted around the world. We carry out a large-scale study of 252,874 SARS-CoV-2 genome isolates from patients to identify many other rapidly growing mutations on the spike (S) protein receptor-binding domain (RDB). We reveal that 88 out of 95 significant mutations that were observed more than 10 times strengthen the binding between the RBD and the host angiotensin-converting enzyme 2 (ACE2), indicating the virus evolves toward more infectious variants. In particular, we discover new fast-growing RBD mutations N439K, L452R, S477N, S477R, and N501T that also enhance the RBD and ACE2 binding. We further unveil that mutation N501Y involved in United Kingdom (UK), South Africa, and Brazil variants may moderately weaken the binding between the RBD and many known antibodies, while mutations E484K and K417N found in South Africa and Brazilian variants can potentially disrupt the binding between the RDB and many known antibodies. Among three newly identified fast-growing RBD mutations, L452R, which is now known as part of the California variant B.1.427, and N501T are able to effectively weaken the binding of many known antibodies with the RBD. Finally, we hypothesize that RBD mutations that can simultaneously make SARS-CoV-2 more infectious and disrupt the existing antibodies, called vaccine escape mutations, will pose an imminent threat to the current crop of vaccines. A list of most likely vaccine escape mutations is given, including N501Y, L452R, E484K, N501T, S494P, and K417N. △ Less

Submitted 21 March, 2021; v1 submitted 14 March, 2021; originally announced March 2021.

Comments: 20 pages, 13 figures

arXiv:2103.00475 [pdf]

doi 10.1073/pnas.2022288118

Segregation, integration and balance of large-scale resting brain networks configure different cognitive abilities

Authors: Rong Wang, Mianxin Liu, Xinhong Cheng, Ying Wu, Andrea Hildebrandt, Changsong Zhou

Abstract: Diverse cognitive processes set different demands on locally segregated and globally integrated brain activity. However, it remains unclear how resting brains configure their functional organization to balance the demands on network segregation and integration to best serve cognition. Here, we use an eigenmode-based approach to identify hierarchical modules in functional brain networks, and quanti… ▽ More Diverse cognitive processes set different demands on locally segregated and globally integrated brain activity. However, it remains unclear how resting brains configure their functional organization to balance the demands on network segregation and integration to best serve cognition. Here, we use an eigenmode-based approach to identify hierarchical modules in functional brain networks, and quantify the functional balance between network segregation and integration. In a large sample of healthy young adults (n=991), we combine the whole-brain resting state functional magnetic resonance imaging (fMRI) data with a mean-filed model on the structural network derived from diffusion tensor imaging and demonstrate that resting brain networks are on average close to a balanced state. This state allows for a balanced time dwelling at segregated and integrated configurations, and highly flexible switching between them. Furthermore, we employ structural equation modelling to estimate general and domain-specific cognitive phenotypes from nine tasks, and demonstrate that network segregation, integration and their balance in resting brains predict individual differences in diverse cognitive phenotypes. More specifically, stronger integration is associated with better general cognitive ability, stronger segregation fosters crystallized intelligence and processing speed, and individual's tendency towards balance supports better memory. Our findings provide a comprehensive and deep understanding of the brain's functioning principles in supporting diverse functional demands and cognitive abilities, and advance modern network neuroscience theories of human cognition. △ Less

Submitted 28 February, 2021; originally announced March 2021.

Comments: 36 pages, 5 figures

arXiv:2102.00971 [pdf, other]

Methodology-centered review of molecular modeling, simulation, and prediction of SARS-CoV-2

Authors: Kaifu Gao, Rui Wang, Jiahui Chen, Limei Cheng, Jaclyn Frishcosy, Yuta Huzumi, Yuchi Qiu, Tom Schluckbier, Guo-Wei Wei

Abstract: The deadly coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has gone out of control globally. Despite much effort by scientists, medical experts, and society in general, the slow progress on drug discovery and antibody therapeutic development, the unknown possible side effects of the existing vaccines, and the high transmission rat… ▽ More The deadly coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has gone out of control globally. Despite much effort by scientists, medical experts, and society in general, the slow progress on drug discovery and antibody therapeutic development, the unknown possible side effects of the existing vaccines, and the high transmission rate of the SARS-CoV-2, remind us of the sad reality that our current understanding of the transmission, infectivity, and evolution of SARS-CoV-2 is unfortunately very limited. The major limitation is the lack of mechanistic understanding of viral-host cell interactions, the viral regulation, protein-protein interactions, including antibody-antigen binding, protein-drug binding, host immune response, etc. This limitation will likely haunt the scientific community for a long time and have a devastating consequence in combating COVID-19 and other pathogens. Notably, compared to the long-cycle, highly cost, and safety-demanding molecular-level experiments, the theoretical and computational studies are economical, speedy, and easy to perform. There exists a tsunami of the literature on molecular modeling, simulation, and prediction of SARS-CoV-2 that has become impossible to fully be covered in a review. To provide the reader a quick update about the status of molecular modeling, simulation, and prediction of SARS-CoV-2, we present a comprehensive and systematic methodology-centered narrative in the nick of time. Aspects such as molecular modeling, Monte Carlo (MC) methods, structural bioinformatics, machine learning, deep learning, and mathematical approaches are included in this review. This review will be beneficial to researchers who are looking for ways to contribute to SARS-CoV-2 studies and those who are assessing the current status in the field. △ Less

Submitted 1 February, 2021; originally announced February 2021.

Comments: 99 pages, 17 figures

arXiv:2012.15268 [pdf, other]

UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets

Authors: Yuta Hozumi, Rui Wang, Changchuan Yin, Guo-Wei Wei

Abstract: Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the p… ▽ More Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates. △ Less

Submitted 30 December, 2020; originally announced December 2020.

Comments: 30 pages, 10 figures

arXiv:2011.10616 [pdf, other]

Bridging Physics-based and Data-driven modeling for Learning Dynamical Systems

Authors: Rui Wang, Danielle Maddix, Christos Faloutsos, Yuyang Wang, Rose Yu

Abstract: How can we learn a dynamical system to make forecasts, when some variables are unobserved? For instance, in COVID-19, we want to forecast the number of infected and death cases but we do not know the count of susceptible and exposed people. While mechanics compartment models are widely used in epidemic modeling, data-driven models are emerging for disease forecasting. We first formalize the learni… ▽ More How can we learn a dynamical system to make forecasts, when some variables are unobserved? For instance, in COVID-19, we want to forecast the number of infected and death cases but we do not know the count of susceptible and exposed people. While mechanics compartment models are widely used in epidemic modeling, data-driven models are emerging for disease forecasting. We first formalize the learning of physics-based models as AutoODE, which leverages automatic differentiation to estimate the model parameters. Through a benchmark study on COVID-19 forecasting, we notice that physics-based mechanistic models significantly outperform deep learning. Our method obtains a 57.4% reduction in mean absolute errors for 7-day ahead COVID-19 forecasting compared with the best deep learning competitor. Such performance differences highlight the generalization problem in dynamical system learning due to distribution shift. We identify two scenarios where distribution shift can occur: changes in data domain and changes in parameter domain (system dynamics). Through systematic experiments on several dynamical systems, we found that deep learning models fail to forecast well under both scenarios. While much research on distribution shift has focused on changes in the data domain, our work calls attention to rethink generalization for learning dynamical systems. △ Less

Submitted 29 April, 2021; v1 submitted 20 November, 2020; originally announced November 2020.

arXiv:2010.06357 [pdf, other]

Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies

Authors: Jiahui Chen, Kaifu Gao, Rui Wang, Guowei Wei

Abstract: Antibody therapeutics and vaccines are among our last resort to end the raging COVID-19 pandemic. They, however, are prone to over 5,000 mutations on the spike (S) protein uncovered by a Mutation Tracker based on over 200,000 genome isolates. It is imperative to understand how mutations would impact vaccines and antibodies in the development. In this work, we study the mechanism, frequency, and ra… ▽ More Antibody therapeutics and vaccines are among our last resort to end the raging COVID-19 pandemic. They, however, are prone to over 5,000 mutations on the spike (S) protein uncovered by a Mutation Tracker based on over 200,000 genome isolates. It is imperative to understand how mutations would impact vaccines and antibodies in the development. In this work, we study the mechanism, frequency, and ratio of mutations on the S protein. Additionally, we use 56 antibody structures and analyze their 2D and 3D characteristics. Moreover, we predict the mutation-induced binding free energy (BFE) changes for the complexes of S protein and antibodies or ACE2. By integrating genetics, biophysics, deep learning, and algebraic topology, we reveal that most of 462 mutations on the receptor-binding domain (RBD) will weaken the binding of S protein and antibodies and disrupt the efficacy and reliability of antibody therapies and vaccines. A list of 31 vaccine escape mutants is identified, while many other disruptive mutations are detailed as well. We also unveil that about 65\% existing RBD mutations, including those variants recently found in the United Kingdom (UK) and South Africa, are binding-strengthen mutations, resulting in more infectious COVID-19 variants. We discover the disparity between the extreme values of RBD mutation-induced BFE strengthening and weakening of the bindings with antibodies and ACE2, suggesting that SARS-CoV-2 is at an advanced stage of evolution for human infection, while the human immune system is able to produce optimized antibodies. This discovery implies the vulnerability of current vaccines and antibody drugs to new mutations. Our predictions were validated by comparison with more than 1,400 deep mutations on the S protein RBD. Our results show the urgent need to develop new mutation-resistant vaccines and antibodies and to prepare for seasonal vaccinations. △ Less

Submitted 9 March, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: 28 pages, 17 figures

arXiv:2010.02141 [pdf, other]

AdaLead: A simple and robust adaptive greedy search algorithm for sequence design

Authors: Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane, Eric D. Kelsic

Abstract: Efficient design of biological sequences will have a great impact across many industrial and healthcare domains. However, discovering improved sequences requires solving a difficult optimization problem. Traditionally, this challenge was approached by biologists through a model-free method known as "directed evolution", the iterative process of random mutation and selection. As the ability to buil… ▽ More Efficient design of biological sequences will have a great impact across many industrial and healthcare domains. However, discovering improved sequences requires solving a difficult optimization problem. Traditionally, this challenge was approached by biologists through a model-free method known as "directed evolution", the iterative process of random mutation and selection. As the ability to build models that capture the sequence-to-function map improves, such models can be used as oracles to screen sequences before running experiments. In recent years, interest in better algorithms that effectively use such oracles to outperform model-free approaches has intensified. These span from approaches based on Bayesian Optimization, to regularized generative models and adaptations of reinforcement learning. In this work, we implement an open-source Fitness Landscape EXploration Sandbox (FLEXS: github.com/samsinai/FLEXS) environment to test and evaluate these algorithms based on their optimality, consistency, and robustness. Using FLEXS, we develop an easy-to-implement, scalable, and robust evolutionary greedy algorithm (AdaLead). Despite its simplicity, we show that AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges. △ Less

Submitted 5 October, 2020; originally announced October 2020.

arXiv:2009.08491 [pdf]

doi 10.1016/j.jsb.2020.107606

Globular structure of the hypermineralized tissue in human femoral neck

Authors: Qiong Wang, Tengteng Tang, David Cooper, Felipe Eltit, Peter Fratzl, Pierre Guy, Rizhi Wang

Abstract: Bone becomes more fragile with ageing. Among many structural changes, a thin layer of highly mineralized and brittle tissue covers part of the external surface of the thin femoral neck cortex in older people and has been proposed to increase hip fragility. However, there have been very limited reports on this hypermineralized tissue in the femoral neck, especially on its ultrastructure. Such infor… ▽ More Bone becomes more fragile with ageing. Among many structural changes, a thin layer of highly mineralized and brittle tissue covers part of the external surface of the thin femoral neck cortex in older people and has been proposed to increase hip fragility. However, there have been very limited reports on this hypermineralized tissue in the femoral neck, especially on its ultrastructure. Such information is critical to understanding both the mineralization process and its contributions to hip fracture. Here, we use multiple advanced techniques to characterize the ultrastructure of the hypermineralized tissue in the neck across various length scales. Synchrotron radiation micro-CT found larger but less densely distributed cellular lacunae in hypermineralized tissue than in lamellar bone. When examined under FIB-SEM, the hypermineralized tissue was mainly composed of mineral globules with sizes varying from submicron to a few microns. Nano-sized channels were present within the mineral globules and oriented with the surrounding organic matrix. Transmission electron microscopy showed the apatite inside globules were poorly crystalline, while those at the boundaries between the globules had well-defined lattice structure with crystallinity similar to the apatite mineral in lamellar bone. No preferred mineral orientation was observed both inside each globule and at the boundaries. Collectively, we conclude based on these new observations that the hypermineralized tissue is non-lamellar and has less organized mineral, which may contribute to the high brittleness of the tissue. △ Less

Submitted 17 September, 2020; originally announced September 2020.

Journal ref: Journal of Structural Biology,Volume 212, Issue 2, 1 November 2020, 107606

arXiv:2009.00107 [pdf]

Data Mining and Analytical Models to Predict and Identify Adverse Drug-drug Interactions

Authors: Ricky Wang

Abstract: The use of multiple drugs accounts for almost 30% of all hospital admission and is the 5th leading cause of death in America. Since over 30% of all adverse drug events (ADEs) are thought to be caused by drug-drug interactions (DDI), better identification and prediction of administration of known DDIs in primary and secondary care could reduce the number of patients seeking urgent care in hospitals… ▽ More The use of multiple drugs accounts for almost 30% of all hospital admission and is the 5th leading cause of death in America. Since over 30% of all adverse drug events (ADEs) are thought to be caused by drug-drug interactions (DDI), better identification and prediction of administration of known DDIs in primary and secondary care could reduce the number of patients seeking urgent care in hospitals, resulting in substantial savings for health systems worldwide along with better public health. However, current DDI prediction models are prone to confounding biases along with either inaccurate or a lack of access to longitudinal data from Electronic Health Records (EHR) and other drug information such as FDA Adverse Event Reporting System (FAERS) which continue to be the main barriers in measuring the prevalence of DDI and characterizing the phenomenon in medical care. In this review, analytical models including Label Propagation using drug side effect data and Supervised Learning DDI Prediction model using Drug-Gene interactions (DGIs) data are discussed. Improved identification of DDIs in both of these models compared to previous versions are highlighted while limitations that include bias, inaccuracy, and insufficient data are also assessed. A case study of Psoriasis DDI prediction by DGI data using Random Forest Classifier was studied. Transfer Matrix Recurrent Neural Networks (TM-RNN) that address the above limitations are discussed in future works. △ Less

Submitted 31 August, 2020; originally announced September 2020.

Comments: 21 pages

arXiv:2008.07488 [pdf, other]

Host immune response driving SARS-CoV-2 evolution

Authors: Rui Wang, Yuta Hozumi, Yong-Hui Zheng, Changchuan Yin, Guo-Wei Wei

Abstract: The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of paramount importance to the controlling and combating of coronavirus disease 2019 (COVID-19) pandemic. Currently, near 15,000 SARS-CoV-2 single mutations have been recorded, having a great ramification to the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is k… ▽ More The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of paramount importance to the controlling and combating of coronavirus disease 2019 (COVID-19) pandemic. Currently, near 15,000 SARS-CoV-2 single mutations have been recorded, having a great ramification to the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is known about SARS-CoV-2 evolutionary characteristics and general trend. In this work, we present a comprehensive genotyping analysis of existing SARS-CoV-2 mutations. We reveal that host immune response via APOBEC and ADAR gene editing gives rise to near 65\% of recorded mutations. Additionally, we show that children under age five and the elderly may be at high risk from COVID-19 because of their overreacting to the viral infection. Moreover, we uncover that populations of Oceania and Africa react significantly more intensively to SARS-CoV-2 infection than those of Europe and Asia, which may explain why African Americans were shown to be at increased risk of dying from COVID-19, in addition to their high risk of getting sick from COVID-19 caused by systemic health and social inequities. Finally, our study indicates that for two viral genome sequences of the same origin, their evolution order may be determined from the ratio of mutation type C$>$T over T$>$C. △ Less

Submitted 20 August, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: 22 pages, 15 figures

arXiv:2007.12692 [pdf, other]

Characterizing SARS-CoV-2 mutations in the United States

Authors: Rui Wang, Jiahui Chen, Kaifu Gao, Yuta Hozumi, Changchuan Yin, Guo-Wei Wei

Abstract: The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been mutating since it was first sequenced in early January 2020. The genetic variants have developed into a few distinct clusters with different properties. Since the United States (US) has the highest number of viral infected patients globally, it is essential to understand the US SARS-CoV-2. Using genotyping, sequence-alignmen… ▽ More The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been mutating since it was first sequenced in early January 2020. The genetic variants have developed into a few distinct clusters with different properties. Since the United States (US) has the highest number of viral infected patients globally, it is essential to understand the US SARS-CoV-2. Using genotyping, sequence-alignment, time-evolution, $k$-means clustering, protein-folding stability, algebraic topology, and network theory, we reveal that the US SARS-CoV-2 has four substrains and five top US SARS-CoV-2 mutations were first detected in China (2 cases), Singapore (2 cases), and the United Kingdom (1 case). The next three top US SARS-CoV-2 mutations were first detected in the US. These eight top mutations belong to two disconnected groups. The first group consisting of 5 concurrent mutations is prevailing, while the other group with three concurrent mutations gradually fades out. Our analysis suggests that female immune systems are more active than those of males in responding to SARS-CoV-2 infections. We identify that one of the top mutations, 27964C$>$T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysis of all mutations on the spike protein, we further uncover that three of four US SASR-CoV-2 substrains become more infectious. Our study calls for effective viral control and containing strategies in the US. △ Less

Submitted 24 July, 2020; originally announced July 2020.

Comments: 31 pages, 20 figures, and 4 tables

arXiv:2007.01344 [pdf, other]

Decoding asymptomatic COVID-19 infection and transmission

Authors: Rui Wang, Yuta Hozumi, Changchuan Yin, Guo-Wei Wei

Abstract: Coronavirus disease 2019 (COVID-19) is a continuously devastating public health and the world economy. One of the major challenges in controlling the COVID-19 outbreak is its asymptomatic infection and transmission, which are elusive and defenseless in most situations. The pathogenicity and virulence of asymptomatic COVID-19 remain mysterious. Based on the genotyping of 20656 Severe Acute Respirat… ▽ More Coronavirus disease 2019 (COVID-19) is a continuously devastating public health and the world economy. One of the major challenges in controlling the COVID-19 outbreak is its asymptomatic infection and transmission, which are elusive and defenseless in most situations. The pathogenicity and virulence of asymptomatic COVID-19 remain mysterious. Based on the genotyping of 20656 Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) genome isolates, we reveal that asymptomatic infection is linked to SARS-CoV-2 11083G>T mutation, i.e., leucine (L) to phenylalanine (F) substitution at the residue 37 (L37F) of nonstructure protein 6 (NSP6). By analyzing the distribution of 11083G>T in various countries, we unveil that 11083G>T may correlate with the hypotoxicity of SARS-CoV-2. Moreover, we show a global decaying tendency of the 11083G>T mutation ratio indicating that 11083G>T hinders SARS-CoV-2 transmission capacity. Sequence alignment found both NSP6 and residue 37 neighborhoods are relatively conservative over a few coronaviral species, indicating their importance in regulating host cell autophagy to undermine innate cellular defense against viral infection. Using machine learning and topological data analysis, we demonstrate that mutation L37F has made NSP6 energetically less stable. The rigidity and flexibility index and several network models suggest that mutation L37F may have compromised the NSP6 function, leading to a relatively weak SARS-CoV subtype. This assessment is a good agreement with our genotyping of SARS-CoV-2 evolution and transmission across various countries and regions over the past few months. △ Less

Submitted 2 July, 2020; originally announced July 2020.

Comments: 18 pages, 5 figures

arXiv:2006.10584 [pdf, other]

Review of COVID-19 Antibody Therapies

Authors: Jiahui Chen, Kaifu Gao, Rui Wang, Duc Duy Nguyen, Guo-Wei Wei

Abstract: Under the global health emergency caused by coronavirus disease 2019 (COVID-19), efficient and specific therapies are urgently needed. Compared with traditional small-molecular drugs, antibody therapies are relatively easy to develop and as specific as vaccines in targeting severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and thus attract much attention in the past few months. This wo… ▽ More Under the global health emergency caused by coronavirus disease 2019 (COVID-19), efficient and specific therapies are urgently needed. Compared with traditional small-molecular drugs, antibody therapies are relatively easy to develop and as specific as vaccines in targeting severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and thus attract much attention in the past few months. This work reviews seven existing antibodies for SARS-CoV-2 spike (S) protein with three-dimensional (3D) structures deposited in the Protein Data Bank. Five antibody structures associated with SARS-CoV are evaluated for their potential in neutralizing SARS-CoV-2. The interactions of these antibodies with the S protein receptor-binding domain (RBD) are compared with those of angiotensin-converting enzyme 2 (ACE2) and RBD complexes. Due to the orders of magnitude in the discrepancies of experimental binding affinities, we introduce topological data analysis (TDA), a variety of network models, and deep learning to analyze the binding strength and therapeutic potential of the aforementioned fourteen antibody-antigen complexes. The current COVID-19 antibody clinical trials, which are not limited to the S protein target, are also reviewed. △ Less

Submitted 18 June, 2020; originally announced June 2020.

Comments: 30 pages, 10 figures, 5 tables

arXiv:2006.05002 [pdf, other]

Determination and estimation of optimal quarantine duration for infectious diseases with application to data analysis of COVID-19

Authors: Ruoyu Wang, Qihua Wang

Abstract: Quarantine measure is a commonly used non-pharmaceutical intervention during the outbreak of infectious diseases. A key problem for implementing quarantine measure is to determine the duration of quarantine. In this paper, a policy with optimal quarantine duration is developed. The policy suggests different quarantine durations for every individual with different characteristic. The policy is opti… ▽ More Quarantine measure is a commonly used non-pharmaceutical intervention during the outbreak of infectious diseases. A key problem for implementing quarantine measure is to determine the duration of quarantine. In this paper, a policy with optimal quarantine duration is developed. The policy suggests different quarantine durations for every individual with different characteristic. The policy is optimal in the sense that it minimizes the average quarantine duration of uninfected people with the constraint that the probability of symptom presentation for infected people attains the given value closing to 1. The optimal solution for the quarantine duration is obtained and estimated by some statistic methods with application to analyzing COVID-19 data. △ Less

Submitted 28 April, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

Journal ref: biometrics,2022

arXiv:2005.14669 [pdf, other]

Mutations strengthened SARS-CoV-2 infectivity

Authors: Jiahui Chen, Rui Wang, Menglun Wang, Guo-Wei Wei

Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infectivity is a major concern in coronavirus disease 2019 (COVID-19) prevention and economic reopening. However, rigorous determination of SARS-COV-2 infectivity is essentially impossible owing to its continuous evolution with over 13752 single nucleotide polymorphisms (SNP) variants in six different subtypes. We develop an advanced mac… ▽ More Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infectivity is a major concern in coronavirus disease 2019 (COVID-19) prevention and economic reopening. However, rigorous determination of SARS-COV-2 infectivity is essentially impossible owing to its continuous evolution with over 13752 single nucleotide polymorphisms (SNP) variants in six different subtypes. We develop an advanced machine learning algorithm based on the algebraic topology to quantitatively evaluate the binding affinity changes of SARS-CoV-2 spike glycoprotein (S protein) and host angiotensin-converting enzyme 2 (ACE2) receptor following the mutations. Based on mutation-induced binding affinity changes, we reveal that five out of six SARS-CoV-2 subtypes have become either moderately or slightly more infectious, while one subtype has weakened its infectivity. We find that SARS-CoV-2 is slightly more infectious than SARS-CoV according to computed S protein-ACE2 binding affinity changes. Based on a systematic evaluation of all possible 3686 future mutations on the S protein receptor-binding domain (RBD), we show that most likely future mutations will make SARS-CoV-2 more infectious. Combining sequence alignment, probability analysis, and binding affinity calculation, we predict that a few residues on the receptor-binding motif (RBM), i.e., 452, 489, 500, 501, and 505, have very high chances to mutate into significantly more infectious COVID-19 strains. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: 24 pages, 2 tables and 19 figures

arXiv:2005.13653 [pdf, other]

Unveiling the molecular mechanism of SARS-CoV-2 main protease inhibition from 92 crystal structures

Authors: Duc D Nguyen, Kaifu Gao, Jiahui Chen, Rui Wang, Guo-Wei Wei

Abstract: Currently, there is no effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (M$^{\text{pro}}$) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of M… ▽ More Currently, there is no effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (M$^{\text{pro}}$) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of M$^{\text{pro}}$ inhibition is limited by the lack of reliable binding affinity ranking and prediction of existing structures of M$^{\text{pro}}$-inhibitor complexes. This work integrates mathematics and deep learning (MathDL) to provide a reliable ranking of the binding affinities of 92 SARS-CoV-2 M$^{\text{pro}}$ inhibitor structures. We reveal that Gly143 residue in M$^{\text{pro}}$ is the most attractive site to form hydrogen bonds, followed by Cys145, Glu166, and His163. We also identify 45 targeted covalent bonding inhibitors. Validation on the PDBbind v2016 core set benchmark shows the MathDL has achieved the top performance with Pearson's correlation coefficient ($R_p$) being 0.858. Most importantly, MathDL is validated on a carefully curated SARS-CoV-2 inhibitor dataset with the averaged $R_p$ as high as 0.751, which endows the reliability of the present binding affinity prediction. The present binding affinity ranking, interaction analysis, and fragment decomposition offer a foundation for future drug discovery efforts. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: 17 pages, 8 figures, 3 tables

Showing 1–50 of 77 results for author: Wang, R