Search | arXiv e-print repository

Evaluating representation learning on the protein structure universe

Authors: Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom L. Blundell

Abstract: We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relations… ▽ More We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: ICLR 2024

arXiv:2406.10146 [pdf]

Multimodal Radiomics Model for Predicting Gold Nanoparticles Accumulation in Mouse Tumors

Authors: Jiajia Tang, Jie Zhang, Jiulou Zhang, Yuxia Tang, Hao Ni, Shouju Wang

Abstract: Background: Nanoparticles can accumulate in solid tumors, serving as diagnostic or therapeutic agents for cancer. Clinical translation is challenging due to low accumulation in tumors and heterogeneity between tumor types and individuals. Tools to identify this heterogeneity and predict nanoparticle accumulation are needed. Advanced imaging techniques combined with radiomics and AI may offer a sol… ▽ More Background: Nanoparticles can accumulate in solid tumors, serving as diagnostic or therapeutic agents for cancer. Clinical translation is challenging due to low accumulation in tumors and heterogeneity between tumor types and individuals. Tools to identify this heterogeneity and predict nanoparticle accumulation are needed. Advanced imaging techniques combined with radiomics and AI may offer a solution. Methods: 183 mice were used to create seven subcutaneous tumor models, with three sizes (15nm, 40nm, 70nm) of gold nanoparticles injected via the tail vein. Accumulation was measured using ICP-OES. Data were divided into training and test sets (7:3). Tumors were categorized into high and low uptake groups based on the median value of the training set. Before injection, multimodal imaging data (CT, B-mode ultrasound, SWE, CEUS) were acquired, and radiomics features extracted. LASSO and RFE algorithms built a radiomics signature. This, along with tumor type and mean values from CT and SWE, constructed the best model using SVM. For each tumor in the test set, the radiomics signature predicted gold nanoparticle uptake. Model performance was evaluated by AUC. Results: Significant variability in gold nanoparticle accumulation was observed among tumors (P < 0.001). The median accumulation in the training set was 3.37% ID/g. Nanoparticle size was not a main determinant of uptake (P > 0.05). The composite model based on radiomics signature outperformed the basic model in both training (AUC 0.93 vs. 0.68) and testing (0.78 vs. 0.61) datasets. Conclusion: The composite model identifies tumor heterogeneity and predicts high uptake of gold nanoparticles, improving patient stratification and supporting nanomedicine's clinical application. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.05347 [pdf, other]

MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

Authors: Bo Chen, Zhilei Bei, Xingyi Cheng, Pan Li, Jie Tang, Le Song

Abstract: Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in compre… ▽ More Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in comprehensively capturing the intricate coevolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model complex evolutionary patterns. Endowed by this, its flexible 1D MSA decoding framework facilitates zero or few shot learning. Moreover, we demonstrate that leveraging the feedback from AlphaFold2 can further enhance the model capacity via Rejective Fine tuning (RFT) and Reinforcement Learning from AF2 Feedback (RLAF). Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy. The transfer learning capabilities also highlight its great potential for facilitating other protein tasks. △ Less

Submitted 10 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

arXiv:2405.18605 [pdf, ps, other]

BioBERT-based Deep Learning and Merged ChemProt-DrugProt for Enhanced Biomedical Relation Extraction

Authors: Bridget T. McInnes, Jiawei Tang, Darshini Mahendran, Mai H. Nguyen

Abstract: This paper presents a methodology for enhancing relation extraction from biomedical texts, focusing specifically on chemical-gene interactions. Leveraging the BioBERT model and a multi-layer fully connected network architecture, our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy. Through extensive experimentation, we demonstrate significant performance improv… ▽ More This paper presents a methodology for enhancing relation extraction from biomedical texts, focusing specifically on chemical-gene interactions. Leveraging the BioBERT model and a multi-layer fully connected network architecture, our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy. Through extensive experimentation, we demonstrate significant performance improvements, particularly in CPR groups shared between the datasets. The findings underscore the importance of dataset merging in augmenting sample counts and improving model accuracy. Moreover, the study highlights the potential of automated information extraction in biomedical research and clinical practice. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.00751 [pdf, other]

F$^3$low: Frame-to-Frame Coarse-grained Molecular Dynamics with SE(3) Guided Flow Matching

Authors: Shaoning Li, Yusong Wang, Mingyu Li, Jian Zhang, Bin Shao, Nanning Zheng, Jian Tang

Abstract: Molecular dynamics (MD) is a crucial technique for simulating biological systems, enabling the exploration of their dynamic nature and fostering an understanding of their functions and properties. To address exploration inefficiency, emerging enhanced sampling approaches like coarse-graining (CG) and generative models have been employed. In this work, we propose a \underline{Frame-to-Frame} genera… ▽ More Molecular dynamics (MD) is a crucial technique for simulating biological systems, enabling the exploration of their dynamic nature and fostering an understanding of their functions and properties. To address exploration inefficiency, emerging enhanced sampling approaches like coarse-graining (CG) and generative models have been employed. In this work, we propose a \underline{Frame-to-Frame} generative model with guided \underline{Flow}-matching (F$3$low) for enhanced sampling, which (a) extends the domain of CG modeling to the SE(3) Riemannian manifold; (b) retreating CGMD simulations as autoregressively sampling guided by the former frame via flow-matching models; (c) targets the protein backbone, offering improved insights into secondary structure formation and intricate folding pathways. Compared to previous methods, F$3$low allows for broader exploration of conformational space. The ability to rapidly generate diverse conformations via force-free generative paradigm on SE(3) paves the way toward efficient enhanced sampling methods. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: Accepted by ICLR 2024 GEM workshop

arXiv:2402.10433 [pdf, other]

Fusing Neural and Physical: Augment Protein Conformation Sampling with Tractable Simulations

Authors: Jiarui Lu, Zuobai Zhang, Bozitao Zhong, Chence Shi, Jian Tang

Abstract: The protein dynamics are common and important for their biological functions and properties, the study of which usually involves time-consuming molecular dynamics (MD) simulations in silico. Recently, generative models has been leveraged as a surrogate sampler to obtain conformation ensembles with orders of magnitude faster and without requiring any simulation data (a "zero-shot" inference). Howev… ▽ More The protein dynamics are common and important for their biological functions and properties, the study of which usually involves time-consuming molecular dynamics (MD) simulations in silico. Recently, generative models has been leveraged as a surrogate sampler to obtain conformation ensembles with orders of magnitude faster and without requiring any simulation data (a "zero-shot" inference). However, being agnostic of the underlying energy landscape, the accuracy of such generative model may still be limited. In this work, we explore the few-shot setting of such pre-trained generative sampler which incorporates MD simulations in a tractable manner. Specifically, given a target protein of interest, we first acquire some seeding conformations from the pre-trained sampler followed by a number of physical simulations in parallel starting from these seeding samples. Then we fine-tuned the generative model using the simulation trajectories above to become a target-specific sampler. Experimental results demonstrated the superior performance of such few-shot conformation sampler at a tractable computational cost. △ Less

Submitted 11 March, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: Published at the GEM workshop, ICLR 2024

arXiv:2402.07955 [pdf, other]

ProtIR: Iterative Refinement between Retrievers and Predictors for Protein Function Annotation

Authors: Zuobai Zhang, Jiarui Lu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, Jian Tang

Abstract: Protein function annotation is an important yet challenging task in biology. Recent deep learning advancements show significant potential for accurate function prediction by learning from protein sequences and structures. Nevertheless, these predictor-based methods often overlook the modeling of protein similarity, an idea commonly employed in traditional approaches using sequence or structure ret… ▽ More Protein function annotation is an important yet challenging task in biology. Recent deep learning advancements show significant potential for accurate function prediction by learning from protein sequences and structures. Nevertheless, these predictor-based methods often overlook the modeling of protein similarity, an idea commonly employed in traditional approaches using sequence or structure retrieval tools. To fill this gap, we first study the effect of inter-protein similarity modeling by benchmarking retriever-based methods against predictors on protein function annotation tasks. Our results show that retrievers can match or outperform predictors without large-scale pre-training. Building on these insights, we introduce a novel variational pseudo-likelihood framework, ProtIR, designed to improve function predictors by incorporating inter-protein similarity modeling. This framework iteratively refines knowledge between a function predictor and retriever, thereby combining the strengths of both predictors and retrievers. ProtIR showcases around 10% improvement over vanilla predictor-based methods. Besides, it achieves performance on par with protein language model-based methods, yet without the need for massive pre-training, highlighting the efficacy of our framework. Code will be released upon acceptance. △ Less

Submitted 10 February, 2024; originally announced February 2024.

arXiv:2402.05856 [pdf, other]

Structure-Informed Protein Language Model

Authors: Zuobai Zhang, Jiarui Lu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, Jian Tang

Abstract: Protein language models are a powerful tool for learning protein representations through pre-training on vast protein sequence datasets. However, traditional protein language models lack explicit structural supervision, despite its relevance to protein function. To address this issue, we introduce the integration of remote homology detection to distill structural information into protein language… ▽ More Protein language models are a powerful tool for learning protein representations through pre-training on vast protein sequence datasets. However, traditional protein language models lack explicit structural supervision, despite its relevance to protein function. To address this issue, we introduce the integration of remote homology detection to distill structural information into protein language models without requiring explicit protein structures as input. We evaluate the impact of this structure-informed training on downstream protein function prediction tasks. Experimental results reveal consistent improvements in function annotation accuracy for EC number and GO term prediction. Performance on mutant datasets, however, varies based on the relationship between targeted properties and protein structures. This underscores the importance of considering this relationship when applying structure-aware training to protein function prediction tasks. Code and model weights are available at https://github.com/DeepGraphLearning/esm-s. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2401.17123 [pdf, other]

Unsupervised Discovery of Steerable Factors When Graph Deep Generative Models Are Entangled

Authors: Shengchao Liu, Chengpeng Wang, Jiarui Lu, Weili Nie, Hanchen Wang, Zhuoxinran Li, Bolei Zhou, Jian Tang

Abstract: Deep generative models (DGMs) have been widely developed for graph data. However, much less investigation has been carried out on understanding the latent space of such pretrained graph DGMs. These understandings possess the potential to provide constructive guidelines for crucial tasks, such as graph controllable generation. Thus in this work, we are interested in studying this problem and propos… ▽ More Deep generative models (DGMs) have been widely developed for graph data. However, much less investigation has been carried out on understanding the latent space of such pretrained graph DGMs. These understandings possess the potential to provide constructive guidelines for crucial tasks, such as graph controllable generation. Thus in this work, we are interested in studying this problem and propose GraphCG, a method for the unsupervised discovery of steerable factors in the latent space of pretrained graph DGMs. We first examine the representation space of three pretrained graph DGMs with six disentanglement metrics, and we observe that the pretrained representation space is entangled. Motivated by this observation, GraphCG learns the steerable factors via maximizing the mutual information between semantic-rich directions, where the controlled graph moving along the same direction will share the same steerable factors. We quantitatively verify that GraphCG outperforms four competitive baselines on two graph DGMs pretrained on two molecule datasets. Additionally, we qualitatively illustrate seven steerable factors learned by GraphCG on five pretrained DGMs over five graph datasets, including two for molecules and three for point clouds. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.11447 [pdf, other]

Sequential Model for Predicting Patient Adherence in Subcutaneous Immunotherapy for Allergic Rhinitis

Authors: Yin Li, Yu Xiong, Wenxin Fan, Kai Wang, Qingqing Yu, Liping Si, Patrick van der Smagt, Jun Tang, Nutan Chen

Abstract: Objective: Subcutaneous Immunotherapy (SCIT) is the long-lasting causal treatment of allergic rhinitis (AR). How to enhance the adherence of patients to maximize the benefit of allergen immunotherapy (AIT) plays a crucial role in the management of AIT. This study aims to leverage novel machine learning models to precisely predict the risk of non-adherence of AR patients and related local symptom s… ▽ More Objective: Subcutaneous Immunotherapy (SCIT) is the long-lasting causal treatment of allergic rhinitis (AR). How to enhance the adherence of patients to maximize the benefit of allergen immunotherapy (AIT) plays a crucial role in the management of AIT. This study aims to leverage novel machine learning models to precisely predict the risk of non-adherence of AR patients and related local symptom scores in three years SCIT. Methods: The research develops and analyzes two models, sequential latent-variable model (SLVM) of Stochastic Latent Actor-Critic (SLAC) and Long Short-Term Memory (LSTM) evaluating them based on scoring and adherence prediction capabilities. Results: Excluding the biased samples at the first time step, the predictive adherence accuracy of the SLAC models is from 60\% to 72\%, and for LSTM models, it is 66\% to 84\%, varying according to the time steps. The range of Root Mean Square Error (RMSE) for SLAC models is between 0.93 and 2.22, while for LSTM models it is between 1.09 and 1.77. Notably, these RMSEs are significantly lower than the random prediction error of 4.55. Conclusion: We creatively apply sequential models in the long-term management of SCIT with promising accuracy in the prediction of SCIT nonadherence in AR patients. While LSTM outperforms SLAC in adherence prediction, SLAC excels in score prediction for patients undergoing SCIT for AR. The state-action-based SLAC adds flexibility, presenting a novel and effective approach for managing long-term AIT. △ Less

Submitted 19 July, 2024; v1 submitted 21 January, 2024; originally announced January 2024.

Comments: Frontiers in Pharmacology, research topic: Methods and Metrics to Measure Medication Adherence

arXiv:2401.06199 [pdf, other]

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Authors: Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song

Abstract: Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of… ▽ More Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2312.15252 [pdf, other]

DTIAM: A unified framework for predicting drug-target interactions, binding affinities and activation/inhibition mechanisms

Authors: Zhangli Lu, Chuqi Lei, Kaili Wang, Libo Qin, Jing Tang, Min Li

Abstract: Accurate and robust prediction of drug-target interactions (DTIs) plays a vital role in drug discovery. Despite extensive efforts have been invested in predicting novel DTIs, existing approaches still suffer from insufficient labeled data and cold start problems. More importantly, there is currently a lack of studies focusing on elucidating the mechanism of action (MoA) between drugs and targets.… ▽ More Accurate and robust prediction of drug-target interactions (DTIs) plays a vital role in drug discovery. Despite extensive efforts have been invested in predicting novel DTIs, existing approaches still suffer from insufficient labeled data and cold start problems. More importantly, there is currently a lack of studies focusing on elucidating the mechanism of action (MoA) between drugs and targets. Distinguishing the activation and inhibition mechanisms is critical and challenging in drug development. Here, we introduce a unified framework called DTIAM, which aims to predict interactions, binding affinities, and activation/inhibition mechanisms between drugs and targets. DTIAM learns drug and target representations from large amounts of label-free data through self-supervised pre-training, which accurately extracts the substructure and contextual information of drugs and targets, and thus benefits the downstream prediction based on these representations. DTIAM achieves substantial performance improvement over other state-of-the-art methods in all tasks, particularly in the cold start scenario. Moreover, independent validation demonstrates the strong generalization ability of DTIAM. All these results suggested that DTIAM can provide a practically useful tool for predicting novel DTIs and further distinguishing the MoA of candidate drugs. DTIAM, for the first time, provides a unified framework for accurate and robust prediction of drug-target interactions, binding affinities, and activation/inhibition mechanisms. △ Less

Submitted 23 December, 2023; originally announced December 2023.

arXiv:2312.00080 [pdf, other]

PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design

Authors: Chuanrui Wang, Bozitao Zhong, Zuobai Zhang, Narendra Chaudhary, Sanchit Misra, Jian Tang

Abstract: Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not… ▽ More Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability. To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins. We curate datasets from high-quality CATH protein data, high-throughput $\textit{de novo}$ designed proteins, and mega-scale experimental mutagenesis experiments, and in doing so, present the $\textbf{PDB-Struct}$ benchmark that evaluates both recent and previously uncompared protein design methods. Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric. We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark. Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. Code is available at https://github.com/WANG-CR/PDB-Struct. △ Less

Submitted 29 November, 2023; originally announced December 2023.

Comments: 13 pages

arXiv:2310.10697 [pdf]

Synthetic IMU Datasets and Protocols Can Simplify Fall Detection Experiments and Optimize Sensor Configuration

Authors: Jie Tang, Bin He, Junkai Xu, Tian Tan, Zhipeng Wang, Yanmin Zhou, Shuo Jiang

Abstract: Falls represent a significant cause of injury among the elderly population. Extensive research has been devoted to the utilization of wearable IMU sensors in conjunction with machine learning techniques for fall detection. To address the challenge of acquiring costly training data, this paper presents a novel method that generates a substantial volume of synthetic IMU data with minimal real fall e… ▽ More Falls represent a significant cause of injury among the elderly population. Extensive research has been devoted to the utilization of wearable IMU sensors in conjunction with machine learning techniques for fall detection. To address the challenge of acquiring costly training data, this paper presents a novel method that generates a substantial volume of synthetic IMU data with minimal real fall experiments. First, unmarked 3D motion capture technology is employed to reconstruct human movements. Subsequently, utilizing the biomechanical simulation platform Opensim and forward kinematic methods, an ample amount of training data from various body segments can be custom generated. An LSTM model is trained, achieving testing accuracies of 91.99% and 86.62% on two distinct datasets of actual fall-related IMU data, demonstrated the comparable performance of models trained using genuine IMU data. Building upon the simulation framework, this paper further optimized the single IMU attachment position and multiple IMU combinations on fall detection. The proposed method simplifies fall detection data acquisition experiments, provides novel venue for generating low cost synthetic data in scenario where acquiring data for machine learning is challenging and paves the way for customizing machine learning configurations. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 11 pages, 7 figures

arXiv:2306.09375 [pdf, other]

Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials

Authors: Shengchao Liu, Weitao Du, Yanjing Li, Zhuoxinran Li, Zhiling Zheng, Chenru Duan, Zhiming Ma, Omar Yaghi, Anima Anandkumar, Christian Borgs, Jennifer Chayes, Hongyu Guo, Jian Tang

Abstract: Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific communities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their g… ▽ More Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific communities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their geometric structures. Nevertheless, due to the rapidly evolving process of the field and the knowledge gap between science (e.g., physics, chemistry, & biology) and machine learning communities, a benchmarking study on geometrical representation for such data has not been conducted. To address such an issue, in this paper, we first provide a unified view of the current symmetry-informed geometric methods, classifying them into three main categories: invariance, equivariance with spherical frame basis, and equivariance with vector frame basis. Then we propose a platform, coined Geom3D, which enables benchmarking the effectiveness of geometric strategies. Geom3D contains 16 advanced symmetry-informed geometric representation models and 14 geometric pretraining methods over 46 diverse datasets, including small molecules, proteins, and crystalline materials. We hope that Geom3D can, on the one hand, eliminate barriers for machine learning researchers interested in exploring scientific problems; and, on the other hand, provide valuable guidance for researchers in computational chemistry, structural biology, and materials science, aiding in the informed selection of representation techniques for specific applications. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.07505 [pdf]

Deep learning radiomics for assessment of gastroesophageal varices in people with compensated advanced chronic liver disease

Authors: Lan Wang, Ruiling He, Lili Zhao, Jia Wang, Zhengzi Geng, Tao Ren, Guo Zhang, Peng Zhang, Kaiqiang Tang, Chaofei Gao, Fei Chen, Liting Zhang, Yonghe Zhou, Xin Li, Fanbin He, Hui Huan, Wenjuan Wang, Yunxiao Liang, Juan Tang, Fang Ai, Tingyu Wang, Liyun Zheng, Zhongwei Zhao, Jiansong Ji, Wei Liu , et al. (22 additional authors not shown)

Abstract: Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with… ▽ More Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with compensated advanced chronic liver disease. 305 patients were enrolled from 12 hospitals, and finally 265 patients were included, with 1136 liver stiffness measurement (LSM) images and 1042 spleen stiffness measurement (SSM) images generated by 2D-SWE. We leveraged deep learning methods to uncover associations between image features and patient risk, and thus conducted models to predict GEV and HRV. Results: A multi-modality Deep Learning Risk Prediction model (DLRP) was constructed to assess GEV and HRV, based on LSM and SSM images, and clinical information. Validation analysis revealed that the AUCs of DLRP were 0.91 for GEV (95% CI 0.90 to 0.93, p < 0.05) and 0.88 for HRV (95% CI 0.86 to 0.89, p < 0.01), which were significantly and robustly better than canonical risk indicators, including the value of LSM and SSM. Moreover, DLPR was better than the model using individual parameters, including LSM and SSM images. In HRV prediction, the 2D-SWE images of SSM outperform LSM (p < 0.01). Conclusion: DLRP shows excellent performance in predicting GEV and HRV over canonical risk indicators LSM and SSM. Additionally, the 2D-SWE images of SSM provided more information for better accuracy in predicting HRV than the LSM. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2306.03117 [pdf, other]

Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling

Authors: Jiarui Lu, Bozitao Zhong, Zuobai Zhang, Jian Tang

Abstract: The dynamic nature of proteins is crucial for determining their biological functions and properties, for which Monte Carlo (MC) and molecular dynamics (MD) simulations stand as predominant tools to study such phenomena. By utilizing empirically derived force fields, MC or MD simulations explore the conformational space through numerically evolving the system via Markov chain or Newtonian mechanics… ▽ More The dynamic nature of proteins is crucial for determining their biological functions and properties, for which Monte Carlo (MC) and molecular dynamics (MD) simulations stand as predominant tools to study such phenomena. By utilizing empirically derived force fields, MC or MD simulations explore the conformational space through numerically evolving the system via Markov chain or Newtonian mechanics. However, the high-energy barrier of the force fields can hamper the exploration of both methods by the rare event, resulting in inadequately sampled ensemble without exhaustive running. Existing learning-based approaches perform direct sampling yet heavily rely on target-specific simulation data for training, which suffers from high data acquisition cost and poor generalizability. Inspired by simulated annealing, we propose Str2Str, a novel structure-to-structure translation framework capable of zero-shot conformation sampling with roto-translation equivariant property. Our method leverages an amortized denoising score matching objective trained on general crystal structures and has no reliance on simulation data during both training and inference. Experimental results across several benchmarking protein systems demonstrate that Str2Str outperforms previous state-of-the-art generative structure prediction models and can be orders of magnitude faster compared to long MD simulations. Our open-source implementation is available at https://github.com/lujiarui/Str2Str △ Less

Submitted 11 March, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: Published as a conference paper at ICLR 2024, see https://openreview.net/forum?id=C4BikKsgmK

arXiv:2306.01794 [pdf, other]

DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing

Authors: Yangtian Zhang, Zuobai Zhang, Bozitao Zhong, Sanchit Misra, Jian Tang

Abstract: Proteins play a critical role in carrying out biological functions, and their 3D structures are essential in determining their functions. Accurately predicting the conformation of protein side-chains given their backbones is important for applications in protein structure prediction, design and protein-protein interactions. Traditional methods are computationally intensive and have limited accurac… ▽ More Proteins play a critical role in carrying out biological functions, and their 3D structures are essential in determining their functions. Accurately predicting the conformation of protein side-chains given their backbones is important for applications in protein structure prediction, design and protein-protein interactions. Traditional methods are computationally intensive and have limited accuracy, while existing machine learning methods treat the problem as a regression task and overlook the restrictions imposed by the constant covalent bond lengths and angles. In this work, we present DiffPack, a torsional diffusion model that learns the joint distribution of side-chain torsional angles, the only degrees of freedom in side-chain packing, by diffusing and denoising on the torsional space. To avoid issues arising from simultaneous perturbation of all four torsional angles, we propose autoregressively generating the four torsional angles from $χ_1$ to $χ_4$ and training diffusion models for each torsional angle. We evaluate the method on several benchmarks for protein side-chain packing and show that our method achieves improvements of $11.9\%$ and $13.5\%$ in angle accuracy on CASP13 and CASP14, respectively, with a significantly smaller model size ($60\times$ fewer parameters). Additionally, we show the effectiveness of our method in enhancing side-chain predictions in the AlphaFold2 model. Code is available at https://github.com/DeepGraphLearning/DiffPack. △ Less

Submitted 15 February, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2305.18407 [pdf, other]

A Group Symmetric Stochastic Differential Equation Model for Molecule Multi-modal Pretraining

Authors: Shengchao Liu, Weitao Du, Zhiming Ma, Hongyu Guo, Jian Tang

Abstract: Molecule pretraining has quickly become the go-to schema to boost the performance of AI-based drug discovery. Naturally, molecules can be represented as 2D topological graphs or 3D geometric point clouds. Although most existing pertaining methods focus on merely the single modality, recent research has shown that maximizing the mutual information (MI) between such two modalities enhances the molec… ▽ More Molecule pretraining has quickly become the go-to schema to boost the performance of AI-based drug discovery. Naturally, molecules can be represented as 2D topological graphs or 3D geometric point clouds. Although most existing pertaining methods focus on merely the single modality, recent research has shown that maximizing the mutual information (MI) between such two modalities enhances the molecule representation ability. Meanwhile, existing molecule multi-modal pretraining approaches approximate MI based on the representation space encoded from the topology and geometry, thus resulting in the loss of critical structural information of molecules. To address this issue, we propose MoleculeSDE. MoleculeSDE leverages group symmetric (e.g., SE(3)-equivariant and reflection-antisymmetric) stochastic differential equation models to generate the 3D geometries from 2D topologies, and vice versa, directly in the input space. It not only obtains tighter MI bound but also enables prosperous downstream tasks than the previous work. By comparing with 17 pretraining baselines, we empirically verify that MoleculeSDE can learn an expressive representation with state-of-the-art performance on 26 out of 32 downstream tasks. △ Less

Submitted 28 May, 2023; originally announced May 2023.

arXiv:2304.12825 [pdf, other]

GraphVF: Controllable Protein-Specific 3D Molecule Generation with Variational Flow

Authors: Fang Sun, Zhihao Zhan, Hongyu Guo, Ming Zhang, Jian Tang

Abstract: Designing molecules that bind to specific target proteins is a fundamental task in drug discovery. Recent models leverage geometric constraints to generate ligand molecules that bind cohesively with specific protein pockets. However, these models cannot effectively generate 3D molecules with 2D skeletal curtailments and property constraints, which are pivotal to drug potency and development. To ta… ▽ More Designing molecules that bind to specific target proteins is a fundamental task in drug discovery. Recent models leverage geometric constraints to generate ligand molecules that bind cohesively with specific protein pockets. However, these models cannot effectively generate 3D molecules with 2D skeletal curtailments and property constraints, which are pivotal to drug potency and development. To tackle this challenge, we propose GraphVF, a variational flow-based framework that combines 2D topology and 3D geometry, for controllable generation of binding 3D molecules. Empirically, our method achieves state-of-the-art binding affinity and realistic sub-structural layouts for protein-specific generation. In particular, GraphVF represents the first controllable geometry-aware, protein-specific molecule generation method, which can generate binding 3D molecules with tailored sub-structures and physio-chemical properties. Our code is available at https://github.com/Franco-Solis/GraphVF-code. △ Less

Submitted 23 February, 2023; originally announced April 2023.

Comments: 15 pages, 8 figures

arXiv:2303.06275 [pdf, other]

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

Authors: Zuobai Zhang, Chuanrui Wang, Minghao Xu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, Jian Tang

Abstract: Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural informat… ▽ More Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet. △ Less

Submitted 18 October, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

arXiv:2303.00233 [pdf, other]

doi 10.1145/3583780.3615061

Single-Cell Multimodal Prediction via Transformers

Authors: Wenzhuo Tang, Hongzhi Wen, Renming Liu, Jiayuan Ding, Wei Jin, Yuying Xie, Hui Liu, Jiliang Tang

Abstract: The recent development of multimodal single-cell technology has made the possibility of acquiring multiple omics data from individual cells, thereby enabling a deeper understanding of cellular states and dynamics. Nevertheless, the proliferation of multimodal single-cell data also introduces tremendous challenges in modeling the complex interactions among different modalities. The recently advance… ▽ More The recent development of multimodal single-cell technology has made the possibility of acquiring multiple omics data from individual cells, thereby enabling a deeper understanding of cellular states and dynamics. Nevertheless, the proliferation of multimodal single-cell data also introduces tremendous challenges in modeling the complex interactions among different modalities. The recently advanced methods focus on constructing static interaction graphs and applying graph neural networks (GNNs) to learn from multimodal data. However, such static graphs can be suboptimal as they do not take advantage of the downstream task information; meanwhile GNNs also have some inherent limitations when deeply stacking GNN layers. To tackle these issues, in this work, we investigate how to leverage transformers for multimodal single-cell data in an end-to-end manner while exploiting downstream task information. In particular, we propose a scMoFormer framework which can readily incorporate external domain knowledge and model the interactions within each modality and cross modalities. Extensive experiments demonstrate that scMoFormer achieves superior performance on various benchmark datasets. Remarkably, scMoFormer won a Kaggle silver medal with the rank of 24/1221 (Top 2%) without ensemble in a NeurIPS 2022 competition. Our implementation is publicly available at Github. △ Less

Submitted 13 October, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: CIKM 2023

Journal ref: In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM 23), 2023, Birmingham, United Kingdom

arXiv:2302.04611 [pdf, other]

A Text-guided Protein Design Framework

Authors: Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao, Jian Tang, Hongyu Guo, Anima Anandkumar

Abstract: Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework tha… ▽ More Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90\% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks. △ Less

Submitted 12 August, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

arXiv:2301.12040 [pdf, other]

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Authors: Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang

Abstract: Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also descr… ▽ More Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation. △ Less

Submitted 4 July, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

Comments: Accpeted by ICML 2023 (Oral), code and data released

arXiv:2212.10789 [pdf, other]

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

Authors: Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, Anima Anandkumar

Abstract: There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Her… ▽ More There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks. △ Less

Submitted 29 January, 2024; v1 submitted 21 December, 2022; originally announced December 2022.

arXiv:2210.12385 [pdf, other]

Deep Learning in Single-Cell Analysis

Authors: Dylan Molho, Jiayuan Ding, Zhaoheng Li, Hongzhi Wen, Wenzhuo Tang, Yixin Wang, Julian Venegas, Wei Jin, Renming Liu, Runze Su, Patrick Danaher, Robert Yang, Yu Leo Lei, Yuying Xie, Jiliang Tang

Abstract: Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performan… ▽ More Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations. △ Less

Submitted 5 November, 2022; v1 submitted 22 October, 2022; originally announced October 2022.

Comments: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysis

arXiv:2210.08761 [pdf, other]

Protein Sequence and Structure Co-Design with Equivariant Translation

Authors: Chence Shi, Chuanrui Wang, Jiarui Lu, Bozitao Zhong, Jian Tang

Abstract: Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In thi… ▽ More Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods. △ Less

Submitted 2 March, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: Published as a conference paper at ICLR 2023, see https://openreview.net/forum?id=pRCMXcfdihq

arXiv:2210.06069 [pdf, other]

E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking

Authors: Yangtian Zhang, Huiyu Cai, Chence Shi, Bozitao Zhong, Jian Tang

Abstract: In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work focuses on blind flexible selfdocking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference cost. Recently, data-driven met… ▽ More In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work focuses on blind flexible selfdocking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference cost. Recently, data-driven methods based on deep learning techniques are attracting growing interest thanks to their efficiency during inference and promising performance. These methods usually either adopt a two-stage approach by first predicting the distances between proteins and ligands and then generating the final coordinates based on the predicted distances, or directly predicting the global roto-translation of ligands. In this paper, we take a different route. Inspired by the resounding success of AlphaFold2 for protein structure prediction, we propose E3Bind, an end-to-end equivariant network that iteratively updates the ligand pose. E3Bind models the protein-ligand interaction through careful consideration of the geometric constraints in docking and the local context of the binding site. Experiments on standard benchmark datasets demonstrate the superior performance of our end-to-end trainable model compared to traditional and recently-proposed deep learning methods. △ Less

Submitted 1 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: International Conference on Learning Representations (ICLR 2023)

arXiv:2209.15315 [pdf, other]

FusionRetro: Molecule Representation Fusion via In-Context Learning for Retrosynthetic Planning

Authors: Songtao Liu, Zhengkai Tu, Minkai Xu, Zuobai Zhang, Lu Lin, Rex Ying, Jian Tang, Peilin Zhao, Dinghao Wu

Abstract: Retrosynthetic planning aims to devise a complete multi-step synthetic route from starting materials to a target molecule. Current strategies use a decoupled approach of single-step retrosynthesis models and search algorithms, taking only the product as the input to predict the reactants for each planning step and ignoring valuable context information along the synthetic route. In this work, we pr… ▽ More Retrosynthetic planning aims to devise a complete multi-step synthetic route from starting materials to a target molecule. Current strategies use a decoupled approach of single-step retrosynthesis models and search algorithms, taking only the product as the input to predict the reactants for each planning step and ignoring valuable context information along the synthetic route. In this work, we propose a novel framework that utilizes context information for improved retrosynthetic planning. We view synthetic routes as reaction graphs and propose to incorporate context through three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. Our approach is the first attempt to utilize in-context learning for retrosynthesis prediction in retrosynthetic planning. The entire framework can be efficiently optimized in an end-to-end fashion and produce more practical and accurate predictions. Comprehensive experiments demonstrate that by fusing in the context information over routes, our model significantly improves the performance of retrosynthetic planning over baselines that are not context-aware, especially for long synthetic routes. Code is available at https://github.com/SongtaoLiu0823/FusionRetro. △ Less

Submitted 31 May, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: Accepted by ICML 2023

arXiv:2207.05561 [pdf, other]

Brain-inspired Graph Spiking Neural Networks for Commonsense Knowledge Representation and Reasoning

Authors: Hongjian Fang, Yi Zeng, Jianbo Tang, Yuwei Wang, Yao Liang, Xin Liu

Abstract: How neural networks in the human brain represent commonsense knowledge, and complete related reasoning tasks is an important research topic in neuroscience, cognitive science, psychology, and artificial intelligence. Although the traditional artificial neural network using fixed-length vectors to represent symbols has gained good performance in some specific tasks, it is still a black box that lac… ▽ More How neural networks in the human brain represent commonsense knowledge, and complete related reasoning tasks is an important research topic in neuroscience, cognitive science, psychology, and artificial intelligence. Although the traditional artificial neural network using fixed-length vectors to represent symbols has gained good performance in some specific tasks, it is still a black box that lacks interpretability, far from how humans perceive the world. Inspired by the grandmother-cell hypothesis in neuroscience, this work investigates how population encoding and spiking timing-dependent plasticity (STDP) mechanisms can be integrated into the learning of spiking neural networks, and how a population of neurons can represent a symbol via guiding the completion of sequential firing between different neuron populations. The neuron populations of different communities together constitute the entire commonsense knowledge graph, forming a giant graph spiking neural network. Moreover, we introduced the Reward-modulated spiking timing-dependent plasticity (R-STDP) mechanism to simulate the biological reinforcement learning process and completed the related reasoning tasks accordingly, achieving comparable accuracy and faster convergence speed than the graph convolutional artificial neural networks. For the fields of neuroscience and cognitive science, the work in this paper provided the foundation of computational modeling for further exploration of the way the human brain represents commonsense knowledge. For the field of artificial intelligence, this paper indicated the exploration direction for realizing a more robust and interpretable neural network by constructing a commonsense knowledge representation and reasoning spiking neural networks with solid biological plausibility. △ Less

Submitted 11 July, 2022; originally announced July 2022.

arXiv:2207.00821 [pdf]

PGMG: A Pharmacophore-Guided Deep Learning Approach for Bioactive Molecular Generation

Authors: Huimin Zhu, Renyi Zhou, Jing Tang, Min Li

Abstract: The rational design of novel molecules with desired bioactivity is a critical but challenging task in drug discovery, especially when treating a novel target family or understudied targets. Here, we propose PGMG, a pharmacophore-guided deep learning approach for bioactivate molecule generation. Through the guidance of pharmacophore, PGMG provides a flexible strategy to generate bioactive molecules… ▽ More The rational design of novel molecules with desired bioactivity is a critical but challenging task in drug discovery, especially when treating a novel target family or understudied targets. Here, we propose PGMG, a pharmacophore-guided deep learning approach for bioactivate molecule generation. Through the guidance of pharmacophore, PGMG provides a flexible strategy to generate bioactive molecules with structural diversity in various scenarios using a trained variational autoencoder. We show that PGMG can generate molecules matching given pharmacophore models while maintaining a high level of validity, uniqueness, and novelty. In the case studies, we demonstrate the application of PGMG to generate bioactive molecules in ligand-based and structure-based drug de novo design, as well as in lead optimization scenarios. Overall, the flexibility and effectiveness of PGMG make it a useful tool for accelerating the drug discovery process. △ Less

Submitted 2 July, 2022; originally announced July 2022.

arXiv:2206.13602 [pdf, other]

Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching

Authors: Shengchao Liu, Hongyu Guo, Jian Tang

Abstract: Molecular representation pretraining is critical in various applications for drug and material discovery due to the limited number of labeled molecules, and most existing work focuses on pretraining on 2D molecular graphs. However, the power of pretraining on 3D geometric structures has been less explored. This is owing to the difficulty of finding a sufficient proxy task that can empower the pret… ▽ More Molecular representation pretraining is critical in various applications for drug and material discovery due to the limited number of labeled molecules, and most existing work focuses on pretraining on 2D molecular graphs. However, the power of pretraining on 3D geometric structures has been less explored. This is owing to the difficulty of finding a sufficient proxy task that can empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose GeoSSL, a 3D coordinate denoising pretraining framework to model such an energy landscape. Further by leveraging an SE(3)-invariant score matching method, we propose GeoSSL-DDM in which the coordinate denoising proxy task is effectively boiled down to denoising the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method. △ Less

Submitted 28 February, 2023; v1 submitted 27 June, 2022; originally announced June 2022.

arXiv:2206.08005 [pdf, other]

Evaluating Self-Supervised Learning for Molecular Graph Embeddings

Authors: Hanchen Wang, Jean Kaddour, Shengchao Liu, Jian Tang, Joan Lasenby, Qi Liu

Abstract: Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring embeddings without expert labelling, a capability that carries profound implications for molecular graphs due to the staggering number of potential molecules and the high cost of obtaining labels. However, GSSL methods are designed not for optimisation within a specific domain but rather for transferability across a vari… ▽ More Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring embeddings without expert labelling, a capability that carries profound implications for molecular graphs due to the staggering number of potential molecules and the high cost of obtaining labels. However, GSSL methods are designed not for optimisation within a specific domain but rather for transferability across a variety of downstream tasks. This broad applicability complicates their evaluation. Addressing this challenge, we present "Molecular Graph Representation Evaluation" (MOLGRAPHEVAL), generating detailed profiles of molecular graph embeddings with interpretable and diversified attributes. MOLGRAPHEVAL offers a suite of probing tasks grouped into three categories: (i) generic graph, (ii) molecular substructure, and (iii) embedding space properties. By leveraging MOLGRAPHEVAL to benchmark existing GSSL methods against both current downstream datasets and our suite of tasks, we uncover significant inconsistencies between inferences drawn solely from existing datasets and those derived from more nuanced probing. These findings suggest that current evaluation methodologies fail to capture the entirety of the landscape. △ Less

Submitted 18 October, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: Camera ready, NeurIPS Benchmark 2023

arXiv:2205.11279 [pdf, other]

Tyger: Task-Type-Generic Active Learning for Molecular Property Prediction

Authors: Kuangqi Zhou, Kaixin Wang, Jiashi Feng, Jian Tang, Tingyang Xu, Xinchao Wang

Abstract: How to accurately predict the properties of molecules is an essential problem in AI-driven drug discovery, which generally requires a large amount of annotation for training deep learning models. Annotating molecules, however, is quite costly because it requires lab experiments conducted by experts. To reduce annotation cost, deep Active Learning (AL) methods are developed to select only the most… ▽ More How to accurately predict the properties of molecules is an essential problem in AI-driven drug discovery, which generally requires a large amount of annotation for training deep learning models. Annotating molecules, however, is quite costly because it requires lab experiments conducted by experts. To reduce annotation cost, deep Active Learning (AL) methods are developed to select only the most representative and informative data for annotating. However, existing best deep AL methods are mostly developed for a single type of learning task (e.g., single-label classification), and hence may not perform well in molecular property prediction that involves various task types. In this paper, we propose a Task-type-generic active learning framework (termed Tyger) that is able to handle different types of learning tasks in a unified manner. The key is to learn a chemically-meaningful embedding space and perform active selection fully based on the embeddings, instead of relying on task-type-specific heuristics (e.g., class-wise prediction probability) as done in existing works. Specifically, for learning the embedding space, we instantiate a querying module that learns to translate molecule graphs into corresponding SMILES strings. Furthermore, to ensure that samples selected from the space are both representative and informative, we propose to shape the embedding space by two learning objectives, one based on domain knowledge and the other leveraging feedback from the task learner (i.e., model that performs the learning task at hand). We conduct extensive experiments on benchmark datasets of different task types. Experimental results show that Tyger consistently achieves high AL performance on molecular property prediction, outperforming baselines by a large margin. We also perform ablative experiments to verify the effectiveness of each component in Tyger. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2205.11274 [pdf, other]

Single-cell gene regulatory network analysis for mixed cell populations with applications to COVID-19 single cell data

Authors: Junjie Tang, Changhu Wang, Feiyi Xiao, Ruibin Xi

Abstract: Gene regulatory network (GRN) refers to the complex network formed by regulatory interactions between genes in living cells. In this paper, we consider inferring GRNs in single cells based on single cell RNA sequencing (scRNA-seq) data. In scRNA-seq, single cells are often profiled from mixed populations and their cell identities are unknown. A common practice for single cell GRN analysis is to fi… ▽ More Gene regulatory network (GRN) refers to the complex network formed by regulatory interactions between genes in living cells. In this paper, we consider inferring GRNs in single cells based on single cell RNA sequencing (scRNA-seq) data. In scRNA-seq, single cells are often profiled from mixed populations and their cell identities are unknown. A common practice for single cell GRN analysis is to first cluster the cells and infer GRNs for every cluster separately. However, this two-step procedure ignores uncertainty in the clustering step and thus could lead to inaccurate estimation of the networks. To address this problem, we propose to model scRNA-seq by the mixture multivariate Poisson log-normal (MPLN) distribution. The precision matrices of the MPLN are the GRNs of different cell types and can be jointly estimated by maximizing MPLN's lasso-penalized log-likelihood. We show that the MPLN model is identifiable and the resulting penalized log-likelihood estimator is consistent. To avoid the intractable optimization of the MPLN's log-likelihood, we develop an algorithm called VMPLN based on the variational inference method. Comprehensive simulation and real scRNA-seq data analyses reveal that VMPLN performs better than the state-of-the-art single cell GRN methods. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: 95 pages,28 figures

arXiv:2203.04695 [pdf, other]

Structured Multi-task Learning for Molecular Property Prediction

Authors: Shengchao Liu, Meng Qu, Zuobai Zhang, Huiyu Cai, Jian Tang

Abstract: Multi-task learning for molecular property prediction is becoming increasingly important in drug discovery. However, in contrast to other domains, the performance of multi-task learning in drug discovery is still not satisfying as the number of labeled data for each task is too limited, which calls for additional data to complement the data scarcity. In this paper, we study multi-task learning for… ▽ More Multi-task learning for molecular property prediction is becoming increasingly important in drug discovery. However, in contrast to other domains, the performance of multi-task learning in drug discovery is still not satisfying as the number of labeled data for each task is too limited, which calls for additional data to complement the data scarcity. In this paper, we study multi-task learning for molecular property prediction in a novel setting, where a relation graph between tasks is available. We first construct a dataset (ChEMBL-STRING) including around 400 tasks as well as a task relation graph. Then to better utilize such relation graph, we propose a method called SGNN-EBM to systematically investigate the structured task modeling from two perspectives. (1) In the \emph{latent} space, we model the task representations by applying a state graph neural network (SGNN) on the relation graph. (2) In the \emph{output} space, we employ structured prediction with the energy-based model (EBM), which can be efficiently trained through noise-contrastive estimation (NCE) approach. Empirical results justify the effectiveness of SGNN-EBM. Code is available on https://github.com/chao1224/SGNN-EBM. △ Less

Submitted 5 October, 2022; v1 submitted 22 February, 2022; originally announced March 2022.

arXiv:2203.02923 [pdf, other]

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

Authors: Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, Jian Tang

Abstract: Predicting molecular conformations from molecular graphs is a fundamental problem in cheminformatics and drug discovery. Recently, significant progress has been achieved with machine learning approaches, especially with deep generative models. Inspired by the diffusion process in classical non-equilibrium thermodynamics where heated particles will diffuse from original states to a noise distributi… ▽ More Predicting molecular conformations from molecular graphs is a fundamental problem in cheminformatics and drug discovery. Recently, significant progress has been achieved with machine learning approaches, especially with deep generative models. Inspired by the diffusion process in classical non-equilibrium thermodynamics where heated particles will diffuse from original states to a noise distribution, in this paper, we propose a novel generative model named GeoDiff for molecular conformation prediction. GeoDiff treats each atom as a particle and learns to directly reverse the diffusion process (i.e., transforming from a noise distribution to stable conformations) as a Markov chain. Modeling such a generation process is however very challenging as the likelihood of conformations should be roto-translational invariant. We theoretically show that Markov chains evolving with equivariant Markov kernels can induce an invariant distribution by design, and further propose building blocks for the Markov kernels to preserve the desirable equivariance property. The whole framework can be efficiently trained in an end-to-end fashion by optimizing a weighted variational lower bound to the (conditional) likelihood. Experiments on multiple benchmarks show that GeoDiff is superior or comparable to existing state-of-the-art approaches, especially on large molecules. △ Less

Submitted 6 March, 2022; originally announced March 2022.

Comments: Published as a conference paper at ICLR 2022 (https://openreview.net/forum?id=PzcvxEMzvQC)

arXiv:2112.06567 [pdf, other]

doi 10.1093/bib/bbac279

Implications of Topological Imbalance for Representation Learning on Biomedical Knowledge Graphs

Authors: Stephen Bonner, Ufuk Kirik, Ola Engkvist, Jian Tang, Ian P Barrett

Abstract: Adoption of recently developed methods from machine learning has given rise to creation of drug-discovery knowledge graphs (KG) that utilize the interconnected nature of the domain. Graph-based modelling of the data, combined with KG embedding (KGE) methods, are promising as they provide a more intuitive representation and are suitable for inference tasks such as predicting missing links. One comm… ▽ More Adoption of recently developed methods from machine learning has given rise to creation of drug-discovery knowledge graphs (KG) that utilize the interconnected nature of the domain. Graph-based modelling of the data, combined with KG embedding (KGE) methods, are promising as they provide a more intuitive representation and are suitable for inference tasks such as predicting missing links. One common application is to produce ranked lists of genes for a given disease, where the rank is based on the perceived likelihood of association between the gene and the disease. It is thus critical that these predictions are not only pertinent but also biologically meaningful. However, KGs can be biased either directly due to the underlying data sources that are integrated or due to modeling choices in the construction of the graph, one consequence of which is that certain entities can get topologically overrepresented. We demonstrate the effect of these inherent structural imbalances, resulting in densely-connected entities being highly ranked no matter the context. We provide support for this observation across different datasets, models as well as predictive tasks. Further, we present various graph perturbation experiments which yield more support to the observation that KGE models can be more influenced by the frequency of entities rather than any biological information encoded within the relations. Our results highlight the importance of data modeling choices, and emphasizes the need for practitioners to be mindful of these issues when interpreting model outputs and during KG composition. △ Less

Submitted 18 March, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: Briefings in Bioinformatics, 2022

arXiv:2110.07728 [pdf, other]

Pre-training Molecular Graph Representation with 3D Geometry

Authors: Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, Jian Tang

Abstract: Molecular graph representation learning is a fundamental problem in modern drug and material discovery. Molecular graphs are typically modeled by their 2D topological structures, but it has been recently discovered that 3D geometric information plays a more vital role in predicting molecular functionalities. However, the lack of 3D information in real-world scenarios has significantly impeded the… ▽ More Molecular graph representation learning is a fundamental problem in modern drug and material discovery. Molecular graphs are typically modeled by their 2D topological structures, but it has been recently discovered that 3D geometric information plays a more vital role in predicting molecular functionalities. However, the lack of 3D information in real-world scenarios has significantly impeded the learning of geometric graph representation. To cope with this challenge, we propose the Graph Multi-View Pre-training (GraphMVP) framework where self-supervised learning (SSL) is performed by leveraging the correspondence and consistency between 2D topological structures and 3D geometric views. GraphMVP effectively learns a 2D molecular graph encoder that is enhanced by richer and more discriminative 3D geometry. We further provide theoretical insights to justify the effectiveness of GraphMVP. Finally, comprehensive experiments show that GraphMVP can consistently outperform existing graph SSL methods. △ Less

Submitted 29 May, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2108.07435 [pdf, other]

Modeling Protein Using Large-scale Pretrain Language Model

Authors: Yijia Xiao, Jiezhong Qiu, Ziang Li, Chang-Yu Hsieh, Jie Tang

Abstract: Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data poss… ▽ More Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at https://github.com/THUDM/ProteinLM. △ Less

Submitted 7 December, 2021; v1 submitted 17 August, 2021; originally announced August 2021.

Comments: Accepted paper in Pretrain@KDD 2021 (The International Workshop on Pretraining: Algorithms, Architectures, and Applications)

arXiv:2106.05388 [pdf]

Neurological Consequences of COVID-19 Infection

Authors: Jiabin Tang, Shivani Patel, Steve Gentleman, Paul Matthews

Abstract: COVID-19 infections have well described systemic manifestations, especially respiratory problems. There are currently no specific treatments or vaccines against the current strain. With higher case numbers, a range of neurological symptoms are becoming apparent. The mechanisms responsible for these are not well defined, other than those related to hypoxia and microthrombi. We speculate that sustai… ▽ More COVID-19 infections have well described systemic manifestations, especially respiratory problems. There are currently no specific treatments or vaccines against the current strain. With higher case numbers, a range of neurological symptoms are becoming apparent. The mechanisms responsible for these are not well defined, other than those related to hypoxia and microthrombi. We speculate that sustained systemic immune activation seen with SARS-CoV-2 may also cause secondary autoimmune activation in the CNS. Patients with chronic neurological diseases may be at higher risk because of chronic secondary respiratory disease and potentially poor nutritional status. Here, we review the impact of COVID-19 on people with chronic neurological diseases and potential mechanisms. We believe special attention to protecting people with neurodegenerative disease is warranted. We are concerned about a possible delayed pandemic in the form of an increased burden of neurodegenerative disease after acceleration of pathology by systemic COVID-19 infections. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: 19 pages, 4 figures

arXiv:2105.07246 [pdf, other]

An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming

Authors: Minkai Xu, Wujie Wang, Shitong Luo, Chence Shi, Yoshua Bengio, Rafael Gomez-Bombarelli, Jian Tang

Abstract: Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to con… ▽ More Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to consistently preserve the geometry of local atomic neighborhoods, making the generated structures unsatisfying. In this paper, we propose an end-to-end solution for molecular conformation prediction called ConfVAE based on the conditional variational autoencoder framework. Specifically, the molecular graph is first encoded in a latent space, and then the 3D structures are generated by solving a principled bilevel optimization program. Extensive experiments on several benchmark data sets prove the effectiveness of our proposed approach over existing state-of-the-art approaches. Code is available at \url{https://github.com/MinkaiXu/ConfVAE-ICML21}. △ Less

Submitted 2 June, 2021; v1 submitted 15 May, 2021; originally announced May 2021.

Comments: Accepted by ICML 2021

arXiv:2105.03902 [pdf, other]

Learning Gradient Fields for Molecular Conformation Generation

Authors: Chence Shi, Shitong Luo, Minkai Xu, Jian Tang

Abstract: We study a fundamental problem in computational chemistry known as molecular conformation generation, trying to predict stable 3D structures from 2D molecular graphs. Existing machine learning approaches usually first predict distances between atoms and then generate a 3D structure satisfying the distances, where noise in predicted distances may induce extra errors during 3D coordinate generation.… ▽ More We study a fundamental problem in computational chemistry known as molecular conformation generation, trying to predict stable 3D structures from 2D molecular graphs. Existing machine learning approaches usually first predict distances between atoms and then generate a 3D structure satisfying the distances, where noise in predicted distances may induce extra errors during 3D coordinate generation. Inspired by the traditional force field methods for molecular dynamics simulation, in this paper, we propose a novel approach called ConfGF by directly estimating the gradient fields of the log density of atomic coordinates. The estimated gradient fields allow directly generating stable conformations via Langevin dynamics. However, the problem is very challenging as the gradient fields are roto-translation equivariant. We notice that estimating the gradient fields of atomic coordinates can be translated to estimating the gradient fields of interatomic distances, and hence develop a novel algorithm based on recent score-based generative models to effectively estimate these gradients. Experimental results across multiple tasks show that ConfGF outperforms previous state-of-the-art baselines by a significant margin. △ Less

Submitted 7 June, 2021; v1 submitted 9 May, 2021; originally announced May 2021.

Comments: ICML 2021, Long talk

arXiv:2012.05716 [pdf, other]

Utilising Graph Machine Learning within Drug Discovery and Development

Authors: Thomas Gaudelet, Ben Day, Arian R. Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy B. R. Hayter, Richard Vickers, Charles Roberts, Jian Tang, David Roblin, Tom L. Blundell, Michael M. Bronstein, Jake P. Taylor-King

Abstract: Graph Machine Learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets - amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development… ▽ More Graph Machine Learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets - amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development. After introducing key terms and modelling approaches, we move chronologically through the drug development pipeline to identify and summarise work incorporating: target identification, design of small molecules and biologics, and drug repurposing. Whilst the field is still emerging, key milestones including repurposed drugs entering in vivo studies, suggest graph machine learning will become a modelling framework of choice within biomedical machine learning. △ Less

Submitted 10 February, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

Comments: 19 pages, 7 figures, 2 tables

arXiv:2009.10931 [pdf]

doi 10.1038/s41598-021-02353-5

Drug repurposing for COVID-19 using graph neural network and harmonizing multiple evidence

Authors: Kanglin Hsieh, Yinyin Wang, Luyao Chen, Zhongming Zhao, Sean Savitz, Xiaoqian Jiang, Jing Tang, Yejin Kim

Abstract: Amid the pandemic of 2019 novel coronavirus disease (COVID-19) infected by SARS-CoV-2, a vast amount of drug research for prevention and treatment has been quickly conducted, but these efforts have been unsuccessful thus far. Our objective is to prioritize repurposable drugs using a drug repurposing pipeline that systematically integrates multiple SARS-CoV-2 and drug interactions, deep graph neura… ▽ More Amid the pandemic of 2019 novel coronavirus disease (COVID-19) infected by SARS-CoV-2, a vast amount of drug research for prevention and treatment has been quickly conducted, but these efforts have been unsuccessful thus far. Our objective is to prioritize repurposable drugs using a drug repurposing pipeline that systematically integrates multiple SARS-CoV-2 and drug interactions, deep graph neural networks, and in-vitro/population-based validations. We first collected all the available drugs (n= 3,635) involved in COVID-19 patient treatment through CTDbase. We built a SARS-CoV-2 knowledge graph based on the interactions among virus baits, host genes, pathways, drugs, and phenotypes. A deep graph neural network approach was used to derive the candidate representation based on the biological interactions. We prioritized the candidate drugs using clinical trial history, and then validated them with their genetic profiles, in vitro experimental efficacy, and electronic health records. We highlight the top 22 drugs including Azithromycin, Atorvastatin, Aspirin, Acetaminophen, and Albuterol. We further pinpointed drug combinations that may synergistically target COVID-19. In summary, we demonstrated that the integration of extensive interactions, deep neural networks, and rigorous validation can facilitate the rapid identification of candidate drugs for COVID-19 treatment. This is a post-peer-review, pre-copyedit version of an article published in Scientific Reports The final authenticated version is available online at: https://www.nature.com/articles/s41598-021-02353-5 △ Less

Submitted 1 February, 2022; v1 submitted 23 September, 2020; originally announced September 2020.

Comments: 13 pages

Journal ref: Sci Rep 11, 23179 (2021)

arXiv:1912.12587 [pdf]

Highly fluorescent copper nanoclusters for sensing and bioimaging

Authors: Yu An, Ying Ren, Jing Tang, Jun Chen, Baisong Chang

Abstract: Metal nanoclusters (NCs), typically consisting of a few to tens of metal atoms, bridge the gap between organometallic compounds and crystalline metal nanoparticles. As their size approaches the Fermi wavelength of electrons, metal NCs exhibit discrete energy levels, which in turn results in the emergence of intriguing physical and chemical (or physicochemical) properties, especially strong fluores… ▽ More Metal nanoclusters (NCs), typically consisting of a few to tens of metal atoms, bridge the gap between organometallic compounds and crystalline metal nanoparticles. As their size approaches the Fermi wavelength of electrons, metal NCs exhibit discrete energy levels, which in turn results in the emergence of intriguing physical and chemical (or physicochemical) properties, especially strong fluorescence. In the past few decades, dramatic growth has been witnessed in the development of different types of noble metal NCs (mainly AuNCs and AgNCs). However, compared with noble metals, copper is a relatively earth-abundant and cost-effective metal. Theoretical and experimental studies have shown that copper NCs (CuNCs) possess unique catalytic and photoluminescent properties. In this context, CuNCs are emerging as a new class of nontoxic, economic, and effective phosphors and catalysts, drawing significant interest across the life and medical sciences. To highlight these achievements, this review begins by providing an overview of a multitude of factors that play central roles in the fluorescence of CuNCs. Additionally, a critical perspective of how the aggregation of CuNCs can efficiently improve the florescent stability, tunability, and intensity is also discussed. Following, we present representative applications of CuNCs in detection and bioimaging. Finally, we outline current challenges and our perspective on the development of CuNCs. △ Less

Submitted 29 December, 2019; originally announced December 2019.

arXiv:1907.04765 [pdf, other]

Evaluating bird collision risk of a high-speed railway crossing the habitat of the crested ibis (Nipponia nippon) in Qinling Mountains, China

Authors: Han Hu, Junqing Tang, Yi Wang, Hongfeng Zhang, Dong Wu, Yingchun Lin, Lina Su, Yan Liu, Wei Zhang, Chao Wang, Xiaomin Wu

Abstract: Bird collisions with high-speed transport modes is a vital topic on vehicle safety and wildlife protection, especially when high-speed trains, with an average speed of 250km/h, have to run across the habitat of an endangered bird species. This paper evaluates the bird-train collision risk associated with a recent high-speed railway project in Qinling Mountains, China, for the crested ibis (Nipponi… ▽ More Bird collisions with high-speed transport modes is a vital topic on vehicle safety and wildlife protection, especially when high-speed trains, with an average speed of 250km/h, have to run across the habitat of an endangered bird species. This paper evaluates the bird-train collision risk associated with a recent high-speed railway project in Qinling Mountains, China, for the crested ibis (Nipponia nippon) and other local bird species. Using line transect surveys and walking monitoring techniques, we surveyed the population abundance, spatial-temporal distributions, and bridge-crossing behaviors of the birds in the study area. The results show that: (1) The crested ibis and the egret were the two most abundant waterfowl species in the study area. The RAI of these two species were about 43.69% and 42.91%, respectively; (2) Crested ibises overall habitat closer to the railway bridge. 91.63% of them were firstly detected within the range of 0m to 25m of the vicinity of the bridge; (3) the ratio between crossing over and under the railway bridge was about 7:3. Crested ibises were found to prefer to fly over the railway bridge (89.29% of the total crossing activities observed for this species). Egrets were more likely to cross the railway below the bridge, and they accounted for 60.27% of the total observations of crossing under the bridge. We recommend that, while the collision risk of crested ibises could be low, barrier-like structures, such as fences, should still be considered to promote the conservation of multiple bird species in the area. This paper provides a practical case for railway ecology studies in China. To our best knowledge, this is the first high-speed railway project that takes protecting crested ibises as one of the top priorities, and exemplifies the recent nationwide initiative towards the construction of "eco-civilization" in the country. △ Less

Submitted 10 July, 2019; originally announced July 2019.

Comments: 25 pages, 6 figures, preprint for submission to Transportation Research Part D: Transport and Environment

arXiv:1906.01513 [pdf]

doi 10.1002/bem.22148

Custom Edge-Element FEM Solver and its Application to Eddy-Current Simulation of Realistic 2M-Element Human Brain Phantom

Authors: Wuliang Yin, Mingyang Lu, Jiawei Tang, Qian Zhao, Zhijie Zhang, Kai Li, Yan Han, Anthony Peyton

Abstract: Extensive research papers of three-dimensional computational techniques are widely used for the investigation of human brain pathophysiology. Eddy current analyzing could provide an indication of conductivity change within a biological body. A significant obstacle to current trend analyses is the development of a numerically stable and efficiency-finite element scheme that performs well at low fre… ▽ More Extensive research papers of three-dimensional computational techniques are widely used for the investigation of human brain pathophysiology. Eddy current analyzing could provide an indication of conductivity change within a biological body. A significant obstacle to current trend analyses is the development of a numerically stable and efficiency-finite element scheme that performs well at low frequency and does not require a large number of degrees of freedom. Here, a custom finite element method (FEM) solver based on edge elements is proposed using the weakly coupled theory, which separates the solution into two steps. First, the background field (the magnetic vector potential on each edge) is calculated and stored. Then, the electric scalar potential on each node is obtained by FEM based on Galerkin formulations. Consequently, the electric field and eddy current distribution in the object can be obtained. This solver is more efficient than typical commercial solvers since it reduces the vector eddy current equation to a scalar one, and reduces the meshing domain to just the eddy current region. It can therefore tackle complex eddy current calculations for models with much larger numbers of elements, such as those encountered in eddy current computation in biological tissues. An example is presented with a realistic human brain mesh of 2 million elements. In addition, with this solver, the equivalent magnetic field induced from the excitation coil is applied, and therefore there is no need to mesh the excitation coil. In combination, these significantly increase the efficiency of the solver. △ Less

Submitted 30 May, 2019; originally announced June 2019.

arXiv:1905.00534 [pdf, other]

Drug-Drug Adverse Effect Prediction with Graph Co-Attention

Authors: Andreea Deac, Yu-Hsiang Huang, Petar Veličković, Pietro Liò, Jian Tang

Abstract: Complex or co-existing diseases are commonly treated using drug combinations, which can lead to higher risk of adverse side effects. The detection of polypharmacy side effects is usually done in Phase IV clinical trials, but there are still plenty which remain undiscovered when the drugs are put on the market. Such accidents have been affecting an increasing proportion of the population (15% in th… ▽ More Complex or co-existing diseases are commonly treated using drug combinations, which can lead to higher risk of adverse side effects. The detection of polypharmacy side effects is usually done in Phase IV clinical trials, but there are still plenty which remain undiscovered when the drugs are put on the market. Such accidents have been affecting an increasing proportion of the population (15% in the US now) and it is thus of high interest to be able to predict the potential side effects as early as possible. Systematic combinatorial screening of possible drug-drug interactions (DDI) is challenging and expensive. However, the recent significant increases in data availability from pharmaceutical research and development efforts offer a novel paradigm for recovering relevant insights for DDI prediction. Accordingly, several recent approaches focus on curating massive DDI datasets (with millions of examples) and training machine learning models on them. Here we propose a neural network architecture able to set state-of-the-art results on this task---using the type of the side-effect and the molecular structure of the drugs alone---by leveraging a co-attentional mechanism. In particular, we show the importance of integrating joint information from the drug pairs early on when learning each drug's representation. △ Less

Submitted 1 May, 2019; originally announced May 2019.

Comments: 8 pages, 5 figures

arXiv:1903.05947 [pdf, other]

Designing wildlife crossing structures for ungulates in a desert landscape: A case study in China

Authors: Bin Zhang, Junqing Tang, Yi Wang, Hongfeng Zhang, Gang Xu, Yu Lin, Xiaomin Wu

Abstract: This paper reports on the design of wildlife crossing structures (WCSs) along a new expressway in China, which exemplifies the country's increasing efforts on wildlife protection in infrastructure projects. The expert knowledge and field surveys were used to determine the target species in the study area and the quantity, locations, size, and type of the WCSs. The results on relative abundance ind… ▽ More This paper reports on the design of wildlife crossing structures (WCSs) along a new expressway in China, which exemplifies the country's increasing efforts on wildlife protection in infrastructure projects. The expert knowledge and field surveys were used to determine the target species in the study area and the quantity, locations, size, and type of the WCSs. The results on relative abundance index and encounter rate showed that the ibex (\textit{Capra ibex}), argali sheep (Ovis ammon), and goitered gazelle (Gazella subgutturosa) are the main ungulates in the study area. Among them, the goitered gazelle is the most widely distributed species. WCSs were proposed based on the estimated crossing hotspots. The mean deviation distance between those hotspots and their nearest proposed WCSs is around 341m. In addition, those 16 proposed underpass WCSs have a width of no less than 12m and height of no lower than 3.5m, which is believed to be sufficient for ungulates in the area. Given the limited availability of high-resolution movement data and wildlife-vehicle collision data during road's early design stage, the approach demonstrated in this paper facilitates practical spatial planning and provides insights into designing WCSs in a desert landscape. △ Less

Submitted 14 March, 2019; originally announced March 2019.

Comments: 20 pages, 7 figures, Submit to Transportation Research Part D

Showing 1–50 of 60 results for author: Tang, J