-
3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
Jinhua Zhu,
Rui Yan
Abstract:
The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and…
▽ More
The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and also functions. While some attempts have been made to leverage external structure encoding modules to inject the 3D molecular information into LMs, there exist obvious difficulties that hinder the integration of molecular structure and language text, such as modality alignment and separate tuning. To bridge this gap, we propose 3D-MolT5, a unified framework designed to model both 1D molecular sequence and 3D molecular structure. The key innovation lies in our methodology for mapping fine-grained 3D substructure representations (based on 3D molecular fingerprints) to a specialized 3D token vocabulary for 3D-MolT5. This 3D structure token vocabulary enables the seamless combination of 1D sequence and 3D structure representations in a tokenized format, allowing 3D-MolT5 to encode molecular sequence (SELFIES), molecular structure, and text sequences within a unified architecture. Alongside, we further introduce 1D and 3D joint pre-training to enhance the model's comprehension of these diverse modalities in a joint representation space and better generalize to various tasks for our foundation model. Through instruction tuning on multiple downstream datasets, our proposed 3D-MolT5 shows superior performance than existing methods in molecular property prediction, molecule captioning, and text-based molecule generation tasks. Our code will be available on GitHub soon.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
FABind+: Enhancing Molecular Docking through Improved Pocket Prediction and Pose Generation
Authors:
Kaiyuan Gao,
Qizhi Pei,
Jinhua Zhu,
Kun He,
Lijun Wu
Abstract:
Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with…
▽ More
Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with a focus on speed and accuracy, we present FABind+, an enhanced iteration that largely boosts the performance of its predecessor. We identify pocket prediction as a critical bottleneck in molecular docking and propose a novel methodology that significantly refines pocket prediction, thereby streamlining the docking process. Furthermore, we introduce modifications to the docking module to enhance its pose generation capabilities. In an effort to bridge the gap with conventional sampling/generative methods, we incorporate a simple yet effective sampling technique coupled with a confidence model, requiring only minor adjustments to the regression framework of FABind. Experimental results and analysis reveal that FABind+ remarkably outperforms the original FABind, achieves competitive state-of-the-art performance, and delivers insightful modeling strategies. This demonstrates FABind+ represents a substantial step forward in molecular docking and drug discovery. Our code is in https://github.com/QizhiPei/FABind.
△ Less
Submitted 7 April, 2024; v1 submitted 29 March, 2024;
originally announced March 2024.
-
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
Jinhua Zhu,
Yue Wang,
Zun Wang,
Tao Qin,
Rui Yan
Abstract:
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomol…
▽ More
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling}.
△ Less
Submitted 5 March, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
Xiaozhuan Liang,
Yin Fang,
Jinhua Zhu,
Shufang Xie,
Tao Qin,
Rui Yan
Abstract:
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper intro…
▽ More
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
△ Less
Submitted 31 May, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
Authors:
Qizhi Pei,
Wei Zhang,
Jinhua Zhu,
Kehan Wu,
Kaiyuan Gao,
Lijun Wu,
Yingce Xia,
Rui Yan
Abstract:
Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose…
▽ More
Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
△ Less
Submitted 28 January, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
FABind: Fast and Accurate Protein-Ligand Binding
Authors:
Qizhi Pei,
Kaiyuan Gao,
Lijun Wu,
Jinhua Zhu,
Yingce Xia,
Shufang Xie,
Tao Qin,
Kun He,
Tie-Yan Liu,
Rui Yan
Abstract:
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based meth…
▽ More
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https://github.com/QizhiPei/FABind
△ Less
Submitted 8 January, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction
Authors:
Qizhi Pei,
Lijun Wu,
Jinhua Zhu,
Yingce Xia,
Shufang Xie,
Tao Qin,
Haiguang Liu,
Tie-Yan Liu,
Rui Yan
Abstract:
Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep…
▽ More
Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the SSM-DTA framework, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling (MLM) using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS, and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations, and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes. Our code is available at $\href{https://github.com/QizhiPei/SSM-DTA}{Github}$.
△ Less
Submitted 17 October, 2023; v1 submitted 20 June, 2022;
originally announced June 2022.
-
An evaluation of machine learning techniques to predict the outcome of children treated for Hodgkin-Lymphoma on the AHOD0031 trial: A report from the Children's Oncology Group
Authors:
Cédric Beaulac,
Jeffrey S. Rosenthal,
Qinglin Pei,
Debra Friedman,
Suzanne Wolden,
David Hodgson
Abstract:
In this manuscript we analyze a data set containing information on children with Hodgkin Lymphoma (HL) enrolled on a clinical trial. Treatments received and survival status were collected together with other covariates such as demographics and clinical measurements. Our main task is to explore the potential of machine learning (ML) algorithms in a survival analysis context in order to improve over…
▽ More
In this manuscript we analyze a data set containing information on children with Hodgkin Lymphoma (HL) enrolled on a clinical trial. Treatments received and survival status were collected together with other covariates such as demographics and clinical measurements. Our main task is to explore the potential of machine learning (ML) algorithms in a survival analysis context in order to improve over the Cox Proportional Hazard (CoxPH) model. We discuss the weaknesses of the CoxPH model we would like to improve upon and then we introduce multiple algorithms, from well-established ones to state-of-the-art models, that solve these issues. We then compare every model according to the concordance index and the brier score. Finally, we produce a series of recommendations, based on our experience, for practitioners that would like to benefit from the recent advances in artificial intelligence.
△ Less
Submitted 26 March, 2021; v1 submitted 15 January, 2020;
originally announced January 2020.
-
Coding Capacity of Purkinje Cells with Different Schemes of Morphological Reduction
Authors:
Lingling An,
Yuanhong Tang,
Quan Wang,
Qingqi Pei,
Ran Wei,
Huiyuan Duan,
Jian K. Liu
Abstract:
The brain as a neuronal system has very complex structure with large diversity of neuronal types. The most basic complexity is seen from the structure of neuronal morphology, which usually has a complex tree-like structure with dendritic spines distributed in branches. For simulating a large-scale network with spiking neurons, the simple point neuron, such as integrate-and-fire neuron, is often us…
▽ More
The brain as a neuronal system has very complex structure with large diversity of neuronal types. The most basic complexity is seen from the structure of neuronal morphology, which usually has a complex tree-like structure with dendritic spines distributed in branches. For simulating a large-scale network with spiking neurons, the simple point neuron, such as integrate-and-fire neuron, is often used. However, recent experimental evidence suggests that the computational ability of a single neuron is largely enhanced by its morphological structure, in particular, by various types of dendritic dynamics. As morphology reduction of detailed biophysical models is one of classic questions for systems neuroscience, much effort has been taken to simulate a neuron with a few compartments to include the interaction between soma and dendritic spines. Yet, novel reduction methods are still needed to deal with complex dendritic tree. Here by using ten individual Purkinje cells of the cerebellum from three species of guinea-pig, mouse and rat, we consider four types of reduction methods and study their effects on the coding capacity of Purkinje cells in terms of firing rate, timing coding, spiking pattern, and modulated firing under different stimulation protocols. We find that there is a variation of reduction performance depending on individual cells and species, however, all reduction methods can preserve, to some degree, firing activity of the full model of Purkinje cell. Therefore, when stimulating large-scale network of neurons, one has to choose a proper type of reduced neuronal model depending on the questions addressed.
△ Less
Submitted 25 April, 2019;
originally announced May 2019.