Search | arXiv e-print repository

DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs

Authors: Jinzhe Liu, Xiangsheng Huang, Zhuo Chen, Yin Fang

Abstract: Large Language Models (LLMs) encounter challenges with the unique syntax of specific domains, such as biomolecules. Existing fine-tuning or modality alignment techniques struggle to bridge the domain knowledge gap and understand complex molecular data, limiting LLMs' progress in specialized fields. To overcome these limitations, we propose an expandable and adaptable non-parametric knowledge injec… ▽ More Large Language Models (LLMs) encounter challenges with the unique syntax of specific domains, such as biomolecules. Existing fine-tuning or modality alignment techniques struggle to bridge the domain knowledge gap and understand complex molecular data, limiting LLMs' progress in specialized fields. To overcome these limitations, we propose an expandable and adaptable non-parametric knowledge injection framework named Domain-specific Retrieval-Augmented Knowledge (DRAK), aimed at enhancing reasoning capabilities in specific domains. Utilizing knowledge-aware prompts and gold label-induced reasoning, DRAK has developed profound expertise in the molecular domain and the capability to handle a broad spectrum of analysis tasks. We evaluated two distinct forms of DRAK variants, proving that DRAK exceeds previous benchmarks on six molecular tasks within the Mol-Instructions dataset. Extensive experiments have underscored DRAK's formidable performance and its potential to unlock molecular insights, offering a unified paradigm for LLMs to tackle knowledge-intensive tasks in specific domains. Our code will be available soon. △ Less

Submitted 4 March, 2024; originally announced June 2024.

Comments: Ongoing work; 11 pages, 6 Figures, 2 Tables

arXiv:2405.12144 [pdf]

Alterations of electrocortical activity during hand movements induced by motor cortex glioma

Authors: Yihan Wu, Tao Chang, Siliang Chen, Xiaodong Niu, Yu Li, Yuan Fang, Lei Yang, Yixuan Zong, Yaoxin Yang, Yuehua Li, Mengsong Wang, Wen Yang, Yixuan Wu, Chen Fu, Xia Fang, Yuxin Quan, Xilin Peng, Qiang Sun, Marc M. Van Hulle, Yanhui Liu, Ning Jiang, Dario Farina, Yuan Yang, Jiayuan He, Qing Mao

Abstract: Glioma cells can reshape functional neuronal networks by hijacking neuronal synapses, leading to partial or complete neurological dysfunction. These mechanisms have been previously explored for language functions. However, the impact of glioma on sensorimotor functions is still unknown. Therefore, we recruited a control group of patients with unaffected motor cortex and a group of patients with gl… ▽ More Glioma cells can reshape functional neuronal networks by hijacking neuronal synapses, leading to partial or complete neurological dysfunction. These mechanisms have been previously explored for language functions. However, the impact of glioma on sensorimotor functions is still unknown. Therefore, we recruited a control group of patients with unaffected motor cortex and a group of patients with glioma-infiltrated motor cortex, and recorded high-density electrocortical signals during finger movement tasks. The results showed that glioma suppresses task-related synchronization in the high-gamma band and reduces the power across all frequency bands. The resulting atypical motor information transmission model with discrete signaling pathways and delayed responses disrupts the stability of neuronal encoding patterns for finger movement kinematics across various temporal-spatial scales. These findings demonstrate that gliomas functionally invade neural circuits within the motor cortex. This result advances our understanding of motor function processing in chronic disease states, which is important to advance the surgical strategies and neurorehabilitation approaches for patients with malignant gliomas. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.06178 [pdf, other]

ACTION: Augmentation and Computation Toolbox for Brain Network Analysis with Functional MRI

Authors: Yuqi Fang, Junhao Zhang, Linmin Wang, Qianqian Wang, Mingxia Liu

Abstract: Functional magnetic resonance imaging (fMRI) has been increasingly employed to investigate functional brain activity. Many fMRI-related software/toolboxes have been developed, providing specialized algorithms for fMRI analysis. However, existing toolboxes seldom consider fMRI data augmentation, which is quite useful, especially in studies with limited or imbalanced data. Moreover, current studies… ▽ More Functional magnetic resonance imaging (fMRI) has been increasingly employed to investigate functional brain activity. Many fMRI-related software/toolboxes have been developed, providing specialized algorithms for fMRI analysis. However, existing toolboxes seldom consider fMRI data augmentation, which is quite useful, especially in studies with limited or imbalanced data. Moreover, current studies usually focus on analyzing fMRI using conventional machine learning models that rely on human-engineered fMRI features, without investigating deep learning models that can automatically learn data-driven fMRI representations. In this work, we develop an open-source toolbox, called Augmentation and Computation Toolbox for braIn netwOrk aNalysis (ACTION), offering comprehensive functions to streamline fMRI analysis. The ACTION is a Python-based and cross-platform toolbox with graphical user-friendly interfaces. It enables automatic fMRI augmentation, covering blood-oxygen-level-dependent (BOLD) signal augmentation and brain network augmentation. Many popular methods for brain network construction and network feature extraction are included. In particular, it supports constructing deep learning models, which leverage large-scale auxiliary unlabeled data (3,800+ resting-state fMRI scans) for model pretraining to enhance model performance for downstream tasks. To facilitate multi-site fMRI studies, it is also equipped with several popular federated learning strategies. Furthermore, it enables users to design and test custom algorithms through scripting, greatly improving its utility and extensibility. We demonstrate the effectiveness and user-friendliness of ACTION on real fMRI data and present the experimental results. The software, along with its source code and manual, can be accessed online. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 14 pages, 5 figures, 5 tables

arXiv:2402.17810 [pdf, other]

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan

Abstract: Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper intro… ▽ More Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}. △ Less

Submitted 31 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted by ACL 2024 (Findings)

arXiv:2306.16780 [pdf, other]

Graph Sampling-based Meta-Learning for Molecular Property Prediction

Authors: Xiang Zhuang, Qiang Zhang, Bin Wu, Keyan Ding, Yin Fang, Huajun Chen

Abstract: Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-ba… ▽ More Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular property prediction. First, we construct a Molecule-Property relation Graph (MPG): molecule and properties are nodes, while property labels decide edges. Then, to utilize the topological information of MPG, we reformulate an episode in meta-learning as a subgraph of the MPG, containing a target property node, molecule nodes, and auxiliary property nodes. Third, as episodes in the form of subgraphs are no longer independent of each other, we propose to schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Extensive experiments on 5 commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art methods by 5.71%-6.93% in ROC-AUC and verify the effectiveness of each proposed module. Our code is available at https://github.com/HICAI-ZJU/GS-Meta. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: Accepted by IJCAI 2023

arXiv:2306.14080 [pdf, other]

Leveraging Brain Modularity Prior for Interpretable Representation Learning of fMRI

Authors: Qianqian Wang, Wei Wang, Yuqi Fang, P. -T. Yap, Hongtu Zhu, Hong-Jun Li, Lishan Qiao, Mingxia Liu

Abstract: Resting-state functional magnetic resonance imaging (rs-fMRI) can reflect spontaneous neural activities in brain and is widely used for brain disorder analysis.Previous studies propose to extract fMRI representations through diverse machine/deep learning methods for subsequent analysis. But the learned features typically lack biological interpretability, which limits their clinical utility. From t… ▽ More Resting-state functional magnetic resonance imaging (rs-fMRI) can reflect spontaneous neural activities in brain and is widely used for brain disorder analysis.Previous studies propose to extract fMRI representations through diverse machine/deep learning methods for subsequent analysis. But the learned features typically lack biological interpretability, which limits their clinical utility. From the view of graph theory, the brain exhibits a remarkable modular structure in spontaneous brain functional networks, with each module comprised of functionally interconnected brain regions-of-interest (ROIs). However, most existing learning-based methods for fMRI analysis fail to adequately utilize such brain modularity prior. In this paper, we propose a Brain Modularity-constrained dynamic Representation learning (BMR) framework for interpretable fMRI analysis, consisting of three major components: (1) dynamic graph construction, (2) dynamic graph learning via a novel modularity-constrained graph neural network(MGNN), (3) prediction and biomarker detection for interpretable fMRI analysis. Especially, three core neurocognitive modules (i.e., salience network, central executive network, and default mode network) are explicitly incorporated into the MGNN, encouraging the nodes/ROIs within the same module to share similar representations. To further enhance discriminative ability of learned features, we also encourage the MGNN to preserve the network topology of input graphs via a graph topology reconstruction constraint. Experimental results on 534 subjects with rs-fMRI scans from two datasets validate the effectiveness of the proposed method. The identified discriminative brain ROIs and functional connectivities can be regarded as potential fMRI biomarkers to aid in clinical diagnosis. △ Less

Submitted 24 June, 2023; originally announced June 2023.

arXiv:2306.08018 [pdf, other]

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

Authors: Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, Huajun Chen

Abstract: Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular doma… ▽ More Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability. △ Less

Submitted 4 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: ICLR 2024. Project homepage: https://github.com/zjunlp/Mol-Instructions

arXiv:2303.08015 [pdf, ps, other]

Molecular Communication for Quorum Sensing Inspired Cooperative Drug Delivery

Authors: Yuting Fang, Stuart T. Johnston, Matt Faria, Xinyu Huang, Andrew W. Eckford, Jamie Evans

Abstract: A cooperative drug delivery system is proposed, where quorum sensing (QS), a density-dependent bacterial behavior coordination mechanism, is employed by synthetic bacterium-based nanomachines (B-NMs) for controllable drug delivery. In our proposed system, drug delivery is only triggered when there are enough QS molecules, which in turn only happens when there are enough B-NMs. This makes the propo… ▽ More A cooperative drug delivery system is proposed, where quorum sensing (QS), a density-dependent bacterial behavior coordination mechanism, is employed by synthetic bacterium-based nanomachines (B-NMs) for controllable drug delivery. In our proposed system, drug delivery is only triggered when there are enough QS molecules, which in turn only happens when there are enough B-NMs. This makes the proposed system can be used to achieve a high release rate of drug molecules from a high number of B-NMs when the population density of B-NMs may not be known. Analytical expressions for i) the expected activation probability of the B-NM due to randomly-distributed B-NMs and ii) the expected aggregate absorption rate of drug molecules due to randomly-distributed QS activated B-NMs are derived. Analytical results are verified by particle-based simulations. The derived results can help to predict and control the impact of environmental factors (e.g. diffusion coefficient and degradation rate) on the absorption rate of drug molecules since rigorous diffusion-based molecular channels are considered. Our results show that the activation probability at the B-NM increases as this B-NM is located closer to the center of the B-NM population and the aggregate absorption rate of the drug molecules non-linearly increases as the population density increases. △ Less

Submitted 14 February, 2023; originally announced March 2023.

Comments: 9 pages; 9 figures

arXiv:2303.04902 [pdf]

doi 10.1002/hbm.26672

Inter-brain substrates of role switching during mother-child interaction

Authors: Yamin Li, Saishuang Wu, Jiayang Xu, Haiwa Wang, Qi Zhu, Wen Shi, Yue Fang, Fan Jiang, Shanbao Tong, Yunting Zhang, Xiaoli Guo

Abstract: Mother-child interaction is highly dynamic and reciprocal. Switching roles in these back-and-forth interactions serves as a crucial feature of reciprocal behaviors while the underlying neural entrainment is still not well-studied. Here, we designed a role-controlled cooperative task with dual EEG recording to study how differently two brains interact when mothers and children hold different roles.… ▽ More Mother-child interaction is highly dynamic and reciprocal. Switching roles in these back-and-forth interactions serves as a crucial feature of reciprocal behaviors while the underlying neural entrainment is still not well-studied. Here, we designed a role-controlled cooperative task with dual EEG recording to study how differently two brains interact when mothers and children hold different roles. When children were actors and mothers were observers, mother-child inter-brain synchrony emerged within the theta oscillations and the frontal lobe, which highly correlated with children's attachment to their mothers. When their roles were reversed, this synchrony was shifted to the alpha oscillations and the central area and associated with mothers' perception of their relationship with their children. The results suggested an observer-actor neural alignment within the actor's oscillations, which was modulated by the actor-toward-observer emotional bonding. Our findings contribute to the understanding of how inter-brain synchrony is established and dynamically changed during mother-child reciprocal interaction. △ Less

Submitted 8 March, 2023; originally announced March 2023.

arXiv:2210.13225 [pdf, other]

Biologically Plausible Variational Policy Gradient with Spiking Recurrent Winner-Take-All Networks

Authors: Zhile Yang, Shangqi Guo, Ying Fang, Jian K. Liu

Abstract: One stream of reinforcement learning research is exploring biologically plausible models and algorithms to simulate biological intelligence and fit neuromorphic hardware. Among them, reward-modulated spike-timing-dependent plasticity (R-STDP) is a recent branch with good potential in energy efficiency. However, current R-STDP methods rely on heuristic designs of local learning rules, thus requirin… ▽ More One stream of reinforcement learning research is exploring biologically plausible models and algorithms to simulate biological intelligence and fit neuromorphic hardware. Among them, reward-modulated spike-timing-dependent plasticity (R-STDP) is a recent branch with good potential in energy efficiency. However, current R-STDP methods rely on heuristic designs of local learning rules, thus requiring task-specific expert knowledge. In this paper, we consider a spiking recurrent winner-take-all network, and propose a new R-STDP method, spiking variational policy gradient (SVPG), whose local learning rules are derived from the global policy gradient and thus eliminate the need for heuristic designs. In experiments of MNIST classification and Gym InvertedPendulum, our SVPG achieves good training performance, and also presents better robustness to various kinds of noises than conventional methods. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: Accepted to BMVC 2022

arXiv:2206.11255 [pdf, other]

Attention-aware contrastive learning for predicting T cell receptor-antigen binding specificity

Authors: Yiming Fang, Xuejun Liu, Hui Liu

Abstract: It has been verified that only a small fraction of the neoantigens presented by MHC class I molecules on the cell surface can elicit T cells. The limitation can be attributed to the binding specificity of T cell receptor (TCR) to peptide-MHC complex (pMHC). Computational prediction of T cell binding to neoantigens is an challenging and unresolved task. In this paper, we propose an attentive-mask c… ▽ More It has been verified that only a small fraction of the neoantigens presented by MHC class I molecules on the cell surface can elicit T cells. The limitation can be attributed to the binding specificity of T cell receptor (TCR) to peptide-MHC complex (pMHC). Computational prediction of T cell binding to neoantigens is an challenging and unresolved task. In this paper, we propose an attentive-mask contrastive learning model, ATMTCR, for inferring TCR-antigen binding specificity. For each input TCR sequence, we used Transformer encoder to transform it to latent representation, and then masked a proportion of residues guided by attention weights to generate its contrastive view. Pretraining on large-scale TCR CDR3 sequences, we verified that contrastive learning significantly improved the prediction performance of TCR binding to peptide-MHC complex (pMHC). Beyond the detection of important amino acids and their locations in the TCR sequence, our model can also extracted high-order semantic information underlying the TCR-antigen binding specificity. Comparison experiments were conducted on two independent datasets, our method achieved better performance than other existing algorithms. Moreover, we effectively identified important amino acids and their positional preferences through attention weights, which indicated the interpretability of our proposed model. △ Less

Submitted 17 May, 2022; originally announced June 2022.

arXiv:2206.06035 [pdf, other]

doi 10.1016/j.cag.2022.07.005

SHREC 2022: Protein-ligand binding site recognition

Authors: Luca Gagliardi, Andrea Raffo, Ulderico Fugacci, Silvia Biasotti, Walter Rocchia, Hao Huang, Boulbaba Ben Amor, Yi Fang, Yuanyuan Zhang, Xiao Wang, Charles Christoffer, Daisuke Kihara, Apostolos Axenopoulos, Stelios Mylonas, Petros Daras

Abstract: This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition. The prediction of protein-ligand binding regions is an active research domain in computational biophysics and structural biology and plays a relevant role for molecular docking and drug design. The goal of the contest is to assess the effectiveness of computational methods i… ▽ More This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition. The prediction of protein-ligand binding regions is an active research domain in computational biophysics and structural biology and plays a relevant role for molecular docking and drug design. The goal of the contest is to assess the effectiveness of computational methods in recognizing ligand binding sites in a protein based on its geometrical structure. Performances of the segmentation algorithms are analyzed according to two evaluation scores describing the capacity of a putative pocket to contact a ligand and to pinpoint the correct binding region. Despite some methods perform remarkably, we show that simple non-machine-learning approaches remain very competitive against data-driven algorithms. In general, the task of pocket detection remains a challenging learning problem which suffers of intrinsic difficulties due to the lack of negative examples (data imbalance problem). △ Less

Submitted 24 August, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

Journal ref: Computers & Graphics 107 (2022) 20-31

arXiv:2204.12725 [pdf]

The low-entropy hydration shell at the binding site of spike RBD determines the contagiousness of SARS-CoV-2 variants

Authors: Lin Yang, Shuai Guo, Chengyu Houc, Jiacheng Lia, Liping Shi, Chenchen Liao, Rongchun Shi, Xiaoliang Ma, Bing Zheng, Yi Fang, Lin Ye, Xiaodong He

Abstract: The infectivity of SARS-CoV-2 depends on the binding affinity of the receptor-binding domain (RBD) of the spike protein with the angiotensin converting enzyme 2 (ACE2) receptor. The calculated RBD-ACE2 binding energies indicate that the difference in transmission efficiency of SARS-CoV-2 variants cannot be fully explained by electrostatic interactions, hydrogen-bond interactions, van der Waals int… ▽ More The infectivity of SARS-CoV-2 depends on the binding affinity of the receptor-binding domain (RBD) of the spike protein with the angiotensin converting enzyme 2 (ACE2) receptor. The calculated RBD-ACE2 binding energies indicate that the difference in transmission efficiency of SARS-CoV-2 variants cannot be fully explained by electrostatic interactions, hydrogen-bond interactions, van der Waals interactions, internal energy, and nonpolar solvation energies. Here, we demonstrate that low-entropy regions of hydration shells around proteins drive hydrophobic attraction between shape-matched low-entropy regions of the hydration shells, which essentially coordinates protein-protein binding in rotational-configurational space of mutual orientations and determines the binding affinity. An innovative method was used to identify the low-entropy regions of the hydration shells of the RBDs of multiple SARS-CoV-2 variants and the ACE2. We observed integral low-entropy regions of hydration shells covering the binding sites of the RBDs and matching in shape to the low-entropy region of hydration shell at the binding site of the ACE2. The RBD-ACE2 binding is thus found to be guided by hydrophobic collapse between the shape-matched low-entropy regions of the hydration shells. A measure of the low-entropy of the hydration shells can be obtained by counting the number of hydrophilic groups expressing hydrophilicity within the binding sites. The low-entropy level of hydration shells at the binding site of a spike protein is found to be an important indicator of the contagiousness of the coronavirus. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: 27 pages, 14 figures

arXiv:2202.10605 [pdf]

Space Layout of Low-entropy Hydration Shells Guides Protein Binding

Authors: Lin Yang, Shuai Guo, Chengyu Hou, Chencheng Liao, Jiacheng Li, Liping Shi, Xiaoliang Ma, Shenda Jiang, Bing Zheng, Yi Fang, Lin Ye, Xiaodong He

Abstract: Protein-protein binding enables orderly and lawful biological self-organization, and is therefore considered a miracle of nature. Protein-protein binding is steered by electrostatic forces, hydrogen bonding, van der Waals force, and hydrophobic interactions. Among these physical forces, only the hydrophobic interactions can be considered as long-range intermolecular attractions between proteins in… ▽ More Protein-protein binding enables orderly and lawful biological self-organization, and is therefore considered a miracle of nature. Protein-protein binding is steered by electrostatic forces, hydrogen bonding, van der Waals force, and hydrophobic interactions. Among these physical forces, only the hydrophobic interactions can be considered as long-range intermolecular attractions between proteins in intracellular and extracellular fluid. Low-entropy regions of hydration shells around proteins drive hydrophobic attraction among them that essentially coordinate protein-protein docking in rotational-conformational space of mutual orientations at the guidance stage of the binding. Here, an innovative method was developed for identifying the low-entropy regions of hydration shells of given proteins, and we discovered that the largest low-entropy regions of hydration shells on proteins typically cover the binding sites. According to an analysis of determined protein complex structures, shape matching between the largest low-entropy hydration shell region of a protein and that of its partner at the binding sites is revealed as a regular pattern. Protein-protein binding is thus found to be mainly guided by hydrophobic collapse between the shape-matched low-entropy hydration shells that is verified by bioinformatics analyses of hundreds of structures of protein complexes. A simple algorithm is developed to precisely predict protein binding sites. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2202.10587 [pdf, other]

Knowledge-informed Molecular Learning: A Survey on Paradigm Transfer

Authors: Yin Fang, Zhuo Chen, Xiaohui Fan, Ningyu Zhang

Abstract: Machine learning, notably deep learning, has significantly propelled molecular investigations within the biochemical sphere. Traditionally, modeling for such research has centered around a handful of paradigms. For instance, the prediction paradigm is frequently deployed for tasks such as molecular property prediction. To enhance the generation and decipherability of purely data-driven models, sch… ▽ More Machine learning, notably deep learning, has significantly propelled molecular investigations within the biochemical sphere. Traditionally, modeling for such research has centered around a handful of paradigms. For instance, the prediction paradigm is frequently deployed for tasks such as molecular property prediction. To enhance the generation and decipherability of purely data-driven models, scholars have integrated biochemical domain knowledge into these molecular study models. This integration has sparked a surge in paradigm transfer, which is solving one molecular learning task by reformulating it as another one. With the emergence of Large Language Models, these paradigms have demonstrated an escalating trend towards harmonized unification. In this work, we delineate a literature survey focused on knowledge-informed molecular learning from the perspective of paradigm transfer. We classify the paradigms, scrutinize their methodologies, and dissect the contribution of domain knowledge. Moreover, we encapsulate prevailing trends and identify intriguing avenues for future exploration in molecular learning. △ Less

Submitted 5 September, 2023; v1 submitted 17 February, 2022; originally announced February 2022.

Comments: 8 pages, 3 figures

arXiv:2112.00544 [pdf, other]

Molecular Contrastive Learning with Chemical Element Knowledge Graph

Authors: Yin Fang, Qiang Zhang, Haihong Yang, Xiang Zhuang, Shumin Deng, Wen Zhang, Ming Qin, Zhuo Chen, Xiaohui Fan, Huajun Chen

Abstract: Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus… ▽ More Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus ignore the correlations between atoms that have common attributes but are not directly connected by bonds. To address these issues, we construct a Chemical Element Knowledge Graph (KG) to summarize microscopic associations between elements and propose a novel Knowledge-enhanced Contrastive Learning (KCL) framework for molecular representation learning. KCL framework consists of three modules. The first module, knowledge-guided graph augmentation, augments the original molecular graph based on the Chemical Element KG. The second module, knowledge-aware graph representation, extracts molecular representations with a common graph encoder for the original molecular graph and a Knowledge-aware Message Passing Neural Network (KMPNN) to encode complex information in the augmented molecular graph. The final module is a contrastive objective, where we maximize agreement between these two views of molecular graphs. Extensive experiments demonstrated that KCL obtained superior performances against state-of-the-art baselines on eight molecular datasets. Visualization experiments properly interpret what KCL has learned from atoms and attributes in the augmented molecular graphs. Our codes and data are available at https://github.com/ZJU-Fangyin/KCL. △ Less

Submitted 10 March, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: Accepted in AAAI 2022 Main track

arXiv:2103.13047 [pdf, other]

Knowledge-aware Contrastive Molecular Graph Learning

Authors: Yin Fang, Haihong Yang, Xiang Zhuang, Xin Shao, Xiaohui Fan, Huajun Chen

Abstract: Leveraging domain knowledge including fingerprints and functional groups in molecular representation learning is crucial for chemical property prediction and drug discovery. When modeling the relation between graph structure and molecular properties implicitly, existing works can hardly capture structural or property changes and complex structure, with much smaller atom vocabulary and highly frequ… ▽ More Leveraging domain knowledge including fingerprints and functional groups in molecular representation learning is crucial for chemical property prediction and drug discovery. When modeling the relation between graph structure and molecular properties implicitly, existing works can hardly capture structural or property changes and complex structure, with much smaller atom vocabulary and highly frequent atoms. In this paper, we propose the Contrastive Knowledge-aware GNN (CKGNN) for self-supervised molecular representation learning to fuse domain knowledge into molecular graph representation. We explicitly encode domain knowledge via knowledge-aware molecular encoder under the contrastive learning framework, ensuring that the generated molecular embeddings equipped with chemical domain knowledge to distinguish molecules with similar chemical formula but dissimilar functions. Extensive experiments on 8 public datasets demonstrate the effectiveness of our model with a 6\% absolute improvement on average against strong competitors. Ablation study and further investigation also verify the best of both worlds: incorporation of chemical domain knowledge into self-supervised learning. △ Less

Submitted 24 March, 2021; originally announced March 2021.

Comments: 7 pages, 3 figures

arXiv:2103.10012 [pdf, ps, other]

doi 10.1109/TNSE.2021.3075222

Age-Stratified COVID-19 Spread Analysis and Vaccination: A Multitype Random Network Approach

Authors: Xianhao Chen, Guangyu Zhu, Lan Zhang, Yuguang Fang, Linke Guo, Xinguang Chen

Abstract: The risk for severe illness and mortality from COVID-19 significantly increases with age. As a result, age-stratified modeling for COVID-19 dynamics is the key to study how to reduce hospitalizations and mortality from COVID-19. By taking advantage of network theory, we develop an age-stratified epidemic model for COVID-19 in complex contact networks. Specifically, we present an extension of stand… ▽ More The risk for severe illness and mortality from COVID-19 significantly increases with age. As a result, age-stratified modeling for COVID-19 dynamics is the key to study how to reduce hospitalizations and mortality from COVID-19. By taking advantage of network theory, we develop an age-stratified epidemic model for COVID-19 in complex contact networks. Specifically, we present an extension of standard SEIR (susceptible-exposed-infectious-removed) compartmental model, called age-stratified SEAHIR (susceptible-exposedasymptomatic-hospitalized-infectious-removed) model, to capture the spread of COVID-19 over multitype random networks with general degree distributions. We derive several key epidemiological metrics and then propose an age-stratified vaccination strategy to decrease the mortality and hospitalizations. Through extensive study, we discover that the outcome of vaccination prioritization depends on the reproduction number R0. Specifically, the elderly should be prioritized only when R0 is relatively high. If ongoing intervention policies, such as universal masking, could suppress R0 at a relatively low level, prioritizing the high-transmission age group (i.e., adults aged 20-39) is most effective to reduce both mortality and hospitalizations. These conclusions provide useful recommendations for age-based vaccination prioritization for COVID-19. △ Less

Submitted 18 March, 2021; originally announced March 2021.

Comments: 11 pages, 9 figures

arXiv:2007.10848 [pdf, other]

doi 10.1093/bib/bbaa303

Recent Advances in Network-based Methods for Disease Gene Prediction

Authors: Sezin Kircali Ata, Min Wu, Yuan Fang, Le Ou-Yang, Chee Keong Kwoh, Xiao-Li Li

Abstract: Disease-gene association through Genome-wide association study (GWAS) is an arduous task for researchers. Investigating single nucleotide polymorphisms (SNPs) that correlate with specific diseases needs statistical analysis of associations. Considering the huge number of possible mutations, in addition to its high cost, another important drawback of GWAS analysis is the large number of false-posit… ▽ More Disease-gene association through Genome-wide association study (GWAS) is an arduous task for researchers. Investigating single nucleotide polymorphisms (SNPs) that correlate with specific diseases needs statistical analysis of associations. Considering the huge number of possible mutations, in addition to its high cost, another important drawback of GWAS analysis is the large number of false-positives. Thus, researchers search for more evidence to cross-check their results through different sources. To provide the researchers with alternative low-cost disease-gene association evidence, computational approaches come into play. Since molecular networks are able to capture complex interplay among molecules in diseases, they become one of the most extensively used data for disease-gene association prediction. In this survey, we aim to provide a comprehensive and an up-to-date review of network-based methods for disease gene prediction. We also conduct an empirical analysis on 14 state-of-the-art methods. To summarize, we first elucidate the task definition for disease gene prediction. Secondly, we categorize existing network-based efforts into network diffusion methods, traditional machine learning methods with handcrafted graph features and graph representation learning methods. Thirdly, an empirical analysis is conducted to evaluate the performance of the selected methods across seven diseases. We also provide distinguishing findings about the discussed methods based on our empirical analysis. Finally, we highlight potential research directions for future studies on disease gene prediction. △ Less

Submitted 17 December, 2020; v1 submitted 19 July, 2020; originally announced July 2020.

Comments: Published in Briefings in Bioinformatics, 05 December 2020

Journal ref: Briefings in Bioinformatics (2020)

arXiv:1902.06032 [pdf]

Molecular Phylogeny of Chinese Thuidiaceae with emphasis on Thuidium and Pelekium

Authors: Qi-Ying Cai, Bi-Cai Guan, Gang Ge, Yan-Ming Fang

Abstract: We present molecular phylogenetic investigation of Thuidiaceae, especially on Thudium and Pelekium. Three chloroplast sequences (trnL-F, rps4, and atpB-rbcL) and one nuclear sequence (ITS) were analyzed. Data partitions were analyzed separately and in combination by employing MP (maximum parsimony) and Bayesian methods. The influence of data conflict in combined analyses was further explored by tw… ▽ More We present molecular phylogenetic investigation of Thuidiaceae, especially on Thudium and Pelekium. Three chloroplast sequences (trnL-F, rps4, and atpB-rbcL) and one nuclear sequence (ITS) were analyzed. Data partitions were analyzed separately and in combination by employing MP (maximum parsimony) and Bayesian methods. The influence of data conflict in combined analyses was further explored by two methods: the incongruence length difference (ILD) test and the partition addition bootstrap alteration approach (PABA). Based on the results, ITS 1& 2 had crucial effect in phylogenetic reconstruction in this study, and more chloroplast sequences should be combinated into the analyses since their stability for reconstructing within genus of pleurocarpous mosses. We supported that Helodiaceae including Actinothuidium, Bryochenea, and Helodium still attributed to Thuidiaceae, and the monophyletic Thuidiaceae s. lat. should also include several genera (or species) from Leskeaceae such as Haplocladium and Leskea. In the Thuidiaceae, Thuidium and Pelekium were resolved as two monophyletic groups separately. The results from molecular phylogeny were supported by the crucial morphological characters in Thuidiaceae s. lat., Thuidium and Pelekium. △ Less

Submitted 15 February, 2019; originally announced February 2019.

Comments: 20 pages, 4 tables, 3 figures

arXiv:1902.00486 [pdf]

Differentiation of skin incision and laparoscopic trocar insertion via quantifying transient bradycardia measured by electrocardiogram

Authors: Cheng-Hsi Chang, Yue-Lin Fang, Yu-Jung Wang, Hau-tieng Wu, Yu-Ting Lin

Abstract: Background. Most surgical procedures involve structures deeper than the skin. However, the difference in surgical noxious stimulation between skin incision and laparoscopic trocar insertion is unknown. By analyzing instantaneous heart rate (IHR) calculated from the electrocardiogram, in particular the transient bradycardia in response to surgical stimuli, this study investigates surgical noxious s… ▽ More Background. Most surgical procedures involve structures deeper than the skin. However, the difference in surgical noxious stimulation between skin incision and laparoscopic trocar insertion is unknown. By analyzing instantaneous heart rate (IHR) calculated from the electrocardiogram, in particular the transient bradycardia in response to surgical stimuli, this study investigates surgical noxious stimuli arising from skin incision and laparoscopic trocar insertion. Methods. Thirty-five patients undergoing laparoscopic cholecystectomy were enrolled in this prospective observational study. Sequential surgical steps including umbilical skin incision (11 mm), umbilical trocar insertion (11 mm), xiphoid skin incision (5 mm), xiphoid trocar insertion (5 mm), subcostal skin incision (3 mm), and subcostal trocar insertion (3 mm) were investigated. IHR was derived from electrocardiography and calculated by the modern time-varying power spectrum. Similar to the classical heart rate variability analysis, the time-varying low frequency power (tvLF), time-varying high frequency power (tvHF), and tvLF-to-tvHF ratio (tvLHR) were calculated. Prediction probability (PK) analysis and global pointwise F-test were used to compare the performance between indices and the heart rate readings from the patient monitor. Results. Analysis of IHR showed that surgical stimulus elicits a transient bradycardia, followed by the increase of heart rate. Transient bradycardia is more significant in trocar insertion than skin incision. The IHR change quantifies differential responses to different surgical intensity. Serial PK analysis demonstrates de-sensitization in skin incision, but not in laparoscopic trocar insertion. Conclusions. Quantitative indices present the transient bradycardia introduced by noxious stimulation. The results indicate different effects between skin incision and trocar insertion. △ Less

Submitted 1 February, 2019; originally announced February 2019.

Comments: One table and 4 figures

arXiv:1812.00191 [pdf, ps, other]

Expected Density of Cooperative Bacteria in a 2D Quorum Sensing Based Molecular Communication System

Authors: Yuting Fang, Adam Noel, Andrew W. Eckford, Nan Yang

Abstract: The exchange of small molecular signals within microbial populations is generally referred to as quorum sensing (QS). QS is ubiquitous in nature and enables microorganisms to respond to fluctuations in living environments by working together. In this study, a QS-based molecular communication system within a microbial population in a two-dimensional (2D) environment is analytically modeled. Microor… ▽ More The exchange of small molecular signals within microbial populations is generally referred to as quorum sensing (QS). QS is ubiquitous in nature and enables microorganisms to respond to fluctuations in living environments by working together. In this study, a QS-based molecular communication system within a microbial population in a two-dimensional (2D) environment is analytically modeled. Microorganisms are randomly distributed on a 2D circle where each one releases molecules at random times. The number of molecules observed at each randomly-distributed bacterium is first derived by characterizing the diffusion and degradation of signaling molecules within the population. Using the derived result and some approximation, the expected density of cooperative bacteria is derived. Our model captures the basic features of QS. The analytical results for noisy signal propagation agree with simulation results where the Brownian motion of molecules is simulated by a particle-based method. Therefore, we anticipate that our model can be used to predict the density of cooperative bacteria in a variety of QS-coordinated activities, e.g., biofilm formation and antibiotic resistance. △ Less

Submitted 13 September, 2019; v1 submitted 1 December, 2018; originally announced December 2018.

Comments: 7 pages, 7 figures; This work has been accepted by IEEE Globecom 2019

arXiv:1803.04640 [pdf, other]

Bayesian Detection of Abnormal ADS in Mutant Caenorhabditis elegans Embryos

Authors: Wei Liang, Yuxiao Yang, Yusi Fang, Zhongying Zhao, Jie Hu

Abstract: Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each m… ▽ More Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each mutated gene, usually one, is far smaller than that of wild-type, direct comparison of two distributions of ADS between wild-type and mutant type, such as Kolmogorov- Smirnov test, is not feasible. On the other hand, we find that sometimes ADS is correlated with the life span of corresponding mother cell in wild-type. Hence, we apply a semiparametric Bayesian quantile regression method to estimate the 95% confidence interval curve of ADS with respect to life span of mother cell of wild-type individuals. Then, mutant-type ADSs outside the corresponding confidence interval are selected out as abnormal one with a significance level of 0.05. Simulation study demonstrates the accuracy of our method and Gene Enrichment Analysis validates the results of real data sets. △ Less

Submitted 13 March, 2018; originally announced March 2018.

arXiv:1711.04870 [pdf, other]

Using Game Theory for Real-Time Behavioural Dynamics in Microscopic Populations with Noisy Signalling

Authors: Adam Noel, Yuting Fang, Nan Yang, Dimitrios Makrakis, Andrew W. Eckford

Abstract: This paper introduces the application of game theory to understand noisy real-time signalling and the resulting behavioural dynamics in microscopic populations such as bacteria and other cells. It presents a bridge between the fields of molecular communication and microscopic game theory. Molecular communication uses conventional communication engineering theory and techniques to study and design… ▽ More This paper introduces the application of game theory to understand noisy real-time signalling and the resulting behavioural dynamics in microscopic populations such as bacteria and other cells. It presents a bridge between the fields of molecular communication and microscopic game theory. Molecular communication uses conventional communication engineering theory and techniques to study and design systems that use chemical molecules as information carriers. Microscopic game theory models interactions within and between populations of cells and microorganisms. Integrating these two fields provides unique opportunities to understand and control microscopic populations that have imperfect signal propagation. Two examples, namely bacteria quorum sensing and tumour cell signalling, are presented with potential games to demonstrate the application of this approach. Finally, a case study of bacteria resource sharing demonstrates how noisy signalling can alter the distribution of behaviour. △ Less

Submitted 4 February, 2019; v1 submitted 13 November, 2017; originally announced November 2017.

Comments: 10 pages, 10 figures, 1 table. Submitted for publication

arXiv:1708.06305 [pdf, other]

doi 10.1109/ITW.2017.8278046

Effect of Local Population Uncertainty on Cooperation in Bacteria

Authors: Adam Noel, Yuting Fang, Nan Yang, Dimitrios Makrakis, Andrew W. Eckford

Abstract: Bacteria populations rely on mechanisms such as quorum sensing to coordinate complex tasks that cannot be achieved by a single bacterium. Quorum sensing is used to measure the local bacteria population density, and it controls cooperation by ensuring that a bacterium only commits the resources for cooperation when it expects its neighbors to reciprocate. This paper proposes a simple model for shar… ▽ More Bacteria populations rely on mechanisms such as quorum sensing to coordinate complex tasks that cannot be achieved by a single bacterium. Quorum sensing is used to measure the local bacteria population density, and it controls cooperation by ensuring that a bacterium only commits the resources for cooperation when it expects its neighbors to reciprocate. This paper proposes a simple model for sharing a resource in a bacterial environment, where knowledge of the population influences each bacterium's behavior. Game theory is used to model the behavioral dynamics, where the net payoff (i.e., utility) for each bacterium is a function of its current behavior and that of the other bacteria. The game is first evaluated with perfect knowledge of the population. Then, the unreliability of diffusion introduces uncertainty in the local population estimate and changes the perceived payoffs. The results demonstrate the sensitivity to the system parameters and how population uncertainty can overcome a lack of explicit coordination. △ Less

Submitted 21 August, 2017; originally announced August 2017.

Comments: 5 pages, 6 figures. Will be presented as an invited paper at the 2017 IEEE Information Theory Workshop in November 2017 in Kaohsiung, Taiwan

arXiv:1405.2362 [pdf]

doi 10.1109/CNNA.2014.6888657

Image Segmentation Using Frequency Locking of Coupled Oscillators

Authors: Yan Fang, Matthew J. Cotter, Donald M. Chiarulli, Steven P. Levitan

Abstract: Synchronization of coupled oscillators is observed at multiple levels of neural systems, and has been shown to play an important function in visual perception. We propose a computing system based on locally coupled oscillator networks for image segmentation. The system can serve as the preprocessing front-end of an image processing pipeline where the common frequencies of clusters of oscillators r… ▽ More Synchronization of coupled oscillators is observed at multiple levels of neural systems, and has been shown to play an important function in visual perception. We propose a computing system based on locally coupled oscillator networks for image segmentation. The system can serve as the preprocessing front-end of an image processing pipeline where the common frequencies of clusters of oscillators reflect the segmentation results. To demonstrate the feasibility of our design, the system is simulated and tested on a human face image dataset and its performance is compared with traditional intensity threshold based algorithms. Our system shows both better performance and higher noise tolerance than traditional methods. △ Less

Submitted 9 May, 2014; originally announced May 2014.

Comments: 7 pages, 14 figures, the 51th Design Automation Conference 2014, Work in Progress Poster Session

ACM Class: C.1.3

arXiv:1202.1358 [pdf, other]

Protein Folding: The Gibbs Free Energy

Authors: Yi Fang

Abstract: The fundamental law for protein folding is the Thermodynamic Principle: the amino acid sequence of a protein determines its native structure and the native structure has the minimum Gibbs free energy. If all chemical problems can be answered by quantum mechanics, there should be a quantum mechanics derivation of Gibbs free energy formula G(X) for every possible conformation X of the protein. We ap… ▽ More The fundamental law for protein folding is the Thermodynamic Principle: the amino acid sequence of a protein determines its native structure and the native structure has the minimum Gibbs free energy. If all chemical problems can be answered by quantum mechanics, there should be a quantum mechanics derivation of Gibbs free energy formula G(X) for every possible conformation X of the protein. We apply quantum statistics to derive such a formula. For simplicity, only monomeric self folding globular proteins are covered. We point out some immediate applications of the formula. We show that the formula explains the observed phenomena very well. It gives a unified explanation to both folding and denaturation; it explains why hydrophobic effect is the driving force of protein folding and clarifies the role played by hydrogen bonding; it explains the successes and deficients of various surface area models. The formula also gives a clear kinetic force of the folding: Fi(X) = - \nablaxi G(X). This also gives a natural way to perform the ab initio prediction of protein structure, minimizing G(X) by Newton's fastest desciending method. △ Less

Submitted 8 April, 2012; v1 submitted 7 February, 2012; originally announced February 2012.

Comments: 18 pages, 2 figures

MSC Class: 82C10

Showing 1–27 of 27 results for author: Fang, Y