-
Quantum Long Short-Term Memory for Drug Discovery
Authors:
Liang Zhang,
Yin Xu,
Mohan Wu,
Liang Wang,
Hua Xu
Abstract:
Quantum computing combined with machine learning (ML) is an extremely promising research area, with numerous studies demonstrating that quantum machine learning (QML) is expected to solve scientific problems more effectively than classical ML. In this work, we successfully apply QML to drug discovery, showing that QML can significantly improve model performance and achieve faster convergence compa…
▽ More
Quantum computing combined with machine learning (ML) is an extremely promising research area, with numerous studies demonstrating that quantum machine learning (QML) is expected to solve scientific problems more effectively than classical ML. In this work, we successfully apply QML to drug discovery, showing that QML can significantly improve model performance and achieve faster convergence compared to classical ML. Moreover, we demonstrate that the model accuracy of the QML improves as the number of qubits increases. We also introduce noise to the QML model and find that it has little effect on our experimental conclusions, illustrating the high robustness of the QML model. This work highlights the potential application of quantum computing to yield significant benefits for scientific advancement as the qubit quantity increase and quality improvement in the future.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX
Authors:
Zhiyuan Chen,
Tianhao Chen,
Chenggang Xie,
Yang Xue,
Xiaonan Zhang,
Jingbo Zhou,
Xiaomin Fang
Abstract:
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. Th…
▽ More
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. These approaches restrict the understanding and generation of multimodal protein data. In contrast, large multimodal models have demonstrated potential capabilities in generating any-to-any content like text, images, and videos, thus enriching user interactions across various domains. Integrating these multimodal model technologies into protein research offers significant promise by potentially transforming how proteins are studied. To this end, we introduce HelixProtX, a system built upon the large multimodal model, aiming to offer a comprehensive solution to protein research by supporting any-to-any protein modality generation. Unlike existing methods, it allows for the transformation of any input protein modality into any desired protein modality. The experimental results affirm the advanced capabilities of HelixProtX, not only in generating functional descriptions from amino acid sequences but also in executing critical tasks such as designing protein sequences and structures from textual descriptions. Preliminary findings indicate that HelixProtX consistently achieves superior accuracy across a range of protein-related tasks, outperforming existing state-of-the-art models. By integrating multimodal large models into protein research, HelixProtX opens new avenues for understanding protein biology, thereby promising to accelerate scientific discovery.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames
Authors:
Ruidong Wu,
Ruihan Guo,
Rui Wang,
Shitong Luo,
Yue Xu,
Jiahan Li,
Jianzhu Ma,
Qiang Liu,
Yunan Luo,
Jian Peng
Abstract:
Despite the striking success of general protein folding models such as AlphaFold2(AF2, Jumper et al. (2021)), the accurate computational modeling of antibody-antigen complexes remains a challenging task. In this paper, we first analyze AF2's primary loss function, known as the Frame Aligned Point Error (FAPE), and raise a previously overlooked issue that FAPE tends to face gradient vanishing probl…
▽ More
Despite the striking success of general protein folding models such as AlphaFold2(AF2, Jumper et al. (2021)), the accurate computational modeling of antibody-antigen complexes remains a challenging task. In this paper, we first analyze AF2's primary loss function, known as the Frame Aligned Point Error (FAPE), and raise a previously overlooked issue that FAPE tends to face gradient vanishing problem on high-rotational-error targets. To address this fundamental limitation, we propose a novel geodesic loss called Frame Aligned Frame Error (FAFE, denoted as F2E to distinguish from FAPE), which enables the model to better optimize both the rotational and translational errors between two frames. We then prove that F2E can be reformulated as a group-aware geodesic loss, which translates the optimization of the residue-to-residue error to optimizing group-to-group geodesic frame distance. By fine-tuning AF2 with our proposed new loss function, we attain a correct rate of 52.3\% (DockQ $>$ 0.23) on an evaluation set and 43.8\% correct rate on a subset with low homology, with substantial improvement over AF2 by 182\% and 100\% respectively.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Multimodal Data Integration for Precision Oncology: Challenges and Future Directions
Authors:
Huajun Zhou,
Fengtao Zhou,
Chenyu Zhao,
Yingxue Xu,
Luyang Luo,
Hao Chen
Abstract:
The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade,…
▽ More
The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade, multimodal data integration technology for precision oncology has made significant strides, showcasing remarkable progress in understanding the intricate details within heterogeneous data modalities. These strides have exhibited tremendous potential for improving clinical decision-making and model interpretation, contributing to the advancement of cancer care and treatment. Given the rapid progress that has been achieved, we provide a comprehensive overview of about 300 papers detailing cutting-edge multimodal data integration techniques in precision oncology. In addition, we conclude the primary clinical applications that have reaped significant benefits, including early assessment, diagnosis, prognosis, and biomarker discovery. Finally, derived from the findings of this survey, we present an in-depth analysis that explores the pivotal challenges and reveals essential pathways for future research in the field of multimodal data integration for precision oncology.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Peptide Vaccine Design by Evolutionary Multi-Objective Optimization
Authors:
Dan-Xuan Liu,
Yi-Heng Xu,
Chao Qian
Abstract:
Peptide vaccines are growing in significance for fighting diverse diseases. Machine learning has improved the identification of peptides that can trigger immune responses, and the main challenge of peptide vaccine design now lies in selecting an effective subset of peptides due to the allelic diversity among individuals. Previous works mainly formulated this task as a constrained optimization prob…
▽ More
Peptide vaccines are growing in significance for fighting diverse diseases. Machine learning has improved the identification of peptides that can trigger immune responses, and the main challenge of peptide vaccine design now lies in selecting an effective subset of peptides due to the allelic diversity among individuals. Previous works mainly formulated this task as a constrained optimization problem, aiming to maximize the expected number of peptide-Major Histocompatibility Complex (peptide-MHC) bindings across a broad range of populations by selecting a subset of diverse peptides with limited size; and employed a greedy algorithm, whose performance, however, may be limited due to the greedy nature. In this paper, we propose a new framework PVD-EMO based on Evolutionary Multi-objective Optimization, which reformulates Peptide Vaccine Design as a bi-objective optimization problem that maximizes the expected number of peptide-MHC bindings and minimizes the number of selected peptides simultaneously, and employs a Multi-Objective Evolutionary Algorithm (MOEA) to solve it. We also incorporate warm-start and repair strategies into MOEAs to improve efficiency and performance. We prove that the warm-start strategy ensures that PVD-EMO maintains the same worst-case approximation guarantee as the previous greedy algorithm, and meanwhile, the EMO framework can help avoid local optima. Experiments on a peptide vaccine design for COVID-19, caused by the SARS-CoV-2 virus, demonstrate the superiority of PVD-EMO.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
HelixFold-Multimer: Elevating Protein Complex Structure Prediction to New Heights
Authors:
Xiaomin Fang,
Jie Gao,
Jing Hu,
Lihang Liu,
Yang Xue,
Xiaonan Zhang,
Kunrui Zhu
Abstract:
While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex predictio…
▽ More
While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex prediction, tasks based on precise protein-protein interaction analysis also face obstacles. In this report, we highlight the ongoing advancements of our protein complex structure prediction model, HelixFold-Multimer, underscoring its enhanced performance. HelixFold-Multimer provides precise predictions for diverse protein complex structures, especially in therapeutic protein interactions. Notably, HelixFold-Multimer achieves remarkable success in antigen-antibody and peptide-protein structure prediction, greatly surpassing AlphaFold 3. HelixFold-Multimer is now available for public use on the PaddleHelix platform, offering both a general version and an antigen-antibody version. Researchers can conveniently access and utilize this service for their development needs.
△ Less
Submitted 17 May, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
UAlign: Pushing the Limit of Template-free Retrosynthesis Prediction with Unsupervised SMILES Alignment
Authors:
Kaipeng Zeng,
Bo yang,
Xin Zhao,
Yu Zhang,
Fan Nie,
Xiaokang Yang,
Yaohui Jin,
Yanyan Xu
Abstract:
Motivation: Retrosynthesis planning poses a formidable challenge in the organic chemical industry. Single-step retrosynthesis prediction, a crucial step in the planning process, has witnessed a surge in interest in recent years due to advancements in AI for science. Various deep learning-based methods have been proposed for this task in recent years, incorporating diverse levels of additional chem…
▽ More
Motivation: Retrosynthesis planning poses a formidable challenge in the organic chemical industry. Single-step retrosynthesis prediction, a crucial step in the planning process, has witnessed a surge in interest in recent years due to advancements in AI for science. Various deep learning-based methods have been proposed for this task in recent years, incorporating diverse levels of additional chemical knowledge dependency.
Results: This paper introduces UAlign, a template-free graph-to-sequence pipeline for retrosynthesis prediction. By combining graph neural networks and Transformers, our method can more effectively leverage the inherent graph structure of molecules. Based on the fact that the majority of molecule structures remain unchanged during a chemical reaction, we propose a simple yet effective SMILES alignment technique to facilitate the reuse of unchanged structures for reactant generation. Extensive experiments show that our method substantially outperforms state-of-the-art template-free and semi-template-based approaches. Importantly, our template-free method achieves effectiveness comparable to, or even surpasses, established powerful template-based methods.
Scientific contribution: We present a novel graph-to-sequence template-free retrosynthesis prediction pipeline that overcomes the limitations of Transformer-based methods in molecular representation learning and insufficient utilization of chemical information. We propose an unsupervised learning mechanism for establishing product-atom correspondence with reactant SMILES tokens, achieving even better results than supervised SMILES alignment methods. Extensive experiments demonstrate that UAlign significantly outperforms state-of-the-art template-free methods and rivals or surpasses template-based approaches, with up to 5\% (top-5) and 5.4\% (top-10) increased accuracy over the strongest baseline.
△ Less
Submitted 19 April, 2024; v1 submitted 24 March, 2024;
originally announced April 2024.
-
Prediction of vaccination coverage level in the heterogeneous mixing population
Authors:
Fan Bai,
Qianyu Chen,
Yizhuo Xu
Abstract:
Heterogeneity of population is a key factor in modeling the transmission of disease among the population and has huge impact on the outcome of the transmission. In order to investigate the decision making process in the heterogeneous mixing population regarding whether to be vaccinated or not, we propose the modeling framework which includes the epidemic models and the game theoretical analysis. W…
▽ More
Heterogeneity of population is a key factor in modeling the transmission of disease among the population and has huge impact on the outcome of the transmission. In order to investigate the decision making process in the heterogeneous mixing population regarding whether to be vaccinated or not, we propose the modeling framework which includes the epidemic models and the game theoretical analysis. We consider two sources of heterogeneity in this paper: the different activity levels and the different relative vaccination costs. It is interesting to observe that, if both sources of heterogeneity are considered, there exist a finite number of Nash equilibria (evolutionary stable strategies (ESS)) of the vaccination game. While if only the difference of activity levels is considered, there are infinitely many Nash equilibira. For the latter case, the initial condition of the decision making process becomes highly sensitive. In the application of public health management, the inclusion of population heterogeneity significantly complicates the prediction of the overall vaccine coverage level.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
Authors:
ChenRui Duan,
Zelin Zang,
Yongjie Xu,
Hang He,
Zihan Liu,
Zijia Song,
Ju-Sheng Zheng,
Stan Z. Li
Abstract:
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metage…
▽ More
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
Accelerating Discovery of Novel and Bioactive Ligands With Pharmacophore-Informed Generative Models
Authors:
Weixin Xie,
Jianhang Zhang,
Qin Xie,
Chaojun Gong,
Youjun Xu,
Luhua Lai,
Jianfeng Pei
Abstract:
Deep generative models have gained significant advancements to accelerate drug discovery by generating bioactive chemicals against desired targets. Nevertheless, most generated compounds that have been validated for potent bioactivity often exhibit structural novelty levels that fall short of satisfaction, thereby providing limited inspiration to human medicinal chemists. The challenge faced by ge…
▽ More
Deep generative models have gained significant advancements to accelerate drug discovery by generating bioactive chemicals against desired targets. Nevertheless, most generated compounds that have been validated for potent bioactivity often exhibit structural novelty levels that fall short of satisfaction, thereby providing limited inspiration to human medicinal chemists. The challenge faced by generative models lies in their ability to produce compounds that are both bioactive and novel, rather than merely making minor modifications to known actives present in the training set. Recognizing the utility of pharmacophores in facilitating scaffold hopping, we developed TransPharmer, an innovative generative model that integrates ligand-based interpretable pharmacophore fingerprints with generative pre-training transformer (GPT) for de novo molecule generation. TransPharmer demonstrates superior performance across tasks involving unconditioned distribution learning, de novo generation and scaffold elaboration under pharmacophoric constraints. Its distinct exploration mode within the local chemical space renders it particularly useful for scaffold hopping, producing compounds that are structurally novel while pharmaceutically related. The efficacy of TransPharmer is validated through two case studies involving the dopamine receptor D2 (DRD2) and polo-like kinase 1 (PLK1). Notably in the case of PLK1, three out of four synthesized designed compounds exhibit submicromolar activities, with the most potent one, IIP0943, demonstrating a potency of 5.1 nM. Featuring a new scaffold of 4-(benzo[b]thiophen-7-yloxy)pyrimidine, IIP0943 also exhibits high selectivity for PLK1. It was demonstrated that TransPharmer is a powerful tool for discovery of novel and bioactive ligands.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater
Authors:
Viorel Munteanu,
Victor Gordeev,
Michael Saldana,
Eva Aßmann,
Justin Maine Su,
Nicolae Drabcinski,
Oksana Zlenko,
Maryna Kit,
Felicia Iordachi,
Khooshbu Kantibhai Patel,
Abdullah Al Nahid,
Likhitha Chittampalli,
Yidian Xu,
Pavel Skums,
Shelesh Agrawal,
Martin Hölzer,
Adam Smith,
Alex Zelikovsky,
Serghei Mangul
Abstract:
In light of the continuous transmission and evolution of SARS-CoV-2 coupled with a significant decline in clinical testing, there is a pressing need for scalable, cost-effective, long-term, passive surveillance tools to effectively monitor viral variants circulating in the population. Wastewater genomic surveillance of SARS-CoV-2 has arrived as an alternative to clinical genomic surveillance, allo…
▽ More
In light of the continuous transmission and evolution of SARS-CoV-2 coupled with a significant decline in clinical testing, there is a pressing need for scalable, cost-effective, long-term, passive surveillance tools to effectively monitor viral variants circulating in the population. Wastewater genomic surveillance of SARS-CoV-2 has arrived as an alternative to clinical genomic surveillance, allowing to continuously monitor the prevalence of viral lineages in communities of various size at a fraction of the time, cost, and logistic effort and serving as an early warning system for emerging variants, critical for developed communities and especially for underserved ones. Importantly, lineage prevalence estimates obtained with this approach aren't distorted by biases related to clinical testing accessibility and participation. However, the relative performance of bioinformatics methods used to measure relative lineage abundances from wastewater sequencing data is unknown, preventing both the research community and public health authorities from making informed decisions regarding computational tool selection. Here, we perform comprehensive benchmarking of 18 bioinformatics methods for estimating the relative abundance of SARS-CoV-2 (sub)lineages in wastewater by using data from 36 in vitro mixtures of synthetic lineage and sublineage genomes. In addition, we use simulated data from 78 mixtures of lineages and sublineages co-occurring in the clinical setting with proportions mirroring their prevalence ratios observed in real data. Importantly, we investigate how the accuracy of the evaluated methods is impacted by the sequencing technology used, the associated error rate, the read length, read depth, but also by the exposure of the synthetic RNA mixtures to wastewater, with the goal of capturing the effects induced by the wastewater matrix, including RNA fragmentation and degradation.
△ Less
Submitted 21 January, 2024; v1 submitted 29 September, 2023;
originally announced September 2023.
-
Survey of Consciousness Theory from Computational Perspective
Authors:
Zihan Ding,
Xiaoxi Wei,
Yidan Xu
Abstract:
Human consciousness has been a long-lasting mystery for centuries, while machine intelligence and consciousness is an arduous pursuit. Researchers have developed diverse theories for interpreting the consciousness phenomenon in human brains from different perspectives and levels. This paper surveys several main branches of consciousness theories originating from different subjects including inform…
▽ More
Human consciousness has been a long-lasting mystery for centuries, while machine intelligence and consciousness is an arduous pursuit. Researchers have developed diverse theories for interpreting the consciousness phenomenon in human brains from different perspectives and levels. This paper surveys several main branches of consciousness theories originating from different subjects including information theory, quantum physics, cognitive psychology, physiology and computer science, with the aim of bridging these theories from a computational perspective. It also discusses the existing evaluation metrics of consciousness and possibility for current computational models to be conscious. Breaking the mystery of consciousness can be an essential step in building general artificial intelligence with computing machines.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Revive, Restore, Revitalize: An Eco-economic Methodology for Maasai Mara
Authors:
Yipeng Xu,
He Sun,
Junfeng Zhu
Abstract:
The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource typ…
▽ More
The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource types. The model employs the metabolic rate-mass relationship for animal energy dynamics, logistic curves for animal growth, individual interactions for food web simulation, and human intervention impacts. Algorithms like fitness proportional selection and particle swarm mimic organism preferences for resources. To guide preservation activities, we formulated 21 management strategies encompassing tourism, transportation, taxation, environmental conservation, research, diplomacy, and poaching, employing a game-theoretic framework. Using the TOPSIS method, we prioritized four key developmental indicators: environmental health, research advancement, economic growth, and security. The interplay of 16 factors determines these indicators, each influenced by our policies to varying degrees. By evaluating the policies' repercussions, we aim to mitigate adverse animal-human interactions and equitably address human concerns. We classified the policy impacts into three categories: Environmental Preservation, Economic Prosperity, and Holistic Development. By applying these policy groupings to our ecosystem model, we tracked the effects on the intricate animal-human-resource dynamics. Utilizing the entropy weight method, we assessed the efficacy of these policy clusters over a decade, identifying the optimal blend emphasizing both environmental conservation and economic progression.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
MC-NN: An End-to-End Multi-Channel Neural Network Approach for Predicting Influenza A Virus Hosts and Antigenic Types
Authors:
Yanhua Xu,
Dominik Wojtczak
Abstract:
Influenza poses a significant threat to public health, particularly among the elderly, young children, and people with underlying dis-eases. The manifestation of severe conditions, such as pneumonia, highlights the importance of preventing the spread of influenza. An accurate and cost-effective prediction of the host and antigenic sub-types of influenza A viruses is essential to addressing this is…
▽ More
Influenza poses a significant threat to public health, particularly among the elderly, young children, and people with underlying dis-eases. The manifestation of severe conditions, such as pneumonia, highlights the importance of preventing the spread of influenza. An accurate and cost-effective prediction of the host and antigenic sub-types of influenza A viruses is essential to addressing this issue, particularly in resource-constrained regions. In this study, we propose a multi-channel neural network model to predict the host and antigenic subtypes of influenza A viruses from hemagglutinin and neuraminidase protein sequences. Our model was trained on a comprehensive data set of complete protein sequences and evaluated on various test data sets of complete and incomplete sequences. The results demonstrate the potential and practicality of using multi-channel neural networks in predicting the host and antigenic subtypes of influenza A viruses from both full and partial protein sequences.
△ Less
Submitted 21 February, 2024; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Development and Evaluation of Conformal Prediction Methods for QSAR
Authors:
Yuting Xu,
Andy Liaw,
Robert P. Sheridan,
Vladimir Svetnik
Abstract:
The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting biological activities of compounds using their molecular descriptors. Predictions from QSAR models can help, for example, to optimize molecular structure; prioritize compounds for further experimental testing; and estimate their toxicity. In addition to the accurate estimation of th…
▽ More
The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting biological activities of compounds using their molecular descriptors. Predictions from QSAR models can help, for example, to optimize molecular structure; prioritize compounds for further experimental testing; and estimate their toxicity. In addition to the accurate estimation of the activity, it is highly desirable to obtain some estimate of the uncertainty associated with the prediction, e.g., calculate a prediction interval (PI) containing the true molecular activity with a pre-specified probability, say 70%, 90% or 95%. The challenge is that most machine learning (ML) algorithms that achieve superior predictive performance require some add-on methods for estimating uncertainty of their prediction. The development of these algorithms is an active area of research by statistical and ML communities but their implementation for QSAR modeling remains limited. Conformal prediction (CP) is a promising approach. It is agnostic to the prediction algorithm and can produce valid prediction intervals under some weak assumptions on the data distribution. We proposed computationally efficient CP algorithms tailored to the most advanced ML models, including Deep Neural Networks and Gradient Boosting Machines. The validity and efficiency of proposed conformal predictors are demonstrated on a diverse collection of QSAR datasets as well as simulation studies.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Label Propagation via Random Walk for Training Robust Thalamus Nuclei Parcellation Model from Noisy Annotations
Authors:
Anqi Feng,
Yuan Xue,
Yuli Wang,
Chang Yan,
Zhangxing Bian,
Muhan Shao,
Jiachen Zhuo,
Rao P. Gullapalli,
Aaron Carass,
Jerry L. Prince
Abstract:
Data-driven thalamic nuclei parcellation depends on high-quality manual annotations. However, the small size and low contrast changes among thalamic nuclei, yield annotations that are often incomplete, noisy, or ambiguously labelled. To train a robust thalamic nuclei parcellation model with noisy annotations, we propose a label propagation algorithm based on random walker to refine the annotations…
▽ More
Data-driven thalamic nuclei parcellation depends on high-quality manual annotations. However, the small size and low contrast changes among thalamic nuclei, yield annotations that are often incomplete, noisy, or ambiguously labelled. To train a robust thalamic nuclei parcellation model with noisy annotations, we propose a label propagation algorithm based on random walker to refine the annotations before model training. A two-step model was trained to generate first the whole thalamus and then the nuclei masks. We conducted experiments on a mild traumatic brain injury~(mTBI) dataset with noisy thalamic nuclei annotations. Our model outperforms current state-of-the-art thalamic nuclei parcellations by a clear margin. We believe our method can also facilitate the training of other parcellation models with noisy labels.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Machine Learning for Flow Cytometry Data Analysis
Authors:
Yanhua Xu
Abstract:
Flow cytometry mainly used for detecting the characteristics of a number of biochemical substances based on the expression of specific markers in cells. It is particularly useful for detecting membrane surface receptors, antigens, ions, or during DNA/RNA expression. Not only can it be employed as a biomedical research tool for recognising distinctive types of cells in mixed populations, but it can…
▽ More
Flow cytometry mainly used for detecting the characteristics of a number of biochemical substances based on the expression of specific markers in cells. It is particularly useful for detecting membrane surface receptors, antigens, ions, or during DNA/RNA expression. Not only can it be employed as a biomedical research tool for recognising distinctive types of cells in mixed populations, but it can also be used as a diagnostic tool for classifying abnormal cell populations connected with disease. Modern flow cytometers can rapidly analyse tens of thousands of cells at the same time while also measuring multiple parameters from a single cell. However, the rapid development of flow cytometers makes it challenging for conventional analysis methods to interpret flow cytometry data. Researchers need to be able to distinguish interesting-looking cell populations manually in multi-dimensional data collected from millions of cells. Thus, it is essential to find a robust approach for analysing flow cytometry data automatically, specifically in identifying cell populations automatically. This thesis mainly concerns discover the potential shortcoming of current automated-gating algorithms in both real datasets and synthetic datasets. Three representative automated clustering algorithms are selected to be applied, compared and evaluated by completely and partially automated gating. A subspace clustering ProClus also implemented in this thesis. The performance of ProClus in flow cytometry is not well, but it is still a useful algorithm to detect noise.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Characterize the non-Gaussian diffusion property of cerebrospinal fluid using Diffusion Kurtosis Imaging and explore its diagnostic efficacy for Alzheimer's disease
Authors:
Yingnan Xue,
Min Wen,
Qiong Ye
Abstract:
Differentiating Alzheimer's disease (AD) patients from healthy controls (HCs) remains a challenge. The changes of protein level in cerebrospinal fluid (CSF) of AD patients have been reported in the literature. Macromolecules will hinder the movement of water in CSF and lead to non-Gaussian diffusion. Diffusion kurtosis imaging (DKI) is a commonly used technique for quantifying non-Gaussian diffusi…
▽ More
Differentiating Alzheimer's disease (AD) patients from healthy controls (HCs) remains a challenge. The changes of protein level in cerebrospinal fluid (CSF) of AD patients have been reported in the literature. Macromolecules will hinder the movement of water in CSF and lead to non-Gaussian diffusion. Diffusion kurtosis imaging (DKI) is a commonly used technique for quantifying non-Gaussian diffusivity. In this study, we used DKI to evaluate the non-Gaussian diffusion of CSF in AD patients and HC. Between-group difference was explored. In addition, we have built a prediction model using cross-validation Support Vector Machines (SVM), and achieved excellent performance. The validated area under the receiver operating characteristic curve(AUC) is in the range of 0.96-1.00, and the correct prediction is in the range of 87.1% - 90.0%.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
Protein Language Models and Structure Prediction: Connection and Progression
Authors:
Bozhen Hu,
Jun Xia,
Jiangbin Zheng,
Cheng Tan,
Yufei Huang,
Yongjie Xu,
Stan Z. Li
Abstract:
The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past…
▽ More
The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Data-driven generation of 4D velocity profiles in the aneurysmal ascending aorta
Authors:
Simone Saitta,
Ludovica Maga,
Chloe Armour,
Emiliano Votta,
Declan P. O'Regan,
M. Yousuf Salmasi,
Thanos Athanasiou,
Jonathan W. Weinsaft,
Xiao Yun Xu,
Selene Pirola,
Alberto Redaelli
Abstract:
Numerical simulations of blood flow are a valuable tool to investigate the pathophysiology of ascending thoracic aortic aneurysms (ATAA). To accurately reproduce hemodynamics, computational fluid dynamics (CFD) models must employ realistic inflow boundary conditions (BCs). However, the limited availability of in vivo velocity measurements still makes researchers resort to idealized BCs. In this st…
▽ More
Numerical simulations of blood flow are a valuable tool to investigate the pathophysiology of ascending thoracic aortic aneurysms (ATAA). To accurately reproduce hemodynamics, computational fluid dynamics (CFD) models must employ realistic inflow boundary conditions (BCs). However, the limited availability of in vivo velocity measurements still makes researchers resort to idealized BCs. In this study we generated and thoroughly characterized a large dataset of synthetic 4D aortic velocity profiles suitable to be used as BCs for CFD simulations. 4D flow MRI scans of 30 subjects with ATAA were processed to extract cross-sectional planes along the ascending aorta, ensuring spatial alignment among all planes and interpolating all velocity fields to a reference configuration. Velocity profiles of the clinical cohort were extensively characterized by computing flow morphology descriptors of both spatial and temporal features. By exploiting principal component analysis (PCA), a statistical shape model (SSM) of 4D aortic velocity profiles was built and a dataset of 437 synthetic cases with realistic properties was generated. Comparison between clinical and synthetic datasets showed that the synthetic data presented similar characteristics as the clinical population in terms of key morphological parameters. The average velocity profile qualitatively resembled a parabolic-shaped profile, but was quantitatively characterized by more complex flow patterns which an idealized profile would not replicate. Statistically significant correlations were found between PCA principal modes of variation and flow descriptors. We built a data-driven generative model of 4D aortic velocity profiles, suitable to be used in computational studies of blood flow. The proposed software system also allows to map any of the generated velocity profiles to the inlet plane of any virtual subject given its coordinate set.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
One Transformer Can Understand Both 2D & 3D Molecular Data
Authors:
Shengjie Luo,
Tianlang Chen,
Yixian Xu,
Shuxin Zheng,
Tie-Yan Liu,
Liwei Wang,
Di He
Abstract:
Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to…
▽ More
Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to fail for other data formats. We believe a general-purpose neural network model for chemistry should be able to handle molecular tasks across data modalities. To achieve this goal, in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. By training on 2D and 3D molecular data with properly designed supervised signals, Transformer-M automatically learns to leverage knowledge from different data modalities and correctly capture the representations. We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability. The code and models will be made publicly available at https://github.com/lsj2408/Transformer-M.
△ Less
Submitted 27 March, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
Widely Used and Fast De Novo Drug Design by a Protein Sequence-Based Reinforcement Learning Model
Authors:
Yaqin Li,
Lingli Li,
Yongjin Xu,
Yi Yu
Abstract:
De novo molecular design has facilitated the exploration of large chemical space to accelerate drug discovery. Structure-based de novo method can overcome the data scarcity of active ligands by incorporating drug-target interaction into deep generative architectures. However, these strategies are bottlenecked by the small fraction of experimentally determined protein or complex structures. In addi…
▽ More
De novo molecular design has facilitated the exploration of large chemical space to accelerate drug discovery. Structure-based de novo method can overcome the data scarcity of active ligands by incorporating drug-target interaction into deep generative architectures. However, these strategies are bottlenecked by the small fraction of experimentally determined protein or complex structures. In addition, the cost of molecular generation is computationally expensive due to 3D representations of both molecule and protein. Here, we demonstrate a widely used and fast protein sequence-based reinforcement learning (RL) model for drug discovery. In the generative model, one of the reward components, a binding affinity predictor, is based on 1D protein sequence and molecular SMILES. As a proof of concept, the RL model was utilized to design molecules for four targets. The generated compounds showed bioactivities by the validation of both QSAR and molecular docking with experimental 3D binding pockets. We also found that the performance of generated molecules depends on the selection of data source training for the binding predictor. Furthermore, drug design for a kinase without any experimental structure, CDK20, was studied by our model. With only 1D protein sequence as input, the generated novel compounds showed favorable binding affinity based on the AlphaFold predicted structure.
△ Less
Submitted 14 August, 2022;
originally announced September 2022.
-
Dynamics of COVID-19 models with asymptomatic infections and quarantine measures
Authors:
Songbai Guo,
Yuling Xue,
Xiliang Li,
Zuohuan Zheng
Abstract:
Considering the propagation characteristics of COVID-19 in different regions, the dynamics analysis and numerical demonstration of long-term and short-term models of COVID-19 are carried out, respectively. The long-term model is devoted to investigate the global stability of COVID-19 model with asymptomatic infections and quarantine measures. By using the limit system of the model and Lyapunov fun…
▽ More
Considering the propagation characteristics of COVID-19 in different regions, the dynamics analysis and numerical demonstration of long-term and short-term models of COVID-19 are carried out, respectively. The long-term model is devoted to investigate the global stability of COVID-19 model with asymptomatic infections and quarantine measures. By using the limit system of the model and Lyapunov function method, it is shown that the COVID-19-free equilibrium $V^0$ is globally asymptotically stable if the control reproduction number $\mathcal{R}_{c}<1$ and globally attractive if $\mathcal{R}_{c}=1$, which means that COVID-19 will die out; the COVID-19 equilibrium $V^{\ast}$ is globally asymptotically stable if $\mathcal{R}_{c}>1$, which means that COVID-19 will be persistent. In particular, to obtain the local stability of $V^{\ast}$, we use proof by contradiction and the properties of complex modulus with some novel details, and we prove the weak persistence of the system to obtain the global attractivity of $V^{\ast}$. Moreover, the final size of the corresponding short-term model is calculated and the stability of its multiple equilibria is analyzed. Numerical simulations of COVID-19 cases show that quarantine measures and asymptomatic infections have a non-negligible impact on the transmission of COVID-19.
△ Less
Submitted 6 November, 2022; v1 submitted 12 September, 2022;
originally announced September 2022.
-
HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle
Authors:
Guoxia Wang,
Xiaomin Fang,
Zhihua Wu,
Yiqun Liu,
Yang Xue,
Yingfei Xiang,
Dianhai Yu,
Fan Wang,
Yanjun Ma
Abstract:
Accurate protein structure prediction can significantly accelerate the development of life science. The accuracy of AlphaFold2, a frontier end-to-end structure prediction system, is already close to that of the experimental determination techniques. Due to the complex model architecture and large memory consumption, it requires lots of computational resources and time to implement the training and…
▽ More
Accurate protein structure prediction can significantly accelerate the development of life science. The accuracy of AlphaFold2, a frontier end-to-end structure prediction system, is already close to that of the experimental determination techniques. Due to the complex model architecture and large memory consumption, it requires lots of computational resources and time to implement the training and inference of AlphaFold2 from scratch. The cost of running the original AlphaFold2 is expensive for most individuals and institutions. Therefore, reducing this cost could accelerate the development of life science. We implement AlphaFold2 using PaddlePaddle, namely HelixFold, to improve training and inference speed and reduce memory consumption. The performance is improved by operator fusion, tensor fusion, and hybrid parallelism computation, while the memory is optimized through Recompute, BFloat16, and memory read/write in-place. Compared with the original AlphaFold2 (implemented with Jax) and OpenFold (implemented with PyTorch), HelixFold needs only 7.5 days to complete the full end-to-end training and only 5.3 days when using hybrid parallelism, while both AlphaFold2 and OpenFold take about 11 days. HelixFold saves 1x training time. We verified that HelixFold's accuracy could be on par with AlphaFold2 on the CASP14 and CAMEO datasets. HelixFold's code is available on GitHub for free download: https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein/forecast.
△ Less
Submitted 13 July, 2022; v1 submitted 12 July, 2022;
originally announced July 2022.
-
Multi-channel neural networks for predicting influenza A virus hosts and antigenic types
Authors:
Yanhua Xu,
Dominik Wojtczak
Abstract:
Influenza occurs every season and occasionally causes pandemics. Despite its low mortality rate, influenza is a major public health concern, as it can be complicated by severe diseases like pneumonia. A fast, accurate and low-cost method to predict the origin host and subtype of influenza viruses could help reduce virus transmission and benefit resource-poor areas. In this work, we propose multi-c…
▽ More
Influenza occurs every season and occasionally causes pandemics. Despite its low mortality rate, influenza is a major public health concern, as it can be complicated by severe diseases like pneumonia. A fast, accurate and low-cost method to predict the origin host and subtype of influenza viruses could help reduce virus transmission and benefit resource-poor areas. In this work, we propose multi-channel neural networks to predict antigenic types and hosts of influenza A viruses with hemagglutinin and neuraminidase protein sequences. An integrated data set containing complete protein sequences were used to produce a pre-trained model, and two other data sets were used for testing the model's performance. One test set contained complete protein sequences, and another test set contained incomplete protein sequences. The results suggest that multi-channel neural networks are applicable and promising for predicting influenza A virus hosts and antigenic subtypes with complete and partial protein sequences.
△ Less
Submitted 29 July, 2022; v1 submitted 8 June, 2022;
originally announced June 2022.
-
Accurate Virus Identification with Interpretable Raman Signatures by Machine Learning
Authors:
Jiarong Ye,
Yin-Ting Yeh,
Yuan Xue,
Ziyang Wang,
Na Zhang,
He Liu,
Kunyan Zhang,
RyeAnne Ricker,
Zhuohang Yu,
Allison Roder,
Nestor Perea Lopez,
Lindsey Organtini,
Wallace Greene,
Susan Hafenstein,
Huaguang Lu,
Elodie Ghedin,
Mauricio Terrones,
Shengxi Huang,
Sharon Xiaolei Huang
Abstract:
Rapid identification of newly emerging or circulating viruses is an important first step toward managing the public health response to potential outbreaks. A portable virus capture device coupled with label-free Raman Spectroscopy holds the promise of fast detection by rapidly obtaining the Raman signature of a virus followed by a machine learning approach applied to recognize the virus based on i…
▽ More
Rapid identification of newly emerging or circulating viruses is an important first step toward managing the public health response to potential outbreaks. A portable virus capture device coupled with label-free Raman Spectroscopy holds the promise of fast detection by rapidly obtaining the Raman signature of a virus followed by a machine learning approach applied to recognize the virus based on its Raman spectrum, which is used as a fingerprint. We present such a machine learning approach for analyzing Raman spectra of human and avian viruses. A Convolutional Neural Network (CNN) classifier specifically designed for spectral data achieves very high accuracy for a variety of virus type or subtype identification tasks. In particular, it achieves 99% accuracy for classifying influenza virus type A vs. type B, 96% accuracy for classifying four subtypes of influenza A, 95% accuracy for differentiating enveloped and non-enveloped viruses, and 99% accuracy for differentiating avian coronavirus (infectious bronchitis virus, IBV) from other avian viruses. Furthermore, interpretation of neural net responses in the trained CNN model using a full-gradient algorithm highlights Raman spectral ranges that are most important to virus identification. By correlating ML-selected salient Raman ranges with the signature ranges of known biomolecules and chemical functional groups (for example, amide, amino acid, carboxylic acid), we verify that our ML model effectively recognizes the Raman signatures of proteins, lipids and other vital functional groups present in different viruses and uses a weighted combination of these signatures to identify viruses.
△ Less
Submitted 5 June, 2022;
originally announced June 2022.
-
A novel analysis approach of uniform persistence for a COVID-19 model with quarantine and standard incidence rate
Authors:
Songbai Guo,
Yuling Xue,
Xiliang Li,
Zuohuan Zheng
Abstract:
A coronavirus disease 2019 (COVID-19) model with quarantine and standard incidence rate is first developed, then a novel analysis approach for finding the ultimate lower bound of COVID-19 infectious individuals is proposed, which means that the COVID-19 pandemic is uniformly persistent if the control reproduction number $\mathcal{R}_{c}>1$. This approach can be applied to other related biomathemat…
▽ More
A coronavirus disease 2019 (COVID-19) model with quarantine and standard incidence rate is first developed, then a novel analysis approach for finding the ultimate lower bound of COVID-19 infectious individuals is proposed, which means that the COVID-19 pandemic is uniformly persistent if the control reproduction number $\mathcal{R}_{c}>1$. This approach can be applied to other related biomathematical models, and some existing works can be improved by using it. In addition, the COVID-19-free equilibrium $V^0$ is locally asymptotically stable (LAS) if $\mathcal{R}_{c}<1$ and linearly stable if $\mathcal{R}_{c}=1$, respectively; while $V^0$ is unstable if $\mathcal{R}_{c}>1$.
△ Less
Submitted 31 October, 2022; v1 submitted 31 May, 2022;
originally announced May 2022.
-
MolMiner: You only look once for chemical structure recognition
Authors:
Youjun Xu,
Jinchuan Xiao,
Chia-Han Chou,
Jianhang Zhang,
Jintao Zhu,
Qiwan Hu,
Hemin Li,
Ningsheng Han,
Bingyu Liu,
Shuaipeng Zhang,
Jinyu Han,
Zhen Zhang,
Shuhao Zhang,
Weilin Zhang,
Luhua Lai,
Jianfeng Pei
Abstract:
Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recog…
▽ More
Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recognition (OCSR). Most OCSR systems developed over the last three decades follow a rule-based approach where the key step of vectorization of the depiction is based on the interpretation of vectors and nodes as bonds and atoms. Here, we present a practical software MolMiner, which is primarily built up using deep neural networks originally developed for semantic segmentation and object detection to recognize atom and bond elements from documents. These recognized elements can be easily connected as a molecular graph with distance-based construction algorithm. We carefully evaluate our software on four benchmark datasets with the state-of-the-art performance. Various real application scenarios are also tested, yielding satisfactory outcomes. The free download links of Mac and Windows versions are available: Mac: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/mac/PharmaMind-mac-latest-setup.dmg and Windows: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/win/PharmaMind-win-latest-setup.exe
△ Less
Submitted 22 May, 2022;
originally announced May 2022.
-
A Survey on Deep Graph Generation: Methods and Applications
Authors:
Yanqiao Zhu,
Yuanqi Du,
Yinkai Wang,
Yichen Xu,
Jieyu Zhang,
Qiang Liu,
Shu Wu
Abstract:
Graphs are ubiquitous in encoding relational information of real-world objects in many domains. Graph generation, whose purpose is to generate new graphs from a distribution similar to the observed graphs, has received increasing attention thanks to the recent advances of deep learning models. In this paper, we conduct a comprehensive review on the existing literature of deep graph generation from…
▽ More
Graphs are ubiquitous in encoding relational information of real-world objects in many domains. Graph generation, whose purpose is to generate new graphs from a distribution similar to the observed graphs, has received increasing attention thanks to the recent advances of deep learning models. In this paper, we conduct a comprehensive review on the existing literature of deep graph generation from a variety of emerging methods to its wide application areas. Specifically, we first formulate the problem of deep graph generation and discuss its difference with several related graph learning tasks. Secondly, we divide the state-of-the-art methods into three categories based on model architectures and summarize their generation strategies. Thirdly, we introduce three key application areas of deep graph generation. Lastly, we highlight challenges and opportunities in the future study of deep graph generation. We hope that our survey will be useful for researchers and practitioners who are interested in this exciting and rapidly-developing field.
△ Less
Submitted 6 December, 2022; v1 submitted 13 March, 2022;
originally announced March 2022.
-
Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction
Authors:
Yang Xue,
Zijing Liu,
Xiaomin Fang,
Fan Wang
Abstract:
Protein-protein interactions (PPIs) are essentials for many biological processes where two or more proteins physically bind together to achieve their functions. Modeling PPIs is useful for many biomedical applications, such as vaccine design, antibody therapeutics, and peptide drug discovery. Pre-training a protein model to learn effective representation is critical for PPIs. Most pre-training mod…
▽ More
Protein-protein interactions (PPIs) are essentials for many biological processes where two or more proteins physically bind together to achieve their functions. Modeling PPIs is useful for many biomedical applications, such as vaccine design, antibody therapeutics, and peptide drug discovery. Pre-training a protein model to learn effective representation is critical for PPIs. Most pre-training models for PPIs are sequence-based, which naively adopt the language models used in natural language processing to amino acid sequences. More advanced works utilize the structure-aware pre-training technique, taking advantage of the contact maps of known protein structures. However, neither sequences nor contact maps can fully characterize structures and functions of the proteins, which are closely related to the PPI problem. Inspired by this insight, we propose a multimodal protein pre-training model with three modalities: sequence, structure, and function (S2F). Notably, instead of using contact maps to learn the amino acid-level rigid structures, we encode the structure feature with the topology complex of point clouds of heavy atoms. It allows our model to learn structural information about not only the backbones but also the side chains. Moreover, our model incorporates the knowledge from the functional description of proteins extracted from literature or manual annotations. Our experiments show that the S2F learns protein embeddings that achieve good performances on a variety of PPIs tasks, including cross-species PPI, antibody-antigen affinity prediction, antibody neutralization prediction for SARS-CoV-2, and mutation-driven binding affinity change prediction.
△ Less
Submitted 9 December, 2021;
originally announced December 2021.
-
Major Depressive Disorder Recognition and Cognitive Analysis Based on Multi-layer Brain Functional Connectivity Networks
Authors:
Xiaofang Sun,
Xiangwei Zheng,
Yonghui Xu,
Lizhen Cui,
Bin Hu
Abstract:
On the increase of major depressive disorders (MDD), many researchers paid attention to their recognition and treatment. Existing MDD recognition algorithms always use a single time-frequency domain method method, but the single time-frequency domain method is too simple and is not conducive to simulating the complex link relationship between brain functions. To solve this problem, this paper prop…
▽ More
On the increase of major depressive disorders (MDD), many researchers paid attention to their recognition and treatment. Existing MDD recognition algorithms always use a single time-frequency domain method method, but the single time-frequency domain method is too simple and is not conducive to simulating the complex link relationship between brain functions. To solve this problem, this paper proposes a recognition method based on multi-layer brain functional connectivity networks (MBFCN) for major depressive disorder and conducts cognitive analysis. Cognitive analysis based on the proposed MBFCN finds that the Alpha-Beta1 frequency band is the key sub-band for recognizing MDD. The connections between the right prefrontal lobe and the temporal lobe of the extremely depressed disorders (EDD) are deficient in the brain functional connectivity networks (BFCN) based on phase lag index (PLI). Furthermore, potential biomarkers by the significance analysis of depression features and PHQ-9 can be found.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Deep Learning Model of Dock by Dock Process Significantly Accelerate the Process of Docking-based Virtual Screening
Authors:
Wei Ma,
Qin Xie,
Jianhang Zhang,
Shiliang Li,
Youjun Xu,
Xiaobing Deng,
Weilin Zhang
Abstract:
Docking-based virtual screening (VS process) selects ligands with potential pharmacological activities from millions of molecules using computational docking methods, which greatly could reduce the number of compounds for experimental screening, shorten the research period and save the research cost. Howerver, a majority of compouds with low docking scores could waste most of the computational res…
▽ More
Docking-based virtual screening (VS process) selects ligands with potential pharmacological activities from millions of molecules using computational docking methods, which greatly could reduce the number of compounds for experimental screening, shorten the research period and save the research cost. Howerver, a majority of compouds with low docking scores could waste most of the computational resources. Herein, we report a novel and practical docking-based machine learning method called MLDDM (Machince Learning Docking-by-Docking Models). It is composed of a regression model and a classification model that simulates a classical docking by docking protocol ususally applied in many virtual screening projects. MLDDM could quickly eliminate compounds with low docking scores and the retained compounds with potential high docking scores would be examined for further real docking program. We demonstrated that MLDDM has a good ability to identify active compounds in the case studies for 10 specific protein targets. Compared to pure docking by docking based VS protocol, the VS process with MLDDM can achieve an over 120 times speed increment on average and the consistency rate with corresponding docking by docking VS protocol is above 0.8. Therefore, it would be promising to be used for examing ultra-large compound libraries in the current big data era.
△ Less
Submitted 25 October, 2021; v1 submitted 21 October, 2021;
originally announced October 2021.
-
CRNNTL: convolutional recurrent neural network and transfer learning for QSAR modelling
Authors:
Yaqin Li,
Yongjin Xu,
Yi Yu
Abstract:
In this study, we propose the convolutional recurrent neural network and transfer learning (CRNNTL) for QSAR modelling. The method was inspired by the applications of polyphonic sound detection and electrocardiogram classification. Our strategy takes advantages of both convolutional and recurrent neural networks for feature extraction, as well as the data augmentation method. Herein, CRNNTL is eva…
▽ More
In this study, we propose the convolutional recurrent neural network and transfer learning (CRNNTL) for QSAR modelling. The method was inspired by the applications of polyphonic sound detection and electrocardiogram classification. Our strategy takes advantages of both convolutional and recurrent neural networks for feature extraction, as well as the data augmentation method. Herein, CRNNTL is evaluated on 20 benchmark datasets in comparison with baseline methods. In addition, one isomers based dataset is used to elucidate its ability for both local and global feature extraction. Then, knowledge transfer performance of CRNNTL is tested, especially for small biological activity datasets. Finally, different latent representations from other type of AEs were used for versatility study of our model. The results show the effectiveness of CRNNTL using different latent representation. Moreover, efficient knowledge transfer is achieved to overcome data scarcity considering binding site similarity between different targets.
△ Less
Submitted 7 September, 2021;
originally announced September 2021.
-
Supervised multi-specialist topic model with applications on large-scale electronic health record data
Authors:
Ziyang Song,
Xavier Sumba Toral,
Yixin Xu,
Aihua Liu,
Liming Guo,
Guido Powell,
Aman Verma,
David Buckeridge,
Ariane Marelli,
Yue Li
Abstract:
Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments a…
▽ More
Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
Integration of Unpaired Single-cell Chromatin Accessibility and Gene Expression Data via Adversarial Learning
Authors:
Yang Xu,
Andrew Jeremiah Strick
Abstract:
Deep learning has empowered analysis for single-cell sequencing data in many ways and has generated deep understanding about a range of complex cellular systems. As the booming single-cell sequencing technologies brings the surge of high dimensional data that come from different sources and represent cellular systems with different features, there is an equivalent rise and challenge of integrating…
▽ More
Deep learning has empowered analysis for single-cell sequencing data in many ways and has generated deep understanding about a range of complex cellular systems. As the booming single-cell sequencing technologies brings the surge of high dimensional data that come from different sources and represent cellular systems with different features, there is an equivalent rise and challenge of integrating single-cell sequence across modalities. Here, we present a novel adversarial approach to integrate single-cell chromatin accessibility and gene expression data in a semi-supervised manner. We demonstrate that our method substantially improves data integration from a simple adversarial domain adaption approach, and it also outperforms two state-of-the-art (SOTA) methods.
△ Less
Submitted 25 April, 2021;
originally announced April 2021.
-
Bayesian data assimilation for estimating epidemic evolution: a COVID-19 study
Authors:
Xian Yang,
Shuo Wang,
Yuting Xing,
Ling Li,
Richard Yi Da Xu,
Karl J. Friston,
Yike Guo
Abstract:
The evolution of epidemiological parameters, such as instantaneous reproduction number Rt, is important for understanding the transmission dynamics of infectious diseases. Current estimates of time-varying epidemiological parameters often face problems such as lagging observations, averaging inference, and improper quantification of uncertainties. To address these problems, we propose a Bayesian d…
▽ More
The evolution of epidemiological parameters, such as instantaneous reproduction number Rt, is important for understanding the transmission dynamics of infectious diseases. Current estimates of time-varying epidemiological parameters often face problems such as lagging observations, averaging inference, and improper quantification of uncertainties. To address these problems, we propose a Bayesian data assimilation framework for time-varying parameter estimation. Specifically, this framework is applied to Rt estimation, resulting in the state-of-the-art DARt system. With DARt, time misalignment caused by lagging observations is tackled by incorporating observation delays into the joint inference of infections and Rt; the drawback of averaging is overcome by instantaneously updating upon new observations and developing a model selection mechanism that captures abrupt changes; the uncertainty is quantified and reduced by employing Bayesian smoothing. We validate the performance of DARt and demonstrate its power in revealing the transmission dynamics of COVID-19. The proposed approach provides a promising solution for accurate and timely estimating transmission dynamics from reported data.
△ Less
Submitted 24 October, 2021; v1 submitted 22 December, 2020;
originally announced January 2021.
-
EDGE COVID-19: A Web Platform to generate submission-ready genomes for SARS-CoV-2 sequencing efforts
Authors:
Chien-Chi Lo,
Migun Shakya,
Karen Davenport,
Mark Flynn,
Adán Myers y Gutiérrez,
Bin Hu,
Po-E Li,
Elais Player Jackson,
Yan Xu,
Patrick S. G. Chain
Abstract:
Genomics has become an essential technology for surveilling emerging infectious disease outbreaks. A wide range of technologies and strategies for pathogen genome enrichment and sequencing are being used by laboratories worldwide, together with different, and sometimes ad hoc, analytical procedures for generating genome sequences. As a result, public repositories now contain non-standard entries o…
▽ More
Genomics has become an essential technology for surveilling emerging infectious disease outbreaks. A wide range of technologies and strategies for pathogen genome enrichment and sequencing are being used by laboratories worldwide, together with different, and sometimes ad hoc, analytical procedures for generating genome sequences. As a result, public repositories now contain non-standard entries of varying quality. A standardized analytical process for consensus genome sequence determination, particularly for outbreaks such as the ongoing COVID-19 pandemic, is critical to provide a solid genomic basis for epidemiological analyses and well-informed decision making. To address this need, we have developed a bioinformatic workflow to standardize the analysis of SARS-CoV-2 sequencing data generated with either the Illumina or Oxford Nanopore platforms. Using an intuitive web-based interface, this workflow automates SARS-CoV-2 reference-based genome assembly, variant calling, lineage determination, and provides the ability to submit the consensus sequence and necessary metadata to GenBank or GISAID. Given a raw Illumina or Oxford Nanopore FASTQ read file, this web-based platform enables non-bioinformatics experts to automatically produce a SARS-CoV-2 genome that is ready for submission to GISAID or GenBank.
Availability:https://edge-covid19.edgebioinformatics.org;https://github.com/LANL-Bioinformatics/EDGE/tree/SARS-CoV2
△ Less
Submitted 24 June, 2021; v1 submitted 14 June, 2020;
originally announced June 2020.
-
A Public Website for the Automated Assessment and Validation of SARS-CoV-2 Diagnostic PCR Assays
Authors:
Po-E Li,
Adán Myers y Gutiérrez,
Karen Davenport,
Mark Flynn,
Bin Hu,
Chien-Chi Lo,
Elais Player Jackson,
Migun Shakya,
Yan Xu,
Jason Gans,
Patrick S. G. Chain
Abstract:
Summary: Polymerase chain reaction-based assays are the current gold standard for detecting and diagnosing SARS-CoV-2. However, as SARS-CoV-2 mutates, we need to constantly assess whether existing PCR-based assays will continue to detect all known viral strains. To enable the continuous monitoring of SARS-CoV-2 assays, we have developed a web-based assay validation algorithm that checks existing P…
▽ More
Summary: Polymerase chain reaction-based assays are the current gold standard for detecting and diagnosing SARS-CoV-2. However, as SARS-CoV-2 mutates, we need to constantly assess whether existing PCR-based assays will continue to detect all known viral strains. To enable the continuous monitoring of SARS-CoV-2 assays, we have developed a web-based assay validation algorithm that checks existing PCR-based assays against the ever-expanding genome databases for SARS-CoV-2 using both thermodynamic and edit-distance metrics. The assay screening results are displayed as a heatmap, showing the number of mismatches between each detection and each SARS-CoV-2 genome sequence. Using a mismatch threshold to define detection failure, assay performance is summarized with the true positive rate (recall) to simplify assay comparisons. Availability: https://covid19.edgebioinformatics.org/#/assayValidation. Contact: Jason Gans (jgans@lanl.gov) and Patrick Chain (pchain@lanl.gov)
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
Noise induces continuous and noncontinuous transitions in neuronal interspike intervals range
Authors:
P R Protachevicz,
M S Santos,
E G Seifert,
E C Gabrick,
F S Borges,
R R Borges,
J Trobia,
J D Szezech Jr,
K C Iarosz,
I L Caldas,
C G Antonopoulos,
Y Xu,
R L Viana,
A M Batista
Abstract:
Noise appears in the brain due to various sources, such as ionic channel fluctuations and synaptic events. They affect the activities of the brain and influence neuron action potentials. Stochastic differential equations have been used to model firing patterns of neurons subject to noise. In this work, we consider perturbing noise in the adaptive exponential integrate-and-fire (AEIF) neuron. The A…
▽ More
Noise appears in the brain due to various sources, such as ionic channel fluctuations and synaptic events. They affect the activities of the brain and influence neuron action potentials. Stochastic differential equations have been used to model firing patterns of neurons subject to noise. In this work, we consider perturbing noise in the adaptive exponential integrate-and-fire (AEIF) neuron. The AEIF is a two-dimensional model that describes different neuronal firing patterns by varying its parameters. Noise is added in the equation related to the membrane potential. We show that a noise current can induce continuous and noncontinuous transitions in neuronal interspike intervals. Moreover, we show that the noncontinuous transition occurs mainly for parameters close to the border between tonic spiking and burst activities of the neuron without noise.
△ Less
Submitted 29 May, 2020;
originally announced May 2020.
-
Estimating the Number of Infected Cases in COVID-19 Pandemic
Authors:
Donghui Yan,
Ying Xu,
Pei Wang
Abstract:
The COVID-19 pandemic has caused major disturbance to human life. An important reason behind the widespread social anxiety is the huge uncertainty about the pandemic. A fundamental uncertainty is how many or what percentage of people have been infected. There are published and frequently updated data on various statistics of the pandemic, at local, country or global level. However, due to various…
▽ More
The COVID-19 pandemic has caused major disturbance to human life. An important reason behind the widespread social anxiety is the huge uncertainty about the pandemic. A fundamental uncertainty is how many or what percentage of people have been infected. There are published and frequently updated data on various statistics of the pandemic, at local, country or global level. However, due to various reasons, many cases were not included in those reported numbers. We propose a structured approach for the estimation of the number of unreported cases, where we distinguish cases that arrive late in the reported numbers and those who had mild or no symptoms and thus were not captured by any medical system at all. We use post-report data for the estimation of the former and population matching to the latter. We estimate that the reported number of infected cases in the US should be corrected by multiplying a factor of 220.54% as of Apr 20, 2020, while the infection ratio out of the US population is estimated to be 0.53%, implying a case mortality rate at 2.85% which is close to the 3.4% suggested by the WHO in Mar 2020. Towards the end of the summer of 2020, the overall infection ratio of the US rises to 2.49% while the case mortality decreases to 2.09%, and the ratio of asymptomatic cases out of all infected cases reduces from the pre-summer 35-40% to around 20-25%.
△ Less
Submitted 3 March, 2021; v1 submitted 24 May, 2020;
originally announced May 2020.
-
Implications of the virus-encoded miRNA and host miRNA in the pathogenicity of SARS-CoV-2
Authors:
Zhi Liu,
Jianwei Wang,
Yuyu Xu,
Mengchen Guo,
Kai Mi,
Rui Xu,
Yang Pei,
Qiangkun Zhang,
Xiaoting Luan,
Zhibin Hu,
Xingyin Liu#
Abstract:
The outbreak of COVID-19 caused by SARS-CoV-2 has rapidly spread worldwide and has caused over 1,400,000 infections and 80,000 deaths. There are currently no drugs or vaccines with proven efficacy for its prevention and little knowledge was known about the pathogenicity mechanism of SARS-CoV-2 infection. Previous studies showed both virus and host-derived MicroRNAs (miRNAs) played crucial roles in…
▽ More
The outbreak of COVID-19 caused by SARS-CoV-2 has rapidly spread worldwide and has caused over 1,400,000 infections and 80,000 deaths. There are currently no drugs or vaccines with proven efficacy for its prevention and little knowledge was known about the pathogenicity mechanism of SARS-CoV-2 infection. Previous studies showed both virus and host-derived MicroRNAs (miRNAs) played crucial roles in the pathology of virus infection. In this study, we use computational approaches to scan the SARS-CoV-2 genome for putative miRNAs and predict the virus miRNA targets on virus and human genome as well as the host miRNAs targets on virus genome. Furthermore, we explore miRNAs involved dysregulation caused by the virus infection. Our results implicated that the immune response and cytoskeleton organization are two of the most notable biological processes regulated by the infection-modulated miRNAs. Impressively, we found hsa-miR-4661-3p was predicted to target the S gene of SARS-CoV-2, and a virus-encoded miRNA MR147-3p could enhance the expression of TMPRSS2 with the function of strengthening SARS-CoV-2 infection in the gut. The study may provide important clues for the mechisms of pathogenesis of SARS-CoV-2.
△ Less
Submitted 9 April, 2020;
originally announced April 2020.
-
COVID-19 Evolves in Human Hosts
Authors:
Yanni Li,
Bing Liu,
Zhi Wang,
Jiangtao Cui,
Kaicheng Yao,
Pengfan Lv,
Yulong Shen,
Yueshen Xu,
Yuanfang Guan,
Xiaoke Ma
Abstract:
Today, we are all threatened by an unprecedented pandemic: COVID-19. How different is it from other coronaviruses? Will it be attenuated or become more virulent? Which animals may be its original host? In this study, we collected and analyzed nearly thirty thousand publicly available complete genome sequences for COVID-19 virus from 79 different countries, the previously known flu-causing coronavi…
▽ More
Today, we are all threatened by an unprecedented pandemic: COVID-19. How different is it from other coronaviruses? Will it be attenuated or become more virulent? Which animals may be its original host? In this study, we collected and analyzed nearly thirty thousand publicly available complete genome sequences for COVID-19 virus from 79 different countries, the previously known flu-causing coronaviruses (HCov-229E, HCov-OC43, HCov-NL63 and HCov-HKU1) and the lethal, pathogenic viruses, SARS, MERS, Victoria, Lassa, Yamagata, Ebola, and Dengue. We found strong similarities between the current circulating COVID-19 and SARS and MERS, as well as COVID-19 in rhinolophines and pangolins. On the contrary, COVID-19 shares little similarity with the flu-causing coronaviruses and the other known viruses. Strikingly, we observed that the divergence of COVID-19 strains isolated from human hosts has steadily increased from December 2019 to May 2020, suggesting COVID-19 is actively evolving in human hosts. In this paper, we first propose a novel MLCS algorithm NP-MLCS1 for the big sequence analysis, which can calculate the common model for COVID-19 complete genome sequences to provide important information for vaccine and antibody development. Geographic and time-course analysis of the evolution trees of the human COVID-19 reveals possible evolutional paths among strains from 79 countries. This finding has important implications to the management of COVID-19 and the development of vaccines and medications.
△ Less
Submitted 15 August, 2020; v1 submitted 11 March, 2020;
originally announced March 2020.
-
3D Deep Learning Enables Fast Imaging of Spines through Scattering Media by Temporal Focusing Microscopy
Authors:
Zhun Wei,
Josiah R. Boivin,
Yi Xue,
Xudong Chen,
Peter T. C. So,
Elly Nedivi,
Dushan N. Wadduwage
Abstract:
Today the gold standard for in vivo imaging through scattering tissue is the point-scanning two-photon microscope (PSTPM). Especially in neuroscience, PSTPM is widely used for deep-tissue imaging in the brain. However, due to sequential scanning, PSTPM is slow. Temporal focusing microscopy (TFM), on the other hand, focuses femtosecond pulsed laser light temporally, while keeping wide-field illumin…
▽ More
Today the gold standard for in vivo imaging through scattering tissue is the point-scanning two-photon microscope (PSTPM). Especially in neuroscience, PSTPM is widely used for deep-tissue imaging in the brain. However, due to sequential scanning, PSTPM is slow. Temporal focusing microscopy (TFM), on the other hand, focuses femtosecond pulsed laser light temporally, while keeping wide-field illumination, and is consequently much faster. However, due to the use of a camera detector, TFM suffers from the scattering of emission photons. As a result, TFM produces images of poor spatial resolution and signal-to-noise ratio (SNR), burying fluorescent signals from small structures such as dendritic spines. In this work, we present a data-driven deep learning approach to improve resolution and SNR of TFM images. Using a 3D convolutional neural network (CNN) we build a map from TFM to PSTPM modalities, to enable fast TFM imaging while maintaining high-resolution through scattering media. We demonstrate this approach for in vivo imaging of dendritic spines on pyramidal neurons in the mouse visual cortex. We show that our trained network rapidly outputs high-resolution images that recover biologically relevant features previously buried in the scattered fluorescence in the TFM images. In vivo imaging that combines TFM and the proposed 3D convolution neural network is one to two orders of magnitude faster than PSTPM but retains the high resolution and SNR necessary to analyze small fluorescent structures. The proposed 3D convolution deep network could also be potentially beneficial for improving the performance of many speed-demanding deep-tissue imaging applications such as in vivo voltage imaging.
△ Less
Submitted 24 December, 2019;
originally announced January 2020.
-
Automatic Retrosynthetic Pathway Planning Using Template-free Models
Authors:
Kangjie Lin,
Youjun Xu,
Jianfeng Pei,
Luhua Lai
Abstract:
We present an attention-based Transformer model for automatic retrosynthesis route planning. Our approach starts from reactants prediction of single-step organic reactions for given products, followed by Monte Carlo tree search-based automatic retrosynthetic pathway prediction. Trained on two datasets from the United States patent literature, our models achieved a top-1 prediction accuracy of over…
▽ More
We present an attention-based Transformer model for automatic retrosynthesis route planning. Our approach starts from reactants prediction of single-step organic reactions for given products, followed by Monte Carlo tree search-based automatic retrosynthetic pathway prediction. Trained on two datasets from the United States patent literature, our models achieved a top-1 prediction accuracy of over 54.6% and 63.0% with more than 95% and 99.6% validity rate of SMILES, respectively, which is the best up to now to our knowledge. We also demonstrate the application potential of our model by successfully performing multi-step retrosynthetic route planning for four case products, i.e., antiseizure drug Rufinamide, a novel allosteric activator, an inhibitor of human acute-myeloid-leukemia cells and a complex intermediate of drug candidate. Further, by using heuristics Monte Carlo tree search, we achieved automatic retrosynthetic pathway searching and successfully reproduced published synthesis pathways. In summary, our model has achieved the state-of-the-art performance on single-step retrosynthetic prediction and provides a novel strategy for automatic retrosynthetic pathway planning.
△ Less
Submitted 21 May, 2019;
originally announced June 2019.
-
Dual Graph-Laplacian PCA: A Closed-Form Solution for Bi-clustering to Find "Checkerboard" Structures on Gene Expression Data
Authors:
Jin-Xing Liu,
Chun-Mei Feng,
Xiang-Zhen Kong,
Yong Xu
Abstract:
In the context of cancer, internal "checkerboard" structures are normally found in the matrices of gene expression data, which correspond to genes that are significantly up- or down-regulated in patients with specific types of tumors. In this paper, we propose a novel method, called dual graph-regularization principal component analysis (DGPCA). The main innovation of this method is that it simult…
▽ More
In the context of cancer, internal "checkerboard" structures are normally found in the matrices of gene expression data, which correspond to genes that are significantly up- or down-regulated in patients with specific types of tumors. In this paper, we propose a novel method, called dual graph-regularization principal component analysis (DGPCA). The main innovation of this method is that it simultaneously considers the internal geometric structures of the condition manifold and the gene manifold. Specifically, we obtain principal components (PCs) to represent the data and approximate the cluster membership indicators through Laplacian embedding. This new method is endowed with internal geometric structures, such as the condition manifold and gene manifold, which are both suitable for bi-clustering. A closed-form solution is provided for DGPCA. We apply this new method to simultaneously cluster genes and conditions (e.g., different samples) with the aim of finding internal "checkerboard" structures on gene expression data, if they exist. Then, we use this new method to identify regulatory genes under the particular conditions and to compare the results with those of other state-of-the-art PCA-based methods. Promising results on gene expression data have been verified by extensive experiments
△ Less
Submitted 21 January, 2019;
originally announced January 2019.
-
Directed Non-Targeted Mass Spectrometry and Chemical Networking for Discovery of Eicosanoids
Authors:
Jeramie D. Watrous,
Teemu Niiranen,
Kim A. Lagerborg,
Mir Henglin,
Yong-Jian Xu,
Sonia Sharma,
Ramachandran S. Vasan,
Martin G. Larson,
Aaron Armando,
Oswald Quehenberger,
Edward A. Dennis,
Susan Cheng,
Mohit Jain
Abstract:
Eicosanoids and related species are critical, small bioactive mediators of human physiology and inflammation. While ~1100 distinct eicosanoids have been predicted to exist, to date, less than 150 of these molecules have been measured in humans, limiting our understanding of eicosanoids and their role in human biology. Using a directed non-targeted mass spectrometry approach in conjunction with com…
▽ More
Eicosanoids and related species are critical, small bioactive mediators of human physiology and inflammation. While ~1100 distinct eicosanoids have been predicted to exist, to date, less than 150 of these molecules have been measured in humans, limiting our understanding of eicosanoids and their role in human biology. Using a directed non-targeted mass spectrometry approach in conjunction with computational chemical networking of spectral fragmentation patterns, we find over 500 discrete chemical signals highly consistent with known and putative eicosanoids in human plasma, including 46 putative novel molecules not previously described, thereby greatly expanding the breath of prior analytical strategies. In plasma samples from 1500 individuals, we find members of this expanded eicosanoid library hold close association with markers of inflammation, as well as clinical characteristics linked with inflammation, including advancing age and obesity. These experimental and computational approaches enable discovery of new chemical entities and will shed important insight into the role of bioactive molecules in human disease.
△ Less
Submitted 4 June, 2018;
originally announced June 2018.
-
Deep Reinforcement Learning of Cell Movement in the Early Stage of C. elegans Embryogenesis
Authors:
Zi Wang,
Dali Wang,
Chengcheng Li,
Yichi Xu,
Husheng Li,
Zhirong Bao
Abstract:
Cell movement in the early phase of C. elegans development is regulated by a highly complex process in which a set of rules and connections are formulated at distinct scales. Previous efforts have shown that agent-based, multi-scale modeling systems can integrate physical and biological rules and provide new avenues to study developmental systems. However, the application of these systems to model…
▽ More
Cell movement in the early phase of C. elegans development is regulated by a highly complex process in which a set of rules and connections are formulated at distinct scales. Previous efforts have shown that agent-based, multi-scale modeling systems can integrate physical and biological rules and provide new avenues to study developmental systems. However, the application of these systems to model cell movement is still challenging and requires a comprehensive understanding of regulation networks at the right scales. Recent developments in deep learning and reinforcement learning provide an unprecedented opportunity to explore cell movement using 3D time-lapse images. We present a deep reinforcement learning approach within an ABM system to characterize cell movement in C. elegans embryogenesis. Our modeling system captures the complexity of cell movement patterns in the embryo and overcomes the local optimization problem encountered by traditional rule-based, ABM that uses greedy algorithms. We tested our model with two real developmental processes: the anterior movement of the Cpaaa cell via intercalation and the rearrangement of the left-right asymmetry. In the first case, model results showed that Cpaaa's intercalation is an active directional cell movement caused by the continuous effects from a longer distance, as opposed to a passive movement caused by neighbor cell movements. This is because the learning-based simulation found that a passive movement model could not lead Cpaaa to the predefined destination. In the second case, a leader-follower mechanism well explained the collective cell movement pattern. These results showed that our approach to introduce deep reinforcement learning into ABM can test regulatory mechanisms by exploring cell migration paths in a reverse engineering perspective. This model opens new doors to explore large datasets generated by live imaging.
△ Less
Submitted 2 March, 2018; v1 submitted 14 January, 2018;
originally announced January 2018.
-
Sleep Stage Classification Based on Multi-level Feature Learning and Recurrent Neural Networks via Wearable Device
Authors:
Xin Zhang,
Weixuan Kou,
Eric I-Chao Chang,
He Gao,
Yubo Fan,
Yan Xu
Abstract:
This paper proposes a practical approach for automatic sleep stage classification based on a multi-level feature learning framework and Recurrent Neural Network (RNN) classifier using heart rate and wrist actigraphy derived from a wearable device. The feature learning framework is designed to extract low- and mid-level features. Low-level features capture temporal and frequency domain properties a…
▽ More
This paper proposes a practical approach for automatic sleep stage classification based on a multi-level feature learning framework and Recurrent Neural Network (RNN) classifier using heart rate and wrist actigraphy derived from a wearable device. The feature learning framework is designed to extract low- and mid-level features. Low-level features capture temporal and frequency domain properties and mid-level features learn compositions and structural information of signals. Since sleep staging is a sequential problem with long-term dependencies, we take advantage of RNNs with Bidirectional Long Short-Term Memory (BLSTM) architectures for sequence data learning. To simulate the actual situation of daily sleep, experiments are conducted with a resting group in which sleep is recorded in resting state, and a comprehensive group in which both resting sleep and non-resting sleep are included.We evaluate the algorithm based on an eight-fold cross validation to classify five sleep stages (W, N1, N2, N3, and REM). The proposed algorithm achieves weighted precision, recall and F1 score of 58.0%, 60.3%, and 58.2% in the resting group and 58.5%, 61.1%, and 58.5% in the comprehensive group, respectively. Various comparison experiments demonstrate the effectiveness of feature learning and BLSTM. We further explore the influence of depth and width of RNNs on performance. Our method is specially proposed for wearable devices and is expected to be applicable for long-term sleep monitoring at home. Without using too much prior domain knowledge, our method has the potential to generalize sleep disorder detection.
△ Less
Submitted 2 November, 2017;
originally announced November 2017.
-
Mining Functional Modules by Multiview-NMF of Phenome-Genome Association
Authors:
YaoGong Zhang,
YingJie Xu,
Xin Fan,
YuXiang Hong,
Jiahui Liu,
ZhiCheng He,
YaLou Huang,
MaoQiang Xie
Abstract:
Background: Mining gene modules from genomic data is an important step to detect gene members of pathways or other relations such as protein-protein interactions. In this work, we explore the plausibility of detecting gene modules by factorizing gene-phenotype associations from a phenotype ontology rather than the conventionally used gene expression data. In particular, the hierarchical structure…
▽ More
Background: Mining gene modules from genomic data is an important step to detect gene members of pathways or other relations such as protein-protein interactions. In this work, we explore the plausibility of detecting gene modules by factorizing gene-phenotype associations from a phenotype ontology rather than the conventionally used gene expression data. In particular, the hierarchical structure of ontology has not been sufficiently utilized in clustering genes while functionally related genes are consistently associated with phenotypes on the same path in the phenotype ontology. Results: We propose a hierarchal Nonnegative Matrix Factorization (NMF)-based method, called Consistent Multiple Nonnegative Matrix Factorization (CMNMF), to factorize genome-phenome association matrix at two levels of the hierarchical structure in phenotype ontology for mining gene functional modules. CMNMF constrains the gene clusters from the association matrices at two consecutive levels to be consistent since the genes are annotated with both the child phenotype and the parent phenotype in the consecutive levels. CMNMF also restricts the identified phenotype clusters to be densely connected in the phenotype ontology hierarchy. In the experiments on mining functionally related genes from mouse phenotype ontology and human phenotype ontology, CMNMF effectively improved clustering performance over the baseline methods. Gene ontology enrichment analysis was also conducted to reveal interesting gene modules. Conclusions: Utilizing the information in the hierarchical structure of phenotype ontology, CMNMF can identify functional gene modules with more biological significance than the conventional methods. CMNMF could also be a better tool for predicting members of gene pathways and protein-protein interactions. Availability: https://github.com/nkiip/CMNMF
△ Less
Submitted 10 May, 2017;
originally announced May 2017.
-
DeepMetabolism: A Deep Learning System to Predict Phenotype from Genome Sequencing
Authors:
Weihua Guo,
You Xu,
Xueyang Feng
Abstract:
Life science is entering a new era of petabyte-level sequencing data. Converting such big data to biological insights represents a huge challenge for computational analysis. To this end, we developed DeepMetabolism, a biology-guided deep learning system to predict cell phenotypes from transcriptomics data. By integrating unsupervised pre-training with supervised training, DeepMetabolism is able to…
▽ More
Life science is entering a new era of petabyte-level sequencing data. Converting such big data to biological insights represents a huge challenge for computational analysis. To this end, we developed DeepMetabolism, a biology-guided deep learning system to predict cell phenotypes from transcriptomics data. By integrating unsupervised pre-training with supervised training, DeepMetabolism is able to predict phenotypes with high accuracy (PCC>0.92), high speed (<30 min for >100 GB data using a single GPU), and high robustness (tolerate up to 75% noise). We envision DeepMetabolism to bridge the gap between genotype and phenotype and to serve as a springboard for applications in synthetic biology and precision medicine.
△ Less
Submitted 8 May, 2017;
originally announced May 2017.