Search | arXiv e-print repository

Human-level molecular optimization driven by mol-gene evolution

Authors: Jiebin Fang, Churu Mao, Yuchen Zhu, Xiaoming Chen, Chang-Yu Hsieh, Zhongjun Ma

Abstract: De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to… ▽ More De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to the level of medicinal chemists. A discrete variational autoencoder (D-VAE) is used in DGMM to encode molecules as quantization code, mol-gene, which incorporates deep learning into genetic algorithms for flexible structural optimization. The mol-gene allows for the discovery of pharmacologically similar but structurally distinct compounds, and reveals the trade-offs of structural optimization in drug discovery. We demonstrate the effectiveness of the DGMM in several applications. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2308.11666 [pdf, other]

Generalized dimension reduction approach for heterogeneous networked systems with time-delay

Authors: Cheng Ma, Gyorgy Korniss, Boleslaw K. Szymanski, Jianxi Gao

Abstract: Networks of interconnected agents are essential to study complex networked systems' state evolution, stability, resilience, and control. Nevertheless, the high dimensionality and nonlinear dynamics are vital factors preventing us from theoretically analyzing them. Recently, the dimension-reduction approaches reduced the system's size by mapping the original system to a one-dimensional system such… ▽ More Networks of interconnected agents are essential to study complex networked systems' state evolution, stability, resilience, and control. Nevertheless, the high dimensionality and nonlinear dynamics are vital factors preventing us from theoretically analyzing them. Recently, the dimension-reduction approaches reduced the system's size by mapping the original system to a one-dimensional system such that only one effective representative can capture its macroscopic dynamics. However, the approaches dramatically fail as the network becomes heterogeneous and has multiple community structures. Here, we bridge the gap by developing a generalized dimension reduction approach, which enables us to map the original system to a $m$-dimensional system that consists of $m$ interacting components. Notably, by validating it on various dynamical models, this approach accurately predicts the original system state and the tipping point, if any. Furthermore, the numerical results demonstrate that this approach approximates the system evolution and identifies the critical points for complex networks with time delay. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: 26 pages, 11 figures

arXiv:2306.13089 [pdf, other]

GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning

Authors: Haiteng Zhao, Shengchao Liu, Chang Ma, Hannan Xu, Jie Fu, Zhi-Hong Deng, Lingpeng Kong, Qi Liu

Abstract: Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We disco… ▽ More Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We discover that existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs. To overcome these issues, we propose GIMLET, which unifies language models for both graph and text data. By adopting generalized position embedding, our model is extended to encode both graph structures and instruction text without additional graph encoding modules. GIMLET also decouples encoding of the graph from tasks instructions in the attention mechanism, enhancing the generalization of graph features across novel tasks. We construct a dataset consisting of more than two thousand molecule tasks with corresponding instructions derived from task descriptions. We pretrain GIMLET on the molecule tasks along with instructions, enabling the model to transfer effectively to a broad range of tasks. Experimental results demonstrate that GIMLET significantly outperforms molecule-text baselines in instruction-based zero-shot learning, even achieving closed results to supervised GNN models on tasks such as toxcast and muv. △ Less

Submitted 22 October, 2023; v1 submitted 28 May, 2023; originally announced June 2023.

arXiv:2303.05564 [pdf, other]

Novel Tetrahedral Human Phantoms for Space Radiation Dose Assessment

Authors: Chesal MA, Blue RS, Aunon-Chancellor SA, Chancellor JC

Abstract: Space radiation remains one of the primary hazards to spaceflight crews. The unique nature of the intravehicular radiation spectrum makes prediction of biological outcomes difficult, with computational simulation-based efforts stymied by lack of computational resources or accurate modeling capabilities. Recent advancements in both Monte Carlo simulations and computational human phantom development… ▽ More Space radiation remains one of the primary hazards to spaceflight crews. The unique nature of the intravehicular radiation spectrum makes prediction of biological outcomes difficult, with computational simulation-based efforts stymied by lack of computational resources or accurate modeling capabilities. Recent advancements in both Monte Carlo simulations and computational human phantom developments have allowed for complex radiation simulations and dosimetric calculations to be performed for numerous applications. In this work, advanced tetrahedral-type human phantoms were exposed to a simulated spectrum of particles equivalent to a single days exposure in the International Space Station in Low Earth Orbit. 3D Monte Carlo techniques were used to produce and simulate the radiation spectra. Organ absorbed dose, average energy deposition, and the whole-body integral dose was determined for a male and female phantom. Results were then extrapolated for two long-term scenarios: a 6-9 month mission on the International Space Station and a 3-year mission to Mars. The whole-body integral dose for the male and female models were found to be 0.2985 +- 0.0002 mGy/day 0.3050 +- 0.0002 mGy/day, respectively, which is within 10% of recorded dose values from the International Space Station. This work presents a novel approach to assess absorbed dose from space-like radiation fields using high-fidelity computational phantoms, highlighting the utility of complex models for space radiation research. △ Less

Submitted 23 August, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

Comments: 10 pages, 3 figures, 2 tables

arXiv:2302.12563 [pdf, other]

Retrieved Sequence Augmentation for Protein Representation Learning

Authors: Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Lu, Qi Liu, Lingpeng Kong

Abstract: Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, a… ▽ More Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2212.01384 [pdf, other]

KGML-xDTD: A Knowledge Graph-based Machine Learning Framework for Drug Treatment Prediction and Mechanism Description

Authors: Chunyu Ma, Zhihan Zhou, Han Liu, David Koslicki

Abstract: Background: Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of… ▽ More Background: Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of action (MOAs) between repurposed drugs and their target diseases remain largely unknown, which is still a main obstacle for computational drug repurposing methods to be widely adopted in clinical settings. Results: In this work, we propose KGML-xDTD: a Knowledge Graph-based Machine Learning framework for explainably predicting Drugs Treating Diseases. It is a two-module framework that not only predicts the treatment probabilities between drugs/compounds and diseases but also biologically explains them via knowledge graph (KG) path-based, testable mechanisms of action (MOAs). We leverage knowledge-and-publication based information to extract biologically meaningful "demonstration paths" as the intermediate guidance in the Graph-based Reinforcement Learning (GRL) path-finding process. Comprehensive experiments and case study analyses show that the proposed framework can achieve state-of-the-art performance in both predictions of drug repurposing and recapitulation of human-curated drug MOA paths. Conclusions: KGML-xDTD is the first model framework that can offer KG-path explanations for drug repurposing predictions by leveraging the combination of prediction outcomes and existing biological knowledge and publications. We believe it can effectively reduce "black-box" concerns and increase prediction confidence for drug repurposing based on predicted path-based explanations, and further accelerate the process of drug discovery for emerging diseases. △ Less

Submitted 25 April, 2023; v1 submitted 30 November, 2022; originally announced December 2022.

arXiv:2012.08580 [pdf, other]

PANTHER: Pathway Augmented Nonnegative Tensor factorization for HighER-order feature learning

Authors: Yuan Luo, Chengsheng Mao

Abstract: Genetic pathways usually encode molecular mechanisms that can inform targeted interventions. It is often challenging for existing machine learning approaches to jointly model genetic pathways (higher-order features) and variants (atomic features), and present to clinicians interpretable models. In order to build more accurate and better interpretable machine learning models for genetic medicine, w… ▽ More Genetic pathways usually encode molecular mechanisms that can inform targeted interventions. It is often challenging for existing machine learning approaches to jointly model genetic pathways (higher-order features) and variants (atomic features), and present to clinicians interpretable models. In order to build more accurate and better interpretable machine learning models for genetic medicine, we introduce Pathway Augmented Nonnegative Tensor factorization for HighER-order feature learning (PANTHER). PANTHER selects informative genetic pathways that directly encode molecular mechanisms. We apply genetically motivated constrained tensor factorization to group pathways in a way that reflects molecular mechanism interactions. We then train a softmax classifier for disease types using the identified pathway groups. We evaluated PANTHER against multiple state-of-the-art constrained tensor/matrix factorization models, as well as group guided and Bayesian hierarchical models. PANTHER outperforms all state-of-the-art comparison models significantly (p<0.05). Our experiments on large scale Next Generation Sequencing (NGS) and whole-genome genotyping datasets also demonstrated wide applicability of PANTHER. We performed feature analysis in predicting disease types, which suggested insights and benefits of the identified pathway groups. △ Less

Submitted 15 December, 2020; originally announced December 2020.

Comments: Accepted by 35th AAAI Conference on Artificial Intelligence (AAAI 2021)

arXiv:2011.11808 [pdf, other]

doi 10.1038/s42005-021-00758-2

Universality of noise-induced resilience restoration in spatially-extended ecological systems

Authors: Cheng Ma, Gyorgy Korniss, Boleslaw K. Szymanski, Jianxi Gao

Abstract: Many systems may switch to an undesired state due to internal failures or external perturbations, of which critical transitions toward degraded ecosystem states are a prominent example. Resilience restoration focuses on the ability of spatially-extended systems and the required time to recover to their desired states under stochastic environmental conditions. While mean-field approaches may guide… ▽ More Many systems may switch to an undesired state due to internal failures or external perturbations, of which critical transitions toward degraded ecosystem states are a prominent example. Resilience restoration focuses on the ability of spatially-extended systems and the required time to recover to their desired states under stochastic environmental conditions. While mean-field approaches may guide recovery strategies by indicating the conditions needed to destabilize undesired states, these approaches are not accurately capturing the transition process toward the desired state of spatially-extended systems in stochastic environments. The difficulty is rooted in the lack of mathematical tools to analyze systems with high dimensionality, nonlinearity, and stochastic effects. We bridge this gap by developing new mathematical tools that employ nucleation theory in spatially-embedded systems to advance resilience restoration. We examine our approach on systems following mutualistic dynamics and diffusion models, finding that systems may exhibit single-cluster or multi-cluster phases depending on their sizes and noise strengths, and also construct a new scaling law governing the restoration time for arbitrary system size and noise strength in two-dimensional systems. This approach is not limited to ecosystems and has applications in various dynamical systems, from biology to infrastructural systems. △ Less

Submitted 9 September, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: 31 pages, 7 figures

Journal ref: Communications Physics, vol. 4:262, Dec. 10, 2021

arXiv:2005.09769 [pdf, ps, other]

Controlling the Hidden Growth of COVID-19

Authors: Xiubin Bruce Wang, Chaolun Ma

Abstract: The COVID-19 pandemic has plagued the world for months. The U.S. has taken measures to counter it. On a daily basis, newly confirmed cases have been reported. In the early days, these numbers showed an increasing trend. Recently, the numbers have been generally flattened out. This report tries to estimate the hidden number of currently alive infections in the population by using the confirmed case… ▽ More The COVID-19 pandemic has plagued the world for months. The U.S. has taken measures to counter it. On a daily basis, newly confirmed cases have been reported. In the early days, these numbers showed an increasing trend. Recently, the numbers have been generally flattened out. This report tries to estimate the hidden number of currently alive infections in the population by using the confirmed cases. A major result indicates an existing infections estimate at about 10-50 times the daily confirmed new cases, with the stringent social distancing policy tipping to the upper end of this range. It clarifies the relationship between the infection rate and the test rate to put the epidemic under control, which says that the test rate shall keep up at the same pace as infection rate to prevent an outbreak. This relationship is meaningful in the wake of business re-opening in the U.S. and the world. The report also reveals the connections of all the measures taken to the epidemic spread. A stratified sampling method is proposed to add to the current tool kits of epidemic control. Again, this report is a summary of some straight observations and thoughts, not through a thorough study backed with field data. The results appear obvious and suitable for general education to interested policymakers and the public. △ Less

Submitted 19 May, 2020; originally announced May 2020.

Comments: 13 pages, 1 figure

arXiv:1908.01472 [pdf]

On the theoretical prediction of microalgae growth for parallel flow

Authors: C. Y. Ma

Abstract: The established microalgae growth models are semi-empirical or considerable fitting coefficients exist currently. Therefore, the ability of the model prediction is reduced by the numerous fitting coefficients. Furthermore, the predicted results of the established models are dependent on the size of the photobioreactor (PBR), light intensity, flow and concentration field. The growth mechanism of mi… ▽ More The established microalgae growth models are semi-empirical or considerable fitting coefficients exist currently. Therefore, the ability of the model prediction is reduced by the numerous fitting coefficients. Furthermore, the predicted results of the established models are dependent on the size of the photobioreactor (PBR), light intensity, flow and concentration field. The growth mechanism of microalgae has not clearly understood in PBR cultivation. It is difficult to predict the microalgae growth by theoretical methods, owing to the aforementioned factors. We developed an exploratory bridging microalgae growth model to predict the microalgae growth rate in PBRs by using the nondimensional method which is effectively in fluid dynamics and heat transfer. The analytical solution of the growth rate was obtained for the parallel flow. The nondimensional growth rate expressed as function of Reynolds number and Schmidt number, which can be used for arbitrary parallel flow due to the solution was expressed as nondimensional quantities. The theoretically predicted growth rate is compared with the experimentally measured microalgae growth rate on the order of magnitude. The nondimensional method successfully applied to the microalgae growth problem for the first time. The general nondimensional solution can unify the numerous experimental data for different laboratory conditions, and give a direction for the disorder of the microalgae growth problem. The nondimensional solution may be useful to explain the growth mechanism of microalgae and design large-scale PBRs for microalgae biofuel production. The significance of the work is to give a theoretical foundation and methodology of biological theory of microalgae growth. △ Less

Submitted 10 May, 2022; v1 submitted 5 August, 2019; originally announced August 2019.

Comments: 24 pages, 4 figures, 1 table

arXiv:1903.00197 [pdf]

Outcome-Driven Clustering of Acute Coronary Syndrome Patients using Multi-Task Neural Network with Attention

Authors: Eryu Xia, Xin Du, Jing Mei, Wen Sun, Suijun Tong, Zhiqing Kang, Jian Sheng, Jian Li, Changsheng Ma, Jianzeng Dong, Shaochun Li

Abstract: Cluster analysis aims at separating patients into phenotypically heterogenous groups and defining therapeutically homogeneous patient subclasses. It is an important approach in data-driven disease classification and subtyping. Acute coronary syndrome (ACS) is a syndrome due to sudden decrease of coronary artery blood flow, where disease classification would help to inform therapeutic strategies an… ▽ More Cluster analysis aims at separating patients into phenotypically heterogenous groups and defining therapeutically homogeneous patient subclasses. It is an important approach in data-driven disease classification and subtyping. Acute coronary syndrome (ACS) is a syndrome due to sudden decrease of coronary artery blood flow, where disease classification would help to inform therapeutic strategies and provide prognostic insights. Here we conducted outcome-driven cluster analysis of ACS patients, which jointly considers treatment and patient outcome as indicators for patient state. Multi-task neural network with attention was used as a modeling framework, including learning of the patient state, cluster analysis, and feature importance profiling. Seven patient clusters were discovered. The clusters have different characteristics, as well as different risk profiles to the outcome of in-hospital major adverse cardiac events. The results demonstrate cluster analysis using outcome-driven multi-task neural network as promising for patient classification and subtyping. △ Less

Submitted 27 March, 2019; v1 submitted 1 March, 2019; originally announced March 2019.

arXiv:1811.02757 [pdf, other]

Early Prediction of Acute Kidney Injury in Critical Care Setting Using Clinical Notes

Authors: Yikuan Li, Liang Yao, Chengsheng Mao, Anand Srivastava, Xiaoqian Jiang, Yuan Luo

Abstract: Acute kidney injury (AKI) in critically ill patients is associated with significant morbidity and mortality. Development of novel methods to identify patients with AKI earlier will allow for testing of novel strategies to prevent or reduce the complications of AKI. We developed data-driven prediction models to estimate the risk of new AKI onset. We generated models from clinical notes within the f… ▽ More Acute kidney injury (AKI) in critically ill patients is associated with significant morbidity and mortality. Development of novel methods to identify patients with AKI earlier will allow for testing of novel strategies to prevent or reduce the complications of AKI. We developed data-driven prediction models to estimate the risk of new AKI onset. We generated models from clinical notes within the first 24 hours following intensive care unit (ICU) admission extracted from Medical Information Mart for Intensive Care III (MIMIC-III). From the clinical notes, we generated clinically meaningful word and concept representations and embeddings, respectively. Five supervised learning classifiers and knowledge-guided deep learning architecture were used to construct prediction models. The best configuration yielded a competitive AUC of 0.779. Our work suggests that natural language processing of clinical notes can be applied to assist clinicians in identifying the risk of incident AKI onset in critically ill patients upon admission to the ICU. △ Less

Submitted 9 November, 2018; v1 submitted 6 November, 2018; originally announced November 2018.

Comments: 4 pages, 3 figures, accepted by BIBM 2018

arXiv:1809.10681 [pdf]

Cancer classification and pathway discovery using non-negative matrix factorization

Authors: Zexian Zeng, Andy Vo, Chengsheng Mao, Susan E Clare, Seema A Khan, Yuan Luo

Abstract: Extracting genetic information from a full range of sequencing data is important for understanding diseases. We propose a novel method to effectively explore the landscape of genetic mutations and aggregate them to predict cancer type. We used multinomial logistic regression, nonsmooth non-negative matrix factorization (nsNMF), and support vector machine (SVM) to utilize the full range of sequenci… ▽ More Extracting genetic information from a full range of sequencing data is important for understanding diseases. We propose a novel method to effectively explore the landscape of genetic mutations and aggregate them to predict cancer type. We used multinomial logistic regression, nonsmooth non-negative matrix factorization (nsNMF), and support vector machine (SVM) to utilize the full range of sequencing data, aiming at better aggregating genetic mutations and improving their power in predicting cancer types. Specifically, we introduced a classifier to distinguish cancer types using somatic mutations obtained from whole-exome sequencing data. Mutations were identified from multiple cancers and scored using SIFT, PP2, and CADD, and grouped at the individual gene level. The nsNMF was then applied to reduce dimensionality and to obtain coefficient and basis matrices. A feature matrix was derived from the obtained matrices to train a classifier for cancer type classification with the SVM model. We have demonstrated that the classifier was able to distinguish the cancer types with reasonable accuracy. In five-fold cross-validations using mutation counts as features, the average prediction accuracy was 77.1% (SEM=0.1%), significantly outperforming baselines and outperforming models using mutation scores as features. Using the factor matrices derived from the nsNMF, we identified multiple genes and pathways that are significantly associated with each cancer type. This study presents a generic and complete pipeline to study the associations between somatic mutations and cancers. The discovered genes and pathways associated with each cancer type can lead to biological insights. The proposed method can be adapted to other studies for disease classification and pathway discovery. △ Less

Submitted 8 October, 2018; v1 submitted 27 September, 2018; originally announced September 2018.

Comments: 8 pages, 5 figures, conference

arXiv:1806.01217 [pdf]

Efficient Genomic Interval Queries Using Augmented Range Trees

Authors: Chengsheng Mao, Alal Eran, Yuan Luo

Abstract: Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allen's interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query f… ▽ More Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allen's interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allen's relations. We design and implement a novel approach to address this unmet need. Through rewriting Allen's interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allen's relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allen's relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allen's relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations. △ Less

Submitted 4 June, 2018; originally announced June 2018.

Comments: 4 figures, 4 tables

arXiv:1805.05008 [pdf]

Integrating Hypertension Phenotype and Genotype with Hybrid Non-negative Matrix Factorization

Authors: Yuan Luo, Chengsheng Mao, Yiben Yang, Fei Wang, Faraz S. Ahmad, Donna Arnett, Marguerite R. Irvin, Sanjiv J. Shah

Abstract: Hypertension is a heterogeneous syndrome in need of improved subtyping using phenotypic and genetic measurements so that patients in different subtypes share similar pathophysiologic mechanisms and respond more uniformly to targeted treatments. Existing machine learning approaches often face challenges in integrating phenotype and genotype information and presenting to clinicians an interpretable… ▽ More Hypertension is a heterogeneous syndrome in need of improved subtyping using phenotypic and genetic measurements so that patients in different subtypes share similar pathophysiologic mechanisms and respond more uniformly to targeted treatments. Existing machine learning approaches often face challenges in integrating phenotype and genotype information and presenting to clinicians an interpretable model. We aim to provide informed patient stratification by introducing Hybrid Non-negative Matrix Factorization (HNMF) on phenotype and genotype matrices. HNMF simultaneously approximates the phenotypic and genetic matrices using different appropriate loss functions, and generates patient subtypes, phenotypic groups and genetic groups. Unlike previous methods, HNMF approximates phenotypic matrix under Frobenius loss, and genetic matrix under Kullback-Leibler (KL) loss. We propose an alternating projected gradient method to solve the approximation problem. Simulation shows HNMF converges fast and accurately to the true factor matrices. On real-world clinical dataset, we used the patient factor matrix as features to predict main cardiac mechanistic outcomes. We compared HNMF with six different models using phenotype or genotype features alone, with or without NMF, or using joint NMF with only one type of loss. HNMF significantly outperforms all comparison models. HNMF also reveals intuitive phenotype-genotype interactions that characterize cardiac abnormalities. △ Less

Submitted 18 May, 2018; v1 submitted 14 May, 2018; originally announced May 2018.

Comments: fixed some presentation errors

arXiv:1711.00045 [pdf]

Retention Time of Peptides in Liquid Chromatography Is Well Estimated upon Deep Transfer Learning

Authors: Chunwei Ma, Zhiyong Zhu, Jun Ye, Jiarui Yang, Jianguo Pei, Shaohang Xu, Chang Yu, Fan Mo, Bo Wen, Siqi Liu

Abstract: A fully automatic prediction for peptide retention time (RT) in liquid chromatography (LC), termed as DeepRT, was developed using deep learning approach, an ensemble of Residual Network (ResNet) and Long Short-Term Memory (LSTM). In contrast to the traditional predictor based on the hand-crafted features for peptides, DeepRT learns features from raw amino acid sequences and makes relatively accura… ▽ More A fully automatic prediction for peptide retention time (RT) in liquid chromatography (LC), termed as DeepRT, was developed using deep learning approach, an ensemble of Residual Network (ResNet) and Long Short-Term Memory (LSTM). In contrast to the traditional predictor based on the hand-crafted features for peptides, DeepRT learns features from raw amino acid sequences and makes relatively accurate prediction of peptide RTs with 0.987 R2 for unmodified peptides. Furthermore, by virtue of transfer learning, DeepRT enables utilization of the peptides datasets generated from different LC conditions and of different modification status, resulting in the RT prediction of 0.992 R2 for unmodified peptides and 0.978 R2 for post-translationally modified peptides. Even though chromatographic behaviors of peptides are quite complicated, the study here demonstrated that peptide RT prediction could be largely improved by deep transfer learning. The DeepRT software is freely available at https://github.com/horsepurve/DeepRT, under Apache2 open source License. △ Less

Submitted 31 October, 2017; originally announced November 2017.

Comments: 13-page research article

arXiv:1710.11430 [pdf]

DeepQuality: Mass Spectra Quality Assessment via Compressed Sensing and Deep Learning

Authors: Chunwei Ma

Abstract: Motivation: Mass spectrometry-based proteomics is among the most commonly used methods for scrutinizing proteomic profiles in different organs for biological or medical researches. All the proteomic analyses including peptide/protein identification and quantification, differential expression analysis, biomarker discovery and so on are all based on the matching of mass spectra with peptide sequence… ▽ More Motivation: Mass spectrometry-based proteomics is among the most commonly used methods for scrutinizing proteomic profiles in different organs for biological or medical researches. All the proteomic analyses including peptide/protein identification and quantification, differential expression analysis, biomarker discovery and so on are all based on the matching of mass spectra with peptide sequences, which is significantly influenced by the quality of the spectra, such as the peak numbers, noisy peaks, signal-to-noise ratios, etc. Hence, it is crucial to assess the quality of the spectra in order for filtering and/or post-processing after identification. The handcrafted features representing spectra quality, however, need human expertise to design and are difficult to optimize, and thus the existing assessing algorithms are still lacking in accuracy. Thus, there is a critical need for the robust and adaptive algorithm for mass spectra quality assessment. Results: We have developed a novel mass spectrum assessment software DeepQuality, based on the state-of-the-art compressed sensing and deep learning algorithms. We evaluated the algorithm on two publicly available tandem MS data sets, resulting in the AUC of 0.96 and 0.92, respectively, a significant improvement compared with the AUC of 0.85 and 0.91 of the existing method SpectrumQuality v2.0. Availability: Software available at https://github.com/horsepurve/DeepQuality △ Less

Submitted 31 October, 2017; originally announced October 2017.

Comments: four-page technical brief

arXiv:1705.05368 [pdf]

DeepRT: deep learning for peptide retention time prediction in proteomics

Authors: Chunwei Ma, Zhiyong Zhu, Jun Ye, Jiarui Yang, Jianguo Pei, Shaohang Xu, Ruo Zhou, Chang Yu, Fan Mo, Bo Wen, Siqi Liu

Abstract: Accurate predictions of peptide retention times (RT) in liquid chromatography have many applications in mass spectrometry-based proteomics. Herein, we present DeepRT, a deep learning based software for peptide retention time prediction. DeepRT automatically learns features directly from the peptide sequences using the deep convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model… ▽ More Accurate predictions of peptide retention times (RT) in liquid chromatography have many applications in mass spectrometry-based proteomics. Herein, we present DeepRT, a deep learning based software for peptide retention time prediction. DeepRT automatically learns features directly from the peptide sequences using the deep convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model, which eliminates the need to use hand-crafted features or rules. After the feature learning, principal component analysis (PCA) was used for dimensionality reduction, then three conventional machine learning methods were utilized to perform modeling. Two published datasets were used to evaluate the performance of DeepRT and we demonstrate that DeepRT greatly outperforms previous state-of-the-art approaches ELUDE and GPTime. △ Less

Submitted 15 May, 2017; originally announced May 2017.

Showing 1–18 of 18 results for author: Ma, C