-
Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX
Authors:
Zhiyuan Chen,
Tianhao Chen,
Chenggang Xie,
Yang Xue,
Xiaonan Zhang,
Jingbo Zhou,
Xiaomin Fang
Abstract:
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. Th…
▽ More
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. These approaches restrict the understanding and generation of multimodal protein data. In contrast, large multimodal models have demonstrated potential capabilities in generating any-to-any content like text, images, and videos, thus enriching user interactions across various domains. Integrating these multimodal model technologies into protein research offers significant promise by potentially transforming how proteins are studied. To this end, we introduce HelixProtX, a system built upon the large multimodal model, aiming to offer a comprehensive solution to protein research by supporting any-to-any protein modality generation. Unlike existing methods, it allows for the transformation of any input protein modality into any desired protein modality. The experimental results affirm the advanced capabilities of HelixProtX, not only in generating functional descriptions from amino acid sequences but also in executing critical tasks such as designing protein sequences and structures from textual descriptions. Preliminary findings indicate that HelixProtX consistently achieves superior accuracy across a range of protein-related tasks, outperforming existing state-of-the-art models. By integrating multimodal large models into protein research, HelixProtX opens new avenues for understanding protein biology, thereby promising to accelerate scientific discovery.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs
Authors:
Jinzhe Liu,
Xiangsheng Huang,
Zhuo Chen,
Yin Fang
Abstract:
Large Language Models (LLMs) encounter challenges with the unique syntax of specific domains, such as biomolecules. Existing fine-tuning or modality alignment techniques struggle to bridge the domain knowledge gap and understand complex molecular data, limiting LLMs' progress in specialized fields. To overcome these limitations, we propose an expandable and adaptable non-parametric knowledge injec…
▽ More
Large Language Models (LLMs) encounter challenges with the unique syntax of specific domains, such as biomolecules. Existing fine-tuning or modality alignment techniques struggle to bridge the domain knowledge gap and understand complex molecular data, limiting LLMs' progress in specialized fields. To overcome these limitations, we propose an expandable and adaptable non-parametric knowledge injection framework named Domain-specific Retrieval-Augmented Knowledge (DRAK), aimed at enhancing reasoning capabilities in specific domains. Utilizing knowledge-aware prompts and gold label-induced reasoning, DRAK has developed profound expertise in the molecular domain and the capability to handle a broad spectrum of analysis tasks. We evaluated two distinct forms of DRAK variants, proving that DRAK exceeds previous benchmarks on six molecular tasks within the Mol-Instructions dataset. Extensive experiments have underscored DRAK's formidable performance and its potential to unlock molecular insights, offering a unified paradigm for LLMs to tackle knowledge-intensive tasks in specific domains. Our code will be available soon.
△ Less
Submitted 4 March, 2024;
originally announced June 2024.
-
BEACON: Benchmark for Comprehensive RNA Tasks and Language Models
Authors:
Yuchen Ren,
Zhiyuan Chen,
Lifeng Qiao,
Hongtai Jing,
Yuchen Cai,
Sheng Xu,
Peng Ye,
Xinzhu Ma,
Siqi Sun,
Hongliang Yan,
Dong Yuan,
Wanli Ouyang,
Xihui Liu
Abstract:
RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we i…
▽ More
RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON (\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https://github.com/terry-r123/RNABenchmark.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Advancing High Resolution Vision-Language Models in Biomedicine
Authors:
Zekai Chen,
Arda Pekis,
Kevin Brown
Abstract:
Multi-modal learning has significantly advanced generative AI, especially in vision-language modeling. Innovations like GPT-4V and open-source projects such as LLaVA have enabled robust conversational agents capable of zero-shot task completions. However, applying these technologies in the biomedical field presents unique challenges. Recent initiatives like LLaVA-Med have started to adapt instruct…
▽ More
Multi-modal learning has significantly advanced generative AI, especially in vision-language modeling. Innovations like GPT-4V and open-source projects such as LLaVA have enabled robust conversational agents capable of zero-shot task completions. However, applying these technologies in the biomedical field presents unique challenges. Recent initiatives like LLaVA-Med have started to adapt instruction-tuning for biomedical contexts using large datasets such as PMC-15M. Our research offers three key contributions: (i) we present a new instruct dataset enriched with medical image-text pairs from Claude3-Opus and LLaMA3 70B, (ii) we propose a novel image encoding strategy using hierarchical representations to improve fine-grained biomedical visual comprehension, and (iii) we develop the Llama3-Med model, which achieves state-of-the-art zero-shot performance on biomedical visual question answering benchmarks, with an average performance improvement of over 10% compared to previous methods. These advancements provide more accurate and reliable tools for medical professionals, bridging gaps in current multi-modal conversational assistants and promoting further innovations in medical AI.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding
Authors:
Yiqing Shen,
Zan Chen,
Michail Mamalakis,
Luhan He,
Haiyang Xia,
Tianbin Li,
Yanzhou Su,
Junjun He,
Yu Guang Wang
Abstract:
The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have…
▽ More
The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have then attempted to adapt LLMs for protein understanding by integrating a protein sequence encoder with a pre-trained LLM. However, this adaptation raises a fundamental question: "Can LLMs, originally designed for NLP, effectively comprehend protein sequences as a form of language?" Current datasets fall short in addressing this question due to the lack of a direct correlation between protein sequences and corresponding text descriptions, limiting the ability to train and evaluate LLMs for protein understanding effectively. To bridge this gap, we introduce ProteinLMDataset, a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning (SFT) of LLMs to enhance their capability for protein sequence comprehension. Specifically, ProteinLMDataset includes 17.46 billion tokens for pretraining and 893,000 instructions for SFT. Additionally, we present ProteinLMBench, the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs. ProteinLMBench incorporates protein-related details and sequences in multiple languages, establishing a new standard for evaluating LLMs' abilities in protein comprehension. The large language model InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score.
△ Less
Submitted 8 July, 2024; v1 submitted 8 June, 2024;
originally announced June 2024.
-
Unbending strategies shepherd cooperation and suppress extortion in spatial populations
Authors:
Zijie Chen,
Yuxin Geng,
Xingru Chen,
Feng Fu
Abstract:
Evolutionary game dynamics on networks typically consider the competition among simple strategies such as cooperation and defection in the Prisoner's Dilemma and summarize the effect of population structure as network reciprocity. However, it remains largely unknown regarding the evolutionary dynamics involving multiple powerful strategies typically considered in repeated games, such as the zero-d…
▽ More
Evolutionary game dynamics on networks typically consider the competition among simple strategies such as cooperation and defection in the Prisoner's Dilemma and summarize the effect of population structure as network reciprocity. However, it remains largely unknown regarding the evolutionary dynamics involving multiple powerful strategies typically considered in repeated games, such as the zero-determinant (ZD) strategies that are able to enforce a linear payoff relationship between them and their co-players. Here, we consider the evolutionary dynamics of always cooperate (AllC), extortionate ZD (extortioners), and unbending players in lattice populations based on the commonly used death-birth updating. Out of the class of unbending strategies, we consider a particular candidate, PSO Gambler, a machine-learning-optimized memory-one strategy, which can foster reciprocal cooperation and fairness among extortionate players. We derive analytical results under weak selection and rare mutations, including pairwise fixation probabilities and long-term frequencies of strategies. In the absence of the third unbending type, extortioners can achieve a half-half split in equilibrium with unconditional cooperators for sufficiently large extortion factors. However, the presence of unbending players fundamentally changes the dynamics and tilts the system to favor unbending cooperation. Most surprisingly, extortioners cannot dominate at all regardless of how large their extortion factor is, and the long-term frequency of unbending players is maintained almost as a constant. Our analytical method is applicable to studying the evolutionary dynamics of multiple strategies in structured populations. Our work provides insights into the interplay between network reciprocity and direct reciprocity, revealing the role of unbending strategies in enforcing fairness and suppressing extortion.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Combining Radiomics and Machine Learning Approaches for Objective ASD Diagnosis: Verifying White Matter Associations with ASD
Authors:
Junlin Song,
Yuzhuo Chen,
Yuan Yao,
Zetong Chen,
Renhao Guo,
Lida Yang,
Xinyi Sui,
Qihang Wang,
Xijiao Li,
Aihua Cao,
Wei Li
Abstract:
Autism Spectrum Disorder is a condition characterized by a typical brain development leading to impairments in social skills, communication abilities, repetitive behaviors, and sensory processing. There have been many studies combining brain MRI images with machine learning algorithms to achieve objective diagnosis of autism, but the correlation between white matter and autism has not been fully u…
▽ More
Autism Spectrum Disorder is a condition characterized by a typical brain development leading to impairments in social skills, communication abilities, repetitive behaviors, and sensory processing. There have been many studies combining brain MRI images with machine learning algorithms to achieve objective diagnosis of autism, but the correlation between white matter and autism has not been fully utilized. To address this gap, we develop a computer-aided diagnostic model focusing on white matter regions in brain MRI by employing radiomics and machine learning methods. This study introduced a MultiUNet model for segmenting white matter, leveraging the UNet architecture and utilizing manually segmented MRI images as the training data. Subsequently, we extracted white matter features using the Pyradiomics toolkit and applied different machine learning models such as Support Vector Machine, Random Forest, Logistic Regression, and K-Nearest Neighbors to predict autism. The prediction sets all exceeded 80% accuracy. Additionally, we employed Convolutional Neural Network to analyze segmented white matter images, achieving a prediction accuracy of 86.84%. Notably, Support Vector Machine demonstrated the highest prediction accuracy at 89.47%. These findings not only underscore the efficacy of the models but also establish a link between white matter abnormalities and autism. Our study contributes to a comprehensive evaluation of various diagnostic models for autism and introduces a computer-aided diagnostic algorithm for early and objective autism diagnosis based on MRI white matter regions.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals
Authors:
Hui Zheng,
Hai-Teng Wang,
Wei-Bang Jiang,
Zhong-Tao Chen,
Li He,
Pei-Yang Lin,
Peng-Hu Wei,
Guo-Guang Zhao,
Yun-Zhe Liu
Abstract:
Invasive brain-computer interfaces have garnered significant attention due to their high performance. The current intracranial stereoElectroEncephaloGraphy (sEEG) foundation models typically build univariate representations based on a single channel. Some of them further use Transformer to model the relationship among channels. However, due to the locality and specificity of brain computation, the…
▽ More
Invasive brain-computer interfaces have garnered significant attention due to their high performance. The current intracranial stereoElectroEncephaloGraphy (sEEG) foundation models typically build univariate representations based on a single channel. Some of them further use Transformer to model the relationship among channels. However, due to the locality and specificity of brain computation, their performance on more difficult tasks, e.g., speech decoding, which demands intricate processing in specific brain regions, is yet to be fully investigated. We hypothesize that building multi-variate representations within certain brain regions can better capture the specific neural processing. To explore this hypothesis, we collect a well-annotated Chinese word-reading sEEG dataset, targeting language-related brain networks, over 12 subjects. Leveraging this benchmark dataset, we developed the Du-IN model that can extract contextual embeddings from specific brain regions through discrete codebook-guided mask modeling. Our model achieves SOTA performance on the downstream 61-word classification task, surpassing all baseline models. Model comparison and ablation analysis reveal that our design choices, including (i) multi-variate representation by fusing channels in vSMC and STG regions and (ii) self-supervision by discrete codebook-guided mask modeling, significantly contribute to these performances. Collectively, our approach, inspired by neuroscience findings, capitalizing on multi-variate neural representation from specific brain regions, is suitable for invasive brain modeling. It marks a promising neuro-inspired AI approach in BCI.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Dynamics of antibody binding and neutralization during viral infection
Authors:
Zhenying Chen,
Hasan Ahmed,
Cora Hirst,
Rustom Antia
Abstract:
In vivo in infection, virions are constantly produced and die rapidly. In contrast, most antibody binding assays do not include such features. Motivated by this, we considered virions with n=100 binding sites in simple mathematical models with and without the production of virions. In the absence of viral production, at steady state, the distribution of virions by the number of sites bound is give…
▽ More
In vivo in infection, virions are constantly produced and die rapidly. In contrast, most antibody binding assays do not include such features. Motivated by this, we considered virions with n=100 binding sites in simple mathematical models with and without the production of virions. In the absence of viral production, at steady state, the distribution of virions by the number of sites bound is given by a binomial distribution, with the proportion being a simple function of antibody affinity (Kon/Koff) and concentration; this generalizes to a multinomial distribution in the case of two or more kinds of antibodies. In the presence of viral production, the role of affinity is replaced by an infection analog of affinity (IAA), with IAA=Kon/(Koff+dv+r), where dv is the virus decaying rate and r is the infection growth rate. Because in vivo dv can be large, the amount of binding as well as the effect of Koff on binding are substantially reduced. When neutralization is added, the effect of Koff is similarly small which may help explain the relatively high Koff reported for many antibodies. We next show that the n+2-dimensional model used for neutralization can be simplified to a 2-dimensional model. This provides some justification for the simple models that have been used in practice. A corollary of our results is that an unexpectedly large effect of Koff in vivo may point to mechanisms of neutralization beyond stoichiometry. Our results suggest reporting Kon and Koff separately, rather than focusing on affinity, until the situation is better resolved both experimentally and theoretically.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Bayesian-Guided Generation of Synthetic Microbiomes with Minimized Pathogenicity
Authors:
Nisha Pillai,
Bindu Nanduri,
Michael J Rothrock Jr.,
Zhiqian Chen,
Mahalingam Ramkumar
Abstract:
Synthetic microbiomes offer new possibilities for modulating microbiota, to address the barriers in multidtug resistance (MDR) research. We present a Bayesian optimization approach to enable efficient searching over the space of synthetic microbiome variants to identify candidates predictive of reduced MDR. Microbiome datasets were encoded into a low-dimensional latent space using autoencoders. Sa…
▽ More
Synthetic microbiomes offer new possibilities for modulating microbiota, to address the barriers in multidtug resistance (MDR) research. We present a Bayesian optimization approach to enable efficient searching over the space of synthetic microbiome variants to identify candidates predictive of reduced MDR. Microbiome datasets were encoded into a low-dimensional latent space using autoencoders. Sampling from this space allowed generation of synthetic microbiome signatures. Bayesian optimization was then implemented to select variants for biological screening to maximize identification of designs with restricted MDR pathogens based on minimal samples. Four acquisition functions were evaluated: expected improvement, upper confidence bound, Thompson sampling, and probability of improvement. Based on each strategy, synthetic samples were prioritized according to their MDR detection. Expected improvement, upper confidence bound, and probability of improvement consistently produced synthetic microbiome candidates with significantly fewer searches than Thompson sampling. By combining deep latent space mapping and Bayesian learning for efficient guided screening, this study demonstrated the feasibility of creating bespoke synthetic microbiomes with customized MDR profiles.
△ Less
Submitted 29 April, 2024;
originally announced May 2024.
-
Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction
Authors:
Hongxiao Wang,
Yang Yang,
Zhuo Zhao,
Pengfei Gu,
Nishchal Sapkota,
Danny Z. Chen
Abstract:
For predicting cancer survival outcomes, standard approaches in clinical research are often based on two main modalities: pathology images for observing cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene expressions. However, existing pathology-genomic multi-modal algorithms face significant challenges: (1) Valuable biological insights regarding genes and gene-gene int…
▽ More
For predicting cancer survival outcomes, standard approaches in clinical research are often based on two main modalities: pathology images for observing cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene expressions. However, existing pathology-genomic multi-modal algorithms face significant challenges: (1) Valuable biological insights regarding genes and gene-gene interactions are frequently overlooked; (2) one modality often dominates the optimization process, causing inadequate training for the other modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic" framework for cancer survival outcome prediction. First, to extract valuable biological insights, we regulate the embedding space of a foundation model, scGPT, initially trained on single-cell RNA-seq data, making it adaptable for bulk RNA-seq data. Second, to address the imbalance-between-modalities problem, we propose a gradient modulation mechanism tailored to the Cox partial likelihood loss for survival prediction. The contributions of the modalities are dynamically monitored and adjusted during the training process, encouraging that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer Genome Atlas) datasets, our model achieves substantially improved survival prediction accuracy.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression
Authors:
Zhaomeng Chen,
Zihuai He,
Benjamin B. Chu,
Jiaqi Gu,
Tim Morrison,
Chiara Sabatti,
Emmanuel Candès
Abstract:
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to a…
▽ More
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs (He et al. [2022]) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Diffusion-Driven Generative Framework for Molecular Conformation Prediction
Authors:
Bobin Yang,
Jie Deng,
Zhenghan Chen,
Ruoxue Wu
Abstract:
The task of deducing three-dimensional molecular configurations from their two-dimensional graph representations holds paramount importance in the fields of computational chemistry and pharmaceutical development. The rapid advancement of machine learning, particularly within the domain of deep generative networks, has revolutionized the precision of predictive modeling in this context. Traditional…
▽ More
The task of deducing three-dimensional molecular configurations from their two-dimensional graph representations holds paramount importance in the fields of computational chemistry and pharmaceutical development. The rapid advancement of machine learning, particularly within the domain of deep generative networks, has revolutionized the precision of predictive modeling in this context. Traditional approaches often adopt a two-step strategy: initially estimating interatomic distances and subsequently refining the spatial molecular structure by solving a distance geometry problem. However, this sequential approach occasionally falls short in accurately capturing the intricacies of local atomic arrangements, thereby compromising the fidelity of the resulting structural models. Addressing these limitations, this research introduces a cutting-edge generative framework named \method{}. This framework is grounded in the principles of diffusion observed in classical non-equilibrium thermodynamics. \method{} views atoms as discrete entities and excels in guiding the reversal of diffusion, transforming a distribution of stochastic noise back into coherent molecular structures through a process akin to a Markov chain. This transformation commences with the initial representation of a molecular graph in an abstract latent space, culminating in the realization of three-dimensional structures via a sophisticated bilevel optimization scheme meticulously tailored to meet the specific requirements of the task. One of the formidable challenges in this modeling endeavor involves preserving roto-translational invariance to ensure that the generated molecular conformations adhere to the laws of physics. Extensive experimental evaluations confirm the efficacy of the proposed \method{} in comparison to state-of-the-art methods.
△ Less
Submitted 21 January, 2024; v1 submitted 22 December, 2023;
originally announced January 2024.
-
GlycoNMR: Dataset and benchmarks for NMR chemical shift prediction of carbohydrates with graph neural networks
Authors:
Zizhang Chen,
Ryan Paul Badman,
Lachele Foley,
Robert Woods,
Pengyu Hong
Abstract:
Molecular representation learning (MRL) is a powerful tool for bridging the gap between machine learning and chemical sciences, as it converts molecules into numerical representations while preserving their chemical features. These encoded representations serve as a foundation for various downstream biochemical studies, including property prediction and drug design. MRL has had great success with…
▽ More
Molecular representation learning (MRL) is a powerful tool for bridging the gap between machine learning and chemical sciences, as it converts molecules into numerical representations while preserving their chemical features. These encoded representations serve as a foundation for various downstream biochemical studies, including property prediction and drug design. MRL has had great success with proteins and general biomolecule datasets. Yet, in the growing sub-field of glycoscience (the study of carbohydrates, where longer carbohydrates are also called glycans), MRL methods have been barely explored. This under-exploration can be primarily attributed to the limited availability of comprehensive and well-curated carbohydrate-specific datasets and a lack of Machine learning (ML) pipelines specifically tailored to meet the unique problems presented by carbohydrate data. Since interpreting and annotating carbohydrate-specific data is generally more complicated than protein data, domain experts are usually required to get involved. The existing MRL methods, predominately optimized for proteins and small biomolecules, also cannot be directly used in carbohydrate applications without special modifications. To address this challenge, accelerate progress in glycoscience, and enrich the data resources of the MRL community, we introduce GlycoNMR. GlycoNMR contains two laboriously curated datasets with 2,609 carbohydrate structures and 211,543 annotated nuclear magnetic resonance (NMR) chemical shifts for precise atomic-level prediction. We tailored carbohydrate-specific features and adapted existing MRL models to tackle this problem effectively. For illustration, we benchmark four modified MRL models on our new datasets.
△ Less
Submitted 29 November, 2023; v1 submitted 28 November, 2023;
originally announced November 2023.
-
DP-DCAN: Differentially Private Deep Contrastive Autoencoder Network for Single-cell Clustering
Authors:
Huifa Li,
Jie Fu,
Zhili Chen,
Xiaomin Yang,
Haitao Liu,
Xinpeng Ling
Abstract:
Single-cell RNA sequencing (scRNA-seq) is important to transcriptomic analysis of gene expression. Recently, deep learning has facilitated the analysis of high-dimensional single-cell data. Unfortunately, deep learning models may leak sensitive information about users. As a result, Differential Privacy (DP) is increasingly used to protect privacy. However, existing DP methods usually perturb whole…
▽ More
Single-cell RNA sequencing (scRNA-seq) is important to transcriptomic analysis of gene expression. Recently, deep learning has facilitated the analysis of high-dimensional single-cell data. Unfortunately, deep learning models may leak sensitive information about users. As a result, Differential Privacy (DP) is increasingly used to protect privacy. However, existing DP methods usually perturb whole neural networks to achieve differential privacy, and hence result in great performance overheads. To address this challenge, in this paper, we take advantage of the uniqueness of the autoencoder that it outputs only the dimension-reduced vector in the middle of the network, and design a Differentially Private Deep Contrastive Autoencoder Network (DP-DCAN) by partial network perturbation for single-cell clustering. Since only partial network is added with noise, the performance improvement is obvious and twofold: one part of network is trained with less noise due to a bigger privacy budget, and the other part is trained without any noise. Experimental results of six datasets have verified that DP-DCAN is superior to the traditional DP scheme with whole network perturbation. Moreover, DP-DCAN demonstrates strong robustness to adversarial attacks.
△ Less
Submitted 13 May, 2024; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Large-scale Foundation Models and Generative AI for BigData Neuroscience
Authors:
Ran Wang,
Zhe Sage Chen
Abstract:
Recent advances in machine learning have made revolutionary breakthroughs in computer games, image and natural language understanding, and scientific discovery. Foundation models and large-scale language models (LLMs) have recently achieved human-like intelligence thanks to BigData. With the help of self-supervised learning (SSL) and transfer learning, these models may potentially reshape the land…
▽ More
Recent advances in machine learning have made revolutionary breakthroughs in computer games, image and natural language understanding, and scientific discovery. Foundation models and large-scale language models (LLMs) have recently achieved human-like intelligence thanks to BigData. With the help of self-supervised learning (SSL) and transfer learning, these models may potentially reshape the landscapes of neuroscience research and make a significant impact on the future. Here we present a mini-review on recent advances in foundation models and generative AI models as well as their applications in neuroscience, including natural language and speech, semantic memory, brain-machine interfaces (BMIs), and data augmentation. We argue that this paradigm-shift framework will open new avenues for many neuroscience research directions and discuss the accompanying challenges and opportunities.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Second-order group knockoffs with applications to GWAS
Authors:
Benjamin B Chu,
Jiaqi Gu,
Zhaomeng Chen,
Tim Morrison,
Emmanuel Candes,
Zihuai He,
Chiara Sabatti
Abstract:
Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying…
▽ More
Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance.
While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct "group knockoffs." While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank.
The described algorithms are implemented in an open-source Julia package Knockoffs.jl, for which both R and Python wrappers are available.
△ Less
Submitted 3 March, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
SI-SD: Sleep Interpreter through awake-guided cross-subject Semantic Decoding
Authors:
Hui Zheng,
Zhong-Tao Chen,
Hai-Teng Wang,
Jian-Yang Zhou,
Lin Zheng,
Pei-Yang Lin,
Yun-Zhe Liu
Abstract:
Understanding semantic content from brain activity during sleep represents a major goal in neuroscience. While studies in rodents have shown spontaneous neural reactivation of memories during sleep, capturing the semantic content of human sleep poses a significant challenge due to the absence of well-annotated sleep datasets and the substantial differences in neural patterns between wakefulness an…
▽ More
Understanding semantic content from brain activity during sleep represents a major goal in neuroscience. While studies in rodents have shown spontaneous neural reactivation of memories during sleep, capturing the semantic content of human sleep poses a significant challenge due to the absence of well-annotated sleep datasets and the substantial differences in neural patterns between wakefulness and sleep. To address these challenges, we designed a novel cognitive neuroscience experiment and collected a comprehensive, well-annotated electroencephalography (EEG) dataset from 134 subjects during both wakefulness and sleep. Leveraging this benchmark dataset, we developed SI-SD that enhances sleep semantic decoding through the position-wise alignment of neural latent sequence between wakefulness and sleep. In the 15-way classification task, our model achieves 24.12% and 21.39% top-1 accuracy on unseen subjects for NREM 2/3 and REM sleep, respectively, surpassing all other baselines. With additional fine-tuning, decoding performance improves to 30.32% and 31.65%, respectively. Besides, inspired by previous neuroscientific findings, we systematically analyze how the "Slow Oscillation" event impacts decoding performance in NREM 2/3 sleep -- decoding performance on unseen subjects further improves to 40.02%. Together, our findings and methodologies contribute to a promising neuro-AI framework for decoding brain activity during sleep.
△ Less
Submitted 19 May, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Empowering Precision Medicine: AI-Driven Schizophrenia Diagnosis via EEG Signals: A Comprehensive Review from 2002-2023
Authors:
Mahboobeh Jafari,
Delaram Sadeghi,
Afshin Shoeibi,
Hamid Alinejad-Rokny,
Amin Beheshti,
David López GarcÃa,
Zhaolin Chen,
U. Rajendra Acharya,
Juan M. Gorriz
Abstract:
Schizophrenia (SZ) is a prevalent mental disorder characterized by cognitive, emotional, and behavioral changes. Symptoms of SZ include hallucinations, illusions, delusions, lack of motivation, and difficulties in concentration. Diagnosing SZ involves employing various tools, including clinical interviews, physical examinations, psychological evaluations, the Diagnostic and Statistical Manual of M…
▽ More
Schizophrenia (SZ) is a prevalent mental disorder characterized by cognitive, emotional, and behavioral changes. Symptoms of SZ include hallucinations, illusions, delusions, lack of motivation, and difficulties in concentration. Diagnosing SZ involves employing various tools, including clinical interviews, physical examinations, psychological evaluations, the Diagnostic and Statistical Manual of Mental Disorders (DSM), and neuroimaging techniques. Electroencephalography (EEG) recording is a significant functional neuroimaging modality that provides valuable insights into brain function during SZ. However, EEG signal analysis poses challenges for neurologists and scientists due to the presence of artifacts, long-term recordings, and the utilization of multiple channels. To address these challenges, researchers have introduced artificial intelligence (AI) techniques, encompassing conventional machine learning (ML) and deep learning (DL) methods, to aid in SZ diagnosis. This study reviews papers focused on SZ diagnosis utilizing EEG signals and AI methods. The introduction section provides a comprehensive explanation of SZ diagnosis methods and intervention techniques. Subsequently, review papers in this field are discussed, followed by an introduction to the AI methods employed for SZ diagnosis and a summary of relevant papers presented in tabular form. Additionally, this study reports on the most significant challenges encountered in SZ diagnosis, as identified through a review of papers in this field. Future directions to overcome these challenges are also addressed. The discussion section examines the specific details of each paper, culminating in the presentation of conclusions and findings.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Current and future directions in network biology
Authors:
Marinka Zitnik,
Michelle M. Li,
Aydin Wells,
Kimberly Glass,
Deisy Morselli Gysi,
Arjun Krishnan,
T. M. Murali,
Predrag Radivojac,
Sushmita Roy,
Anaïs Baudot,
Serdar Bozdag,
Danny Z. Chen,
Lenore Cowen,
Kapil Devkota,
Anthony Gitter,
Sara Gosline,
Pengfei Gu,
Pietro H. Guzzi,
Heng Huang,
Meng Jiang,
Ziynet Nesibe Kesimoglu,
Mehmet Koyuturk,
Jian Ma,
Alexander R. Pico,
Nataša Pržulj
, et al. (12 additional authors not shown)
Abstract:
Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These challenges stem from various fa…
▽ More
Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These challenges stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology and highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on the future directions of network biology. Additionally, we offer insights into scientific communities, educational initiatives, and the importance of fostering diversity within the field. This paper establishes a roadmap for an immediate and long-term vision for network biology.
△ Less
Submitted 11 June, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Shape-conditioned 3D Molecule Generation via Equivariant Diffusion Models
Authors:
Ziqi Chen,
Bo Peng,
Srinivasan Parthasarathy,
Xia Ning
Abstract:
Ligand-based drug design aims to identify novel drug candidates of similar shapes with known active molecules. In this paper, we formulated an in silico shape-conditioned molecule generation problem to generate 3D molecule structures conditioned on the shape of a given molecule. To address this problem, we developed a translation- and rotation-equivariant shape-guided generative model ShapeMol. Sh…
▽ More
Ligand-based drug design aims to identify novel drug candidates of similar shapes with known active molecules. In this paper, we formulated an in silico shape-conditioned molecule generation problem to generate 3D molecule structures conditioned on the shape of a given molecule. To address this problem, we developed a translation- and rotation-equivariant shape-guided generative model ShapeMol. ShapeMol consists of an equivariant shape encoder that maps molecular surface shapes into latent embeddings, and an equivariant diffusion model that generates 3D molecules based on these embeddings. Experimental results show that ShapeMol can generate novel, diverse, drug-like molecules that retain 3D molecular shapes similar to the given shape condition. These results demonstrate the potential of ShapeMol in designing drug candidates of desired 3D shapes binding to protein target pockets.
△ Less
Submitted 16 October, 2023; v1 submitted 22 August, 2023;
originally announced August 2023.
-
MoCLIM: Towards Accurate Cancer Subtyping via Multi-Omics Contrastive Learning with Omics-Inference Modeling
Authors:
Ziwei Yang,
Zheng Chen,
Yasuko Matsubara,
Yasushi Sakurai
Abstract:
Precision medicine fundamentally aims to establish causality between dysregulated biochemical mechanisms and cancer subtypes. Omics-based cancer subtyping has emerged as a revolutionary approach, as different level of omics records the biochemical products of multistep processes in cancers. This paper focuses on fully exploiting the potential of multi-omics data to improve cancer subtyping outcome…
▽ More
Precision medicine fundamentally aims to establish causality between dysregulated biochemical mechanisms and cancer subtypes. Omics-based cancer subtyping has emerged as a revolutionary approach, as different level of omics records the biochemical products of multistep processes in cancers. This paper focuses on fully exploiting the potential of multi-omics data to improve cancer subtyping outcomes, and hence developed MoCLIM, a representation learning framework. MoCLIM independently extracts the informative features from distinct omics modalities. Using a unified representation informed by contrastive learning of different omics modalities, we can well-cluster the subtypes, given cancer, into a lower latent space. This contrast can be interpreted as a projection of inter-omics inference observed in biological networks. Experimental results on six cancer datasets demonstrate that our approach significantly improves data fit and subtyping performance in fewer high-dimensional cancer instances. Moreover, our framework incorporates various medical evaluations as the final component, providing high interpretability in medical analysis.
△ Less
Submitted 24 August, 2023; v1 submitted 17 August, 2023;
originally announced August 2023.
-
Digital Twin Brain: a simulation and assimilation platform for whole human brain
Authors:
Wenlian Lu,
Longbin Zeng,
Xin Du,
Wenyong Zhang,
Shitong Xiang,
Huarui Wang,
Jiexiang Wang,
Mingda Ji,
Yubo Hou,
Minglong Wang,
Yuhao Liu,
Zhongyu Chen,
Qibao Zheng,
Ningsheng Xu,
Jianfeng Feng
Abstract:
In this work, we present a computing platform named digital twin brain (DTB) that can simulate spiking neuronal networks of the whole human brain scale and more importantly, a personalized biological brain structure. In comparison to most brain simulations with a homogeneous global structure, we highlight that the sparseness, couplingness and heterogeneity in the sMRI, DTI and PET data of the brai…
▽ More
In this work, we present a computing platform named digital twin brain (DTB) that can simulate spiking neuronal networks of the whole human brain scale and more importantly, a personalized biological brain structure. In comparison to most brain simulations with a homogeneous global structure, we highlight that the sparseness, couplingness and heterogeneity in the sMRI, DTI and PET data of the brain has an essential impact on the efficiency of brain simulation, which is proved from the scaling experiments that the DTB of human brain simulation is communication-intensive and memory-access intensive computing systems rather than computation-intensive. We utilize a number of optimization techniques to balance and integrate the computation loads and communication traffics from the heterogeneous biological structure to the general GPU-based HPC and achieve leading simulation performance for the whole human brain-scaled spiking neuronal networks. On the other hand, the biological structure, equipped with a mesoscopic data assimilation, enables the DTB to investigate brain cognitive function by a reverse-engineering method, which is demonstrated by a digital experiment of visual evaluation on the DTB. Furthermore, we believe that the developing DTB will be a promising powerful platform for a large of research orients including brain-inspiredintelligence, rain disease medicine and brain-machine interface.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey)
Authors:
Subba Reddy Oota,
Zijiao Chen,
Manish Gupta,
Raju S. Bapi,
Gael Jobard,
Frederic Alexandre,
Xavier Hinaut
Abstract:
Can we obtain insights about the brain using AI models? How is the information in deep learning models related to brain recordings? Can we improve AI models with the help of brain recordings? Such questions can be tackled by studying brain recordings like functional magnetic resonance imaging (fMRI). As a first step, the neuroscience community has contributed several large cognitive neuroscience d…
▽ More
Can we obtain insights about the brain using AI models? How is the information in deep learning models related to brain recordings? Can we improve AI models with the help of brain recordings? Such questions can be tackled by studying brain recordings like functional magnetic resonance imaging (fMRI). As a first step, the neuroscience community has contributed several large cognitive neuroscience datasets related to passive reading/listening/viewing of concept words, narratives, pictures, and movies. Encoding and decoding models using these datasets have also been proposed in the past two decades. These models serve as additional tools for basic cognitive science and neuroscience research. Encoding models aim at generating fMRI brain representations given a stimulus automatically. They have several practical applications in evaluating and diagnosing neurological conditions and thus may also help design therapies for brain damage. Decoding models solve the inverse problem of reconstructing the stimuli given the fMRI. They are useful for designing brain-machine or brain-computer interfaces. Inspired by the effectiveness of deep learning models for natural language processing, computer vision, and speech, several neural encoding and decoding models have been recently proposed. In this survey, we will first discuss popular representations of language, vision and speech stimuli, and present a summary of neuroscience datasets. Further, we will review popular deep learning based encoding and decoding architectures and note their benefits and limitations. Finally, we will conclude with a summary and discussion about future trends. Given the large amount of recently published work in the computational cognitive neuroscience (CCN) community, we believe that this survey enables an entry point for DNN researchers to diversify into CCN research.
△ Less
Submitted 8 July, 2024; v1 submitted 17 July, 2023;
originally announced July 2023.
-
Sulcal Pattern Matching with the Wasserstein Distance
Authors:
Zijian Chen,
Soumya Das,
Moo K. Chung
Abstract:
We present the unified computational framework for modeling the sulcal patterns of human brain obtained from the magnetic resonance images. The Wasserstein distance is used to align the sulcal patterns nonlinearly. These patterns are topologically different across subjects making the pattern matching a challenge. We work out the mathematical details and develop the gradient descent algorithms for…
▽ More
We present the unified computational framework for modeling the sulcal patterns of human brain obtained from the magnetic resonance images. The Wasserstein distance is used to align the sulcal patterns nonlinearly. These patterns are topologically different across subjects making the pattern matching a challenge. We work out the mathematical details and develop the gradient descent algorithms for estimating the deformation field. We further quantify the image registration performance. This method is applied in identifying the differences between male and female sulcal patterns.
△ Less
Submitted 1 July, 2023;
originally announced July 2023.
-
Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation and Elaboration
Authors:
Haitao Lin,
Yufei Huang,
Odin Zhang,
Lirong Wu,
Siyuan Li,
Zhiyuan Chen,
Stan Z. Li
Abstract:
In recent years, AI-assisted drug design methods have been proposed to generate molecules given the pockets' structures of target proteins. Most of them are atom-level-based methods, which consider atoms as basic components and generate atom positions and types. In this way, however, it is hard to generate realistic fragments with complicated structures. To solve this, we propose D3FG, a functiona…
▽ More
In recent years, AI-assisted drug design methods have been proposed to generate molecules given the pockets' structures of target proteins. Most of them are atom-level-based methods, which consider atoms as basic components and generate atom positions and types. In this way, however, it is hard to generate realistic fragments with complicated structures. To solve this, we propose D3FG, a functional-group-based diffusion model for pocket-specific molecule generation and elaboration. D3FG decomposes molecules into two categories of components: functional groups defined as rigid bodies and linkers as mass points. And the two kinds of components can together form complicated fragments that enhance ligand-protein interactions.
To be specific, in the diffusion process, D3FG diffuses the data distribution of the positions, orientations, and types of the components into a prior distribution; In the generative process, the noise is gradually removed from the three variables by denoisers parameterized with designed equivariant graph neural networks. In the experiments, our method can generate molecules with more realistic 3D structures, competitive affinities toward the protein targets, and better drug properties. Besides, D3FG as a solution to a new task of molecule elaboration, could generate molecules with high affinities based on existing ligands and the hotspots of target proteins.
△ Less
Submitted 18 March, 2024; v1 submitted 30 May, 2023;
originally announced June 2023.
-
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Authors:
Yin Fang,
Xiaozhuan Liang,
Ningyu Zhang,
Kangwei Liu,
Rui Huang,
Zhuo Chen,
Xiaohui Fan,
Huajun Chen
Abstract:
Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular doma…
▽ More
Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.
△ Less
Submitted 4 March, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Energy landscape reveals the underlying mechanism of cancer-adipose conversion with gene network models
Authors:
Zihao Chen,
Jia Lu,
Xing-Ming Zhao,
Haiyang Yu,
Chunhe Li
Abstract:
Cancer is a systemic heterogeneous disease involving complex molecular networks. Tumor formation involves epithelial-mesenchymal transition (EMT), which promotes both metastasis and plasticity of cancer cells. Recent experiments proposed that cancer cells can be transformed into adipocytes with combination drugs. However, the underlying mechanisms for how these drugs work from molecular network pe…
▽ More
Cancer is a systemic heterogeneous disease involving complex molecular networks. Tumor formation involves epithelial-mesenchymal transition (EMT), which promotes both metastasis and plasticity of cancer cells. Recent experiments proposed that cancer cells can be transformed into adipocytes with combination drugs. However, the underlying mechanisms for how these drugs work from molecular network perspective remain elusive. To reveal the mechanism of cancer-adipose conversion (CAC), we adopt a systems biology approach by combing mathematical modeling and molecular experiments based on the underlying molecular regulatory network. We identified four types of attractors which correspond to epithelial (E), mesenchymal (M), adipose (A) and partial/intermediate EMT (P) cell states on the CAC landscape. Landscape and transition path results illustrate that the intermediate states play critical roles in cancer to adipose transition. Through a landscape control strategy, we identified two new therapeutic strategies for drug combinations to promote CAC. We further verified these predictions by molecular experiments in different cell lines. Our combined computational and experimental approach provides a powerful tool to explore molecular mechanisms for cell fate transitions in cancer networks. Our results revealed the underlying mechanism for intermediate cell states governing the CAC, and identified new potential drug combinations to induce cancer adipogenesis.
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
T-Cell Receptor Optimization with Reinforcement Learning and Mutation Policies for Precesion Immunotherapy
Authors:
Ziqi Chen,
Martin Renqiang Min,
Hongyu Guo,
Chao Cheng,
Trevor Clancy,
Xia Ning
Abstract:
T cells monitor the health status of cells by identifying foreign peptides displayed on their surface. T-cell receptors (TCRs), which are protein complexes found on the surface of T cells, are able to bind to these peptides. This process is known as TCR recognition and constitutes a key step for immune response. Optimizing TCR sequences for TCR recognition represents a fundamental step towards the…
▽ More
T cells monitor the health status of cells by identifying foreign peptides displayed on their surface. T-cell receptors (TCRs), which are protein complexes found on the surface of T cells, are able to bind to these peptides. This process is known as TCR recognition and constitutes a key step for immune response. Optimizing TCR sequences for TCR recognition represents a fundamental step towards the development of personalized treatments to trigger immune responses killing cancerous or virus-infected cells. In this paper, we formulated the search for these optimized TCRs as a reinforcement learning (RL) problem, and presented a framework TCRPPO with a mutation policy using proximal policy optimization. TCRPPO mutates TCRs into effective ones that can recognize given peptides. TCRPPO leverages a reward function that combines the likelihoods of mutated sequences being valid TCRs measured by a new scoring function based on deep autoencoders, with the probabilities of mutated sequences recognizing peptides from a peptide-TCR interaction predictor. We compared TCRPPO with multiple baseline methods and demonstrated that TCRPPO significantly outperforms all the baseline methods to generate positive binding and valid TCRs. These results demonstrate the potential of TCRPPO for both precision immunotherapy and peptide-recognizing TCR motif discovery.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Language Models are Few-shot Learners for Prognostic Prediction
Authors:
Zekai Chen,
Mariann Micsinai Balan,
Kevin Brown
Abstract:
Clinical prediction is an essential task in the healthcare industry. However, the recent success of transformers, on which large language models are built, has not been extended to this domain. In this research, we explore the use of transformers and language models in prognostic prediction for immunotherapy using real-world patients' clinical data and molecular profiles. This paper investigates t…
▽ More
Clinical prediction is an essential task in the healthcare industry. However, the recent success of transformers, on which large language models are built, has not been extended to this domain. In this research, we explore the use of transformers and language models in prognostic prediction for immunotherapy using real-world patients' clinical data and molecular profiles. This paper investigates the potential of transformers to improve clinical prediction compared to conventional machine learning approaches and addresses the challenge of few-shot learning in predicting rare disease areas. The study benchmarks the efficacy of baselines and language models on prognostic prediction across multiple cancer types and investigates the impact of different pretrained language models under few-shot regimes. The results demonstrate significant improvements in accuracy and highlight the potential of NLP in clinical research to improve early detection and intervention for different diseases.
△ Less
Submitted 4 May, 2023; v1 submitted 24 February, 2023;
originally announced February 2023.
-
AI of Brain and Cognitive Sciences: From the Perspective of First Principles
Authors:
Luyao Chen,
Zhiqiang Chen,
Longsheng Jiang,
Xiang Liu,
Linlu Xu,
Bo Zhang,
Xiaolong Zou,
Jinying Gao,
Yu Zhu,
Xizi Gong,
Shan Yu,
Sen Song,
Liangyi Chen,
Fang Fang,
Si Wu,
Jia Liu
Abstract:
Nowadays, we have witnessed the great success of AI in various applications, including image classification, game playing, protein structure analysis, language translation, and content generation. Despite these powerful applications, there are still many tasks in our daily life that are rather simple to humans but pose great challenges to AI. These include image and language understanding, few-sho…
▽ More
Nowadays, we have witnessed the great success of AI in various applications, including image classification, game playing, protein structure analysis, language translation, and content generation. Despite these powerful applications, there are still many tasks in our daily life that are rather simple to humans but pose great challenges to AI. These include image and language understanding, few-shot learning, abstract concepts, and low-energy cost computing. Thus, learning from the brain is still a promising way that can shed light on the development of next-generation AI. The brain is arguably the only known intelligent machine in the universe, which is the product of evolution for animals surviving in the natural environment. At the behavior level, psychology and cognitive sciences have demonstrated that human and animal brains can execute very intelligent high-level cognitive functions. At the structure level, cognitive and computational neurosciences have unveiled that the brain has extremely complicated but elegant network forms to support its functions. Over years, people are gathering knowledge about the structure and functions of the brain, and this process is accelerating recently along with the initiation of giant brain projects worldwide. Here, we argue that the general principles of brain functions are the most valuable things to inspire the development of AI. These general principles are the standard rules of the brain extracting, representing, manipulating, and retrieving information, and here we call them the first principles of the brain. This paper collects six such first principles. They are attractor network, criticality, random network, sparse coding, relational memory, and perceptual learning. On each topic, we review its biological background, fundamental property, potential application to AI, and future development.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
Graph neural networks to learn joint representations of disjoint molecular graphs
Authors:
Chen Shao,
Zhou Chen,
Pascal Friederich
Abstract:
Graph neural networks are widely used to learn global representations of graphs, which are then used for regression or classification tasks. Typically, the graphs in such data sets are connected, i.e. each training sample consists of a single internally connected graph associated with a global label. However, there is a wide variety of yet unconsidered but application-relevant tasks, where labels…
▽ More
Graph neural networks are widely used to learn global representations of graphs, which are then used for regression or classification tasks. Typically, the graphs in such data sets are connected, i.e. each training sample consists of a single internally connected graph associated with a global label. However, there is a wide variety of yet unconsidered but application-relevant tasks, where labels are assigned to sets of disjoint graphs, which requires the generation of global representations of disjoint graphs. In this paper, we present a new data set with chemical reactions, which is illustrating this task. Each sample consists of a pair of disjoint molecular graphs and a joint label representing a scalar measure associated with the chemical reaction of the molecules. We show the initial results of graph neural networks that are able to solve the task within a combinatorial subset of the dataset but do not generalize well to the full data set and unseen (sub)graphs.
△ Less
Submitted 30 October, 2022; v1 submitted 14 October, 2022;
originally announced October 2022.
-
Corticosteroid Activation of Atlantic Sea Lamprey Corticoid Receptor: Allosteric Regulation by the N-terminal Domain
Authors:
Yoshinao Katsu,
Xiaozhi Lin,
Ruigeng Ji,
Ze Chen,
Yui Kamisaka,
Koto Bamba,
Michael E. Baker
Abstract:
Lampreys are jawless fish that evolved about 550 million years ago at the base of the vertebrate line. Modern lampreys contain a corticoid receptor (CR), the common ancestor of the glucocorticoid receptor (GR) and mineralocorticoid receptor (MR), which first appear in cartilaginous fish, such as sharks. Until recently, 344 amino acids at the amino terminus of adult lamprey CR were not present in t…
▽ More
Lampreys are jawless fish that evolved about 550 million years ago at the base of the vertebrate line. Modern lampreys contain a corticoid receptor (CR), the common ancestor of the glucocorticoid receptor (GR) and mineralocorticoid receptor (MR), which first appear in cartilaginous fish, such as sharks. Until recently, 344 amino acids at the amino terminus of adult lamprey CR were not present in the lamprey CR sequence in GenBank. A search of the recently sequenced lamprey germline genome identified two CR sequences, CR1 and CR2, containing the 344 previously un-identified amino acids at the amino terminus. CR1 also contains a novel four amino acid insertion in the DNA-binding domain (DBD). We studied corticosteroid activation of CR1 and CR2 and found their strongest response was to 11-deoxycorticosterone and 11-deoxycortisol, the two circulating corticosteroids in lamprey. Based on steroid specificity, both CRs are close to elephant shark MR and distant from elephant shark GR. HEK293 cells transfected with full-length CR1 or CR2 and the MMTV promoter have about 3-fold higher steroid-mediated activation compared to HEK293 cells transfected with these CRs and the TAT3 promoter. Deletion of the amino-terminal domain (NTD) of lamprey CR1 and CR2 to form truncated CRs decreased transcriptional activation by about 70% in HEK293 cells transfected with MMTV, but increased transcription by about 6-fold in cells transfected with TAT3, indicating that the promoter has an important effect on NTD regulation of CR transcription by corticosteroids.
△ Less
Submitted 8 October, 2022;
originally announced October 2022.
-
Classify Respiratory Abnormality in Lung Sounds Using STFT and a Fine-Tuned ResNet18 Network
Authors:
Zizhao Chen,
Hongliang Wang,
Chia-Hui Yeh,
Xilin Liu
Abstract:
Recognizing patterns in lung sounds is crucial to detecting and monitoring respiratory diseases. Current techniques for analyzing respiratory sounds demand domain experts and are subject to interpretation. Hence an accurate and automatic respiratory sound classification system is desired. In this work, we took a data-driven approach to classify abnormal lung sounds. We compared the performance usi…
▽ More
Recognizing patterns in lung sounds is crucial to detecting and monitoring respiratory diseases. Current techniques for analyzing respiratory sounds demand domain experts and are subject to interpretation. Hence an accurate and automatic respiratory sound classification system is desired. In this work, we took a data-driven approach to classify abnormal lung sounds. We compared the performance using three different feature extraction techniques, which are short-time Fourier transformation (STFT), Mel spectrograms, and Wav2vec, as well as three different classifiers, including pre-trained ResNet18, LightCNN, and Audio Spectrogram Transformer. Our key contributions include the bench-marking of different audio feature extractors and neural network based classifiers, and the implementation of a complete pipeline using STFT and a fine-tuned ResNet18 network. The proposed method achieved Harmonic Scores of 0.89, 0.80, 0.71, 0.36 for tasks 1-1, 1-2, 2-1 and 2-2, respectively on the testing sets in the IEEE BioCAS 2022 Grand Challenge on Respiratory Sound Classification.
△ Less
Submitted 29 August, 2022;
originally announced August 2022.
-
Spectrum of non-Hermitian deep-Hebbian neural networks
Authors:
Zijian Jiang,
Ziming Chen,
Tianqi Hou,
Haiping Huang
Abstract:
Neural networks with recurrent asymmetric couplings are important to understand how episodic memories are encoded in the brain. Here, we integrate the experimental observation of wide synaptic integration window into our model of sequence retrieval in the continuous time dynamics. The model with non-normal neuron-interactions is theoretically studied by deriving a random matrix theory of the Jacob…
▽ More
Neural networks with recurrent asymmetric couplings are important to understand how episodic memories are encoded in the brain. Here, we integrate the experimental observation of wide synaptic integration window into our model of sequence retrieval in the continuous time dynamics. The model with non-normal neuron-interactions is theoretically studied by deriving a random matrix theory of the Jacobian matrix in neural dynamics. The spectra bears several distinct features, such as breaking rotational symmetry about the origin, and the emergence of nested voids within the spectrum boundary. The spectral density is thus highly non-uniformly distributed in the complex plane. The random matrix theory also predicts a transition to chaos. In particular, the edge of chaos provides computational benefits for the sequential retrieval of memories. Our work provides a systematic study of time-lagged correlations with arbitrary time delays, and thus can inspire future studies of a broad class of memory models, and even big data analysis of biological time series.
△ Less
Submitted 16 January, 2023; v1 submitted 24 August, 2022;
originally announced August 2022.
-
Mechanics of Morphogenesis in Neural Development: in vivo, in vitro, and in silico
Authors:
Joseph Sutlive,
Hamed Seyyedhosseinzadeh,
Zheng Ao,
Haning Xiu,
Kun Gou,
Feng Guo,
Zi Chen
Abstract:
Morphogenesis in the central nervous system has received intensive attention as elucidating fundamental mechanisms of morphogenesis will shed light on the physiology and pathophysiology of the developing central nervous system. Morphogenesis of the central nervous system is of a vast topic that includes important morphogenetic events such as neurulation and cortical folding. Here we review three t…
▽ More
Morphogenesis in the central nervous system has received intensive attention as elucidating fundamental mechanisms of morphogenesis will shed light on the physiology and pathophysiology of the developing central nervous system. Morphogenesis of the central nervous system is of a vast topic that includes important morphogenetic events such as neurulation and cortical folding. Here we review three types of methods used to improve our understanding of morphogenesis of the central nervous system: in vivo experiments, organoids (in vitro), and computational models (in silico). The in vivo experiments are used to explore cellular- and tissue-level mechanics and interpret them on the roles of neurulation morphogenesis. Recent advances in human brain organoids have provided new opportunities to study morphogenesis and neurogenesis to compensate for the limitations of in vivo experiments, as organoid models are able to recapitulate some critical neural morphogenetic processes during early human brain development. Due to the complexity and costs of in vivo and in vitro studies, a variety of computational models have been developed and used to explain the formation and morphogenesis of brain structures. We review and discuss the Pros and Cons of these methods and their usage in the studies on morphogenesis of the central nervous system. Notably, none of these methods alone is sufficient to unveil the biophysical mechanisms of morphogenesis, thus calling for the interdisciplinary approaches using a combination of these methods in order to test hypotheses and generate new insights on both normal and abnormal development of the central nervous system.
△ Less
Submitted 21 July, 2022;
originally announced July 2022.
-
PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction
Authors:
Sirui Liu,
Jun Zhang,
Haotian Chu,
Min Wang,
Boxin Xue,
Ningxi Ni,
Jialiang Yu,
Yuhao Xie,
Zhenyu Chen,
Mengyun Chen,
Yuan Liu,
Piya Patra,
Fan Xu,
Jie Chen,
Zidong Wang,
Lijiang Yang,
Fan Yu,
Lei Chen,
Yi Qin Gao
Abstract:
Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to…
▽ More
Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization
Authors:
Zheng Chen,
Lingwei Zhu,
Ziwei Yang,
Takashi Matsubara
Abstract:
Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping mo…
▽ More
Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping models for outputting sensible clustering. In this study, we propose a novel clustering method for exploiting genetic expression profiles and distinguishing subtypes in an unsupervised manner. The proposed method adaptively learns categorical correspondence from latent representations of expression profiles to the subtypes output by the model. By maximizing the problem -- agnostic mutual information between input expression profiles and output subtypes, our method can automatically decide a suitable number of subtypes. Through experiments, we demonstrate that our proposed method can refine existing controversial labels, and, by further medical analysis, this refinement is proven to have a high correlation with cancer survival rates.
△ Less
Submitted 14 November, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.
-
$\mathsf{G^2Retro}$ as a Two-Step Graph Generative Models for Retrosynthesis Prediction
Authors:
Ziqi Chen,
Oluwatosin R. Ayinde,
James R. Fuchs,
Huan Sun,
Xia Ning
Abstract:
Retrosynthesis is a procedure where a target molecule is transformed into potential reactants and thus the synthesis routes can be identified. Recently, computational approaches have been developed to accelerate the design of synthesis routes. In this paper, we develop a generative framework $\mathsf{G^2Retro}$ for one-step retrosynthesis prediction. $\mathsf{G^2Retro}$ imitates the reversed logic…
▽ More
Retrosynthesis is a procedure where a target molecule is transformed into potential reactants and thus the synthesis routes can be identified. Recently, computational approaches have been developed to accelerate the design of synthesis routes. In this paper, we develop a generative framework $\mathsf{G^2Retro}$ for one-step retrosynthesis prediction. $\mathsf{G^2Retro}$ imitates the reversed logic of synthetic reactions. It first predicts the reaction centers in the target molecules (products), identifies the synthons needed to assemble the products, and transforms these synthons into reactants. $\mathsf{G^2Retro}$ defines a comprehensive set of reaction center types, and learns from the molecular graphs of the products to predict potential reaction centers. To complete synthons into reactants, $\mathsf{G^2Retro}$ considers all the involved synthon structures and the product structures to identify the optimal completion paths, and accordingly attaches small substructures sequentially to the synthons. Here we show that $\mathsf{G^2Retro}$ is able to better predict the reactants for given products in the benchmark dataset than the state-of-the-art methods.
△ Less
Submitted 5 June, 2023; v1 submitted 10 June, 2022;
originally announced June 2022.
-
Masked Image Modeling Advances 3D Medical Image Analysis
Authors:
Zekai Chen,
Devansh Agarwal,
Kshitij Aggarwal,
Wiem Safta,
Samit Hirawat,
Venkat Sethuraman,
Mariann Micsinai Balan,
Kevin Brown
Abstract:
Recently, masked image modeling (MIM) has gained considerable attention due to its capacity to learn from vast amounts of unlabeled data and has been demonstrated to be effective on a wide variety of vision tasks involving natural images. Meanwhile, the potential of self-supervised learning in modeling 3D medical images is anticipated to be immense due to the high quantities of unlabeled images, a…
▽ More
Recently, masked image modeling (MIM) has gained considerable attention due to its capacity to learn from vast amounts of unlabeled data and has been demonstrated to be effective on a wide variety of vision tasks involving natural images. Meanwhile, the potential of self-supervised learning in modeling 3D medical images is anticipated to be immense due to the high quantities of unlabeled images, and the expense and difficulty of quality labels. However, MIM's applicability to medical images remains uncertain. In this paper, we demonstrate that masked image modeling approaches can also advance 3D medical images analysis in addition to natural images. We study how masked image modeling strategies leverage performance from the viewpoints of 3D medical image segmentation as a representative downstream task: i) when compared to naive contrastive learning, masked image modeling approaches accelerate the convergence of supervised training even faster (1.40$\times$) and ultimately produce a higher dice score; ii) predicting raw voxel values with a high masking ratio and a relatively smaller patch size is non-trivial self-supervised pretext-task for medical images modeling; iii) a lightweight decoder or projection head design for reconstruction is powerful for masked image modeling on 3D medical images which speeds up training and reduce cost; iv) finally, we also investigate the effectiveness of MIM methods under different practical scenarios where different image resolutions and labeled data ratios are applied.
△ Less
Submitted 23 August, 2022; v1 submitted 25 April, 2022;
originally announced April 2022.
-
Multi-Tier Platform for Cognizing Massive Electroencephalogram
Authors:
Zheng Chen,
Lingwei Zhu,
Ziwei Yang,
Renyuan Zhang
Abstract:
An end-to-end platform assembling multiple tiers is built for precisely cognizing brain activities. Being fed massive electroencephalogram (EEG) data, the time-frequency spectrograms are conventionally projected into the episode-wise feature matrices (seen as tier-1). A spiking neural network (SNN) based tier is designed to distill the principle information in terms of spike-streams from the rare…
▽ More
An end-to-end platform assembling multiple tiers is built for precisely cognizing brain activities. Being fed massive electroencephalogram (EEG) data, the time-frequency spectrograms are conventionally projected into the episode-wise feature matrices (seen as tier-1). A spiking neural network (SNN) based tier is designed to distill the principle information in terms of spike-streams from the rare features, which maintains the temporal implication in the nature of EEGs. The proposed tier-3 transposes time- and space-domain of spike patterns from the SNN; and feeds the transposed pattern-matrices into an artificial neural network (ANN, Transformer specifically) known as tier-4, where a special spanning topology is proposed to match the two-dimensional input form. In this manner, cognition such as classification is conducted with high accuracy. For proof-of-concept, the sleep stage scoring problem is demonstrated by introducing multiple EEG datasets with the largest comprising 42,560 hours recorded from 5,793 subjects. From experiment results, our platform achieves the general cognition overall accuracy of 87% by leveraging sole EEG, which is 2% superior to the state-of-the-art. Moreover, our developed multi-tier methodology offers visible and graphical interpretations of the temporal characteristics of EEG by identifying the critical episodes, which is demanded in neurodynamics but hardly appears in conventional cognition scenarios.
△ Less
Submitted 20 April, 2022;
originally announced April 2022.
-
Embedding of Functional Human Brain Networks on a Sphere
Authors:
Moo K. Chung,
Zijian Chen
Abstract:
Human brain activity is often measured using the blood-oxygen-level dependent (BOLD) signals obtained through functional magnetic resonance imaging (fMRI). The strength of connectivity between brain regions is then measured as a Pearson correlation matrix. As the number of brain regions increases, the dimension of matrix increases. It becomes extremely cumbersome to even visualize and quantify suc…
▽ More
Human brain activity is often measured using the blood-oxygen-level dependent (BOLD) signals obtained through functional magnetic resonance imaging (fMRI). The strength of connectivity between brain regions is then measured as a Pearson correlation matrix. As the number of brain regions increases, the dimension of matrix increases. It becomes extremely cumbersome to even visualize and quantify such weighted complete networks. To remedy the problem, we propose to embed brain networks onto a sphere, which is a Riemannian manifold with constant positive curvature. The Matlab code for the spherical embedding is given in https://github.com/laplcebeltrami/sphericalMDS.
△ Less
Submitted 19 May, 2022; v1 submitted 7 April, 2022;
originally announced April 2022.
-
Cancer Subtyping via Embedded Unsupervised Learning on Transcriptomics Data
Authors:
Ziwei Yang,
Lingwei Zhu,
Zheng Chen,
Ming Huang,
Naoaki Ono,
MD Altaf-Ul-Amin,
Shigehiko Kanaya
Abstract:
Cancer is one of the deadliest diseases worldwide. Accurate diagnosis and classification of cancer subtypes are indispensable for effective clinical treatment. Promising results on automatic cancer subtyping systems have been published recently with the emergence of various deep learning methods. However, such automatic systems often overfit the data due to the high dimensionality and scarcity. In…
▽ More
Cancer is one of the deadliest diseases worldwide. Accurate diagnosis and classification of cancer subtypes are indispensable for effective clinical treatment. Promising results on automatic cancer subtyping systems have been published recently with the emergence of various deep learning methods. However, such automatic systems often overfit the data due to the high dimensionality and scarcity. In this paper, we propose to investigate automatic subtyping from an unsupervised learning perspective by directly constructing the underlying data distribution itself, hence sufficient data can be generated to alleviate the issue of overfitting. Specifically, we bypass the strong Gaussianity assumption that typically exists but fails in the unsupervised learning subtyping literature due to small-sized samples by vector quantization. Our proposed method better captures the latent space features and models the cancer subtype manifestation on a molecular basis, as demonstrated by the extensive experimental results.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
Modern Views of Machine Learning for Precision Psychiatry
Authors:
Zhe Sage Chen,
Prathamesh,
Kulkarni,
Isaac R. Galatzer-Levy,
Benedetta Bigio,
Carla Nasca,
Yu Zhang
Abstract:
In light of the NIMH's Research Domain Criteria (RDoC), the advent of functional neuroimaging, novel technologies and methods provide new opportunities to develop precise and personalized prognosis and diagnosis of mental disorders. Machine learning (ML) and artificial intelligence (AI) technologies are playing an increasingly critical role in the new era of precision psychiatry. Combining ML/AI w…
▽ More
In light of the NIMH's Research Domain Criteria (RDoC), the advent of functional neuroimaging, novel technologies and methods provide new opportunities to develop precise and personalized prognosis and diagnosis of mental disorders. Machine learning (ML) and artificial intelligence (AI) technologies are playing an increasingly critical role in the new era of precision psychiatry. Combining ML/AI with neuromodulation technologies can potentially provide explainable solutions in clinical practice and effective therapeutic treatment. Advanced wearable and mobile technologies also call for the new role of ML/AI for digital phenotyping in mobile mental health. In this review, we provide a comprehensive review of the ML methodologies and applications by combining neuroimaging, neuromodulation, and advanced mobile technologies in psychiatry practice. Additionally, we review the role of ML in molecular phenotyping and cross-species biomarker identification in precision psychiatry. We further discuss explainable AI (XAI) and causality testing in a closed-human-in-the-loop manner, and highlight the ML potential in multimedia information extraction and multimodal data fusion. Finally, we discuss conceptual and practical challenges in precision psychiatry and highlight ML opportunities in future research.
△ Less
Submitted 11 July, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
Optimize Deep Learning Models for Prediction of Gene Mutations Using Unsupervised Clustering
Authors:
Zihan Chen,
Xingyu Li,
Miaomiao Yang,
Hong Zhang,
Xu Steven Xu
Abstract:
Deep learning has become the mainstream methodological choice for analyzing and interpreting whole-slide digital pathology images (WSIs). It is commonly assumed that tumor regions carry most predictive information. In this paper, we proposed an unsupervised clustering-based multiple-instance learning, and apply our method to develop deep-learning models for prediction of gene mutations using WSIs…
▽ More
Deep learning has become the mainstream methodological choice for analyzing and interpreting whole-slide digital pathology images (WSIs). It is commonly assumed that tumor regions carry most predictive information. In this paper, we proposed an unsupervised clustering-based multiple-instance learning, and apply our method to develop deep-learning models for prediction of gene mutations using WSIs from three cancer types in The Cancer Genome Atlas (TCGA) studies (CRC, LUAD, and HNSCC). We showed that unsupervised clustering of image patches could help identify predictive patches, exclude patches lack of predictive information, and therefore improve prediction on gene mutations in all three different cancer types, compared with the WSI based method without selection of image patches and models based on only tumor regions. Additionally, our proposed algorithm outperformed two recently published baseline algorithms leveraging unsupervised clustering to assist model prediction. The unsupervised-clustering-based approach for mutation prediction allows identification of the spatial regions related to mutation of a specific gene via the resolved probability scores, highlighting the heterogeneity of a predicted genotype in the tumor microenvironment. Finally, our study also demonstrated that selection of tumor regions of WSIs is not always the best way to identify patches for prediction of gene mutations, and other tissue types in the tumor micro-environment may provide better prediction ability for gene mutations than tumor tissues.
△ Less
Submitted 24 April, 2022; v1 submitted 31 March, 2022;
originally announced April 2022.
-
Knowledge-informed Molecular Learning: A Survey on Paradigm Transfer
Authors:
Yin Fang,
Zhuo Chen,
Xiaohui Fan,
Ningyu Zhang
Abstract:
Machine learning, notably deep learning, has significantly propelled molecular investigations within the biochemical sphere. Traditionally, modeling for such research has centered around a handful of paradigms. For instance, the prediction paradigm is frequently deployed for tasks such as molecular property prediction. To enhance the generation and decipherability of purely data-driven models, sch…
▽ More
Machine learning, notably deep learning, has significantly propelled molecular investigations within the biochemical sphere. Traditionally, modeling for such research has centered around a handful of paradigms. For instance, the prediction paradigm is frequently deployed for tasks such as molecular property prediction. To enhance the generation and decipherability of purely data-driven models, scholars have integrated biochemical domain knowledge into these molecular study models. This integration has sparked a surge in paradigm transfer, which is solving one molecular learning task by reformulating it as another one. With the emergence of Large Language Models, these paradigms have demonstrated an escalating trend towards harmonized unification. In this work, we delineate a literature survey focused on knowledge-informed molecular learning from the perspective of paradigm transfer. We classify the paradigms, scrutinize their methodologies, and dissect the contribution of domain knowledge. Moreover, we encapsulate prevailing trends and identify intriguing avenues for future exploration in molecular learning.
△ Less
Submitted 5 September, 2023; v1 submitted 17 February, 2022;
originally announced February 2022.
-
AGMI: Attention-Guided Multi-omics Integration for Drug Response Prediction with Graph Neural Networks
Authors:
Ruiwei Feng,
Yufeng Xie,
Minshan Lai,
Danny Z. Chen,
Ji Cao,
Jian Wu
Abstract:
Accurate drug response prediction (DRP) is a crucial yet challenging task in precision medicine. This paper presents a novel Attention-Guided Multi-omics Integration (AGMI) approach for DRP, which first constructs a Multi-edge Graph (MeG) for each cell line, and then aggregates multi-omics features to predict drug response using a novel structure, called Graph edge-aware Network (GeNet). For the f…
▽ More
Accurate drug response prediction (DRP) is a crucial yet challenging task in precision medicine. This paper presents a novel Attention-Guided Multi-omics Integration (AGMI) approach for DRP, which first constructs a Multi-edge Graph (MeG) for each cell line, and then aggregates multi-omics features to predict drug response using a novel structure, called Graph edge-aware Network (GeNet). For the first time, our AGMI approach explores gene constraint based multi-omics integration for DRP with the whole-genome using GNNs. Empirical experiments on the CCLE and GDSC datasets show that our AGMI largely outperforms state-of-the-art DRP methods by 8.3%--34.2% on four metrics. Our data and code are available at https://github.com/yivan-WYYGDSG/AGMI.
△ Less
Submitted 9 January, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
HelixMO: Sample-Efficient Molecular Optimization in Scene-Sensitive Latent Space
Authors:
Zhiyuan Chen,
Xiaomin Fang,
Zixu Hua,
Yueyang Huang,
Fan Wang,
Hua Wu
Abstract:
Efficient exploration of the chemical space to search the candidate drugs that satisfy various constraints is a fundamental task of drug discovery. Advanced deep generative methods attempt to optimize the molecules in the compact latent space instead of the discrete original space, but the mapping between the original and latent spaces is always kept unchanged during the entire optimization proces…
▽ More
Efficient exploration of the chemical space to search the candidate drugs that satisfy various constraints is a fundamental task of drug discovery. Advanced deep generative methods attempt to optimize the molecules in the compact latent space instead of the discrete original space, but the mapping between the original and latent spaces is always kept unchanged during the entire optimization process. The unchanged mapping makes those methods challenging to fast adapt to various optimization scenes and leads to the great demand for assessed molecules (samples) to provide optimization direction, which is a considerable expense for drug discovery. To this end, we design a sample-efficient molecular generative method, HelixMO, which explores the scene-sensitive latent space to promote sample efficiency. The scene-sensitive latent space focuses more on modeling the promising molecules by dynamically adjusting the space mapping by leveraging the correlations between the general and scene-specific characteristics during the optimization process. Extensive experiments demonstrate that HelixMO can achieve competitive performance with only a few assessed samples on four molecular optimization scenes. Ablation studies verify the positive impact of the scene-specific latent space, which is capable of identifying the critical characteristics of the promising molecules. We also deployed HelixMO on the website PaddleHelix (https://paddlehelix.baidu.com/app/drug/drugdesign/forecast) to provide drug design service.
△ Less
Submitted 16 November, 2022; v1 submitted 30 November, 2021;
originally announced December 2021.
-
Molecular Contrastive Learning with Chemical Element Knowledge Graph
Authors:
Yin Fang,
Qiang Zhang,
Haihong Yang,
Xiang Zhuang,
Shumin Deng,
Wen Zhang,
Ming Qin,
Zhuo Chen,
Xiaohui Fan,
Huajun Chen
Abstract:
Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus…
▽ More
Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus ignore the correlations between atoms that have common attributes but are not directly connected by bonds. To address these issues, we construct a Chemical Element Knowledge Graph (KG) to summarize microscopic associations between elements and propose a novel Knowledge-enhanced Contrastive Learning (KCL) framework for molecular representation learning. KCL framework consists of three modules. The first module, knowledge-guided graph augmentation, augments the original molecular graph based on the Chemical Element KG. The second module, knowledge-aware graph representation, extracts molecular representations with a common graph encoder for the original molecular graph and a Knowledge-aware Message Passing Neural Network (KMPNN) to encode complex information in the augmented molecular graph. The final module is a contrastive objective, where we maximize agreement between these two views of molecular graphs. Extensive experiments demonstrated that KCL obtained superior performances against state-of-the-art baselines on eight molecular datasets. Visualization experiments properly interpret what KCL has learned from atoms and attributes in the augmented molecular graphs. Our codes and data are available at https://github.com/ZJU-Fangyin/KCL.
△ Less
Submitted 10 March, 2022; v1 submitted 1 December, 2021;
originally announced December 2021.
-
Effect of He Self-organized pattern plasma-activated media with different conductivity on cancer cells
Authors:
Zhitong Chen
Abstract:
The self-organized pattern (SOP) phenomenon is prevalent in plasma, while knowledge about SOP discharge affecting reactive species generated plasma-activated media (PAM) for cancer therapy is poorly documented. The aim of this study focused on the effect of SOP discharge modes on reactive oxygen and nitrogen species (ROS, RNS) in He SOP plasma-activated media with different conductivity (saline so…
▽ More
The self-organized pattern (SOP) phenomenon is prevalent in plasma, while knowledge about SOP discharge affecting reactive species generated plasma-activated media (PAM) for cancer therapy is poorly documented. The aim of this study focused on the effect of SOP discharge modes on reactive oxygen and nitrogen species (ROS, RNS) in He SOP plasma-activated media with different conductivity (saline solution and deionized (DI) water), and employed them to breast cancer MDA-MB-231 and pancreatic BxPC-3 cancer cells. Optical emission spectrum and Fluorimetric analysis were used to identify and quantify ROS and RNS generated in He SOP plasma-activated saline solution and DI water. Furthermore, He SOP plasma discharge modes are capable of efficiently controlling the ROS and RNS concentration in the plasma-activated saline solution and DI water, which contribute to the cytotoxic effect. On the other hand, stainless steel and copper were used as a lower electrode to compare their effect on cell viability. Taken together, our findings provide insight into potential mechanisms involved in cell death after treatment with He SOP plasma-activated media.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.