-
EEG-Deformer: A Dense Convolutional Transformer for Brain-computer Interfaces
Authors:
Yi Ding,
Yong Li,
Hao Sun,
Rui Liu,
Chengxuan Tong,
Cuntai Guan
Abstract:
Effectively learning the temporal dynamics in electroencephalogram (EEG) signals is challenging yet essential for decoding brain activities using brain-computer interfaces (BCIs). Although Transformers are popular for their long-term sequential learning ability in the BCI field, most methods combining Transformers with convolutional neural networks (CNNs) fail to capture the coarse-to-fine tempora…
▽ More
Effectively learning the temporal dynamics in electroencephalogram (EEG) signals is challenging yet essential for decoding brain activities using brain-computer interfaces (BCIs). Although Transformers are popular for their long-term sequential learning ability in the BCI field, most methods combining Transformers with convolutional neural networks (CNNs) fail to capture the coarse-to-fine temporal dynamics of EEG signals. To overcome this limitation, we introduce EEG-Deformer, which incorporates two main novel components into a CNN-Transformer: (1) a Hierarchical Coarse-to-Fine Transformer (HCT) block that integrates a Fine-grained Temporal Learning (FTL) branch into Transformers, effectively discerning coarse-to-fine temporal patterns; and (2) a Dense Information Purification (DIP) module, which utilizes multi-level, purified temporal information to enhance decoding accuracy. Comprehensive experiments on three representative cognitive tasks consistently verify the generalizability of our proposed EEG-Deformer, demonstrating that it either outperforms existing state-of-the-art methods or is comparable to them. Visualization results show that EEG-Deformer learns from neurophysiologically meaningful brain regions for the corresponding cognitive tasks. The source code can be found at https://github.com/yi-ding-cs/EEG-Deformer.
△ Less
Submitted 25 April, 2024;
originally announced May 2024.
-
Domain Adaptive and Fine-grained Anomaly Detection for Single-cell Sequencing Data and Beyond
Authors:
Kaichen Xu,
Yueyang Ding,
Suyang Hou,
Weiqiang Zhan,
Nisang Chen,
Jun Wang,
Xiaobo Sun
Abstract:
Fined-grained anomalous cell detection from affected tissues is critical for clinical diagnosis and pathological research. Single-cell sequencing data provide unprecedented opportunities for this task. However, current anomaly detection methods struggle to handle domain shifts prevalent in multi-sample and multi-domain single-cell sequencing data, leading to suboptimal performance. Moreover, these…
▽ More
Fined-grained anomalous cell detection from affected tissues is critical for clinical diagnosis and pathological research. Single-cell sequencing data provide unprecedented opportunities for this task. However, current anomaly detection methods struggle to handle domain shifts prevalent in multi-sample and multi-domain single-cell sequencing data, leading to suboptimal performance. Moreover, these methods fall short of distinguishing anomalous cells into pathologically distinct subtypes. In response, we propose ACSleuth, a novel, reconstruction deviation-guided generative framework that integrates the detection, domain adaptation, and fine-grained annotating of anomalous cells into a methodologically cohesive workflow. Notably, we present the first theoretical analysis of using reconstruction deviations output by generative models for anomaly detection in lieu of domain shifts. This analysis informs us to develop a novel and superior maximum mean discrepancy-based anomaly scorer in ACSleuth. Extensive benchmarks over various single-cell data and other types of tabular data demonstrate ACSleuth's superiority over the state-of-the-art methods in identifying and subtyping anomalies in multi-sample and multi-domain contexts. Our code is available at https://github.com/Catchxu/ACsleuth.
△ Less
Submitted 29 April, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
Perceptual learning in contour detection transfer across changes in contour path and orientation
Authors:
Yue Ding,
Hongqiao Shi,
Shuang Song,
Yonghui Wang,
Ya Li
Abstract:
The integration of local elements into shape contours is critical for target detection and identification in cluttered scenes. Previous studies have shown that observers can learn to use image regularities for contour integration and target identification. However, we still know little about the generalization of perceptual learning in contour integration. Specifically, whether training in contour…
▽ More
The integration of local elements into shape contours is critical for target detection and identification in cluttered scenes. Previous studies have shown that observers can learn to use image regularities for contour integration and target identification. However, we still know little about the generalization of perceptual learning in contour integration. Specifically, whether training in contour detection task could transfer to untrained contour type, path or orientation is still unclear. In a series of four experiments, human perceptual learning in contour detection was studied using psychophysical methods. We trained participants to detect contours in cluttered scenes over several days, which resulted in a significant improvement in sensitivity to trained contour type. This improved sensitivity was highly specific to contour type, but transfer across changes in contour path and contour orientation. These results suggest that short-term training improves the ability to integrate specific types of contours by optimizing the ability of the visual system to extract specific image regularities. The differential specificity and generalization across different stimulus features may support the involvement of both low-level and higher-level visual areas in perceptual learning in contour detection. These findings provide further insights into understanding the nature and the brain plasticity mechanism of contour integration learning.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy
Authors:
Yan Ding,
Hao Cheng,
Ziliang Ye,
Ruyi Feng,
Wei Tian,
Peng Xie,
Juan Zhang,
Zhongze Gu
Abstract:
We propose Adjustable Molecular Representation (AdaMR), a new large-scale uniform pre-training strategy for small-molecule drugs, as a novel unified pre-training strategy. AdaMR utilizes a granularity-adjustable molecular encoding strategy, which is accomplished through a pre-training job termed molecular canonicalization, setting it apart from recent large-scale molecular models. This adaptabilit…
▽ More
We propose Adjustable Molecular Representation (AdaMR), a new large-scale uniform pre-training strategy for small-molecule drugs, as a novel unified pre-training strategy. AdaMR utilizes a granularity-adjustable molecular encoding strategy, which is accomplished through a pre-training job termed molecular canonicalization, setting it apart from recent large-scale molecular models. This adaptability in granularity enriches the model's learning capability at multiple levels and improves its performance in multi-task scenarios. Specifically, the substructure-level molecular representation preserves information about specific atom groups or arrangements, influencing chemical properties and functionalities. This proves advantageous for tasks such as property prediction. Simultaneously, the atomic-level representation, combined with generative molecular canonicalization pre-training tasks, enhances validity, novelty, and uniqueness in generative tasks. All of these features work together to give AdaMR outstanding performance on a range of downstream tasks. We fine-tuned our proposed pre-trained model on six molecular property prediction tasks (MoleculeNet datasets) and two generative tasks (ZINC250K datasets), achieving state-of-the-art (SOTA) results on five out of eight tasks.
△ Less
Submitted 27 April, 2024; v1 submitted 28 December, 2023;
originally announced January 2024.
-
Morphological Profiling for Drug Discovery in the Era of Deep Learning
Authors:
Qiaosi Tang,
Ranjala Ratnayake,
Gustavo Seabra,
Zhe Jiang,
Ruogu Fang,
Lina Cui,
Yousong Ding,
Tamer Kahveci,
Jiang Bian,
Chenglong Li,
Hendrik Luesch,
Yanjun Li
Abstract:
Morphological profiling is a valuable tool in phenotypic drug discovery. The advent of high-throughput automated imaging has enabled the capturing of a wide range of morphological features of cells or organisms in response to perturbations at the single-cell resolution. Concurrently, significant advances in machine learning and deep learning, especially in computer vision, have led to substantial…
▽ More
Morphological profiling is a valuable tool in phenotypic drug discovery. The advent of high-throughput automated imaging has enabled the capturing of a wide range of morphological features of cells or organisms in response to perturbations at the single-cell resolution. Concurrently, significant advances in machine learning and deep learning, especially in computer vision, have led to substantial improvements in analyzing large-scale high-content images at high-throughput. These efforts have facilitated understanding of compound mechanism-of-action (MOA), drug repurposing, characterization of cell morphodynamics under perturbation, and ultimately contributing to the development of novel therapeutics. In this review, we provide a comprehensive overview of the recent advances in the field of morphological profiling. We summarize the image profiling analysis workflow, survey a broad spectrum of analysis strategies encompassing feature engineering- and deep learning-based approaches, and introduce publicly available benchmark datasets. We place a particular emphasis on the application of deep learning in this pipeline, covering cell segmentation, image representation learning, and multimodal learning. Additionally, we illuminate the application of morphological profiling in phenotypic drug discovery and highlight potential challenges and opportunities in this field.
△ Less
Submitted 15 January, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
The QUATRO Application Suite: Quantum Computing for Models of Human Cognition
Authors:
Raghavendra Pradyumna Pothukuchi,
Leon Lufkin,
Yu Jun Shen,
Alejandro Simon,
Rome Thorstenson,
Bernardo Eilert Trevisan,
Michael Tu,
Mudi Yang,
Ben Foxman,
Viswanatha Srinivas Pothukuchi,
Gunnar Epping,
Thi Ha Kyaw,
Bryant J Jongkees,
Yongshan Ding,
Jerome R Busemeyer,
Jonathan D Cohen,
Abhishek Bhattacharjee
Abstract:
Research progress in quantum computing has, thus far, focused on a narrow set of application domains. Expanding the suite of quantum application domains is vital for the discovery of new software toolchains and architectural abstractions. In this work, we unlock a new class of applications ripe for quantum computing research -- computational cognitive modeling. Cognitive models are critical to und…
▽ More
Research progress in quantum computing has, thus far, focused on a narrow set of application domains. Expanding the suite of quantum application domains is vital for the discovery of new software toolchains and architectural abstractions. In this work, we unlock a new class of applications ripe for quantum computing research -- computational cognitive modeling. Cognitive models are critical to understanding and replicating human intelligence. Our work connects computational cognitive models to quantum computer architectures for the first time. We release QUATRO, a collection of quantum computing applications from cognitive models. The development and execution of QUATRO shed light on gaps in the quantum computing stack that need to be closed to ease programming and drive performance. Among several contributions, we propose and study ideas pertaining to quantum cloud scheduling (using data from gate- and annealing-based quantum computers), parallelization, and more. In the long run, we expect our research to lay the groundwork for more versatile quantum computer systems in the future.
△ Less
Submitted 8 December, 2023; v1 submitted 1 September, 2023;
originally announced September 2023.
-
SBSM-Pro: Support Bio-sequence Machine for Proteins
Authors:
Yizheng Wang,
Yixiao Zhai,
Yijie Ding,
Quan Zou
Abstract:
Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw seq…
▽ More
Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and posttranslational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at http://lab.malab.cn/soft/SBSM-Pro/.
△ Less
Submitted 4 November, 2023; v1 submitted 20 August, 2023;
originally announced August 2023.
-
Semi-supervised Cooperative Learning for Multiomics Data Fusion
Authors:
Daisy Yi Ding,
Xiaotao Shen,
Michael Snyder,
Robert Tibshirani
Abstract:
Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and of…
▽ More
Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and offers a systematic framework for leveraging the shared underlying relationships across omics to strengthen signals. However, the challenge of acquiring large-scale labeled data remains, and there are cases where multiomics data are available but in the absence of annotated labels. To harness the potential of unlabeled multiomcis data, we introduce semi-supervised cooperative learning. By utilizing an "agreement penalty", our method incorporates the additional unlabeled data in the learning process and achieves consistently superior predictive performance on simulated data and a real multiomics study of aging. It offers an effective solution to multiomics data fusion in settings with both labeled and unlabeled data and maximizes the utility of available data resources, with the potential of significantly improving predictive models for diagnostics and therapeutics in an increasingly multiomics world.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Machine Learning-guided Lipid Nanoparticle Design for mRNA Delivery
Authors:
Daisy Yi Ding,
Yuhui Zhang,
Yuan Jia,
Jiuzhi Sun
Abstract:
While RNA technologies hold immense therapeutic potential in a range of applications from vaccination to gene editing, the broad implementation of these technologies is hindered by the challenge of delivering these agents effectively. Lipid nanoparticles have emerged as one of the most widely used delivery agents, but their design optimization relies on laborious and costly experimental methods. W…
▽ More
While RNA technologies hold immense therapeutic potential in a range of applications from vaccination to gene editing, the broad implementation of these technologies is hindered by the challenge of delivering these agents effectively. Lipid nanoparticles have emerged as one of the most widely used delivery agents, but their design optimization relies on laborious and costly experimental methods. We propose to in silico optimize LNP design with machine learning models. On a curated dataset of 622 LNPs from published studies, we demonstrate the effectiveness of our model in predicting the transfection efficiency of unseen LNPs, with the multilayer perceptron achieving a classification accuracy of 98% on the test set. Our work represents a pioneering effort in combining ML and LNP design, offering significant potential for improving screening efficiency by computationally prioritizing LNP candidates for experimental validation and accelerating the development of effective mRNA delivery systems.
△ Less
Submitted 28 August, 2023; v1 submitted 2 August, 2023;
originally announced August 2023.
-
Inactivated COVID-19 Vaccination did not affect In vitro fertilization (IVF) / Intra-Cytoplasmic Sperm Injection (ICSI) cycle outcomes
Authors:
Qi Wan,
Ying Ling Yao,
XingYu Lv,
Li Hong Geng,
Yue Wang,
Enoch Appiah Adu-Gyamfi,
Xue Jiao Wang,
Yue Qian,
Juan Yang,
Ming Xing Chend,
Zhao Hui Zhong,
Yuan Li,
Yu Bin Ding
Abstract:
Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan…
▽ More
Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan Jinxin Xinan Women & Children's Hospital. Based on whether they received vaccination before ovarian stimulation, the couples were divided into the vaccination group and the non-vaccination group. We compared the laboratory parameters and pregnancy outcomes between the two groups. Findings: After performing propensity score matching (PSM), the analysis demonstrated similar clinical pregnancy rates, biochemical pregnancy and ongoing pregnancy rates between vaccinated and unvaccinated women. No significant disparities were found in terms of embryo development and laboratory parameters among the groups. Moreover, male vaccination had no impact on patient performance or pregnancy outcomes in assisted reproductive technology treatments. Additionally, there were no significant differences observed in the effects of vaccination on embryo development and pregnancy outcomes among couples undergoing ART. Interpretation: The findings suggest that COVID-19 vaccination did not have a significant effect on patients undergoing IVF/ICSI with fresh embryo transfer. Therefore, it is recommended that couples should receive COVID-19 vaccination as scheduled to help mitigate the COVID-19 pandemic.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Generalizability of PRS313 for breast cancer risk amongst non-Europeans in a Los Angeles biobank
Authors:
Helen Shang,
Yi Ding,
Vidhya Venkateswaran,
Kristin Boulier,
Nikhita Kathuria-Prakash,
Parisa Boodaghi Malidarreh,
Jacob M. Luber,
Bogdan Pasaniuc
Abstract:
Polygenic risk scores (PRS) summarize the combined effect of common risk variants and are associated with breast cancer risk in patients without identifiable monogenic risk factors. One of the most well-validated PRSs in breast cancer to date is PRS313, which was developed from a Northern European biobank but has shown attenuated performance in non-European ancestries. We further investigate the g…
▽ More
Polygenic risk scores (PRS) summarize the combined effect of common risk variants and are associated with breast cancer risk in patients without identifiable monogenic risk factors. One of the most well-validated PRSs in breast cancer to date is PRS313, which was developed from a Northern European biobank but has shown attenuated performance in non-European ancestries. We further investigate the generalizability of the PRS313 for American women of European (EA), African (AFR), Asian (EAA), and Latinx (HL) ancestry within one institution with a singular EHR system, genotyping platform, and quality control process. We found that the PRS313 achieved overlapping Areas under the ROC Curve (AUCs) in females of Lantix (AUC, 0.68; 95 CI, 0.65-0.71) and European ancestry (AUC, 0.70; 95 CI, 0.69-0.71) but lower AUCs for the AFR and EAA populations (AFR: AUC, 0.61; 95 CI, 0.56-0.65; EAA: AUC, 0.64; 95 CI, 0.60-0.680). While PRS313 is associated with Hormone Positive (HR+) disease in European Americans (OR, 1.42; 95 CI, 1.16-1.64), for Latinx females, it may be instead associated with Human Epidermal Growth Factor Receptor 2 (HER2+) disease (OR, 2.52; 95 CI, 1.35-4.70) although due to small numbers, additional studies are needed. In summary, we found that PRS313 was significantly associated with breast cancer but with attenuated accuracy in women of African and Asian descent within a singular health system in Los Angeles. Our work further highlights the need for additional validation in diverse cohorts prior to clinical implementation of polygenic risk scores.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
CancerGPT: Few-shot Drug Pair Synergy Prediction using Large Pre-trained Language Models
Authors:
Tianhao Li,
Sandesh Shetty,
Advaith Kamath,
Ajay Jaiswal,
Xianqian Jiang,
Ying Ding,
Yejin Kim
Abstract:
Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology, has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structure…
▽ More
Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology, has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Our proposed few-shot learning approach uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrated that the LLM-based prediction model achieved significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with $\sim$ 124M parameters), was even comparable to the larger fine-tuned GPT-3 model (with $\sim$ 175B parameters). Our research is the first to tackle drug pair synergy prediction in rare tissues with limited data. We are also the first to utilize an LLM-based prediction model for biological reaction prediction tasks.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model
Authors:
Bo Qiang,
Yiran Zhou,
Yuheng Ding,
Ningfeng Liu,
Song Song,
Liangren Zhang,
Bo Huang,
Zhenming Liu
Abstract:
Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, w…
▽ More
Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a novel pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results on challenging downstream tasks. By possessing chemical knowledge, our generative framework overcome the limitations of current molecule generation models that rely on a small number of reaction templates. In the extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a significant step toward a large-scale deep-learning framework for a variety of reaction-based applications.
△ Less
Submitted 7 March, 2024; v1 submitted 13 March, 2023;
originally announced March 2023.
-
Cooperative learning for multiview analysis
Authors:
Daisy Yi Ding,
Shuangning Li,
Balasubramanian Narasimhan,
Robert Tibshirani
Abstract:
We propose a new method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data such as genomics, proteomics and radiomics are measured on a common set of samples. Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from…
▽ More
We propose a new method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data such as genomics, proteomics and radiomics are measured on a common set of samples. Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g. lasso, random forests, boosting, neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and a real multiomics example of labor onset prediction. Leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
△ Less
Submitted 3 September, 2022; v1 submitted 22 December, 2021;
originally announced December 2021.
-
Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence
Authors:
Yanyi Ding,
Zhiyi Kuang,
Yuxin Pei,
Jeff Tan,
Ziyu Zhang,
Joseph Konan
Abstract:
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021. With thousands of strains sequenced to date, SARS-CoV-2 mutations pose significant challenges to scientists on keeping pace with vaccine development and public health measures. Therefore, an efficient method of identifying the divergence of lab samples…
▽ More
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021. With thousands of strains sequenced to date, SARS-CoV-2 mutations pose significant challenges to scientists on keeping pace with vaccine development and public health measures. Therefore, an efficient method of identifying the divergence of lab samples from patients would greatly aid the documentation of SARS-CoV-2 genomics. In this study, we propose a neural network model that leverages recurrent and convolutional units to directly take in amino acid sequences of spike proteins and classify corresponding clades. We also compared our model's performance with Bidirectional Encoder Representations from Transformers (BERT) pre-trained on protein database. Our approach has the potential of providing a more computationally efficient alternative to current homology based intra-species differentiation.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Surveillance Testing for Rapid Detection of Outbreaks in Facilities
Authors:
Yanyue Ding,
Sudesh K. Agrawal,
Jincheng Cao,
Lauren Meyers,
John J. Hasenbein
Abstract:
This paper develops an agent-based disease spread model on a contact network in an effort to guide efforts at surveillance testing in small to moderate facilities such as nursing homes and meat-packing plants. The model employs Monte Carlo simulations of viral spread sample paths in the contact network. The original motivation was to detect COVID-19 outbreaks quickly in such facilities, but the mo…
▽ More
This paper develops an agent-based disease spread model on a contact network in an effort to guide efforts at surveillance testing in small to moderate facilities such as nursing homes and meat-packing plants. The model employs Monte Carlo simulations of viral spread sample paths in the contact network. The original motivation was to detect COVID-19 outbreaks quickly in such facilities, but the model can be applied to any communicable disease. In particular, the model provides guidance on how many test to administer each day and on the importance of the testing order among staff or workers.
△ Less
Submitted 30 September, 2021;
originally announced October 2021.
-
Relational graph convolutional networks for predicting blood-brain barrier penetration of drug molecules
Authors:
Yan Ding,
Xiaoqian Jiang,
Yejin Kim
Abstract:
Evaluating the blood-brain barrier (BBB) permeability of drug molecules is a critical step in brain drug development. Traditional methods for the evaluation require complicated in vitro or in vivo testing. Alternatively, in silico predictions based on machine learning have proved to be a cost-efficient way to complement the in vitro and in vivo methods. However, the performance of the established…
▽ More
Evaluating the blood-brain barrier (BBB) permeability of drug molecules is a critical step in brain drug development. Traditional methods for the evaluation require complicated in vitro or in vivo testing. Alternatively, in silico predictions based on machine learning have proved to be a cost-efficient way to complement the in vitro and in vivo methods. However, the performance of the established models has been limited by their incapability of dealing with the interactions between drugs and proteins, which play an important role in the mechanism behind the BBB penetrating behaviors. To address this limitation, we employed the relational graph convolutional network (RGCN) to handle the drug-protein interactions as well as the properties of each individual drug. The RGCN model achieved an overall accuracy of 0.872, an AUROC of 0.919 and an AUPRC of 0.838 for the testing dataset with the drug-protein interactions and the Mordred descriptors as the input. Introducing drug-drug similarity to connect structurally similar drugs in the data graph further improved the testing results, giving an overall accuracy of 0.876, an AUROC of 0.926 and an AUPRC of 0.865. In particular, the RGCN model was found to greatly outperform the LightGBM base model when evaluated with the drugs whose BBB penetration was dependent on drug-protein interactions. Our model is expected to provide high-confidence predictions of BBB permeability for drug prioritization in the experimental screening of BBB-penetrating drugs.
△ Less
Submitted 6 April, 2022; v1 submitted 4 July, 2021;
originally announced July 2021.
-
Optimal Control of the SIR Model with Constrained Policy, with an Application to COVID-19
Authors:
Yujia Ding,
Henry Schellhorn
Abstract:
This article considers the optimal control of the SIR model with both transmission and treatment uncertainty. It follows the model presented in Gatto and Schellhorn (2021). We make four significant improvements on the latter paper. First, we prove the existence of a solution to the model. Second, our interpretation of the control is more realistic: while in Gatto and Schellhorn the control $α$ is…
▽ More
This article considers the optimal control of the SIR model with both transmission and treatment uncertainty. It follows the model presented in Gatto and Schellhorn (2021). We make four significant improvements on the latter paper. First, we prove the existence of a solution to the model. Second, our interpretation of the control is more realistic: while in Gatto and Schellhorn the control $α$ is the proportion of the population that takes a basic dose of treatment, so that $α>1$ occurs only if some patients take more than a basic dose, in our paper, $α$ is constrained between zero and one, and represents thus the proportion of the population undergoing treatment. Third, we provide a complete solution for the moderate infection regime (with constant treatment). Finally, we give a thorough interpretation of the control in the moderate infection regime, while Gatto and Schellhorn focussed on the interpretation of the low infection regime. Finally, we compare the efficiency of our control to curb the COVID-19 epidemic to other types of control.
△ Less
Submitted 18 May, 2021;
originally announced May 2021.
-
LGGNet: Learning from Local-Global-Graph Representations for Brain-Computer Interface
Authors:
Yi Ding,
Neethu Robinson,
Chengxuan Tong,
Qiuhao Zeng,
Cuntai Guan
Abstract:
Neuropsychological studies suggest that co-operative activities among different brain functional areas drive high-level cognitive processes. To learn the brain activities within and among different functional areas of the brain, we propose LGGNet, a novel neurologically inspired graph neural network, to learn local-global-graph representations of electroencephalography (EEG) for Brain-Computer Int…
▽ More
Neuropsychological studies suggest that co-operative activities among different brain functional areas drive high-level cognitive processes. To learn the brain activities within and among different functional areas of the brain, we propose LGGNet, a novel neurologically inspired graph neural network, to learn local-global-graph representations of electroencephalography (EEG) for Brain-Computer Interface (BCI). The input layer of LGGNet comprises a series of temporal convolutions with multi-scale 1D convolutional kernels and kernel-level attentive fusion. It captures temporal dynamics of EEG which then serves as input to the proposed local and global graph-filtering layers. Using a defined neurophysiologically meaningful set of local and global graphs, LGGNet models the complex relations within and among functional areas of the brain. Under the robust nested cross-validation settings, the proposed method is evaluated on three publicly available datasets for four types of cognitive classification tasks, namely, the attention, fatigue, emotion, and preference classification tasks. LGGNet is compared with state-of-the-art methods, such as DeepConvNet, EEGNet, R2G-STNN, TSception, RGNN, AMCNN-DGCN, HRNN and GraphNet. The results show that LGGNet outperforms these methods, and the improvements are statistically significant (p<0.05) in most cases. The results show that bringing neuroscience prior knowledge into neural network design yields an improvement of classification performance. The source code can be found at https://github.com/yi-ding-cs/LGG
△ Less
Submitted 5 December, 2022; v1 submitted 5 May, 2021;
originally announced May 2021.
-
Minimal invariant regions and minimal globally attracting regions for toric differential inclusions
Authors:
Yida Ding,
Abhishek Deshpande,
Gheorghe Craciun
Abstract:
Toric differential inclusions occur as key dynamical systems in the context of the Global Attractor Conjecture. We introduce the notions of minimal invariant regions and minimal globally attracting regions for toric differential inclusions. We describe a procedure for constructing explicitly the minimal invariant and minimal globally attracting regions for two-dimensional toric differential inclus…
▽ More
Toric differential inclusions occur as key dynamical systems in the context of the Global Attractor Conjecture. We introduce the notions of minimal invariant regions and minimal globally attracting regions for toric differential inclusions. We describe a procedure for constructing explicitly the minimal invariant and minimal globally attracting regions for two-dimensional toric differential inclusions. In particular, we obtain invariant regions and globally attracting regions for two-dimensional weakly reversible or endotactic dynamical systems (even if they have time-dependent parameters).
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
Towards a terramechanics for bio-inspired locomotion in granular environments
Authors:
Chen Li,
Yang Ding,
Nick Gravish,
Ryan D. Maladen,
Andrew Masse,
Paul B. Umbanhowar,
Haldun Komsuoglu,
Daniel E. Koditschek,
Daniel I. Goldman
Abstract:
Granular media (GM) present locomotor challenges for terrestrial and extraterrestrial devices because they can flow and solidify in response to localized intrusion of wheels, limbs, and bodies. While the development of airplanes and submarines is aided by understanding of hydrodynamics, fundamental theory does not yet exist to describe the complex interactions of locomotors with GM. In this paper,…
▽ More
Granular media (GM) present locomotor challenges for terrestrial and extraterrestrial devices because they can flow and solidify in response to localized intrusion of wheels, limbs, and bodies. While the development of airplanes and submarines is aided by understanding of hydrodynamics, fundamental theory does not yet exist to describe the complex interactions of locomotors with GM. In this paper, we use experimental, computational, and theoretical approaches to develop a terramechanics for bio-inspired locomotion in granular environments. We use a fluidized bed to prepare GM with a desired global packing fraction, and use empirical force measurements and the Discrete Element Method (DEM) to elucidate interaction mechanics during locomotion-relevant intrusions in GM such as vertical penetration and horizontal drag. We develop a resistive force theory (RFT) to account for more complex intrusions. We use these force models to understand the locomotor performance of two bio-inspired robots moving on and within GM.
△ Less
Submitted 31 October, 2019;
originally announced November 2019.
-
DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data
Authors:
Zhe Sun,
Ting Wang,
Ke Deng,
Xiao-Feng Wang,
Robert Lafyatis,
Ying Ding,
Ming Hu,
Wei Chen
Abstract:
Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the t…
▽ More
Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. Methods: We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. An expectation-maximization algorithm is used for parameter inference. Results: We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.
△ Less
Submitted 6 April, 2017;
originally announced April 2017.
-
Wavy membranes and the growth rate of a planar chemical garden: Enhanced diffusion and bioenergetics
Authors:
Yang Ding,
Bruno Batista,
Oliver Steinbock,
Julyan H. E. Cartwright,
Silvana S. S. Cardoso
Abstract:
In order to model ion transport across protocell membranes in Hadean hydrothermal vents, we consider both theoretically and experimentally the planar growth of a precipitate membrane formed at the interface between two parallel fluid streams in a two-dimensional microfluidic reactor. The growth rate of the precipitate is found to be proportional to the square root of time, which is characteristic…
▽ More
In order to model ion transport across protocell membranes in Hadean hydrothermal vents, we consider both theoretically and experimentally the planar growth of a precipitate membrane formed at the interface between two parallel fluid streams in a two-dimensional microfluidic reactor. The growth rate of the precipitate is found to be proportional to the square root of time, which is characteristic of diffusive transport. However, the dependence of the growth rate on the concentrations of hydroxide and metal ions is approximately linear and quadratic, respectively. We show that such a difference in ionic transport dynamics arises from the enhanced transport of metal ions across a thin gel layer present at the surface of the precipitate. The fluctuations in transverse velocity in this wavy porous gel layer allow an enhanced transport of the cation, so that the effective diffusivity is about an order of magnitude higher than that expected from molecular diffusion alone. Our theoretical predictions are in excellent agreement with our laboratory measurements of the growth of a manganese hydroxide membrane in a microfluidic channel, and this enhanced transport is thought to have been needed to account for the bioenergetics of the first single-celled organisms.
△ Less
Submitted 26 September, 2016;
originally announced September 2016.
-
Imputation of truncated p-values for meta-analysis methods and its genomic application
Authors:
Shaowu Tang,
Ying Ding,
Etienne Sibille,
Jeffrey S. Mogil,
William R. Lariviere,
George C. Tseng
Abstract:
Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. A tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and val…
▽ More
Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. A tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and validated findings. Methods that aggregate transformed $p$-value evidence have been widely used in genomic settings, among which Fisher's and Stouffer's methods are the most popular ones. In practice, raw data and $p$-values of DE evidence are often not available in genomic studies that are to be combined. Instead, only the detected DE gene lists under a certain $p$-value threshold (e.g., DE genes with $p$-value${}<0.001$) are reported in journal publications. The truncated $p$-value information makes the aforementioned meta-analysis methods inapplicable and researchers are forced to apply a less efficient vote counting method or naïvely drop the studies with incomplete information. The purpose of this paper is to develop effective meta-analysis methods for such situations with partially censored $p$-values. We developed and compared three imputation methods - mean imputation, single random imputation and multiple imputation - for a general class of evidence aggregation methods of which Fisher's and Stouffer's methods are special examples. The null distribution of each method was analytically derived and subsequent inference and genomic analysis frameworks were established. Simulations were performed to investigate the type I error, power and the control of false discovery rate (FDR) for (correlated) gene expression data. The proposed methods were applied to several genomic applications in colorectal cancer, pain and liquid association analysis of major depressive disorder (MDD). The results showed that imputation methods outperformed existing naïve approaches. Mean imputation and multiple imputation methods performed the best and are recommended for future applications.
△ Less
Submitted 19 January, 2015;
originally announced January 2015.
-
Epistasis not needed to explain low dN/dS
Authors:
David M. McCandlish,
Etienne Rajon,
Premal Shah,
Yang Ding,
Joshua B. Plotkin
Abstract:
An important question in molecular evolution is whether an amino acid that occurs at a given position makes an independent contribution to fitness, or whether its effect depends on the state of other loci in the organism's genome, a phenomenon known as epistasis. In a recent letter to Nature, Breen et al. (2012) argued that epistasis must be "pervasive throughout protein evolution" because the obs…
▽ More
An important question in molecular evolution is whether an amino acid that occurs at a given position makes an independent contribution to fitness, or whether its effect depends on the state of other loci in the organism's genome, a phenomenon known as epistasis. In a recent letter to Nature, Breen et al. (2012) argued that epistasis must be "pervasive throughout protein evolution" because the observed ratio between the per-site rates of non-synonymous and synonymous substitutions (dN/dS) is much lower than would be expected in the absence of epistasis. However, when calculating the expected dN/dS ratio in the absence of epistasis, Breen et al. assumed that all amino acids observed in a protein alignment at any particular position have equal fitness. Here, we relax this unrealistic assumption and show that any dN/dS value can in principle be achieved at a site, without epistasis. Furthermore, for all nuclear and chloroplast genes in the Breen et al. dataset, we show that the observed dN/dS values and the observed patterns of amino acid diversity at each site are jointly consistent with a non-epistatic model of protein evolution.
△ Less
Submitted 20 December, 2012;
originally announced December 2012.
-
Semantic Inference using Chemogenomics Data for Drug Discovery
Authors:
Qian Zhu,
Yuyin Sun,
Sashikiran Challa,
Ying Ding,
Michael S. Lajiness,
David J. Wild
Abstract:
Background Semantic Web Technology (SWT) makes it possible to integrate and search the large volume of life science datasets in the public domain, as demonstrated by well-known linked data projects such as LODD, Bio2RDF, and Chem2Bio2RDF. Integration of these sets creates large networks of information. We have previously described a tool called WENDI for aggregating information pertaining to new c…
▽ More
Background Semantic Web Technology (SWT) makes it possible to integrate and search the large volume of life science datasets in the public domain, as demonstrated by well-known linked data projects such as LODD, Bio2RDF, and Chem2Bio2RDF. Integration of these sets creates large networks of information. We have previously described a tool called WENDI for aggregating information pertaining to new chemical compounds, effectively creating evidence paths relating the compounds to genes, diseases and so on. In this paper we examine the utility of automatically inferring new compound-disease associations (and thus new links in the network) based on semantically marked-up versions of these evidence paths, rule-sets and inference engines.
Results Through the implementation of a semantic inference algorithm, rule set, Semantic Web methods (RDF, OWL and SPARQL) and new interfaces, we have created a new tool called Chemogenomic Explorer that uses networks of ontologically annotated RDF statements along with deductive reasoning tools to infer new associations between the query structure and genes and diseases from WENDI results. The tool then permits interactive clustering and filtering of these evidence paths.
Conclusions We present a new aggregate approach to inferring links between chemical compounds and diseases using semantic inference. This approach allows multiple evidence paths between compounds and diseases to be identified using a rule-set and semantically annotated data, and for these evidence paths to be clustered to show overall evidence linking the compound to a disease. We believe this is a powerful approach, because it allows compound-disease relationships to be ranked by the amount of evidence supporting them.
△ Less
Submitted 23 June, 2011;
originally announced June 2011.
-
Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA
Authors:
Huijun Wang,
Ying Ding,
Jie Tang,
Xiao Dong,
Bing He,
Judy Qiu,
David J. Wild
Abstract:
The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between them. In this paper, we describe an algorithm called Bio-LDA…
▽ More
The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between them. In this paper, we describe an algorithm called Bio-LDA that uses extracted biological terminology to automatically identify latent topics, and provides a variety of measures to uncover putative relations among topics and bio-terms. Relationships identified using those approaches are combined with existing data in life science datasets to provide additional insight. Three case studies demonstrate the utility of the Bio-LDA model, including association predication, association search and connectivity map generation. This combined approach offers new opportunities for knowledge discovery in many areas of biology including target identification, lead hopping and drug repurposing.
△ Less
Submitted 27 March, 2011;
originally announced March 2011.
-
Chem2Bio2RDF: A Linked Open Data Portal for Chemical Biology
Authors:
Bin Chen,
David J Wild,
Qian Zhu,
Ying Ding,
Xiao Dong,
Madhuvanthi Sankaranarayanan,
Huijun Wang,
Yuyin Sun
Abstract:
The Chem2Bio2RDF portal is a Linked Open Data (LOD) portal for systems chemical biology aiming for facilitating drug discovery. It converts around 25 different datasets on genes, compounds, drugs, pathways, side effects, diseases, and MEDLINE/PubMed documents into RDF triples and links them to other LOD bubbles, such as Bio2RDF, LODD and DBPedia. The portal is based on D2R server and provides a SP…
▽ More
The Chem2Bio2RDF portal is a Linked Open Data (LOD) portal for systems chemical biology aiming for facilitating drug discovery. It converts around 25 different datasets on genes, compounds, drugs, pathways, side effects, diseases, and MEDLINE/PubMed documents into RDF triples and links them to other LOD bubbles, such as Bio2RDF, LODD and DBPedia. The portal is based on D2R server and provides a SPARQL endpoint, but adds on few unique features like RDF faceted browser, user-friendly SPARQL query generator, MEDLINE/PubMed cross validation service, and Cytoscape visualization plugin. Three use cases demonstrate the functionality and usability of this portal.
△ Less
Submitted 21 December, 2010;
originally announced December 2010.