Search | arXiv e-print repository

EEG-Deformer: A Dense Convolutional Transformer for Brain-computer Interfaces

Authors: Yi Ding, Yong Li, Hao Sun, Rui Liu, Chengxuan Tong, Cuntai Guan

Abstract: Effectively learning the temporal dynamics in electroencephalogram (EEG) signals is challenging yet essential for decoding brain activities using brain-computer interfaces (BCIs). Although Transformers are popular for their long-term sequential learning ability in the BCI field, most methods combining Transformers with convolutional neural networks (CNNs) fail to capture the coarse-to-fine tempora… ▽ More Effectively learning the temporal dynamics in electroencephalogram (EEG) signals is challenging yet essential for decoding brain activities using brain-computer interfaces (BCIs). Although Transformers are popular for their long-term sequential learning ability in the BCI field, most methods combining Transformers with convolutional neural networks (CNNs) fail to capture the coarse-to-fine temporal dynamics of EEG signals. To overcome this limitation, we introduce EEG-Deformer, which incorporates two main novel components into a CNN-Transformer: (1) a Hierarchical Coarse-to-Fine Transformer (HCT) block that integrates a Fine-grained Temporal Learning (FTL) branch into Transformers, effectively discerning coarse-to-fine temporal patterns; and (2) a Dense Information Purification (DIP) module, which utilizes multi-level, purified temporal information to enhance decoding accuracy. Comprehensive experiments on three representative cognitive tasks consistently verify the generalizability of our proposed EEG-Deformer, demonstrating that it either outperforms existing state-of-the-art methods or is comparable to them. Visualization results show that EEG-Deformer learns from neurophysiologically meaningful brain regions for the corresponding cognitive tasks. The source code can be found at https://github.com/yi-ding-cs/EEG-Deformer. △ Less

Submitted 25 April, 2024; originally announced May 2024.

Comments: 10 pages, 9 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2404.17454 [pdf, other]

Domain Adaptive and Fine-grained Anomaly Detection for Single-cell Sequencing Data and Beyond

Authors: Kaichen Xu, Yueyang Ding, Suyang Hou, Weiqiang Zhan, Nisang Chen, Jun Wang, Xiaobo Sun

Abstract: Fined-grained anomalous cell detection from affected tissues is critical for clinical diagnosis and pathological research. Single-cell sequencing data provide unprecedented opportunities for this task. However, current anomaly detection methods struggle to handle domain shifts prevalent in multi-sample and multi-domain single-cell sequencing data, leading to suboptimal performance. Moreover, these… ▽ More Fined-grained anomalous cell detection from affected tissues is critical for clinical diagnosis and pathological research. Single-cell sequencing data provide unprecedented opportunities for this task. However, current anomaly detection methods struggle to handle domain shifts prevalent in multi-sample and multi-domain single-cell sequencing data, leading to suboptimal performance. Moreover, these methods fall short of distinguishing anomalous cells into pathologically distinct subtypes. In response, we propose ACSleuth, a novel, reconstruction deviation-guided generative framework that integrates the detection, domain adaptation, and fine-grained annotating of anomalous cells into a methodologically cohesive workflow. Notably, we present the first theoretical analysis of using reconstruction deviations output by generative models for anomaly detection in lieu of domain shifts. This analysis informs us to develop a novel and superior maximum mean discrepancy-based anomaly scorer in ACSleuth. Extensive benchmarks over various single-cell data and other types of tabular data demonstrate ACSleuth's superiority over the state-of-the-art methods in identifying and subtyping anomalies in multi-sample and multi-domain contexts. Our code is available at https://github.com/Catchxu/ACsleuth. △ Less

Submitted 29 April, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

Comments: 17 pages, 2 figures. Accepted by IJCAI 2024

arXiv:2403.11516 [pdf]

Perceptual learning in contour detection transfer across changes in contour path and orientation

Authors: Yue Ding, Hongqiao Shi, Shuang Song, Yonghui Wang, Ya Li

Abstract: The integration of local elements into shape contours is critical for target detection and identification in cluttered scenes. Previous studies have shown that observers can learn to use image regularities for contour integration and target identification. However, we still know little about the generalization of perceptual learning in contour integration. Specifically, whether training in contour… ▽ More The integration of local elements into shape contours is critical for target detection and identification in cluttered scenes. Previous studies have shown that observers can learn to use image regularities for contour integration and target identification. However, we still know little about the generalization of perceptual learning in contour integration. Specifically, whether training in contour detection task could transfer to untrained contour type, path or orientation is still unclear. In a series of four experiments, human perceptual learning in contour detection was studied using psychophysical methods. We trained participants to detect contours in cluttered scenes over several days, which resulted in a significant improvement in sensitivity to trained contour type. This improved sensitivity was highly specific to contour type, but transfer across changes in contour path and contour orientation. These results suggest that short-term training improves the ability to integrate specific types of contours by optimizing the ability of the visual system to extract specific image regularities. The differential specificity and generalization across different stimulus features may support the involvement of both low-level and higher-level visual areas in perceptual learning in contour detection. These findings provide further insights into understanding the nature and the brain plasticity mechanism of contour integration learning. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2401.06166 [pdf]

AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy

Authors: Yan Ding, Hao Cheng, Ziliang Ye, Ruyi Feng, Wei Tian, Peng Xie, Juan Zhang, Zhongze Gu

Abstract: We propose Adjustable Molecular Representation (AdaMR), a new large-scale uniform pre-training strategy for small-molecule drugs, as a novel unified pre-training strategy. AdaMR utilizes a granularity-adjustable molecular encoding strategy, which is accomplished through a pre-training job termed molecular canonicalization, setting it apart from recent large-scale molecular models. This adaptabilit… ▽ More We propose Adjustable Molecular Representation (AdaMR), a new large-scale uniform pre-training strategy for small-molecule drugs, as a novel unified pre-training strategy. AdaMR utilizes a granularity-adjustable molecular encoding strategy, which is accomplished through a pre-training job termed molecular canonicalization, setting it apart from recent large-scale molecular models. This adaptability in granularity enriches the model's learning capability at multiple levels and improves its performance in multi-task scenarios. Specifically, the substructure-level molecular representation preserves information about specific atom groups or arrangements, influencing chemical properties and functionalities. This proves advantageous for tasks such as property prediction. Simultaneously, the atomic-level representation, combined with generative molecular canonicalization pre-training tasks, enhances validity, novelty, and uniqueness in generative tasks. All of these features work together to give AdaMR outstanding performance on a range of downstream tasks. We fine-tuned our proposed pre-trained model on six molecular property prediction tasks (MoleculeNet datasets) and two generative tasks (ZINC250K datasets), achieving state-of-the-art (SOTA) results on five out of eight tasks. △ Less

Submitted 27 April, 2024; v1 submitted 28 December, 2023; originally announced January 2024.

arXiv:2312.07899 [pdf]

Morphological Profiling for Drug Discovery in the Era of Deep Learning

Authors: Qiaosi Tang, Ranjala Ratnayake, Gustavo Seabra, Zhe Jiang, Ruogu Fang, Lina Cui, Yousong Ding, Tamer Kahveci, Jiang Bian, Chenglong Li, Hendrik Luesch, Yanjun Li

Abstract: Morphological profiling is a valuable tool in phenotypic drug discovery. The advent of high-throughput automated imaging has enabled the capturing of a wide range of morphological features of cells or organisms in response to perturbations at the single-cell resolution. Concurrently, significant advances in machine learning and deep learning, especially in computer vision, have led to substantial… ▽ More Morphological profiling is a valuable tool in phenotypic drug discovery. The advent of high-throughput automated imaging has enabled the capturing of a wide range of morphological features of cells or organisms in response to perturbations at the single-cell resolution. Concurrently, significant advances in machine learning and deep learning, especially in computer vision, have led to substantial improvements in analyzing large-scale high-content images at high-throughput. These efforts have facilitated understanding of compound mechanism-of-action (MOA), drug repurposing, characterization of cell morphodynamics under perturbation, and ultimately contributing to the development of novel therapeutics. In this review, we provide a comprehensive overview of the recent advances in the field of morphological profiling. We summarize the image profiling analysis workflow, survey a broad spectrum of analysis strategies encompassing feature engineering- and deep learning-based approaches, and introduce publicly available benchmark datasets. We place a particular emphasis on the application of deep learning in this pipeline, covering cell segmentation, image representation learning, and multimodal learning. Additionally, we illuminate the application of morphological profiling in phenotypic drug discovery and highlight potential challenges and opportunities in this field. △ Less

Submitted 15 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: 44 pages, 5 figure, 5 tables

arXiv:2309.00597 [pdf, other]

The QUATRO Application Suite: Quantum Computing for Models of Human Cognition

Authors: Raghavendra Pradyumna Pothukuchi, Leon Lufkin, Yu Jun Shen, Alejandro Simon, Rome Thorstenson, Bernardo Eilert Trevisan, Michael Tu, Mudi Yang, Ben Foxman, Viswanatha Srinivas Pothukuchi, Gunnar Epping, Thi Ha Kyaw, Bryant J Jongkees, Yongshan Ding, Jerome R Busemeyer, Jonathan D Cohen, Abhishek Bhattacharjee

Abstract: Research progress in quantum computing has, thus far, focused on a narrow set of application domains. Expanding the suite of quantum application domains is vital for the discovery of new software toolchains and architectural abstractions. In this work, we unlock a new class of applications ripe for quantum computing research -- computational cognitive modeling. Cognitive models are critical to und… ▽ More Research progress in quantum computing has, thus far, focused on a narrow set of application domains. Expanding the suite of quantum application domains is vital for the discovery of new software toolchains and architectural abstractions. In this work, we unlock a new class of applications ripe for quantum computing research -- computational cognitive modeling. Cognitive models are critical to understanding and replicating human intelligence. Our work connects computational cognitive models to quantum computer architectures for the first time. We release QUATRO, a collection of quantum computing applications from cognitive models. The development and execution of QUATRO shed light on gaps in the quantum computing stack that need to be closed to ease programming and drive performance. Among several contributions, we propose and study ideas pertaining to quantum cloud scheduling (using data from gate- and annealing-based quantum computers), parallelization, and more. In the long run, we expect our research to lay the groundwork for more versatile quantum computer systems in the future. △ Less

Submitted 8 December, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

arXiv:2308.10275 [pdf, other]

SBSM-Pro: Support Bio-sequence Machine for Proteins

Authors: Yizheng Wang, Yixiao Zhai, Yijie Ding, Quan Zou

Abstract: Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw seq… ▽ More Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and posttranslational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at http://lab.malab.cn/soft/SBSM-Pro/. △ Less

Submitted 4 November, 2023; v1 submitted 20 August, 2023; originally announced August 2023.

Comments: 38 pages, 9 figures

arXiv:2308.01458 [pdf, other]

Semi-supervised Cooperative Learning for Multiomics Data Fusion

Authors: Daisy Yi Ding, Xiaotao Shen, Michael Snyder, Robert Tibshirani

Abstract: Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and of… ▽ More Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and offers a systematic framework for leveraging the shared underlying relationships across omics to strengthen signals. However, the challenge of acquiring large-scale labeled data remains, and there are cases where multiomics data are available but in the absence of annotated labels. To harness the potential of unlabeled multiomcis data, we introduce semi-supervised cooperative learning. By utilizing an "agreement penalty", our method incorporates the additional unlabeled data in the learning process and achieves consistently superior predictive performance on simulated data and a real multiomics study of aging. It offers an effective solution to multiomics data fusion in settings with both labeled and unlabeled data and maximizes the utility of available data resources, with the potential of significantly improving predictive models for diagnostics and therapeutics in an increasingly multiomics world. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: The 2023 ICML Workshop on Machine Learning for Multimodal Healthcare Data. arXiv admin note: text overlap with arXiv:2112.12337

arXiv:2308.01402 [pdf, other]

Machine Learning-guided Lipid Nanoparticle Design for mRNA Delivery

Authors: Daisy Yi Ding, Yuhui Zhang, Yuan Jia, Jiuzhi Sun

Abstract: While RNA technologies hold immense therapeutic potential in a range of applications from vaccination to gene editing, the broad implementation of these technologies is hindered by the challenge of delivering these agents effectively. Lipid nanoparticles have emerged as one of the most widely used delivery agents, but their design optimization relies on laborious and costly experimental methods. W… ▽ More While RNA technologies hold immense therapeutic potential in a range of applications from vaccination to gene editing, the broad implementation of these technologies is hindered by the challenge of delivering these agents effectively. Lipid nanoparticles have emerged as one of the most widely used delivery agents, but their design optimization relies on laborious and costly experimental methods. We propose to in silico optimize LNP design with machine learning models. On a curated dataset of 622 LNPs from published studies, we demonstrate the effectiveness of our model in predicting the transfection efficiency of unseen LNPs, with the multilayer perceptron achieving a classification accuracy of 98% on the test set. Our work represents a pioneering effort in combining ML and LNP design, offering significant potential for improving screening efficiency by computationally prioritizing LNP candidates for experimental validation and accelerating the development of effective mRNA delivery systems. △ Less

Submitted 28 August, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

Comments: The 2023 ICML Workshop on Computational Biology

arXiv:2306.07652 [pdf]

Inactivated COVID-19 Vaccination did not affect In vitro fertilization (IVF) / Intra-Cytoplasmic Sperm Injection (ICSI) cycle outcomes

Authors: Qi Wan, Ying Ling Yao, XingYu Lv, Li Hong Geng, Yue Wang, Enoch Appiah Adu-Gyamfi, Xue Jiao Wang, Yue Qian, Juan Yang, Ming Xing Chend, Zhao Hui Zhong, Yuan Li, Yu Bin Ding

Abstract: Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan… ▽ More Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan Jinxin Xinan Women & Children's Hospital. Based on whether they received vaccination before ovarian stimulation, the couples were divided into the vaccination group and the non-vaccination group. We compared the laboratory parameters and pregnancy outcomes between the two groups. Findings: After performing propensity score matching (PSM), the analysis demonstrated similar clinical pregnancy rates, biochemical pregnancy and ongoing pregnancy rates between vaccinated and unvaccinated women. No significant disparities were found in terms of embryo development and laboratory parameters among the groups. Moreover, male vaccination had no impact on patient performance or pregnancy outcomes in assisted reproductive technology treatments. Additionally, there were no significant differences observed in the effects of vaccination on embryo development and pregnancy outcomes among couples undergoing ART. Interpretation: The findings suggest that COVID-19 vaccination did not have a significant effect on patients undergoing IVF/ICSI with fresh embryo transfer. Therefore, it is recommended that couples should receive COVID-19 vaccination as scheduled to help mitigate the COVID-19 pandemic. △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: 26 pages, 4 figures and 5 tables

arXiv:2305.03893 [pdf]

Generalizability of PRS313 for breast cancer risk amongst non-Europeans in a Los Angeles biobank

Authors: Helen Shang, Yi Ding, Vidhya Venkateswaran, Kristin Boulier, Nikhita Kathuria-Prakash, Parisa Boodaghi Malidarreh, Jacob M. Luber, Bogdan Pasaniuc

Abstract: Polygenic risk scores (PRS) summarize the combined effect of common risk variants and are associated with breast cancer risk in patients without identifiable monogenic risk factors. One of the most well-validated PRSs in breast cancer to date is PRS313, which was developed from a Northern European biobank but has shown attenuated performance in non-European ancestries. We further investigate the g… ▽ More Polygenic risk scores (PRS) summarize the combined effect of common risk variants and are associated with breast cancer risk in patients without identifiable monogenic risk factors. One of the most well-validated PRSs in breast cancer to date is PRS313, which was developed from a Northern European biobank but has shown attenuated performance in non-European ancestries. We further investigate the generalizability of the PRS313 for American women of European (EA), African (AFR), Asian (EAA), and Latinx (HL) ancestry within one institution with a singular EHR system, genotyping platform, and quality control process. We found that the PRS313 achieved overlapping Areas under the ROC Curve (AUCs) in females of Lantix (AUC, 0.68; 95 CI, 0.65-0.71) and European ancestry (AUC, 0.70; 95 CI, 0.69-0.71) but lower AUCs for the AFR and EAA populations (AFR: AUC, 0.61; 95 CI, 0.56-0.65; EAA: AUC, 0.64; 95 CI, 0.60-0.680). While PRS313 is associated with Hormone Positive (HR+) disease in European Americans (OR, 1.42; 95 CI, 1.16-1.64), for Latinx females, it may be instead associated with Human Epidermal Growth Factor Receptor 2 (HER2+) disease (OR, 2.52; 95 CI, 1.35-4.70) although due to small numbers, additional studies are needed. In summary, we found that PRS313 was significantly associated with breast cancer but with attenuated accuracy in women of African and Asian descent within a singular health system in Los Angeles. Our work further highlights the need for additional validation in diverse cohorts prior to clinical implementation of polygenic risk scores. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: 27 pages, 2 figures

arXiv:2304.10946 [pdf, other]

CancerGPT: Few-shot Drug Pair Synergy Prediction using Large Pre-trained Language Models

Authors: Tianhao Li, Sandesh Shetty, Advaith Kamath, Ajay Jaiswal, Xianqian Jiang, Ying Ding, Yejin Kim

Abstract: Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology, has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structure… ▽ More Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology, has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Our proposed few-shot learning approach uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrated that the LLM-based prediction model achieved significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with $\sim$ 124M parameters), was even comparable to the larger fine-tuned GPT-3 model (with $\sim$ 175B parameters). Our research is the first to tackle drug pair synergy prediction in rare tissues with limited data. We are also the first to utilize an LLM-based prediction model for biological reaction prediction tasks. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2303.06965 [pdf, other]

doi 10.1038/s42256-023-00764-9

Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model

Authors: Bo Qiang, Yiran Zhou, Yuheng Ding, Ningfeng Liu, Song Song, Liangren Zhang, Bo Huang, Zhenming Liu

Abstract: Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, w… ▽ More Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a novel pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results on challenging downstream tasks. By possessing chemical knowledge, our generative framework overcome the limitations of current molecule generation models that rely on a small number of reaction templates. In the extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a significant step toward a large-scale deep-learning framework for a variety of reaction-based applications. △ Less

Submitted 7 March, 2024; v1 submitted 13 March, 2023; originally announced March 2023.

arXiv:2112.12337 [pdf, other]

doi 10.1073/pnas.2202113119

Cooperative learning for multiview analysis

Authors: Daisy Yi Ding, Shuangning Li, Balasubramanian Narasimhan, Robert Tibshirani

Abstract: We propose a new method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data such as genomics, proteomics and radiomics are measured on a common set of samples. Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from… ▽ More We propose a new method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data such as genomics, proteomics and radiomics are measured on a common set of samples. Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g. lasso, random forests, boosting, neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and a real multiomics example of labor onset prediction. Leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion. △ Less

Submitted 3 September, 2022; v1 submitted 22 December, 2021; originally announced December 2021.

arXiv:2111.06593 [pdf, other]

Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence

Authors: Yanyi Ding, Zhiyi Kuang, Yuxin Pei, Jeff Tan, Ziyu Zhang, Joseph Konan

Abstract: SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021. With thousands of strains sequenced to date, SARS-CoV-2 mutations pose significant challenges to scientists on keeping pace with vaccine development and public health measures. Therefore, an efficient method of identifying the divergence of lab samples… ▽ More SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021. With thousands of strains sequenced to date, SARS-CoV-2 mutations pose significant challenges to scientists on keeping pace with vaccine development and public health measures. Therefore, an efficient method of identifying the divergence of lab samples from patients would greatly aid the documentation of SARS-CoV-2 genomics. In this study, we propose a neural network model that leverages recurrent and convolutional units to directly take in amino acid sequences of spike proteins and classify corresponding clades. We also compared our model's performance with Bidirectional Encoder Representations from Transformers (BERT) pre-trained on protein database. Our approach has the potential of providing a more computationally efficient alternative to current homology based intra-species differentiation. △ Less

Submitted 12 November, 2021; originally announced November 2021.

arXiv:2110.00170 [pdf, other]

Surveillance Testing for Rapid Detection of Outbreaks in Facilities

Authors: Yanyue Ding, Sudesh K. Agrawal, Jincheng Cao, Lauren Meyers, John J. Hasenbein

Abstract: This paper develops an agent-based disease spread model on a contact network in an effort to guide efforts at surveillance testing in small to moderate facilities such as nursing homes and meat-packing plants. The model employs Monte Carlo simulations of viral spread sample paths in the contact network. The original motivation was to detect COVID-19 outbreaks quickly in such facilities, but the mo… ▽ More This paper develops an agent-based disease spread model on a contact network in an effort to guide efforts at surveillance testing in small to moderate facilities such as nursing homes and meat-packing plants. The model employs Monte Carlo simulations of viral spread sample paths in the contact network. The original motivation was to detect COVID-19 outbreaks quickly in such facilities, but the model can be applied to any communicable disease. In particular, the model provides guidance on how many test to administer each day and on the importance of the testing order among staff or workers. △ Less

Submitted 30 September, 2021; originally announced October 2021.

Comments: 21 pages, 14 figures and 3 tables. Submitted to Health Care Management Science

arXiv:2107.06773 [pdf]

Relational graph convolutional networks for predicting blood-brain barrier penetration of drug molecules

Authors: Yan Ding, Xiaoqian Jiang, Yejin Kim

Abstract: Evaluating the blood-brain barrier (BBB) permeability of drug molecules is a critical step in brain drug development. Traditional methods for the evaluation require complicated in vitro or in vivo testing. Alternatively, in silico predictions based on machine learning have proved to be a cost-efficient way to complement the in vitro and in vivo methods. However, the performance of the established… ▽ More Evaluating the blood-brain barrier (BBB) permeability of drug molecules is a critical step in brain drug development. Traditional methods for the evaluation require complicated in vitro or in vivo testing. Alternatively, in silico predictions based on machine learning have proved to be a cost-efficient way to complement the in vitro and in vivo methods. However, the performance of the established models has been limited by their incapability of dealing with the interactions between drugs and proteins, which play an important role in the mechanism behind the BBB penetrating behaviors. To address this limitation, we employed the relational graph convolutional network (RGCN) to handle the drug-protein interactions as well as the properties of each individual drug. The RGCN model achieved an overall accuracy of 0.872, an AUROC of 0.919 and an AUPRC of 0.838 for the testing dataset with the drug-protein interactions and the Mordred descriptors as the input. Introducing drug-drug similarity to connect structurally similar drugs in the data graph further improved the testing results, giving an overall accuracy of 0.876, an AUROC of 0.926 and an AUPRC of 0.865. In particular, the RGCN model was found to greatly outperform the LightGBM base model when evaluated with the drugs whose BBB penetration was dependent on drug-protein interactions. Our model is expected to provide high-confidence predictions of BBB permeability for drug prioritization in the experimental screening of BBB-penetrating drugs. △ Less

Submitted 6 April, 2022; v1 submitted 4 July, 2021; originally announced July 2021.

arXiv:2105.08848 [pdf, other]

Optimal Control of the SIR Model with Constrained Policy, with an Application to COVID-19

Authors: Yujia Ding, Henry Schellhorn

Abstract: This article considers the optimal control of the SIR model with both transmission and treatment uncertainty. It follows the model presented in Gatto and Schellhorn (2021). We make four significant improvements on the latter paper. First, we prove the existence of a solution to the model. Second, our interpretation of the control is more realistic: while in Gatto and Schellhorn the control $α$ is… ▽ More This article considers the optimal control of the SIR model with both transmission and treatment uncertainty. It follows the model presented in Gatto and Schellhorn (2021). We make four significant improvements on the latter paper. First, we prove the existence of a solution to the model. Second, our interpretation of the control is more realistic: while in Gatto and Schellhorn the control $α$ is the proportion of the population that takes a basic dose of treatment, so that $α>1$ occurs only if some patients take more than a basic dose, in our paper, $α$ is constrained between zero and one, and represents thus the proportion of the population undergoing treatment. Third, we provide a complete solution for the moderate infection regime (with constant treatment). Finally, we give a thorough interpretation of the control in the moderate infection regime, while Gatto and Schellhorn focussed on the interpretation of the low infection regime. Finally, we compare the efficiency of our control to curb the COVID-19 epidemic to other types of control. △ Less

Submitted 18 May, 2021; originally announced May 2021.

Comments: 28 pages, 2 figures

arXiv:2105.02786 [pdf, other]

LGGNet: Learning from Local-Global-Graph Representations for Brain-Computer Interface

Authors: Yi Ding, Neethu Robinson, Chengxuan Tong, Qiuhao Zeng, Cuntai Guan

Abstract: Neuropsychological studies suggest that co-operative activities among different brain functional areas drive high-level cognitive processes. To learn the brain activities within and among different functional areas of the brain, we propose LGGNet, a novel neurologically inspired graph neural network, to learn local-global-graph representations of electroencephalography (EEG) for Brain-Computer Int… ▽ More Neuropsychological studies suggest that co-operative activities among different brain functional areas drive high-level cognitive processes. To learn the brain activities within and among different functional areas of the brain, we propose LGGNet, a novel neurologically inspired graph neural network, to learn local-global-graph representations of electroencephalography (EEG) for Brain-Computer Interface (BCI). The input layer of LGGNet comprises a series of temporal convolutions with multi-scale 1D convolutional kernels and kernel-level attentive fusion. It captures temporal dynamics of EEG which then serves as input to the proposed local and global graph-filtering layers. Using a defined neurophysiologically meaningful set of local and global graphs, LGGNet models the complex relations within and among functional areas of the brain. Under the robust nested cross-validation settings, the proposed method is evaluated on three publicly available datasets for four types of cognitive classification tasks, namely, the attention, fatigue, emotion, and preference classification tasks. LGGNet is compared with state-of-the-art methods, such as DeepConvNet, EEGNet, R2G-STNN, TSception, RGNN, AMCNN-DGCN, HRNN and GraphNet. The results show that LGGNet outperforms these methods, and the improvements are statistically significant (p<0.05) in most cases. The results show that bringing neuroscience prior knowledge into neural network design yields an improvement of classification performance. The source code can be found at https://github.com/yi-ding-cs/LGG △ Less

Submitted 5 December, 2022; v1 submitted 5 May, 2021; originally announced May 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2006.08735 [pdf, ps, other]

Minimal invariant regions and minimal globally attracting regions for toric differential inclusions

Authors: Yida Ding, Abhishek Deshpande, Gheorghe Craciun

Abstract: Toric differential inclusions occur as key dynamical systems in the context of the Global Attractor Conjecture. We introduce the notions of minimal invariant regions and minimal globally attracting regions for toric differential inclusions. We describe a procedure for constructing explicitly the minimal invariant and minimal globally attracting regions for two-dimensional toric differential inclus… ▽ More Toric differential inclusions occur as key dynamical systems in the context of the Global Attractor Conjecture. We introduce the notions of minimal invariant regions and minimal globally attracting regions for toric differential inclusions. We describe a procedure for constructing explicitly the minimal invariant and minimal globally attracting regions for two-dimensional toric differential inclusions. In particular, we obtain invariant regions and globally attracting regions for two-dimensional weakly reversible or endotactic dynamical systems (even if they have time-dependent parameters). △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: 29 pages, 15 figures

MSC Class: 37N25; 80A30; 92C45; 92E20; 14M25

arXiv:1911.00075 [pdf]

doi 10.1061/9780784412190.031

Towards a terramechanics for bio-inspired locomotion in granular environments

Authors: Chen Li, Yang Ding, Nick Gravish, Ryan D. Maladen, Andrew Masse, Paul B. Umbanhowar, Haldun Komsuoglu, Daniel E. Koditschek, Daniel I. Goldman

Abstract: Granular media (GM) present locomotor challenges for terrestrial and extraterrestrial devices because they can flow and solidify in response to localized intrusion of wheels, limbs, and bodies. While the development of airplanes and submarines is aided by understanding of hydrodynamics, fundamental theory does not yet exist to describe the complex interactions of locomotors with GM. In this paper,… ▽ More Granular media (GM) present locomotor challenges for terrestrial and extraterrestrial devices because they can flow and solidify in response to localized intrusion of wheels, limbs, and bodies. While the development of airplanes and submarines is aided by understanding of hydrodynamics, fundamental theory does not yet exist to describe the complex interactions of locomotors with GM. In this paper, we use experimental, computational, and theoretical approaches to develop a terramechanics for bio-inspired locomotion in granular environments. We use a fluidized bed to prepare GM with a desired global packing fraction, and use empirical force measurements and the Discrete Element Method (DEM) to elucidate interaction mechanics during locomotion-relevant intrusions in GM such as vertical penetration and horizontal drag. We develop a resistive force theory (RFT) to account for more complex intrusions. We use these force models to understand the locomotor performance of two bio-inspired robots moving on and within GM. △ Less

Submitted 31 October, 2019; originally announced November 2019.

Journal ref: ASCE Earth & Space Conference (2012), 264-273

arXiv:1704.02007 [pdf]

DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

Authors: Zhe Sun, Ting Wang, Ke Deng, Xiao-Feng Wang, Robert Lafyatis, Ying Ding, Ming Hu, Wei Chen

Abstract: Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the t… ▽ More Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. Methods: We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. An expectation-maximization algorithm is used for parameter inference. Results: We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods. △ Less

Submitted 6 April, 2017; originally announced April 2017.

arXiv:1609.08099 [pdf, other]

doi 10.1073/pnas.1607828113

Wavy membranes and the growth rate of a planar chemical garden: Enhanced diffusion and bioenergetics

Authors: Yang Ding, Bruno Batista, Oliver Steinbock, Julyan H. E. Cartwright, Silvana S. S. Cardoso

Abstract: In order to model ion transport across protocell membranes in Hadean hydrothermal vents, we consider both theoretically and experimentally the planar growth of a precipitate membrane formed at the interface between two parallel fluid streams in a two-dimensional microfluidic reactor. The growth rate of the precipitate is found to be proportional to the square root of time, which is characteristic… ▽ More In order to model ion transport across protocell membranes in Hadean hydrothermal vents, we consider both theoretically and experimentally the planar growth of a precipitate membrane formed at the interface between two parallel fluid streams in a two-dimensional microfluidic reactor. The growth rate of the precipitate is found to be proportional to the square root of time, which is characteristic of diffusive transport. However, the dependence of the growth rate on the concentrations of hydroxide and metal ions is approximately linear and quadratic, respectively. We show that such a difference in ionic transport dynamics arises from the enhanced transport of metal ions across a thin gel layer present at the surface of the precipitate. The fluctuations in transverse velocity in this wavy porous gel layer allow an enhanced transport of the cation, so that the effective diffusivity is about an order of magnitude higher than that expected from molecular diffusion alone. Our theoretical predictions are in excellent agreement with our laboratory measurements of the growth of a manganese hydroxide membrane in a microfluidic channel, and this enhanced transport is thought to have been needed to account for the bioenergetics of the first single-celled organisms. △ Less

Submitted 26 September, 2016; originally announced September 2016.

Journal ref: PNAS, vol. 113, no. 33, 9182-9186, 2016

arXiv:1501.04415 [pdf, ps, other]

doi 10.1214/14-AOAS747

Imputation of truncated p-values for meta-analysis methods and its genomic application

Authors: Shaowu Tang, Ying Ding, Etienne Sibille, Jeffrey S. Mogil, William R. Lariviere, George C. Tseng

Abstract: Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. A tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and val… ▽ More Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. A tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and validated findings. Methods that aggregate transformed $p$-value evidence have been widely used in genomic settings, among which Fisher's and Stouffer's methods are the most popular ones. In practice, raw data and $p$-values of DE evidence are often not available in genomic studies that are to be combined. Instead, only the detected DE gene lists under a certain $p$-value threshold (e.g., DE genes with $p$-value${}<0.001$) are reported in journal publications. The truncated $p$-value information makes the aforementioned meta-analysis methods inapplicable and researchers are forced to apply a less efficient vote counting method or naïvely drop the studies with incomplete information. The purpose of this paper is to develop effective meta-analysis methods for such situations with partially censored $p$-values. We developed and compared three imputation methods - mean imputation, single random imputation and multiple imputation - for a general class of evidence aggregation methods of which Fisher's and Stouffer's methods are special examples. The null distribution of each method was analytically derived and subsequent inference and genomic analysis frameworks were established. Simulations were performed to investigate the type I error, power and the control of false discovery rate (FDR) for (correlated) gene expression data. The proposed methods were applied to several genomic applications in colorectal cancer, pain and liquid association analysis of major depressive disorder (MDD). The results showed that imputation methods outperformed existing naïve approaches. Mean imputation and multiple imputation methods performed the best and are recommended for future applications. △ Less

Submitted 19 January, 2015; originally announced January 2015.

Comments: Published in at http://dx.doi.org/10.1214/14-AOAS747 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS747

Journal ref: Annals of Applied Statistics 2014, Vol. 8, No. 4, 2150-2174

arXiv:1212.5239 [pdf]

doi 10.1038/nature12219

Epistasis not needed to explain low dN/dS

Authors: David M. McCandlish, Etienne Rajon, Premal Shah, Yang Ding, Joshua B. Plotkin

Abstract: An important question in molecular evolution is whether an amino acid that occurs at a given position makes an independent contribution to fitness, or whether its effect depends on the state of other loci in the organism's genome, a phenomenon known as epistasis. In a recent letter to Nature, Breen et al. (2012) argued that epistasis must be "pervasive throughout protein evolution" because the obs… ▽ More An important question in molecular evolution is whether an amino acid that occurs at a given position makes an independent contribution to fitness, or whether its effect depends on the state of other loci in the organism's genome, a phenomenon known as epistasis. In a recent letter to Nature, Breen et al. (2012) argued that epistasis must be "pervasive throughout protein evolution" because the observed ratio between the per-site rates of non-synonymous and synonymous substitutions (dN/dS) is much lower than would be expected in the absence of epistasis. However, when calculating the expected dN/dS ratio in the absence of epistasis, Breen et al. assumed that all amino acids observed in a protein alignment at any particular position have equal fitness. Here, we relax this unrealistic assumption and show that any dN/dS value can in principle be achieved at a site, without epistasis. Furthermore, for all nuclear and chloroplast genes in the Breen et al. dataset, we show that the observed dN/dS values and the observed patterns of amino acid diversity at each site are jointly consistent with a non-epistatic model of protein evolution. △ Less

Submitted 20 December, 2012; originally announced December 2012.

Comments: This manuscript is in response to "Epistasis as the primary factor in molecular evolution" by Breen et al. Nature 490, 535-538 (2012)

Journal ref: McCandlish, D. M., Rajon, E., Shah, P., Ding, Y., & Plotkin, J. B. (2013). The role of epistasis in protein evolution. Nature, 497(7451), E1-E2

arXiv:1106.4880 [pdf]

doi 10.1186/1471-2105-12-256

Semantic Inference using Chemogenomics Data for Drug Discovery

Authors: Qian Zhu, Yuyin Sun, Sashikiran Challa, Ying Ding, Michael S. Lajiness, David J. Wild

Abstract: Background Semantic Web Technology (SWT) makes it possible to integrate and search the large volume of life science datasets in the public domain, as demonstrated by well-known linked data projects such as LODD, Bio2RDF, and Chem2Bio2RDF. Integration of these sets creates large networks of information. We have previously described a tool called WENDI for aggregating information pertaining to new c… ▽ More Background Semantic Web Technology (SWT) makes it possible to integrate and search the large volume of life science datasets in the public domain, as demonstrated by well-known linked data projects such as LODD, Bio2RDF, and Chem2Bio2RDF. Integration of these sets creates large networks of information. We have previously described a tool called WENDI for aggregating information pertaining to new chemical compounds, effectively creating evidence paths relating the compounds to genes, diseases and so on. In this paper we examine the utility of automatically inferring new compound-disease associations (and thus new links in the network) based on semantically marked-up versions of these evidence paths, rule-sets and inference engines. Results Through the implementation of a semantic inference algorithm, rule set, Semantic Web methods (RDF, OWL and SPARQL) and new interfaces, we have created a new tool called Chemogenomic Explorer that uses networks of ontologically annotated RDF statements along with deductive reasoning tools to infer new associations between the query structure and genes and diseases from WENDI results. The tool then permits interactive clustering and filtering of these evidence paths. Conclusions We present a new aggregate approach to inferring links between chemical compounds and diseases using semantic inference. This approach allows multiple evidence paths between compounds and diseases to be identified using a rule-set and semantically annotated data, and for these evidence paths to be clustered to show overall evidence linking the compound to a disease. We believe this is a powerful approach, because it allows compound-disease relationships to be ranked by the amount of evidence supporting them. △ Less

Submitted 23 June, 2011; originally announced June 2011.

Comments: 23 pages, 9 figures, 4 tables

arXiv:1103.5181 [pdf]

doi 10.1371/journal.pone.0017243

Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA

Authors: Huijun Wang, Ying Ding, Jie Tang, Xiao Dong, Bing He, Judy Qiu, David J. Wild

Abstract: The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between them. In this paper, we describe an algorithm called Bio-LDA… ▽ More The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between them. In this paper, we describe an algorithm called Bio-LDA that uses extracted biological terminology to automatically identify latent topics, and provides a variety of measures to uncover putative relations among topics and bio-terms. Relationships identified using those approaches are combined with existing data in life science datasets to provide additional insight. Three case studies demonstrate the utility of the Bio-LDA model, including association predication, association search and connectivity map generation. This combined approach offers new opportunities for knowledge discovery in many areas of biology including target identification, lead hopping and drug repurposing. △ Less

Submitted 27 March, 2011; originally announced March 2011.

Comments: 14 pages, 8 figures, 10 tables

Journal ref: PLoS ONE (2011) 6(3): e17243

arXiv:1012.4759 [pdf]

Chem2Bio2RDF: A Linked Open Data Portal for Chemical Biology

Authors: Bin Chen, David J Wild, Qian Zhu, Ying Ding, Xiao Dong, Madhuvanthi Sankaranarayanan, Huijun Wang, Yuyin Sun

Abstract: The Chem2Bio2RDF portal is a Linked Open Data (LOD) portal for systems chemical biology aiming for facilitating drug discovery. It converts around 25 different datasets on genes, compounds, drugs, pathways, side effects, diseases, and MEDLINE/PubMed documents into RDF triples and links them to other LOD bubbles, such as Bio2RDF, LODD and DBPedia. The portal is based on D2R server and provides a SP… ▽ More The Chem2Bio2RDF portal is a Linked Open Data (LOD) portal for systems chemical biology aiming for facilitating drug discovery. It converts around 25 different datasets on genes, compounds, drugs, pathways, side effects, diseases, and MEDLINE/PubMed documents into RDF triples and links them to other LOD bubbles, such as Bio2RDF, LODD and DBPedia. The portal is based on D2R server and provides a SPARQL endpoint, but adds on few unique features like RDF faceted browser, user-friendly SPARQL query generator, MEDLINE/PubMed cross validation service, and Cytoscape visualization plugin. Three use cases demonstrate the functionality and usability of this portal. △ Less

Submitted 21 December, 2010; originally announced December 2010.

Comments: 8 pages, 10 figures

ACM Class: D.2.12

Showing 1–28 of 28 results for author: Ding, Y