-
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding
Authors:
Yiqing Shen,
Zan Chen,
Michail Mamalakis,
Luhan He,
Haiyang Xia,
Tianbin Li,
Yanzhou Su,
Junjun He,
Yu Guang Wang
Abstract:
The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have…
▽ More
The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have then attempted to adapt LLMs for protein understanding by integrating a protein sequence encoder with a pre-trained LLM. However, this adaptation raises a fundamental question: "Can LLMs, originally designed for NLP, effectively comprehend protein sequences as a form of language?" Current datasets fall short in addressing this question due to the lack of a direct correlation between protein sequences and corresponding text descriptions, limiting the ability to train and evaluate LLMs for protein understanding effectively. To bridge this gap, we introduce ProteinLMDataset, a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning (SFT) of LLMs to enhance their capability for protein sequence comprehension. Specifically, ProteinLMDataset includes 17.46 billion tokens for pretraining and 893,000 instructions for SFT. Additionally, we present ProteinLMBench, the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs. ProteinLMBench incorporates protein-related details and sequences in multiple languages, establishing a new standard for evaluating LLMs' abilities in protein comprehension. The large language model InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score.
△ Less
Submitted 8 July, 2024; v1 submitted 8 June, 2024;
originally announced June 2024.
-
Matrix dissimilarities based on differences in moments and sparsity
Authors:
Tuobang Li
Abstract:
Generating a dissimilarity matrix is typically the first step in big data analysis. Although numerous methods exist, such as Euclidean distance, Minkowski distance, Manhattan distance, Bray Curtis dissimilarity, Jaccard similarity and Dice dissimilarity, it remains unclear which factors drive dissimilarity between groups. In this paper, we introduce an approach based on differences in moments and…
▽ More
Generating a dissimilarity matrix is typically the first step in big data analysis. Although numerous methods exist, such as Euclidean distance, Minkowski distance, Manhattan distance, Bray Curtis dissimilarity, Jaccard similarity and Dice dissimilarity, it remains unclear which factors drive dissimilarity between groups. In this paper, we introduce an approach based on differences in moments and sparsity. We show that this method can delineate the key factors underlying group differences. For example, in biology, mean dissimilarity indicates differences driven by up down regulated gene expressions, standard deviation dissimilarity reflects the heterogeneity of response to treatment, and sparsity dissimilarity corresponds to differences prompted by the activation silence of genes. Through extensive reanalysis of genome, transcriptome, proteome, metabolome, immune profiling, microbiome, and social science datasets, we demonstrate insights not captured in previous studies. For instance, it shows that the sparsity dissimilarity is as effective as the mean dissimilarity in predicting the alleviation effects of a COVID 19 drug, suggesting that sparsity dissimilarity is highly meaningful.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
No winners: Performance of lung cancer prediction models depends on screening-detected, incidental, and biopsied pulmonary nodule use cases
Authors:
Thomas Z. Li,
Kaiwen Xu,
Aravind Krishnan,
Riqiang Gao,
Michael N. Kammer,
Sanja Antic,
David Xiao,
Michael Knight,
Yency Martinez,
Rafael Paez,
Robert J. Lentz,
Stephen Deppen,
Eric L. Grogan,
Thomas A. Lasko,
Kim L. Sandler,
Fabien Maldonado,
Bennett A. Landman
Abstract:
Statistical models for predicting lung cancer have the potential to facilitate earlier diagnosis of malignancy and avoid invasive workup of benign disease. Many models have been published, but comparative studies of their utility in different clinical settings in which patients would arguably most benefit are scarce. This study retrospectively evaluated promising predictive models for lung cancer…
▽ More
Statistical models for predicting lung cancer have the potential to facilitate earlier diagnosis of malignancy and avoid invasive workup of benign disease. Many models have been published, but comparative studies of their utility in different clinical settings in which patients would arguably most benefit are scarce. This study retrospectively evaluated promising predictive models for lung cancer prediction in three clinical settings: lung cancer screening with low-dose computed tomography, incidentally detected pulmonary nodules, and nodules deemed suspicious enough to warrant a biopsy. We leveraged 9 cohorts (n=898, 896, 882, 219, 364, 117, 131, 115, 373) from multiple institutions to assess the area under the receiver operating characteristic curve (AUC) of validated models including logistic regressions on clinical variables and radiologist nodule characterizations, artificial intelligence on chest CTs, longitudinal imaging AI, and multi-modal approaches. We implemented each model from their published literature, re-training the models if necessary, and curated each cohort from primary data sources. We observed that model performance varied greatly across clinical use cases. No single predictive model emerged as a clear winner across all cohorts, but certain models excelled in specific clinical contexts. Single timepoint chest CT AI performed well in lung screening, but struggled to generalize to other clinical settings. Longitudinal imaging and multimodal models demonstrated comparatively promising performance on incidentally-detected nodules. However, when applied to nodules that underwent biopsy, all models underperformed. These results underscore the strengths and limitations of 8 validated predictive models and highlight promising directions towards personalized, noninvasive lung cancer diagnosis.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Synthesis and stability of biomolecules in C-H-O-N fluids at extreme pressure-temperature conditions
Authors:
Tao Li,
Nore Stolte,
Renbiao Tao,
Dimitri A. Sverjensky,
Isabelle Daniel,
Ding Pan
Abstract:
How life started on Earth is an unsolved mystery. There are various hypotheses for the location ranging from outer space to the seafloor, subseafloor or potentially deeper. Here, we applied extensive ab initio molecular dynamics (AIMD) simulations to study chemical reactions between NH3, H2O, H2, and CO at pressures (P) and temperatures (T) approximating the conditions of Earth's upper mantle (i.e…
▽ More
How life started on Earth is an unsolved mystery. There are various hypotheses for the location ranging from outer space to the seafloor, subseafloor or potentially deeper. Here, we applied extensive ab initio molecular dynamics (AIMD) simulations to study chemical reactions between NH3, H2O, H2, and CO at pressures (P) and temperatures (T) approximating the conditions of Earth's upper mantle (i.e. 10-13 GPa, 1000-1400 K). Contrary to the previous assumptions that larger organic molecules might readily disintegrate in aqueous solutions at extreme P-T conditions, we found that many organic compounds formed without any catalysts and persisted in C-H-O-N fluids under these extreme conditions, including glycine, ribose, urea, and uracil-like molecules. Particularly, our free energy calculations showed that the C-N bond is thermodynamically stable at 10 GPa and 1400 K. Moreover, while the pyranose (six-membered-ring) form of ribose is more stable than the furanose (five-membered-ring) form at ambient conditions, we observed the predominant formation of the five-membered-ring form of ribose at extreme conditions, which is consistent with the exclusive incorporation of \{beta}-D-ribofuranose in RNA. We have uncovered a previously unexplored pathway through which the building blocks of biomolecules may have originated in early Earth and other planets. Our findings contribute to an evolving understanding of the fundamental conditions necessary for life to arise.
△ Less
Submitted 13 July, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Infer metabolic directions and magnitudes from moment differences of mass-weighted intensity distributions
Authors:
Tuobang Li
Abstract:
Metabolic pathways are fundamental maps in biochemistry that detail how molecules are transformed through various reactions. Metabolomics refers to the large-scale study of small molecules. High-throughput, untargeted, mass spectrometry-based metabolomics experiments typically depend on libraries for structural annotation, which is necessary for pathway analysis. However, only a small fraction of…
▽ More
Metabolic pathways are fundamental maps in biochemistry that detail how molecules are transformed through various reactions. Metabolomics refers to the large-scale study of small molecules. High-throughput, untargeted, mass spectrometry-based metabolomics experiments typically depend on libraries for structural annotation, which is necessary for pathway analysis. However, only a small fraction of spectra can be matched to known structures in these libraries and only a portion of annotated metabolites can be associated with specific pathways, considering that numerous pathways are yet to be discovered. The complexity of metabolic pathways, where a single compound can play a part in multiple pathways, poses an additional challenge. This study introduces a different concept: mass-weighted intensity distribution, which is the empirical distribution of the intensities times their associated m/z values. Analysis of COVID-19 and mouse brain datasets shows that by estimating the differences of the point estimations of these distributions, it becomes possible to infer the metabolic directions and magnitudes without requiring knowledge of the exact chemical structures of these compounds and their related pathways. The overall metabolic momentum map, named as momentome, has the potential to bypass the current bottleneck and provide fresh insights into metabolomics studies. This brief report thus provides a mathematical framing for a classic biological concept.
△ Less
Submitted 28 February, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein Generation
Authors:
Xiangru Tang,
Howard Dai,
Elizabeth Knight,
Fang Wu,
Yunyang Li,
Tianxiao Li,
Mark Gerstein
Abstract:
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent…
▽ More
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
△ Less
Submitted 26 June, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Tensor formalism for predicting synaptic connections with ensemble modeling or optimization
Authors:
Tirthabir Biswas,
Tianzhi Lambus Li,
James E. Fitzgerald
Abstract:
Theoretical neuroscientists often try to understand how the structure of a neural network relates to its function by focusing on structural features that would either follow from optimization or occur consistently across possible implementations. Both optimization theories and ensemble modeling approaches have repeatedly proven their worth, and it would simplify theory building considerably if pre…
▽ More
Theoretical neuroscientists often try to understand how the structure of a neural network relates to its function by focusing on structural features that would either follow from optimization or occur consistently across possible implementations. Both optimization theories and ensemble modeling approaches have repeatedly proven their worth, and it would simplify theory building considerably if predictions from both theory types could be derived and tested simultaneously. Here we show how tensor formalism from theoretical physics can be used to unify and solve many optimization and ensemble modeling approaches to predicting synaptic connectivity from neuronal responses. We specifically focus on analyzing the solution space of synaptic weights that allow a threshold-linear neural network to respond in a prescribed way to a limited number of input conditions. For optimization purposes, we compute the synaptic weight vector that minimizes an arbitrary quadratic loss function. For ensemble modeling, we identify synaptic weight features that occur consistently across all solutions bounded by an arbitrary ellipsoid. We derive a common solution to this suite of nonlinear problems by showing how each of them reduces to an equivalent linear problem that can be solved analytically. Although identifying the equivalent linear problem is nontrivial, our tensor formalism provides an elegant geometrical perspective that allows us to solve the problem approximately in an analytical way or exactly using numeric methods. The final algorithm is applicable to a wide range of interesting neuroscience problems, and the associated geometric insights may carry over to other scientific problems that require constrained optimization.
△ Less
Submitted 18 June, 2024; v1 submitted 31 October, 2023;
originally announced October 2023.
-
On the Mathematics of RNA Velocity II: Algorithmic Aspects
Authors:
Tiejun Li,
Yizhuo Wang,
Guoguo Yang,
Peijie Zhou
Abstract:
In a previous paper [CSIAM Trans. Appl. Math. 2 (2021), 1-55], the authors proposed a theoretical framework for the analysis of RNA velocity, which is a promising concept in scRNA-seq data analysis to reveal the cell state-transition dynamical processes underlying snapshot data. The current paper is devoted to the algorithmic study of some key components in RNA velocity workflow. Four important po…
▽ More
In a previous paper [CSIAM Trans. Appl. Math. 2 (2021), 1-55], the authors proposed a theoretical framework for the analysis of RNA velocity, which is a promising concept in scRNA-seq data analysis to reveal the cell state-transition dynamical processes underlying snapshot data. The current paper is devoted to the algorithmic study of some key components in RNA velocity workflow. Four important points are addressed in this paper: (1) We construct a rational time-scale fixation method which can determine the global gene-shared latent time for cells. (2) We present an uncertainty quantification strategy for the inferred parameters obtained through the EM algorithm. (3) We establish the optimal criterion for the choice of velocity kernel bandwidth with respect to the sample size in the downstream analysis and discuss its implications. (4) We propose a temporal distance estimation approach between two cell clusters along the cellular development path. Some illustrative numerical tests are also carried out to verify our analysis. These results are intended to provide tools and insights in further development of RNA velocity type methods in the future.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
CancerGPT: Few-shot Drug Pair Synergy Prediction using Large Pre-trained Language Models
Authors:
Tianhao Li,
Sandesh Shetty,
Advaith Kamath,
Ajay Jaiswal,
Xianqian Jiang,
Ying Ding,
Yejin Kim
Abstract:
Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology, has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structure…
▽ More
Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology, has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Our proposed few-shot learning approach uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrated that the LLM-based prediction model achieved significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with $\sim$ 124M parameters), was even comparable to the larger fine-tuned GPT-3 model (with $\sim$ 175B parameters). Our research is the first to tackle drug pair synergy prediction in rare tissues with limited data. We are also the first to utilize an LLM-based prediction model for biological reaction prediction tasks.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Connectivity based Real-Time fMRI Neurofeedback Training in Youth with a History of Major Depressive Disorder
Authors:
Xiaofu He,
Diana Rodriguez Moreno,
Zhenghua Hou,
Keely Cheslack-Postava,
Yanni Jiang,
Tong Li,
Ronit Kishon,
Larry Amsel,
George Musa,
Zhishun Wang,
Christina W. Hoven
Abstract:
Background: Real-time functional magnetic resonance imaging neurofeedback (rtfMRI-nf) has proven to be a powerful technique to help subjects to gauge and enhance emotional control. Traditionally, rtfMRI-nf has focused on emotional regulation through self-regulation of amygdala. Recently, rtfMRI studies have observed that regulation of a target brain region is accompanied by connectivity changes be…
▽ More
Background: Real-time functional magnetic resonance imaging neurofeedback (rtfMRI-nf) has proven to be a powerful technique to help subjects to gauge and enhance emotional control. Traditionally, rtfMRI-nf has focused on emotional regulation through self-regulation of amygdala. Recently, rtfMRI studies have observed that regulation of a target brain region is accompanied by connectivity changes beyond the target region. Therefore, the aim of present study is to investigate the use of connectivity between amygdala and prefrontal regions as the target of neurofeedback training in healthy individuals and subjects with a life-time history of major depressive disorder (MDD) performing an emotion regulation task. Method: Ten remitted MDD subjects and twelve healthy controls (HC) performed an emotion regulation task in 4 runs of rtfMRI-nf training followed by one transfer run without neurofeedback conducted in a single session. The functional connectivity between amygdala and prefrontal cortex was presented as a feedback bar concurrent with the emotion regulation task. Participants' emotional state was measured by the Positive and Negative Affect Schedule (PANAS) prior to and following the rtfMRI-nf. Psychological assessments were used to determine subjects' history of depression. Results: Participants with a history of MDD showed a trend of decreasing functional connectivity across the four rtfMRI-nf runs, and there was a marginally significant interaction between the MDD history and number of training runs. The HC group showed a significant increase of frontal cortex activation between the second and third neurofeedback runs. Comparing PANAS scores before and after connectivity-based rtfMRI-nf, we observed a significant decrease in negative PANAS score in the whole group overall, and a significant decrease in positive PANAS score in the MDD group alone.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
Statistical learning methods for neuroimaging data analysis with applications
Authors:
Hongtu Zhu,
Tengfei Li,
Bingxin Zhao
Abstract:
The aim of this paper is to provide a comprehensive review of statistical challenges in neuroimaging data analysis from neuroimaging techniques to large-scale neuroimaging studies to statistical learning methods. We briefly review eight popular neuroimaging techniques and their potential applications in neuroscience research and clinical translation. We delineate the four common themes of neuroima…
▽ More
The aim of this paper is to provide a comprehensive review of statistical challenges in neuroimaging data analysis from neuroimaging techniques to large-scale neuroimaging studies to statistical learning methods. We briefly review eight popular neuroimaging techniques and their potential applications in neuroscience research and clinical translation. We delineate the four common themes of neuroimaging data and review major image processing analysis methods for processing neuroimaging data at the individual level. We briefly review four large-scale neuroimaging-related studies and a consortium on imaging genomics and discuss four common themes of neuroimaging data analysis at the population level. We review nine major population-based statistical analysis methods and their associated statistical challenges and present recent progress in statistical methodology to address these challenges.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering
Authors:
Tianxiao Li,
Hongyu Guo,
Filippo Grazioli,
Mark Gerstein,
Martin Renqiang Min
Abstract:
In protein biophysics, the separation between the functionally important residues (forming the active site or binding surface) and those that create the overall structure (the fold) is a well-established and fundamental concept. Identifying and modifying those functional sites is critical for protein engineering but computationally non-trivial, and requires significant domain knowledge. To automat…
▽ More
In protein biophysics, the separation between the functionally important residues (forming the active site or binding surface) and those that create the overall structure (the fold) is a well-established and fundamental concept. Identifying and modifying those functional sites is critical for protein engineering but computationally non-trivial, and requires significant domain knowledge. To automate this process from a data-driven perspective, we propose a disentangled Wasserstein autoencoder with an auxiliary classifier, which isolates the function-related patterns from the rest with theoretical guarantees. This enables one-pass protein sequence editing and improves the understanding of the resulting sequences and editing actions involved. To demonstrate its effectiveness, we apply it to T-cell receptors (TCRs), a well-studied structure-function case. We show that our method can be used to alter the function of TCRs without changing the structural backbone, outperforming several competing methods in generation quality and efficiency, and requiring only 10% of the running time needed by baseline models. To our knowledge, this is the first approach that utilizes disentangled representations for TCR engineering.
△ Less
Submitted 16 October, 2023; v1 submitted 14 October, 2022;
originally announced October 2022.
-
Time-distance vision transformers in lung cancer diagnosis from longitudinal computed tomography
Authors:
Thomas Z. Li,
Kaiwen Xu,
Riqiang Gao,
Yucheng Tang,
Thomas A. Lasko,
Fabien Maldonado,
Kim Sandler,
Bennett A. Landman
Abstract:
Features learned from single radiologic images are unable to provide information about whether and how much a lesion may be changing over time. Time-dependent features computed from repeated images can capture those changes and help identify malignant lesions by their temporal behavior. However, longitudinal medical imaging presents the unique challenge of sparse, irregular time intervals in data…
▽ More
Features learned from single radiologic images are unable to provide information about whether and how much a lesion may be changing over time. Time-dependent features computed from repeated images can capture those changes and help identify malignant lesions by their temporal behavior. However, longitudinal medical imaging presents the unique challenge of sparse, irregular time intervals in data acquisition. While self-attention has been shown to be a versatile and efficient learning mechanism for time series and natural images, its potential for interpreting temporal distance between sparse, irregularly sampled spatial features has not been explored. In this work, we propose two interpretations of a time-distance vision transformer (ViT) by using (1) vector embeddings of continuous time and (2) a temporal emphasis model to scale self-attention weights. The two algorithms are evaluated based on benign versus malignant lung cancer discrimination of synthetic pulmonary nodules and lung screening computed tomography studies from the National Lung Screening Trial (NLST). Experiments evaluating the time-distance ViTs on synthetic nodules show a fundamental improvement in classifying irregularly sampled longitudinal images when compared to standard ViTs. In cross-validation on screening chest CTs from the NLST, our methods (0.785 and 0.786 AUC respectively) significantly outperform a cross-sectional approach (0.734 AUC) and match the discriminative performance of the leading longitudinal medical imaging algorithm (0.779 AUC) on benign versus malignant classification. This work represents the first self-attention-based framework for classifying longitudinal medical images. Our code is available at https://github.com/tom1193/time-distance-transformer.
△ Less
Submitted 4 September, 2022;
originally announced September 2022.
-
Calcium oscillation on homogeneous and heterogeneous networks of ryanodine receptor
Authors:
Zhong-Xue Gao,
Tian-Tian Li,
Han-Yu Jiang,
Jun He
Abstract:
Calcium oscillation is an important calcium homeostasis, imbalance of which is the key mechanism of initiation and progression of many major diseases. The formation and maintenance of calcium homeostasis are closely related to the spatial distribution of calcium channels. In the current paper, a theoretical framework is established by abstracting the spatial distribution of the calcium channels as…
▽ More
Calcium oscillation is an important calcium homeostasis, imbalance of which is the key mechanism of initiation and progression of many major diseases. The formation and maintenance of calcium homeostasis are closely related to the spatial distribution of calcium channels. In the current paper, a theoretical framework is established by abstracting the spatial distribution of the calcium channels as a nonlinear biological complex network with calcium channels as nodes and Ca$^{2+}$ as edges. A dynamical model for a RyR is adopted to investigate the effect of spatial distribution on calcium oscillation. The mean-field model can be well reproduced from the complete graph and dense Erdös-Rényi network. The synchronization of RyRs is found important to generate a global calcium oscillation. The clique graph with a cluster structure can not produce a global oscillation due to the failure of synchronization between clusters. A more realistic geometric network is constructed in a two-dimensional plane based on the experimental information about the RyR arrangement of clusters and the frequency distribution of cluster sizes. Different from the clique graph, the global oscillation can be generated with reasonable parameters on the geometric network. The simulation also suggests that existence of small clusters and rogue RyR's plays an important role in the maintenance of global calcium oscillation through keeping synchronization between large clusters. Such results support the heterogeneous distribution of RyR's with different-size clusters, which is helpful to understand recent observations with super resolution nanoscale imaging techniques. The current theoretical framework can also be extent to investigate other phenomena in calcium signal transduction.
△ Less
Submitted 1 February, 2023; v1 submitted 24 July, 2022;
originally announced July 2022.
-
A Deep Learning Approach to Predicting Ventilator Parameters for Mechanically Ventilated Septic Patients
Authors:
Zhijun Zeng,
Zhen Hou,
Ting Li,
Lei Deng,
Jianguo Hou,
Xinran Huang,
Jun Li,
Meirou Sun,
Yunhan Wang,
Qiyu Wu,
Wenhao Zheng,
Hua Jiang,
Qi Wang
Abstract:
We develop a deep learning approach to predicting a set of ventilator parameters for a mechanically ventilated septic patient using a long and short term memory (LSTM) recurrent neural network (RNN) model. We focus on short-term predictions of a set of ventilator parameters for the septic patient in emergency intensive care unit (EICU). The short-term predictability of the model provides attending…
▽ More
We develop a deep learning approach to predicting a set of ventilator parameters for a mechanically ventilated septic patient using a long and short term memory (LSTM) recurrent neural network (RNN) model. We focus on short-term predictions of a set of ventilator parameters for the septic patient in emergency intensive care unit (EICU). The short-term predictability of the model provides attending physicians with early warnings to make timely adjustment to the treatment of the patient in the EICU. The patient specific deep learning model can be trained on any given critically ill patient, making it an intelligent aide for physicians to use in emergent medical situations.
△ Less
Submitted 20 February, 2022;
originally announced February 2022.
-
Modeling bacterial flagellar motor with new structure information: Rotational dynamics of two interacting protein nano-rings
Authors:
Yuansheng Cao,
Tairan Li,
Yuhai Tu
Abstract:
In this article, we develop a mathematical model for the rotary bacterial flagellar motor (BFM) based on the recently discovered structure of the stator complex (MotA$_5$MotB$_2$). The structure suggested that the stator also rotates. The BFM is modeled as two rotating nano-rings that interact with each other. Specifically, translocation of protons through the stator complex drives rotation of the…
▽ More
In this article, we develop a mathematical model for the rotary bacterial flagellar motor (BFM) based on the recently discovered structure of the stator complex (MotA$_5$MotB$_2$). The structure suggested that the stator also rotates. The BFM is modeled as two rotating nano-rings that interact with each other. Specifically, translocation of protons through the stator complex drives rotation of the MotA pentamer ring, which in turn drives rotation of the FliG ring in the rotor via interactions between the MotA ring of the stator and the FliG ring of the rotor. Preliminary results from the structure-informed model are consistent with the observed torque-speed relation. More importantly, the model predicts distinctive rotor and stator dynamics and their load dependence, which may be tested by future experiments. Possible approaches to verify and improve the model to further understanding of the molecular mechanism for torque generation in BFM are also discussed.
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
Leaving Flatland: Advances in 3D behavioral measurement
Authors:
Jesse D. Marshall,
Tianqing Li,
Joshua H. Wu,
Timothy W. Dunn
Abstract:
Animals move in three dimensions (3D). Thus, 3D measurement is necessary to report the true kinematics of animal movement. Existing 3D measurement techniques draw on specialized hardware, such as motion capture or depth cameras, as well as deep multi-view and monocular computer vision. Continued advances at the intersection of deep learning and computer vision will facilitate 3D tracking across mo…
▽ More
Animals move in three dimensions (3D). Thus, 3D measurement is necessary to report the true kinematics of animal movement. Existing 3D measurement techniques draw on specialized hardware, such as motion capture or depth cameras, as well as deep multi-view and monocular computer vision. Continued advances at the intersection of deep learning and computer vision will facilitate 3D tracking across more anatomical features, with less training data, in additional species, and within more natural, occlusive environments. 3D behavioral measurement enables unique applications in phenotyping, investigating the neural basis of behavior, and designing artificial agents capable of imitating animal behavior.
△ Less
Submitted 3 December, 2021;
originally announced December 2021.
-
Phylogenetic Study of 2019-nCoV by Using Alignment Free Method (Evolutionary Bifurcation of Novel Coronavirus Mutants)
Authors:
Yang Gao,
Tao Li,
Liaofu Luo
Abstract:
The phylogenetic tree of SARS-CoV-2 (nCov-19) viruses is reconstructed according to the similarity of genome sequences. The tree topology of Betacoronavirus is remarkably consistent with biologist's systematics. Because the tree construction contains enough information about virus mutants, it is suitable to study the evolutionary relationship between novel coronavirus mutants transmitted among hum…
▽ More
The phylogenetic tree of SARS-CoV-2 (nCov-19) viruses is reconstructed according to the similarity of genome sequences. The tree topology of Betacoronavirus is remarkably consistent with biologist's systematics. Because the tree construction contains enough information about virus mutants, it is suitable to study the evolutionary relationship between novel coronavirus mutants transmitted among humans. The emergences of 14 kinds of main mutants are studied and these strains can be classified as eight bifurcations of the phylogenetic tree. It is found that there exist three types of virus mutations, namely, the mutation among sub-branches of the same branch, the off-root mutation and the root-oriented mutation between large branches of the tree. From the point of the relation between viral mutation and host selection we found that individuals with low immunity provide a special environment for the positive natural selection of virus evolution. It gives a mechanism to explain why large mutations between two distant branches generally occur in the nCov-19 phylogenetic tree. The finding is helpful to formulate strategies to control the spread of COVID-19.
△ Less
Submitted 28 January, 2022; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Simulating the Spread of Epidemics in China on the Multi-layer Transportation Network: Beyond the Coronavirus in Wuhan
Authors:
Tianyi Li
Abstract:
Based on the SEIR model and the modeling of urban transportation networks, a general-purpose simulator for the spread of epidemics in Chinese cities is built. The Chinese public transportation system between over 340 prefectural-level cities is modeled as a multi-layer bi-partite network, with layers representing different means of transportation (airlines, railways, sail routes and buses), and no…
▽ More
Based on the SEIR model and the modeling of urban transportation networks, a general-purpose simulator for the spread of epidemics in Chinese cities is built. The Chinese public transportation system between over 340 prefectural-level cities is modeled as a multi-layer bi-partite network, with layers representing different means of transportation (airlines, railways, sail routes and buses), and nodes divided into two categories (central cities, peripheral cities). At each city, an open-system SEIR model tracks the local spread of the disease, with population in- and out-flow exchanging with the overlying transportation network. The model accounts for (1) different transmissivities of the epidemic on different transportation media, (2) the transit of inbound flow at cities, (3) cross-infection on public transportation vehicles due to path overlap, and the realistic considerations that (4) the infected population are not entering public transportation and (5) the recovered population are not subject to repeated infections. The model could be used to simulate the city-level spread in China (and potentially other countries) of an arbitrary epidemic, characterized by its basic reproduction number, incubation period, infection period and zoonotic force, originated from any Chinese prefectural-level city(s), during the period before effective government interventions are implemented. Flowmaps are input into the system to trigger inter-city dynamics, assuming different flow strength, determined from empirical observation, within/between the bi-partite divisions of nodes. The model is used to simulate the 2019 Coronavirus epidemic in Wuhan; it shows that the framework is robust and reliable, and simulated results match public city-level datasets to an extraordinary extent.
△ Less
Submitted 27 February, 2020;
originally announced February 2020.
-
The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up
Authors:
Razvan V. Marinescu,
Neil P. Oxtoby,
Alexandra L. Young,
Esther E. Bron,
Arthur W. Toga,
Michael W. Weiner,
Frederik Barkhof,
Nick C. Fox,
Arman Eshaghi,
Tina Toni,
Marcin Salaterski,
Veronika Lunina,
Manon Ansart,
Stanley Durrleman,
Pascal Lu,
Samuel Iddi,
Dan Li,
Wesley K. Thompson,
Michael C. Donohue,
Aviv Nahon,
Yarden Levy,
Dan Halbersberg,
Mariya Cohen,
Huiling Liao,
Tengfei Li
, et al. (71 additional authors not shown)
Abstract:
We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcome…
▽ More
We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. The methods used by challenge participants included multivariate linear regression, machine learning methods such as support vector machines and deep neural networks, as well as disease progression models. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guesswork. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as the slope or maxima/minima of biomarkers. TADPOLE's unique results suggest that current prediction algorithms provide sufficient accuracy to exploit biomarkers related to clinical diagnosis and ventricle volume, for cohort refinement in clinical trials for Alzheimer's disease. However, results call into question the usage of cognitive test scores for patient selection and as a primary endpoint in clinical trials.
△ Less
Submitted 27 December, 2021; v1 submitted 9 February, 2020;
originally announced February 2020.
-
On an enhancement of RNA probing data using Information Theory
Authors:
Thomas J. X. Li,
Christian M. Reidys
Abstract:
Identifying the secondary structure of an RNA is crucial for understanding its diverse regulatory functions. This paper focuses on how to enhance target identification in a Boltzmann ensemble of structures via chemical probing data. We employ an information-theoretic approach to solve the problem, via considering a variant of the Rényi-Ulam game. Our framework is centered around the ensemble tree,…
▽ More
Identifying the secondary structure of an RNA is crucial for understanding its diverse regulatory functions. This paper focuses on how to enhance target identification in a Boltzmann ensemble of structures via chemical probing data. We employ an information-theoretic approach to solve the problem, via considering a variant of the Rényi-Ulam game. Our framework is centered around the ensemble tree, a hierarchical bi-partition of the input ensemble, that is constructed by recursively querying about whether or not a base pair of maximum information entropy is contained in the target. These queries are answered via relating local with global probing data, employing the modularity in RNA secondary structures. We present that leaves of the tree are comprised of sub-samples exhibiting a distinguished structure with high probability. In particular, for a Boltzmann ensemble incorporating probing data, which is well established in the literature, the probability of our framework correctly identifying the target in the leaf is greater than $90\%$.
△ Less
Submitted 12 September, 2019;
originally announced September 2019.
-
The block spectrum of RNA pseudoknot structures
Authors:
Thomas J. X. Li,
Christina S. Burris,
Christian M. Reidys
Abstract:
In this paper we analyze the length-spectrum of blocks in $γ$-structures. $γ$-structures are a class of RNA pseudoknot structures that plays a key role in the context of polynomial time RNA folding. A $γ$-structure is constructed by nesting and concatenating specific building components having topological genus at most $γ$. A block is a substructure enclosed by crossing maximal arcs with respect t…
▽ More
In this paper we analyze the length-spectrum of blocks in $γ$-structures. $γ$-structures are a class of RNA pseudoknot structures that plays a key role in the context of polynomial time RNA folding. A $γ$-structure is constructed by nesting and concatenating specific building components having topological genus at most $γ$. A block is a substructure enclosed by crossing maximal arcs with respect to the partial order induced by nesting. We show that, in uniformly generated $γ$-structures, there is a significant gap in this length-spectrum, i.e., there asymptotically almost surely exists a unique longest block of length at least $n-O(n^{1/2})$ and that with high probability any other block has finite length. For fixed $γ$, we prove that the length of the longest block converges to a discrete limit law, and that the distribution of short blocks of given length tends to a negative binomial distribution in the limit of long sequences. We refine this analysis to the length spectrum of blocks of specific pseudoknot types, such as H-type and kissing hairpins. Our results generalize the rainbow spectrum on secondary structures by the first and third authors and are being put into context with the structural prediction of long non-coding RNAs.
△ Less
Submitted 12 June, 2018;
originally announced June 2018.
-
The rainbow-spectrum of RNA secondary structures
Authors:
Thomas J. X. Li,
Christian M. Reidys
Abstract:
In this paper we analyze the length-spectrum of rainbows in RNA secondary structures. A rainbow in a secondary structure is a maximal arc with respect to the partial order induced by nesting. We show that there is a significant gap in this length-spectrum. We shall prove that there asymptotically almost surely exists a unique longest rainbow of length at least $n-O(n^{1/2})$ and that with high pro…
▽ More
In this paper we analyze the length-spectrum of rainbows in RNA secondary structures. A rainbow in a secondary structure is a maximal arc with respect to the partial order induced by nesting. We show that there is a significant gap in this length-spectrum. We shall prove that there asymptotically almost surely exists a unique longest rainbow of length at least $n-O(n^{1/2})$ and that with high probability any other rainbow has finite length. We show that the distribution of the length of the longest rainbow converges to a discrete limit law and that, for finite $k$, the distribution of rainbows of length $k$, becomes for large $n$ a negative binomial distribution. We then put the results of this paper into context, comparing the analytical results with those observed in RNA minimum free energy structures, biological RNA structures and relate our findings to the sparsification of folding algorithms.
△ Less
Submitted 8 June, 2018;
originally announced June 2018.
-
Quantifying the accuracy of ancestral state prediction in a phylogenetic tree under maximum parsimony
Authors:
Lina Herbst,
Thomas Li,
Mike Steel
Abstract:
In phylogenetic studies, biologists often wish to estimate the ancestral discrete character state at an interior vertex $v$ of an evolutionary tree $T$ from the states that are observed at the leaves of the tree. A simple and fast estimation method --- maximum parsimony --- takes the ancestral state at $v$ to be any state that minimises the number of state changes in $T$ required to explain its ev…
▽ More
In phylogenetic studies, biologists often wish to estimate the ancestral discrete character state at an interior vertex $v$ of an evolutionary tree $T$ from the states that are observed at the leaves of the tree. A simple and fast estimation method --- maximum parsimony --- takes the ancestral state at $v$ to be any state that minimises the number of state changes in $T$ required to explain its evolution on $T$. In this paper, we investigate the reconstruction accuracy of this estimation method further, under a simple symmetric model of state change, and obtain a number of new results, both for 2-state characters, and $r$--state characters ($r>2$). Our results rely on establishing new identities and inequalities, based on a coupling argument that involves a simpler `coin toss' approach to ancestral state reconstruction.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
Discrete Dynamic Causal Modeling and Its Relationship with Directed Information
Authors:
Zhe Wang,
Yu Zheng,
David C. Zhu,
Jian Ren,
Tongtong Li
Abstract:
This paper explores the discrete Dynamic Causal Modeling (DDCM) and its relationship with Directed Information (DI). We prove the conditional equivalence between DDCM and DI in characterizing the causal relationship between two brain regions. The theoretical results are demonstrated using fMRI data obtained under both resting state and stimulus based state. Our numerical analysis is consistent wit…
▽ More
This paper explores the discrete Dynamic Causal Modeling (DDCM) and its relationship with Directed Information (DI). We prove the conditional equivalence between DDCM and DI in characterizing the causal relationship between two brain regions. The theoretical results are demonstrated using fMRI data obtained under both resting state and stimulus based state. Our numerical analysis is consistent with that reported in previous study.
△ Less
Submitted 18 September, 2017;
originally announced September 2017.
-
Bi-phase age-related brain gray matter magnetic resonance T1rho relaxation time change
Authors:
Yao T Li,
Huang Hua,
Zhizheng Zhang,
Puxuan Lu,
Weitian Chen,
Yixiang J Wang
Abstract:
Objectives: To investigate normative value and age-related change of brain magnetic resonance T1rho relaxation at 1.5 T. Methods: 20 males (age: 40.7+/-15.5 years, range: 22-68 years) and 22 females (age: 38.5 +/-14.8 years, range: 21-62 years), were scanned at 1.5 Tesla using 3D fluid suppressed turbo spin echo sequence. Regions-of-interests (ROIs) were obtained by atlas-based tissue segmentation…
▽ More
Objectives: To investigate normative value and age-related change of brain magnetic resonance T1rho relaxation at 1.5 T. Methods: 20 males (age: 40.7+/-15.5 years, range: 22-68 years) and 22 females (age: 38.5 +/-14.8 years, range: 21-62 years), were scanned at 1.5 Tesla using 3D fluid suppressed turbo spin echo sequence. Regions-of-interests (ROIs) were obtained by atlas-based tissue segmentation and T1rho was calculated by fitting the mean value to monoexponential model. Correlation between T1rho relaxation of brain gray matter regions and age was investigated. Results: A regional difference among individual gray matter areas was noted; with hippocampus (98.37+/-5.37 msec) and amygdala (94.95+/-4.34 msec) have the highest measurement, while pallidum (83.81+/-5.49) and putamen (83.93+4.76) the lowest measurement. T1rho values decreased slowly (mean slope: -0.256) and significantly (p<0.05) with age in gray matter for subjects younger than 40 years old, while for subjects older than 40 years old there was no significant correlation between T1rho relaxation and age. Conclusion: T1rho relaxation demonstrates a bi-phase change with age in adults of 22-68 years.
△ Less
Submitted 25 October, 2016;
originally announced October 2016.
-
Comment on "On Uniqueness of SDE Decomposition in A-type Stochastic Integration" [arXiv:1603.07927v1]
Authors:
Peijie Zhou,
Tiejun Li
Abstract:
The uniqueness issue of SDE decomposition theory proposed by Ao and his co-workers has recently been discussed. A comprehensive study to investigate connections among different landscape theories [J. Chem. Phys. 144, 094109 (2016)] has pointed out that the decomposition is generally not unique, while Ao et al. (arXiv:1603.07927v1) argues that such conclusions are "incorrect" because of the missing…
▽ More
The uniqueness issue of SDE decomposition theory proposed by Ao and his co-workers has recently been discussed. A comprehensive study to investigate connections among different landscape theories [J. Chem. Phys. 144, 094109 (2016)] has pointed out that the decomposition is generally not unique, while Ao et al. (arXiv:1603.07927v1) argues that such conclusions are "incorrect" because of the missing boundary conditions. In this comment, we will combine literatures research and concrete examples to show that the concrete and effective boundary conditions have not been proposed to guarantee the uniqueness, hence the arguments in [arXiv:1603.07927v1] are not sufficient. Moreover, we show that the "uniqueness" of the O-U process decomposition referred by YTA paper is unable to serve as a counterexample to ZL's result since additional assumptions have been made implicitly beyond the original SDE decomposition framework, which cannot be applied to general nonlinear cases. Some other issues such as the failure of gradient expansion method will also be discussed. Our demonstration contributes to better understanding of the relevant papers as well as the SDE decomposition theory.
△ Less
Submitted 21 September, 2016;
originally announced September 2016.
-
Statistics of topological RNA structures
Authors:
Thomas J. X. Li,
Christian M. Reidys
Abstract:
In this paper we study properties of topological RNA structures, i.e.~RNA contact structures with cross-serial interactions that are filtered by their topological genus. RNA secondary structures within this framework are topological structures having genus zero. We derive a new bivariate generating function whose singular expansion allows us to analyze the distributions of arcs, stacks, hairpin- ,…
▽ More
In this paper we study properties of topological RNA structures, i.e.~RNA contact structures with cross-serial interactions that are filtered by their topological genus. RNA secondary structures within this framework are topological structures having genus zero. We derive a new bivariate generating function whose singular expansion allows us to analyze the distributions of arcs, stacks, hairpin- , interior- and multi-loops. We then extend this analysis to H-type pseudoknots, kissing hairpins as well as $3$-knots and compute their respective expectation values. Finally we discuss our results and put them into context with data obtained by uniform sampling structures of fixed genus.
△ Less
Submitted 22 June, 2016;
originally announced June 2016.
-
RNA secondary structures having a compatible sequence of certain nucleotide ratios
Authors:
Christopher L. Barrett,
Thomas J. X. Li,
Christian M. Reidys
Abstract:
Given a random RNA secondary structure, $S$, we study RNA sequences having fixed ratios of nuclotides that are compatible with $S$. We perform this analysis for RNA secondary structures subject to various base pairing rules and minimum arc- and stack-length restrictions. Our main result reads as follows: in the simplex of the nucleotide ratios there exists a convex region in which, in the limit of…
▽ More
Given a random RNA secondary structure, $S$, we study RNA sequences having fixed ratios of nuclotides that are compatible with $S$. We perform this analysis for RNA secondary structures subject to various base pairing rules and minimum arc- and stack-length restrictions. Our main result reads as follows: in the simplex of the nucleotide ratios there exists a convex region in which, in the limit of long sequences, a random structure a.a.s.~has compatible sequence with these ratios and outside of which a.a.s.~a random structure has no such compatible sequence. We localize this region for RNA secondary structures subject to various base pairing rules and minimum arc- and stack-length restrictions. In particular, for {\bf GC}-sequences having a ratio of {\bf G} nucleotides smaller than $1/3$, a random RNA secondary structure without any minimum arc- and stack-length restrictions has a.a.s.~no such compatible sequence. For sequences having a ratio of {\bf G} nucleotides larger than $1/3$, a random RNA secondary structure has a.a.s. such compatible sequences. We discuss our results in the context of various families of RNA structures.
△ Less
Submitted 11 March, 2016;
originally announced March 2016.
-
Non-Scanning Fiber-Optic Near-Infrared Beam Led to Two-Photon Optogenetic Stimulation In-Vivo
Authors:
Kamal R. Dhakal,
Ling Gu,
Shivaranjani Shivalingaiah,
Torry S. Dennis,
Samara A. Morris-Bobzean,
Ting Li,
Linda I. Perrotti,
Samarendra K. Mohanty
Abstract:
Stimulation of specific neurons expressing opsins in a targeted region to manipulate brain function has proved to be a powerful tool in neuroscience. However, the use of visible light for optogenetic stimulation is invasive due to low penetration depth and tissue damage owing to larger absorption and scattering. Here, we report, for the first time, in-depth non-scanning fiber-optic two-photon opto…
▽ More
Stimulation of specific neurons expressing opsins in a targeted region to manipulate brain function has proved to be a powerful tool in neuroscience. However, the use of visible light for optogenetic stimulation is invasive due to low penetration depth and tissue damage owing to larger absorption and scattering. Here, we report, for the first time, in-depth non-scanning fiber-optic two-photon optogenetic stimulation (FO-TPOS) of neurons in-vivo in transgenic mouse models. In order to optimize the deep-brain stimulation strategy, we characterized two-photon activation efficacy at different nearinfrared laser parameters. The significantly-enhanced in-depth stimulation efficiency of FO-TPOS as compared to conventional single-photon beam was demonstrated both by experiments and Monte Carlo simulation. The non-scanning FO-TPOS technology will lead to better understanding of the in-vivo neural circuitry because this technology permits more precise and less invasive anatomical delivery of stimulation.
△ Less
Submitted 25 February, 2016;
originally announced February 2016.
-
Realization of Waddington's Metaphor: Potential Landscape, Quasi-potential, A-type Integral and Beyond
Authors:
Peijie Zhou,
Tiejun Li
Abstract:
Motivated by the famous Waddington's epigenetic landscape metaphor in developmental biology, biophysicists and applied mathematicians made different proposals to realize this metaphor in a rationalized way. We adopt comprehensive perspectives to systematically investigate three different but closely related realizations in recent literature: namely the potential landscape theory from the steady st…
▽ More
Motivated by the famous Waddington's epigenetic landscape metaphor in developmental biology, biophysicists and applied mathematicians made different proposals to realize this metaphor in a rationalized way. We adopt comprehensive perspectives to systematically investigate three different but closely related realizations in recent literature: namely the potential landscape theory from the steady state distribution of stochastic differential equations (SDEs), the quasi-potential from the large deviation theory, and the construction through SDE decomposition and A-type integral.The connections among these theories are established in this paper. We demonstrate that the quasi-potential is the zero noise limit of the potential landscape. We also show that the potential function in the third proposal coincides with the quasi-potential. The most probable transition path by minimizing the Onsager-Machlup or Freidlin-Wentzell action functional is discussed as well. Furthermore, we compare the difference between local and global quasi-potential through the exchange of limit order for time and noise amplitude. As a consequence of such explorations, we arrive at the existence result for the SDE decomposition while deny its uniqueness in general cases. It is also clarified that the A-type integral is more appropriate to be applied to the decomposed SDEs rather than the original one. Our results contribute to a better understanding of existing landscape theories for biological systems.
△ Less
Submitted 6 November, 2015;
originally announced November 2015.
-
Exposing ambiguities in a relation-extraction gold standard with crowdsourcing
Authors:
Tong Shu Li,
Benjamin M. Good,
Andrew I. Su
Abstract:
Semantic relation extraction is one of the frontiers of biomedical natural language processing research. Gold standards are key tools for advancing this research. It is challenging to generate these standards because of the high cost of expert time and the difficulty in establishing agreement between annotators. We implemented and evaluated a microtask crowdsourcing approach that can produce a gol…
▽ More
Semantic relation extraction is one of the frontiers of biomedical natural language processing research. Gold standards are key tools for advancing this research. It is challenging to generate these standards because of the high cost of expert time and the difficulty in establishing agreement between annotators. We implemented and evaluated a microtask crowdsourcing approach that can produce a gold standard for extracting drug-disease relations. The aggregated crowd judgment agreed with expert annotations from a pre-existing corpus on 43 of 60 sentences tested. The levels of crowd agreement varied in a similar manner to the levels of agreement among the original expert annotators. This work rein-forces the power of crowdsourcing in the process of assembling gold standards for relation extraction. Further, it high-lights the importance of exposing the levels of agreement between human annotators, expert or crowd, in gold standard corpora as these are reproducible signals indicating ambiguities in the data or in the annotation guidelines.
△ Less
Submitted 22 May, 2015;
originally announced May 2015.
-
Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study
Authors:
Ariana E. Anderson,
Wesley T. Kerr,
April Thames,
Tong Li,
Jiayang Xiao,
Mark S. Cohen
Abstract:
Objectives: In the United States, 25% of people with type 2 diabetes are undiagnosed. Conventional screening models use limited demographic information to assess risk. We evaluated whether electronic health record (EHR) phenotyping could improve diabetes screening, even when records are incomplete and data are not recorded systematically across patients and practice locations. Methods: In this cro…
▽ More
Objectives: In the United States, 25% of people with type 2 diabetes are undiagnosed. Conventional screening models use limited demographic information to assess risk. We evaluated whether electronic health record (EHR) phenotyping could improve diabetes screening, even when records are incomplete and data are not recorded systematically across patients and practice locations. Methods: In this cross-sectional, retrospective study, data from 9,948 US patients between 2009 and 2012 were used to develop a pre-screening tool to predict current type 2 diabetes, using multivariate logistic regression. We compared (1) a full EHR model containing prescribed medications, diagnoses, and traditional predictive information, (2) a restricted EHR model where medication information was removed, and (3) a conventional model containing only traditional predictive information (BMI, age, gender, hypertensive and smoking status). We additionally used a random-forests classification model to judge whether including additional EHR information could increase the ability to detect patients with Type 2 diabetes on new patient samples. Results: Using a patient's full or restricted EHR to detect diabetes was superior to using basic covariates alone (p<0.001). The random forests model replicated on out-of-bag data. Migraines and cardiac dysrhythmias were negatively associated with type 2 diabetes, while acute bronchitis and herpes zoster were positively associated, among other factors. Conclusions: EHR phenotyping resulted in markedly superior detection of type 2 diabetes in a general US population, could increase the efficiency and accuracy of disease screening, and are capable of picking up signals in real-world records.
△ Less
Submitted 10 January, 2015;
originally announced January 2015.
-
Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard
Authors:
Tianyang Li,
Rui Jiang,
Xuegong Zhang
Abstract:
Maximum likelihood is a popular technique for isoform reconstruction. Here, we show that isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard.
Maximum likelihood is a popular technique for isoform reconstruction. Here, we show that isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard.
△ Less
Submitted 4 May, 2013;
originally announced May 2013.
-
Transition Path, Quasi-potential Energy Landscape and Stability of Genetic Switches
Authors:
Cheng Lv,
Xiaoguang Li,
Fangting Li,
Tiejun Li
Abstract:
One of the fundamental cellular processes governed by genetic regulatory networks in cells is the transition among different states under the intrinsic and extrinsic noise. Based on a two-state genetic switching model with positive feedback, we develop a framework to understand the metastability in gene expressions. This framework is comprised of identifying the transition path, reconstructing the…
▽ More
One of the fundamental cellular processes governed by genetic regulatory networks in cells is the transition among different states under the intrinsic and extrinsic noise. Based on a two-state genetic switching model with positive feedback, we develop a framework to understand the metastability in gene expressions. This framework is comprised of identifying the transition path, reconstructing the global quasi-potential energy landscape, analyzing the uphill and downhill transition paths, etc. It is successfully utilized to investigate the stability of genetic switching models and fluctuation properties in different regimes of gene expression with positive feedback. The quasi-potential energy landscape, which is the rationalized version of Waddington potential, provides a quantitative tool to understand the metastability in more general biological processes with intrinsic noise.
△ Less
Submitted 17 December, 2012;
originally announced December 2012.
-
The topological filtration of $γ$-structures
Authors:
Thomas J. X. Li,
Christian M. Reidys
Abstract:
In this paper we study $γ$-structures filtered by topological genus. $γ$-structures are a class of RNA pseudoknot structures that plays a key role in the context of polynomial time folding of RNA pseudoknot structures. A $γ$-structure is composed by specific building blocks, that have topological genus less than or equal to $γ$, where composition means concatenation and nesting of such blocks. Our…
▽ More
In this paper we study $γ$-structures filtered by topological genus. $γ$-structures are a class of RNA pseudoknot structures that plays a key role in the context of polynomial time folding of RNA pseudoknot structures. A $γ$-structure is composed by specific building blocks, that have topological genus less than or equal to $γ$, where composition means concatenation and nesting of such blocks. Our main results are the derivation of a new bivariate generating function for $γ$-structures via symbolic methods, the singularity analysis of the solutions and a central limit theorem for the distribution of topological genus in $γ$-structures of given length. In our derivation specific bivariate polynomials play a central role. Their coefficients count particular motifs of fixed topological genus and they are of relevance in the context of genus recursion and novel folding algorithms.
△ Less
Submitted 6 February, 2012;
originally announced February 2012.
-
Combinatorial analysis of interacting RNA molecules
Authors:
Thomas J. X. Li,
Christian M. Reidys
Abstract:
Recently several minimum free energy (MFE) folding algorithms for predicting the joint structure of two interacting RNA molecules have been proposed. Their folding targets are interaction structures, that can be represented as diagrams with two backbones drawn horizontally on top of each other such that (1) intramolecular and intermolecular bonds are noncrossing and (2) there is no "zig-zag" confi…
▽ More
Recently several minimum free energy (MFE) folding algorithms for predicting the joint structure of two interacting RNA molecules have been proposed. Their folding targets are interaction structures, that can be represented as diagrams with two backbones drawn horizontally on top of each other such that (1) intramolecular and intermolecular bonds are noncrossing and (2) there is no "zig-zag" configuration. This paper studies joint structures with arc-length at least four in which both, interior and exterior stack-lengths are at least two (no isolated arcs). The key idea in this paper is to consider a new type of shape, based on which joint structures can be derived via symbolic enumeration. Our results imply simple asymptotic formulas for the number of joint structures with surprisingly small exponential growth rates. They are of interest in the context of designing prediction algorithms for RNA-RNA interactions.
△ Less
Submitted 21 June, 2010;
originally announced June 2010.
-
Combinatorics of RNA-RNA interaction
Authors:
Thomas J. X. Li,
Christian M. Reidys
Abstract:
RNA-RNA binding is an important phenomenon observed for many classes of non-coding RNAs and plays a crucial role in a number of regulatory processes. Recently several MFE folding algorithms for predicting the joint structure of two interacting RNA molecules have been proposed. Here joint structure means that in a diagram representation the intramolecular bonds of each partner are pseudoknot-free,…
▽ More
RNA-RNA binding is an important phenomenon observed for many classes of non-coding RNAs and plays a crucial role in a number of regulatory processes. Recently several MFE folding algorithms for predicting the joint structure of two interacting RNA molecules have been proposed. Here joint structure means that in a diagram representation the intramolecular bonds of each partner are pseudoknot-free, that the intermolecular binding pairs are noncrossing, and that there is no so-called ``zig-zag'' configuration. This paper presents the combinatorics of RNA interaction structures including their generating function, singularity analysis as well as explicit recurrence relations. In particular, our results imply simple asymptotic formulas for the number of joint structures.
△ Less
Submitted 15 June, 2010;
originally announced June 2010.
-
Force unfolding kinetics of RNA using optical tweezers. II. Modeling experiments
Authors:
M. Manosas,
J. -D. Wen,
P. T. X. Li,
S. B. Smith,
C. Bustamante,
I. Tinoco, Jr.,
F. Ritort
Abstract:
By exerting mechanical force it is possible to unfold/refold RNA molecules one at a time. In a small range of forces, an RNA molecule can hop between the folded and the unfolded state with force-dependent kinetic rates. Here, we introduce a mesoscopic model to analyze the hopping kinetics of RNA hairpins in an optical tweezers setup. The model includes different elements of the experimental setu…
▽ More
By exerting mechanical force it is possible to unfold/refold RNA molecules one at a time. In a small range of forces, an RNA molecule can hop between the folded and the unfolded state with force-dependent kinetic rates. Here, we introduce a mesoscopic model to analyze the hopping kinetics of RNA hairpins in an optical tweezers setup. The model includes different elements of the experimental setup (beads, handles and RNA sequence) and limitations of the instrument (time lag of the force-feedback mechanism and finite bandwidth of data acquisition). We investigated the influence of the instrument on the measured hopping rates. Results from the model are in good agreement with the experiments reported in the companion article (1). The comparison between theory and experiments allowed us to infer the values of the intrinsic molecular rates of the RNA hairpin alone and to search for the optimal experimental conditions to do the measurements. We conclude that long handles and soft laser traps represent the best conditions to extract rate estimates that are closest to the intrinsic molecular rates. The methodology and rationale presented here can be applied to other experimental setups and other molecules.
△ Less
Submitted 4 July, 2007;
originally announced July 2007.
-
Force unfolding kinetics of RNA using optical tweezers. I. Effects of experimental variables on measured results
Authors:
J. -D. Wen,
M. Manosas,
P. T. X. Li,
S. B. Smith,
C. Bustamante,
F. Ritort,
I. Tinoco Jr
Abstract:
Experimental variables of optical tweezers instrumentation that affect RNA folding/unfolding kinetics were investigated. A model RNA hairpin, P5ab, was attached to two micron-sized beads through hybrid RNA/DNA handles; one bead was trapped by dual-beam lasers and the other was held by a micropipette. Several experimental variables were changed while measuring the unfolding/refolding kinetics, in…
▽ More
Experimental variables of optical tweezers instrumentation that affect RNA folding/unfolding kinetics were investigated. A model RNA hairpin, P5ab, was attached to two micron-sized beads through hybrid RNA/DNA handles; one bead was trapped by dual-beam lasers and the other was held by a micropipette. Several experimental variables were changed while measuring the unfolding/refolding kinetics, including handle lengths, trap stiffness, and modes of force applied to the molecule. In constant-force mode where the tension applied to the RNA was maintained through feedback control, the measured rate coefficients varied within 40% when the handle lengths were changed by 10 fold (1.1 to 10.2 Kbp); they increased by two- to three-fold when the trap stiffness was lowered to one third (from 0.1 to 0.035 pN/nm). In the passive mode, without feedback control and where the force applied to the RNA varied in response to the end-to-end distance change of the tether, the RNA hopped between a high-force folded-state and a low-force unfolded-state. In this mode, the rates increased up to two-fold with longer handles or softer traps. Overall, the measured rates remained with the same order-of-magnitude over the wide range of conditions studied. In the companion paper (1), we analyze how the measured kinetics parameters differ from the intrinsic molecular rates of the RNA, and thus how to obtain the molecular rates.
△ Less
Submitted 4 July, 2007;
originally announced July 2007.