-
Bayesian estimation of Differential Transcript Usage from RNA-seq data
Authors:
Panagiotis Papastamoulis,
Magnus Rattray
Abstract:
Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use…
▽ More
Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.
△ Less
Submitted 28 September, 2017; v1 submitted 11 January, 2017;
originally announced January 2017.
-
Identifying stochastic oscillations in single-cell live imaging time series using Gaussian processes
Authors:
Nick E. Phillips,
Cerys Manning,
Nancy Papalopulu,
Magnus Rattray
Abstract:
Multiple biological processes are driven by oscillatory gene expression at different time scales. Pulsatile dynamics are thought to be widespread, and single-cell live imaging of gene expression has lead to a surge of dynamic, possibly oscillatory, data for different gene networks. However, the regulation of gene expression at the level of an individual cell involves reactions between finite numbe…
▽ More
Multiple biological processes are driven by oscillatory gene expression at different time scales. Pulsatile dynamics are thought to be widespread, and single-cell live imaging of gene expression has lead to a surge of dynamic, possibly oscillatory, data for different gene networks. However, the regulation of gene expression at the level of an individual cell involves reactions between finite numbers of molecules, and this can result in inherent randomness in expression dynamics, which blurs the boundaries between aperiodic fluctuations and noisy oscillators. Thus, there is an acute need for an objective statistical method for classifying whether an experimentally derived noisy time series is periodic. Here we present a new data analysis method that combines mechanistic stochastic modelling with the powerful methods of non-parametric regression with Gaussian processes. Our method can distinguish oscillatory gene expression from random fluctuations of non-oscillatory expression in single-cell time series, despite peak-to-peak variability in period and amplitude of single-cell oscillations. We show that our method outperforms the Lomb-Scargle periodogram in successfully classifying cells as oscillatory or non-oscillatory in data simulated from a simple genetic oscillator model and in experimental data. Analysis of bioluminescent live cell imaging shows a significantly greater number of oscillatory cells when luciferase is driven by a {\it Hes1} promoter (10/19), which has previously been reported to oscillate, than the constitutive MoMuLV 5' LTR (MMLV) promoter (0/25). The method can be applied to data from any gene network to both quantify the proportion of oscillating cells within a population and to measure the period and quality of oscillations. It is publicly available as a MATLAB package.
△ Less
Submitted 25 May, 2017; v1 submitted 23 August, 2016;
originally announced August 2016.
-
Inferring the perturbation time from biological time course data
Authors:
Jing Yang,
Christopher A. Penfold,
Murray R. Grant,
Magnus Rattray
Abstract:
Time course data are often used to study the changes to a biological process after perturbation. Statistical methods have been developed to determine whether such a perturbation induces changes over time, e.g. comparing a perturbed and unperturbed time course dataset to uncover differences. However, existing methods do not provide a principled statistical approach to identify the specific time whe…
▽ More
Time course data are often used to study the changes to a biological process after perturbation. Statistical methods have been developed to determine whether such a perturbation induces changes over time, e.g. comparing a perturbed and unperturbed time course dataset to uncover differences. However, existing methods do not provide a principled statistical approach to identify the specific time when the two time course datasets first begin to diverge after a perturbation; we call this the perturbation time. Estimation of the perturbation time for different variables in a biological process allows us to identify the sequence of events following a perturbation and therefore provides valuable insights into likely causal relationships.
In this paper, we propose a Bayesian method to infer the perturbation time given time course data from a wild-type and perturbed system. We use a non-parametric approach based on Gaussian Process regression. We derive a probabilistic model of noise-corrupted and replicated time course data coming from the same profile before the perturbation time and diverging after the perturbation time. The likelihood function can be worked out exactly for this model and the posterior distribution of the perturbation time is obtained by a simple histogram approach, without recourse to complex approximate inference algorithms. We validate the method on simulated data and apply it to study the transcriptional change occurring in Arabidopsis following inoculation with P. syringae pv. tomato DC3000 versus the disarmed strain DC3000hrpA.
An R package, DEtime, implementing the method is available at https://github.com/ManchesterBioinference/DEtime along with the data and code required to reproduce all the results.
△ Less
Submitted 4 February, 2016;
originally announced February 2016.
-
Genome-wide modelling of transcription kinetics reveals patterns of RNA production delays
Authors:
Antti Honkela,
Jaakko Peltonen,
Hande Topa,
Iryna Charapitsa,
Filomena Matarese,
Korbinian Grote,
Hendrik G. Stunnenberg,
George Reid,
Neil D. Lawrence,
Magnus Rattray
Abstract:
Genes with similar transcriptional activation kinetics can display very different temporal mRNA profiles due to differences in transcription time, degradation rate and RNA processing kinetics. Recent studies have shown that a splicing-associated RNA production delay can be significant. We introduce a joint model of transcriptional activation and mRNA accumulation which can be used for inference of…
▽ More
Genes with similar transcriptional activation kinetics can display very different temporal mRNA profiles due to differences in transcription time, degradation rate and RNA processing kinetics. Recent studies have shown that a splicing-associated RNA production delay can be significant. We introduce a joint model of transcriptional activation and mRNA accumulation which can be used for inference of transcription rate, RNA production delay and degradation rate given genome-wide data from high-throughput sequencing time course experiments. We combine a mechanistic differential equation model with a non-parametric statistical modelling approach allowing us to capture a broad range of activation kinetics, and use Bayesian parameter estimation to quantify the uncertainty in the estimates of the kinetic parameters. We apply the model to data from estrogen receptor (ER-α) activation in the MCF-7 breast cancer cell line. We use RNA polymerase II (pol-II) ChIP-Seq time course data to characterise transcriptional activation and mRNA-Seq time course data to quantify mature transcripts. We find that 11% of genes with a good signal in the data display a delay of more than 20 minutes between completing transcription and mature mRNA production. The genes displaying these long delays are significantly more likely to be short. We also find a statistical association between high delay and late intron retention in pre-mRNA data, indicating significant splicing-associated production delays in many genes.
△ Less
Submitted 16 July, 2015; v1 submitted 3 March, 2015;
originally announced March 2015.
-
Fast and accurate approximate inference of transcript expression from RNA-seq data
Authors:
James Hensman,
Panagiotis Papastamoulis,
Peter Glaus,
Antti Honkela,
Magnus Rattray
Abstract:
Motivation: Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared to compet…
▽ More
Motivation: Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared to competing methods. However, exact Bayesian inference is intractable and approximate methods such as Markov chain Monte Carlo (MCMC) and Variational Bayes (VB) are typically used. While providing a high degree of accuracy and modelling flexibility, standard implementations can be prohibitively slow for large datasets and complex transcriptome annotations.
Results: We propose a novel approximate inference scheme based on VB and apply it to an existing model of transcript expression inference from RNA-seq data. Recent advances in VB algorithmics are used to improve the convergence of the algorithm beyond the standard Variational Bayes Expectation Maximisation (VBEM) algorithm. We apply our algorithm to simulated and biological datasets, demonstrating a significant increase in speed with only very small loss in accuracy of expression level estimation. We carry out a comparative study against seven popular alternative methods and demonstrate that our new algorithm provides excellent accuracy and inter-replicate consistency while remaining competitive in computation time.
Availability: The methods were implemented in R and C++, and are available as part of the BitSeq project at \url{https://github.com/BitSeq}. The method is also available through the BitSeq Bioconductor package. The source code to reproduce all simulation results can be accessed via \url{https://github.com/BitSeq/BitSeqVB_benchmarking}.
△ Less
Submitted 30 June, 2015; v1 submitted 18 December, 2014;
originally announced December 2014.
-
A Bayesian model selection approach for identifying differentially expressed transcripts from RNA-Seq data
Authors:
Panagiotis Papastamoulis,
Magnus Rattray
Abstract:
Recent advances in molecular biology allow the quantification of the transcriptome and scoring transcripts as differentially or equally expressed between two biological conditions. Although these two tasks are closely linked, the available inference methods treat them separately: a primary model is used to estimate expression and its output is post-processed using a differential expression model.…
▽ More
Recent advances in molecular biology allow the quantification of the transcriptome and scoring transcripts as differentially or equally expressed between two biological conditions. Although these two tasks are closely linked, the available inference methods treat them separately: a primary model is used to estimate expression and its output is post-processed using a differential expression model. In this paper, both issues are simultaneously addressed by proposing the joint estimation of expression levels and differential expression: the unknown relative abundance of each transcript can either be equal or not between two conditions. A hierarchical Bayesian model builds upon the BitSeq framework and the posterior distribution of transcript expression and differential expression is inferred using Markov Chain Monte Carlo (MCMC). It is shown that the proposed model enjoys conjugacy for fixed dimension variables, thus the full conditional distributions are analytically derived. Two samplers are constructed, a reversible jump MCMC sampler and a collapsed Gibbs sampler, and the latter is found to perform best. A cluster representation of the aligned reads to the transcriptome is introduced, allowing parallel estimation of the marginal posterior distribution of subsets of transcripts under reasonable computing time. The proposed algorithm is benchmarked against alternative methods using synthetic datasets and applied to real RNA-sequencing data. Source code is available online (https://github.com/mqbssppe/cjBitSeq).
△ Less
Submitted 26 September, 2016; v1 submitted 9 December, 2014;
originally announced December 2014.
-
Fast Approximate Inference of Transcript Expression Levels from RNA-seq Data
Authors:
James Hensman,
Peter Glaus,
Antti Honkela,
Magnus Rattray
Abstract:
Motivation: The mapping of RNA-seq reads to their transcripts of origin is a fundamental task in transcript expression estimation and differential expression scoring. Where ambiguities in mapping exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem becomes an instance of non-trivial probabilistic inference. Bayesian inference in such a problem is intractable…
▽ More
Motivation: The mapping of RNA-seq reads to their transcripts of origin is a fundamental task in transcript expression estimation and differential expression scoring. Where ambiguities in mapping exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem becomes an instance of non-trivial probabilistic inference. Bayesian inference in such a problem is intractable and approximate methods must be used such as Markov chain Monte Carlo (MCMC) and Variational Bayes. Standard implementations of these methods can be prohibitively slow for large datasets and complex gene models.
Results: We propose an approximate inference scheme based on Variational Bayes applied to an existing model of transcript expression inference from RNA-seq data. We apply recent advances in Variational Bayes algorithmics to improve the convergence of the algorithm beyond the standard variational expectation-maximisation approach. We apply our algorithm to simulated and biological datasets, demonstrating that the increase in speed requires only a small trade-off in accuracy of expression level estimation.
Availability: The methods were implemented in R and C++, and are available as part of the BitSeq project at https://code.google.com/p/bitseq/. The methods will be made available through the BitSeq Bioconductor package at the next stable release.
△ Less
Submitted 27 January, 2015; v1 submitted 27 August, 2013;
originally announced August 2013.
-
Inference of RNA Polymerase II Transcription Dynamics from Chromatin Immunoprecipitation Time Course Data
Authors:
Ciira wa Maina,
Antti Honkela,
Filomena Matarese,
Korbinian Grote,
Hendrik G. Stunnenberg,
George Reid,
Neil D. Lawrence,
Magnus Rattray
Abstract:
Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influence the rate and timing of gene expression. In this work we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription s…
▽ More
Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influence the rate and timing of gene expression. In this work we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription speed and to infer the temporal pol-II activity profile at the gene promoter. Model parameters are estimated using either maximum likelihood estimation or via Bayesian inference using Markov chain Monte Carlo sampling. The Bayesian approach provides confidence intervals for parameter estimates and allows the use of priors that capture domain knowledge, e.g. the expected range of transcription speeds, based on previous experiments. The model describes the movement of pol-II down the gene body and can be used to identify the time of induction for transcriptionally engaged genes. By clustering the inferred promoter activity time profiles, we are able to determine which genes respond quickly to stimuli and group genes that share activity profiles and may therefore be co-regulated. We apply our methodology to biological data obtained using ChIP-seq to measure pol-II occupancy genome-wide when MCF-7 human breast cancer cells are treated with estradiol (E2). The transcription speeds we obtain agree with those obtained previously for smaller numbers of genes with the advantage that our approach can be applied genome-wide. We validate the biological significance of the pol-II promoter activity clusters by investigating cluster-specific transcription factor binding patterns and determining canonical pathway enrichment. We find that rapidly induced genes are enriched for both estrogen receptor alpha (ER$α$) and FOXA1 binding in their proximal promoter regions.
△ Less
Submitted 5 March, 2014; v1 submitted 20 March, 2013;
originally announced March 2013.
-
Identifying differentially expressed transcripts from RNA-seq data with biological variation
Authors:
Peter Glaus,
Antti Honkela,
Magnus Rattray
Abstract:
Motivation: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression.
Results: We prese…
▽ More
Motivation: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression.
Results: We present BitSeq (Bayesian Inference of Transcripts from Sequencing data), a Bayesian approach for estimation of transcript expression level from RNA-seq experiments. Inferred relative expression is represented by Markov chain Monte Carlo (MCMC) samples from the posterior probability distribution of a generative model of the read data. We propose a novel method for differential expression analysis across replicates which propagates uncertainty from the sample-level model while modelling biological variance using an expression-level-dependent prior. We demonstrate the advantages of our method using simulated data as well as an RNA-seq dataset with technical and biological replication for both studied conditions.
Availability: The implementation of the transcriptome expression estimation and differential expression analysis, BitSeq, has been written in C++.
△ Less
Submitted 5 March, 2012; v1 submitted 5 September, 2011;
originally announced September 2011.
-
RNA-based Phylogenetic Methods: Application to Mammalian Mitochondrial RNA Sequences
Authors:
Cendrine Hudelot,
Vivek Gowri-Shankar,
Howsun Jow,
Magnus Rattray,
Paul G. Higgs
Abstract:
The PHASE software package allows phylogenetic tree construction with a number of evolutionary models designed specifically for use with RNA sequences that have conserved secondary structure. Evolution in the paired regions of RNAs occurs via compensatory substitutions, hence changes on either side of a pair are correlated. Accounting for this correlation is important for phylogenetic inference…
▽ More
The PHASE software package allows phylogenetic tree construction with a number of evolutionary models designed specifically for use with RNA sequences that have conserved secondary structure. Evolution in the paired regions of RNAs occurs via compensatory substitutions, hence changes on either side of a pair are correlated. Accounting for this correlation is important for phylogenetic inference because it affects the likelihood calculation. In the present study we use the complete set of tRNA and rRNA sequences from 69 complete mammalian mitochondrial genomes. The likelihood calculation uses two evolutionary models simultaneously for different parts of the sequence: a paired-site model for the paired sites and a single-site model for the unpaired sites. We use Bayesian phylogenetic methods and a Markov chain Monte Carlo algorithm is used to obtain the most probable trees and posterior probabilities of clades. The results are well resolved for almost all the important branches on the mammalian tree. They support the arrangement of mammalian orders within the four supra-ordinal clades that have been identified by studies of much larger data sets mainly comprising nuclear genes. Groups such as the hedgehogs and the murid rodents, which have been problematic in previous studies with mitochondrial proteins, appear in their expected position with the other members of their order. Our choice of genes and evolutionary model appears to be more reliable and less subject to biases caused by variation in base composition than previous studies with mitochondrial genomes.
△ Less
Submitted 23 April, 2004;
originally announced April 2004.
-
The Evolution of tRNA-Leu Genes in Animal Mitochondrial Genomes
Authors:
P. G. Higgs,
D. Jameson,
H. Jow,
M. Rattray
Abstract:
Animal mitochondrial genomes usually have two transfer RNAs for Leucine: one, with anticodon UAG, translates the four-codon family CUN, whilst the other, with anticodon UAA, translates the two-codon family UUR. These two genes must differ at the third anticodon position, but in some species the genes differ at many additional sites, indicating that these genes have been independent for a long ti…
▽ More
Animal mitochondrial genomes usually have two transfer RNAs for Leucine: one, with anticodon UAG, translates the four-codon family CUN, whilst the other, with anticodon UAA, translates the two-codon family UUR. These two genes must differ at the third anticodon position, but in some species the genes differ at many additional sites, indicating that these genes have been independent for a long time. Duplication and deletion of genes in mitochondrial genomes occurs frequently during the evolution of the Metazoa. If a tRNA-Leu gene were duplicated and a substitution occurred in the anticodon, this would effectively turn one type of tRNA into the other. The original copy of the second tRNA type might then be lost by a deletion elsewhere in the genome. There are several groups of species in which the two tRNA-Leu genes occur next to one another (or very close) on the genome, which suggests that tandem duplication has occurred. Here we use RNA-specific phylogenetic methods to determine evolutionary trees for both genes. We present evidence that the process of duplication, anticodon mutation and deletion of tRNA-Leu genes has occurred at least five times during the evolution of the Metazoa - once in the common ancestor of all Protostomes, once in the common ancestor of Echinoderms and Hemichordates, once in the hermit crab, and twice independently in Molluscs.
△ Less
Submitted 23 October, 2003;
originally announced October 2003.
-
Cumulant Dynamics of a Population under Multiplicative Selection, Mutation and Drift
Authors:
Magnus Rattray,
Jonathan L. Shapiro
Abstract:
We revisit the classical population genetics model of a population evolving under multiplicative selection, mutation and drift. The number of beneficial alleles in a multi-locus system can be considered a trait under exponential selection. Equations of motion are derived for the cumulants of the trait distribution in the diffusion limit and under the assumption of linkage equilibrium. Because of…
▽ More
We revisit the classical population genetics model of a population evolving under multiplicative selection, mutation and drift. The number of beneficial alleles in a multi-locus system can be considered a trait under exponential selection. Equations of motion are derived for the cumulants of the trait distribution in the diffusion limit and under the assumption of linkage equilibrium. Because of the additive nature of cumulants, this reduces to the problem of determining equations of motion for the expected allele distribution cumulants at each locus. The cumulant equations form an infinite dimensional linear system and in an authored appendix Adam Prugel-Bennett provides a closed form expression for these equations. We derive approximate solutions which are shown to describe the dynamics well for a broad range of parameters. In particular, we introduce two approximate analytical solutions: (1) Perturbation theory is used to solve the dynamics for weak selection and arbitrary mutation rate. The resulting expansion for the system's eigenvalues reduces to the known diffusion theory results for the limiting cases with either mutation or selection absent. (2) For low mutation rates we observe a separation of time-scales between the slowest mode and the rest which allows us to develop an approximate analytical solution for the dominant slow mode. The solution is consistent with the perturbation theory result and provides a good approximation for much stronger selection intensities.
△ Less
Submitted 5 June, 2001; v1 submitted 23 July, 1999;
originally announced July 1999.