RNA Bioinformatics
RNA Bioinformatics
RNA Bioinformatics
RNA
Bioinformatics
METHODS IN M O L E C U L A R B I O LO G Y
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Edited by
Ernesto Picardi
Department of Biosciences, Biotechnology and Biopharmaceutics,
University of Bari, Bari, Italy
Editor
Ernesto Picardi
Department of Biosciences
Biotechnology and Biopharmaceutics
University of Bari
Bari, Italy
Institute of Biomembranes and Bioenergetics
National Research Council
Bari, Italy
National Institute of Biostructures and Biosystems (INBB)
Rome, Italy
RNA is a versatile nucleic acid polymer with a structure analogous to single-stranded DNA
even though it has a backbone composed of ribose sugar and the organic base thymine (T)
is replaced by uracil (U). In contrast to DNA, RNA molecules are less stable and character-
ized by secondary and tertiary structures, which underline their different function.
According to the central dogma of molecular biology, RNA is directly involved in the flux
of genetic information from the DNA to the proteins. However, recent developments in
molecular biology indicate that RNA molecules have a plethora of functional roles and are
indispensable for living organisms and cell homeostasis. Indeed, on the basis of their func-
tions, RNA molecules can be divided into two groups, coding and noncoding. While the
coding fraction is represented by messenger RNAs (mRNAs), noncoding RNAs (ncRNAs)
include at least two main classes: (1) structural ncRNAs as transfer RNAs (tRNAs), ribo-
somal RNAs (rRNAs), and small nucleolar RNAs (snoRNAs); (2) regulatory ncRNAs as
micro RNAs (miRNAs), piwi-interacting RNAs (piwiRNAs), and long noncoding RNAs
(lncRNAs).
A large fraction of eukaryotic transcriptomes consists of ncRNAs that play relevant
biological roles as the regulation of gene expression in normal as well as pathological condi-
tions. NcRNA functions are generally mediated by interactions with other RNA molecules
or DNA regions or proteins. In all cases, RNA secondary and tertiary structures are basic
for a correct function and interaction. However, the characterization of RNA molecules is
a challenging task and, thus, during past decades a variety of bioinformatics tools have been
developed. Nowadays, RNA bioinformatics represents one of the most active fields of bio-
informatics and computational biology.
The interest towards RNA bioinformatics has increased rapidly thanks to the advent of
recent high-throughput sequencing technologies that enable the investigation of complete
transcriptomes at single nucleotide resolution.
The present book has been conceived with the aim of providing an overview of RNA
bioinformatics methodologies, starting from “classical” technologies to predict secondary
and tertiary structures, to novel strategies and algorithms based on massive RNA sequenc-
ing. Indeed, the content of the book is organized as follows:
– Part I—RNA secondary and tertiary structures. This section includes chapters devoted
to main computational algorithms to predict, draw, and edit secondary and tertiary
RNA structures or infer RNA-RNA interactions.
– Part II—Analysis of high-throughput RNA sequencing data. The aim of this section
is to provide a global overview of current methodologies to handle and analyze large
sequencing dataset generated by next-generation sequencing (NGS) technologies.
Indeed, the section includes chapters about quality control of RNA sequencing data
or the mapping of RNA-Seq reads on complete reference genomes or the gene
expression profiling. In addition, methodologies to investigate posttranscriptional
events as alternative splicing or RNA editing and entire meta-transcriptomes are also
described in detail.
v
vi Preface
– Part III—Web resources for RNA data analysis. Finally, this section provides chapters
about several available web resources to work with RNA data without specific computer
requirements (hardware and software) or specialized bioinformatics skills.
Hoping that the book content meets the reader expectations, I would like to acknowl-
edge those who helped make this book possible: all chapter authors for their work and
excellent contributions; the Series Editor, John Walker, for his constant support and sug-
gestions; my wife Angela and daughter Adele for their patience and encouragement.
A special thanks is addressed to my mentor Graziano Pesole for his always indispensable
paternal suggestions.
Finally, I would to dedicate this effort to my parents since they have always believed in
me and to the memory of my first supervisor Carla Quagliariello who introduced me for the
first time to the wonderful and fascinating world of RNA bioinformatics.
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Contributors
ix
x Contributors
Abstract
Determining the RNA secondary structure from sequence data by computational predictions is a long-
standing problem. Its solution has been approached in two distinctive ways. If a multiple sequence align-
ment of a collection of homologous sequences is available, the comparative method uses phylogeny to
determine conserved base pairs that are more likely to form as a result of billions of years of evolution than
by chance. In the case of single sequences, recursive algorithms that compute free energy structures by
using empirically derived energy parameters have been developed. This latter approach of RNA folding
prediction by energy minimization is widely used to predict RNA secondary structure from sequence. For
a significant number of RNA molecules, the secondary structure of the RNA molecule is indicative of its
function and its computational prediction by minimizing its free energy is important for its functional
analysis. A general method for free energy minimization to predict RNA secondary structures is dynamic
programming, although other optimization methods have been developed as well along with empirically
derived energy parameters. In this chapter, we introduce and illustrate by examples the approach of free
energy minimization to predict RNA secondary structures.
Key words RNA folding prediction, Free energy minimization methods, RNA free energy parameters,
RNA bioinformatics, RNA secondary structure prediction
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_1, © Springer Science+Business Media New York 2015
3
4 Alexander Churkin et al.
2 Materials
2.1 Biological We obtained the RNA sequences for our examples from the study
Sequences published by You et al. [22] and Krol et al. [23]. This is by no
means inclusive as there are many different ways to obtain biologi-
cal sequences, e.g., from a sequencing experiment to investigate a
certain organism or from a known database such as Rfam [3], to
name but a few. Let us briefly describe these biological sequences
and their significance. The first sequence used for illustration, the
5BSL3.2 HCV taken from [22], is a functionally important stem-
loop structure for virus replication that appears in the coding
region of the hepatitis C virus RNA-dependent RNA polymerase,
NS5B. The 5BSL3.2 is about 50 bases in length and is part of a
larger predicted cruciform structure (5BSL3). It was confirmed
experimentally that the 5BSL3.2 consists of an 8-bp lower helix, a
6-bp upper helix, a 12-base terminal loop, and an 8-bp internal
loop. Mutagenesis experiments were performed to investigate the
relationship between its structure and function. In addition, the
size of the sequence is amenable to energy minimization prediction
methods (see Note 1). Thus, the sequence of the 5BSL3.2 HCV
provides a good test bed for energy minimization prediction meth-
ods since the structure of both the wildtype sequence and several
of its mutants was already investigated by structure probing experi-
ments. The second sequence used for illustration, an miRNA pre-
cursor taken from [23], is a hairpin loop structure of about 70
bases in length. It is important as a participant in gene regulation
and its structure was verified experimentally, thus providing another
good test bed for energy minimization prediction methods.
For convenience, the sequence of the 5BSL3.2 HCV is shown
below:
AGCGGGGGAGACAUAUAUCACAGCCUGUCUCGUGCCC
GACCCCGC
Its two-point mutants that were experimentally tested in [22] are
clearly specified in the text herein and the relevant figure captions.
2.2 Computational The most widely used energy minimization software for predicting
Hardware RNA secondary structures from sequence can be accessed using
webservers, e.g., [9, 11], requiring no special computational
resources. With the exception of some specific large-scale tasks that
may require whole genome scans, or specific cases of a software
that may require a massively parallel computational platform [13],
almost all tasks that utilize energy minimization prediction are
approached with a standard PC. If one prefers to download a
6 Alexander Churkin et al.
2.3 Computational When downloading the most widely used energy minimization soft-
Software ware packages such as UNAFold/mfold or the Vienna RNA pack-
age, some common requirements are to have a Perl interface such as
Perl 5.6.1 or greater and an application programming interface (API)
such as OpenMP and OpenGL. Specifications can be found on the
instructions page that is available for each individual package.
3 Methods
3.1 Predicting RNA We start by describing the basic procedure of predicting RNA sec-
Secondary Structures ondary structures by energy minimization given an initial sequence.
The format for the input sequence is similar in almost all software
packages that have been developed to solve this problem. The user
can either select to manually insert a sequence consisting of the
Free Energy Minimization to Predict RNA Secondary Structures 7
Fig. 1 Free energy minimization for the secondary structure prediction of wild-
type 5BSL3.2 HCV. UNAFold version 3.8 with mfold utils version 4.6 was used to
predict and generate the secondary structure drawing
8 Alexander Churkin et al.
Fig. 2 Energy dot plot for the secondary structure prediction of wildtype 5BSL3.2 HCV. Using mfold utils version
4.6, the percent of suboptimality was increased to 30 to capture the fourth suboptimal solution, for an illustra-
tive reason that will become clear after generating and examining Fig. 3
Fig. 3 Free energy minimization for the secondary structure prediction of a spe-
cific two-point mutant 5BSL3.2 HCV (C31A-U33G). UNAFold version 3.8 with
mfold utils version 4.6 was used to predict and generate the secondary structure
drawing. One can observe the similarity in shape between the structure obtained
and the fourth suboptimal solution displayed in the dot plot of Fig. 2
3.2 Predicting RNA The first example above served to illustrate the most basic use of free
Deleterious Mutations energy minimization to predict RNA structures. Having also
observed in this example the relationship between suboptimal solu-
tions of the wildtype and their possible appearance as an optimal
solution of a near-neighbor mutant, we now focus on a second
example that utilizes this concept to efficiently predict deleterious
mutations in the sequence that may cause a dramatic change in
structure. As detailed in a review on this topic [24], there are several
ways to approach this problem [25, 26]. The method described in
[25] relies on Vienna’s RNAfold [10] for the direct problem of
RNA structure prediction, Vienna’s RNAsubopt [16] for calculat-
ing all suboptimal solutions, and the concept described above of a
suboptimal solution becoming an optimal one. To experiment, the
user can now access the RNAmute webserver [27] and insert the
same example sequence of the 5BSL3.2 HCV that was used as input
10 Alexander Churkin et al.
3.3 Computational The final example is taken from the field of RNA design. The
RNA Design inverse RNA folding problem for designing sequences that
fold into a given RNA secondary structure was introduced in [10].
The approach to solve it by stochastic optimization relies on the
solution of the direct problem. Initially, a seed sequence is chosen,
after which a local search strategy (or global, as in [28]) is used to
mutate the seed and apply repeatedly the direct problem of RNA
Fig. 4 Predicted rearranging mutation in the 5BSL3.2 wildtype. The RNAmute webserver was used, relying
on RNAfold, RNAsubopt, and other programs from the Vienna RNA package. The output two-point mutant
C31A-U33G happens to coincide with the experimental result that was illustrated in Fig. 3
Free Energy Minimization to Predict RNA Secondary Structures 11
Fig. 5 A tree-graph illustration for a coarse grain representation [29] of RNA secondary structures
12 Alexander Churkin et al.
Fig. 6 Sequence design with a motif constraint by using repetitively free-energy minimization of RNA second-
ary structures. (a) Initial input screen in the RNAfbinv application for designing sequences. (b) Motif selection
screen in the RNAfbinv application for designing sequences
Free Energy Minimization to Predict RNA Secondary Structures 13
Fig. 7 Sequence design of an artificial example by energy minimization. Best result and 20th result obtained
with RNAfbinv according to base-pair distance from the wildtype structure are displayed
In the future, the uses for such free energy minimization applications
that involve computational RNA design are expected to grow con-
siderably, with the increase of experimental evidence regarding the
functional importance of certain RNA motifs. Also, these applica-
tions can be used to detect known noncoding RNAs in new loca-
tions when the query RNA pattern is suspected to possess more
conservation in structure and less conservation in sequence. Results
of the flexible computational RNA design procedure are sequences
that can be searched for efficiently by using sequence-based
methods.
14 Alexander Churkin et al.
Fig. 8 Sequence design of an miRNA precursor by energy minimization. Best result obtained with RNAfbinv
according to the base-pair distance from the wildtype structure is shown with the depicted motif constraint
4 Notes
Acknowledgments
The authors would like to thank Idan Gabdank and Assaf Avihoo
for their assistance in this study. This work was partially supported
by the Kreitman Foundation at Ben-Gurion University.
References
1. Brion P, Westhof E (1997) Hierarchy and 7. Zuker M, Stiegler P (1981) Optimal computer
dynamics of RNA folding. Annu Rev Biophys folding of large RNA sequences using thermo-
Biomol Struct 26:113–137 dynamics and auxiliary information. Nucleic
2. Tinoco I, Bustamante C (1999) How RNA Acids Res 9:133–148
folds. J Mol Biol 293:271–281 8. Zuker M, Sankoff D (1984) RNA secondary
3. Griffiths-Jones S, Bateman A, Marshall M, structures and their prediction. Bull Math Biol
Khanna A, Eddy SR (2003) Rfam: an RNA 46:591–621
family database. Nucleic Acids Res 31: 9. Zuker M (2003) Mfold web server for nucleic
439–441 acid folding and hybridization prediction.
4. Nussinov R, Pieczenik G, Grigg JR, Kleitman Nucleic Acids Res 31:3406–3415
DJ (1978) Algorithms for loop matchings. 10. Hofacker IL, Fontana W, Stadler PF,
SIAM J Appl Math 35:68–82 Bonhoeffer LS, Tacker M, Schuster P (1994)
5. Waterman MS, Smith TF (1978) RNA sec- Fast folding and comparison of RNA second-
ondary structure: a complete mathematical ary structures. Monatsh Chem 125:167–188
analysis. Math Biosci 42:257–266 11. Hofacker IL (2003) Vienna RNA second-
6. Nussinov R, Jacobson AB (1980) Fast algo- ary structure server. Nucleic Acids Res 31:
rithm for predicting the secondary structure 3429–3431
of single stranded RNA. Proc Natl Acad Sci 12. Mathews DH, Sabina J, Zuker M, Turner DH
U S A 77(11):6309–6313 (1999) Expanded sequence dependence of
16 Alexander Churkin et al.
Abstract
It has been well accepted that the RNA secondary structures of most functional non-coding RNAs
(ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of
conserved secondary structures from evolutionarily related sequences is one important task in RNA bioin-
formatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve
the accuracy of secondary structure predictions and to find novel functional RNAs from the genome.
In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in
which one secondary structure whose length is equal to that of the input alignment is predicted. I system-
atically review and classify existing tools and algorithms for the problem, by utilizing the information
employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG)
estimators. I believe that this classification will allow a deeper understanding of each tool and provide users
with useful information for selecting tools for common secondary structure predictions.
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_2, © Springer Science+Business Media New York 2015
17
18 Michiaki Hamada
Fig. 1 (a) Conventional RNA secondary structure prediction, in which the input is an individual RNA sequence
and the output is an RNA secondary structure of the sequence. (b) Common (or consensus) RNA secondary
structure prediction in which the input is a multiple sequence alignment of RNA sequences and the output is
an RNA secondary structure whose length is equal to the length of the alignment. The secondary structure is
called the common (or consensus) secondary structure
20 Michiaki Hamada
Fig. 2 An evaluation procedure for the predicted common RNA secondary structure of an input alignment, in
which the reference secondary structure of each RNA sequence in the alignment is given. This procedure is
based on the idea that a common secondary structure should reflect as many of the secondary structures of
each RNA sequence in the input alignment as possible. Mapped RNA secondary structures without gaps are
computed by getting rid of base-pairs that correspond to gaps. Note that it is difficult to compare a predicted
common secondary structure with a reference common secondary structure, because the reference common
RNA secondary structure for an arbitrary input alignment is not available in general. In most studies of com-
mon secondary structure prediction, evaluation is conducted by using this procedure or a variant. See [23] for
a more detailed discussion of evaluation procedures for Problem 1
TN, FP, and FN for each mapped secondary structure y(map) with
respect to the reference secondary structure (TP, TN, FP, and FN are
the respective numbers of true positive, true-negative, false-positive
and false-negative base-pairs of a predicted secondary structure with
respect to the reference structures). Finally, calculate the evaluation
measures sensitivity (SEN), positive predictive value (PPV), and
Matthew correlation coefficient (MCC) for the sum of TP, TN, FP,
and FN over all the RNA sequences in the alignment A.
Figure 2 shows an illustrative example of Evaluation
Procedure 1. Note that there exist a few variants of this procedure
(e.g., [5]).
In this review, I aim to classify the existing tools (or algorithms)
for Problem 1; These tools are summarized in Table 1, which
includes all the tools for common secondary structure prediction
as of 17th June 2013. To achieve this aim, I describe the informa-
tion that is often utilized in common secondary structure
predictions, and classify tools from a unified viewpoint based on
maximum expected gain (MEG) estimators. I also explain the rela-
tions between the MEG estimators and Evaluation Procedure 1.
This review is organized as follows. In Subheading 2, I sum-
marize the information that is commonly utilized when designing
algorithms for common secondary structure prediction. In
Subheading 3, several concepts to be utilized in the classification of
tools are presented, and the currently available tools are classified
within this framework.
Table 1
List of tools for common secondary structure prediction from aligned sequences
Table 2
Comparison of tools (in Table 1) for common secondary structure predictions from aligned sequences
Name SA WS TS ML CV PI MI MR EI
(Without pseudoknot)
CentroidAlifold ✓ ✓ ✓ ✓ ✓ ✓ ✓ e
γ-cent Anyf
ConStruct ✓ ✓ ✓ ✓ ✓ na na
KNetFold ✓ ✓ ✓ na na
McCaskill-MEA ✓ ✓ Contra Av(Mc)
PETfold ✓ ✓ ✓ Contra Pf+Av(Mc)
Pfold ✓ ✓ ✓ Delta Pf
PhyloRNAalifold ✓ ✓ ✓ Delta Ra
PPfold ✓ ✓ ✓ ✓ Delta Pf
RNAalifold ✓ ✓ ✓ ✓ Delta Ra
RSpredict ✓ ✓ ✓ ✓ na na
(With pseudoknot)
hxmatch ✓ ✓ ✓ na na
ILM ✓ ✓ ✓ ✓ na na
IPKnot ✓ ✓ ✓ ✓ ✓ ✓ γ-cent Anyf
MIfold ✓g ✓ na na
a
Type of software available. SA stand alone, WS web server, TS thermodynamic stability (Subheading 2.1.1)
b
In the “Information used” columns, ML machine learning (Subheading 2.1.2); CV covariation (Subheading 2.3); PI
phylogenetic (evolutionary) information(Subheading 2.4); MI mutual information (Subheading 2.2); MR majorityrule
(Subheading 2.5); EI experimental information (Subheading 2.1.3)
c
In the column“Gain,” γ-cent:γ-centroid-type gain function (Subheading 3.3.2); contra: CONTRAfold-type gain func-
tion (Subheading 3.3.3)
d
In the column “Prob. dist.,” Pf Pfold model (Subheading 3.2.2); Ra RNAalipffold model (Subheading 3.2.1);
Av(Mc) averaged probability distribution with McCaskill model (Subheading 3.2.3); “+” indicates a mixture distribu-
tion (of several models). “na” means “Not available” due to no use of probabilistic models
e
If the method proposed in [16] is used, experimental information derived from SHAPE [36] and PARS [31] is easily
incorporated in CentroidAlign
f
CentroidAlifold and IPKnot can employ a mixed distribution given by an arbitrary combination of RNAalipffold,
Pfold, and an averaged probability distribution based on the McCaskill or CONTRAfold modelsgMATLAB codes
are available
2 Materials
2.1 Fitness to Each The common RNA secondary structure should be a representative
Sequence in the Input secondary structure among RNA sequences in the alignment.
Alignment Therefore, the fitness of a predicted common secondary structure
to each RNA sequence in the alignment is useful information. In
particular, in Evaluation Procedure 1, the fitness of a predicted
common secondary structure to each RNA sequence is evaluated.
This fitness is based on probabilistic models for RNA second-
ary structures of individual RNA sequence, such as the energy-
based and machine learning models shown in Subheadings 2.1.1
and 2.1.2, respectively. These models provide a probability distri-
bution of secondary structures of a given RNA sequence. (p(θ | x)
denotes a probability distribution of RNA secondary structures for
a given RNA sequence x.)
2.1.1 Thermodynamic Turner’s energy model [38] is an energy-based model, which con-
Stability: Energy-Based siders the thermodynamic stability of RNA secondary structures.
Models This model is widely utilized in RNA secondary structure predic-
tions, in which experimentally determined energy parameters
[38, 39, 75] are employed. In the model, structures with a lower
free energy are more stable than those with a higher free energy.
Note that Turner’s energy model leads to a probabilistic model for
RNA sequences, providing a probability distribution of secondary
structures, which is called the McCaskill model [40].
2.1.2 Machine Learning In addition to the energy-based models described in the previous
(ML) Models subsection, probabilistic models for RNA secondary structures
based on machine learning (ML) approaches have been proposed.
In contrast to the energy-based models, machine learning models
can automatically learn parameters from training data (i.e., a set of
RNA sequences with secondary structures). There are several mod-
els based on machine learning which adopt different approaches:
(1) Stochastic context free grammar (SCFG) models [10]; (2) the
CONTRAfold model [9] (a conditional random field model); (3)
the Boltzmann Likelihood (BL) model [1–3]; and (4) non-
parametric Bayesian models [56].
See Rivas et al. [51] for detailed comparisons of probabilistic
models for RNA secondary structures.
2.2 Mutual The mutual information of the ith and jth columns in the input
Information alignment is defined by
f ij (XY )
M ij = åf
X ,Y
ij (XY )log
f i (X ) f j (Y )
= KL ( f ij (XY ) || f i (X ) f j (Y ))) (1)
where fi(X) is the frequency of base X at alignment position i;
fij(XY ) is the joint frequency of finding X in the ith column and Y
in the jth column, and KL(⋅ | | ⋅ ) denotes the Kullback–Liebler dis-
tance between two probability distributions. As a result, the com-
plete set of mutual information can be represented as an upper
triangular matrix: {Mij}i < j.
Note that the mutual information score makes no use of base-
pairing rules of RNA secondary structure. In particular, mutual
information does not account for consistent non-compensatory
mutations at all, although information about them would be useful
when predicting common secondary structures as described in the
next subsection.
2.3 Sequence Because secondary structures of ncRNAs are related to their func-
Covariation tions, mutations that preserve base-pairs (i.e., covariations of a
of Base-Pairs base-pair) often occur during evolution. Figure 3 shows an exam-
ple of covariation of base-pairs of tRNA sequences, in which many
covariations of base-pairs are found, especially in the stem parts in
the tRNA structure.
The covariation of the ith and jth columns in the input align-
ment is evaluated by the averaged number of compensatory muta-
tions defined by
2
Cij = å dijx , y Pijx Pijy
N (N - 1) (x , y )ÎA
(2)
where N is the number of sequences in the input alignment A. For
an RNA sequence x in A, Pijx = 1 if xi and xj form a base-pair and
Pijx = 0 otherwise, and
dijx , y = 2 - d (xi , y i ) - d (x j , y j ) (3)
where δ is the delta function: δ(a, b) = 1 only if a = b, and δ(a, b) = 0
otherwise.
For instance, RNAalifold [65] uses the information of covaria-
tion in combination with the thermodynamic stability of common
secondary structures.
RNA Secondary Structure Prediction from Multi-Aligned Sequences 25
a
(((((((..((((.........)))).(((((.......))))).....(((((......
Seq1/1-74 GGGCCUGUAGCUCAGAGGAUUAGAGCACGUGGCUACGAACCACGGUGUCGGGGGUUCGAA 60
Seq2/1-74 GGGCUAUUAGCUCAGUUGGUUAGAGCGCACCCCUGAUAAGGGUGAGGUCGCUGAUUCGAA 60
Seq3/1-72 GGCGCCGUGGCGCAGUGGA--AGCGCGCAGGGCUCAUAACCCUGAUGUCCUCGGAUCGAA 58
Seq5/1-72 GCGUUGGUGGUAUAGUGGUG-AGCAUAGCUGCCUUCCAAGCA-GUUGACCCGGGUUCGAU 58
Seq4/1-68 ACUCCCUUAGUAUAAUU----AAUAUAACUGACUUCCAAUUA-GUAGAUUCUGAAU-AAA 54
.........10........20........30........40........50.........
.)))))))))))).
Seq1/1-74 UCCCUCCUCGCCCA 74 b A
Seq2/1-74 UUCAGCAUAGCCCA 74 GC
Seq3/1-72 ACCGAGCGGCGCUA 72 GC
Seq5/1-72 UCCCGGCCAACGCA 72 GC
Seq4/1-68 CCCAGAAGAGAGUA 68 CG
.........70... CA
CG UAA
U GA U GC GACC G
G A C
C A C C C G GG C
G G U U
A GAGC C
U
_ _ A A G
C GAU
A _
UA
GC
C
C CA
U A
U C
C
Fig. 3 An example of covariation of base-pairs in an alignment of tRNA sequences: (a) a multiple alignment of
tRNA sequences and (b) a predicted common secondary structure of the alignment. The figures are taken from
the output of an example on the RNAalifold [5] Web Server (http://rna.tbi.univie.ac.at/cgi-bin/RNAalifold.cgi)
2.5 Majority Rule Kiryu et al. [32] proposed the use of the majority rule of base-
of Base-Pairs pairs in predictions of common secondary structures. This rule
states that base-pairs supported by many RNA sequences should
be included in a predicted common secondary structure.
Specifically, Kiryu et al. utilized an averaged probability distri-
bution of secondary structures among RNA sequences to pre-
dict a common RNA secondary structure from a given alignment
(Subheading 3.2.3).
The aim of this approach is to mitigate alignment errors,
because the effects of a minor alignment errors can be disregarded
in the prediction of common secondary structures.
26 Michiaki Hamada
3 Methods
3.1 MEG Estimators In this section, I classify the existing algorithms for common RNA
secondary structure prediction (shown in Table 1) from a unified
viewpoint, based on a previous study [23] in which the following
type of estimator [18, 24] was employed.
yˆ = argmax y ÎS (A ) å G (q , y ) p (q | A) (4)
q ÎY
where S (A) denotes a set of possible secondary structures with
length | A | (the length of the alignment A), G(θ, y) is called a “gain
function” and returns a measure of the similarity between two com-
mon secondary structures, and p(θ | A) is a probabilistic distribution
on S (A) . This type of estimator is an MEG estimator; When the
gain function is designed according to accuracy measures for target
problems, the MEG estimator is often called a “maximum expected
accuracy (MEA) estimator” [18] (see Note 2).
In the following, a common secondary structure q Î S (A) is
represented as an upper triangular matrix q = qij { }
. In this
1£i < j £|A |
matrix θij = 1 if the ith column in A forms a base-pair with the jth
column in A, and θij = 0 otherwise.
The choices of p(θ | A) and G(θ, y) are described in
Subheadings 3.2 and 3.3, respectively.
1 æ -E (q , A) ö
p (RNAalipffold) (q | A) = exp ç + Cov (q , A) ÷ (5)
Z (T ) è kT ø
3.2.2 Pfold Model The Pfold model [34, 35] incorporates phylogenetic (evolution-
ary) information about the input alignments into a probabilistic
distribution of common secondary structures:
p (A | q ,T ) p (q | M )
p (pfold) (q | A) = p (pfold) (q | A,T , M ) = (6)
p (A | T , M )
where T is a phylogenetic tree, A is the input data (i.e., an align-
ment), M is a prior model for secondary structures (based on
SCFGs; see Note 3). Unless the original phylogenetic tree T is
obtained, T is taken to be the ML estimate of the tree, TML, given
the model M and the alignment A.
1
p (ave) (q | A) = å p(q | x)
n xÎA
(7)
where p(θ | x) is a probabilistic model for RNA secondary struc-
tures, for example, the McCaskill model [40], the CONTRAfold
model [9], the BL model [1–3], and others [56]. Note that nei-
ther covariation nor phylogenetic information about alignments is
considered in this probability distribution.
In [32], the authors utilized the McCaskill model as a proba-
bilistic model for individual RNA sequences (i.e., they used p(θ | x)
in Eq. 14), and showed that their method was more robust with
respect to alignment errors than RNAalifold and Pfold. Using
averaged probability distributions for RNA sequences in an input
alignment is compatible with Evaluation Procedure 1. See Hamada
et al. [23] for a detailed discussion.
3.2.4 Mixture of Several Hamada et al. [23] pointed out that arbitrary information can be
Distributions incorporated into common secondary structure predictions by uti-
lizing a mixture of probability distributions. For example, the
probability distribution
3.3 Choice of Gain A choice of the gain function in MEG estimators corresponds to the
Functions G(θ, y) decoding method (i.e., prediction of one final common secondary
structure from the distribution) of common RNA secondary struc-
tures, given a probabilistic model of common secondary structures.
RNA Secondary Structure Prediction from Multi-Aligned Sequences 29
3.3.1 The Kronecker A straightforward choice of the gain function is the Kronecker
Delta Function delta function:
3.3.2 The γ-Centroid- It is known that the probability of the ML estimation is extremely
Type Gain Function [19] small, due to the immense number of secondary structures that
could be predicted; this fact is known as the “uncertainty” of the
solution, and often leads to issues in bioinformatics [17]. Because
the MEG estimator with the delta function considers only the solu-
tion with the highest probability, it is affected by this uncertainty.
A choice of gain function that partially overcomes this uncertainty
of solutions is the γ-centroid-type gain function [19]:
3.3.4 Remarks About From a theoretical viewpoint, the γ-centroid-type gain function is
Choice of Gain Function more appropriate for Evaluation Procedure 1 than either the delta
function or the CONTRAfold-type gain function (see Note 6),
which is also supported by several empirical (computational)
experiments. See Hamada et al. [23] for a detailed discussion.
Another choice of the gain function is MCC (or F-score),
which takes a balance between SEN and PPV of base-pairs, and the
MEG estimator with this gain function leads to an algorithm that
maximizes pseudo-expected accuracy [22]. In addition, the estima-
tor with this gain function includes only one parameter for predict-
ing secondary structure (see Note 7).
3.4 Computation The MEG estimator with the delta function (that predicts the
of Common Secondary common secondary structure with the highest probability with
Structure Through respect to a given probabilistic model) can be computed by employ-
MEG Estimators ing a CYK (Cocke–Younger–Kasami)-type algorithm. For exam-
ple, see [65] for details.
3.4.1 MEG Estimator
with Delta Function
3.4.2 MEG Estimator The MEG estimator with the γ-centroid-type (or CONTRAfold-
with γ-Centroid (or type) gain function is computed based on “base-pairing probability
CONTRAfold) Type Gain matrices” (BPPMs) (see Note 8) and Nussinov-style dynamic pro-
Function gramming (DP) [9, 23]:
ì M i +1, j
ï M i , j -1
ï
M i, j = max í M i +1, j -1 + Sij (12)
ï
ïmax éM i ,k + M k +1, j ù
î k ë û
where Mi, j is the optimal score of the subsequence xi⋯j and Sij is a
score computed from the BPPM(s). For instance, for the γ-centroid
estimator with the RNAalipffold model, the score Sij is equal to
Sij = (g + 1) pij(alipffold) - 1 where pij(alipffold) is the base-pairing probabil-
ity with respect to the RNAalipffold model. This DP algorithm
maximizes the sum of (base-pairing) probabilities pij(alipffold) which
are larger than 1 / (g + 1) , and requires O( | A | 3) time.
3.4.3 MEG Estimators The MEG estimator with an averaged probability distribution
with Averaged Probability (Subheading 3.2.3) can be computed by using averaged base-
Distributions pairing probabilities, {pij}i < j:
1
pij(ave) = å pij(x )
n xÎA
(13)
where
RNA Secondary Structure Prediction from Multi-Aligned Sequences 31
3.4.4 MEG Estimators The MEG estimator with a mixture of distribution (Subheading
with a Mixture Distribution 3.2.4) and the delta function (Subheading 3.3.1) cannot be com-
puted efficiently. However, if the γ-centroid-type (or CONTRAfold-
type) gain function is utilized, the prediction can be conducted
using a similar DP recursion to that in Eq. 12. For instance, the DP
recursion of the γ-centroid-type gain function with respect to Eq. 8
is equivalent to the one in Eq. 12 with Sij = (g + 1) pij* - 1 where
w3
pij* = w1 × pij(pfold) + w2 × pij(alipffold ) +
n
åp (x )
ij . (15)
x ÎA
In the above, pij(pfold) and pij(alipffold) are base-pairing probabilities
for the Pfold and RNAalipffold models, respectively, and {pij(x)} is a
base-pairing probability matrix with respect to a probabilistic
model for secondary structures of single RNA sequence x
(McCaskill or CONTRAfold model). Note that the total compu-
tational time of CentroidAlifold with a mixture of distributions still
remains O(n | A | 3).
3.4.5 MEG Estimators Using probability distributions of secondary structures with pseu-
with Probability doknots in MEG estimators generally has higher computational
Distribution Including cost [42]. To overcome this, for example, IPKnot [57] utilizes an
Pseudoknots approximated method for determining the probability distribution
as along with integer linear programming for predicting a final
common secondary structure.
3.5 A Classification Table 1 shows a comprehensive list of tools for common secondary
of Tools for Problem 1 structure prediction from aligned RNA sequences (in alphabetical
order within groups that do or do not consider pseudoknots
(see Note 9)). To the best of my knowledge, Table 1 is a complete
list of tools for Problem 1 as of 17 June 2013.
32 Michiaki Hamada
3.6.2 Improvement Although several studies have been conducted for RNA secondary
of RNA Secondary structure predictions for a single RNA sequence [9, 19, 38, 47],
Structure Predictions Using the accuracy is still limited, especially for long RNA sequences. By
Common Secondary employing comparative approaches using homologous sequence
Structure information, the accuracy of RNA secondary structure prediction
will be improved. In many cases, homologous RNA sequences of
RNA Secondary Structure Prediction from Multi-Aligned Sequences 33
the target RNA sequence are obtained, and someone would like to
know the common secondary structure of those sequences.
Gardner and Giegerich [13] introduced three approaches for
comparative analysis of RNA sequences, and common secondary
structure prediction is essentially utilized in the first of these.
However, if the aim is to improve the accuracy of secondary struc-
ture predictions, common secondary structure prediction is not
always the best solution, because it is not designed to predict the
optimal secondary structure of a specific target RNA sequence. If
you have a target RNA sequence for which the secondary structure
is to be predicted, the approach adopted by the CentroidHomfold
[21, 25] software is more appropriate than a method based on
common RNA secondary structure prediction.
3.6.3 How to Incorporate As shown in this review, there are two ways to incorporate several
Several Pieces pieces of information into an algorithm for common secondary
of Information structure prediction. The first approach is to modify the (internal)
in Algorithms algorithm itself in order to handle the additional information. For
example, PhyloRNAalifold [14] incorporates phylogenetic infor-
mation into the RNAalifold algorithm by modifying the internal
algorithm and PPfold [63] modifies the Pfold algorithm to handle
experimental information. The drawbacks of this approach are the
relatively large implementation cost and the heuristic combination
of the information.
On the other hand, another approach adopted in
CentroidAlifold [23] is promising because it can easily incorporate
many pieces of information into predictions if a base-pairing prob-
ability matrix is available. Because the approach depends on only
base-pairing probability matrices, and does not depend on the
detailed design of the algorithm, it is easy to implement an algo-
rithm using a mixture of distributions.
Moreover, a method to update a base-pairing probability
matrix (computed using sequence information only) which incor-
porates experimental information [16] has recently been proposed.
The method is independent of the probabilistic models of RNA
secondary structures, and is suitable for incorporating experimen-
tal information into common RNA secondary structure predic-
tion. A more sophisticated method by Washietl et al. [68] can also
be used to incorporate experimental information into common
secondary structure predictions, because it produces a BPPM that
takes experimental information into account.
3.6.4 A Problem that Is Problem 1, which is considered in this paper, can be extended to
Mathematically Related predictions of RNA–RNA interactions, another important task in
to Problem 1 RNA bioinformatics (e.g., [29, 50]).
4 Notes
Acknowledgement
References
1. Andronescu M, Condon A, Hoos HH, 9. Do CB, Woods DA, Batzoglou S (2006)
Mathews DH, Murphy KP (2007) Efficient CONTRAfold: RNA secondary structure pre-
parameter estimation for RNA secondary struc- diction without physics-based models.
ture prediction. Bioinformatics 23(13):19–28 Bioinformatics 22(14):e90–e98
2. Andronescu M, Condon A, Hoos HH, Mathews 10. Dowell RD, Eddy SR (2004) Evaluation of
DH, Murphy KP (2010) Computational several lightweight stochastic context-free
approaches for RNA energy parameter estima- grammars for RNA secondary structure predic-
tion. RNA 16(12):2304–2318 tion. BMC Bioinform 5:71
3. Andronescu MS, Pop C, Condon AE (2010) 11. Esteller M (2011) Non-coding RNAs in human
Improved free energy parameters for RNA disease. Nat Rev Genet 12(12):861–874
pseudoknotted secondary structure prediction. 12. Freyhult E, Moulton V, Gardner P (2005)
RNA 16(1):26–42 Predicting RNA structure using mutual infor-
4. Balik A, Penn AC, Nemoda Z, Greger IH mation. Appl Bioinform 4(1):53–59
(2013) Activity-regulated RNA editing in 13. Gardner PP, Giegerich R (2004) A comprehen-
select neuronal subfields in hippocampus. Nucl sive comparison of comparative RNA structure
Acids Res 41(2):1124–1134 prediction approaches. BMC Bioinform 5:140
5. Bernhart SH, Hofacker IL, Will S, Gruber AR, 14. Ge P, Zhang S (2013) Incorporating
Stadler PF (2008) RNAalifold: improved con- phylogenetic-based covarying mutations into
sensus structure prediction for RNA align- RNAalifold for RNA consensus structure pre-
ments. BMC Bioinform 9:474 diction. BMC Bioinform 14(1):142
6. Bindewald E, Shapiro BA (2006) RNA second- 15. Gruber AR, Findeiss S, Washietl S, Hofacker
ary structure prediction from sequence align- IL, Stadler PF (2010) Rnaz 2.0: improved
ments using a network of k-nearest neighbor noncoding RNA detection. Pac Symp
classifiers. RNA 12(3):342–352 Biocomput 15:69–79
7. Burge SW, Daub J, Eberhardt R, Tate J, 16. Hamada M (2012) Direct updating of an RNA
Barquist L, Nawrocki EP, Eddy SR, Gardner base-pairing probability matrix with marginal
PP, Bateman A (2013) Rfam 11.0: 10 years of probability constraints. J Comput Biol 19(12):
RNA families. Nucl Acids Res 41(Database 1265–1276
issue):D226–D232 17. Hamada M (2014) Fighting against uncer-
8. Carvalho LE, Lawrence CE (2008) Centroid tainty: an essential issue in bioinformatics.
estimation in discrete high-dimensional spaces Briefings Bioinform 15(5):748–767
with applications in biology. Proc Natl Acad Sci 18. Hamada M, Asai K (2012) A classification of
USA 105(9):3209–3214 bioinformatics algorithms from the viewpoint
36 Michiaki Hamada
of maximizing expected accuracy (MEA). J 32. Kiryu H, Kin T, Asai K (2007) Robust predic-
Comput Biol 19(5):532–549 tion of consensus secondary structures using
19. Hamada M, Kiryu H, Sato K, Mituyama T, averaged base-pairing probability matrices.
Asai K (2009) Prediction of RNA secondary Bioinformatics 23(4):434–441
structure using generalized centroid estima- 33. Klein RJ, Eddy SR (2003) RSEARCH: finding
tors. Bioinformatics 25(4):465–473 homologs of single structured RNA sequences.
20. Hamada M, Sato K, Kiryu H, Mituyama T, BMC Bioinform 4:44
Asai K (2009) CentroidAlign: fast and accurate 34. Knudsen B, Hein J (1999) RNA secondary
aligner for structured RNAs by maximizing structure prediction using stochastic context-
expected sum-of-pairs score. Bioinformatics free grammars and evolutionary history.
25(24):3236–3243 Bioinformatics 15(6):446–454
21. Hamada M, Sato K, Kiryu H, Mituyama T, 35. Knudsen B, Hein J (2003) Pfold: RNA sec-
Asai K (2009) Predictions of RNA secondary ondary structure prediction using stochastic
structure by combining homologous sequence context-free grammars. Nucl Acids Res 31(13):
information. Bioinformatics 25(12):i330–i338 3423–3428
22. Hamada M, Sato K, Asai K (2010) Prediction 36. Low JT, Weeks KM (2010) SHAPE-directed
of RNA secondary structure by maximizing RNA secondary structure prediction. Methods
pseudo-expected accuracy. BMC Bioinform 52(2):150–158
11:586 37. Luck R, Graf S, Steger G (1999) ConStruct: a
23. Hamada M, Sato K, Asai K (2011) Improving tool for thermodynamic controlled prediction
the accuracy of predicting secondary structure of conserved secondary structure. Nucl Acids
for aligned RNA sequences. Nucl Acids Res Res 27(21):4208–4217
39(2):393–402 38. Mathews DH, Sabina J, Zuker M, Turner DH
24. Hamada M, Kiryu H, Iwasaki W, Asai K (2011) (1999) Expanded sequence dependence of
Generalized centroid estimators in bioinfor- thermodynamic parameters improves predic-
matics. PLoS ONE 6(2):e16450 tion of RNA secondary structure. J Mol Biol
25. Hamada M, Yamada K, Sato K, Frith MC, Asai 288(5):911–940
K (2011) CentroidHomfold-LAST: accurate 39. Mathews DH, Disney MD, Childs JL,
prediction of RNA secondary structure using Schroeder SJ, Zuker M, Turner DH (2004)
automatically collected homologous sequences. Incorporating chemical modification con-
Nucl Acids Res 39(Web Server issue): straints into a dynamic programming algorithm
W100–W106 for prediction of RNA secondary structure.
26. Hofacker IL (2007) RNA consensus structure Proc Natl Acad Sci USA 101(19):7287–7292
prediction with RNAalifold. Methods Mol Biol 40. McCaskill JS (1990) The equilibrium partition
395:527–544 function and base-pair binding probabilities for
27. Hofacker IL, Fekete M, Stadler PF (2002) RNA secondary structure. Biopolymers 29(6–7):
Secondary structure prediction for aligned RNA 1105–1119
sequences. J Mol Biol 319(5):1059–1066 41. Meer EJ, Wang DO, Kim S, Barr I, Guo F,
28. Jager D, Pernitzsch SR, Richter AS, Backofen Martin KC (2012) Identification of a cis-acting
R, Sharma CM, Schmitz RA (2012) An element that localizes mrna to synapses. Proc
archaeal sRNA targeting cis- and trans-encoded Natl Acad Sci 109(12):4639–4644
mRNAs via two distinct domains. Nucl Acids 42. Nebel ME, Weinberg F (2012) Algebraic and
Res 40(21):10964–10979 combinatorial properties of common RNA
29. Kato Y, Sato K, Hamada M, Watanabe Y, Asai pseudoknot classes with applications. J Comput
K, Akutsu T (2010) RactIP: fast and accurate Biol 19(10):1134–1150
prediction of RNA-RNA interaction using 43. Novikova IV, Hennelly SP, Sanbonmatsu KY
integer programming. Bioinformatics 26(18): (2012) Structural architecture of the human
i460–i466 long non-coding RNA, steroid receptor RNA
30. Katoh K, Toh H (2008) Improved accuracy of activator. Nucl Acids Res 40(11):5034–5051
multiple ncRNA alignment by incorporating 44. Pang PS, Elazar M, Pham EA, Glenn JS (2011)
structural information into a MAFFT-based Simplified RNA secondary structure mapping
framework. BMC Bioinform 9:212 by automation of SHAPE data analysis. Nucl
31. Kertesz M, Wan Y, Mazor E, Rinn JL, Nutter Acids Res 39(22):e151
RC, Chang HY, Segal E (2010) Genome-wide 45. Pauli A, Rinn JL, Schier AF (2011) Non-
measurement of RNA secondary structure in coding RNAs as regulators of embryogenesis.
yeast. Nature 467(7311):103–107 Nat Rev Genet 12(2):136–149
RNA Secondary Structure Prediction from Multi-Aligned Sequences 37
46. Penn AC, Balik A, Greger IH (2013) Steric information for RNA folding of multiple align-
antisense inhibition of AMPA receptor Q/R ments. Nucl Acids Res 36(20):6355–6362
editing reveals tight coupling to intronic edit- 60. Seemann SE, Menzel P, Backofen R, Gorodkin
ing sites and splicing. Nucl Acids Res 41(2): J (2011) The PETfold and PETcofold web
1113–1123 servers for intra- and intermolecular structures
47. Proctor JR, Meyer IM (2013) COFOLD: an of multiple RNA sequences. Nucl Acids Res
RNA secondary structure prediction method 39(Web Server issue):W107–W111
that takes co-transcriptional folding into 61. Seemann SE, Richter AS, Gesell T, Backofen
account. Nucl Acids Res 41(9):e102 R, Gorodkin J (2011) PETcofold: predicting
48. Puton T, Kozlowski LP, Rother KM, Bujnicki conserved interactions and structures of two
JM (2013) CompaRNA: a server for continu- multiple alignments of RNA sequences.
ous benchmarking of automated methods for Bioinformatics 27(2):211–219
RNA secondary structure prediction. Nucl 62. Spirollari J, Wang JT, Zhang K, Bellofatto V, Park
Acids Res 41(7):4307–4323 Y, Shapiro BA (2009) Predicting consensus
49. Qureshi IA, Mehler MF (2012) Emerging structures for RNA alignments via pseudo-energy
roles of non-coding RNAs in brain evolution, minimization. Bioinform Biol Insights 3:51–69
development, plasticity and disease. Nat Rev 63. Sukosd Z, Knudsen B, Kjems J, Pedersen CN
Neurosci 13(8):528–541 (2012) PPfold 3.0: fast RNA secondary struc-
50. Richter AS, Backofen R (2012) Accessibility ture prediction using phylogeny and auxiliary
and conservation: general features of bacterial data. Bioinformatics 28(20):2691–2692
small RNA–mRNA interactions? RNA Biol 64. Underwood JG, Uzilov AV, Katzman S,
9(7):954–965 Onodera CS, Mainzer JE, Mathews DH, Lowe
51. Rivas E, Lang R, Eddy SR (2012) A range of TM, Salama SR, Haussler D (2010) FragSeq:
complex probabilistic models for RNA secondary transcriptome-wide RNA structure probing
structure prediction that includes the nearest- using high-throughput sequencing. Nat
neighbor model and more. RNA 18(2):193–212 Methods 7(12):995–1001
52. Ruan J, Stormo GD, Zhang W (2004) An iter- 65. Washietl S, Hofacker IL (2004) Consensus
ated loop matching approach to the prediction folding of aligned sequences as a new measure
of RNA secondary structures with pseu- for the detection of functional RNAs by com-
doknots. Bioinformatics 20(1):58–66 parative genomics. J Mol Biol 342(1):19–30
53. Sahraeian SM, Yoon BJ (2011) PicXAA-R: effi- 66. Washietl S, Hofacker IL, Lukasser M,
cient structural alignment of multiple RNA Huttenhofer A, Stadler PF (2005) Mapping of
sequences using a greedy approach. BMC conserved RNA secondary structures predicts
Bioinform 12(Suppl 1):S38 thousands of functional noncoding RNAs in
54. Sankoff D (1985) Simultaneous solution of the the human genome. Nat Biotechnol 23(11):
RNA folding alignment and protosequence 1383–1390
problems. SIAM J Appl Math 45:810–825 67. Washietl S, Hofacker IL, Stadler PF (2005) Fast
55. Sato K, Hamada M, Asai K, Mituyama T and reliable prediction of noncoding RNAs.
(2009) CENTROIDFOLD: a web server for Proc Natl Acad Sci USA 102(7):2454–2459
RNA secondary structure prediction. Nucl 68. Washietl S, Hofacker IL, Stadler PF, Kellis M
Acids Res 37(Web Server issue):W277–W280 (2012) RNA folding with soft constraints: rec-
56. Sato K, Hamada M, Mituyama T, Asai K, onciliation of probing data and thermodynamic
Sakakibara Y (2010) A non-parametric bayesian secondary structure prediction. Nucl Acids Res
approach for predicting rna secondary struc- 40(10):4261–4272
tures. J Bioinform Comput Biol 8(4):727–742 69. Watts JM, Dang KK, Gorelick RJ, Leonard
57. Sato K, Kato Y, Hamada M, Akutsu T, Asai K CW, Bess JW, Swanstrom R, Burch CL, Weeks
(2011) IPknot: fast and accurate prediction of KM (2009) Architecture and secondary struc-
RNA secondary structures with pseudoknots ture of an entire HIV-1 RNA genome. Nature
using integer programming. Bioinformatics 460(7256):711–716
27(13):85–93 70. Wei D, Alpert LV, Lawrence CE (2011)
58. Sato K, Kato Y, Akutsu T, Asai K, Sakakibara Y RNAG: a new Gibbs sampler for predicting
(2012) DAFS: simultaneous aligning and fold- RNA secondary structure for unaligned
ing of RNA sequences via dual decomposition. sequences. Bioinformatics 27(18):2486–2493
Bioinformatics 28(24):3218–3224 71. Wilm A, Higgins DG, Notredame C (2008)
59. Seemann SE, Gorodkin J, Backofen R (2008) R-Coffee: a method for multiple alignment of
Unifying evolutionary and thermodynamic non-coding RNA. Nucl Acids Res 36(9):e52
38 Michiaki Hamada
72. Wilm A, Linnenbrink K, Steger G (2008) 75. Xia T, SantaLucia J, Burkard ME, Kierzek R,
ConStruct: improved construction of RNA Schroeder SJ, Jiao X, Cox C, Turner DH
consensus structures. BMC Bioinform 9:219 (1998) Thermodynamic parameters for an
73. Witwer C, Hofacker IL, Stadler PF (2004) expanded nearest-neighbor model for forma-
Prediction of consensus RNA secondary struc- tion of RNA duplexes with Watson–Crick base-
tures including pseudoknots. IEEE/ACM pairs. Biochemistry 37(42):14719–14735
Trans Comput Biol Bioinform 1(2):66–77 76. Yonemoto H, Asai K, Hamada M (2013)
74. Wong KM, Suchard MA, Huelsenbeck JP CentroidAlign-Web: a fast and accurate multi-
(2008) Alignment uncertainty and genomic ple aligner for long non-coding RNAs. Int J
analysis. Science 319(5862):473–476 Mol Sci 14(3):6144–6156
Chapter 3
Abstract
RNA alignment is an important step in the annotation and characterization of unknown RNAs, and several
methods have been developed to meet the need of fast and accurate alignments. Being the performances
of the aligning methods affected by the input RNA features, finding the most suitable method is not trivial.
Indeed, no available method clearly outperforms the others. Here we present a simple workflow to help
choosing the more suitable method for RNA pairwise alignment. We tested the performances of six algo-
rithms, based on different approaches, on datasets created by merging publicly available datasets of known
or curated RNA secondary structure annotations with datasets of curated RNA alignments. Then, we
simulated the frequent case where the secondary structure is unknown by using the same alignment data-
sets but ignoring the known structure and instead predicting it. In conclusion, the proposed workflow for
pairwise RNA alignment depends on the input RNA primary sequence identity and the availability of reli-
able secondary structures.
Key words RNA alignment, RNA structure comparison, RNA sequence–structure relationship, RNA
functional annotation, Computational biology
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_3, © Springer Science+Business Media New York 2015
39
40 Eugenio Mattei et al.
2 Materials
2.1 Computational LocARNA [10] uses a folding and aligning strategy to perform pair-
Methods for RNA wise alignments. The latest version (v. 1.7.10) can be downloaded
Alignment Tested from http://www.bioinf.uni-freiburg.de/Software/LocARNA/
in This Work and requires the Vienna package [16] that can be downloaded from
http://www.tbi.univie.ac.at/~ronny/RNA/index.html. LocARNA
can be also accessed through a Web interface (http://rna.informa-
tik.uni-freiburg.de:8080/LocARNA/Input.jsp). Detailed installa-
tion instructions are included in the package (see Note 1).
LocARNA uses a two-step procedure to align RNAs. In the first
place, a base pair probability matrix for each input sequence is com-
puted using RNAfold (included in the Vienna package). Then the
two matrices are used as a guide to find the optimal alignment.
RNA StrAT [15] employs a tree-based strategy to perform
pairwise structural alignments. The latest version (v. 7.1) is
available upon request; instructions and contact information can
be found at http://www-lbit.iro.umontreal.ca/rnastrat/. RNA
StrAT uses a three-step procedure to align RNA secondary struc-
tures. Firstly, the two input secondary structures are broken down
into stems and stem-loop structures and then these substructures
RNA Global Pairwise Alignment 41
are compared using a variant of the tree edit distance. The scores
obtained from the previous calculation are then used to guide the
alignment of the stem and stem-loop structures. During the align-
ment, also the nucleotidic sequence is taken into account to
improve the alignment performances.
needle from the EMBOSS package [17] performs RNA
sequence alignments using an implementation of the Needleman–
Wunsch algorithm with affine gaps. needle can be downloaded
from http://emboss.open-bio.org/html/adm/ch01s01.html
(see Note 2); a detailed guide is included in the Web page.
2.2 Test Datasets We built three datasets to test the alignment performances of the
six selected alignment methods. We retrieved curated secondary
structures from the RNA STRAND [18] and RNAspa [19] datas-
ets. RNA STRAND integrates information about known RNA sec-
ondary structure of any type and from different organisms,
retrieved from several public databases. Instead, the RNAspa data-
set is a collection of curated secondary structures from Rfam.
Datasets of curated sequence alignments were retrieved from
RNASTAR [20] and bralibase II [5]. Bralibase II is a collection of
RNA alignment datasets proposed for benchmarking of alignment
algorithms. Among the available datasets supplied by bralibase, we
selected the dataset 2 including pairwise tRNA alignments.
RNASTAR includes refined Rfam alignments that were manually
curated using structural information from the Protein Data Bank,
PDB [21]. Since these curated datasets of alignments do not pro-
vide secondary structure annotation, we combined together struc-
tural datasets and secondary structure datasets to obtain a collection
of curated alignments of known RNA structures. More specifically,
we used RNA STRAND secondary structure for RNASTAR align-
ments, RNAspa secondary structure for bralibase alignments, and
finally the remaining RNAs in RNAspa for Rfam alignments.
The sum-of-pairs (SPS) score [5] was employed as measure to
evaluate the performances of the alignment methods. SPS is
defined as the number of correct pairs (aligned nucleotide pairs
found in the reference alignment, i.e., the correctly aligned posi-
tions) over the total number of predicted pairs, and it can be con-
sidered as a measure of the sensitivity of a method. An SPS score of
0 indicates two completely different alignments (i.e., the alignment
reconstructed from the algorithm is completely wrong); conversely,
a score of 1 indicates that the reconstructed alignment is identical
to the reference alignment.
2.3 Hardware No specific requirements are needed to handle and align RNA
Requirements sequences and structures. The proposed methods work well on
normal desktop computers running Linux. Nevertheless all the
methods described in this paper have a Web-based interface that
can be used instead of the command line version.
42 Eugenio Mattei et al.
3 Methods
3.1 Comparing We run six different RNA alignment tools on the datasets pre-
Algorithms sented in Subheading 2.2, namely needle, LocARNA, RNA StrAT,
Performance RNAforester (included in Vienna package), RNAdistance (also
included in Vienna package), and gardenia [14]. These algorithms
can be divided into three classes according to their approach:
sequence-based (needle), folding and aligning (LocARNA),
tree-based (all the others). We used needle as a measure of how
much the secondary structure is important for the alignment of
two RNAs; the less the sequence-based alignment is correct, the
more the secondary structure can give a positive contribution in
guiding the alignment.
Figure 2 shows the results of the structure-based aligners
at different levels of sequence identity, measured as SPS (see
Subheading 2.2). Specifically, when sequence identity is lower than
Fig. 1 The proposed workflow, showing the required steps to find the best align-
ment approach according to the features of the input RNA sequences
RNA Global Pairwise Alignment 43
Fig. 2 Alignment performances of the six tested methods on datasets composed by curated structures and
alignments. Employed methods are needle, LocARNA, RNA StrAT, gardenia, RNAforester, RNAdistance.
Alignment accuracy is evaluated as the sum-of-pairs (SPS, described in Subheading 2.2), at different intervals
of sequence identity as measured using needle
Fig. 3 Alignment performances of the six methods using predicted secondary structures obtained using
RNAfold
For the more frequent cases where structural annotations are not
available, LocARNA is the tool of choice, and sequence alignment
can be also sufficient depending on the user needs.
3.2 Finding The command to perform a sequence alignment with needle is:
Sequence Identity
needle -asequence rna1 -bsequence rna2
with Needle
where rna1 and rna2 are two plain text files containing the input
sequences. The two input files can be in different formats; below is
shown the widely used fasta format:
>S1
ACCAGGUGAAAUUCCUGGACCGACGGUUAAAGUCCGG
The output is printed on screen but it can be saved into a file
(in different formats) using the following command:
needle -asequence rna1 -bsequence rna2 -outfile out
3.3 Aligning Using The command to perform an alignment with LocARNA is:
LocARNA
locarna rna1 rna2
RNA Global Pairwise Alignment 45
where rna1 and rna2 are two plain text files containing the input
sequences (and, optionally, their secondary structure, or a set of
structural constraints, as described later). The two input files must
be formatted as shown below:
S1 ACCAGGUGAAAUUCCUGGACCGACGGUUAAAGUCCGG
Secondary structure annotation can be supplied in the stan-
dard dot-bracket notation as a new line. The input files must then
be modified as shown below:
S1 ACCAGGUGAAAUUCCUGGACCGACGGUUAAAGUCCGG
#S .(((((.......))))).(((((.......)).)))
The secondary structure is used by LocARNA to compute a
base pair probability matrix satisfying structural constraints.
Among the parameters that can help in tuning the alignment,
LocARNA allows the user to choose the substitution matrix with
the switch ‘--ribosum-file=<path_to_the_matrix>’; default is the
RIBOSUM85-60 matrix. Another useful parameter is ‘--clustal=
<file>’ that save the output of the program using ClustalW [22]
format in the file ‘my_out’. By default the output is printed on
screen.
4 Notes
Table 1
Mean running time (in seconds) per sequence
Acknowledgements
References
1. Mercer TR, Gerhardt DJ, Dinger ME et al programs upon structural RNAs. Nucleic Acids
(2012) Targeted RNA sequencing reveals the Res 33:2433–2439. doi:10.1093/nar/gki541
deep complexity of the human transcriptome. 6. Wan Y, Kertesz M, Spitale RC et al (2011)
Nat Biotechnol 30:99–104. doi:10.1038/ Understanding the transcriptome through
nbt.2024 RNA structure. Nat Rev Genet 12:641–655.
2. Cabili MN, Trapnell C, Goff L et al (2011) doi:10.1038/nrg3049
Integrative annotation of human large inter- 7. Sankoff D (1985) Simultaneous solution of the
genic noncoding RNAs reveals global proper- RNA folding, alignment and protosequence
ties and specific subclasses. Genes Dev problems. SIAM J Appl Math 45:810–825
25:1915–1927. doi:10.1101/gad.17446611 8. Havgaard JH, Torarinsson E, Gorodkin J
3. Baker M (2011) Long noncoding RNAs: the (2007) Fast pairwise structural RNA alignments
search for function. Nat Methods 8:379–383. by pruning of the dynamical programming
doi:10.1038/nmeth0511-379 matrix. PLoS Comput Biol 3:1896–1908.
4. Burge SW, Daub J, Eberhardt R et al (2013) doi:10.1371/journal.pcbi.0030193
Rfam 11.0: 10 years of RNA families. Nucleic 9. Harmanci AO, Sharma G, Mathews DH
Acids Res 41:D226–D232. doi:10.1093/nar/ (2007) Efficient pairwise RNA structure pre-
gks1005 diction using probabilistic alignment con-
5. Gardner PP, Wilm A, Washietl S (2005) A straints in Dynalign. BMC Bioinformatics
benchmark of multiple sequence alignment 8:130. doi:10.1186/1471-2105-8-130
RNA Global Pairwise Alignment 47
10. Will S, Reiche K, Hofacker IL et al (2007) 17. Rice P, Longden I, Bleasby A (2000)
Inferring noncoding RNA families and classes EMBOSS: the European molecular biology
by means of genome-scale structure-based open software suite. Trends Genet 16:
clustering. PLoS Comput Biol 3:e65. 276–277
doi:10.1371/journal.pcbi.0030065 18. Andronescu M, Bereg V, Hoos HH, Condon
11. Dowell RD, Eddy SR (2006) Efficient pair- A (2008) RNA STRAND: the RNA secondary
wise RNA structure prediction and align- structure and statistical analysis database.
ment using sequence alignment constraints. BMC Bioinformatics 9:340. doi:10.1186/1471-
BMC Bioinformatics 7:400. doi:10.1186/ 2105-9-340
1471-2105-7-400 19. Horesh Y, Doniger T, Michaeli S, Unger R
12. Taneda A (2010) Multi-objective pairwise RNA (2007) RNAspa: a shortest path approach for
sequence alignment. Bioinformatics 26:2383– comparative prediction of the secondary structure
2390. doi:10.1093/bioinformatics/btq439 of ncRNA molecules. BMC Bioinformatics 8:
13. Notredame C, Higgins DG (1996) SAGA: 366. doi:10.1186/1471-2105-8-366
sequence alignment by genetic algorithm. 20. Widmann J, Stombaugh J, McDonald D et al
Nucleic Acids Res 24:1515–1524 (2012) RNASTAR: an RNA STructural
14. Blin G, Denise A, Dulucq S et al (2007) Alignment Repository that provides insight
Alignments of RNA structures. IEEE ACM into the evolution of natural and artificial
Trans Comput Biol Bioinformatics 7:309– RNAs. RNA 18:1319–1327. doi:10.1261/
322. doi:10.1109/TCBB.2008.28 rna.032052.111
15. Guignon V, Chauve C, Hamel S (2005) An 21. Berman HM, Kleywegt GJ, Nakamura H,
edit distance between RNA stem-loops. In: Markley JL (2013) The future of the protein
Consens MP, Navarro G (eds) SPIRE. Springer, data bank. Biopolymers 99:218–222.
Heidelberg, pp 335–347 doi:10.1002/bip.22132
16. Lorenz R, Bernhart SH, Höner Zu Siederdissen 22. Larkin MA, Blackshields G, Brown NP et al
C et al (2011) ViennaRNA Package 2.0. (2007) Clustal W and Clustal X version 2.0.
Algorithms Mol Biol 6:26. doi:10.1186/ Bioinformatics 23:2947–2948. doi:10.1093/
1748-7188-6-26 bioinformatics/btm404
Chapter 4
Abstract
RNA secondary structure plays critical roles in several biological processes. For example, many trans-acting
noncoding RNA genes and cis-acting RNA regulatory elements present functional motifs, conserved both
in structure and sequence, that can be hardly detected by primary sequence analysis alone. We describe
here how conserved secondary structure motifs shared by functionally related RNA sequences can be
detected through the software tool RNAProfile. RNAProfile takes as input a set of unaligned RNA
sequences expected to share a common motif, and outputs the regions that are most conserved through-
out the sequences, according to a similarity measure that takes into account both the sequence of the
regions and the secondary structure they can form according to base-pairing and thermodynamic rules.
The method is split into two parts. First, it identifies candidate regions within the input sequences,
and associates with each region a locally optimal secondary structure. Then, it compares candidate regions
to one another, both at sequence and structure level, and builds motifs exploring the search space through
a greedy heuristic. We provide a detailed guide to the different parameters that can be employed, and usage
examples showing the different software capabilities.
Key words RNAProfile, RNA untranslated sequences, RNA secondary structure, UTR, Modtools,
Posttranscriptional regulation
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_4, © Springer Science+Business Media New York 2015
49
50 Federico Zambelli and Giulio Pavesi
2 Materials
3 Methods
3.1 The Algorithm We present here a summary of the main steps performed by the
algorithm, while a more detailed description can be found in [3]
and relative Supplementary Materials.
3.1.1 Candidate Regions Given a set of RNA sequences expected to share a structural motif,
Selection the first step is to identify a set of candidate regions from each
sequence, containing potential secondary structure motifs. Since
RNA secondary structure can be decomposed into hairpins, the
only requested parameter is the number of hairpins expected to
form the motif, with a single hairpin as default. No other con-
straints like size and number of features like loops, stacks, and con-
necting elements or any threshold for the folding energy are
required. The prediction of the secondary structure of the input
sequences is performed only locally. That is, given as input the
number h of hairpins that have to be contained in the secondary
structure of the motif, the algorithm selects from each input
sequence the regions whose predicted optimal secondary structure
contains exactly h hairpins. If this parameter is also not available,
the analysis can be simply iterated starting from a single hairpin,
and increasing the number h at each run. In the absence of further
constraints, the algorithm examines all the possible substrings of
each input sequence. The general idea is that a structural motif
should correspond to a local free energy optimum for the region
forming it, and its formation thus should depend solely on local
interactions among the nucleotides of the region and not on the
presence of other structures elsewhere in the sequence.
Moreover, searching for a fixed number of hairpins within each
region of a RNA sequence of length n allows the reduction of com-
putational complexity from the exponential time required to enu-
merate all the potential secondary structures that a RNA sequence
can form [14] to a polynomial one. Notice also that the regions
length is not predetermined, since the algorithm processes every
possible substring of suitable size of the sequence in order to check
if it contains the required number of hairpins. Finally, this approach
takes also advantage from the fact that folding parameters used by
energy-based RNA secondary structure methods are usually more
reliable when applied to small regions of limited size (10–50 nt).
RNAProfile 53
3.1.2 Building Motifs After being selected from the input sequences, together with their
secondary structure, candidate regions are compared to one
another, in order to build groups of “most similar” regions likely to
form a conserved motif. These groups will contain exactly one
region from each input sequence. Comparisons are made by com-
puting pairwise alignment of the regions, with a scoring function
that takes into account at the same time similarity at sequence and
structure level. Given k input sequences of size n, the selection step
returns O(n) candidate regions per sequence, with O(nk) possible
candidate motifs that can be built by selecting one region per
sequence. Thus, exhaustive enumeration of all possible region com-
binations would be computationally unfeasible, and a greedy heu-
ristic has been introduced in order to explore the solution space.
Briefly, the heuristic works as follows. Given a set of k sequences
S = {S1, S2, …, Sk} and their relative sets of associated regions R = {R1,
R2, …, Rk}, RNAProfile first computes all the pairwise alignments
between the regions from R1 with those from R2. All the align-
ments are scored and ranked, and the p highest scoring alignments
are kept. Each of the alignments is described with an alignment
profile, that summarizes the sequence-structure similarity of the
two aligned regions. At the second step, regions from R3 are then
aligned with the p profiles that were built at the previous step. The
resulting alignments, that now contain three regions, are again
scored, ranked, and the best p alignments are kept. This step is iter-
ated until the last set of regions Rk has been reached. The align-
ments built at the last step will thus contain exactly one region per
input sequence, and will correspond to the motifs output by the
algorithm. Also, a fitness value is associated with each region
included in an alignment, estimating how well the region fits the
motif profile defined by the alignment itself. A positive fitness value
is an indicator that the region is likely to represent an instance of
the motif described by the profile, vice versa for negative fitness
values. The rationale is that, since there are no guarantees that all
the sequences of S will share an instance of the motif, a motif profile
could include regions that do not represent instance of the motif,
and hence have little similarity to the others building the motif.
When the motif is expected to appear only in a small subset
(less than half) of the input sequences, the algorithm can be run
employing a different method for the construction of motif pro-
files. For each i from 0 to k − 1 the selected regions Ri of a sequence
Si are aligned to all the regions Rj of the other sequences with j > i,
and again the p highest scoring profiles are kept. Thus, instead of
54 Federico Zambelli and Giulio Pavesi
3.2 Running The command for using RNAProfile with default settings is:
RNAPRofile
> ./rnaprofile –f <filename>
where <filename> is the input file.
3.2.1 Input Input files can be of two types. The first and most common type is
a multifasta file containing the RNA sequences to be analyzed. A
careful selection of the sequences is important and may have a
strong impact on the output. When possible, it is useful to avoid
including long sequences (i.e., full mRNAs). Rather, it is advisable
to include only the portion that is more likely to contain a func-
tional motif. For example, when motifs appear within mRNA
untranslated regions, it can be a good strategy to omit the coding
portion of the transcripts, in order to exclude the bias induced by
the higher sequence conservation usually found in coding sequences
(see Note).
The second kind of input file accepted by RNAProfile contains
preprocessed sequences or regions, for which a secondary structure
is already available. In this case, the secondary structure prediction
and region selection steps of the algorithm are bypassed. In the
latter case, the file format is as follows, with for each input sequence
the sequence itself in FASTA format, followed by the secondary
structure associated with its regions in bracket notation.
>Sequence1
CAGTCAGTACGTCTGACAGTCAGTACATGCTCGATGGTACGTATGCATGCGTGT
CATCAGTCTGAGTCAGTACTGACGTAGTCAGTCTGACTGACGTATGCAGTCTGA
!Sequence1
GTTCGTCCTCAGTGCAGGGCAAC
(((.(((((......))))))))
RNAProfile 55
AACTTCAGCTACAGTGTTAGCTAAGTT
(((((.(((((......))))))))))
CCACAGGCTCAGTGTGGTCTTGG
(((.(((((......))))))))
GCCTTCTGCACCAGTGTGTGTAAAGGC
(((((.(((((......))))))))))
GCCTTCTGCGCCAGTGTGTGTAAAGGC
(((((.(((((......))))))))))
TAATTGCAAACGCAGTGCCGTTTCAATTG
((((((.(((((......)))))))))))
>Sequence2
CTACTGACAGTCAGTCATGCGTACAGTGTCAGTCATGCAGTCAGTACCGTACGTA
CATGACGTCATGCATGCATGCAGTCAGTCATGCAGTCATGCATGCATGCAGTCAG
!Sequence2
CCACAGGCTCAGTGTGGTCTTGG
(((.(((((......))))))))
GCCTTCTGCACCAGTGTGTGTAAAGGC
(((((.(((((......))))))))))
GCCTTCTGCGCCAGTGTGTGTAAAGGC
(((((.(((((......))))))))))
TAATTGCAAACGCAGTGCCGTTTCAATTG
((((((.(((((......)))))))))))
3.2.2 Parameters Default settings have been determined by running several tests,
with the aim of speeding up the computation obtaining at the same
time reliable results. All input parameters can anyway be modified
and fine tuned by users as follows.
-A <int>: run the program in “all versus all” mode, useful when
only a subset of the input sequences is expected to share a
motif as discussed in Subheading 3.1.2, or when the default
run did not yield significant results. The algorithm stops when
profiles containing <int> regions have been generated. The
best profiles for each iteration are anyway output.
-H <int>: sets the number of hairpins that the secondary structure
associated to each candidate region must contain. Default is a
single hairpin. A good strategy can be to run RNAProfile on
the same set of sequences increasing by one the value of H at
each run, and stopping when the score for the best profile
found given n hairpins is smaller than the one for the best pro-
file with n − 1 hairpins.
56 Federico Zambelli and Giulio Pavesi
3.2.3 Output Results are output into a file, whose name is shown on the screen
at the end of the run. A result file has the following format:
ALIGNMENT RESULTS
Input file: mouse+human.ferritin.fna
Number of profiles saved at each step: 100 Max number of
profiles originating from the same profile (or region): 10
Region minimum length: 20 Region maximum length: 40
Energy threshold: 0
Max difference in length between regions: 10
Random alignment: yes (Seed used: 1065538283)
Best profiles:
Profile 1. Score: 2.66
(profile data here)
3.3 Usage Examples In these examples the algorithm has been run iteratively with
default parameters, unless otherwise specified, starting with h = 1
(h being the number of hairpins in candidate regions) and increas-
ing h by 1 until the best profile found with h hairpins had a score
lower than the best profile reported using h − 1 hairpins.
3.3.1 Iron Responsive The iron responsive element (IRE) is a well-known mRNA struc-
Element tural element (e.g., [15–17]) that has been also associated with
neurodegenerative diseases (e.g., [18]). It is composed by a con-
served hairpin structure present in the UTRs of transcripts coding
for proteins involved in cellular iron metabolism. IREs usually
present an unpaired nucleotide (cytosine) at the 5′ of the stem, or
a cytosine nucleotide and two additional bases at the 5′ opposing
one free 3′ nucleotide. The iron responsive element is then a good
benchmark for testing the functionality of RNAProfile. A dataset
with the full mRNA sequences of human and mouse ferritin (light
and heavy chain) and aminolevulinate synthase 2 was prepared.
To add experimental noise, sequences from three human ferritin
pseudogenes, hence not likely to contain an IRE motif, were also
included in the dataset. As it can be seen in Fig. 2 the algorithm is
able to identify correctly the IREs as building the highest scoring
motif. Candidate motif regions from pseudogenes have instead a
negative fitness value, strongly hinting that they are not actual
instances of the IRE motif described by the profile.
3.3.2 Atypical IREs Not all the IREs fall in the two types described in the previous
example. A new dataset with sequences including atypical IREs
found in the 5′-UTR of mRNAs from fruitfly, bullfrog, starfish and
crayfish was prepared. Eight 5′-UTRs from plant ferritin mRNAs,
lacking the IRE, were also added to the dataset to test the “all ver-
sus all” functionality of the algorithm. In fact, only one third of the
sequences of this dataset actually contain an IRE. The algorithm
was thus run with the –A option active, and looking for motifs con-
taining up to six regions (appearing in at most half of the sequences).
As shown in Fig. 3, four regions containing four IRE instances
form the highest scoring motif. The bottom helix has different
length in the four instances, while the starfish and crayfish IREs
share a base pair missing at the bottom of the topmost helix.
RNAProfile 59
Fig. 2 IRE motif occurrences reported by RNAProfile, with respective energy and fitness values. Notice the
negative fitness values for the instances reported in pseudogenes
The crayfish element does not present the unpaired cytosine usually
considered the distinctive signature of IREs. However, all these ele-
ments have been experimentally validated as functional IREs in [19].
3.3.3 Translation Control The Nanos protein determines the correct anterior/posterior pat-
Element terning in fruitfly embryos [20]. The translation of its mRNA is
repressed in the bulk cytoplasm and activated only in the posterior
domain. Translational control is dependent on a RNA secondary
structure feature called translation control element (TCE) found in
the 3′-UTR of Nanos mRNA. This structure is Y shaped and is
bound by the protein Smaug leading to translational repression
[21]. For this test RNAProfile was run on the 3′-UTR sequences
of Nanos mRNA homologs from D. melanogaster, D. virilis and D.
simulans. The score of the highest scoring profile remained stable
when searching for motifs with one or two hairpins and the best
profile found with one hairpin was contained within the best pro-
file found with two hairpins. The latter captures the two hairpins
forming the Y shape as seen in Fig. 4.
60 Federico Zambelli and Giulio Pavesi
Fig. 3 Four IRE instances found in the “unconventional IRE” dataset building the highest scoring motif. Allowing
the algorithm to reach the imposed limit of six regions per motif lowered the score of the motif itself by includ-
ing two unlikely instances, as also shown by the low fitness of the two additional sequences
Fig. 4 The Nanos TCE element identified by RNAProfile, composed of two hairpins building a Y-shaped secondary
structure
RNAProfile 61
3.4 Single Sequence In some cases, it might be helpful to have a tool for secondary
RNA Secondary structure prediction available to work together with motif finding
Structure Prediction tools like RNAProfile. In fact, while functional RNA secondary
structures may be different from the ones predicted according to
minimal energy or other criteria, tools like mfold [22] can be con-
strained to predict structures that include fixed elements in some
regions, like those identified by RNAProfile. In this way the pre-
dicted structure for the whole molecule might represent a more
accurate representation of the possible in vivo secondary structure.
3.5 Secondary RNAProfile output consists of a text file where the structure motifs
Structure are described using the conventional “dot-and-brackets” notation,
Visualization Tools where round brackets correspond to paired bases, and dots to the
unpaired ones. While this method is practical, a graphical represen-
tation of the RNA motifs should be more suitable when producing
figures for articles or oral presentations. Tools like PseudoViewer
[23] and Jviz.Rna [24] can be very useful for this task, since they
permit to draw a model of the secondary structure starting from
the dot-bracket notation.
4 Note
References
1. Sabin LR, Delás MJ, Hannon GJ (2013) 13. Mignone F, Gissi C, Liuni S et al (2002)
Dogma derailed: the many influences of RNA Untranslated regions of mRNAs. Genome Biol
on the genome. Mol Cell 49(5):783–794 3(3):REVIEWS0004
2. Dieterich C, Stadler PF (2012) Computational 14. Waterman MS (1995) Introduction to compu-
biology of RNA interactions. Wiley Interdiscip tational biology. Chapman & Hall, London
Rev RNA 4(1):107–120 15. Hentze MW, Muckenthaler MU, Galy B et al
3. Pavesi G, Mauri G, Stefani M et al (2004) (2010) Two to tango: regulation of mamma-
RNAProfile: an algorithm for finding con- lian iron metabolism. Cell 142(1):24–38
served secondary structure motifs in unaligned 16. Tandara L, Salamunic I (2012) Iron metabo-
RNA sequences. Nucleic Acids Res 32(10): lism: current facts and future directions.
3258–3269 Biochem Med 22(3):311–328
4. Rabani M, Kertesz M, Segal E (2008) 17. Ma J, Haldar S, Khan MA et al (2012) Fe2+
Computational prediction of RNA structural binds iron responsive element-RNA, selec-
motifs involved in posttranscriptional regula- tively changing protein-binding affinities and
tory processes. Proc Natl Acad Sci U S A regulating mRNA repression and activation.
105(39):14885–14890 Proc Natl Acad Sci U S A 109(22):
5. Hiller M, Pudimat R, Busch A et al (2006) 8417–8422
Using RNA secondary structures to guide 18. Cahill CM, Lahiri DK, Huang X et al (2009)
sequence motif finding towards single-stranded Amyloid precursor protein and alpha synuclein
regions. Nucleic Acids Res 34(17):e117 translation, implications for iron and inflam-
6. Bafna V, Tang H, Zhang S (2006) Consensus mation in neurodegenerative diseases. Biochim
folding of unaligned RNA sequences revisited. Biophys Acta 1790(7):615–628
J Comput Biol 13(2):283–295 19. Huang TS, Melefors O, Lind MI et al (1999)
7. Mokrejs M, Vopálenský V, Kolenaty O et al An atypical iron-responsive element (IRE)
(2006) IRESite: the database of experimentally within crayfish ferritin mRNA and an iron reg-
verified IRES structures (www.iresite.org). ulatory protein 1 (IRP1)-like protein from
Nucleic Acids Res 34(Database issue): crayfish hepatopancreas. Insect Biochem Mol
D125–D130 Biol 29(1):1–9
8. Burge SW, Daub J, Eberhardt R et al (2013) 20. Lehmann R, Nüsslein-Volhard C (1991) The
Rfam 11.0: 10 years of RNA families. Nucleic maternal gene nanos has a central role in pos-
Acids Res 41(Database issue):D226–D232 terior pattern formation of the Drosophila
9. Grillo G, Turi A, Licciulli F et al (2010) embryo. Development 112(3):679–691
UTRdb and UTRsite (RELEASE 2010): a col- 21. Crucs S, Chatterjee S, Gavis ER (2000)
lection of sequences and regulatory motifs of Overlapping but distinct RNA elements con-
the untranslated regions of eukaryotic mRNAs. trol repression and activation of nanos transla-
Nucleic Acids Res 38(Database issue): tion. Mol Cell 5(3):457–467
D75–D80 22. Zuker M (2003) Mfold web server for nucleic
10. Reuter JS, Mathews DH (2010) RNAstructure: acid folding and hybridization prediction.
software for RNA secondary structure predic- Nucleic Acids Res 31(13):3406–3415
tion and analysis. BMC Bioinformatics 11:129 23. Byun Y, Han K (2009) PseudoViewer3: gener-
11. Lorenz R, Bernhart SH, Höner Zu Siederdissen ating planar drawings of large-scale RNA
C et al (2011) ViennaRNA Package 2.0. structures with pseudoknots. Bioinformatics
Algorithms Mol Biol 6:26 25(11):1435–1437
12. Witwer C, Hofacker IL, Stadler PF (2004) 24. Wiese KC, Glen E, Vasudevan A (2005) JViz.
Prediction of consensus RNA secondary struc- Rna—a Java tool for RNA secondary structure
tures including pseudoknots. IEEE/ACM visualization. IEEE Trans Nanobioscience
Trans Comput Biol Bioinformatics 1(2):66–77 4(3):212–218
Chapter 5
Abstract
Secondary structure diagrams are essential, in RNA biology, to communicate functional hypotheses and
summarize structural data, and communicate them visually as drafts or finalized publication-ready figures. While
many tools are currently available to automate the production of such diagrams, their capacities are usually
partial, making it hard for a user to decide which to use in a given context. In this chapter, we guide the reader
through the steps involved in the production of expressive publication-quality illustrations featuring the RNA
secondary structure. We present major existing representations and layouts, and give precise instructions to
produce them using available free software, including jViz.RNA, the PseudoViewer, RILogo, R-chie,
RNAplot, R2R, and VARNA. We describe the file formats and structural descriptions accepted by popular
RNA visualization tools. We also provide command lines and Python scripts to ease the user’s access to
advanced features. Finally, we discuss and illustrate alternative approaches to visualize the secondary structure
in the presence of probing data, pseudoknots, RNA–RNA interactions, and comparative data.
Key words RNA visualization, Secondary structure, Graph drawing, Pseudoknots, Non-canonical
motifs, Structure-informed multiple sequence alignments
1 Introduction
Electronic Supplementary Material: The online version of this chapter (doi: 10.1007/978-1-4939-
2291-8_5) contains supplementary material, which is available to authorized users
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_5, © Springer Science+Business Media New York 2015
63
64 Yann Ponty and Fabrice Leclerc
1.2 Objectives The secondary structure can be naturally represented as a graph whose
of RNA Visualization vertices are individual nucleotides. In this representation, edges are
expected to connect consecutive nucleotides in the sugar-phosphate back-
bone, or pairs of positions involved in a base-pair mediated by hydrogen
bonds. Besides these formal requirements, additional properties have
been identified as desirable by earlier work [25]:
Modularity Inspection of the drawing should suggest a decomposi-
tion of the structure into functional domains. For
instance, helices should be easy to identify.
1.3 Existing Tools Many tools are now available for visualizing the secondary structure
of RNA. These tools differ on many aspects, depending on their
intended application and functionalities. Among the distinguishing
features of available tools, one denotes the presence of a graphical
user interface, allowing for a convenient post-processing before
resorting to a time-consuming general-purpose editor. Moreover,
while certain formats may be converted into others in a lossless
manner, most conversion may lose some format-specific data, and a
support for a rich variety of input format is therefore desirable. The
output format of a software is also of great importance, as generated
figures typically require further post-processing before meeting
the quality criteria expected by publishers. Therefore, vector for-
mats such as Portable Document Format (PDF), or Scalable Vector
Graphics (SVG) should always be preferred to bitmap formats.
Table 1 summarizes the foremost tools available for visualizing
the secondary structure. Which, of the available alternatives, should
be preferred will typically depend on the intended application. We
hope that this overview of existing tools and representations will
assist the reader in his choice of the right visualization, and ulti-
mately ease the production of more informative pictures.
1.4 Outline In this chapter, we focus on the necessary steps involved in the
production of publication-quality illustrations involving the RNA
secondary structure. After introducing the main data sources and file
formats, we describe a preferred workflow, and stress on the advan-
tages of vector graphics manipulation. Then, we describe major
proposed representations and layouts for the secondary structure.
We also provide simple command lines to invoke main tools.
We also describe how to address the specific visualization needs
arising from exemplary use-cases: chemical probing experiments, RNA
pseudoknots and interactions, and multiple-sequence alignments.
66
Table 1
Main features of selected tools offering a visualization of the RNA secondary structure
Name Platform(s) [BWMLa] Ease of use Layouts Interactivity Pseudoknots Ext. sec. str.b Outputc References
jViz.RNA •••• ++ Circular + + • EPS PNG [29, 32]
Linear
Graph
PseudoViewer •• ∘∘ ++ Graph ++ ∘ EPS SVG PNG [5]
GIF
RNAMovies •••• ++ Graph + ∘ SVG PNG [17]
JPEG GIF
Yann Ponty and Fabrice Leclerc
2 Materials
2.1 Data Sources At the primary structure level, including sequences and alignments,
the RFAM database [12] is the authoritative source of RNA data.
Unfortunately this centralized authority is without a counterpart on
the secondary structure level, leading to a wealth of specialized repos-
itories focusing on RNAs of specific functions. The reason of this
situation is mostly due to the young history of RNA computational
biology, coupled with the fact that, up until recently, RNA structural
data was relatively scarce compared to proteins. Consequently, data-
bases have typically arisen from a variety of domain-specific efforts,
which have not currently undergone standardization.
Of notable interest at the secondary structure level is the
STRAND database [1], a collection of secondary structures found
in, or inferred from, a variety of databases. However, the purpose
of this database, which was initially assembled to train new energy
models from existing sequence/structure data and benchmark
computational prediction tools, conflicts with the goal of com-
pleteness, and secondary structure data must currently be sought
within a variety of specialized databases. One should however note
that recent initiatives, such as the RNA ontology consortium [21]
or the RNA Central project [2], have led to propositions for more
rational organizations of RNA structural data, and a solution to
these issues may be found in the near future (Fig. 1).
Tools jViz.RNA PseudoViewer RNAMovies RILogo R-chie RNAplot R2R S2S VARNA
Fig. 1 Main input and output formats supported by the major visualization tools for the RNA secondary
structure
68 Yann Ponty and Fabrice Leclerc
Fig. 2 Stockholm formatted file fragment, excerpted from RFAM seed alignment for miRNA mir-22. RFAM
ID: RF00653
2.2 RNA Secondary Reflecting the diversity of biological contexts and computational
Structure File Formats use-cases involving RNA secondary structure, many formats were
proposed over time to support its description. Aside from the
ubiquitous FASTA format, extended by the popular Vienna RNA
software package [14] to include a dot-bracket encoding of the
secondary structure, the STOCKHOLM format is used within RFAM
to present a WUSS-formatted consensus secondary structure, pos-
sibly including pseudoknots. The BPSEQ and CONNECT formats
represent an adjacency list, indicating the partner (if any) of each
position.
Notably, the RNAML format [30] represents a unifying effort of
the community to represent virtually any sort of RNA-related data.
While this format is not widely used to represent the secondary
structure because of its intrinsic verbosity, it is clearly the format of
choice to represent and manipulate the extended secondary struc-
ture, including non-canonical base-pairs.
2.2.1 STOCKHOLM STOCKHOLM format files are arguably the preferred representation
Format for RNA multiple sequence alignments. As illustrated in Fig. 2, a
STOCKHOLM file breaks the global alignment of a set of sequences
into portions of bounded width. Unique identifying prefixes
(Organism name/Accession ID) are associated with each portion.
Additionally a set of mark-up lines, recognizable by their prefix
# , specify various additional information, including accession
identifiers, experimental parameters, bibliographical references etc.
RNA 2D Visualization 69
Fig. 3 Dot-parentheses (aka dot-bracket) notation for the minimal free-energy structure of the Rat Alanine
tRNA, predicted using RNAFold
2.2.2 Vienna RNA In this format, the RNA sequence is coupled with a dot-parenthe-
Dot-Parentheses (aka sis string where matching pairs of opening and closing parentheses
dot-bracket) and identify base-pairs. This expression is well parenthesized, meaning
parenthesis that any opening parenthesis can be unambiguously associated
and Pseudobase Notations with a closing parenthesis, inducing a set of non-crossing base-
pairs. For instance, in Fig. 3, the bases at first and ante-penultimate
positions are base-paired.
Since matching pairs of parenthesis cannot unambiguously
identify crossing base-pairs, pseudoknots cannot be represented
strictly within this format. Therefore this format was extended by
the PseudoBase to include support for multiple parentheses sys-
tems, allowing crossing interactions to be represented. In Fig. 4,
two crossing helices, forming a H-type pseudoknot, are initiated
by base-pairs at positions (1604, 1623) and (1615, 1630),
respectively.
2.2.3 BPSEQ Format The BPSeq format is an alternative to the CT format introduced by the
Comparative RNA Web site (CRW, hosted at the University of Texas
Austin by the Robin Gutell Lab). It essentially consists in a simplified ver-
sion of the CT format. As illustrated in Fig. 5 any BPSeq file starts with
four self-explanatory lines identifying the data source and content, and is
followed by the structure, specified as a sequence of space-separated trip-
lets (x, y, z), where: x
x Position;
y IUPAC code for base;
z Base-pairing partner position (0 if unpaired).
2.2.4 CONNECT This format was introduced by Mfold [37], the historical tool for the ab-
(CT) Format initio prediction of RNA secondary structure, and is still used to date by
several prediction tools. After an initial header consisting of the sequence
70 Yann Ponty and Fabrice Leclerc
Fig. 4 Pseudobase notation for the Gag/pro ribosomal frameshift site of Bovine Leukemia Virus. Source: pseu-
dobase, entry# PKB1
Fig. 5 Fragment of a BPSEQ formatted 5 s rRNA inferred using comparative modeling. Source: comparative
RNA web site
a Position;
b IUPAC code for base;
c Position of previous base in the backbone (5′–3′ order, 0 is used if
first position);
d Position of next base in the backbone (5′–3′ order, 0 is used if last
position);
e Base-pairing partner position (0 if unpaired);
RNA 2D Visualization 71
Fig. 6 CONNECT format for the Mfold 3.7 [37] predicted structure of the human let-7 pre-miRNA
Fig. 7 Fragment of a three-dimensional model for the Oceanobacillus iheyensis Group II intron (PDBID:3IGI)
f Position (duplicated).
This format can be gently abused to allow for the description of
pseudoknots, although such motif cannot be predicted by Mfold.
2.2.5 PDB File Format The PDB format is a comprehensive text-formatted representation
of macromolecules used with the authoritative eponym repository
of experimentally derived 3D models [3]. Originally introduced to
represent protein structure, it has been enriched over the years to
include fine details of the experimental protocol used for any struc-
ture derivation. However, it does not support detailed information
regarding base-pairing position, and is mostly used by RNA soft-
ware as a raw geometrical descriptor of a 3D model, as illustrated
in Fig. 7.
72 Yann Ponty and Fabrice Leclerc
Fig. 8 RNAML formatted fragment of an all-atom 3D model for the Oceanobacillus iheyensis Group II intron
(PDBID: 3IGI, also shown in Fig. 7), produced by the RNAView software. The base-pair XML sections are
automatically annotated geometrically from the PDB file, and represent base-pairing positions. Such base-pairs
may be non-canonical, i.e. they may involve non-standard pairs of nucleotides, interacting edges, or orientation.
Here, positions 2 and 260 form a canonical base-pair (U–A, both Watson-Crick edges, cis orientation), while
positions 4 and 259 form a non-canonical base-pair (U–U, Sugar/Watson-Crick edges, cis orientation)
2.2.6 RNAML Format RNAML [30] is an XML format, introduced to address the dual need
to unify data representations related to RNA, and to represent
novel important features of its structure (e.g. non-canonical base-
pairs and motifs). Although still challenged in its former goal by
less structured, domain-specific, formats (arguably because of the
intrinsic verbosity arising from its ambitious goals), RNAML has
established itself as the format of choice for an enriched symbolic
description of the tertiary structure, as illustrated in Fig. 8.
Accordingly, it is currently supported by most automated methods
capturing non-canonical interactions and motifs.
RNA 2D Visualization 73
3 Methods
RNAView
Annotation MC-Annotate
FR3D
(Extended) Secondary Structure File Formats
Stockhlom Vienna/DBN BPSeq Connect RNAML
b RNA-Aware Tools
Interactive Editors Command-line tools Web-based tools
VARNA R-chie PseudoViewer
c Post-Processing
Vector Graphics Editors
Inkscape Adobe
R
Illustrator
R
Fig. 9 Typical workflow for the production of publication-quality diagrams. The preferred execution (blue
arrows, dashed arrow indicates a functionality which is not widely available) will start with a scripted produc-
tion of a data-rich initial draft, followed by an interactive refinement within a specialized GUI, and conclude
with a limited post-processing session using a general-purpose vector graphics editor
74 Yann Ponty and Fabrice Leclerc
3.2 Representations, Many representations have been proposed to display the secondary
Layouts, and Basic structure of RNA, each with its strengths and limitations. In this
Usage section, we give an overview of the existing propositions and, for
each representation and tool, we provide minimal command lines
and/or python scripts to obtain them.
3.2.1 Linear Layout The linear layout represents base-pairs as arcs drawn over and/or
under a linearly drawn RNA sequence. It arguably constitutes one
of the easiest ways to represent the secondary structure, and can be
thought as a genomic perspective over the secondary structure.
Among the strengths of this representation, one counts an easy
identification of helices as sets of nested arcs, and a convenient
(unbiased) representation of pseudoknots. This representation may
also be easily adapted into a side-by-side visual comparison of two
or several candidate structures for a given RNA, provided an align-
ment is available at the sequence level. In such a joint representa-
tion, shared base-pairs are easy to identify, as illustrated in Fig. 10.
However, the linear representation suffers from a poor density
of information. Indeed, the height of arcs associated with base-pairs
typically grows proportionally to the distance between its associated
RNA 2D Visualization 75
Fig. 10 ON (above) and OFF (below) states of the pbuE adenine riboswitch [19], jointly represented using a linear
layout created using VARNA (see actual command in Note 1) [7], followed by minimal post-processing. Shared
structural elements (blue arcs) are easily identified in this representation, a property which is not satisfied by their
typical graph layout (gray boxes)
3.2.2 Circular Layout The circular layout constitutes a modified version of the linear lay-
out, in which the sequence is drawn along a circle. In this represen-
tation, base-pairs are drawn either as arcs or as chords of the circle,
as illustrated in Fig. 11.
This representation is usually more compact than its linear
counterpart (albeit only by a constant factor), and puts more
76 Yann Ponty and Fabrice Leclerc
Fig. 11 Left: RFAM consensus sequence/structure for the Hepatitis delta virus ribozyme family (RFAM id:
RF00094), drawn using VARNA [7]; Right: comparative model for the 16s ribosomal RNA in Thermus ther-
mophilus (source: comparative RNA Web site [6]), drawn using jviz.RNA [32]
a c
Fig. 12 Various layout strategies for the display of (extended/consensus) RNA secondary structures as (outer)
planar graphs, also known as squiggle plots. Source: three-dimensional model of the TPP riboswitch (PDB
id:2HOM), and associated RFAM family (RFAM: RF00059)
3.2.3 Squiggle Plots: This representation attempts to draw the secondary structure as a
Planar Graph schematics of its three-dimensional structure. Helices are almost
Representations universally represented as straight, ladder-like, segments. Besides
this norm, layout strategies largely differ, especially with respect to
the layout of multiples loops (3-way junctions and more), as sum-
marized in Fig. 12. This representation is usually preferred to illus-
trate functional scenario.
RNAView (Fig. 12a) draws the secondary structure of an RNA
(PDB model) as a direct 2D projection of the three-dimensional
model, chosen to maximize the spread of the diagram. This strat-
egy results in drawings which, despite being usually self-overlap-
ping, are often indicative of the relative orientation of helices for
small RNAs, as illustrated in Fig. 12a. The software is also men-
tioned in Subheading 2.2 (Fig. 8) for its capacity to annotate the
base-pairs of a 3D model. After downloading/installing the soft-
ware from http://ndbserver.rutgers.edu, a PostScript projec-
tion can be produced from an input RNA (PDB format file XXXX.
pdb), by simply running the command:
78 Yann Ponty and Fabrice Leclerc
# Running VARNA on the ext . sec . str . ( RNAML -> Vector graphics )
java - cp VARNAvX - Y . jar fr . orsay . lri . varna . applications . VARNAcmd
-i XXXX . pdb . xml -o YYYY . svg
3.2.4 Tree Layout In the absence of crossing interactions such as pseudoknots, RNA
secondary structures can be unambiguously decomposed in a vari-
ety of ways, leading to tree-like objects. Such a decomposition of
the conformation space is at the core of any dynamic programming
scheme, a strategy for solving combinatorial optimization prob-
lems which is especially popular in RNA bioinformatics. For
instance, atomic contributions in the Turner energy model [22]
are entirely determined by the joint knowledge of internal nodes (a
base-pair) in combination with their immediate children.
Consequently, this representation may be useful to illustrate the
principles underlying RNA algorithms.
Unfortunately, there are currently few available options to pro-
duce such a representation, and the most realistic option consists
in using the general-purpose graph visualization software
GraphViz [11], which unfortunately requires a DOT-formatted
file as input. Consequently, a secondary structure, typically denoted
by a dot-parenthesis expression (see Subheading 2.2.2), will require
some conversion. We provide in Note 3 a minimal python (2.x)
script to convert, and dump to the standard output as a DOT-
formatted file, a secondary structure.
Invoking this code for a given sequence/structure, while captur-
ing the output through standard I/O redirection, one obtains a file
which can then be fed to the dot utility of GraphViz (See Fig. 13).
This utility can then be configured to produce quality pictures in virtu-
ally any format, including many vector formats such as PDF and SVG.
3.3 Mapping Probing Footprinting techniques are widely applied to map the 2D and 3D
Data onto Secondary structures of RNA using both chemical and/or enzymatic probes.
Structure Models They provide useful information about the relative accessibility of
the RNA to a given probe at the nucleotide resolution. Depending
on the physico-chemical conditions or the presence of different
molecular partners (RNA, proteins, antibiotics, etc.), the chemical
or enzymatic probing data give a snapshot of the RNA structure
RNA 2D Visualization 81
Fig. 13 Planar graph layout using VARNA [7] and tree representation using the default options of the dot [11]
algorithm of GraphViz [10] of the secondary structure of a transfer RNA (PDB id: 1TN2:A, annotated using
RNAView [34])
d e f
Fig. 14 2D structure representations integrating chemical probing of the Pab21 HACA sRNP from P. abyssi
(RFAM: RF00065), as generated by VARNA. (a) 2D structure representation of the HACA RNA motif highlighting
the key structural features. (b) Free RNA Chemical probing indicating the accessible positions for Pb cleavage
on a scale from 1 to 3 in the intensity of the reaction. (c) Alternate representation showing the accessibility
using a color map. (d–f) Modifications of the RNA probing data due to the successive binding of L7Ae, CBF5
and NOP10, and the RNA substrate. Green pins: initial probing on RNA; red pins: increase in reactivity; blue
pins: decrease in reactivity; violet regions: newly protected positions
84 Yann Ponty and Fabrice Leclerc
# Ma r k i n g n u c l e o t i d e 10
RNAplot - -post " 10 cmark " < XXX . txt
# Drawing b ackbone for region (10 ,15) using red color ( r =1 , g = b =0) and line thickness 2
RNAplot - -pre " 10 15 2 1. 0. 0. omark " < XXX . txt
# Defining a custom PS macro and using it to fill base number 10 in blue ( r = g =0 , b =1)
RNAplot - -pre " / cfmark { setrgbcolor newpath 1 sub coor exch get aload pop fsize 2 div 0 360
arc fill } bind def 10 0. 0. 1. cfmark " < XXX . txt
import os , sys
def getColor ( val , minval , maxval , col1 =(1. ,1. ,1.) , col2 =(0.0 ,0.8 ,0.0) ) :
( r1 , g1 , b1 ) ,( r2 , g2 , b2 ) = col1 , col2
span = float ( maxval ) - float ( minval )
k = ( float ( val ) - float ( minval ) ) /( span if span !=0 else 1.0)
l = 1. - k
return ( r1 * l + r2 *k , g1 * l + g2 *k , b1 * l + b2 * k )
def formatValuesAsPS ( values ) :
minval , maxval = min ( values ) , max ( values )
valtab = [ " % s % s cfmark " %( i +1 , " %.2 f %.2 f %.2 f " % getColor (v , minval , maxval ) ) for i , v in
enumerate ( values ) ]
return " " . join ( valtab )
# Invokes RNAplot to display a color map
# Arg1 : path to DBN file
# Arg2 : list of comma - s e par ate d values ( eg "1 ,2 ,3")
if __name__ == " __main__ " :
inputFile = sys . argv [1]
values = sys . argv [2]. split ( " ," )
extraMacro = " / cfmark { setrgbcolor newpath 1 sub coor exch get aload pop fsize 2 div 0 360
arc fill } bind def "
os . system ( " RNAplot -- pre \" % s % s \" < % s " %( extraMacro , f o r m a t V a l u e sAsPS ( values ) , inputFile ) )
RNA 2D Visualization 85
3.4 3Displaying Pseudoknots and RNA–RNA interactions depart from the tree-like
Pseudoknots paradigm, and therefore constitute a challenge to computational
and Interactions methods. Most prediction algorithms address the problem by
restricting the search space. However, visualization tools cannot
arbitrarily enforce such restrictions. This narrows down the choice
of suitable layouts and tools to few options.
3.4.1 Pseudoknots Basic representations, such as the linear and circular representations,
do not require a complicated layout, and therefore remain largely
unaffected by the presence of pseudoknot. From a software designer
perspective, supporting pseudoknot for these representations is then
mostly a matter of being able to parse suitable input formats. jViz.
RNA (Fig. 15b), R-chie and VARNA (Fig. 15d) may be used with-
out much additional effort. In particular, the latter will also support
more complicated structures, such as those featuring multiple part-
ners for a given nucleotide, as illustrated by Fig. 15d.
However, a large subset of existing pseudoknots can be repre-
sented planarly as squiggle plots, i.e. in such a way that base-pairs and
backbone are non-crossing. To create such representations, the
PseudoViewer is the only fully satisfactory option. Pseudoknots
are laid out within dedicated boxes, that are embedded in a tree-like
general layout. As shown in Fig. 15a, the result is aesthetically pleas-
ing even in the absence of user intervention, its only drawback being
a consequent backbone distortion. Alternatively, the spring layout of
jViz.RNA can be used to obtain decent squiggle plots, as illustrated
in Fig. 15c usually after some manual disentangling from the user.
> organism1
GCGGGGUGCGC & AGGACCCACUCCU
> organism2
CGGCCCGGCCG & AGCGGGCCGCGCU
> structure
((( AABBB ) ) ) &(((( aabbb ) ) ) )
86 Yann Ponty and Fabrice Leclerc
a b c
Fig. 15 Pseudoknotted secondary structure of the Hepatitis Delta virus ribozyme (PseudoBase entry PKB75),
rendered as a planar squiggle plot by the PseudoViewer (a), as a circular diagram and as a force-driven spring
layout by jViz.RNA (b and c). Extended secondary structure of a variant (PDB id: 1SJ3), drawn linearly by
VARNA (d)
a b
Fig. 16 Available representations for RNA–RNA interactions: consensus secondary structures of the H/ACA
guide and its target sequence from the 16S rRNA (RFAM: RF00065) automatically generated by RILogo (Panel
a), or as a manually edited pseudoknoted structure using R2R (Panel b) and VARNA (Panel c)
3.5 Visualizing The identification of novel families of structured ncRNAs relies heav-
Structure-Informed ily on the implementation of a comparative approach. This approach
RNA Alignments starts with a multiple sequence alignments, from which a draft consen-
sus secondary structure is identified. This draft is then used to refine
the alignment, and retrieve novel homologs, upon which the second-
ary structure can be further refined. Iterating these steps leads to a
final structure-informed multiple sequence alignment, whose quality
must be assessed before concluding on the existence of a new func-
tional family. To that purpose, a global visualization of the conserva-
tion/covariation levels in the context of the structure is essential.
The R2R software is particularly suited to generate representa-
tions of structure-informed RNA alignments. Beside allowing to
decorate a squiggle plot using conservation and covariation levels,
88 Yann Ponty and Fabrice Leclerc
Fig. 17 Consensus sequence and secondary structures of the H/ACA guide RNA from the snoR9 family (RFAM:
RF00065) generated by R2R (see Note 5 for a sequence of commands) from the RFAM seed alignment. Four
individual secondary structures from the seed alignment are annotated to display the common structural
features (internal loop, K-loop, ANA box) and the subtle differences in sequence and loop size. The modular
sub-structures correspond to two subfamilies: ILOOP75 and ILOOP85 where the 5′ guide sequence of the
internal loop contains either 7 or 8 nucleotides, respectively
90 Yann Ponty and Fabrice Leclerc
Fig. 18 Structure-informed representation of the RFAM multiple-sequence alignment of the snoR9 family
(RFAM: RF00065), produced using R-Chie
Rscript rchie . R - -msafile XXXX . txt - -colour1 " #4 DAF4A " - -msacol " #00
A651 ,#0072 BC ,#00 B9F2 ,# F15A22 ,#231 F20 ,# AAAAAA ,# DA6FAB " - -pdf
- -output = " out . pdf " - -format1 " vienna " YYYY . txt
4 Notes
# Top
java - cp VARNAvx - y . jar fr . orsay . lri . varna . applications . VARNAcmd \
- sequenceDBN " G G G C C C G G C U C C C G C C C U C U C C G G G G A A U C G U G A A C C G G G G G U U C C G G C C G G G C C U A C A " \
- structureDBN " ( ( ( ( ( ( ( ( ( . . . . . . . ( . ( ( ( ( ( ( ( ( ( . . . . . ) ) ) ) ) ) ) ) ) ) .....) ) ) ) ) ) ) ) ) ... " \
- highlightRegion " 10 -16: radius =10 , fill =# acf269 , outline =# acf269 ;43 -47: radius =10 , fill =# acf269 ,
outline =# acf269 " \
-o RF00065_testarc1 . svg - algorithm line - drawBases False - s paceBetweenBases 0.6 - bpStyle line
# Bottom
java - cp VARNAvx - y . jar fr . orsay . lri . varna . applications . VARNAcmd \
- sequenceDBN " G G G C C C G G C U C C C G C C C U C U C C G G G G A A U C G U G A A C C G G G G G U U C C G G C C G G G C C U A C A " \
- structureDBN " ( ( ( ( ( ( ( ( ( . . . . . . . ( . ( ( ( ( ( ( ( ( ( . . . . . ) ) ) ) ) ) ) ) ) ) .....) ) ) ) ) ) ) ) ) ... " \
- chemProb " 10 -11: glyph = pin , dir = out , intensity =3 , color =# f90000 ;17 -18: glyph = pin , dir = out , intensity
=1 , color =# f90000 ; " \
- highlightRegion " 26 -27: radius =10 , fill =# d2ccf4 , outline =# d2ccf4 ;31 -37: radius =10 , fill =# d2ccf4 ,
outline =# d2ccf4 " \
-o RF00065_testarc2 . svg - algorithm line - drawBases False - s paceBetweenBases 0.6 - bpStyle line
import sys
# Get custom styles by changing these lines ( cf GraphViz manual )
STYLE_DEFAULT = " shape =\" rectangle \" , style = filled , margin =\"0 ,0\" , fontsize =20 , color = grey40 ,
fontcolor = grey20 , fillcolor = grey90 , fontname =\" Helvetica \" "
STYLE_UNPAIRED = " shape =\" circle \" , color = blue , fillcolor = aliceblue "
STYLE_PAIRED = " shape =\" hexagon \" "
STYLE_EDGES = " color = grey50 "
# C o n v e r t s an RNA s e q u e n c e / s e c o n d a r y s t r u c t u r e ( dot - p a r e n t h e s i s n o t a t i o n ) into a DOT - f o r m a t t e d
GraphViz input , written into a previously opened file f
def drawAsDOT ( seq , secstr , f ) :
print >> f , " digraph rna {\ n node [% s ];\ n edge [% s ];\ n -1 [ label =\" Root \"]; " %(
STYLE_DEFAULT , STYLE_EDGES )
stack = [ -1]
for i in range ( len ( secstr ) ) :
k = stack [ -1]
if secstr [ i ]== " ( " :
stack . append ( i )
print >> f , " % s -> % s ; " %( k , i )
elif secstr [ i ]== " ) " :
stack . pop ()
print >> f , " % s [ label =\"% s \" ,% s ]; " %( k , seq [ k ]+ seq [ i ] , STYLE_PAIRED )
else :
print >> f , " % s -> % s ; " %( k , i )
print >> f , " % s [ label =\"% s \" ,% s ]; " %( i , seq [ i ] , STYLE_UNPAIRED )
print >> f , " } "
# panel B
java - cp VARNAvx - y . jar fr . orsay . lri . varna . applications . VARNAcmd \
- sequenceDBN " G G G C C C G G C U C C C G C C C U C U C C G G G G A A U C G U G A A C C G G G G G U U C C G G C C G G G C C U A C A " \
- structureDBN " ( ( ( ( ( ( ( ( ( . . . . . . . ( . ( ( ( ( ( ( ( ( ( . . . . . ) ) ) ) ) ) ) ) ) ) .....) ) ) ) ) ) ) ) ) ... " \
- chemProb " 9 -10: glyph = pin , dir = out , intensity =1 , color =#3 e844b ;10 -11: glyph = pin , dir = out , intensity
=1 , color =#3 e844b ; " \
-o RF00065_panelB . svg - alg orithm radiate - flat true - drawBases False - spaceBetweenBases 0.6
# panel C
java - cp VARNAvx - y . jar fr . orsay . lri . varna . applications . VARNAcmd \
- sequenceDBN " G G G C C C G G C U C C C G C C C U C U C C G G G G A A U C G U G A A C C G G G G G U U C C G G C C G G G C C U A C A " \
- structureDBN " ( ( ( ( ( ( ( ( ( . . . . . . . ( . ( ( ( ( ( ( ( ( ( . . . . . ) ) ) ) ) ) ) ) ) ) .....) ) ) ) ) ) ) ) ) ... " \
- colorMap " 0 ; 0 ; 0 ; 0 ; 0 ; 0 ; 0 ; 0 ; 0 ; 1 ; 1 ; 2 ; 3 ; 2 ; 1 ; 1 ; 1 ; 1 ; 3 ; 3 ; 1 ; 1 ; 0 ; 0 ; 0 ; 2 ; 2 ; 2 ; 3 ; 3 ; 1 ; 3 ; 2 ; \
1;1;1;1;0;0;0;0;0;1;3;3;3;1;0;0;0;0;0;0;0;0;0;1;1;3 " \
- colorMapMax 3 - colorMapMin 0 - colorMapStyle green \
-o RF00065_panelC . svg - alg orithm radiate - flat true - drawBases True - spaceBetweenBases 0.6
# panels D , E , F
java - cp VARNAvx - y . jar fr . orsay . lri . varna . applications . VARNAcmd \
- sequenceDBN " G G G C C C G G C U C C C G C C C U C U C C G G G G A A U C G U G A A C C G G G G G U U C C G G C C G G G C C U A C A " \
- structureDBN " ( ( ( ( ( ( ( ( ( . . . . . . . ( . ( ( ( ( ( ( ( ( ( . . . . . ) ) ) ) ) ) ) ) ) ) .....) ) ) ) ) ) ) ) ) ... " \
- chemProb " 9 -10: glyph = pin , dir = out , intensity =1 , color =#3 e844b ;10 -11: glyph = pin , dir = out , intensity
=1 , color =#3 e844b ; " \
- highlightRegion " 26 -27: radius =10 , fill =# d2ccf4 , outline =# d2ccf4 ;31 -37: radius =10 , fill =# d2ccf4 ,
outline =# d2ccf4 " \
-o RF00065_panelD . svg - alg orithm radiate - flat true - drawBases False - spaceBetweenBases 0.6
$ r2r -- GSC - weighted - consensus RF00065_seed . sto RF00065_seed . cons . sto $ GSCparams
# STOCKHOLM 1 . 0
#= GF R2R i g n o r e _ s s _ e x c e p t _ f o r _ p a i r s _1 o u t l i n e _ o n l y _ b p
#= GF R2R place_explicit 2 2 - - - 45 1 0 0 0 - 90
#= GF R2R place_explicit n n - - - 45 1 0 0 0 - 90
#= GF R2R place_explicit m m - - - 45 1 0 0 0 - 90
#= GF R2R delcols 0
#= GF R2R tick_label l guide
#= GF R2R tick_label K K - Loop
#= GF R2R tick_label a ANA box
#= GF R2R tick_label g target
#= GF R2R keep q r
#= GF R2R var_backbone_range 4 5
#= GF R2R outline_nuc L
#= GF R2R outline_nuc l
#= GF R2R outline_nuc G
#= GF R2R outline_nuc n
#= GF R2R outline_nuc j
#= GF R2R circle_nuc 2 rgb :0 ,0 ,0
#= GF R2R circle_nuc C rgb :0 ,0 ,0
#= GF R2R nuc_color 2 rgb :0 ,0 ,0
#= GF R2R nuc_color C rgb :0 ,0 ,0
#= GF R2R var_backbone_range_size_fake_nucs 3 e f
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e KLOOP : K rgb :0 ,129 ,255
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ANA : C rgb :200 ,200 ,200
#= GF R2R o u t l i n e _ a l o n g _ b a c k b o n e ILOOP : Y rgb :255 ,228 ,196
#= GF R2R o u t l i n e _ a l o n g _ b a c k b o n e ILOOP : P rgb :193 ,255 ,193
#= GF R2R o u t l i n e _ a l o n g _ b a c k b o n e ILOOP : Q rgb :193 ,255 ,193
#= GF R2R o u t l i n e _ a l o n g _ b a c k b o n e ILOOP : X rgb :193 ,255 ,193
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : Y rgb :255 ,228 ,196
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : P rgb :193 ,255 ,193
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : Q rgb :193 ,255 ,193
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : X rgb :193 ,255 ,193
RNA 2D Visualization 95
//
# STOCKHOLM 1 . 0
#= GF ID snoR9
#= GF AC RF00065
#= GF DE Small nucleolar RNA snoR9
#= GF AU Bateman A , Daub J
#= GF GA 50 . 0
#= GF NC 49 . 8
#= GF TC 68 . 7
#= GF SE Bateman A
#= GF SS Published ; PMID :12032319
#= GF TP Gene ; snRNA ; snoRNA ; CD - box ;
#= GF BM cmbuild -F CM SEED ; cmcalibrate - - mpi -s 1 CM
#= GF BM cmsearch -Z 274931 -E 1000000 - - toponly CM SEQDB
#= GF DR SO :0000593 SO : C_D_box_snoRNA
#= GF DR GO :0006396 GO : RNA processing
#= GF DR GO :0005730 GO : nucleolus
#= GF RN [1]
#= GF RM 12032319
#= GF RT Noncoding RNA genes identified in AT - rich h yp er the rm o phi les .
#= GF RA Klein RJ , Misulovin Z , Eddy SR ;
#= GF RL Proc Natl Acad Sci U S A 2002;99:7542 - 7547 .
#= GF CC snoRNA R9 is a member of the C / D class of snoRNA which contain
#= GF CC the C ( UGAUGA ) and D ( CUGA ) box motifs . R9 was identified in a
#= GF CC computational screen in AT - rich h y p e r th er m o ph iles [1] . R9 was
#= GF CC found to overlap with the smaller snoRNA R19 which is currently a
#= GF CC member of Pyrococcus C / D box snoRNA family Rfam : RF00095 .
#= GF WK http :// en . wikipedia . org / wiki / S m a l l _ n u c l e o l a r _ R N A _ s n o R 9
#= GF SQ 5
Pyrococcus_furiosus GGGCCCGGUU .
CCCGCCCUCUCCGGGGAAUCGUGAACCGGGGGUUCCGACCGGGCCCACA ..
A U G G G A U G A U G A C C U U U U G C U U U A C U G A A C A C A U G A U G A C C A C G C C C U U C G C U G A C . CUAA AUAU UU GAC
Pyrococcus_abyssi_GE GGGCCCGGCU .
CCCGCCCUCUCCGGGGAAUCGUGAACCGGGGGUUCCGGCCGGGCCUACA ..G..
UUAUGAUGAACUUUUGCUUUGCUGAUGUGGUGAUGAGCACGCCCUUCGCUGAUACUCUCUCGUCCAU
Pyrococcus_horikoshi CGGCCCGGUU .
C C C G C C C U C U C C G G G G A A U C G U G A A C C G G G G G U U C C G A C C G G G C C G A C A . . GG .
G G A U G A A G A G C U U U U G C U U U G C U G A G C A G A U G A U G A C C A C G C C C U U C G C U G A C . CU . GCUAUUUGAC
P. furiosus GGGCCCGGUU .
CCCGCCCUCUCCGGGGAAUCGUGAACCGGGGGUUCCGACCGGGCCCACA ..
A U G G G A U G A U G A C C U U U U G C U U U A C U G A A C A C A U G A U G A C C A C G C C C U U C G C U G A C . CUAA AUAU UU GAC
Thermococcus_kodakar
GGGCCUGGCGUCCCGCCCUCCCCGGGGAAACGUGAACCGGGGCUUCCUGCCAGGCCUACACCGGGGGAUGAAGAGCUUUUGCUUUGCUGAC
.. UGUGAUGAGCACGCCCUUCACUGACCCCGUAUCAGCUCU
#= GC S S_ c o n s < <<<<<<<<........ <. <<<<<<<<<..... > >>>>>>>>>..... >>>>>>>>>..
.........................................................................
#= GC RF gGGCCCGGcu .
CCCgCCCUCUCCGGGGAAUCGUGAACCGGGGGuUCCggCCGGGCCcACA ..
a u g u u A U G A U G A a C U U U U G C U U U a C U G A a g a g a U G A U G A g C A C G C C C U U C G C U G A u . CU aaaUa uUugAu
#= GR Py r o c o c c u s _f u r i o s u s DEL_COLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...----------------------------------------------------------------------
#= GR P y r o c o c c u s _ a b y s s i _ G E DEL_COLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...----------------------------------------------------------------------
#= GR P y r o c o c c u s _ h o r i k o s h i DEL_COLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...----------------------------------------------------------------------
#= GR P . furiosus DEL_COLS ...........................................................
...----------------------------------------------------------------------
#= GR T h e r m o c o c c u s _ k o d a k a r DEL_COLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.------------------------------------------------------------------------
#= GC R 2 R _ L A B E L p........1......2...........3..4... QQQQqQQQ5 ...6 RRRRrRRRs ..
.7......................................................................8
#= GC R 2 R _ X L A B E L _ K L O O P . . . . . . . . . . . . . . . . . . . . . . . . . . KKKKKKKKK . . . . . . . . . . . . . . . . . . . . . . . .
.........................................................................
#= GC R 2 R _ X L A B E L _ K L O O P L ................................k..........................
.........................................................................
#= GC R 2 R _ X L A B E L _ I L O O P . . . . . . . . . IIIIIIII . . . . . . . . . . . . . . . . . . . . . . . . . . JJJJJ . . . . . . . . . . .
.........................................................................
#= GC R 2 R _ X L A B E L _ I L O O P L .............i...............................j.............
.........................................................................
#= GC R 2 R _ X L A B E L _ A N A .........................................................
CCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#= GC R2R _XLABEL_ANAL ...........................................................
c........................................................................
RNA 2D Visualization 97
#= GC S U B F A M _ L A B E L _ T E R M ..........x................................................
.........................................................................
#= GC S U B F A M _ I L O O P 7 5 _ R 2 R _ L A B E L - - - - - - - <. . . . . . . . . . B > - - - - - - - - - - - - - - - - - - - - - <. KLMNO . > - - - - - - - - -
-------------------------------------------------------------------------
#= GC S U B F A M _ I L O O P 8 5 _ R 2 R _ L A B E L - - - - - - - <. . . . . . . . . . B > - - - - - - - - - - - - - - - - - - - - - <. KLMNO . > - - - - - - - - -
-------------------------------------------------------------------------
#= GF R2R keep p s
#= GF R2R var_backbone_range 1 2
#= GF R2R var_backbone_range 3 4
#= GF R2R var_backbone_range 5 6
#= GF R2R var_backbone_range 7 8
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : I rgb :0 ,255 ,0
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : J rgb :0 ,255 ,0
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e KLOOP : K rgb :0 ,129 ,255
#= GF R2R s h a d e _ a l o n g _ b a c k b o n e ANA : C rgb :200 ,200 ,200
#= GF R2R tick_label ANAL : c ANA box
#= GF R2R tick_label KLOOPL : k K - Loop
#= GF S U B F A M _ R E G E X _ P R E D ILOOP75 TERM [ . ]
#= GF S U B F A M _ R E G E X _ P R E D ILOOP85 TERM [ A -Z ]
#= GF SUBFAM_ILOOP75_R2R no5
#= GF SUBFAM_ILOOP75_R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : I rgb :0 ,255 ,0
#= GF S U BF AM _ I L OO P 7 5 _ R 2 R tick_label ILOOPL : i 7 nt
#= GF SUBFAM_ILOOP75_R2R outline_nuc ILOOP :I
#= GF SUBFAM_ILOOP75_R2R v a r _ b a c k b o n e _ r a n g e _ s i z e _ f a k e _ n u c s 1 M M 5 nt
#= GF SUBFAM_ILOOP75_R2R var_backbone_range_size_fake_nucs 1 B B
#= GF SUBFAM_ILOOP75_R2R var_backbone_range_size_fake_nucs 1 K K
#= GF SUBFAM_ILOOP75_R2R var_backbone_range_size_fake_nucs 1 L L
#= GF SUBFAM_ILOOP75_R2R var_backbone_range_size_fake_nucs 1 N N
#= GF SUBFAM_ILOOP75_R2R var_backbone_range_size_fake_nucs 1 O O
#= GF SUBFAM_ILOOP85_R2R no5
#= GF SUBFAM_ILOOP85_R2R s h a d e _ a l o n g _ b a c k b o n e ILOOP : I rgb :0 ,255 ,0
#= GF S U BF AM _ I L OO P 8 5 _ R 2 R tick_label ILOOPL : i 8 nt
#= GF SUBFAM_ILOOP85_R2R outline_nuc ILOOP :I
#= GF SUBFAM_ILOOP85_R2R v a r _ b a c k b o n e _ r a n g e _ s i z e _ f a k e _ n u c s 1 M M 5 nt
#= GF SUBFAM_ILOOP85_R2R var_backbone_range_size_fake_nucs 1 B B
#= GF SUBFAM_ILOOP85_R2R var_backbone_range_size_fake_nucs 1 K K
#= GF SUBFAM_ILOOP85_R2R var_backbone_range_size_fake_nucs 1 L L
#= GF SUBFAM_ILOOP85_R2R var_backbone_range_size_fake_nucs 1 N N
#= GF SUBFAM_ILOOP85_R2R var_backbone_range_size_fake_nucs 1 O O
R F 0 0 0 6 5 _ g u i d e _ t a r g e t . cons . sto
R F 0 0 0 6 5 _ g u i d e _ t a r g e t . sto oneseq Pa21 - S892
RF00065_guide_target - hybrid . cons . sto
Acknowledgements
References
1. Andronescu M, Bereg V, Hoos H, Condon A 8. Fourmann J-B, Tillault A-S, Blaud M, Leclerc
(2008) RNA STRAND: the RNA secondary F, Branlant C, Charpentier B (2013)
structure and statistical analysis database. Comparative study of RNA conformational
BMC Bioinform 9(1):340. doi: 10.1186/1471- dynamics during assembly of two box H/ACA
2105-9-340 ribonucleoprotein pseudouridine-synthases
2. Bateman A, Agrawal S, Birney E, Bruford EA, revealing uncorrelated efficiencies in assembly
Bujnicki JM, Cochrane G, Cole JR, Dinger and activity. PLoS ONE 8(7):e70313.
ME, Enright AJ, Gardner PP, Gautheret D, doi:10.1371/journal.pone. 0070313
Griffiths-Jones S, Harrow J, Herrero J, Holmes 9. Fourmy D, Yoshizawa S (2012) Protein-RNA
IH, Huang H-D, Kelly KA, Kersey P, Kozomara footprinting: an evolving tool. Wiley Interdiscip
A, Lowe TM, Marz M, Moxon S, Pruitt KD, Rev RNA 3(4):557–566. doi:10.1002/wrna. 1119
Samuelsson T, Stadler PF, Vilella AJ, Vogel 10. Gansner ER, North SC (2000) An open graph
J-H, Williams KP, Wright MW, Zwieb C (2011) visualization system and its applications to
RNAcentral: a vision for an international data- software engineering. Softw – Pract Exp
base of RNA sequences. RNA 17(11):1941– 30(11):1203–1233. doi:10.1002/ 1097-024X
1946. doi:10.1261/rna. 2750811 (200009)30:11<1203::AID- SPE338>3.3.
3. Berman HM, Olson WK, Beveridge DL, CO; 2-E
Westbrook J, Gelbin A, Demeny T, Hsieh SH, 11. Gansner ER, Koutsofios E, North SC, Vo KP
Srinivasan AR, Schneider B (1992) The nucleic (1993) A technique for drawing directed
acid database. A comprehensive relational data- graphs. IEEE Trans Softw Eng 19(3):214–
base of three-dimensional structures of nucleic 230. doi:10.1109/32.221135
acids. Biophys J 63(3):751–759. doi:10.1016/ 12. Griffiths-Jones S, Bateman A, Marshall M,
S0006-3495(92) 81649-1 Khanna A, Eddy SR (2003) Rfam: an RNA
4. Bruccoleri RE, Heinrich G (1988) An improved family database. Nucl Acids Res 31(1):439–
algorithm for nucleic acid secondary structure 441. doi: 10.1093/nar/gkg006
display. Comput Appl Biosci 4(1):167–173. 13. Gruber AR, Lorenz R, Bernhart SH, Neuböck
doi:10.1093/bioinformatics/4.1. 167 R, Hofacker IL (2008) The Vienna RNA
5. Byun Y, Han K (2009) PseudoViewer3: generat- websuite. Nucl Acids Res 36(Web Server
ing planar drawings of large-scale RNA structures issue):W70–W74. doi:10.1093/nar/gkn188
with pseudoknots. Bioinformatics 25(11):1435– 14. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer
1437. doi:10.1093/bioinformatics/ btp252 LS, Tacker M, Schuster P (1994) Fast folding
6. Cannone JJ, Subramanian S, Schnare MN, and comparison of RNA secondary structures.
Collett JR, D’Souza LM, Du Y, Feng B, Lin N, Monatshefte für Chemie/Chem Mon
Madabusi LV, Muller KM, Pande N, Shang Z, 125(2):167–188. doi:10.1007/ BF00818163
Yu N, Gutell RR (2002) The comparative 15. Jossinet F, Westhof E (2005) Sequence to
RNA web (CRW) site: an online database of structure (S2S): display, manipulate and
comparative sequence and structure infor- interconnect RNA data from sequence to
mation for ribosomal, intron, and other structure. Bioinformatics 21(15):3320–3321.
RNAs. BMC Bioinform 3(2). doi: 10.1186/ doi:10.1093/bioinformatics/ bti504
1471-2105-3-15 16. Jossinet F, Ludwig TE, Westhof E (2010)
7. Darty K, Denise A, Ponty Y (2009) VARNA: Assemble: an interactive graphical tool to ana-
interactive drawing and editing of the RNA sec- lyze and build RNA architectures at the 2D
ondary structure. Bioinformatics 25(15): and 3D levels. Bioinformatics 26(16):2057–
1974–1975. doi:10.1093/bioinformatics/ btp250 2059. doi:10.1093/ bioinformatics/ btq321
100 Yann Ponty and Fabrice Leclerc
17. Kaiser A, Krüger J, Evers DJ (2007) RNA mov- ary structure. Bioinformatics 19(2):299–300.
ies 2: sequential animation of RNA secondary doi:10.1093/bioinformatics/19.2. 299
structures. Nucl Acids Res 35(Web Server 28. Schlatterer JC, Brenowitz M (2009)
issue):W330–W334. doi:10.1093/nar/gkm309 Complementing global measures of RNA fold-
18. Lai D, Proctor JR, Zhu JYA, Meyer IM (2012) ing with local reports of backbone solvent
R-CHIE: a web server and R package for visu- accessibility by time resolved hydroxyl radical
alizing RNA secondary structures. Nucl Acids footprinting. Methods 49(2):142–147.
Res 40(12):e95. doi:10.1093/nar/gks241 doi:10.1016/j. ymeth.2009.04.019
19. Lemay J-F, Lafontaine DA (2007) Core 29. Shabash B, Wiese KC, Glen E (2012)
requirements of the adenine riboswitch Improving the portability and performance
aptamer for ligand binding. RNA 13(3):339– of jViz.RNA – a dynamic RNA visualization
350. doi:10. 1261/rna.142007 software. In: PRIB, pp 82–93. doi:10.1007/
20. Leonard CW, Hajdin CE, Karabiber F, 978-3-642-34123-6\_8
Mathews DH, Favorov OV, Dokholyan NV, 30. Waugh A, Gendron P, Altman R, Brown JW,
Weeks KM (2013) Principles for understand- Case D, Gautheret D, Harvey SC, Leontis N,
ing the accuracy of SHAPE-directed RNA Westbrook J, Westhof E, Zuker M, Major F
structure modeling. Biochemistry 52(4):588– (2002) RNAML: a standard syntax for
595. doi:10.1021/bi300755u exchanging RNA information. RNA 8(6):
21. Leontis NB, Altman RB, Berman HM, Brenner 707–717. doi:10.1017/ S1355838202028017
SE, Brown JW, Engelke DR, Harvey SC, 31. Weinberg Z, Breaker RR (2011) R2R–software
Holbrook SR, Jossinet F, Lewis SE, Major F, to speed the depiction of aesthetic consensus
Mathews DH, Richardson JS, Williamson JR, RNA secondary structures. BMC Bioinform
Westhof E (2006) The RNA ontology consor- 12:3. doi: 10.1186/1471-2105-12-3
tium: an open invitation to the RNA community. 32. Wiese KC, Glen E, Vasudevan A (2005) JViz.
RNA 12(4):533–541. doi:10.1261/rna. 2343206 Rna–a Java tool for RNA secondary structure
22. Mathews DH, Sabina J, Zuker M, Turner DH visualization. IEEE Trans Nanobiosci 4(3):212–
(1999) Expanded sequence dependence of 218. doi:10.1109/TNB.2005. 853646
thermodynamic parameters improvespredic- 33. Xu Z, Culver GM (2009) Chemical probing of
tion of RNA secondary structure. J Mol Biol RNA and RNA/protein complexes. Methods
288:911–940. doi:10.1006/jmbi.1999. 2700 Enzymol 468:147–165. doi:10.1016/ S0076-
23. Menzel P, Seemann SE, Gorodkin J (2012) 6879(09) 68008-3
RILogo: visualizing RNA–RNA interactions. 34. Yang H, Jossinet F, Leontis N, Chen L,
Bioinformatics 28(19):2523–2526. doi:10. Westbrook J, Berman HM, Westhof E (2003)
1093/bioinformatics/ bts461 Tools for the automatic identification and
24. Mortimer SA, Trapnell C, Aviran S, Pachter L, classification of RNA base pairs. Nucl Acids
Lucks JB (2012) SHAPE-Seq: high-through- Res 31(13):3450–3460. doi:10.1093/nar/
put RNA structure analysis. Curr Protoc gkg529
Chem Biol 4(4):275–297. doi:10.1002/ 35. Zarringhalam K, Meyer MM, Dotu I, Chuang
9780470559277. ch120019 JH, Clote P (2012) Integrating chemical foot-
25. Muller G, Gaspin C, Etienne A, Westhof E printing data into RNA secondary structure pre-
(1993) Automatic display of RNA secondary diction. PLoS ONE 7(10):e45160. doi:10.1371/
structures. Comput Appl Biosci 9(5):551–561. journal.pone. 0045160
doi:10.1093/bioinformatics/9.5. 551 36. Zhang C, Darnell RB (2011) Mapping in vivo
26. Ouyang Z, Snyder MP, Chang HY (2012) protein-RNA interactions at single-nucleotide
SeqFold: genome-scale reconstruction of RNA resolution from HITS-CLIP data. Nat Biotechnol
secondary structure integrating high-through- 29(7):607–614. doi:10.1038/nbt.1873
put sequencing data. Genome Res. 37. Zuker M, Stiegler P (1981) Optimal computer
doi:10.1101/gr.138545.112 folding of large RNA sequencesusing thermo-
27. Rijk PD, Wuyts J, Wachter RD (2003) RnaViz dynamics and auxiliary information. Nucl Acids
2: an improved representation of RNA second- Res 9:133–148. doi:10.1093/nar/9.1.133
Chapter 6
Abstract
Modeling the three-dimensional structure of RNAs is a milestone toward better understanding and
prediction of nucleic acids molecular functions. Physics-based approaches and molecular dynamics simula-
tions are not tractable on large molecules with all-atom models. To address this issue, coarse-grained
models of RNA three-dimensional structures have been developed. In this chapter, we describe a graphical
modeling based on the Leontis–Westhof extended base-pair classification. This representation of RNA
structures enables us to identify highly conserved structural motifs with complex nucleotide interactions in
structure databases. Further, we show how to take advantage of this knowledge to quickly and simply
predict three-dimensional structures of large RNA molecules.
Key words Tertiary structure, RNA motifs, Extended secondary structure, Base-pair classification,
Modeling, Prediction
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_6, © Springer Science+Business Media New York 2015
101
102 Jérôme Waldispühl and Vladimir Reinharz
2 Material
2.2 Automatic The rnaview software is used to identify all base-pairing interac-
Annotation of 3D tions that represent in a RNA tertiary structure [17]. This program
Structures can be downloaded at: http://ndbserver.rutgers.edu/ndbmodule/
services/download/rnaview.html and is currently available for
Linux, UNIX, SUN, and MAC systems.
2.3 RNA Secondary The prediction of RNA secondary structures is performed with the
Structure Prediction RNAfold program available in the Vienna RNA package [20, 25].
Tools The software suite is available at http://www.tbi.univie.ac.at/
RNA/ and can run under LINUX, UNIX, MAC, and WINDOWS
systems. In this chapter, we used the version 2.1.2 of the Vienna
RNA package. Web services are also available at http://rna.tbi.uni-
vie.ac.at/.
2.4 Insertion of RNA We perform the insertion of RNA motifs inside predicted RNA
Motifs in Secondary secondary structures with the RNA-MoIP software [21] available
Structure at http://csb.cs.mcgill.ca/RNAMoIP/. Noticeably, this software
requires installing the Gurobi solver (http://www.gurobi.com/),
which is free for academic users.
2.5 Building RNA 3D We use the MC-Sym software to build tertiary structure from a
Structures from a RNA base-pairing interaction network [9]. MC-Sym is part of the MC-
Interaction Graphs tools package available at: http://www.major.iric.ca/MajorLabEn/
MC-Tools.html.
3 Methods
3.1 Classification Watson–Crick (C–G and A–U) and Wobble (G–U) base pairs are
of Base-Pairing the most common type of interactions. They create a scaffold for
Interactions the tertiary structure. Nonetheless, the analysis of RNA crystal
structures revealed a diversity of base-pairing interactions that goes
far beyond these canonical base pairs. In order to facilitate the
description of RNA structures, Leontis and Westhof proposed a
104 Jérôme Waldispühl and Vladimir Reinharz
Fig. 1 Base modeling and Leontis–Westhof base-pair classification. The base of a nucleotide is represented
with a right triangle. The hypotenuse represents the Hoogsteen edge (noted “H”), and the other sides are
associated with the Watson–Crick edge (noted “W”) and Sugar edge (noted “S”). This figure represents all 12
base-pair interactions with cis or trans orientation
106 Jérôme Waldispühl and Vladimir Reinharz
10
U C
U G
C G
C G
G C 15U A G
G
5
A G
C C
50 U A G
G
G C 20 55 C
G
C G U
G
C C 5’
3’
C G
45 A U
C G
C G 25
U G
G A
40 G G
A
U G 30
C G
G C
G U
35
C U
3.3 RNA Motifs The modeling of RNA tertiary structures into networks of base-
pairing interactions revealed recurrent motifs conserved across
families of structures. These structural patterns form a base toward
better understanding of complex organizations of nucleotides
inside hairpins, internal loops and beyond (See previous section for
a definition of the secondary structure elements). Several method-
ologies and databases have been developed to mine and store these
data. Among the most popular repositories, we count the RNA 3D
108 Jérôme Waldispühl and Vladimir Reinharz
3.4 Prediction Modeling RNA tertiary structure is the first step toward predicting
of RNA Tertiary RNA 3D structures. We now describe how the knowledge accu-
Structures from mulated in motif databases can be used together with RNA sec-
Sequence Data ondary structure prediction methods to predict the structure of
large RNA molecules from sequence data only. The strategy pre-
sented here works in two steps. First, we predict a secondary struc-
ture using classical software such as RNAfold [20] and refine this
Fig. 3 Example of 3D RNA motif insertions in a secondary structure. In green, we show the position of the
hairpin motif “1F7Y.B.6”, and in blue we indicate the position of the internal loop motif “1FKA.A.51”. On the left
side of the motif IDs, we display a 3D structure of the motif together with its base-pairing interaction graph
110 Jérôme Waldispühl and Vladimir Reinharz
3.5 Prediction Our first objective is to create a base-pairing network from sequence
of a Secondary data. Since the majority of base pairs in RNA structures are cis
Structure Scaffold Watson–Crick base pairs, we first predict a secondary structure
(without pseudo-knots) that will be used to build a scaffold of the
interaction graph. Secondary structures (without pseudo-knot) can
be deterministically predicted with RNAfold, or stochastically gen-
erated with RNAsubopt. The command line used to predict the
minimum free energy (MFE) secondary structure with RNAfold is:
RNAfold --noPS < input.fasta
Where input.fasta is a text file (FASTA format recommended)
that stores your input sequence. The--noPS flag is not mandatory,
but it prevents the program to generate a postscript file drawing
the predicted secondary structure. The program returns the input
sequence with its MFE secondary structure in bracket format on
the line below.
However, as discussed in previous chapters, single energy min-
imized structures do not always provide the best secondary struc-
ture prediction. Instead, it is recommended to perform a deeper
exploration of the conformational space and to consider subopti-
mal structures [29, 30]. This approach to secondary structure pre-
diction, originally implemented in mfold [30, 31] and Sfold [32],
is available with the RNAsubopt program in the Vienna RNA
package. The command line for running RNAsubopt is:
RNAsubopt -e 3 < input.fasta
Where the “-e” option specifies the depth of the suboptimal
search. More specifically, this argument indicates the range
(in kCal/mol) from the MFE, within which all suboptimal struc-
tures must be returned. Obviously, the values of that range depen-
dent of the MFE of the sequence, and should be adjusted on a
case-by-case basis. In our experiments, a value of 3 kCal/mol gen-
erated 25 secondary structures; which appears to provide a good
representation of the suboptimal conformational landscape.
The list of suboptimal secondary structures generated by
RNAsubopt (available at http://csb.cs.mcgill.ca/RNAMoIP)
provides us a description of the set of potential secondary structure
backbones. It will be used as it in the next step.
It is worth noting that other software could have been used to
generate the initial secondary structures. RNAstructure [33, 34],
mfold [30, 35], or Sfold [32, 36] make similar prediction than
RNAsubopt. Further, recent software such as MC-Fold [9] and
Modeling and Predicting RNA Three-Dimensional Structures 111
3.6 Prediction We describe how to use RNA-MoIP [21] to insert local motifs into
of a Complete secondary structures generated with RNAsubopt. By default,
Base-Pairing RNA-MoIP aims to inserts motifs from a repository build with
Interaction Graph RNA3Dmotif [18]. This repository of nonredundant motifs is
with Motif Insertion included in the distribution of RNA-MoIP and named “No_
Redondance_DESC” (see Note 1). Nonetheless, advanced users
can also either build themselves an up-to-date motif repository
using RNA3Dmotif, or use databases available on the RNA 3D
Hub [19].
RNA-MoIP is a flexible tool that allows modifications of the
secondary structure to permit the insertion of motifs. In particular,
the program has the capacity to remove a fixed amount of base
pairs from the input secondary structure. This feature is particu-
larly helpful if the predicted secondary structure has incorrectly
predicted base pairs that prevent motifs to be inserted.
Lets assume that you work in a directory that contains the
RNA-MoIP program (i.e., the python script named “RNAMoIP.
py”) and that Gurobi has been properly installed. The command
line to insert motifs in the sequence and secondary structure with
RNA-MoIP is:
gurobi.sh RNAMoIP.py -s <sequence> -ss <structure_list>
–d <path_to_repository>
Where the argument < sequence > is a string representing the
primary structure of the RNA sequence, <structure_list > is the
name of the file that stores the secondary structures in bracket for-
mat generated by RNAsubopt, and < path_to_repository > is the
location of the motif repository. It indicates the path to the motif
repository stored in the directory named “CATALOGUE,” which
is distributed with the RNA-MoIP package at http://csb.cs.mcgill.
ca/RNAMoIP/. Assuming that the directory “CATALOGUE” is
in your current directory, the value of < path_to_repository > is the
string “./CATALOGUE/No_Redondance_DESC/”.
RNA-MoIP accepts an additional parameter to control the
maximum number of base pairs that can be removed. This param-
eter can be adjusted with a float number (between 0 and 1) through
the option −r. By default, RNA-MoIP allows up to 30 % (i.e., −r
0.3) of base pairs to be removed. This is a reasonable choice as the
base-pair prediction positive predictive value (PPV) is roughly
60 % for classical secondary structure predictors such as RNAfold
and mfold [38]. Nonetheless, users can decrease this value if they
are confident in their predicted secondary structure.
112 Jérôme Waldispühl and Vladimir Reinharz
C-2DU6.D.2-12-22-1
C-3CUL.D.6-51-61-1
C-2DU3.D.3-30-38-1
C-2DU5.D.1-6-10-1
C-2DU5.D.1-24-27-2
C-2DU5.D.1-41-48-3
C-2DU5.D.1-64-66-4
D-15-19
D-14-20
D-13-21
D-26-42
D-52-60
D-7-65
Table 1
Description of motifs inserted by RNA-MoIP in the suboptimal
secondary structures generated by RNAsubopt for the tRNA(Cys)
from Archaeoglobus fulgidus
//===================RiboseRestraints===================
ribose_rst
(
structure
method = ccm,
pucker = C3p_endo,
threshold = 2.0
)
3.8 Retrieving, Once submitted, the script will run on MC-Sym servers and the
Optimizing, Analyzing URL of a result page will be returned to the user. This page offers
and Visualizing multiple options to optimize, analyze, and retrieve the results. To
Structures access these options, the user must access the web page “com-
mands.html” located in the results directory.
The energy minimization of the MC-Sym structure is probably
one of the most important options. We recommend any user to
run it before analyzing or visualizing the results. The “steepest
descent” option returns satisfactory results in a short time, but
more sophisticated and slower techniques (e.g., simulated anneal-
ing) are also available.
Clustering of structures using the k-means algorithm is another
useful option that enables the user to automatically identify the most
significant structures in the set or structures returned by MC-Sym.
A PDB model of all predicted structures is available at the root
of the directory. Each model can be visualized with molecular
viewer such as PyMOL or Jmol. Figure 5 show an example of a
structure predicted with our RNA-MoIP and MC-Sym pipeline,
aligned with the experimental structure [24].
120 Jérôme Waldispühl and Vladimir Reinharz
4 Notes
References
1. Bekaert M et al (2003) Towards a computa- potentials for RNA structure evaluation. RNA
tional model for -1 eukaryotic frameshifting 17(6):1066–1075
sites. Bioinformatics 19(3):327–335 6. Ding F et al (2008) Ab initio RNA folding by
2. Vitreschak AG et al (2004) Riboswitches: the discrete molecular dynamics: from structure
oldest mechanism for the regulation of gene prediction to folding mechanisms. RNA 14(6):
expression? Trends Genet 20(1):44–50 1164–1173
3. Szewczak AA et al (1993) The conformation 7. Jonikas MA et al (2009) Coarse-grained mod-
of the sarcin/ricin loop from 28S ribosomal eling of large RNA molecules with knowledge-
RNA. Proc Natl Acad Sci U S A 90(20): based potentials and structural filters. RNA
9581–9585 15(2):189–199
4. Sponer J et al (2012) Chapter 6 molecular 8. Poursina M et al (2011) Strategies for articulated
dynamics simulations of RNA molecules, in multibody-based adaptive coarse grain simula-
innovations in biomolecular modeling and tion of RNA. Methods Enzymol 487:73–98
simulations. R Soc Chem 2:129–155 9. Parisien M, Major F (2008) The MC-fold and
5. Bernauer J et al (2011) Fully differentiable MC-Sym pipeline infers RNA structure from
coarse-grained and all-atom knowledge-based sequence data. Nature 452(7183):51–55
Modeling and Predicting RNA Three-Dimensional Structures 121
10. Martinez HM, Maizel JV Jr, Shapiro BA cysteine biosynthesis in archaea. Nat Struct
(2008) RNA2D3D: a program for generating, Mol Biol 14(4):272–279
viewing, and comparing 3-dimensional models 25. Lorenz R et al (2011) ViennaRNA Package
of RNA. J Biomol Struct Dyn 25(6):669–683 2.0. Algorithms Mol Biol 6:26
11. Zhao Y et al (2012) Automated and fast build- 26. Waugh A et al (2002) RNAML: a standard
ing of three-dimensional RNA structures. Sci syntax for exchanging RNA information. RNA
Rep 2:734 8(6):707–717
12. Das R, Baker D (2007) Automated de novo 27. Lemieux S, Major F (2002) RNA canonical
prediction of native-like RNA tertiary struc- and non-canonical base pairing types: a recog-
tures. Proc Natl Acad Sci U S A 104(37): nition method and complete repertoire.
14664–14669 Nucleic Acids Res 30(19):4250–4263
13. Das R, Karanicolas J, Baker D (2010) Atomic 28. Chojnowski G, Walen T, Bujnicki JM (2014)
accuracy in predicting and designing noncanoni- RNA Bricks–a database of RNA 3D motifs and
cal RNA structure. Nat Methods 7(4):291–294 their interactions. Nucleic Acids Res
14. Wang Z, Xu J (2011) A conditional random 42(1):D123–D131
fields method for RNA sequence-structure 29. Ding Y, Chan CY, Lawrence CE (2005) RNA
relationship modeling and conformation sam- secondary structure prediction by centroids in
pling. Bioinformatics 27(13):i102–i110 a Boltzmann weighted ensemble. RNA 11(8):
15. Rother M et al (2011) ModeRNA: a tool for 1157–1166
comparative modeling of RNA 3D structure. 30. Zuker M (1989) On finding all suboptimal
Nucleic Acids Res 39(10):4007–4022 foldings of an RNA molecule. Science
16. Leontis NB, Westhof E (2001) Geometric 244(4900):48–52
nomenclature and classification of RNA base 31. Zuker M, Mathews DH, Turner DH (1999)
pairs. RNA 7(4):499–512 Algorithms and Thermodynamics for RNA
17. Yang H et al (2003) Tools for the automatic Secondary Structure Prediction: A Practical
identification and classification of RNA base Guide. In: Barciszewski J, Clark BFC (eds)
pairs. Nucleic Acids Res 31(13):3450–3460 RNA Biochemistry and Biotechnology.
18. Djelloul M, Denise A (2008) Automated motif Springer, Netherlands, pp 11–43
extraction and classification in RNA tertiary 32. Ding Y, Lawrence CE (2003) A statistical
structures. RNA 14(12):2489–2497 sampling algorithm for RNA secondary struc-
19. Leontis N, Zirbel CL (2012) Nonredundant ture prediction. Nucleic Acids Res 31(24):
3D structure datasets for RNA knowledge 7280–7301
extraction and benchmarking. In: Leontis N, 33. Reuter JS, Mathews DH (2010) RNAstructure:
Westhof E (eds) RNA 3D structure analysis software for RNA secondary structure prediction
and prediction. Springer, Berlin, pp 281–298 and analysis. BMC Bioinformatics 11:129
20. Hofacker IL et al (1994) Fast folding and 34. Bellaousov S et al (2013) RNAstructure: Web
comparison of RNA secondary structures. servers for RNA secondary structure predic-
Monatsh Chem 125(2):167–188 tion and analysis. Nucleic Acids Res 41(Web
21. Reinharz V, Major F, Waldispuhl J (2012) Server issue):W471–W474
Towards 3D structure prediction of large RNA 35. Zuker M (2003) Mfold web server for nucleic
molecules: an integer programming framework acid folding and hybridization prediction.
to insert local 3D motifs in RNA secondary Nucleic Acids Res 31(13):3406–3415
structure. Bioinformatics 28(12):i207–i214 36. Ding Y, Chan CY, Lawrence CE (2004) Sfold
22. Berman HM et al (1992) The nucleic acid web server for statistical folding and rational
database. A comprehensive relational database design of nucleic acids. Nucleic Acids Res
of three-dimensional structures of nucleic 32(Web Server issue):W135–W141
acids. Biophys J 63(3):751–759 37. Honer zu Siederdissen C et al (2011) A fold-
23. Bernstein FC et al (1977) The Protein Data ing algorithm for extended RNA secondary
Bank: a computer-based archival file for mac- structures. Bioinformatics 27(13):i129–i136
romolecular structures. J Mol Biol 112(3): 38. Do CB, Woods DA, Batzoglou S (2006)
535–542 CONTRAfold: RNA secondary structure
24. Fukunaga R, Yokoyama S (2007) Structural prediction without physics-based models.
insights into the first step of RNA-dependent Bioinformatics 22(14):e90–e98
Chapter 7
Abstract
Interaction between two RNA molecules plays a crucial role in many medical and biological processes such
as gene expression regulation. In this process, an RNA molecule prohibits the translation of another RNA
molecule by establishing stable interactions with it. Some algorithms have been formed to predict the
structure of the RNA–RNA interaction. High computational time is a common challenge in most of the
presented algorithms. In this context, a heuristic method is introduced to accurately predict the interaction
between two RNAs based on minimum free energy (MFE). This algorithm uses a few dot matrices for
finding the secondary structure of each RNA and binding sites between two RNAs. Furthermore, a parallel
version of this method is presented. We describe the algorithm’s concurrency and parallelism for a multi-
core chip. The proposed algorithm has been performed on some datasets including CopA-CopT, R1inv-
R2inv, Tar-Tar*, DIS-DIS, and IncRNA54-RepZ in Escherichia coli bacteria. The method has high validity
and efficiency, and it is run in low computational time in comparison to other approaches.
Key words RNA secondary structure, RNA–RNA interaction, Heuristic, Parallel, Minimum free energy
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_7, © Springer Science+Business Media New York 2015
123
124 Soheila Montaseri
and thus each set of consecutive unpaired bases can interact with
other complementary unpaired sets. Here, we also have imple-
mented a modeling for RNA–RNA interaction prediction based on
parallel heuristic approach. All diagonals in each dot matrix can be
considered independently of each other. Therefore, the sub-
diagonals on different diagonals and their minimum free energy
(MFE) are calculated concurrently. The sub-diagonals are sorted in
parallel based on their length and MFE.
2 Materials
2.1 Dataset The proposed algorithm has been performed on some standard
datasets such as CopA-CopT, R1inv-R2inv, Tar-Tar*, DIS-DIS,
and IncRNA54-RepZ in Escherichia coli bacteria [14]. These pair
RNAs have kissing hairpin structures.
Before presenting our algorithm, some definitions and nota-
tions about RNA are presented. An RNA molecule consists of a
sequence of four nucleotides: adenine (A), cytosine (C), guanine
(G), and uracil (U). Thus, an RNA sequence is defined by
R = r1r2 … rn in 5′ − 3′ direction, where |R| = n and
ri ∈ {A, C, G, U} (1 ≤ i ≤ n). On the other hand, the reverse R is
indicated as rnrn − 1 … r1 in 3′ − 5′ direction. A subsequence
rij = riri + 1 … rj from the sequence R is started from the position i
and ended at the position j in 5′ − 3′ direction. The reverse rij is
defined as rjrj − 1 … ri in 3′ − 5′ direction. The RNA structure is
formed by the creation of hydrogen bonds between complemen-
tary bases (A − U, C − G).
An RNA secondary structure is composed of stems and single
regions. Each stem contains some consecutive base pairs such as
ri. rj and ri + 1. rj − 1. Let rij and rkl be two subsequences from the
sequence R that form a stem. So the subsequence rij is bonded to
the reverse rkl base to base. For the stem, each base in rij is repre-
sented by ' (' and each base in rkl is displayed with ') '. A single
region is a loop or single strand. Each end of a loop is linked to a
stem, while one end of a single strand is connected to a stem. For
the single region, each base is represented by '. '. Thus the second-
ary structure of the RNA R is indicated with a sequence of charac-
ters ′(', ′) ' and ′. ′.
An RNA–RNA interaction structure is composed of stems in
each RNA and hybrids between the two RNAs. Each hybrid con-
tains some consecutive base pairs between the two RNAs without
any base pairs in each RNA. Let rij′ and rkl″ be two single regions of
RNA secondary structures R′ and reverse R″, respectively, which
are involved in a hybrid. Thus the subsequence rij′ is bonded to
the reverse rkl″ base to base. For the hybrid, each base in rij′ is
shown by ' [' and each base in the rkl″ is declared by '] '.
126 Soheila Montaseri
2.2 Computational The proposed method is run on Intel(R) Core(TM)2 Duo proces-
Hardware sor T6670 2.20 GHz with 4 GB RAM to predict the interaction
structures. Parallel version of the algorithm implements parallelism
on a multicore system. Experiments are performed on an Intel(R)
core(TM) i7 2600k CPU 3.40 GHz with 8G RAM.
2.3 Computational The heuristic method for RNA–RNA interaction prediction is per-
Software formed in language C++. The parallel heuristic approach is imple-
mented in language Matlab.
3 Methods
3.1 RNA Secondary In this section, a heuristic algorithm is presented for RNA–RNA
Structure Prediction interaction prediction. This method has two main steps. In the first
step, an RNA secondary structure is constructed for each RNA
sequence as follows:
1. A stem dot matrix M Rn × n is constructed for the RNA sequence
R = r1r2 … rn (|R| = n) as follows:
ìï1 if ( ri , rj ) Î {( A, U ) , ( U, A ) , ( C, G ) , ( G, C )} ,
M R [i, j ] = í
îï0 else
where ri and rj (1 ≤ i, j ≤ n) are the i-th nucleotide in the
sequence R and the j-th nucleotide in the reverse R,
respectively.
2. In the stem dot matrix Mn × nR, each right-skewed consecutive
value of 1’s which is parallel to the main diagonal shows a stem
sub-diagonal. A set of stem sub-diagonals for the RNA R is
formed as follows:
D R = {< i, j , k , l >| 1 £ i £ k £ n &1 £ j £ l £ n} ,
where (i, j) and (k, l) are the start and end positions of a stem
sub-diagonal, respectively. Let dR ∈ DR and dR = 〈i, j, k, l〉. Then
dR shows how the subsequence rik is bonded to the reverse rjl in
R. If there are i′ and j′ where i < i′ < k, j < j′ < l and i′ + j′ = n + 1,
then dR has to be removed and two stem sub-diagonals
〈i, j, i′, j′〉 and 〈i′ + 1, j′ + 1, k, l〉 have to be inserted to the set DR.
RNA-RNA Interaction Prediction 127
Table 1
MFE of all two adjacent base pairs
128 Soheila Montaseri
3.2 RNA–RNA In the second step, interaction structure between the two struc-
Interaction Prediction tures S′ and S″ is calculated as follows:
¢ ²
1. A hybrid dot matrix M nR´m, R is made up of two secondary struc-
tures of RNA sequences R′ = r1′, …, rn′ and R″ = r1″, …, rm″(|R′
| = n, |R″| = m), as follows:
ìï1 if ( ri¢ ,rj² ) Î {( A, U ) , ( U, A ) , ( C, G ) , ( G, C )} & ( si¢ , s ²j ) = ( ¢ .¢ ,¢ .¢ ) ,
M R , R [i, j ] = í
¢ ²
îï0 else,
where (ri′, rj″) are the i-th and j-th nucleotides and (si′, sj″) are
the i-th and j-th structures in the sequence R′ and reverse R″,
respectively (1 ≤ i ≤ n, 1 ≤ j ≤ m).
¢ ²
2. In the hybrid dot matrix M nR´m, R , each right-skewed consecu-
tive value of 1’s which is parallel to the main diagonal shows a
hybrid sub-diagonal. A set of hybrid sub-diagonals for the two
RNAs R′ and R″ is formed as follows:
D R , R = {< i, j , k , l >| 1 £ i £ k £ n 1 £ j £ l £ m} ,
¢ ²
where (i, j) and (k, l) are the start and the end positions of the
R¢ , R ² ¢ ²
hybrid sub-diagonal, respectively. Each d =< i, j , k , l >Î D R , R
indicates that the subsequence rik′ in R′ is bonded to the rjl″ in
reverse R″.
The remaining steps are similar to steps 3, 4, and 5 in
Subheading 3.1.
The time and space complexity of the algorithm are 0(k2 log k2)
and 0(k2), respectively, where k indicates the sum of the length of
two RNAs.
Example 1
Let R′ = AGUACCGAAAACU and R″ = CCGUUUGAGGUCGG
be two RNA sequences. Interaction between the two RNAs can be
shown as follows:
5¢ - AGUACCGAAAACU - 3¢ 5¢ - CCGUUUGAGGUCGG - 3¢
( ( ( [ [ [ .[ [ [ ) ) ) ( ( ( ] ] ] . . ] ] ] ) ) ),
Here, each RNA has one stem. In the left hand RNA, one stem is
found by the formation of bonding between AGU and the reverse
ACU. Consecutive open brackets and their corresponding closing
brackets are denoted as an interaction region between two RNAs.
Here, there are two interaction regions between the sequences R′
and R″. The first one is formed by binding between ACC and the
reverse GGU. The second one is generated by binding between
AAA and the reverse UUU.
RNA-RNA Interaction Prediction 129
3.3 Parallel Here, parallel version of the heuristic algorithm is described. Let N
Algorithm be the number of created threads. Each dot matrix is constructed
in parallel. Notice that an element (x, y) in the dot matrix is depen-
dent on a set of elements (x, j) and (i, y) for each 1 ≤ i, j ≤ n as shown
in Fig. 1. This dependency pattern allows that the main diagonal
and other diagonals parallel to it can be considered independent of
each other. So divide the diagonals between threads. Figure 2
Fig. 1 Dependency of element (x, y) to a set of elements (x, j) and (i, y) for each
1 ≤ i, j ≤ n
3.4 Algorithm One of the limitations of the proposed algorithm is that we have
Limitations first obtained the length of sub-diagonals and then computed their
MFEs for RNA–RNA interaction prediction, while the interaction
between two RNAs is formed based on MFE.
To evaluate the prediction accuracy of the method, sensitivity,
specificity, and F-measure have been employed. Let TP, FP, and
FN be the number of correctly predicted base pairs, the number of
false predicted base pairs, and the number of unpredicted base
pairs, respectively. So the sensitivity (Sn), specificity (Sp), and
F-measure (F) are defined as follows:
Sn = TP / (TP + FN ) , Sp = TP / (TP + FP ) , F = ( 2 * Sn * Sp ) / ( Sn + Sp ) .
We have compared the prediction accuracy of our parallel method,
TPIRNA, in these criteria with some similar approaches such as
InRNAs, RNAup [12] and EBM in Table 2. As it is shown, the
Table 2
Comparison of sensitivity, specificity, and F-measure
Table 3
Comparison of running times of TPIRNA with 1, 2, 4, and 8 threads and also EBM
Table 4
Comparison of time and space complexity of some methods
Acknowledgements
References
1. Huang FWD, Qin J, Reidys CM et al (2010) 11. Mückstein U, Tafer H, Bernhart S et al (2009)
Target prediction and a statistical sampling Translational control by RNA-RNA interac-
algorithm for RNA-RNA interaction. tion: Improved computation of RNA-RNA
Bioinformatics 26:175–181 binding thermodynamics. Bioinform Res Dev
2. Salari R, Backofen R, Sahinalp SC (2010) Fast 13:114–127
prediction of RNA-RNA interaction. 12. Mückstein U, Tafer H, Hackermuller J et al
Algorithms Mol Biol 5:5–15 (2006) Thermodynamics of RNA-RNA bind-
3. Tafer H, Hofacker IL (2008) RNAplex: a fast ing. Bioinformatics 22:1177–1182
tool for RNA-RNA interaction search. 13. Busch A, Richter AS, Backofen R (2008)
Bioinformatics 24:2657–2663 IntaRNA: efficient prediction of bacterial
4. Mneimneh S (2009) On the approximation of sRNA targets incorporating target site accessi-
optimal structures for RNA-RNA interaction. bility and seed regions. Bioinformatics 24:
Trans Comput Biol Bioinform 6:682–688 2849–2856
5. Andronescu M, Zhang ZC, Condon A (2005) 14. Kato Y, Akutsu T, Seki H (2009) A grammati-
Secondary structure prediction of interacting cal approach to RNA-RNA interaction predic-
RNA molecules. J Mol Biol 345:987–1001 tion. Pattern Recogn 42:531–538
6. Bernhart S, Tafer H, Mückstein U et al (2006) 15. Aksay C, Karakoc E, Kin Ho C et al. (2007)
Partition function and base pairing probabilities ncRNA discovery and functional identification
of RNA heterodimers. Algorithms Mol Biol 1–3 via sequence motifs. Technical Report TR 1–9
7. Dirks R, Bios J, Schaeffer JM et al (2007) 16. Selim GA (1989) The design and analysis of
Thermodynamic analysis of interacting nucleic parallel algorithms. ISBN 0-33-23005b-3
acid strands. Soc Ind Appl Math 49:65–88 17. Seemann SE, Richter AS, Gesell T et al (2011)
8. Rehmsmeier M, Steffen P, Hochsmann M et al PETcofold: predicting conserved interactions
(2004) Fast and effective prediction of and structures of two multiple alignments of
microRNA/target duplexes. RNA 10: RNA sequences. Bioinformatics 2:211–219
1507–1517 18. Chitsaz H, Salari R, Sahinalp SC et al (2009) A
9. Markham NR, Zuker M (2008) UNAFold: partition function algorithm for interacting
software for nucleic acid folding and hybridiza- nucleic acid strands. Bioinformatics 25:i365–i373
tion. Methods Mol Biol 453:3–31 19. Li AX, Marz M, Qin J et al (2011) RNA–
10. Alkan C, Karakoc E, Nadeau JH et al (2006) RNA interaction prediction based on multiple
RNA-RNA interaction prediction and antisense sequence alignments. Bioinformatics 4:
RNA target search. J. Comput Biol 13:267–282 456–463
Part II
Abstract
Direct sequencing of the complementary DNA (cDNA) using high-throughput sequencing technologies
(RNA-seq) is widely used and allows for more comprehensive understanding of the transcriptome than
microarray. In theory, RNA-seq should be able to precisely identify and quantify all RNA species, small or
large, at low or high abundance. However, RNA-seq is a complicated, multistep process involving reverse
transcription, amplification, fragmentation, purification, adaptor ligation, and sequencing. Improper opera-
tions at any of these steps could make biased or even unusable data. Additionally, RNA-seq intrinsic biases
(such as GC bias and nucleotide composition bias) and transcriptome complexity can also make data imper-
fect. Therefore, comprehensive quality assessment is the first and most critical step for all downstream analy-
ses and results interpretation. This chapter discusses the most widely used quality control metrics including
sequence quality, sequencing depth, reads duplication rates (clonal reads), alignment quality, nucleotide
composition bias, PCR bias, GC bias, rRNA and mitochondria contamination, coverage uniformity, etc.
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_8, © Springer Science+Business Media New York 2015
137
138 Xing Li et al.
1.1 Raw Sequence Phred quality score (Q) was originally developed by the program
Quality Phred to measure base-calling reliability from Sanger sequencing
chromatograms [20, 21]. It is defined as Q = −10 × log10(P) where
P is the probability of erroneous base calling. For example, a Phred
quality score of 30 means the chance that this base is called incor-
rectly is 1 in 1,000. Although the Phred program is rarely used in
next-generation sequencing field, Phred or Phred-like quality score
has become widely accepted to characterize the quality of DNA
sequences (see Note 1). Most often, Phred scores are reported as
their corresponding ASCII characters (33–126 or “!” to “V”) (see
Note 2 for FASTQ format), but SOLiD still uses numbers to rep-
resent quality scores.
There is no gold standard to tell if the quality of a particular
sequence is good or bad, as this is really depending on the purpose
of the study. For example, compared to expression profiling, vari-
ants calling tasks require much higher sequence quality. In general,
scores over 30 indicate very good quality, 20–30 indicate reason-
able good and <20 indicate poor quality. Parallel boxplots visualize
“per nucleotide quality score” by summarizing Phred qualities for
all reads at each position (Fig. 1a) [22, 23]. In addition, one can
also calculate the average quality score per read (“per sequence
quality score”) and check the quality score distribution of all
sequences (Fig. 1b).
a b
0.25
65
0.20
60
Phred quality score
55
0.15
Density
50
0.10
45
0.05
40
35
0.00
Nucleotide postion on read 35 40 45 50 55 60 65
Phred quality score
Fig. 1 (a) Parallel boxplot showing “per nucleotide quality score.” All reads are overlaid together, and then
summarize Phred quality score (Y-axis) for each position of read from 5′ to 3′ end (X-axis). (b) “per sequence
quality score” distribution. For each read, “per sequence quality score” is calculated as the average Phred
quality score (X-axis) across all nucleotides
1.3 Duplicate Read duplication rate is affected by read length, sequencing depth,
Sequences (PCR transcript abundance and PCR amplification. Supposing the sequenc-
Duplication) ing library is purely random and read length is 36 bp, the chance
to get a duplicated read is 1/472 (or 4.5 × 10−44), this chance is still
slim even if the sequencing depth reaches hundreds of millions.
140 Xing Li et al.
a b
0.40
0.030
A
T
0.35 G
C
Nucleotide composition
0.020
0.30
Density
0.25
0.010
0.20
0.15
0.000
0 5 10 15 20 25 30 35 0 20 40 60 80 100
Nucleotide position on read GC content
Fig. 2 (a) Diagram showing nucleotide composition bias at the beginning of reads. All reads are overlaid
together, and then calculate nucleotide frequency (Y-axis) for each position of read (X-axis). Four nucleotides
were indicated using different colors. (b) “per sequence GC content” distribution
1.4 Descriptive Mapping statistics are the simplest and most intuitive way to assess
Statistics if RNA sequencing was successful. These include mappability
(number of reads aligned to reference genome), number of reads
aligned to unique locations in the genome, and the number of
splice mapped reads and number of reads mapped to mitochon-
dria. It is difficult to derive reasonable or even empirical thresholds
to determine if a particular RNA sequencing was successful or not,
because these metrics really depend on read length, sequencing
depth, bioinformatic analysis parameters, sample preparation pro-
tocol, and tissue type. For example, compared with shorter reads,
longer reads will have better mappability, lower duplication rate,
higher proportion that aligned to unique genome location, more
spliced reads given the same sequencing depth. For the same
sequencing depth and same read length, number of splice reads
Assessing RNA-seq data quality 141
1.5 rRNA/tRNA The goal of most RNA-seq studies is to interrogate functional mes-
Contamination sage RNA (mRNA). However, structure RNAs such as Ribosomal
RNA (rRNA) and transfer RNA (tRNA) are the most abundant
RNA species and constitute 60–90 % of total RNA in a cell. To
avoid having these RNAs dominate the sequencing data, it is nec-
essary to remove these RNA species before preparing libraries for
deep sequencing. Two approaches have been used to enrich
mRNA. The first approach starts with total RNA that has been
depleted of rRNA by using a set of oligos that binds to rRNA (such
as RiboMinus™), and the second method selects for transcripts by
isolating poly-A RNA as the staring materials for the construction
of sequencing libraries.
Even with ribosome depletion, a fair amount of ribosomal
sequences may still remain in the raw data. Small amounts of rRNA
contamination will not be a detriment to downstream analyses.
However, a larger amount of ribosomal reads usually suggests
rRNA depletion was inefficient or failed and additional sequencing
may be necessary. Assessing rRNA contamination is straightfor-
ward; aligning reads to reference genome and then counts how
many reads mapped to ribosome genes, or aligning reads directly
to ribosomal RNA sequences.
1.6 Saturation Test RNA-seq experiments are diverse in their aims and design goals.
of Sequencing Depth The amount of sequencing needed for a given sample is determined
by the goals of the experiment. For gene expression profiling,
where we are interested to find quantitative differences of known
genes between groups, modest sequencing depth is good enough
(e.g., 30 million pair-end reads with length >30 bp for mammalian
genomes). But for studies that involve investigation of alternative
splicing, gene fusion detection and novel transcript identification,
deeper read depths is required to be able to adequately cover not
just the exons but also exon–exon junctions. It is recommended by
ENCODE consortium that a minimum of 100–200 million
2 × 76 bp or longer reads is needed for mammalian genomes.
The saturation test is an approach to determine if current
sequencing depth is deep enough to satisfy a particular purpose.
It is fundamentally important because if sequencing was unsatu-
rated, estimated gene expression metrics such as RPKM (Reads Per
Kilobase exon per Million mapped reads) will be unstable and low
abundant isoforms will be undiscovered. In practical, we resample
5, 10, …, 100 % of the total mapped reads and RPKMs are
142 Xing Li et al.
a b
40
400
38
300
Number of junctions
36
RPKM
200
34
100
32 Annotated Junctions
Novel Junctions
30
0
5 25 45 65 85 5 25 45 65 85
Resampling percentage Resampling percentage
Fig. 3 Saturation test of sequencing depth. (a) Saturation test using RPKM (gene expression measurement).
RPKMs were recalculated for each resampled subset (blue dots) to test if RPKM values enter a steady state
(or saturated). (b) Saturation test using detected splice junctions (blue: annotated junction, red: novel junction).
Horizontal dashed line indicates all annotated junctions encoded in reference gene models
The use of such replicates comes into play for experiments that
involve comparison of two or more groups for differential expres-
sion analysis. It is recommended to have at least two biological
replicates per group in order to statistically determine the signifi-
cantly differentially expressed genes.
Evaluating the reproducibility between replicates is straight-
forward. Most often scatter plots are used to visualize the repro-
ducibility between expression measurements such as RPKM or
FPKM (Fragment Per Kilobase exon per Million reads). Logarithm
transformation of RPKM is necessary because of the large dynamic
range of RPKM values. After logarithm transformation, expression
values roughly follow a normal distribution and have a high
Pearson’s correlation coefficient (Fig. 4).
1.8 Coverage Gene body coverage describes the overall reads density over the
Uniformity mRNA regions (both UTR exon and CDS exon). Ideally, each
base has the same chance to be sequenced, and each site within
gene body has similar coverage. However, read density profiles can
be affected by library preparation protocol, PCR amplification,
RNA degradation, genome complexity and the underlying gene
3.0
r = 0.9904
2.5
Replicate-2 (RPKM,log10)
2.0
1.5
1.0
0.5
0.0
−0.5
Fig. 4 Scatter plot showing reproducibility between two RNA-seq datasets (technical replicates). Each blue dot
represents a gene, and the red dashed line is linear regression line
144 Xing Li et al.
1.9 Reads After mapping reads to a reference genome, we can calculate the
Distribution (Intron, fraction of reads assigned to exons (including both UTR and CDS
Exon, UTR, etc.) exons), introns and intergenic regions based on the provided gene
model. In ideal conditions and for well-annotated organisms, most
of reads in RNA-seq data should be mapped to exonic regions.
However, in practice, a considerable amount of reads are mapped
to intron or intergenic regions. Except for mapping artifacts, inter-
genic/intronic reads are mainly from DNA contamination, pre-
mRNAs, new isoforms, or novel transcripts. Some UTR regions
are overrepresented (i.e., higher reads density) because of DNA
15000
read number
10000
5000
0 20 40 60 80 100
percentile of gene body (5'−>3')
Fig. 5 Coverage uniformity over gene body. All transcripts were scaled into the
same length (100 nucleotides) and then reads coverage (Y-axis) was calculated
for each position (X-axis) from 5′ to 3′ end
Assessing RNA-seq data quality 145
repeats or PCR bias, but most often they have lower reads density
due to RNA degradation. Specially, when poly-A RNA-seq proto-
col was used, reads are biased (i.e., overrepresented) in 3′UTR.
2 Notes
References
1. Mortazavi A, Williams BA, McCue K et al prostate cancer using next-generation RNA
(2008) Mapping and quantifying mammalian sequencing. Gene Dev 21:56–67. doi:10.1101/
transcriptomes by RNA-Seq. Nat Methods gr.110684.110
5:621–628. doi:10.1038/nmeth.1226 13. Edgren H, Murumagi A, Kangaspeska S et al
2. Marioni JCJ, Mason CEC, Mane SMS et al (2011) Identification of fusion genes in breast
(2008) RNA-seq: an assessment of technical cancer by paired-end RNA-sequencing. Genome
reproducibility and comparison with gene Biol 12:R6. doi:10.1186/gb-2011-12-1-r6
expression arrays. Gene Dev 18:1509–1517. 14. Peng ZZ, Cheng YY, Tan BC-MB et al (2012)
doi:10.1101/gr.079558.108 Comprehensive analysis of RNA-Seq data
3. Wang Z, Gerstein M, Snyder M (2009) RNA- reveals extensive RNA editing in a human tran-
Seq: a revolutionary tool for transcriptomics. scriptome. Nat Biotechnol 30:253–260.
Nat Rev Genet 10:57–63. doi:10.1038/ doi:10.1038/nbt.2122
nrg2484 15. Bahn JHJ, Lee J-HJ, Li GG et al (2012)
4. Wilhelm BT, Landry J-R (2009) RNA-Seq— Accurate identification of A-to-I RNA editing
quantitative measurement of expression in human by transcriptome sequencing. Gene
through massively parallel RNA-sequencing. Dev 22:142–150. doi:10.1101/gr.124107.111
Methods 48:249–257. doi:10.1016/j.ymeth. 16. Park EE, Williams BB, Wold BJB, Mortazavi
2009.03.016 AA (2012) RNA editing in the human
5. Wang ET, Sandberg R, Luo S et al (2008) ENCODE RNA-seq data. Gene Dev 22:1626–
Alternative isoform regulation in human tissue 1633. doi:10.1101/gr.134957.111
transcriptomes. Nature 456:470–476. 17. Ramaswami G, Zhang R, Piskol R et al (2013)
doi:10.1038/nature07509 Identifying RNA editing sites using RNA
6. Katz Y, Wang ET, Airoldi EM, Burge sequencing data alone. Nat Methods.
CB (2010) Analysis and design of RNA doi:10.1038/nmeth.2330
sequencing experiments for identifying iso- 18. Benjamini Y, Speed TP (2012) Summarizing
form regulation. Nat Methods 7:1009–1015. and correcting the GC content bias in high-
doi:10.1038/nmeth.1528 throughput sequencing. Nucleic Acids Res
7. Trapnell C, Williams BA, Pertea G et al (2010) 40:e72. doi:10.1093/nar/gks001
Transcript assembly and quantification by 19. Hansen KD, Brenner SE, Dudoit S (2010)
RNA-Seq reveals unannotated transcripts and Biases in Illumina transcriptome sequencing
isoform switching during cell differentiation. caused by random hexamer priming. Nucleic
Nat Biotechnol 28:511–515. doi:10.1038/ Acids Res 38:e131. doi:10.1093/nar/gkq224
nbt.1621 20. Ewing B, Hillier L, Wendl MC, Green P
8. Cabili MN, Trapnell C, Goff L et al (2011) (1998) Base-calling of automated sequencer
Integrative annotation of human large inter- traces using phred. I. Accuracy assessment.
genic noncoding RNAs reveals global proper- Genome Res 8(3):175–85
ties and specific subclasses. Gene Dev 21. Ewing B, Green P (1998) Base-calling of auto-
25:1915–1927. doi:10.1101/gad.17446611 mated sequencer traces using phred. II. Error
9. Guttman M, Garber M, Levin JZ et al (2010) probabilities. Genome Res 8(3):186–94
Ab initio reconstruction of cell type-specific tran- 22. Babraham Bioinformatics – FastQC a quality
scriptomes in mouse reveals the conserved multi- control tool for high throughput sequence data.
exonic structure of lincRNAs. Nat Biotechnol http://www.bioinformatics.babraham.ac.uk/
28:503–510. doi:10.1038/nbt.1633 projects/fastqc/
10. Prensner JRJ, Iyer MKM, Balbin OAO et al 23. Wang L, Wang S, Li W (2012) RSeQC: quality
(2011) Transcriptome sequencing across a control of RNA-seq experiments. Bioin
prostate cancer cohort identifies PCAT-1, an formatics. Oxford, England. doi:10.1093/
unannotated lincRNA implicated in disease bioinformatics/bts356
progression. Nat Biotechnol 29:742–749. 24. Levin JZ, Yassour M, Adiconis X et al (2010)
doi:10.1038/nbt.1914 Comprehensive comparative analysis of strand-
11. Kannan K, Wang L, Wang J et al (2011) specific RNA sequencing methods. Nat
Recurrent chimeric RNAs enriched in human Methods. doi:10.1038/nmeth.1491
prostate cancer identified by deep sequencing. 25. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice
Proc Natl Acad Sci U S A 108:9172–9177. PM (2010) The Sanger FASTQ file format for
doi:10.1073/pnas.1100489108 sequences with quality scores, and the Solexa/
12. Pflueger D, Terry S, Sboner A et al (2011) Illumina FASTQ variants. Nucleic Acids Res.
Discovery of non-ETS gene fusions in human 38(6):1767-71. doi:10.1093/nar/gkp1137
Chapter 9
Abstract
The mapping of RNA-Seq data on genome is not the same as DNA-Seq data, because the junction reads
span two exons and have no identical matches at reference genome. In this chapter, we describe a junction
read aligner SpliceMap that is based on an algorithm of “half-read seeding” and “seeding extension.” Four
analysis steps are integrated in SpliceMap (half-read mapping, seeding selection, seeding extension and
junction search, and paired-end filtering), and all toning parameters of these steps can be editable in a
single configuration file. Thus, SpliceMap can be executed by a single command. While we describe the
analysis steps of SpliceMap, we illustrate how to choose the parameters according to the research interest
and RNA-Seq data quality by an example of human brain RNA-Seq data.
Key words Junction reads, Junction, SpliceMap, Half-read seeding, Seeding extension
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_9, © Springer Science+Business Media New York 2015
147
148 Kin Fai Au
2 Material
2.1 RNA-Seq Dataset In order to illustrate the procedure of SpliceMap, we use a human
brain RNA-Seq data from Illumina’s Human BodyMap 2.0 proj-
ect. It was released with Ensembl (release 62). The RNA was from
a 77-year-old Caucasian female. This RNA-Seq dataset contains
80,946,860 50 bp paired-end reads in FASTQ format. Each read
in FASTQ format contains four lines:
@HWI-BRUNOP16X_0001:3:1:9465:1024#0/1
TTAACAGTCGAGAGTGTGCTGAGAACTTAGACGGGATTTGGTAGGCCAAG
+HWI-BRUNOP16X_0001:3:1:9465:1024#0/1
TQFLTSSTTTggggfJKTSUcccccgggggggeggggggggfgffgggge
The first line is the read name and begins with a symbol “@.”
The second line is the raw sequence. The third line begins with a
symbol “+” and it is the extra info that was added during sequenc-
ing or preliminary analysis. In this example, the third line is just a
copy of the read name. The fourth line contains quality score of
each base. The quality score encodes the probability that the cor-
responding base call is incorrect. There are three types of quality
scores: phred-33, phred-64, and Solexa formats. Phred-33 format
encodes Phred quality score from 0 to 93 using the ASCII charac-
ters from 33 to 126. Phred-64 format encodes Phred quality score
from 0 to 62 using the ASCII characters from 64 to 126. Solexa
format encodes Phred quality score from −5 to 62 using the ASCII
characters from 59 to 126. Each character in quality score line
presents an integer as a sequencing quality which can be converted
to probability of false base call. Thus, please be aware that the same
character in three different quality score format can encode differ-
ent probabilities of false base call. Therefore, SpliceMap requires a
proper setting of quality score format in the configuration file. Our
human brain data set uses phred-64 format (see Subheading 3).
2.2 Reference In this chapter, we use hg19 as the reference human genome,
Genome which is in FASTA format, with each chromosome in a separate
RNA-Seq Mapping 149
3 Methods
A junction read spans two exons and thus its sequence can be split
into two exonic fragments. The mapping of junction read is the
process to find where to split junction read and how to map two
exonic fragments to genome. It is intuitive that either exonic frag-
ment is longer or equal to the half length of original junction reads.
Therefore, either half of a junction read comes from a single exon,
and thus can be directly mappable to genome. This mapping hit
becomes a seeding to narrow the search region of the remaining
sequence. The second part is to extend the seeding alignment to
complete the mapping of the corresponding exonic fragment and
then find the mapping of the other exonic fragment in a small
neighbor region on the same chromosome (Fig. 1a). SpliceMap
contains four steps: half-read mapping, seeding selection, seeding
extension and junction search, and paired-end filtering (Fig. 1b).
SpliceMap integrates all steps in an easy one-button tool. The
parametric settings of each step are simply editable in the configu-
ration file “run.cfg.” In order to use SpliceMap on your own data,
you should follow these protocol:
1. Obtaining the genome files in the format “chr1.fa, chr2.fa, …”
and also the corresponding Bowtie index (see Subheading 2.2).
2. Create an empty directory, this will be the working directory.
3. Copy “run.cfg” from the SpliceMap package to the working
directory.
4. Edit run.cfg to include paths to your data files and genome
directories.
5. Edit the parametric settings for each steps, according to the
data quality and research interest
6. Execute “runSpliceMap run.cfg” while in your working directory.
7. After a certain time execution will conclude. You can find
results in the “output” directory.
At first, the information of SpliceMap input (data and genome)
should be edited correctly in run.cfg. The location of the reference
genome is required:
# Directory of the chromosome files in FASTA format
# Each chromosome should be in a separate file (can be concatenated)
# ie. chr1.fa, chr2.fa, …
# (single value)
genome_dir = /home/kinfai/genome/hg19/
# (optional)
# (single value)
chromosome_wildcard = chr*.fa
The other input of SpliceMap is the RNA-Seq data. The loca-
tion, data format and the quality format (see Subheading 2.1)
should be defined (see Note 2):
# These are the two lists of sequencer reads files.
# "reads_list2" can be commented out if reads are not
paired-end.
# Make sure the order of both lists are the same!
# Also, "reads_list1" must be the first mate of the
pair of reads.
# Note: pair-reads should be in the "forward-reverse" format.
# (multiple values)
> reads_list1
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_3_1_sequence.txt
<
> reads_list2
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_3_2_sequence.txt
<
# Format of the sequencer reads, also make sure reads are
# not split over multiple lines.
# Choices are: FASTA, FASTQ, RAW
# (single value)
read_format = FASTQ
# Format of the quality string if FASTQ is used
# Choices are:
# phred-33--Phred base 33 (same as Sanger format) [default]
# phred-64--Phred base 64 (same as Illumina 1.3+)
# solexa --Format used by solexa machines
# (single value)
quality_format = phred-64
The other parametric settings in run.cfg are to tune the perfor-
mance of four steps of SpliceMap. It is illustrated in the following
subsections by the example of human brain RNA-Seq data. Some
parameters are required while some others are optional. If optional
parameters start with the symbol “#,” they are commented out.
3.1 Half-Read Many Second Generation Sequencing platforms provide reads that
Mapping are not shorter than 50 bp. Thus, the half reads are not shorter
than 25 bp, which are long enough for reliable alignment on
genome by regular short read aligners. SpliceMap is compatible
with Bowtie, ELAND, and SeqMap for this mapping. In this
example, we select Bowtie:
# The short reads mapper used
# choices are "bowtie", "eland" or "seqmap"
# Bowtie is recommended
# this can be commented out if the mapping has already
been done and the
RNA-Seq Mapping 153
3.2 Seeding The mapping hits of the half reads are used to narrow the search
Selection region of junction read mapping. But not all mapping hits of half
reads are used. Because there exist many repeat regions in genome,
especially mammalian genome, a half read may be mapped to mul-
tiple places. If all mapping hits are passed to the downstream map-
ping process, some half reads with extremely many repeats could
increase the computational intensity greatly. To balance the run-
ning time and sensitivity, we need to set a tolerance limit of multiple
mappability, which is determined by the complexity of the target
genome. According to the experience of human and mouse RNA-
seq analyses, we set the number of multi-hits not higher than 10:
# Maximum number of multi-hits
# If a 25-mer seed has more than this many multi-
hits, it will be discarded.
# (optional) Default is 10
# (single value)
max_multi_hit = 10
RNA-Seq Mapping 155
If two repeat regions are very closed, then the half-read map-
ping hits from two regions could be likely crossed over each other
in the downstream mapping process. This could introduce many
incorrect mapping hits of junction reads. We need to set a con-
straint of the distance between mapping hits from the same half
reads. Since the downstream mapping process is only performed
within a region of 99.9th percentile of intron size (400,000 bp in
human, see Fig. 2 and Subheading 3.3) around the half-read map-
ping hits, we exclude those hits that are within this region:
# Maximum intron size, this is absolute 99th-percen-
tile maximum.
# Introns beyond this size will be ignored.
# (optional) If you don't set this, we will assume a
mammalian genome (400,000)
# (single value)
max_intron = 400000
3.3 Seeding Half-read mapping hit is extended base by base until reach the
Extension canonical splice signal GT-AG. The extended hit is considered as a
and Junction Search possible hit of one exonic fragment of junction read. Then, the
other exonic fragment is searched in the neighbor regions. Two
exonic fragments of a junction read are separated by an intron, so
the search region is narrowed down to the 99.9th percentile of
intron size (400,000 bp in human, see Fig. 2 and Subheading 3.2).
Based on the data quality, we set a mismatch tolerance of the
entire read mapping. This value should be determined by the error
rate (please see the chapter “Quality controls of massive RNA
sequencing data”) but should not be smaller than seed_mismatch
(see Subheading 3.1). If seed_mismatch is bigger than read_mis-
match, then SpliceMap would output null result.
# Maximum number of mismatches allowed in entire read
# No limit on value, however SpliceMap can only identify
reads with
# a maximum of 2 mismatches per 25bp.
# Default is 2.
# (optional)
# (single value)
read_mismatch = 2
Some very long read may have a few low-quality bases at two
ends, while the remaining part are long and can be mapped reli-
ably. In order to rescue these reads from the constraints of read_
mismatch, an extra tolerance of unmappable termini is set. This
parameter is different from the clipping parameters in
Subheading 3.1. The base clipping should be used if some terminal
bases are of very low qualities for majority of reads. If only small
part of reads has quality problem at their terminus (especially for
very long reads), then max_clip_allowed is used instead. In this
chapter we use 50 bp reads as an example, so this optional param-
eter is commented out by starting with a symbol “#”:
156 Kin Fai Au
Fig. 2 The cumulative distribution function of intron lengths in human genome (RefSeq annotation)
3.4 Paired-End Paired-end information can help to exclude some false mapping
Filtering hits. If a paired-end data is input, the paired-end filter is applied by
default. At first, three types of “good hits” are defined as: if two
halves of a full read are mapped to two locations that differ by
RNA-Seq Mapping 157
exactly the half-read length, this mapping hit of the full read is
called “exonic hit.” If a half-read mapping can be extended to rea-
sonable long (not shorter than full length minus 10), this mapping
hit is defined as “extension hit.” If a read can be mapped as a junc-
tion read by the processes above, then it is a “junction hit.” Three
constraints are used for paired-end filtering:
1. Both hits from a pair of reads are good hits.
2. The mapping hits from a pair of reads are at the same chromo-
some and not separated by the 99.9th percentile of intron size
(400,000 bp in human).
3. The directions and the positional order of the mapping hits from
a pair of reads are consistent to the sequencing platform setting.
If both mapping hits of two mates of a pair of reads pass three
constraints, then they are reported in a SAM file.
3.5 Mapping Results: The mapping results are written in a SAM (The Sequence
SAM File Alignment/Map) file “good_hits.sam” by SpliceMap (see Note 3).
If a read is multiply mapped, then there will be multiple entries in
the SAM file. Each line of the file good_hits.sam contains the fol-
lowing columns, separated by tab:
QNAME | FLAG | RNAME | POS | MAPQ | CIGAR | MRNM |
MPOS | ISIZE | SEQ | QUAL | OPTIONAL
where QNAME is the name of each query copied from the FASTA
or FASTQ file. A unique index is attached to the name;
FLAG is an integer that represents the information of read. The
explanation of each flag can be found at http://picard.sourceforge.
net/explain-flags.html;
RNAME is the name of reference sequence that the mapping hit
locates. For example, the reference sequences of hg19 has chromo-
some names as “chr1, chr2,…”;
POS is the mapping location;
MAPQ is 255 if uniquely mapped and 0 if multiply mapped;
CIGAR is the mapping pattern (details can be found in http://
samtools.sourceforge.net/SAM1.pdf);
MRNM and MPOS are the name of reference sequence and posi-
tion that the mapping hit of the mate read (the other read of the
pair) locates.
ISIZE is the distance between the mapping hits of two mates of the
paired-end reads.
SEQ and QUAL are the raw sequence and the mapping quality.
OPTIONAL contains the following items:
XS—Strand of junction
XC—Number of bases clipped
158 Kin Fai Au
3.6 Junction The mapping hits of junction reads can be converted to junction
Detection from detection and reported in a bed file “junction_color.bed.” The bed
Junction Read format is described in http://genome.ucsc.edu/FAQ/
Alignment FAQformat.html#format1. SpliceMap tags each junction with a
snippet of text describing the reliability of the junction (the fourth
column in bed file). The format of this snippet is
(nR)[width_nNR](nUR/nMR)
nR—Number of reads supporting this junction.
width—Range of different right lengths supporting this junction,
the larger the better.
nNR—Number of nonredundant reads supporting this junction.
nUR—Number of uniquely mappable reads supporting this
junction.
nMR—Number of multiply mappable reads supporting this
junction.
Using the parameters above, a number of junction filters pack-
aged with SpliceMap can control the specificity of the outputted
junctions. Different combinations of filters can be applied to tone
the balance of specificity and sensitivity. These filters only work on
the junctions (.bed files).
RNA-Seq Mapping 159
3.7 Results Using the settings above (6 threads for Bowtie and 3 cores for
from Human Brain SpliceMap), it took 69,580.6 s (19 h) to finish all SpliceMap steps.
RNA-Seq Data In total, 134,961,718 mapping hits from 127,349,786 reads are
reported in good_hits.sam. The mappable rate is 78.66 %
(127,349,786/(2 × 80,946,860)). Then, 193,127 nonredundant
junctions (junction_color.bed) are found from these mapping hits.
190,487 junctions pass the filter uniqueJunctionFilter:
uniqueJunctionFilter junction_color.bed junction_
nUM_color.bed
Or we can apply nnrFilter:
nnrFilter junction_color.bed junction_nNR_color.bed 2
139,190 junctions remain after the nonredundant read filter.
4 Notes
input, the files listed under “> reads_list1” and “> reads_list2”
should be in the consistent order. Here is an example:
# These are the two lists of sequencer reads files.
# "reads_list2" can be commented out if reads are
not paired-end.
# Make sure the order of both lists are the same!
# Also, "reads_list1" must be the first pair.
# Note: pair-reads should be in the "forward-
reverse" format.
# (multiple values)
> reads_list1
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_3_1_
sequence.txt
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_4_1_
sequence.txt
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_1_1_
sequence.txt
<
> reads_list2
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_3_2_
sequence.txt
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_4_2_
sequence.txt
/scratch1/0-2-kinfai/50bp_mRNA_Seq/FCA/s_1_2_
sequence.txt
<
References
1. Pruitt KD, Tatusova T, Maglott DR (2005) paired-end RNA-seq data by SpliceMap.
NCBI reference sequences (RefSeq): a curated Nucleic Acids Res 38:4570–4578
non-redundant sequence database of genomes, 4. Langmead B, Trapnell C, Pop M, Salzberg SL
transcripts and proteins. Nucleic Acids Res 33: (2009) Ultrafast and memory-efficient align-
501–504 ment of short DNA sequences to the human
2. Curwen V, Eyras E, Andrews TD, Clarke L, genome. Genome Biol 10:R25
Mongin E, Searle SM, Clamp M (2004) The 5. Jiang H, Wong WH (2008) SeqMap: mapping
Ensembl automatic gene annotation system. massive amount of oligonucleotides to the
Genome Res 14:942–950 genome. Bioinformatics 24:2395–2396
3. Au KF, Jiang H, Lin L, Xing Y, Wong WH
(2010) Detection of splice junctions from
Chapter 10
Abstract
Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conven-
tional microarray technology, both for mRNAs and noncoding RNAs. Although RNA quality and library
preparation protocols are the main source of variability, the bioinformatics pipelines for RNA-seq data
analysis are very complex and the choice of different tools at each stage of the analysis can significantly
affect the overall results. In this chapter we describe the pipelines we use to detect miRNA and mRNA
differential expression.
Key words Coding genes, Differential expression, Annotated genes, RNAseq workflow
1 Introduction
1.1 RNA-Seq The biological question is the first and probably the most impor-
Criticalities tant point to be considered. Although RNA-seq provides a massive
amount of information if the experimental setting is not driven by
a clear biological question, the extraction of valuable biological
knowledge could be very challenging. The biological question is
very tightly connected with the experimental design, since only in
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_10, © Springer Science+Business Media New York 2015
163
164 Raffaele A. Calogero and Francesca Zolezzi
2 Materials
2.1 Deep All the cDNA libraries where subjected to indexed sequencing run
Sequencing Data on an Illumina HiSeq 2000. We used 51 nts single end reads for
miRNA-seq because the use of 50 nts long reads makes easier the
detection and removal of linkers. In the case of mRNA-seq we use
pair end sequencing runs of 2 × 51 cycles as the best compromise
between sequencing cost and sequence specificity. Sequence direc-
tionality can provide valuable information specifically in the case
whole transcriptome analysis is performed, i.e., rRNA depletion
followed by sequencing of coding and noncoding RNAs.
2.2 Computational Being a relatively small lab we preferred a multicore solution with
Hardware respect to a cluster solution for RNA-seq analysis since manage-
ment of multiple cores on a single machine is easier with respect to
the management of a cluster solution. We run our analyses on two
AMD machines (64/48 cores), 512 Gb RAM each running linux
SUSE Enterprise 11/10. We decided also to invest significantly in
storage. Our storage is made of 30 × 2 Tb SATA disks and 6 × 2 Tb
hot spare disks configured as RAID 1 + 0, which guarantee a
reasonable security on hardware failure for the data under analysis.
We also have a 6 × 2 Tb SATA disks configured as RAID 5 for long
storage of raw data, i.e., fastq files.
2.3 Computational As aligner for RNA-seq projects we use STAR due to its elevated
Software performances [10]. We use SHRiMP [11] for miRNA-seq proj-
ects, because it has specific alignment parameters for miRNAs and
it was one of the aligner showing the best sensitivity in miRNA-seq
benchmark experiments [4].
As reference for miRNA alignment we use the latest version of
miRNA hairpins from miRbase (mirbase.org). The fasta file encom-
passing the hairpins of interest can be easily generated with the R
script shown below:
download.file("ftp://mirbase.org/pub/mirbase/CURRENT/
hairpin.fa.gz", "hairpin.fa.gz", mode="wb")
system("gunzip hairpin.fa.gz")
library(Rsamtools)
library(GenomicRanges)
library(Biostrings)
hairpins<- readRNAStringSet("hairpin.fa")
hsa<- hairpins[grep("hsa", names(hairpins))]
hsa<-DNAStringSet(hsa)
writeXStringSet(hsa, "hsa.fa", format="fastq")
166 Raffaele A. Calogero and Francesca Zolezzi
Fig. 1 Table browser at USCS genome browser. The table browser allows the exportation in various formats of
annotation data. Specifically here it is shown the table selection to generate a GTF file for the human
transcriptome
3 Methods
3.1 Preprocessing Fastq files are the standard output of high-throughput sequencers,
including Illumina sequencers. Before aligning them to the refer-
ence genome it is important to check the overall quality of the
data. This can be done using FASTQC (http://www.bioinformatics.
babraham.ac.uk/projects/fastqc/), which is a stand-alone java
tool that allows the quality check of fastq as well as bam files.
For miRNA data analysis it is essential to remove linkers from
the fastq data. There are multiple tools to do it. We normally use the
perl script provided by mirTools suite (http://centre.bioinformatics.
zj.cn/mirtools/adaptortrim.php).
3.2 Sequence The first step in a RNA-seq pipeline is the alignment of the fastq
Alignment data to a reference dataset. miRNA data alignment is done with
SHRiMP [11]. The reference database for the alignment is created
with the following code:
Transcriptome Quantification 167
3.3 Postprocessing In the case of miRNA-seq the sam file generated by SHRiMP is
filtered to keep only the alignments with at least 16 contiguous
perfect matches. The script below associates the number of detected
counts to each miRNA present in the human hairpins dataset:
library(Rsamtools)
library(Biostrings)
asBam("my.sam", "my")
mybam<- scanBam("my.bam",param=ScanBamParam(what=c("r
name","cigar")))
cigar<- strsplit(mybam[[1]]$cigar, "S")
cigar1<- sapply(cigar, function(x){
good<- sub("M","",x[grep("M",x)])
})
cigar1<- as.numeric(cigar1)
reads<- as.character(mybam[[1]]$rname[which(cigar1>= 16)])
hairpins<- readRNAStringSet("hairpin.fa")
hsa<- hairpins[grep("hsa", names(hairpins))]
hsa<-DNAStringSet(hsa)
168 Raffaele A. Calogero and Francesca Zolezzi
3.4 Differential Differential expression analysis for miRNAs and for gene level can
Expression Analysis be done applying the same statistical methods. We have observed
that the best tools for miRNA-seq differential expression are
DESeq [20] and baySeq [4, 23]. Their use is relatively simple and
a detailed description of the use of these tools is found by the fol-
lowing commands:
library(DESeq)
openVignette("DESeq")
170 Raffaele A. Calogero and Francesca Zolezzi
library(baySeq)
openVignette("baySeq")
The counts for each experiment can be generated as indicated
above in post-processing paragraph.
DESeq estimates the variance in digital data and tests for dif-
ferential expression [20]. The method implemented in DESeq uses
the mean as a good predictor of the variance; that is, genes with a
similar expression level also have similar variance across replicates.
Hence, it predicts the variance from the mean. This estimation is
done by calculating for each gene, the sample mean and variance
within replicates and then fitting a curve to this data. The statistics
tests for differences between the base means of the two conditions.
baySeq is based on the NB (negative binomial) model.
Specifically, it estimates the empirical distribution on the parame-
ters of the NB distribution by bootstrapping from the data and the
subsequent acquisition of posterior likelihoods, thus estimating the
proportions of differentially expressed counts.
Both methods perform quite well on miRNA-seq data. Instead
in an analysis involving a much larger set of features, as in the case
of gene-level analysis, DESeq is often overly conservative [8].
Furthermore, the modification of the parameters in DESeq can
have significant effects on the results of the differential expression
analysis and the recommended parameters [20] are the best choice.
baySeq could produce highly variable results in case the differen-
tially expressed genes are up-regulated in one condition compared
to the other [8]. It is important to note that the behavior of DESeq
and baySeq in case of a very low number of replications, i.e., 2,
results in a poor FDR control [8].
As indicated above for the detection of alternative splicing
events we use an exon-level approach: DEXSeq [22]. The first
point to be highlighted is how counts are associated to exons.
Exons are fragmented on the basis of their association to a specific
isoform (Fig. 2) and the read counts are associated to them on the
Fig. 2 DEXSeq exon binning. Exons are fragmented on the basis of their associa-
tion to isoforms
Transcriptome Quantification 171
4 Note
Acknowledgement
References
1. Maher CA, Kumar-Sinha C, Cao X, Kalyana- 13. Mortazavi A, Williams BA, McCue K, Schaeffer
Sundaram S, Han B, Jing X, Sam L, Barrette L, Wold B (2008) Mapping and quantifying
T, Palanisamy N, Chinnaiyan AM (2009) mammalian transcriptomes by RNA-Seq. Nat
Transcriptome sequencing to detect gene Methods 5(7):621–628
fusions in cancer. Nature 458(7234):97–101 14. Dohm JC, Lottaz C, Borodina T, Himmelbauer
2. McGettigan PA (2013) Transcriptomics in the H (2008) Substantial biases in ultra-short read
RNA-seq era. Curr Opin Chem Biol 17(1): data sets from high-throughput DNA sequenc-
4–11 ing. Nucleic Acids Res 36(16):e105
3. Arribere JA, Gilbert WV (2013) Roles for 15. Hansen KD, Brenner SE, Dudoit S (2010)
transcript leaders in translation and mRNA Biases in Illumina transcriptome sequencing
decay revealed by transcript leader sequencing. caused by random hexamer priming. Nucleic
Genome Res 23(6):977–987 Acids Res 38(12):e131
4. Cordero F, Beccuti M, Arigoni M, Donatelli S, 16. Wu Z, Wang X, Zhang X (2011) Using non-
Calogero RA (2012) Optimizing a massive uniform read distribution models to improve
parallel sequencing workflow for quantitative isoform expression inference in RNA-Seq.
miRNA expression analysis. PloS One 7(2): Bioinformatics 27(4):502–508
e31630 17. Trapnell C, Williams BA, Pertea G, Mortazavi
5. Carrara M, Beccuti M, Lazzarato F, Cavallo F, A, Kwan G, van Baren MJ, Salzberg SL, Wold
Cordero F, Donatelli S, Calogero RA (2013) BJ, Pachter L (2010) Transcript assembly and
State-of-the-art fusion-finder algorithms sensi- quantification by RNA-Seq reveals unanno-
tivity and specificity. Biomed Res Int 2013: tated transcripts and isoform switching during
340620 cell differentiation. Nat Biotechnol 28(5):
6. Carrara M, Beccuti M, Cavallo F, Donatelli S, 511–515
Lazzarato F, Cordero F, Calogero RA (2013) 18. Li B, Dewey CN (2011) RSEM: accurate
State of art fusion-finder algorithms are suit- transcript quantification from RNA-Seq data
able to detect transcription-induced chimeras with or without a reference genome. BMC
in normal tissues? BMC Bioinformatics Bioinformatics 12:323
14(Suppl 7):S2 19. Mostafavi S, Battle A, Zhu X, Urban AE,
7. Hansen KD, Wu Z, Irizarry RA, Leek JT Levinson D, Montgomery SB, Koller D
(2011) Sequencing technology does not elimi- (2013) Normalizing RNA-sequencing data by
nate biological variability. Nat Biotechnol modeling hidden covariates with prior knowl-
29(7):572–573 edge. PloS One 8(7):e68141
8. Soneson C, Delorenzi M (2013) A compari- 20. Anders S, Huber W (2010) Differential expres-
son of methods for differential expression anal- sion analysis for sequence count data. Genome
ysis of RNA-seq data. BMC Bioinformatics Biol 11(10):R106
14:91 21. Anders S, McCarthy DJ, Chen Y, Okoniewski
9. Dillies MA, Rau A, Aubert J, Hennequet- M, Smyth GK, Huber W, Robinson MD
Antier C, Jeanmougin M, Servant N, Keime (2013) Count-based differential expression
C, Marot G, Castel D, Estelle J et al (2012) A analysis of RNA sequencing data using R and
comprehensive evaluation of normalization Bioconductor. Nat Protoc 8(9):1765–1786
methods for Illumina high-throughput RNA 22. Anders S, Reyes A, Huber W (2012) Detecting
sequencing data analysis. Brief Bioinform differential usage of exons from RNA-seq data.
14(6):671–683 Genome Res 22(10):2008–2017
10. Dobin A, Davis CA, Schlesinger F, Drenkow J, 23. Hardcastle TJ, Kelly KA (2010) baySeq:
Zaleski C, Jha S, Batut P, Chaisson M, Gingeras empirical Bayesian methods for identifying dif-
TR (2013) STAR: ultrafast universal RNA-seq ferential expression in sequence count data.
aligner. Bioinformatics 29(1):15–21 BMC Bioinformatics 11:422
11. Rumble SM, Lacroute P, Dalca AV, Fiume M, 24. Robinson MD, McCarthy DJ, Smyth GK
Sidow A, Brudno M (2009) SHRiMP: accu- (2010) edgeR: a Bioconductor package
rate mapping of short color-space reads. PLoS for differential expression analysis of digital
Comput Biol 5(5):e1000386 gene expression data. Bioinformatics 26(1):
12. Gentleman RC, Carey VJ, Bates DM, Bolstad 139–140
B, Dettling M, Dudoit S, Ellis B, Gautier L, 25. Sanges R, Cordero F, Calogero RA (2007)
Ge Y, Gentry J et al (2004) Bioconductor: oneChannelGUI: a graphical interface to bio-
open software development for computa- conductor tools, designed for life scientists
tional biology and bioinformatics. Genome who are not familiar with R language.
Biol 5(10):R80 Bioinformatics 23(24):3406–3408
Chapter 11
Abstract
Alternative Splicing (AS) is the molecular phenomenon whereby multiple transcripts are produced from
the same gene locus. As a consequence, it is responsible for the expansion of eukaryotic transcriptomes.
Aberrant AS is involved in the onset and progression of several human diseases. Therefore, the character-
ization of exon–intron structure of a gene and the detection of corresponding transcript isoforms is an
extremely relevant biological task. Nonetheless, the computational prediction of AS events and the reper-
toire of alternative transcripts is yet a challenging issue.
Hereafter we introduce PIntron, a software package to predict the exon-intron structure and the full-
length isoforms of a gene given a genomic region and a set of transcripts (ESTs and/or mRNAs). The
software is open source and available at http://pintron.algolab.eu. PIntron has been designed for (and
extensively tested on) a standard workstation without requiring dedicated expensive hardware. It easily
manages large genomic regions and more than 20,000 ESTs, achieving good accuracy as shown in an
experimental evaluation performed on 112 well-annotated genes selected from the ENCODE human
regions used as training set in the EGASP competition.
Key words Alternative splicing, Spliced alignment, Gene structure, Expressed isoforms, Software
package
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_11, © Springer Science+Business Media New York 2015
173
174 Paola Bonizzoni et al.
2 Materials
2.1 Installation PIntron has been developed and intensively tested on Linux and
should easily run also on OS X. It can be downloaded from http://
pintron.algolab.eu as a TAR or ZIP archive and the following addi-
tional software is required: Python v3.1 or newer, Perl v5 and a
recent version of the standard GNU toolchain (the C compiler
“gcc” and the build tool “make”, in particular). All these prereq-
uisites can be easily installed on popular Linux distributions
through their package manager.
Transcriptome Assembly and Alternative Splicing Analysis 175
2.2 Input Data PIntron takes in input the genomic sequence of a gene locus and a
set of expressed sequences (mostly ESTs and mRNAs). In particu-
lar, it requires two input files: a FASTA file containing the genomic
sequence (genomic file) and a MultiFASTA file containing the
EST/mRNA sequences (transcript file). All nucleotide sequences
must be in uppercase letters.
In addition, the genomic file must contain the nucleotide
sequence on the same strand of the input gene and its FASTA
header must be in the format “>chrZZ:START:END:STRAND”,
where ZZ is the reference chromosome, START and END are the
start and end coordinates of the genomic region on the reference
chromosome, and STRAND is +1 (or, simply, 1) for a gene on the
plus strand of the chromosome or -1 for a gene on the minus
strand of the chromosome.
An example of genomic file is shown in Fig. 1 for human TP53
gene. Header characteristics are highlighted in boxes.
The transcript file must contain input sequences in a
MultiFASTA format (typically a UniGene file) and each header
should include the substring “/gb = XXXXXX”, where XXXXXX is
a unique identifier (like the GenBank identifier associated to
UniGene sequences). The substring “/clone_end = YY” is optional
and allows to specify the read strand (YY). In particular, YY = 3′
means plus strand (or 5′3′), while YY = 5′ means minus strand
(or 3′5′). In the latter case, the input transcript should be reversed
Fig. 2 Example of input transcript file (ests.txt) for human TP53 gene: EST sequence CN342738 from Unigene
cluster Hs.437460
In all cases in which the file cds is not present or the CDS
annotation is not provided or the isoform has been obtained by
assembling spliced ESTs, PIntron tries to annotate the coding
sequence by computing the left-most Open Reading Frame (ORF)
of at least 100 bp and having a non-weak context. If all predicted
ORFs of at least 100 bp have a weak context, then the left-most
ORF (if any) is reported. The ORF context [7] is given by a small
window “XNNATGY” of seven bases around the start codon ATG
(underlined). The context is strong if both X and Y are purines
(A or G), medium if only one of X and Y is a purine, and weak if
neither X nor Y is a purine. Symbol N stands for any base.
3 Methods
3.1 Description The PIntron pipeline (Fig. 4) is composed of two steps: (1) recon-
of the Pipeline struction of the exon–intron gene structure by computing
transcript-to-genome alignments (spliced alignments), and (2)
assembly of potentially expressed full-length isoforms.
3.2.2 Managing The pipeline could prematurely terminate if a runtime error occurs.
Runtime Errors There are two common causes for runtime errors: wrong input file
formats and exhaustion of computational resources. PIntron
attempts to check if the input files are strictly compliant with their
required format (as described in Subheading 2) and, in case it is
not able to correctly parse them, terminates with a runtime error
(or with unexpected/partial results). As a consequence, in case of
runtime errors or unexpected results, it is important to check the
formats of input files and rerun the pipeline. The other most com-
mon cause of runtime errors is the exhaustion of the computa-
tional resources. PIntron has been designed to run on standard
mid-range workstations and, as such, has strict default limits on
computational resources (running times and amount of memory).
Default settings are conservative and prevent to deplete all compu-
tational resources of the workstation.
The inspection of log files (see Note 2) or the customization of
default computational limits (see Note 3) could help in identifying
and removing the causes of runtime error. However, if the user is
not able to trace down the cause of a runtime error can report it
using the issue tracker at PIntron website (http://pintron.algolab.
eu/). The report should be as complete as possible and should
include all data needed to reproduce the error. Moreover, results
from all intermediate steps should be attached to the report. By
default, these (temporary) files are deleted at the end of the execu-
tion. To prevent this, it is possible to use the option --keep-
intermediate-files (−k, in short).
3.3 Output Data PIntron produces its output in two standard formats: GTF (Gene
Transfer Format) and JSON (JavaScript Object Notation). More
precisely, it outputs a file describing the predicted full-length iso-
forms in GTF format and a JSON file reporting all details of the
prediction (such as the exon–intron gene structure and the pre-
dicted full-length isoforms along with their annotations). GTF is a
well-known standard format in bioinformatics used to describe
gene features, while the JSON format allows to describe all results
into a unique file. The JSON format has been chosen since it is
human-readable and very easy to parse by means of libraries that
are available for almost all programming languages (Ruby, Python,
JavaScript, Perl and so on).
3.3.1 The GTF A GTF file is composed of records with nine tab-separated fields,
Output File and each record represents a feature. Here we omit the details of
the format and refer the reader to the official documentation
(http://mblab.wustl.edu/GTF22.html).
Transcriptome Assembly and Alternative Splicing Analysis 181
The GTF output file is specified by the option --gtf and con-
tains all predicted full-length isoforms if the execution option
--strict-GTF-compliance has not been specified, or only the subset
of the CDS-annotated isoforms, otherwise. A full-length isoform is
described in the GTF output as composed of features on the
genome. A feature can be an exon, a 5′ untranslated region, a 3′
untranslated region, a coding sequence, a start codon, or a stop codon.
Below, there is a description of GTF fields produced by
PIntron:
● seqname is the reference chromosome.
● source is always the string “PIntron”, since PIntron is the
program generating the feature.
● feature is the string describing the feature represented by the
GTF record, and its value belongs to the set {“exon”, “3UTR”,
“5UTR”, “CDS”, “start_codon”, “stop_codon”}.
● start is the (1-based) starting position of the feature on the
plus strand of the reference chromosome (that is specified by
the seqname field).
● end is the (1-based) ending position of the feature on the plus
strand of the reference chromosome (that is specified by the
seqname field).
● score is always a dot “.”, since PIntron does not produce this
value.
● strand is the strand of the input gene.
● frame can assume one of the following values:
– 0, if the value of the field feature is “start_codon” or
“stop_codon” (the feature is a start or a stop codon).
– 0, if the value of the field feature is “CDS” (the feature is
a coding sequence), and its first base is the first base of a
codon.
– 1, if the value of the field feature is “CDS” (the feature is
a coding sequence), and its first base is the second base of a
codon.
– 2, if the value of the field feature is “CDS” (the feature is
a coding sequence), and its first base is the third base of a
codon.
– A dot “.” (the frame is not defined), if the value of the field
feature is “exon” or “5UTR” or “3UTR”.
● attribute specifies the gene and the full-length isoform con-
taining the feature. More in detail:
– The value of the tag gene_id is the gene HUGO
name specified by the execution option --gene
(see Subheading 3.2).
182 Paola Bonizzoni et al.
3.3.2 The JSON A JSON file is a (key, value) dictionary, where value can be in turn
Output File a dictionary, a list, or simple value (like a string or a number). In
the following, we describe the JSON structure of the PIntron
output file, while we refer the interested reader to the official docu-
mentation (http://www.json.org/) for a complete description of
the format.
We have chosen key names that are self-explanatory, and we
have exploited the nesting nature of the JSON format to encode a
gene and its sets of introns and of full-length isoforms. The JSON file
produced by PIntron is specified by the execution option --output
(see Subheading 3.2). The value of a key ending with a question
mark (“?”) can be true or false, since it represents the answer to a
given question.
The whole PIntron prediction is represented by a dictionary
giving the predicted exon–intron gene structure and the full-length
isoforms potentially expressed by the gene. In particular, the top-
level dictionary gives the following data:
● The input genomic sequence (key = “genome”) that in turn is
represented by a dictionary specifying:
– The sequence identifier (key = “sequence_id”) through the
FASTA header of the genomic file (without the symbol “>”).
– The gene strand (key = “strand”), which is “+” if the gene is
on the plus strand of the chromosome, otherwise it is “-”.
– The sequence length (key = “length”).
● The number of input transcripts which have been successfully
aligned to the genomic sequence (key = “number_of_
processed_transcripts”).
Transcriptome Assembly and Alternative Splicing Analysis 183
4 Notes
References
Abstract
The advent of deep sequencing technologies has greatly improved the study of complex eukaryotic
genomes and transcriptomes, providing the unique opportunity to investigate posttranscriptional molecu-
lar mechanisms as alternative splicing and RNA editing at single base-pair resolution. RNA editing by
adenosine deamination (A-to-I) is widespread in humans and can lead to a variety of biological effects
depending on the RNA type or the RNA region involved in the editing modification.
Hereafter, we describe an easy and reproducible computational protocol for the identification of
candidate RNA editing sites in human using deep transcriptome (RNA-Seq) and genome (DNA-Seq)
sequencing data.
Key words RNA editing, A-to-I editing, Deep sequencing, Bioinformatics, Genomics, RNA-Seq,
DNA-Seq
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_12, © Springer Science+Business Media New York 2015
189
190 Ernesto Picardi et al.
2 Materials
2.1.2 Short Read Mapper Although a plethora of mapping programs has been released up to
now, we obtained accurate results using GSNAP [9], freely avail-
able for Unix/Linux or Mac OS X systems from http://research-
pub.gene.com/gmap/ (see Note 1 for basic installation on Linux/
Unix systems). Nonetheless, other tools to perform the fast align-
ment of RNA reads onto the reference genome could be used,
even though the selected mapping software may affect the detec-
tion of RNA editing [7]. Note that the mapping program should
print out alignments in the standard SAM format [10].
2.1.3 Mandatory REDItools requires the Python interpreter (at least version 2.6)
Software (see Note 2). In Linux/Unix or Mac OS X it should be prein-
stalled. Alternatively, a copy can be downloaded from http://
www.python.org/download/ and installed according to instruc-
tions reported in the web page. To correctly read alignments in
BAM files, the external python module pysam (at least version
0.6) needs to be installed (available from https://github.com/
pysam-developers/pysam). In addition, SAMtools should be
downloaded and installed from http://samtools.sourceforge.net/
(version ≥0.1.18) [10].
Detection of RNA Editing Sites 191
2.1.4 Optional Software Sometimes optional software could be required to improve final
results. Indeed, mismapping errors, that are quite frequent and
represent the first source of false RNA editing calls, can be miti-
gated aligning reads (only ones carrying mismatches) using Blat
[7]. The complete Blat suite, including gfServer and gfClient exe-
cutables as well as the related documentation, can be obtained
from the following UCSC web page http://hgdownload.cse.ucsc.
edu/admin/exe/.
Duplicated reads due to PCR could be marked in BAM files
before running REDItools, even though this practice is still an
open question (especially at RNA level).
Duplicated reads can be marked using the Picard MarkDuplicate
tool.
$ java -Xmx2g -jar MarkDuplicate.jar INPUT=
myINPUT.bam OUTPUT=myOUTPUT.bam METRICS_FILE=
myMetrics.txt REMOVE_DUPLICATES=false ASSUME_
SORTED=true
The complete Picard package can be downloaded from http://
picard.sourceforge.net (version ≥1.57).
3 Methods
3.1 Mapping The mapping of RNA-Seq reads can be carried out using GSNAP
of RNA-Seq Reads and providing known splice junctions (from RefSeq, Gencode or
other specialized databases) (see Note 3) to improve the global
alignment process.
The command line for paired-end reads against the human
genome assembly hg19 (see Note 4) is:
$ gsnap -d hg19 -s known-splicesites -E 1000
-n1 -Q -O--nofails -A sam--gunzip--split-output=
outputGsnap -a paired R1.fastq.gz R2.fastq.gz
The option “-a paired” enables a procedure to remove adapt-
ers in paired-end reads (useful in case of not preprocessed data).
The options “-t” and “-B” take into account multiple working
threads and available RAM for allocating genomic indexes, respec-
tively. They speed up the mapping process and should be tuned
according to the hardware running GSNAP (see Note 5). For
human genome, 4 GB of RAM should be sufficient to preload
indices in memory (option “-B 4”).
At the end of the run, GSNAP will create nine separate output
files in the standard SAM format (thanks to the option--plit-
output), one for each alignment type (using specific filename suf-
fixes) (see Note 6). For an accurate RNA editing detection, only
the file with suffix “.concordant_uniq” will be taken into account
for downstream analyses. Indeed, it includes unique and concor-
dant alignments in order to exclude reads mapping to multiple
genomic locations with the same number of mismatches.
3.2 SAM to BAM REDItools accept pre-aligned reads in the standard BAM format
Conversion (see Note 7). For this reason, after the mapping, SAM alignments
need to be converted in the BAM format by means of SAMtools:
$ samtools view -bS outputGsnap.concordant_
uniq > outputGsnap.concordant_uniq.bam
$ samtools sort outputGsnap.concordant_uniq.
bam outputGsnap.concordant_uniq.sorted
$ samtools index outputGsnap.concordant_
uniq.sorted.bam
Detection of RNA Editing Sites 193
3.3 Blat Correction Although GSNAP is quite accurate in aligning paired-end reads
(Optional) onto the complete human genome [11], several reads could be
ambiguously mapped, leading to false mismatches and, thus,
erroneous RNA editing calls. Misalignment errors can be miti-
gated realigning reads carrying mismatches by the classical Blat
program. The list of problematic reads can be generated using
the REDItoolBlatCorrection.py script and the following com-
mand line:
$ REDItoolBlatCorrection.py -i outputGsnap.
concordant_uniq.sorted.nodup.bam -f hg19.fa -F
hg19.2bit -o BlatCorrection -V -T
At the end of the run, the script will generate a directory
named “BlatCorrection” and, inside, several “.bad” files including
the list of reads prone to mismapping.
3.4 Detection of RNA REDItoolDnaRna.py is the main script to identify RNA editing
Editing Candidates candidates using matched DNA-Seq and RNA-Seq data. In par-
Using Matched ticular, it inspects all genomic regions covered by RNA-Seq reads
DNA-Seq position-by-position looking for nucleotide changes between the
and RNA-Seq Data reference genome and RNA sequences. DNA-Seq data are
employed to support the presence of RNA editing events exclud-
ing potential SNPs or somatic mutations.
REDItoolDnaRna.py requires at least three input files: RNA-
Seq alignments in BAM format, DNA-Seq alignments in BAM for-
mat and the reference genome in FASTA format. All inputs need
to be indexed using SAMtools (see Note 8).
Once all input data are ready, a list of RNA editing candidates
can be obtained with the following command line:
$ REDItoolDnaRna.py –i RNAseq.bam -j DNAseq.
bam -f hg19.fa -o myOUTPUT -b BlatCorrection -c
10,10 -m 20,20 -q 25.25 -u -a6-0 -v 3 -N 0.0
-n 0.1
194 Ernesto Picardi et al.
3.5 Exploring Known REDItoolKnown.py script has been developed to explore the
RNA Editing Sites impact of RNA editing on a given RNA-Seq dataset using known
in RNA-Seq Data editing events annotated in available databases as DARNED [12]
and RADAR [13] or collected from supplementary materials of a
variety of publications. REDitoolKnown.py requires at least three
input files: RNA-Seq alignments in BAM format, the reference
genome in FASTA format and a list of known RNA editing sites
(see Note 11).
Once all input data are ready, RNA editing can be explored by
means of the following command line:
$ REDitoolKnown.py -i RNAseq.bam -f hg19.fa
-l known_editing_sites.txt.gz -o myOUTPUT
where RNAseq.bam is the file of aligned RNA-Seq reads gen-
erated by the methodology described in Subheadings 3.1 and 3.2,
hg19.fa is the reference human genome assembly hg19 in FASTA
format, known_editing_sites.txt.gz is the table file containing
known RNA editing positions and myOUTPUT is the directory
including all output files.
A detailed description of available parameters and options is in
Table 2.
Table 1
Parameters required for REDItoolDnaRna.py
Parameter Description
-i RNA-Seq BAM file
-j DNA-Seq BAM files separated by comma or folder containing BAM files. Note that
each chromosome must be present in a single BAM file only
-I Sort input RNA-Seq BAM file
-J Sort input DNA-Seq BAM file
-f Reference genome in fasta format. Note that chromosome names must match
chromosome names in BAMs files
-C It indicates how many bases to load in RAM during the execution [100,000 by default]
-k List of chromosomes to skip separated by comma. A file in which each line contains a
chromosome name can also be provided
-t It indicates how many processes should be launched [1 by default]
-o Output folder in which all results will be stored [rediFolder_XXXX by default in which
XXXX is a random number generated at each run]
-F It indicates the name of the internal folder containing output tables
-M If selected, pileup-like files are generated
-c Minimum read coverage for DNA and RNA reads, respectively (dna,rna) [10,10 by
default]
-Q Fastq offset value (dna,rna) [33,33 by default]. For Illumina FASTQ 1.3+, 64 should
be used
-q Minimum quality score for DNA and RNA reads, respectively (dna,rna) [25,25 by
deafult]. This option can influence the sequencing depth but is very useful to
exclude sequencing errors
-m Minimum mapping quality score for DNA and RNA reads, respectively (dna,rna)
[25,25 by default]. This parameter can mitigate errors due to misplaced reads
-O Minimum homoplymeric length for DNA and RNA reads, respectively (dna,rna) [5,5
by default]. Homopolymeric regions may confound alignment tools leading to
misalignments. As a consequence, substitutions occurring in such regions of a
predefined length (generally equal to or higher than five bases) should be excluded
-s For strand oriented RNA-Seq reads, it indicates which read has the orientation of the
RNA. Available values are: 1 for read1 in line with RNA; 2 for read2 in line with
RNA. Option 1 is equivalent to fr-secondstrand library, whereas option 2
corresponds to fr-firststrand library. The strand information is essential to exclude
substitutions occurring in antisense RNAs
-g It specifies how strand should be deduced. Valid options are: 1 maxValue and 2
useConfidence (the strand is assigned if over a prefixed frequency set by -x option)
-x Strand confidence value [0.70 by default]. It is used if -g 2 option is selected
-S If selected, it performs the strand correction. Once the strand has been inferred, only
bases according to this strand will be printed out
(continued)
196 Ernesto Picardi et al.
Table 1
(continued)
Parameter Description
-G Infer strand by GFF annotation (must be in GFF and sorted, otherwise the -X option
should be used). Sorting requires grep and sort unix executables
-K GFF File with positions to exclude (must be in GFF and sorted, otherwise the -X
option should be used). Sorting requires grep and sort unix executables
-T Work only on given GFF positions (must be in GFF and sorted, otherwise the -X
option should be used). Sorting requires grep and sort unix executables
-X Sort annotation files. It requires grep and sort unix executables
-e Exclude multi hits in RNA-Seq
-E Exclude multi hits in DNA-Seq
-d Exclude duplicated reads in RNA-Seq. PCR duplicates need to be marked in the input
BAM
-D Exclude duplicated reads in DNA-Seq. PCR duplicates need to be marked in the input
BAM
-p Use paired concordant reads only in RNA-Seq. It should be activated in case of paired
end reads
-P Use paired concordant reads only in DNA-Seq. It should be activated in case of paired
end reads
-u Consider mapping quality in RNA-Seq
-U Consider mapping quality in DNA-Seq
-a Trim x bases up and y bases down per read [0–0] in RNA-Seq
-A Trim x bases up and y bases down per read [0–0] in DNA-Seq
-b Blat folder containing problematic reads in RNA-Seq
-B Blat folder containing problematic reads in DNA-Seq
-l Remove substitutions in homopolymeric regions in RNA-Seq
-L Remove substitutions in homopolymeric regions in DNA-Seq
-v Minimum number of reads supporting the variation for RNA-Seq [3 by default]
-n Minimum editing frequency for RNA-Seq [0.1 by default]
-N Minimum variation frequency for DNA-Seq [0.1 by default]
-z Exclude positions with multiple changes in RNA-Seq
-Z Exclude positions with multiple changes in DNA-Seq
-W Select RNA-Seq positions with defined changes (separated by comma, for example:
AG,TC) [all by default]
-R Exclude invariant RNA-Seq positions
-V Exclude sites not supported by DNA-Seq
(continued)
Detection of RNA Editing Sites 197
Table 1
(continued)
Parameter Description
-w File containing splice sites annotations
-r Number of bases near splice sites to explore [four by deafult]
--gzip Gzip output files
-h, −-help Print out these options
Table 2
Parameters required for REDItoolKnown.py
Parameter Description
-i RNA-Seq BAM file
-I Sort input BAM file
-f Reference genome in FASTA format. Note that chromosome names must match
chromosome names in the BAM file
-l List of known RNA editing events (see Note 11)
-C It indicates how many bases to load in RAM during the execution [100,000 by default]
-k List of chromosomes to skip separated by comma. A file in which each line contains a
chromosome name can also be provided
-t It indicates how many processes should be launched [1 by default]
-o Output folder in which all results will be stored [rediFolder_XXXX by default in which
XXXX is a random number generated at each run]
-F It indicates the name of the internal folder containing output tables
-c Minimum read coverage for RNA reads [10 by default]
-Q Fastq offset value [33 by default]. For Illumina FASTQ 1.3+, 64 should be used
-q Minimum quality score for RNA reads [25 by default]
-m Minimum mapping quality score for RNA reads [25 by default]
-O Minimum homoplymeric length [5 by default]
-s For strand oriented RNA-Seq reads, it indicates which read has the orientation of the
RNA. Available values are: 1 for read1 in line with RNA; 2 for read2 in line with
RNA. Option 1 is equivalent to fr-secondstrand library, whereas option 2 corresponds
to fr-firststrand library. The strand information is essential to exclude substitutions
occurring in antisense RNAs
-g It specifies how strand should be deduced. Valid options are: 1 maxValue and 2
useConfidence (the strand is assigned if over a prefixed frequency set by -x option)
-x Strand confidence value [0.70 by default]. It is used if -g 2 option is selected
(continued)
198 Ernesto Picardi et al.
Table 2
(continued)
Parameter Description
-S If selected, it performs the strand correction. Once the strand has been inferred, only
bases according to this strand will be printed out
-G Infer strand by GFF annotation (must be in GFF and sorted, otherwise the -X option
should be used). Sorting requires grep and sort unix executables
-X Sort annotation files. It requires grep and sort unix executables
-K File with positions to exclude in the format chromosome name[TAB or space]coordinate
-e Exclude multi hits
-d Exclude duplicated reads in RNA-Seq. PCR duplicates need to be marked in the
input BAM
-p Use paired concordant reads only in RNA-Seq. It should be activated in case
of paired end reads
-u Consider mapping quality
-T Trim x bases up and y bases down per RNA read [0–0]
-B Blat folder containing problematic reads in RNA-Seq
-U Remove substitutions in homopolymeric regions
-v Minimum number of reads supporting the variation for RNA-Seq [3 by default].
-n Minimum editing frequency [0.1 by default]
-E Exclude positions with multiple changes
-P File containing splice sites annotations
-r Number of bases near splice sites to explore [4 by default]
-h Print out these options
3.7 Downstream All output positions reported in REDItool tables can be sorted,
Analyses filtered, and annotated using auxiliary scripts provided in the pack-
age. The complete list of available accessory scripts can be found at
the web page http://code.google.com/p/reditools/.
An important issue in RNA editing detection is the exclusion
of SNP sites also in case of DNA-Seq dataset. Indeed, positions
well supported by RNA-Seq reads could not be equally sup-
ported by DNA-Seq reads and, thus, RNA editing candidates
could be genuine SNPs. REDItools include the accessory
FilterTable.py script that is able to filter out known SNPs as well
as other preferred positions according to specialized databases as
dbSNP (see Note 12). A typical command line is:
$ FilterTable.py -i myREDItoolTable -s dbsnp.
gff.gz -S snp -E -p -o myREDItoolTable.nosnp
where myREDItoolTable is the output table from a REDItool
script, dbsnp.gff.gz is the file containing genomic SNPs in GFF
format (bgzipped using bgzip from SAMtools) and myREDI-
toolTable.nosnp is the final table. Option -S indicates the feature
to filter out (according to GFF file), −E prevents the printing of
filtered positions and -p reports simple statistics in the standard
output.
FilterTable.py can also be used to filter out (or in) editing can-
didates in repetitive regions, as Alu rich regions, employing anno-
tations stored in the RepeatMask table from UCSC. In this case, a
sample command line is:
$ FilterTable.py -i myREDItoolTable -s dbsnp.
gff.gz -F alu -E -p -o myREDItoolTable.alu
Table 3
Example of a REDItool output table
4 Notes
Acknowledgments
References
1. Gott JM, Emeson RB (2000) Functions and Brent M, Prange C, Schreiber K, Shapiro N,
mechanisms of RNA editing. Annu Rev Genet Bhat NK, Hopkins RF, Hsie F, Driscoll T,
34:499–531 Soares MB, Casavant TL, Scheetz TE, Brown-
2. Gerhard DS, Wagner L, Feingold EA, Shenmen stein MJ, Usdin TB, Toshiyuki S, Carninci P,
CM, Grouse LH, Schuler G, Klein SL, Old S, Piao Y, Dudekula DB, Ko MS, Kawakami K,
Rasooly R, Good P, Guyer M, Peck AM, Derge Suzuki Y, Sugano S, Gruber CE, Smith MR,
JG, Lipman D, Collins FS, Jang W, Sherry S, Simmons B, Moore T, Waterman R, Johnson
Feolo M, Misquitta L, Lee E, Rotmistrovsky K, SL, Ruan Y, Wei CL, Mathavan S, Gunaratne
Greenhut SF, Schaefer CF, Buetow K, Bonner PH, Wu J, Garcia AM, Hulyk SW, Fuh E, Yuan
TI, Haussler D, Kent J, Kiekhaus M, Furey T, Y, Sneed A, Kowis C, Hodgson A, Muzny DM,
Detection of RNA Editing Sites 205
McPherson J, Gibbs RA, Fahey J, Helton E, of human Alu and non-Alu RNA editing sites.
Ketteman M, Madan A, Rodrigues S, Sanchez Nat Methods 9(6):579–581. doi:10.1038/
A, Whiting M, Madari A, Young AC, Wetherby nmeth.1982
KD, Granite SJ, Kwong PN, Brinkley CP, 8. Picardi E, Pesole G (2013) REDItools: high-
Pearson RL, Bouffard GG, Blakesly RW, Green throughput RNA editing detection made easy.
ED, Dickson MC, Rodriguez AC, Grimwood Bioinformatics 29(14):1813–1814.
J, Schmutz J, Myers RM, Butterfield YS, doi:10.1093/bioinformatics/btt287
Griffith M, Griffith OL, Krzywinski MI, Liao
9. Wu TD, Nacu S (2011) Fast and SNP-tolerant
N, Morin R, Palmquist D, Petrescu AS, Skalska
detection of complex variants and splicing in
U, Smailus DE, Stott JM, Schnerch A, Schein
short reads. Bioinformatics 26(7):873–881
JE, Jones SJ, Holt RA, Baross A, Marra MA,
Clifton S, Makowski KA, Bosak S, Malek J 10. Li H, Handsaker B, Wysoker A, Fennell T,
(2004) The status, quality, and expansion of Ruan J, Homer N, Marth G, Abecasis G,
the NIH full-length cDNA project: the Durbin R (2009) The sequence alignment/
Mammalian Gene Collection (MGC). Genome map format and SAMtools. Bioinformatics
Res 14(10B):2121–2127 25(16):2078–2079
3. Keegan LP, Gallo A, O’Connell MA (2001) 11. Engstrom PG, Steijger T, Sipos B, Grant GR,
The many roles of an RNA editor. Nat Rev Kahles A, Alioto T, Behr J, Bertone P, Bohnert
Genet 2(11):869–878. doi:10.1038/35098584 R, Campagna D, Davis CA, Dobin A, Gingeras
4. Silberberg G, Lundin D, Navon R, Ohman M TR, Goldman N, Guigo R, Harrow J, Hubbard
(2012) Deregulation of the A-to-I RNA edit- TJ, Jean G, Kosarev P, Li S, Liu J, Mason CE,
ing mechanism in psychiatric disorders. Hum Molodtsov V, Ning Z, Ponstingl H, Prins JF,
Mol Genet 21(2):311–321. doi:10.1093/ Ratsch G, Ribeca P, Seledtsov I, Solovyev V,
hmg/ddr461 Valle G, Vitulo N, Wang K, Wu TD, Zeller G
(2013) Systematic evaluation of spliced align-
5. Gallo A, Locatelli F (2011) ADARs: allies or
ment programs for RNA-seq data. Nat Methods
enemies? The importance of A-to-I RNA edit-
10(12):1185–1191. doi:10.1038/nmeth.2722
ing in human disease—from cancer to HIV-1.
Biol Rev Camb Philos Soc 87(1):95–110 12. Kiran A, Baranov PV (2010) DARNED: a
6. Picardi E, Horner DS, Chiara M, Schiavon R, DAtabase of RNa EDiting in humans.
Valle G, Pesole G (2010) Large-scale detection Bioinformatics 26(14):1772–1776. doi:10.1093/
and analysis of RNA editing in grape mtDNA bioinformatics/btq285
by RNA deep-sequencing. Nucleic Acids Res 13. Ramaswami G, Li JB (2014) RADAR: a rigor-
38(14):4755–4767. doi:10.1093/nar/gkq202 ously annotated database of A-to-I RNA edit-
7. Ramaswami G, Lin W, Piskol R, Tan MH, ing. Nucleic Acids Res 42(Database
Davis C, Li JB (2012) Accurate identification issue):D109–D113. doi:10.1093/nar/gkt996
Chapter 13
Abstract
Computational methods for miRNA target prediction are currently undergoing extensive review and
evaluation. There is still a great need for improvement of these tools and bioinformatics approaches are
looking towards high-throughput experiments in order to validate predictions. The combination of large-
scale techniques with computational tools will not only provide greater credence to computational predic-
tions but also lead to the better understanding of specific biological questions. Current miRNA target
prediction tools utilize probabilistic learning algorithms, machine learning methods and even empirical
biologically defined rules in order to build models based on experimentally verified miRNA targets. Large-
scale protein downregulation assays and next-generation sequencing (NGS) are now being used to validate
methodologies and compare the performance of existing tools. Tools that exhibit greater correlation
between computational predictions and protein downregulation or RNA downregulation are considered
the state of the art. Moreover, efficiency in prediction of miRNA targets that are concurrently verified
experimentally provides additional validity to computational predictions and further highlights the com-
petitive advantage of specific tools and their efficacy in extracting biologically significant results. In this
review paper, we discuss the computational methods for miRNA target prediction and provide a detailed
comparison of methodologies and features utilized by each specific tool. Moreover, we provide an over-
view of current state-of-the-art high-throughput methods used in miRNA target prediction.
Key words MiRNA target prediction, Computational tools, Databases, High-throughput methods,
Biological features of miRNA target recognition
1 Introduction
1.1 State of the Art MicroRNAs (miRNAs) belong to a recently identified group of the
large family of noncoding RNAs [1]. The mature miRNA is usually
19–27 nt long and is derived from a larger precursor that folds into
an imperfect stem-loop structure. The mode of action of the
mature miRNA in mammalian systems is dependent on com-
plementary base pairing primarily to the 3′-UTR region of the
target mRNA, thereafter causing the inhibition of translation and/
or the degradation of the mRNA.
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_13, © Springer Science+Business Media New York 2015
207
208 Anastasis Oulas et al.
1.2 The Biology As mentioned earlier, the mode of action of the mature miRNA in
of miRNA Target mammalian systems is dependent on complementary base pairing
Prediction to the 3′UTR region of the target mRNA, thereafter causing
the inhibition of translation and/or the degradation of the
1.2.1 Background mRNA. Despite the body of evidence supporting a role of miR-
NAs in cancer, their exact mechanism of action remains to be
elucidated because the downstream targets of miRNAs implicated
in cancer have yet to be defined. Towards this goal, miRNA target
prediction tools can offer a first indication as to which target genes
are regulated by miRNAs, thus providing new insights regarding
their specific functions and guiding future experiments.
1.2.2 Biological Features Early work on the principles of microRNA target recognition was
of miRNA–mRNA conducted by Julius Brennecke et al. [14]. Based on experimental
Interactions data, they separate the target sites on two classes based on their
base pairing motifs, “5′ dominant” which are sites that depend
critically on the pairing to the miRNA 5′ end and “3′ compensa-
tory” which include seed matches of four to six base pairs and
seeds of seven to eight bases that contain G–U base pairs, single
nucleotide bulges, or mismatches and a extensive pairing to the
3′end of the miRNA.
Additionally, they distinguish two subgroups of 5′ dominant
sites, canonical and seed. Canonical sites have good pairing to both
5′ and 3′ ends of the miRNA. Their repression efficiency was not
affected by a mismatch at position 1, 9, 10 or at the 3′ region, but
miRNA Targets Prediction 211
Fig. 1 Categories of miRNA–mRNA interactions. (a) Canonical are 7–8 nt seed-matched sites. They are divided
into three groups, 7mer-A1, 7mer-m8, and 8mer. (b) Atypical are sites with productive 3´ pairing. They consist
of two groups 3´ supplementary and 3´compensatory. (c) Marginal are 6 nt sites matching the seed region.
(d) Recently characterized interactions, which do not belong to other categories, “centered” and “G-bulge”
sites or occur in the mRNA’s coding region (CDS), are shown here. Vertical dashes indicate contiguous Watson–
Crick pairing. Underlined numbers on the miRNA sequence represent the seed site
1.2.3 Biological Features A summary of the features which are widely used by miRNA target
Used by miRNA Target prediction methods is described below. In general, these features
Prediction Algorithms can be grouped into three categories: (1) Sequence—this feature
allows for the detection of base pairing between the miRNA and
miRNA Targets Prediction 213
the target site and moreover pinpoints the exact location of the
matching nucleotides (e.g., perfect “seed” match (nucleotides 2–7
at the 5′ part of a miRNA) or compensatory binding at the 3′
region); (2) Thermodynamics—the minimum free energy (ΔG) of
the miRNA–mRNA hybrid structure is very important for predict-
ing target sites. Programs such as RNAcofold [43] or RNAhybrid
[44, 45] can successfully predict hybrid structures between two
RNA molecules based on matching base pairing and furthermore
yield a minimum free energy for the overall structure. Since
miRNAs bind to their targets in a very specific and stable manner,
it is anticipated that the ΔG will be low. In fact, minimum free
energy calculations from experimentally predicted targets sites
display a very low value for ΔG [46] analogous to structures, which
display stable binding and interaction; (3) Conservation—target
sites are functional sequences in the 3′UTR of the transcribed
mRNA. This fact makes target sites subject to evolutionary conser-
vation across various organisms. A conservation analysis using
multiz full genome alignment files provided by UCSC (http://
genome.ucsc.edu/) shows that ~70 % of experimentally predicted
human miRNA targets sites are in fact conserved across eight other
organisms studied (including non-mammalian species). When used
as a feature for predicting new target sites, conservation may pro-
vide additional support for predictions.
2 Materials
3 Methods
3.1 miRNA Online One of the earliest online databases for miRNA is miRBase [47]
Databases (http://www.mirbase.org). The current version of miRBase (v19)
provides an elaborate repository which includes ~19,000 miRNA
entries, ~22,000 mature miRNA sequences from over 160 species.
Moreover, all entries are extensively annotated using information
such as hairpin and mature sequences, genomic location, and rel-
evant paper(s) supporting the entry. In addition to facilitating
functional information for each miRNA entry, miRBase provides
links to popular miRNA target prediction tools and databases, such
as microRNA.org [48], DIANA-microT [49] and TargetScan [23,
50], as well as miRecords [51], DIANA-TarBase [52]. MirBase is
also equipped with additional information like information about
miRNA families and clusters, external links to Entrez Gene [53],
Rfam [54], and HGNC [55] as well as links to deep sequencing
experiments describing miRNA entries.
Databases archiving expression data for miRNAs in various
tissues and cell lines are becoming an absolute necessity in the new
era of RNAi. Data of this sort are often used in combination to pre-
diction algorithms, and are ideal for obtaining initial clues and direc-
tions for biologically driven hypotheses that can be concurrently
investigated in greater depth. One of the first databases describing
such data is smiRNAdb (http://www.mirz. unibas.ch/cloningpro-
files) [56]. smiRNAdb provides an elaborate miRNA expression map
for humans and rodents [56] derived from sequencing ~250 small
RNA libraries from 26 different cell types and organs. This database
offers various analysis steps and tools in order to compare expression
levels and profiles of miRNAs under different experimental condi-
tions, such as clustering analysis and principal component analysis.
Although computational prediction of miRNA gene targets
is a valuable asset, predictions should be supported by experi-
mental verification. Techniques used to verify target prediction
results can be either small or large scale [57]. The most widely
used small-scale, wet lab experimentation techniques are reporter
gene assays, followed by small-scale miRNA and/or mRNA
expression assessment via northern blotting or qPCR. ELISA
immunoassays or Western blotting can also be used to detect
changes in protein concentration of gene targets. High-
throughput proteomics methods like stable isotope labeling with
amino acids in cell culture/SILAC and transcriptomics methods
like microarrays and sequencing-based methodologies, such as
HITS-CLIP, PAR-CLIP, Degradome-Seq and RNA-Seq, can
further be used for large-scale miRNA target validation (further
described in detail below).
Numerous databases that contain curated experimentally veri-
fied miRNA targets are currently readily available. The first
comprehensive repository of experimentally validated targets was
miRNA Targets Prediction 215
3.2 miRNA Target This section describes some widely used microRNA target predic-
Prediction Tools tion tools in detail with particular emphasis to the target site cate-
in Detail gories each tool is specialized to recognize. We aim to provide an
overview of the most recently published tools and also to describe
216 Anastasis Oulas et al.
3.2.1 PicTar (http:// PicTar is a target prediction tool suitable for predicting “5′-domi-
pictar.mdc-berlin.de/) nant” target sites. However, it allows for targets with imperfect
seed matches given that they pass a heuristically defined binding-
energy threshold. Additionally, PicTar implements a maximum
likelihood approach to incorporate the combinatorial nature of
miRNA targeting. The program also implements cross-species
conservation constraints. It requires conservation between at least
five species for the portion of the target site that binds to the
miRNA seed. Conservation amounts to a seed match occurring at
overlapping positions in a cross-species UTR alignment. A more
recent version of PicTar provides precompiled target predictions
on the mouse genome based on comparative analyses of 17 verte-
brate genomes [8].
3.2.4 PITA (http://genie. PITA uses standard settings to identify initial seeds for each miRNA
weizmann.ac.il/pubs/ in 3′UTRs, applies a target accessibility model to each putative site,
mir07/) and then combines sites for the same miRNA to find a total inter-
action score for the miRNA and the UTR. This model also adds a
new dimension to miRNA target prediction, namely, the “flank
sites.” “Flank sites” are sites around the seed, which are required
to be unpaired. It has been found that a flank of 3 upstream and 15
downstream nucleotides results in the optimal the model perfor-
mance which surpasses all other methods. The model’s perfor-
mance is actually better than other methods even without using
the “3-15 flank” option [4].
4 Notes
Table 1
Databases and tools utilized for miRNA target predictions, including function and attributed to each
application as well as link to the site where the application can be found
Table 1
(continued)
References
1. Lee RC, Feinbaum RL, Ambros V (1993) The 11. Khvorova A, Reynolds A, Jayasena SD (2003)
C. elegans heterochronic gene lin-4 encodes Functional siRNAs and miRNAs exhibit strand
small RNAs with antisense complementarity to bias. Cell 115(2):209–216
lin-14. Cell 75(5):843–854 12. Lee Y, Jeon K, Lee JT et al (2002) MicroRNA
2. Huttenhofer A, Vogel J (2006) Experimental maturation: stepwise processing and subcellu-
approaches to identify non-coding RNAs. lar localization. EMBO J 21(17):4663–4670
Nucleic Acids Res 34(2):635–646 13. Lim LP, Glasner ME, Yekta S et al (2003)
3. Miranda KC, Huynh T, Tay Y et al (2006) A Vertebrate microRNA genes. Science
pattern-based method for the identification of 299(5612):1540
MicroRNA binding sites and their correspond- 14. Brennecke J, Stark A, Russell RB et al (2005)
ing heteroduplexes. Cell 126(6):1203–1217 Principles of microRNA-target recognition.
4. Kertesz M, Iovino N, Unnerstall U et al (2007) PLoS Biol 3(3):e85
The role of site accessibility in microRNA tar- 15. Helvik SA, Snove O Jr, Saetrom P (2006)
get recognition. Nat Genet Reliable prediction of Drosha processing sites
39(10):1278–1284 improves microRNA gene prediction.
5. Griffiths-Jones S, Grocock RJ, van Dongen S Bioinformatics 23(2):142–149
et al (2006) miRBase: microRNA sequences, 16. Hertel J, Stadler PF (2006) Hairpins in a
targets and gene nomenclature. Nucleic Acids Haystack: recognizing microRNA precursors
Res 34(Database issue):D140–D144 in comparative genomics data. Bioinformatics
6. Betel D, Wilson M, Gabow A et al (2008) The 22(14):e197–e202
microRNA.org resource: targets and expres- 17. Lim LP, Lau NC, Weinstein EG et al (2003)
sion. Nucleic Acids Res 36(Database The microRNAs of Caenorhabditis elegans.
issue):D149–D153 Genes Dev 16(8):991–1008
7. Lewis BP, Burge CB, Bartel DP (2005) 18. Sewer A, Paul N, Landgraf P et al (2005)
Conserved seed pairing, often flanked by aden- Identification of clustered microRNAs using an
osines, indicates that thousands of human ab initio prediction method. BMC Bioinform
genes are microRNA targets. Cell 6:267–281
120(1):15–20 19. Yousef M, Nebozhyn M, Shatkay H et al
8. Krek A, Grun D, Poy MN et al (2005) (2006) Combining multi-species genomic data
Combinatorial microRNA target predictions. for microRNA identification using a Naive
Nat Genet 37(5):495–500 Bayes classifier. Bioinformatics
9. Maragkakis M, Reczko M, Simossis VA et al 22(11):1325–1334
(2009) DIANA-microT web server: elucidat- 20. Kiriakidou M, Nelson PT, Kouranov A et al
ing microRNA functions through target pre- (2004) A combined computational-experi-
diction. Nucleic Acids Res 37(Web Server mental approach predicts human microRNA
issue):W273–W276 targets. Genes Dev 18(10):1165–1178
10. Maragkakis M, Alexiou P, Papadopoulos GL 21. Friedman RC, Farh KK, Burge CB et al (2009)
et al (2009) Accurate microRNA target predic- Most mammalian mRNAs are conserved
tion correlates with protein repression levels. targets of microRNAs. Genome Res 19(1):
BMC Bioinform 10:295 92–105
miRNA Targets Prediction 227
22. Witkos TM, Koscianska E, Krzyzosiak WJ functional sites with centered pairing. Mol Cell
(2011) Practical aspects of microRNA target 38(6):789–802
prediction. Curr Mol Med 11(2):93–109 39. Chi SW, Zang JB, Mele A et al (2009)
23. Lewis BP, Shih IH, Jones-Rhoades MW et al Argonaute HITS-CLIP decodes microRNA-
(2003) Prediction of mammalian microRNA mRNA interaction maps. Nature 460(7254):
targets. Cell 115(7):787–798 479–486
24. Enright AJ, John B, Gaul U et al (2003) 40. Chi SW, Hannon GJ, Darnell RB (2012) An
MicroRNA targets in Drosophila. Genome alternative mode of microRNA target recogni-
Biol 5(1):R1 tion. Nat Struct Mol Biol 19(3):321–327
25. Long D, Lee R, Williams P et al (2007) Potent 41. Tay Y, Zhang J, Thomson AM et al (2008)
effect of target structure on microRNA func- MicroRNAs to Nanog, Oct4 and Sox2 coding
tion. Nat Struct Mol Biol 14(4):287–294 regions modulate embryonic stem cell differ-
26. Marin RM, Vanicek J (2011) Efficient use of entiation. Nature 455(7216):1124–1128
accessibility in microRNA target prediction. 42. Lal A, Navarro F, Maher CA et al (2009) miR-
Nucleic Acids Res 39(1):19–29 24 Inhibits cell proliferation by targeting E2F2,
27. Grimson A, Farh KK, Johnston WK et al MYC, and other cell-cycle genes via binding to
(2007) MicroRNA targeting specificity in “seedless” 3′UTR microRNA recognition ele-
mammals: determinants beyond seed pairing. ments. Mol Cell 35(5):610–625
Mol Cell 27(1):91–105 43. Hofacker IL (2003) Vienna RNA secondary
28. Schmidt T, Mewes HW, Stumpflen V (2009) A structure server. Nucleic Acids Res
novel putative miRNA target enhancer signal. 31(13):3429–3431
PLoS One 4(7):e6473 44. Rehmsmeier M, Steffen P, Hochsmann M et al
29. Baek D, Villen J, Shin C et al (2008) The (2004) Fast and effective prediction of
impact of microRNAs on protein output. microRNA/target duplexes. RNA
Nature 455(7209):64–71 10(10):1507–1517
30. Selbach M, Schwanhausser B, Thierfelder N 45. Kruger J, Rehmsmeier M (2006) RNAhybrid:
et al (2008) Widespread changes in protein microRNA target prediction easy, fast and flex-
synthesis induced by microRNAs. Nature ible. Nucleic Acids Res 34(Web Server
455(7209):58–63 issue):W451–W454
31. Friedlander MR, Chen W, Adamidi C et al 46. Rusinov V, Baev V, Minkov IN et al (2005)
(2008) Discovering microRNAs from deep MicroInspector: a web tool for detection of
sequencing data using miRDeep. Nat miRNA binding sites in an RNA sequence. Nucleic
Biotechnol 26(4):407–415 Acids Res 33(Web Server issue):W696–W700
32. Cimmino A, Calin GA, Fabbri M et al (2005) 47. Kozomara A, Griffiths-Jones S (2010) miR-
miR-15 and miR-16 induce apoptosis by tar- Base: integrating microRNA annotation and
geting BCL2. Proc Natl Acad Sci U S A deep-sequencing data. Nucleic Acids Res
102(39):13944–13949 39(Database issue):D152–D157
33. Mayr C, Hemann MT, Bartel DP (2007) 48. Betel D, Koppal A, Agius P et al (2010)
Disrupting the pairing between let-7 and Comprehensive modeling of microRNA tar-
Hmga2 enhances oncogenic transformation. gets predicts functional non-conserved and
Science 315(5818):1576–1579 non-canonical sites. Genome Biol 11(8):R90
34. Sylvestre Y, De Guire V, Querido E et al (2007) 49. Reczko M, Maragkakis M, Alexiou P et al
An E2F/miR-20a autoregulatory feedback (2012) Functional microRNA targets in pro-
loop. J Biol Chem 282(4):2135–2143 tein coding sequences. Bioinformatics
35. Papadopoulos GL, Reczko M, Simossis VA 28(6):771–776
et al (2009) The database of experimentally 50. Garcia DM, Baek D, Shin C et al (2011) Weak
supported targets: a functional update of seed-pairing stability and high target-site abun-
TarBase. Nucleic Acids Res 37(Database dance decrease the proficiency of lsy-6 and
issue):D155–D158 other microRNAs. Nat Struct Mol Biol
36. Lee Y, Yang X, Huang Y et al (2010) Network 18(10):1139–1146
modeling identifies molecular functions tar- 51. Xiao F, Zuo Z, Cai G et al (2009) miRecords:
geted by miR-204 to suppress head and neck an integrated resource for microRNA-target
tumor metastasis. PLoS Comput Biol interactions. Nucleic Acids Res 37(Database
6(4):e1000730 issue):D105–D110
37. Bartel DP (2009) MicroRNAs: target recogni- 52. Vergoulis T, Vlachos IS, Alexiou P et al (2012)
tion and regulatory functions. Cell TarBase 6.0: capturing the exponential growth
136(2):215–233 of miRNA targets with experimental support.
38. Shin C, Nam JW, Farh KK et al (2010) Nucleic Acids Res 40(Database issue):
Expanding the microRNA targeting code: D222–D229
228 Anastasis Oulas et al.
53. Maglott D, Ostell J, Pruitt KD et al (2005) of microRNAs. Nucleic Acids Res 36(Database
Entrez Gene: gene-centered information at issue):D159–D164
NCBI. Nucleic Acids Res 39(Database 68. Shannon P, Markiel A, Ozier O et al (2003)
issue):D52–D57 Cytoscape: a software environment for inte-
54. Griffiths-Jones S, Moxon S, Marshall M et al grated models of biomolecular interaction net-
(2005) Rfam: annotating non-coding RNAs in works. Genome Res 13(11):2498–2504
complete genomes. Nucleic Acids Res 69. Jiang Q, Wang Y, Hao Y et al (2009) miR2Dis-
33(Database issue):D121–D124 ease: a manually curated database for
55. Seal RL, Gordon SM, Lush MJ et al (2011) microRNA deregulation in human disease.
genenames.org: the HGNC resources in 2011. Nucleic Acids Res 37(Database
Nucleic Acids Res 39(Database issue):D98–D104
issue):D514–D519 70. Lu M, Zhang Q, Deng M et al (2008) An anal-
56. Landgraf P, Rusu M, Sheridan R et al (2007) A ysis of human microRNA and disease associa-
mammalian microRNA expression atlas based tions. PLoS One 3(10):e3420
on small RNA library sequencing. Cell 71. Ruepp A, Kowarsch A, Schmidl D et al (2010)
129(7):1401–1414 PhenomiR: a knowledgebase for microRNA
57. Thomson DW, Bracken CP, Goodall GJ (2011) expression in diseases and biological processes.
Experimental strategies for microRNA target Genome Biol 11(1):R6
identification. Nucleic Acids Res 72. Hiard S, Charlier C, Coppieters W et al (2010)
39(16):6845–6853 Patrocles: a database of polymorphic miRNA-
58. Hsu SD, Lin FM, Wu WY et al (2010) miRTar- mediated gene regulation in vertebrates.
Base: a database curates experimentally vali- Nucleic Acids Res 38(Database
dated microRNA-target interactions. Nucleic issue):D640–D651
Acids Res 39(Database issue):D163–D169 73. Yang Q, Qiu C, Yang J et al (2011) miREnvi-
59. Naeem H, Kuffner R, Csaba G et al (2010) ronment database: providing a bridge for
miRSel: automated extraction of associations microRNAs, environmental factors and pheno-
between microRNAs and genes from the bio- types. Bioinformatics 27(23):3329–3330
medical literature. BMC Bioinform 11:135 74. Friedlander MR, Mackowiak SD, Li N et al
60. Yang JH, Li JH, Shao P et al (2010) starBase: (2012) miRDeep2 accurately identifies known
a database for exploring microRNA-mRNA and hundreds of novel microRNA genes in
interaction maps from Argonaute CLIP-Seq seven animal clades. Nucleic Acids Res
and Degradome-Seq data. Nucleic Acids Res 40(1):37–52
39(Database issue):D202–D209 75. Berezikov E, Robine N, Samsonova A et al
61. John B, Enright AJ, Aravin A et al (2004) (2011) Deep annotation of Drosophila mela-
Human MicroRNA targets. PLoS Biol nogaster microRNAs yields insights into their
2(11):e363 processing, modification, and emergence.
62. Oulas A, Karathanasis N, Louloupi A et al Genome Res 21(2):203–215
(2012) A new microRNA target prediction 76. Li N, You X, Chen T et al (2013) Global profil-
tool identifies a novel interaction of a putative ing of miRNAs and the hairpin precursors:
miRNA with CCND2. RNA Biol insights into miRNA processing and novel
9(9):1196–1207 miRNA discovery. Nucleic Acids Res
63. Vlachos IS, Kostoulas N, Vergoulis T et al 41(6):3619–3634
(2012) DIANA miRPath v. 2.0: investigating 77. Chou CH, Lin FM, Chou MT et al (2013) A
the combinatorial effect of microRNAs in computational approach for identifying
pathways. Nucleic Acids Res 40(Web Server microRNA-target interactions using high-
issue):W498–W504 throughput CLIP and PAR-CLIP sequencing.
64. Kowarsch A, Preusse M, Marr C et al (2011) BMC Genomics 14(Suppl 1):S2
miTALOS: analyzing the tissue-specific regula- 78. Mathelier A, Carbone A (2010) MIReNA:
tion of signaling pathways by human and finding microRNAs with high accuracy and no
mouse microRNAs. RNA 17(5):809–819 learning at genome scale and from deep
65. Backes C, Keller A, Kuentzer J et al (2007) sequencing data. Bioinformatics
GeneTrail–advanced gene set enrichment anal- 26(18):2226–2234
ysis. Nucleic Acids Res 35(Web Server 79. Konig J, Zarnack K, Luscombe NM et al
issue):W186–W192 (2011) Protein-RNA interactions: new
66. Cho S, Jun Y, Lee S et al (2010) miRGator genomic technologies and perspectives. Nat
v2.0: an integrated system for functional inves- Rev Genet 13(2):77–83
tigation of microRNAs. Nucleic Acids Res 80. Hafner M, Landthaler M, Burger L et al (2010)
39(Database issue):D158–D162 Transcriptome-wide identification of RNA-
67. Nam S, Kim B, Shin S et al (2008) miRGator: binding protein and microRNA target sites by
an integrated system for functional annotation PAR-CLIP. Cell 141(1):129–141
miRNA Targets Prediction 229
81. Khorshid M, Rodak C, Zavolan M (2011) 86. Vlachos IS, Hatzigeorgiou AG (2013) Online
CLIPZ: a database and analysis environment resources for miRNA analysis. Clin Biochem
for experimentally determined binding sites of 46(10–11):879–900. doi:10.1016/j.clinbio-
RNA-binding proteins. Nucleic Acids Res chem.2013.03.006, Epub 2013 Mar 18.
39(Database issue):D245–D252 Review. PMID: 23518312 [PubMed—indexed
82. Wu J, Liu Q, Wang X, Zheng J, Wang T, You for MEDLINE]
M, Sheng Sun Z, Shi Q (2013), mirTools 2.0 87. Oulas A, Boutla A, Gkirtzou K et al (2009)
for non-coding RNA discovery, profiling, and Prediction of novel microRNA genes in cancer-
functional annotation based on high-through- associated genomic regions – a combined com-
put sequencing. RNA Biol 10(7): 1087–1092 putational and experimental approach. Nucleic
83. Hackenberg M, Rodriguez-Ezpeleta N, Acids Res 37(10):3276–3287
Aransay AM (2011) miRanalyzer: an update on 88. Simoneau M, Aboulkassim TO, LaRue H et al
the detection and analysis of microRNAs in (1999) Four tumor suppressor loci on chro-
high-throughput sequencing experiments. mosome 9q in bladder cancer: evidence for two
Nucleic Acids Res 39(Web Server novel candidate regions at 9q22.3 and 9q31.
issue):W132–W138 Oncogene 18(1):157–163
84. Langmead B, Trapnell C, Pop M et al (2009) 89. Han Y, Chen J, Zhao X et al (2011) MicroRNA
Ultrafast and memory-efficient alignment of expression signatures of bladder cancer
short DNA sequences to the human genome. revealed by deep sequencing. PLoS One 6(3):
Genome Biol 10(3):R25 e18286
85. Zhao W, Liu W, Tian D et al (2011) wapRNA: 90. Kapranov P, Cheng J, Dike S et al (2007) RNA
a web-based application for the processing of maps reveal new RNA classes and a possible
RNA sequences. Bioinformatics 27(21):3 function for pervasive transcription. Science
076–3077 316(5830):1484–1488
Chapter 14
Abstract
Deep sequencing has many possible applications; one of them is the identification and quantification of
RNA editing sites. The most common type of RNA editing is adenosine to inosine (A-to-I) editing.
A prerequisite for this editing process is a double-stranded RNA (dsRNA) structure. Such dsRNAs are
formed as part of the microRNA (miRNA) maturation process, and it is therefore expected that miRNAs
are affected by A-to-I editing. Indeed, tens of editing sites were found in miRNAs, some of which change
the miRNA binding specificity. Here, we describe a protocol for the identification of RNA editing sites in
mature miRNAs using deep sequencing data.
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_14, © Springer Science+Business Media New York 2015
231
232 Shahar Alon and Eli Eisenberg
2 Materials
3 Methods
3.1 Filtering The Fastq file of the sequencing reads could be the raw reads with-
Low-Quality Reads out any filter, or filtered reads using Illumina software tools. In the
and Trimming latter case, proceed directly to Subheading 3.2. Otherwise, continue
Sequence Adapters with this step. Raw sequencing reads likely contain parts of the
adapter sequence. Therefore, these sequences must be identified
and removed. Moreover, low-quality reads (as defined by the read
quality score) are unlikely to be informative and therefore should be
removed. There are several published tools for raw Fastq filtering
[15, 16], or one can use our in-house filtering script “Process_reads.
pl.” The script allows the user to define “low-quality reads” by
choosing (a) a quality score cutoff and (b) the maximum number of
locations allowed to have lower quality score compared to the cho-
sen cutoff. For example, one can use a cutoff of 20 for the Phred
quality score (which ranges from 0 to 40) and the maximum num-
ber of locations can be set to 3 [13]. The script also trims the adapt-
ers (see Note 3) and the resulting trimmed read can be filtered if it
is too long or too short (as defined by the user). For example, as the
expected length of mature miRNAs is ~21 bases, one can remove
reads with length longer than 28 bases or shorter than 15 bases.
234 Shahar Alon and Eli Eisenberg
3.2 Aligning The filtered and trimmed reads should be aligned against the
the Reads Against genome of interest and not against an miRNA sequence database
the Genome [13]. We require unique best alignment, that is, the reads cannot
be aligned to other locations in the genome with the same number
of mismatches (Fig. 1). Lastly, we recommend using only align-
ments with up to one mismatch (in our datasets, allowing up to
two mismatches did not aid in detecting more editing sites; instead,
it added unreliable alignments). These steps taken together solve,
by and large, the cross mapping problem that significantly hinders
identification of true editing sites in mature miRNAs [9, 13].
The last two bases (the 3′ end) of mature miRNA undergo
extensive adenylation and uridylation [5]. It is thus recommended
that these bases will not be considered in the alignment. Naturally,
doing so prevents detection of editing in these locations. However,
not taking this measure and still demanding low number of mis-
matches will severely reduce the number of alignments obtained.
3.3 Mapping The purpose of this step is to move from reads aligned against the
the Mismatches genome (the end-point of the previous step) to counts of each of
Against the Pre- the four possible nucleotides at each position along the pre-miRNA
miRNA Sequences sequence, for all the pre-miRNAs. Performing this transformation
will allow us to focus our analysis only on bona fide miRNA and to
use, in the following step, binomial statistics to detect significant
modifications inside them.
We use pre-built files for the transformation: the alignment of
the pre-miRNA sequences against the genome and the mature/
star sequences of miRNAs. Taken together, these files give the
location of the pre-miRNA and the mature/star miRNA inside the
genome (see Notes 9 and 10).
The pre-built files and the Bowtie output file from the previous
step are the input for the script “Analyze_mutation.pl.” A key user-
defined input for this script is the minimum quality score allowed
in the location of the mismatch in order for it to be counted.
Naturally, the higher this number is, the lower the probability for
sequencing error. However, taking very high quality score filter
will give small number of counted modifications. We suggest using
30 as the minimum quality score allowed (see also below and [13]).
Running this script on large sequencing datasets (large Bowtie
output files) requires extensive internal memory resources. A way
around this hurdle is to divide the sequencing dataset into sev-
eral smaller files; afterwards the output files can be merged
(see Subheading 3.4). We suggest dividing the Bowtie output file
if the number of reads is higher than 2.5 million when using 8 GB
of RAM.
1. As in our example the Bowtie output file has ~8 million reads,
dividing the file is required (see Note 11):
$ split -l 2500000 sra_data_filtered.output part_
This command will divide the Bowtie output file into 4 files:
part_aa, part_ab, part_ac, and part_ad.
236 Shahar Alon and Eli Eisenberg
3.4 Using Binomial In this step binomial statistics is performed on the output file of the
Statistics to Remove previous step to separate sequencing errors from statistically sig-
Sequencing Errors nificant modifications (Fig. 2).
Importantly, binomial statistics do not require any arbitrary
expression level filter. It is well suited even for low-expressed miR-
NAs with low number of sequencing reads, and the P-values com-
puted reflect the absolute number of reads detected, small or large
as the case may be.
This analysis is performed for every position (except the last
two positions of the miRNA due to the extensive adenylation and
uridylation) in every mature/star miRNA separately. As multiple
tests are performed, the resulting P-value for each position must be
corrected accordingly. The script performing this analysis,
“Binomial_analysis.pl,” gives the user an option to use Bonferroni
or Benjamini–Hochberg corrections.
1. Run the script “Binomial_analysis.pl” using the files from the
previous step (see Notes 14 and 15):
$ perl Binomial_analysis.pl main_output_a.txt main_output_b.
txt main_output_c.txt main_output_d.txt > binomial_output.txt
RUN TIME: <1 min (using Intel Core i7 processor).
Identification of Editing Sites in Mature miRNAs 237
3.5 Removing SNPs Known SNPs may be filtered from the statistically significant modi-
from the List fications detected in the previous step. This can be done using
of Statistically Galaxy (http://galaxy.psu.edu/).
Significant 1. In the Galaxy site choose the “Use Galaxy” option.
Modifications
2. Use the “Get Data” link, and then “UCSC Main” to down-
load data from the UCSC Table Browser.
3. To download the mouse SNP dataset, choose “Mouse” in the
“genome” field and “Variation and Repeats” in the “group”
field. Make sure that the “track” field is set to “SNPs” and the
“region” field is set to “genome.” Then press “get output”
and afterwards press “send query to Galaxy.”
238 Shahar Alon and Eli Eisenberg
4 Notes
Acknowledgments
References
1. Nishikura K (2010) Functions and regulation of editing sites in mature microRNAs in high-
of RNA editing by ADAR deaminases. Annu throughput sequencing libraries. Genome Res
Rev Biochem 79:321–349 20:257–264
2. Yang W, Chendrimada TP, Wang Q et al 10. Li M, Wang IX, Li Y et al (2011) Widespread
(2006) Modulation of microRNA processing RNA and DNA sequence differences in the
and expression through RNA editing by ADAR human transcriptome. Science 333:53–58
deaminases. Nat Struct Mol Biol 13:13–21 11. Lin W, Piskol R, Tan MH et al (2012)
3. Bartel DP (2004) MicroRNAs: genomics, bio- Comment on “Widespread RNA and DNA
genesis, mechanism, and function. Cell 116: sequence differences in the human transcrip-
281–297 tome.”. Science 335:1302
4. Kawahara Y, Zinshteyn B, Sethupathy P et al 12. Piskol R, Peng Z, Wang J et al (2013) Lack of
(2007) Redirection of silencing targets by evidence for existence of noncanonical RNA
adenosine- to-inosine editing of miRNAs. editing. Nat Biotechnol 31:19–20
Science 315:1137–1140 13. Alon S, Mor E, Vigneault F et al (2012)
5. Burroughs AM, Ando Y, de Hoon MJL et al Systematic identification of edited microRNAs
(2010) A comprehensive survey of 3′ animal in the human brain. Genome Res 22:
miRNA modification events and a possible role 1533–1540
for 3′ adenylation in modulating miRNA target- 14. Langmead B, Trapnell C, Pop M et al (2009)
ing effectiveness. Genome Res 20:1398–1410 Ultrafast and memory-efficient alignment of
6. Ekdahl Y, Farahani HS, Behm M et al (2012) short DNA sequences to the human genome.
A-to-I editing of microRNAs in the mamma- Genome Biol 10:R25
lian brain increases during development. 15. Stocks MB, Moxon S, Mapleson D et al (2012)
Genome Res 22:1477–1487 The UEA sRNA Workbench: a suite of tools
7. Warf MB, Shepherd BA, Johnson WE et al for analysing and visualising next generation
(2012) Effects of ADARs on small RNA pro- sequencing microRNA and small RNA datas-
cessing pathways in C. elegans. Genome Res ets. Bioinformatics 28:2059–2061
22:1488–1498 16. Zhu E, Zhao F, Xu G et al (2010) mirTools:
8. Vesely C, Tauber S, Sedlazeck FJ et al (2012) microRNA profiling and discovery based on
Adenosine deaminases that act on RNA induce high-throughput sequencing. Nucleic Acids
reproducible changes in abundance and Res 38:W392–W397
sequence of embryonic miRNAs. Genome Res 17. Zuker M, Stiegler P (1981) Optimal computer
22:1468–1476 folding of large RNA sequences using thermo-
9. de Hoon MJL, Taft RJ, Hashimoto T et al dynamics and auxiliary information. Nucleic
(2010) Cross-mapping and the identification Acids Res 9:133–148
Chapter 15
Abstract
RNA-Seq technology allows the rapid analysis of whole transcriptomes taking advantage of next-generation
sequencing platforms. Moreover with the constant decrease of the cost of NGS analysis RNA-Seq is
becoming very popular and widespread. Unfortunately data analysis is quite demanding in terms of bioin-
formatic skills and infrastructures required, thus limiting the potential users of this method.
Here we describe the complete analysis of sample data from raw sequences to data mining of results
by using NGS-Trex platform, a low user interaction, fully automatic analysis workflow. Used through a
web interface, NGS-Trex processes data and profiles the transcriptome of the samples identifying expressed
genes, transcripts, and new and known splice variants. It also detects differentially expressed genes and
transcripts across different experiments.
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_15, © Springer Science+Business Media New York 2015
243
244 Ilenia Boria et al.
2 Materials
2.1 RNA-Seq For our sample analysis we will process raw RNA-Seq data obtained
Datasets by Birzele et al. [7]. Briefly, the authors have used Illumina’s
Genome Analyzer platform to explore the differences in transcrip-
tome profile between resting and activated human CD4+ T-helper
and regulatory T-cells, including differential expression of genes
and splice variants.
Raw sequence files, corresponding to the four samples examined
by Birzele et al., are available at the NCBI Sequence Read Archive
(SRA) under accession number SRP006674. To download sequences
An Automatic Analysis Workflow for RNA-Seq Data 245
Table 1
List of datasets described by Birzele et al. Each dataset is the result of the forward and reverse
paired end sequences merged together
3 Methods
Fig. 1 Schema of NGS-Trex procedure for RNA-Seq data analysis (image modified from original in NGS-Trex
web page)
Demo account is fully functional but has two main limitation: (1) the
size of each sample is limited to 10 Gb (2) projects and files are
“world accessible” so anyone can read and modify our data.
3.1 Data Submission The first step is to upload files to NGS-Trex server. This is the only
procedure requiring an external tool. Indeed we need an FTP
3.1.1 File Upload
client (such as FireFTP, FileZilla, or any other client supporting
sftp transfer protocol) to upload sequences. The exact procedure
depends on the client used, but we have to provide the server name
(www.ngs-trex.org) and our credentials (username and password).
Only fasta and fastq formats (*.fa,*.tfa,*.fasta,*.fna,*.fq,*.fastq)
are supported by the system.
An Automatic Analysis Workflow for RNA-Seq Data 247
3.1.2 Create a New By clicking the Create New Project link we can fill a form with the
Project project name (we enter “SRP006674”), a brief description (“Cell
line datasets—Birzele et al. [7] NAR 39(18)”), and we can select
the reference genome to be used for the analysis (we select “Homo
sapiens (NCBI Build 36)” to use the same release used in Birzele
work). On form submission we are redirected to the project main
page. This page includes three sections: (1) the project summary:
with the description of the project and a summary of already ana-
lyzed dataset, (2) the list of datasets belonging to the project
(which is empty on newly created projects), and (3) the Add
Dataset form.
3.1.3 Create Datasets At this stage we need to create the required datasets by selecting
the Add Dataset panel.
The Select file to upload link opens a window that shows a list
of the previously uploaded files. It is also possible to upload files
through this interface but—given the size of the files involved—it
is much more reliable to use the ftp approach described before.
As previously noticed, each dataset has two files. NGS-Trex does
not explicitly deal with paired end sequences but treats forward
and reverse sequences as single stranded. For this reason we need
to assign the two distinct files to the same dataset. There is no need
to preprocess files as we can select both files (pressing “ctrl” key—
“cmd” key on mac OS—while selecting files). For example we can
select “SRR192536_1” and “SRR192536_2”, we then right-click
on one of the highlighted files and press on “Select” menu item.
We then assign “CD4_activated” as label and “CD4 Th activated—
Birzele et al. [7]” as short description and press “Add Dataset”
button.
We can repeat the same procedure for all the other datasets.
Now Dataset panel of SRP006674 project shows a table with
our four dataset: “CD4_activated”, “TREG_activated”, “CD4_
resting”, and “TREG_resting”. We are ready to analyze our data.
3.2 Analysis Although NGS-Trex analysis can be run with default options, it is
possible to tune many parameters through a simple form accessible
via the Set Analysis Params link provided—for each sample—in the
Datasets panel of the project main page. Through this form it is
possible to optimize the three main steps of NGS-Trex analysis
workflow: (1) sequences preprocessing, (2) mapping criteria to
align reads to the reference genome, and (3) annotation strategy to
assign mapping reads to annotated genes and transcripts.
3.2.1 Sequences With sequence preprocessing we can instruct the system to remove
Preprocessing cloning linkers and define the rules to deal with reverse stranded
reads.
“Which kind of data do you want to analyze?” Based on the plat-
form used to obtain the raw data, we can currently select “454” or
248 Ilenia Boria et al.
“Illumina”. This option controls the algorithm used for the map-
ping of the reads onto the reference genome.
“Cut-off value for percentage of read length that should be of given
quality” (default: 70 %). This threshold controls the number of bases
of the reads that satisfy the quality threshold defined in the “Cut-off
value for quality score for high-quality filtering” value (see Note 2).
“Do you have linker/cloning sequences to remove?” By answering Yes
a form where we can insert the sequence of the 5′ and 3′ linkers
appears. (Both linkers have to be written in 5′–3′ orientation.) It is
also possible to trim some extra nucleotides between the linker and
the main sequence. The system searches for exact matches between
the provided linker sequences and the reads.
“Look for linker in Reverse (if not found in forward)?” If sequences
are not oriented we can expect to find 5′ and 3′ linkers also with a
reverse order.
“Reverse complement sequence if linker found reversed?” If linkers
are found with reverse order the read can be simply trimmed
(maintaining unchanged its orientation) or it can be reverse-
complemented to reflect the correct order of linkers. This option
can be very critical for the following Annotation step (as described
below) as it allows proper orientation of the reads thus making
more reliable the Annotation procedure.
For the four datasets of the SRP006674 project, we used the
default options in trimming step (since we have no adapters to
remove). No quality filtering was applied as described in Birzele et al.
3.2.2 Mapping “Which trimmed sequences should be mapped?” (default: Map all
Reads (also those without linkers)). We can define which sequences
have to be compared to genome after the trimming step choosing
among the available options. Indeed we can proceed with the anal-
ysis using all reads or we can filter out the reads on the bases of the
number of linker identified. In our example this option is irrelevant
and we leave the default value.
“Only consider Reads mapping with similarity” (default: 95 %). We
can define the cut-off value for minimum sequence similarity
between the reads and the genome.
“Only consider Reads mapping with coverage” (default: 90 %). We
can set the minimum overlap of the read with the genome. The
value can be set as percentage or as number of nucleotides referred
to the length of trimmed read.
“Allow a mismatch of up to N nt at the 5′end of the read”. This
parameter filters out the reads with more than N unaligned bases
at the 5′ end. “0” is the most restrictive option while “-1” does not
apply any restriction.
An Automatic Analysis Workflow for RNA-Seq Data 249
Fig. 2 Read to gene assignment schema reporting proximal reads as “P”, genic reads as “G”, transcript reads
as “T”, and extragenic reads as “O”. The bounding limit for “P” reads is labeled as “a” and the minimum over-
lap with the gene for “G” assignment is labeled as “b” (image modified from original by Boria et al. (2013))
250 Ilenia Boria et al.
More in details:
“Gene upstream/downstream region classify Read as ‘Proximal’”.
Bounding limit upstream and downstream a gene to classify a read
as “P”. Default value is 0 nt which means that “P” reads are not
considered.
“Minimum overlap to classify Read as ‘Genic’”. Minimum overlap
between a read and a gene to classify the read as “G”. Default value
(1 nt) is very tolerant as it implies that a read overlapping a gene
with a single nucleotide is labeled as “G”.
“Minimum similarity between Read and Transcript” (default
95 %). Similarity cut-off value between read and RefSeq sequence
to label the read as “T”.
“Extension to assign Reads to Transcript” (default: 10 nt). A Read
overlapping the RefSeq transcript at its 5′ or 3′ end can extend it
up to N nt. If the read extends the transcripts exceeding this
threshold it is not assigned to the transcript but it is treated as a
putative (extended) variant of the annotated transcript.
“Trim to assign Reads to Transcript” (default: 3 nt). The maxi-
mum number of allowed un-aligned nucleotide at reads ends.
Unlike previous filter, this one is restricted to reads not aligning
at RefSeq ends.
All sequences mapping outside gene coordinates beyond the
defined surrounding region are tagged as “O”—out of gene.
“Are sequences oriented?” If reads are oriented (default: N) the rela-
tive orientation between the read and the gene is used to optimize
the assignment process. This can help the assignment of multiple
mapping reads or overlapping genes which are two main problems
of the annotation process [8].
By selecting Y (default value) in “Try to solve ambiguities?”
field, NGS-Trex attempts to solve ambiguities by using relative
orientation of genes and reads and the confidence level of read
assignment, being the order from higher to lower “T”, ”G”, ”P”
(Fig. 2). See 6 for detail.
3.2.4 Sequences Once filled the analysis setup form as described, let us click the
Post-processing “Upload parameters” button and the “submit job to analysis queue”.
The progresses of each dataset analysis will be shown in the corre-
sponding status column.
3.3 Data Mining Upon the completion of the analysis process (required time
depends on the size of the datasets and on the load of the server),
the status column of the datasets table is set to “Done” enabling us
to explore the results.
A first overview of the obtained results is displayed as a sum-
mary table in the project main page. For each analyzed dataset
the table shows the total number of reads in the sample, the reads
An Automatic Analysis Workflow for RNA-Seq Data 251
3.3.1 Statistics By clicking the “Statistics” link we can access—for each dataset—
several statistical summaries:
“Analysis Parameters”. Displays a table summarizing all the param-
eters used for the analysis.
“Mapping”. Flowchart showing the flow of reads among the dif-
ferent steps of the mapping procedure. This diagram makes it easy
to monitor the quality of the sequences, as it shows how many
reads do not map (or map with low quality) with the reference
genome. Information about rRNA like sequences, about repetitive
sequences, and spliced reads is also reported.
“Annotation”. Flowchart which describes reads assignment at
gene and transcript level.
“Results Overview”. Table with a summary of some information
about the dataset like the number of identified genes, transcripts,
new introns, and differentially expressed genes. In Table 2 the
example of results overview for the dataset “CD4_activated”.
Table 2
Results overview for dataset “CD4_activated” as reported in Statistic
page of NGS-Trex
Fig. 3 “Mapping” diagram for the dataset “CD4_activated” as reported in the Statistics page of NGS-Trex
3.3.2 Query Results Through the “Query Results” link of the project navigation bar
(top left side of the page), it is possible to obtain the most useful
information by performing several queries on data. Again the
“Query Result” page offers several panels that we will explore in
details.
In “Query Gene” panel we can search a single gene using either
the Entrez Gene Identifier (EntrezGeneID) or the HUGO
accession.
The result is a short summary showing the count of reads
assigned to the gene and the count of both annotated and new
introns identified by the analysis. Furthermore the differential
expression pattern of the selected gene within the different datasets
is provided.
In “Advanced Search” panel we can filter the genes with three
indexes: depth, coverage, and focus index. Depth is the maximum
number of overlapping reads assigned to a gene. Coverage is the
total number of reads assigned to the gene. Focus index is the ratio
between depth and coverage: if all reads are distributed along the
gene the depth value is lower than the coverage and the focus index
is low, if all reads are focused on the same region the depth value is
similar to the coverage value and the focus index tends to 1.
“Reads mapping within the gene boundaries but not overlapping
with known exons should be considered?” To carry out the search we
must also decide whether we want to use all reads assigned to genes
(G,P,T) answering “Yes” or (with a more restrictive approach) to
limit the search to reads labeled as “T” according to the previously
described classification (by answering “No”). “All selected samples
should satisfy criteria on the same gene?” option searches genes
satisfying the filtering criteria in all selected samples (“Yes”) or in
at least one of the selected samples (“No”).
For each gene the result table shows: the eg_id, the HUGO
name, the sequence coverage, the sequencing depth, the focus
index, and the RPKM value.
“DE Genes” and “DE Introns” panels are very similar. They allow
the query differentially expressed genes between a reference and
the others datasets. We need to select the reference sample, pValue,
minimum number of reads, and relative expression level (i.e., we
can select genes o introns which are overexpressed in reference
dataset). As for the advanced search it is possible to limit the query
254 Ilenia Boria et al.
Fig. 4 Relative position of reads with respect to CDS and UTRs regions (Reads 5′ UTR = A, Reads 5′ UTR-
CDS = D, Reads CDS = B, Reads 3′ UTR-CDS = E, Reads 3′ UTR = C) in Transcripts panel (image modified from
original in NGS-Trex web page)
4 Notes
References
1. Wang RL, Biales A, Bencic D, Lattier D, Kostich computational research in the life sciences.
M, Villeneuve D, Ankley GT, Lazorchak J, Toth Genome Biol 11(8):R86
G (2008) DNA microarray application in eco- 5. Halbritter F, Vaidya HJ, Tomlinson SR (2011)
toxicology: experimental design, microarray GeneProf: analysis of highthroughput
scanning and factors affecting transcriptional sequencing experiments. Nat Methods 9:7–8
profiles in a small fish species. Environ Toxicol 6. Boria I, Boatti L, Pesole G, Mignone F (2013)
Chem 27:652–663 NGS-Trex: Next Generation Sequencing
2. Wang L, Feng Z, Wang X, Zhang X (2010) Transcriptome profile explorer. BMC
DEGseq: an R package for identifying differen- Bioinformatics 14(Suppl 7):S10
tially expressed genes from RNA-Seq data. 7. Birzele F, Fauti T, Stahl H, Lenter MC, Simon E,
Bioinformatics 26:136–138 Knebel D, Weith A, Hildebrandt T, Mennerich D
3. Mutz KO, Heilkenbrinker A, Lönne M, Walter (2011) Next-generation insights into regulatory T
JG, Stahl F (2013) Transcriptome analysis using cells: expression profiling and FoxP3 occupancy in
next-generation sequencing. Curr Opin Human. Nucleic Acids Res 39(18):7946–7960
Biotechnol 24(1):22–30 8. Costa V, Angelini C, De Feis I, Ciccodicola A
4. Goecks J, Nekrutenko A, Taylor J, Team G (2010) Uncovering the complexity of transcrip-
(2010) Galaxy: a comprehensive approach for tomes with RNA-Seq. J Biomed Biotechnol
supporting accessible, reproducible, and transparent 2010:1–20
Chapter 16
Abstract
In recent years, thanks to the essential support provided by the Next-Generation Sequencing (NGS)
technologies, Metagenomics is enabling the direct access to the taxonomic and functional composition of
mixed microbial communities living in any environmental niche, without the prerequisite to isolate or
culture the single organisms. This approach has already been successfully applied for the analysis of many
habitats, such as water or soil natural environments, also characterized by extreme physical and chemical
conditions, food supply chains, and animal organisms, including humans. A shotgun sequencing approach
can lead to investigate both organisms and genes diversity. Anyway, if the purpose is limited to explore the
taxonomic complexity, an amplicon-based approach, based on PCR-targeted sequencing of selected
genetic species markers, commonly named “meta-barcodes”, is desirable. Among the genomic regions
most widely used for the discrimination of bacterial organisms, in some cases up to the species level, some
hypervariable domains of the gene coding for the 16S rRNA occupy a prominent place.
The amplification of a certain meta-barcode from a microbial community through the use of PCR prim-
ers able to work in the entire considered taxonomic group is the first task after the extraction of the total
DNA. Generally, this step is followed by the high-throughput sequencing of the resulting amplicons libraries
by means of a selected NGS platform. Finally, the interpretation of the huge amount of produced data
requires appropriate bioinformatics tools and know-how in addition to efficient computational resources.
Here a computational methodology suitable for the taxonomic characterization of 454 meta-barcode
sequences is described in detail. In particular, a dataset covering the V1–V3 region belonging to the bacte-
rial 16S rRNA coding gene and produced in the Human Microbiome Project (HMP) from a palatine
tonsils sample is analyzed. The proposed exercise includes the basic steps to manage raw sequencing data,
remove amplification and pyrosequencing errors, and finally map sequences on the taxonomy.
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_16, © Springer Science+Business Media New York 2015
257
258 Fosso Bruno et al.
2 Materials
2.1 HMP (Human The proposed protocol is here executed by starting from a publicly
Microbiome Project) available dataset of bacterial 16S rRNA gene sequences stored in
Data Set the NCBI Sequence Read Archive (SRA) collection. In particular,
the considered dataset has been produced by the Human
Microbiome Project (HMP) in which several body sites, including
gastrointestinal and female urogenital tracts, oral cavity, nasal and
pharyngeal tracts, and skin, were studied by pyrosequencing the
V1–V3 and V3–V5 hypervariable regions of 16S rRNA gene. The
SRR331235 datarun covering the V1–V3 region obtained from
palatine tonsils sample of a male subject was chosen to be submitted
to the exercise described below. The forward primer ATTACCG
CGGCTGCTGG was used for the unidirectional pyrosequencing
as reported in SRA collection experiment page at the link http://
www.ncbi.nlm.nih.gov/sra?term=SRR331235. Moreover a nine-
nucleotide tag (TCGAGGAAC) was inserted upstream of the
V1–V3 region [20] in order to discriminate the reads belonging to
the different samples concurrently sequenced in the same run.
The SRR331235 datarun was downloaded as a binary sra file
of about 5.3 Mb. The dataset consists of 4,504 reads with an aver-
age length of 535.80 nt.
Since the data provided by the 454 technology are in the
binary Standard Flowgram Format (SFF), the downloaded sra file
was converted in the sff by using the sratoolkit (see Note 1). The
aim of this step was to reproduce in the exercise the case in which
the researcher has to analyze directly the data produced by this
type of platform.
The information included in the sff file can be visualized using
the Roche SFFtools by converting it in a textual flat file, as extensively
described in the next section. In particular, two principal sections
can be found in this file: a common header section including some
general information about the sequencing output (such as the
number of reads stored in the file and the number of nucleotide
flows performed during the sequencing) [21] and a read-specific
262 Fosso Bruno et al.
2.2 Computational The management and the processing of large metagenomic datasets
Hardware require powerful storage and computational capabilities therefore
Linux and Unix-based machines are recommended.
Meta-Barcoding Taxonomic Profiling 263
2.3 Computational The protocol described below includes the use of the following
Software software:
1. SFFtools, a part of Roche Data Analysis package, is not freely
available but 454 users can request it from http://454.com/
products/analysis-software/index.asp. In order to use
SFFtools to manage the sff file, the installation of the whole
Roche Data Analysis package is required on a Linux operating
system by using root privileges.
2.
AmpliconNoise, available at http://code.google.com/p/
ampliconnoise/, which is a collection of programs used to
reduce the base errors intrinsically introduced during the 454
amplicons sequencing procedure [23]. It is supported on
Linux and Mac OS X platforms but a particular attention is
required during the installation and the configuration of the
package because it is required the setting of some environment
variables (as fully described in the documentation file provided
with the installation package). Moreover, AmpliconNoise is
designed to work on multi-processors/cores system using the
Message Passing Interface (MPI) libraries that must be previ-
ously installed and configured. The MPI libraries allow paral-
lelising the process on multi-processor/core machines for the
High Performance Computing (HPC).
3. FastQC, a Java-based tool, which is available at http://www.
bioinformatics.babraham.ac.uk/projects/fastqc/ and used for
a simple and quick quality check of sequences outputted
directly by the sequencer (also referred as raw data).
4. RDP classifier is a naïve Bayesian classifier for fast taxonomic
assignment of sequences and confidence estimation
(Assignment procedures and confidence values estimation are
amply described in Subheading 3.3). It can be downloaded
from http://rdp-classifier.sourceforge.net.
5. Python, an object-oriented programming language available at
http://www.python.org. It is preinstalled in several Linux-based
Operating Systems (e.g., Ubuntu) and in Mac OS X but not in
Windows Operating Systems. During the proposed exercise
two Python modules are used: BioPython (http://biopython.
org/wiki/Main_Page) and Numpy (http://www.numpy.org).
264 Fosso Bruno et al.
3 Methods
3.1 Raw Data The first step of the protocol consists in the conversion of the
Management binary sff file produced by the sequencing platform 454 into two
and Evaluation different file formats, a textual flat file and a fastq file, and in the
statistical evaluation of the raw data.
The conversion of the sff file into a flat textual file is performed
by the Roche SFFtools software. The produced file will be submit-
ted to the denoising step. Some alternative methods that are shown
in Note 2 are available to manage the sff file.
The command used in the bash terminal to achieve this con-
version is the following:
$ sffinfo SRR331235.sff > SRR331235.sff.txt
The sff is also converted into a fastq file suitable to perform the
raw data quality check by means of FastQC. This operation is car-
ried out in the Python shell by using the BioPython module and
consists of two steps:
(a) the invocation of the BioPython SeqIO command class, which
is useful to manage biological data files, by using the following
commands:
$ Python
Meta-Barcoding Taxonomic Profiling 265
And then:
>>>from Bio import SeqIO
(b) the sff file conversion in a fastq file by using the command
SeqIO.convert that requires four instructions: the file to be
converted, its format, the output file name and the final file
format. The command to perform this conversion is the
following:
>>>SeqIO.convert(“SRR331235.
sff”,”sff”,”SRR331235.fastq”,”fastq”)
Afterwards, the produced fastq file is processed by means of
FastQC, by using the following command:
$ fastqc SRR331235.fastq
FastQC performs several statistical evaluations on the sequenc-
ing data and builds an html report file divided in 12 subsections.
The descriptions of the latter are included in several different html
files (you can open them using any Internet browser) contained in
the “help” folder, available within the FastQC package.
For example, in one of the 12 subsections obtained in this
exercise, the “Basic Statistics,” the dataset under investigation
results to consist of 4,504 sequences ranging from 86 to 1,049
nucleotides and with a GC content of 52 %. Furthermore, in the
“Per base sequence quality” subsection it is shown a chart of the
quality values ranges across all bases at each sequence position in
the fastq file related to the analyzed dataset (Fig. 1). In Fig. 1 it is
possible to observe a reduction of the quality at the 3′ end. This
behavior is due to the intrinsic limits of the sequencing technology
which tends to lose sensitivity in the last steps of the process.
3.2 Denoising The second step of the entire protocol, called denoising, consists in
the application of AmpliconNoise. AmpliconNoise is a suite of
programs designed to reduce the errors introduced in the sequences
by both amplification and pyrosequencing before mapping them
on the taxonomy.
In order to exhaustively explain all the procedures of this
phase, the manual procedure is shown but it is possible to perform
them by means of bash scripts that are supplied in the installation
package. Moreover, AmpliconNoise can be alternatively applied
within the Mothur suite [24] (see Note 3).
The processing starts with a filtering step applied to remove
low quality reads using the Perl script FlowMinMax.
In the bash terminal, type:
$ FlowsMinMax.pl TCGAGGAACATTACCGCGGCTGCTGG
test_book < SRR331235.sff.txt
266 Fosso Bruno et al.
40
38
36
34
32
30
28
26
24
Quality score
22
20
18
16
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 15-19 30-34 45-49 70-79 100-149 250-299 400-449 600-699 900-999
Fig. 1 The per base sequence quality chart produced by FastQC for the dataset analyzed in this exercise (datarun
SRR331235 produced by the Human Microbiome Project (HMP))
“-c” is the cutoff for the initial clustering step and “–s” rules the
cluster size (60.0 is a good optimization between noise removal
and computational requirements). At the end of the PyroNoiseM
processing, a number of files labelled with the “–out” parameter
(C000_s60_c01) are produced:
●● a fasta file where the denoised sequences associated with
their accession numbers, called as output_label_index_weight,
are reported. Weight indicates the number of raw sequences
associated to each denoised one and index is just a number
which identifies univocally the specific denoised sequences
(C000_s60_c01_cd.fa);
●● a quality file for the denoised sequences (C000_s60_c01_
cd.qual). It is a fasta-like format file where, for each denoised
sequence, the accession number and the estimated Phred-score
for each base is annotated;
●● a mapping file in which the accession numbers of the raw reads
are associated with their corresponding denoised sequences.
(C000_s60_c01_cd.mapping);
●● a directory (C000_s60_c01) containing a fasta file for each
denoised sequence. In each file the unique sequences, pro-
duced by the Perl script FastaUnique, and the corresponding
reads are listed.
The sequence and mapping data produced by PyroNoiseM are
then copied and renamed in the test_book_pyro_denoised.fa and
test_book_pyro_denoised.mapping files, respectively, using the
following commands:
$ cp C000/C000_s60_c01_cd.fa > test_book_pyro_
denoised.fa
$ cp C000/C000_s60_c01.mapping > test_book_
pyro_denoised.mapping
Subsequently, the denoised sequences stored in the test_book_
pyro_denoised.fa file are truncated at 400 nt (see the AmpliconNoise
documentation) by typing:
$ Truncate.pl 400 < test_book_pyro_denoised.fa
> test_book_pyro_denoised _T400.fa
The sequences contained in the produced test_book_pyro_
denoised _T400.fa file are processed in order to remove the barcode
and the primer sequences before the SeqNoise application by means
of sed, a stream editor which works using regular expressions.
In the exercise case, it is possible to substitute (s) the barcode
and the primer at the 5′ end (^) of each sequence with an empty
string by typing:
$ sed 's/^TCGAGGAACATTACCGCGGCTGCTGG//' test_
book_pyro_denoised _T400.fa > test_book_pyro_
denoised _T400_P.fa
Meta-Barcoding Taxonomic Profiling 271
(30.0) for the clustering size (for more information refers to the
AmpliconNoise documentation). The “–out” flag (test_book_
denoised_T400_s25_c08) is used to label the output data pro-
duced by SeqNoise which are contained in the following files:
●● a fasta file where the denoised sequence associated with their
accession number formed as output_label_index_weight are
reported. Weight indicates the number of raw sequences
clustered in each denoised one and index is just a number
which identifies univocally the specific cluster (test_book_
denoised_T400_s25_c08_cd.fa);
●● a mapping file in which the accession numbers of the sequences
produced by PyroNoiseM computation are associated with
their corresponding denoised sequences obtained by SeqNoise
(test_book_denoised_T400_s25_c08.mapping);
●● a directory (test_book_denoised_T400_s25_c08) containing
a fasta file for each sequence derived from the SeqNoise
denoising procedure. In each file the unique sequences and
the corresponding reads, all used to generate the denoised
sequences, are listed;
●● if the –min parameter is added to the command-line, an
optional mapping file (test_book_denoised_T400_s25_c08_
cd.mapping), in which the accession numbers of the raw reads
are associated with the corresponding denoised sequence
derived from the full denoising procedure, is created.
In the exercise described here, after the complete denoising
procedure, AmpliconNoise identifies 399 real sequences, that are
theoretically unaffected by sequencing and amplification errors,
starting from the 3,229 reads that passes the initial filtering step.
3.3 Taxonomic The third step of the protocol consists in the taxonomic classifica-
Classification tion of the denoised sequences by using the RDP classifier.
RDP classifier [28] is a naïve Bayesian classifier which works
rapidly on short sequences and without the need to align them to
a taxonomically annotated reference database. It is developed by
the same team which maintains RPD II [29] and it is able to assign
sequences from kingdom to genus levels.
As other textual Bayesian classifiers, it works in a space of fea-
tures consisting in all the 8 nt long sub-string that could be found
in a sequence whatever their position. In this system W = {w1,…
,wd} is defined as the set of all possible 8 nt long sub-string that can
be retrieved in a given sequence. Given a collection of N sequences,
n(wi) represents the number of times that wi is observed in the set
N. The expected probability to observe wi in considering the entire
set N is:
n ( wi ) + 0.5
Pi =
N +1
Meta-Barcoding Taxonomic Profiling 273
$ Python
#First let us importing the class commands needed
to make the analysis
>>>from Bio import SeqIO
>>>import numpy
>>>import fpformat
#an empty list is created to store the infor-
mation about the sequence lengths
>>>size = []
# the length of each sequence is stored in the
sizes list by using the multifasta parser SeqIO.
parse
>>>for sequence in SeqIO.parse(“test_book_
denoised_T400_s25_c08_cd.fa”,”fasta”):
size.append(len(sequence))
#the average length is estimated using numpy.
mean command and the computed value is rounded
#to 2 decimal positions
>>>mean = fpformat.fix(numpy.mean(size),2)
#the standard deviation is estimated using
numpy.std command and the computed value is
rounded
#to 2 decimal positions
>>>sd = fpformat.fix(numpy.std(size),2)
#estimation of the minimum length
>>>min_length = min(size)
#estimation of the maximum length
>>>max_length = max(size)
#simply the estimated information are printed
>>>print “The denoised sequences are”,len(size)
The denoised sequences are 399
>>>print “the average length is”,mean,” and the
standard deviation is “,sd
the average length is 345.62 and the standard
deviation is 43.60
>>>print “the shortest sequence is long“,min_
length,”while the longest “,max_length
the shortest sequence is long 205 while the lon-
gest 374
The previous operations provide the information that the data-
set considered in the exercise contains sequences shorter than
250 nt. Accordingly, a confidence threshold values of 0.5 is fixed
for the RDP classifier computing by using the “--conf” parameter.
The flag “--hier_outfile” will allow to produce the output file con-
taining the assignment count for each annotated taxonomical rank
and the flag “--assign_outfile” will specify the output file contain-
ing the assignment details for each sequence.
Meta-Barcoding Taxonomic Profiling 275
acc = fields[0]
s = acc.split("_")
reads = int(s[-1])
stop = len(fields)
index = 2
while index < stop:
node = fields[index]
rank = fields[index+1]
conf = float(fields[index+2])
if conf >= 0.5:
name2rank.setdefault(rank,set())
name2rank[rank].add(node)
name2reads.setdefault(node,0)
name2reads[node] += reads
index += 3
else:
index = stop
#now we can print the data in a csv file
csv=open("data_summary_results.csv","w")
for rank in name2rank.keys():
csv.write(rank+"\n")
for node in name2rank[rank]:
c s v . w r i t e ( n o d e + " \ t " + s t r ( n a m e 2 r e a d s
[node])+"\n")
The script can be applied simply by typing “python extract_
data.py” and in few seconds it produces the “data_summary_
results.csv” file.
By observing the produced data, Firmicutes (1,099 reads,
34.03 %) and Proteobacteria (1,083 reads, 33.54 %) result the most
represented phyla, while Bacteroidetes, Fusobacteria and
Actinobacteria correspond only to 15.89 % (513 reads), 20.13 %
(650 reads) and 6.41 % (207 reads), respectively. Really, these data
are consistent with those obtained in previous human microbiota
surveys [31, 32].
4 Notes
the “-O” option is used to redirect the output in folder (in our
case “sff_file_dir”).
2. As reported in AmpliconNoise documentation, there are some
alternative methods to manage the sff file such as by using the
free software Flower [33] or the process_sff.py script included in
QIIME [34]. Moreover, the Mothur suite provides the com-
mand sffinfo.
3. In Mothur suite, PyroNoise and SeqNoise are re-implemented
in shhh.flows and shhh.seqs commands, respectively.
4. All the AmpliconNoise analysis is computationally very heavy.
For this region, the several steps are performed by using the
MPI libraries in order to parallelise the job on 8/10 nodes. By
means of these measures, the pyro- and seq-noise run times
can be reduced to 30 and 20 min respectively, if the tested
dataset is considered.
References
1. Hemme CL, Deng Y, Gentry TJ et al (2010) assignment in metagenomics. Brief Bioinform
Metagenomic insights into evolution of a heavy 13(6):682–695
metal-contaminated groundwater microbial 11. Tringe SG, Hugenholtz P (2008) A renais-
community. ISME J 4(5):660–672 sance for the pioneering 16S rRNA gene. Curr
2. Ottman N, Smidt H, de Vos WM et al (2012) Opin Microbiol 11(5):442–446
The function of our microbiota: who is out 12. Nilsson RH, Kristiansson E, Ryberg M et al
there and what do they do? Front Cell Infect (2008) Intraspecific ITS variability in the king-
Microbiol 2:104 dom fungi as expressed in the international
3. Dutton RJ, Turnbaugh PJ (2012) Taking a sequence databases and its implications for
metagenomic view of human nutrition. Curr molecular species identification. Evol Bioinform
Opin Clin Nutr Metab Care 15(5):448–454 Online 4:193–201
4. Knight R, Jansson J, Field D et al (2012) 13. Teeling H, Glöckner FO (2012) Current
Unlocking the potential of metagenomics opportunities and challenges in microbial
through replicated experimental design. Nat metagenome analysis—a bioinformatic per-
Biotechnol 30(6):513–520 spective. Brief Bioinform 13(6):728–742
5. Barnard D, Casanueva A, Tuffin M et al (2010) 14. Gilbert JA, Field D, Swift P et al (2010) The
Extremophiles in biofuel synthesis. Environ taxonomic and functional diversity of microbes
Technol 31(8–9):871–888 at a temperate coastal site: a 'multi-omic' study
6. Shokralla S, Spall JL, Gibson JF et al (2012) of seasonal and diel temporal variation. PLoS
Next-generation sequencing technologies for One 5(11):e15545
environmental DNA research. Mol Ecol 21: 15. Bazinet al, Cummings MP (2012) A compara-
1794–1805 tive evaluation of sequence classification
7. Luo C, Tsementzi D, Kyrpides N et al (2012) programs. http://drum.lib.umd.edu/handle/
Direct comparisons of Illumina vs. Roche 454 1903/13346
sequencing technologies on the same microbial 16. Simon C, Daniel R (2011) Metagenomic anal-
community DNA sample. PLoS One 7:e30087 yses: past and future trends. Appl Environ
8. Taberlet P, Coissac E, Pompanon F et al (2012) Microbiol 77(4):1153–1161
Towards next-generation biodiversity assess- 17. DeSantis TZ, Hugenholtz P, Larsen N et al
ment using DNA metabarcoding. Mol Ecol (2006) Greengenes, a chimera-checked 16S
21(8):2045–2050 rRNA gene database and workbench compati-
9. Blaalid R, Kumar S, Nilsson RH et al (2013) ble with ARB. Appl Environ Microbiol
ITS1 versus ITS2 as DNA metabarcodes for 72(7):5069–5072
fungi. Mol Ecol Resour 13(2):218–224 18. Cole JR, Chai B, Marsh TL et al (2003)
10. Santamaria M, Fosso B, Consiglio A et al Ribosomal Database Project. The ribosomal
(2012) Reference databases for taxonomic database project (RDP-II): previewing a new
278 Fosso Bruno et al.
autoaligner that allows regular updates and the 27. Chuong BD, Batzoglou S (2008) What is the
new prokaryotic taxonomy. Nucleic Acids Res expectation maximization algorithm? Nat
31(1):442–443 Biotechnol 26(8):897–899
19. Pruesse E, Quast C, Knittel K et al (2007) 28. Wang Q, Garrity GM, Tiedje JM et al (2007)
SILVA: a comprehensive online resource for Naïve bayesian classifier for rapid assignment of
quality checked and aligned ribosomal RNA rRNA sequences into the new bacterial taxonomy.
sequence data compatible with ARB. Nucleic Appl Environ Microbiol 73(16):5261–5267
Acids Res 35(21):7188–7196 29. Cole JR, Wang Q, Cardenas E et al (2009) The
20. Roche Applied Sciences (2008) Genome Ribosomal Database Project: improved align-
sequencer data analysis software manual. Roche ments and new tools for rRNA analysis. Nucleic
Diagnostics GmbH, Germany Acids Res 37(Database issue):D141–D145.
21. Metzker ML (2010) Sequencing Technologies - doi:10.1093/nar/gkn879
the Next Generation. Nat Rev Genet 11(1): 30. Claesson MJ, O'Sullivan O, Wang Q et al
31–46 (2009) Comparative analysis of pyrosequenc-
22. Ewing B, Green P (1998) Base-calling of auto- ing and a phylogenetic microarray for explor-
mated sequencer traces using phred II error ing microbial community structures in the
probabilities. Genome Res 8(3):186–194 human distal intestine. PLoS One 4(8):e6669
23. Quince C, Lanzen A, Davenport RJ et al 31. Gosalbes MJ, Abellan JJ, Durbán A et al (2012)
(2011) Removing noise from pyrosequenced Metagenomics of human microbiome: beyond
amplicons. BMC Bioinformatics 12:38 16s rDNA. Clin Microbiol Infect 18(4):47–49
24. Schloss PD (2009) A high-throughput DNA 32. Andersson AF, Lindberg M, Jakobsson H et al
sequence aligner for microbial ecology studies. (2008) Comparative analysis of human gut
PLoS One 4(12):e8230 microbiota by barcoded pyrosequencing. PLoS
25. Balzer S, Malde K, Lanzén A et al (2010) One 3(7):e2836
Characteristics of 454 pyrosequencing data- 33. Malde K (2011) Flower: extracting informa-
enabling realistic simulation with flowsim. tion from pyrosequencing data. Bioinformatics
Bioinformatics 26(18):i420–i425 27(7):1041–1042
26. Huse SM, Huber JA, Morrison HG et al 34. Caporaso JG, Kuczynski J, Stombaugh J et al
(2007) Accuracy and quality of massively paral- (2010) QIIME allows analysis of high-
lel DNA pyrosequencing. Genome Biol 8(7): throughput community sequencing data. Nat
R143 Methods 7(5):335–336
Chapter 17
Abstract
Metatranscriptomic data contributes another piece of the puzzle to understanding the phylogenetic struc-
ture and function of a community of organisms. High-quality total RNA is a bountiful mixture of ribosomal,
transfer, messenger and other noncoding RNAs, where each family of RNA is vital to answering questions
concerning the hidden microbial world. Software tools designed for deciphering metatranscriptomic data
fall under two main categories: the first is to reassemble millions of short nucleotide fragments produced by
high-throughput sequencing technologies into the original full-length transcriptomes for all organisms
within a sample, and the second is to taxonomically classify the organisms and determine their individual
functional roles within a community. Species identification is mainly established using the ribosomal RNA
genes, whereas the behavior and functionality of a community is revealed by the messenger RNA of the
expressed genes. Numerous chemical and computational methods exist to separate families of RNA prior to
conducting further downstream analyses, primarily suitable for isolating mRNA or rRNA from a total RNA
sample. In this chapter, we demonstrate a computational technique for filtering rRNA from total RNA using
the software SortMeRNA. Additionally, we propose a post-processing pipeline using the latest software
tools to conduct further studies on the filtered data, including the reconstruction of mRNA transcripts for
functional analyses and phylogenetic classification of a community using the ribosomal RNA.
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_17, © Springer Science+Business Media New York 2015
279
280 Evguenia Kopylova et al.
2 Materials
2.1 Software The software tool SortMeRNA is a fast and accurate filter for iden-
tifying and sorting rRNA in a set of metatranscriptomic reads.
SortMeRNA is written in C++ and freely distributed under the GPL
license as a stand-alone version or as a Galaxy wrapper. Both distri-
butions, including the user manual with installation instructions,
can be downloaded from http://bioinfo.lifl.fr/RNA/sortmerna/.
SortMeRNA supports multi-threading and has been tested on
Linux (Ubuntu, Fedora, CentOS, and Debian) and Mac OS 10.6.8
systems. For compilation, a g++ compiler version 4.3 or higher is
required. Our experiments were performed using version 1.8 of
SortMeRNA, on an Intel(R) Xeon(R) CPU W3520 2.67 GHz
machine with 8 GB of RAM and one thread.
2.2 Ribosomal RNA To filter out rRNA, SortMeRNA performs a search against a data-
Databases base of known rRNAs, such as the public rRNA databases SILVA
[9], Greengenes [10], Rfam [11], and RDP-II [12], that provides
access to millions of rRNA sequences. SortMeRNA can be used
with any of these databases, any fraction of them, or with any cus-
tom database. The only requirement is that sequences are in the
FASTA format. For easier use, the distribution package of
SortMeRNA also includes eight nonredundant representative
databases for 16S bacteria and archaea, 18S eukarya, 23S bacteria
and archaea, 28S eukarya, 5S bacteria/archaea, and 5.8S eukarya,
which were derived from the raw SILVA rRNA database sets using
the tools ARB [13] and UCLUST [14]. We first remove contami-
nated sequences, such as chimeric rRNA and partial mRNA, and
then use the tool UCLUST to construct a nonredundant database
according to the identity threshold of the curated sequences.
Table 1
List of examples of relevant software which can be used for performing
read quality control
Filter Program
Vector/Adapter clipping TagCleaner [16], FASTX-Toolkit
Low quality trimming PRINSEQ [17], FASTX-Toolkit
Low complexity regions dust [18]
Read length PRINSEQ
Clone removal PRINSEQ
Error correction Coral [19]
2. Remove adapters and primers on the whole read and low qual-
ity nucleotides from both ends (while quality value is lower
than 20) and then continue next steps with the longest
sequence without adapters and low quality bases.
3. Remove sequences between the second unknown nucleotide
(N) and the end of the read. This step is done on trimmed
(adapters + quality) reads.
4. Discard reads shorter than 30 nucleotides after trimming.
5. Remove reads that mapped onto run quality control sequences
(PhiX genome).
3 Methods
3.1.2 How to Run We filter the 16S rRNA from our ERA236149 dataset using our
SortMeRNA two previously indexed databases. The command-line is as follows:
sortmerna -n 2 -- db ./rRNA_databases/silva-bac-16s-database-id85.
fasta ./rRNA_databases/silva-arc-16S-database-id95.fasta --I
ERA236149.fastq --accept rrna --bydbs --other nonrrna --log bilan
Where the options correspond to:
Input:
-n <N> is the number <N> of databases to search,
--db <string> is for a space separated list of <N> databases,
--I <filename> is for the input reads,
Output:
--accept <filename without extension> is for the output file(s)
of matching reads,
--bydbs is to output the matching reads to <N> different files
(one for each database),
--other <filename without extension> is the output file for
non-matching reads,
--log <filename> gives a statistics file for the percentages of
reads matching to each of the <N> databases.
Paired reads: paired reads are managed using the options--paired-
in and--paired-out. Although SortMeRNA does not use parity
information during filtering, when run in multi-threading mode,
the output of matching and non-matching reads can lose the parity
order of the original file. To maintain parity of reads, if one read
matches and the other does not, the option--paired-in will output
the paired reads into the--accept file, whereas the option--paired-
out will output the paired reads into the--other file. If neither of
the options are set, by default the output of SortMeRNA may not
maintain parity order.
3.1.3 Output From our example above, there will be four output files: two files
for rRNA classified reads for each 16s-bacteria, 16s-archaea
database, one file for the non-rRNA reads, and one file for the
overall statistics of the ERA236149 dataset tested above. The sort-
ing of ~21 million reads was performed 1.5 h using one thread.
From the statistics file, approximately 27 % of these reads were
classified as 16S rRNA to one of two reference databases.
3.1.4 Advanced Options Memory management: The size of ERA236149.fastq is 5.2 GB, by
default SortMeRNA works internally with 1 GB of memory allo-
cated for the reads, without any required user intervention to split
the file into multiple parts. If the user has sufficient memory to
load all 5.2 GB of reads into memory, the option -m <M> allows
284 Evguenia Kopylova et al.
3.2.1 Installation The Galaxy wrapper for SortMeRNA is available on the Toolshed,
the place for sharing Galaxy tools with any Galaxy instance. Any
local administrator may install SortMeRNA by browsing the
Toolshed, and Galaxy takes care of downloading and installing
SortMeRNA, as well as indexing the ribosomal RNA databases
provided with SortMeRNA.
3.2.2 Input The Galaxy main graphical interface has three parts (see Fig. 1): the
left panel lists all the available tools; the right panel is the history of
all the steps of the current analysis and their outputs. The central
panel displays all of the input, output and parameter criteria for
launching SortMeRNA.
The input FASTA or FASTQ file of reads must be first imported
to the Galaxy history. Small files may be uploaded from the local
hard drive, by selecting the “Upload file from your computer” in
the “Get data” section; bigger files can be transferred from an
accessible URL or by FTP. Once done, our file “ERA236149.fasta”
and its associated metadata are visible in the right panel history.
We bring up SortMeRNA by selecting “Filter with Sort-
MeRNA” in the left panel. We select the desired reads file from the
history—only FASTA/FASTQ files can be selected. Most of the
options selected by default are appropriate for our experiment;
we tick the 16S databases. We want to retrieve both the classified
reads and non-rRNA reads, as well as the statistics file, so we tick
all three options.
3.2.3 Output The SortMeRNA outputs are automatically imported to the his-
tory as Galaxy datasets, and are recognized as FASTA/FASTQ
files, which can be used for further analysis inside Galaxy.
3.2.4 Advanced Options It is possible to use personal databases to filter our reads against.
We must first import them to Galaxy as FASTA files, just like the
reads file. In the SortMeRNA interface, we then select “Databases
from your history” as “Databases to query,” and add as many files
as we want. The indexation of the rRNA sequences file will be
automatically invoked by Galaxy before running SortMeRNA—
note however that the index must be computed each time, which
is time-consuming. Alternatively, more databases can be added to
the default list by the Galaxy administrator.
3.3 Further Analyses Once SortMeRNA has been run, one might want to perform fur-
ther analyses of the filtered dataset. In this section we touch on
various stages of filtered data manipulation, such as the reconstruc-
tion of full-length RNA from short reads, taxonomic assignation
for ribosomal RNA, and functional analysis of messenger RNA
using well-maintained protein databases. Please refer to Fig. 2
for an illustration of the pipeline. Many of the tools shown in this
pipeline are also available in Galaxy.
RAW DATA
Metatranscriptomic reads
(total RNA)
2.1 CLEAN
• PRINSEQ
• FASTX-Toolkit
• digital normalization
(see khmer)
2.2 SORT
SortMeRNA
mRNA rRNA
PROCESS
2.3 ASSEMBLE
2.4 MAP
• Oases
• TopHat2 ASSEMBLE/MAP
• Trinity
• BLAST • BLAST
• khmer
• CLC Genomics • EMIRGE
• CLC Genomics Workbench
Workbench (CLC Bio)
(CLC Bio)
ANALYZE
2.6 TAXONOMY
2.5 FUNCTION
• mothur
• FragGeneScan • MG-RAST • QIIME
• Glimmer • MEGAN 4 • RDP Classifier
• ARB
using the program EMIRGE [23], which can also provide relative
abundances of rRNA in the community sample. For reconstruct-
ing full-length mRNA transcripts from a metatranscriptome,
we describe two work-around methods applicable for raw or
Metatranscriptomic Data Analysis 287
3.3.3 Functional Analysis For assessing the potential functions of mRNA sequences, it is routine
of mRNA to search the reads or assembled transcripts (see Subheading 3.3.1)
for homologs in a public reference database such as the NCBI non-
redundant protein database (NCBI-nr) [34], the RDP-II database,
KEGG (for enzymes and pathway networks) [35], or SEED
(protein sequences with known functional roles) [36]. The
MG-RAST [37] server and MEGAN4 [38] are two services pro-
viding comprehensive tools for the annotation and functional
288 Evguenia Kopylova et al.
3.3.4 Taxonomic Due to the high conservation of rRNA genes between species and
Classification of rRNA their presence in all cells, the 16S and 18S small subunit rRNA
serve as signatures to infer whole organism phylogeny [43]. For a
reliable representation of species diversity, short reads covering
rRNA variable regions are more accurately classified than those
covering higher-conserved regions. Therefore, tools which are
more sensitive to capture sequence variation will render a higher
taxonomic resolution of a given metatranscriptome. In the follow-
ing two paragraphs, we describe two methods for rRNA classifica-
tion, either by using the raw unassembled short reads, or with an
assembled set of full-length rRNA genes.
Short rRNA fragments (reads): MEGAN4 implements visual and
interactive phylogenetic trees from a given set of rRNA reads, and
allows the user to carefully examine the alignment of each read
with its assigned species. To use MEGAN4, the user must first run
a BLASTN search to compare a set of reads to the SILVA database
and then import the BLAST results file to calculate taxonomic clas-
sification using the NCBI taxonomy. Although the searching step
can be computationally expensive for large set of reads and relaxed
parameter settings, the BLAST algorithm can be efficient at finding
the reads covering variable regions in low coverage samples.
Metatranscriptomic Data Analysis 289
Acknowledgments
References
1. Kapranov P et al (2007) RNA maps reveal 6. Boissinot K, Huletsky A, Peytavi R et al (2007)
new RNA classes and a possible function for Rapid exonuclease digestion of PCR-amplified
pervasive transcription. Science 316(5830): targets for improved microarray hybridization.
1484–1488 Clin Chem 53(11):2020–2023
2. Velculescu VE et al (1995) Serial analysis of 7. Yi H, Cho YJ, Won S et al (2011) Duplex-
gene expression. Science 270(5235):484–487 specific nuclease efficiently removes rRNA for
3. Shiraki T et al (2003) Cap analysis gene expres- prokaryotic RNA-seq. Nucleic Acids Res
sion for high-throughput analysis of transcrip- 39(20):e140
tional starting point and identification of 8. Kopylova E, Noe L, Touzet H (2012)
promoter usage. Proc Natl Acad Sci U S A SortMeRNA: fast and accurate filtering of
100(26):15776–15781 ribosomal RNAs in metatranscriptomic data.
4. Janda JM, Abbott SL (2007) 16S rRNA gene Bioinformatics 28(24):3211–3217
sequencing for bacterial identification in the 9. Quast C, Pruesse E, Yilmaz P et al (2013) The
diagnostic laboratory: pluses, perils, and pit- SILVA ribosomal RNA gene database project:
falls. J Clin Microbiol 45(9):2761–2764 improved data processing and web-based
5. Sorek R, Cossart P (2010) Prokaryotic tran- tools. Nucleic Acids Res 41(D1):D590–D596
scriptomics: a new view on regulation, physiol- 10. DeSantis TZ, Hugenholtz P, Larsen N et al
ogy and pathogenicity. Nat Rev Genet 11(1): (2006) Greengenes, a chimera-checked 16S
9–16 rRNA gene database and workbench compatible
290 Evguenia Kopylova et al.
with ARB. Appl Environ Microbiol 72(7): short-read de novo assembler. GigaScience.
5069–5072 doi:10.1186/2047-217X-1-18
11. Griffiths-Jones S, Bateman A, Marshall M et al 25. Mason OU, Hazen TC, Borglin S et al (2012)
(2003) Rfam: an RNA family database. Nucleic Metagenome, metatranscriptome and single-
Acids Res 31(1):439–441 cell sequencing reveal microbial response to
12. Cole JR, Wang Q, Cardenas E et al (2008) Deepwater Horizon oil spill. ISME J 6(9):
The Ribosomal Database Project: improved 1715–1727
alignments and new tools for rRNA analysis. 26. Sommer DD, Delcher AL, Salzberg SL et al
Nucleic Acids Res 37:D141–D145 (2007) Minimus: a fast, lightweight genome
13. Ludwig W, Strunk O, Westram R et al (2004) assembler. BMC Bioinformatics.
ARB: a software environment for sequence doi:10.1186/1471-2105-8-64
data. Nucleic Acids Res 32(4):1363–1371 27. Schulz MH, Zerbino DR, Vingron M et al
14. Edgar RC (2010) Search and clustering orders (2012) Oases: robust de novo RNA-seq assem-
of magnitude faster than BLAST. Bioinformatics bly across the dynamic range of expression lev-
26(19):2460–2461 els. Bioinformatics 28(8):1086–1092
15. Brown CT, Howe A, Zhang Q et al (2013) A 28. Grabherr MG, Haas BJ, Yassour M et al (2011)
reference-free algorithm for computational Full-length transcriptome assembly from
normalization of shotgun sequencing data. RNA-Seq data without a reference genome.
https://www.e-biogenouest.org/resources/46 Nat Biotechnol 29:644–652
16. Schmieder R, Lim YW, Rohwer F et al (2010) 29. Pell J, Hintze A, Canino-Koning R et al (2012)
TagCleaner: identification and removal of tag Scaling metagenome sequence assembly with
sequences from genomic and metagenomic probabilistic de Bruijn graphs. Proc Natl Acad
datasets. BMC Bioinformatics 11:341 Sci U S A. doi:10.1073/pnas.1121464109
17. Schmieder R, Edwards R (2011) Quality con- 30. Altschul SF, Gish W, Miller W et al (1990)
trol and preprocessing of metagenomic datas- Basic local alignment search tool. J Mol Biol
ets. Bioinformatics 27(6):863–864 215(3):403–410
18. Morgulis A, Gertz EM, Schäffer AA et al 31. Langmead B, Trapnell C, Pop M et al (2009)
(2006) A fast and symmetric DUST imple- Ultrafast and memory-efficient alignment of
mentation to mask low-complexity DNA short DNA sequences to the human genome.
sequences. J Comput Biol 13(5):1028–1040 Genome Biol 10:R25
19. Salmela L, Schroder J (2011) Correcting 32. Kim D, Pertea G, Trapnell C et al (2013)
errors in short reads by multiple alignments. TopHat2: accurate alignment of transcrip-
Bioinformatics 27(11):1455–1461 tomes in the presence of insertions, deletions
20. Goecks J, Nekrutenko A, Taylor J, The Galaxy and gene fusions. Genome Biol 14(4):R36
Team (2010) Galaxy: a comprehensive 33. Langmead B, Salzberg SL (2012) Fast gapped-
approach for supporting accessible, reproduc- read alignment with Bowtie 2. Nat Methods
ible, and transparent computational research in 9(4):357–359
the life sciences. Genome Biol 11(8):R86. 34. Pruitt KD, Tatusova T, Maglott DR (2005)
doi:10.1186/gb-2010-11-8-r86 NCBI Reference Sequence (RefSeq): a curated
21. Radax R, Rattei T, Lanzen A et al (2012) non-redundant sequence database of genomes,
Metatranscriptomics of the marine sponge transcripts and proteins. Nucleic Acids Res
Geodia barretti: tackling phylogeny and func- 33:D501–D504
tion of its microbial community. Environ 35. Kanehisa M, Goto S, Sato Y et al (2012)
Microbiol 14(5):1308–1324 KEGG for integration and interpretation of
22. Fan L, McElroy K, Thomas T (2012) large-scale molecular datasets. Nucleic Acids
Reconstruction of ribosomal RNA genes from Res 40:D109–D114
metagenomic data. PLoS One 7(6):e39948. 36. Overbeek R, Begley T, Butler RM et al (2005)
doi:10.1371/journal.pone.0039948 The subsystems approach to genome annota-
23. Miller CS, Baker BJ, Thomas BC et al (2011) tion and its use in the project to annotate
EMIRGE: reconstruction of full-length ribo- 1000 genomes. Nucleic Acids Res 33(17):
somal genes from microbial community short |5691–5702
read sequencing data. Genome Biol 12(5):R44 37. Meyer F, Paarmann D, D’Souza M et al (2008)
24. Luo R, Liu B, Xie Y et al (2012) SOAPdenovo2: The metagenomics RAST server—a public
an empirically improved memory-efficient resource for the automatic phylogenetic and
Metatranscriptomic Data Analysis 291
functional analysis of metagenomes. BMC between the two kingdoms. Mol Biol Evol
Bioinformatics 9:386. doi:10.1186/1471- 23(6):1107–1108
2105-9-386 43. Lane DJ, Pace B, Olsen GJ et al (1985) Rapid
38. Hudson DH, Mitra S, Ruscheweyh HJ et al determination of 16S ribosomal RNA
(2011) Integrative analysis of environmental sequences for phylogenetic analyses. Proc Natl
sequences using MEGAN4. Genome Res Acad Sci U S A 82(20):6955–6959
21(9):1552–1560 44. Schloss PD, Westcott SL, Ryabin T et al
39. Mitra S, Rupek P, Richter DC et al (2011) (2009) Introducing mothur: open-source,
Functional analysis of metagenomes and meta- platform-independent, community-supported
transcriptomes using SEED and KEGG. BMC software for describing and comparing micro-
Bioinformatics 12(Suppl 1):S21 bial communities. Appl Environ Microbiol
40. Rho M, Tang H, Ye Y (2010) FragGeneScan: 75(23):7537–7541
predicting genes in short and error-prone 45. Caporaso JG, Kuczynski J, Stombaugh J et al
reads. Nucleic Acids Res 38(20):e191 (2010) QIIME allows analysis of high-
41. Delcher AL, Bratke KA, Powers EC et al throughput community sequencing data. Nat
(2007) Identifying bacterial genes and endo- Methods 7(5):335–336
symbiont DNA with Glimmer. Bioinformatics 46. Wang Q, Garrity GM, Tiedje JM et al (2007)
23(6):673–679 Naive Bayesian classifier for rapid assignment
42. Lin X, Hong C, Xiaohua H et al (2006) of rRNA sequences into the new bacterial
Average gene length is highly conserved in taxonomy. Appl Environ Microbiol 73(16):
prokaryotes and eukaryotes and diverges only 5261–5267
Chapter 18
Abstract
Next-generation sequencing (NGS) technologies have opened new avenues of unprecedented power for
research in molecular biology and genetics. In particular, their application to the study of RNA-binding
proteins (RBPs), extracted through immunoprecipitation (RIP), permits to sequence and characterize all
RNAs that were found to be bound in vivo by a given RBP (RIP-Seq). On the other hand, NGS-based
experiments, including RIP-Seq, produce millions of short sequence fragments that have to be processed
with suitable bioinformatic tools and methods to recover and/or quantify the original sequence sample. In
this chapter we provide a survey of different approaches that can be taken for the analysis of RIP-Seq data
and the identification of the RNAs bound by a given RBP.
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_18, © Springer Science+Business Media New York 2015
293
294 Federico Zambelli and Giulio Pavesi
2 Materials
3 Methods
3.1 RIP-Seq and Its In general, the steps that have to be followed for an IP experiment
Variants are similar regardless of the protein studied or the antibody
employed. From a bioinformatic point of view, the main differ-
ence is on whether and how the protein is cross-linked (fixed) to
the DNA/RNA before the IP [4]. For RBPs, this sometimes may
lead to different choices in the bioinformatic treatment of the
data. Sequencing of the RNA sample with cross-linking of RBPs
is often found in literature referred to as CLIP-Seq or HITS-
CLIP [18]. Also, PAR-CLIP [17] is based on the incorporation
of photoreactive ribonucleoside analogs, such as 4-thiouridine
(4-SU) and 6-thioguanosine (6-SG) into nascent RNA transcripts.
296 Federico Zambelli and Giulio Pavesi
3.3 Sequencing As with most NGS applications for which the genome of the organ-
RNAs: Mapping ism studied is available, the first step of the analysis consists on the
Against the Genome mapping of sequence reads on the genome, that is, finding on the
genome the corresponding sequence region or, in case of RNAs,
the one that had been transcribed in the RNA itself. The sequence
read length currently available allows for unambiguous mapping of
nearly all the sequence reads not coming from repetitive DNA
regions, and sequence quality is good enough to allow at most two
or three substitutions per read.
Another aspect that has to be kept in mind is that if mature
RNAs are sequenced, a sizable fraction of the sequence reads
obtained will cover an exon–exon junction on the RNA. Hence,
mapping reads of this kind without allowing for insertions on the
genome will not find any match, since the alignment of the read
would have to be split by the intervening intron(s), as shown in
Fig. 1. However, fast and reliable sequence mapping tools for
Fig. 1 Result of the mapping of RNA-Seq reads derived from mature mRNAs on
the genome. Reads that covered exon–exon junctions will be split on the genome,
and their mapping will encompass the intervening intron
298 Federico Zambelli and Giulio Pavesi
3.4 RIP-Seq: Using Starting from the considerations outlined in the previous sections,
Read Counts at the end of the mapping phase the mapping coordinates of each
sequence read can be crossed with the genomic coordinates of
exons, and of the respective transcripts or genes to which they
belong. This step, which can be performed by simple in-house
developed scripts, or by using a module contained in more general-
purpose analysis tools, yields the basic information needed to assess
enrichment, that is, a “read count” associated with each gene or
transcript annotated on the genome, in turn proportional to the
enrichment of the transcript in the samples sequenced. At this
point, if a control experiment is available, a simple but effective
strategy we implemented with good results [19], borrowing from
the analysis of SAGE data [20], evaluates differential enrichment
for a gene or transcript with respect to the control by using a 2 × 2
Chi-Square test. This kind of test is able to compare the relative
enrichment of the gene or transcript across the two experiments,
that is, it evaluates whether the relative enrichment of the gene dif-
fers, and if enough reads are associated to the gene to assume sig-
nificant difference and not a random fluctuation of the data deriving
from low read counts. Since a separate statistical test is performed
on each gene or transcript, suitable corrections for multiple test-
ing, like Bonferroni or Benjamini–Hochberg, should be then
applied to the resulting p-values, yielding family-wise error rates
and false discovery rates.
RIP-Seq Data Analysis 299
3.5 Treating RIP-Seq RNA-Seq, that is, sequencing of whole transcriptomes, has become
as RNA-Seq the de facto standard also for expression studies, aimed not only at
the reconstruction of the repertoire of alternative splicing and
RNAs expressed by a given cell line or tissue but more simply, given
an existing gene/transcript annotation, at the quantification of the
transcript level of genes, possibly at the single isoform level. The
application of RNA-Seq has boomed in the last few years, and sev-
eral different approaches have been introduced for its analysis. For
example, if a reliable gene annotation is available, DESeq identifies
differentially expressed genes starting from read counts associated
with them, with an approach similar to the one described in the
previous section. Instead of a Chi-Square test a negative binomial
distribution is employed, and mean and variance of read counts are
linked by local regression. DEXSeq performs similar computations
at the single exon level, in order to identify differentially enriched
alternative splicings. RSEM starts from the mapping of reads against
a reference transcriptome, and outputs for each gene or transcript
of the annotation abundance estimates (including estimation of the
relative abundance of alternative transcripts of the same gene),
95 % credibility intervals, and visualization files. Differentially
enriched genes or isoforms can be in turn identified through the
EBSeq package included in the RSEM software distribution.
Also, the Cufflinks pipeline is currently one of the most widely
used approaches in RNA-Seq analysis. Without delving into the
details, starting from the mapping of sequence reads against the
genomes, an estimate of the transcript level of each gene, and of
each annotated isoform of each gene can be computed, expressed
as Fragments Per Kilobase of exon per Million fragments mapped
(FPKM), a measure which is already normalized with respect to
the transcript length and the overall number of sequence reads
produced by an experiment. FKPM values associated with a given
gene or transcript can be compared across two or more samples,
and tested for significant differences denoting “under-” or “over-
expression” in the different conditions. Cufflinks returns plenty of
additional information, together with the FKPM values, like the
raw read counts associated with genes and transcripts, estimates of
the statistical significance of difference in gene (or isoform) expres-
sion, or alternative promoter usage.
Regardless of the approach, it is straightforward to see how
methods for RNA-Seq analysis can be applied to a sample extracted
by RIP-Seq. The reliability of the results depends on the distribu-
tion of the reads that were produced in the RIP-Seq sample, that
is, how the RBP-bound RNAs are enriched with respect to the
control, which in turn depends on the specificity of the antibody
employed. Also, RNA-Seq analysis tools need a suitable sequenc-
ing depth, and at least two replicates, to calculate the significance
of enrichment in a reliable way. An interesting feature of this kind
of analysis is that the FPKM (or analogous) values are computed at
300 Federico Zambelli and Giulio Pavesi
the single transcript and isoform level, that is, return an abundance
estimate and differential expression for each of the alternative tran-
scripts of a gene. Thus, this might be of great interest if the RBP
studied is for example a splicing factor, or it is known to or sus-
pected to bind only some specific alternatively spliced transcripts.
Fig. 2 A typical ChIP-Seq enrichment profile corresponding to a DNA bound region. The actual protein-binding
site is usually located in correspondence to a local maximum (peak) of the sequencing coverage plot in the region
RIP-Seq Data Analysis 301
3.7 RIPSeeker A recent tool that has been explicitly designed for RIP-Seq data
analysis is RIPSeeker, which however shares some similarities with
other ChIP-Seq analysis methods, especially those developed to
identify regions enriched for histone modifications. Once again,
the starting point of the analysis is the mapping of sequence reads
on the genome performed with TopHat, allowing for read split
mapping across different exons. The genome is then in turn divided
into “bins” of a given size, and the read count in each bin is com-
puted. The general idea is to first identify candidate RBP-bound
enriched bins by training a hidden Markov model (HMM) with an
Expectation Maximization procedure. Thus, the trained HMM
yields for each bin the probability of being a “RBP-bound region,”
or not. Once the HMM has been trained, the genome is scanned
with the HMM itself, yielding the Viterbi (highest likelihood) par-
tition of the genome into “RBP-bound” or “not bound” regions
given by merging of consecutive “bound” or “unbound” bins.
Comparison to other methods for RNA-Seq or ChIP-Seq seems to
demonstrate the advantages offered by this tool. As for other meth-
ods for NGS data processing, the usage is quite straightforward,
needing as input only the result of the read mapping, with most of
the needed parameters calculated automatically (like bin size) or set
to suitable default values.
4 Notes
References
1. Horner DS, Pavesi G, Castrignanò T et al (2010) Stewart RM, Kendziorski C (2013) EBSeq: an
Bioinformatics approaches for genomics and empirical Bayes hierarchical model for infer-
post genomics applications of next-generation ence in RNA-seq experiments. Bioinformatics
sequencing. Brief Bioinform 11(2):181–197 29(8):1035–1043
2. Marioni JC, Mason CE, Mane SM et al (2008) 13. Anders S, Huber W (2010) Differential expres-
RNA-seq: an assessment of technical repro- sion analysis for sequence count data. Genome
ducibility and comparison with gene expres- Biol 11(10):R106
sion arrays. Genome Res 18(9):1509–1517 14. Anders S, Reyes A, Huber W (2012) Detecting
3. Glisovic T, Bachorik JL, Yong J et al (2008) differential usage of exons from RNA-seq data.
RNA-binding proteins and post-transcriptional Genome Res 22(10):2008–2017
gene regulation. FEBS Lett 582(14):1977–1986 15. Trapnell C, Roberts A, Goff L et al (2012)
4. Collas P (2010) The current state of chroma- Differential gene and transcript expression
tin immunoprecipitation. Mol Biotechnol analysis of RNA-seq experiments with TopHat
45(1):87–100 and Cufflinks. Nat Protoc 7(3):562–578
5. Furey TS (2012) ChIP-seq and beyond: new 16. Li Y, Zhao DY, Greenblatt JF (2013)
and improved methodologies to detect and RIPSeeker: a statistical package for identifying
characterize protein-DNA interactions. Nat protein-associated transcripts from RIP-seq
Rev Genet 13(12):840–852 experiments. Nucleic Acids Res 41(8):e94
6. Yuan J, Muljo SA (2013) Exploring the RNA 17. Corcoran DL, Georgiev S, Mukherjee N et al
world in hematopoietic cells through the lens (2011) PARalyzer: definition of RNA binding
of RNA-binding proteins. Immunol Rev sites from PAR-CLIP short-read sequence
253(1):290–303 data. Genome Biol 12(8):R79
7. Pepke S, Wold B, Mortazavi A (2009) 18. Licatalosi DD, Mele A, Fak JJ et al (2008)
Computation for ChIP-seq and RNA-seq HITS-CLIP yields genome-wide insights into
studies. Nat Methods 6(11 Suppl):S22–S32 brain alternative RNA processing. Nature
8. Trapnell C, Pachter L, Salzberg SL (2009) 456(7221):464–469
TopHat: discovering splice junctions with 19. Mihailovic M, Wurth L, Zambelli F et al
RNA-Seq. Bioinformatics 25(9):1105–1111 (2012) Widespread generation of alternative
9. Langmead B, Trapnell C, Pop M, Salzberg SL UTRs contributes to sex-specific RNA binding
(2009) Ultrafast and memory-efficient align- by UNR. RNA 18(1):53–64
ment of short DNA sequences to the human 20. Michiels EM, Oussoren E, Van Groenigen M
genome. Genome Biol 10(3):R25 et al (1999) Genes differentially expressed in
10. Wu TD, Nacu S (2010) Fast and SNP-tolerant medulloblastoma and fetal brain. Physiol
detection of complex variants and splicing in Genomics 1(2):83–91
short reads. Bioinformatics 26(7):873–881 21. Micsinai M, Parisi F, Strino F et al (2012)
11. Li B, Dewey CN (2011) RSEM: accurate tran- Picking ChIP-seq peak detectors for analyzing
script quantification from RNA-Seq data with chromatin modification experiments. Nucleic
or without a reference genome. BMC Acids Res 40(9):e70
Bioinform 12:323 22. Zhang Y, Liu T, Meyer CA et al (2008) Model-
12. Leng N, Dawson JA, Thomson JA, Ruotti V, based analysis of ChIP-Seq (MACS). Genome
Rissman AI, Smits BM, Haag JD, Gould MN, Biol 9(9):R137
Part III
Abstract
The ViennaRNA package is a widely used collection of programs for thermodynamic RNA secondary
structure prediction. Over the years, many additional tools have been developed building on the core
programs of the package to also address issues related to noncoding RNA detection, RNA folding kinetics,
or efficient sequence design considering RNA-RNA hybridizations. The ViennaRNA web services provide
easy and user-friendly web access to these tools. This chapter describes how to use this online platform to
perform tasks such as prediction of minimum free energy structures, prediction of RNA-RNA hybrids, or
noncoding RNA detection. The ViennaRNA web services can be used free of charge and can be accessed
via http://rna.tbi.univie.ac.at.
Key words RNA secondary structure prediction, RNA-RNA interaction prediction, Noncoding
RNA detection, Sequence design
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_19, © Springer Science+Business Media New York 2015
307
308 Andreas R. Gruber et al.
Fig. 1 Outline of the ViennaRNA web services. Based on their functionalities, programs are grouped into five
categories
2 Materials
2.1 Hardware Use of the ViennaRNA web services does not require installation
and Software of particular software other than a common web browser that sup-
Requirements ports JavaScript. We recommend, however, to use modern web
browsers such as Mozilla Firefox, Google Chrome, or Safari since
these browsers natively support rendering of image files in SVG
format and do not require installation of additional plug-ins.
2.2 Input Data Tools from the ViennaRNA web services require input of DNA or
Formats RNA sequences in either FASTA or ClustalW format, two widely
used formats for storing sequence information. The user can paste
sequences into the corresponding field or upload files via a file
upload dialog. Unless DNA thermodynamic parameters are specifi-
cally selected, sequences will automatically be converted to RNA
sequences (replacing T residues by U) upon submission of the job.
The programs of the ViennaRNA package do not have any intrin-
sic restriction on the length or number of sequences that can be
processed; certain restrictions do apply for the web services,
though. Limitations have been chosen gracefully and are stated on
the web pages of the individual services.
3 Methods
3.1 Secondary The RNAfold server offers an interface to the most basic secondary
Structure Prediction structure prediction program within the ViennaRNA web services.
with the RNAfold Web In short, the RNAfold algorithm uses a set of thermodynamic
Server energy parameters in a dynamic programming framework to
calculate the minimum free energy (MFE) and its associated sec-
3.1.1 Description ondary structure of a single RNA molecule. In addition, the parti-
and Scope of the Program tion function of the equilibrium ensemble of all secondary
structures is computed which in turn allows to infer base pair prob-
abilities and reliability measures for the MFE prediction.
310 Andreas R. Gruber et al.
3.1.2 Description The input sequence for the RNAfold web server can be uploaded
of Input Data and Program from the client computer as FASTA file or simply typed/pasted
Options into the input field of the web interface. Without changing the
default options, the server computes the MFE and partition func-
tion for the ensemble of canonical structures, i.e., structures with-
out helices of size 1. However, structures having such isolated base
pairs can be included in the prediction by unchecking the “avoid
isolated base pairs” option. Since a weak GU pair at the end of a
helix is likely to unfold, especially if the helix branches off a large
loop, exclusion of such conformations usually yields more stable
structures. Therefore, for some predictions, it might be a good
idea to use the appropriate checkbox in the basic options. The
auxiliary constraint folding feature of RNAfold allows for an even
bigger restriction of the secondary structure space. Here, a pseudo-
dot-bracket notation consisting of structure constraint symbols
allows to specify whether a nucleotide must be unpaired (symbol:
x), should be paired with another unspecified base (symbols “|”,
“<”, “>”), or has a predefined pairing partner (symbols “(”,“)”).
It has to be noted that the web server allows for MFE predic-
tion only; however, relying on this single representative of the
ground state may be misleading. Unless the sequence length is
very small, the partition function and the ensemble properties
derived from the partition function (1) provide a reliability mea-
sure of the prediction and (2) take into account that an RNA mol-
ecule in thermodynamic equilibrium is in constant flux between
several distinct low-energy structures and the ground state.
Fig. 2 Sample RNAfold output depicting the minimum free energy secondary structure, the centroid structure,
and graphical representations thereof
3.2 Prediction For certain classes of functional RNA molecules, secondary struc-
of Consensus ture elements are often evolutionarily more conserved than the
Secondary Structures underlying primary sequence. The accuracy of single-sequence
with the RNAalifold RNA secondary structure prediction tools is limited, irrespective of
Web Server the approach or method employed [18]. Consensus structure pre-
diction of a set of (aligned) sequences is thus a commonly used way
3.2.1 Description to increase the predictive power of structure prediction tools and
and Scope of the Program to spot evolutionarily conserved, often functional structural motifs.
In addition to thermodynamic folding, consensus structure
prediction relies on signals of sequence variation. In particular,
consistent mutations (one base in a base pair is changed with
respect to a reference sequence) and compensatory mutations
(both bases are changed, e.g., a G-C base pair in one sequence and
an A-U base pair in the other) are counted as supporting evidence
for a conserved secondary structure, while base pairs that can only
be formed in a single or few sequences contradict the formation of
a conserved structural motif.
ViennaRNA Web Services 313
3.2.2 Description The input for RNAalifold consists of multiple sequence alignments
of Input Data and Program in FASTA or ClustalW format. Information on the phylogenetic
Options tree underlying the alignment is currently not considered by the
algorithm. It must be noted that the quality of the structure pre-
diction is directly influenced by the quality of the alignment. For
average pairwise sequence identities above 75 %, alignments gener-
ated by simple sequence-only-based alignment tools such as
ClustalW (http://www.ebi.ac.uk/Tools/msa/clustalw2) are suffi-
cient for good results. For a set of highly divergent sequences, we
recommend using an alignment program that considers both
sequence and secondary structures such as LocARNA (http://rna.
informatik.uni-freiburg.de/LocARNA).
RNAalifold has two ways of scoring evolutionary conservation
of base pairs. The first approach simply counts occurrences of
canonical base pairs and counterexamples for each base pair. The
algorithm then only considers base pairs with less than three coun-
terexamples. The other scoring scheme uses RIBOSUM scores,
which are substitution matrices for RNA sequences, to compute an
alignment-like score for each possible base pair. The main advan-
tage of the RIBOSUM score is that it does not penalize counterex-
amples as harshly as the simple scoring scheme. For alignments with
many sequences, this feature will allow RNAalifold to infer a con-
sensus structure even if there are some counterexamples, possibly
due to faulty alignment choices. The new version of RNAalifold [6]
also introduced an alternate way of dealing with gap characters in
the aligned sequences, with gaps not contributing to the energy
computation. We recommend the default setting which is the new
RNAalifold version with RIBOSUM scoring, but other variants can
be chosen as well.
Many of the parameters of RNAfold apply to RNAalifold as
well, but there are two parameters unique to RNAalifold. The
parameter “Weight of covariance term” controls the relative
weighting of energy contributions to contributions from structural
conservation supporting base pairs. The default value of 1 implies
equal contribution, while a value below 1 will down-weigh the
conservation supporting score.
On the other side, “Penalty for non-compatible sequences”
controls the impact of counterexamples. If this value is set to 0,
non-supporting base pairs will not be penalized. Both parameters
are accessible via the advanced options field.
314 Andreas R. Gruber et al.
3.2.3 Description The RNAalifold output page provides the consensus secondary struc-
of the Program Output ture in dot-bracket notation and the computed minimum free energy,
which actually is a pseudo-energy with contributions coming from
both sequence-averaged free energy minimization and base pair
conservation bonus energies. Each of the individual contributions is
listed separately. The textual representation of the secondary struc-
ture is assisted by graphical output in the form of a structure-
annotated alignment and a visualization of the secondary structure
highlighting consistent and compensatory mutations (Fig. 3).
3.3.2 Description RNAcofold takes two RNA sequences termed as sequence A and
of Input Data and Program sequence B as its input and will, by default, not only compute
Options the partition function and base pair probabilities of the heterodi-
mer (i.e., the structure formed by a dimer of the two input
sequences, hence denoted as AB) but also of the homodimers
(AA and BB), which are the structures that two molecules of
sequence A or sequence B will form. Binding of two molecules is
always a concentration-dependent process. Therefore, the
RNAcofold server offers two ways of predicting concentrations
of dimers. It either varies the monomer concentrations using
predefined values or uses a list of concentrations provided by the
user. In the latter case, two values separated by a white space
have to be provided by line.
3.3.3 Description The RNAcofold output page, by default, displays the heterodimer
of the Program Output structure in dot-bracket notation with an “&” sign separating the
two sequences. In addition, dot plots of the heterodimer structure
ViennaRNA Web Services 315
Fig. 3 Sample RNAalifold output. (a) Visualization of the consensus secondary structure predicted by RNAalifold.
The number of different types of base pairs observed for a particular pair of positions in the alignment is color
coded. Pale colors indicate contradicting evidence, meaning that not all sequences in the alignment can form
the particular base pair. Red indicates base pairs that show conservation at nucleotide level in all sequences,
while base pairs colored in green, for example, are supported by compensatory/consistent mutations; in this
case, three different types of base pairs are observed. (b) Structure-annotated multiple sequence alignment.
The same color-coding scheme as discussed above applies here
316 Andreas R. Gruber et al.
(AB) as well as the two monomer structures (A, B) and the two
homodimer structures (AA, BB) are provided. In the dot plots of
the dimers, a cross sign is used to indicate the cut point that sepa-
rates the two molecules. The output page also contains ensemble
free energies of all five species, as well as the energy that is gained
by forming the heterodimer compared to the two monomers
denoted as “Delta G for heterodimer binding.”
If the option “Vary concentration” was selected, concentration
dependency plots showing the relative concentrations for each
dimer (AA, BB, AB) and the monomers (A, B) will be generated.
If the user specified concentrations, results will be provided in form
of a table.
3.4 RNA-RNA RNAup takes another approach for the prediction of intermolecu-
Interaction Prediction: lar RNA-RNA interactions [9]. It basically models RNA-RNA
RNAup interaction as a two-step process, which resembles a simplified
approach of what is actually happening in vivo when RNA-RNA
3.4.1 Description
hybrids form. In the first step, the interaction sites have to be made
and Scope of the Program
accessible (i.e., single stranded) in order to engage in intermolecu-
lar base pairing. In the second step, the supposed interaction sites
of two RNA molecules can then form base pairs between each
other. The RNAup algorithm follows this model by first comput-
ing the opening energy for every sequence window of length w.
Computation of interaction energies for all intervals with a maxi-
mum length of W is then accomplished in a second step. Lastly,
results from these two steps are merged to obtain the total interac-
tion energy, which is minimized to find the best site of interaction.
This approach allows any form of structure around the interaction
sites, thus being able to also predict for example the kissing hairpin
structural motif. However, the drawback of this approach is that
only a single interaction site can be taken into account. Therefore,
RNAup is mainly used for the prediction of interactions of short
RNAs on long target RNAs such as bacterial sRNAs on mRNAs.
3.4.3 Description The RNAup output page will report the location of the interaction
of the Program Output sites in both sequences and the interaction secondary structure in
dot-bracket notation. The interaction free energy is composed of
multiple contributions (energy gained from duplex formation and
energies needed to open existing secondary structures in the two
molecules), which are separately listed. The output is also accom-
panied by two plots depicting the interaction free energy (red) and
the energy needed to open existing secondary structure along the
longer of the two sequences. The original RNAup output text file
is also provided, which lists detailed energy values for each nucleo-
tide and, if selected, detailed energy contributions of individual
loop types.
3.5.2 Description Input for the RNAinverse server is the desired secondary structure
of Input Data and Program in dot-bracket notation (see Subheading 3.10. for more details) and
Options optionally an RNA sequence string of equal length. If no sequence
is provided, RNAinverse will generate a random sequence. Note
that when specifying the sequence, the letter “N” can be used to
indicate a random base. Certain restrictions also apply to the sec-
ondary structure. A hairpin loop has to consist of at least three
unpaired bases, and crossing base pairs, i.e., pseudo-knots, are not
allowed. A sample input of a sequence-structure pair is given below:
.((((...))))...((((...)))).
NANNNNNNNNNUNNNNNNNNNNNNNNN
318 Andreas R. Gruber et al.
3.5.3 Description The results page lists the generated sequences and the minimum
of the Program Output free energy of the predicted structure. Links to the RNAfold server
that automatically paste the sequence into the input field are pro-
vided for each sequence. This allows the user to explore the pre-
dicted secondary structure and folding potential of the sequences
in more detail. We do recommend to perform this analysis, espe-
cially for sequences listed under “Results for thermodynamic
ensemble prediction.” The procedure of optimizing the frequency
in the thermodynamic ensemble may lead to deviation of the mini-
mum free energy structure from the desired structure. When mul-
tiple sequences were generated, results are presented in sortable
tables, which allows convenient sorting, e.g., by free energy.
3.6 Exploring Some RNA molecules are known to take a very long time or even
Folding Kinetics never adopt their ground-state structure after transcription. These
with the Barriers RNAs usually get trapped in a so-called metastable conformation,
Server either due to co-transcriptional folding or while refolding from a
denatured state. Approaches like RNAfold that consider the RNA
3.6.1 Description
molecule to be in thermodynamic equilibrium can therefore not
and Scope of the Program
be used to predict this behavior since they exclude folding kinetics
by design. However, when all secondary structures an RNA mol-
ecule can adopt are known, they can be interpreted as points in a
landscape with their corresponding free energy marking the eleva-
tion. A move set, e.g., inserting/removing a base pair, then defines
the neighborhood of any given structure. Finally, folding kinetics
is given by how RNA molecules explore this landscape which in
turn can be modeled as a Markov process. Given an initial popula-
tion density of the structure space, this allows to mathematically
solve the problem of the population density for an arbitrary point
in time. However, the number of secondary structures an RNA
molecule can fold into grows exponentially with its sequence
length making the solution impractical to compute even for very
short RNAs of about 50 nt. Thus, a widely used approach to cir-
cumvent this limitation is to heavily shrink the state space by
lumping secondary structure states together in the so-called macro
states. One possibility of macro-state generation is to use barrier
trees where all structures that reach the same local minimum via
consecutive steepest descent moves belong to the same macro state.
ViennaRNA Web Services 319
3.6.2 Description A single RNA sequence with at most 100 nt in length may be used
of Input Data and as input for the barriers server. In order to keep the barrier tree
Program Options simple, the user may set the maximum number of leaves (default
50). Furthermore, local minima that do not exhibit an energy bar-
rier higher than a certain user-defined threshold can be merged.
This allows for keeping the number of leaves in the barrier tree
small even for RNAs with high diversity of low-energy structures.
The number of secondary structures generated by RNAsubopt can
be limited again by omitting structures with isolated base pairs,
which is the default, or removing all structures where a weak GU
pair terminates a helix.
3.6.3 Description Among the results of the barriers web server calculations are the
of the Program Output number of secondary structures obtained by RNAsubopt as well as
the resulting barrier tree in graphical form. Additionally, the refold-
ing path between any of the ten lowest free energy macro states can
be displayed with an animated SVG image. In this image, the
energy profile in the form of a height plot and a circular and a
regular structure representation are shown for each intermediate
structure along the refolding path.
The folding kinetics section of the output consists of the
downloadable transition rate matrix of the Markov process and
four options that control the kinetic simulation. First, the initial
structure, i.e., the structure that has a population density equal to
100 % at the start of the simulation, can be chosen from the ten
lowest free energy macro states. The time span of the simulation,
which is measured in arbitrary time steps, can be set via the “Start
time” and “Stop time” values. Usually, the default values should
suffice; however, for RNAs which populate a metastable state for
too long, the stop time may have to be set to a value considerably
higher than one million. In any case, the “Stop time” has to be set
such that the plot displays macro-state number 1 with highest den-
sity at the end of the simulation. If no further changes appear in the
population density over time, the system has reached equilibrium.
320 Andreas R. Gruber et al.
To keep the simulation plot clean from local minima which are
sparsely populated, the treekin options allow for ignoring all macro
states which do not exceed a predefined threshold.
3.7 Noncoding RNA Detection of functional RNA structures and noncoding RNA (ncRNA)
Gene Prediction genes is a challenging task. Simply folding the sequence of interest
with the RNAz Server and assessing its minimum free energy is not a sufficiently good
and Related Tools approach, since the MFE is highly dependent on the G + C content
of the sequence and the length of the sequence. Nevertheless,
3.7.1 Description
functional RNA structures often show elevated thermodynamic
and Scope of the Program
stability when compared to a set of randomized sequences of the
same length and base composition [13]. Moreover, functional
RNAs are also often found to be evolutionarily conserved [12].
Aiming for accurate prediction of specifically the class of thermo-
dynamically stable and evolutionarily conserved RNA structures,
the ViennaRNA group developed RNAz [13, 15] and the RNAz
server for easy and user-friendly access to the RNAz toolbox [16].
In detail, the RNAz approach measures thermodynamic stability
in terms of a normalized z-score of folding energies. It indicates how
many standard deviations the structure of a given sequence is more
or less stable than that of random sequences of the same length and
base composition. Negative z-scores thus indicate that a sequence is
more stable than expected by chance. The second component is the
evaluation of the structural conservation, which requires that the
input for RNAz is an alignment of at least two sequences. Structural
conservation is measured by the structure conservation index (SCI),
which evaluates structural conservation indirectly through folding
energies rather than on the structure level itself (see ref. 14 for
details). In theory, a SCI above 1 indicates structural conservation
supported by compensatory mutations. The SCI is, however,
strongly influenced by the alignment quality with respect to second-
ary structures, and as a rule of thumb, a SCI close to or above the
average pairwise identity of the sequences in the alignment can be
considered as good. While this is valuable information to know when
evaluating individual RNAz predictions in detail, we generally advise
the user to assess predictions via the RNA class probability, which is
calculated by support vector machine classification combining ther-
modynamic stability and structural conservation.
3.7.2 Description Detection of functional RNA structures with the RNAz server is a
of Input Data and multistep process. As stated before, RNAz requires multiple
Program Options sequence alignments as input. Alignments can be provided in vari-
ous formats such as ClustalW or MAF (http://genome.ucsc.edu/
FAQ/FAQformat.html) but have to consist of at least two sequences
and should not be less than 50 nt in length. Multiple alignments in
ClustalW have to be separated by a header line starting with
“CLUSTAL W.” After successful upload of the alignments to the
server, the user will be redirected to the analysis options page (a
screenshot is shown in Fig. 4). The RNAz version hosted at the
ViennaRNA Web Services 321
Fig. 4 Screenshot of the RNAz analysis options page. For each field, detailed help is available by clicking on
the question mark icons
3.7.3 Description The RNAz output page is clearly structured providing links to
of the Program Output detailed analyses for each alignment analyzed. In particular, for
each alignment screened, the SCI, z-score, and RNA class proba-
bility are reported among other features. Each prediction is accom-
panied by a structure-annotated alignment and visualization of the
consensus secondary structure (see Subheading 3.2.3 for detailed
description of these plots). We would like to draw attention of the
reader to the excellent help pages of the RNAz server, which dis-
cuss many of the analysis options and the RNAz output in great
detail. In default mode, RNAz screens both the alignment in sense
and antisense direction for putatively conserved functional struc-
tures. Considering the hit with the higher probability of the two
screened windows is a first good choice. For a more detailed analy-
sis, we refer the user to specialized tools such as RNAstrand ([21];
http://www.bioinf.uni-leipzig.de/Software/RNAstrand). Results
are kept on the server for 30 days and can be accessed by the URL
outlined on the results page. RNAz is also available as a stand-
alone program and can be obtained from http://www.tbi.univie.
ac.at/~wash/RNAz. For additional information on noncoding
RNA gene detection, we refer the interested reader to ref. 22.
3.7.4 Related Tools
for Noncoding RNA While RNAz considers both thermodynamic stability and evolu-
Prediction tionary conservation, the structure conservation analysis (SCA)
server exclusively evaluates multiple sequence alignments with
respect to conserved structural elements [14]. This is done by cal-
culating various measures like the SCI, base pair distance, or tree-
editing distances and comparing the values obtained from the
alignment of choice against a set of randomized alignments gener-
ated by shuffling alignment columns while preserving gap patterns
[12]. The output page features a graphical representation of the
calculated scores together with z-scores and empirical p-values for
statistical assessment. Note that for similarity measures like the
SCI, a positive z-score indicates structural conservation, while for
distance measures like the base pair distance, a negative z-score is
indicative of structural conservation.
The Bcheck web server is a specialized tool for finding RNaseP
genes [17]. The input simply consists of a FASTA sequence one
wants to screen for a potential RNaseP gene, while the output page
contains the predicted subsequence together with secondary struc-
ture annotation derived from fitting the RNaseP covariance model.
3.8 Other Services So far, we have described the most commonly used programs of
Hosted at the ViennaRNA package and tools for ncRNA detection. The
the ViennaRNA Web ViennaRNA web services also offer access to a series of other spe-
Services cialized tools.
ViennaRNA Web Services 323
3.9 For the Advanced Interfaces to the core programs of the ViennaRNA package all
User: Additional have an advanced options panel. Clicking on it will expand the
Parameter Options field, showing options to control energy parameters, dangling
ends, and folding temperature. Default RNA energy parameters
are those of the Turner 2004 model [26]. Usage of the Turner
1999 model is only recommended when reevaluating the folding
of RNA sequences from publications that used the Turner 1999
energy model. Folding of DNA sequences is also possible, which
can be done by selecting “DNA parameters” (Matthews model,
2004). Folding of DNA/RNA hybrids is currently not imple-
mented. Turner 2004 parameters were measured at 37 °C, which
is also set as default temperature. Other temperatures can be
entered and energy parameters will be extrapolated accordingly,
but please note that for temperatures far from 37 °C, results will
increasingly become unreliable. Dangling ends are nucleotides that
stack onto the ends of helices. Programs of the ViennaRNA pack-
age have several options for including dangling ends in the compu-
tations. The default setting is that dangling end energies are
considered for both sides of a helix in any case. Other options are
to turn off dangling end energy contributions or to extend the
dangling end model to also include coaxial stacking of adjacent
helices in a multi-loop. For some programs, it is also possible to
specify that the RNA sequence should be treated as a circular RNA
molecule. As circular RNA molecules cannot adopt all conforma-
tions of their linear counterparts due to tension in the exterior
loop, a switch to correct the prediction for this case is available.
324 Andreas R. Gruber et al.
3.11 Insights into Underlying the secondary structure prediction algorithms of the
Energy Contributions ViennaRNA package is the so-called loop-based energy model or
with RNAeval Web nearest-neighbor model. The principle of this model is that any
Server RNA secondary structure can uniquely be decomposed into a set
of loops. In this model, stacked base pairs that build the helices of
RNA secondary structures are also represented as loops, as a special
case of interior loops with no unpaired nucleotide between two
base pairs. For users interested in the energy contributions of
individual structural elements to the total minimum free energy,
we recommend the RNAeval server. It takes as input a
sequence-structure pair and lists the individual energy contribu-
tions according to the loop-based energy model.
3.12 Obtaining Images displayed on the results pages are either in a bitmap format
High-Quality Images such as PNG or in a vector format such as SVG. We do not recom-
for Publication mend to use images in PNG format for figures in publications, but
advise the user to instead download the corresponding file in PDF
or EPS format, which preserves the quality of the image when
scaled. Images in these file formats can easily be postprocessed and
edited with free image tools such as Inkscape. If this is not a viable
option for the user, results pages also provide links named “IMAGE
CONVERTER” to obtain PNG or GIF files at higher, user-speci-
fied resolutions than those displayed on the results pages.
3.13 The ViennaRNA A stand-alone version of the ViennaRNA package can be obtained
Package: Stand-Alone from http://www.tbi.univie.ac.at/RNA. The ViennaRNA pack-
Version age has been developed and tested under the operating system
Linux and therefore is best operated under Linux or a Unix-like
ViennaRNA Web Services 325
3.14 Related Web The Nucleic Acids Research web server index (http://www.oxford-
Services journals.org/nar/webserver/subcat/10/90) lists many tools for
the purpose of RNA secondary structure prediction and related
tasks and offers a good starting point for exploring other tools.
Among the service listed, we would like to specifically high-
light the Freiburg RNA tools (http://rna.informatik.uni-freiburg.
de), which is a comprehensive set of programs that provides among
other services also excellent tools for the prediction of RNA-RNA
interactions [28].
Acknowledgments
References
1. Hofacker IL, Fontana W, Stadler P, Bonhoeffer 8. Bernhart SH, Tafer H, Mückstein U, Flamm
S, Tacker M, Schuster P (1994) Fast folding C, Stadler PF, Hofacker IL (2006) Partition
and comparison of RNA secondary structures. function and base pairing probabilities of RNA
Monatsh Chem 125:167–188 heterodimers. Algorithm Mol Biol 1:3
2. Lorenz R, Bernhart SH, Höner Zu Siederdissen 9. Mückstein U, Tafer H, Hackermüller J,
C, Tafer H, Flamm C, Stadler PF, Hofacker IL Bernhart SH, Stadler PF, Hofacker IL (2006)
(2011) ViennaRNA Package 2.0. Algorithm Thermodynamics of RNA-RNA binding.
Mol Biol 6:26 Bioinformatics 22:1177–1182
3. Hofacker IL (2003) Vienna RNA secondary 10. Eggenhofer F, Tafer H, Stadler PF, Hofacker
structure server. Nucleic Acids Res 31: IL (2011) RNApredator: fast accessibility-
3429–3431 based prediction of sRNA targets. Nucleic
4. Gruber AR, Lorenz R, Bernhart SH, Neuböck Acids Res 39:W149–W154
R, Hofacker IL (2008) The Vienna RNA web- 11. Tafer H, Ameres SL, Obernosterer G,
suite. Nucleic Acids Res 36:W70–W74 Gebeshuber CA, Schroeder R, Martinez J,
5. Hofacker IL, Fekete M, Stadler PF (2002) Hofacker IL (2008) The impact of target site
Secondary structure prediction for aligned accessibility on the design of effective siRNAs.
RNA sequences. J Mol Biol 319:1059–1066 Nat Biotechnol 26:578–583
6. Bernhart SH, Hofacker IL, Will S, Gruber AR, 12. Washietl S, Hofacker IL (2004) Consensus
Stadler PF (2008) RNAalifold: improved con- folding of aligned sequences as a new measure
sensus structure prediction for RNA align- for the detection of functional RNAs by com-
ments. BMC Bioinform 9:474 parative genomics. J Mol Biol 342:19–30
7. Flamm C, Hofacker IL, Stadler PF, Wolfinger 13. Washietl S, Hofacker IL, Stadler PF (2005) Fast
MT (2002) Barrier trees of degenerate land- and reliable prediction of noncoding RNAs.
scapes. Z Phys Chem 216:155 Proc Natl Acad Sci U S A 102:2454–2459
326 Andreas R. Gruber et al.
14. Gruber AR, Bernhart SH, Hofacker IL, 22. Washietl S (2010) Sequence and structure anal-
Washietl S (2008) Strategies for measuring ysis of noncoding RNAs. Methods Mol Biol
evolutionary conservation of RNA secondary 609:285–306
structures. BMC Bioinform 9:122 23. Hofacker IL, Tafer H (2010) Designing opti-
15. Gruber AR, Findeiß S, Washietl S, Hofacker mal siRNA based on target site accessibility.
IL, Stadler PF (2010) RNAz 2.0: improved Methods Mol Biol 623:137–154
noncoding RNA detection. Pac Symp 24. Eggenhofer F, Hofacker IL, Höner Zu
Biocomput 2010:69 Siederdissen C (2013) CMCompare web-
16. Gruber AR, Neuböck R, Hofacker IL, Washietl server: comparing RNA families via covari-
S (2007) The RNAz web server: prediction of ance models. Nucleic Acids Res 41:
thermodynamically stable and evolutionarily W499–W503
conserved RNA structures. Nucleic Acids Res 25. Gruber AR, Fallmann J, Kratochvill F, Kovarik
35:W335–W338 P, Hofacker IL (2011) AREsite: a database for
17. Yusuf D, Marz M, Stadler PF, Hofacker IL the comprehensive investigation of AU-rich
(2010) Bcheck: a wrapper tool for detecting elements. Nucleic Acids Res 39:D66–D69
RNase P RNA genes. BMC Genomics 11:432 26. Mathews DH, Disney MD, Childs JL,
18. Rivas E (2013) The four ingredients of single- Schroeder SJ, Zuker M, Turner DH (2004)
sequence RNA secondary structure prediction: Incorporating chemical modification con-
a unifying perspective. RNA Biol 10:59–70 straints into a dynamic programming algo-
19. Wuchty S, Walter F, Hofacker IL, Schuster P rithm for prediction of RNA secondary
et al (1999) Complete suboptimal folding of structure. Proc Natl Acad Sci U S A 101:
RNA and the stability of secondary structures. 7287–7292
Biopolymers 49:145–165 27. Hofacker I, Lorenz R (2014) Predicting RNA
20. Wolfinger MT, Svrcek-Seiler AW, Flamm C, structure: advances and limitations. Methods
Hofacker IL, Stadler PF (2004) Efficient com- Mol Biol 1086:1
putation of RNA folding dynamics. J Phys 28. Smith C, Heyne S, Richter AS, Will S,
Math Gen 37:4731 Backofen R (2010) Freiburg RNA tools: a
21. Reiche K, Stadler PF (2007) RNAstrand: read- web server integrating INTARNA, EXPARNA
ing direction of structured RNAs in multiple and LOCARNA. Nucleic Acids Res 38:
sequence alignments. Algorithm Mol Biol 2:6 W373–W377
Chapter 20
Abstract
Revealing the impact of A-to-I RNA editing in RNA-Seq experiments is relevant in humans because RNA
editing can influence gene expression. In addition, its deregulation has been linked to a variety of human
diseases. Exploiting the RNA editing potential in complete RNA-Seq datasets, however, is a challenging
task. Indeed, no dedicated software is available, and sometimes deep computational skills and appropriate
hardware resources are required. To explore the impact of known RNA editing events in massive transcrip-
tome sequencing experiments, we developed the ExpEdit web service application. In the present work, we
provide an overview of ExpEdit as well as methodologies to investigate known RNA editing in human
RNA-Seq datasets.
Key words RNA editing, A-to-I editing, Deep sequencing, Bioinformatics, Genomics, RNA-Seq
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_20, © Springer Science+Business Media New York 2015
327
328 Mattia D’Antonio et al.
2 Materials
2.1 Required The management of large datasets generated from deep sequenc-
Software and ing experiments typically requires powerful hardware resources.
Hardware Resources Hundred GBs of free disk space are needed to store inputs, outputs
and intermediate temporary files. Bioinformatics software is often
expensive in terms of computational resources and requires a large
amount of RAM (16–24 GB at least) and several CPUs/cores to
be executed in a reasonable time.
To handle several inputs as different lanes from the same
experiment, analysis workflows need to be parallelized on multiple
servers or serialized to analyze one input a time in full automation.
Web applications overcome these hardware requirements, and the
final user only needs a web browser and a stable Internet
connection.
Input files can be directly uploaded or transferred providing a
web link. In the first case, they have to be stored locally, and the
upload process could be time consuming depending on file size
and connection speed. Alternatively, inputs can also be transferred
providing a web link to the location of data stored on remote serv-
ers. Using web links, the transfer is independent on the user con-
nection speed. A further benefit of web applications is that analysis
results can be downloaded by the user and stored locally to perform
off-line investigations.
RNA Editing Detection by ExpEdit 329
For our aim and thus the investigation of known RNA editing
events in RNA-Seq datasets, the user needs a standard desktop or
laptop computer with a stable Internet connection and an updated
browser like the Internet Explorer, Firefox, or Safari.
ExpEdit service is available at the following web page http://
bioinformatics.cineca.it/expedit, and a free registration is required
to upload data and start computational analyses.
3 Methods
3.1 Hands On: Using Every ExpEdit action can be executed through an interactive GUI
the ExpEdit GUI available at http://bioinformatics.cineca.it/expedit/, and what-
ever popular web browser can be used to perform a complete anal-
ysis. Hereafter, we provide guidelines to create an ExpEdit study/
project, upload data, run an entire analysis, and inspect results.
3.1.1 Creating a New The first step to submit a dataset to ExpEdit is the creation of a
Study/Project new project/study that will contain one or more input files as well
as analysis results. To register a new study, a few information are
required such as a title, a description and an access level (private,
group, or public) (Fig. 1).
3.1.2 Uploading The ExpEdit upload engine offers several options: web upload,
Input Files web link, and Dropbox. The FTP protocol is planned but not yet
implemented. Web Upload supports up to 12 GB file size on 64-bit
operating systems and up to 2 GB on 32-bit operating systems. It
is implemented in JavaScript and allows the user to upload several
RNA Editing Detection by ExpEdit 331
Fig. 1 Creation of a new ExpEdit project. Through the web interface, the user can create new projects by
providing minimal information
3.1.3 Creating a New To create a new analysis, one or more files have to be selected. Then,
Analysis a name and a description can be provided. Before launching ExpEdit,
several parameters can be tuned to customize the analysis, even
though default parameters are suitable for many experiments.
ExpEdit includes four categories of parameters:
1. Common to many pipeline modules such as the human refer-
ence database (hg18 and hg19 assemblies are allowed).
2. For quality check and filtering.
332 Mattia D’Antonio et al.
Fig. 2 Uploading files in ExpEdit. Input files can be added to an existing project by the web interface or by
providing a web link or by importing data from a Dropbox account
Fig. 3 Creation of a new analysis. The analysis workflow can be configured tuning several parameters
3.1.4 Monitoring Once all parameters have been customized, the analysis can be
an Analysis submitted to the queue system. The execution process is fully
automated with a complex dispatching architecture integrated
with a TORQUE Resource Manager. The user can follow the anal-
ysis progress through a monitor page, which displays the list of
pipeline steps and the intermediate output results (these files can
be visualized and/or downloaded, also during the analysis execu-
tion). For each step, the running status and the used algorithm and
its version are shown (Fig. 4). In case of execution fault, the run-
ning status is marked as error. At the completion of each step of the
pipeline, all generated output files are listed and displayed in the
monitoring page. In this way, the user can visualize or download all
intermediate files. The command line and the estimated running
time per task are also provided.
3.1.5 Analysis Results At the end of the pipeline, all results are parsed and stored into
a dedicated and optimized database. In addition, an e-mail
containing a link to the available results will be automatically sent
to the user.
334 Mattia D’Antonio et al.
Fig. 4 Monitoring an analysis. During the execution, each computational task can be inspected to monitor the
job status and intermediate output files
3.1.6 Result Details Final ExpEdit results are provided in a detailed table that can be
filtered, sorted and downloaded (Fig. 6). Every site in the table is
identified by a genomic position (chromosome, coordinate, and
strand) annotated with relevant biological information. Given a
reference base, ExpEdit reports the total number of reads support-
ing the specific site and the number of reads supporting each
RNA Editing Detection by ExpEdit 335
Fig. 5 Visualization of summary results. After each run, ExpEdit shows simple summary tables (quality checks
on the top and detected editing sites on the bottom)
Fig. 6 Visualization of detailed results. RNA editing site is shown in dynamic tables
336 Mattia D’Antonio et al.
3.1.7 Intersections When two or more analyzed samples are available, union and inter-
and Unions section operations can be performed. A union operation may be
helpful to merge together results obtained from biological repli-
cates, while intersections may be useful to filter out aspecific results.
Unions and intersections can be combined together to build com-
plex operations, involving multiple lanes.
3.2 Case of Study In this section, we provide an example of ExpEdit analysis using
raw sequencing data obtained from the study published by [5] and
stored into NCBI SRA database under the accession SRP002274.
Input reads can be uploaded in ExpEdit using the web link
facility and the following URLs:
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/
ByRun/sra/SRR/SRR039/SRR039628/SRR039628.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/
ByRun/sra/SRR/SRR039/SRR039629/SRR039629.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/
ByRun/sra/SRR/SRR039/SRR039630/SRR039630.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/
ByRun/sra/SRR/SRR039/SRR039631/SRR039631.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/
ByRun/sra/SRR/SRR039/SRR039632/SRR039632.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/
ByRun/sra/SRR/SRR039/SRR039633/SRR039633.sra
SRA archives will be automatically extracted by the ExpEdit
preprocessing facility and converted into regular FASTQ files.
Now a new analysis can be created and executed by selecting
the human genome assembly hg19/build37 and tuning parame-
ters in the following way:
– Quality check and filtering: default values for cutoffs, length
(70) and quality (20). No primer/adaptor library has to be
selected.
– Genome alignment: the paired-end library has to be config-
ured as unstranded with a mate inner distance of 200 bp and a
mate standard deviation of 20. Default values should be used
RNA Editing Detection by ExpEdit 337
Acknowledgments
References
1. Gott JM, Emeson RB (2000) Functions and 7. Leinonen R, Sugawara H, Shumway M,
mechanisms of RNA editing. Annu Rev Genet Collaboration., International Nucleotide
34:499–531 Sequence Database (2011) The sequence read
2. Kiran A, Baranov PV (2010) DARNED: a archive. Nucleic Acids Res 39(Database issue):
DAtabase of RNa EDiting in humans. Bioin- D19–D21
formatics 26(14):1772–1776. doi:10.1093/ 8. Li H, Handsaker B, Wysoker A, Fennell T, Ruan
bioinformatics/btq285, Epub 2010 Jun 14 J, Homer N, Marth G, Abecasis G, Durbin R,
3. Ramaswami G, Li JB (2014) RADAR: a rigor- 1000 Genome Project Data Processing Subgroup
ously annotated database of A-to-I RNA edit- (2011) The sequence alignment/map format
ing. Nucleic Acids Res 42(D1):D109–D113. and SAMtools. Bioinformatics 25(16):2078–
doi:10.1093/nar/gkt996 2079. doi:10.1093/bioinformatics/btp352
4. Picardi E, D’Antonio M, Carrabino D, 9. Patel RK, Jain M (2012) NGS QC toolkit: a
Castrignanò T, Pesole G (2011) ExpEdit: a toolkit for quality control of next generation
webserver to explore human RNA editing in sequencing data. PLoS One 7(2):e30619.
RNA-Seq experiments. Bioinformatics doi:10.1371/journal.pone.0030619
27(9):1311–1312. doi:10.1093/bioinformat- 10. Trapnell C, Pachter L, Salzberg SL (2009)
ics/btr117 TopHat: discovering splice junctions with
5. Au KF, Jiang H, Lin L, Xing Y, Wong WH RNA-Seq. Bioinformatics 25(9):1105–1111.
(2010) Detection of splice junctions from doi:10.1093/bioinformatics/btp120
paired-end RNA-seq data by SpliceMap. 11. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley
Nucleic Acids Res 38(14):4570–8. R, Salzberg SL (2011) TopHat2: accurate align-
doi:10.1093/nar/gkq211 ment of transcriptomes in the presence of inser-
6. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice tions, deletions and gene fusions. Genome Biol
PM (2010) The Sanger FASTQ file format for 14(4):R36. doi:10.1186/gb-2013-14-4-r36
sequences with quality scores, and the Solexa/ 12. Langmead B, Salzberg S (2012) Fast gapped-
Illumina FASTQ variants. Nucleic Acids Res read alignment with Bowtie 2. Nat Methods
38(6):1767–71. doi:10.1093/nar/gkp1137 9(4):357–359. doi:10.1038/nmeth.1923
Chapter 21
Abstract
Gene expression regulatory elements are scattered in gene promoters and pre-mRNAs. In particular, RNA
elements lying in untranslated regions (5′ and 3′UTRs) are poorly studied because of their peculiar fea-
tures (i.e., a combination of primary and secondary structure elements) which also pose remarkable com-
putational challenges. Several years ago, we began collecting experimentally characterized UTR regulatory
elements, developing the specialized database UTRsite. This paper describes the detailed guidelines to
annotate cis-regulatory elements in 5′ and 3′ UnTranslated Regions (UTRs) by computational analyses,
retracing all main steps used by UTRsite curators.
Key words UTR, cis-Regulatory elements, RNA secondary structure, Sequence alignments, Perl,
Motif pattern
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_21, © Springer Science+Business Media New York 2015
339
340 Matteo Giulietti et al.
2 Materials
2.1 Hardware The annotation workflow described in this paper can be executed
and Software on a regular computer without specific hardware resources.
However, it must have a stable Internet connection to download
executables or to access dedicated Web services. Although what-
ever operating system as Linux, Mac OS X, or Windows is suitable,
the Perl interpreter needs to be preinstalled as well as the BioPerl
package. An updated Perl interpreter for Windows and Mac OS X
can be downloaded from www.activestate.com/activeperl. Linux
users can install a copy via apt or yum tools. BioPerl package can be
downloaded from www.bioperl.org.
3 Methods
3.1 Literature Search The UTRsite entries describe the known regulatory elements
located in 5′UTR or 3′UTR or both, whose functional role and
structure have been experimentally assessed and reported in litera-
ture. Therefore, PubMed database is the starting point for UTRsite
data collection and to identify new regulatory elements. The litera-
ture search is a crucial and time-consuming process.
The literature information related to UTR regulatory elements
can be obtained using a variety of PubMed queries providing key-
words like “UTR” or “untranslated regions” or “5′UTR” or
“3′UTR” and different combinations among them (by means of
Boolean operators AND, OR, and BUT NOT). Limits can also be
activated to reduce the number of records and false positives,
increasing the search specificity. Papers describing the same func-
tional element are examined in order to extract all available infor-
mation about primary sequence, secondary structure, RNA
localization (5′ or 3′UTR), biological role, RNA-binding proteins
and experimental procedures. In general, the functional character-
ization of UTR regulatory elements includes gene expression
assays using recombinant or modified UTRs, site-specific muta-
genesis, RNase protection, and chemical probing experiments.
Here, we present and explanatory example focusing on the
DNA polymerase beta stem loop II regulatory element (POLB)
which has been identified in a PubMed search using aforemen-
tioned terms [9, 15–17]. Related literature describes this element
as a 3′UTR cis-acting element involved in the posttranscriptional
regulation of the DNA Polymerase beta gene. It is 42-nt long and
has a stem-loop secondary structure that is fundamental for the
interaction between the gene transcript and the Hax-1 protein
which, in turn, is responsible for at least two mechanisms: localiza-
tion in the perinuclear space and stabilization of the otherwise
unstable mRNA.
3.2 Detection Once the UTR regulatory element has been identified from litera-
and Extraction ture and the relative sequence has been retrieved, the NCBI
of Orthologous BLAST tool should be used to search for all available orthologous
Sequences sequences. We suggest to perform BLAST searches using as query
the whole mRNA sequence including the element (or its RefSeq
identifier) or the UTR sequence alone or simply the extracted
motif sequence. Multiple queries are essential because the length
of the motif alone is generally short and, thus, BLAST may provide
many not significant and undesirable alignments. All detected
database hits must be carefully checked in order to define a reliable
set of orthologous sequences, all containing the regulatory ele-
ments under investigation, and then investigate its conservation
profile at level of primary and secondary structure that will be used
to obtain an optimal definition of the regulatory pattern.
342 Matteo Giulietti et al.
3.3 Alignment In the next step, retrieved UTR sequences needs to be multi-aligned.
of Orthologous For this task we suggest the tool DIALIGN paying attention to the
Sequences content of the input file in order to exclude mRNAs that do not have
an annotated UTR (for example mRNAs from predicted RefSeq).
In addition, it could be useful to include in the input the sequence
of the element obtained by literature to facilitate its localization in
orthologous sequences and improve the final multi-alignment.
In case of very divergent sequences, the alignment software
could produce biased results. So to limit false alignments, we suggest
working in a hierarchical way aligning first sequences belonging to
close related species and, then, adding further sequences one a time.
Although alignment programs are very accurate, manual refinement
is almost always needed to obtain a reliable multiple alignment.
In our example on DNA polymerase beta stem loop II regula-
tory element we used DIALIGN with default parameters. Only 30
sequences out of about 100 orthologous mRNAs derived from
BLAST searches could be aligned with the rat element. This is
mainly due to the presence of mRNAs without annotated full-
length 3′UTRs. Some sequences, however, have to be discarded
for insufficient sequence similarity to the rat element.
The multiple alignment is also important to deduce the taxo-
nomic range as showed in Fig. 1. In this case, all 30 sequences
belong to the class Mammalia.
3.5 Generation Once a reliable multiple alignment is obtained, taking also into
of the Motif Pattern account the secondary structure of the UTR element, a motif pat-
and Assessment of its tern can be modelled using the PatSearch software (see Note 3)
Sensitivity/Specificity and its specific syntax rules. First, the multiple alignment has to be
344 Matteo Giulietti et al.
Fig. 1 Multiple alignment generated by DIALIGN and manually refined. The first column reports the GenBank GI
identifier and the organism. The character “|” in the alignment is used to distinguishing stems from loops.
Mismatched nucleotides in stems are underlined. Colored nucleotides are used to facilitate the alignment
visualization
Fig. 2 The secondary structure of the rat DNA polymerase beta stem loop II element
(a) and the inferred consensus structure from the multiple alignment (b)
3.6 Submit Significant patterns and new UTR elements can be submitted to
the Annotated UTR the UTRsite Web service (see Note 5). Each entry in UTRsite has
Element to UTRsite several sections including all relevant info characterizing the ele-
ment as those detected using the above methodology.
The UTRsite entry corresponding to the DNA polymerase
beta stem loop II regulatory element is shown in Fig. 3 and can be
directly inspected at the following Web page http://utrsite.ba.itb.
cnr.it/index.php/UTRSite%20signal/Signal/action/view/
frmUTRSiteID/U0049.
4 Notes
Fig. 3 UTRsite entry (Accession ID U0049) fir the DNA polymerase beta stem loop II regulatory element
References
1. Wang C, Zhang MQ, Zhang Z (2013) 11. Morgenstern B (2004) DIALIGN: multiple
Computational identification of active enhanc- DNA and protein sequence alignment at
ers in model organisms. Genomics Proteomics BiBiServ. Nucleic Acids Res 32:W33–W36
Bioinformatics 11:142–150 12. Smith C, Heyne S, Richter AS et al (2010)
2. Keene JD (2007) RNA regulons: coordination Freiburg RNA tools: a web server integrating
of post-transcriptional events. Nat Rev Genet INTARNA, EXPARNA and LOCARNA.
8:533–543 Nucleic Acids Res 38:W373–W377
3. Piva F, Giulietti M, Burini AB et al (2012) 13. Grillo G, Licciulli F, Liuni S et al (2003)
SpliceAid 2: a database of human splicing fac- PatSearch: a program for the detection of pat-
tors expression data and RNA target motifs. terns and structural motifs in nucleotide
Hum Mutat 33:81–85 sequences. Nucleic Acids Res 31:3608–3612
4. Piva F, Giulietti M, Nocchi L et al (2009) 14. Pesole G, Liuni S, D’Souza M (2000)
SpliceAid: a database of experimental RNA PatSearch: a pattern matcher software that
target motifs bound by splicing proteins in finds functional elements in nucleotide and
humans. Bioinformatics 25:1211–1213 protein sequences and assesses their statistical
5. Giulietti M, Piva F, D’Antonio M et al (2013) significance. Bioinformatics 16:439–450
SpliceAid-F: a database of human splicing fac- 15. Grzybowska EA, Zayat V, Konopinski R et al
tors and their RNA-binding sites. Nucleic (2013) HAX-1 is a nucleocytoplasmic shut-
Acids Res 41:D125–D131 tling protein with a possible role in mRNA
6. Grillo G, Turi A, Licciulli F et al (2010) processing. FEBS J 280:256–272
UTRdb and UTRsite (RELEASE 2010): a 16. Nowak R, Siedlecki JA, Kaczmarek L et al
collection of sequences and regulatory motifs (1989) Levels and size complexity of DNA
of the untranslated regions of eukaryotic polymerase beta mRNA in rat regenerating
mRNAs. Nucleic Acids Res 38:D75–D80 liver and other organs. Biochim Biophys Acta
7. Mathews DH, Moss WN, Turner DH (2010) 1008:203–207
Folding and finding RNA secondary structure. 17. Konopinski R, Nowak R, Siedlecki JA (1996)
Cold Spring Harb Perspect Biol 2:a003665 Alternative polyadenylation of the gene tran-
8. Hamilton RS, Davis I (2011) Identifying and scripts encoding a rat DNA polymerase beta.
searching for conserved RNA localisation sig- Gene 176:191–195
nals. Methods Mol Biol 714:447–466 18. Hofacker IL (2007) RNA consensus structure
9. Sarnowska E, Grzybowska EA, Sobczak K prediction with RNAalifold. Methods Mol
et al (2007) Hairpin structure within the Biol 395:527–544
3′UTR of DNA polymerase beta mRNA acts 19. Nawrocki EP, Eddy SR (2013) Infernal 1.1:
as a post-transcriptional regulatory element 100-fold faster RNA homology searches.
and interacts with Hax-1. Nucleic Acids Res Bioinformatics 29:2933–2935
35:5499–5510 20. Wei D, Alpert LV, Lawrence CE (2011)
10. Johnson M, Zaretskaya I, Raytselis Y et al RNAG: a new Gibbs sampler for predicting
(2008) NCBI BLAST: a better web interface. RNA secondary structure for unaligned
Nucleic Acids Res 36:W5–W9 sequences. Bioinformatics 27:2486–2493
Chapter 22
Abstract
The primary task of the Rfam database is to collate experimentally validated noncoding RNA (ncRNA)
sequences from the published literature and facilitate the prediction and annotation of new homologues in
novel nucleotide sequences. We group homologous ncRNA sequences into “families” and related families
are further grouped into “clans.” We collate and manually curate data cross-references for these families
from other databases and external resources. Our Web site offers researchers a simple interface to Rfam and
provides tools with which to annotate their own sequences using our covariance models (CMs), through our
tools for searching, browsing, and downloading information on Rfam families. In this chapter, we will work
through examples of annotating a query sequence, collating family information, and searching for data.
Key words Noncoding RNA, Rfam, Multiple sequence alignment, Secondary structure, Covariance
model, Infernal, Families, Clans
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_22, © Springer Science+Business Media New York 2015
349
350 Jennifer Daub et al.
1.2 Clans Rfam has implemented the use of “clans” to group-related families
since Rfam release 10.0 in 2010 [8]. Families within a clan are
proposed to be derived from a common ancestor or be related in
sequence and structure, but are maintained as separate families due
to distinct functions.
Evidence for relationship between families is established using
computational tools (SCOOP (Simple Comparison Of Outputs
Program) [9], PRC (Profile comparer) [10]), and curation-based
approaches (literature searches and curation from specialized data-
bases). Within a clan, the same sequence may be a member of more
than one family, which is contrary to our usual curation rule that mul-
tiple families cannot annotate the same sequence region.
Clans are given stable accessions of the format CLXXXXX and
identifiers, as for families. For each clan, we provide a curation
summary that includes supporting literature, and we collate sec-
ondary structure images, PDB (Protein Data Bank) mappings, and
alignments for each of the families in the clan. Users can access
these using the four tabs provided on each clan page (Fig. 2).
Rfam: Annotating Non-Coding RNA 351
2 Materials
2.1 Computational The Web site is compatible with all recent browsers and makes use of
Software several Java applets, such as Jalview for viewing sequence alignments
and AstexViewer for visualizing three-dimensional structures.
2.2 Example In this chapter, we provide a guide to using the Rfam Web site,
Sequence based on an example sequence which is a short region from the
genome of Clostridium tetani E88 (AE015927.1/1226050-
1226250), annotated as belonging to family RF00162, in clan
CL00012. This sequence can also be found on our ftp site: ftp://
ftp.sanger.ac.uk/pub/databases/Rfam/RNA_Methods_Mol_
Biol/C_tetani_AE015927_fragment.fa
>AE015927.1/1226050-1226250 Clostridium tetani E88, com-
plete genome
AAATAAAAAAACACTTCGATAAGAAGTGTTTATAGCAAAA
TAAACTTTTCTTATCTTTCAGGAAATTTCCTGCTGGA
GTTAGCACCTTTATACTTTGTGTATAGGTTGCTGGGT
TTCATAGGGCCAGTCCCTCAACCGCTCTTGATAAGAT
ATTTTAAATTATTTATTCAATTTAAAGCA
TATTATATATTTAATTATTTC
3 Methods
There are multiple ways to access data using the Rfam Web site;
Figure 3 summarizes some of the suggested routes to finding
information about a family, clan, or sequence of interest.
3.1 Sequence You can access the main sequence search page using the SEARCH
Searching link in the navigation bar, which appears at the top of every page in
the Rfam Web site. The interactive sequence search can only be
3.1.1 Searching
used for nucleotide sequence of less than 10,000 bases. You can see
with a Nucleotide
documentation on the restrictions on sequences under the More…
Sequence
link. If you have multiple sequences or if you prefer to retrieve
results as plain text, you can use the Batch search facility, which
runs searches offline and returns results via email.
In the interactive sequence search, the query sequence is scanned
against a library of Rfam family sequences using WU-BLAST with
an E-value threshold of 1.0 (see Note 1). Any matches are then
searched against the corresponding CM using the hand-curated bit
score gathering threshold (GA) for the family (see Note 2).
Paste the example sequence into the query box and press
Submit—your results should be returned within a few seconds.
The results page summarizes the number of hits and the query
sequence used. You are also provided with a URL to bookmark if
you wish to return to this result later.
Rfam: Annotating Non-Coding RNA 353
3.1.2 Searching Using Rather than searching for matches to ENA sequences, the Rfam
a Sequence Accession Web site allows you to quickly look up the sequence using its ENA
accession. Note that the sequence must exist in the ENA version
from which the current Rfamseq derives. You can use the Look up
sequence box under the sequence search form, or you can use the
Jump to field found on every tabbed page in the Rfam Web site (see
Notes 3 and 4).
Using the Look up sequence box, we can see that there are 181
annotations in Rfam for our example sequence, AE015927.1 (this
number is for the full sequence rather than the fragment that we
searched in Subheading 3.1.1). The results are displayed in a tabu-
lar format, ordered by default according to the order of coordi-
nates on the query sequence. You can sort the table on the different
columns of data by clicking on the table headers (see Note 5).
Browsing down the results to the match with coordinates
1226200-1226097, we find the SAM annotation that we identi-
fied by searching the sequence above. If we now sort the table by
Rfam familyID and scroll down to the SAM riboswitches, we can
see that Rfam annotates four different instances of the SAM family
(RF00162) on this sequence. Note the bit scores and coordinates
of the different annotations (three are annotated on the minus
strand and one on the positive).
356 Jennifer Daub et al.
3.2 Finding Family Entering the accession RF00162 or the family ID SAM in the
Information Jump to box will take us directly to the summary tab of the family
page for this specific family (see Fig. 1). The accession, identifier,
3.2.1 Family Pages
and description are displayed prominently at the top of all family
pages. If the family is in a clan, the other members of the clan will
also be displayed in the header of the summary page. We can see
immediately that family RF00162 belongs to clan CL00012 (clan
ID “SAM”), which contains two other families: SAM-I-IV-variant
(RF01725) and SAM-IV (RF00634).
Since 2007, Rfam has linked every family to the article in
Wikipedia that best describes that family [12]. The Wikipedia arti-
cle is displayed in the main body of the Summary tab and should
provide the user with preliminary background information, links,
and literature sources that are relevant to this ncRNA family.
If you browse through the different tabs (Fig. 1), you can col-
late pertinent information on the family itself and the sequences in
the full alignment, such as, under the Alignments tab, the number
of sequences in the seed and full alignments (433 and 4,757,
respectively). In the same tab, you can view or download the seed
or full alignments and sequences in several formats, with the option
of having sequences annotated using sequence accessions or species
identifiers.
You can browse multiple graphical representations of the pre-
dicted secondary structure and sequence conservation for the seed,
under the Secondary structures tab. In addition to the stem-loop
representations, we also provide R-chie diagrams [13] and a link for
viewing the structure interactively using the VARNA applet [14].
You can find the predicted phylogenetic trees for seed and full
alignments in the Trees tab, while under the Species tab, you can
browse, download, and align different taxonomic subsets of
sequences from the full alignment (see Subheading 3.2.2).
Any PDB sequence mappings we have made to this family can
be viewed in the Structures tab. Note that multiple PDB sequence
regions may be mapped to the same ENA sequence region. In the
case of RF00162, for example, we map 20 fragments from differ-
ent PDB sequences and chains onto nine regions from five differ-
ent ENA sequences. You can view these sequence fragments in the
context of the three-dimensional PDB structure using the
AstexViewer applet [15], which allows you to manipulate and
rotate the structure to view the sequence region and toggle
between different views of surface transparency.
The Database references tab provides external database cross-
references, such as links to Gene Ontology (GO) and Sequence
Ontology (SO) [16, 17]. Under the Curation tab, you will find
important information about the curation of the Rfam family,
including our family type classification, family source references,
authors, and family building parameters, and you can also down-
load the covariance model. Note that our example family has been
Rfam: Annotating Non-Coding RNA 357
using the size slider to make it easier to find the right node. When
you identify C. tetani, the tooltip will indicate “Clostridium tetani
[species], 1 sequence, 1 species.” After clicking to select the node,
the segment turns red, and the selected sequences are registered at
the bottom of the control panel on the right. You can download
these sequences in FASTA format or have them aligned to the
CM. Select Align to CM for our C. tetani selection and save the
resulting Stockholm formatted alignment of the four SAM annota-
tions indicated in Subheading 3.1.2.
The sunburst tool allows you to choose any arbitrary set of
species: you can select or exclude sequences for any taxonomic
division or combination of divisions you choose. For example, you
may choose to download all of the regions for all of the Clostridia
taxonomic class (425 sequences from 265 species) or for all of the
phylum Firmicutes (2,218 sequences from 677 species).
Alternatively, you could choose to select all of the phylum
Firmicutes excluding all the Clostridia apart from C. tetani.
For comparison, if you browse to the sunburst for family
RF01725, you can clearly see the difference in the taxonomic dis-
tribution between these two families. If you locate the Firmicutes,
you will see that there are only five sequences from five species
present in this phylum in RF01725.
3.3 Text Searching Under the main SEARCH tab, select the Entry type tab, you are
presented with a hierarchical representation of the Rfam family
3.3.1 ncRNA Type
types. There are three top level categories to this structure, under
(Type Search)
which all families are classified: gene, cis-reg, and intron. Use the
More… link to read more details about our classifications and what
they mean within Rfam.
Our example family is of the type “Cis-reg; riboswitch;.” This
means that our family is a cis-regulatory element of the subtype
riboswitch. To view all the families that Rfam annotates in the
“Cis-reg; riboswitch;” category, simply check the box next to ribo-
switch and press Submit. In the results page, you will see we have
26 families in this category. We already know our example family
RF00162 is in a clan with two others in this list, but some of these
other riboswitch families may also be of interest. This simple list of
checkboxes is very flexible and allows you to select/deselect any
combination of ncRNA categories depending on what data you are
looking for.
3.3.2 Annotations by Under the main SEARCH link, select the Taxonomy tab. The oper-
Species, Class, and Phylum ation of this taxonomy search box is relatively complex and is
(Taxonomy Searching) explained in more detail under the More… link. Briefly, to view all
of the family annotations that Rfam makes to Clostridium tetani
(regardless of sequence accession), you can simply add Clostridium
tetani to the search box, and you will be taken to a summary of all
families that we annotate in any sequence from this species. You
can also check the lower box on this page to identify families that
Rfam: Annotating Non-Coding RNA 359
3.3.3 Genome If you know the sequence accession of your genome of interest
Annotation, by Species (e.g., the Clostridium tetani ENA accession is AE015927.1), you
Name or by Sequence can use the Jump to box to go directly to genome pages for that spe-
Accession (Jump cies (in the case of higher eukaryotes, you would use a chromosome
to and Browse) sequence accession, e.g., CM000209.2, for mouse chromosome 1).
If you do not know the genome accession or chromosome
accession for your species of interest, you can browse the list of
complete genomes that are annotated by Rfam, via the BROWSE
link in the navigation bar (see Subheading 3.4).
The genome page gives the total number of annotations (in
this case 181), the number of different families (32) found on
sequences from that genome, and some basic information and sta-
tistics on the genome itself. To view the annotations, visit the chro-
mosome tab. Since this is a bacterial genome and a complete
sequence, there is only one chromosome sequence. In the case of
higher eukaryotes, you would find the multiple chromosomes
listed here separately. From the chromosome tab, you can choose
to view the annotations or download them in GFF format.
3.3.4 PubMed ID, Author, All of our curated family and clan data are indexed, so you can use
PDB Structure, and Others multiple criteria to identify families of interest. The Keyword search
(Keyword Search) box appears in the top right corner of the every page in the Rfam
Web site, as well as under Keyword tab in the main search page. The
keyword search allows you to perform a simple text search against the
textual data from several different sections of the Rfam database:
● Rfam: accessions, identifiers, descriptions, type, comment lines
from Rfam families
● Wikipedia: text from Wikipedia articles linked to Rfam families
● Literature: literature reference titles, authors, and PubMed
identifiers
360 Jennifer Daub et al.
3.4 Browsing As an alternative to directed searching, you can use the BROWSE
link in the navigation toolbar to view our families listed under five
different criteria: families (listed by family ID), via annotated
genomes (listed by species name), as clans (listed by clan ID), fami-
lies with structures (listed by family ID), and families with Wikipedia
articles (listed by Wikipedia article title).
Rfam: Annotating Non-Coding RNA 361
3.6 RESTful Interface We provide a RESTful interface to enable users to interact pro-
grammatically with the services provided by Rfam. A detailed
description of this is beyond the scope of this chapter, but full
documentation, including sample queries and scripts, is available
via our help pages.
4 Notes
Acknowledgments
References
1. Cochrane G, Alako B, Amid C, Bower L, 4. Nawrocki EP, Kolbe DL, Eddy SR (2009)
Cerdeno-Tarraga A, Cleland I, Gibson R, Infernal 1.0: inference of RNA alignments.
Goodgame N, Jang M, Kay S et al (2013) Bioinformatics 25(10):1335–1337
Facing growth in the European Nucleotide 5. Gardner PP (2009) The use of covariance
Archive. Nucleic Acids Res 41(Database models to annotate RNAs in whole genomes.
issue):D30–D35 Brief Funct Genomic Proteomic 8(6):
2. Torarinsson E, Lindgreen S (2008) WAR: 444–450
Webserver for aligning structural RNAs. 6. Nawrocki EP, Eddy SR (2013) Computational
Nucleic Acids Res 36(Web server issue): identification of functional RNA homologs in
W79–W84 metagenomic data. RNA Biol
3. Zuker M (2003) Mfold web server for nucleic 10(7):1170–1179
acid folding and hybridization prediction. 7. Burge SW, Daub J, Eberhardt R, Tate J,
Nucleic Acids Res 31(13):3406–3415 Barquist L, Nawrocki EP, Eddy SR, Gardner
Rfam: Annotating Non-Coding RNA 363
PP, Bateman A (2013) Rfam 11.0: 10 years of visualizing RNA secondary structures. Nucleic
RNA families. Nucleic Acids Res 41(Database Acids Res 40(12):e95
issue):D226–D232 14. Darty K, Denise A, Ponty Y (2009) VARNA:
8. Gardner PP, Daub J, Tate J, Moore BL, Osuch interactive drawing and editing of the RNA
IH, Griffiths-Jones S, Finn RD, Nawrocki EP, secondary structure. Bioinformatics 25(15):
Kolbe DL, Eddy SR et al (2011) Rfam: Wikipedia, 1974–1975
clans and the “decimal” release. Nucleic Acids 15. Hartshorn MJ (2002) AstexViewer: a visuali-
Res 39(Database issue):D141–D145 sation aid for structure-based drug design.
9. Bateman A, Finn RD (2007) SCOOP: a simple J Comput Aided Mol Des 16(12):871–881
method for identification of novel protein 16. Ashburner M, Ball CA, Blake JA, Botstein D,
superfamily relationships. Bioinformatics Butler H, Cherry JM, Davis AP, Dolinski K,
23(7):809–814 Dwight SS, Eppig JT et al (2000) Gene ontol-
10. Madera M (2008) Profile comparer: a program ogy: tool for the unification of biology. The
for scoring and aligning profile hidden Markov Gene Ontology Consortium. Nat Genet
models. Bioinformatics 24(22):2630–2631 25(1):25–29
11. Infernal User Guide. ftp://selab.janelia.org/ 17. Mungall CJ, Batchelor C, Eilbeck K (2011)
pub/software/infernal/Userguide.pdf Evolution of the Sequence Ontology terms
12. Daub J, Gardner PP, Tate J, Ramskold D, and relationships. J Biomed Inform 44(1):
Manske M, Scott WG, Weinberg Z, Griffiths- 87–93
Jones S, Bateman A (2008) The RNA 18. Zhang J, Haider S, Baran J, Cros A, Guberman
WikiProject: community annotation of RNA JM, Hsu J, Liang Y, Yao L, Kasprzyk A (2011)
families. RNA 14(12):2462–2464 BioMart: a data federation framework for
13. Lai D, Proctor JR, Zhu JY, Meyer IM (2012) large collaborative projects. Database 2011:
R-CHIE: a web server and R package for bar038
Chapter 23
Abstract
Alternative splicing (AS) is a basic molecular phenomenon that increases the functional complexity of
higher eukaryotic transcriptomes. Indeed, through AS individual gene loci can generate multiple RNAs
from the same pre-mRNA. AS has been investigated in a variety of clinical and pathological studies, such
as the transcriptome regulation in cancer. In human, recent works based on massive RNA sequencing
indicate that >95 % of pre-mRNAs are processed to yield multiple transcripts. Given the biological rele-
vance of AS, several computational efforts have been done leading to the implementation of novel algo-
rithms and specific specialized databases. Here we describe the web application ASPicDB that allows the
recovery of detailed biological information about the splicing mechanism. ASPicDB provides powerful
querying systems to interrogate AS events at gene, transcript, and protein levels. Finally, ASPicDB includes
web visualization instruments to browse and export results for further off-line analyses.
Key words Alternative splicing, Transcriptome, Transcription, Alternative isoforms, Database, Web
tool, Transcript prediction, Bioinformatics, Transcriptomics
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_23, © Springer Science+Business Media New York 2015
365
366 Mattia D’Antonio et al.
2 Materials
3 Methods
3.1 Simple The simple search form comprises three different panels as shown
Search Form in Fig. 1. The user can decide to perform a query providing a gene
ID or a gene ontology term or whatever keywords of interest.
3.2 Advanced The advanced search form allows the combination of different
Search Form search criteria in order to increase the specificity of the interroga-
tion. Five different panels are available to perform queries on dif-
ferent biological aspects, and each panel contains a set of common
and specific features.
Every module includes the following three fields: Organism,
Accession, and Coordinates. Only the Organism field is mandatory,
and, actually, queries on multiple organisms are not feasible.
Through the Coordinates fields, users can select genomic regions
in which the search can be performed.
In addition to all common filters, in the gene section (Fig. 2),
users can query:
● Keywords matching gene descriptions (e.g., “apoptosis”)
● RNA regions in which splicing events are located (i.e., 5′ UTR;
CDS; 3′ UTR)
● Type of splicing events (e.g., exon skipping, intron retention,
etc.)
● Specific donor-acceptor splice sites (e.g., GT-AG)
● Intron class (e.g.. U2, U12)
● Number of alternative proteins and transcripts (range)
● Number of independent splicing events (range)
ASPicDB for Alternative Splicing Analysis 369
The exon retrieval form enables the user to spot single exons
that match to a set of personalized criteria, such as:
● Minimal and maximal exon length
● Exon class (initial, internal, terminal)
● Features of flanking introns (donor-acceptor sites, intron class)
● Number of ESTs supporting flanking introns
● Presence of poly-A tail or poly-A signal annotated in the exon
To retrieve particular splice sites (introns), the user can cus-
tomize the following filters:
Specific donor-acceptor sites
● Intron class (U12, U2, not determined)
● Number of ESTs supporting the intron
● Intron length (interval)
● Minimal and maximal repeat site length in the intron sequence
● Presence of ambiguous splicing sites (i.e., the intron boundar-
ies cannot be precisely defined due to an ambiguous alignment
between cDNA and gDNA)
Finally, the protein search form is divided in two sections: the
first one allows to search proteins with structural or functional
properties; the second one is able to look for genes encoding alter-
native proteins matching user-defined criteria.
The first section allows to query for:
● Pfam class
● Protein class (e.g., globular, transmembrane)
● Number of functional domains
● Cellular localization for globular proteins (e.g., nucleus, cyto-
plasm, mitochondria)
● Presence of GPI anchors and signal peptides
The second panel contains a set of checkboxes to select fea-
tures requested to be different in the alternative encoded proteins
for the resulting genes. For instance, by selecting “subcellular
localization,” the application will fetch all genes that encode for
alternative proteins that differ in subcellular localization (e.g.,
some proteins localized in nucleus and others in cytoplasm).
query length, and the matching score. Results are also linked to
the corresponding ASPicDB entries.
This facility could be very useful in combination with experi-
mental techniques to investigate specific transcripts or embedded
features in diverse conditions.
Fig. 4 Structure view for the human tumor protein p63 (TP63)
that do not have canonical splicing sites (black dots at ends) are
represented by orange dotted lines and are labeled “fuzzy introns.”
The Predicted Transcripts section provides a graphical repre-
sentation of the predicted transcripts. Only the most reliable
introns (i.e., those supported by a perfectly aligned EST with
canonical donor and acceptor sites or those supported by two or
more ESTs) are used for the assembly process. Figure 5 shows a
“Transcript View” snapshot for the ADSL human gene where
exons (thick-colored strips) and introns (black junction regions)
are clearly identifiable, along with other functional regions such as
5′ UTR (yellow strips) and 3′ UTR (sky blue). Interestingly, mass-
spectrometry identified peptide sequences [28] supporting specific
coding exons or splice junctions are represented in the top of the
Transcript View as small green boxes.
The mouseover on specific exons or introns activates the
tooltip with additional information, while a click opens a popup
page containing the genomic sequence of the selected element.
Furthermore, two icons to the right of each transcript are linked to
the genomic sequence of the transcript and the corresponding pro-
tein (if the transcript has a coding sequence).
A 3′ terminal arrow marks polyadenylated transcripts that may
also contain a poly(A) site (AAUAAA or its accepted variants), and
a hyphen marks the position of mapping CAGE [29] tags which
identify transcription initiation sites. In Fig. 6, five of the alterna-
tive transcripts of the BRAF gene are shown. Since the fourth
doesn’t contain an annotated CDS, its exons are displayed in gray
ASPicDB for Alternative Splicing Analysis 373
Fig. 5 Visualization of predicted transcripts for the adenylosuccinate lyase (ADSL) encoding gene
Fig. 6 Visualization of alternative transcripts predicted for B-Raf proto-oncogene serine/threonine kinase (BRAF)
374 Mattia D’Antonio et al.
Fig. 7 Visualization of predicted functional domains for A-Raf proto-oncogene, serine/threonine kinase (ARAF)
ASPicDB for Alternative Splicing Analysis 375
3.5 Exporting The possibility of exporting search results is crucial to allow further
Results analysis and cross-validation with different tools. ASPicDB includes
different download options and formats.
Search result tables contain a link to the download section,
where it is possible to explore the element (i.e., gene, transcript,
exon, intron, or protein) sequences. Every element has different
download options, e.g., an intron can be extended with exonic
sequences preceding and following the donor-acceptor sites (exten-
sion length is customizable by the user). For each transcript, exon
sequences are downloadable, as well as its alternative protein.
Since the result sets can be considerably large, the extraction of
the sequence data can be time-consuming. To avoid long waiting
times, the user can just leave an email address, where a web link
with the data will be sent, upon query completion.
A second layer of data export is available in the different Result
Visualization section. In Gene Information all coding and tran-
script sequences are available for download, as well as the predic-
tion result in a GTF file (Gene Transfer File). The GTF contains
the list of predicted transcripts, exons, and introns. For each tran-
script all its exons and introns are listed, with a report containing
the EST supporting each intron.
Moreover, in Predicted Transcripts, it’s possible to download
sequences for each exon, intron, transcript, and encoded proteins.
Proteins can also be fetched in the Protein Table. In Predicted Splice
sites an overall export process is available: users can download all
the multiple EST alignments against genome splicing sites.
Acknowledgments
References
1. Wang ET, Sandberg R, Luo S, Khrebtukova I, 2. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ
Zhang L, Mayr C, Kingsmore SF, Schroth GP, (2008) Deep surveying of alternative splicing
Burge CB (2008) Alternative isoform regula- complexity in the human transcriptome by
tion in human tissue transcriptomes. Nature high-throughput sequencing. Nat Genet
456:470–476 40(12):1413–1415
ASPicDB for Alternative Splicing Analysis 377
gene and transcripts annotation. Genome Biol Mongin E, Pettett R, Pocock M, Potter S,
7 (Suppl 1), S12, 1–14 Rust A, Schmidt E, Searle S, Slater G, Smith J,
28. Desiere F, Deutsch EW, King NL, Nesvizhskii Spooner W, Stabenau A, Stalker J, Stupka E,
AI, Mallick P, Eng J, Chen S, Eddes J, Ureta-Vidal A, Vastrik I, Clamp M (2002) The
Loevenich SN, Aebersold R (2006) The Ensembl genome database project. Nucleic
PeptideAtlas project. Nucleic Acids Res Acids Res 30(1):38–41
34(Database issue):D655–D658 31. Pierleoni A, Martelli PL, Fariselli P, Casadio R
29. Kawaji H, Kasukawa T, Fukuda S, Katayama S, (2006) BaCelLo: a balanced subcellular
Kai C, Kawai J, Carninci P, Hayashizaki Y localization predictor. Bioinformatics 22(14):
(2006) CAGE basic/analysis databases: the e408–e416
CAGE resource for comprehensive promoter 32. Fariselli P, Finocchiaro G, Casadio R (2003)
analysis. Nucleic Acids Res 34(Database SPEPlip: the detection of signal peptide and
issue):D632–D636 lipoprotein cleavage sites. Bioinformatics
30. Hubbard T, Barker D, Birney E, Cameron G, 19(18):2498–2499
Chen Y, Clark L, Cox T, Cuff J, Curwen V, 33. Pierleoni A, Martelli PL, Casadio R (2008)
Down T, Durbin R, Eyras E, Gilbert J, PredGPI: a GPI-anchor predictor. BMC
Hammond M, Huminiecki L, Kasprzyk A, Bioinformatics 9:392–395. doi:10.1186/
Lehvaslaiho H, Lijnzaad P, Melsopp C, 1471-2105-9-392
Chapter 24
Abstract
Alternative splicing (AS) is a eukaryotic principle to derive more than one RNA product from transcribed
genes by removing distinct subsets of introns from a premature polymer. We know today that this process
is highly regulated and makes up a large part of the differences between species, cell types, and states. The
key to compare AS across different genes or organisms is to tokenize the AS phenomenon into atomary
units, so-called AS events. These events then usually are grouped by common patterns to investigate the
underlying molecular mechanisms that drive their regulation. However, attempts to decompose loci with
AS observations into events are often hampered by applying a limited set of a priori defined event patterns
which are not capable to describe all AS configurations and therefore cannot decompose the phenomenon
exhaustively.
In this chapter, we describe working scenarios of AStalavista, a computational method that reports all
AS events reflected by transcript annotations. We show how to practically employ AStalavista to study AS
variation in complex transcriptomes, as characterized by the human GENCODE annotation. Our exam-
ples demonstrate how the inherent and universal AStalavista paradigm allows for an automatic delineation
of AS events in custom gene datasets. Additionally, we sketch an example of an AStalavista use case includ-
ing next-generation sequencing data (RNA-Seq) to enrich the landscape of discovered AS events.
Key words Gene expression, RNA processing, Alternative splicing, AS event, Definition of alternative
splicing, Transcriptome annotation, RNA-seq, Splicing nomenclature, AS code, Bioinformatics
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_24, © Springer Science+Business Media New York 2015
379
380 Sylvain Foissac and Michael Sammeth
2 Materials
2.1 Installing The AStalavista package can be accessed from the website
AStalavista http://astalavista.sammeth.net
where also the source code is available from the “Download” sec-
tion. To install, just unzip/untar the bundle once downloaded, for
instance, by the command
tar xzf AStalavista-3.2.tgz
The command creates a folder named “AStalavista” (with the
corresponding version number) that contains all the files from the
tarball: documentation, license information, and program files.
AStalavista is executed by wrapper scripts located in the “bin”
subfolder within the installation directory, either “astalavista.bat”
(for Windows platforms) or “astalavista” (for UNIX-based operat-
ing systems, including OSX).
The “--help” flag provides a brief description of the software,
its version, and the available options: flags to handle verbosity,
parallelization, output behavior, and available tools (“--list-tools”;
see below).
AStalavista-3.2/bin/AStalavista--elp
AStalavista v3.2 (Flux Library: 1.22)
-------Documentation & Issue Tracker-------
Analysis of Alternative Splicing Events 381
2.2 Obtaining The full GENCODE transcriptome annotation dataset [3] can be
the GENCODE downloaded from the project’s ftp site by the command:
Annotation wget
ftp://ftp.sanger.ac.uk/pub/gencode/
release_12/gencode.v12.annotation.gtf.gz ./
and subsequently, the downloaded gzip archive is decompressed by
gunzip Gencode.v17.annotation.gtf.gz
Table 1
Examples of distinct AS events with the corresponding AS patterns
Traditional
name Exon-intron structure AS code Example
A Exon 0, 1-2^ PTBP1 polypyrimidine
skipping tract-binding protein 1,
ninth exon
3.1.2 The AS Code AS events are commonly classified in different categories based on
the observed differences in the exon-intron structure between
the respective transcripts. Typically, most studies employ the terms
384 Sylvain Foissac and Michael Sammeth
3.2.2 Interpreting AStalavista produces from the input annotation in gtf format a file
the Results with events that are equally stored as gtf features: start (column 4)
and end (column 5) of the event are the genomic coordinates of
the delimiting common sites [10], respectively, the first/last vari-
able site in the case of ASE events; the nature of these delimiting
sites is detailed in the “flanks” attribute, which adopts the same
“AS code” notation as used for “splice chain” and “structure”
(see below):
chr1 Gencode_v12 as_event
12227 12721 . + . tran-
script_id "ENST00000518655.2,ENST00000456328.2/
ENST00000515242.2"; gene_id "chr1:11869-14412W";
flanks "12227^,12721^"; structure "1-,2-"; splice_
chain "12595-,12613-"; dimension "2_2"; degree "4";
The attribute “degree” summarizes the number of variable
sites in the event, which is naturally increasing with longer events
or events of more variants. Additionally, the attribute “dimension”
provides information about the reported event with respect to the
underlying complete event: the dimensionality is denoted as a
string of the form “X_Y” where X is the number of variants in the
reported event and Y is the number of events in the corresponding
complete event. All events with X = Y are complete, and those with
X < Y are not (X > Y is not possible by definition).
The “transcript_id” attribute describes a comma-separated list
of the transcript identifiers for each of the variant structures of the
event; if there is more than one transcript supporting a certain vari-
ant of the event, the corresponding identifiers are furthermore
separated by a “/” separator. The attribute “splice_chain” describes
the genomic coordinates of each variable site in the event, with the
variants in the same comma-separated order as the identifiers of
“transcript_id”.
As introduced previously, the AS code nomenclature specifies
after the genomic coordinate of each site also the site type by a
certain symbol: “^” denotes a splice donor, “-” a splice acceptor,
and “[“a transcription start and”]” a cleavage site. Together with
the name of the chromosome/contig, the value of “splice_chain”
can be considered as a unique identifier of the event, because it
describes every event univocally by the morphology of its sites as
localized by corresponding genomic coordinates. Therefore, also
events delimited by the same flanks (start/end tuples) can still
differ by their splice chains.
Analysis of Alternative Splicing Events 387
1-2^3-,4- 1-2^4-,3-
1,621 1,475
33
6,5 0,1-2^3-4^
3
The
9,286 AS Code
Landscape
(Gencode v12)
1-,2-
0,1-2^
10
,2 1^,2^
75
0,1^2-
322
34,
6,06
1
5,552
3.3 Discovery AStalavista can also analyze NGS data from RNA sequencing
of Novel AS Events by experiments (RNA-Seq). The main conceptual difference between
RNA-Seq Experiments transcriptome annotations, like GENCODE and RNA-Seq data, is
that the latter only provide partial information on the transcrip-
3.3.1 Employing
tomes: RNA-Seq reads are limited to parts of expressed genes,
AStalavista for NGS Data
because current NGS technologies still cannot reproduce sequences
of entire RNA molecules and transcripts thus are fragmented
before sequencing. Although RNA-Seq reads correspond to frag-
ments and cannot be considered as full-length transcripts, they
can provide valuable information about AS by potentially novel
splice junctions they harbor: if the exon-exon junction of a spliced
transcript is captured in a read, the position of the corresponding
intron can be revealed by aligning the read sequence on the
genomic reference.
Dedicated NGS alignment methods—as implemented in soft-
ware like GEM [12] or STAR [5]—allow to generate such spliced
alignments where introns correspond to gaps flanked by exonic
regions. These experimentally derived evidences about splicing
events can be analyzed by AStalavista, either separately or in com-
bination with a reference annotation, to identify novel alternative
splicing events that are not captured by the reference gene set.
An interesting and original aspect of the latter approach is that no
transcript assembly is required for the analysis; indeed, AStalavista
considers reads as independent pieces of information and does not
take any assumptions which of them might originate from the same
and which of them have been obtained from independent RNA
molecules, respectively. In the next section, we employ this strat-
egy to illustrate how AStalavista can enrich a repertoire of AS
events by processing RNA-Seq data.
Analysis of Alternative Splicing Events 389
3.3.2 Processing We converted the format of the bed file with mapped reads
RNA-Seq Data downloaded earlier from the CRG dashboard, and produced an
AStalavista-compatible gtf file by the commands:
bed = SID38226_SID38227_SJ.bed; gtf = ${bed%.
bed}.gtf; cat $bed | awk -v bed = $bed -v OFS = "\t"
'{print $1,bed,"exon",$2,$2,$5,$6,".","gene_id
\"SJ"NR"\"; transcript_id \"SJ"NR"\";";print
$1,bed,"exon",$3,$3,$5,$6,".","gene_id
\"SJ"NR"\"; transcript_id \"SJ"NR"\";"}' > $gtf
Finally, we concatenated thus obtained gtf representation of
the NGS mappings with the GENCODE annotation from our
previous example:
cat Gencode.v12.annotation.gtf $gtf >
Gencode_rnaseq.gtf
The joint dataset then was processed with AStalavista in order
to compare correspondingly obtained AS events with the results
derived from the GENCODE annotation alone:
./bin/AStalavista -t asta -i Gencode_
rnaseq.gtf
3.3.3 Identifying Novel After processing the NGS-enriched GENCODE annotation with
AS Events AStalavista, we characterized events involving RNA-Seq reads that
were not detected using the annotation alone. We found 342 such
novel internal AS events (ASI). Figure 2 shows that the distribution
of patterns among these additionally found events follows closely the
one described by GENCODE events, albeit some rank positions are
permuted: evidence by RNA-Seq split mappings indicates that among
the simple AS patterns, alternative exon boundaries (89 donors and
79 acceptors events) are most underestimated by the current
GENCODE annotation, followed by alternatively skipped exons
(53 novel alternative exons); furthermore, RNA-Seq data pinpoints
177 novel complex events of 34 distinct patterns (Fig. 2).
Figure 2 also summarizes the number and nature of AS events
that are revealed by RNA-Seq introns; many describe the skipping of
(multiple) exons: 53 “0,1-2^” (1 exon), 21 “0,1-2^3-4^”
(2 exons), 11 “0,1-2^3-4^5-6^” (3 exons), etc. Albeit our analysis
focuses on a small set of high-confidence events, we observe nearly
half of the exons that predicted to be skipped (20 out of 53) exhibit
a frame-preserving length, i.e., the size of the exon is a multiple of 3.
These events include, for instance, the third exon in the TMCO1
(transmembrane domain protein) locus, where also additional EST
(Expressed Sequenced Tags) evidence (BP308568) and comple-
mentary computational gene models (ENST00000476173) support
the skipping of the corresponding 60 nt exon; as another example,
we find NGS evidence for the skipping of the 36 nt long exon num-
ber 10 in the TRIP (thyroid hormone receptor interacting protein)
gene, which so far has not been discovered by EST evidence.
390 Sylvain Foissac and Michael Sammeth
1-2^3-,4-
0,1-2^3-4^
Novel Events
(RNA-Seq) 10
21
17
7
1-,2-
79
0,1-2^
1^,2^ 0,1^2-
53
5
86
References
1. Kornblihtt AR, Schor IE, Alló M (2013) 3. Harrow J, Frankish A, Gonzalez JM (2012)
Alternative splicing: a pivotal step between GENCODE: the reference human genome
eukaryotic transcription and translation. Nat annotation for The ENCODE Project. Genome
Rev Mol Cell Biol 14:153–165 Res 22:1760
2. Foissac S, Sammeth M (2007) ASTALAVISTA: 4. Dunham I, Birney E, Lajoie BR, Sanyal A
dynamic and flexible analysis of alternative (2012) An integrated encyclopedia of DNA
splicing events in custom gene datasets. Nucleic elements in the human genome. Nature
Acids Res 35:W297 489:57
392 Sylvain Foissac and Michael Sammeth
Abstract
RNA interference (RNAi) is a powerful tool for the regulation of gene expression. Small exogenous
noncoding RNAs (ncRNAs) such as siRNA and shRNA are the active silencing agents, intended to target
and cleave complementary mRNAs in a specific way. They are widely and successfully employed in func-
tional studies, and several ongoing and already completed siRNA-based clinical trials suggest encouraging
results in the regulation of overexpressed genes in disease.
siRNAs share many aspects of their biogenesis and function with miRNAs, small ncRNA molecules
transcribed from endogenous genes which are able to repress the expression of target mRNAs by either
inhibiting their translation or promoting their degradation. Although siRNA and artificial miRNA mole-
cules can significantly reduce the expression of overexpressed target genes, cancer and other diseases can
also be triggered or sustained by upregulated miRNAs.
Thus, in the past recent years, molecular tools for miRNA silencing, such as antagomiRs and miRNA
sponges, have been developed. These molecules have shown their efficacy in the derepression of genes
downregulated by overexpressed miRNAs. In particular, while a single antagomiR is able to inhibit a single
complementary miRNA, an artificial sponge construct usually contains one or more binding sites for one
or more miRNAs and functions by competing with the natural targets of these miRNAs. As a consequence,
natural miRNA targets are reexpressed at their physiological level.
In this chapter we review the most successful methods for the computational design of siRNAs,
antagomiRs, and miRNA sponges and describe the most popular tools that implement them.
Key words RNAi, siRNA, shRNA, antagomiR, miRNA, Sponge, Gene expression
1 Introduction
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8_25, © Springer Science+Business Media New York 2015
393
394 Alessandro Laganà et al.
2 Materials
3 Methods
3.1 Design From a computational point of view, siRNA design is the process
of Functional siRNA of choosing a functional binding site on a target mRNA sequence,
which will correspond to the sense strand of the siRNA under
design (typically 21–23 nt long) [35]. The siRNA antisense
sequence is obtained as the complement to the sense strand.
Symmetric 3′ overhangs, usually dTdT, are added to improve sta-
bility of the duplex and to facilitate RISC loading, ensuring equal
ratios of sense and antisense strands incorporation [6, 36, 37].
Other overhang sequences are acceptable, but some combinations,
such as GG, should be avoided. The efficacy of RNAi is mostly
determined by sequence-specific factors which affect the stability
of the duplex ends. siRNA duplexes often have asymmetric loading
of the antisense versus sense strands [38, 39]. The strand whose 5′
end is thermodynamically less stable is preferentially incorporated
into the RISC.
Elbashir et al. suggest to choose the 23-nt sequence motif
AA(N19)TT as binding site (N19 means any combination of 19
nucleotides), where (N19)TT corresponds to the sense strand of
the siRNA, while the complement to AA(N19) corresponds to the
antisense strand (see Fig. 1a) [6].
Many different features associated to functional siRNAs have
been identified in the past years and some of them are now widely
accepted as standard rules in siRNA design and are implemented in
398 Alessandro Laganà et al.
Fig. 1 siRNA design rules. (a) An example of target region in an mRNA sequence and the corresponding siRNA
duplex with 3′ overhangs. (b) Specific positional rules for siRNA design. The darker cells represent positions
on the siRNA antisense (AS) and sense (S) strands. The light gray cells contain specific rules for the corre-
sponding positions. For each set of rules, references are given (Ref). The striped gray background indicates
inconsistencies of the rules, due to the different experiments that they come from
the majority of design tools. They can be classified into four differ-
ent categories: (1) general binding rules, (2) nucleotide composi-
tion rules, (3) specific positional rules, and (4) thermodynamics
rules. Table 1 summarizes categories (1), (2), and (4), while rules
in category (3) are represented in Fig. 1b.
General binding rules refer to factors such as the position of
the binding sites in the target transcript. For example, the target
region should preferably be between 50 and 100 nt downstream of
the start codon, and the middle of the coding sequence should be
avoided. Another rule suggests pooling of four or five siRNA
duplexes per target gene, in order to ensure a stronger repression.
Another class of rules concerns the siRNA nucleotide compo-
sition. A major feature, implemented by every design tool, is the
G/C content, which should typically be in the range of 30–55 %,
although values as low as 25 % or as high as 79 % are still associated
to functional siRNAs.
Other features in this category include the presence/absence
of particular motifs in the antisense strand and the absence of
internal repeats.
Specific positional rules are the most numerous and regard the
selection of nucleotides to prefer or avoid in specific positions of
either the sense or the antisense strand of the duplex. For example,
Computational Design of RNAi Molecules 399
Table 1
Rules for siRNA design
3.2 siRNA Design Table 2 provides a list of tools for the automated design of siRNA
Online Resources and shRNA sequences. Most tools have user-friendly interfaces
which don’t require any additional specifications aside from the
target sequence.
OptiRNAi 2.0 is a fast tool which predicts 21–23-nt RNAi
target sites on a user-provided sequence using the criteria described
by Elbashir et al. in 2001 and Reynolds et al. in 2004 [36, 40, 41].
The program generates a list of up to ten siRNA target sites for
each of which a score indicates how well it matches the considered
features. The tool doesn’t return the actual siRNA antisense
sequence, which has to be manually derived as the reverse comple-
ment of the binding sites.
As for OptiRNAi, siDirect 2 also accepts just the target
sequence as input. It returns a table with a list of potential binding
sites (including the 2-nt overhang) [42]. For each site, the corre-
sponding siRNA duplex strands are given, together with the melt-
ing temperature (Tm) of the seed-target duplex, as a measure of
thermodynamic stability. The seed-target duplex is formed between
the region 2–8 of the siRNA guide strand (from the 5′ end) and its
target mRNA site.
Other details provided include the list of potential off-target
genes for the guide and passenger strands and a graphical view of
the siRNA binding sites in the target sequence.
siRNA scales are another design tool which accepts as input a
target sequence and returns a list with all possible 19-nt long
siRNA sequences and the predicted percentage of target mRNA
copies present in the cells after siRNA-directed cleavage as a
Computational Design of RNAi Molecules 401
Table 2
Tools for siRNA design
Table 3
siRNA databases
3.3 siRNA Databases Several sources of siRNA sequences are publicly available online.
Here we provide a brief overview of some of them, which are listed
in Table 3.
NCBI’s RNAi resource page allows easy access to the RNAi
probes (siRNA/shRNA), stored in the NCBI database. For each
probe, details about the sequence, the targets, and the hairpin, in
case of an shRNA, are given. Queries are submitted through the
standard NCBI interface, which allows results filtering by auto-
matically adding the keywords “gene silencing” to the query.
The MIT/ICBP siRNA Database is a comprehensive database
which stores and distributes information on validated siRNAs and
shRNAs.
Currently the database contains siRNA and shRNA sequences
against over 100 genes from three different sources: (1) sequences
designed and tested by MIT researchers, (2) sequences designed
by Qiagen and tested by Natasha Caplen’s group at the NCI, and
(3) sequences designed by Greg Hannon and Steve Elledge and
tested by the ICBP and CGAP programs at the NCI. The database
can be searched by keywords (e.g., target gene name) or browsed
by gene name and siRNA ID. The results include links to NCBI
probe pages. The website also has a section for the submission of
new validated reagents. Sequences are available for human and
mouse.
HuSiDa is a database that contains sequences of published
functional siRNA molecules targeting human genes and important
technical details of the corresponding gene silencing experiments
[56]. The database is searchable by different terms, such as gene
name, cell line, transfection methods, siRNA source, and siRNA
sequence.
siRecords archives experimentally tested siRNA inferred from
literature [57]. Different data are available for each siRNA, such as
its sequence and the alignment with the target gene, the cell types
or tissues in which it was tested, the forms of the siRNA agents
Computational Design of RNAi Molecules 405
3.5 Design Ebert et al. first introduced miRNA sponges in 2007, as an alterna-
of Effective miRNA tive to chemically modified antisense oligonucleotides for miRNA
Sponges inhibition [34]. Sponges contain multiple binding sites for endog-
enous miRNAs and function by “absorbing” and distracting them
from their natural targets, thus representing a useful tool to probe
miRNA functions in a variety of experimental systems.
Sponges can be easily cloned into expression vectors and tran-
siently transfected into cultured cells in order to efficiently derepress
miRNA targets. They can also be delivered by virus-based vectors,
Computational Design of RNAi Molecules 407
Fig. 2 miRNA sponge constructs. (a) Basic sponge with six miRNA binding sites separated by 4-nt spacers.
(b) Perfect miRNA binding site on a sponge. (c) Bulged miRNA binding site on a sponge. (d) Prototype decoy
consisting of a short hairpin molecule where the loop exposes a binding site for an miRNA. (e) Tough decoy (TuD)
with two exposed miRNA binding sites. (f) Synthetic tough decoy (S-TuD) consisting of two fully 2′-O-methylated
RNA strands exposing an miRNA binding site each
References
1. Lee RC, Feinbaum RL, Ambros V (1993) The 6. Elbashir SM, Harborth J, Lendeckel W, Yalcin
C. elegans heterochronic gene lin-4 encodes A, Weber K, Tuschl T (2001) Duplexes of
small RNAs with antisense complementarity to 21-nucleotide RNAs mediate RNA interfer-
lin-14. Cell 75:843–854 ence in cultured mammalian cells. Nature
2. Fire A, Xu S, Montgomery M, Kostas S, Driver 411:494–498
S, Mello C (1998) Potent and specific genetic 7. Brummelkamp TR, Bernards R, Agami R
interference by double-stranded RNA in (2002) A system for stable expression of short
Caenorhabditis elegans. Nature 391:806 interfering RNAs in mammalian cells. Science
3. Tomari Y, Zamore PD (2005) Perspective: 296:550–553
machines for RNAi. Genes Dev 19: 8. Paddison PJ, Caudy AA, Bernstein E, Hannon
517–529 GJ, Conklin DS (2002) Short hairpin RNAs
4. Chu C-Y, Rana TM (2007) Small RNAs: regu- (shRNAs) induce sequence-specific silencing
lators and guardians of the genome. J Cell in mammalian cells. Genes Dev 16:948–958
Physiol 213:412–419 9. Bernstein E, Caudy AA, Hammond SM,
5. Kutter C, Svoboda P (2008) miRNA, siRNA, Hannon GJ (2001) Role for a bidentate ribo-
piRNA: knowns of the unknown. RNA Biol nuclease in the initiation step of RNA interfer-
5:181–188 ence. Nature 409:363–366
410 Alessandro Laganà et al.
10. Bartel DP (2004) MicroRNAs: genomics, bio- siRNA during RISC activation. Cell 123:
genesis, mechanism, and function. Cell 116: 621–629
281–297 25. Gregory RI, Chendrimada TP, Cooch N,
11. Borchert GM, Lanier W, Davidson BL (2006) Shiekhattar R (2005) Human RISC couples
RNA polymerase III transcribes human MicroRNA biogenesis and posttranscriptional
microRNAs. Nat Struct Mol Biol 13: gene silencing. Cell 123:631–640
1097–1101 26. Maniataki E, Mourelatos Z (2005) A human,
12. Monteys AM, Spengler RM, Wan J, Tecedor ATP-independent, RISC assembly machine
L, Lennox KA, Xing Y, Davidson BL (2010) fueled by pre-miRNA. Genes Dev 19:
Structure and activity of putative intronic miR- 2979–2990
NAs promoters. RNA 16:495. doi:10.1261/ 27. Okamura K, Phillips MD, Tyler DM, Duan H,
rna.1731910 Chou Y-T, Lai EC (2008) The regulatory
13. Ozsolak F, Poling LL, Wang Z, Liu H, Liu XS, activity of microRNA* species has substantial
Roeder RG, Zhang X, Song JS, Fisher DE influence on microRNA and 3′ UTR evolu-
(2008) Chromatin structure analyses identify tion. Nat Struct Mol Biol 15:354–363
miRNA promoters. Genes Dev 22:3172–3183 28. Gunsalus KC, Piano F (2005) RNAi as a tool
14. Zeng Y, Yi R, Cullen BR (2005) Recognition to study cell biology: building the genome-
and cleavage of primary microRNA precursors phenome bridge. Curr Opin Cell Biol 17:3–8
by the nuclear processing enzyme Drosha. 29. Kim DH, Rossi JJ (2007) Strategies for silenc-
EMBO J 24:138–148 ing human disease using RNA interference.
15. Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Nat Rev Genet 8:173–184
Lee J, Provost P, Rådmark O, Kim S et al 30. Hadj-Slimane R, Lepelletier Y, Lopez N,
(2003) The nuclear RNase III Drosha initiates Garbay C, Raynaud F (2007) Short interfering
microRNA processing. Nature 425:415–419 RNA (siRNA), a novel therapeutic tool acting
16. Gregory RI, Yan K-P, Amuthan G, on angiogenesis. Biochimie 89:1234–1244
Chendrimada T, Doratotaj B, Cooch N, 31. de Fougerolles A, Vornlocher H-P, Maraganore
Shiekhattar R (2004) The Microprocessor J, Lieberman J (2007) Interfering with dis-
complex mediates the genesis of microRNAs. ease: a progress report on siRNA-based thera-
Nature 432:235–240 peutics. Nat Rev Drug Discov 6:443–453
17. Lund E, Güttinger S, Calado A, Dahlberg JE, 32. Krützfeldt J, Rajewsky N, Braich R, Rajeev
Kutay U (2004) Nuclear export of microRNA KG, Tuschl T, Manoharan M, Stoffel M
precursors. Science 303:95–98 (2005) Silencing of microRNAs in vivo with
18. Yi R, Qin Y, Macara IG, Cullen BR (2003) ‘antagomirs’. Nature 438:685–689
Exportin-5 mediates the nuclear export of pre- 33. Scherr M, Eder M (2007) Gene silencing by
microRNAs and short hairpin RNAs. Genes small regulatory RNAs in mammalian cells.
Dev 17:3011–3016 Cell Cycle 6:444
19. Hamilton AJ, Baulcombe DC (1999) A spe- 34. Ebert MS, Neilson JR, Sharp PA (2007)
cies of small antisense RNA in posttranscrip- MicroRNA sponges: competitive inhibitors of
tional gene silencing in plants. Science small RNAs in mammalian cells. Nat Methods
286:950–952 4:721–726
20. Zamore PD, Tuschl T, Sharp PA, Bartel DP 35. Liu Q, Zhou H, Zhu R, Xu Y, Cao Z (2014)
(2000) RNAi: double-stranded RNA directs Reconsideration of in silico siRNA design from
the ATP-dependent cleavage of mRNA at 21 a perspective of heterogeneous data integra-
to 23 nucleotide intervals. Cell 101:25–33 tion: problems and solutions. Brief Bioinform
21. Elbashir SM, Lendeckel W, Tuschl T (2001) 15:292. doi:10.1093/bib/bbs073
RNA interference is mediated by 21- and 36. Elbashir SM, Martinez J, Patkaniowska A,
22-nucleotide RNAs. Genes Dev 15:188–200 Lendeckel W, Tuschl T (2001) Functional
22. Matranga C, Tomari Y, Shin C, Bartel DP, anatomy of siRNAs for mediating efficient
Zamore PD (2005) Passenger-strand cleavage RNAi in Drosophila melanogaster embryo
facilitates assembly of siRNA into Ago2- lysate. EMBO J 20:6877–6888
containing RNAi enzyme complexes. Cell 37. Strapps WR, Pickering V, Muiru GT, Rice J,
123:607–620 Orsborn S, Polisky BA, Sachs A, Bartz SR
23. Miyoshi K, Tsukumo H, Nagami T, Siomi H, (2010) The siRNA sequence and guide strand
Siomi MC (2005) Slicer function of Drosophila overhangs are determinants of in vivo duration
Argonautes and its involvement in RISC for- of silencing. Nucleic Acids Res 38:4788–4797
mation. Genes Dev 19:2837–2848 38. Khvorova A, Reynolds A, Jayasena SD (2003)
24. Rand TA, Petersen S, Du F, Wang X (2005) Functional siRNAs and miRNAs exhibit strand
Argonaute2 cleaves the anti-guide strand of bias. Cell 115:209–216
Computational Design of RNAi Molecules 411
65. Peek AS (2007) Improving model predictions 69. Chalk AM, Wahlestedt C, Sonnhammer ELL
for RNA interference activities that use sup- (2004) Improved and automated prediction of
port vector machine regression by combining effective siRNA. Biochem Biophys Res
and filtering features. BMC Bioinform 8:182 Commun 319:264–274
66. Ui-Tei K, Naito Y, Takahashi F, Haraguchi T, 70. Klingelhoefer JW, Moutsianas L, Holmes C
Ohki-Hamazaki H, Juni A, Ueda R, Saigo K (2009) Approximate Bayesian feature selection
(2004) Guidelines for the selection of highly on a large meta-dataset offers novel insights
effective siRNA sequences for mammalian and on factors that effect siRNA potency.
chick RNA interference. Nucleic Acids Res Bioinformatics 25:1594–1601
32:936–948 71. Liu Q, Zhou H, Cui J, Cao Z, Xu Y (2012)
67. Shabalina SA, Spiridonov AN, Ogurtsov AY Reconsideration of in-silico siRNA design
(2006) Computational models with thermo- based on feature selection: a cross-platform
dynamic and composition features improve data integration perspective. PLoS One
siRNA design. BMC Bioinform 7:65 7:e37879
68. Amarzguioui M, Prydz H (2004) An algo- 72. Wang L, Huang C, Yang JY (2010)
rithm for selection of functional siRNA Predicting siRNA potency with random for-
sequences. Biochem Biophys Res Commun ests and support vector machines. BMC
316:1050–1058 Genomics 11:S2
INDEX
A E
Adenosine to inosine (A-to-I) editing..................... 189, 231, Energy model ............................ 21, 23, 67, 80, 124, 323, 324
232, 237 Environmental DNA (e-DNA)................................257–277
Alternative Expressed isoforms ...........................................................178
isoforms ...................................................... 173, 366, 369 Extended secondary structure .......................... 64, 66, 68, 78,
splicing .......................................137, 141, 142, 164, 170, 86, 110, 111
173–188, 287, 293, 299, 365–376, 379–391
Alternative splicing (AS) F
code .............................................380, 383–384, 386, 387 Families ............65, 86, 87, 107, 214, 280, 349–362, 407, 409
event ................................................... 379, 380, 382–391 Free energy minimization .......................................3–15, 314
AntagomiR ............................................... 396, 397, 405–407
G
B
Gene
Base pair classification ...................................... 102, 105, 108 annotated .................................................... 144, 147, 247
Bioinformatics .............................. 17, 29, 32, 33, 69, 80, 123, coding ................................................................. 208, 373
163, 164, 166, 180, 244, 263, 275, 280, 327–330 expression ............................. 50, 123, 137, 141, 142, 145,
164, 189, 209, 219, 255, 339, 341, 379, 393, 394
C
profiling ........................................................137, 141
Cis-regulatory elements ....................................................349 structure ....................... 138, 174, 177, 178, 180, 182, 371
Clans ................................................ 349–352, 356–360, 362 Genomics ................................. 4, 39, 74, 164, 173–175, 178,
CLIP-seq.................................................................. 295, 296 179, 182, 183, 185, 188, 190, 192–194, 199, 201,
Common/consensus secondary structures ....................17–34 203, 208, 214, 238, 240, 254, 255, 259, 287, 293,
Comparative methods ..........................................................4 298, 327, 330, 332, 334, 339, 366, 368, 371, 372,
Computational 374, 381, 382, 384–388, 391, 396, 403, 407
biology .............................................................. 3, 67, 262 Graph drawing ...................................................................64
tools .................................................... 209, 219, 307, 350
Covariance model ..............................322, 346, 350, 354, 356 H
Covariation ............................................21, 22, 24–27, 87, 90 Half-read seeding .....................................................149, 153
Heuristic ...................... 21, 32, 33, 53, 61, 123–132, 216, 317
D
High-throughput
Database .............................................................. 4, 5, 34, 41, methods ......................................................................219
43, 50, 67, 102, 103, 107–109, 111, 120, 124, 166, sequencing ............. 23, 149, 219, 220, 244, 245, 279, 295
167, 169, 192, 194, 199, 201, 202, 204, 210,
213–215, 217, 218, 220–223, 224, 226, 234, I
243, 260, 272, 280–285, 287–289, 323, 327,
Immunoprecipitation ...................81, 219, 220, 294, 295, 300
330–333, 339–342
Infernal ..............................................323, 346, 350, 354, 361
Deep sequencing ......................................141, 165, 214, 220,
221, 222, 231–241, 328
J
Differential expression ............................. 143, 164, 168–171,
244, 245, 253, 254, 295, 300 Junction .................................... 113, 114, 117, 142, 147–150,
DNA-seq .................................. 138, 148, 190, 191, 193–196, 153–159, 294, 297, 372, 388
199, 201, 203, 288, 323 reads.....................................147–150, 154, 155, 157–159
Ernesto Picardi (ed.), RNA Bioinformatics, Methods in Molecular Biology, vol. 1269,
DOI 10.1007/978-1-4939-2291-8, © Springer Science+Business Media New York 2015
413
RNA BIOINFORMATICS
414 Index
M R
Maximum expected gain (MEG) estimators ............... 20, 21, REDItools ........................................ 190–192, 194, 199–204
26–32, 34 Rfam ............................................. 4, 5, 41, 67, 68, 76, 77, 82,
Meta-barcoding ........................................................257–277 83, 87–91, 97, 98, 214, 281, 349–362
Metagenomics ...................................257, 258, 260, 262, 287 Ribonucleic acid (RNA)
Metatranscriptomics .................................................279–289 alignment ..........................................................40–42, 87
Microbiome ...................................... 258, 259, 261–262, 266 binding protein ....................294–296, 298, 300, 302, 341
Micro RNA (miRNA) bioinformatics ............................................. 17, 33, 69, 80
modeling ............................................. 207, 208, 211, 217 editing..................................189–204, 232, 237, 327–337
prediction............................................................207–215 folding prediction ..................................... 10–11, 14, 402
target free energy parameters ................................................6, 7
prediction.............................. 208–217, 219, 221–223 functional annotation....................................................39
recognition .................................... 210, 231, 232, 396 motifs....................11, 13, 61, 83, 103, 107–109, 114, 115
Minimum free energy .................................... 4, 15, 110, 125, processing .....................................................................95
213, 308–311, 314, 317, 318, 320, 324 secondary structure
Modtools ............................................................................49 prediction ....................................3–15, 17–35, 43, 50,
Motif pattern ............................................................343–346 52, 61, 69, 103, 109, 126–127, 307, 308, 312, 317,
Multiple sequence alignment........................4, 18, 19, 26, 28, 325, 340, 346
32, 34, 65, 68, 80, 85, 87, 88, 91, 313, 315, 320, relationship ...........................................................3, 9
322, 340, 350 structure comparison ....................................................40
Mutual information ................................................ 21, 22, 24 untranslated sequences..................................................49
visualization ............................................................64–65
N RIP-seq ....................................................................293–302
ncRNAs. See Non-coding RNAs (ncRNAs) RNA interference (RNAi) .........214, 394–397, 400, 404, 405
Next generation sequencing (NGS) ........................ 138, 209, RNAProfile ..................................................................49–61
219, 221, 225, 243, 258, 293, 380 RNA–RNA
Next-generation sequencing (NGS)-Trex.................243–255 interaction................................... 33, 85–87, 97, 123–132
Non-canonical motifs .........................................................72 prediction............................................ 124–126, 128, 130
Non-coding RNAs (ncRNAs) ......................... 17, 24–26, 34, RNA-seq .................. 137–145, 147–161, 163–171, 190–199,
39, 67, 87, 123, 137, 394, 406 201, 203, 213, 214, 221, 223–226, 243–255, 279,
detection .....................................................................322 280, 294–301, 327–337, 381, 388–391
workflow .............................................................243–255
P S
Perl ................................... 6, 93, 97, 166, 174, 180, 232–234,
Secondary structure ............................... 3–15, 17–35, 39–43,
236, 238–240, 265, 270, 271, 340, 342, 343
45, 49–61, 63–99, 101–104, 107, 109–115,
Photoactivatable-Ribonucleoside-Enhanced
123–127, 209, 220, 225, 233, 239, 307–315, 317,
Cross-linking and Immunoprecipitation
318, 322, 324, 325, 340, 341, 343, 345, 346, 349,
(PAR-CLIP).................................. 214, 215, 220,
350, 354, 356, 357, 402, 406, 409
221, 222, 226, 295, 296, 301
Seeding extension ..................................... 149, 150, 155–156
Phylogenetic tree ................... 21, 27, 260, 288, 289, 313, 356
Sequence
Post-transcriptional regulation ........................... 50, 123, 341
alignments ............................ 4, 18, 19, 21, 26, 28, 32, 34,
Probabilistic model ............................. 21–23, 25–33, 35, 350
41, 43, 44, 65, 68, 80, 85, 87, 91, 137, 157, 166–167,
Pseudo-knots ............................................. 21, 31, 35, 64–66,
209, 287, 313, 315, 320, 322, 340, 350, 352
69, 71, 74, 78, 79, 80, 85–91, 110, 314, 317
design ................................................. 6, 12–14, 308, 404
Pyrosequencing........................................ 259, 261, 264–266,
Short hairpin RNAs (shRNAs) ............................... 394, 395,
268, 269, 271, 287, 289
400–402, 404, 405
Python ........................................ 74, 80, 84, 88, 92, 111, 169,
16S rRNA ...........................................87, 259, 261, 282, 283
174, 180, 190, 191, 202, 263, 264, 273–276, 330
phylogenetic analysis ..................................................280
Small interfering RNAs (siRNA) ............................ 123, 309,
Q
323, 394–405
Quality control ..........................137–145, 155, 282, 329, 334 Software package .................................4, 6, 68, 173, 244, 295
RNA BIOINFORMATICS
Index
415
Spliced alignment .................................... 173, 174, 176–178, Transcriptome
187, 188, 252, 330, 388 annotation............................148, 295, 381, 382, 384–388
SpliceMap ........................................ 148–153, 155, 157–161 profile ................................................................. 244, 245
Splicing nomenclature ...................................... 380, 382–384 Transcriptomics ........................................ 209, 214, 219, 380
Structure-informed multiple sequence Transcript prediction ................................ 367–368, 372–376
alignments ........................................................87
U
T
Untranslated region (UTR) ....................50, 54, 58, 143–145,
Taxonomy ..........................................265, 275, 288, 358–359
189, 216, 217, 219, 254, 323, 330, 339–347
Tertiary structure ......................................... 39, 72, 101, 103,
104, 107–109, 114–119
W
Transcription ......................................17, 280, 300, 301, 318,
330, 332, 339, 340, 372, 379, 382, 384–386, 394 Web tool ...................................................................365–376