Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
67% found this document useful (3 votes)
1K views

Data Retrieval

There are three main data retrieval systems for molecular biology databases: Sequence Retrieval System (SRS), Entrez, and DBGET. SRS provides access to over 80 biological databases developed at EBI, Entrez integrates databases from NCBI, and DBGET is part of the Japanese GenomeNet service. These systems allow text searches across multiple databases and provide links to relevant information matching search criteria. There are also data mining tools that retrieve data from genomic databases and visualization tools for proteomic databases, including tools for homology, protein function, sequence analysis, and structural analysis.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
67% found this document useful (3 votes)
1K views

Data Retrieval

There are three main data retrieval systems for molecular biology databases: Sequence Retrieval System (SRS), Entrez, and DBGET. SRS provides access to over 80 biological databases developed at EBI, Entrez integrates databases from NCBI, and DBGET is part of the Japanese GenomeNet service. These systems allow text searches across multiple databases and provide links to relevant information matching search criteria. There are also data mining tools that retrieve data from genomic databases and visualization tools for proteomic databases, including tools for homology, protein function, sequence analysis, and structural analysis.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Data retrieval means obtaining data from a database management system such as ODBMS.

The retrieved data may be stored in a file, printed, or viewed on the screen. A query language,
such as Structured Query Language (SQL), is used to prepare the queries.

Database-Searching in retrieving tools:

The amount of biological relevant data is increasing so rapidly, its important to know how to
access and search this information is essential.

There are three data retrieval systems of relevance to molecular biologist:

1. Sequence Retrieval System (SRS),


2. Entrez,
3. DBGET

These systems allow text searching of multiple molecular biology database and provide links to
relevant information for entries that match the search criteria. The three systems differ in the
databases they search and the links they have to other information.

Sequence Retrieval System (SRS) :

 SRS is a homogeneous interface to over 80 biological databases that had been developed at
the European Bioinformatics Institute (EBI) at Hinxton, UK.
 It includes databases of sequences, metabolic pathways, transcription factors, application
results (like BLAST, SSEARCH, FASTA), protein 3-D structures, genomes, mappings,
mutations, and locus specific mutations.
 The web page listing all the databases contains a link to a description page about the database
including the date on which it was last updated. One or more of the databases is selected to
search before entering your query.
 After getting results, choose an alignment algorithm (like CLUSTALW, PHYLIP) enter
parameters, and run it.
 The SRS is highly recommended for use.
Entrez:

 Entrez is a molecular biology database and retrieval system.


 Developed by the National Center for Biotechnology information (NCBI).
 It is entry point for exploring distinct but integrated databases. Of the three text-based
database systems, Entrez is the easiest to use, but also offers more limited information to
search.
 Entrez is both an indexing and retrieval system having data from various sources for
biomedical research.
 Entrez is composed of nucleotide sequences from PDB and GenBank, protein sequences
from SWISS-PROT, translated GenBank, PIR, PRF, PDB and associated abstracts and
citations from PubMed.
 The Entrez system can provide views of gene and protein sequences and chromosome maps.

DBGET:

 The integrated database retrieval system DBGET/LinkDB is the backbone of the Japanese

GenomeNet service.
 DBGET is used to search and extract entries from a wide range of molecular biology

databases, while LinkDB is used to search and compute links between entries in different
databases.

 The WWW version of DBGET/LinkDB at GenomeNet is integrated with other search tools,
such as BLAST, FASTA and MOTIF, and with local helper applications, such as RasMol. 

Data Retrieving tools:

There are data-mining software that retrieve data from genomic sequence databases and also
visualization tools to analyze and retrieve information from proteomic databases. These are

 homology and similarity tools,


 Protein functional analysis tools,
 Sequence analysis tools.
Homology and Similarity Tools:
Homologous sequences are sequences that are related by divergence from a common ancestor.
Thus, the degree of similarity between two sequences can be measured while their homology is a
case of being either true of false. This set of tools can be used to identify similarities between
novel query sequences of unknown structure and function and database sequences whose
structure and function have been elucidated.
Protein Function Analysis tools:
This group of programs allow you to compare your protein sequence to the secondary (or
derived) protein databases that contain information on motifs, signatures and protein domains.
Highly significant hits against these different pattern databases allow you to approximate the
biochemical function of your query protein.
Structural Analysis tools:
This set of tools allow you to compare structures with the known structure databases. The
function of a protein is more directly a consequence of its structure rather than its sequence with
structural homologs tending to share functions. The determination of a protein's 2D/3D structure
is crucial in the study of its function.
Sequence Analysis tools:
This set of tools allows you to carry out further, more detailed analysis on your query sequence
including evolutionary analysis, identification of mutations, hydropathy regions, CpG islands
and compositional biases. The identification of these and other biological properties are all clues
that aid the search to elucidate the specific function of your sequence.

Some examples of Bioinformatics Tools:


BLAST:
BLAST ( Basic Local Alignment Search Tool) comes under the category of homology and
similarity tools.

It is a set of search programs designed for the Windows platform and is used to
perform fast similarity searches regardless of whether the query is for protein or DNA.
Comparison of nucleotide sequences in a database can be performed. Also a protein database can
be searched to find a match against the queried protein sequence. NCBI has also introduced the
new queuing system to BLAST (Q BLAST) that allows users to retrieve results at their
convenience and format their results multiple times with different formatting options.
Depending on the type of sequences to compare, there are different programs:

 blastp compares an amino acid query sequence against a protein sequence database

 blastn compares a nucleotide query sequence against a nucleotide sequence database

 blastx compares a nucleotide query sequence translated in all reading frames against a
protein sequence database
 tblastn compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
 tblastx compares the six-frame translations of a nucleotide query sequence against the
six-frame translations of a nucleotide sequence database.

FASTA:
FAST is an alignment program for protein sequences created by Pearsin and Lipman in 1988.
The program is one of the many heuristic algorithms proposed to speed up sequence comparison.
The basic idea is to add a fast prescreen step to locate the highly matching segments between two
sequences, and then extend these matching segments to local alignments using more rigorous
algorithms such as Smith-Waterman.
EMBOSS:
EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package. It
can work with data in a range of formats and also retrieve sequence data transparently from the
Web. Extensive libraries are also provided with this package, allowing other scientists to release
their software as open source. It provides a set of sequence-analysis programs, and also supports
all UNIX platforms.
Clustalw:
It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the
best match over a total length of input sequences, be it a protein or a nucleic acid.
Bioinformatics tools for analysis of DNA:
RasMol:
It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules.
Protein Explorer, a derivative of RasMol, is an easier to use program.

WebAct- This is the web version of ACT (Artemis Comparison Tool) a DNA sequence
comparison viewer based on Artemis. (http://www.webact.org).

• BASys- It is known as Bacterial Annotation Tool. It is far-fetched tool which supports


automated and in-depth annotation. (http://basys.ca/basys/cgi/submit.pl).

Electronic PCR:

Identifies sequence tagged sites (STSs) within DNA sequences.

Open Reading Frame Finder (ORF Finder):

Suggests potential open reading frames in a DNA sequence.

Splign:

Computes alignments of cDNAs to genomic nucleotide sequences.

OSIRIS:

Facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on
laboratory-specific protocols.

LALIGN- It finds multiple matching sub-segments in two sequences. It provides or assigns one
with % identity for different sub segments. (http://www.lalign.org).

• GraphAlin- It presents the output file in graphical and numerical form of % identity between
two proteins, or RNA or DNA molecules. (http://www.graphalin.org).
• GeneOrder- It is an ideal tool for the alignment of small GenBank genome sequences (up to
0.25Mb). It has a new version as GeneOrder 3.0. (http://www.genesorder.org).

• CoreGenes- It is designed to analyze two to five genomes simultaneously, it also generates a


table of related genes i.e. orthologs and putative orthologs. It has a limit of 0.35 Mb.
(http://www.coregenes.org).

Phenotype-Genotype Integrator (phogenl) :

Finds human phenotype/genotype relationships with queries by phenotype,


chromosome location, gene, and SNP identifiers.

BLAST RefseqGene:

Finds regions of local similarity between query sequences and genomic sequences in the
RefSeqGene/LRG set

ORF finder:

Suggests potential open reading frames in a DNA sequence.

Vec Screen:

Identifies segments of a nucleotide sequence that may be of vector origin.

Clustal Omega (EBI):

Multiple sequence alignment programs for DNA or proteins.

Clustal W- PBIL:

Multiple sequence alignment programs for DNA.

GENIO/Logo:

Graphic representation of an amino acid or DNA/RNA multiple sequence alignment.


Bioinformatics tools for analysis of protein:

A. Protein structure Databases

Protein Data Bank (PDB) :

PDB is a very large universal storage place of processing and distribution of 3- dimensional structure
data of macromolecules. the information in PDB derived from variety tools and experiments like
NMR, X-ray crystallography, microscopy, cryoelectron and theoretical modeling,. Accommodations
of the database for users are access to structural data, providing methods for visualizing the
structure and downloading structural information.[7] NCBI Structure Database (MMDB): It includes
database of 3D structure of biomolecules which experimentally determined.Most of these data
derived from X-ray crystallography and NMR spectroscopy. The database provide biologists with a
broad information on biological functions of proteins, on mechanisms related to their functions and
on relationship between biomolecules and their evolutionary history.Additionally this database
provide biologists with comparative analysis of 3D structure of proteins. NCBI also called as MMDB
(molecular modeling database) and includes 3D structure of macromolecules and visualization tools
for comparative analysis of proteins.[8] Database and tools for protein structure visualization: Cn3-D
: "see in 3-D" is a viewer of structural sequence alignment for MMDB database. It facilitates viewing
of 3-Dstructure and alignment of sequence –structure of structure-structure. It serves as a helper
application for the browser. Files can be downloaded to the pc and the application can be launched.

SWISS PDB Viewer:

It facilitates and network for analysis of several proteins simultaneously. The proteins lay over each
other in order to analyze structural alignment and provide comparison of their active sites, their
amino acid mutations angles, distances and H bonds between their atoms. This viewer is joined to
Swiss-Model server. [10] Chemscape Chime, Rasmol and protein explorer: This tool is one of the
usual tools for visualization of protein structure.It can read molecular structure files from PDB.
Chemscape chime serves as a plug in to permit structure visualization with browser. Protein explorer
serves as a plug in to permit viewing of protein structure with our browser. Both of these application
namely Chemscape chime and protein explorer are primary derivation of Rasmol.[11] Mage and
Kinemages: It is another tool for protein structure visualization. It is able for rotation of entire image
in real time, displaying of parts by turning off and on them, selection of points for their identification
and animation of change between different forms.[6] PDBsum : It is a database that facilitates a
large illustrated graphic summary of the main information on each biomolecular structure from the
protein data bank. It consists of images of structure, detailed structural analysis derived from
PROMOTIF program, schematic graphs of interactions, summary PROCHEK results [12] Protein
structure alignment tools: VAST (vector alignment sequence tool): it is a tool produced by NCBI and
provides identification of similar proteins with 3D structure. So it is structure similarity and search
service. [13]. DALI : It is an computational protein structure alignment tool used for comparison of
protein structure in 3D.[14] B: Domain architecture Database: Conserved Domain Database :(CDD) :
is a database contain sequence alignment and profiles, showing protein domain conserved during
molecular evolution course.[15] CDART: (Conserved Domain Architecture Retrieval Tool) used for
searching protein having similar domain architectures.[16] C. Bioinformatics tools for plotting
protein –ligand interactions: Ligplot : It is used to find out interaction between protein and ligand
also hydrogen and hydrophobic contacts can be represented in this tool.[17]. D. Approaches for
classification of proteins: Classification of proteins b several databases usually is on the basis of their
structural similarities. Both structural and evolutionary relationship is factors of their classification.
In hierarchy of proteins several levels exist but the main level considered are such as Family,
superfamily and fold Family: In this level proteins are grouped together into family having clear and
known evolutional relatedness so called as clear evolutionarily relationship level. Superfamily: In this
level proteins are with low sequence identities but their structural and functional characters suggest
a common evolutionary origin so the level called as probable common evolutionary origin. This
proteins positioned in superfamily level. Fold: In this level the proteins are not having evolutionary
origin but structural similarities derived from physics and chemistry of proteins facilitating certain
chain topologies and packing arrangements. So this level also called as major structural similarity
level. SCOP: It is a database for structural classification of proteins. It provides comprehensive
classification of structural and evolutionary relationships between those proteins with known
structures.[18]. CATH: (Class, Architecture, Topology and Homologous superfamily): This database
facilitates a hierarchical classification for domain structures of proteins, which cause clustering of
proteins at four different levels: C, A, T, H means Class, Architecture, Topology and Homologous
superfamily, respectively

PROSPECT:
PROSPECT (PROtein Structure Prediction and Evaluation Computer ToolKit) is a protein-
structure prediction system that employs a computational technique called protein threading to
construct a protein's 3-D model.

STRING: STRING stands for Search Tool for the Retrieval of Interacting Genes/Proteins. It is
associated with high through put experimental data, mining databases and literature, and from
predictions based on genomic context analysis. It assembles them in a common reference set, and
presents evidence in a consistent and intuitive web interface. (http://string.embl.org).

YASPIN: It is built on three individual web servers: cons-PPISP, PINUP, and Promate. It is
known as the Meta web server and is used for protein-protein interaction and site prediction.
(http://www.yaspin.org).

SPLIT: Trans membrane Protein Topology Prediction Server provides modified hydrophobic
moment index and clear, colorful output including beta reference (http://www.split).

OCTOPUS: This tool uses a novel combination of hidden Markov models and artificial neural
networks. It predicts the correct topology for 94% of the dataset of 124 sequences with known
structures. (http://octopus.org).

Swiss-port:

It contains annotated or commented sequences, that is, each sequence has been
reviewed, documented and linked to other databases.

TrEMBL:

Translation of EMBL Nucleotide Sequence Database includes the translation of all


coding sequences derived from (EMBL-BANK) and which have not yet been annotated in
Swiss-Prot.

PDB:
Protein Data Bank is the 3-D tertiary structure database of proteins that have been
crystallized. External link: PDB (http://www.rcsb.org/pdb/ )

COPIA :
COPIA (COnsensus Pattern Identification and Analysis) is a protein structure analysis tool for
discovering motifs (conserved regions) in a family of protein sequences. Such motifs can be then
used to determine membership to the family for new protein sequences, predict secondary and
tertiary structure and function of proteins and study evolution history of the sequences.

Amino acid Explorer:

Explores amino acid properties, substitutions and functions.

BLAST:

Finds regions of local similarity between biological sequences.

BLAST Link (Blink):

Displays the results of a pre computed BLAST search of a protein against all other protein
sequences at NCBI.

CD Tree:

Classifies protein sequences and investigates their evolutionary relationships.

Cn 3D:

Displays and manipulates 3 dimensional structures and alignments from the structure databases.

COBALT:

Performs protein multiple sequence alignment.

Concise Microbial Protein BLAST:


Finds regions of local similarity between query proteins and proteins from complete microbial
(prokaryotic) genome.

CDART:

It is abbreviated as Conserved Domain Architecture Retrieval Tool. It displays the functional


domains that make up a given protein sequence.

CD Search:

Identifies the conserve domains present in a protein sequence.

VAST:

It is abbreviated as Vector Alignment Search Tool. It identifies 3 dimensional protein structures.


Swiss-port:

It contains annotated or commented sequences, that is, each sequence has been
reviewed, documented and linked to other databases.

TrEMBL:

Translation of EMBL Nucleotide Sequence Database includes the translation of all


coding sequences derived from (EMBL-BANK) and which have not yet been annotated in
Swiss-Prot.

PDB:

Protein Data Bank is the 3-D tertiary structure database of proteins that have been
crystallized. External link: PDB (http://www.rcsb.org/pdb/ )

PIR:

Protein Information Resource is divided into four sub-bases that have a decreasing annotation
level. External link: PIR (http://pir.georgetown.edu/ )

INTERPRO:

It integrates information from various secondary structure databases such as


PROSITE, providing links to other databases and more extensive information. External link:
INTERPRO ( http://www.ebi.ac.uk/interpro/index.html )

Tools for the RNA analysis:

.
General tools[edit]
These tools perform normalization and calculate the abundance of each gene expressed in a
sample.[48] RPKM, FPKM and TPMs[49] are some of the units employed to quantification of expression.
Some software are also designed to study the variability of genetic expression between samples
(differential expression). Quantitative and differential studies are largely determined by the quality of
reads alignment and accuracy of isoforms reconstruction. Several studies are available comparing
differential expression methods.[50][51][52]

 ABSSeq a new RNA-Seq analysis method based on modelling absolute expression


differences.
 ALDEx2 is a tool for comparative analysis of high-throughput sequencing data. ALDEx2
uses compositional data analysis and can be applied to RNAseq, 16S rRNA gene sequencing,
metagenomic sequencing, and selective growth experiments.
 Alexa-Seq is a pipeline that makes possible to perform gene expression analysis, transcript
specific expression analysis, exon junction expression and quantitative alternative analysis.
Allows wide alternative expression visualization, statistics and graphs.
 ARH-seq – identification of differential splicing in RNA-seq data.
 ASC[53]
 Ballgown
 BaySeq is a Bioconductor package to identify differential expression using next-generation
sequencing data, via empirical Bayesian methods. There is an option of using the "snow"
package for parallelisation of computer data processing, recommended when dealing with large
data sets.
 GMNB[54] is a Bayesian method to temporal gene differential expression analysis across
different phenotypes or treatment conditions that naturally handles the heterogeneity of
sequencing depth in different samples, removing the need for ad-hoc normalization.
 BBSeq
 BitSeq (Bayesian Inference of Transcripts from Sequencing Data) is an application for
inferring expression levels of individual transcripts from sequencing (RNA-Seq) data and
estimating differential expression (DE) between conditions.
 CEDER Accurate detection of differentially expressed genes by combining significance of
exons using RNA-Seq.
 CPTRA The CPTRA package is for analyzing transcriptome sequencing data from different
sequencing platforms. It combines advantages of 454, Illumina GAII, or other platforms and can
perform sequence tag alignment and annotation, expression quantification tasks.
 casper is a Bioconductor package to quantify expression at the isoform level. It combines
using informative data summaries, flexible estimation of experimental biases and statistical
precision considerations which (reportedly) provide substantial reductions in estimation error.
 Cufflinks/Cuffdiff is appropriate to measure global de novo transcript isoform expression. It
performs assembly of transcripts, estimation of abundances and determines differential
expression (Cuffdiff) and regulation in RNA-Seq samples. [55]
 DESeq is a Bioconductor package to perform differential gene expression analysis based on
negative binomial distribution.
 DEGSeq
 Derfinder Annotation-agnostic differential expression analysis of RNA-seq data at base-pair
resolution via the DER Finder approach.
 DEvis is a powerful, integrated solution for the analysis of differential expression data. Using
DESeq2 as a framework, DEvis provides a wide variety of tools for data manipulation,
visualization, and project management.
 DEXSeq is Bioconductor package that finds differential differential exon usage based on
RNA-Seq exon counts between samples. DEXSeq employs negative binomial distribution,
provides options to visualization and exploration of the results.
 DEXUS is a Bioconductor package that identifies differentially expressed genes in RNA-Seq
data under all possible study designs such as studies without replicates, without sample groups,
and with unknown conditions.[56] In contrast to other methods, DEXUS does not need replicates
to detect differentially expressed transcripts, since the replicates (or conditions) are estimated by
the EM method for each transcript.
 DGEclust is a Python package for clustering expression data from RNA-seq, CAGE and
other NGS assays using a Hierarchical Dirichlet Process Mixture Model. The estimated cluster
configurations can be post-processed in order to identify differentially expressed genes and for
generating gene- and sample-wise dendrograms and heatmaps. [57]
 DiffSplice is a method for differential expression detection and visualization, not dependent
on gene annotations. This method is supported on identification of alternative splicing modules
(ASMs) that diverge in the different isoforms. A non-parametric test is applied to each ASM to
identify significant differential transcription with a measured false discovery rate.
 EBSeq is a Bioconductor package for identifying genes and isoforms differentially expressed
(DE) across two or more biological conditions in an RNA-seq experiment. It also can be used to
identify DE contigs after performing de novo transcriptome assembly. While performing DE
analysis on isoforms or contigs, different isoform/contig groups have varying estimation
uncertainties. EBSeq models the varying uncertainties using an empirical Bayes model with
different priors.
 EdgeR is a R package for analysis of differential expression of data from DNA sequencing
methods, like RNA-Seq, SAGE or ChIP-Seq data. edgeR employs statistical methods supported
on negative binomial distribution as a model for count variability.
 EdgeRun an R package for sensitive, functionally relevant differential expression discovery
using an unconditional exact test.
 EQP The exon quantification pipeline (EQP): a comprehensive approach to the quantification
of gene, exon and junction expression from RNA-seq data.
 ESAT The End Sequence Analysis Toolkit (ESAT) is specially designed to be applied for
quantification of annotation of specialized RNA-Seq gene libraries that target the 5' or 3' ends of
transcripts.
 eXpress performance includes transcript-level RNA-Seq quantification, allele-specific and
haplotype analysis and can estimate transcript abundances of the multiple isoforms present in a
gene. Although could be coupled directly with aligners (like Bowtie), eXpress can also be used
with de novo assemblers and thus is not needed a reference genome to perform alignment. It
runs on Linux, Mac and Windows.
 ERANGE performs alignment, normalization and quantification of expressed genes.
 featureCounts an efficient general-purpose read quantifier.
 FDM
 FineSplice Enhanced splice junction detection and estimation from RNA-Seq data.
 GFOLD[58] Generalized fold change for ranking differentially expressed genes from RNA-seq
data.
 globalSeq[59] Global test for counts: testing for association between RNA-Seq and high-
dimensional data.
 GPSeq This is a software tool to analyze RNA-seq data to estimate gene and exon
expression, identify differentially expressed genes, and differentially spliced exons.
 IsoDOT – Differential RNA-isoform Expression.
 Limma Limma powers differential expression analyses for RNA-sequencing and microarray
studies.
 LPEseq accurately test differential expression with a limited number of replicates.
 Kallisto "Kallisto is a program for quantifying abundances of transcripts from RNA-Seq data,
or more generally of target sequences using high-throughput sequencing reads. It is based on
the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets,
without the need for alignment. On benchmarks with standard RNA-Seq data, kallisto can
quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only
the read sequences and a transcriptome index that itself takes less than 10 minutes to build."
 MATS Multivariate Analysis of Transcript Splicing (MATS).
 MAPTest provides a general testing framework for differential expression analysis of RNA-
Seq time course experiment. Method of the pack is based on latent negative-binomial Gaussian
mixture model. The proposed test is optimal in the maximum average power. The test allows not
only identification of traditional DE genes but also testing of a variety of composite hypotheses of
biological interest.[60]
 MetaDiff Differential isoform expression analysis using random-effects meta-regression.
 metaseqR is a Bioconductor package that detects differentially expressed genes from RNA-
Seq data by combining six statistical algorithms using weights estimated from their performance
with simulated data estimated from real data, either public or user-based. In this way, metaseqR
optimizes the tradeoff between precision and sensitivity.[61] In addition, metaseqR creates a
detailed and interactive report with a variety of diagnostic and exploration plots and auto-
generated text.
 MMSEQ is a pipeline for estimating isoform expression and allelic imbalance in diploid
organisms based on RNA-Seq. The pipeline employs tools like Bowtie, TopHat,
ArrayExpressHTS and SAMtools. Also, edgeR or DESeq to perform differential expression.
 MultiDE
 Myrna is a pipeline tool that runs in a cloud environment (Elastic MapReduce) or in a unique
computer for estimating differential gene expression in RNA-Seq datasets. Bowtie is employed
for short read alignment and R algorithms for interval calculations, normalization, and statistical
processing.
 NEUMA is a tool to estimate RNA abundances using length normalization, based on
uniquely aligned reads and mRNA isoform models. NEUMA uses known transcriptome data
available in databases like RefSeq.
 NOISeq NOISeq is a non-parametric approach for the identification of differentially
expressed genes from count data or previously normalized count data. NOISeq empirically
models the noise distribution of count changes by contrasting fold-change differences (M) and
absolute expression differences (D) for all the features in samples within the same condition.
 NPEBseq is a nonparametric empirical Bayesian-based method for differential expression
analysis.
 NSMAP allows inference of isoforms as well estimation of expression levels, without
annotated information. The exons are aligned and splice junctions are identified using TopHat.
All the possible isoforms are computed by a combination of the detected exons.
 NURD an implementation of a new method to estimate isoform expression from non-uniform
RNA-seq data.
 PANDORA An R package for the analysis and result reporting of RNA-Seq data by
combining multiple statistical algorithms.
 PennSeq PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by
modeling non-uniform read distribution.
 Quark Quark enables semi-reference-based compression of RNA-seq data.
 QuasR Quantify and Annotate Short Reads in R.
 RapMap A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to
Transcriptomes.
 RNAeXpress Can be run with Java GUI or command line on Mac, Windows, and Linux. It
can be configured to perform read counting, feature detection or GTF comparison on mapped
rnaseq data.
 Rcount Rcount: simple and flexible RNA-Seq read counting.
 rDiff is a tool that can detect differential RNA processing (e.g. alternative splicing,
polyadenylation or ribosome occupancy).
 RNASeqPower Calculating samples Size estimates for RNA Seq studies. R package
version.
 RNA-Skim RNA-Skim: a rapid method for RNA-Seq quantification at transcript-level.
 rSeq rSeq is a set of tools for RNA-Seq data analysis. It consists of programs that deal with
many aspects of RNA-Seq data analysis, such as read quality assessment, reference sequence
generation, sequence mapping, gene and isoform expressions (RPKMs) estimation, etc.
 RSEM
 rQuant is a web service (Galaxy (computational biology) installation) that determines
abundances of transcripts per gene locus, based on quadratic programming. rQuant is able to
evaluate biases introduced by experimental conditions. A combination of tools is employed:
PALMapper (reads alignment), mTiM and mGene (inference of new transcripts).
 Salmon is a software tool for computing transcript abundance from RNA-seq data using
either an alignment-free (based directly on the raw reads) or an alignment-based (based on pre-
computed alignments) approach. It uses an online stochastic optimization approach to maximize
the likelihood of the transcript abundances under the observed data. The software itself is
capable of making use of many threads to produce accurate quantification estimates quickly. It
is part of the Sailfish suite of software, and is the successor to the Sailfish tool.
 SAJR is a java-written read counter and R-package for differential splicing analysis. It uses
junction reads to estimate exon exclusion and reads mapped within exon to estimate its
inclusion. SAJR models it by GLM with quasibinomial distribution and uses log likelihood test to
assess significance.
 Scotty Performs power analysis to estimate the number of replicates and depth of
sequencing required to call differential expression.
 Seal alignment-free algorithm to quantify sequence expression by matching kmers between
raw reads and a reference transcriptome. Handles paired reads and alternate isoforms, and
uses little memory. Accepts all common read formats, and outputs read counts, coverage, and
FPKM values per reference sequence. Open-source, written in pure Java; supports all platforms
with no recompilation and no other dependencies. Distributed with BBMap. (Seal - Sequence
Expression AnaLyzer - is unrelated to the SEAL distributed short-read aligner.)
 semisup[62] Semi-supervised mixture model: detecting SNPs with interactive effects on a
quantitative trait
 Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances
have been quantified with kallisto.
 SplicingCompass differential splicing detection using RNA-Seq data.
 sSeq The purpose of this R package is to discover the genes that are differentially
expressed between two conditions in RNA-seq experiments.
 StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential
transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step
to assemble and quantitate full-length transcripts representing multiple splice variants for each
gene locus. It was designed as a successor to Cufflinks (its developers include some of the
Cufflinks developers) and has many of the same features, but runs far faster and in far less
memory.
 TIGAR Transcript isoform abundance estimation method with gapped alignment of RNA-Seq
data by variational Bayesian inference.
 TimeSeq Detecting Differentially Expressed Genes in Time Course RNA-Seq Data.
 WemIQ is a software tool to quantify isoform expression and exon splicing ratios from RNA-
seq data accurately and robustly.
Evaluation of quantification and differential expression[edit]
 CompcodeR RNAseq data simulation, differential expression analysis and performance
comparison of differential expression methods.
 DEAR-O Differential Expression Analysis based on RNA-seq data – Online.
 PROPER comprehensive power evaluation for differential expression using RNA-seq.
 RNAontheBENCH computational and empirical resources for benchmarking RNAseq
quantification and differential expression methods.
 rnaseqcomp Several quantitative and visualized benchmarks for RNA-seq quantification
pipelines. Two-condition quantifications for genes, transcripts, junctions or exons by each
pipeline with nessasery meta information should be organizd into numeric matrices in order to
proceed the evaluation.
Multi-tool solutions[edit]
 DEB is a web-interface/pipeline that permits to compare results of significantly expressed
genes from different tools. Currently are available three algorithms: edgeR, DESeq and bayseq.
 SARTools A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis
of RNA-Seq Data.
Transposable Element expression[edit]
 TeXP is a Transposable Element quantification pipeline that deconvolves pervasive
transcription from autonomous transcription of LINE-1 elements. [63]

You might also like