Data Retrieval
Data Retrieval
The retrieved data may be stored in a file, printed, or viewed on the screen. A query language,
such as Structured Query Language (SQL), is used to prepare the queries.
The amount of biological relevant data is increasing so rapidly, its important to know how to
access and search this information is essential.
These systems allow text searching of multiple molecular biology database and provide links to
relevant information for entries that match the search criteria. The three systems differ in the
databases they search and the links they have to other information.
SRS is a homogeneous interface to over 80 biological databases that had been developed at
the European Bioinformatics Institute (EBI) at Hinxton, UK.
It includes databases of sequences, metabolic pathways, transcription factors, application
results (like BLAST, SSEARCH, FASTA), protein 3-D structures, genomes, mappings,
mutations, and locus specific mutations.
The web page listing all the databases contains a link to a description page about the database
including the date on which it was last updated. One or more of the databases is selected to
search before entering your query.
After getting results, choose an alignment algorithm (like CLUSTALW, PHYLIP) enter
parameters, and run it.
The SRS is highly recommended for use.
Entrez:
DBGET:
The integrated database retrieval system DBGET/LinkDB is the backbone of the Japanese
GenomeNet service.
DBGET is used to search and extract entries from a wide range of molecular biology
databases, while LinkDB is used to search and compute links between entries in different
databases.
The WWW version of DBGET/LinkDB at GenomeNet is integrated with other search tools,
such as BLAST, FASTA and MOTIF, and with local helper applications, such as RasMol.
There are data-mining software that retrieve data from genomic sequence databases and also
visualization tools to analyze and retrieve information from proteomic databases. These are
It is a set of search programs designed for the Windows platform and is used to
perform fast similarity searches regardless of whether the query is for protein or DNA.
Comparison of nucleotide sequences in a database can be performed. Also a protein database can
be searched to find a match against the queried protein sequence. NCBI has also introduced the
new queuing system to BLAST (Q BLAST) that allows users to retrieve results at their
convenience and format their results multiple times with different formatting options.
Depending on the type of sequences to compare, there are different programs:
blastp compares an amino acid query sequence against a protein sequence database
blastx compares a nucleotide query sequence translated in all reading frames against a
protein sequence database
tblastn compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
tblastx compares the six-frame translations of a nucleotide query sequence against the
six-frame translations of a nucleotide sequence database.
FASTA:
FAST is an alignment program for protein sequences created by Pearsin and Lipman in 1988.
The program is one of the many heuristic algorithms proposed to speed up sequence comparison.
The basic idea is to add a fast prescreen step to locate the highly matching segments between two
sequences, and then extend these matching segments to local alignments using more rigorous
algorithms such as Smith-Waterman.
EMBOSS:
EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package. It
can work with data in a range of formats and also retrieve sequence data transparently from the
Web. Extensive libraries are also provided with this package, allowing other scientists to release
their software as open source. It provides a set of sequence-analysis programs, and also supports
all UNIX platforms.
Clustalw:
It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the
best match over a total length of input sequences, be it a protein or a nucleic acid.
Bioinformatics tools for analysis of DNA:
RasMol:
It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules.
Protein Explorer, a derivative of RasMol, is an easier to use program.
WebAct- This is the web version of ACT (Artemis Comparison Tool) a DNA sequence
comparison viewer based on Artemis. (http://www.webact.org).
Electronic PCR:
Splign:
OSIRIS:
Facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on
laboratory-specific protocols.
LALIGN- It finds multiple matching sub-segments in two sequences. It provides or assigns one
with % identity for different sub segments. (http://www.lalign.org).
• GraphAlin- It presents the output file in graphical and numerical form of % identity between
two proteins, or RNA or DNA molecules. (http://www.graphalin.org).
• GeneOrder- It is an ideal tool for the alignment of small GenBank genome sequences (up to
0.25Mb). It has a new version as GeneOrder 3.0. (http://www.genesorder.org).
BLAST RefseqGene:
Finds regions of local similarity between query sequences and genomic sequences in the
RefSeqGene/LRG set
ORF finder:
Vec Screen:
Clustal W- PBIL:
GENIO/Logo:
PDB is a very large universal storage place of processing and distribution of 3- dimensional structure
data of macromolecules. the information in PDB derived from variety tools and experiments like
NMR, X-ray crystallography, microscopy, cryoelectron and theoretical modeling,. Accommodations
of the database for users are access to structural data, providing methods for visualizing the
structure and downloading structural information.[7] NCBI Structure Database (MMDB): It includes
database of 3D structure of biomolecules which experimentally determined.Most of these data
derived from X-ray crystallography and NMR spectroscopy. The database provide biologists with a
broad information on biological functions of proteins, on mechanisms related to their functions and
on relationship between biomolecules and their evolutionary history.Additionally this database
provide biologists with comparative analysis of 3D structure of proteins. NCBI also called as MMDB
(molecular modeling database) and includes 3D structure of macromolecules and visualization tools
for comparative analysis of proteins.[8] Database and tools for protein structure visualization: Cn3-D
: "see in 3-D" is a viewer of structural sequence alignment for MMDB database. It facilitates viewing
of 3-Dstructure and alignment of sequence –structure of structure-structure. It serves as a helper
application for the browser. Files can be downloaded to the pc and the application can be launched.
It facilitates and network for analysis of several proteins simultaneously. The proteins lay over each
other in order to analyze structural alignment and provide comparison of their active sites, their
amino acid mutations angles, distances and H bonds between their atoms. This viewer is joined to
Swiss-Model server. [10] Chemscape Chime, Rasmol and protein explorer: This tool is one of the
usual tools for visualization of protein structure.It can read molecular structure files from PDB.
Chemscape chime serves as a plug in to permit structure visualization with browser. Protein explorer
serves as a plug in to permit viewing of protein structure with our browser. Both of these application
namely Chemscape chime and protein explorer are primary derivation of Rasmol.[11] Mage and
Kinemages: It is another tool for protein structure visualization. It is able for rotation of entire image
in real time, displaying of parts by turning off and on them, selection of points for their identification
and animation of change between different forms.[6] PDBsum : It is a database that facilitates a
large illustrated graphic summary of the main information on each biomolecular structure from the
protein data bank. It consists of images of structure, detailed structural analysis derived from
PROMOTIF program, schematic graphs of interactions, summary PROCHEK results [12] Protein
structure alignment tools: VAST (vector alignment sequence tool): it is a tool produced by NCBI and
provides identification of similar proteins with 3D structure. So it is structure similarity and search
service. [13]. DALI : It is an computational protein structure alignment tool used for comparison of
protein structure in 3D.[14] B: Domain architecture Database: Conserved Domain Database :(CDD) :
is a database contain sequence alignment and profiles, showing protein domain conserved during
molecular evolution course.[15] CDART: (Conserved Domain Architecture Retrieval Tool) used for
searching protein having similar domain architectures.[16] C. Bioinformatics tools for plotting
protein –ligand interactions: Ligplot : It is used to find out interaction between protein and ligand
also hydrogen and hydrophobic contacts can be represented in this tool.[17]. D. Approaches for
classification of proteins: Classification of proteins b several databases usually is on the basis of their
structural similarities. Both structural and evolutionary relationship is factors of their classification.
In hierarchy of proteins several levels exist but the main level considered are such as Family,
superfamily and fold Family: In this level proteins are grouped together into family having clear and
known evolutional relatedness so called as clear evolutionarily relationship level. Superfamily: In this
level proteins are with low sequence identities but their structural and functional characters suggest
a common evolutionary origin so the level called as probable common evolutionary origin. This
proteins positioned in superfamily level. Fold: In this level the proteins are not having evolutionary
origin but structural similarities derived from physics and chemistry of proteins facilitating certain
chain topologies and packing arrangements. So this level also called as major structural similarity
level. SCOP: It is a database for structural classification of proteins. It provides comprehensive
classification of structural and evolutionary relationships between those proteins with known
structures.[18]. CATH: (Class, Architecture, Topology and Homologous superfamily): This database
facilitates a hierarchical classification for domain structures of proteins, which cause clustering of
proteins at four different levels: C, A, T, H means Class, Architecture, Topology and Homologous
superfamily, respectively
PROSPECT:
PROSPECT (PROtein Structure Prediction and Evaluation Computer ToolKit) is a protein-
structure prediction system that employs a computational technique called protein threading to
construct a protein's 3-D model.
STRING: STRING stands for Search Tool for the Retrieval of Interacting Genes/Proteins. It is
associated with high through put experimental data, mining databases and literature, and from
predictions based on genomic context analysis. It assembles them in a common reference set, and
presents evidence in a consistent and intuitive web interface. (http://string.embl.org).
YASPIN: It is built on three individual web servers: cons-PPISP, PINUP, and Promate. It is
known as the Meta web server and is used for protein-protein interaction and site prediction.
(http://www.yaspin.org).
SPLIT: Trans membrane Protein Topology Prediction Server provides modified hydrophobic
moment index and clear, colorful output including beta reference (http://www.split).
OCTOPUS: This tool uses a novel combination of hidden Markov models and artificial neural
networks. It predicts the correct topology for 94% of the dataset of 124 sequences with known
structures. (http://octopus.org).
Swiss-port:
It contains annotated or commented sequences, that is, each sequence has been
reviewed, documented and linked to other databases.
TrEMBL:
PDB:
Protein Data Bank is the 3-D tertiary structure database of proteins that have been
crystallized. External link: PDB (http://www.rcsb.org/pdb/ )
COPIA :
COPIA (COnsensus Pattern Identification and Analysis) is a protein structure analysis tool for
discovering motifs (conserved regions) in a family of protein sequences. Such motifs can be then
used to determine membership to the family for new protein sequences, predict secondary and
tertiary structure and function of proteins and study evolution history of the sequences.
BLAST:
Displays the results of a pre computed BLAST search of a protein against all other protein
sequences at NCBI.
CD Tree:
Cn 3D:
Displays and manipulates 3 dimensional structures and alignments from the structure databases.
COBALT:
CDART:
CD Search:
VAST:
It contains annotated or commented sequences, that is, each sequence has been
reviewed, documented and linked to other databases.
TrEMBL:
PDB:
Protein Data Bank is the 3-D tertiary structure database of proteins that have been
crystallized. External link: PDB (http://www.rcsb.org/pdb/ )
PIR:
Protein Information Resource is divided into four sub-bases that have a decreasing annotation
level. External link: PIR (http://pir.georgetown.edu/ )
INTERPRO:
.
General tools[edit]
These tools perform normalization and calculate the abundance of each gene expressed in a
sample.[48] RPKM, FPKM and TPMs[49] are some of the units employed to quantification of expression.
Some software are also designed to study the variability of genetic expression between samples
(differential expression). Quantitative and differential studies are largely determined by the quality of
reads alignment and accuracy of isoforms reconstruction. Several studies are available comparing
differential expression methods.[50][51][52]