Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Bioinformatics

Bioinformatics is a scientific discipline that integrates biology, computer science, and information technology to manage complex biological data. The field has evolved significantly since the 1990s with advancements in high-throughput DNA sequencing and the growth of various 'omics' projects, necessitating sophisticated computational tools for data analysis. Key tasks in bioinformatics include sequence alignment, protein folding, and evolutionary analysis, with various databases and algorithms available for researchers to utilize.

Uploaded by

georginaroudri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Bioinformatics

Bioinformatics is a scientific discipline that integrates biology, computer science, and information technology to manage complex biological data. The field has evolved significantly since the 1990s with advancements in high-throughput DNA sequencing and the growth of various 'omics' projects, necessitating sophisticated computational tools for data analysis. Key tasks in bioinformatics include sequence alignment, protein folding, and evolutionary analysis, with various databases and algorithms available for researchers to utilize.

Uploaded by

georginaroudri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Bioinformatics:

Copyright© Kerstin Wagner


Introduction: What is bioinformatics?
Can be defined as the body of tools, algorithms needed to handle large
and complex biological information.

Bioinformatics is a scientific discipline created from the interaction


of biology and computer science.

The NCBI defines bioinformatics as:


"Bioinformatics is the field of science in which biology, computer
science, and information technology merge into a single discipline”
Genomics era: High-throughput DNA sequencing

The first high-throughput genomics


technology was automated DNA sequencing
in the early 1990.

In 1995, Venter and Hamilton used whole-


genome shotgun sequencing strategy to
sequence the genomes of Mycoplasma and
Haemophilus .

In September 1999, Celera Genomics


completed the sequencing of the
Drosophila genome.

The 3-billion-bp human genome sequence


was generated in a competition between
the publicly funded Human Genome
Project and Celera
High-throughput DNA sequencing

Top image: confocal detection


by the MegaBACE sequencer
of fluorescently labeled DNA

That was then. How about


now?
The trend of data growth
21st century is a century of biotechnology and OMICS:
8
7

Nucleotides(billion)
6
5
 Genomics: New sequence information is being 4
3
produced at increasing rates. (The 2
contents of GenBank double every year) 1
0
1980 1985 1990 1995 2000

 Transcriptomics: Microarray: Global expression analysis: RNA Years

levels of every gene in the genome analyzed in parallel.


Progressively replaced by RNA-seq

 Proteomics: Global protein analysis generates by large mass


spectra libraries.

 Metabolomics: Global metabolite analysis: 25,000 secondary


metabolites characterized
How to handle the large amount of information?

Drew Sheneman, New Jersey--The Newark Star Ledger

Answer: bioinformatics and Internet


Bioinformatics history
In1960s: the birth of bioinformatics

IBM 7090 computer

Margaret Oakley Dayhoff created:


The first protein database
The first program for sequence assembly

There is a need for computers and algorithms that allow:


Access, processing, storing, sharing, retrieving, visualizing, annotating…
Why do we need the Internet?
“omics” projects and the information associated with involve a huge amount
of data that is stored on computers all over the world.
Because it is impossible to maintain up-to-date copies of all relevant
databases within the lab. Access to the data is via the internet.
Database
storage

You are
here
Scope of this lab
The lab will touch on the following computational tasks:
Similaritysearch
Sequence comparison: Alignment, multiple alignment, retrieval
Sequences analysis: Signal peptide, transmembrane domain,…
Protein folding: secondary structure from sequence
Sequence evolution: phylogenetic trees

Make you familiar with bioinformatics resources available on the


web to do these tasks.
Applying algorithms to analyze genomics data
-Accession #?
-Annotation?
Is it already in
databases?
Protein Other
characteristics? information?
-Sub-localization -Expression profile?
-Soluble? -Mutants?
You have just
-3D fold
cloned a gene

Is there conserved Is there similar Evolutionary


regions? sequences? relationship?
-Alignments? -% identity? -Phylogenetic
-Domains? -Family member? tree

A critical failure of current bioinformatics is the lack of a single software


package that can perform all of these functions.
DNA (nucleotide sequences) databases
They are big databases and searching either one should produce
similar results because they exchange information routinely.

-GenBank (NCBI): http://www.ncbi.nlm.nih.gov

-Ensembl: http://useast.ensembl.org/index.html

-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp

-TIGR: http://tigr.org/tdb/tgi

-Yeast: http://yeastgenome.org

-Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi
Protein (amino acid) databases
Known proteins:
-Swiss-Prot (very high level of annotation)
http://au.expasy.org/

-PIR (protein identification resource) the world's most


comprehensive catalog of information on proteins
http://www.pir.uniprot.org/

Translated databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html

-GenPept (translation of coding regions in GenBank)

-pdb (sequences derived from the 3D structure


Brookhaven PDB) http://www.rcsb.org/pdb/
Database homology searching
Use algorithms to efficiently provide mathematical basis of searches
that can be translated to statistical significance.

Assumes that sequence, structure, and function are inter-related.

All
similarity searching methods rely on the concepts of alignment
and distance between sequences.

A similarity
score is calculated from a distance: the number of DNA
bases or amino acids that are different between two sequences.
Database search methods: Sequence Alignment
Two broad classes of sequence alignments exist:

QKESGPSSSYC
 Global alignment: not sensitive
VQQESGLVRTTC

ESG
 Local alignment: faster
ESG

The most widely used local similarity algorithms are:


Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)
Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)

Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?

Speed:
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)

Sensitivity/statistics:
FASTA is more sensitive, misses less homologues
Smith-Waterman is even more sensitive.

BLAST calculates probabilities

FASTA more accurate for DNA-DNA search then BLAST


Tools to search databases
The dilemma: DNA or protein?

Search by similarity

Using nucleotide seq. Using amino acid seq.

 Is the comparison of two nucleotide sequences accurate?

 By translating into amino acid sequence, are we losing information?


The genetic code is degenerate (Two or more codons can represent
the same amino acid)

 Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Reasons for translating
Comparing DNA sequences give more random matches:
A good alignment with end-gaps A very poor alignment

Almost 50% identity!

Conservation of protein in evolution (DNA similarity decays faster!)

Conclusion:
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
 Very highly similar nucleotide sequences may give better results.
BLAST and FASTA variants

FASTA: Compares a DNA query to DNA database, or a protein query


to protein database
FASTX: Compares a translated DNA query to a protein database
TFASTA: Compares a protein query to a translated DNA database

BLASTN: Compares a DNA query to DNA database.

BLASTP: Compares a protein query to protein database.

BLASTX: Compares the 6-frame translations of DNA query to protein


database.
TBLASTN: Compares a protein query to the 6-frame translations of a DNA
database. You can however define your frame of interest
TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame
translations of a DNA database (each sequence is comparable to
BLASTP searches!)

PSI-BLAST: Performs iterative database searches. The results from each round
are incorporated into a 'position specific' score matrix, which is
used for further searching
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov

BLAST results
Detailed BLAST results

E value: is the expectation value or probability to find by chance hits similar to


your sequence. The lower the E, the more significant the score.
Database searching tips
Use latest database version.

Use BLAST first, then a finer tool (FASTA,…)

Search both strands when using FASTA.

Translate sequences where relevant

Search 6-frame translation of DNA database

E < 0.05 is statistically significant, usually biologically


interesting.

If the query has repeated segments, delete them and


repeat search

You might also like