Bioinformatics
Bioinformatics
Nucleotides(billion)
6
5
Genomics: New sequence information is being 4
3
produced at increasing rates. (The 2
contents of GenBank double every year) 1
0
1980 1985 1990 1995 2000
You are
here
Scope of this lab
The lab will touch on the following computational tasks:
Similaritysearch
Sequence comparison: Alignment, multiple alignment, retrieval
Sequences analysis: Signal peptide, transmembrane domain,…
Protein folding: secondary structure from sequence
Sequence evolution: phylogenetic trees
-Ensembl: http://useast.ensembl.org/index.html
-TIGR: http://tigr.org/tdb/tgi
-Yeast: http://yeastgenome.org
-Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi
Protein (amino acid) databases
Known proteins:
-Swiss-Prot (very high level of annotation)
http://au.expasy.org/
Translated databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html
All
similarity searching methods rely on the concepts of alignment
and distance between sequences.
A similarity
score is calculated from a distance: the number of DNA
bases or amino acids that are different between two sequences.
Database search methods: Sequence Alignment
Two broad classes of sequence alignments exist:
QKESGPSSSYC
Global alignment: not sensitive
VQQESGLVRTTC
ESG
Local alignment: faster
ESG
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?
Speed:
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)
Sensitivity/statistics:
FASTA is more sensitive, misses less homologues
Smith-Waterman is even more sensitive.
Search by similarity
Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Reasons for translating
Comparing DNA sequences give more random matches:
A good alignment with end-gaps A very poor alignment
Conclusion:
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
Very highly similar nucleotide sequences may give better results.
BLAST and FASTA variants
PSI-BLAST: Performs iterative database searches. The results from each round
are incorporated into a 'position specific' score matrix, which is
used for further searching
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov
BLAST results
Detailed BLAST results