Lecture - 02 - Comparative Sequence Analysis
Lecture - 02 - Comparative Sequence Analysis
Sequence Analysis
Heredity Unit
2
Latest on Genome Sequencing
• Human Genome Project (1990 – 2003)
Now!
3
Our Genome and Need for Comparative
Genomics
• Number of bases: 3.2 billion bases
• Similarity between individuals: Almost all (99.9%) nucleotide bases are exactly the same
in all people 4
Proteome to Protein
Genes: 30,000
5
Need for Comparative Proteomics
• Number of reported proteins: 150 million and counting
6
Benefits of Comparative Genomics
• Comparison of whole genome sequences provides a highly detailed
view of how organisms are related to each other at the genetic level
7
Fly vs. Humans
Comparison between fruit fly genome with the human genome:
8
Evolutionary Relationship
9
COV2
10
http://bacterialphylogeny.info/overview.html
11
What have we done and what’s
next?
DONE: Gene and Protein Sequences
• GenBank (DNA Sequences)
• Uniprot (Protein Sequences)
• GeneMark (Gene Prediction)
12
From Sequences to Comparisons
• Problem: If we sequence a new gene or protein, can we compare it
with the existing information in GenBank or Uniprot?
15
BLAST - Workflow
1. BLAST searches the database sequences using “Dynamic Programming” on “promising”
sequences.
2. This is done by indexing all database sequences in a so-called suffix-tree which makes it
very fast to search for perfect matching sub-strings. A suffix tree is the quickest possible
way (so far) to search for the longest matching sub-string between two strings.
3. BLAST creates a list of all “words” (short subsequences) that have a certain “threshold”
score when compared with the query sequence. Words are 16-256 nucleotides or 3
amino acids put together in a row consecutively.
4. A lookup hash table is made of all such words and “neighboring” words present in the
query sequence (rather than just random words).
5. When a BLAST search is run, candidate sequences from the database is picked based on
perfect matches to small sub-sequences in the query sequence. 16
BLOSUM62 Match/Mismatch Matrix
17
• Here the word is PQG and
Score from neighboring words are
BLOSUM everything with a score
above 13 (for three
letters) as calculated by
the given scoring system
(e.g., BLOSUM62).
T is user provided threshold!
• PSG is a neighboring word,
PQA is not.
18
Example Blast search method
Query sequence: PQGELV
• Find all database sequences that has at least 2 matches among our 3 words
• PQG, GEL & PEG
21
BLAST for Nucleotides and Proteins
• Nucleotides
• blastn
• Compares a nucleotide query sequence against a nucleotide sequence
database.
• Proteins
• blastp
• Compares an amino acid query sequence against a protein sequence
database.
22
Comparing an unknown nucleotide
sequence with possible “protein”
sequences!!
• blastx
> but what about the 6 possible ORFs?
23
How about the reverse of blastx?
• tblastn
24
Comparing all translated ORFs of a
nucleotide sequence with all ORFs
of a nucleotide DB
• tblastx
25
Getting started with BLAST
Getting started:
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/BLAST/
and
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
26
So what if we find out the Alien
Gene in GenBank?
• Homologs
• Features (including DNA and protein sequences) in species being compared that are similar
because they are ancestrally related
• Orthologs
• Homologous genes (or any DNA sequences) that separated because of a speciation event
• Derived from the same gene in the last common ancestor
• Paralogs
• Homologous genes that separated because of gene duplication events within the same species
27
28