Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
24 views

Lecture - 02 - Comparative Sequence Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lecture - 02 - Comparative Sequence Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Comparative

Sequence Analysis

Department of Life Sciences, SBASSE, LUMS


Genome to Gene

Heredity Unit

2
Latest on Genome Sequencing
• Human Genome Project (1990 – 2003)

Now!

3
Our Genome and Need for Comparative
Genomics
• Number of bases: 3.2 billion bases

• Number of chromosomes: 23 pairs

• Percentage of genes: Only 1% of genome is genes

• Protein-coding Gene Number: 20,000 - 25,000

• Average gene size: ~ 3000 bases & huge variation


• Largest known human gene consists of 2.4 million bases (dystrophin)

• Repetition: Almost 45-50% of the DNA is repetitive

• Similarity between individuals: Almost all (99.9%) nucleotide bases are exactly the same
in all people 4
Proteome to Protein
Genes: 30,000

Alternative Splicing: 2 - 3 per gene


3 x 30,000 = 90,000 proteins

Post translational modifications


10 x 90,000 = 900,000 proteins

Peng and Gygi, JMS 2001


Asa Wheelock

5
Need for Comparative Proteomics
• Number of reported proteins: 150 million and counting

6
Benefits of Comparative Genomics
• Comparison of whole genome sequences provides a highly detailed
view of how organisms are related to each other at the genetic level

• Comparative genomics also provides a powerful tool for studying


evolutionary changes among organisms

• Helps to identify genes that are conserved or common among species


that give each organism its unique characteristics

7
Fly vs. Humans
Comparison between fruit fly genome with the human genome:

• about 75% percent of genes are conserved

• two organisms appear to share a core set of genes

• two-thirds of human genes known to be involved in cancer have


counterparts in the fruit fly

8
Evolutionary Relationship

9
COV2

10
http://bacterialphylogeny.info/overview.html

11
What have we done and what’s
next?
DONE: Gene and Protein Sequences
• GenBank (DNA Sequences)
• Uniprot (Protein Sequences)
• GeneMark (Gene Prediction)

NEXT: Sequence & Structure Analysis


• BLAST (nucleotide, protein)
• PDB
• iTASSER

12
From Sequences to Comparisons
• Problem: If we sequence a new gene or protein, can we compare it
with the existing information in GenBank or Uniprot?

• Idea: Compare NOVEL sequences with KNOWN (previously


characterized) genes or proteins.

• Benefit: STRUCTURAL , FUNCTIONAL and EVOLUTIONARY


information can be inferred from WELL DESIGNED comparisons.

• The most common tool used is called BLAST.


13
BLAST?
• Basic Local Alignment Search Tool

• A method for rapid searching of sequence databases, for both


nucleotides and proteins.

• The BLAST algorithm detects local as well as global matches


(alignments) and regions of similarity embedded in otherwise unrelated
proteins.

• Uses statistical theory to determine if a match might have occurred by


chance.
14
https://blast.ncbi.nlm.nih.gov/Blast.cgi

15
BLAST - Workflow
1. BLAST searches the database sequences using “Dynamic Programming” on “promising”
sequences.

2. This is done by indexing all database sequences in a so-called suffix-tree which makes it
very fast to search for perfect matching sub-strings. A suffix tree is the quickest possible
way (so far) to search for the longest matching sub-string between two strings.

3. BLAST creates a list of all “words” (short subsequences) that have a certain “threshold”
score when compared with the query sequence. Words are 16-256 nucleotides or 3
amino acids put together in a row consecutively.

4. A lookup hash table is made of all such words and “neighboring” words present in the
query sequence (rather than just random words).

5. When a BLAST search is run, candidate sequences from the database is picked based on
perfect matches to small sub-sequences in the query sequence. 16
BLOSUM62 Match/Mismatch Matrix

17
• Here the word is PQG and
Score from neighboring words are
BLOSUM everything with a score
above 13 (for three
letters) as calculated by
the given scoring system
(e.g., BLOSUM62).
T is user provided threshold!
• PSG is a neighboring word,
PQA is not.

18
Example Blast search method
Query sequence: PQGELV

•Make list of all possible k-mer words (length 3 for proteins)


PQG (score 18)
QGE (score 16)
GEL (score 15)
ELV (score 13)

•Assign scores from Blosum62, use those with score >= 13


• PQG, QGE, GEL & ELV

•In total we get: PQG, QGE, GEL & ELV


Example Blast search method
• Make k-mer (word-size 3) of all sequences in database
• Store in a suffix-tree (fast tree-structure to search for identical matches)

• Find all database sequences that has at least 2 matches among our 3 words
• PQG, GEL & PEG

• Find database hit and extend alignment (High-scoring Segment Pair):


Query: M E T P Q G I A V
Database: - - - P Q G E L V
8 5 5 2 0 8

• HSP: PQGI (score 8+5+5+2)

• If 2 HSP in query sequence are < 40 positions away


• Full alignment on query and hit sequences
Advantages of BLAST
• The BLAST algorithm was written balancing speed and
increased sensitivity for finding distant sequence relationships.
• Speed is achieved by:
1. Pre-indexing the database before the search
2. Parallel processing
3. Hash table that contains neighborhood words rather than just random words.

• BLAST emphasizes regions of local alignment to detect


relationships among sequences having isolated regions of
similarity between them.

21
BLAST for Nucleotides and Proteins
• Nucleotides
• blastn
• Compares a nucleotide query sequence against a nucleotide sequence
database.

• Proteins
• blastp
• Compares an amino acid query sequence against a protein sequence
database.

22
Comparing an unknown nucleotide
sequence with possible “protein”
sequences!!
• blastx
> but what about the 6 possible ORFs?

• Compares a nucleotide query sequence translated in all reading


frames against a protein sequence database.

• This option may be used to find potential translation products of


an unknown nucleotide sequence.

23
How about the reverse of blastx?
• tblastn

• Compares a protein query sequence against a nucleotide


sequence database dynamically translated in all reading
frames.

24
Comparing all translated ORFs of a
nucleotide sequence with all ORFs
of a nucleotide DB
• tblastx

• Compares the six-frame translations of a nucleotide query


sequence against the six-frame translations of a nucleotide
sequence database.

25
Getting started with BLAST
Getting started:
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/BLAST/
and
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

26
So what if we find out the Alien
Gene in GenBank?
• Homologs
• Features (including DNA and protein sequences) in species being compared that are similar
because they are ancestrally related

• Homologs can be either Orthologs and Paralogs

• Orthologs
• Homologous genes (or any DNA sequences) that separated because of a speciation event
• Derived from the same gene in the last common ancestor

• Paralogs
• Homologous genes that separated because of gene duplication events within the same species

27
28

You might also like