Sequence Alignment: Lecture 2, Thursday April 3, 2003
Sequence Alignment: Lecture 2, Thursday April 3, 2003
transcription
pre-mRNA
splicing
mature mRNA
Human 3x109 bp translation
Genome: ~30,000 genes
~200,000 exons protein
~23 Mb coding
~15 Mb noncoding
Modifications to Needleman-Wunsch:
0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
Iteration:
F(i – 1, j – 1) + s(xi, yj)
F(i, j) = max
G(i – 1, j – 1) + s(xi, yj)
F(i – 1, j) – d
F(i, j – 1) – d
G(i, j) = max
G(i, j – 1) – e
G(i – 1, j) – e
Termination: same
Iteration:
For i = 1…M
For j = max(1, i – k)…min(N, i+k)
Termination: same
k*
N-k*
M/2 M/2
Lecture 3, Tuesday April 8,
Today
• Time Warping
Definition:
A t-block is a t × t square of C
the DP matrix
Idea:
Divide matrix in t-blocks,
Precompute t-blocks
yr’ D
Speedup: O(t)
t
Lecture 3, Tuesday April 8,
The Four-Russian Algorithm
• For i = 1……K
• For j = 1……K
• Compute Di,j as a function of
t t t
Another observation:
( Assume m = 0, s = 1, d = 1 )
Proof of Lemma:
3. Same row:
• F(i, j) – F(i – 1, j) ≤ +1
• F(i, j) – F(i – 1, j) ≥ -1
x1……xi-1 xi x1……xi-1 –
y1……ya-1ya ya+1…yj y1……ya-1ya ya+1…yj ≥ -1
x1……xi-1 xi x1……xi-1
y1……ya-1– ya…yj y1……ya-1ya…yj +1
Proof of Lemma:
3. Same diagonal:
• F(i, j) – F(i – 1, j – 1) ≤ +1
b. F(i, j) – F(i – 1, j – 1) ≥ -1
x1……xi-1 xi x1……xi-1
|
y1……yi-1 yj y1……yj-1 ≥-1
x1……xi-1 xi x1……xi-1
y1……ya-1– ya…yj y1……ya-1ya…yj +1
Definition: xl xl’
The offset vector is a yr A B
t-long vector of values
from {-2, -1, 0, 1, 2},
where the first entry is 0
t
Lecture 3, Tuesday April 8,
The Four-Russian Algorithm
Example:
A A C T
x = AACT
0 1 -1 -1
y = CACT
C 0 5 6 5 4 0
A 0 5 5 6 5 1
C -1 4 5 5 6 1
T 1 5 5 6 5 -1
0 0 1 -1
Example:
A A C T
x = AACT
0 1 -1 -1
y = CACT
C 0 1 2 1 0 0
A 0 1 1 2 1 1
C -1 0 1 1 2 1
T 1 1 1 2 1 -1
0 0 1 -1
Definition: xl xl’
The offset function of a t- yr A B
block
is a function that for any
t
Lecture 3, Tuesday April 8,
The Four-Russian Algorithm
We can keep all these values in a table, and look up in linear time,
or in O(1) time if we assume
constant-lookup RAM for log-sized inputs
t t t
Lecture 3, Tuesday April 8,
Heuristic Local Aligners
Sequenced Genomes:
Vertebrate: ~30,000
Insects: ~14,000
Worm: ~17,000
Fungi: ~6,000-10,000
Our
new
gene
1010 - 1011
1010 - 1011
Main idea:
Dictionary: ……
All words of length k (~11) query
Alignment:
scan
Ungapped extensions until
score DB
Output:
All local alignments with score query
Lecture 3, Tuesday April 8, threshold
> statistical
BLAST Original Version
A C G A A G T A A G G T C C A G T
Example:
C C C T T C C T G G A T T G C G A
k = 4,
T=4
Output:
GTAAGGTCC
GTTAGGTCC
Lecture 3, Tuesday April 8,
Gapped BLAST
A C G A A G T A A G G T C C A G T
Added features:
C T G A T C C T G G A T T G C G A
• Pairs of words can
initiate alignment
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
C T G A T C C T G G A T T G C G A
• Pairs of words can
initiate alignment
• Nearby alignments
are merged
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
MEGABLAST:
Optimized to align very similar sequences
Works best when k = 4i ≥ 16
Linear gap penalty
PSI-BLAST:
BLAST produces many hits
Those are aligned, and a pattern is extracted
Pattern is used for next search; above steps iterated
WU-BLAST: (Wash U BLAST)
Optimized, added features
BlastZ
Combines BLAST/PatternHunter methodology
Query: 4 tacaccccgattacaccccga 24
||||||| |||||||||||||
Sbjct: 125138 tacacccagattacaccccga 125158
Query: 4 tacaccccgattacaccccga 24
||||||| |||||||||||||
Sbjct: 125104 tacacccagattacaccccga 125124
Query: 4 tacaccccgattacaccccga 24
||||||| |||||||||||||
Sbjct: 3891 tacacccagattacaccccga 3911
Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62
||||||||||||| ||||||||||||||||||| ||||||||||||||||||||||||||
Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203
Main features:
• Non-consecutive position words
• Highly optimized
6 hits
7 hits 5 hits
7 hits 3 hits
3 hits
On a 70% conserved region:
Consecutive Non-consecutive
Expected # hits: 1.07 0.97
Prob[at least one hit]: 0.30 0.47
11 positions
11 positions
10 positions