Alignment Algorithm
Alignment Algorithm
ACCTA
T
C
C
ACCCGA TCCTA
T ACTA
Definition
The Levenshtein Distance between two words/sequences is the
minimal number of substitutions, insertions and deletions to
transform one into the other.
Example
ACCCGA and ACTA have (at most) distance 3:
ACCCGA → ACCGA → ACCTA → ACTA
In biology, operations have different cost. (Why?)
Example
ACCCGA →(C ,−) ACCGA →(G ,T ) ACCTA →(−,T ) ATCCTA
Remarks
• Natural ’evolution-motivated’ problem definition.
Example
a = ACGGAT
b = CCGCTT
possible alignments are
Recall:
Definition (Edit Distance)
The edit distance of a and b is
dw (a, b) = min{w̃ (S) | a transformed to b by e.o.-sequence S}.
Definition (Alignment Distance)
Remarks
• Proof idea:
dw (a, b) ≤ Dw (a, b): alignment yields sequence of edit ops
Dw (a, b) ≤ dw (a, b): sequence of edit ops yields equal or
better alignment (needs triangle inequality)
• Reduces edit distance to alignment distance
• We will see: the alignment distance is computed efficiently by
dynamic programming (using Bellman’s Principle of
Remarks
• The principle is valid for the alignment distance problem
Notational remarks
A A G T
A
T
A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2
Theorem
For the alignment matrix D of a and b, holds that
• D0,0 = 0
Pi
• for all 1 ≤ i ≤ n: Di,0 = k=1 w (ak , −) = Di−1,0 + w (ai , −)
Pj
• for all 1 ≤ j ≤ m: D0,j = k=1 w (−, bk ) = D0,j−1 + w (−, bj )
Di−1,j−1 + w (ai , bj ) (match)
• Dij = min Di−1,j + w (ai , −) (deletion)
Theorem
For the alignment matrix D of a and b, holds that
• D0,0 = 0
Pi
• for all 1 ≤ i ≤ n: Di,0 = k=1 w (ak , −) = Di−1,0 + w (ai , −)
Pj
• for all 1 ≤ j ≤ m: D0,j = k=1 w (−, bk ) = D0,j−1 + w (−, bj )
Di−1,j−1 + w (ai , bj ) (match)
• Dij = min Di−1,j + w (ai , −) (deletion)
D0,0 := 0
for i := 1 to n do
Di,0 := Di−1,0 + w (ai , −)
end for
for j := 1 to m do
D0,j := D0,j−1 + w (−, bj )
end for
for i := 1 to n do
for j := 1 to mdo
Di−1,j−1 + w (ai , bj )
Di,j := min Di−1,j + w (ai , −)
Di,j−1 + w (−, bj )
Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise
A A G T
0 1 2 3 4
A 1 0
T
Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise
A A G T
0 1 2 3 4
A 1 0 1 2 3
T
A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2
Remarks
A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2
Remarks
Remarks
• assuming m ≤ n is w.l.o.g. since we can exchange a and b
• space complexity can be improved to O(n) for computation of
distance (simple, “store only current and last row”) and
traceback (more involved; Hirschberg-algorithm uses “Divide
Motivation:
GA--T G-A-T
• The alignments and have the same edit
GAAGT GAAGT
distance.
• The first one is biologically more reasonable: it is more likely
that evolution introduces one large gap than two small ones.
• This means: gap cost should be non-linear, sub-additive!
g (k + l) ≤ g (k) + g (l).
Example:
a = ATG---CGAC--GC ⇒ ∆a = {---, --}, ∆b = {-, -}
b = -TGCGGCG-CTTTC
General sub-additive gap penalty
Theorem
Let D be the alignment matrix of a and b with cost w and gap
penalty g , such that Dij = wg (a1..i , b1..j ). Then:
• D0,0 = 0
• for all 1 ≤ i ≤ n: Di,0 = g (i)
• for all 1 ≤ j ≤ m: D0,j = g (j)
Di−1,j−1 + w (ai , bj )
(match)
• Dij = min min1≤k≤i Di−k,j + g (k) (deletion of length k)
min1≤k≤j Di,j−k + g (k) (insertion of length k)
Remarks
Definition
A gap penalty is affine, iff there are real constants α and β, such
that for all k ∈ N: g (k) = α + βk.
Remarks
• Affine gap penalties almost as good as general ones:
Distinguishing gap opening (α) and gap extension cost (β) is
“biologically reasonable”.
• The minimal alignment cost with affine gap penalty can be
computed in O(n2 ) time! (Gotoh algorithm)
Remarks
• That is, local alignment asks for the subsequences of a and b
Definition
The local alignment matrix H of a and b is (Hi,j )0≤i≤n,0≤j≤m
defined by
Remarks
• Slocal (a, b) = maxi,j Hi,j (!)
• all entries Hi,j ≥ 0, since Sglobal (, ) = 0.
• Hi,j = 0 implies no subsequences of a and b that end in
Theorem
For the local alignment matrix H of a and b,
• H0,0 = 0
• for all 1 ≤ i ≤ n: Hi,0 = 0
• for all 1 ≤ j ≤ m: H0,j = 0
0 (empty alignment)
H
i−1,j−1 + s(ai , bj )
• Hij = max
Hi−1,j + s(ai , −)
H
i,j−1 + s(−, bj )
Remarks
• Complexity O(n2 ) time and space, again space complexity can
be improved
• Requires that similarity function is centered around zero, i.e.
positive = similar, negative = dissimilar.
• Extension to affine gap cost works
• Traceback?
A C A A
0 0 0 0 0
A 0 2 0 2 2
A 0 2 0 2 4
Example −
A
C C C
S(A) = s A +s C +s − +s − + ··· + s C
T C C − G
Example −
A
C C C
S(A) = s A +s C +s − +s − + ··· + s C
T C C − G
Drawbacks?
Heuristics:
1. Compute pairwise distances between all input sequences
• align all against all
• in case, transform similarities to distances (e.g. Feng-Doolittle)
2. Cluster sequences by their distances, e.g. by
• Unweighted Pair Group Method (UPGMA)
• Neighbor Joining (NJ)
0 4 12 20 28 •
• ...
Progressive Alignment — Example
IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y
w (x, y ) = 2 iff x = − or y = −
3 otherwise (for mismatch)
• prototypical progressive
alignment
• similarity score with affine
gap cost
• neighbor joining for tree
construction
• special ‘tricks’ for gap
handling
• Progressive alignment +
Consistency heuristic
• Avoid mistakes when
optimizing locally by
modifying the scores
“Library extension”
• Modified scores reflect
global consistency
• Details of consistency
transformation: next slide
• Merges local and global
Alignment
ACGG- Profile:
ACCG- A: 0.75 0 0 0 0
AC-G- C: 0
1
0.5
0
0
TCCGG G: 0 0 0.25 1 0.25
T : 0.25 0 0 0 0
Consensus:
ACCG-
Remarks
• A profile of a multiple alignment consists of character
frequency vectors for each column.
T
H B
T H
T T
B
[The frog climbs the ladder more likely when the sun shines. Assume that
the weather is hidden, but we can observe the frog.]