0% found this document useful (0 votes)

118 views

Alignment Algorithm

The document discusses sequence alignment to assess sequence similarity and evolutionary relationships. It defines homologous sequences as those that evolved from a common ancestor. The document outlines computing the edit distance and alignment distance between sequences, which are shown to be equivalent. It introduces the principle of optimality and uses it to define an alignment matrix to efficiently compute alignment distances between prefixes of sequences via dynamic programming.

Uploaded by

Xyros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views

Alignment Algorithm

Uploaded by

Xyros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Sequence Alignment

Motivation: assess similarity of sequences and learn about their

evolutionary relationship
Why do we want to know this?
Example: Sequences Alignment
ACCCGA ACCCGA
ACTA ⇒ align AC--TA
TCCTA TCC-TA
Homology: Alignment reasonable, if sequences homologous
ACCGA

ACCTA
T

C
C
ACCCGA TCCTA
T ACTA

S.Will, 18.417, Fall 2011

Definition (Sequence Homology)
Two or more sequences are homologous iff
they evolved from a common ancestor.
[Homology in anatomy]
Plan (and Some Preliminaries)

• First: study only pairwise alignment.

Fix alphabet Σ, such that − 6∈ Σ. − is called the gap symbol.
The elements of Σ∗ are called sequences.
Fix two sequences a, b ∈ Σ∗ .
• For pairwise sequence comparison: define edit distance, define
alignment distance, show equivalence of distances, define
alignment problem and efficient algorithm
gap penalties, local alignment
• Later: extend pairwise alignment to multiple alignment

Definition (Alphabet, words)

An alphabet Σ is a finite set (of symbols/characters). Σ+ denotes

S.Will, 18.417, Fall 2011

+ i
S
the set of non-empty words of Σ, i.e. Σ := i>0 Σ . A word
x ∈ Σn has length n, written |x|. Σ∗ := Σ+ ∪ {}, where denotes
the empty word of length 0.
Levenshtein Distance

Definition
The Levenshtein Distance between two words/sequences is the
minimal number of substitutions, insertions and deletions to
transform one into the other.

Example
ACCCGA and ACTA have (at most) distance 3:
ACCCGA → ACCGA → ACCTA → ACTA
In biology, operations have different cost. (Why?)

S.Will, 18.417, Fall 2011

Edit Distance: Operations
Definition (Edit Operations)
An edit operation is a pair (x, y ) ∈ (Σ ∪ {−} =
6 (−, −). We call
(x,y)
• substitution iff x 6= − and y 6= −
• deletion iff y = −
• insertion iff x = −
For sequences a, b, write a →(x,y ) b, iff a is transformed to b by
operation (x, y ). Furthermore, write a ⇒S b, iff a is transformed
to b by a sequence of edit operations S.

Example
ACCCGA →(C ,−) ACCGA →(G ,T ) ACCTA →(−,T ) ATCCTA

S.Will, 18.417, Fall 2011

ACCCGA ⇒(C ,−),(G ,T ),(−,T ) ATCCTA

Recall: − 6∈ Σ, a, b are sequences in Σ∗

Edit Distance: Cost and Problem Definition

Definition (Cost, Edit Distance)

Let w : (Σ ∪ {−})2 → R, such that w (x, y ) is the cost of an edit
operation (x, y ). The cost of a sequence of edit operations
n
S = e1 , . . . , en is X
w̃ (S) = w (e1 ).
i=1

The edit distance of sequences a and b is

dw (a, b) = min{w̃ (S) | a ⇒S b}.

S.Will, 18.417, Fall 2011

Edit Distance: Cost and Problem Definition
Definition (Cost, Edit Distance)
Let w : (Σ ∪ {−})2 → R, such that w (x, y ) is the cost of an edit
operation (x, y ). The cost of a sequence of edit operations
n
S = e1 , . . . , en is X
w̃ (S) = w (e1 ).
i=1

The edit distance of sequences a and b is

dw (a, b) = min{w̃ (S) | a ⇒S b}.

Is the definition reasonable?

Definition (Metric)
A function d : X 2 → R is called metric iff 1.) d(x, y ) = 0 iff x = y

S.Will, 18.417, Fall 2011

2.) d(x, y ) = d(y , x) 3.) d(x, y ) ≤ d(x, z) + d(z, y ).
Remarks: 1.) for metric d, d(x, y ) ≥ 0, 2.) dw is metric iff w (x, y ) ≥ 0,
3.) In the following, assume dw is metric.
Edit Distance: Cost and Problem Definition

Definition (Cost, Edit Distance)

Let w : (Σ ∪ {−})2 → R, such that w (x, y ) is the cost of an edit
operation (x, y ). The cost of a sequence of edit operations
n
S = e1 , . . . , en is X
w̃ (S) = w (e1 ).
i=1

The edit distance of sequences a and b is

dw (a, b) = min{w̃ (S) | a ⇒S b}.

Remarks
• Natural ’evolution-motivated’ problem definition.

S.Will, 18.417, Fall 2011

• Not obvious how to compute edit distance efficiently
⇒ define alignment distance
Alignment Distance
Definition (Alignment)
A pair of words a , b ∈ (Σ ∪ {−})∗ is called alignment of
sequences a and b (a and b are called alignment strings), iff
1. |a | = |b |
2. for all 1 ≤ i ≤ |a |: ai 6= − or bi 6= −
3. deleting all gap symbols − from a yields a
and deleting all − from b yields b

Example
a = ACGGAT
b = CCGCTT
possible alignments are

S.Will, 18.417, Fall 2011

a = AC-GG-AT a = ACGG---AT
or or . . . (exponentially many)
b = -CCGCT-T b = --CCGCT-T
edit operations of first alignment: (A,-),(-,C),(G,C),(-,T),(A,-)
Alignment Distance

Definition (Cost of Alignment, Alignment Distance)

The cost of the alignment (a , b ), given a cost function w on edit
operations is
|a |
X

w (a , b ) = w (ai , bi )
i=1

The alignment distance of a and b is

Dw (a, b) = min{w (a , b ) | (a , b ) is alignment of a and b}.

S.Will, 18.417, Fall 2011

Alignment Distance = Edit Distance

Theorem (Equivalence of Edit and Alignment Distance)

For metric w , dw (a, b) = Dw (a, b).

Recall:
Definition (Edit Distance)
The edit distance of a and b is
dw (a, b) = min{w̃ (S) | a transformed to b by e.o.-sequence S}.
Definition (Alignment Distance)

S.Will, 18.417, Fall 2011

The alignment distance of a and b is
Dw (a, b) = min{w (a , b ) | (a , b ) is alignment of a and b}.
Alignment Distance = Edit Distance

Theorem (Equivalence of Edit and Alignment Distance)

For metric w , dw (a, b) = Dw (a, b).

Remarks
• Proof idea:
dw (a, b) ≤ Dw (a, b): alignment yields sequence of edit ops
Dw (a, b) ≤ dw (a, b): sequence of edit ops yields equal or
better alignment (needs triangle inequality)
• Reduces edit distance to alignment distance
• We will see: the alignment distance is computed efficiently by
dynamic programming (using Bellman’s Principle of

S.Will, 18.417, Fall 2011

Optimality ).
Principle of Optimality and Dynamic Programming
Principle of Optimality :
‘Optimal solutions consist of optimal partial solutions’

Example: Shortest Path

Idea of Dynamic Programming (DP):

• Solve partial problems first and materialize results
• (recursively) solve larger problems based on smaller ones

Remarks
• The principle is valid for the alignment distance problem

S.Will, 18.417, Fall 2011

• Principle of Optimality enables the programming method DP
• Dynamic programming is widely used in Computational
Biology and you will meet it quite often in this class
Alignment Matrix
Idea: choose alignment distances of prefixes a1..i and b1..j as
partial solutions and define matrix of these partial solutions.

Let n := |a|, m := |b|.

Definition (Alignment matrix)
The alignment matrix of a and b is the (n + 1) × (m + 1)-matrix
D := (Dij )0≤i≤n,0≤j≤m defined by

Dij := Dw (a1..i , b1..j )

= min{w (a , b ) | (a , b ) is alignment of a1..i and b1..j } .

Notational remarks

S.Will, 18.417, Fall 2011

• ai is the i-th character of a
• ax..y is the sequence ax ax+1 . . . ay (subsequence of a).
• by convention ax..y = if x > y .
Alignment Matrix Example
Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T

A
T

S.Will, 18.417, Fall 2011

Remark: The alignment matrix D contains the alignment distance
(=edit distance) of a and b in Dn,m .
Alignment Matrix Example
Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2

S.Will, 18.417, Fall 2011

Remark: The alignment matrix D contains the alignment distance
(=edit distance) of a and b in Dn,m .
Needleman-Wunsch Algorithm
Claim
For (a , b ) alignment of a and b with length r = |a |,
w (a , b ) = w (a1..r

−1 , b1..r −1 ) + w (ar , br ).

Theorem
For the alignment matrix D of a and b, holds that
• D0,0 = 0
Pi
• for all 1 ≤ i ≤ n: Di,0 = k=1 w (ak , −) = Di−1,0 + w (ai , −)
Pj
• for all 1 ≤ j ≤ m: D0,j = k=1 w (−, bk ) = D0,j−1 + w (−, bj )

Di−1,j−1 + w (ai , bj ) (match)

• Dij = min Di−1,j + w (ai , −) (deletion)

S.Will, 18.417, Fall 2011


Di,j−1 + w (−, bj ) (insertion)


Remark: The theorem claims that each prefix alignment distance

can be computed from a constant number of smaller ones.
Proof ???
Needleman-Wunsch Algorithm
Claim
For (a , b ) alignment of a and b with length r = |a |,
w (a , b ) = w (a1..r

−1 , b1..r −1 ) + w (ar , br ).

S.Will, 18.417, Fall 2011


Di,j−1 + w (−, bj ) (insertion)


Remark: The theorem claims that each prefix alignment distance

can be computed from a constant number of smaller ones.
Proof: Induction over i+j
Needleman-Wunsch Algorithm (Pseudocode)

D0,0 := 0
for i := 1 to n do
Di,0 := Di−1,0 + w (ai , −)
end for
for j := 1 to m do
D0,j := D0,j−1 + w (−, bj )
end for
for i := 1 to n do
for j := 1 to mdo
Di−1,j−1 + w (ai , bj )

Di,j := min Di−1,j + w (ai , −)

Di,j−1 + w (−, bj )

S.Will, 18.417, Fall 2011


end for
end for
Back to Example

Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0
T

S.Will, 18.417, Fall 2011

Open: how to find best alignment?

Back to Example

Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T

S.Will, 18.417, Fall 2011

2 1 1 2 2

Open: how to find best alignment?

(
Traceback
0 iff x = y
w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2

Remarks

S.Will, 18.417, Fall 2011

• Start in (n, m). For every (i, j) determine optimal case.
• Not necessarily unique.
• Sequence of trace arrows let’s infer best alignment.
(
Traceback
0 iff x = y
w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2

Remarks

S.Will, 18.417, Fall 2011

• Start in (n, m). For every (i, j) determine optimal case.
• Not necessarily unique.
• Sequence of trace arrows let’s infer best alignment.
Complexity

• compute one entry: three cases, i.e. constant time

• nm entries ⇒ fill matrix in O(nm) time
• traceback: O(n + m) time
• TOTAL: O(n2 ) time and space (assuming m ≤ n)

Remarks
• assuming m ≤ n is w.l.o.g. since we can exchange a and b
• space complexity can be improved to O(n) for computation of
distance (simple, “store only current and last row”) and
traceback (more involved; Hirschberg-algorithm uses “Divide

S.Will, 18.417, Fall 2011

and Conquer” for computing trace)
Plan

• We have seen how to compute the pairwise edit distance and

the corresponding optimal alignment.
• Before going multiple, we will look at two further special
topics for pairwise alignment:
• more realistic, non-linear gap cost and
• similarity scores and local alignment

S.Will, 18.417, Fall 2011

Alignment Cost Revisited

Motivation:
GA--T G-A-T
• The alignments and have the same edit
GAAGT GAAGT
distance.
• The first one is biologically more reasonable: it is more likely
that evolution introduces one large gap than two small ones.
• This means: gap cost should be non-linear, sub-additive!

S.Will, 18.417, Fall 2011

Gap Penalty
Definition (Gap Penalty)
A gap penalty is a function g : N → R that is sub-additive, i.e.

g (k + l) ≤ g (k) + g (l).

A gap in an alignment string a is a substring of a that consists of

only gap symbols − and is maximally extended. ∆a is the
multi-set of gaps in a .
The alignment cost with gap penalty g of (a , b ) is
X
wg (a , b ) = w (ar , br ) (cost of mismatchs)
1≤r ≤|a |,
where ar 6=−,br 6=−
X

S.Will, 18.417, Fall 2011

+ g (|x|) (cost of gaps)
a
x∈∆ ]∆ b

Example:

a = ATG---CGAC--GC ⇒ ∆a = {---, --}, ∆b = {-, -}
b = -TGCGGCG-CTTTC
General sub-additive gap penalty
Theorem
Let D be the alignment matrix of a and b with cost w and gap
penalty g , such that Dij = wg (a1..i , b1..j ). Then:
• D0,0 = 0
• for all 1 ≤ i ≤ n: Di,0 = g (i)
• for all 1 ≤ j ≤ m: D0,j = g (j)

Di−1,j−1 + w (ai , bj )
 (match)
• Dij = min min1≤k≤i Di−k,j + g (k) (deletion of length k)

min1≤k≤j Di,j−k + g (k) (insertion of length k)


Remarks

S.Will, 18.417, Fall 2011

• Complexity O(n3 ) time, O(n2 ) space
• pseudocode, correctness, traceback left as exercise
• much more realistic, but significantly more expensive than
Needleman-Wunsch ⇒ can we improve it?
Affine gap cost

Definition
A gap penalty is affine, iff there are real constants α and β, such
that for all k ∈ N: g (k) = α + βk.

Remarks
• Affine gap penalties almost as good as general ones:
Distinguishing gap opening (α) and gap extension cost (β) is
“biologically reasonable”.
• The minimal alignment cost with affine gap penalty can be
computed in O(n2 ) time! (Gotoh algorithm)

S.Will, 18.417, Fall 2011

Gotoh algorithm: sketch only
In addition to the alignment matrix D, define two further
matrices/states.
• Ai,j := cost of best alignment of a1..i , b1..j ,
ai
that ends with deletion |
−
.
• Bi,j := cost of best alignment of a1..i , b1..j ,
−
that ends with insertion b| .
( j

Recursions: Ai−1,j + β (deletion extension)

Ai,j = min
Di−1,j + g (1) (deletion opening )
(
Bi,j−1 + β (insertion extension)
Bi,j = min
Di,j−1 + g (1) (insertion opening )

Di−1,j−1 + w (ai , bj ) (match)

S.Will, 18.417, Fall 2011


Dij = min Ai,j (deletion closing )

Bi,j (insertion closing )


Remark: O(n2 ) time and space

Similarity
Definition (Similarity)
The similarity of an alignment (a , b ) is

|a |
X

s(a , b ) = s(ai , bi ),
i=1

where s : (Σ ∪ {−})2 → R is a similarity function, where

for x ∈ Σ : s(x, x) > 0, s(x, −) < 0, s(−, x) < 0.
Observation. Instead of minimizing
 alignment cost, one can
maximize similarity:  i−1,j−1 + s(ai , bj )
 S
Sij = max Si−1,j + s(ai , −)

Si,j−1 + s(−, bj )


S.Will, 18.417, Fall 2011

Motivation:
• defining similarity of ’building blocks’ could be more natural,
e.g. similarity of amino acids.
• similarity is useful for local alignment
Local Alignment Motivation

Local alignment asks for the best alignment of any two

subsequences of a and b. Important Application: Search!
(e.g. BLAST combines heuristics and local alignment)
Example
a =AWGVIACAILAGRS
b =VIVTAIAVAGYY

In contrast, all previous methods compute “global alignments”.

Why is distance not useful?
Example
XXXAAXXXX XXAAAAAXXXX
a) b)

S.Will, 18.417, Fall 2011

YYAAYY YYYAAAAAYY
Where is the stronger local motif? Only similarity can distinguish.
Local Alignment
Definition (Local Alignment Problem)
Let s be a similarity on alignments.

Sglobal (a, b) := max s(a , b ) (global similarity )

(a ,b )
alignment of a and b

Slocal (a, b) := max Sglobal (ai 0 ..i , bj 0 ..j ) (local similarity )

1≤i 0 <i≤n
1≤j 0 <j≤m

The local alignment problem is to compute Slocal (a, b).

Remarks
• That is, local alignment asks for the subsequences of a and b

S.Will, 18.417, Fall 2011

that have the best alignment.
• How would we define the local alignment matrix for DP?
• For example, why does “Hi,j := Slocal (a1..i , b1..j )” not work?
Local Alignment Matrix

Definition
The local alignment matrix H of a and b is (Hi,j )0≤i≤n,0≤j≤m
defined by

Hi,j := max Sglobal (ai 0 +1..i , bj 0 +1..j ).

0≤i 0 ≤i,0≤j 0 ≤j

Remarks
• Slocal (a, b) = maxi,j Hi,j (!)
• all entries Hi,j ≥ 0, since Sglobal (, ) = 0.
• Hi,j = 0 implies no subsequences of a and b that end in

S.Will, 18.417, Fall 2011

respective i and j are similar.
• Allows case distinction / Principle of optimality holds!
Local Alignment Algorithm — Case Distinction

Cases for Hi,j

. . . ai . . . ai ... −
1.) 2.) 3.)
. . . bi ... − . . . bj

4.) 0, since if each of the above cases is dissimilar (i.e. negative),

there is still (, ).

S.Will, 18.417, Fall 2011

Local Alignment Algorithm (Smith-Waterman Algorithm)

Theorem
For the local alignment matrix H of a and b,
• H0,0 = 0
• for all 1 ≤ i ≤ n: Hi,0 = 0
• for all 1 ≤ j ≤ m: H0,j = 0



0 (empty alignment)

H
i−1,j−1 + s(ai , bj )
• Hij = max


Hi−1,j + s(ai , −)

H
i,j−1 + s(−, bj )

S.Will, 18.417, Fall 2011

Local Alignment Remarks

Remarks
• Complexity O(n2 ) time and space, again space complexity can
be improved
• Requires that similarity function is centered around zero, i.e.
positive = similar, negative = dissimilar.
• Extension to affine gap cost works
• Traceback?

S.Will, 18.417, Fall 2011

Local Alignment Example
Example
• a =AAC, b =ACAA
(
2 iff x = y
• s(x, y ) =
−3 otherwise

A C A A
0 0 0 0 0
A 0 2 0 2 2
A 0 2 0 2 4

S.Will, 18.417, Fall 2011

C 0 0 4 1 1

Traceback: start at maximum entry, trace back to first 0 entry

Substitution/Similarity Matrices
• In practice: use similarity matrices learned from closely related
sequences or multiple alignments
• PAM (Percent Accepted Mutations) for proteins
• BLOSUM (BLOcks of Amino Acid SUbstitution) for proteins
• RIBOSUM for RNA
[x,y | Related]
• Scores are (scaled) log odd scores: log PrPr
[x,y | Background]
For example, BLOSUM62:

S.Will, 18.417, Fall 2011

Multiple Alignment

Example: Sequences Alignment

a(1) = ACCCGAG ACCCGA-G-
a(2) = ACTACC ⇒align A = AC--TAC-C
a(3) = TCCTACGG TCC-TACGG
Definition
A multiple alignment A of K sequences a(1) ...a(K ) is a
K × N-matrix (Ai,j )1≤i≤K (N is the number of columns of A)
1≤j≤N
where
1. each entry Ai,j ∈ (Σ ∪ {−})
2. for each row i: deleting all gaps from (Ai,1 ...Ai,N ) yields a(i)

S.Will, 18.417, Fall 2011

3. no column j contains only gap symbols
How to Score Multiple Alignments

As for pairwise alignment:

• Assume columns are scored independently
• Score is sum over alignment columns
N
X
S(A) = s(A1j , . . . , AKj )
j=1

Example −
A
C C C
S(A) = s A +s C +s − +s − + ··· + s C
T C C − G

S.Will, 18.417, Fall 2011

How do we know similarities?
How to Score Multiple Alignments
As for pairwise alignment:
• Assume columns are scored independently
• Score is sum over alignment columns
N A
X 1j
S(A) = s ...
AKj
j=1

Example −
A
C C C
S(A) = s A +s C +s − +s − + ··· + s C
T C C − G

S.Will, 18.417, Fall 2011

x x
How to define s y ? as log odds s y = log PrPr [x,y ,z| Related]
[x,y ,z| Background] ?
z z
Problems? Can we learn similarities for triples, 4-tuples, . . . ?
Sum-Of-Pairs Score

Idea: approximate column scores by pairwise scores

x
1 X
s ... = s(xk , xl )
xj
1≤k<l≤K

Sum-of-pairs is the most commonly used scoring scheme for

multiple alignments.
(Extensible to gap penalties, in particular affine gap cost)

Drawbacks?

S.Will, 18.417, Fall 2011

Optimal Multiple Alignment
Idea: use dynamic programming
Example
For 3 sequences a, b, c, use 3-dimensional matrix
(after initialization:) 

Si−1,j−1,k−1 +s(ai , bj , ck )


Si−1,j−1,k +s(ai , bj , −)




Si−1,j,k−1 +s(ai , −, ck )



Si,j,k = max Si,j−1,k−1 +s(−, bj , ck )


Si−1,j,k +s(ai , −, −)





Si,j−1,k +s(−, bj , −)




Si,j,k−1 +s(−, −, ck )


S.Will, 18.417, Fall 2011

For K sequences use K-dimensional matrix.
Complexity?
Heuristic Multiple Alignment: Progressive Alignment
Idea: compute optimal alignments only pairwise
Example
4 sequences a(1) , a(2) , a(3) , a(4)
1. determine how they are related
⇒ tree, e.g. ((a(1) , a(2) ), (a(3) , a(4) ))
2. align most closely related sequences first
⇒ (optimally) align a(1) and a(2) by DP
3. go on ⇒ (optimally) align a(3) and a(4) by DP
4. go on?! ⇒ (optimally) align the two alignments
How can we do that?
5. Done. We produced a multiple alignment of

S.Will, 18.417, Fall 2011

a(1) , a(2) , a(3) , a(4) .
Remarks: - Optimality is not guaranteed. Why?
- The tree is known as guide tree. How can we get it?
Guide tree

The guide tree determines the order of pairwise alignments in the

progressive alignment scheme.
The order of the progressive alignment steps is crucial for quality!

Heuristics:
1. Compute pairwise distances between all input sequences
• align all against all
• in case, transform similarities to distances (e.g. Feng-Doolittle)
2. Cluster sequences by their distances, e.g. by
• Unweighted Pair Group Method (UPGMA)
• Neighbor Joining (NJ)

S.Will, 18.417, Fall 2011

Aligning Alignments
Two (multiple) alignments A and B can be aligned by DP in the
same way as two sequences.
Idea:
• An alignment is a sequence of alignment columns.
ACCCGA-G- A C C C −
Example: AC--TAC-C ≡ A C − − ... C .
T C C − G
TCC-TACGG
• Assign similarity to two columns from resp. A and B, e.g.
−

G
s( C , C ) by sum-of-pairs.
G

We can use dynamic programming, which recurses over alignment

scores of prefixes of alignments.

S.Will, 18.417, Fall 2011

Consequences for progressive alignment scheme:
• Optimization only local.
• Commits to local decisions. “Once a gap, always a gap”
Progressive Alignment — Example
IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)


• Compute all against all edit distances and cluster

Align ACCG and TTGG Align ACCG and TCG
T T G G T C G
0 2 4 6 8 0 2 4 6
A 2 3 5 7 9 A 2 3 5 7
C 4 5 6 8 10 C 4 5 3 6
C 6 7 8 9 11 C 6 7 5 6
G 8 9 10 8 9 G 8 9 8 5
Align ACCG and CTGG Align TTGG and TCG
C T G G T C G
0 2 4 6 8 0 2 4 6
A 2 3 5 7 9 T 2 0 3 6
C 4 2 5 8 10 T 4 2 3 6
C 6 4 5 8 11 G 6 5 5 3
G 8 7 7 5 8 G 8 8 8 5

S.Will, 18.417, Fall 2011

Align TTGG and CTGG Align TCG and CTGG
C T G G C T G G
0 2 4 6 8 0 2 4 6 8
T 2 3 2 5 8 T 2 3 2 5 8
T 4 5 3 5 8 C 4 2 5 5 8
G 6 7 6 3 5 G 6 5 5 5 5
G 8 9 9 6 3
Progressive Alignment — Example

IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG

0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)


• Compute all against all edit distances and cluster

⇒ distance matrix
a(1) a(2) a(3) a(4)
(1)
a 0 9 5 8
a(2) 0 5 3
a(3) 0 5
a(4) 0

S.Will, 18.417, Fall 2011

⇒ Cluster (e.g. UPGMA)
a(2) and a(4) are closest, Then: a(1) and a(3)
Progressive Alignment — Example
IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)


• Compute all against all edit distances and cluster

⇒ guide tree ((a(2) , a(4) ), (a(1) , a(3) ))
TTGG ACCG
• Align a(2) and a(4) : , Align a(1) and a(3) :
CTGG -TCG
• Align the alignments!
Align TTGG and ACCG
CTGG -TCG
A C C G
- T C G
0 4 12 20 28

S.Will, 18.417, Fall 2011

TC 8 10
.
...
.
. .
TT 16 . .
GG 24
GG 32
Progressive Alignment — Example
(1) (2)
IN: a =
 ACCG , a = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)


• Compute all against all edit distances and cluster

⇒ guide tree ((a(2) , a(4) ), (a(1) , a(3) ))
TTGG ACCG
• Align a(2) and a(4) : , Align a(1) and a(3) :
CTGG -TCG
• Align the alignments!
• w (TC , −−) =
w (T , −) + w (C , −) + w (T , −) + w (C , −) = 8
Align TTGG and ACCG
CTGG -TCG • w (−−, A−) =
A C C G
- T C G w (−, A) + w (−, −) + w (−, A) + w (−, −) = 4

0 4 12 20 28 •

S.Will, 18.417, Fall 2011

w (TC , A−) =
TC 8 10
.
...
. w (T , A) + w (C , A) + w (T , −) + w (C , −) = 10
. .
TT 16 . .
GG 24 • w (TC , CT ) =
GG 32 w (T , C ) + w (C , C ) + w (T , T ) + w (C , T ) = 6

• ...
Progressive Alignment — Example
IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)


• Compute all against all edit distances and cluster

⇒ guide tree ((a(2) , a(4) ), (a(1) , a(3) ))
TTGG ACCG
• Align a(2) and a(4) : , Align a(1) and a(3) :
CTGG -TCG
• Align the alignments!
Align TTGG and ACCG
CTGG -TCG
A C C G TTGG
- T C G =⇒
CTGG
0 4 12 20 28 after filling

S.Will, 18.417, Fall 2011

TC 8 10 ... ACCG
.
. .
. and traceback
TT 16 . . -TCG
GG 24
GG 32
A Classical Approach: CLUSTAL W

• prototypical progressive
alignment
• similarity score with affine
gap cost
• neighbor joining for tree
construction
• special ‘tricks’ for gap
handling

S.Will, 18.417, Fall 2011

Advanced Progressive Alignment in MUSCLE

S.Will, 18.417, Fall 2011

1.) alignment draft and 2.) reestimation 3.) iterative refinement
Consistency-based scoring in T-Coffee

• Progressive alignment +
Consistency heuristic
• Avoid mistakes when
optimizing locally by
modifying the scores
“Library extension”
• Modified scores reflect
global consistency
• Details of consistency
transformation: next slide
• Merges local and global

S.Will, 18.417, Fall 2011

alignments
Consistency-based scoring in T-Coffee
Misalignment by standard procedure Consistency Transformation
• For each sequence triplet:
strengthen compatible
edges
• This moves global
Correct alignment after library extension
information into scores
• Consistency-based scores
guide pairwise
alignments towards
(global) consistency

All-2-all alignments for weighting

S.Will, 18.417, Fall 2011

Alignment Profiles

Alignment
ACGG- Profile:         
ACCG- A: 0.75 0 0 0 0
AC-G- C:   0 
 1
 
 0.5 
 
0
 
 0 
 
TCCGG G:  0  0 0.25 1 0.25
T : 0.25 0 0 0 0
Consensus:
ACCG-

Remarks
• A profile of a multiple alignment consists of character
frequency vectors for each column.

S.Will, 18.417, Fall 2011

• The profile describes sequences of the alignment in a rigid way.
• Modeling insertions/deletions requires profile HMMs.
Hidden Markov Models (HMMs)
Example of a simple HMM
0.8 0.6
0.2
S R
2/3 1/3 0.4 1/6 5/6

T
H B
T H
T T
B
[The frog climbs the ladder more likely when the sun shines. Assume that
the weather is hidden, but we can observe the frog.]

• Idea: the probability of an observation depends on a hidden

state, where there are specific probabilities to change states.

S.Will, 18.417, Fall 2011

• Hidden Markov Models generate observation sequences (e.g.
TBTTT) according to an (encoded) probability distribution.
• One can compute things like “most probable path given an
observation sequence”, . . . (no details here)
Profile HMMs
• Profile HMMs describe (probability distribution of) sequences
in a multiple alignment (observation ≡ sequence).
• hidden states = insertion (Ii ), match (Mi ), deletion (Di ) in
relation to consensus (state sequence ≡ alignment string)
Alignment
ACGG-
ACCG-
AC-G-
TCCGG
Consensus:
ACCG-
Remarks

S.Will, 18.417, Fall 2011

• Profile HMMs are used to search for sequences that are
similar to sequences of a given alignment (Pfam, HMMer)
• Profile HMMs can be used to construct multiple alignments
• We come back to HMMs when we discuss SCFGs.

FDNY High Rise Ops
100% (3)
FDNY High Rise Ops
51 pages
Lecture 4
No ratings yet
Lecture 4
57 pages
Definition of Minimum Edit Distance
No ratings yet
Definition of Minimum Edit Distance
49 pages
03 Med
No ratings yet
03 Med
52 pages
05 Dynamic Programming i i
No ratings yet
05 Dynamic Programming i i
64 pages
COB Sequencealignment
No ratings yet
COB Sequencealignment
49 pages
18-IntroNLP II PDF
No ratings yet
18-IntroNLP II PDF
187 pages
Edit Dist
No ratings yet
Edit Dist
35 pages
2 EditDistance 2022
No ratings yet
2 EditDistance 2022
37 pages
Sequence Alignment: Lecture 2, Thursday April 3, 2003
No ratings yet
Sequence Alignment: Lecture 2, Thursday April 3, 2003
39 pages
DNA Alignment
No ratings yet
DNA Alignment
76 pages
Lab5 Ch2 Sequence Similarity PDF
No ratings yet
Lab5 Ch2 Sequence Similarity PDF
95 pages
Csci3104 S2018 L7
No ratings yet
Csci3104 S2018 L7
11 pages
06DynamicProgrammingII 2x2
No ratings yet
06DynamicProgrammingII 2x2
17 pages
DP and Edit Dist
No ratings yet
DP and Edit Dist
30 pages
Pairwise Alignment 2017
No ratings yet
Pairwise Alignment 2017
49 pages
Lecture 2
No ratings yet
Lecture 2
71 pages
Bio Medical Tics - Sequence Analysis - Alignment - 2011
No ratings yet
Bio Medical Tics - Sequence Analysis - Alignment - 2011
96 pages
Sequence Comparison and Alignment: Bioinformatics #4 IPB University
No ratings yet
Sequence Comparison and Alignment: Bioinformatics #4 IPB University
37 pages
Lecture5 Newest
No ratings yet
Lecture5 Newest
124 pages
03 Med
No ratings yet
03 Med
35 pages
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
No ratings yet
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
13 pages
Lecture 5: Multiple Sequence Alignment: Introduction To Computational Biology
No ratings yet
Lecture 5: Multiple Sequence Alignment: Introduction To Computational Biology
34 pages
Sequence Alignment
No ratings yet
Sequence Alignment
17 pages
Needleman Wunsch
100% (1)
Needleman Wunsch
6 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
PCB Lect02 Pairwise Allign
No ratings yet
PCB Lect02 Pairwise Allign
51 pages
EditDistance
No ratings yet
EditDistance
28 pages
Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions
No ratings yet
Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions
47 pages
4.1. Pairwise Alignment - 2
No ratings yet
4.1. Pairwise Alignment - 2
4 pages
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
No ratings yet
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
18 pages
String Edit PDF
No ratings yet
String Edit PDF
39 pages
Notes On Dynamic-Programming Sequence Alignment
No ratings yet
Notes On Dynamic-Programming Sequence Alignment
8 pages
Uninformed Search 5 Edit Distance: Naive Method
No ratings yet
Uninformed Search 5 Edit Distance: Naive Method
2 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
2 EditDistance 2023
No ratings yet
2 EditDistance 2023
35 pages
Calculating Minimum Edit Distance
0% (1)
Calculating Minimum Edit Distance
52 pages
Defini'on of Minimum Edit Distance
No ratings yet
Defini'on of Minimum Edit Distance
52 pages
Needlemanwunsch 130216130832 Phpapp01
No ratings yet
Needlemanwunsch 130216130832 Phpapp01
39 pages
lecture1-2
No ratings yet
lecture1-2
44 pages
L3 Edit Distance
No ratings yet
L3 Edit Distance
23 pages
Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras Lecture - 7b Sequence Alignment II
No ratings yet
Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras Lecture - 7b Sequence Alignment II
26 pages
Affine Gap
No ratings yet
Affine Gap
18 pages
001_AffineGap (2023_08_02 04_29_18 UTC)
No ratings yet
001_AffineGap (2023_08_02 04_29_18 UTC)
18 pages
Dynamic Programming and Single Word Recognizers (Part 1)
No ratings yet
Dynamic Programming and Single Word Recognizers (Part 1)
25 pages
Multiple Alignment PDF
No ratings yet
Multiple Alignment PDF
45 pages
L3.4 Alignment
No ratings yet
L3.4 Alignment
90 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Sequence Alignment Methods and Algorithms
No ratings yet
Sequence Alignment Methods and Algorithms
37 pages
8 LCS 19 01 2024
No ratings yet
8 LCS 19 01 2024
17 pages
CS253 Report 3 Wilhelm Aaron
No ratings yet
CS253 Report 3 Wilhelm Aaron
35 pages
Week5 Dynamic Programming1
No ratings yet
Week5 Dynamic Programming1
11 pages
Programming Assignment 5: Dynamic Programming 1
No ratings yet
Programming Assignment 5: Dynamic Programming 1
11 pages
Programming Assignment 5: Dynamic Programming 1
No ratings yet
Programming Assignment 5: Dynamic Programming 1
11 pages
Sequence Alignment
No ratings yet
Sequence Alignment
92 pages
lec-02
No ratings yet
lec-02
103 pages
423f11 Lec4 Gaps
No ratings yet
423f11 Lec4 Gaps
17 pages
04 Dynamic Programming 2 Editdistance
No ratings yet
04 Dynamic Programming 2 Editdistance
99 pages
Topology Essentials
From Everand
Topology Essentials
Emil G. Milewski
5/5 (1)
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
What Is Reformed Theology
67% (3)
What Is Reformed Theology
6 pages
Ppf-Guidelines 1
No ratings yet
Ppf-Guidelines 1
36 pages
Spider Silk Harvester
No ratings yet
Spider Silk Harvester
2 pages
Fim de Curso YBLX-CK Chint CKW
No ratings yet
Fim de Curso YBLX-CK Chint CKW
4 pages
Hayakawa Electronics (Phils.) Corporation
No ratings yet
Hayakawa Electronics (Phils.) Corporation
22 pages
NapFly VTOL Hi-Scout Drone (v2)
No ratings yet
NapFly VTOL Hi-Scout Drone (v2)
1 page
How Many Times Can We Order at A Time in Swiggy - Google Search
No ratings yet
How Many Times Can We Order at A Time in Swiggy - Google Search
1 page
Thesis Statement and Blueprint Examples
100% (3)
Thesis Statement and Blueprint Examples
8 pages
ACCD ViewBook
No ratings yet
ACCD ViewBook
252 pages
Winter Project Final
No ratings yet
Winter Project Final
57 pages
Sight Translation and Interpreting A Comparative Analysis of Constraints and Failures
No ratings yet
Sight Translation and Interpreting A Comparative Analysis of Constraints and Failures
26 pages
Assembly Line Balancing
No ratings yet
Assembly Line Balancing
17 pages
Walter
No ratings yet
Walter
3 pages
LLB 3 Y 1st Sem Family Law (English)
No ratings yet
LLB 3 Y 1st Sem Family Law (English)
32 pages
Government Intervention Activity
No ratings yet
Government Intervention Activity
1 page
Elmar Marine Induction Plan
No ratings yet
Elmar Marine Induction Plan
46 pages
Group 1 Presentations
No ratings yet
Group 1 Presentations
6 pages
SIMCE Reading Sample Test 2
No ratings yet
SIMCE Reading Sample Test 2
16 pages
Building Your Practice Routine
83% (6)
Building Your Practice Routine
6 pages
Topic 7 - Combined Axial + Bending
No ratings yet
Topic 7 - Combined Axial + Bending
6 pages
His1 B01the Trends in Historiography
No ratings yet
His1 B01the Trends in Historiography
14 pages
01 14 05 Final Draft Worksheet
No ratings yet
01 14 05 Final Draft Worksheet
3 pages
Wfs Summary Report 2010
No ratings yet
Wfs Summary Report 2010
58 pages
Investment Club Constitution
No ratings yet
Investment Club Constitution
8 pages
Ch.7 The Cry of The Wolf
No ratings yet
Ch.7 The Cry of The Wolf
3 pages
Appendix G - Standard Thermodynamic Properties For Selected Substances - Chemistry
No ratings yet
Appendix G - Standard Thermodynamic Properties For Selected Substances - Chemistry
16 pages
Mood Disorders II
No ratings yet
Mood Disorders II
24 pages
12 - Physics Question Paper (Pre-Board Ii)
No ratings yet
12 - Physics Question Paper (Pre-Board Ii)
7 pages
05 Introduction To Plant Pathology
100% (1)
05 Introduction To Plant Pathology
20 pages

Alignment Algorithm

Uploaded by

Alignment Algorithm

Uploaded by

Sequence Alignment

Motivation: assess similarity of sequences and learn about their

S.Will, 18.417, Fall 2011

• First: study only pairwise alignment.

Definition (Alphabet, words)

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

Recall: − 6∈ Σ, a, b are sequences in Σ∗

Definition (Cost, Edit Distance)

The edit distance of sequences a and b is

S.Will, 18.417, Fall 2011

The edit distance of sequences a and b is

Is the definition reasonable?

S.Will, 18.417, Fall 2011

Definition (Cost, Edit Distance)

The edit distance of sequences a and b is

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

Definition (Cost of Alignment, Alignment Distance)

The alignment distance of a and b is

Dw (a, b) = min{w (a , b  ) | (a , b  ) is alignment of a and b}.

S.Will, 18.417, Fall 2011

Theorem (Equivalence of Edit and Alignment Distance)

S.Will, 18.417, Fall 2011

Theorem (Equivalence of Edit and Alignment Distance)

S.Will, 18.417, Fall 2011

Example: Shortest Path

Idea of Dynamic Programming (DP):

S.Will, 18.417, Fall 2011

Let n := |a|, m := |b|.

Dij := Dw (a1..i , b1..j )

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

Remark: The theorem claims that each prefix alignment distance

S.Will, 18.417, Fall 2011

Remark: The theorem claims that each prefix alignment distance

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

Open: how to find best alignment?

S.Will, 18.417, Fall 2011

Open: how to find best alignment?

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

• compute one entry: three cases, i.e. constant time

S.Will, 18.417, Fall 2011

• We have seen how to compute the pairwise edit distance and

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

A gap in an alignment string a is a substring of a that consists of

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

Recursions: Ai−1,j + β (deletion extension)

S.Will, 18.417, Fall 2011

Remark: O(n2 ) time and space

where s : (Σ ∪ {−})2 → R is a similarity function, where

S.Will, 18.417, Fall 2011

Local alignment asks for the best alignment of any two

In contrast, all previous methods compute “global alignments”.

S.Will, 18.417, Fall 2011

Sglobal (a, b) := max s(a , b  ) (global similarity )

Slocal (a, b) := max Sglobal (ai 0 ..i , bj 0 ..j ) (local similarity )

The local alignment problem is to compute Slocal (a, b).

S.Will, 18.417, Fall 2011

Hi,j := max Sglobal (ai 0 +1..i , bj 0 +1..j ).

S.Will, 18.417, Fall 2011

Cases for Hi,j

4.) 0, since if each of the above cases is dissimilar (i.e. negative),

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

Traceback: start at maximum entry, trace back to first 0 entry

Dw (a, b) = min{w (a , b ) | (a , b ) is alignment of a and b}.

A gap in an alignment string a is a substring of a that consists of

Sglobal (a, b) := max s(a , b ) (global similarity )