Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
110 views

Alignment Algorithm

The document discusses sequence alignment to assess sequence similarity and evolutionary relationships. It defines homologous sequences as those that evolved from a common ancestor. The document outlines computing the edit distance and alignment distance between sequences, which are shown to be equivalent. It introduces the principle of optimality and uses it to define an alignment matrix to efficiently compute alignment distances between prefixes of sequences via dynamic programming.

Uploaded by

Xyros
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views

Alignment Algorithm

The document discusses sequence alignment to assess sequence similarity and evolutionary relationships. It defines homologous sequences as those that evolved from a common ancestor. The document outlines computing the edit distance and alignment distance between sequences, which are shown to be equivalent. It introduces the principle of optimality and uses it to define an alignment matrix to efficiently compute alignment distances between prefixes of sequences via dynamic programming.

Uploaded by

Xyros
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Sequence Alignment

Motivation: assess similarity of sequences and learn about their


evolutionary relationship
Why do we want to know this?
Example: Sequences Alignment
ACCCGA ACCCGA
ACTA ⇒ align AC--TA
TCCTA TCC-TA
Homology: Alignment reasonable, if sequences homologous
ACCGA

ACCTA
T

C
C
ACCCGA TCCTA
T ACTA

S.Will, 18.417, Fall 2011


Definition (Sequence Homology)
Two or more sequences are homologous iff
they evolved from a common ancestor.
[Homology in anatomy]
Plan (and Some Preliminaries)

• First: study only pairwise alignment.


Fix alphabet Σ, such that − 6∈ Σ. − is called the gap symbol.
The elements of Σ∗ are called sequences.
Fix two sequences a, b ∈ Σ∗ .
• For pairwise sequence comparison: define edit distance, define
alignment distance, show equivalence of distances, define
alignment problem and efficient algorithm
gap penalties, local alignment
• Later: extend pairwise alignment to multiple alignment

Definition (Alphabet, words)


An alphabet Σ is a finite set (of symbols/characters). Σ+ denotes

S.Will, 18.417, Fall 2011


+ i
S
the set of non-empty words of Σ, i.e. Σ := i>0 Σ . A word
x ∈ Σn has length n, written |x|. Σ∗ := Σ+ ∪ {}, where  denotes
the empty word of length 0.
Levenshtein Distance

Definition
The Levenshtein Distance between two words/sequences is the
minimal number of substitutions, insertions and deletions to
transform one into the other.

Example
ACCCGA and ACTA have (at most) distance 3:
ACCCGA → ACCGA → ACCTA → ACTA
In biology, operations have different cost. (Why?)

S.Will, 18.417, Fall 2011


Edit Distance: Operations
Definition (Edit Operations)
An edit operation is a pair (x, y ) ∈ (Σ ∪ {−} =
6 (−, −). We call
(x,y)
• substitution iff x 6= − and y 6= −
• deletion iff y = −
• insertion iff x = −
For sequences a, b, write a →(x,y ) b, iff a is transformed to b by
operation (x, y ). Furthermore, write a ⇒S b, iff a is transformed
to b by a sequence of edit operations S.

Example
ACCCGA →(C ,−) ACCGA →(G ,T ) ACCTA →(−,T ) ATCCTA

S.Will, 18.417, Fall 2011


ACCCGA ⇒(C ,−),(G ,T ),(−,T ) ATCCTA

Recall: − 6∈ Σ, a, b are sequences in Σ∗


Edit Distance: Cost and Problem Definition

Definition (Cost, Edit Distance)


Let w : (Σ ∪ {−})2 → R, such that w (x, y ) is the cost of an edit
operation (x, y ). The cost of a sequence of edit operations
n
S = e1 , . . . , en is X
w̃ (S) = w (e1 ).
i=1

The edit distance of sequences a and b is


dw (a, b) = min{w̃ (S) | a ⇒S b}.

S.Will, 18.417, Fall 2011


Edit Distance: Cost and Problem Definition
Definition (Cost, Edit Distance)
Let w : (Σ ∪ {−})2 → R, such that w (x, y ) is the cost of an edit
operation (x, y ). The cost of a sequence of edit operations
n
S = e1 , . . . , en is X
w̃ (S) = w (e1 ).
i=1

The edit distance of sequences a and b is


dw (a, b) = min{w̃ (S) | a ⇒S b}.

Is the definition reasonable?


Definition (Metric)
A function d : X 2 → R is called metric iff 1.) d(x, y ) = 0 iff x = y

S.Will, 18.417, Fall 2011


2.) d(x, y ) = d(y , x) 3.) d(x, y ) ≤ d(x, z) + d(z, y ).
Remarks: 1.) for metric d, d(x, y ) ≥ 0, 2.) dw is metric iff w (x, y ) ≥ 0,
3.) In the following, assume dw is metric.
Edit Distance: Cost and Problem Definition

Definition (Cost, Edit Distance)


Let w : (Σ ∪ {−})2 → R, such that w (x, y ) is the cost of an edit
operation (x, y ). The cost of a sequence of edit operations
n
S = e1 , . . . , en is X
w̃ (S) = w (e1 ).
i=1

The edit distance of sequences a and b is


dw (a, b) = min{w̃ (S) | a ⇒S b}.

Remarks
• Natural ’evolution-motivated’ problem definition.

S.Will, 18.417, Fall 2011


• Not obvious how to compute edit distance efficiently
⇒ define alignment distance
Alignment Distance
Definition (Alignment)
A pair of words a , b  ∈ (Σ ∪ {−})∗ is called alignment of
sequences a and b (a and b  are called alignment strings), iff
1. |a | = |b  |
2. for all 1 ≤ i ≤ |a |: ai 6= − or bi 6= −
3. deleting all gap symbols − from a yields a
and deleting all − from b  yields b

Example
a = ACGGAT
b = CCGCTT
possible alignments are

S.Will, 18.417, Fall 2011


a = AC-GG-AT a = ACGG---AT
 or  or . . . (exponentially many)
b = -CCGCT-T b = --CCGCT-T
edit operations of first alignment: (A,-),(-,C),(G,C),(-,T),(A,-)
Alignment Distance

Definition (Cost of Alignment, Alignment Distance)


The cost of the alignment (a , b  ), given a cost function w on edit
operations is
|a |
X
 
w (a , b ) = w (ai , bi )
i=1

The alignment distance of a and b is

Dw (a, b) = min{w (a , b  ) | (a , b  ) is alignment of a and b}.

S.Will, 18.417, Fall 2011


Alignment Distance = Edit Distance

Theorem (Equivalence of Edit and Alignment Distance)


For metric w , dw (a, b) = Dw (a, b).

Recall:
Definition (Edit Distance)
The edit distance of a and b is
dw (a, b) = min{w̃ (S) | a transformed to b by e.o.-sequence S}.
Definition (Alignment Distance)

S.Will, 18.417, Fall 2011


The alignment distance of a and b is
Dw (a, b) = min{w (a , b  ) | (a , b  ) is alignment of a and b}.
Alignment Distance = Edit Distance

Theorem (Equivalence of Edit and Alignment Distance)


For metric w , dw (a, b) = Dw (a, b).

Remarks
• Proof idea:
dw (a, b) ≤ Dw (a, b): alignment yields sequence of edit ops
Dw (a, b) ≤ dw (a, b): sequence of edit ops yields equal or
better alignment (needs triangle inequality)
• Reduces edit distance to alignment distance
• We will see: the alignment distance is computed efficiently by
dynamic programming (using Bellman’s Principle of

S.Will, 18.417, Fall 2011


Optimality ).
Principle of Optimality and Dynamic Programming
Principle of Optimality :
‘Optimal solutions consist of optimal partial solutions’

Example: Shortest Path

Idea of Dynamic Programming (DP):


• Solve partial problems first and materialize results
• (recursively) solve larger problems based on smaller ones

Remarks
• The principle is valid for the alignment distance problem

S.Will, 18.417, Fall 2011


• Principle of Optimality enables the programming method DP
• Dynamic programming is widely used in Computational
Biology and you will meet it quite often in this class
Alignment Matrix
Idea: choose alignment distances of prefixes a1..i and b1..j as
partial solutions and define matrix of these partial solutions.

Let n := |a|, m := |b|.


Definition (Alignment matrix)
The alignment matrix of a and b is the (n + 1) × (m + 1)-matrix
D := (Dij )0≤i≤n,0≤j≤m defined by

Dij := Dw (a1..i , b1..j )


= min{w (a , b  ) | (a , b  ) is alignment of a1..i and b1..j } .


Notational remarks

S.Will, 18.417, Fall 2011


• ai is the i-th character of a
• ax..y is the sequence ax ax+1 . . . ay (subsequence of a).
• by convention ax..y =  if x > y .
Alignment Matrix Example
Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T

A
T

S.Will, 18.417, Fall 2011


Remark: The alignment matrix D contains the alignment distance
(=edit distance) of a and b in Dn,m .
Alignment Matrix Example
Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2

S.Will, 18.417, Fall 2011


Remark: The alignment matrix D contains the alignment distance
(=edit distance) of a and b in Dn,m .
Needleman-Wunsch Algorithm
Claim
For (a , b  ) alignment of a and b with length r = |a |,
w (a , b  ) = w (a1..r
   
−1 , b1..r −1 ) + w (ar , br ).

Theorem
For the alignment matrix D of a and b, holds that
• D0,0 = 0
Pi
• for all 1 ≤ i ≤ n: Di,0 = k=1 w (ak , −) = Di−1,0 + w (ai , −)
Pj
• for all 1 ≤ j ≤ m: D0,j = k=1 w (−, bk ) = D0,j−1 + w (−, bj )

Di−1,j−1 + w (ai , bj ) (match)

• Dij = min Di−1,j + w (ai , −) (deletion)

S.Will, 18.417, Fall 2011



Di,j−1 + w (−, bj ) (insertion)

Remark: The theorem claims that each prefix alignment distance


can be computed from a constant number of smaller ones.
Proof ???
Needleman-Wunsch Algorithm
Claim
For (a , b  ) alignment of a and b with length r = |a |,
w (a , b  ) = w (a1..r
   
−1 , b1..r −1 ) + w (ar , br ).

Theorem
For the alignment matrix D of a and b, holds that
• D0,0 = 0
Pi
• for all 1 ≤ i ≤ n: Di,0 = k=1 w (ak , −) = Di−1,0 + w (ai , −)
Pj
• for all 1 ≤ j ≤ m: D0,j = k=1 w (−, bk ) = D0,j−1 + w (−, bj )

Di−1,j−1 + w (ai , bj ) (match)

• Dij = min Di−1,j + w (ai , −) (deletion)

S.Will, 18.417, Fall 2011



Di,j−1 + w (−, bj ) (insertion)

Remark: The theorem claims that each prefix alignment distance


can be computed from a constant number of smaller ones.
Proof: Induction over i+j
Needleman-Wunsch Algorithm (Pseudocode)

D0,0 := 0
for i := 1 to n do
Di,0 := Di−1,0 + w (ai , −)
end for
for j := 1 to m do
D0,j := D0,j−1 + w (−, bj )
end for
for i := 1 to n do
for j := 1 to mdo
Di−1,j−1 + w (ai , bj )

Di,j := min Di−1,j + w (ai , −)

Di,j−1 + w (−, bj )

S.Will, 18.417, Fall 2011



end for
end for
Back to Example

Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0
T

S.Will, 18.417, Fall 2011


2

Open: how to find best alignment?


Back to Example

Example
• a =AT, b =AAGT
(
0 iff x = y
• w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T

S.Will, 18.417, Fall 2011


2 1 1 2 2

Open: how to find best alignment?


(
Traceback
0 iff x = y
w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2

Remarks

S.Will, 18.417, Fall 2011


• Start in (n, m). For every (i, j) determine optimal case.
• Not necessarily unique.
• Sequence of trace arrows let’s infer best alignment.
(
Traceback
0 iff x = y
w (x, y ) =
1 otherwise

A A G T
0 1 2 3 4
A 1 0 1 2 3
T 2 1 1 2 2

Remarks

S.Will, 18.417, Fall 2011


• Start in (n, m). For every (i, j) determine optimal case.
• Not necessarily unique.
• Sequence of trace arrows let’s infer best alignment.
Complexity

• compute one entry: three cases, i.e. constant time


• nm entries ⇒ fill matrix in O(nm) time
• traceback: O(n + m) time
• TOTAL: O(n2 ) time and space (assuming m ≤ n)

Remarks
• assuming m ≤ n is w.l.o.g. since we can exchange a and b
• space complexity can be improved to O(n) for computation of
distance (simple, “store only current and last row”) and
traceback (more involved; Hirschberg-algorithm uses “Divide

S.Will, 18.417, Fall 2011


and Conquer” for computing trace)
Plan

• We have seen how to compute the pairwise edit distance and


the corresponding optimal alignment.
• Before going multiple, we will look at two further special
topics for pairwise alignment:
• more realistic, non-linear gap cost and
• similarity scores and local alignment

S.Will, 18.417, Fall 2011


Alignment Cost Revisited

Motivation:
GA--T G-A-T
• The alignments and have the same edit
GAAGT GAAGT
distance.
• The first one is biologically more reasonable: it is more likely
that evolution introduces one large gap than two small ones.
• This means: gap cost should be non-linear, sub-additive!

S.Will, 18.417, Fall 2011


Gap Penalty
Definition (Gap Penalty)
A gap penalty is a function g : N → R that is sub-additive, i.e.

g (k + l) ≤ g (k) + g (l).

A gap in an alignment string a is a substring of a that consists of



only gap symbols − and is maximally extended. ∆a is the
multi-set of gaps in a .
The alignment cost with gap penalty g of (a , b  ) is
X
wg (a , b  ) = w (ar , br ) (cost of mismatchs)
1≤r ≤|a |,
where ar 6=−,br 6=−
X

S.Will, 18.417, Fall 2011


+ g (|x|) (cost of gaps)
a
x∈∆ ]∆ b

Example:
 
a = ATG---CGAC--GC ⇒ ∆a = {---, --}, ∆b = {-, -}
b  = -TGCGGCG-CTTTC
General sub-additive gap penalty
Theorem
Let D be the alignment matrix of a and b with cost w and gap
penalty g , such that Dij = wg (a1..i , b1..j ). Then:
• D0,0 = 0
• for all 1 ≤ i ≤ n: Di,0 = g (i)
• for all 1 ≤ j ≤ m: D0,j = g (j)

Di−1,j−1 + w (ai , bj )
 (match)
• Dij = min min1≤k≤i Di−k,j + g (k) (deletion of length k)

min1≤k≤j Di,j−k + g (k) (insertion of length k)

Remarks

S.Will, 18.417, Fall 2011


• Complexity O(n3 ) time, O(n2 ) space
• pseudocode, correctness, traceback left as exercise
• much more realistic, but significantly more expensive than
Needleman-Wunsch ⇒ can we improve it?
Affine gap cost

Definition
A gap penalty is affine, iff there are real constants α and β, such
that for all k ∈ N: g (k) = α + βk.

Remarks
• Affine gap penalties almost as good as general ones:
Distinguishing gap opening (α) and gap extension cost (β) is
“biologically reasonable”.
• The minimal alignment cost with affine gap penalty can be
computed in O(n2 ) time! (Gotoh algorithm)

S.Will, 18.417, Fall 2011


Gotoh algorithm: sketch only
In addition to the alignment matrix D, define two further
matrices/states.
• Ai,j := cost of best alignment of a1..i , b1..j ,
ai
that ends with deletion |

.
• Bi,j := cost of best alignment of a1..i , b1..j ,

that ends with insertion b| .
( j

Recursions: Ai−1,j + β (deletion extension)


Ai,j = min
Di−1,j + g (1) (deletion opening )
(
Bi,j−1 + β (insertion extension)
Bi,j = min
Di,j−1 + g (1) (insertion opening )

Di−1,j−1 + w (ai , bj ) (match)

S.Will, 18.417, Fall 2011



Dij = min Ai,j (deletion closing )

Bi,j (insertion closing )

Remark: O(n2 ) time and space


Similarity
Definition (Similarity)
The similarity of an alignment (a , b  ) is

|a |
X
 
s(a , b ) = s(ai , bi ),
i=1

where s : (Σ ∪ {−})2 → R is a similarity function, where


for x ∈ Σ : s(x, x) > 0, s(x, −) < 0, s(−, x) < 0.
Observation. Instead of minimizing
 alignment cost, one can
maximize similarity:  i−1,j−1 + s(ai , bj )
 S
Sij = max Si−1,j + s(ai , −)

Si,j−1 + s(−, bj )

S.Will, 18.417, Fall 2011


Motivation:
• defining similarity of ’building blocks’ could be more natural,
e.g. similarity of amino acids.
• similarity is useful for local alignment
Local Alignment Motivation

Local alignment asks for the best alignment of any two


subsequences of a and b. Important Application: Search!
(e.g. BLAST combines heuristics and local alignment)
Example
a =AWGVIACAILAGRS
b =VIVTAIAVAGYY

In contrast, all previous methods compute “global alignments”.


Why is distance not useful?
Example
XXXAAXXXX XXAAAAAXXXX
a) b)

S.Will, 18.417, Fall 2011


YYAAYY YYYAAAAAYY
Where is the stronger local motif? Only similarity can distinguish.
Local Alignment
Definition (Local Alignment Problem)
Let s be a similarity on alignments.

Sglobal (a, b) := max s(a , b  ) (global similarity )


(a ,b  )
alignment of a and b

Slocal (a, b) := max Sglobal (ai 0 ..i , bj 0 ..j ) (local similarity )


1≤i 0 <i≤n
1≤j 0 <j≤m

The local alignment problem is to compute Slocal (a, b).

Remarks
• That is, local alignment asks for the subsequences of a and b

S.Will, 18.417, Fall 2011


that have the best alignment.
• How would we define the local alignment matrix for DP?
• For example, why does “Hi,j := Slocal (a1..i , b1..j )” not work?
Local Alignment Matrix

Definition
The local alignment matrix H of a and b is (Hi,j )0≤i≤n,0≤j≤m
defined by

Hi,j := max Sglobal (ai 0 +1..i , bj 0 +1..j ).


0≤i 0 ≤i,0≤j 0 ≤j

Remarks
• Slocal (a, b) = maxi,j Hi,j (!)
• all entries Hi,j ≥ 0, since Sglobal (, ) = 0.
• Hi,j = 0 implies no subsequences of a and b that end in

S.Will, 18.417, Fall 2011


respective i and j are similar.
• Allows case distinction / Principle of optimality holds!
Local Alignment Algorithm — Case Distinction

Cases for Hi,j


. . . ai . . . ai ... −
1.) 2.) 3.)
. . . bi ... − . . . bj

4.) 0, since if each of the above cases is dissimilar (i.e. negative),


there is still (, ).

S.Will, 18.417, Fall 2011


Local Alignment Algorithm (Smith-Waterman Algorithm)

Theorem
For the local alignment matrix H of a and b,
• H0,0 = 0
• for all 1 ≤ i ≤ n: Hi,0 = 0
• for all 1 ≤ j ≤ m: H0,j = 0



0 (empty alignment)

H
i−1,j−1 + s(ai , bj )
• Hij = max


Hi−1,j + s(ai , −)

H
i,j−1 + s(−, bj )

S.Will, 18.417, Fall 2011


Local Alignment Remarks

Remarks
• Complexity O(n2 ) time and space, again space complexity can
be improved
• Requires that similarity function is centered around zero, i.e.
positive = similar, negative = dissimilar.
• Extension to affine gap cost works
• Traceback?

S.Will, 18.417, Fall 2011


Local Alignment Example
Example
• a =AAC, b =ACAA
(
2 iff x = y
• s(x, y ) =
−3 otherwise

A C A A
0 0 0 0 0
A 0 2 0 2 2
A 0 2 0 2 4

S.Will, 18.417, Fall 2011


C 0 0 4 1 1

Traceback: start at maximum entry, trace back to first 0 entry


Substitution/Similarity Matrices
• In practice: use similarity matrices learned from closely related
sequences or multiple alignments
• PAM (Percent Accepted Mutations) for proteins
• BLOSUM (BLOcks of Amino Acid SUbstitution) for proteins
• RIBOSUM for RNA
[x,y | Related]
• Scores are (scaled) log odd scores: log PrPr
[x,y | Background]
For example, BLOSUM62:

S.Will, 18.417, Fall 2011


Multiple Alignment

Example: Sequences Alignment


a(1) = ACCCGAG ACCCGA-G-
a(2) = ACTACC ⇒align A = AC--TAC-C
a(3) = TCCTACGG TCC-TACGG
Definition
A multiple alignment A of K sequences a(1) ...a(K ) is a
K × N-matrix (Ai,j )1≤i≤K (N is the number of columns of A)
1≤j≤N
where
1. each entry Ai,j ∈ (Σ ∪ {−})
2. for each row i: deleting all gaps from (Ai,1 ...Ai,N ) yields a(i)

S.Will, 18.417, Fall 2011


3. no column j contains only gap symbols
How to Score Multiple Alignments

As for pairwise alignment:


• Assume columns are scored independently
• Score is sum over alignment columns
N
X
S(A) = s(A1j , . . . , AKj )
j=1

Example    −
A
C  C  C 
S(A) = s A +s C +s − +s − + ··· + s C
T C C − G

S.Will, 18.417, Fall 2011


How do we know similarities?
How to Score Multiple Alignments
As for pairwise alignment:
• Assume columns are scored independently
• Score is sum over alignment columns
N A 
X 1j
S(A) = s ...
AKj
j=1

Example    −
A
C  C  C 
S(A) = s A +s C +s − +s − + ··· + s C
T C C − G

S.Will, 18.417, Fall 2011


x  x 
How to define s y ? as log odds s y = log PrPr [x,y ,z| Related]
[x,y ,z| Background] ?
z z
Problems? Can we learn similarities for triples, 4-tuples, . . . ?
Sum-Of-Pairs Score

Idea: approximate column scores by pairwise scores


x 
1 X
s ... = s(xk , xl )
xj
1≤k<l≤K

Sum-of-pairs is the most commonly used scoring scheme for


multiple alignments.
(Extensible to gap penalties, in particular affine gap cost)

Drawbacks?

S.Will, 18.417, Fall 2011


Optimal Multiple Alignment
Idea: use dynamic programming
Example
For 3 sequences a, b, c, use 3-dimensional matrix
(after initialization:) 

Si−1,j−1,k−1 +s(ai , bj , ck )


Si−1,j−1,k +s(ai , bj , −)




Si−1,j,k−1 +s(ai , −, ck )



Si,j,k = max Si,j−1,k−1 +s(−, bj , ck )


Si−1,j,k +s(ai , −, −)





Si,j−1,k +s(−, bj , −)




Si,j,k−1 +s(−, −, ck )

S.Will, 18.417, Fall 2011


For K sequences use K-dimensional matrix.
Complexity?
Heuristic Multiple Alignment: Progressive Alignment
Idea: compute optimal alignments only pairwise
Example
4 sequences a(1) , a(2) , a(3) , a(4)
1. determine how they are related
⇒ tree, e.g. ((a(1) , a(2) ), (a(3) , a(4) ))
2. align most closely related sequences first
⇒ (optimally) align a(1) and a(2) by DP
3. go on ⇒ (optimally) align a(3) and a(4) by DP
4. go on?! ⇒ (optimally) align the two alignments
How can we do that?
5. Done. We produced a multiple alignment of

S.Will, 18.417, Fall 2011


a(1) , a(2) , a(3) , a(4) .
Remarks: - Optimality is not guaranteed. Why?
- The tree is known as guide tree. How can we get it?
Guide tree

The guide tree determines the order of pairwise alignments in the


progressive alignment scheme.
The order of the progressive alignment steps is crucial for quality!

Heuristics:
1. Compute pairwise distances between all input sequences
• align all against all
• in case, transform similarities to distances (e.g. Feng-Doolittle)
2. Cluster sequences by their distances, e.g. by
• Unweighted Pair Group Method (UPGMA)
• Neighbor Joining (NJ)

S.Will, 18.417, Fall 2011


Aligning Alignments
Two (multiple) alignments A and B can be aligned by DP in the
same way as two sequences.
Idea:
• An alignment is a sequence of alignment columns.
ACCCGA-G-  A C C C  −
Example: AC--TAC-C ≡ A C − − ... C .
T C C − G
TCC-TACGG
• Assign similarity to two columns from resp. A and B, e.g.

   
G
s( C , C ) by sum-of-pairs.
G

We can use dynamic programming, which recurses over alignment


scores of prefixes of alignments.

S.Will, 18.417, Fall 2011


Consequences for progressive alignment scheme:
• Optimization only local.
• Commits to local decisions. “Once a gap, always a gap”
Progressive Alignment — Example
IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)

• Compute all against all edit distances and cluster


Align ACCG and TTGG Align ACCG and TCG
T T G G T C G
0 2 4 6 8 0 2 4 6
A 2 3 5 7 9 A 2 3 5 7
C 4 5 6 8 10 C 4 5 3 6
C 6 7 8 9 11 C 6 7 5 6
G 8 9 10 8 9 G 8 9 8 5
Align ACCG and CTGG Align TTGG and TCG
C T G G T C G
0 2 4 6 8 0 2 4 6
A 2 3 5 7 9 T 2 0 3 6
C 4 2 5 8 10 T 4 2 3 6
C 6 4 5 8 11 G 6 5 5 3
G 8 7 7 5 8 G 8 8 8 5

S.Will, 18.417, Fall 2011


Align TTGG and CTGG Align TCG and CTGG
C T G G C T G G
0 2 4 6 8 0 2 4 6 8
T 2 3 2 5 8 T 2 3 2 5 8
T 4 5 3 5 8 C 4 2 5 5 8
G 6 7 6 3 5 G 6 5 5 5 5
G 8 9 9 6 3
Progressive Alignment — Example

IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG


0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)

• Compute all against all edit distances and cluster


⇒ distance matrix
a(1) a(2) a(3) a(4)
(1)
a 0 9 5 8
a(2) 0 5 3
a(3) 0 5
a(4) 0

S.Will, 18.417, Fall 2011


⇒ Cluster (e.g. UPGMA)
a(2) and a(4) are closest, Then: a(1) and a(3)
Progressive Alignment — Example
IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)

• Compute all against all edit distances and cluster


⇒ guide tree ((a(2) , a(4) ), (a(1) , a(3) ))
TTGG ACCG
• Align a(2) and a(4) : , Align a(1) and a(3) :
CTGG -TCG
• Align the alignments!
Align TTGG and ACCG
CTGG -TCG
A C C G
- T C G
0 4 12 20 28

S.Will, 18.417, Fall 2011


TC 8 10
.
...
.
. .
TT 16 . .
GG 24
GG 32
Progressive Alignment — Example
(1) (2)
IN: a =
 ACCG , a = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)

• Compute all against all edit distances and cluster


⇒ guide tree ((a(2) , a(4) ), (a(1) , a(3) ))
TTGG ACCG
• Align a(2) and a(4) : , Align a(1) and a(3) :
CTGG -TCG
• Align the alignments!
• w (TC , −−) =
w (T , −) + w (C , −) + w (T , −) + w (C , −) = 8
Align TTGG and ACCG
CTGG -TCG • w (−−, A−) =
A C C G
- T C G w (−, A) + w (−, −) + w (−, A) + w (−, −) = 4

0 4 12 20 28 •

S.Will, 18.417, Fall 2011


w (TC , A−) =
TC 8 10
.
...
. w (T , A) + w (C , A) + w (T , −) + w (C , −) = 10
. .
TT 16 . .
GG 24 • w (TC , CT ) =
GG 32 w (T , C ) + w (C , C ) + w (T , T ) + w (C , T ) = 6

• ...
Progressive Alignment — Example
IN: a(1) = ACCG , a(2) = TTGG , a(3) = TCG , a(4) = CTGG
0 iff x = y

w (x, y ) = 2 iff x = − or y = −

3 otherwise (for mismatch)

• Compute all against all edit distances and cluster


⇒ guide tree ((a(2) , a(4) ), (a(1) , a(3) ))
TTGG ACCG
• Align a(2) and a(4) : , Align a(1) and a(3) :
CTGG -TCG
• Align the alignments!
Align TTGG and ACCG
CTGG -TCG
A C C G TTGG
- T C G =⇒
CTGG
0 4 12 20 28 after filling

S.Will, 18.417, Fall 2011


TC 8 10 ... ACCG
.
. .
. and traceback
TT 16 . . -TCG
GG 24
GG 32
A Classical Approach: CLUSTAL W

• prototypical progressive
alignment
• similarity score with affine
gap cost
• neighbor joining for tree
construction
• special ‘tricks’ for gap
handling

S.Will, 18.417, Fall 2011


Advanced Progressive Alignment in MUSCLE

S.Will, 18.417, Fall 2011


1.) alignment draft and 2.) reestimation 3.) iterative refinement
Consistency-based scoring in T-Coffee

• Progressive alignment +
Consistency heuristic
• Avoid mistakes when
optimizing locally by
modifying the scores
“Library extension”
• Modified scores reflect
global consistency
• Details of consistency
transformation: next slide
• Merges local and global

S.Will, 18.417, Fall 2011


alignments
Consistency-based scoring in T-Coffee
Misalignment by standard procedure Consistency Transformation
• For each sequence triplet:
strengthen compatible
edges
• This moves global
Correct alignment after library extension
information into scores
• Consistency-based scores
guide pairwise
alignments towards
(global) consistency

All-2-all alignments for weighting

S.Will, 18.417, Fall 2011


Alignment Profiles

Alignment
ACGG- Profile:         
ACCG- A: 0.75 0 0 0 0
AC-G- C:   0 
 1
 
 0.5 
 
0
 
 0 
 
TCCGG G:  0  0 0.25 1 0.25
T : 0.25 0 0 0 0
Consensus:
ACCG-

Remarks
• A profile of a multiple alignment consists of character
frequency vectors for each column.

S.Will, 18.417, Fall 2011


• The profile describes sequences of the alignment in a rigid way.
• Modeling insertions/deletions requires profile HMMs.
Hidden Markov Models (HMMs)
Example of a simple HMM
0.8 0.6
0.2
S R
2/3 1/3 0.4 1/6 5/6

T
H B
T H
T T
B
[The frog climbs the ladder more likely when the sun shines. Assume that
the weather is hidden, but we can observe the frog.]

• Idea: the probability of an observation depends on a hidden


state, where there are specific probabilities to change states.

S.Will, 18.417, Fall 2011


• Hidden Markov Models generate observation sequences (e.g.
TBTTT) according to an (encoded) probability distribution.
• One can compute things like “most probable path given an
observation sequence”, . . . (no details here)
Profile HMMs
• Profile HMMs describe (probability distribution of) sequences
in a multiple alignment (observation ≡ sequence).
• hidden states = insertion (Ii ), match (Mi ), deletion (Di ) in
relation to consensus (state sequence ≡ alignment string)
Alignment
ACGG-
ACCG-
AC-G-
TCCGG
Consensus:
ACCG-
Remarks

S.Will, 18.417, Fall 2011


• Profile HMMs are used to search for sequences that are
similar to sequences of a given alignment (Pfam, HMMer)
• Profile HMMs can be used to construct multiple alignments
• We come back to HMMs when we discuss SCFGs.

You might also like