ViteRbi Algorithm
ViteRbi Algorithm
Start
0.5
H
0.5
A
C
G
T
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
H
0.5
A
C
G
T
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCACTGAA
There are several paths through the hidden states (H and L) that lead to
the given sequence S.
Example: P = LLHHHHLLL
The probability of the HMM to produce sequence S through the path P is:
p = pL(0) * pL(G) * pLL * pL(G) * pLH * pH(C) * ...
= 0.5 * 0.2 * 0.6 * 0.2 * 0.4 * 0.3 * ...
= ...
H
0.5
A
C
G
T
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCACTGAA
There are several paths through the hidden states (H and L) that lead
to the given sequence, but they do not have the same probability.
The Viterbi algorithm is a dynamical programming algorithm that
allows us to compute the most probable path. Its principle is similar to
the DP programs used to align 2 sequences (i.e. Needleman-Wunsch)
Source: Borodovsky & Ekisheva, 2006
Start
0.5
H
A
C
G
T
0.5
0.2
0.3
0.3
0.2
0.5
0.5
A
C
G
T
0.4
0.3
0.2
0.2
0.3
0.6
G G C A C T G A A
The probability of the most probable path ending in state k with observation "i" is
probability to
observe
element i in
state l
probability of the
transition from
state l to state k
Start
0.5
H
A
C
G
T
0.5
0.2
0.3
0.3
0.2
0.5
0.5
A
C
G
T
0.4
0.3
0.2
0.2
0.3
0.6
G G C A C T G A A
The probability of the most probable path ending in state k with observation "i" is
In our example, the probability of the most probable path ending in state H with observation
"A" at the 4th position is:
We can thus compute recursively (from the first to the last element of our sequence) the
probability of the most probable path.
H
-1
A -2.322
C -1.737
G -1.737
T -2.322
-1
-1
-1.322
L
A
C
G
T
-1.737
-2.322
-2.322
-1.737
-0.737
H
-1
A -2.322
C -1.737
G -1.737
T -2.322
-1
-1
-1.322
L
A
C
G
T
-1.737
-2.322
-2.322
-1.737
GGCACTGAA
Probability (in log2) that G at the
first position was emitted by state H
-0.737
H
-1
A -2.322
C -1.737
G -1.737
T -2.322
-1
-1
-1.322
L
A
C
G
T
-1.737
-2.322
-2.322
-1.737
-0.737
GGCACTGAA
Probability (in log2) that G at the
2nd position was emitted by state H
H
A -2.322
C -1.737
G -1.737
T -2.322
-1
-1
-1
-1.322
L
A
C
G
T
-1.737
-2.322
-2.322
-1.737
-0.737
GGCACTGAA
G
-2.73
-5.47
-8.21
-11.53
-14.01
...
-25.65
-3.32
-6.06
-8.79
-10.94
-14.01
...
-24.49
We then compute iteratively the probabilities pH(i,x) and pL(i,x) that nucleotide i at position x was
emitted by state H or L, respectively. The highest probability obtained for the nucleotide at the last
position is the probability of the most probable path. This path can be retrieved by back-tracking.
H
A -2.322
C -1.737
G -1.737
T -2.322
-1
-1
-1
-1.322
L
A
C
G
T
-1.737
-2.322
-2.322
-1.737
-0.737
back-tracking
GGCACTGAA
-2.73
-5.47
-8.21
-11.53
-14.01
...
-25.65
-3.32
-6.06
-8.79
-10.94
-14.01
...
-24.49
HHHLLLLLL
H
0.5
A
C
G
T
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCA
What is the probability P(S) that this sequence S was generated by the
HMM model?
This probability P(S) is given by the sum of the probabilities pi(S) of each
possible path that produces this sequence.
The probability P(S) can be computed by dynamical programming using
either the so-called Forward or the Backward algorithm.
Start
0.5
H
A
C
G
T
0.5
0.2
0.3
0.3
0.2
0.5
0.5
0.4
0.5*0.3=0.15
0.5*0.2=0.1
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCA
C
H
A
C
G
T
0.5
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCA
Start
0.5*0.3=0.15
0.15*0.5*0.3 + 0.1*0.4*0.3=0.0345
0.5*0.2=0.1
H
A
C
G
T
0.5
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCA
Start
0.5*0.3=0.15
0.15*0.5*0.3 + 0.1*0.4*0.3=0.0345
0.5*0.2=0.1
0.1*0.6*0.2 + 0.15*0.5*0.2=0.027
H
A
C
G
T
0.5
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCA
Start
0.5*0.3=0.15
0.15*0.5*0.3 + 0.1*0.4*0.3=0.0345
... + ...
0.5*0.2=0.1
0.1*0.6*0.2 + 0.15*0.5*0.2=0.027
... + ...
H
A
C
G
T
0.5
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
GGCA
Start
0.5*0.3=0.15
0.15*0.5*0.3 + 0.1*0.4*0.3=0.0345
... + ...
0.0013767
0.5*0.2=0.1
0.1*0.6*0.2 + 0.15*0.5*0.2=0.027
... + ...
0.0024665
=> The probability that the sequence S was generated by the HMM model
is thus P(S)=0.0038432.
= 0.0038432
H
0.5
A
C
G
T
0.2
0.3
0.3
0.2
0.5
0.5
0.4
L
A
C
G
T
0.3
0.2
0.2
0.3
0.6
The probability that sequence S="GGCA" was generated by the HMM model is PHMM(S) =
0.0038432.
To assess the significance of this value, we have to compare it to the probability that
sequence S was generated by the background model (i.e. by chance).
Ex: If all nucleotides have the same probability, pbg=0.25; the probability to observe S by
chance is: Pbg(S) = pbg4 = 0.254 = 0.00396.
Thus, for this particular example, it is likely that the sequence S does not match the HMM
model (Pbg > PHMM).
NB: Note that this toy model is very simple and does not reflect any biological motif. If fact
both states H and L are characterized by probabilities close to the background probabilities,
which makes the model not realistic and not suitable to detect specific motifs.
HMM : Summary
Summary
The Viterbi algorithm is used to compute the most probable path (as well as
its probability). It requires knowledge of the parameters of the HMM model and
a particular output sequence and it finds the state sequence that is most likely
to have generated that output sequence. It works by finding a maximum over
all possible state sequences.
In sequence analysis, this method can be used for example to predict coding
vs non-coding sequences.
In fact there are often many state sequences that can produce the same
particular output sequence, but with different probabilities. It is possible to
calculate the probability for the HMM model to generate that output sequence
by doing the summation over all possible state sequences. This also can be
done efficiently using the Forward algorithm (or the Backward algorithm),
which is also a dynamical programming algorithm.
In sequence analysis, this method can be used for example to predict the
probability that a particular DNA region match the HMM motif (i.e. was emitted
by the HMM model). A HMM motif can represent a TF binding site for ex.
HMM : Summary
Remarks
To create a HMM model (i.e. find the most likely set of state transition and
output probabilities of each state), we need a set of (training) sequences,
that does not need to be aligned.
No tractable algorithm is known for solving this problem exactly, but a local
maximum likelihood can be derived efficiently using the Baum-Welch
algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is
an example of a forward-backward algorithm, and is a special case of the
Expectation-maximization algorithm.
For more details: see Durbin et al (1998)
HMMER
The HUMMER3 package contains a set of programs (developed by S. Eddy) to build
HMM models (from a set of aligned sequences) and to use HMM models (to align
sequences or to find sequences in databases). These programs are available at the
Mobyle plateform (http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py)