Forward-Backward Algorithm PDF
Forward-Backward Algorithm PDF
The forward and backward steps may also be called forward message pass and backward message pass - these
terms are due to the message-passing used in general
belief propagation approaches. At each single observation in the sequence, probabilities to be used for calculations at the next observation are computed. The smoothing step can be calculated simultaneously during the backward pass. This step allows the algorithm to take into account any past observations of output for computing more
accurate results.
In the rst pass, the forwardbackward algorithm computes a set of forward probabilities which provide, for all
k {1, . . . , t} , the probability of ending up in any particular state given the rst k observations in the sequence,
i.e. P (Xk | o1:k ) . In the second pass, the algorithm
computes a set of backward probabilities which provide
the probability of observing the remaining observations
given any starting point k , i.e. P (ok+1:t | Xk ) . These
two sets of probability distributions can then be combined
to obtain the distribution over states at any specic point
in time given the entire observation sequence:
(
)
0.7 0.3
T=
0.3 0.7
P (Xk | o1:t ) = P (Xk | o1:k , ok+1:t ) P (ok+1:t | Xk )P (Xk | o1:k )
In a typical Markov model we would multiply a state vecThe last step follows from an application of the Bayes tor by this matrix to obtain the probabilities for the subrule and the conditional independence of ok+1:t and o1:k sequent state. In a hidden Markov model the state is ungiven Xk .
known, and we instead observe events associated with the
possible states. An event matrix of the form:
As outlined above, the algorithm involves three steps:
1. computing forward probabilities
B=
(
0.9
0.2
)
0.1
0.8
3 BACKWARD PROBABILITIES
observed 90% of the time if we are in state 1 while event This allows us to interpret the scaled probability vector
2 has a 10% probability of occurring in this state. In con- as:
trast, event 1 will only be observed 20% of the time if we
are in state 2 and event 2 has an 80% chance of occurring.
Given a state vector ( ), the probability of observing ^f (i) = f0:t (i) = P(o1 , o2 , . . . , ot , Xt = xi |) = P(X = x |o , o , .
t
t
i 1 2
0:t
P(o1 , o2 , . . . , ot |)
event j is then:
s=1 cs
We thus nd that the product of the scaling factors provides us with the total probability for observing the given
P(O = j) =
i bi,j
sequence up to time t and that the scaled probability veci
tor provides us with the probability of being in each state
This can be represented in matrix form by multiplying at this time.
the state vector ( ) by an observation matrix ( Oj =
diag(b,oj ) ) containing only diagonal entries. Each entry
is the probability of the observed event given each state. 3 Backward probabilities
Continuing the above example, an observation of event 1
would be:
A similar procedure can be constructed to nd backward
probabilities. These intend to provide the probabilities:
(
)
0.9 0.0
O1 =
0.0 0.2
bt:T (i) = P(ot+1 , ot+2 , . . . , oT |Xt = xi )
This allows us to calculate the probabilities associated
with transitioning to a new state and observing the given That is, we now want to assume that we start in a particular state ( Xt = xi ), and we are now interested in the
event as:
probability of observing all future events from this state.
Since the initial state is assumed as given (i.e. the prior
probability of this state = 100%), we begin with:
f0:1 = TO1
P(o1 , o2 , . . . , ot |) =
s=1
cs
^
^
bt1:T = c1
t TOt bt:T
where ^
bt:T represents the previous, scaled vector. This
result is that the scaled probability vector is related to the
backward probabilities by:
bt:T (i)
^
bt:T (i) = T
s=t+1 cs
3
This is useful because it allows us to nd the total prob(
ability of being in each state at a given time, t, by multi0.9
plying these values:
B=
0.2
t (i) = P(Xt = xi |o1 , o2 , . . . , oT , ) =
P(o1 , o2 , . . .we
, oTwill
|)represent in ourTs=1
cs
calculations
as:
Example
T=
(
0.7
0.3
0.3
0.7
)
0.1
0.8
O1 =
(
0.9
0.0
0.0
0.2
)
O2 =
(
0.9
0.0
0.0
0.2
)
O3 =
(
0.1
0.0
0.0
0.8
)
O4 =
(
0.9
0.0
0
0
)
0.5
)(
0.0 0.7
0.2 0.3
)(
0.0 0.7
0.2 0.3
)(
0.0 0.7
0.8 0.3
)(
0.0 0.7
0.2 0.3
)(
0.0 0.7
0.2 0.3
)(
)
(
) (
0.3 0.5000
0
1 0.4500
= c1
=
0.7 0.5000
0.1000
0
)(
(
)
) (
0.3 0.8182
0
1 0.5645
= c2
=
0.7 0.1818
0.0745
0
)(
(
)
) (
0.3 0.8834
0.0653
0
= c1
=
3
0.7 0.1166
0.2772
0
)(
(
)
) (
0.3 0.1907
0.3386
0
= c1
=
4
0.7 0.8093
0.1247
0
)(
(
)
) (
0.3 0.7308
0.5331
0
= c1
=
5
0.7 0.2692
0.0815
0
7 PYTHON EXAMPLE
We are then able to compute (using the observations in The calculations above reveal that the most probable
reverse order and normalizing with dierent constants): weather state on every day except for the third one was
rain. They tell us more than this, however, as they now
provide
)(
)(
)
) ( a way to
) quantify the probabilities of each state at
(
(
0.7
0.3
0.9
0.0
1.0000
0.6900
0.6273
dierent
times.
Perhaps most importantly, our value at 5
^
b4:5 =
=
=
0.3 0.7 0.0 0.2 1.0000
0.4100quanties
0.3727
our knowledge of the state vector at the end of
the
(
)(
)(
)
(
) observation
(
)sequence. We can then use this to predict
0.7 0.3 0.9 0.0 0.6273
0.4175the probability
0.6533 of the various weather states tomorrow as
^
b3:5 =
=
=
0.3 0.7 0.0 0.2 0.3727
0.2215well as the
0.3467
probability of observing an umbrella.
(
)(
)(
)
(
) (
)
0.7 0.3 0.1 0.0 0.6533
0.1289
0.3763
^
b2:5 =
=
=
0.3 0.7 0.0 0.8 0.3467
0.2138
0.6237
(
)(
)(
)
(
) Performance
(
)
5
0.7 0.3 0.9 0.0 0.3763
0.2745
0.5923
^
b1:5 =
=
=
0.3 0.7 0.0 0.2 0.6237
0.1889
0.4077
The
(
)(
)(
)
(
) brute-force
(
) procedure for the solution of this prob0.7 0.3 0.9 0.0 0.5923
0.3976lem is the
0.6469
generation
of all possible N T state sequences
^
b0:5 =
=
=
0.3 0.7 0.0 0.2 0.4077
0.2170and calculating
0.3531 the joint probability of each state seFinally, we will compute the smoothed probability values. quence with the observed series of events. This approach
T
These result also must be scaled so that its entries sum to 1 has time complexity O(T N ) , where T is the length
because we did not scale the backward probabilities with of sequences and N is the number of symbols in the state
the ct 's found earlier. The backward probability vectors alphabet. This is intractable for realistic problems, as the
above thus actually represent the likelihood of each state number of possible hidden node sequences typically is exat time t given the future observations. Because these vec- tremely high. However, the forwardbackward algorithm
2
tors are proportional to the actual backward probabilities, has time complexity O(N T ) .
the result has to be scaled an additional time.
An enhancement to the general forward-backward algorithm, called the Island algorithm, trades smaller memory)usage for longer running time, taking O(N 2 T log T )
) (
)
(
) (
0.5000
0.6469
0.3235
0.6469
time and O(N 2 log T ) memory. On a computer with an
(0 )T =
=
=
0.5000
0.3531
0.1765
0.3531
unlimited number of processors, this can be reduced to
(
) (
)
(
) (
) 2 T ) total time, while still taking only O(N 2 log T )
O(N
0.8182
0.5923
0.4846
0.8673
T
memory.[1]
(1 ) =
=
=
0.1818
0.4077
0.0741
0.1327
In addition,
algorithms have been developed to compute
(
) (
)
(
) (
)
0.8834
0.3763
0.3324
0.8204
T
f
eciently
through online smoothing such as the
0:t+1
(2 ) =
=
=
0.1166
0.6237
0.0728
0.1796
xed-lag smoothing (FLS) algorithm Russell & Norvig
(
) (
)
(
) (
) Figure 15.6 pp. 580.
2010
0.1907
0.6533
0.1246
0.3075
T
(3 ) =
=
=
0.8093
0.3467
0.2806
0.6925
(
) (
)
(
) (
)
0.7308
0.6273
0.4584
0.8204
(4 )T =
=
=
6 Pseudocode
0.2692
0.3727
0.1003
0.1796
(
) (
)
(
) (
)
0.8673
1.0000
0.8673
0.8673
T
ForwardBackward(guessState, sequenceIndex): if se(5 ) =
=
=
0.1327
1.0000
0.1327
0.1327
quenceIndex is past the end of the sequence, return 1
Notice that the value of 0 is equal to ^
b0:5 and that 5 if (guessState, sequenceIndex) has been seen before, reis equal to ^f0:5 . This follows naturally because both ^f0:5 turn saved result result = 0 for each neighboring state n:
and ^
b0:5 begin with uniform priors over the initial and - result = result + (transition probability from guessState
nal state vectors (respectively) and take into account all of to n given observation element at sequenceIndex) *
the observations. However, 0 will only be equal to ^
b0:5 ForwardBackward(n, sequenceIndex+1) save result for
when our initial state vector represents a uniform prior (guessState, sequenceIndex) return result
(i.e. all entries are equal). When this is not the case ^
b0:5
needs to be combined with the initial state vector to nd
the most likely initial state. We thus nd that the forward 7 Python example
probabilities by themselves are sucient to calculate the
most likely nal state. Similarly, the backward probabilities can be combined with the initial state vector to Given HMM (just like in Viterbi algorithm) represented
provide the most probable initial state given the observa- in the Python programming language:
tions. The forward and backward probabilities need only states = ('Healthy', 'Fever') end_state = 'E' observabe combined to infer the most probable states between tions = ('normal', 'cold', 'dizzy') start_probability =
{'Healthy': 0.6, 'Fever': 0.4} transition_probability =
the initial and nal points.
(
5
{ 'Healthy' : {'Healthy': 0.69, 'Fever': 0.3, 'E': 0.01},
'Fever' : {'Healthy': 0.4, 'Fever': 0.59, 'E': 0.01}, }
emission_probability = { 'Healthy' : {'normal': 0.5,
'cold': 0.4, 'dizzy': 0.1}, 'Fever' : {'normal': 0.1, 'cold':
0.3, 'dizzy': 0.6}, }
We can write implementation like this:
def fwd_bkw(x, states, a_0, a, e, end_st): L = len(x)
fwd = [] f_prev = {} # forward part of the algorithm
for i, x_i in enumerate(x): f_curr = {} for st in states:
if i == 0: # base case for the forward part prev_f_sum
= a_0[st] else: prev_f_sum = sum(f_prev[k]*a[k][st]
for k in states) f_curr[st] = e[st][x_i] * prev_f_sum
sum_prob = sum(f_curr.values()) for st in states:
f_curr[st] /= sum_prob # normalising to make sum
== 1 fwd.append(f_curr) f_prev = f_curr p_fwd =
sum(f_curr[k]*a[k][end_st] for k in states) bkw = []
b_prev = {} # backward part of the algorithm for
i, x_i_plus in enumerate(reversed(x[1:]+(None,))):
b_curr = {} for st in states: if i == 0: # base case
for backward part b_curr[st] = a[st][end_st] else:
b_curr[st] = sum(a[st][l]*e[l][x_i_plus]*b_prev[l] for
l in states) sum_prob = sum(b_curr.values()) for st in
states: b_curr[st] /= sum_prob # normalising to make
sum == 1 bkw.insert(0,b_curr) b_prev = b_curr p_bkw
= sum(a_0[l] * e[l][x[0]] * b_curr[l] for l in states) #
merging the two parts posterior = [] for i in range(L):
posterior.append({st: fwd[i][st]*bkw[i][st]/p_fwd for
st in states}) assert p_fwd == p_bkw return fwd, bkw,
posterior
The function fwd_bkw takes the following arguments: x
is the sequence of observations, e.g. ['normal', 'cold',
'dizzy']; states is the set of hidden states; a_0 is the start
probability; a are the transition probabilities; and e are
the emission probabilities.
For simplicity of code, we assume that the observation
sequence x is non-empty and that a[i][j] and e[i][j] is dened for all states i,j.
In the running example, the forward-backward algorithm
is used as follows:
def example(): return fwd_bkw(observations, states,
start_probability,
transition_probability,
emission_probability, end_state) for line in example():
print(' '.join(map(str, line)))
See also
Baum-Welch algorithm
Viterbi algorithm
BCJR algorithm
9 References
[1] J. Binder, K. Murphy and S. Russell. Space-Ecient Inference in Dynamic Probabilistic Networks. Int'l, Joint
Conf. on Articial Intelligence, 1997.
10 External links
An interactive spreadsheet for teaching the forward
backward algorithm (spreadsheet and article with
step-by-step walk-through)
Tutorial of hidden Markov models including the
forwardbackward algorithm
Collection of AI algorithms implemented in Java
(including HMM and the forwardbackward algorithm)
11
11
11.1
11.2
Images
11.3
Content license