The Expectation Maximization Algorithm
The Expectation Maximization Algorithm
Frank Dellaert
Abstract
This note represents my attempt at explaining the EM algorithm (Hartley, 1958;
Dempster et al., 1977; McLachlan and Krishnan, 1997). This is just a slight
variation on Tom Minka’s tutorial (Minka, 1998), perhaps a little easier (or perhaps
not). It includes a graphical example to provide some intuition.
1 Intuitive Explanation of EM
EM is an iterative optimization method to estimate some unknown parameters Θ, given
measurement data U. However, we are not given some “hidden” nuisance variables J,
which need to be integrated out. In particular, we want to maximize the posterior
probability of the parameters Θ given the data U, marginalizing over J:
X
Θ∗ = argmax P (Θ, J|U) (1)
Θ J∈J n
The intuition behind EM is an old one: alternate between estimating the unknowns
Θ and the hidden variables J. This idea has been around for a long time. However,
instead of finding the best J ∈ J given an estimate Θ at each iteration, EM computes a
distribution over the space J . One of the earliest papers on EM is (Hartley, 1958), but
the seminal reference that formalized EM and provided a proof of convergence is the
“DLR” paper by Dempster, Laird, and Rubin (Dempster et al., 1977). A recent book
devoted entirely to EM and applications is (McLachlan and Krishnan, 1997), whereas
(Tanner, 1996) is another popular and very useful reference.
One of the most insightful explanations of EM, that provides a deeper understanding
of its operation than the intuition of alternating between variables, is in terms of lower-
bound maximization (Neal and Hinton, 1998; Minka, 1998). In this derivation, the
E-step can be interpreted as constructing a local lower-bound to the posterior distribu-
tion, whereas the M-step optimizes the bound, thereby improving the estimate for the
unknowns. This is demonstrated below for a simple example.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
−0.1
−5 −4 −3 −2 −1 0 1 2 3 4 5
Figure 1: EM example: Mixture components and data. The data consists of three
samples drawn from each mixture component, shown above as circles and triangles.
The means of the mixture components are −2 and 2, respectively.
0.5
0.4
0.3
0.2
0.1
0
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
θ2
θ1
Figure 2: The true likelihood function of the two component means θ 1 and θ2 , given
the data in Figure 1.
2
i=1, Q=−3.279564
0.5
0.4
0.3
0.2
0.1
0
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
θ2
θ1
0.5
1.5
2.5
1 2 3 4 5 6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
3 3
2 2
3 3
1 2 1 2
0 1 0 1
−1 0 −1 0
−1 −1
−2 −2 −2 −2
−3 −3 −3 −3
θ2 θ2
θ1 θ1
0.5 0.5
1 1
1.5 1.5
2 2
2.5 2.5
1 2 3 4 5 6 1 2 3 4 5 6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
3 3
2 2
3 3
1 2 1 2
0 1 0 1
−1 0 −1 0
−1 −1
−2 −2 −2 −2
−3 −3 −3 −3
θ2 θ2
θ1 θ1
0.5 0.5
1 1
1.5 1.5
2 2
2.5 2.5
1 2 3 4 5 6 1 2 3 4 5 6
3
Consider the mixture estimation problem shown in Figure 1, where the goal is to es-
timate the two component means θ1 and θ2 given 6 samples drawn from the mixture,
but without knowing from which mixture each sample was drawn. The state space is
two-dimensional, and the true likelihood function is shown in Figure 2. Note that there
are two modes, located respectively at (−2, 2) and (2, −2). This makes perfect sense,
as we can switch the mixture components without affecting the quality of the solution.
Note also that the true likelihood is computed by integrating over all possible data as-
sociations, and hence we can find a maximum likelihood solution without solving a
correspondence problem. However, even for only 6 samples, this requires summing
over the space of 64 possible data-associations.
EM proceeds as follows in this example. In the E-step, a “soft” assignment is computed
that assigns a posterior probability to each possible association of each individual sam-
ple. In the current example, there are 2 mixtures and 6 samples, so the computed prob-
abilities can be represented in a 2 × 6 table. Given these probabilities, EM computes a
tight lower bound to the true likelihood function of Figure 2. The bound is constructed
such that it touches the likelihood function at the current estimate, and it is only close to
the true likelihood in the neighborhood of this estimate. The bound and its correspond-
ing probability table is computed in each iteration, as shown in Figure 3. In this case,
EM was run for 5 iterations. In the M-step, the lower bound is maximized (shown by a
black asterisk in the figure), and the corresponding new estimate (θ 1 , θ2 ) is guaranteed
to lie closer to the location of the nearest local maximum of the likelihood. Each next
bound is an increasingly better approximation to the mode of the likelihood, until at
convergence the bound touches the likelihood at the local maximum, and progress can
no longer be made. This is shown in the last panel of Figure 3.
The idea behind EM is to start with a guess Θt for the parameters Θ, compute an
easily computed lower bound B(Θ; Θt ) to the function log P (Θ|U), and maximize
that bound instead. If iterated, this procedure will converge to a local maximizer Θ ∗ of
the objective function, provided the bound improves at each iteration.
To motivate this, note that the key problem with maximizing (2) is that it involves the
logarithm of a (big) sum, which is difficult to deal with. Fortunately, we can construct
4
a tractable lower bound B(Θ; Θt ) that instead contains a sum of logarithms. To derive
the bound, first trivially rewrite log P (U, Θ) as
X X P (U, J, Θ)
log P (U, Θ) = log P (U, J, Θ) = log f t (J)
f t (J)
J∈J n J∈J n
where f t (J) is an arbitrary probability distribution over the space J n of hidden vari-
ables J. By Jensen’s inequality, we have
∆
X P (U, J, Θ) X P (U, J, Θ)
B(Θ; Θt ) = f t (J) log <= log f t (J)
f t (J) f t (J)
J∈J n n J∈J
Note that we have transformed a log of sums into a sum of logs, which was the prime
motivation.
EM goes one step further and tries to find the best bound, defined as the bound B(Θ; Θ t )
that touches the objective function log P (U, Θ) at the current guess Θ t . Intuitively,
finding the best bound at each iteration will guarantee that we obtain an improved es-
timate Θt+1 when we locally maximize the bound with respect to Θ. Since we know
B(Θ; Θt ) to be a lower bound, the optimal bound at Θt can be found by maximizing
X P (U, J, Θt )
B(Θt ; Θt ) = f t (J) log (3)
f t (J)
J∈J n
By examining the value of the resulting optimal bound at Θt we see that it indeed
touches the objective function:
X P (U, J, Θt )
B(Θt ; Θt ) = P (J|U, Θt ) log = log P (U, Θt )
P (J|U, Θt )
J∈J n
5
2.2 Maximizing The Bound
Since H does not depend on Θ, we can maximize the bound with respect to Θ using
the first two terms only:
At each iteration, the EM algorithm first finds an optimal lower bound B(Θ; Θ t ) at the
current guess Θt (equation 3), and then maximizes this bound to obtain an improved
estimate Θt+1 (equation 4). Because the bound is expressed as an expectation, the first
step is called the “expectation-step” or E-step, whereas the second step is called the
“maximization-step” or M-step. The EM algorithm can thus be conveniently summa-
rized as:
∆
• E-step: calculate f t (J) = P (J|U, Θt )
• M-step: Θt+1 = argmax Θ [Qt (Θ) + log P (Θ)]
6
A Relation to the Expected Log-Posterior
Note that we have chosen to define Qt (Θ) as the expected log-likelihood as in (Demp-
ster et al., 1977; McLachlan and Krishnan, 1997), i.e.,
∆
Qt (Θ) = hlog P (U, J|Θ)i
hlog P (Θ|U, J)i = hlog P (U, J|Θ) + log P (Θ) − log P (U, J)i (5)
Here the second term does not depend on J and can be taken out of the expectation,
and the last term does not depend on Θ. Hence, maximizing (5) with respect to Θ is
equivalent to (4):
argmax hlog P (Θ|U, J)i = argmax [hlog P (U, J|Θ)i + log P (Θ)]
Θ Θ
= argmax Qt (Θ) + log P (Θ)
Θ
References
[1] Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society, Series B,
39(1):1–38.
[2] Hartley, H. (1958). Maximum likelihood estimation from incomplete data. Bio-
metrics, 14:174–194.
[3] McLachlan, G. and Krishnan, T. (1997). The EM algorithm and extensions. Wiley
series in probability and statistics. John Wiley & Sons.
[4] Minka, T. (1998). Expectation-Maximization as lower bound maximiza-
tion. Tutorial published on the web at http://www-white.media.mit.edu/ tp-
minka/papers/em.html.
[5] Neal, R. and Hinton, G. (1998). A view of the EM algorithm that justifies incre-
mental, sparse, and other variants. In Jordan, M., editor, Learning in Graphical
Models. Kluwer Academic Press.
[6] Tanner, M. (1996). Tools for Statistical Inference. Springer Verlag, New York.
Third Edition.