A Tutorial on MM Algorithms
A Tutorial on MM Algorithms
A Tutorial on MM Algorithms
David R. HUNTER and Kenneth LANGE
ster et al. (1977) article on EM algorithms. Although the work
of de Leeuw and Heiser did not spark the same explosion of
Most problems in frequentist statistics involve optimization of interest from the statistical community set off by the Dempster
a function such as a likelihood or a sum of squares. EM algo- et al. (1977) article, steady development of MM algorithms has
rithms are among the most effective algorithms for maximum continued. The MM principle reappears, among other places,
likelihood estimation because they consistently drive the likeli- in robust regression (Huber 1981), in correspondence analy-
hood uphill by maximizing a simple surrogate function for the sis (Heiser 1987), in the quadratic lower bound principle of
log-likelihood. Iterative optimization of a surrogate function as Bohning and Lindsay (1988), in the psychometrics literature
exempli ed by an EM algorithm does not necessarily require on least squares (Bijleveld and de Leeuw 1991; Kiers and Ten
missing data. Indeed, every EM algorithm is a special case of Berge 1992), and in medical imaging (De Pierro 1995;Lange and
the more general class of MM optimization algorithms, which Fessler 1995). The recent survey articles of de Leeuw (1994),
typically exploit convexity rather than missing data in majoriz- Heiser (1995), Becker, Yang, and Lange (1997), and Lange,
ing or minorizing an objective function. In our opinion, MM Hunter, and Yang (2000) deal with the general principle, but it
algorithms deserve to be part of the standard toolkit of profes- is not until the rejoinder of Hunter and Lange (2000a) that the
sional statisticians. This article explains the principle behind acronym MM rst appears. This acronym pays homage to the
MM algorithms, suggests some methods for constructing them, earlier names “majorization” and “iterative majorization” of the
and discusses some of their attractive features. We include nu-
MM principle, emphasizes its crucial link to the better-known
merous examples throughout the article to illustrate the concepts
EM principle, and diminishes the possibility of confusion with
described. In addition to surveying previous work on MM algo-
the distinct subject in mathematics known as majorization (Mar-
rithms, this article introduces some new material on constrained
shall and Olkin 1979). Recent work has demonstrated the utility
optimization and standard error estimation.
of MM algorithms in a broad range of statistical contexts, in-
KEY WORDS: Constrained optimization; EM algorithm; cluding quantile regression (Hunter and Lange 2000b), survival
Majorization; Minorization; Newton–Raphson. analysis (Hunter and Lange 2002), paired and multiple compar-
isons (Hunter 2004), variable selection (Hunter and Li 2002),
and DNA sequence analysis (Sabatti and Lange 2002).
One of the virtues of the MM acronym is that it does dou-
1. INTRODUCTION ble duty. In minimization problems, the rst M of MM stands
for majorize and the second M for minimize. In maximization
Maximum likelihoodand least squares are the dominantforms
problems, the rst M stands for minorize and the second M
of estimation in frequentist statistics. Toy optimizationproblems
for maximize. (We de ne the terms “majorize” and “minorize”
designed for classroom presentation can be solved analytically,
but most practical maximum likelihood and least squares esti- in Section 2.) A successful MM algorithm substitutes a simple
mation problems must be solved numerically. This article dis- optimization problem for a dif cult optimization problem. Sim-
cusses an optimization method that typically relies on convexity plicity can be attained by (a) avoiding large matrix inversions,
arguments and is a generalization of the well-known EM algo- (b) linearizing an optimization problem, (c) separating the pa-
rithm method (Dempster, Laird, and Rubin 1977; McLachlan rameters of an optimization problem, (d) dealing with equality
and Krishnan 1997). We call any algorithm based on this itera- and inequality constraints gracefully, or (e) turning a nondiffer-
tive method an MM algorithm. entiable problem into a smooth problem. Iteration is the price
To our knowledge, the general principle behind MM algo- we pay for simplifying the original problem.
rithms was rst enunciated by the numerical analysts Ortega In our view, MM algorithms are easier to understand and
and Rheinboldt (1970) in the context of line search methods. sometimes easier to apply than EM algorithms. Although we
De Leeuw and Heiser (1977) presented an MM algorithm for have no intention of detracting from EM algorithms, their domi-
multidimensional scaling contemporary with the classic Demp- nance over MM algorithms is a historical accident. An EM algo-
rithm operates by identifying a theoretical complete data space.
In the E-step of the algorithm, the conditional expectation of the
David R. Hunter is Assistant Professor, Department of Statistics, Penn State complete data log-likelihood is calculated with respect to the
University, University Park, PA 16802-2111 (E-mail: dhunter@stat.psu.edu).
observed data. The surrogate function created by the E-step is,
Kenneth Lange is Professor, Departments of Biomathematics and Human Genet-
ics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1766. up to a constant, a minorizing function. In the M-step, this mi-
Research supported in part by USPHS grants GM53275 and MH59490. norizing function is maximized with respect to the parameters of
30 The American Statistician, February 2004, Vol. 58, No. 1 c 2004 American Statistical Association DOI: 10.1198/0003130042836
®
(a) (b)
Figure 1. For q = 0.8, (a) depicts the “vee” function » q ( ³ ) and its quadratic majorizing function for ³ ( m) = ¡0.75; (b) shows the objective function
f ( ³ ) that is minimized by the 0.8 quantile of the sample 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, along with its quadratic majorizer, for ³ ( m) = 2.5.
the underlying model; thus, every EM algorithm is an example downhill. Indeed, the inequality
of an MM algorithm. Construction of an EM algorithm some- ³ ´ ³ ´
times demands creativity in identifying the complete data and f ³ (m+ 1) = g ³ (m+ 1) j ³ (m)
technical skill in calculating an often complicated conditional ³ ´ ³ ´
expectation and then maximizing it analytically. +f ³ (m+ 1) ¡ g ³ (m+ 1)
j³ (m)
³ ´
EM algorithms hinge on somewhat different mathematical ma-
= f ³ (m) (2)
neuvers. However, the skills required by most MM algorithms
are no harder to master than the skills required by most EM algo-
rithms. The purpose of this article is to present some strategies follows directly from the fact g(³ (m+ 1) j ³ (m) ) µ g(³ (m) j
for constructing MM algorithms and to illustrate various aspects ³ (m) ) and de nition (1). The descent property (2) lends an MM
of these algorithms through the study of speci c examples. algorithm remarkable numerical stability. With straightforward
We conclude this section with a note on nomenclature. Just changes, the MM recipe also applies to maximization rather than
as EM is more a prescription for creating algorithms than an minimization: To maximize a function f (³ ), we minorize it by
actual algorithm, MM refers not to a single algorithm but to a a surrogate function g(³ j ³ (m) ) and maximize g(³ j ³ (m) ) to
class of algorithms. Thus, this article refers to speci c EM and produce the next iterate ³ (m+ 1) .
MM algorithms but never to “the MM algorithm” or “the EM
2.1 Calculation of Sample Quantiles
algorithm.”
The function f(³ ) and its majorizer g(³ j ³ (m) ) are shown in If X has the density b(x), then E [a(X)=b(X)] = 1, so the
Figure 1(b) for a particular sample of size n = 12. left-hand side above vanishes and we obtain
Setting the rst derivative of g(³ j ³ (m) ) equal to zero gives
E [ln a(X)] µ E [ln b(X)] ;
the minimum point
P (m) which is sometimes known as the information inequality. It is
n(2q ¡ 1) + ni= 1 wi xi
³ (m+ 1)
= Pn ; (5) this inequality that guarantees that a minorizing function is con-
(m)
i= 1 wi structed in the E-step of any EM algorithm (de Leeuw 1994;
(m) Heiser 1995), making every EM algorithm an MM algorithm.
where the weight wi = jxi ¡ ³ (m) j¡1 depends on ³ (m) . A
(m)
aw of algorithm (5) is that the weight wi is unde ned when- 3.2 Minorization via Supporting Hyperplanes
(m)
ever ³ = xi . In mending this aw, Hunter and Lange (2000b) Jensen’s inequality is easily derived from the supporting hy-
also discussed the broader technique of quantile regression in- perplane property of a convex function: Any linear function tan-
troduced by Koenker and Bassett (1978). From a computational gent to the graph of a convex function is a minorizer at the point
perspective,the most fascinating thing aboutthe quantile- nding of tangency. Thus, if µ(³ ) is convex and differentiable, then
algorithm is that it avoids sorting and relies entirely on arithmetic ³ ´ ³ ´t ³ ´
and iteration. For the case of the sample median (q = 1=2), al- µ(³ ) ¶ µ ³ (m)
+ rµ ³ (m)
³ ¡ ³ (m)
; (7)
gorithm (5) is found in Schlossmacher (1973) and is shown to be
an MM algorithm by Lange and Sinsheimer (1993) and Heiser with equality when ³ = ³ (m) . This inequality is illustrated by
(1995). the example of Section 7 involving constrained optimization.
Because g(³ j ³ (m) ) in Equation (4) is a quadratic function
of ³ , expression (5) coincides with the more general Newton– 3.3 Majorization via the De nition of Convexity
Raphson update If we wish to majorize a convex function instead of minorizing
h i¡1 ³ ´ it, then we can use the standard de nition of convexity; namely,
³ (m+ 1) = ³ (m) ¡ r2 g(³ (m) j ³ (m) ) rg ³ (m) j ³ (m) ; µ(t) is convex if and only if
(6) Ã !
X X
where rg(³ (m) j ³ (m) ) and r2 g(³ (m) j ³ (m) ) denote the gra- µ ¬ i ti µ ¬ i µ(ti ) (8)
dient vector and the Hessian matrix of g(³ j ³ (m) ) evaluated i i
at ³ (m) . Because the descent property (2) depends only on de- for any nite collection ofPpoints ti and corresponding multipli-
creasing g(³ j ³ (m) ) and not on minimizing it, the update (6) can ers ¬ i with ¬ i ¶ 0 and i ¬ i = 1. Application of de nition
serve in cases where g(³ j ³ (m) ) lacks a closed form minimizer, (8) is particularly effective when µ(t) is composed with a linear
provided this update decreases the value of g(³ j ³ (m) ). In the function xt ³ . For instance, suppose for vectors x, ³ , and ³ (m)
context of EM algorithms, Dempster et al. (1977) called an al- (m)
that we make the substitution ti = xi (³ i ¡ ³ i )=¬ i + xt ³ (m) .
gorithm that reduces g(³ j ³ (m) ) without actually minimizing it
Inequality (8) then becomes
a generalized EM (GEM) algorithm. The speci c case of Equa-
· ¸
tion (6), which we call a gradient MM algorithm, was studied in t
X xi (m) t (m)
the EM context by Lange (1995a), who pointed out that update µ(x ³ ) µ ¬ iµ (³ i ¡ ³ i ) + x ³ : (9)
i
¬ i
(6) saves us from performing iterations within iterations and yet
still displays the same local rate of convergence as a full MM Alternatively, if all components of x, ³ , and ³ (m) are pos-
algorithm that minimizes g(³ j ³ (m) ) at each iteration. (m)
itive, then we may take ti = xt ³ (m) ³ i =³ i and ¬ i =
(m)
3. TRICKS OF THE TRADE t (m)
xi ³ i =x ³ . Now inequality (8) becomes
In the quantile example of Section 2.1, the convex “vee” func- " #
X xi ³ (m) xt ³ (m) ³ i
tion admits a quadratic majorizer as depicted in Figure 1(a). In t
µ(x ³ ) µ i
µ : (10)
general, many majorizing or minorizing relationships may be i
xt ³ (m) ³
(m)
i
32 General
Inequalities (9) and (10) have been used to construct MM al- which is the Cauchy–Schwartz inequality.De Leeuw and Heiser
gorithms in the contexts of medical imaging (De Pierro 1995; (1977) and Groenen (1993) used inequality (13) to derive MM
Lange and Fessler 1995) and least-squares estimation without algorithms for multidimensional scaling.
matrix inversion (Becker et al. 1997).
4. SEPARATION OF PARAMETERS
3.4 Majorization via a Quadratic Upper Bound AND CYCLIC MM
If a convex function µ(³ ) is twice differentiable and has One of the key criteria in judging minorizing or majorizing
bounded curvature, then we can majorize µ(³ ) by a quadratic functions is their ease of optimization. Successful MM algo-
function with suf ciently high curvature and tangent to µ(³ ) rithms in high-dimensional parameter spaces often rely on sur-
at ³ (m) (Bohning and Lindsay 1988). In algebraic terms, if we rogate functions in which the individual parameter components
can nd a positive de nite matrix M such that M ¡ r2 µ(³ ) is are separated. In other words, the surrogate function mapping
nonnegative de nite for all ³ , then ³ 2 U » Rd ! R reduces to the sum of d real-valued functions
³ ´ ³ ´t ³ ´ taking the real-valued arguments ³ 1 through ³ d . Because the d
µ(³ ) µ µ ³ (m) + rµ ³ (m) ³ ¡ ³ (m) univariate functions may be optimized one by one, this makes
the surrogate function easier to optimize at each iteration.
1³ ´t ³ ´
+ ³ ¡ ³ (m) M ³ ¡ ³ (m) 4.1 Poisson Sports Model
2
provides a quadratic upper bound. For example, Heiser (1995) Consider a simpli ed version of a model proposed by Maher
noted in the unidimensional case that (1982) for a sports contest between two individuals or teams
in which the number of points scored by team i against team j
(m)
1 1 ³ ¡ ³ (³ ¡ ³ (m) )2 follows a Poisson process with intensity eoi ¡dj , where oi is an
µ ¡ +
³ ³ (m) (³ (m) )2 c3 “offensive strength” parameter for team i and dj is a “defensive
strength” parameter for team j. If tij is the length of time that
for 0 < c µ minf³ ; ³ (m) g. The corresponding quadratic lower i plays j and pij is the number of points that i scores against j,
bound principle for minorization is the basis for the logistic then the corresponding Poisson log-likelihood function is
regression example of Section 6.
3.5 The Arithmetic-Geometric Mean Inequality `ij (³ ) = pij (oi ¡ dj ) ¡ tij eoi ¡dj + pij ln tij ¡ ln pij !;
(14)
The arithmetic-geometric mean inequality is a special case of
inequality (8). Taking µ(t) = et and ¬ i = 1=m yields
where ³ = (o; d) is the parameter vector. NotePthat theP parame-
à !
m m ters should satisfy a linear constraint, such as i oi + j dj =
1 X 1 X ti
exp ti µ e : 0, in order for the model be identi able; otherwise, it is clearly
m m possible to add the same constant to each oi and dj without
i= 1 i= 1
altering the likelihood. We make two simplifying assumptions.
If we let xi = eti , then we obtain the standard form
First, different games are independent of each other. Second,
v
um m each team’s point total within a single game is independent of
uY 1 X
t
m
xi µ xi (11) its opponent’s point total. The second assumption is more sus-
i= 1
m i= 1 pect than the rst because it implies that a team’s offensive and
defensive performances are somehow unrelated to one another;
of the arithmetic-geometric mean inequality. Because the expo- nonetheless the model gives an interesting rst approximation
nential function is strictly convex, equality holds if and only if to reality. Under these assumptions, the full data log-likelihood
all of the xi are equal. Inequality (11) is helpful in constructing is obtained by summing `ij (³ ) over all pairs (i; j). Setting the
the majorizer partial derivatives of the log-likelihoodequal to zero leads to the
(m) (m)
equations
x2 x1
x1 x2 µ x21 + x22 (12) P P
j pij
(m) (m) pij
2x1 2x2 e ¡d^j
= P i
and e ^
o i
=
o^ P ¡d^j
i tij e
i
34 General
of a Newton–Raphson iteration reveals that it requires evalua- 6.1 Numerical Differentiation via MM
tion and inversion of the Hessian matrix r2 f (³ (m) ). If ³ has p
The two numerical approximations to ¡ r2 `(³ ^) are based on
components, then the number of calculations needed to invert
the formulas
the p £ p matrix r2 f(³ ) is roughly proportional to p3 . By con- h i
trast, an MM algorithm that separates parameters usually takes r2 `(³ ^) = r2 g(³ ^ j ³ ^) I ¡ rM (³ ^) (19)
on the order of p or p2 arithmetic operations per iteration. Thus, · ¸
well-designed MM algorithms tend to require more iterations @
= r2 g(³ ^ j ³ ^) + rg(³ ^ j #) ; (20)
but simpler iterations than Newton–Raphson. For this reason @# #= ³ ^
MM algorithms sometimes enjoy an advantagein computational where I denotes the identity matrix. These formulas were de-
speed. rived by Lange (1999) using two simple facts: First, the tan-
For example, the Poisson process scoring model for the NBA gency of `(³ ) and its minorizer imply that their gradient vectors
dataset of Section 4 has 57 parameters (two for each of 29 teams are equal at the point of minorization; and second, the gradient
minus one for the linear constraint). A single matrix inversion of g(³ j ³ (m) ) at its maximizer M (³ (m) ) is zero. Alternative
of a 57 £ 57 matrix requires roughly 387,000 oating point op- derivations of formulas (19) and (20) were given by Meng and
erations according to MATLAB. Thus, even a single Newton– Rubin (1991) and Oakes (1999), respectively. Although these
Raphson iteration requires more computation in this example formulas have been applied to standard error estimation in the
than the 300,000 oating point operations required for the MM EM algorithm literature—Meng and Rubin (1991) base their
algorithm to converge completely in 28 iterations. Numerical SEM idea on formula (19)—to our knowledge, neither has been
stability also enters the balance sheet. A Newton–Raphson al- applied in the broader context of MM algorithms.
gorithm can behave poorly if started too far from an optimum Approximation of r2 `(³ ^) using Equation (19) requires a nu-
point. By contrast, MM algorithms are guaranteed to appropri- merical approximation of the Jacobian matrix rM (³ ), whose
ately increase or decrease the value of the objective function at i; j entry equals
every iteration.
Other types of deterministic optimization algorithms, such as @ Mi (³ + ¯ ej ) ¡ Mi (³ )
Mi (³ ) = lim ; (21)
Fisher scoring, quasi-Newton methods, or gradient-free methods @³ j ¯ ! 0 ¯
like Nelder–Mead, occupy a kind of middle ground. Although
where the vector ej is the jth standard basis vector having a one
none of them can match Newton–Raphson in required iterations
until convergence, each has its own merits. The expected infor- in its jth component and zeros elsewhere. Because M (³ ^) = ³ ^,
mation matrix used in Fisher scoring is sometimes easier to eval- the jth column of rM (³ ^) may be approximated using only out-
uate than the observed information matrix of Newton–Raphson. put from the corresponding MM algorithm by (a) iterating until
Scoring does not automatically lead to an increase in the log- ³ ^ is found, (b) altering the jth component of ³ ^ by a small amount
likelihood, but at least (unlike Newton–Raphson) it can always ¯ j , (c) applying the MM algorithm to this altered ³ , (d) subtract-
be made to do so if some form of backtracking is incorporated. ing ³ ^ from the result, and (e) dividing by ¯ j . Approximation of
Quasi-Newton methods mitigate or even eliminate the need for r2 `(³ ^) using Equation (20) is analogous except it involves nu-
matrix inversion. The Nelder–Mead approach is applicable in merically approximating the Jacobian of h(#) = rg(³ ^ j #). In
situations where the objective function is nondifferentiable. Be- this case one may exploit the fact that h(³ ^) is zero.
cause of the complexities of practical problems, it is impossible
to declare any optimization algorithm best overall. In our ex- 6.2 An MM Algorithm for Logistic Regression
perience, however, MM algorithms are often dif cult to beat in To illustrate these ideas and facilitate comparison of the var-
terms of stability and computational simplicity. ious numerical methods, we consider an example in which the
Hessian of the log-likelihood is easy to compute. Bohning and
6. STANDARD ERROR ESTIMATES
Lindsay (1988) apply the quadratic bound principle of Section
In most cases, a maximum likelihood estimator has asymp- 3.4 to the case of logistic regression, in which we have an n £ 1
totic covariance matrix equal to the inverse of the expected in- vector Y of binary responses and an n £ p matrix X of predic-
formation matrix. In practice, the expected information matrix tors. The model stipulates that the probability º i (³ ) that Yi = 1
is often well-approximated by the observed information matrix equals expf³ t xi g= (1 + expf³ t xi g). Straightforward differen-
¡ r2 `(³ ) computed by differentiating the log-likelihood `(³ ) tiation of the resulting log-likelihood function shows that
twice. Thus, after the MLE ³ ^ has been found, a standard error of n
X
³ ^ can be obtained by taking square roots of the diagonal entries r2 `(³ ) = ¡ º i (³ )[1 ¡ º i (³ )]xi xti :
of the inverse of ¡ r2 `(³ ^). In some problems, however, direct i= 1
calculation of r2 `(³ ^) is dif cult. Here we propose two numeri- Since º i (³ )[1 ¡ º i (³ )] is bounded above by 1=4, we may de ne
cal approximations to this matrix that exploit quantities readily the negative de nite matrix B = ¡ 14 X t X and conclude that
obtained by running an MM algorithm. Let g(³ j ³ (m) ) denote a r2 `(³ ) ¡ B is nonnegative de nite as desired. Therefore, the
minorizing function of the log-likelihood`(³ ) at the point ³ (m) , quadratic function
and de ne ³ ´ ³ ´ ³ ´t ³ ´
M (#) = arg max g(³ j #) g ³ j ³ (m) = ` ³ (m) + r` ³ (m) ³ ¡ ³ (m)
³
1³ ´t ³ ´
to be the MM algorithm map taking ³ (m)
to ³ (m+ 1)
. + ³ ¡ ³ (m) B ³ ¡ ³ (m)
2
The American Statistician, February 2004, Vol. 58, No. 1 35
Table 2. Estimated Coef cients and Standard Errors for the interior of the parameter space but allows strict inequalities to
Low Birth Weight Logistic Regression Example become equalities in the limit.
Consider the problem of minimizing f (³ ) subject to the con-
Standard errors based on:
straints vj (³ ) ¶ 0 for 1 µ j µ q, where each vj (³ ) is a concave,
Variable ³ˆ Exact r2 `(³ ˆ ) Equation (19) Equation (20) differentiable function. Since ¡ vj (³ ) is convex, we know from
inequality (7) that
Constant 0.48062 1.1969 1.1984 1.1984
AGE ¡0.029549 0.037031 0.037081 0.037081 ³ ´ ³ ´t
LWT ¡0.015424 0.0069194 0.0069336 0.0069336 vj ³ (m)
¡ vj (³ ) ¶ rvj ³ (m)
(³ (m)
¡ ³ ):
RACE2 1.2723 0.52736 0.52753 0.52753
RACE3 0.8805 0.44079 0.44076 0.44076
SMOKE 0.93885 0.40215 0.40219 0.40219 Application of the similar inequality ln s ¡ ln t ¶ s¡1 (t ¡ s)
PTL 0.54334 0.34541 0.34545 0.34545 implies that
HT 1.8633 0.69754 0.69811 0.69811
UI 0.76765 0.45932 0.45933 0.45933 h i
FTV 0.065302 0.17240 0.17251 0.17251 vj (³ (m) ) ¡ ln vj (³ ) + ln vj (³ (m) ) ¶ vj (³ (m) ) ¡ vj (³ ):