Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

A Tutorial on MM Algorithms

Uploaded by

rucywl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

A Tutorial on MM Algorithms

Uploaded by

rucywl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

General

A Tutorial on MM Algorithms
David R. HUNTER and Kenneth LANGE
ster et al. (1977) article on EM algorithms. Although the work
of de Leeuw and Heiser did not spark the same explosion of
Most problems in frequentist statistics involve optimization of interest from the statistical community set off by the Dempster
a function such as a likelihood or a sum of squares. EM algo- et al. (1977) article, steady development of MM algorithms has
rithms are among the most effective algorithms for maximum continued. The MM principle reappears, among other places,
likelihood estimation because they consistently drive the likeli- in robust regression (Huber 1981), in correspondence analy-
hood uphill by maximizing a simple surrogate function for the sis (Heiser 1987), in the quadratic lower bound principle of
log-likelihood. Iterative optimization of a surrogate function as Bohning and Lindsay (1988), in the psychometrics literature
exempliŽ ed by an EM algorithm does not necessarily require on least squares (Bijleveld and de Leeuw 1991; Kiers and Ten
missing data. Indeed, every EM algorithm is a special case of Berge 1992), and in medical imaging (De Pierro 1995;Lange and
the more general class of MM optimization algorithms, which Fessler 1995). The recent survey articles of de Leeuw (1994),
typically exploit convexity rather than missing data in majoriz- Heiser (1995), Becker, Yang, and Lange (1997), and Lange,
ing or minorizing an objective function. In our opinion, MM Hunter, and Yang (2000) deal with the general principle, but it
algorithms deserve to be part of the standard toolkit of profes- is not until the rejoinder of Hunter and Lange (2000a) that the
sional statisticians. This article explains the principle behind acronym MM Ž rst appears. This acronym pays homage to the
MM algorithms, suggests some methods for constructing them, earlier names “majorization” and “iterative majorization” of the
and discusses some of their attractive features. We include nu-
MM principle, emphasizes its crucial link to the better-known
merous examples throughout the article to illustrate the concepts
EM principle, and diminishes the possibility of confusion with
described. In addition to surveying previous work on MM algo-
the distinct subject in mathematics known as majorization (Mar-
rithms, this article introduces some new material on constrained
shall and Olkin 1979). Recent work has demonstrated the utility
optimization and standard error estimation.
of MM algorithms in a broad range of statistical contexts, in-
KEY WORDS: Constrained optimization; EM algorithm; cluding quantile regression (Hunter and Lange 2000b), survival
Majorization; Minorization; Newton–Raphson. analysis (Hunter and Lange 2002), paired and multiple compar-
isons (Hunter 2004), variable selection (Hunter and Li 2002),
and DNA sequence analysis (Sabatti and Lange 2002).
One of the virtues of the MM acronym is that it does dou-
1. INTRODUCTION ble duty. In minimization problems, the Ž rst M of MM stands
for majorize and the second M for minimize. In maximization
Maximum likelihoodand least squares are the dominantforms
problems, the Ž rst M stands for minorize and the second M
of estimation in frequentist statistics. Toy optimizationproblems
for maximize. (We deŽ ne the terms “majorize” and “minorize”
designed for classroom presentation can be solved analytically,
but most practical maximum likelihood and least squares esti- in Section 2.) A successful MM algorithm substitutes a simple
mation problems must be solved numerically. This article dis- optimization problem for a difŽ cult optimization problem. Sim-
cusses an optimization method that typically relies on convexity plicity can be attained by (a) avoiding large matrix inversions,
arguments and is a generalization of the well-known EM algo- (b) linearizing an optimization problem, (c) separating the pa-
rithm method (Dempster, Laird, and Rubin 1977; McLachlan rameters of an optimization problem, (d) dealing with equality
and Krishnan 1997). We call any algorithm based on this itera- and inequality constraints gracefully, or (e) turning a nondiffer-
tive method an MM algorithm. entiable problem into a smooth problem. Iteration is the price
To our knowledge, the general principle behind MM algo- we pay for simplifying the original problem.
rithms was Ž rst enunciated by the numerical analysts Ortega In our view, MM algorithms are easier to understand and
and Rheinboldt (1970) in the context of line search methods. sometimes easier to apply than EM algorithms. Although we
De Leeuw and Heiser (1977) presented an MM algorithm for have no intention of detracting from EM algorithms, their domi-
multidimensional scaling contemporary with the classic Demp- nance over MM algorithms is a historical accident. An EM algo-
rithm operates by identifying a theoretical complete data space.
In the E-step of the algorithm, the conditional expectation of the
David R. Hunter is Assistant Professor, Department of Statistics, Penn State complete data log-likelihood is calculated with respect to the
University, University Park, PA 16802-2111 (E-mail: dhunter@stat.psu.edu).
observed data. The surrogate function created by the E-step is,
Kenneth Lange is Professor, Departments of Biomathematics and Human Genet-
ics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1766. up to a constant, a minorizing function. In the M-step, this mi-
Research supported in part by USPHS grants GM53275 and MH59490. norizing function is maximized with respect to the parameters of

30 The American Statistician, February 2004, Vol. 58, No. 1 c 2004 American Statistical Association DOI: 10.1198/0003130042836
®
(a) (b)

Figure 1. For q = 0.8, (a) depicts the “vee” function » q ( ³ ) and its quadratic majorizing function for ³ ( m) = ¡0.75; (b) shows the objective function
f ( ³ ) that is minimized by the 0.8 quantile of the sample 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, along with its quadratic majorizer, for ³ ( m) = 2.5.

the underlying model; thus, every EM algorithm is an example downhill. Indeed, the inequality
of an MM algorithm. Construction of an EM algorithm some- ³ ´ ³ ´
times demands creativity in identifying the complete data and f ³ (m+ 1) = g ³ (m+ 1) j ³ (m)
technical skill in calculating an often complicated conditional ³ ´ ³ ´
expectation and then maximizing it analytically. +f ³ (m+ 1) ¡ g ³ (m+ 1)
j³ (m)

In contrast, typical applications of MM revolve around care- ³ ´


ful inspection of a log-likelihood or other objective function to µ g ³ (m) j ³ (m)
be optimized, with particular attention paid to convexity and in- ³ ´ ³ ´
equalities. Thus, success with MM algorithms and success with +f ³ (m) ¡ g ³ (m) j ³ (m)

³ ´
EM algorithms hinge on somewhat different mathematical ma-
= f ³ (m) (2)
neuvers. However, the skills required by most MM algorithms
are no harder to master than the skills required by most EM algo-
rithms. The purpose of this article is to present some strategies follows directly from the fact g(³ (m+ 1) j ³ (m) ) µ g(³ (m) j
for constructing MM algorithms and to illustrate various aspects ³ (m) ) and deŽ nition (1). The descent property (2) lends an MM
of these algorithms through the study of speciŽ c examples. algorithm remarkable numerical stability. With straightforward
We conclude this section with a note on nomenclature. Just changes, the MM recipe also applies to maximization rather than
as EM is more a prescription for creating algorithms than an minimization: To maximize a function f (³ ), we minorize it by
actual algorithm, MM refers not to a single algorithm but to a a surrogate function g(³ j ³ (m) ) and maximize g(³ j ³ (m) ) to
class of algorithms. Thus, this article refers to speciŽ c EM and produce the next iterate ³ (m+ 1) .
MM algorithms but never to “the MM algorithm” or “the EM
2.1 Calculation of Sample Quantiles
algorithm.”

2. THE MM PHILOSOPHY As a one-dimensionalexample, consider the problem of com-


puting a sample quantile from a sample x1 ; : : : ; xn of n real
Let ³ (m) represent a Ž xed value of the parameter ³ , and let numbers. One can readily prove (Hunter and Lange 2000b) that
g(³ j ³ (m) ) denote a real-valued function of ³ whose form de- for q 2 (0; 1), a qth sample quantile of x1 ; : : : ; xn minimizes
pends on ³ (m) . The function g(³ j ³ (m) ) is said to majorize a the function
real-valued function f (³ ) at the point ³ (m) provided
n
X
¡ (m)
¢ f (³ ) = » q (xi ¡ ³ ); (3)
¡ g(m)³ j ³ (m) ¢ ¶ f (³ )(m) for all ³ ,
(1) i= 1
g ³ j³ = f (³ ):
¡ ¢ where » q (³ ) is the “vee” function
In other words, the surface ³ 7! g ³ j ³ (m) lies above the
½
surface f (³ ) and is tangent to it at the point ³ = ³ (m) . The q³ ³ ¶0
» q (³ ) =
function g(³ j ³ (m) ) is said to minorize f (³ ) at ³ (m) if ¡ g(³ j ¡ (1 ¡ q)³ ³ < 0.
³ (m) ) majorizes ¡ f(³ ) at ³ (m) .
Ordinarily, ³ (m) represents the current iterate in a search of When q = 1=2, this function is proportional to the absolute
the surface f (³ ). In a majorize-minimize MM algorithm, we value function; for q 6= 1=2, the “vee” is tilted to one side or the
minimize the majorizing function g(³ j ³ (m) ) rather than the other. As seen in Figure 1(a), it is possible to majorize the “vee”
actual function f (³ ). If ³ (m+ 1) denotes the minimizer of g(³ j function at any nonzero point by a simple quadratic function.
³ (m) ), then we can show that the MM procedure forces f (³ ) SpeciŽ cally, for a given ³ (m) 6= 0, » q (³ ) is majorized at §³ (m)
The American Statistician, February 2004, Vol. 58, No. 1 31
by derived from various inequalities stemming from convexity or
½ 2
¾ concavity. This section outlines some common inequalitiesused
1 ³
± q (³ j ³ (m)
) = + (4q ¡ 2)³ + j³ (m)
j : to construct majorizing or minorizing functionsfor various types
4 j³ (m) j
of objective functions.
Fortunately, the majorization relation between functions is
3.1 Jensen’s Inequality
closed under the formation of sums, nonnegative products, lim-
its, and composition with an increasing function. These rules Jensen’s inequality states for a convex function µ(x) and any
permit us to work piecemeal in simplifying complicated objec- random variable X that µ[E (X)] µ E [µ(X)]. Since ¡ ln(x) is
tive functions. Thus, the function f(³ ) of Equation (3) is ma- a convex function, we conclude for probability densities a(x)
jorized at the point ³ (m) by and b(x) that
³ ´ n ³ ´ ½ · ¸¾ · ¸
X a(X) a(X)
g ³ j³ (m)
= ± q xi ¡ ³ j xi ¡ ³ (m)
: (4) ¡ ln E µ ¡ E ln :
b(X) b(X)
i= 1

The function f(³ ) and its majorizer g(³ j ³ (m) ) are shown in If X has the density b(x), then E [a(X)=b(X)] = 1, so the
Figure 1(b) for a particular sample of size n = 12. left-hand side above vanishes and we obtain
Setting the Ž rst derivative of g(³ j ³ (m) ) equal to zero gives
E [ln a(X)] µ E [ln b(X)] ;
the minimum point
P (m) which is sometimes known as the information inequality. It is
n(2q ¡ 1) + ni= 1 wi xi
³ (m+ 1)
= Pn ; (5) this inequality that guarantees that a minorizing function is con-
(m)
i= 1 wi structed in the E-step of any EM algorithm (de Leeuw 1994;
(m) Heiser 1995), making every EM algorithm an MM algorithm.
where the weight wi = jxi ¡ ³ (m) j¡1 depends on ³ (m) . A
(m)
 aw of algorithm (5) is that the weight wi is undeŽ ned when- 3.2 Minorization via Supporting Hyperplanes
(m)
ever ³ = xi . In mending this  aw, Hunter and Lange (2000b) Jensen’s inequality is easily derived from the supporting hy-
also discussed the broader technique of quantile regression in- perplane property of a convex function: Any linear function tan-
troduced by Koenker and Bassett (1978). From a computational gent to the graph of a convex function is a minorizer at the point
perspective,the most fascinating thing aboutthe quantile-Ž nding of tangency. Thus, if µ(³ ) is convex and differentiable, then
algorithm is that it avoids sorting and relies entirely on arithmetic ³ ´ ³ ´t ³ ´
and iteration. For the case of the sample median (q = 1=2), al- µ(³ ) ¶ µ ³ (m)
+ rµ ³ (m)
³ ¡ ³ (m)
; (7)
gorithm (5) is found in Schlossmacher (1973) and is shown to be
an MM algorithm by Lange and Sinsheimer (1993) and Heiser with equality when ³ = ³ (m) . This inequality is illustrated by
(1995). the example of Section 7 involving constrained optimization.
Because g(³ j ³ (m) ) in Equation (4) is a quadratic function
of ³ , expression (5) coincides with the more general Newton– 3.3 Majorization via the DeŽ nition of Convexity
Raphson update If we wish to majorize a convex function instead of minorizing
h i¡1 ³ ´ it, then we can use the standard deŽ nition of convexity; namely,
³ (m+ 1) = ³ (m) ¡ r2 g(³ (m) j ³ (m) ) rg ³ (m) j ³ (m) ; µ(t) is convex if and only if
(6) Ã !
X X
where rg(³ (m) j ³ (m) ) and r2 g(³ (m) j ³ (m) ) denote the gra- µ ¬ i ti µ ¬ i µ(ti ) (8)
dient vector and the Hessian matrix of g(³ j ³ (m) ) evaluated i i

at ³ (m) . Because the descent property (2) depends only on de- for any Ž nite collection ofPpoints ti and corresponding multipli-
creasing g(³ j ³ (m) ) and not on minimizing it, the update (6) can ers ¬ i with ¬ i ¶ 0 and i ¬ i = 1. Application of deŽ nition
serve in cases where g(³ j ³ (m) ) lacks a closed form minimizer, (8) is particularly effective when µ(t) is composed with a linear
provided this update decreases the value of g(³ j ³ (m) ). In the function xt ³ . For instance, suppose for vectors x, ³ , and ³ (m)
context of EM algorithms, Dempster et al. (1977) called an al- (m)
that we make the substitution ti = xi (³ i ¡ ³ i )=¬ i + xt ³ (m) .
gorithm that reduces g(³ j ³ (m) ) without actually minimizing it
Inequality (8) then becomes
a generalized EM (GEM) algorithm. The speciŽ c case of Equa-
· ¸
tion (6), which we call a gradient MM algorithm, was studied in t
X xi (m) t (m)
the EM context by Lange (1995a), who pointed out that update µ(x ³ ) µ ¬ iµ (³ i ¡ ³ i ) + x ³ : (9)
i
¬ i
(6) saves us from performing iterations within iterations and yet
still displays the same local rate of convergence as a full MM Alternatively, if all components of x, ³ , and ³ (m) are pos-
algorithm that minimizes g(³ j ³ (m) ) at each iteration. (m)
itive, then we may take ti = xt ³ (m) ³ i =³ i and ¬ i =
(m)
3. TRICKS OF THE TRADE t (m)
xi ³ i =x ³ . Now inequality (8) becomes
In the quantile example of Section 2.1, the convex “vee” func- " #
X xi ³ (m) xt ³ (m) ³ i
tion admits a quadratic majorizer as depicted in Figure 1(a). In t
µ(x ³ ) µ i
µ : (10)
general, many majorizing or minorizing relationships may be i
xt ³ (m) ³
(m)
i

32 General
Inequalities (9) and (10) have been used to construct MM al- which is the Cauchy–Schwartz inequality.De Leeuw and Heiser
gorithms in the contexts of medical imaging (De Pierro 1995; (1977) and Groenen (1993) used inequality (13) to derive MM
Lange and Fessler 1995) and least-squares estimation without algorithms for multidimensional scaling.
matrix inversion (Becker et al. 1997).
4. SEPARATION OF PARAMETERS
3.4 Majorization via a Quadratic Upper Bound AND CYCLIC MM

If a convex function µ(³ ) is twice differentiable and has One of the key criteria in judging minorizing or majorizing
bounded curvature, then we can majorize µ(³ ) by a quadratic functions is their ease of optimization. Successful MM algo-
function with sufŽ ciently high curvature and tangent to µ(³ ) rithms in high-dimensional parameter spaces often rely on sur-
at ³ (m) (Bohning and Lindsay 1988). In algebraic terms, if we rogate functions in which the individual parameter components
can Ž nd a positive deŽ nite matrix M such that M ¡ r2 µ(³ ) is are separated. In other words, the surrogate function mapping
nonnegative deŽ nite for all ³ , then ³ 2 U » Rd ! R reduces to the sum of d real-valued functions
³ ´ ³ ´t ³ ´ taking the real-valued arguments ³ 1 through ³ d . Because the d
µ(³ ) µ µ ³ (m) + rµ ³ (m) ³ ¡ ³ (m) univariate functions may be optimized one by one, this makes
the surrogate function easier to optimize at each iteration.
1³ ´t ³ ´
+ ³ ¡ ³ (m) M ³ ¡ ³ (m) 4.1 Poisson Sports Model
2
provides a quadratic upper bound. For example, Heiser (1995) Consider a simpliŽ ed version of a model proposed by Maher
noted in the unidimensional case that (1982) for a sports contest between two individuals or teams
in which the number of points scored by team i against team j
(m)
1 1 ³ ¡ ³ (³ ¡ ³ (m) )2 follows a Poisson process with intensity eoi ¡dj , where oi is an
µ ¡ +
³ ³ (m) (³ (m) )2 c3 “offensive strength” parameter for team i and dj is a “defensive
strength” parameter for team j. If tij is the length of time that
for 0 < c µ minf³ ; ³ (m) g. The corresponding quadratic lower i plays j and pij is the number of points that i scores against j,
bound principle for minorization is the basis for the logistic then the corresponding Poisson log-likelihood function is
regression example of Section 6.
3.5 The Arithmetic-Geometric Mean Inequality `ij (³ ) = pij (oi ¡ dj ) ¡ tij eoi ¡dj + pij ln tij ¡ ln pij !;
(14)
The arithmetic-geometric mean inequality is a special case of
inequality (8). Taking µ(t) = et and ¬ i = 1=m yields
where ³ = (o; d) is the parameter vector. NotePthat theP parame-
à !
m m ters should satisfy a linear constraint, such as i oi + j dj =
1 X 1 X ti
exp ti µ e : 0, in order for the model be identiŽ able; otherwise, it is clearly
m m possible to add the same constant to each oi and dj without
i= 1 i= 1
altering the likelihood. We make two simplifying assumptions.
If we let xi = eti , then we obtain the standard form
First, different games are independent of each other. Second,
v
um m each team’s point total within a single game is independent of
uY 1 X
t
m
xi µ xi (11) its opponent’s point total. The second assumption is more sus-
i= 1
m i= 1 pect than the Ž rst because it implies that a team’s offensive and
defensive performances are somehow unrelated to one another;
of the arithmetic-geometric mean inequality. Because the expo- nonetheless the model gives an interesting Ž rst approximation
nential function is strictly convex, equality holds if and only if to reality. Under these assumptions, the full data log-likelihood
all of the xi are equal. Inequality (11) is helpful in constructing is obtained by summing `ij (³ ) over all pairs (i; j). Setting the
the majorizer partial derivatives of the log-likelihoodequal to zero leads to the
(m) (m)
equations
x2 x1
x1 x2 µ x21 + x22 (12) P P
j pij
(m) (m) pij
2x1 2x2 e ¡d^j
= P i
and e ^
o i
=
o^ P ¡d^j
i tij e
i

of the product of two positive numbers. This inequality is used j tij e

in the sports contest model of Section 4.


satisŽ ed by the maximum likelihoodestimate (^ ^ These equa-
o; d).
3.6 The Cauchy–Schwartz Inequality tions do not admit a closed form solution, so we turn to an MM
The Cauchy–Schwartz inequality for the Euclidean norm is a algorithm.
special case of inequality(7). The function µ(³ ) = k³ k is convex Because the task is to maximize the log-likelihood (14), we
because it satisŽ es the triangle inequality and need a minorizing function. Focusing on the ¡ tij eoi ¡dj term,
pPthe homogeneity
condition k¬ ³ k = j¬ j ¢ k³ k. Since µ(³ ) = 2 we may use inequality (12) to show that
i ³ i , we see that
rµ(³ ) = ³ =k³ k, and therefore inequality (7) gives
tij e2oi tij ¡2dj o(m) (m)
¡ ¢t ¡ tij eoi ¡dj ¶ ¡ ¡ e e i + dj
:
³ ¡ ³ (m) ³ (m) ³ t ³ (m) 2 eoi + d(m)
(m)
j 2
(m)
k³ k ¶ k³ k+ = ; (13)
k³ (m) k k³ (m) k (15)
The American Statistician, February 2004, Vol. 58, No. 1 33
Table 1. Ranking of all 29 NBA Teams on the Basis of the 2002–2003 in the right direction; indeed, every iteration of a cyclic MM
Regular Season According to Their Estimated Offensive Strength Plus algorithm is simply an MM iteration on a reduced parameter
Defensive Strength. Each team played 82 games.
set.
Team ôi + d̂i Wins Team ôi + d̂i Wins 4.2 Application to National Basketball Association
Results
Cleveland ¡0.0994 17 Phoenix 0.0166 44
Denver ¡0.0845 17 New Orleans 0.0169 47 Table 1 summarizes our application of the Poisson sports
Toronto ¡0.0647 24 Philadelphia 0.0187 48 model to the results of the 2002–2003 regular season of the
Miami ¡0.0581 25 Houston 0.0205 43
Chicago ¡0.0544 30 Minnesota 0.0259 51
National Basketball Association. In these data, tij is measured
Atlanta ¡0.0402 35 LA Lakers 0.0277 50 in minutes. A regular game lasts 48 minutes, and each overtime
LA Clippers ¡0.0355 27 Indiana 0.0296 48 period, if necessary, adds Ž ve minutes. Thus, team i is expected
Memphis ¡0.0255 28 Utah 0.0299 47 ^
New York ¡0.0164 37 Portland 0.0320 50
to score 48eo^i ¡dj points against team j when the two teams
Washington ¡0.0153 37 Detroit 0.0336 50 meet and do not tie. Team i is ranked higher than team j if
Boston ¡0.0077 44 New Jersey 0.0481 49 o^i ¡ d^j > o^j ¡ d^i , which is equivalent to o^i + d^i > o^j + d^j .
Golden State ¡0.0051 38 San Antonio 0.0611 60 It is worth emphasizingsome of the virtues of the model. First,
Orlando ¡0.0039 42 Sacramento 0.0686 59
Milwaukee ¡0.0027 42 Dallas 0.0804 60 the ranking of the 29 NBA teams on the basis of the estimated
Seattle 0.0039 40 sums o^i + d^i for the 2002–2003 regular season is not perfectly
consistent with their cumulative wins; strength of schedule and
margins of victory are re ected in the model. Second, the model
Although the right side of the above inequalitymay appear more gives the point-spread function for a particulargame as the differ-
complicated than the left side, it is actually simpler in one impor- ence of two independent Poisson random variables. Third, one
tant respect—the parameter components oi and dj are separated can easily amend the model to rank individual players rather
on the right side but not on the left. Summing the log-likelihood than teams by assigning to each player an offensive and de-
(14) over all pairs (i; j) and invoking inequality (15) yields the fensive intensity parameter. If each game is divided into time
function segments punctuated by substitutions, then the MM algorithm
"
XX can be adapted to estimate the assigned player intensities. This
(m) tij e2oi
g(³ j ³ )= pij (oi ¡ dj ) ¡ might provide a rational basis for salary negotiations that takes
2 eo(m) (m)
+ dj
i j i
into account subtle differences between players not re ected in
#
tij ¡2dj o(m) (m)
traditional sports statistics.
¡ e e i + dj
Finally, the NBA dataset sheds light on the comparative
2
speeds of the original MM algorithm (16) and its cyclic modi-
minorizing the full log-likelihood up to an additive constant in- Ž cation (17). The cyclic MM algorithm converged in fewer it-
dependent of ³ . The fact that the components of ³ are separated erations (25 instead of 28). However, because of the additional
by g(³ j ³ (m) ) permits us to update parameters one by one work required to recompute the denominators in Equation (17),
and substantially reduces computational costs. Setting the par- the cyclic version required slightly more  oating-point opera-
tial derivatives of g(³ j ³ (m) ) equal to zero yields the updates tions as counted by MATLAB (301,157 instead of 289,998).
8 P 9 5. SPEED OF CONVERGENCE
< p =
(m+ 1) 1 j ij
oi = ln P ;
2 : t e
(m)
¡oi ¡dj
(m)
; MM algorithms and Newton–Raphson algorithms have com-
j ij
( P ) plementary strengths. On one hand, Newton–Raphson algo-
(m+ 1) 1 i pij rithms boast a quadratic rate of convergence as they near a local
dj = ¡ ln P : (16)
2 (m) (m)
optimum point ³ ¤ . In other words, under certain general condi-
t eoi + dj
i ij
tions,
The question now arises as to whether one should modify
k³ (m+ 1) ¡ ³ ¤ k
algorithm (16) so that updated subsets of the parameters are used lim = c
m! 1 k³ (m) ¡ ³ ¤ k2
as soon as they become available. For instance, if we update the
o vector before the d vector in each iteration of algorithm (16), for some constant c. This quadratic rate of convergence is much
(m+ 1)
we could replace the formula for dj above by faster than the linear rate of convergence
( P ) k³ (m+ 1) ¡ ³ ¤ k
(m+ 1) 1 i pij
lim = c < 1 (18)
dj =¡ ln P (m+1) (m) : (17) m! 1 k³ (m) ¡ ³ ¤ k
2 t eoi + dj
i ij displayed by typical MM algorithms. Hence, Newton–Raphson
In practice, an MM algorithm often takes fewer iterations when algorithms tend to require fewer iterations than MM algorithms.
we cycle through the parameters updating one at a time than On the other hand, an iteration of a Newton–Raphson algorithm
when we update the whole vector at once as in algorithm (16). can be far more computationally onerous than an iteration of an
We call such versions of MM algorithms cyclic MM algorithms; MM algorithm. Examination of the form
they generalize the ECM algorithms of Meng and Rubin (1993). ³ ´¡1 ³ ´
A cyclic MM algorithm always drives the objective function ³ (m+ 1) = ³ (m) ¡ r2 f ³ (m) rf ³ (m)

34 General
of a Newton–Raphson iteration reveals that it requires evalua- 6.1 Numerical Differentiation via MM
tion and inversion of the Hessian matrix r2 f (³ (m) ). If ³ has p
The two numerical approximations to ¡ r2 `(³ ^) are based on
components, then the number of calculations needed to invert
the formulas
the p £ p matrix r2 f(³ ) is roughly proportional to p3 . By con- h i
trast, an MM algorithm that separates parameters usually takes r2 `(³ ^) = r2 g(³ ^ j ³ ^) I ¡ rM (³ ^) (19)
on the order of p or p2 arithmetic operations per iteration. Thus, · ¸
well-designed MM algorithms tend to require more iterations @
= r2 g(³ ^ j ³ ^) + rg(³ ^ j #) ; (20)
but simpler iterations than Newton–Raphson. For this reason @# #= ³ ^
MM algorithms sometimes enjoy an advantagein computational where I denotes the identity matrix. These formulas were de-
speed. rived by Lange (1999) using two simple facts: First, the tan-
For example, the Poisson process scoring model for the NBA gency of `(³ ) and its minorizer imply that their gradient vectors
dataset of Section 4 has 57 parameters (two for each of 29 teams are equal at the point of minorization; and second, the gradient
minus one for the linear constraint). A single matrix inversion of g(³ j ³ (m) ) at its maximizer M (³ (m) ) is zero. Alternative
of a 57 £ 57 matrix requires roughly 387,000  oating point op- derivations of formulas (19) and (20) were given by Meng and
erations according to MATLAB. Thus, even a single Newton– Rubin (1991) and Oakes (1999), respectively. Although these
Raphson iteration requires more computation in this example formulas have been applied to standard error estimation in the
than the 300,000  oating point operations required for the MM EM algorithm literature—Meng and Rubin (1991) base their
algorithm to converge completely in 28 iterations. Numerical SEM idea on formula (19)—to our knowledge, neither has been
stability also enters the balance sheet. A Newton–Raphson al- applied in the broader context of MM algorithms.
gorithm can behave poorly if started too far from an optimum Approximation of r2 `(³ ^) using Equation (19) requires a nu-
point. By contrast, MM algorithms are guaranteed to appropri- merical approximation of the Jacobian matrix rM (³ ), whose
ately increase or decrease the value of the objective function at i; j entry equals
every iteration.
Other types of deterministic optimization algorithms, such as @ Mi (³ + ¯ ej ) ¡ Mi (³ )
Mi (³ ) = lim ; (21)
Fisher scoring, quasi-Newton methods, or gradient-free methods @³ j ¯ ! 0 ¯
like Nelder–Mead, occupy a kind of middle ground. Although
where the vector ej is the jth standard basis vector having a one
none of them can match Newton–Raphson in required iterations
until convergence, each has its own merits. The expected infor- in its jth component and zeros elsewhere. Because M (³ ^) = ³ ^,
mation matrix used in Fisher scoring is sometimes easier to eval- the jth column of rM (³ ^) may be approximated using only out-
uate than the observed information matrix of Newton–Raphson. put from the corresponding MM algorithm by (a) iterating until
Scoring does not automatically lead to an increase in the log- ³ ^ is found, (b) altering the jth component of ³ ^ by a small amount
likelihood, but at least (unlike Newton–Raphson) it can always ¯ j , (c) applying the MM algorithm to this altered ³ , (d) subtract-
be made to do so if some form of backtracking is incorporated. ing ³ ^ from the result, and (e) dividing by ¯ j . Approximation of
Quasi-Newton methods mitigate or even eliminate the need for r2 `(³ ^) using Equation (20) is analogous except it involves nu-
matrix inversion. The Nelder–Mead approach is applicable in merically approximating the Jacobian of h(#) = rg(³ ^ j #). In
situations where the objective function is nondifferentiable. Be- this case one may exploit the fact that h(³ ^) is zero.
cause of the complexities of practical problems, it is impossible
to declare any optimization algorithm best overall. In our ex- 6.2 An MM Algorithm for Logistic Regression
perience, however, MM algorithms are often difŽ cult to beat in To illustrate these ideas and facilitate comparison of the var-
terms of stability and computational simplicity. ious numerical methods, we consider an example in which the
Hessian of the log-likelihood is easy to compute. Bohning and
6. STANDARD ERROR ESTIMATES
Lindsay (1988) apply the quadratic bound principle of Section
In most cases, a maximum likelihood estimator has asymp- 3.4 to the case of logistic regression, in which we have an n £ 1
totic covariance matrix equal to the inverse of the expected in- vector Y of binary responses and an n £ p matrix X of predic-
formation matrix. In practice, the expected information matrix tors. The model stipulates that the probability º i (³ ) that Yi = 1
is often well-approximated by the observed information matrix equals expf³ t xi g= (1 + expf³ t xi g). Straightforward differen-
¡ r2 `(³ ) computed by differentiating the log-likelihood `(³ ) tiation of the resulting log-likelihood function shows that
twice. Thus, after the MLE ³ ^ has been found, a standard error of n
X
³ ^ can be obtained by taking square roots of the diagonal entries r2 `(³ ) = ¡ º i (³ )[1 ¡ º i (³ )]xi xti :
of the inverse of ¡ r2 `(³ ^). In some problems, however, direct i= 1
calculation of r2 `(³ ^) is difŽ cult. Here we propose two numeri- Since º i (³ )[1 ¡ º i (³ )] is bounded above by 1=4, we may deŽ ne
cal approximations to this matrix that exploit quantities readily the negative deŽ nite matrix B = ¡ 14 X t X and conclude that
obtained by running an MM algorithm. Let g(³ j ³ (m) ) denote a r2 `(³ ) ¡ B is nonnegative deŽ nite as desired. Therefore, the
minorizing function of the log-likelihood`(³ ) at the point ³ (m) , quadratic function
and deŽ ne ³ ´ ³ ´ ³ ´t ³ ´
M (#) = arg max g(³ j #) g ³ j ³ (m) = ` ³ (m) + r` ³ (m) ³ ¡ ³ (m)
³
1³ ´t ³ ´
to be the MM algorithm map taking ³ (m)
to ³ (m+ 1)
. + ³ ¡ ³ (m) B ³ ¡ ³ (m)
2
The American Statistician, February 2004, Vol. 58, No. 1 35
Table 2. Estimated CoefŽ cients and Standard Errors for the interior of the parameter space but allows strict inequalities to
Low Birth Weight Logistic Regression Example become equalities in the limit.
Consider the problem of minimizing f (³ ) subject to the con-
Standard errors based on:
straints vj (³ ) ¶ 0 for 1 µ j µ q, where each vj (³ ) is a concave,
Variable ³ˆ Exact r2 `(³ ˆ ) Equation (19) Equation (20) differentiable function. Since ¡ vj (³ ) is convex, we know from
inequality (7) that
Constant 0.48062 1.1969 1.1984 1.1984
AGE ¡0.029549 0.037031 0.037081 0.037081 ³ ´ ³ ´t
LWT ¡0.015424 0.0069194 0.0069336 0.0069336 vj ³ (m)
¡ vj (³ ) ¶ rvj ³ (m)
(³ (m)
¡ ³ ):
RACE2 1.2723 0.52736 0.52753 0.52753
RACE3 0.8805 0.44079 0.44076 0.44076
SMOKE 0.93885 0.40215 0.40219 0.40219 Application of the similar inequality ln s ¡ ln t ¶ s¡1 (t ¡ s)
PTL 0.54334 0.34541 0.34545 0.34545 implies that
HT 1.8633 0.69754 0.69811 0.69811
UI 0.76765 0.45932 0.45933 0.45933 h i
FTV 0.065302 0.17240 0.17251 0.17251 vj (³ (m) ) ¡ ln vj (³ ) + ln vj (³ (m) ) ¶ vj (³ (m) ) ¡ vj (³ ):

Adding the last two inequalities, we see that


minorizes `(³ ) at ³ (m) . The MM algorithm proceeds by maxi-
mizing this quadratic, giving h i
vj (³ (m) ) ¡ ln vj (³ ) + ln vj (³ (m) )
³ ´
³ (m+ 1) = ³ (m) ¡ B ¡1 r` ³ (m) +rvj (³ (m) t
) (³ ¡ ³ (m)
) ¶ 0;
h ³ ´i
= ³ (m) ¡ 4(X t X)¡1 X t Y ¡ º ³ (m) : (22) with equality when ³ = ³ (m) . Summing over j and multiplying
by a positive tuning parameter !, we construct the function
Since the MM algorithm of Equation (22) needs to invert X t X "
³ ´ X q ³ ´ v ¡³ (m) ¢
only once, it enjoys an increasing computational advantage (m) (m) j
g ³ j³ = f (³ ) + ! vj ³ ln
over Newton–Raphson as the number of predictors p increases vj (³ )
j= 1
(Bohning and Lindsay 1988). ³ ´t ³ ´¸
(m) (m)
6.3 Application to Low Birth Weight Data + ³ ¡ ³ rvj ³ (23)

We now test the standard error approximationsbased on Equa-


tions (19) and (20) on the low birth weight dataset of Hosmer majorizing f (³ ) at ³ (m) . The presence of the term ln vj (³ ) in
and Lemeshow (1989). This dataset involves 189 observations Equation (23) prevents vj (³ (m+ 1) ) µ 0 from occurring. The
and eight maternal predictors. The response is 0 or 1 accord- multiplier vj (³ (m) ) of ln vj (³ ) gradually adapts and allows
ing to whether an infant is born underweight, deŽ ned as less vj (³ (m+ 1) ) to tend to 0 if it is inclined to do so. When there
than 2.5 kilograms. The predictors include mother’s age in years are equality constraints A³ = b in addition to the inequality
(AGE), weight at last menstrual period (LWT), race (RACE2 and constraints vj (³ ) ¶ 0, these should be enforced during the min-
RACE3), smoking status during pregnancy (SMOKE), number imization of g(³ j ³ (m) ).
of previous premature labors (PTL), presence of hypertension 7.1 Multinomial Sampling
history (HT), presence of uterine irritability (UI), and number of
physician visits during the Ž rst trimester (FTV). Each of these To gain a feel for how these ideas work in practice, consider
predictors is quantitative except for race, which is a three-level the problem of maximum likelihood estimation given a random
factor with level 1 for whites, level 2 for blacks, and level 3 for sample of size n from a multinomial distribution. If there are q
other races. Table 2 shows the maximum likelihood estimates categories and ni observations
P fall in category i, then the log-
and asymptotic standard errors for the 10 parameters. The dif- likelihood reduces to i ni ln ³ i plus a constant. The P compo-
ferentiation increment ¯ j was ³ ^j =1000 for each parameter ³ j . nents of the parameter vector ³ satisfy ³ i ¶ 0 and i ³ i = 1.
The standard error approximationsin the two rightmost columns Although it is well known that the maximum likelihood esti-
turn out to be the same in this example, but in other models they mates are given by ³ ^i = ni =n, this example is instructive be-
will differ. The close agreement of the approximations with the cause it is explicitly solvable and demonstrates the linear rate of
“gold standard” based on the exact value r2 `(³ ^) is clearly good convergence of the proposed MM algorithm. P
enough for practical purposes. To minimize the negativelog-likelihoodf (³ ) = ¡ i ni ln ³ i
subject to the q inequality
P constraints v i (³ ) = ³ i ¶ 0 and the
7. HANDLING CONSTRAINTS equalityconstraint i ³ i = 1, we construct the majorizing func-
tion
Many optimization problems impose constraints on param-
³ ´ q
X q
X
eters. For example, parameters are often required to be non- g ³ j³ (m)
= f (³ ) ¡ ! ³
(m)
ln ³ +! ³
i i i
negative. Here we discuss a majorization technique that in a i= 1 i= 1
sense eliminates inequality constraints. For this adaptive barrier
method(Censor and Zenios 1992; Lange 1994) to work, an initial suggested in Equation (23), omittingP irrelevant constants. We
point ³ (0) must be selected with all inequalityconstraints strictly minimize g(³ j ³ (m) ) while enforcing i ³ i = 1 by introducing
satisŽ ed. The barrier method conŽ nes subsequent iterates to the a Lagrange multiplier and looking for a stationary point of the
36 General
Lagrangian de Leeuw, J., and Heiser, W. J. (1977), “Convergence of Correction Matrix
à ! Algorithms for Multidimensional Scaling,” in Geometric Representations of
X Relational Data, eds. J. C. Lingoes, E. Roskam, and I. Borg, Ann Arbor, MI:
h(³ ) = g(³ j ³ (m)
)+¶ ³ ¡ 1 : Mathesis Press, pp. 735–752.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood
i from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical
Society, Ser. B, 39, 1–38.
Setting @h(³ )=@³ i equal to zero and multiplying by ³ i gives De Pierro, A. R. (1995), “A ModiŽ ed Expectation Maximization Algorithm for
(m) Penalized Likelihood Estimation in Emission Tomography,” IEEE Transac-
¡ ni ¡ !³ i + !³ i +¶ ³ i = 0: tions on Medical Imaging, 14, 132–137.
Groenen, P. J. F. (1993), The Majorization Approach to Multidimensional Scal-
Summing on i reveals that ¶ = n and yields the update ing: Some Problems and Extensions, Leiden, the Netherlands: DSWO Press.
Heiser, W. J. (1987), “Correspondence Analysis with Least Absolute Residuals,”
(m) Computational Statistical Data Analysis, 5, 337–356.
(m+ 1) ni + !³ i
³ i = : (1995), “Convergent Computing by Iterative Majorization: Theory and
n+! Applications in Multidimensional Data Analysis,” in Recent Advances in De-
scriptive Multivariate Analysis, ed. W. J. Krzanowski, Oxford: Clarendon
Hence, all iterates have positive components if they start with Press, pp. 157–189.
positive components. The Ž nal rearrangement Hosmer, D. W., and Lemeshow, S. (1989), Applied Logistic Regression, New
York: Wiley.
(m+ 1) ni ! ³ (m) ni ´ Huber, P. J. (1981), Robust Statistics, New York: Wiley.
³ i ¡ = ³ ¡ :
n n+! i n Hunter, D. R. (2004), “MM Algorithms for Generalized Bradley–Terry Models,”
The Annals of Statistics, 32, 386–408.
demonstrates that ³ (m) approaches the estimate ³ ^ at the linear Hunter, D. R., and Lange, K. (2000a), Rejoinder to discussion of “Optimization
rate !=(n + !), regardless of whether ³ ^ occurs on the boundary Transfer Using Surrogate Objective Functions,” Journal of Computational
and Graphical Statistics, 9, 52–59.
of the parameter space where one or more of its components ³ ^i (2000b), “Quantile Regression via an MM Algorithm,” Journal of Com-
equal zero. putational and Graphical Statistics, 9, 60–77.
(2002), “Computing Estimates in the ProportionalOdds Model,” Annals
of the Institute of Statistical Mathematics, 54, 155–168.
8. DISCUSSION Hunter, D. R., and Li, R. (2002), “A Connection Between Variable Selection and
This article is meant to whet readers’ appetites, not satiate EM-type Algorithms,” technical report 0201,Dept. of Statistics, Pennsylvania
State University.
them. We have omitted much. For instance, there is a great deal Jamshidian, M., and Jennrich, R. I. (1997), “Quasi-Newton Acceleration of the
known about the convergence properties of MM algorithms that EM Algorithm,” Journal of the Royal Statistical Society, Ser. B, 59, 569–587.
is too mathematically demanding to present here. Fortunately, Kiers, H. A. L., and Ten Berge, J. M. F. (1992), “Minimization of a Class of
Matrix Trace Functions by Means of ReŽ ned Majorization,” Psychometrika,
almost all results from the EM algorithm literature (Wu 1983; 57, 371–382.
Lange 1995a; McLachlan and Krishnan 1997;Lange 1999) carry Koenker, R., and Bassett, G. (1978), “Regression Quantiles,” Econometrica, 46,
33–50.
over without change to MM algorithms. Furthermore, there are Lange, K. (1994), “An Adaptive Barrier Method for Convex Programming,”
several methods for accelerating EM algorithms that are also Methods Applications Analysis, 1, 392–402.
applicable to accelerating MM algorithms (Heiser 1995; Lange (1995a), “A Gradient Algorithm Locally Equivalent to the EM Algo-
rithm,” Journal of the Royal Statistical Society, Ser. B, 57, 425–437.
1995b; Jamshidian and Jennrich 1997; Lange et al. 2000). (1995b), “A Quasi-Newton Acceleration of the EM Algorithm,” Statis-
Although this survey article necessarily reports much that is tica Sinica, 5, 1–18.
already known, there are some new results here. Our MM treat- (1999), Numerical Analysis for Statisticians, New York: Springer-
Verlag.
ment of constrained optimization in Section 7 is more general Lange, K., and Fessler, J. A. (1995), “Globally Convergent Algorithms for Max-
than previous versions in the literature (Censor and Zenios 1992; imum A Posteriori Transmission Tomography,” IEEE Transactions on Image
Processing, 4, 1430–1438.
Lange 1994). The application of Equation (20) to the estimation
Lange, K., Hunter, D. R., and Yang, I. (2000), “Optimization Transfer using
of standard errors in MM algorithms is new, as is the extension Surrogate Objective Functions” (with discussion), Journal of Computational
of the SEM idea of Meng and Rubin (1991) to the MM case. and Graphical Statistics, 9, 1–20.
Lange, K., and Sinsheimer, J. S. (1993), “Normal/Independent Distributions
There are so many examples of MM algorithms in the liter- and Their Applications in Robust Regression,” Journal of Computationaland
ature that we are unable to cite them all. Readers should be on Graphical Statistics, 2, 175–198.
the lookout for these and for known EM algorithms that can be Luenberger, D. G. (1984), Linear and Nonlinear Programming (2nd ed.), Read-
ing, MA: Addison-Wesley.
explained more simply as MM algorithms. Even more impor- Maher, M. J. (1982), “Modelling Association Football Scores,” Statistica Neer-
tantly, we hope this article will stimulate readers to discover new landica, 36, 109–118.
MM algorithms. Marshall, A. W., and Olkin, I. (1979), Inequalities: Theory of Majorization and
its Applications, San Diego: Academic.
McLachlan, G. J., and Krishnan, T. (1997), The EM Algorithm and Extensions,
[Received June 2003. Revised September 2003.] New York: Wiley.
Meng, X.-L., and Rubin, D. B. (1991), “Using EM to Obtain Asymptotic
Variance-Covariance Matrices: The SEM Algorithm,” Journal of the Ameri-
REFERENCES can Statistical Association, 86, 899–909.
Becker, M. P., Yang, I., and Lange, K. (1997), “EM Algorithms Without Missing (1993), “Maximum Likelihood Estimation via the ECM Algorithm: A
Data,” Statistical Methods in Medical Research, 6, 38–54. General Framework, Biometrika, 800, 267–278.
Bijleveld, C. C. J. H., and de Leeuw, J. (1991), “Fitting Longitudinal Reduced- Oakes, D. (1999), “Direct Calculation of the Information Matrix via the EM
Rank Regression Models by Alternating Least Squares,” Psychometrika, 56, Algorithm,” Journal of the Royal Statistical Society, Ser. B, 61, Part 2, 479–
433–447. 482.
Bo hning, D., and Lindsay, B. G. (1988), “Monotonicity of Quadratic Approx- Ortega, J. M., and Rheinboldt, W. C. (1970), Iterative Solutions of Nonlinear
imation Algorithms,” Annals of the Institute of Statistical Mathematics, 40, Equations in Several Variables, New York: Academic, pp. 253–255.
641–663. Sabatti, C., and Lange, K. (2002), “Genomewide Motif IdentiŽ cation Using a
Censor, Y., and Zenios, S. A. (1992), “Proximal Minimization With D- Dictionary Model,” Proceedings of the IEEE, 90, 1803–1810.
Functions,” Journal of Optimization Theory and Applications, 73, 451–464. Schlossmacher, E. J. (1973), “An Iterative Technique for Absolute Deviations
de Leeuw, J. (1994), “Block Relaxation Algorithms in Statistics, in Information Curve Fitting,” Journal of the American Statistical Association, 68, 857–859.
Systems and Data Analysis, eds. H. H. Bock, W. Lenski, and M. M. Richter, Wu, C. F. J. (1983), “On the Convergence Properties of the EM Algorithm,” The
Berlin: Springer-Verlag, pp. 308–325. Annals of Statistics, 11, 95–103.

The American Statistician, February 2004, Vol. 58, No. 1 37

You might also like