Advstatcomp PDF
Advstatcomp PDF
Advstatcomp PDF
Roger D. Peng
2018-07-17
Contents
Welcome 2
Stay in Touch! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1 Introduction 3
1.1 Example: Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Principle of Optimization Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Textbooks vs. Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 General Optimization 21
3.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 The Newton Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 The EM Algorithm 53
4.1 EM Algorithm for Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Canonical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 A Minorizing Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Missing Information Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Acceleration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Integration 68
5.1 Laplace Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1
Welcome
The book covers material taught in the Johns Hopkins Biostatistics Advanced Statistical Computing course.
I taught this course off and on from 2003–2016 to upper level PhD students in Biostatistics. The course
ran for 8 weeks each year, which is a fairly compressed schedule for material of this nature. Because of the
short time frame, I felt the need to present material in a manner that assumed that students would often be
using others’ software to implement these algorithms but that they would need to know what was going on
underneath. In particular, should something go wrong with one of these algorithms, it’s important that they
know enough to diagnose the problem and make an attempt at fixing it. Therefore, the book is a bit light on
the details, in favor of giving a more general overview of the methods.
This book is a WORK IN PROGRESS.
Stay in Touch!
If you are interested in hearing more from me about things that I’m working on (books, data science courses,
podcast, etc.), you can do two things:
• First, I encourage you to join the Johns Hopkins Data Science Lab mailing list. On this list I send out
updates of my own activities as well as occasional comments on data science current events. You can
also find out what my co-conspirators Jeff Leek, Brian Caffo, and Stephanie Hicks are up to because
sometimes they do really cool stuff.
• Second, I have a regular podcast called Not So Standard Deviations that I co-host with Dr. Hilary
Parker, a Data Scientist at Stitch Fix. On this podcast, Hilary and I talk about the craft of data
science and discuss common issues and problems in analyzing data. We also compare how data science
is approached in both academia and industry contexts and discuss the latest industry trends. You can
listen to recent episodes on our web site or you can subscribe to it in Apple Podcasts or your favorite
podcasting app.
For those of you who purchased a printed copy of this book, I encourage you to go to the Leanpub web site
and obtain the e-book version, which is available for free. The reason is that I will occasionally update the
book with new material and readers who purchase the e-book version are entitled to free updates (this is
unfortunately not yet possible with printed books) and will be notified when they are released. You can also
find a web version of this book at the bookdown web site.
Thanks again for purchasing this book and please do stay in touch!
Setup
This book makes use of the following R packages, which should be installed to take full advantage of the
examples.
dplyr
ggplot2
knitr
MASS
microbenchmark
mvtnorm
readr
remotes
tidypvals
tidyr
2
Figure 1: The process of statistical modeling.
1 Introduction
The journey from statistical model to useful output has many steps, most of which are taught in other books
and courses. The purpose of this book is to focus on one particular aspect of this journey: the development
and implementation of statistical algorithms.
It’s often nice to think about statistical models and various inferential philosophies and techniques, but when
the rubber meets the road, we need an algorithm and a computer program implementation to get the results
we need from a combination of our data and our models. This book is about how we fit models to data and
the algorithms that we use to do so.
y = —0 + —1 x + Á (1)
This model has unknown parameters —0 and —1 . Given observations (y1 , x1 ), (y2 , x2 ), . . . , (yn , xn ), we can
combine these data with the likelihood principle, which gives us a procedure for producing model parameter
estimates. The likelihood can be maximized to produce maximum likelihood estimates,
—ˆ0 = ȳ ≠ —ˆ1 x̄
and qn
(x ≠ x̄)(yi ≠ ȳ)
—ˆ1 = qn i
i=1
i=1 (xi ≠ x̄)
These statistics, —ˆ0 and —ˆ1 , can then be interpreted, depending on the area of application, or used for other
purposes, perhaps as inputs to other procedures. In this simple example, we can see how each component of
the modeling process works.
Component Implementation
Model Linear regression
Principle/Technique Likelihood principle
3
Component Implementation
Algorithm Maximization
Statistic —ˆ0 , —ˆ1
In this example, the maximization of the likelihood was simple because the solution was available in closed
form. However, in most other cases, there will not be a closed form solution and some specific algorithm will
be needed to maximize the likelihood.
Changing the implementation of a given component can lead to different outcomes further down the change
and can even produce completely different outputs. Identical estimates for the parameters in this model can
be produced (in this case) by replacing the likelihood principle with the principle of least squares. However,
changing the principle to produce, for example, maximum a posteriori estimates would have produced different
statistics at the end.
There is a general principle that will be repeated in this book that Kenneth Lange calls “optimization transfer”.
The basic idea applies to the problem of maximizing a function f .
1. We want to maximize f , but it is difficult to do so.
2. We can compute an approximation to f , call it g, based on local information about f .
3. Instead of maximizing f , we “transfer” the maximization problem to g and maximize g instead.
4. We iterate Steps 2 and 3 until convergence.
The difficult problem of maximizing f is replaced with the simpler problem of maximizing g coupled with
iteration (otherwise known as computation). This is the optimization “transfer”.
Note that all of the above applies to minimization problems, because maximizing f is equivalent to minimizing
≠f .
One confusing aspect of statistical computing is that often there is a disconnect between what is printed in a
statistical computing textbook and what should be implemented on the computer. In textbooks, it is usually
simpler to present solutions as convenient mathematical formulas whenever possible, in order to communicate
basic ideas and to provide some insight. However, directly translating these formulas into computer code is
usually not advisable because there are many problematic aspects of computers that are simply not relevant
when writing things down on paper.
Some key issues to look for when implementing statistical or numerical solutions on the computer are
1. Overflow - When numbers get too big, they cannot be represented on a computer and so often NAs are
produced instead;
2. Underflow - Similar to overflow, numbers can get too small for computers to represent, resulting in
errors or warnings or inaccurate computation;
3. Near linear dependence - the existence of linear dependence in matrix computations depends on the
precision of a machine. Because computers are finite precision, there are commonly situations where
one might think there is no linear dependence but the computer cannot tell the difference.
All three of the above problems arise from the finite precision nature of all computers. One must take care to
use algorithms that do calculations in the computable range and that automatically handle things like near
dependence.
4
Below, I highlight some common examples in statistics where the implementation diverges from what
textbooks explain as the solution: Computing with logarithms, the least squares solution to the linear
regression estimation problem, and the computation of the multivariate Normal density. Both problems, on
paper, involve inverting a matrix, which is typically a warning sign in any linear algebra problem. While
matrix inverses are commonly found in statistics textbooks, it’s rare in practice that you will ever want to
directly compute them. This point bears repeating: If you find yourself computing the inverse of a
matrix, there is usually a better way of doing whatever you are trying to do.
Most textbooks write out functions, such as densities, in their natural form. For example, the univariate
Normal distribution with mean µ and variance ‡ 2 is written
1 1 2
f (x | µ, ‡ 2 ) = Ô e≠ 2‡2 (x≠µ)
2fi‡
and you can compute this value for any x, µ, and ‡ in R with the dnorm() function.
But in practice, you almost never have to compute this exact number. Usually you can get away with
computing the log of this value (and with dnorm() you can set the option log = TRUE). In some situations,
such as with importance sampling, you do have to compute density values on the original scale, and that can
be considered a disadvantage of that technique.
Computing densities with logarithms is much more numerically stable than computing densities without
them. With the exponential function in the density, numbers can get very small quickly, to the point where
they are too small for the machine to represent (underflow). In some situations, you may need to take the
ratio of two densities and then you may end up with either underflow or overflow (if numbers get too big).
Doing calculations on a log scale (and then exponentiating them later if needed) usually resolves problems of
underflow or overflow.
In this book (and in any other), when you see expressions like f (x)/g(x), you should think that this means
exp(log f (x) ≠ log(g(x))). The two are equivalent but the latter is likely more numerically stable. In fact,
most of the time, you never have to re-exponentiate the values, in which case you can spend your entire time
(x)
in log-land. For example, in the rejection sampling algorithm, you need to determine if U Æ fg(x) . However,
taking the log of both sides allows you to do the exact same comparison in a much more numerically stable
way.
The typical linear regression model, written in matrix form, is represented as follows,
y = X— + Á
where y is an n ◊ 1 observed response, X is the n ◊ p predictor matrix, — is the p ◊ 1 coefficient vector, and
Á is n ◊ 1 error vector.
In most textbooks the solution for estimating —, whether it be via maximum likelihood or least squares, is
written as
—ˆ = (X Õ X)≠1 X Õ y.
And indeed, that is the solution. In R, this could be translated literally as
betahat <- solve(t(X) %*% X) %*% t(X) %*% y
5
where solve() is used to invert the cross product matrix X Õ X. However, one would never compute the
actual value of —ˆ this way on the computer. The formula presented above is only computed in textbooks.
The primary reason is that computing the direct inverse of X Õ X is very expensive computationally and
is a potentially unstable operation on a computer when there is high colinearity amongst the predictors.
Furthermore, in computing —ˆ we do not actually need the inverse of X Õ X, so why compute it? A simpler
approach would be to take the normal equations,
X Õ X— = X Õ y
and solve them directly. In R, we could write
solve(crossprod(X), crossprod(X, y))
Rather than compute the inverse of X Õ X, we directly compute —ˆ via Gaussian elimination. This approach
has the benefit of being more numerically stable and being much faster.
set.seed(2017-07-13)
X <- matrix(rnorm(5000 * 100), 5000, 100)
y <- rnorm(5000)
Unit: milliseconds
expr min lq mean median
solve(t(X) %*% X) %*% t(X) %*% y 8.797343 9.835521 11.94031 10.85574
uq max neval
13.02197 57.42063 100
The following timing uses the solve() function to compute —ˆ via Gaussian elimination.
microbenchmark(solve(t(X) %*% X) %*% t(X) %*% y,
solve(crossprod(X), crossprod(X, y)))
Unit: milliseconds
expr min lq mean median
solve(t(X) %*% X) %*% t(X) %*% y 8.187201 8.957787 10.140894 9.197867
solve(crossprod(X), crossprod(X, y)) 1.841551 1.934159 2.012302 1.992533
uq max neval
11.096826 36.599512 100
2.061172 3.284126 100
You can see that the betweeen the two approach there is a more than 5-fold difference in computation time,
with the second approach being considerably faster.
However, this approach breaks down when there is any colinearity in the X matrix. For example, we can
tack on a column to X that is very similar (but not identical) to the first column of X.
W <- cbind(X, X[, 1] + rnorm(5000, sd = 0.0000000001))
solve(crossprod(W), crossprod(W, y))
6
Here, we use the fact that X can be decomposed as X = QR, where Q is an orthonormal matrix and R is an
upper triangular matrix. Given that, we can write
X Õ X— = X Õ y
as
RÕ QÕ QR— = RÕ QÕ y
RÕ R— = RÕ QÕ y
R— = QÕ y
because QÕ Q = I. At this point, we can solve for — via Gaussian elimination, which is greatly simplified
because R is already upper triangular. The QR decomposition has the added benefit that we do not have
to compute the cross product X Õ X at all, as this matrix can be numericaly unstable if it is not properly
centered or scaled.
We can see in R code that even with our singular matrix W above, the QR decomposition continues without
error.
Qw <- qr(W)
str(Qw)
List of 4
$ qr : num [1:5000, 1:101] 70.88664 -0.01277 0.01561 0.00158 0.02451 ...
$ rank : int 100
$ qraux: num [1:101] 1.01 1.03 1.01 1.02 1.02 ...
$ pivot: int [1:101] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "class")= chr "qr"
Note that the output of qr() computes the rank of W to be 100, not 101, because of the colinear column.
From there, we can get —ˆ if we want using qr.coef(),
betahat <- qr.coef(Qw, y)
We do not show it here, but the very last element of betahat is NA because a coefficient corresponding to the
last column of W (the collinear column) could not be calculated.
While the QR decomposition does handle colinearity, we do pay a price in speed.
library(ggplot2)
m <- microbenchmark(solve(t(X) %*% X) %*% t(X) %*% y,
solve(crossprod(X), crossprod(X, y)),
qr.coef(qr(X), y))
autoplot(m)
Coordinate system already present. Adding new coordinate system, which will replace the existing one.
7
Compared to the approaches above, it is comparable to the naive approach but it is a much better and more
stable method.
In practice, we do not use functions like qr() or qr.coef() directly because higher level functions like lm()
do the work for us. However, for certain narrow, highly optimized cases, it may be fruitful to turn to another
matrix decomposition to compute linear regression coefficients, particularly if this must be done repeatedly
in a loop.
Computing the multivariate normal density is a common problem in statistics, such as in fitting spatial
statistical models or Gaussian process models. Because optimization procedures used to compute maximum
likelihood estimates or likelihood ratios can be evaluated hundreds or thousands of times in a single run, it’s
useful to have a highly efficient procedure for evaluating the multivariate Normal density.
The p-dimensional multivariate Normal density is written as
p 1 1
Ï(x | µ, ) = ≠ log 2fi ≠ log | | ≠ (x ≠ µ)Õ ≠1
(x ≠ µ)
2 2 2
The critical, and most time-consuming, part of computing the multivariate Normal density is the quadratic
form,
(x ≠ µ)Õ ≠1 (x ≠ µ).
We can simplify this problem a bit by focusing on the centered version of x which we will refer to as z = x ≠ µ.
Hence, we are trying to compute
z Õ ≠1 z
8
Here, much like the linear regression example above, the key bottleneck is the inversion of the p-dimensional
covariance matrix . If we take z to be a p ◊ 1 column vector, then a literal translation of the mathematics
into R code might look something like this,
t(z) %*% solve(Sigma) %*% z
But once again, we are taking on the difficult and unstable task of inverting when, at the end of the day,
we do not need this inverse.
Instead of taking the textbook translation approach, we can make use of the Cholesky decomposition of .
The Cholesky decomposition of a positive definite matrix provides
= RÕ R
where R is an upper triangular matrix. R is sometimes referred to as the “square root” of (although it is
not unique). Using the Cholesky decomposition of and the rules of matrix algebra, we can then write
zÕ ≠1
z = z Õ (RÕ R)≠1 z
= z Õ R≠1 RÕ≠1 z
= (RÕ≠1 z)Õ RÕ≠1 z
= vÕ v
where v = RÕ≠1 z and is a p ◊ 1 vector. Furthermore, we can avoid inverting RÕ by computing v as the solution
to the linear system
RÕ v = z
Once we have computed v, we can compute the quadratic form as v Õ v, which is simply the cross product of
two p-dimensional vectors!
Another benefit of the Cholesky decomposition is that it gives us a simple way to compute the log-determinant
of . The log-determinant of is simply 2 times the sum of the log of the diagonal elements of R.
Here is an implementation of the naive approach to computing the quadratic form in the multivariate Normal.
set.seed(2017-07-13)
z <- matrix(rnorm(200 * 100), 200, 100)
S <- cov(z)
quad.naive <- function(z, S) {
Sinv <- solve(S)
rowSums((z %*% Sinv) * z)
}
We can first take a look at the output that this function produces.
library(dplyr)
quad.naive(z, S) %>% summary
9
We can verify that this function produces the same output as the naive version.
quad.chol(z, S) %>% summary
Unit: microseconds
expr min lq mean median uq max
quad.naive(z, S) 593.125 749.9505 1075.3927 830.0240 903.1380 17847.149
quad.chol(z, S) 252.818 448.3040 592.0377 485.8535 528.9575 9492.268
neval
100
100
We can see that the version using the Cholesky decomposition takes about 60% of the time of the naive
version. In a single evaluation, this may not amount to much time. However, over the course of potentially
many iterations, these kinds of small savings can add up.
The key lesson here is that our use of the Cholesky decomposition takes advantage of the fact that we know
that the covariance matrix in a multivariate Normal is symmetric and positive definite. The naive version of
the algorithm that just blindly inverts the covariance matrix is not able to take advantage of this information.
The bisection algorithm is a simple method for finding the roots of one-dimensional functions. The goal is to
find a root x0 œ [a, b] such that f (x0 ) = 0. The algorithm starts with a large interval, known to contain x0 ,
and then successively reduces the size of the interval until it brackets the root. The theoretical underpinning
of the algorithm is the intermediate value theorem which states that if a continuous function f takes values
f (a) and f (b) at the end points of the interval [a, b], then f must take all values between f (a) and f (b)
somewhere in the interval. So if f (a) < “ < f (b), then there exists a c œ [a, b] such that f (c) = “.
Using this information, we can present the bisection algorithm. First we must check that sign(f (a)) ”=
sign(f (b)). Otherwise, the interval does not contain the root and might need to be widened. Then we can
proceed:
1. Let c = 2 .
a+b
10
Figure 2: Ideal setup for bisection algorithm.
The bisection algorithm can run into problems in situations where the function f is not well behaved. The
ideal situation for the bisection algorithm looks something like this.
Here, f (a) and f (b) are of opposite signs and the root is clearly in between a and b.
In the scenario below, the algorithm will not start because f (a) > 0 and f (b) > 0.
In this next scenario, there are two roots between a and b, in addition to having f (a) > 0 and f (b) > 0. One
would need to reduce the length of the starting interval in order to find either root.
In the scenario below, the algorithm will start because f (a) and f (b) are of opposite sign, but there is no root.
Convergence of the bisection algorithm can be determined by either having |b ≠ a| < Á for some small Á or
having |f (b) ≠ f (a)| < Á. Which criterion you use will depend on the specific application and on what kinds
of tolerances are required.
Given a cumulative distribution function F (x) and a number p œ (0, 1), a quantile of F is a number x such
that F (x) = p. The bisection algorithm can be used to find a quantile x for a given p by defining the function
g(x) = F (x) ≠ p and solving for the value of x that achieves g(x) = 0.
Another way to put this is that we are inverting the CDF to compute x = F ≠1 (p). So the bisection algorithm
can be used to invert functions in these situations.
11
Figure 3: Derivative of f at the root is 0.
12
Figure 5: Interval contains an asymptote but no root.
One of the ways in which algorithms will be compared is via their rates of convergence to some limiting value.
Typically, we have an interative algorithm that is trying to find the maximum/minimum of a function and we
want an estimate of how long it will take to reach that optimal value. There are three rates of convergence
that we will focus on here—linear, superlinear, and quadratic—which are ordered from slowest to fastest.
In our context, rates of convergence are typically determined by how much information about the target
function f we use in the updating process of the algorithm. Algorithms that use little information about f ,
such as the bisection algorithm, converge slowly. Algorithms that require more information about f , such as
derivative information, typically converge more quickly. There is no free lunch!
Suppose we have a sequence {xn } such that xn æ xŒ in Rk . We say the convergence is linear if there exists
r œ (0, 1) such that
Îxn+1 ≠ xŒ Î
Ær
Îxn ≠ xŒ Î
for all n sufficiently large.
2.2.1.1 Example
! 1 "n
The simple sequence xn = 1 + 2 converges linearly to xŒ = 1 because
13
! 1 "n+1
Îxn+1 ≠ xŒ Î 1
= !2 1 "n =
Îxn ≠ xŒ Î 2
2
which is always in (0, 1).
Îxn+1 ≠ xŒ Î
lim =0
næŒ Îxn ≠ xŒ Î
The sequence above does not converge superlinearly because !the" ratio is always constant, and so never can
converge to zero as n æ Œ. However, the sequence xn = 1 + n1
n
converges superlinearly to 1.
Quadratic convergence is the fastest form of convergence that we will discuss here and is generally considered
desirable if possible to achieve. We say the sequence converges at a quadratic rate if there exists some constant
0 < M < Œ such that
Îxn+1 ≠ xŒ Î
ÆM
Îxn ≠ xŒ Î2
for all n sufficiently large.
! 1 "2n
Extending the examples from above, the sequence xn = 1 + n converges quadratically to 1. With this
sequence, we have
1 22n+1
1 3 42n+1
Îxn+1 ≠ xŒ Î n+1 n
= ! "(2n )2 = Æ1
Îxn ≠ xŒ Î2 1 n+1
n
For the bisection algorithm, the error that we make in estimating the root is xn = |bn ≠ an |, where an
and bn represent the end points of the bracketing interval at iteration n. However, we know that the size
of the interval in the bisection algorithm decreases by a half at each iteration. Therefore, we can write
xn = 2≠n |b0 ≠ a0 | and we can write the rate of convergence as
We want to find a solution to the equation f (x) = 0 for f : Rk æ R and x œ S µ Rk . One approach to solving
this problem is to characterize solutions as fixed points of other functions. For example, if f (x0 ) = 0, then x0
is a fixed point of the function g(x) = f (x) + x. Another such function might be g(x) = x(f (x) + 1) for x ”= 0.
14
In some situations, we can construct a function g and a sequence xn = g(xn≠1 ) such that we have the
sequence xn æ xŒ where g(xŒ ) = xŒ . In other words, the sequence of values xn converges to a fixed point
of g. If this fixed point satisfies f (xŒ ) = 0, then we have found a solution to our original problem.
The most important algorithm that we will discuss based on functional iteration is Newton’s method. The
EM algorithm is also an algorithm that can be formulated as a functional iteration and we will discuss that
at a later section.
When can such a functional iteration procedure work? The Shrinking Lemma gives us the conditions under
which this type of sequence will converge.
The Shrinking Lemma gives conditions under which a sequence derived via functional iteration will converge
to a fixed point. Let M be a closed subset of a complete normed vector space and let f : M æ M be a map.
Assume that there exists a K, 0 < K < 1, such that for all x, y œ M ,
Then f has a unique fixed point, i.e. there is a unique point x0 œ M such that f (x0 ) = x0 .
Proof : The basic idea of the proof of this lemma is that for a give x œ M , we can construct a Cauchy sequence
{f n (x)} that converges to x0 , where f n (x) represents the nth functional iteration of x, i.e. f 2 (x) = f (f (x)).
Given x œ M , we can write
Therefore, there exists some N such that for all m, n Ø N (say n = m + k), we have Îf n (x) ≠ f m (x)Î Æ Á,
because K m æ 0 as m æ Œ. As a result, the sequence {f n (x)} is a Cauchy sequence, so let x0 be its limit.
Given Á > 0, let N be such that for all n Ø N , Îf n (x) ≠ x0 Î Æ Á. Then we can also say that for n Ø N ,
15
So what we have is {f n (x)} æ x0 and we have {f n (x)} æ f (x0 ). Therefore, x0 is a fixed point of f , so that
f (x0 ) = x0 .
To show that x0 is unique, suppose that x1 is another fixed point of f . Then
Suppose g satisfies
Newton’s method build a sequence of values {xn } via functional iteration that converges to the root of a
function f . Let that root be called xŒ and let xn be the current estimate. By the mean value theorem, we
know there exists some z such that
f (xn ) = f Õ (z)(xn ≠ xŒ ),
where z is somewhere between xn and xŒ . Rearranging terms, we can write
f (xn )
xŒ = xn ≠
f Õ (z)
Obviously, we do not know xŒ or z, so we can replace them with our next iterate xn+1 and our current
iterate xn , giving us the Newton update formula,
f (xn )
xn+1 = xn ≠ .
f Õ (xn )
We will discuss Newton’s method more in the later section on general optimization, as it is a core method for
minimizing functions.
Newton’s method can be written as a functional iteration that converges to a fixed point. Let f be a function
that is twice continuously differentiable and suppose there exists a xŒ such that f (xŒ ) = 0 and f Õ (xŒ ) ”= 0.
Then there exists a ” such that for any x0 œ (xŒ ≠ ”, xŒ + ”), the sequence
f (xn≠1 )
xn = g(xn≠1 ) = xn≠1 ≠
f Õ (xn≠1 )
16
converges to xŒ .
Note that
Therefore, g Õ (xŒ ) = 0 because we assume f (xŒ ) = 0 and f Õ (xŒ ) ”= 0. Further we know g Õ is continuous
because we assumed f was twice continuously differentiable.
Therefore, given K < 1, there exists ” > 0 such that for all x œ (xŒ ≠ ”, xŒ + ”), we have |g Õ (x)| < K. For
any a, b œ (xŒ ≠ ”, xŒ + ”) we can also write
In the interval of xŒ ± ” we have that g is a shrinking map. Therefore, there exists a unique fixed point xŒ
such that g(xŒ ) = xŒ . This value xŒ is a root of f .
Although proof of Newton’s method’s convergence to a root can be done using the Shrinking Lemma, the
convergence rate of Newton’s method is considerably faster than the linear rate of generic shrinking maps.
This fast convergence is obtained via the additional assumptions we make about the smoothness of the
function f .
Suppose again that f is twice continuously differentiable and that there exists xŒ such that f (xŒ ) = 0.
Given some small Á > 0, we can approximate f around xŒ with
Á2 ÕÕ
f (xŒ + Á) = f (xŒ ) + Áf Õ (xŒ ) + f (xŒ ) + O(Á2 )
2
Á2 ÕÕ
= 0 + Áf Õ (xŒ ) + f (xŒ ) + O(Á2 )
2
f (xn )
xn+1 = xn ≠ .
f Õ (xn )
Using the time-honored method of adding and subtracting, we can write this as
f (xn )
xn+1 ≠ xŒ = xn ≠ xŒ ≠ .
f Õ (xn )
17
If we let Án+1 = xn+1 ≠ xŒ and Án = xn ≠ xŒ , then we can rewrite the above as
f (xn )
Án+1 = Án ≠
f Õ (xn )
f (xŒ + Án )
Án+1 = Án ≠
f Õ (xŒ + Án )
From here, we can use the approximations written out earlier to give us
Á2
Án f Õ (xŒ ) + 2n f ÕÕ (xŒ )
Án+1 ¥ Án ≠ Õ
f (xŒ ) + Án f ÕÕ (xŒ )
3 1 ÕÕ 4
2 2 f (xŒ )
= Án
f Õ (xŒ ) + Án f ÕÕ (xŒ )
|Án+1 |
ÆM
|Án |2
as n æ Œ, which is the definition of quadratic convergence. Of course, for this to work we need that
f ÕÕ (xŒ ) < Œ and that f Õ (xŒ ) ”= 0.
In summary, Newton’s method is very fast in the neighborhood of the root and furthermore has a direct
multivariate generalization (unlike the bisection method). However, the need to evaluate f Õ at each iteration
requires more computation (and more assumptions about the smoothness of f ). Additionally, Newton’s
method can, in a sense, be “too fast” in that there is no guarantee that each iteration of Newton’s method is
an improvement (i.e. is closer to the root). In certain cases, Newton’s method can swing wildly out of control
and diverge. Newton’s method is only guaranteed to converge in the neighborhood of the root; the exact size
of that neighborhood is usually not known.
In many statistical modeling applications, we have a likelihood function L that is induced by a probability
distribution that we assume generated the data. This likelihood is typically parameterized by a vector ◊ and
maximizing L(◊) provides us with the maximum likelihood estimate (MLE), or ◊. ˆ In practice, it makes more
sense to maximize the log-likehood function, or ¸(◊), which in many common applications is equivalent to
solving the score equations ¸Õ (◊) = 0 for ◊.
ˆ If we assume ◊ is a
Newton’s method can be applied to generate a sequence that converges to the MLE ◊.
k ◊ 1 vector, we can iterate
◊n+1 = ◊n ≠ ¸ÕÕ (◊n )≠1 ¸Õ (◊n )
where ¸ÕÕ is the Hessian of the log-likelihood function.
18
Note that the formula above computes an inverse of a k ◊ k matrix, which should serve as an immediate
warning sign that this is not how the algorithm should be implemented. In practice, it may make more sense
to solve the system of equations
[¸ÕÕ (◊n )]◊n+1 = [¸ÕÕ (◊n )]◊n ≠ ¸Õ (◊n ).
rather than invert ¸ÕÕ (◊n ) directly at every iteration.
However, it may make sense to invert ¸ÕÕ (◊n ) at the very end of the algorithm to obtain the observed
ˆ This observed information matrix can be used to obtain asymptotic standard
information matrix ≠¸ÕÕ (◊).
errors for ◊ˆ for making inference about ◊.
Suppose we observe data x1 , x2 , . . . , xn ≥ Poisson(µ) and we would like to estimate µ via maximum likelihood.
iid
500
0
0 2 4 6 8 10
19
The figure above shows that this is clearly a nice smooth function for Newton’s method to work on. Recall
that for the Newton iteration, we also need the second derivative, which in this case is
nx̄
¸ÕÕ (µ) = ≠
µ2
5 6≠1 3 4
nx̄ nx̄
µn+1 = µn ≠ ≠ 2 ≠n
µn µn
µ2
= 2µn ≠ n
n
Using the functional programming aspects of R, we can write a function that executes the functional iteration
of Newton’s method for however many times we which to run the algorithm.
The following Iterate() code takes a function as argument and generates an “iterator” version of it where
the number of iterations is an argument.
Funcall <- function(f, ...) f(...)
Iterate <- function(f, n = 1) {
function(x) {
Reduce(Funcall, rep.int(list(f), n), x, right = TRUE)
}
}
Now we can pass a single iteration of the Newton step as an argument to the Iterate() function defined
above.
single_iteration <- function(x) {
2 * x - x^2 / xbar
}
g <- function(x0, n) {
giter <- Iterate(single_iteration, n)
giter(x0)
}
Finally, to facilitate plotting of this function, it is helpful if our iterator function is vectorized with respect to
n. The Vectorize() function can help us here.
g <- Vectorize(g, "n")
Let’s use a starting point of µ0 = 10. We can plot the score function along with the values of each of the
Newton iterates for 7 iterations.
par(mar = c(5, 5, 4, 1))
curve(score, .35, 10, xlab = expression(mu), ylab = expression(score(mu)), cex.axis = 0.8)
abline(h = 0, lty = 3)
iterates <- g(10, 1:7) ## Generate values for 7 functional iterations with a starting value of 10.
abline(v = c(10, iterates), lty = 2)
axis(3, c(10, iterates), labels = c(0, 1:7), cex = 2, cex.axis = 0.8)
mtext("Iteration #", at = 2, line = 2.5)
20
Iteration #
1 3 4 5 6 0
1200
200 400 600 800
score(µ)
0 2 4 6 8 10
µ
We can see that by the 7th iteration we are quite close to the root, which in this case is 5.1.
Another feature to note of Newton’s algorithm here is that when the function is relatively flat, the algorithm
makes large moves either to the left or right. However, when the function is relatively steep, the moves are
smaller in distance. This makes sense because the size of the deviation from the current iterate depends on
the inverse of ¸ÕÕ at the current iterate. When ¸Õ is flat, ¸ÕÕ will be small and hence its inverse large.
3 General Optimization
The general optimization problem can be stated as follows. Given a fiunction f : Rk æ R, we want to find
minxœS f (x), where S µ Rk . The general approach to solving this problem that we will discuss is called a line
search method. With line search methods, given f and a current estimate xn of the location of the minimum,
we want to
1. Choose a direction pn in k-dimensional space;
2. Choose a step length in the direction pn , usually by solving min–>0 f (xn + –pn ) to get –n
3. Update our estimate with xn+1 = xn + –n pn .
Clearly then, with line search methods, the two questions one must answer are how should we choose the
direction? and how far should we step? Almost all line search approaches provide variations on the answers
to those two questions.
Care must be taken in addressing the problems involved with line search methods because typically one must
assume that the size of the parameter space is large (i.e. k is large). Therefore, one of constraints for all
methods is minimizing the amount of computation that needs to be done due to the large parameter space.
Efficiency with respect to memory (storage of data or parameters) and computation time is key.
21
Figure 6: Direction of steepest descent.
Perhaps the most obvious direction to choose when attempting to minimize a function f starting at xn is the
direction of steepest descent, or ≠f Õ (xn ). This is the direction that is orthogonal to the contours of f at the
point xn and hence is the direction in which f is changing most rapidly at xn .
The updating procedure for a steepest descent algorithm, given the current estimate xn , is then
xn+1 = xn ≠ –n f Õ (xn )
While it might seem logical to always go in the direction of steepest descent, it can occasionally lead to some
problems. In particular, when certain parameters are highly correlated with each other, the steepest descent
algorithm can require many steps to reach the minimum.
The figure below illustrates a function whose contours are highly correlated and hence elliptical.
Depending on the starting value, the steepest descent algorithm could take many steps to wind its way
towards the minimum.
22
Figure 7: Steepest descent with highly correlated parameters.
One can use steepest descent to compute the maximum likelihood estimate of the mean in a multivariate
Normal density, given a sample of data. However, when the data are highly correlated, as they are in the
simulated example below, the log-likelihood surface can be come difficult to optimize. In such cases, a very
narrow ridge develops in the log-likelihood that can be difficult for the steepest descent algorithm to navigate.
In the example below, we actually compute the negative log-likelihood because the algorith is designed to
minimize functions.
set.seed(2017-08-10)
mu <- c(1, 2)
S <- rbind(c(1, .9), c(.9, 1))
x <- MASS::mvrnorm(500, mu, S)
nloglike <- function(mu1, mu2) {
dmv <- mvtnorm::dmvnorm(x, c(mu1, mu2), S, log = TRUE)
-sum(dmv)
}
nloglike <- Vectorize(nloglike, c("mu1", "mu2"))
nx <- 40
ny <- 40
xg <- seq(-5, 5, len = nx)
yg <- seq(-5, 6, len = ny)
g <- expand.grid(xg, yg)
nLL <- nloglike(g[, 1], g[, 2])
z <- matrix(nLL, nx, ny)
par(mar = c(4.5, 4.5, 1, 1))
contour(xg, yg, z, nlevels = 40, xlab = expression(mu[1]),
ylab = expression(mu[2]))
abline(h = 0, v = 0, lty = 2)
23
6
00
00 00
0
00
0 00
0
10 00
0 65 50 0 0 15
75 00 00
00
0
0 0 0 40 00
0 20
85 55 0 0 00 25
00
0 45
0 0 300
00
0 60 35
4
0
7 00
2
µ2
0
−2
5000
0
00
0 00 75
00 00 65 0
0 0 00
00 40 55 0 90 00
−4
10000 00 50
00 350 00
0 80 10
0 0 300 50 0 0 0 0
00
00 250 00 0 0 70
0
00
0 95
0
50
1500
0
200 450 60
0
85 12
−4 −2 0 2 4
µ1
Note that in the figure above the surface is highly stretched and that the minimum (1, 2) lies in the middle of
a narrow valley. For the steepest descent algorithm we will start at the point (≠5, ≠2) and track the path of
the algorithm.
library(dplyr, warn.conflicts = FALSE)
norm <- function(x) x / sqrt(sum(x^2))
Sinv <- solve(S) ## I know I said not to do this!
step1 <- function(mu, alpha = 1) {
D <- sweep(x, 2, mu, "-")
score <- colSums(D) %>% norm
mu + alpha * drop(Sinv %*% score)
}
steep <- function(mu, n = 10, ...) {
results <- vector("list", length = n)
for(i in seq_len(n)) {
results[[i]] <- step1(mu, ...)
mu <- results[[i]]
}
results
}
m <- do.call("rbind", steep(c(-5, -2), 8))
m <- rbind(c(-5, -2), m)
24
6
00
00 00
0
00
0 00
0
10 00
0 65 50 0 0 15
75 00 00
00
0
0 0 0 40 00
0 20
85 55 0 0 00 25
00
0 45
0 0 300
00
0 60 35
4
0
7 00
2
µ2
0
−2
5000
0
00
0 00 75
00 00 65 0
0 0 00
00 40 55 0 90 00
−4
10000 00 50
00 350 00
0 80 10
0 0 300 50 0 0 0 0
00
00 250 00 0 0 70
0
00
0 95
0
50
1500
0
200 450 60
0
85 12
−4 −2 0 2 4
µ1
We can see that the path of the algorthm is rather winding as it traverses the narrow valley. Now, we have
fixed the step-length in this case, which is probably not optimal. However, one can still see that the algorithm
has some difficulty navigating the surface because the direction of steepest descent does not take one directly
towards the minimum ever.
Given a current best estimate xn , we can approximate f with a quadratic polynomial. For some small p,
1
f (xn + p) ¥ f (xn ) + pÕ f Õ (xn ) + pÕ f ÕÕ (xn )p.
2
If we minimize the right hand side with respect to p, we obtain
which we can think of as the steepest descent direction “twisted” by the inverse of the Hessian matrix
f ÕÕ (xn )≠1 . Newton’s method has a “natural” step length of 1, so that the updating procedure is
Newton’s method makes a quadratic approximation to the target function f at each step of the algorithm.
This follows the “optimization transfer” principle mentioned earlier, whereby we take a complex function
f , replace it with a simpler function g that is easier to optimize, and then optimize the simpler function
repeatedly until convergence to the solution.
We can visualize how Newton’s method makes its quadratic approximation to the target function easily in
one dimension.
25
curve(-dnorm(x), -2, 3, lwd = 2, ylim = c(-0.55, .1))
xn <- -1.2
abline(v = xn, lty = 2)
axis(3, xn, expression(x[n]))
g <- function(x) {
-dnorm(xn) + (x-xn) * xn * dnorm(xn) - 0.5 * (x-xn)^2 * (dnorm(xn) - xn * (xn * dnorm(xn)))
}
curve(g, -2, 3, add = TRUE, col = 4)
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+1]))
xn xn+1
0.1
−0.1
−dnorm(x)
−0.3
−0.5
−2 −1 0 1 2 3
x
In the above figure, the next iterate, xn+1 is actually further away from the minimum than our previous
iterate xn . The quadratic approximation that Newton’s method makes to f is not guaranteed to be good at
every point of the function.
This shows an important “feature” of Newton’s method, which is that it is not monotone. The successive
iterations that Newton’s method produces are not guaranteed to be improvements in the sense that each
iterate is closer to the truth. The tradeoff here is that while Newton’s method is very fast (quadratic
convergence), it can be unstable at times. Monotone algorithms (like the EM algorithm that we discuss later)
that always produce improvements, are more stable, but generally converge at slower rates.
In the next figure, however, we can see that the solution provided by the next approximation, xn+2 , is indeed
quite close to the true minimum.
curve(-dnorm(x), -2, 3, lwd = 2, ylim = c(-0.55, .1))
xn <- -1.2
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+1]))
xn <- op$minimum
26
curve(g, -2, 3, add = TRUE, col = 4)
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+2]))
xn+2 xn+1
0.1
−0.1
−dnorm(x)
−0.3
−0.5
−2 −1 0 1 2 3
x
It is worth noting that in the rare event that f is in fact a quadratic polynomial, Newton’s method will
converge in a single step because the quadratic approximation that it makes to f will be exact.
The generalized linear model is an extension of the standard linear model to allow for non-Normal response
distributions. The distributions used typically come from an exponential family whose density functions
share some common characteristics. With a GLM, we typical present it as yi ≥ p(yi | µi ), where p is an
exponential family distribution, E[yi ] = µi ,
g(µi ) = xÕi —,
where g is a nonlinear link function, and Var(yi ) = V (µ) where V is a known variance function.
Unlike the standard linear model, the maximum likelihood estimate of the parameter vector — cannot be
obtained in closed form, so an iterative algorithm must be used to obtain the estimate. The traditional
algorithm used is the Fisher scoring algorithm. This algorithm uses a linear approximation to the nonlinear
link function g, which can be written as
The typical notation of GLMs refers to zi = g(µi ) + (yi ≠ µi )g Õ (µi ) as the working response. The Fisher
scoring algorithm then works as follows.
1. Start with µ̂i , some initial value.
2. Compute zi = g(µ̂i ) + (yi ≠ µ̂i )g Õ (µ̂i ).
27
3. Given the n ◊ 1 vector of working responses z and the n ◊ p predictor matrix X we compute a weighted
regression of z on X to get
—n = (X Õ W X)≠1 X Õ W z
where W is a diagonal matrix with diagonal elements
# $≠1
wii = g Õ (µi )2 V (µi ) .
Naturally, when doing a weighted regression, we would weight by the inverse of the variances.
Using the Poisson regression example, we can draw a connection between the usual Fisher scoring algorithm
for fitting GLMs and Newton’s method. Recall that if ¸(—) is the log-likelihood as a function of the regression
paramters —, then the Newton updating scheme is
—n+1 = —n + ¸ÕÕ (—n )≠1 [≠¸Õ (—n )].
The log-likelihoood for a Poisson regression model can be written in vector/matrix form as
¸(—) = y Õ X— ≠ exp(X—)Õ 1
where the exponential is taken component-wise on the vector X—. The gradient function is
¸Õ (—) = X Õ y ≠ X Õ exp(X—) = X Õ (y ≠ µ)
and the Hessian is
¸ÕÕ (—) = ≠X Õ W X
where W is a diagonal matrix with the values wii = exp(xÕi —) on the diagonal. The Newton iteration is then
28
Therefore the iteration is exactly the same as the Fisher scoring algorithm in this case. In general, Newton’s
method and Fisher scoring will coincide with any generalized linear model using an exponential family with a
canonical link function.
The nlm() function in R implements Newton’s method for minimizing a function given a vector of starting
values. By default, one does not need to supply the gradient or Hessian functions; they will be estimated
numerically by the algorithm. However, for the purposes of improving accuracy of the algorithm, both the
gradient and Hessian can be supplied as attributes of the target function.
As an example, we will use the nlm() function to fit a simple logistic regression model for binary data. This
model specifies that yi ≥ Bernoulli(pi ) where
pi
log = —0 + xi —1
1 ≠ pi
and the goal is to estimate — via maximum likelihood. Given the assumed Bernoulli distribution, we can
write the log-likelihood for a single observation as
I n
J
Ÿ
1≠yi
log L(—) = log pyi i (1 ≠ pi )
i=1
n
ÿ
= yi log pi + (1 ≠ yi ) log(1 ≠ pi )
i=1
ÿn
pi
= yi log + log(1 ≠ pi )
i=1
1 ≠ pi
3 4
1
ÿn
= yi (—0 + xi —1 ) + log
i=1
1 + e(—0 +xi —1 )
n
ÿ 1 2
= yi (—0 + xi —1 ) ≠ log 1 + e(—0 +xi —1 )
i=1
If we take the very last line of the above derivation and take a single element inside the sum, we have
1 2
¸i (—) = yi (—0 + xi —1 ) ≠ log 1 + e(—0 +xi —1 )
We will need the gradient and Hessian of this with respect to —. Because the sum and the derivative are
exchangeable, we can then sum each of the individual gradients and Hessians to get the full gradient and
Hessian for the entire sample, so that
n
ÿ
¸Õ (—) = ¸Õi (—)
i=1
and
n
ÿ
¸ÕÕ (—) = ¸ÕÕi (—).
i=1
Now, taking the gradient and Hessian of the above expression may be mildly inconvenient, but it is far from
impossible. Nevertheless, R provides an automated way to do symbolic differentiation so that manual work
can be avoided. The deriv() function computes the gradient and Hessian of an expression symbolically so
that it can be used in minimization routines. It cannot compute gradients of arbitrary expressions, but it it
does support a wide range of common statistical functions.
29
3.2.2.1 Example: Trends in p-values Over Time
The tidypvals package written by Jeff Leek contains datasets taken from the literature collecting p-values
associated with various publications along with some information about those publications (i.e. journal, year,
DOI). One question that comes up is whether there has been any trend over time in the claimed statistical
significance of publications, where “statistical significance” is defined as having a p-value less than 0.05.
The tidypvals package is available from GitHub and can be installed using the install_github() function
in the remotes package.
remotes::install_github("jtleek/tidypvals")
Once installed, we will make use of the jager2014 dataset. In particular, we are interseted in creating an
indicator of whether a p-value is less than 0.05 and regressing it on the year variable.
library(tidypvals)
library(dplyr)
jager <- mutate(tidypvals::jager2014,
pvalue = as.numeric(as.character(pvalue)),
y = ifelse(pvalue < 0.05
| (pvalue == 0.05 & operator == "lessthan"),
1, 0),
x = year - 2000) %>%
tbl_df
Note here that we have subtracted the year 2000 off of the year variable so that x = 0 corresponds to year
== 2000.
Next we compute the gradient and Hessian of the negative log-likelihood with respect to —0 and —1 using the
deriv() function. Below, we specify function.arg = TRUE in the call to deriv() because we want deriv()
to return a function whose arguments are b0 and b1.
nll_one <- deriv(~ -(y * (b0 + x * b1) - log(1 + exp(b0 + b1 * x))),
c("b0", "b1"), function.arg = TRUE, hessian = TRUE)
30
.expr18/.expr15
attr(.value, "gradient") <- .grad
attr(.value, "hessian") <- .hessian
.value
}
The function nll_one() produced by deriv() evaluates the negative log-likelihood for each data point. The
output from nll_one() will have attributes "gradient" and "hessian" which represent the gradient and
Hessian, respectively. For example, using the data from the jager dataset, we can evaluate the negative
log-likelihood at —0 = 0, —1 = 0.
x <- jager$x
y <- jager$y
str(nll_one(0, 0))
Now, we can evaluate the full negative log-likelihood with the nll() function. Note that nll() takes a single
numeric vector as input as this is what the nlm() function is expecting.
nll(c(0, 0))
[1] 10849.83
attr(,"gradient")
b0 b1
-4586.5 -21854.5
attr(,"hessian")
b0 b1
b0 3913.25 19618.25
b1 19618.25 137733.75
Using —0 = 0, —1 = 0 as the initial value, we can call nlm() to minimize the negative log-likelihood.
res <- nlm(nll, c(0, 0))
res
31
$minimum
[1] 7956.976
$estimate
[1] 1.57032807 -0.04416515
$gradient
[1] -0.000001451746 -0.000002782241
$code
[1] 1
$iterations
[1] 4
Note first in the output that there is a code with the value 4 and that the number of iterations is 100.
Whenever the number of iterations in an optimization algorithm is a nice round number, the chances are
good that it it some preset iteration limit. This in turn usually means the algorithm didn’t converge.
In the help for nlm() we also learn that the code value of 4 means “iteration limit exceeded”, which is
generally not good. Luckily, the solution is simple: we can increase the iteration limit and let the algorithm
run longer.
res <- nlm(nll, c(0, 0), iterlim = 1000)
res
$minimum
[1] 7956.976
$estimate
[1] 1.57032807 -0.04416515
$gradient
[1] -0.000001451746 -0.000002782241
$code
[1] 1
$iterations
[1] 4
Here we see that the number of iterations used was 260, which is well below the iteration limit. Now we get
code equal to 2 which means that “successive iterates within tolerance, current iterate is probably solution”.
Sounds like good news!
Lastly, most optimization algorithms have an option to scale your parameter values so that they roughly vary
on the same scale. If your target function has paramters that vary on wildly different scales, this can cause a
practical problem for the computer (it’s not a problem for the theory). The way to deal with this in nlm() is
to use the typsize arguemnt, which is a vector equal in length to the parameter vector which provides the
relative sizes of the parameters.
Here, I give typsize = c(1, 0.1), which indicates to nlm() that the first paramter, —0 , should be roughly
10 times larger than the second parameter, —1 when the target function is at its minimum.
res <- nlm(nll, c(0, 0), iterlim = 1000,
typsize = c(1, 0.1))
res
32
$minimum
[1] 7956.976
$estimate
[1] 1.57032807 -0.04416515
$gradient
[1] -0.000001451745 -0.000002782238
$code
[1] 1
$iterations
[1] 4
Running this call to nlm() you’ll notice that the solution is the same but the number of iterations is actually
much less than before (4 iterations) which means the algorithm ran faster. Generally speaking, scaling the
parameter vector appropriately (if possible) improves the performance of all optimization algorithms and
in my experience is almost always a good idea. The specific values given to the typsize argument are not
important; rather their relationships to each other (i.e. orders of magnitude) are what matter.
3.3 Quasi-Newton
Quasi-Newton methods arise from the desire to use something like Newton’s method for its speed but without
having to compute the Hessian matrix each time. The idea is that if the Newton iteration is
◊n+1 = ◊n ≠ f ÕÕ (◊n )≠1 f Õ (◊n )
is there some other matrix that we can use to replace either f ÕÕ (◊n ) or f ÕÕ (◊n )≠1 ? That is can we use a revised
iteration,
◊n+1 = ◊n ≠ Bn≠1 f Õ (◊n )
where Bn is simpler to compute but still allows the algorithm to converge quickly?
This is a challenging problem because f ÕÕ (◊n ) gives us a lot of information about the surface of f at ◊n and
throwing out this information results in, well, a severe loss of information.
The idea with Quasi-Newton is to find a solution Bn to the problem
f Õ (◊n ) ≠ f Õ (◊n≠1 ) = Bn (◊n ≠ ◊n≠1 ).
The equation above is sometimes referred to as the secant equation. Note first that this requires us to store two
values, ◊n and ◊n≠1 . Also, in one dimension, the solution is trivial: we can simply divide the left-hand-side
by ◊n ≠ ◊n≠1 . However, in more than one dimension, there exists an infinite number of solutions and we need
some way to constrain the problem to arrive at a sensible answer.
The key to Quasi-Newton approaches in general is that while we initially may not have much information
about f , with each iteration we obtain just a little bit more. Specifically, we learn more about the Hessian
matrix through successive differences in f Õ . Therefore, with each iteration we can incorporate this newly
obtained information into our estimate of the Hessian matrix. The constraints placed on the matrix Bn is
that it be symmetric and that it be close to Bn≠1 . These constraints can be satisfied by updating Bn via the
addition of rank one matrices.
If we let yn = f Õ (◊n ) ≠ f Õ (◊n≠1 ) and sn = ◊n ≠ ◊n≠1 , then the secant equation is yn = Bn sn . One updating
procedures for Bn
33
The above updating procedure was developed by Broyden, Fletcher, Goldfarb, and Shanno (BFGS). An
analogous approach, which solves the following secant equation, Hn yn = sn was proposed by Davidon,
Fletcher, and Powell (DFP).
Note that in the case of the BFGS method, we actually use Bn≠1 in the Newton update. However, it is not
necessary to solve for Bn and then invert it directly. We can directly update Bn≠1
≠1
to produce Bn≠1 via the
Sherman-Morrison update formula. This formula allows us to generate the new inverse matrix by using the
previous inverse and some matrix multiplication.
Quasi-Newton methods in R can be accessed through the optim() function, which is a general purpose
optimization function. The optim() function implements a variety of methods but in this section we will
focus on the "BFGS" and "L-BFGS-B"methods.
A kernel density estimate of the NO2 data shows the following distribution.
library(ggplot2)
ggplot(dat, aes(x = no2)) +
geom_density()
34
0.05
0.04
0.03
density
0.02
0.01
0.00
0 10 20 30 40
no2
As an initial stab at characterizing the distribution of the NO2 values (and to demonstrate the use of optim()
for fitting models), we will try to fit a truncated Normal model to the data. The truncated Normal can
make sense for these kinds of data because they are strictly positive, making a standard Normal distribution
inappropriate.
For the truncated normal, truncated from below at 0, the density of the data is
1
! "
Ï x≠µ
f (x) = s Œ ‡1 ! x≠µ
‡
" .
0 ‡
Ï ‡ dx
The unknown parameters are µ and ‡. Given the density, we can attempt to estimate µ and ‡ by maximum
likelihood. In this case, we will minimize the negative log-likelihood of the data.
We can use the deriv() function to compute the negative log-likelihood and its gradient automatically.
Because we are using quasi-Newton methods here we do not need the Hessian matrix.
nll_one <- deriv(~ -log(dnorm((x - mu)/s) / s) + log(0.5),
c("mu", "s"),
function.arg = TRUE)
The optim() function works a bit differently from nlm() in that instead of having the gradient as an attribute
of the negative log-likelhood, the gradient needs to be a separate function.
First the negative log-likelihood.
nll <- function(p) {
v <- nll_one(p[1], p[2])
sum(v)
}
35
Then the gradient function.
nll_grad <- function(p) {
v <- nll_one(p[1], p[2])
colSums(attr(v, "gradient"))
}
Now we can pass the nll() and nll_grad() functions to optim() to obtain estimates of µ and ‡. We will
use starting values of µ = 1 and ‡ = 5. To use the "BFGS" quasi-Newton method you need to specify it in
the method argument. The default method for optim() is the Nelder-Mead simplex method. We also specify
hessian = TRUE to tell optim() to numerically calculate the Hessian matrix at the optimum point.
x <- dat$no2
res <- optim(c(1, 5), nll, gr = nll_grad,
method = "BFGS", hessian = TRUE)
$par
[1] 13.23731 8.26315
$value
[1] 4043.641
$counts
function gradient
35 19
$convergence
[1] 0
$message
NULL
$hessian
[,1] [,2]
[1,] 20.8700535980 0.0005659674
[2,] 0.0005659674 41.7458205694
The optim() function returns a list with 5 elements (plus a Hessian matrix if hessian = TRUE is set). The
first element that you should check is the onvergence code. If convergece is 0, that is good. Anything
other than 0 could indicate a problem, the nature of which depends on the algorithm you are using (see the
help page for optim() for more details). This time we also had optim() compute the Hessian (numerically)
at the optimal point so that we could derive asymptotic standard errors if we wanted.
First note that there were a few messages printed to the console while the algorithm was running indicating
that NaNs were produced by the target function. This is likely because the function was attempting to take
the log of negative numbers. Because we used the "BFGS" algorithm, we were conducting an unconstrained
optimization. Therefore, it’s possible that the algorithm’s search produced negative values for ‡, which don’t
make sense in this context. In order to constrain the search, we can use the "L-BFGS-B" methods which is a
“limited memory” BFGS algorithm with “box constraints”. This allows you to put a lower and upper bound
on each parameter in the model.
36
Note that optim() allows your target function to produce NA or NaN values, and indeed from the output it
seems that the algorithm eventually converged on the answer anyway. But since we know that the parameters
in this model are constrained, we can go ahead and use the alternate approach.
Here we set the lower bound for all parameters to be 0 but allow the upper bound to be infinity (Inf), which
is the default.
res <- optim(c(1, 5), nll, gr = nll_grad,
method = "L-BFGS-B", hessian = TRUE,
lower = 0)
res
$par
[1] 13.237470 8.263546
$value
[1] 4043.641
$counts
function gradient
14 14
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
$hessian
[,1] [,2]
[1,] 20.868057205 -0.000250771
[2,] -0.000250771 41.735838073
We can see now that the warning messages are gone, but the solution is identical to that produced by the
original "BFGS" method.
The maximum likelihood estimate of µ is 13.24 and the estimate of ‡ is 8.26. If we wanted to obtain
asymptotic standard errors for these parameters, we could look at the Hessian matrix.
solve(res$hessian) %>%
diag %>%
sqrt
Then we can overlay the fitted model on top of the density using geom_line().
ggplot(dat, aes(x = no2)) +
geom_density() +
geom_line(aes(x = xpts, y = ypts), data = dens, col = "steelblue",
37
lty = 2)
0.05
0.04
0.03
density
0.02
0.01
0.00
0 10 20 30 40 50
no2
It’s not a great fit. Looking at the density smooth of the data, it’s clear that there are two modes to the
data, suggesting that a truncated Normal might not be sufficient to characterize the data.
One alternative in this case would be a mixture of two Normals, which might capture the two modes. For a
two-component mixture, the density for the data would be
3 4 3 4
1 x ≠ µ1 1 x ≠ µ2
f (x) = ⁄ Ï + (1 ≠ ⁄) Ï .
‡ ‡1 ‡ ‡2
Commonly, we see that this model is fit using more complex algorithms like the EM algorithm or Markov
chain Monte Carlo methods. While those methods do provide greater stability in the estimation process (as
we will see later), we can in fact use Newton-type methods to maximize the likelihood directly with a little
care.
First we can write out the negative log-likelihood symbolically and allow R’s deriv() function to compute
the gradient function.
nll_one <- deriv(~ -log(lambda * dnorm((x-mu1)/s1)/s1 + (1-lambda)*dnorm((x-mu2)/s2)/s2),
c("mu1", "mu2", "s1", "s2", "lambda"),
function.arg = TRUE)
Then, as before, we can specify separate negative log-likelihood (nll) and gradient R functions (nll_grad).
nll <- function(p) {
p <- as.list(p)
v <- do.call("nll_one", p)
sum(v)
38
}
nll_grad <- function(p) {
v <- do.call("nll_one", as.list(p))
colSums(attr(v, "gradient"))
}
Finally, we can pass those functions into optim() with an initial vector of parameters. Here, we are careful
to specify
• We are using the "L-BFGS-B" method so that we specify a lower bound of 0 for all parameters and an
upper bound of 1 for the ⁄ parameter
• We set the parscale option in the list of control parameters, which is similar to the typsize argument
to nlm(). The goal here is to give optim() a scaling for each parameter around the optimal point.
x <- dat$no2
pstart <- c(5, 10, 2, 3, 0.5)
res <- optim(pstart, nll, gr = nll_grad, method = "L-BFGS-B",
control = list(parscale = c(2, 2, 1, 1, 0.1)),
lower = 0, upper = c(Inf, Inf, Inf, Inf, 1))
The algorithm appears to run without any warnings or messages. We can take a look at the output.
res
$par
[1] 3.7606598 16.1469811 1.6419640 7.2378153 0.2348927
$value
[1] 4879.924
$counts
function gradient
17 17
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
The convergence code of 0 is a good sign and the parameter estimates in the par vector all seem reasonable.
We can overlay the fitted model on to the density smooth to see how the model does.
xpts <- seq(0, 50, len = 100)
dens <- with(res, {
data.frame(xpts = xpts,
ypts = par[5]*dnorm(xpts, par[1], par[3]) + (1-par[5])*dnorm(xpts, par[2], par[4]))
})
ggplot(dat, aes(x = no2)) +
geom_density() +
geom_line(aes(x = xpts, y = ypts), data = dens, col = "steelblue",
lty = 2)
39
0.06
0.04
density
0.02
0.00
0 10 20 30 40 50
no2
The fit is still not wonderful, but at least this model captures roughly the locations of the two modes in the
density. Also, it would seem that the model captures the tail of the density reasonably well, although this
would need to be checked more carefully by looking at the quantiles.
Finally, as with most models and optimization schemes, it’s usually a good idea to vary the starting points to
see if our current estimate is a local mode.
pstart <- c(1, 20, 5, 2, 0.1)
res <- optim(pstart, nll, gr = nll_grad, method = "L-BFGS-B",
control = list(parscale = c(2, 2, 1, 1, 0.1)),
lower = 0, upper = c(Inf, Inf, Inf, Inf, 1))
res
$par
[1] 3.760571 16.146834 1.641961 7.237776 0.234892
$value
[1] 4879.924
$counts
function gradient
22 22
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
40
Here we see that with a slightly different starting point we get the same values and same minimum negative
log-likelihood.
Conjugate gradient methods represent a kind of steepest descent approach “with a twist”. With steepest
descent, we begin our minimization of a function f starting at x0 by traveling in the direction of the negative
gradient ≠f Õ (x0 ). In subsequent steps, we continue to travel in the direction of the negative gradient evaluated
at each successive point until convergence.
The conjugate gradient approach begins in the same manner, but diverges from steepest descent after the first
step. In subsequent steps, the direction of travel must be conjugate to the direction most recently traveled.
Two vectors u and v are conjugate with respect to the matrix A if uÕ Av = 0.
Before we go further, let’s take a concrete example. Suppose we want to minimize the quadratic function
1 Õ
f (x) = x Ax ≠ xÕ b
2
where x is a p-dimensional vector and A is a p ◊ p symmetric matrix. Starting at a point x0 , both steepest
descent and conjugate gradient would take us in the direction of p0 = ≠f Õ (x0 ) = b ≠ Ax0 , which is the
negative gradient. Once we have moved in that direction to the point x1 , the next direction, p1 must satisfy
pÕ0 Ap1 = 0. So we can begin with the steepest descent direction ≠f Õ (x1 ) but then we must modify it to make
it conjugate to p0 . The constraint pÕ0 Ap1 = 0 allows us to back calculate the next direction, starting with
≠f Õ (x1 ), because we have 3 4
pÕ0 A(≠f Õ )
Õ
p0 A ≠f ≠Õ
p0 = 0.
pÕ0 Ap0
Without the presence of the matrix A, this is process is simply Gram-Schmidt orthogonalization. We can
continue with this process, each time taking the steepest descent direction and modifying it to make it
conjugate with the previous direction. The conjugate gradient process is (somewhat) interesting here because
for minimizing a p-dimensional quadratic function it will converge within p steps.
In reality, we do not deal with exactly quadratic functions and so the above-described algorithm is not
feasible. However, the nonlinear conjugate gradient method draws from these ideas and develops a reasonable
algorithm for finding the minimum of an arbitrary smooth function. It has the feature that it only requires
storage of two gradient vectors, which for large problems with many parameters, is a significant savings in
storage versus Newton-type algorithms which require storage of a gradient vector and a p ◊ p Hessian matrix.
The Fletcher-Reeves nonlinear conjugate gradient algorithm works as follows. Starting with x0 ,
1. Let p0 = ≠f Õ (x0 ).
2. Solve
min f (x0 + –p0 )
–>0
rnÕ rn
—n = Õ .
rn≠1 rn≠1
xn+1 = xn + –ı pn ,
where –ı is the solution to the problem min–>0 f (xn + –pn ). Check convergence and if not converged,
repeat.
41
It is perhaps simpler to describe this method with an illustration. Here, we show the contours of a 2-dimensional
function.
f <- deriv(~ x^2 + y^2 + a * x * y, c("x", "y"), function.arg = TRUE)
a <- 1
n <- 40
xpts <- seq(-3, 2, len = n)
ypts <- seq(-2, 3, len = n)
gr <- expand.grid(x = xpts, y = ypts)
feval <- with(gr, f(x, y))
z <- matrix(feval, nrow = n, ncol = n)
10 11
12 15
8 13 16
14
7
8
2
3
9
7
2
1
1
0
11
−1
10
12 8
14
16 7 6
15 13
−2
9 5 4
−3 −2 −1 0 1 2
We will use as a starting point the point (≠2.5, 1.2), as indicated in the figure above.
From the figure, it is clear that ideally we would be able to travel in the direction that would take us directly
to the minimum of the function, shown here.
par(mar = c(5, 4, 1, 1))
contour(xpts, ypts, z, nlevels = 20)
points(x0, y0, pch = 19, cex = 2)
arrows(x0, y0, 0, 0, lwd = 3, col = "grey")
42
3 10 11
12 15
8 13 16
14
7
8
2
3
9
7
2
1
1
0
11
−1
10
12 8
14
16 7 6
15 13
−2
9 5 4
−3 −2 −1 0 1 2
If only life were so easy? The idea behind conjugate gradient is the construct that direction using a series of
conjugate directions. First we start with the steepest descent direction. Here, we extract the gradient and
find the optimal – value (i.e. the step size).
f0 <- f(x0, y0)
p0 <- drop(-attr(f0, "gradient")) ## Get the gradient function
f.sub <- function(alpha) {
ff <- f(x0 + alpha * p0[1], y0 + alpha * p0[2])
as.numeric(ff)
}
op <- optimize(f.sub, c(0, 4)) ## Compute the optimal alpha
alpha <- op$minimum
Now that we’ve computed the gradient and the optimal –, we can take a step in the steepest descent direction
x1 <- x0 + alpha * p0[1]
y1 <- y0 + alpha * p0[2]
43
3 10 11
12 15
8 13 16
14
7
8
2
3
9
7
2
1
1
0
11
−1
10
12 8
14
16 7 6
15 13
−2
9 5 4
−3 −2 −1 0 1 2
Now we need to compute the next direction in which to travel. We begin with the steepest descent direction
again. The figure below shows what direction that would be (without any modifications for conjugacy).
f1 <- f(x1, y1)
f1g <- drop(attr(f1, "gradient")) ## Get the gradient function
p1 <- -f1g ## Steepest descent direction
op <- optimize(f.sub, c(0, 4)) ## Compute the optimal alpha
alpha <- op$minimum
x2 <- x1 + alpha * p1[1] ## Find the next point
y2 <- y1 + alpha * p1[2]
Now we can plot the next direction that is chosen by the usual steepest descent approach.
par(mar = c(5, 4, 1, 1))
contour(xpts, ypts, z, nlevels = 20)
points(x0, y0, pch = 19, cex = 2)
arrows(x0, y0, 0, 0, lwd = 3, col = "grey")
arrows(x0, y0, x1, y1, lwd = 2)
arrows(x1, y1, x2, y2, col = "red", lwd = 2, lty = 2)
44
3 10 11
12 15
8 13 16
14
7
8
2
3
9
7
2
1
1
0
11
−1
10
12 8
14
16 7 6
15 13
−2
9 5 4
−3 −2 −1 0 1 2
However, the conjugate gradient approach computes a slightly different direction in which to travel.
f1 <- f(x1, y1)
f1g <- drop(attr(f1, "gradient"))
beta <- drop(crossprod(f1g) / crossprod(p0)) ## Fletcher-Reeves
p1 <- -f1g + beta * p0 ## Conjugate gradient direction
f.sub <- function(alpha) {
ff <- f(x1 + alpha * p1[1], y1 + alpha * p1[2])
as.numeric(ff)
}
op <- optimize(f.sub, c(0, 4)) ## Compute the optimal alpha
alpha <- op$minimum
x2c <- x1 + alpha * p1[1] ## Find the next point
y2c <- y1 + alpha * p1[2]
Finally, we can plot the direction in which the conjugate gradient method takes.
par(mar = c(5, 4, 1, 1))
contour(xpts, ypts, z, nlevels = 20)
points(x0, y0, pch = 19, cex = 2)
arrows(x0, y0, 0, 0, lwd = 3, col = "grey")
arrows(x0, y0, x1, y1, lwd = 2)
arrows(x1, y1, x2, y2, col = "red", lwd = 2, lty = 2)
arrows(x1, y1, x2c, y2c, lwd = 2)
45
3 10 11
12 15
8 13 16
14
7
8
2
3
9
7
2
1
1
0
11
−1
10
12 8
14
16 7 6
15 13
−2
9 5 4
−3 −2 −1 0 1 2
In this case, because the target function was exactly quadratic, the process converged on the minimum in
exactly 2 steps. We can see that the steepest descent algorithm would have taken many more steps to wind
its way towards the minimum.
The idea behind coordinate descent methods is simple. If f is a k-dimensional function, we can minimize
f by successively minimizing each of the individual dimensions of f in a cyclic fashion, while holding the
values of f in the other dimensions fixed. This approach is sometimes referred to as cyclic coordinate descent.
The primary advantage of this approach is that it takes an arbitrarily complex k-dimensional problem and
reduces it to a collection of k one-dimensional problems. The disadvantage is that convergence can often be
painfully slow, particularly in problems where f is not well-behaved. In statistics, a popular version of this
algorithm is known as backfitting and is used to fit generalized additive models.
If we take a simple quadratic function we can take a detailed look at how coordinate descent works. Let’s use
the function
f (x, y) = x2 + y 2 + xy.
46
points(-1, -1, pch = 19, cex = 2)
abline(h = -1)
1.5
3.5 4.5 5.5
2 2.5 6
5
1.0
1.5
4
0.5
3
1
0.0
y
2
−0.5
0.5
−1.0
2 2
x
Let’s take as our initial point (≠1, ≠1) and begin our minimization along the x dimension. We can draw a
transect at the y = ≠1 level (thus holding y constant) and attempt to find the minimum along that transect.
Because f is a quadratic function, the one-dimensional function induced by holding y = ≠1 is also a quadratic.
feval <- f(xpts, y = -1)
plot(xpts, feval, type = "l", xlab = "x", ylab = "f(x | y = -1)")
47
4
f(x | y = −1)
3
2
1
x
We can minimize this one-dimensional function with the optimize() function (or we could do it by hand if
we’re not lazy).
fx <- function(x) {
f(x, y = -1)
}
op <- optimize(fx, c(-1.5, 1.5))
op
$minimum
[1] 0.5
$objective
[1] 0.75
Granted, we could have done this analytically because we are looking at a simple quadratic function. But in
general, you will need a one-dimensional optimizer (like the optimize() function in R) to complete each of
the coordinate descent iterations.
This completes one iteration of the coordinate descent algorithm and our new starting point is (0.5, ≠1).
Let’s store this new x value and move on to the next iteration, which will minimize along the y direction.
x1 <- op$minimum
48
1.5
3.5 4.5 5.5
2 2.5 6
1.0 5
1.5
4
0.5
1
0.0
y
2
−0.5
0.5
−1.0
2 2
x
The transect drawn by holding x = 0.5 is shown in the Figure above. The one-dimensional function
corresponding to that transect is shown below (again, a one-dimensional quadratic function).
feval <- f(x = x1, ypts)
plot(xpts, feval, type = "l", xlab = "x",
ylab = sprintf("f(x = %.1f | y)", x1))
3.0
2.5
f(x = 0.5 | y)
2.0
1.5
1.0
0.5
x
Minimizing this one-dimensional function, we get the following.
49
fy <- function(y) {
f(x = x1, y)
}
op <- optimize(fy, c(-1.5, 1.5))
op
$minimum
[1] -0.25
$objective
[1] 0.1875
This completes another iteration of the coordinate descent algorithm and we can plot our progress below.
y1 <- op$minimum
feval <- with(gr, matrix(f(x, y), nrow = n, ncol = n))
par(mar = c(5, 4, 1, 1))
contour(xpts, ypts, feval, nlevels = 20, xlab = "x", ylab = "y")
points(-1, -1, pch = 1, cex = 2) ## Initial point
abline(h = -1, lty = 2)
points(x1, -1, pch = 1, cex = 2) ## After one step
abline(v = x1, lty = 2)
points(x1, y1, pch = 19, cex = 2) ## After two steps
abline(h = y1) ## New transect
1.5
5
1.0
1.5
4
0.5
3
1
0.0
y
2
−0.5
0.5
−1.0
2 2
x
We can see that after two iterations we are quite a bit closer to the minimum. But we still have a ways to go,
given that we can only move along the coordinate axis directions. For a truly quadratic function, this is not
an efficient way to find the minimum, particularly when Newton’s method will find the minimum in a single
step! Of course, Newton’s method can achieve that kind of performance because it uses two derivatives worth
of information. The coordinate descent approach uses no derivative information. There’s no free lunch!
50
In the above example, the coordinates x and y were moderately correlated but not dramatically so. In general,
coordinate descent algorithms show very poor performance when the coordinates are strongly correlated.
The specifics of the coordinate descent algorithm will vary greatly depending on the general function being
minimized, but the essential algorithm is as follows. Given a function f : Rp æ R,
1. For j = 1, . . . , p, minimize fj (x) = f (. . . , xj≠1 , x, xj+1 , . . . ) wbere x1 , . . . , xj≠1 , xj+1 , . . . , xp are all held
fixed at their current values. For this use any simple one-dimensional optimizer.
2. Check for convergence. If not converged, go back to 1.
To take a look at the convergence rate for coordinate descent, we will use as an example, a slightly more
general version of the quadratic function above,
f (x, y) = x2 + y 2 + axy,
where here, a, represents the amount of correlation or coupling between the x and y coordinates. If a = 0
there is no coupling and x and y vary independently.
At each iteration of the coordinate descent algorithm, we minimize a one-dimensional version of this function.
If we fix y = c, then we want to minimize
fy=c (x) = x2 + c2 + acx.
Taking the derivative of this with respect to x and setting it equal to zero gives us the minimum at
≠ac
xmin =
2
Similarly, if we fix x = c, then we can minimize an analogous function fx=c (y) to get a minimum point of
ymin = ≠ac
2 .
Looking at the coordinate descent algorithm, we can develop the recurrence relationship
3 4 3 4
xn+1 ≠ a2 yn
=
yn+1 ≠ a2 xn+1
Rewinding this back to the inital point, we can then write that
1 a 22n≠1
|xn ≠ x0 | = |xn ≠ 0| = y0 .
2
where x0 is the minimum point along the x direction. We can similarly say that
1 a 22n
|yn ≠ y0 | = x0 .
2
Looking at the rates of convergence separately for each dimension, we can then show that in the x direction,
! a "2(n+1)≠1 1 a 22
|xn+1 ≠ x0 | y0
= 2! "2n≠1 = .
|xn ≠ x0 | a
y0 2
2
! "2
In order to achieve linear convergence for this algorithm, we must have a2 œ (0, 1), which can be true for
some values of a. But for values of a Ø 2 we would not even be able to obtain linear convergence.
In summary, coordinate descent algorithms are conceptually (and computationally) easy to implement but
depending on the nature of the target function, convergence may not be possible, even under seemingly
reasonable scenarios like the simple one above. Given that we typically do not have very good information
about the nature of the target function, particularly in high-dimensional problems, coordinate descent
algorithms should be used with care.
51
3.5.2 Generalized Additive Models
Before we begin this section, I want to point out that Brian Caffo has a nice video introduction to generalized
additive models on his YouTube channel.
Generalized additive models represent an interesting class of models that provide nonparametric flexibility to
estimate a high-dimensional function without succumbing to the notorious “curse of dimensionality”. In the
traditional linear model, we model the outcome y as
y = – + —1 x1 + —2 x2 · · · + —p xp + Á.
Generalized additive models replace this formulation with a slightly more general one,
y = – + s1 (x1 | ⁄1 ) + s2 (x2 | ⁄2 ) + · · · + sp (xp | ⁄p ) + Á
where s1 , . . . , sp are smooth functions whose smoothness is controled by the parameters ⁄1 , . . . , ⁄p . The
key compromise of this model is that rather than estimate an arbitrary smooth p-dimensional function, we
estimate a series of p one-dimensional functions. This is a much simpler problem but still allows us to capture
a variety of nonlinear relationships.
The question now is how do we estimate these smooth functions? Hastie and Tibshirani proposed a backfitting
algorithm whereby each sj () would be estimated one-at-a-time while holding all of the other functions constant.
This is essentially a coordinate descent algorithm where the coordinates are one-dimensional functions in a
function space.
The sj () functions can be estimated using any kind of smoother. Hastie and Tibshirani used running median
smoothers for robustness in many of their examples, but one could use splines, kernel smoothers, or many
others.
The backfitting algorithm for additive models works as follows. Given a model of the form
p
ÿ
yi = – + sj (xij | ⁄j ) + Ái
j=1
1
qn
where i = 1, . . . , n, 1. Initialize – = n i=1 yi , s1 = s2 = · · · = sp = 0.
(n) (n)
2. Given current values s1 , . . . , sp , for j = 1, . . . , p, Let
ÿ
rij = yi ≠ – ≠ s¸ (xi¸ | ⁄¸ )
¸”=j
so that rij is the partial residual for predictor j and observation i. Given this set of partial residuals
r1j , . . . , rnj , we can estimate sj by smoothing the relationship between rij and xij using any smoother
we choose. Essentially, we need to solve the mini-problem
rij = sj (xij | ⁄j ) + Ái
using standard nonparametric smoothers. As part of this process, we may need to estimate ⁄j using
a procedure like generalized cross-validation or something similar. At the end of this step we have
(n+1) (n+1)
s1 , . . . , sp
3. We can evaluate
p .
ÿ .
. (n+1) (n) .
= .sj ≠ sj .
j=1
or qp . .
. (n+1) (n) .
j=1 .s j ≠ sj .
= qp . .
. (n) . .
j=1 .sj .
where Î · Î is some reasonable metric. If is less than some pre-specificed tolerance, we can stop the
algorithm. Otherwise, we can go back to Step 2 and do another round of backfitting.
52
4 The EM Algorithm
The EM algorithm is one of the most popular algorithms in all of statistics. A quick look at Google Scholar
shows that the paper by Art Dempster, Nan Laird, and Don Rubin has been cited more than 50,000 times.
The EM stands for “Expectation-Maximization”, which indicates the two-step nature of the algorithm. At a
high level, there are two steps: The “E-Step” and the “M-step” (duh!).
The EM algorithm is not so much an algorithm as a methodology for creating a family of algorithms. We
will get into how exactly it works a bit later, but suffice it to say that when someone says “We used the EM
algorithm,” that probably isn’t enough information to understand exactly what they did. The devil is in
the details and most problems will need a bit of hand crafting. That said, there are a number of canonical
problems now where an EM-type algorithm is the standard approach.
The basic idea underlying the EM algorithm is as follows. We observe some data that we represent with Y .
However, there are some missing data, that we represent with Z, that make life difficult for us. Together, the
observed data Y and the missing data Z make up the complete data X = (Y, Z).
1. We imagine the complete data have a density g(y, z | ◊) that is parametrized by the vector of parameters
◊. Because of the missing data, we cannot evaluate g.
2. The observed data have the density
⁄
f (y | ◊) = g(y, z | ◊) dz
g(y, z | ◊)
h(z | y, ◊) = .
f (y | ◊)
Because we do not know ◊, we can plug in ◊0 to evaluate the missing data density. In particular, one can see
that it’s helpful if the log g(y, z | ◊) is linear in the missing data so that taking the expectation is a simple
operation.
Data that are generated from a regular exponential family distribution have a density that takes the form
53
where ◊ is the canonical parameter and t(x) is the vector of sufficient statistics. When thinking about the
EM algorithm, the idea scenario is that the complete data density can be written as an exponential family. In
that case, for the E-step, if y represents the observed component of the complete data, we can write
(Note: We can ignore the h(x) term because it does not involve the ◊ parameter.) In order to maximize this
function with respect to ◊, we can take the derivative and set it equal to zero,
QÕ (◊ | ◊0 ) = E[t(x) | y, ◊0 ] ≠ E◊ [t(x)] = 0.
Hence, for exponential family distributions, executing the M-step is equivalent to setting
E[t(x) | y, ◊0 ] = E◊ [t(x)]
where E◊ [t(x)] is the unconditional expectation of the complete data and E[t(x) | y, ◊0 ] is the conditional
expectation of the missing data, given the observed data.
In this section, we give some canonical examples of how the EM algorithm can be used to estimate model
parameters. These examples are simple enough that they can be solved using more direct methods, but they
are nevertheless useful for demonstrating how to set up the two-step EM algorithm in various scenarios.
Suppose we have data y1 , . . . , yn that are sampled independently from a two-part mixture of Normals model
with density
f (y | ◊) = ⁄Ï(y | µ1 , ‡12 ) + (1 ≠ ⁄)Ï(y | µ2 , ‡22 ).
where Ï(y | µ, ‡ 2 ) is the Normal density with mean µ and variance ‡ 2 . The unknown parameter vector is
◊ = (µ1 , µ2 , ‡12 , ‡22 , ⁄) and the log-likelihood is
n
ÿ
log f (y1 , . . . , yn | ◊) = log ⁄Ï(yi | µ1 , ‡1 ) + (1 ≠ ⁄)Ï(yi | µ2 , ‡2 ).
i=1
This problem is reasonably simple enough that it could be solved using a direct optimization method like
Newton’s method, but the EM algorithm provides a nice stable approach to finding the optimum.
The art of applying the EM algorithm is coming up with a useful complete data model. In this example,
the approach is to hypothesize that each observation comes from one of two populations parameterized by
(µ1 , ‡12 ) and (µ2 , ‡22 ), respectively. The “missing data” in this case are the labels identifying which observation
came from which population. Therefore, we assert that there are missing data z1 , . . . , zn such that
zi ≥ Bernoulli(⁄).
54
It’s easy to show that
1
ÿ
g(y, z | ◊) = f (y | ◊)
z=0
so that when we “integrate” out the missing data, we get the observed data density.
The complete data log-likelihood is then
n
ÿ
log g(y, z | ◊) = zi log Ï(yi | µ1 , ‡12 ) + (1 ≠ zi ) log Ï(yi | µ2 , ‡22 ) + zi log ⁄ + (1 ≠ zi ) log(1 ≠ ⁄).
i=1
Note that this function is nice and linear in the missing data zi . To evaluate the Q(◊ | ◊0 ) function we need
to take the expectation of the above expression with respect to the missing data density h(z | y, ◊). But what
is that? The missing data density will be proportional to the complete data density, so that
From this, what we need to compute the Q() function is fii = E[zi | yi , ◊0 ]. Given that, wen then compute
the Q() function in the E-step.
C n
D
ÿ
Q(◊ | ◊0 ) = E zi log Ï(y | µ1 , ‡12 ) + (1 ≠ zi ) log Ï(y | µ2 , ‡22 ) + zi log ⁄ + (1 ≠ zi ) log(1 ≠ ⁄)
i=1
n
ÿ
= fii log Ï(y | µ1 , ‡12 ) + (1 ≠ fii )Ï(y | µ2 , ‡22 ) + fii log ⁄ + (1 ≠ fii ) log(1 ≠ ⁄)
i=1
5 6 5 6
1 1 1 1
n
ÿ
= fii ≠ log 2fi‡12 ≠ 2 (yi ≠ µ1 )2 + (1 ≠ fii ) ≠ log 2fi‡22 ≠ 2 (yi ≠ µ2 )2
i=1
2 2‡1 2 2‡2
+fii log ⁄ + (1 ≠ fii ) log(1 ≠ ⁄)
In order to compute fii , we will need to use the current estimates of µ1 , ‡12 , µ2 , and ‡22 (in addition to the
data y1 , . . . , yn ). We can then compute the gradient of Q in order maximize it for the current iteration. After
doing that we get the next values, which are
q
fiy
µ̂1 = qi i
fii
q
(1 ≠ fii )yi
µ̂2 = q
1 ≠ fii
q
fii (yi ≠ µ1 )2
ˆ12
‡ = q
fii
q
(1 ≠ fii )(yi ≠ µ2 )2
ˆ22
‡ = q
(1 ≠ fii )
1ÿ
⁄̂ = fii
n
Once we have these updated estimates, we can go back to the E-step and recompute our Q function.
55
4.2.2 Censored Exponential Data
Suppose we have survival times x1 , . . . , xn ≥ Exponential(⁄). However, we do not observe these survival
times because some of them are censored at times c1 , . . . , cn . Because the censoring times are known, what
we actually observe are the data (min(y1 , c1 ), ”1 ), . . . , (min(yn , cn ), ”n ), where ” = 1 if yi Æ ci and ” = 0 if yi
is censored at time ci .
The complete data density is simply the exponential distribution with rate parameter ⁄,
1
Ÿn
g(x1 , . . . , xn | ⁄) = exp(≠xi /⁄).
i=1
⁄
We can divide the data into the observations that we fully observe (”i = 1) and those that are censored
(”i = 0). For the censored data, their complete survival time is “missing”, so can denote the complete survival
time as zi . Given that, the Q(⁄ | ⁄0 ) function is
I C n D- J
1 ÿ -
-
Q(⁄ | ⁄0 ) = E ≠n log ⁄ ≠ ”i yi + (1 ≠ ”i )zi . - y, ⁄0
⁄ i=1 -
But what is E[zi | yi , ⁄0 ]? Because we assume the underlying data are exponentially distributed, we can use
the “memoryless” property of the exponential distribution. That is, given that we have survived until the
censoring time ci , our expected survival time beyond that is simply ⁄. Because we don’t know ⁄ yet we can
plug in our current best estimate. Now, for the E-step we have
C n D
1 ÿ
Q(⁄ | ⁄0 ) = ≠n log ⁄ ≠ ”i yi + (1 ≠ ”i )(ci + ⁄0 )
⁄ i=1
With the Q function removed of missing data, we can execute the M-step and maximize the above function
to get C n D
1 ÿ
⁄̂ = ”i yi + (1 ≠ ”i )(ci + ⁄0 )
n i=1
One of the positive qualities of the EM algorithm is that it is very stable. Unlike Newton’s algorithm, where
each iteration may or may not be closer to the optimal value, each iteratation of the EM algorithm is designed
to increase the observed log-likelihood. This is the ascent property of the EM algorithm, which we will show
later. This stability, though, comes at a price—the EM algorithm’s convergence rate is linear (while Newton’s
algorithm is quadratic). This can make running the EM algorithm painful at times, particularly when one
has to compute standard errors via a resampling approach like the bootstrap.
The EM algorithm is a minorization approach. Instead of directly maximizing the log-likelihood, which is
difficult to evaluate, the algorithm constructs a minorizing function and optimizes that function instead.
What is a minorizing function? Following Chapter 7 of Jan de Leeuw’s Block Relaxation Algorithms in
Statistics a function g minorizes f over X at y if
1. g(x) Æ f (x) for all x œ X
2. g(y) = f (y)
56
In the description of the EM algorithm above, Q(◊ | ◊0 ) is the minorizing function. The benefits of this
approach are
1. The Q(◊ | ◊0 ) is a much nicer function that is easy to optimize
2. Because the Q(◊ | ◊0 ) minorizes ¸(◊ | y), maximizing it is guaranteed to increase (or at least not
decrease) ¸(◊ | y). This is because if ◊n is our current estimate of ◊ and Q(◊ | ◊n ) minorizes ¸(◊ | y) at
◊n , then we have
¸(◊n+1 | y) Ø Q(◊n+1 | ◊n ) Ø Q(◊n | ◊n ) = ¸(◊n | y).
Let’s take a look at how this minorization process works. We can begin with the observe log-likelihood
⁄
log f (y | ◊) = log g(y, z | ◊) dz.
Using the time-honored strategy of adding and subtracting, we can show that if ◊0 is our current estimate of
◊,
⁄ ⁄
log f (y | ◊) ≠ log f (y | ◊0 ) = log g(y, z | ◊) dz ≠ log g(y, z | ◊0 ) dz
s
g(y, z | ◊) dz
= log s
g(y, z | ◊0 ) dz
s
g(y, z | ◊0 ) g(y,z|◊
g(y,z|◊)
0)
dz
= log s
g(y, z | ◊0 ) dz
⁄
g(y, z | ◊)
log f (y | ◊) ≠ log f (y | ◊0 ) = log h(z | y, ◊0 ) dz
g(y, z | ◊0 )
5 - 6
g(y, z | ◊) --
= log E y, ◊ 0
g(y, z | ◊0 ) -
The right-hand side of the above equation, the middle part of which is a function of ◊, is our minorizing
function. We can see that for ◊ = ◊0 we have that the minorizing function is equal to log f (y | ◊0 ).
57
4.3.1 Example: Minorization in a Two-Part Mixture Model
We will revisit the two-part Normal mixture model from before. Suppose we have data y1 , . . . , yn that are
sampled independently from a two-part mixture of Normals model with density
Histogram of x
30
25
20
Frequency
15
10
5
0
−2 0 2 4 6
x
For the purposes of this example, let’s assume that µ1 , µ2 , ‡12 , and ‡22 are known. The only unknown parameter
is ⁄, the mixing proportion. The observed data log-likelihood is
n
ÿ
log f (y1 , . . . , yn | ⁄) = log ⁄Ï(yi | µ1 , ‡12 ) + (1 ≠ ⁄)Ï(yi | µ2 , ‡22 ).
i=1
We can plot the observed data log-likelihood in this case with the simulated data above. First, we can write
a function encoding the mixture density as a function of the data and ⁄.
58
f <- function(x, lambda) {
lambda * dnorm(x, mu1, s1) + (1-lambda) * dnorm(x, mu2, s2)
}
λ
Note that the true value is ⁄ = 0.4. We can compute the maximum likelihood estimate in this simple case
with
op <- optimize(loglike, c(0.1, 0.9), maximum = TRUE)
op$maximum
[1] 0.3097435
In this case it would appear that the maximum likelihood estimate exhibits some bias, but we won’t worry
about that right now.
We can illustrate how the minorizing function works by starting with an initial value of ⁄0 = 0.8.
lam0 <- 0.8
minor <- function(lambda) {
p1 <- sum(log(f(x, lam0)))
pi <- lam0 * dnorm(x, mu1, s1) / (lam0 * dnorm(x, mu1, s1)
+ (1 - lam0) * dnorm(x, mu2, s2))
59
p2 <- sum(pi * dnorm(x, mu1, s1, log = TRUE)
+ (1-pi) * dnorm(x, mu2, s2, log = TRUE)
+ pi * log(lambda)
+ (1-pi) * log(1-lambda))
p3 <- sum(pi * dnorm(x, mu1, s1, log = TRUE)
+ (1-pi) * dnorm(x, mu2, s2, log = TRUE)
+ pi * log(lam0)
+ (1-pi) * log(1-lam0))
p1 + p2 - p3
}
minor <- Vectorize(minor, "lambda")
Now we can plot the minorizing function along with the observed log-likelihood.
par(mar = c(5,4, 1, 1))
curve(loglike, 0.01, 0.95, ylab = "Log-likelihood",
xlab = expression(lambda))
curve(minor, 0.01, 0.95, add = TRUE, col = "red")
legend("topright", c("obs. log-likelihood", "minorizing function"),
col = 1:2, lty = 1, bty = "n")
obs. log−likelihood
−240 −230 −220 −210 −200 −190
minorizing function
Log−likelihood
λ
Maximizing the minorizing function gives us the next estimate of ⁄ in the EM algorithm. It’s clear from the
picture that maximizing the minorizing function will increase the observed log-likelihood.
par(mar = c(5,4, 2, 1))
curve(loglike, 0.01, 0.95, ylab = "Log-likelihood",
xlab = expression(lambda), xlim = c(-0.5, 1),
ylim = c())
abline(v = lam0, lty = 2)
mtext(expression(lambda[0]), at = lam0, side = 3)
60
curve(minor, 0.01, 0.95, add = TRUE, col = "red", lwd = 2)
op <- optimize(minor, c(0.1, 0.9), maximum = TRUE)
abline(v = op$maximum, lty = 2)
lam0 <- op$maximum
curve(minor, 0.01, 0.95, add = TRUE, col = "blue", lwd = 2)
abline(v = lam0, lty = 2)
mtext(expression(lambda[1]), at = lam0, side = 3)
op <- optimize(minor, c(0.1, 0.9), maximum = TRUE)
abline(v = op$maximum, lty = 2)
mtext(expression(lambda[2]), at = op$maximum, side = 3)
legend("topleft",
c("obs. log-likelihood", "1st minorizing function", "2nd minorizing function"),
col = c(1, 2, 4), lty = 1, bty = "n")
λ2 λ1 λ0
obs. log−likelihood
1st minorizing function
2nd minorizing function
−200
Log−likelihood
−220
−240
λ
In the figure above, the second minorizing function is constructed using ⁄1 and maximized to get ⁄2 . This
process of constructing the minorizing function and maximizing can be repeated until convergence. This is
the EM algorithm at work!
The flip side of minorization is majorization, which is used in minimization problems. We can implement a
constrained minimization procedure by creating a surrogate function that majorizes the target function and
satisfies the constraints. Specifically, the goal is to minimize a funtion f (◊) subject to a set of constraints of
the form gi (◊) Ø 0 where
gi (◊) = uÕi ◊ ≠ ci
and where ui is a vector of the same length as ◊, ci is a constant, and i = 1, . . . , ¸. These constraints are
linear constraints on the parameters. Given the constraints and ◊n , the estimate of ◊ at iteration n, we can
61
construct the surrogate function,
ÿ̧
R(◊ | ◊n ) = f (◊) ≠ ⁄ gi (◊n ) log gi (◊) ≠ uÕi ◊
i=1
with ⁄ > 0.
So far, we have described the EM algorithm for computing maximum likelihood estimates in some missing
data problems. But the original presentation of the EM algorithm did not discuss how to obtain any measures
of uncertainty, such as standard errors. One obvious candidate would be the observed information matrix.
However, much like with the observed log-likelihood, the observed information matrix is difficult to compute
because of the missing data.
Recalling the notation from the previous section, let f (y | ◊) be the observed data density, g(y, z | ◊) the
complete data density, and h(z | y, ◊) := g(y, z | ◊)/f (y | ◊) the missing data density. From this we can write
the following series of identities:
g(y, z | ◊)
f (y | ◊) =
h(z | y, ◊)
≠ log f (y | ◊) = ≠ log g(y, z | ◊) ≠ [≠ log h(z | y, ◊)]
5 6 5 6 5 6
ˆ ˆ ˆ
E ≠ log f (y | ◊) = E ≠ log g(y, z | ◊) ≠ E ≠ log h(z | y, ◊)
ˆ◊ˆ◊Õ ˆ◊ˆ◊Õ ˆ◊ˆ◊Õ
IY (◊) = IY,Z (◊) ≠ IZ|Y (◊)
Here, we refer to IY (◊) as the observed data information matrix, IY,Z (◊) as the complete data information
matrix, and IZ|Y (◊) as the missing information matrix. This identity allows for the the nice interpretation as
the “observed information” equals the “complete information” minus the “missing information”.
If we could easily evaluate the IY (◊), we could simply plug in the maximum likelihood estimate ◊ˆ and obtain
ˆ However, beause of the missing data, IY (◊) is difficult ot evaluate. Presumably,
standard errors from IY (◊).
IY,Z (◊) is reasonable to compute because it is based on the complete data. What then is IZ|Y (◊), the missing
information matrix?
Let S(y | ◊) = ˆ◊
ˆ
log f (y | ◊) be the observed score function and let S(y, z | ◊) = ˆ◊
ˆ
log g(y, z | ◊) be the
complete data score function. In a critically important paper, Tom Louis showed that
with the expectation taken with respect to the missing data density h(z | y, ◊). The first part of the right
hand side involves computations on the complete data, which is fine. Unfortunately, the second part involves
ˆ = 0 at
the observed score function, which is presumably difficult to evaluate. However, by definition, S(y | ◊)
the maximum likelihood estimate ◊.ˆ Therefore, we can write the observed information matrix at the MLE as
ˆ = IY,Z (◊)
IY (◊) ˆ ≠ E [S(y, z | ◊)S(y, z | ◊)Õ ]
so that all computations are done on the complete data. Note also that
5 - 6
ˆ -
ˆ = ≠E
IY,Z (◊) ˆ
log g(y, z | ◊)- ◊, y
-
ˆ◊ˆ◊Õ
= ≠QÕÕ (◊ˆ | ◊)
ˆ
62
Meilijson showed that when the observed data y = y1 , . . . , yn are iid, then
n
ÿ
S(y | ◊) = S(yi | ◊)
i=1
and hence
ˆ = 0 at the MLE, we can ignore the second part of the expression if we are interested
Again, because S(y | ◊)
in obtaining the observed information at the location of the MLE. As for the first part of the expression,
Louis also showed that
S(yi | ◊) = E[S(yi , zi | ◊) | yi , ◊0 ].
where the expectation is once again taken with respect to the missing data density. Therefore, we can transfer
computations on the observed score function to computations on the complete score function.
Dempster et al. showed that the convergence rate for the EM algorithm is linear, which can be painfully
slow for some problems. Therefore, a cottage industry has developed around the notion of speeding up the
convergence of the algorithm. Two approaches that we describe here are one proposed by Tom Louis based
on the Aitken acceleration technique and the SQUAREM approach of Varadhan and Roland.
If we let M (◊) be the map representing a single iteration of the EM algorithm, so that ◊n+1 = M (◊n ). Then
under standard regularity conditions, we can approximate M near the optimum value ◊ı with
which can be interpreted as characterizing the proportion of missing data. (Dempster et al. also showed that
the rate of convergence of the EM algorithm is determined by the modulus of the largest eigenvalue of J(◊ı ).)
Furthermore, for large j and n, we have
where ◊ı is the MLE, and J (n) (◊ı ) is J multiplied by itself n times. Then if ◊ı is the limit of the sequence
{◊n }, we can write (trivially) for any j
Œ
ÿ
◊ ı = ◊j + (◊k+j ≠ ◊k+j≠1 )
k=1
63
We can then approximate this with
A Œ
B
ÿ
◊ı ¥ ◊j + J (k) (◊ı ) (◊j+1 ≠ ◊j )
k=0
= ◊j + (I ≠ J(◊ı ))≠1 (◊j+1 ≠ ◊j )
The last equivalence is possible because the eigenvalues of J are all less than one in absolute value.
Given this relation, the acceleration method proposed by Louis works as follows. Given ◊n , the current
estimate of ◊,
1. Compute ◊n+1 using the standard EM algorithm
ˆ ≠1 = IY,Z (◊n )IY (◊n )≠1
2. Compute (I ≠ J)
ˆ ≠1 (◊n+1 ≠ ◊n ).
3. Let ◊ı = ◊n + (I ≠ J)
4. Set ◊n+1 = ◊ı .
The cost of using Louis’s technique is minimal if the dimension of ◊ is small. Ultimately, it comes down to
the cost of inverting IY (◊n ) relative to running a single iteration of the EM algorithm. Further, it’s worth
emphasizing that the convergence of the approach is only guaranteed for values of ◊ in a neighborhood of the
optimum ◊ı , but the size and nature of that neighborhood is typically unknown in applications.
Looking at the algorithm described above, we can gather some basic heuristics of how it works. When the
information in the observed data is high relative to the complete data, then the value of (I ≠ J)ˆ ≠1 will be
close to 1 and the sequence of iterates generated by the algorithm will be very similar to the usual EM
ˆ ≠1 will be much greater than 1 and
sequence. However, if the proportion of missing data is high, then (I ≠ J)
the modifications that the algorithm makes to the usual EM sequence will be large.
64
Histogram of y
30
25
20
Frequency
15
10
5
0
−2 0 2 4 6
y
If we assume µ1 , µ2 , ‡1 and ‡2 are known, then we can visualize the observed data log-likelihood as a function
of ⁄.
f <- function(y, lambda) {
lambda * dnorm(y, mu1, s1) + (1-lambda) * dnorm(y, mu2, s2)
}
loglike <- Vectorize(
function(lambda) {
sum(log(f(y, lambda)))
}
)
curve(loglike, 0.01, 0.95, n = 200, xlab = expression(lambda))
65
−200
loglike(x)
−220
−240
λ
Because the observed log-likelihood is relatively simple in this case, we can maximize it directly and obtain
the true maximum likelihood estimate.
op <- optimize(loglike, c(0.01, 0.95), maximum = TRUE, tol = 1e-8)
op$maximum
[1] 0.3097386
We can encode the usual EM iteration as follows. The M function represents a single iteration of the EM
algorithm as a function of the current value of ⁄.
make_pi <- function(lambda, y, mu1, mu2, s1, s2) {
lambda * dnorm(y, mu1, s1) / (lambda * dnorm(y, mu1, s1) +
(1 - lambda) * (dnorm(y, mu2, s2)))
}
M <- function(lambda0) {
pi.est <- make_pi(lambda0, y, mu1, mu2, s1, s2)
mean(pi.est)
}
We can also encode the accelerated version here with the function Mstar. The functions Iy and Iyz encode
the observed and complete data information matrices.
Iy <- local({
d <- deriv3(~ log(lambda * dnorm(y, mu1, s1) + (1-lambda) * dnorm(y, mu2, s2)),
"lambda", function.arg = TRUE)
function(lambda) {
H <- attr(d(lambda), "hessian")
sum(H)
}
})
66
function(lambda) {
H <- attr(d(lambda), "hessian")
sum(H)
}
})
Taking a starting value of ⁄ = 0.1, we can see the speed at which the original EM algorithm and the
accelerated versions converge toward the MLE.
lambda0 <- 0.1
lambda0star <- 0.1
iter <- 6
EM <- numeric(iter)
Accel <- numeric(iter)
for(i in 1:iter) {
pihat <- make_pi(lambda0, y, mu1, mu2, s1, s2)
lambda1 <- M(lambda0)
lambda1star <- Mstar(lambda0star)
EM[i] <- lambda1
Accel[i] <- lambda1star
lambda0 <- lambda1
lambda0star <- lambda1star
}
results <- data.frame(EM = EM, Accel = Accel,
errorEM = abs(EM - op$maximum),
errorAccel = abs(Accel - op$maximum))
4.5.2 SQUAREM
Let M (◊) be the map representing a single iteration of the EM algorithm so that ◊n+1 = M (◊n ). Given the
current value ◊0 ,
1. Let ◊1 = M (◊0 )
67
2. Let ◊2 = M (◊1 )
3. Compute the difference r = ◊1 ≠ ◊0
4. Let v = (◊2 ≠ ◊1 ) ≠ r
5. Compute the step length –
6. Modify – if necessary
7. Let ◊Õ = ◊0 ≠ 2–r + –2 v
8. Let ◊1 = M (◊Õ )
9. Compare ◊1 with ◊0 and check for convergence. If we have not yet converged, let ◊0 = ◊1 and go back
to Step 1.
5 Integration
In statistical applications we often need to compute quantities of the form
⁄
Ef g(X) = g(x)f (x) dx
where X is a random variable drawn from a distribution with probability mass function f . Another quantity
that we often need to compute is the normalizing constant for a probability
s density function. If X has a
density that is proportional to p(x | ◊) then its normalizing constant is p(x | ◊) dx.
In both problems—computing the expectation and computing the normalizing constant—an integral must be
evaluated.
Approaches to solving the integration problem roughly fall into two categories. The first categories involves
identifying a sequence of estimates to the integral that eventually converge to the true value as some index
(usually involving time or resources) goes to infinity. Adaptive quadrature, independent Monte Carlo and
Markov chain Monte Carlo techniques all fall into this category. Given enough time and resources, these
techniques should converge to the true value of the integral.
The second category of techniques involves identifying a class of alternative functions that are easier to
work with, finding the member of that class that best matches the true function, and then working with
the alternate function instead to compute the integral. Laplace approximation, variational inference, and
approximate Bayes computation (ABC) fall into this category of approaches. For a given dataset, these
approaches will not provide the true integral value regardless of time and resources, but as the sample size
increases, the approximations will get better.
The first technique that we will discuss is Laplace approximation. This technique can be used for reasonably
well behaved functions that have most of their mass concentrated in a small area of their domain. Technically,
it works for functions that are in the class of L2 , meaning that
⁄
g(x)2 dx < Œ
Such a function generally has very rapidly decreasing tails so that in the far reaches of the domain we would
not expect to see large spikes.
Imagine a function that looks as follows
68
g(x)
x0
We can see that this function has most of its mass concentrated around the point x0 and that we could
probably approximate the area under the function with something like a step function.
g(x)
x0 The
benefit of using something like a step function is that the area under a step function is trivial to compute. If
we could find a principled and automatic way to find that approximating step function, and it were easier
than just directly computing the integral in the first place, then we could have an alternative to computing
the integral. In other words, we could perhaps say that
⁄
g(x) dx ¥ g(x0 )Á
69
Suppose we have a function g(x) œ L2 which achieves its maximum at x0 . We want to compute
⁄ b
g(x) dx.
a
From here we can take a Taylor series approximation of h(x) around the point x0 to give us
⁄ ⁄ 3 4
b b
1
exp(h(x)) dx ¥ exp h(x0 ) + hÕ (x0 )(x ≠ x0 ) + hÕÕ (x0 )(x ≠ x0 )2 dx
a a 2
Because we assumed h(x) achieves its maximum at x0 , we know hÕ (x0 ) = 0. Therefore, we can simplify the
above expression to be
⁄ 3 4
b
1
= exp h(x0 ) + hÕÕ (x0 )(x ≠ x0 )2 dx
a 2
Given that h(x0 ) is a constant that doesn’t depend on x, we can pull it outside the integral. In addition, we
can rearrange some of the terms to give us
⁄ b 3 4
1 (x ≠ x0 )2
= exp(h(x0 )) exp ≠ dx
a 2 ≠hÕÕ (x0 )≠1
Now that looks more like it, right? Inside the integral we have a quantity that is proportional to a Normal
density with mean x0 and variance ≠hÕÕ (x0 )≠1 . At this point we are just one call to the pnorm() function
away from approximating our integral. All we need is to compute our normalizing constants.
If we let (x | µ, ‡ 2 ) be the cumulative distribution function for the Normal distribution with mean µ and
variance ‡ 2 (and Ï is its density function), then we can write the above expression as
Û ⁄
2fi b
= exp(h(x0 )) Ï(x | x0 , ≠hÕÕ (x0 )≠1 ) dx
≠hÕÕ (x0 ) a
Û
2fi # ! " ! "$
= exp(h(x0 )) b | x0 , ≠hÕÕ (x0 )≠1 ≠ a | x0 , ≠hÕÕ (x0 )≠1
≠hÕÕ (x0 )
Recall that exp(h(x0 )) = g(x0 ). If b = Œ and a = ≠Œ, as is commonly the case, then the term in the square
brackets is equal to 1, making the Laplace approximation equal to the value of the function g(x) at its mode
multiplied by a constant that depends on the curvature of the function h.
One final note about the Laplace approximation is that it replaces the problem of integrating a function
with the problem of maximizing it. In order to compute the Laplace approximation, we have to compute
the location of the mode, which is an optimization problem. Often, this problem is faster to solve using
well-understood function optimizers than integrating the same function would be.
70
5.1.1 Computing the Posterior Mean
In Bayesian computations we often want to compute the posterior mean of a parameter given the observed
data. If y represents data we observe and y comes from the distribution f (y | ◊) with parameter ◊ and ◊ has
a prior distribution fi(◊), then we usually want to compute the posterior distribution p(◊ | y) and its mean,
⁄
Ep [◊] = ◊ p(◊ | y) d◊.
⁄ s
◊ f (y | ◊)fi(◊) d◊
◊ p(◊ | y) dx = s
f (y | ◊)fi(◊) d◊
s
◊ exp(log f (y | ◊)fi(◊)) d◊
= s
exp(log f (y | ◊)fi(◊) d◊)
Here, we’ve used the age old trick of exponentiating and log-ging.
If we let h(◊) = log f (y | ◊)fi(◊), then we can use the same Laplace approximation procedure described in the
previous section. However, in order to do that we must know where h(◊) achieves its maximum. Because
h(◊) is simply a monotonic transformation of a function proportional to the posterior density, we know that
h(◊) achieves its maximum at the posterior mode.
Let ◊ˆ be the posterior mode of p(◊ | y). Then we have
s 1 2
⁄ ◊ exp h(◊) ˆ + 1 hÕÕ (◊)(◊
ˆ ≠ ◊) ˆ 2 d◊
2
◊ p(◊ | y) dx ¥ s 1 2
exp h(◊) ˆ + 1 hÕÕ (◊)(◊
ˆ ≠ ◊) ˆ 2 d◊
2
s 1 2
ˆ ≠ ◊)
◊ exp 12 hÕÕ (◊)(◊ ˆ 2 d◊
= s 1 2
ˆ ≠ ◊)
exp 12 hÕÕ (◊)(◊ ˆ 2 d◊
s Ò 2fi 1 2
ˆ ÕÕ ˆ ≠1
◊ ≠hÕÕ (◊)ˆ Ï ◊ | ◊, ≠h (◊) d◊
= s Ò 2fi 1 2
ˆ ÕÕ ˆ ≠1 d◊
ˆ Ï ◊ | ◊, ≠h (◊)
≠hÕÕ (◊)
= ◊ˆ
Hence, the Laplace approximation to the posterior mean is equal to the posterior mode. This approximation
is likely to work well when the posterior is unimodal and relatively symmetric around the model. Furthermore,
the more concentrated the posterior is around ◊, ˆ the better.
Y | µ ≥ Poisson(µ)
µ ≥ Gamma(a, b)
71
where the Gamma density is
1
f (µ) = µa≠1 e≠µ/b
ba (a)
In this case, given an observation y, the posterior distribution is simply a Gamma distribution with shape
parameter y + a and scale parameter (1 + 1/b).
Suppose we observe y = 2. We can draw the posterior distribution and prior distribution as follows.
make_post <- function(y, shape, scale) {
function(x) {
dgamma(x, shape = y + shape,
scale = 1 / (1 + 1 / scale))
}
}
set.seed(2017-11-29)
y <- 2
prior.shape <- 3
prior.scale <- 3
p <- make_post(y, prior.shape, prior.scale)
curve(p, 0, 12, n = 1000, lwd = 3, xlab = expression(mu),
ylab = expression(paste("p(", mu, " | y)")))
curve(dgamma(x, shape = prior.shape, scale = prior.scale), add = TRUE,
lty = 2)
legend("topright", legend = c("Posterior", "Prior"), lty = c(1, 2), lwd = c(3, 1), bty = "n")
0.00 0.05 0.10 0.15 0.20 0.25
Posterior
Prior
p(µ | y)
0 2 4 6 8 10 12
µ
Because this is a Gamma distribution, we can also compute the posterior mode in closed form.
pmode <- (y + prior.shape - 1) * (1 / (1 + 1 / prior.scale))
pmode
[1] 3
We can also compute the mean.
72
pmean <- (y + prior.shape) * (1 / (1 + 1 / prior.scale))
pmean
[1] 3.75
From the skewness in the figure above, it’s clear that the mean and the mode should not match.
We can now see what the Laplace approximation to the posterior looks like in this case. First, we can compute
the gradient and Hessian of the Gamma density.
a <- prior.shape
b <- prior.scale
fhat <- deriv3(~ mu^(y + a - 1) * exp(-mu * (1 + 1/b)) / ((1/(1+1/b))^(y+a) * gamma(y + a)), "mu", funct
Then we can compute the quadratic approximation to the density via the lapprox() function below.
post.shape <- y + prior.shape - 1
post.scale <- 1 / (length(y) + 1 / prior.scale)
lapprox <- Vectorize(function(mu, mu0 = pmode) {
deriv <- fhat(mu0)
grad <- attr(deriv, "gradient")
hess <- drop(attr(deriv, "hessian"))
f <- function(x) dgamma(x, shape = post.shape, scale = post.scale)
hpp <- (hess * f(mu0) - grad^2) / f(mu0)^2
exp(log(f(mu0)) + 0.5 * hpp * (mu - mu0)^2)
}, "mu")
Plotting the true posterior and the Laplace approximation gives us the following.
curve(p, 0, 12, n = 1000, lwd = 3, xlab = expression(mu),
ylab = expression(paste("p(", mu, " | y)")))
curve(dgamma(x, shape = prior.shape, scale = prior.scale), add = TRUE,
lty = 2)
legend("topright",
legend = c("Posterior Density", "Prior Density", "Laplace Approx"),
lty = c(1, 2, 1), lwd = c(3, 1, 1), col = c(1, 1, 2), bty = "n")
curve(lapprox, 0.001, 12, n = 1000, add = TRUE, col = 2, lwd = 2)
73
0.00 0.05 0.10 0.15 0.20 0.25
Posterior Density
Prior Density
Laplace Approx
p(µ | y)
0 2 4 6 8 10 12
µ
The solid red curve is the Laplace approximation and we can see that in the neighborhood of the mode, the
approximation is reasonable. However, as we move farther away from the mode, the tail of the Gamma is
heavier on the right.
Of course, this Laplace approximation is done with only a single observation. One would expect the
approximation to improve as the sample size increases. In this case, with respect to the posterior mode as an
approximation to the posterior mean, we can see that the difference between the two is simply
1
◊ˆmean ≠ ◊ˆmode =
n + 1/b
i.i.d.
If we could simulate x1 , . . . , xn ≥ f , then by the law of large numbers, we would have
1ÿ
n
h(xi ) ≠æ Ef [h(X)].
n i=1
which, we will note for now, does not depend on the dimension of the random variable X1 .
74
This approach to computing the expectation above is known as Monte Carlo integration which takes
advantage of the law of large numbers saying that averages of numbers converge to their expectation. Monte
Carlo integration can be quite useful, but it takes the problem of computing the integral directly (or via
approximation) and replaces it with a problem of drawing samples from an arbitrary density function f .
Before we go further, we will take a brief diversion into random number generation.
In order to use the simulation-based techniques described in this book, we will need to be able to generate
sequences of random numbers. Most of the time in the software we are using, there is a function that will do
this for us. For example, in R, if we want to generate uniformly distributed random numbers over a fixed
interval, we can use the runif() function.
Nevertheless, there are two issues that are worth considering here:
1. It is useful to know a little about what is going on under the hood of these random number generators.
How exactly is the sequence of numbers created?
2. Built-in functions in R are only useful for well-known or well-characterized distributions. However, wil
many simulation-based techniques, we will want to generate random numbers from distributions that
we have likely never seen before and for which there will not be any built-in function.
The truth is that R, along with most other analytics packages, does not generate genuine random numbers.
R generates pseudo-random numbers that appear to be random but are actually generated in a deterministic
way. This approach sounds worse, but it’s actually better for two reasons. First, generating genuine random
numbers can be slow and often will depend on some outside source of entropy/randomness. Second, genuine
random numbers are not reproducible, so if you wanted to re-create some results based on simulations, you
would not ever be able to do so.
Pseudo-random number generators (PRNGs have a long history that we will not cover here. One useful thing
to know is that this is tricky area and it is not simple to wander in and start developing your own PRNGs. It
is useful to know how the systems work, but after that it’s best to leave the specifics to the experts.
The most commonly used class of PRNGs in scientific applications is the linear congruential generator. The
basic idea behind an LCG is that we have a starting seed, and then from there we generate pseudo-random
numbers via a recurrence relation. Most LCGs have the form
where a is called the multiplier, c is the increment, and m is the modulus. For n = 0, the value X0 is the seed.
Modular arithmetic is needed in order to prevent the sequence from going off to infinity. For most generators,
the values X0 , X1 , . . . are integers. However, we could for example generate Uniform(0, 1) variates by taking
Un = Xn /m.
Given the recurrence relation above and the modular arithmetic, the maximum number of distinct values that
can be generated by an LCG is m, so we would need m to be very large. The hope is that if the PRNG is
well-designed, the sequence should hit every number from 0 to m ≠ 1 before repeating. If a number is repeated
before all numbers are seen, then the generator has a period in it that is shorter than the maximal period
that is possible. Setting the values of a, c, and m is a tricky business and can require some experimentation.
In summary, don’t do this at home. As an (historical) example, the random number generator proposed by
the book Numerical Recipes specified that a = 1664525, c = 1013904223, and m = 232 .
Perhaps the biggest problem with using the historical LCGs for generating random numbers is that their
periods are too short, even if they manage to hit the maximal period. Given the scale of simulations being
75
conducted today, even a period of 232 would likely be too short to appear sufficiently random. Most analytical
software systems have since moved on to other more sophisticated generators. For example, the default in R
is the Mersenne-Twister, which has a long period of 219937 ≠ 1.
Further notes:
• The randomness of a pseudo-random sequence can be checked via statistical tests of uniformity, such as
the Kolmogorov-Smirnoff test, Chi-square test, or the Marsaglia “die hard” tests.
• Many PRNGs generate sequences that look random in one dimension but do not look random when
embedded into higher dimensions. It is possible for PRNGs to generate numbers that lie on a higher-
dimensional hyperplane that still look random in one dimension.
Uniform random numbers are useful, but usually we want to generate random numbers from some non-uniform
distribution. There are a few ways to do this depending on the distribution.
The most generic method (but not necessarily the simplest) uses the inverse of the cumulative distribution
function of the distribution.
Supposeswe wnat to draw samples from a distribution with density f and cumulative distribution function
F (x) = ≠Œ f (t) dt. Then we can do the following:
x
76
Histogram of u
12
10
Frequency
8
6
4
2
0
u
Then we can apply the inverse CDF.
lambda <- 2 ## Exponential with mean 2
x <- -lambda * log(1 - u)
hist(x)
rug(x)
Histogram of x
60
50
40
Frequency
30
20
10
0
0 2 4 6 8 10 12
77
The problem with this method is that inverting the CDF is usually a difficult process and so other methods
will be needed to generate other random variables.
The inverse of the CDF is not the only function that we can use to transform uniform random variables into
random variables with other distributions. Here are some common transformations.
To generate Normal random variables, we can
1. Generate U1 , U2 ≥ Unif(0, 1) using a standard PRNG.
2. Let
Z1 = ≠2 log U1 cos(2fi U2 )
Z2 = ≠2 log U1 sin(2fi U2 )
What do we do if we want to generate samples of a random variable with density f and there isn’t a built
in function for doing this? If the random variable is of a reasonably low dimension (less than 10?), then
rejection sampling is a plausible general approach.
The idea of rejection sampling is that although we cannot easily sample from f , there exists another density
g, like a Normal distribution or perhaps a t-distribution, from which it is easy for us to sample (because
there’s a built in function or someone else wrote a nice function). Then we can sample from g directly and
then “reject” the samples in a strategic way to make the resulting “non-rejected” samples look like they came
from f . The density g will be referred to as the “candidate density” and f will be the “target density”.
In order to use the rejections sampling algorithm, we must first ensure that the support of f is a subset of
the support of g. If Xf is the support of f and Xg is the support of g, then we must have Xf µ Xg . This
makes sense: if there’s a region of the support of f that g can never touch, then that area will never get
sampled. In addition, we must assume that
f (x)
c = sup <Œ
xœXf g(x)
and that we can calculate c. The easiest way to satisfy this assumption is to make sure that g has heavier
tails than f . We cannot have that g decreases at a faster rate than f in the tails or else rejection sampling
will not work.
78
6.3.1 The Algorithm
The rejection sampling algorithm for drawing a sample from the target density f is then
1. Simulate U ≥ Unif(0, 1).
2. Simulate a candidate X ≥ g from the candidate density
3. If
f (X)
UÆ
c g(X)
then “accept” the candidate X. Otherwise, “reject” X and go back to the beginning.
The algorithm can be repeated until the desired number of samples from the target density f has been
accepted.
As a simple example, suppose we wanted to generate samples from a N (0, 1) density. We could use the t2
distribution as our candidate density as it has heavier tails than the Normal. Plotting those two densities,
along with a sample from the t2 density gives us the picture below.
set.seed(2017-12-4)
curve(dnorm(x), -6, 6, xlab = "x", ylab = "Density", n = 200)
curve(dt(x, 2), -6, 6, add = TRUE, col = 4, n = 200)
legend("topright", c("Normal density", "t density"),
col = c(1, 4), bty = "n", lty = 1)
x <- rt(200, 2)
rug(x, col = 4)
0.4
Normal density
t density
0.3
Density
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
x
Given what we know about the standard Normal density, most of the samples should be between ≠3 and +3,
except perhaps in very large samples (this is a sample of size 200). From the picture, there are samples in the
range of 4–6. In order to transform the t2 samples into N (0, 1) samples, we will need to reject many of the
samples out in the tail. On the other hand, there are two few samples in the range of [≠2, 2] and so we will
have to disproportionaly accept samples in that range until it represents the proper N (0, 1) density.
Before we move on, it’s worth noting that the rejection sampling method requires that we can evaluate the
target density f . That is how we compute the rejection/acceptance ratio in Step 2. In most cases, this will
79
not be a problem.
One property of the rejection sampling algorithm is that the number of draws we need to take from the
candidate density g before we accept a candidate is a geometric random variable with success probability 1/c.
We can think of the decision to accept or reject a candidate as a sequence of iid coin flips that has a specific
probability of coming up “heads” (i.e. being accepted). That probability is 1/c and we can calculate that as
follows.
3 4
f (X)
P(X accepted) = P U Æ
c g(X)
⁄ 3 - 4
f (x) --
= P UÆ X = x g(x) dx
c g(x) -
⁄
f (x)
= g(x) dx
c g(x)
1
=
c
This property of rejection sampling has implications for how we choose the candidate density g. In theory,
any density can be chosen as the candidate as long as its support includes the support of f . However, in
practice we will want to choose g so that it matches f as closely as possible. As a rule of thumb, candidates g
that match f closely will have smaller values of c and thus will accept candidates with higher probability. We
want to avoid large values of c because large values of c lead to an algorithm that rejects a lot of candidates
and has lower efficiency.
In the example above with the Normal distribution and the t2 distribution, the ratio f (x)/g(x) was maximized
at x = 1 (or x = ≠1) and so the value of c for that setup was 1.257, which implies an acceptance probability
of about 0.8. Suppose however, that we wanted to simulate from a Uniform(0, 1) density and we used an
Exponential(1) as our candidate density. The plot of the two densities looks as follows.
curve(dexp(x, 1), 0, 1, col = 4, ylab = "Density")
segments(0, 1, 1, 1)
legend("bottomleft", c("f(x) Uniform", "g(x) Exponential"), lty = 1, col = c(1, 4), bty = "n")
80
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Density
f(x) Uniform
g(x) Exponential
x
Here, the ratio of f (x)/g(x) is maximized at x = 1 and so the value of c is 2.718 which implies an acceptance
probablity of about 0.37. While running the rejection sampling algorithm in this way to produce Uniform
random variables will still work, it will be very inefficient.
We can now show that the distribution of the accepted values from the rejection sampling algorithm above
follows the target density f . We can do
s t this by calculating the distribution function of the accepted values
and show that this is equal to F (t) = ≠Œ f (x) dx.
P(X Æ t, X accepted)
P(X Æ t | X accepted) =
P(X accepted)
P(X Æ t, X accepted)
=
1/c
5 ; <- 6
f (x) --
= c Eg E 1{x Æ t}1 U Æ X=x
c g(x) -
5 5 ; <- 66
f (x) --
= c Eg 1{X Æ t}E 1 U Æ X=x
c g(x) -
5 6
f (X)
= c Eg 1{X Æ t}
c g(X)
⁄ Œ
f (x)
= 1{x Æ t} g(x) dx
≠Œ g(x)
⁄ t
= f (x) dx
≠Œ
= F (t)
This shows that the distribution function of the candidate values, given that they are accepted, is equal to
the distribution function corresponding to the target density.
A few further notes:
1. We only need to know f and g up to a constant of proportionality. In many applications we will not
know the normalizing constant for these densities, but we do not need them. That is, if f (x) = k1 f ı (x)
81
and g(x) = k2 g ı (x), we can proceed with the algorithm using f ı and g ı even if we do not know the
values of k1 and k2 .
2. Any number cÕ Ø c will work in the rejection sampling algorithm, but the algorithm will be less efficient.
3. Throughout the algorithm, operations can (and should!) be done on a log scale.
4. The higher the dimension of f and g, the less efficient the rejection sampling algorithm will be.
5. Whether c = Œ or not depends on the tail behavior of the the densities f and g. If g(x) ¿ 0 faster than
f (x) ¿ 0 as x æ Œ, then f (x)/g(x) ø Œ.
(x)
What if we cannot calculate c = supxœXf fg(x) or are simply too lazy to do so? Fear not, because it turns out
we almost never have to do so. A slight modification of the standard rejection sampling algorithm will allow
us to estimate c while also sampling from the target density f . The tradeoff (there is always a tradeoff!) is
that we must make a more stringent assumption about c, mainly that it is achievable. That is, there exists
(xc ) (x)
some value xc œ Xf such that fg(xc)
is equal to supxœXf fg(x) .
The modified algorithm is the empirical supremum rejection sampling algorithm of Caffo, Booth, and Davison.
The algorithm goes as follows. First we must choose some starting value of c, call it ĉ, such that ĉ > 1. Then,
1. Draw U ≥ Unif(0, 1).
2. Draw X ≥ g, the candidate density.
(X)
3. Accept X if U Æ ĉfg(X) , otherwise reject X.
Ó Ô
(X)
4. Let ĉı = max ĉ, fg(X) .
5. Update ĉ = ĉı .
6. Goto Step 1.
From the algorithm we can see that at each iteration, we get more information about the ratio f (X)/g(X)
and can update our estimate of c accordingly.
One way to think of this algorithm is to conceptualize a separate sequence Ỹi , which is 0 or 1 depending
on whether Xi should be rejected (0) or accepted (1). This sequence Ỹi is the accept/reject determination
sequence. Under the standard rejection sampling algorithm, the sequence Ỹi is generated using the true
value of c. Under the emprical supremum rejection sampling (ESUP) scheme, we generate a slightly different
sequence Yi using our continuously updated value of ĉ.
If we drew values X1 , X2 , X3 , X4 , X5 , X6 , . . . from the candidate density g, then we could visualize the
acceptance/rejection process as it might occur using the true value of c and our estimate ĉ.
Following the diagram above, we can see that using the estimate ĉ, there are two instances where we accept
a value when we should have rejected it (X1 and X4 ). In every other instance in the sequence, the value
of Yi was equal to Ỹi . The theory behind the ESUP algorithm is that eventually, the sequence Yi becomes
identical to the sequence Ỹi and therefore we will accept/reject candidates in the same manner as we would
have if we had used the true c.
If f and g are discrete distributions, then the proof of the ESUP algorithm is fairly straightforward. Specifically,
Caffo, Booth, and Davison showed that P(Yi ”= Ỹi infinitely often) = 0. Recall that by assumption, there
(xc )
exists some xc œ Xf such that c = fg(x c)
. Therefore, as we independently sample candidates from g, at some
point, we will sample the value xc , in which case we will achieve the value c. Once that happens, we are then
using the standard rejection sampling algorithm and our estimate ĉ never changes.
82
Figure 8: Empirical supremum rejection sampling scheme.
Let “ = mini {xi = xc }, where xi ≥ g. So “ is the first time that we see the value xc as we are sampling
candidates xi from g. The probability that we sample xc is g(xc ) (recall that g is assumed to be discrete
here) and so “ has a Geometric distribution with success probability g(xc ). Once we observe xc , the ESUP
algorithm and the standard rejection sampling algorithms converge and are identical.
From here, we can use the coupling inequality, which tells us that
which, by the Borel-Cantelli lemma, implies that P(Yi ”= Ỹi infinitely often) = 0. Therefore, eventually the
sequences Yi and Ỹi must converge and at that point the ESUP algorithm will be identical to the rejection
sampling algorithm.
In practice, we will know know exactly when the ESUP algorithm has converged to the standard rejection
sampling algorithm. However, Caffo, and Davison report that the convergence is generally fast. Therefore, a
reasonable approach might be to discard the first several accepted values (e.g. a “burn in”) and then use the
remaining values.
We can see how quickly ESUP converges in a simple example where the target density is the standard
Normal and the candidate density is the t2 distribution. Here we simulate 1,000 draws and start with a value
ĉ = 1.0001. Note that in the code below, all of the computations are done on the log scale for the sake of
numerical stability.
set.seed(2017-12-04)
N <- 500
y_tilde <- numeric(N) ## Binary accept/reject for "true" algorithm
y <- numeric(N) ## Binary accept/reject for ESUP
83
log_c_true <- dnorm(1, log = TRUE) - dt(1, 2, log = TRUE)
log_chat <- numeric(N + 1)
log_chat[1] <- log(1.0001) ## Starting c value
for(i in seq_len(N)) {
u <- runif(1)
x <- rt(1, 2)
r_true <- dnorm(x, log = TRUE) - dt(x, 2, log = TRUE) - log_c_true
rhat <- dnorm(x, log = TRUE) - dt(x, 2, log = TRUE) - log_chat[i]
y_tilde[i] <- log(u) <= r_true
y[i] <- log(u) <= rhat
log_chat[i+1] <- max(log_chat[i],
dnorm(x, log = TRUE) - dt(x, 2, log = TRUE))
}
Now we can plot log10 (|ĉ ≠ c|) for each iteration to see how the magnitude of the error changes with each
iteration.
c_true <- exp(log_c_true)
chat <- exp(log_chat)
plot(log10(abs(chat - c_true)), type = "l",
xlab = "Iteration", ylab = expression(paste(log[10], "(Absolute Error)")))
−1
log10(Absolute Error)
−2
−3
−4
−5
−6
Iteration
We can see that by iteration 40 or so, ĉ and c differ only in the 5th decimal place and beyond. By the 380th
iteration, they differ only beyond the 6th decimal place.
With rejection sampling, we ultimately obtain a sample from the target density f . With that sample, we can
create any number of summaries, statistics, or visualizations. However, what if we are interested in the more
narrow problem of computing a mean, such as Ef [h(X)] for some function h : Rk æ R? Clearly, this is a
84
problem that can be solved with rejection sampling: First obtain a sample x1 , . . . , xn ≥ f and then compute
1ÿ
n
µ̂n = h(xi ).
n i=1
with the obtained sample. As n æ ŒÔwe know by the Law of Large Numbers that µ̂n æ Ef [h(X)]. Further,
the Central Limit Theorem gives us n(µ̂n ≠ Ef [h(X)]) ≠æ N (0, ‡ 2 ). So far so good.
However, with rejection sampling, in order to obtain a sample of size n, we must generate, on average, c ◊ n
candidates from g, the candidate density, and then reject about (c ≠ 1) ◊ n of them. If c ¥ 1 then this will
not be too inefficient. But in general, if c is much larger than 1 then we will be generating a lot of candidates
from g and ultimately throwing most of them away.
It’s worth noting that in most cases, the candidates generated from g fall within the domain of f , so that
they are in fact values that could plausibly come from f . They are simply over- or under-represented in the
frequency with which they appear. For example, if g has heavier tails than f , then there will be too many
extreme values generated from g. Rejection sampling simply thins out those extreme values to obtain the
right proportion. But what if we could take those rejected values and, instead of discarding them, simply
downweight or upweight them in a specific way?
Note that we can rewrite the target estimation as follows,
5 6
f (X)
Ef [h(X)] = Eg h(X) .
g(X)
1 ÿ f (xi ) 1ÿ
n n
µ̃n = h(xi ) = wi h(xi ) ¥ Ef [h(X)]
n i=1 g(xi ) n i=1
In the equation above, the values wi = f (xi )/g(xi ) are referred to as the importance weights because they
take each of the candidates xi generated from g and reweight them when taking the average. Note that if
f = g, so that we are simply sampling from the target density, then this estimator is just the sample mean of
the h(xi )s. The estimator µ̃n is known as the importance sampling estimator.
When comparing rejection sampling with importance sampling, we can see that
• Rejection sampling samples directly from f and then uses the samples to compute a simple mean
• Importance sampling samples from g and then reweights those samples by f (x)/g(x)
For estimating expectations, one might reasonably believe that the importance sampling approach is more
efficient than the rejection sampling approach because it does not discard any data.
In fact, we can see this by writing the rejection sampling estimator of the expectation in a different way. Let
c = supxœXf f (x)/g(x). Given a sample x1 , . . . , xn ≥ g and u1 , . . . , un ≥ Unif(0, 1), then
q Ó Ô
(xi )
i 1 ui Æ cfg(xi)
h(xi )
µ̂n = q Ó Ô
f (xi )
i 1 ui Æ c g(xi )
What importance sampling does, effectively, is replace the indicator functions in the above expression with
their expectation. So instead of having a hard threshold, where observation xi is either included (accepted) or
not (rejected), importance sampling smooths out the acceptance/rejection process so that every observation
plays some role.
85
If we take the expectation of the indicator functions above, we get (note that the cs cancel)
q f (xi ) 1
q f (xi )
i g(xi ) h(xi ) i g(xi ) h(xi )
µ̃n = =
n
q f (xi ) 1
q f (xi )
i g(xi ) n i g(xi )
which is roughly equivalent to the importance sampling estimate if we take into account that
1 ÿ f (xi )
n
¥1
n i=1 g(xi )
because 5 6 ⁄
f (X) f (x)
Eg = g(x) dx = 1
g(X) g(x)
The point of all this is to show that the importance sampling estimator of the mean can be seen as a
“smoothed out” version of the rejection sampling estimator. The advantage of the importance sampling
estimator is that it does not discard any data and thus is more efficient.
Note that we do not need to know the normalizing constants for the target density or the candidate density.
If f ı and g ı are the unnormalized target and candidate densities, respectively, then we can use the modified
importance sampling estimator,
q f ı (xi )
i g ı (xi ) h(xi )
µın = q f ı (x ) . i
i g ı (xi )
An interesting application of importance sampling is the examination of the sensitivity of posterior inferences
with respect to prior specification. Suppose we observe data y with density f (y | ◊) and we specify a prior for
◊ as fi(◊ | Â0 ), where Â0 is a hyperparameter. The posterior for ◊ is thus
p(◊ | y, Â0 ) Ã f (y | ◊)fi(◊ | Â0 )
and we would like to compute the posterior mean of ◊. If we q can draw ◊1 , . . . , ◊n , a sample of size n from
p(◊ | y, Â0 ), then we can estimate the posterior mean with n1 i ◊i . However, this posterior mean is estimated
using a specific hyperparameter Â0 . What if we would like to see what the posterior mean would be for a
different value of Â? Do we need to draw a new sample of size n? Thankfully, the answer is no. We can
simply take our existing sample ◊1 , . . . , ◊n and reweight it to get our new posterior mean under a different
value of Â.
Given a sample ◊1 , . . . , ◊n drawn from p(◊ | y, Â0 ), we would like to know E[◊ | y, Â] for some  ”= Â0 . The
idea is to treat our original p(◊ | y, Â0 ) as a “candidate density” from which we have already drawn a large
sample ◊1 , . . . , ◊n . Then we want know the posterior mean of ◊ under a “target density” p(◊ | y, Â). We can
then write our importance sampling estimator as
In this case, the importance sampling weights are simply the ratio of the prior under  to the prior under Â0 .
86
6.4.2 Properties of the Importance Sampling Estimator
So far we’ve talked about how to estimate an expectation with respect to an arbitrary target density f using
importance sampling. However, we haven’t discussed yet what is the variance of that estimator. An analysis
of the variance of the importance sampling estimator is assisted by the Delta method and by viewing the
importance sampling estimator as a ratio estimator.
Recall that the Delta method states that if Yn is a k-dimensional random variable with mean µ, g : Rk æ R
and is differentiable, and further we have
Ô
n(Yn ≠ µ) ≠æ N (0, )
D
as n æ Œ, then
Ô
n(g(Yn ) ≠ g(µ)) ≠æ N (0, g Õ (µ)Õ g Õ (µ))
D
as n æ Œ.
For the importance sampling estimator, we have f is the target density, g is the candidate density, and
x1 , . . . , xn are samples from g. The estimator of Ef [h(X)] is written as
1
q
n i h(xi )w(xi )
1
q
n i w(xi )
where
f (xi )
w(xi ) =
g(xi )
are the importance sampling weights.
! q q "
If we let g((a, b)) = a/b, then g Õ ((a, b)) = (1/b, ≠a/b2 ). If we define the vector Yn = n1 h(xi )wi , 1
n wi
then the importance sampling estimator is simply g(Yn ). Furthremore, we have
53 ÿ 46
1 1ÿ
Eg [Yn ] = Eg h(xi )w(xi ), w(xi ) = (Ef [h(X)], 1) = µ
n n
and 3 4
Var(h(X)w(X)) Cov(h(X)w(X), w(X))
= n Var(Yn ) =
Cov(h(X)w(X), w(X)) Var(w(X))
Note that the above quantity can be estimated consistently using the sample versions of each quantity in the
matrix.
Therefore, the variance of the importance sampling estimator of Ef [h(X)] is g Õ (Yn )Õ g Õ (Yn ) which we can
expand to
3q 42 A q q q B
h(xi )w(xi ) h(xi )2 w(xi )2 h(xi )w(xi )2 w(xi )2
2 ≠ 2( + q
n q q q q
w(xi ) ( h(xi )w(xi )) h(xi )w(xi )) ( w(xi )) ( w(xi ))2
Given this, for the importance sampling estimator, we need the following to be true,
5 6
# $ f (X)
Eg h(X)2 w(X)2 = Eg h(X) < Œ,
g(X)
C3 42 D
2 f (X)
Eg [w(X) ] = Eg < Œ,
g(X)
and C 3 42 D
# $ f (X)
Eg h(X)w(X)2 = Eg h(X) < Œ.
g(X)
All of the above conditions are true if the conditions for rejection sampling are satisfied, that is, if
(x)
supxœXf fg(x) < Œ.
87
7 Markov Chain Monte Carlo
The phrase “Markov chain Monte Carlo” encompasses a broad array of techniques that have in common a
few key ideas. The setup for all the techniques that we will discuss in this book is as follows:
1. We want to sample from a some complicated density or probability mass function fi. Often, this density
is the result of a Bayesian computation so it can be interpreted as a posterior density. The presumption
here is that we can evaluate fi but we cannot sample from it.
2. We know that certain stochastic processes called Markov chains will converge to a stationary distribution
(if it exists and if specific conditions are satisfied). Simulating from such a Markov chain for a long
enough time will eventually give us a sample from the chain’s stationary distribution.
3. Given the functional form of the density fi, we want to construct a Markov chain that has fi as its
stationary distribution.
4. We want to sample values from the Markov chain such that the sequence of values {xn } generated by
the chain converges in distribution to the density fi.
In order for all these ideas to make sense, we need to first go through some background on Markov chains.
The rest of this chapter will be spent defining all these terms, the conditions under which they make sense,
and giving examples of how they can be implemented in practice.
7.1 Background
There are many introductions and overviews of Markov chain Monte Carlo out there and a quick web search
will reveal them. A decent introductory book is Markov Chain Monte Carlo in Practice by Gilks, Richardson,
and Spiegelhalter, but there are many others. In this section we will give just the briefest of overviews to get
things started.
A Markov chain is a stochastic process that evolves over time by transitioning into different states. The
sequence of states is denoted by the collection {Xi } and the transition between states is random, following
the rule
P(Xt | Xt≠1 , Xt≠2 , . . . , X0 ) = P(Xt | Xt≠1 )
(The above notation is for discrete distributions; we will get to the continuous case later.)
This relationship means that the probability distribution of the process at time t, given all of the previous
values of the chain, is equal to the probability distribution given just the previous value (this is also known as
the Markov property). So in determining the sequence of values that the chain takes, we can determine the
distribution of our next value given just our current value.
The collection of states that a Markov chain can visit is called the state space and the quantity that governs
the probability that the chain moves from one state to another state is the transition kernel or transition
matrix.
A classic example of a Markov chain has a state space and transition probabilities that can be drawn as
follows.
Here, we have three states in the state space and the arrows indicate to which state you can travel given your
current state. The fractions next to each arrow tell you what is the transition probabilities of going to a
given state. Note that in this Markov chain, you can only ever travel to two possible states no matter what
your current state happens to be.
88
Figure 9: Simple Markov chain
We can write the transition probabilities in a matrix, a.k.a. the transition matrix, as
S T
1/2 1/2 0
P = U 1/2 0 1/2 V
0 1/2 1/2
With the matrix notation, the interpretation of the entries is that if we are currently on interation n of the
chain, then
P(Xn+1 = j | Xn = i) = Pij
From the matrix, we can see that the probability of going from state 1 to state 3 is 0, because the (1, 3) entry
of the matrix is 0.
Suppose we start this Markov chain in state 3 with probability 1, so the initial probability distribution over
the three states is fi0 = (0, 0, 1). What is the probability distribution of the states after one iteration? The
way the transition matrix works, we can write
fi1 = fi0 P = (0, 1/2, 1/2)
A quick check of the diagram above confirms that if we start in state 3, then after one iteration we can only
be in either state 3 or in state 2. So state 1 gets 0 probability.
What is the probability distribution over the states after n iterations? We can continue the process above to
get
n times
˙ ˝¸ ˚
fin = fi0 P P P P · · · P = P (n)
For example, after five iterations, starting in state 3, we would have fi5 = (0.3125, 0.34375, 0.34375).
For a Markov chain with a discrete state space and transition matrix P , let fiı be such that fiı P = fiı . Then
fiı is a stationary distribution of the Markov chain and the chain is said to be stationary if it reaches this
89
distribution.
The basic limit theorem for Markov chains says that, under a specific set of assumptions that we will detail
below, we have
Îfiı ≠ fin Î ≠æ 0
as n æ Œ, where Î · Î is the total variation distance between the two densities. Therefore, no matter where
we start the Markov chain (fi0 ), fin will eventually approach the stationary distribution. Another way to
think of this is that
lim fin (i) = fiı (i)
næŒ
This chain clearly visits the odd numbers when n is odd and the even numbers when n is even.
Furthermore, if we are at any state i, we cannot revisit state i except in a number of steps that is
a multiple of 2. This chain therefore has a period of 2. If a chain has a period of 1 it is aperiodic
(otherwise it is periodic).
The sequence of states moving in the “forward” direction (with respect to time) is equal in distribution
to the sequence of states moving in the “backward” direction. Further, the definition above implies that
(X0 , X1 ) = (X1 , X0 ), which further implies that X0 = X1 . Because X1 is equal in distribution to X0 , this
D D
implies that fi1 = fi0 . However, because fi1 = fi0 P , where P is the transition matrix, this means that fi0 is
the stationary distribution, which we will now refer to as fi. Therefore, a time reversible Markov chain is
stationary.
In addition to stationarity, the time reversibility property tells us that for all i, j, the following are all
equivalent:
(X0 , X1 ) = (X1 , X0 )
D
P(X0 = i, X1 = j) = P(X1 = i, X0 = j)
P(X0 = i)P(X1 = j | X0 = i) = P(X0 = j)P(X1 = i | X0 = j)
90
The last line can also be written as
fi(i)P (i, j) = fi(j)P (j, i)
which are called the local balance equations. A key property that we will exploit later is that if the local
balance equations hold for a transition matrix P and distribution fi, then fi is the stationary distribution of a
chain governed by the transition matrix P .
Why is all this important? Time reversibility is relevant to analyses making use of Markov chain Monte
Carlo because it allows us to find a way to construct a proper Markov chain from which to simulate. In
most MCMC applications, we have no trouble identifying the stationary distribution. We already know the
stationary distribution because it is usually a posterior density resulting from a complex Bayesian calculation.
Our problem is that we cannot simulate from the distribution. So the problem is constructing a Markov
chain that leads to a given stationary distribution.
Time reversibility gives us a way to construct a Markov chain that converges to a given stationary distribution.
As long as we can show that a Markov chain with a given transition kernel/matrix satisfies the local balance
equations with respect to the stationary distribution, we can know that the chain will converge to the
stationary distribution.
7.1.4 Summary
7.2 Metropolis-Hastings
Let q(Y | X) be a transition density for p-dimensional X and Y from which we can easily simulate and let
fi(X) be our target density (i.e. the stationary distribution that our Markov chain will eventually converge
to). The Metropolis-Hastings procedure is an iterative algorithm where at each stage, there are three steps.
Suppose we are currently in the state x and we want to know how to move to the next state in the state
space.
1. Simulate a candidate value y ≥ q(Y | x). Note that the candidate value depends on our current state x.
2. Let ; <
fi(y)q(x | y)
–(y | x) = min ,1
fi(x)q(y | x)
–(y | x) is referred to as the acceptance ratio.
3. Simulate u ≥ Unif(0, 1). If u Æ –(y | x), then the next state is equal to y. Otherwise, the next state is
still x (we stay in the same place).
91
This three step process represents the transition kernel for our Markov chain from which we are simulating.
Recall that the hope is that our Markov chain will, after many simulations, converge to the stationary
distribution. Eventually, we can be reasonably sure that the samples that we draw from this process are
draws from the stationary distribution, i.e. fi(X).
Why does this work? Recall that we need to have the Markov chain generated by this transition kernel be
time reversible. If K(y | x) is the transition kernel embodied by the three steps above, then we need to show
that
The function K(y | x) can be decomposed into two parts and we can treat each separately. First we can show
that
The last line is trivially true because on both sides of the equation we are taking the minimum of the same
quantities.
# s $
For the second part, let r(x) = 1 ≠ –(s | x)q(s | x) ds . Then we need to show that
1 {y = x} r(x)fi(x) = 1 {x = y} r(y)fi(y)
over the set where y = x. But this is trivially true because if y = x then every quantity in the above equation
is the same regardless of an x or a y appears in it.
y =x+Á
where Á ≥ g and g is a probability density symmetric about 0. Given this definition, we have
q(y | x) = g(Á)
and
Because q(y | x) is symmetric in x and y, the Metropolis-Hastings acceptance ratio –(y | x) simplifies to
92
; <
fi(y)q(x | y)
–(y | x) = min ,1
fi(x)q(y | x)
; <
fi(y)
= min ,1
fi(x)
Given our current state x, the random walk Metropolis-Hastings algorithm proceeds as follows:
1. Simulate Á ≥ g and let y = x + Á.
Ó Ô
2. Compute –(y | x) = min fi(x)fi(y)
,1
3. Simulate u ≥ Unif(0, 1). If u Æ –(y | x) then accept y as the next state, otherwise stay at x.
It should be noted that this form of the Metropolis-Hastings algorithm was the original form of the Metropolis
algorithm.
3. Simulate u ≥ Unif(0, 1). If u Æ –(y | x) then accept y as the next state, otherwise stay at x.
We can see what this looks like but running the iteration many times.
delta <- 0.5
N <- 500
x <- numeric(N)
x[1] <- 0
set.seed(2018-06-04)
for(i in 2:N) {
eps <- runif(1, -delta, delta)
y <- x[i-1] + eps
alpha <- min(dnorm(y, log = TRUE) - dnorm(x[i-1], log = TRUE), 0)
u <- runif(1, 0, 1)
if(log(u) <= alpha)
x[i] <- y
else
x[i] <- x[i-1]
}
summary(x)
93
Histogram of x
120
Frequency
20 40 60 80
0
−2 −1 0 1 2
x
We can also look at a trace plot of the samples to see how the chain moved around.
library(ggplot2)
qplot(1:N, x, geom = "line", xlab = "Iteration")
94
2
0
x
−1
−2
95
2
0
x
−1
−2
The independence Metropolis algorithm defines a transition density as q(y | x) = q(y). In other words, the
candidate proposals do not depend on the current state x. Otherwise, the algorithm works the same as the
original Metropolis-Hastings algorithm, with a modified acceptance ratio,
; <
fi(y)q(x)
–(y | x) = min ,1
fi(x)q(y)
The independence sampler seems to work well in situations where rejection sampling might be reasonable,
i.e. relatively low-dimensional problems. In particular, it has the interesting property that if
fi(x)
C = sup <Œ
x q(x)
96
then
Îfin ≠ fiÎ Æ kfln
for 0 < fl < 1 and some constant k > 0. So if we have the same conditions under which rejection sampling is
allowed, then we can prove that convergence of the chain to the stationary distribution is geometric with rate
fl. The value of fl depends on C; if C is close to 1, then fl will be small. This is an interesting property of this
sampler, but it is largely of theoretical interest.
y ≥ Unif(0, fi(x))
Note that ⁄
fi(x) = 1{0 Æ y Æ fi(x)} dy.
Therefore, f (x, y) = 1{0 Æ y Æ fi(x)} is a joint density. The slice sampler method creates an auxiliary
variable y and the intergrates it out later to give back our original target density fi(x).
As depicted in the illustration, the slice sampler can be useful for densities that might have multiple modes
where it’s easy to get stuck around one model. The sampling of xı allows one to jump easily between modes.
However, sampling xı may be difficult depending on the complexity of fi(x) and ultimately may not be worth
the effort. If fi(x) is very wiggly or is high-dimensional, then calculating the region {x : fi(x) Ø y} will be
very difficult.
The hit and run sampler combines ideas from line search optimization methods with MCMC sampling. Here,
suppose we have the current state x in p-dimensions and we want to propose a new state. Let e be a random
p-dimensional vector that indicates a random direction in which to travel. Then construct the density
where r is a scalar (and hence p(r) is a 1-dimensional density). From this density, sample a value for r and
set the proposal to be
xı = x + re.
This process has teh advantage that it chooses more directions than a typical Gibbs sampler (see below) and
does not require a multi-dimensional proposal distribution. However, sampling from the density p(r) may not
be obvious as there is no guaranteed closed form solution.
97
Figure 10: Slice Sampler
98
7.2.5 Single Component Metropolis-Hastings
The standard Metropolis algorithm updates the entire parameter vector at once with a single step. Hence, a
p-dimensional parameter vector must have a p-dimensional proposal distribution q. However, it is sometimes
simpler to update individual components of the parameter vector one at a time, in an algortihm known as
single component Metropolis-Hastings (SCMH).
1 2
(n) (n) (n)
Let x(n) = x1 , x2 , . . . , xp be a p-dimensional vector representing our parameters of interest at iteration
n. Define
x≠i = (x1 , . . . , xi≠1 , xi+1 , . . . , xp )
as the vector x with the ith component removed. The SCMH algorithm updates each paramater in x one at
(n) (n)
a time. All that is needed to update the ith component of x(n) is a proposal distribution q(y | xi , x≠i ) for
proposing a new value of xi given the current value of xi and all the other components.
At iteration n, SCMH updates the ith component via the following steps.
(n) (n)
1. Sample yi ≥ qi (y | xi , x≠i ) as a proposal for component i.
2. Let A B
(n) (n)
(n) fi(yi | x≠i ) q(xi | yi , x≠i )
–(yi | x ) = min (n) (n)
,1
fi(xi | x≠i ) q(yi | xi , x≠i )
The attraction of an algorithm like single component Metropolis-Hastings is that it converts a p-dimensional
problem into p separate 1-dimensional problems, each if which is likely simple to solve. This advantage is
not unlike that seen with coordinate descent algorithms discussed previously. SCMH can however be very
exploratory and may not be efficient in exploring the parameter space.
Gibbs sampling is a variant of SCMH that uses full conditional distributions for the proposal distribution for
each component. Given a target density fi(x) = fi(x1 , . . . , xp ), we cycle through sampling from fi(xi | x≠i ) to
update the ith component. Rather than go though an accept/reject step as with Metropolis-Hastings, we
always accept, for reasons that will be detailed below.
For example, with a three-component density fi(x, y, z), the full conditional distributions associated with this
density are
fi(x | y, z),
fi(y | x, z),
and
fi(z | x, y).
If our current state is (xn , yn , zn ) at the nth iteration, then we update our parameter values with the following
steps.
1. Sample xn+1 ≥ fi(x | yn , zn )
2. Sample yn+1 ≥ fi(y | xn+1 , zn )
3. Sample zn+1 ≥ fi(z | xn+1 , yn+1 )
99
At each step of the sampling we use the most recent values of all the other components in the full conditional
distribution. In a landmark paper, Geman and Geman showed that if p(xn , yn , zn ) is the density of our
parameters at the nth iteration, then as n æ Œ,
p(xn , yn , zn ) æ fi(x, y, z)
p(xn ) æ fi(x)
p(yn ) æ fi(y)
p(zn ) æ fi(z)
Suppose we want to simulate from a bivariate Normal distribution with mean µ = (µ1 , µ2 ) and covariance
matrix
3 4
‡12 fl
=
fl ‡22
Although there are exact ways to do this, we can make use of Gibbs sampling too simulate a Markov chain
that will converge to a bivariate Normal. At the nth iteration with state vector (xn , yn ), we can update our
state with the following steps.
1. Sample xn+1 ≥ N (µ1 + fl ‡‡12 (yn ≠ µ2 ), ‡12 (1 ≠ fl))
2. Sample yn+1 ≥ N (µ2 + fl ‡‡21 (xn+1 ≠ µ1 ), ‡22 (1 ≠ fl))
If fl is close to 1, then this algorithm will likely take a while to converge. But it is simple to simulate and
demonstrates the mechanics of Gibbs sampling.
iid
Suppose we observe y1 , . . . , yn ≥ N (µ, · ≠1 ), where · ≠1 is the precision of the Normal distribution. We can
use independent priors for the two parameters as
µ ≥ N (0, w≠1 )
· ≥ Gamma(–, —)
Because the priors are independent here, we will have to use Gibbs sampling to sample from the posterior
distribution of µ and · .
Recall that with Bayes’ Theorem, the joint posterior of µ and · is
fi(µ, · | y) Ã fi(µ, ·, y)
= f (y | µ, · )p(µ)p(· )
100
p(µ | ·, y) Ã f (y | µ, · )p(µ)p(· )
à f (y | µ, · )p(µ)
1 · ÿ 2 1 w 2
à exp ≠ (yi ≠ µ)2 exp ≠ µ2
3 2 2 4
(n· + w) 2 1ÿ 2
= exp ≠ µ ≠ yi · µ
2
3 4
· ÿ 1
= N yi ,
n· + w n· + w
and
p(· | µ, †) Ã f (y | µ, · )p(µ)p(· )
à f (y | µ, · )p(· )
1 ·ÿ 2
à · 2 exp ≠ (yi ≠ µ)2 · –≠1 exp(≠· —)
n
23 5 64
1ÿ 2
= · exp ≠· — + (yi ≠ µ)
n
–≠1+ 2
2
3 4
n 1ÿ 2
= Gamma – + , — + (yi ≠ µ)
2 2
The Gibbs sampler therefore alternates between sampling from a Normal distribution and a Gamma
distribution. In this case, the priors were chosen so that the full conditional distributions could be sampled
in closed form.
Gibbs sampling has a fairly straightforward connection to the single component Metropolis-Hastings algorithm
described earlier. We can reframe the Gibbs sampling algorithm for component i at iteration n as
(n) (n) (n)
1. Sample yi ≥ qi (y | xi , x≠i ) = fi(y | x≠i ). Here, the proposal distribution simply the full conditional
distribution of xi given x≠i .
2. The acceptance probability is then
3 4
fi(yi | x≠i )fi(xi | x≠i )
–(yi | xi , x≠i ) = min ,1 = 1
fi(xi | x≠i )fi(yi | x≠i )
The Gibbs sampling algorithm is like the SCMH algorithm but it always accepts the proposal.
Given the relationship between Gibbs sampling and SCMH, we can use this to extend the basic Gibbs
algorithm. In some cases, we will not be able to sample directly from the full conditional distribution of a
component. In those cases, we can substitute a standard Metropolis-Hastings step with a proposal/acceptance
procedure.
For components that require a Metropolis-Hastings step, we will need to come up with a proposal density for
those components. In this case, the target density is simply the full conditional of component i given the
other components. For component i then, the algorithm procedes as
101
1. Sample yi ≥ q(y | x)
2. Compute the acceptance probability
3 4
fi(yi | x≠i )q(xi | x)
–(yi | x) = min ,1
fi(xi | x≠i )q(yi | x)
7.3.5 Reparametrization
In some Metropolis-Hastings or hybrid Gibbs sampling problems we may have parameters where it is easier
to sample from a full conditional of a transformed version of the parameter. For example, we may need
to sample from the full conditional p(⁄ | ·) of a parameter that only takes values between 0 and 1. Doing
something like a random walk proposal can be tricky given the restricted range.
One solution is the transform the parameter to a space that has a infinte range. For a parameter ⁄ œ (0, 1)
we can use the logit transformation to map it to (≠Œ, Œ), i.e.
⁄
z = logit(⁄) = log .
1≠⁄
Then we can do the proposal and acceptance steps on the transformed parameter z, which has an infinite
domain. However, we need to be careful that the acceptance ratio is calculated properly to account for
the nonlinear transformation of ⁄. In this case we need to compute the determinant of the Jacobian of the
transformation mapping z back to ⁄.
The algorithm at iteration n in the transformed space would work as follows.
1. Sample z ı ≥ g(z | zn ), where zn = logit(⁄n ) and g is the proposal density.
2. Compute 3 4
p(logit≠1 (z ı ) | ·) g(zn | z ı ) |J(z ı )|
–(z ı | zn ) = min ,1
p(logit≠1 (zn ) | ·) g(z ı | zn ) |J(zn )|
where |J(z)| is the determinant of the Jacobian of the transformation that maps from z ‘æ ⁄, i.e. the
function
exp(z)
logit≠1 (z) = .
1 + exp(z)
In R, we can easily compute this Jacobian (and that for any other transformation) with the deriv() function.
Jacobian <- local({
J <- deriv(~ exp(x) / (1 + exp(x)), "x",
function.arg = TRUE)
function(x) {
val <- J(x)
drop(attr(val, "gradient"))
}
})
Jacobian(0)
## x
## 0.25
102
Becuase the transformation here is for a single parameter, the determinant is straightforward. However, for a
multi-dimensional mapping, we would need an extra step to compute the determinant. The det() function
can be used for this purpose, but bear in mind that it almost always makes sense to use the logarithm =
TRUE argument to det().
yij ≥ Bernoulli(⁄ij )
logit(⁄ij ) = –i
–i ≥ N (µ, ‡ 2 )
µ ≥ N (0, D)
‡ ≥ InverseGamma(a, b)
where i = 1, . . . , n and j = 1, . . . , G.
For this example, assume that the hyperparameters a, b, and D are known. The goal is to sample from the
posterior of µ and ‡. The joint posterior of the parameters given the data is
S T
n
Ÿ G
Ÿ
fi(µ, ‡, – | y) Ã U p(yij | logit≠1 (–i )V
i=1 j=1
where p() is the Bernoulli density, Ï is the Normal density, and g() is the inverse gamma density.
To implement the Gibbs sampler, we need to cycle through three classes of full conditional distributions.
First is the full conditional for ‡, which can be written in closed form given the prior. In order to compute
the full conditional, we simply select everything from the full posterior that has a ‡ in it.
C n
D
Ÿ
p(‡ | ·) Ã Ï(–i | µ, ‡ 2 ) g(‡)
i=1
3 4
n 1ÿ
= InverseGamma a + , b + (–i ≠ µ)2
2 2
Similarly, we can pick out all the components of the full posterior that have a µ and the full conditional
distribution for µ is
C n
D
Ÿ
2
p(µ | ·) Ã Ï(–i | µ, ‡ ) Ï(µ |, 0, D)
i=1
3 4
D ‡ 2 /n
= N ¯, 2
– D
D + ‡ /n
2 ‡ /n + D
1
q
where –
¯= n –i .
103
Finally, the full conditionals for –i can be compute independently because the –i s are assumed to be
independent.
S T
G
Ÿ
p(–i | ·) Ã U p(yij | logit≠1 (–i ))V Ï(–i | µ, ‡ 2 ).
j=1
Unfortunately, there is no way to simplify this further with the combination of the Bernoulli likelihood and
the Normal random effects distribution. Therefore, some proposal/acceptance step will have to be run to
sample from the –i s.
Once you’ve got your simulated Markov chain running, the question arises regarding when it should be
stopped. On an infinite timeline, we know from theory that the samples drawn from the chain will come from
the target density. On every other timeline, we must design some rule for stopping the chain in order to get
things like parameter estimates or posterior distributions.
Determining how and when to stop a chain depends a bit on what exactly you are trying to do. In many
cases, regardless of your inferential philosophy, you are trying to obtain an estimate of a parameter in a
model. Typically, we estimate these parameters by using a summary statistic like the mean of the posterior
distribution. We calculate this summary statistic by drawing samples from the posterior and computing the
arithmetic mean.
Hence, we are estimating a parameter and in doing so we must worry about the usual things we worry
about when estimating a parameter. In particular, we must worry about the uncertainty introduced by our
procedure, which in this case is the Markov chain Monte Carlo sampling procedure. If our chain ran to
infinity, we would have no uncertainty due to MCMC sampling (we would still have other kinds of uncertainty,
such as from our finite dataset). But because our chain will not run to infinity, stopping the chain can be
thought of as answering the question of how much Monte Carlo uncertainty is acceptable.
One way to measure the amount of uncertainty introduced through MCMC sampling is with Monte Carlo
standard errors. The thought experiment here is, if we were to repeatedly run our MCMC sampler (with
different random number generator seeds!) for a fixed N number of iterations, how much variability would
we expect to see in our estimate of the parameter? Monte Carlo standard errors should give us a sense of this
variability. If the standard errors are two big, we can run the sampler for longer. Exactly how long we will
need to run the sampler to achieve a given standard error will depend on the efficiency and mixing of the
sampler.
One way to compute Monte Carlo standard errors is with the method of batch means. The idea here is we
divide our long MCMC sampler chain into equal size segments and compute our summary statistic on each
segment. If the segments are of the proper size, we can think of them as “independent replicates”, even
though individual samples of the MCMC sampler will not be independent of each other in general. From
these replicates, we compute an estimate of variability from the given chain.
More specifically, suppose x1 , x2 , x3 , . . . are samples from an MCMC sampler that we have run for N iterations
and we want to estimate E[h(X)] where the expectation is take with respect to some posterior distribution.
We first decide on a batch size K and let’s assume that N/K = M where M is an integer. We then divide
the chain into segments x1 , . . . , xK , xK+1 , . . . , x2K , etc. From here, we can compute our segment summary
statistics,
104
1 ÿ
K
b1 = h(xi )
K i=1
2K
1 ÿ
b2 = h(xi )
K
i=K+1
..
.
1
M
ÿ K
bM = h(xi )
K
i=(M ≠1)K+1
1 ÿ 1 ÿ
M KM
b̄ = bi = h(xi ) ≠æ E[h(X)]
M i=1 KM i=1
as N æ Œ.
Once we have the batch means, we can compute a standard error based on the assumed ergodicity of the
chain, which gives us,
3 4
Ôb̄ ≠ E[h(X)]
M ≠æ N (0, 1).
s
The batch means standard error is the square root of
M
K ÿ
s2 = (bi ≠ b̄)2 .
M i=1
For a specific implementation of the batch means procedure, one can use the method of Jones, Haran, Caffo,
and Neath. The source code for this procedure is available at Murali Haran’s web site. Jones, et al recommend
a batch size of N 1/2 (or the smallest integer closest to that) and that is the default setting for their software.
If the chain is mixing relatively well, a batch size of N 1/3 can be used and may be more efficient.
It’s worth noting that the Monte Carlo standard error is a quantity with units attached to it. Therefore,
determining when the standard error is “small enough” will require a certain understanding of the context in
which the problem is being addressed. No universial recommendation can be made on how small is small
enough. Some view this property as a downside, preferring something that is “unit free”, but I see it as an
important reminder that these parameters that we are estimating come from the real world and are addressing
real problems.
Another approach to monitoring the convergence of a MCMC sampler is to think about what we might
expect when a chain has “converged”. If we were to start multiple parallel chains in many different starting
values, the theory claims that they should all eventually converge to the stationary distribution. So after
some amount of time, it should be impossible to distinguish between the multiple chains. They should all
“look” like the stationary distribution. One way to assess this is to compare the variation between chains to
the variation within the chains. If all the chains are “the same”, then the between chain variation should be
close to zero.
(j) (j)
Let x1 , x2 , . . . be samples from the jth Markov chain and suppose there are J chains run in parallel with
different starting values.
105
(j) (j) (j)
1. For each chain, first discard D values as “burn-in” and keep the remaining L values, xD , xD+1 , . . . , xD+L≠1 .
For example, you might set D = L.
2. Calculate
1 ÿ (j)
L
x̄j = x (chain mean)
L t=1 t
1ÿ
J
x̄· = x̄j (grand mean)
J j=1
J
L ÿ
B = (x̄j ≠ x̄· )2 (between chain variance)
J = 1 j=1
1 ÿ (j)
L
s2j = (x ≠ x̄j )2 (within chain variance)
L ≠ 1 t=1 t
1ÿ 2
J
W = s
J j=1 j
Simulated annealing is a technique for minimizing functions that makes use of the ideas from Markov chain
Monte Carlo samplers, which is why it is in this section of the book. It is a particularly useful method for
functions that are very misbehaved and wiggly; the kinds of functions that Newton-style optimizers tend to
be bad at minimizing.
Suppose we want to find the global minimum of a function h(◊), where ◊ is a vector of parameters in a space
S. Ideally, we would simulate a Markov chain whose target density fi(◊) was a point mass on the global
minimum. This would be great if we could do it! But then we wouldn’t have a problem. The idea with
simulated annealing is that we build successive approximations to fi(◊) until we have an approximation that
is very close to the target density.
106
Let S ı = {◊ œ S : h(◊) = min◊ h(◊)}. Then define fi(◊) à 1 for all ◊ œ S ı and fi(◊) = 0 for all ◊ œ / S ı . In
other words, fi(◊) is the uniform distribution over all the global minimizers. The ultimate goal is to find some
way to sample from fi(◊).
We will begin by building an approximate density called fiT (◊) where
fiT (◊) Ã exp(≠h(◊)/T ).
and where T is called the “temperature”. This density as two key properties:
1. As T æ Œ it fiT (◊) approaches the uniform density;
2. As T ¿ 0, fiT (◊) æ fi(◊).
The aim is then to draw many samples from fiT (◊), initially with a large value of T , and to lower T towards
0 slowly. As we lower T , the density fiT (◊) will become more and more concentrated around the minima of
h(◊).
Given that fiT (◊) æ fi(◊) as T ¿ 0, why not just start with T really small? The problem is that if T is small
from the start, then the sampler will quickly get “stuck” in a whatever local model is near our initial value.
Once there, it will not be able to jump out and go to an other mode. So the general strategy is to start with
a large T so that the sample space can be adequately explored and so we don’t get stuck in a local minimum.
Then, as we lower the temperature, we can be sure that we have explored as much of the space as feasible.
The sampling procedure is then to first choose a symmetric proposal density q(· | ◊). Then, if we are at
iteration n with state ◊n ,
1. Sample ◊ı ≥ q(◊ | ◊n ).
2. Sample U ≥ Unif(0, 1).
3. Compute
3 4
exp(≠h(◊ı )/T )
–(◊ı | ◊n ) = min ,1
exp(≠h(◊n )/T )
= min(exp(≠(h(◊ı ) ≠ h(◊n ))/T ), 1)
a
Tn =
log(n + b)
107
for some a, b > 0. Because of the logarithm in the denominator, this cooling schedule is excruciatingly slow,
making the simulated annealing algorithm a very slow algorithm to converge.
108