bayesian-inference
bayesian-inference
Class Notes
Manuel Arellano
March 8, 2016
1 Introduction
Bayesian methods have traditionally had limited in‡uence in empirical economics, but they have
become increasingly important with the popularization of computer-intensive stochastic simulation
algorithms in the 1990s. This is particularly so in macroeconomics, where applications of Bayesian
inference include vector autoregressions (VARs) and dynamic stochastic general equilibrium (DSGE)
models. Bayesian approaches are also attractive in models with many parameters, such as panel
data models with individual heterogeneity and ‡exible nonlinear regression models. Examples include
discrete choice models of consumer demand in the …elds of industrial organization and marketing.
An empirical study uses data to learn about quantities of interest (parameters). A likelihood
function or some of its features specify the information in the data about those quantities. Such
speci…cation typically involves the use of a priori information in the form of parametric or functional
restrictions. In the Bayesian approach to inference, one not only assigns a probability measure to the
sample space but also to the parameter space. Specifying a probability distribution over potential
parameter values is the conventional way of modelling uncertainty in decision-making, and o¤ers a
systematic way of incorporating uncertain prior information into statistical procedures.
Outline The following section introduces the Bayesian way of combining a prior distribution with
the likelihood of the data to generate point and interval estimates. This is followed by some comments
on the speci…cation of prior distributions. Next we turn to discuss asymptotic approximations; the
main result is that in regular cases there is a large-sample equivalence between Bayesian probability
statements and frequentist con…dence statements. As a result, frequentist and Bayesian inferences
are often very similar and can be reinterpreted in each other’s terms. Finally, we review Markov
chain Monte Carlo methods (MCMC). The development of these methods has greatly reduced the
computational di¢ culties that held back Bayesian applications in the past.
Bayesian methods are now not only generally feasible, but sometimes also a better practical al-
ternative to frequentist methods. The upshot is an emerging Bayesian/frequentist synthesis around
increasing agreement on what works for di¤erent kinds of problems. The shifting focus from philo-
sophical debate to methodological considerations is a healthy state of a¤airs because both frequentist
and Bayesian approaches have features that are appealing to most scientists.
1
2 Bayesian inference
Let us consider a data set y = (y1 ; :::; yn ) and a probability density (or mass) function of y conditional
on an unknown parameter :
f (y1 ; :::; yn j ) :
Qn
If y is an iid sample then f (y1 ; :::; yn j ) = i=1 f (yi j ) where f (yi j ) denotes the pdf of yi . In
survey sampling f (yi j = 0) is the pdf of the population, (y1 ; :::; yn ) are n independent draws from
such population, and 0 denotes the true value of in the pdf that generated the data.
In general, for shortness we just write f (y j ) = f (y1 ; :::; yn j ). As a function of the parameter
this is called the likelihood function, also denoted L ( ). We are interested in inference about the
unknown parameter given the data. Any uncertain prior information about the value of is speci…ed
in a prior probability distribution for the parameter, p ( ). Both the likelihood and the prior are
chosen by the researcher. We then combine the prior distribution and the sample information, using
Bayes’theorem, to obtain the conditional distribution of the parameter given the data, also known as
the posterior distribution:
f (y; ) f (y j ) p ( )
p ( j y) = =R :
f (y) f (y j ) p ( ) d
p ( j y) / f (y j ) p ( ) = L ( ) p ( ) :
Once we calculate this product, all we have to do is to …nd the constant that makes this expression
integrate to one as a function of the parameter. The posterior density describes how likely it is that
a value of has generated the observed data.
Point estimation We can use the posterior density to form optimal point estimates. The notion
of optimality is minimizing mean posterior loss for some loss function ` (r):
Z
min ` (c ) p ( j y) d
c
is the point estimate that minimizes mean squared loss ` (r) = r2 . The posterior median minimizes
mean absolute loss ` (r) = jrj. The posterior mode e is the maximizer of the posterior density and
minimizes mean Dirac loss. When the prior density is ‡at, the posterior mode coincides with the
maximum likelihood estimator.
2
Interval estimation The posterior quantiles characterize the posterior uncertainty about the
parameter, and they can be used to obtain interval estimates. Any interval ( ` ; u) such that
Z u
p ( j y) d = 1
`
is called a credible interval with coverage probability 1 . If the posterior density is unimodal,
a common choice is the shortest connected credible interval or the highest posterior density (HPD)
interval. In practice, often an equal-tail-probability interval is favored because of its computational
simplicity. In such case, ` and u are just the =2 and 1 =2 posterior quantiles, respectively. Equal-
tail-probability intervals tend to be longer than the others, except in the case of a symmetric posterior
density. If the posterior is multi-modal then the HPD interval may consist of disjoint segments.1
Frequentist con…dence intervals and Bayesian credible intervals are the two main interval estimation
methods in statistics. In a con…dence interval the coverage probability is calculated from a sampling
density, whereas in a credible interval the coverage probability is calculated from a posterior density.
As discussed in the next section, despite the di¤erences in the two methods, they often provide similar
interval estimates in large samples.
Bernoulli example Let us consider a random sample (y1 ; :::; yn ) of Bernoulli random variables.
The likelihood of the sample is given by
L( ) = m
(1 )n m
Pn
where m = i=1 yi . The maximum likelihood estimator is
b = m:
n
In general, given some prior p ( ), the posterior mode solves
Since is a probability value, a suitable parameter space over which to specify a prior probability
distribution is the (0; 1) interval. A ‡exible and convenient choice is the Beta distribution:
1 1 1
p( ; ; ) = (1 )
B( ; )
where B( ; ) is the beta function, which is constant with respect to :
Z 1
B( ; ) = s 1 (1 s) 1 ds:
0
The quantities ( ; ) are parameters of the prior, to be set according to our a priori information about
. These parameters are called prior hyperparameters.
1
The minimum density of any point within the HPD interval exceeds the density of any point outside that interval.
3
The Beta distribution is a convenient prior because the posterior is also a Beta distribution:
p ( j y) / L ( ) p ( ) / m+ 1
(1 )n m+ 1
That is, if Beta ( ; ) then jy Beta ( + m; +n m). This situation is described by saying
that the Beta distribution is the conjugate prior to the Bernoulli.
The posterior mode is given by
h i m+ 1
e = arg max m+ 1
(1 )n m+ 1
= : (1)
n+ 1+ 1
This result illustrates some interesting properties of the posterior mode in this example. The posterior
mode is equivalent to the ML estimate of a data set with 1 additional ones and 1 additional zeros.
Such data augmentation interpretation provides guidance on how to choose and in describing a
priori knowledge about the probability of success in Bernoulli trials. It also illustrates the vanishing
e¤ect of the prior in a large sample. Note for now that if n is large e b. However, maximum likelihood
may not be a satisfactory estimator in a small sample that only contains zeros if the probability of
success is known a priori to be greater than zero.
Conjugate priors One consideration in selecting the form of both prior and likelihood is mathe-
matical convenience. Conjugate prior distributions, such as the Beta density in the previous example,
have traditionally played a central role in Bayesian inference for analytical and computational reasons.
A prior is conjugate for a family of distributions if the prior and the posterior are of the same family.
When a likelihood model is used together with its conjugate prior, the posterior is not only known to
be from the same family of densities as the prior, but explicit formulas for the posterior hyperparame-
ters are also available. In general, distributions in the exponential family have conjugate priors. Some
likelihood models together with their conjugate priors are the following:
Bernoulli –Beta
Binomial –Beta
Poisson –Gamma
Normal with known variance –Normal
Exponential –Gamma
Uniform –Pareto
Geometric –Beta
4
Conjugate priors not only have advantages in tractability but also in interpretation, since the prior
can be interpreted in terms of a prior sample size or additional pseudo-data (as illustrated in the
Bernoulli example).
Informative priors The argument for using a probability distribution to specify uncertain a
priori information is more compelling when prior knowledge can be associated to past experience, or
to a process of elicitation of consensus expert views. Other times, a parameter is a random realization
drawn from some population, for example, in a model with individual e¤ects for longitudinal survey
data; a situation in which there exists an actual population prior distribution. In those cases one
would like the prior to accurately express the information available about the parameters. However,
often little is known a priori and one would like a prior density to just express lack of information, an
issue that we consider next.
Flat priors For a scalar taking values on the entire real line a uniform, a ‡at prior distribution
is typically employed as an uninformative prior, that is, one that sets p ( ) = 1. A ‡at prior is non-
informative in the sense of having little impact on the posterior, which is simply a renormalization
of the likelihood into a density for .2 A ‡at prior is therefore appealing from the point of view of
seeking to summarize the likelihood.
R
Note that a ‡at prior is improper in the sense that p ( ) d = 1.3 If an improper prior is
combined with a likelihood that cannot be renormalized (due to lacking a …nite integral with respect
to ), the result is an improper posterior that cannot be used for inference. Flat priors are often
approximated by a proper prior with a large variance.
If p ( ) is uniform, then the prior of a transformation of is not uniform. If is a positive number,
a standard reference prior is to assume a ‡at prior on ln , p (ln ) = 1, which implies
1
p( ) = : (2)
Similarly, if lies in the (0; 1) interval, a ‡at prior on the logit transformation of ; ln 1 , implies
1
p( ) = : (3)
(1 )
R1 1
R1 1
These priors are improper because 0 d and 0 (1 ) d both diverge. They are easily dominated
by the data, but (2) assigns most of the weight to values of that are either very large or very close
to zero, and (3) puts most of the weight on values very near 0 and 1.
For example, if (y1 ; :::; yn ) is a random sample from a normal population N ; 1 , the standard
improper reference prior for ( ; ) is to specify independent ‡at priors on and ln , so that
1
p( ; ) = p( )p( ) = :
2
Though arguably a ‡at prior places a large weight on extreme parameter values.
3
Any prior distribution with in…nite mass is called improper.
5
Je¤reys prior It is a rule for choosing a non-informative prior that is invariant to transformation:
p ( ) / [det I ( )]1=2
Bernoulli example continued Let us illustrate three standard candidates for non-informative
prior in the Bernoulli example. The …rst possibility is to use a ‡at prior in the log-odds scale, leading
to (3); this is the Beta (0; 0) distribution since it can be regarded as the limit of the numerator of the
beta distribution as ; ! 0. The second is Je¤reys’prior, which in this case is proportional to
1
p( ) = p ;
(1 )
and corresponds to the Beta(0:5; 0:5) distribution. Finally, the third candidate is the uniform prior
p ( ) = 1, which corresponds to the Beta(1; 1) distribution.
All three priors are data augmentation priors. The Beta (0; 0) prior adds no prior observations,
Je¤reys’prior adds one observation with half a success and half a failure, and the uniform prior adds
two observations with one success and one failure. The ML estimator coincides with the posterior
mode for the Beta(1; 1) prior, and with the posterior mean for the Beta(0; 0) prior.
6
4.1 Consistency of the posterior distribution
If the population distribution of a random sample y = (y1 ; :::; yn ) is included in the parametric like-
lihood family, so that it equals f (yi j 0) for some 0, the posterior is consistent in the sense that it
converges to a point mass at the true parameter value 0 as n ! 1. When the true distribution is
not included in the parametric family, there is no longer a true value 0, except in the sense of the
value 0 that makes the model distribution f (yi j ) closest to the true distribution g (yi ) according
to the Kullback-Leibler divergence:
Z
g (yi )
KL ( ) = ln g (yi ) dyi ;
f (yi j )
so that4
0 = arg min KL ( ) .
2
Here is a consistency theorem of the posterior distribution for a discrete parameter space. The
Q
result is valid if g (yi ) is not included in the f (yi j ) family, in which case we may refer to ni=1 f (yi j )
as a pseudo-likelihood and to p ( j y) as a pseudo-posterior. The theorem and its proof are taken from
Gelman et al (2014, p. 586).
Theorem (…nite parameter space) If the parameter space is …nite and Pr ( = 0) > 0,
then Pr ( = 0 j y) ! 1 as n ! 1, where 0 is the value of that minimizes the Kullback-Leibler
divergence.
Proof: For any 6= 0 let us consider the log posterior odds relative to 0:
n
X
p ( j y) p( ) f (yi j )
ln = ln + ln (4)
p ( 0 j y) p ( 0) f (yi j 0 )
i=1
For …xed values of and 0, if the yi ’s are iid draws from g (yi ), the second term on the right is a sum
of n iid random variables with a mean given by
f (yi j )
E ln = KL ( 0 ) KL ( ) 0:
f (yi j 0 )
Thus, as long as 0 is the unique minimizer of KL ( ), for 6= 0 the second term on the right of (4)
is the sum of n iid random variables with negative mean. By the LLN, the sum approaches 1 as
n ! 1. As long as the …rst term on the right is …nite (provided p ( 0 ) > 0), the whole expression
p( jy)
approaches 1 in the limit. Then p( 0 jy) ! 0, and so p ( j y) ! 0. Moreover, since all probabilities
add up to 1, p ( 0 j y) ! 1.
If has a continuous distribution, p ( 0 j y) is always zero for any …nite sample, and so the previous
argument does not apply, but it can still be shown that p ( j y) becomes more and more concentrated
about 0 as n increases. A statement of the theorem for the continuous case in Gelman et al is as
follows.
4
Equivalently, 0 = arg max 2 E [ln f (y j )].
7
Theorem (continuous parameter space) If is de…ned on a compact set and A is a neigh-
borhood of 0 with nonzero prior probability, then Pr ( 2 A j y) ! 1 as n ! 1, where 0 is the value
of that minimizes KL ( ).
Bernouilli example Recall that the posterior distribution in this case is:
p ( j y) / m+ 1
(1 )n m+ 1
Beta (m + ; n m+ )
(m + ) (n m + ) 1
V ar ( j y) = =O :
(n + + )2 (n + + + 1) n
As n increases the posterior distribution becomes concentrated at a value that does not depend on
the prior distribution.
By the strong LLN, for each " > 06
m
Pr lim 0 <"j 0 = 1:
n!1 n
Therefore, with probability 1, the sequence of posterior probability densities
has a limit distribution with mean 0 and variance 0, independent of and . Thus, under each
conjugate Beta prior, with probability 1, the posterior probability for converges to the Dirac Delta
distribution concentrated on the true parameter value.
We have seen that as n ! 1 the posterior distribution converges to a degenerate measure at the
true value (posterior consistency). To obtain a non-degenerate limit, we consider the sequence of
0
p b , whose densities are given by7
posterior distributions of = n
1 1
p ( j y) = p p b + p jy :
n n
5
The mean and variance of X Beta ( ; ) are:
E (X) = ; V ar (X) = :
+ ( + )2 ( + + 1)
6
R m n m
That is, 1 limn!1 m n 0 < " 0 (1 0) dm = 1, for any 0 .
7
We use the ML estimator b as the centering quantity, but the limiting result is una¤ected if the posterior mode e is
p
used instead or if = n ( Tn ) with Tn = 0 + [nI ( 0 )] 1 [@ ln L ( 0 ) =@ ].
8
The basic result of large-sample Bayesian inference is that as more and more data arrive, the
posterior distribution approaches a normal distribution. This result is known as the Bernstein-von
Mises Theorem. See, for example, Lehmann and Casella (1998, Theorem 8.2, p. 489), van der Vaart
(1998, Theorem 10.1, p. 141), or Chernozhukov and Hong (2003, Theorem 1, p. 305).
A formal statement for iid data and a scalar parameter, under the standard regularity conditions of
MLE asymptotics, the condition that the prior p ( ) is continuous and positive in an open neighborhood
of 0, and some additional technical conditions, is as follows:
Z
1 1 2 p
p ( j y) q exp 2 d ! 0:
2 2 2
where 2 = 1=I ( 0 ). That is, the L1 distance between the scaled and centered posterior and a
N 0; 2 density centered at the random quantity
goes to zero in probability. Thus, for large n,
p ( j y) is approximately a random normal density with random mean parameter b and a constant
1
variance parameter I ( 0 ) =n:
p ( j y) N b; 1 I ( 0 ) 1
:
n
The result can be extended to a multidimensional parameter. To gain intuition for this result let us
consider a Taylor expansion of ln p ( j y) about the posterior mode e:
@ ln p e j y @ 2 ln p e j y
e +1
0
ln p ( j y) ln p e j y + 0
e
0
e
@ 2 @ @
2 3
2
@ ln p e jy
1p e 4 1
0 p
5 n e
= c n 0
2 n @ @
Note that @ ln p e j y =@ 0
= 0. Moreover,
2 ejy 2 e n 2 e n 2 e
1 @ ln p 1 @ ln p 1 X @ ln f yi j 1 X @ ln f yi j 1
= + = +O I e
n @ @ 0 n @ @ 0 n
i=1
@ @ 0
n
i=1
@ @ 0
n
Thus, for large n the curvature of the log posterior can be approximated by the Fisher information:
1p e I e
0 p
e :
ln p ( j y) c n n
2
Dropping terms that do not include we get the approximation
1 e nI e
0
e
p ( j y) / exp ;
2
1
which corresponds to the kernel of a multivariate normal density N e; 1 I e .
n
9
Often, convergence to normality of the posterior distribution for a parameter can be improved
by transformation. If is a continuous transformation of , then both p ( j y) and p ( j y) approach
normal distributions, but the accuracy of the approximation for …nite n can vary substantially with
the transformation chosen.
A Bernstein-von Mises Theorem states that under adequate conditions the posterior distribution
is asymptotically normal, centered at the MLE with a variance equal to the asymptotic frequentist
variance of the MLE. From a frequentist point of view, this implies that Bayesian methods can be used
to obtain statistically e¢ cient estimators and consistent con…dence intervals. The limiting distribution
does not depend on the Bayesian prior.
If g (yi ) 6= f (yi j ) for all 2 , then the …tted model f (yi j ) is misspeci…ed. In such case the
large-n sampling distribution of the pseudo-ML estimator is
p d
n b 0 ! N (0; S )
S = MV M;
1 1
with M = [ E (Hi )] [I ( 0 )] , V = E (qi qi0 ) and
@ ln f (yi j 0) @ 2 ln f (yi j 0)
qi = ; Hi =
@ @ @ 0
1 1
In a correctly speci…ed model the information identity holds V = M but in general V 6= M.
The large sample shape of a posterior distribution obtained from n f (yi j ) becomes close to
i=1
jy N b; 1 M :
n
Thus, misspeci…cation produces a discrepancy between the sampling distribution of b and the shape of
the (pseudo)-likelihood. That is, the pseudo likelihood does not correctly re‡ect the sample information
about contained in b. So, for the purpose of Bayesian inference about 0 (in the knowledge that 0
is only a pseudo-true value) it makes sense to start from the correct large-sample approximation to
the likelihood of b instead of the (incorrect) approximate likelihood of (y1 :::yn ). That is, to consider
a posterior distribution of the form:
1 0
p j b / exp n b
S
1 b p( ) (5)
2
This approach is proposed in Müller (2013) who shows that Bayesian inference about 0 is of lower
asymptotic frequentist risk when the standard pseudo-posterior
" n #
X
p j b / exp ln f (yi j ) p( ) (6)
i=1
10
is substituted by the pseudo-posterior (5) that relies on the asymptotic likelihood of b (an "arti…cial"
normal posterior centered at the MLE with sandwich covariance matrix).
The posterior mode is consistent in repeated sampling with …xed as n ! 1. Moreover, the posterior
mode is also asymptotically normal in repeated samples. So the large-sample Bayesian statement
holds
h i1=2
I e e jy N (0; I)
See for example Lehmann and Casella (1998, Theorem 8.3, p. 490).
These results imply that in regular estimation problems the posterior distribution is asymptotically
the same as the repeated sample distribution. So, for example, a 95% central posterior interval for
will cover the true value 95% of the time under repeated sampling with any …xed true . The frequentist
statement speaks of probabilities of e (y) whereas the Bayesian statement speaks of probabilities of .
Speci…cally,
Z Z
Pr ( r j y) = 1( r) p ( j y) d _ 1( r) f (y j ) p ( ) d
h i Z
Pr e (y) rj 0 = 1 e (y) r f (y j 0 ) dy
These results require that the true data distribution is included in the parametric likelihood family.
Bernoulli example continued The posterior mode corresponding to the beta prior with para-
meters ( ; ) in (1) and the maximum likelihood estimator b = m=n satisfy
p p
n e = n b + Rn
where
p
n m
Rn = 1 k :
n+k n
p p
and k = + 2. Since Rn ! 0, it follows that n e has the same asymptotic distribution as
p b
n , namely N [0; (1 )]. Therefore, the normalized posterior mode has an asymptotic nor-
mal distribution, which is independent of the prior parameters and has the same asymptotic variance
as that of the MLE, so that the posterior mode is asymptotically e¢ cient.
11
Robustness to statistical principle and its failures The dual frequentist/Bayesian inter-
pretation of many textbook estimation procedures suggests that it is possible to aim for robustness to
statistical philosophies in statistical methodology, at least in regular estimation problems.
Even for small samples, many statistical methods can be considered as approximations to Bayesian
inferences based on particular prior distributions. As a way of understanding a statistical procedure,
it is often useful to determine the implicit underlying prior distribution (Gelman et al 2014, p. 92).
In the case of units roots the symmetry of Bayesian probability statements and classical con…dence
statements breaks down. With normal errors and a ‡at prior the Bayesian posterior is normal even
if the true data generating process is a random walk (Sims and Uhlig 1991). Kim (1998) studied
conditions for asymptotic posterior normality, which cover much more general situations than the
normal random walk with ‡at priors.
p ( j y) / f (y j ) p ( ) :
for various functions h (:). For problems for which no analytic solution exists, MCMC methods provide
powerful tools for evaluating these integrals, especially when is high dimensional.
MCMC is a collection of computational methods that produce an ergodic Markov chain with
(1) (2)
the stationary distribution p ( j y). A continuous-state Markov chain is a sequence ; ; :::; that
satis…es the Markov property:
0 0
The probability Pr j of transitioning from state to state is called the transition kernel and
0
we denote it K j . Our interest will be in the steady-state probability distribution of the process.
(0) (1) (2) (M )
Given a starting value , a chain ; ; :::; is generated using a transition kernel
with stationary distribution p ( j y), which ensures the convergence of the marginal distribution
(M )
of to p ( j y). For su¢ ciently large M , the MCMC methods produce a dependent sample
12
(1) (2) (M )
; ; :::; whose empirical distribution approaches p ( j y). The ergodicity and construc-
tion of the chains usually imply that as M ! 1,
M
X Z
b= 1 h (j) p
! h ( ) p ( j y) d :
M
j=1
Analogously, a 90% interval estimation is constructed simply by taking the 0.05th and 0.95th
(1) (M )
quantiles of the sequence h ; :::; h .
In the theory of Markov chains one looks for conditions under which there exists an invariant
0
distribution, and conditions under which iterations of the transition kernel K j converge to the
invariant distribution. In the context of MCMC methods the situation is the reverse: the invariant
distribution is known and in order to generate samples from it the methods look for a transition kernel
0
whose iterations converge to the invariant distribution. The problem is to …nd a suitable K j
that satis…es the invariance property:
Z
0
p j y = K 0 j p ( j y) d : (7)
(j) (j+1)
Under the invariance property, if is a draw from p ( j y) then is also a draw from p ( j y).
A useful fact is that the steady-state distribution p ( j y) satis…es the detailed balance condition:
0 0 0
K j p ( j y) = K j p j y for all ; 0 : (8)
0
The interpretation of equation (8) is that the amount of mass transitioning from to is the same
0
as the amount of mass that transitions back from to .
(j) 0
The invariance property is not enough to guarantee that an average of draws h from K j
0
converges to the posterior mean. It has to be proved that K j has a unique invariant distribution,
0
that repeatedly drawing from K j leads to convergence to the unique invariant distribution
(j)
regardless of the initial condition, and that the dependence of the draws decays su¢ ciently fast
such that Monte Carlo sample averages converge to population means. Robert and Casella (2004)
provide a textbook treatment of the convergence theory for MCMC algorithms.
Two general methods of constructing transition kernels are the Metropolis-Hastings algorithm and
the Gibbs sampler, which we discuss in turn.
The Metropolis–Hastings algorithm proceeds by generating candidates that are either accepted or
rejected according to some probability, which is driven by a ratio of posterior evaluations. A description
of the algorithm is as follows.
Given the posterior density f (y j ) p ( ), known up to a constant, and a prespeci…ed conditional
0 (1) (2) (M )
density q j called the "proposal distribution", generate ; ; :::; in the following way:
13
(0)
1. Choose a starting value .
(j)
2. Draw a proposal from q j .
(j+1) (j)
3. Update from for j = 1; 2:::; using
8
< with probability j (j)
,
(j+1)
=
: (j)
with probability 1 j (j)
,
where
0 1
(j)
f (y j )p( )q j
j (j)
= min @1; A
(j) (j) (j)
f yj p q j
0
Some intuition for how the algorithm deals with a candidate transition from to is as follows
0
(Letham and Rudin 2012). If p j y > p ( j y), then for every accepted draw of , we should have at
0 0 0
least as many accepted draws of and so we always accept the transition ! . If p j y < p ( j y),
0
p( jy) 0
then for every accepted draw , we should have on average p( jy) accepted draws of . We thus
0
accept the transition with probability p( jy)
p( jy) . Thus, for any proposed transition, we accept it with
h 0
i
probability min 1; p( jy)
p( jy) , which corresponds to
0
j when the proposal distribution is symmetric:
0 0
q j =q j , as is the case in the original Metropolis algorithm.
The chain of draws so produced spends a relatively high proportion of time in the higher den-
sity regions and a lower proportion in the lower density regions. Because such proportions of times
are balanced in the right way, the generated sequence of parameter draws has the desired marginal
distribution in the limit. A key practical aspect of this calculation is that the posterior constant of
0
integration is not needed since j only depends on a posterior ratio.
for some variance 2. In practice, one will try several proposal distributions to …nd out which is most
suitable in terms of rejection rates and coverage of the parameter space.8
Other practical considerations include discarding a certain number of the …rst draws to reduce the
dependence on the starting point (burn-in), and only retaining every d th iteration of the chain to
reduce the dependence between draws (thinning).
8
See Letham and Rudin (2012) for examples of the practice of MCMC simulation using the OpenBUGS software
package.
14
Transition kernel and convergence of the M-H algorithm The M-H algorithm describes
(j+1) (j)
how to generate a parameter draw conditional on a parameter draw . Since the proposal
0 0
distribution q j and the acceptance probability j depend only on the current state, the
sequence of draws forms a Markov chain. The M-H transition kernel can be written as
0 0 0 0
K j =q j j +r( ) : (9)
0 0 0
The …rst term q j j is the density that is proposed given , times the probability that
0
it is accepted. To this we add the term r ( ) , which gives the probability r ( ) that conditional
0 0
on the proposal is rejected times the Dirac delta function , equal to one if = and zero
otherwise. Here
Z
0 0
r( ) = 1 q j j d 0:
(j+1) (j)
If the proposal is rejected, then the algorithm sets = , which means that conditional on the
0
rejection, the transition density contains a point mass at = , which is captured by the Dirac delta
function.
For the M-H algorithm to generate a sequence of draws from p ( j y) a necessary condition is that
the posterior distribution is an invariant distribution under the transition kernel (9), namely that it
satis…es condition (7). See Lancaster (2004, p. 213) or Herbst and Schorfheide (2015) for proofs that
0
K j satis…es the invariance property.
The Gibbs sampler is a fast sampling method that can be used in situations when we have access to
conditional distributions.
The idea behind the Gibbs sampler is to partition the parameter vector into two components
(j+1) (j) (j+1)
= ( 1; 2 ). Instead of sampling directly from K j , one …rst samples 1 from
(j) (j+1) (j+1) (j) (j)
p 1 j 2 and then samples 2 from p 2 j 1 . Clearly if 1 ; 2 is a draw from the
(j+1) (j+1)
posterior distribution, so is 1 ; 2 generated as above, so that the Gibbs sampler kernel
satis…es the invariance property; that is, it has p ( 1 ; 2 j y) as its stationary distribution (see Lancaster
2004, p. 209).
The Gibbs sampler kernel is
0 0 0
K 1; 2 j 1; 2 =p 1 j 2 p( 2 j 1) :
It can be regarded as a special case of Metropolis-Hastings where the proposal distribution is taken
to be the conditional posterior distribution.
The Gibbs sampler is related to data augmentation. A probit model nicely illustrates this aspect
(Lancaster 2004, Example 4.17, p. 211).
15
Bibliographical note
A good textbook source on applied Bayesian methods is Gelman, Carlin, Stern, Dunson, Vehtari,
and Rubin (2014). Textbook treatments of Bayesian econometrics include Koop (2003), Lancaster
(2004), Geweke (2005), and Greenberg (2012).
Rothenberg (1973)’s Cowles Foundation Monograph 23 provides a classic discussion of the use of a
priori information in frequentist and Bayesian approaches to econometrics.
Herbst and Schorfheide (2015) provide an up-to-date account of applications of Bayesian inference
to DSGE macro models.
Arellano and Bonhomme (2011) review nonlinear panel data models, drawing on the link between
random-e¤ects approaches and Bayesian computation.
Ruppert, Wand, and Carroll (2003) and Rossi (2014) discuss likelihood-based inference of ‡exible
nonlinear models.
Fiorentini, Sentana, and Shephard (2004) develop simulation-based Bayesian estimation methods of
time-series latent variable models of …nancial volatility.
The initial work on Bayesian asymptotics is due to Laplace. Further early work was done by
Bernstein (1917) and von Mises (1931). Textbook sources are Lehmann and Casella (1998) and van
der Vaart (1998). Chernozhukov and Hong (2003) provide a review of the literature.
The Metropolis-Hastings (M–H) algorithm was developed by Metropolis et al in 1953 and generalized
by Hastings in 1970, but was unknown to statisticians until the early 1990s. Tierney (1994) and Chib
and Greenberg (1995) created awareness about the algorithm and stimulated their use in statistics.
The name Gibbs sampler was introduced by Geman and Geman (1984) after the statistical physicist
Willard Gibbs.
Chib (2001) and Robert and Casella (1999) provide excellent treatments of MCMC methods.
In models with moment restrictions, Chernozhukov and Hong (2003) propose using a GMM-like
criterion function in place of the unknown likelihood to calculate quasi-posterior distributions by
MCMC methods.
16
References
[1] Arellano, Manuel, and Stéphane Bonhomme (2011): “Nonlinear Panel Data Analysis”, Annual
Review of Economics, 3, 395–424.
[2] Bernstein, S. (1917): Theory of Probability, 4th Edition 1946. Gostekhizdat, Moscow–Leningrad
(in Russian).
[3] Chernozhukov, Victor, and Han Hong (2003): "An MCMC approach to classical estimation",
Journal of Econometrics, 115, 293–346.
[4] Chib, Siddhartha (2001): "Markov chain Monte Carlo methods: computation and inference". In:
Heckman, J.J., Leamer, E. (Eds.), Handbook of Econometrics, Vol. 5. North-Holland, Amsterdam,
3564–3634 (Chapter 5).
[5] Chib, Siddhartha and Edward Greenberg (1995): “Understanding the Metropolis–Hastings Algo-
rithm”, The American Statistician, 49, 327–335.
[6] Fiorentini, Gabriele, Enrique Sentana, and Neil Shephard (2004): "Likelihood-Based Estimation
of Latent Generalized ARCH Structures", Econometrica, 72(5), 1481–1517.
[7] Gelman, Andrew, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin (2014):
Bayesian Data Analysis, Third Edition, CRC Press.
[8] Geman, Stuart and Donald Geman (1984): “Stochastic Relaxation, Gibbs Distributions, and the
Bayesian Restoration of Images”, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 6(6), 721-741.
[9] Geweke, John (2005): Contemporary Bayesian Econometrics and Statistics, John Wiley & Sons.
[10] Greenberg, Edward (2012): Introduction to Bayesian econometrics, Cambridge University Press.
[11] Hastings, W. K. (1970): "Monte Carlo Sampling Methods Using Markov Chains and Their Ap-
plications", Biometrika, 57(1), 97–109.
[12] Herbst, Edward, and Frank Schorfheide (2015): Bayesian Estimation of DSGE Models, Princeton.
[13] Kim, Jae-Young (1998): "Large Sample Properties of Posterior Densities, Bayesian Information
Criterion and the Likelihood Principle in Nonstationary Time Series Models", Econometrica, 66,
359–380.
[14] Koop, Gary (2003): Bayesian Econometrics, John Wiley & Sons.
17
[16] Lehmann, E. L. and George Casella (1998): Theory of Point Estimation, Second Edition, Springer.
[17] Letham, Ben and Cynthia Rudin (2012): “Probabilistic Modeling and Bayesian Analysis”, Pre-
diction, Machine Learning, and Statistics Lecture Notes, Sloan School of Management, MIT.
[18] Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller (1953): "Equations of
State Calculations by Fast Computing Machines", Journal of Chemical Physics, 21, 1087–1092.
[19] Müller, Ulrich (2013): "Risk of Bayesian Inference in Misspeci…ed Models, and the Sandwich
Covariance Matrix", Econometrica, 81(5), 1805–1849.
[20] Robert, C.P. and George Casella (1999): Monte Carlo Statistical Methods. Springer, Berlin.
[21] Rossi, Peter E. (2014): Bayesian Non- and Semi-parametric Methods and Applications, Princeton
University Press.
[22] Rothenberg, Thomas (1973): E¢ cient Estimation with A Priori Information, Cowles Foundation
Monograph 23, Yale University Press.
[23] Ruppert, David, M. P. Wand, and R. J. Carroll (2003): Semiparametric Regression, Cambridge
University Press.
[24] Sims, Christopher A. and Harald Uhlig (1991): "Understanding Unit Rooters: A Helicopter Tour"
Econometrica, 59(6), 1591–1599.
[25] Tierney, Luke (1994): "Markov Chains for Exploring Posterior Distributions" (with discussion),
Annals of Statistics, 22, 1701–1762.
[26] van der Vaart, A. W. (1998): Asymptotic Statistics, Cambridge University Press.
[27] von Mises, Richard (1931): Wahrscheinlichkeitsrechnung. Springer, Berlin (Probability Theory, in
German).
18