Class19 Approxinf
Class19 Approxinf
Class19 Approxinf
9.520 Class 19
Ruslan Salakhutdinov
BCS and CSAIL, MIT
1
Plan
1. Introduction/Notation.
2
References/Acknowledgements
• Chris Bishop’s book: Pattern Recognition and Machine
Learning, chapter 11 (many figures are borrowed from this book).
Bayes Rule:
P (x|θ)P (θ)
P (θ|x) =
P (x)
where
Z
P (x) = P (x, θ)dθ Marginalization
I will use probability distribution and probability density interchangeably. It should be obvious from the context.
4
Inference Problem
Given a dataset D = {x1, ..., xn}:
Bayes Rule:
P (D|θ) Likelihood function of θ
P (D|θ)P (θ) P (θ) Prior probability of θ
P (θ|D) =
P (D)
P (θ|D) Posterior distribution over θ
5
Prediction
P (D|θ) Likelihood function of θ
P (D|θ)P (θ)
P (θ|D) = P (θ) Prior probability of θ
P (D)
P (θ|D) Posterior distribution over θ
where
Z
P (D|M) = P (D|θ, M)P (θ|M)dθ
7
Computational Challenges
• Computing marginal likelihoods often requires computing very high-
dimensional integrals.
8
Bayesian PMF
User
Features
1 2 3 4 5 6 7 ...
1 5 3 ? 1 ...
2
V
3 ? 4 ? 3 2 ...
3
4 Movie
5
6
R ~ U Features
7
...
Need to approximate.
11
Bayesian Neural Nets
Regression problem: Given a set of i.i.d observations X = {xn}N
n=1
with corresponding targets D = {tn}N
n=1.
Likelihood:
N
Y
p(D|X, w) = N (tn|y(xn, w), β 2)
n=1
12
Bayesian Neural Nets
Likelihood:
N
Y
p(D|X, w) = N (tn|y(xn, w), β 2)
n=1
Remark: Under certain conditions, Radford Neal (1994) showed, as the number of
hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian
process prior for functions.
13
Undirected Models
x is a binary random vector with xi ∈ {+1, −1}:
1 X X
p(x) = exp θij xixj + θ i xi .
Z
(i,j)∈E i∈V
14
Inference
For most situations we will be interested
in evaluating the expectation:
Z
E[f ] = f (z)p(z)dz
p̃(z)
We will use the following notation: p(z) = Z .
15
Laplace Approximation
Consider:
0.8
p̃(z)
0.6 p(z) =
Z
0.4
Goal: Find a Gaussian approximation
0.2
q(z) which is centered on a mode
0 of the distribution p(z).
−2 −1 0 1 2 3 4
(2π)D/2
Z Z
1 T
Z = p̃(z)dz ≈ p̃(z0) exp − (z − z0) A(z − z0) = p̃(z0)
2 |A|1/2
1
Bayesian Inference: P (θ|D) = P (D) P (D|θ)P (θ).
1/2
|A| 1
p(θ|D) ≈ exp − (θ − θM AP )T A(θ − θM AP )
(2π)D/2 2
18
Laplace Approximation
Remember p(z) = p̃(Zz) , where we approximate:
(2π)D/2
Z Z
1 T
Z = p̃(z)dz ≈ p̃(z0) exp − (z − z0) A(z − z0) = p̃(z0)
2 |A|1/2
1
Bayesian Inference: P (θ|D) = P (D) P (D|θ)P (θ).
D 1
ln P (D) ≈ ln P (D|θM AP ) + ln P (θM AP ) + ln 2π − ln |A|
| 2
{z 2 }
Occam factor: penalize model complexity
19
Bayesian Information Criterion
BIC can be obtained from the Laplace approximation:
D 1
ln P (D) ≈ ln P (D|θM AP ) + ln P (θM AP ) + ln 2π − ln |A|
2 2
by taking the large sample limit (N → ∞) where N is the number of
data points:
1
ln P (D) ≈ P (D|θM AP ) − D ln N
2
• Quick, easy, does not depend on the prior.
• Can use maximum likelihood estimate of θ instead of the MAP estimate
• D denotes the number of “well-determined parameters”
• Danger: Counting parameters can be tricky (e.g. infinite models)
20
Variational Inference
Key Idea: Approximate intractable distribution p(θ|D) with simpler, tractable
distribution q(θ).
We can lower bound the marginal likelihood using Jensen’s inequality:
Z Z
P (D, θ)
ln p(D) = ln p(D, θ)dθ = ln q(θ) dθ
q(θ)
Z Z Z
p(D, θ) 1
≥ q(θ) ln dθ = q(θ) ln p(D, θ)dθ + q(θ) ln dθ
q(θ) q(θ)
| {z }
Entropy functional
| {z }
Variational Lower-Bound
= ln p(D) − KL(q(θ)||p(θ|D)) = L(q)
where KL(q||p) is a Kullback–Leibler divergence. It is a non-symmetric measure of
the difference between two probability distributions q and p.
The goal of variational inference is to maximize the variational lower-bound
w.r.t. approximate q distribution, or minimize KL(q||p).
21
Variational Inference
Key Idea: Approximate intractable distribution p(θ|D) with simpler, tractable
distribution q(θ) by minimizing KL(q(θ)||p(θ|D)).
QD
We can choose a fully factorized distribution: q(θ) = i=1 qi (θi ), also known
as a mean-field approximation.
The variational lower-bound takes form:
Z Z
1
L(q) = q(θ) ln p(D, θ)dθ + q(θ) ln dθ
q(θ)
Z XZ
Y 1
= qj (θj ) ln p(D, θ) qi(θi)dθi dθj + qi(θi) ln dθi
i
q(θ i )
i6=j
| {z }
Ei6=j [ln p(D, θ)]
Suppose we keep {qi6=j } fixed and maximize L(q) w.r.t. all possible forms for the
distribution qj (θj ).
22
Variational Approximation
1
0.8
The plot shows the original distribution (yellow),
0.6
along with the Laplace (red) and
0.4 variational (green) approximations.
0.2
0
−2 −1 0 1 2 3 4
By maximizing L(q) w.r.t. all possible forms for the distribution qj (θj ) we obtain a
general expression:
exp(Ei6=j [ln p(D, θ)])
qj∗(θj ) = R
exp(Ei6=j [ln p(D, θ)])dθj
Iterative Procedure: Initialize all qj and then iterate through the factors replacing
each in turn with a revised estimate.
Convergence is guaranteed as the bound is convex w.r.t. each of the factors qj (see
Bishop, chapter 10).
23
Inference: Recap
For most situations we will be interested
in evaluating the expectation:
Z
E[f ] = f (z)p(z)dz
p̃(z)
We will use the following notation: p(z) = Z .
24
Simple Monte Carlo
General Idea: Draw independent samples {z 1, ..., z n} from
distribution p(z) to approximate expectation:
Z N
1 X
E[f ] = f (z)p(z)dz ≈ f (z n) = fˆ
N n=1
Note that E[f ] = E[fˆ], so the estimator fˆ has correct mean (unbiased).
The variance:
1
var[fˆ] = E (f − E[f ]) 2
N
Remark: The accuracy of the estimator does not depend on
dimensionality of z.
25
Simple Monte Carlo
In general:
Z N
1 X
f (z)p(z)dz ≈ f (z n), z n ∼ p(z)
N n=1
Predictive distribution: Z
P (x∗|D) = P (x∗|θ, D)P (θ|D)dθ
N
1 X
≈ P (x∗|θn, D), θn ∼ p(θ|D)
N n=1
26
Basic Sampling Algorithm
How to generate samples from simple non-uniform distributions
assuming we can generate samples from uniform distribution.
Ry
Define: h(y) = −∞
p(ŷ)dŷ
27
Rejection Sampling
Sampling from target distribution p(z) = p̃(z)/Zp is difficult.
Suppose we have an easy-to-sample proposal distribution q(z), such
that kq(z) ≥ p̃(z), ∀z.
28
Rejection Sampling
Probability that a sample is accepted is:
Z
p̃(z)
p(accept) = q(z)dz
kq(z)
Z
1
= p̃(z)dz
k
29
Importance Sampling
Suppose we have an easy-to-sample proposal distribution q(z), such
that q(z) > 0 if p(z) > 0.
Z
E[f ] = f (z)p(z)dz
Z
p(z)
= f (z) q(z)dz
q(z)
1 X p(z n) n n
≈ n
f (z ), z ∼ q(z)
N n q(z )
n
The quantities wn = p(z )/q(zn) are known as importance weights.
Unlike rejection sampling, all samples are retained.
But wait: we cannot compute p(z), only p̃(z).
30
Importance Sampling
Let our proposal be of the form q(z) = q̃(z)/Zq :
Zq
Z Z Z
p(z) p̃(z)
E[f ] = f (z)p(z)dz = f (z) q(z)dz = f (z) q(z)dz
q(z) Zp q̃(z)
Zq 1 X p̃(z n) n Zq 1 X n n n
≈ n
f (z ) = w f (z ), z ∼ q(z)
Zp N n q̃(z ) Zp N n
Zp
But we can use the same importance weights to approximate Zq :
Zp 1 X p̃(z n)
Z Z
1 p̃(z) 1 X n
= p̃(z)dz = q(z)dz ≈ n
= w
Zq Zq q̃(z) N n q̃(z ) N n
Hence:
1 X wn
E[f ] ≈ P n f (z n) Consistent but biased.
N n nw
31
Problems
If our proposal distribution q(z) poorly matches our target distribution
p(z) then:
32
Markov Chains
A first-order Markov chain: a series of random variables {z 1, ..., z N }
such that the following conditional independence property holds for
n ∈ {z 1, ..., z N −1}:
33
Markov Chains
A marginal probability of a particular state can be computed as:
X
n+1
p(z )= T (z n+1 ← z n)p(z n)
zn
34
Detailed Balance
A sufficient (but not necessary) condition for ensuring that π(z) is
invariant is to choose a transition kernel that satisfies a detailed
balance property:
π(z 0)T (z ← z 0) = π(z)T (z 0 ← z)
Ergodicity: There exists K, for any starting z, T K (z 0 ← z) > 0 for all π(z 0) > 0.
36
Metropolis-Hasting Algorithm
A Markov chain transition operator from current state z to a new
state z 0 is defined as follows:
• A new ’candidate’ state z ∗ is proposed according to some proposal
distribution q(z ∗|z), e.g. N (z, σ 2).
• A candidate state x∗ is accepted with probability:
∗ ∗
π̃(z ) q(z|z )
min 1,
π̃(z) q(z ∗|z)
37
Metropolis-Hasting Algorithm
We can show that M-H transition kernel leaves π(z) invariant by
showing that it satisfies detailed balance:
0 0
0 0 π(z ) q(z|z )
π(z)T (z ← z) = π(z)q(z |z) min 1,
π(z) q(z 0|z)
= min (π(z)q(z 0|z), π(z 0)q(z|z 0))
0
π(z) )q(z |z)
= π(z 0) min 0 0
,1
π(z ) q(z|z )
= π(z 0)T (z ← z 0)
Note that whether the chain is ergodic will depend on the particulars
of π and proposal distribution q.
38
Metropolis-Hasting Algorithm
3
0.5
0
0 0.5 1 1.5 2 2.5 3
39
Choice of Proposal
Proposal distribution:
q(z 0|z) = N (z, ρ2).
40
Gibbs Sampler
Consider sampling from p(z1, ..., zN ).
N
∗ 1 X
∗ (n) (n)
p(rij |R) ≈ p(rij |ui , vj ).
N n=1
The samples (uni, vjn) are generated by running a Gibbs sampler, whose
stationary distribution is the posterior distribution of interest.
43
Bayesian PMF
Monte Carlo approximation:
N
∗ 1 X
∗ (n) (n)
p(rij |R) ≈ p(rij |ui , vj ).
N n=1
The conditional distributions over the user and movie feature vectors
are Gaussians → easy to sample from:
∗ ∗
p(ui|R, V, ΘU , α) = N ui|µi , Σi
∗ ∗
p(vj |R, U, ΘU , α) = N vj |µj , Σj
44
MCMC: Main Problems
Main problems of MCMC:
• Hard to diagnose convergence (burning in).
• Sampling from isolated modes.