The document discusses several computational methods for Bayesian model choice, including importance sampling, bridge sampling, harmonic means, Chib's solution, cross-model solutions, nested sampling, and ABC model choice. Importance sampling can be used to approximate Bayes factors by sampling from an importance distribution. Bridge sampling is a special case of importance sampling that uses the posterior distribution under one model as the importance distribution for another model.
1 of 162
More Related Content
MaxEnt 2009 talk
1. On some computational methods for Bayesian model choice
On some computational methods for Bayesian
model choice
Christian P. Robert
CREST-INSEE and Universit´ Paris Dauphine
e
http://www.ceremade.dauphine.fr/~xian
MaxEnt 2009, Oxford, July 6, 2009
Joint works with M. Beaumont, N. Chopin,
J.-M. Cornuet, J.-M. Marin and D. Wraith
2. On some computational methods for Bayesian model choice
Outline
1 Evidence
2 Importance sampling solutions
3 Cross-model solutions
4 Nested sampling
5 ABC model choice
3. On some computational methods for Bayesian model choice
Evidence
Bayes tests
Formal construction of Bayes tests
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
statistical model, a test is a statistical procedure that takes its
values in {0, 1}.
Theorem (Bayes test)
The Bayes estimator associated with π and with the 0 − 1 loss is
1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x),
δ π (x) =
0 otherwise,
4. On some computational methods for Bayesian model choice
Evidence
Bayes tests
Formal construction of Bayes tests
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
statistical model, a test is a statistical procedure that takes its
values in {0, 1}.
Theorem (Bayes test)
The Bayes estimator associated with π and with the 0 − 1 loss is
1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x),
δ π (x) =
0 otherwise,
5. On some computational methods for Bayesian model choice
Evidence
Bayes factor
Bayes factor
Definition (Bayes factors)
For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior
π(Θ0 )π0 (θ) + π(Θc )π1 (θ) ,
0
central quantity
f (x|θ)π0 (θ)dθ
π(Θ0 |x) π(Θ0 ) Θ0 m0 (x)
B01 (x) = = =
π(Θc |x)
0 π(Θc )
0 m1 (x)
f (x|θ)π1 (θ)dθ
Θc
0
[Jeffreys, 1939]
6. On some computational methods for Bayesian model choice
Evidence
Model choice
Model choice and model comparison
Choice between models
Several models available for the same observation x
Mi : x ∼ fi (x|θi ), i∈I
where the family I can be finite or infinite
Identical setup: Replace hypotheses with models but keep
marginal likelihoods and Bayes factors
7. On some computational methods for Bayesian model choice
Evidence
Model choice
Model choice and model comparison
Choice between models
Several models available for the same observation x
Mi : x ∼ fi (x|θi ), i∈I
where the family I can be finite or infinite
Identical setup: Replace hypotheses with models but keep
marginal likelihoods and Bayes factors
8. On some computational methods for Bayesian model choice
Evidence
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
9. On some computational methods for Bayesian model choice
Evidence
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
10. On some computational methods for Bayesian model choice
Evidence
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
11. On some computational methods for Bayesian model choice
Evidence
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
12. On some computational methods for Bayesian model choice
Evidence
Evidence
Evidence
All these problems end up with a similar quantity, the evidence
Zk = πk (θk )Lk (θk ) dθk ,
Θk
aka the marginal likelihood.
13. On some computational methods for Bayesian model choice
Importance sampling solutions
Importance sampling revisited
1 Evidence
2 Importance sampling solutions
Regular importance
Harmonic means
Chib’s solution
3 Cross-model solutions
4 Nested sampling
5 ABC model choice
14. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bayes factor approximation
When approximating the Bayes factor
f0 (x|θ0 )π0 (θ0 )dθ0
Θ0
B01 =
f1 (x|θ1 )π1 (θ1 )dθ1
Θ1
use of importance functions 0 and 1 and
n−1
0
n0 i i
i=1 f0 (x|θ0 )π0 (θ0 )/
i
0 (θ0 )
B01 =
n−1
1
n1 i i
i=1 f1 (x|θ1 )π1 (θ1 )/
i
1 (θ1 )
15. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bridge sampling
Special case:
If
π1 (θ1 |x) ∝ π1 (θ1 |x)
˜
π2 (θ2 |x) ∝ π2 (θ2 |x)
˜
live on the same space (Θ1 = Θ2 ), then
n
1 π1 (θi |x)
˜
B12 ≈ θi ∼ π2 (θ|x)
n π2 (θi |x)
˜
i=1
[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
16. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bridge sampling variance
The bridge sampling estimator does poorly if
2
var(B12 ) 1 π1 (θ) − π2 (θ)
2 ≈ E
B12 n π2 (θ)
is large, i.e. if π1 and π2 have little overlap...
17. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bridge sampling variance
The bridge sampling estimator does poorly if
2
var(B12 ) 1 π1 (θ) − π2 (θ)
2 ≈ E
B12 n π2 (θ)
is large, i.e. if π1 and π2 have little overlap...
19. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
An infamous example
When
1
α(θ) =
π1 (θ)˜2 (θ)
˜ π
harmonic mean approximation to B12
n1
1
1/˜1 (θ1i |x)
π
n1
i=1
B21 = n2 θji ∼ πj (θ|x)
1
1/˜2 (θ2i |x)
π
n2
i=1
[Newton & Raftery, 1994]
Infamous: Most often leads to an infinite variance!!!
[Radford Neal’s blog, 2008]
20. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
An infamous example
When
1
α(θ) =
π1 (θ)˜2 (θ)
˜ π
harmonic mean approximation to B12
n1
1
1/˜1 (θ1i |x)
π
n1
i=1
B21 = n2 θji ∼ πj (θ|x)
1
1/˜2 (θ2i |x)
π
n2
i=1
[Newton & Raftery, 1994]
Infamous: Most often leads to an infinite variance!!!
[Radford Neal’s blog, 2008]
21. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that
this estimator is consistent ie, it will very likely be very close to the
correct answer if you use a sufficiently large number of points from
the posterior distribution.
The bad news is that the number of points required for this
estimator to get close to the right answer will often be greater
than the number of atoms in the observable universe. The even
worse news is that itws easy for people to not realize this, and to
naively accept estimates that are nowhere close to the correct
value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
22. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that
this estimator is consistent ie, it will very likely be very close to the
correct answer if you use a sufficiently large number of points from
the posterior distribution.
The bad news is that the number of points required for this
estimator to get close to the right answer will often be greater
than the number of atoms in the observable universe. The even
worse news is that itws easy for people to not realize this, and to
naively accept estimates that are nowhere close to the correct
value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
23. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Optimal bridge sampling
The optimal choice of auxiliary function is
n1 + n2
α =
n1 π1 (θ|x) + n2 π2 (θ|x)
leading to
n1
1 π2 (θ1i |x)
˜
n1 n1 π1 (θ1i |x) + n2 π2 (θ1i |x)
i=1
B12 ≈ n2
1 π1 (θ2i |x)
˜
n2 n1 π1 (θ2i |x) + n2 π2 (θ2i |x)
i=1
Back later!
24. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Optimal bridge sampling (2)
Reason:
Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ
2 ≈ 2 −1
B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ
δ method
Drag: Dependence on the unknown normalising constants solved
iteratively
25. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Optimal bridge sampling (2)
Reason:
Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ
2 ≈ 2 −1
B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ
δ method
Drag: Dependence on the unknown normalising constants solved
iteratively
26. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling
Another identity:
Eϕ [˜1 (θ)/ϕ(θ)]
π
B12 =
Eϕ [˜2 (θ)/ϕ(θ)]
π
for any density ϕ with sufficiently large support
[Torrie & Valleau, 1977]
Use of a single sample θ1 , . . . , θn from ϕ
i=1 π1 (θi )/ϕ(θi )
˜
B12 =
i=1 π2 (θi )/ϕ(θi )
˜
27. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling
Another identity:
Eϕ [˜1 (θ)/ϕ(θ)]
π
B12 =
Eϕ [˜2 (θ)/ϕ(θ)]
π
for any density ϕ with sufficiently large support
[Torrie & Valleau, 1977]
Use of a single sample θ1 , . . . , θn from ϕ
i=1 π1 (θi )/ϕ(θi )
˜
B12 =
i=1 π2 (θi )/ϕ(θi )
˜
28. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling (2)
Approximate variance:
2
var(B12 ) 1 (π1 (θ) − π2 (θ))2
2 ≈ Eϕ
B12 n ϕ(θ)2
Optimal choice:
| π1 (θ) − π2 (θ) |
ϕ∗ (θ) =
| π1 (η) − π2 (η) | dη
[Chen, Shao & Ibrahim, 2000]
Formaly better than bridge sampling
29. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling (2)
Approximate variance:
2
var(B12 ) 1 (π1 (θ) − π2 (θ))2
2 ≈ Eϕ
B12 n ϕ(θ)2
Optimal choice:
| π1 (θ) − π2 (θ) |
ϕ∗ (θ) =
| π1 (η) − π2 (η) | dη
[Chen, Shao & Ibrahim, 2000]
Formaly better than bridge sampling
30. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling (2)
Approximate variance:
2
var(B12 ) 1 (π1 (θ) − π2 (θ))2
2 ≈ Eϕ
B12 n ϕ(θ)2
Optimal choice:
| π1 (θ) − π2 (θ) |
ϕ∗ (θ) =
| π1 (η) − π2 (η) | dη
[Chen, Shao & Ibrahim, 2000]
Formaly better than bridge sampling
31. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk
no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
RB-RJ
32. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk
no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
RB-RJ
33. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails than
πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
34. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails than
πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
35. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Comparison with regular importance sampling (cont’d)
Compare Z1k with a standard importance sampling approximation
T (t) (t)
1 πk (θk )Lk (θk )
Z2k = (t)
T ϕ(θk )
t=1
(t)
where the θk ’s are generated from the density ϕ(·) (with fatter
tails like t’s)
36. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Zk using a mixture representation
Bridge sampling recall
Design a specific mixture for simulation [importance sampling]
purposes, with density
ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
where ϕ(·) is arbitrary (but normalised)
Note: ω1 is not a probability weight
37. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Zk using a mixture representation
Bridge sampling recall
Design a specific mixture for simulation [importance sampling]
purposes, with density
ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
where ϕ(·) is arbitrary (but normalised)
Note: ω1 is not a probability weight
38. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1 Take δ (t) = 1 with probability
(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
and δ (t) = 2 otherwise;
(t) (t−1)
2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
39. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1 Take δ (t) = 1 with probability
(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
and δ (t) = 2 otherwise;
(t) (t−1)
2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
40. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1 Take δ (t) = 1 with probability
(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
and δ (t) = 2 otherwise;
(t) (t−1)
2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
41. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Evidence approximation by mixtures
Rao-Blackwellised estimate
T
ˆ 1
ξ=
(t)
ω1 πk (θk )Lk (θk )
(t) (t) (t)
ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
(t)
T
t=1
converges to ω1 Zk /{ω1 Zk + 1}
3k
ˆ ˆ ˆ
Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie
T (t) (t) (t) (t) (t)
t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk )
ˆ
Z3k =
T (t) (t) (t) (t)
t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
[Bridge sampler redux]
42. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Evidence approximation by mixtures
Rao-Blackwellised estimate
T
ˆ 1
ξ=
(t)
ω1 πk (θk )Lk (θk )
(t) (t) (t)
ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
(t)
T
t=1
converges to ω1 Zk /{ω1 Zk + 1}
3k
ˆ ˆ ˆ
Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie
T (t) (t) (t) (t) (t)
t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk )
ˆ
Z3k =
T (t) (t) (t) (t)
t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
[Bridge sampler redux]
43. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
θk ∼ πk (θk ),
fk (x|θk ) πk (θk )
Zk = mk (x) =
πk (θk |x)
Use of an approximation to the posterior
∗ ∗
fk (x|θk ) πk (θk )
Zk = mk (x) = .
ˆ ∗
πk (θk |x)
44. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
θk ∼ πk (θk ),
fk (x|θk ) πk (θk )
Zk = mk (x) =
πk (θk |x)
Use of an approximation to the posterior
∗ ∗
fk (x|θk ) πk (θk )
Zk = mk (x) = .
ˆ ∗
πk (θk |x)
45. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwell
estimate
T
∗ 1 ∗ (t)
πk (θk |x) = πk (θk |x, zk ) ,
T
t=1
(t)
where the zk ’s are Gibbs sampled latent variables
46. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching impact
A mixture model [special case of missing variable model] is
invariant under permutations of the indices of the components.
E.g., mixtures
0.3N (0, 1) + 0.7N (2.3, 1)
and
0.7N (2.3, 1) + 0.3N (0, 1)
are exactly the same!
c The component parameters θi are not identifiable
marginally since they are exchangeable
47. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching impact
A mixture model [special case of missing variable model] is
invariant under permutations of the indices of the components.
E.g., mixtures
0.3N (0, 1) + 0.7N (2.3, 1)
and
0.7N (2.3, 1) + 0.3N (0, 1)
are exactly the same!
c The component parameters θi are not identifiable
marginally since they are exchangeable
48. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Connected difficulties
1 Number of modes of the likelihood of order O(k!):
c Maximization and even [MCMC] exploration of the
posterior surface harder
2 Under exchangeable priors on (θ, p) [prior invariant under
permutation of the indices], all posterior marginals are
identical:
c Posterior expectation of θ1 equal to posterior expectation
of θ2
49. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Connected difficulties
1 Number of modes of the likelihood of order O(k!):
c Maximization and even [MCMC] exploration of the
posterior surface harder
2 Under exchangeable priors on (θ, p) [prior invariant under
permutation of the indices], all posterior marginals are
identical:
c Posterior expectation of θ1 equal to posterior expectation
of θ2
50. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
License
Since Gibbs output does not produce exchangeability, the Gibbs
sampler has not explored the whole parameter space: it lacks
energy to switch simultaneously enough component allocations at
once
0.2 0.3 0.4 0.5
−1 0 1 2 3
µi
0 100 200
n
300 400 500
pi −1 0
µ
1
i
2 3
0.4 0.6 0.8 1.0
0.2 0.3 0.4 0.5
σi
pi
0 100 200 300 400 500 0.2 0.3 0.4 0.5
n pi
0.4 0.6 0.8 1.0
−1 0 1 2 3
σi
µi
0 100 200 300 400 500 0.4 0.6 0.8 1.0
n σi
51. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching paradox
We should observe the exchangeability of the components [label
switching] to conclude about convergence of the Gibbs sampler.
If we observe it, then we do not know how to estimate the
parameters.
If we do not, then we are uncertain about the convergence!!!
52. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching paradox
We should observe the exchangeability of the components [label
switching] to conclude about convergence of the Gibbs sampler.
If we observe it, then we do not know how to estimate the
parameters.
If we do not, then we are uncertain about the convergence!!!
53. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching paradox
We should observe the exchangeability of the components [label
switching] to conclude about convergence of the Gibbs sampler.
If we observe it, then we do not know how to estimate the
parameters.
If we do not, then we are uncertain about the convergence!!!
54. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Compensation for label switching
(t)
For mixture models, zk usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theory
1
πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x)
k!
σ∈S
for all σ’s in Sk , set of all permutations of {1, . . . , k}.
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
T
∗ 1 ∗ (t)
πk (θk |x) = πk (σ(θk )|x, zk ) .
T k!
σ∈Sk t=1
[Berkhof, Mechelen, & Gelman, 2003]
55. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Compensation for label switching
(t)
For mixture models, zk usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theory
1
πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x)
k!
σ∈S
for all σ’s in Sk , set of all permutations of {1, . . . , k}.
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
T
∗ 1 ∗ (t)
πk (θk |x) = πk (σ(θk )|x, zk ) .
T k!
σ∈Sk t=1
[Berkhof, Mechelen, & Gelman, 2003]
56. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset
n = 82 galaxies as a mixture of k normal distributions with both
mean and variance unknown.
[Roeder, 1992]
Average density
0.8
0.6
Relative Frequency
0.4
0.2
0.0
−2 −1 0 1 2 3
data
57. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset (k)
∗
Using only the original estimate, with θk as the MAP estimator,
log(mk (x)) = −105.1396
ˆ
for k = 3 (based on 103 simulations), while introducing the
permutations leads to
log(mk (x)) = −103.3479
ˆ
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100
permutations selected at random in Sk ).
[Lee, Marin, Mengersen & Robert, 2008]
58. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset (k)
∗
Using only the original estimate, with θk as the MAP estimator,
log(mk (x)) = −105.1396
ˆ
for k = 3 (based on 103 simulations), while introducing the
permutations leads to
log(mk (x)) = −103.3479
ˆ
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100
permutations selected at random in Sk ).
[Lee, Marin, Mengersen & Robert, 2008]
59. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset (k)
∗
Using only the original estimate, with θk as the MAP estimator,
log(mk (x)) = −105.1396
ˆ
for k = 3 (based on 103 simulations), while introducing the
permutations leads to
log(mk (x)) = −103.3479
ˆ
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100
permutations selected at random in Sk ).
[Lee, Marin, Mengersen & Robert, 2008]
60. On some computational methods for Bayesian model choice
Cross-model solutions
Cross-model solutions
1 Evidence
2 Importance sampling solutions
3 Cross-model solutions
Reversible jump
Saturation schemes
Implementation error
4 Nested sampling
5 ABC model choice
61. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Reversible jump
irreversible jump
Idea: Set up a proper measure–theoretic framework for designing
moves between models Mk
[Green, 1995]
Create a reversible kernel K on H = k {k} × Θk such that
K(x, dy)π(x)dx = K(y, dx)π(y)dy
A B B A
for the invariant density π [x is of the form (k, θ(k) )]
62. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Reversible jump
irreversible jump
Idea: Set up a proper measure–theoretic framework for designing
moves between models Mk
[Green, 1995]
Create a reversible kernel K on H = k {k} × Θk such that
K(x, dy)π(x)dx = K(y, dx)π(y)dy
A B B A
for the invariant density π [x is of the form (k, θ(k) )]
63. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves
For a move between two models, M1 and M2 , the Markov chain
being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ)
the corresponding kernels, under the detailed balance condition
π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Proposal expressed as
θ2 = Ψ1→2 (θ1 , v1→2 )
where v1→2 is a random variable of dimension
dim(M2 ) − dim(M1 ), generated as
v1→2 ∼ ϕ1→2 (v1→2 ) .
64. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves
For a move between two models, M1 and M2 , the Markov chain
being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ)
the corresponding kernels, under the detailed balance condition
π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Proposal expressed as
θ2 = Ψ1→2 (θ1 , v1→2 )
where v1→2 is a random variable of dimension
dim(M2 ) − dim(M1 ), generated as
v1→2 ∼ ϕ1→2 (v1→2 ) .
65. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves (2)
In this case, q1→2 (θ1 , dθ2 ) has density
−1
∂Ψ1→2 (θ1 , v1→2 )
ϕ1→2 (v1→2 ) ,
∂(θ1 , v1→2 )
by the Jacobian rule.
Reverse importance link
If probability 1→2 of choosing move to M2 while in M1 ,
acceptance probability reduces to
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1∧ .
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
If several models are considered simultaneously, with probability
1→2 of choosing move to M2 while in M1 , as in
∞
XZ
K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x)
m=1
66. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves (2)
In this case, q1→2 (θ1 , dθ2 ) has density
−1
∂Ψ1→2 (θ1 , v1→2 )
ϕ1→2 (v1→2 ) ,
∂(θ1 , v1→2 )
by the Jacobian rule.
Reverse importance link
If probability 1→2 of choosing move to M2 while in M1 ,
acceptance probability reduces to
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1∧ .
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
If several models are considered simultaneously, with probability
1→2 of choosing move to M2 while in M1 , as in
∞
XZ
K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x)
m=1
67. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves (2)
In this case, q1→2 (θ1 , dθ2 ) has density
−1
∂Ψ1→2 (θ1 , v1→2 )
ϕ1→2 (v1→2 ) ,
∂(θ1 , v1→2 )
by the Jacobian rule.
Reverse importance link
If probability 1→2 of choosing move to M2 while in M1 ,
acceptance probability reduces to
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1∧ .
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
If several models are considered simultaneously, with probability
1→2 of choosing move to M2 while in M1 , as in
∞
XZ
K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x)
m=1
68. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Generic reversible jump acceptance probability
Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
1→2
−1
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
c Difficult calibration
69. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Generic reversible jump acceptance probability
Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
1→2
−1
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
c Difficult calibration
70. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Generic reversible jump acceptance probability
Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
1→2
−1
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
c Difficult calibration
71. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Green’s sampler
Algorithm
Iteration t (t ≥ 1): if x(t) = (m, θ(m) ),
1 Select model Mn with probability πmn
2 Generate umn ∼ ϕmn (u) and set
(θ(n) , vnm ) = Ψm→n (θ(m) , umn )
3 Take x(t+1) = (n, θ(n) ) with probability
π(n, θ(n) ) πnm ϕnm (vnm ) ∂Ψm→n (θ(m) , umn )
min ,1
π(m, θ(m) ) πmn ϕmn (umn ) ∂(θ(m) , umn )
and take x(t+1) = x(t) otherwise.
72. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Interpretation
The representation puts us back in a fixed dimension setting:
M1 × V1→2 and M2 in one-to-one relation.
reversibility imposes that θ1 is derived as
(θ1 , v1→2 ) = Ψ−1 (θ2 )
1→2
appears like a regular Metropolis–Hastings move from the
couple (θ1 , v1→2 ) to θ2 when stationary distributions are
π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal
distribution is deterministic
73. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Interpretation
The representation puts us back in a fixed dimension setting:
M1 × V1→2 and M2 in one-to-one relation.
reversibility imposes that θ1 is derived as
(θ1 , v1→2 ) = Ψ−1 (θ2 )
1→2
appears like a regular Metropolis–Hastings move from the
couple (θ1 , v1→2 ) to θ2 when stationary distributions are
π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal
distribution is deterministic
74. On some computational methods for Bayesian model choice
Cross-model solutions
Saturation schemes
Alternative
Saturation of the parameter space H = k {k} × Θk by creating
θ = (θ1 , . . . , θD )
a model index M
pseudo-priors πj (θj |M = k) for j = k
[Carlin & Chib, 1995]
Validation by
P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk
where the (marginal) posterior is [not πk ]
D
π(θ|x) = P(θ, M = k|x)
k=1
D
= pk Zk πk (θk |x) πj (θj |M = k) .
k=1 j=k
75. On some computational methods for Bayesian model choice
Cross-model solutions
Saturation schemes
Alternative
Saturation of the parameter space H = k {k} × Θk by creating
θ = (θ1 , . . . , θD )
a model index M
pseudo-priors πj (θj |M = k) for j = k
[Carlin & Chib, 1995]
Validation by
P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk
where the (marginal) posterior is [not πk ]
D
π(θ|x) = P(θ, M = k|x)
k=1
D
= pk Zk πk (θk |x) πj (θj |M = k) .
k=1 j=k
76. On some computational methods for Bayesian model choice
Cross-model solutions
Saturation schemes
MCMC implementation
(t) (t)
Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary
distribution π(θ, M |x) by
1 Pick M (t) = k with probability π(θ(t−1) , k|x)
(t−1)
2 Generate θk from the posterior πk (θk |x) [or MCMC step]
(t−1)
3 Generate θj (j = k) from the pseudo-prior πj (θj |M = k)
Approximate P(M = k|x) = Zk by
T
(t) (t) (t)
pk (x) ∝ pk
ˇ fk (x|θk ) πk (θk ) πj (θj |M = k)
t=1 j=k
D
(t) (t) (t)
p f (x|θ ) π (θ ) πj (θj |M = )
=1 j=
77. On some computational methods for Bayesian model choice
Cross-model solutions
Saturation schemes
MCMC implementation
(t) (t)
Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary
distribution π(θ, M |x) by
1 Pick M (t) = k with probability π(θ(t−1) , k|x)
(t−1)
2 Generate θk from the posterior πk (θk |x) [or MCMC step]
(t−1)
3 Generate θj (j = k) from the pseudo-prior πj (θj |M = k)
Approximate P(M = k|x) = Zk by
T
(t) (t) (t)
pk (x) ∝ pk
ˇ fk (x|θk ) πk (θk ) πj (θj |M = k)
t=1 j=k
D
(t) (t) (t)
p f (x|θ ) π (θ ) πj (θj |M = )
=1 j=
78. On some computational methods for Bayesian model choice
Cross-model solutions
Saturation schemes
MCMC implementation
(t) (t)
Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary
distribution π(θ, M |x) by
1 Pick M (t) = k with probability π(θ(t−1) , k|x)
(t−1)
2 Generate θk from the posterior πk (θk |x) [or MCMC step]
(t−1)
3 Generate θj (j = k) from the pseudo-prior πj (θj |M = k)
Approximate P(M = k|x) = Zk by
T
(t) (t) (t)
pk (x) ∝ pk
ˇ fk (x|θk ) πk (θk ) πj (θj |M = k)
t=1 j=k
D
(t) (t) (t)
p f (x|θ ) π (θ ) πj (θj |M = )
=1 j=
79. On some computational methods for Bayesian model choice
Cross-model solutions
Implementation error
Scott’s (2002) proposal
Suggest estimating P(M = k|x) by
T D
(t) (t)
Zk ∝ pk f (x|θk ) pj fj (x|θj ) ,
k
t=1 j=1
based on D simultaneous and independent MCMC chains
(t)
(θk )t , 1 ≤ k ≤ D,
with stationary distributions πk (θk |x) [instead of above joint]
80. On some computational methods for Bayesian model choice
Cross-model solutions
Implementation error
Scott’s (2002) proposal
Suggest estimating P(M = k|x) by
T D
(t) (t)
Zk ∝ pk f (x|θk ) pj fj (x|θj ) ,
k
t=1 j=1
based on D simultaneous and independent MCMC chains
(t)
(θk )t , 1 ≤ k ≤ D,
with stationary distributions πk (θk |x) [instead of above joint]
81. On some computational methods for Bayesian model choice
Cross-model solutions
Implementation error
Congdon’s (2006) extension
Selecting flat [prohibited] pseudo-priors, uses instead
T D
(t) (t) (t) (t)
Zk ∝ pk fk (x|θk )πk (θk ) pj fj (x|θj )πj (θj ) ,
t=1 j=1
(t)
where again the θk ’s are MCMC chains with stationary
distributions πk (θk |x)
to next section
82. On some computational methods for Bayesian model choice
Cross-model solutions
Implementation error
Examples
Example (Model choice)
Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model
M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on
both models: 1 = 2 = 0.5.
1.0
0.8
Approximations of P(M = 1|x):
0.6
Scott’s (2002) (blue), and
Congdon’s (2006) (red) 0.4
[N = 106 simulations].
0.2
0.0
1 2 3 4 5
y
83. On some computational methods for Bayesian model choice
Cross-model solutions
Implementation error
Examples
Example (Model choice)
Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model
M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on
both models: 1 = 2 = 0.5.
1.0
0.8
Approximations of P(M = 1|x):
0.6
Scott’s (2002) (blue), and
Congdon’s (2006) (red) 0.4
[N = 106 simulations].
0.2
0.0
1 2 3 4 5
y
84. On some computational methods for Bayesian model choice
Cross-model solutions
Implementation error
Examples (2)
Example (Model choice (2))
Normal model M1 : x ∼ N (θ, 1) with θ ∼ N (0, 1) vs. normal
model M2 : x ∼ N (θ, 1) with θ ∼ N (5, 1)
Comparison of both
1.0
approximations with
0.8
P(M = 1|x): Scott’s (2002)
(green and mixed dashes) and
0.6
Congdon’s (2006) (brown and
0.4
long dashes) [N = 104
simulations].
0.2
0.0
−1 0 1 2 3 4 5 6
y
85. On some computational methods for Bayesian model choice
Cross-model solutions
Implementation error
Examples (3)
Example (Model choice (3))
Model M1 : x ∼ N (0, 1/ω) with ω ∼ Exp(a) vs.
M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b).
1.0
1.0
0.8
0.8
Comparison of Congdon’s (2006)
0.6
0.6
(brown and dashed lines) with
0.4
0.4
0.2
0.2
P(M = 1|x) when (a, b) is equal 0.0
0.0
to (.24, 8.9), (.56, .7), (4.1, .46) −10 −5 0
y
5 10 −10 −5 0
y
5 10
and (.98, .081), resp. [N = 104
1.0
1.0
0.8
0.8
simulations].
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−10 −5 0 5 10 −10 −5 0 5 10
y y
86. On some computational methods for Bayesian model choice
Nested sampling
Nested sampling
1 Evidence
2 Importance sampling solutions
3 Cross-model solutions
4 Nested sampling
Purpose
Implementation
Error rates
Constraints
Importance variant
A mixture comparison
87. On some computational methods for Bayesian model choice
Nested sampling
Purpose
Nested sampling: Goal
Skilling’s (2007) technique using the one-dimensional
representation:
1
Z = Eπ [L(θ)] = ϕ(x) dx
0
with
ϕ−1 (l) = P π (L(θ) > l).
Note; ϕ(·) is intractable in most cases.
88. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: First approximation
Approximate Z by a Riemann sum:
N
Z= (xi−1 − xi )ϕ(xi )
i=1
where the xi ’s are either:
deterministic: xi = e−i/N
or random:
x0 = 0, xi+1 = ti xi , ti ∼ Be(N, 1)
so that E[log xi ] = −i/N .
89. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Extraneous white noise
Take
1 −(1−δ)θ −δθ 1 −(1−δ)θ
Z= e−θ dθ = e e = Eδ e
δ δ
N
1
ˆ
Z= δ −1 e−(1−δ)θi (xi−1 − xi ) , θi ∼ E(δ) I(θi ≤ θi−1 )
N
i=1
N deterministic random
50 4.64 10.5
4.65 10.5
100 2.47 4.9 Comparison of variances and MSEs
2.48 5.02
500 .549 1.01
.550 1.14
90. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Extraneous white noise
Take
1 −(1−δ)θ −δθ 1 −(1−δ)θ
Z= e−θ dθ = e e = Eδ e
δ δ
N
1
ˆ
Z= δ −1 e−(1−δ)θi (xi−1 − xi ) , θi ∼ E(δ) I(θi ≤ θi−1 )
N
i=1
N deterministic random
50 4.64 10.5
4.65 10.5
100 2.47 4.9 Comparison of variances and MSEs
2.48 5.02
500 .549 1.01
.550 1.14
91. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: Alternative representation
Another representation is
N −1
Z= {ϕ(xi+1 ) − ϕ(xi )}xi
i=0
which is a special case of
N −1
Z= {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )})
i=0
where · · · L(θ(i+1) ) < L(θ(i) ) · · ·
[Lebesgue version of Riemann’s sum]
92. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: Alternative representation
Another representation is
N −1
Z= {ϕ(xi+1 ) − ϕ(xi )}xi
i=0
which is a special case of
N −1
Z= {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )})
i=0
where · · · L(θ(i+1) ) < L(θ(i) ) · · ·
[Lebesgue version of Riemann’s sum]
93. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: Second approximation
Estimate (intractable) ϕ(xi ) by ϕi :
Nested sampling
Start with N values θ1 , . . . , θN sampled from π
At iteration i,
1 Take ϕi = L(θk ), where θk is the point with smallest
likelihood in the pool of θi ’s
2 Replace θk with a sample from the prior constrained to
L(θ) > ϕi : the current N points are sampled from prior
constrained to L(θ) > ϕi .
Note that
1
π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
N
94. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: Second approximation
Estimate (intractable) ϕ(xi ) by ϕi :
Nested sampling
Start with N values θ1 , . . . , θN sampled from π
At iteration i,
1 Take ϕi = L(θk ), where θk is the point with smallest
likelihood in the pool of θi ’s
2 Replace θk with a sample from the prior constrained to
L(θ) > ϕi : the current N points are sampled from prior
constrained to L(θ) > ϕi .
Note that
1
π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
N
95. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: Second approximation
Estimate (intractable) ϕ(xi ) by ϕi :
Nested sampling
Start with N values θ1 , . . . , θN sampled from π
At iteration i,
1 Take ϕi = L(θk ), where θk is the point with smallest
likelihood in the pool of θi ’s
2 Replace θk with a sample from the prior constrained to
L(θ) > ϕi : the current N points are sampled from prior
constrained to L(θ) > ϕi .
Note that
1
π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
N
96. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: Second approximation
Estimate (intractable) ϕ(xi ) by ϕi :
Nested sampling
Start with N values θ1 , . . . , θN sampled from π
At iteration i,
1 Take ϕi = L(θk ), where θk is the point with smallest
likelihood in the pool of θi ’s
2 Replace θk with a sample from the prior constrained to
L(θ) > ϕi : the current N points are sampled from prior
constrained to L(θ) > ϕi .
Note that
1
π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 −
N
97. On some computational methods for Bayesian model choice
Nested sampling
Implementation
Nested sampling: Third approximation
Iterate the above steps until a given stopping iteration j is
reached: e.g.,
observe very small changes in the approximation Z;
reach the maximal value of L(θ) when the likelihood is
bounded and its maximum is known;
truncate the integral Z at level , i.e. replace
1 1
ϕ(x) dx with ϕ(x) dx
0
98. On some computational methods for Bayesian model choice
Nested sampling
Error rates
Approximation error
Error = Z − Z
j 1
= (xi−1 − xi )ϕi − ϕ(x) dx = − ϕ(x) dx
i=1 0 0
j 1
+ (xi−1 − xi )ϕ(xi ) − ϕ(x) dx (Quadrature Error)
i=1
j
+ (xi−1 − xi ) {ϕi − ϕ(xi )} (Stochastic Error)
i=1
[Dominated by Monte Carlo]
99. On some computational methods for Bayesian model choice
Nested sampling
Error rates
A CLT for the Stochastic Error
The (dominating) stochastic error is OP (N −1/2 ):
D
N 1/2 Z − Z → N (0, V )
with
V =− sϕ (s)tϕ (t) log(s ∨ t) ds dt.
s,t∈[ ,1]
[Proof based on Donsker’s theorem]
100. On some computational methods for Bayesian model choice
Nested sampling
Error rates
What of log Z?
If the interest lies within log Z, Slutsky’s transform of the CLT:
D
N −1/2 log Z − log Z → N 0, V /Z2
Note: The number of simulated points equals the number of
iterations j, and is a multiple of N : if one stops at first iteration j
such that e−j/N < , then:
j = N − log .
101. On some computational methods for Bayesian model choice
Nested sampling
Error rates
What of log Z?
If the interest lies within log Z, Slutsky’s transform of the CLT:
D
N −1/2 log Z − log Z → N 0, V /Z2
Note: The number of simulated points equals the number of
iterations j, and is a multiple of N : if one stops at first iteration j
such that e−j/N < , then:
j = N − log .
102. On some computational methods for Bayesian model choice
Nested sampling
Error rates
What of log Z?
If the interest lies within log Z, Slutsky’s transform of the CLT:
D
N −1/2 log Z − log Z → N 0, V /Z2
Note: The number of simulated points equals the number of
iterations j, and is a multiple of N : if one stops at first iteration j
such that e−j/N < , then:
j = N − log .
103. On some computational methods for Bayesian model choice
Nested sampling
Constraints
Sampling from constr’d priors
Exact simulation from the constrained prior is intractable in most
cases
Skilling (2007) proposes to use MCMC but this introduces a bias
(stopping rule)
If implementable, then slice sampler can be devised at the same
cost [or less]
104. On some computational methods for Bayesian model choice
Nested sampling
Constraints
Sampling from constr’d priors
Exact simulation from the constrained prior is intractable in most
cases
Skilling (2007) proposes to use MCMC but this introduces a bias
(stopping rule)
If implementable, then slice sampler can be devised at the same
cost [or less]
105. On some computational methods for Bayesian model choice
Nested sampling
Constraints
Sampling from constr’d priors
Exact simulation from the constrained prior is intractable in most
cases
Skilling (2007) proposes to use MCMC but this introduces a bias
(stopping rule)
If implementable, then slice sampler can be devised at the same
cost [or less]
106. On some computational methods for Bayesian model choice
Nested sampling
Constraints
Banana illustration
Case of a banana target made of a twisted 2D normal:
x2 = x2 + βx2 − 100β
1
[Haario, Sacksman, Tamminen, 1999]
β = .03, σ = (100, 1)
107. On some computational methods for Bayesian model choice
Nested sampling
Constraints
Banana illustration (2)
Use of nested sampling with N = 1000, 50 MCMC steps with size
0.1, compared with a population Monte Carlo (PMC) based on 10
iterations with 5000 points per iteration and final sample of 50000
points, using nine Student’s t components with 9 df
[Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]
Evidence estimation
108. On some computational methods for Bayesian model choice
Nested sampling
Constraints
Banana illustration (2)
Use of nested sampling with N = 1000, 50 MCMC steps with size
0.1, compared with a population Monte Carlo (PMC) based on 10
iterations with 5000 points per iteration and final sample of 50000
points, using nine Student’s t components with 9 df
[Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]
E[X1 ] estimation
109. On some computational methods for Bayesian model choice
Nested sampling
Constraints
Banana illustration (2)
Use of nested sampling with N = 1000, 50 MCMC steps with size
0.1, compared with a population Monte Carlo (PMC) based on 10
iterations with 5000 points per iteration and final sample of 50000
points, using nine Student’s t components with 9 df
[Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]
E[X2 ] estimation
110. On some computational methods for Bayesian model choice
Nested sampling
Importance variant
A IS variant of nested sampling
˜
Consider instrumental prior π and likelihood L, weight function
π(θ)L(θ)
w(θ) =
π(θ)L(θ)
and weighted NS estimator
j
Z= (xi−1 − xi )ϕi w(θi ).
i=1
Then choose (π, L) so that sampling from π constrained to
L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
111. On some computational methods for Bayesian model choice
Nested sampling
Importance variant
A IS variant of nested sampling
˜
Consider instrumental prior π and likelihood L, weight function
π(θ)L(θ)
w(θ) =
π(θ)L(θ)
and weighted NS estimator
j
Z= (xi−1 − xi )ϕi w(θi ).
i=1
Then choose (π, L) so that sampling from π constrained to
L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
112. On some computational methods for Bayesian model choice
Nested sampling
A mixture comparison
Mixture pN (0, 1) + (1 − p)N (µ, σ) posterior
Posterior on (µ, σ) for n
observations with µ = 2
and σ = 3/2, when p is
known
Use of a uniform prior
both on (−2, 6) for µ
and on (.001, 16) for
log σ 2 .
occurrences of posterior
bursts for µ = xi
113. On some computational methods for Bayesian model choice
Nested sampling
A mixture comparison
Experiment
MCMC sample for n = 16 Nested sampling sequence
observations from the mixture. with M = 1000 starting points.
114. On some computational methods for Bayesian model choice
Nested sampling
A mixture comparison
Experiment
MCMC sample for n = 50 Nested sampling sequence
observations from the mixture. with M = 1000 starting points.
115. On some computational methods for Bayesian model choice
Nested sampling
A mixture comparison
Comparison
1 Nested sampling: M = 1000 points, with 10 random walk
moves at each step, simulations from the constr’d prior and a
stopping rule at 95% of the observed maximum likelihood
2 T = 104 MCMC (=Gibbs) simulations producing
non-parametric estimates ϕ
[Diebolt & Robert, 1990]
3 Monte Carlo estimates Z1 , Z2 , Z3 using product of two
Gaussian kernels
4 numerical integration based on 850 × 950 grid [reference
value, confirmed by Chib’s]
116. On some computational methods for Bayesian model choice
Nested sampling
A mixture comparison
Comparison (cont’d)
q
q
q q
q q
1.15
q q
q
q
q
q q q
q
q
q
1.10
q
q q
q q
q q q
q
q q
q q
q
q q
q q
q
1.05
q
1.00
0.95
q
q
q
q q
q q
q
q q
q
q
q
q
0.90
q
0.85
q
q
q
q
V1
q V2 V3 V4
q
Graph based on a sample of 10 observations for µ = 2 and
σ = 3/2 (150 replicas).