This document summarizes a talk given by Heiko Strathmann on using partial posterior paths to estimate expectations from large datasets without full posterior simulation. The key ideas are:
1. Construct a path of "partial posteriors" by sequentially adding mini-batches of data and computing expectations over these posteriors.
2. "Debias" the path of expectations to obtain an unbiased estimator of the true posterior expectation using a technique from stochastic optimization literature.
3. This approach allows estimating posterior expectations with sub-linear computational cost in the number of data points, without requiring full posterior simulation or imposing restrictions on the likelihood.
Experiments on synthetic and real-world examples demonstrate competitive performance versus standard M
1 of 43
Download to read offline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
More Related Content
Unbiased Bayes for Big Data
1. Unbiased Bayes for Big Data:
Paths of Partial Posteriors
Heiko Strathmann
Gatsby Unit, University College London
Oxford ML lunch, February 25, 2015
3. Being Bayesian: Averaging beliefs of the unknown
φ =
ˆ
dθϕ(θ) p(θ|D)
posterior
where p(θ|D) ∝ p(D|θ)
likelihood data
p(θ)
prior
4. Metropolis Hastings Transition Kernel
Target π(θ) ∝ p(θ|D)
At iteration j + 1, state θ(j)
Propose θ ∼ q θ|θ(j)
Accept θ(j+1)
← θ with probability
min
π(θ )
π(θ(j))
×
q(θ(j)
|θ )
q(θ |θ(j))
, 1
Reject θ(j+1)
← θ(j)
otherwise.
5. Big D & MCMC
Need to evaluate
p(θ|D) ∝ p(D|θ)p(θ)
in every iteration.
For example, for D = {x1, . . . , xN},
p(D|θ) =
N
i=1
p(xi|θ)
Infeasible for growing N
Lots of current research: Can we use subsets of D?
6. Desiderata for Bayesian estimators
1. No (additional) bias
2. Finite & controllable variance
3. Computational costs sub-linear in N
4. No problems with transition kernel design
9. Stochastic gradient Langevin (Welling & Teh 2011)
θ =
2
θ=θ(j) log p(θ) + θ=θ(j)
N
i=1
log p(xi|θ) + ηj
Two changes:
1. Noisy gradients with mini-batches. Let I ⊆ {1, . . . , N}
and use log-likelihood gradient
θ=θ(j)
i∈I
log p(xi|θ)
2. Don't evaluate MH ratio, but always accept, decrease
step-size/noise j → 0 to compensate
∞
i=1
i = ∞
∞
i=1
2
i < ∞
10. Austerity (Korattikara, Chen, Welling 2014)
Idea: rewrite MH ratio as hypothesis test
At iteration j, draw u ∼ Uniform[0, 1] and compute
µ0 =
1
N log u ×
p(θ(j)
)
p(θ )
×
q(θ |θ(j)
)
q(θ(j)|θ )
µ =
1
N
N
i=1
li li := log p(xi|θ ) − log p(xi|θ(j)
)
Accept if µ > µ0; reject otherwise
Subsample the li, central limit theorem, t-test
Increase data if no signicance, multiple testing correction
11. Bardenet, Doucet, Holmes 2014
Similar to Austerity, but with analysis:
Concentration bounds for MH (CLT might not hold)
Bound for probability of wrong decision
For uniformly ergodic original kernel
Approximate kernel converges
Bound for TV distance of approximation and target
Limitations:
Still approximate
Only random walk
Uses all data on hard (?) problems
12. Firey MCMC (Maclaurin Adams 2014)
First asymptotically exact MCMC kernel using
sub-sampling
Augment state space with binary indcator variables
Only few data bright
Dark points approximated by a lower bound on likelihood
Limitations:
Bound might not be available
Loose bounds → worse than standard MCMC→ need
MAP estimate
Linear in N. Likelihood evaluations at least qdark→bright · N
Mixing time cannot be better than 1/qdark→bright
13. Alternative transition kernels
Existing methods construct alternative transition kernels.
(Welling Teh 2011), (Korattikara, Chen, Welling 2014), (Bardenet, Doucet, Holmes 2014)
(Maclaurin Adams 2014), (Chen, Fox, Guestrin 2014).
They
use mini-batches
inject noise
augment the state space
make clever use of approximations
Problem: Most methods
are biased
have no convergence guarantees
mix badly
14. Reminder: Where we came from expectations
Ep(θ|D) {ϕ(θ)} ϕ : Θ → R
Idea: Assuming the goal is estimation, give up on simulation.
16. Idea Outline
1. Construct partial posterior distributions
2. Compute partial expectations (biased)
3. Remove bias
Note:
No simulation from p(θ|D)
Partial posterior expectations less challenging
Exploit standard MCMC methodology engineering
But not restricted to MCMC
17. Disclaimer
Goal is not to replace posterior sampling, but to provide a ...
dierent perspective when the goal is estimation
Method does not do uniformly better than MCMC, but ...
we show cases where computational gains can be achieved
18. Partial Posterior Paths
Model p(x, θ) = p(x|θ)p(θ), data D = {x1, . . . , xN}
Full posterior πN := p(θ|D) ∝ p(x1, . . . , xN|θ)p(θ)
L subsets Dl of sizes |Dl| = nl
Here: n1 = a, n2 = 2
1
a, n3 = 2
2
a, . . . , nL = 2
L−1
a
Partial posterior ˜πl := p(Dl|θ) ∝ p(Dl|θ)p(θ)
Path from prior to full posterior
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
20. Partial posterior path statistics
For partial posterior paths
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
dene a sequence {φt}∞
t=1
as
φt := ˆE˜πt{ϕ(θ)} t L
φt := φ := ˆEπN{ϕ(θ)} t ≥ L
This gives
φ1 → φ2 → · · · → φL = φ
ˆE˜πt{ϕ(θ)} is empirical estimate. Not necessarily MCMC.
21. Debiasing Lemma (Rhee Glynn 2012, 2014)
φ and {φt}∞
t=1
real-valued random variables. Assume
lim
t→∞
E |φt − φ|2
= 0
T integer rv with P [T ≥ t] 0 for t ∈ N
Assume
∞
t=1
E |φt−1 − φ|2
P [T ≥ t]
∞
Unbiased estimator of E{φ}
φ∗
T =
T
t=1
φt − φt−1
P [T ≥ t]
Here: P [T ≥ t] = 0 for t L since φt+1 − φt = 0
29. Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Debiasing estimate 1
R
R
r=1 ϕ∗
r
True Posterior mean
Debiasing estimates ϕ∗
r
30. Computational complexity
Assume geometric batch size increase nt and truncation
probabilities
Λt := P(T = t) ∝ 2
−αt
α ∈ (0, 1)
Average computational cost sub-linear
O a N
a
1−α
31. Variance-computation tradeos in Big Data
Variance
E (φ∗
T )2
=
∞
t=1
E {|φt−1 − φ|2
} − E {|φt − φ|2
}
P [T ≥ t]
If we assume ∀t ≤ L, there is a constant c and β 0 s.t.
E |φt−1 − φ|2
≤
c
nβ
t
and furthermore α β, then
L
t=1
E |φt−1 − φ|2
P [T ≥ t]
= O(1)
and variance stays bounded as N → ∞.
33. Synthetic log-Gaussian
101 102 103 104 105
Number of data nt
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
log N(0, σ2
), posterior mean σ
(Bardenet, Doucet, Holmes 2014) all data
(Korattikara, Chen, Welling 2014) wrong result
34. Synthetic log-Gaussian debiasing
0 50 100 150 200 250 300
Replication r
1.0
1.2
1.4
1.6
1.8
2.0
Running1
R
R
r=1ϕ∗
r
log N(µ, σ2
), posterior mean σ
0
2000
4000
6000
Tr
t=1nt
Used Data
Truly large-scale version: N ≈ 10
8
Sum of likelihood evaluations: ≈ 0.25N
35. Non-factorising likelihoods
No need for
p(D|θ) =
N
i=1
p(xi|θ)
Example: Approximate Gaussian Process regression
Estimate predictive mean
k∗ (K + λI)−1
y
No MCMC (!)
36. Toy example
N = 10
4
, D = 1
m = 100 random Fourier features (Rahimi, Recht, 2007)
Predictive mean on 1000 test data
101 102 103 104
Number of data nt
0.0
0.2
0.4
0.6
0.8
1.0
1.2
MSE
GP Regression, predictive mean
0 50 100 150 200
Replication R
0.0
0.5
1.0
MSE
Average cost: 469
MSE Convergence Debiasing
37. Gaussian Processes for Big Data
(Hensman, Fusi, Lawrence, 2013): SVI inducing variables
Airtime delays, N = 700, 000, D = 8
Estimate predictive mean on 100, 000 test data
0 20 40 60 80 100
Replication R
30
40
50
60
70
RMSE
Average cost: 2773
39. Conclusions
If goal is estimation rather than simulation, we arrive at
1. No bias
2. Finite controllable variance
3. Data complexity sub-linear in N
4. No problems with transition kernel design
Practical:
Not limited to MCMC
Not limited to factorising likelihoods
Competitiveinitial results
Parallelisable, re-uses existing engineering eort
40. Still biased?
MCMC and nite time
MCMC estimator ˆE˜πt{ϕ(θ)} is not unbiased
Could imagine two-stage process
Apply debiasing to MC estimator
Use to debias partial posterior path
Need conditions on MC convergence to control variance,
(Agapiou, Roberts, Vollmer, 2014)
Memory restrictions
Partial posterior expectations need be computable
Memory limitations cause bias
e.g. large-scale GMRF (Lyne et al, 2014)
41. Free lunch? Not uniformly better than MCMC
Need P [T ≥ t] 0 for all t
Negative example: a9a dataset (Welling Teh, 2011)
N ≈ 32, 000
Converges, but full posterior sampling likely
0 50 100 150 200
Replication r
−4
−2
0
2
β1
Useful for very large (redundant) datasets
42. Xi'an's og, Feb 2015
Discussion of M. Betancourt's note on HMC and subsampling.
...the information provided by the whole data is only available
when looking at the whole data.
See http://goo.gl/bFQvd6