Unbiased Bayes for Big Data

Unbiased Bayes for Big Data:
Paths of Partial Posteriors
Heiko Strathmann
Gatsby Unit, University College London
Oxford ML lunch, February 25, 2015

Being Bayesian: Averaging beliefs of the unknown
φ =
ˆ
dθϕ(θ) p(θ|D)
posterior
where p(θ|D) ∝ p(D|θ)
likelihood data
p(θ)
prior

Metropolis Hastings Transition Kernel
Target π(θ) ∝ p(θ|D)
At iteration j + 1, state θ(j)
Propose θ ∼ q θ|θ(j)
Accept θ(j+1)
← θ with probability
min
π(θ )
π(θ(j))
×
q(θ(j)
|θ )
q(θ |θ(j))
, 1
Reject θ(j+1)
← θ(j)
otherwise.

Big D & MCMC
Need to evaluate
p(θ|D) ∝ p(D|θ)p(θ)
in every iteration.
For example, for D = {x1, . . . , xN},
p(D|θ) =
N
i=1
p(xi|θ)
Infeasible for growing N
Lots of current research: Can we use subsets of D?

Desiderata for Bayesian estimators
1. No (additional) bias
2. Finite & controllable variance
3. Computational costs sub-linear in N
4. No problems with transition kernel design

Outline
Literature Overview
Partial Posterior Path Estimators
Experiments & Extensions
Discussion

Stochastic gradient Langevin (Welling & Teh 2011)
θ =
2
θ=θ(j) log p(θ) + θ=θ(j)
N
i=1
log p(xi|θ) + ηj
Two changes:
1. Noisy gradients with mini-batches. Let I ⊆ {1, . . . , N}
and use log-likelihood gradient
θ=θ(j)
i∈I
log p(xi|θ)
2. Don't evaluate MH ratio, but always accept, decrease
step-size/noise j → 0 to compensate
∞
i=1
i = ∞
∞
i=1
2
i < ∞

Austerity (Korattikara, Chen, Welling 2014)
Idea: rewrite MH ratio as hypothesis test
At iteration j, draw u ∼ Uniform[0, 1] and compute
µ0 =
1
N log u ×
p(θ(j)
)
p(θ )
×
q(θ |θ(j)
)
q(θ(j)|θ )
µ =
1
N
N
i=1
li li := log p(xi|θ ) − log p(xi|θ(j)
)
Accept if µ > µ0; reject otherwise
Subsample the li, central limit theorem, t-test
Increase data if no signicance, multiple testing correction

Bardenet, Doucet, Holmes 2014
Similar to Austerity, but with analysis:
Concentration bounds for MH (CLT might not hold)
Bound for probability of wrong decision
For uniformly ergodic original kernel
Approximate kernel converges
Bound for TV distance of approximation and target
Limitations:
Still approximate
Only random walk
Uses all data on hard (?) problems

Firey MCMC (Maclaurin Adams 2014)
First asymptotically exact MCMC kernel using
sub-sampling
Augment state space with binary indcator variables
Only few data bright
Dark points approximated by a lower bound on likelihood
Limitations:
Bound might not be available
Loose bounds → worse than standard MCMC→ need
MAP estimate
Linear in N. Likelihood evaluations at least qdark→bright · N
Mixing time cannot be better than 1/qdark→bright

Alternative transition kernels
Existing methods construct alternative transition kernels.
(Welling Teh 2011), (Korattikara, Chen, Welling 2014), (Bardenet, Doucet, Holmes 2014)
(Maclaurin Adams 2014), (Chen, Fox, Guestrin 2014).
They
use mini-batches
inject noise
augment the state space
make clever use of approximations
Problem: Most methods
are biased
have no convergence guarantees
mix badly

Reminder: Where we came from expectations
Ep(θ|D) {ϕ(θ)} ϕ : Θ → R
Idea: Assuming the goal is estimation, give up on simulation.

Outline
Literature Overview
Partial Posterior Path Estimators
Experiments Extensions
Discussion

Idea Outline
1. Construct partial posterior distributions
2. Compute partial expectations (biased)
3. Remove bias
Note:
No simulation from p(θ|D)
Partial posterior expectations less challenging
Exploit standard MCMC methodology engineering
But not restricted to MCMC

Disclaimer
Goal is not to replace posterior sampling, but to provide a ...
dierent perspective when the goal is estimation
Method does not do uniformly better than MCMC, but ...
we show cases where computational gains can be achieved

Partial Posterior Paths
Model p(x, θ) = p(x|θ)p(θ), data D = {x1, . . . , xN}
Full posterior πN := p(θ|D) ∝ p(x1, . . . , xN|θ)p(θ)
L subsets Dl of sizes |Dl| = nl
Here: n1 = a, n2 = 2
1
a, n3 = 2
2
a, . . . , nL = 2
L−1
a
Partial posterior ˜πl := p(Dl|θ) ∝ p(Dl|θ)p(θ)
Path from prior to full posterior
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)

Gaussian Mean, Conjugate Prior
−5 −4 −3 −2 −1 0 1 2 3
µ1
−3
−2
−1
0
1
2
3
4
µ2
Prior
1/100
2/100
4/100
8/100
16/100
32/100
64/100
100/100

Partial posterior path statistics
For partial posterior paths
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
dene a sequence {φt}∞
t=1
as
φt := Ê˜πt{ϕ(θ)} t L
φt := φ := ÊπN{ϕ(θ)} t ≥ L
This gives
φ1 → φ2 → · · · → φL = φ
Ê˜πt{ϕ(θ)} is empirical estimate. Not necessarily MCMC.

Debiasing Lemma (Rhee Glynn 2012, 2014)
φ and {φt}∞
t=1
real-valued random variables. Assume
lim
t→∞
E |φt − φ|2
= 0
T integer rv with P [T ≥ t] 0 for t ∈ N
Assume
∞
t=1
E |φt−1 − φ|2
P [T ≥ t]
∞
Unbiased estimator of E{φ}
φ∗
T =
T
t=1
φt − φt−1
P [T ≥ t]
Here: P [T ≥ t] = 0 for t L since φt+1 − φt = 0

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Debiasing estimate 1
R
R
r=1 ϕ∗
r
True Posterior mean
Debiasing estimates ϕ∗
r

Computational complexity
Assume geometric batch size increase nt and truncation
probabilities
Λt := P(T = t) ∝ 2
−αt
α ∈ (0, 1)
Average computational cost sub-linear
O a N
a
1−α

Variance-computation tradeos in Big Data
Variance
E (φ∗
T )2
=
∞
t=1
E {|φt−1 − φ|2
} − E {|φt − φ|2
}
P [T ≥ t]
If we assume ∀t ≤ L, there is a constant c and β 0 s.t.
E |φt−1 − φ|2
≤
c
nβ
t
and furthermore α β, then
L
t=1
E |φt−1 − φ|2
P [T ≥ t]
= O(1)
and variance stays bounded as N → ∞.

Synthetic log-Gaussian
101 102 103 104 105
Number of data nt
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
log N(0, σ2
), posterior mean σ
(Bardenet, Doucet, Holmes 2014) all data
(Korattikara, Chen, Welling 2014) wrong result

Synthetic log-Gaussian debiasing
0 50 100 150 200 250 300
Replication r
1.0
1.2
1.4
1.6
1.8
2.0
Running1
R
R
r=1ϕ∗
r
log N(µ, σ2
), posterior mean σ
0
2000
4000
6000
Tr
t=1nt
Used Data
Truly large-scale version: N ≈ 10
8
Sum of likelihood evaluations: ≈ 0.25N

Non-factorising likelihoods
No need for
p(D|θ) =
N
i=1
p(xi|θ)
Example: Approximate Gaussian Process regression
Estimate predictive mean
k∗ (K + λI)−1
y
No MCMC (!)

Toy example
N = 10
4
, D = 1
m = 100 random Fourier features (Rahimi, Recht, 2007)
Predictive mean on 1000 test data
101 102 103 104
Number of data nt
0.0
0.2
0.4
0.6
0.8
1.0
1.2
MSE
GP Regression, predictive mean
0 50 100 150 200
Replication R
0.0
0.5
1.0
MSE
Average cost: 469
MSE Convergence Debiasing

Gaussian Processes for Big Data
(Hensman, Fusi, Lawrence, 2013): SVI inducing variables
Airtime delays, N = 700, 000, D = 8
Estimate predictive mean on 100, 000 test data
0 20 40 60 80 100
Replication R
30
40
50
60
70
RMSE
Average cost: 2773

Conclusions
If goal is estimation rather than simulation, we arrive at
1. No bias
2. Finite controllable variance
3. Data complexity sub-linear in N
4. No problems with transition kernel design
Practical:
Not limited to MCMC
Not limited to factorising likelihoods
Competitiveinitial results
Parallelisable, re-uses existing engineering eort

Still biased?
MCMC and nite time
MCMC estimator ˆE˜πt{ϕ(θ)} is not unbiased
Could imagine two-stage process
Apply debiasing to MC estimator
Use to debias partial posterior path
Need conditions on MC convergence to control variance,
(Agapiou, Roberts, Vollmer, 2014)
Memory restrictions
Partial posterior expectations need be computable
Memory limitations cause bias
e.g. large-scale GMRF (Lyne et al, 2014)

Free lunch? Not uniformly better than MCMC
Need P [T ≥ t] 0 for all t
Negative example: a9a dataset (Welling Teh, 2011)
N ≈ 32, 000
Converges, but full posterior sampling likely
0 50 100 150 200
Replication r
−4
−2
0
2
β1
Useful for very large (redundant) datasets

Xi'an's og, Feb 2015
Discussion of M. Betancourt's note on HMC and subsampling.
...the information provided by the whole data is only available
when looking at the whole data.
See http://goo.gl/bFQvd6

Unbiased Bayes for Big Data

More Related Content

Unbiased Bayes for Big Data