Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Unbiased Bayes for Big Data:
Paths of Partial Posteriors
Heiko Strathmann
Gatsby Unit, University College London
Oxford ML lunch, February 25, 2015

2

Joint work

3

Being Bayesian: Averaging beliefs of the unknown
φ =
ˆ
dθϕ(θ) p(θ|D)
posterior
where p(θ|D) ∝ p(D|θ)
likelihood data
p(θ)
prior

4

Metropolis Hastings Transition Kernel
Target π(θ) ∝ p(θ|D)
At iteration j + 1, state θ(j)
Propose θ ∼ q θ|θ(j)
Accept θ(j+1)
← θ with probability
min
π(θ )
π(θ(j))
×
q(θ(j)
|θ )
q(θ |θ(j))
, 1
Reject θ(j+1)
← θ(j)
otherwise.

5

Big D & MCMC
Need to evaluate
p(θ|D) ∝ p(D|θ)p(θ)
in every iteration.
For example, for D = {x1, . . . , xN},
p(D|θ) =
N
i=1
p(xi|θ)
Infeasible for growing N
Lots of current research: Can we use subsets of D?

6

Desiderata for Bayesian estimators
1. No (additional) bias
2. Finite & controllable variance
3. Computational costs sub-linear in N
4. No problems with transition kernel design

7

Outline
Literature Overview
Partial Posterior Path Estimators
Experiments & Extensions
Discussion

8

Outline
Literature Overview
Partial Posterior Path Estimators
Experiments & Extensions
Discussion

9

Stochastic gradient Langevin (Welling & Teh 2011)
θ =
2
θ=θ(j) log p(θ) + θ=θ(j)
N
i=1
log p(xi|θ) + ηj
Two changes:
1. Noisy gradients with mini-batches. Let I ⊆ {1, . . . , N}
and use log-likelihood gradient
θ=θ(j)
i∈I
log p(xi|θ)
2. Don't evaluate MH ratio, but always accept, decrease
step-size/noise j → 0 to compensate
∞
i=1
i = ∞
∞
i=1
2
i < ∞

10

Austerity (Korattikara, Chen, Welling 2014)
Idea: rewrite MH ratio as hypothesis test
At iteration j, draw u ∼ Uniform[0, 1] and compute
µ0 =
1
N log u ×
p(θ(j)
)
p(θ )
×
q(θ |θ(j)
)
q(θ(j)|θ )
µ =
1
N
N
i=1
li li := log p(xi|θ ) − log p(xi|θ(j)
)
Accept if µ > µ0; reject otherwise
Subsample the li, central limit theorem, t-test
Increase data if no signicance, multiple testing correction

11

Bardenet, Doucet, Holmes 2014
Similar to Austerity, but with analysis:
Concentration bounds for MH (CLT might not hold)
Bound for probability of wrong decision
For uniformly ergodic original kernel
Approximate kernel converges
Bound for TV distance of approximation and target
Limitations:
Still approximate
Only random walk
Uses all data on hard (?) problems

12

Firey MCMC (Maclaurin  Adams 2014)
First asymptotically exact MCMC kernel using
sub-sampling
Augment state space with binary indcator variables
Only few data bright
Dark points approximated by a lower bound on likelihood
Limitations:
Bound might not be available
Loose bounds → worse than standard MCMC→ need
MAP estimate
Linear in N. Likelihood evaluations at least qdark→bright · N
Mixing time cannot be better than 1/qdark→bright

13

Alternative transition kernels
Existing methods construct alternative transition kernels.
(Welling  Teh 2011), (Korattikara, Chen, Welling 2014), (Bardenet, Doucet, Holmes 2014)
(Maclaurin  Adams 2014), (Chen, Fox, Guestrin 2014).
They
use mini-batches
inject noise
augment the state space
make clever use of approximations
Problem: Most methods
are biased
have no convergence guarantees
mix badly

14

Reminder: Where we came from  expectations
Ep(θ|D) {ϕ(θ)} ϕ : Θ → R
Idea: Assuming the goal is estimation, give up on simulation.

15

Outline
Literature Overview
Partial Posterior Path Estimators
Experiments  Extensions
Discussion

16

Idea Outline
1. Construct partial posterior distributions
2. Compute partial expectations (biased)
3. Remove bias
Note:
No simulation from p(θ|D)
Partial posterior expectations less challenging
Exploit standard MCMC methodology  engineering
But not restricted to MCMC

17

Disclaimer
Goal is not to replace posterior sampling, but to provide a ...
dierent perspective when the goal is estimation
Method does not do uniformly better than MCMC, but ...
we show cases where computational gains can be achieved

18

Partial Posterior Paths
Model p(x, θ) = p(x|θ)p(θ), data D = {x1, . . . , xN}
Full posterior πN := p(θ|D) ∝ p(x1, . . . , xN|θ)p(θ)
L subsets Dl of sizes |Dl| = nl
Here: n1 = a, n2 = 2
1
a, n3 = 2
2
a, . . . , nL = 2
L−1
a
Partial posterior ˜πl := p(Dl|θ) ∝ p(Dl|θ)p(θ)
Path from prior to full posterior
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)

19

Gaussian Mean, Conjugate Prior
−5 −4 −3 −2 −1 0 1 2 3
µ1
−3
−2
−1
0
1
2
3
4
µ2
Prior
1/100
2/100
4/100
8/100
16/100
32/100
64/100
100/100

20

Partial posterior path statistics
For partial posterior paths
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
dene a sequence {φt}∞
t=1
as
φt := ˆE˜πt{ϕ(θ)} t  L
φt := φ := ˆEπN{ϕ(θ)} t ≥ L
This gives
φ1 → φ2 → · · · → φL = φ
ˆE˜πt{ϕ(θ)} is empirical estimate. Not necessarily MCMC.

21

Debiasing Lemma (Rhee  Glynn 2012, 2014)
φ and {φt}∞
t=1
real-valued random variables. Assume
lim
t→∞
E |φt − φ|2
= 0
T integer rv with P [T ≥ t]  0 for t ∈ N
Assume
∞
t=1
E |φt−1 − φ|2
P [T ≥ t]
 ∞
Unbiased estimator of E{φ}
φ∗
T =
T
t=1
φt − φt−1
P [T ≥ t]
Here: P [T ≥ t] = 0 for t  L since φt+1 − φt = 0

22

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

23

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

24

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

25

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

26

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

27

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

28

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean

29

Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Debiasing estimate 1
R
R
r=1 ϕ∗
r
True Posterior mean
Debiasing estimates ϕ∗
r

30

Computational complexity
Assume geometric batch size increase nt and truncation
probabilities
Λt := P(T = t) ∝ 2
−αt
α ∈ (0, 1)
Average computational cost sub-linear
O a N
a
1−α

31

Variance-computation tradeos in Big Data
Variance
E (φ∗
T )2
=
∞
t=1
E {|φt−1 − φ|2
} − E {|φt − φ|2
}
P [T ≥ t]
If we assume ∀t ≤ L, there is a constant c and β  0 s.t.
E |φt−1 − φ|2
≤
c
nβ
t
and furthermore α  β, then
L
t=1
E |φt−1 − φ|2
P [T ≥ t]
= O(1)
and variance stays bounded as N → ∞.

32

Outline
Literature Overview
Partial Posterior Path Estimators
Experiments  Extensions
Discussion

33

Synthetic log-Gaussian
101 102 103 104 105
Number of data nt
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
log N(0, σ2
), posterior mean σ
(Bardenet, Doucet, Holmes 2014)  all data
(Korattikara, Chen, Welling 2014)  wrong result

34

Synthetic log-Gaussian  debiasing
0 50 100 150 200 250 300
Replication r
1.0
1.2
1.4
1.6
1.8
2.0
Running1
R
R
r=1ϕ∗
r
log N(µ, σ2
), posterior mean σ
0
2000
4000
6000
Tr
t=1nt
Used Data
Truly large-scale version: N ≈ 10
8
Sum of likelihood evaluations: ≈ 0.25N

35

Non-factorising likelihoods
No need for
p(D|θ) =
N
i=1
p(xi|θ)
Example: Approximate Gaussian Process regression
Estimate predictive mean
k∗ (K + λI)−1
y
No MCMC (!)

36

Toy example
N = 10
4
, D = 1
m = 100 random Fourier features (Rahimi, Recht, 2007)
Predictive mean on 1000 test data
101 102 103 104
Number of data nt
0.0
0.2
0.4
0.6
0.8
1.0
1.2
MSE
GP Regression, predictive mean
0 50 100 150 200
Replication R
0.0
0.5
1.0
MSE
Average cost: 469
MSE Convergence Debiasing

37

Gaussian Processes for Big Data
(Hensman, Fusi, Lawrence, 2013): SVI  inducing variables
Airtime delays, N = 700, 000, D = 8
Estimate predictive mean on 100, 000 test data
0 20 40 60 80 100
Replication R
30
40
50
60
70
RMSE
Average cost: 2773

38

Outline
Literature Overview
Partial Posterior Path Estimators
Experiments  Extensions
Discussion

39

Conclusions
If goal is estimation rather than simulation, we arrive at
1. No bias
2. Finite  controllable variance
3. Data complexity sub-linear in N
4. No problems with transition kernel design
Practical:
Not limited to MCMC
Not limited to factorising likelihoods
Competitiveinitial results
Parallelisable, re-uses existing engineering eort

40

Still biased?
MCMC and nite time
MCMC estimator ˆE˜πt{ϕ(θ)} is not unbiased
Could imagine two-stage process
Apply debiasing to MC estimator
Use to debias partial posterior path
Need conditions on MC convergence to control variance,
(Agapiou, Roberts, Vollmer, 2014)
Memory restrictions
Partial posterior expectations need be computable
Memory limitations cause bias
e.g. large-scale GMRF (Lyne et al, 2014)

41

Free lunch? Not uniformly better than MCMC
Need P [T ≥ t]  0 for all t
Negative example: a9a dataset (Welling  Teh, 2011)
N ≈ 32, 000
Converges, but full posterior sampling likely
0 50 100 150 200
Replication r
−4
−2
0
2
β1
Useful for very large (redundant) datasets

42

Xi'an's og, Feb 2015
Discussion of M. Betancourt's note on HMC and subsampling.
...the information provided by the whole data is only available
when looking at the whole data.
See http://goo.gl/bFQvd6

43

Thank you
Questions?

More Related Content

Unbiased Bayes for Big Data

  • 1. Unbiased Bayes for Big Data: Paths of Partial Posteriors Heiko Strathmann Gatsby Unit, University College London Oxford ML lunch, February 25, 2015
  • 3. Being Bayesian: Averaging beliefs of the unknown φ = ˆ dθϕ(θ) p(θ|D) posterior where p(θ|D) ∝ p(D|θ) likelihood data p(θ) prior
  • 4. Metropolis Hastings Transition Kernel Target π(θ) ∝ p(θ|D) At iteration j + 1, state θ(j) Propose θ ∼ q θ|θ(j) Accept θ(j+1) ← θ with probability min π(θ ) π(θ(j)) × q(θ(j) |θ ) q(θ |θ(j)) , 1 Reject θ(j+1) ← θ(j) otherwise.
  • 5. Big D & MCMC Need to evaluate p(θ|D) ∝ p(D|θ)p(θ) in every iteration. For example, for D = {x1, . . . , xN}, p(D|θ) = N i=1 p(xi|θ) Infeasible for growing N Lots of current research: Can we use subsets of D?
  • 6. Desiderata for Bayesian estimators 1. No (additional) bias 2. Finite & controllable variance 3. Computational costs sub-linear in N 4. No problems with transition kernel design
  • 7. Outline Literature Overview Partial Posterior Path Estimators Experiments & Extensions Discussion
  • 8. Outline Literature Overview Partial Posterior Path Estimators Experiments & Extensions Discussion
  • 9. Stochastic gradient Langevin (Welling & Teh 2011) θ = 2 θ=θ(j) log p(θ) + θ=θ(j) N i=1 log p(xi|θ) + ηj Two changes: 1. Noisy gradients with mini-batches. Let I ⊆ {1, . . . , N} and use log-likelihood gradient θ=θ(j) i∈I log p(xi|θ) 2. Don't evaluate MH ratio, but always accept, decrease step-size/noise j → 0 to compensate ∞ i=1 i = ∞ ∞ i=1 2 i < ∞
  • 10. Austerity (Korattikara, Chen, Welling 2014) Idea: rewrite MH ratio as hypothesis test At iteration j, draw u ∼ Uniform[0, 1] and compute µ0 = 1 N log u × p(θ(j) ) p(θ ) × q(θ |θ(j) ) q(θ(j)|θ ) µ = 1 N N i=1 li li := log p(xi|θ ) − log p(xi|θ(j) ) Accept if µ > µ0; reject otherwise Subsample the li, central limit theorem, t-test Increase data if no signicance, multiple testing correction
  • 11. Bardenet, Doucet, Holmes 2014 Similar to Austerity, but with analysis: Concentration bounds for MH (CLT might not hold) Bound for probability of wrong decision For uniformly ergodic original kernel Approximate kernel converges Bound for TV distance of approximation and target Limitations: Still approximate Only random walk Uses all data on hard (?) problems
  • 12. Firey MCMC (Maclaurin Adams 2014) First asymptotically exact MCMC kernel using sub-sampling Augment state space with binary indcator variables Only few data bright Dark points approximated by a lower bound on likelihood Limitations: Bound might not be available Loose bounds → worse than standard MCMC→ need MAP estimate Linear in N. Likelihood evaluations at least qdark→bright · N Mixing time cannot be better than 1/qdark→bright
  • 13. Alternative transition kernels Existing methods construct alternative transition kernels. (Welling Teh 2011), (Korattikara, Chen, Welling 2014), (Bardenet, Doucet, Holmes 2014) (Maclaurin Adams 2014), (Chen, Fox, Guestrin 2014). They use mini-batches inject noise augment the state space make clever use of approximations Problem: Most methods are biased have no convergence guarantees mix badly
  • 14. Reminder: Where we came from expectations Ep(θ|D) {ϕ(θ)} ϕ : Θ → R Idea: Assuming the goal is estimation, give up on simulation.
  • 15. Outline Literature Overview Partial Posterior Path Estimators Experiments Extensions Discussion
  • 16. Idea Outline 1. Construct partial posterior distributions 2. Compute partial expectations (biased) 3. Remove bias Note: No simulation from p(θ|D) Partial posterior expectations less challenging Exploit standard MCMC methodology engineering But not restricted to MCMC
  • 17. Disclaimer Goal is not to replace posterior sampling, but to provide a ... dierent perspective when the goal is estimation Method does not do uniformly better than MCMC, but ... we show cases where computational gains can be achieved
  • 18. Partial Posterior Paths Model p(x, θ) = p(x|θ)p(θ), data D = {x1, . . . , xN} Full posterior πN := p(θ|D) ∝ p(x1, . . . , xN|θ)p(θ) L subsets Dl of sizes |Dl| = nl Here: n1 = a, n2 = 2 1 a, n3 = 2 2 a, . . . , nL = 2 L−1 a Partial posterior ˜πl := p(Dl|θ) ∝ p(Dl|θ)p(θ) Path from prior to full posterior p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
  • 19. Gaussian Mean, Conjugate Prior −5 −4 −3 −2 −1 0 1 2 3 µ1 −3 −2 −1 0 1 2 3 4 µ2 Prior 1/100 2/100 4/100 8/100 16/100 32/100 64/100 100/100
  • 20. Partial posterior path statistics For partial posterior paths p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ) dene a sequence {φt}∞ t=1 as φt := ˆE˜πt{ϕ(θ)} t L φt := φ := ˆEπN{ϕ(θ)} t ≥ L This gives φ1 → φ2 → · · · → φL = φ ˆE˜πt{ϕ(θ)} is empirical estimate. Not necessarily MCMC.
  • 21. Debiasing Lemma (Rhee Glynn 2012, 2014) φ and {φt}∞ t=1 real-valued random variables. Assume lim t→∞ E |φt − φ|2 = 0 T integer rv with P [T ≥ t] 0 for t ∈ N Assume ∞ t=1 E |φt−1 − φ|2 P [T ≥ t] ∞ Unbiased estimator of E{φ} φ∗ T = T t=1 φt − φt−1 P [T ≥ t] Here: P [T ≥ t] = 0 for t L since φt+1 − φt = 0
  • 22. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 23. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 24. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 25. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 26. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 27. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 28. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 29. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean Debiasing estimate 1 R R r=1 ϕ∗ r True Posterior mean Debiasing estimates ϕ∗ r
  • 30. Computational complexity Assume geometric batch size increase nt and truncation probabilities Λt := P(T = t) ∝ 2 −αt α ∈ (0, 1) Average computational cost sub-linear O a N a 1−α
  • 31. Variance-computation tradeos in Big Data Variance E (φ∗ T )2 = ∞ t=1 E {|φt−1 − φ|2 } − E {|φt − φ|2 } P [T ≥ t] If we assume ∀t ≤ L, there is a constant c and β 0 s.t. E |φt−1 − φ|2 ≤ c nβ t and furthermore α β, then L t=1 E |φt−1 − φ|2 P [T ≥ t] = O(1) and variance stays bounded as N → ∞.
  • 32. Outline Literature Overview Partial Posterior Path Estimators Experiments Extensions Discussion
  • 33. Synthetic log-Gaussian 101 102 103 104 105 Number of data nt 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 log N(0, σ2 ), posterior mean σ (Bardenet, Doucet, Holmes 2014) all data (Korattikara, Chen, Welling 2014) wrong result
  • 34. Synthetic log-Gaussian debiasing 0 50 100 150 200 250 300 Replication r 1.0 1.2 1.4 1.6 1.8 2.0 Running1 R R r=1ϕ∗ r log N(µ, σ2 ), posterior mean σ 0 2000 4000 6000 Tr t=1nt Used Data Truly large-scale version: N ≈ 10 8 Sum of likelihood evaluations: ≈ 0.25N
  • 35. Non-factorising likelihoods No need for p(D|θ) = N i=1 p(xi|θ) Example: Approximate Gaussian Process regression Estimate predictive mean k∗ (K + λI)−1 y No MCMC (!)
  • 36. Toy example N = 10 4 , D = 1 m = 100 random Fourier features (Rahimi, Recht, 2007) Predictive mean on 1000 test data 101 102 103 104 Number of data nt 0.0 0.2 0.4 0.6 0.8 1.0 1.2 MSE GP Regression, predictive mean 0 50 100 150 200 Replication R 0.0 0.5 1.0 MSE Average cost: 469 MSE Convergence Debiasing
  • 37. Gaussian Processes for Big Data (Hensman, Fusi, Lawrence, 2013): SVI inducing variables Airtime delays, N = 700, 000, D = 8 Estimate predictive mean on 100, 000 test data 0 20 40 60 80 100 Replication R 30 40 50 60 70 RMSE Average cost: 2773
  • 38. Outline Literature Overview Partial Posterior Path Estimators Experiments Extensions Discussion
  • 39. Conclusions If goal is estimation rather than simulation, we arrive at 1. No bias 2. Finite controllable variance 3. Data complexity sub-linear in N 4. No problems with transition kernel design Practical: Not limited to MCMC Not limited to factorising likelihoods Competitiveinitial results Parallelisable, re-uses existing engineering eort
  • 40. Still biased? MCMC and nite time MCMC estimator ˆE˜πt{ϕ(θ)} is not unbiased Could imagine two-stage process Apply debiasing to MC estimator Use to debias partial posterior path Need conditions on MC convergence to control variance, (Agapiou, Roberts, Vollmer, 2014) Memory restrictions Partial posterior expectations need be computable Memory limitations cause bias e.g. large-scale GMRF (Lyne et al, 2014)
  • 41. Free lunch? Not uniformly better than MCMC Need P [T ≥ t] 0 for all t Negative example: a9a dataset (Welling Teh, 2011) N ≈ 32, 000 Converges, but full posterior sampling likely 0 50 100 150 200 Replication r −4 −2 0 2 β1 Useful for very large (redundant) datasets
  • 42. Xi'an's og, Feb 2015 Discussion of M. Betancourt's note on HMC and subsampling. ...the information provided by the whole data is only available when looking at the whole data. See http://goo.gl/bFQvd6