Machine Learning and Pattern Recognition Sampling Based Approximations
Machine Learning and Pattern Recognition Sampling Based Approximations
Our prediction is the average of the predictions made by S different plausible model fits,
sampled from the posterior distribution over parameters.
However, it is not at all obvious how to draw samples from the posterior over weights for
general models. For simple versions of linear regression, we know that p(w | D) is Gaussian,
but we don’t need to approximate the integral in that case. For logistic regression there’s no
obvious way to draw samples from the posterior distribution (if we don’t approximate it
with a Gaussian).
A family of methods, widely used in Statistics, known as Markov chain Monte Carlo (MCMC)
methods, can be used to draw samples from the posterior distribution for models like logistic
regression and neural networks. We don’t cover the details of MCMC in this course. If you’re
interested, Iain has a tutorial here: https://homepages.inf.ed.ac.uk/imurray2/teaching/
15nips/ — or a longer tutorial on probabilistic modelling that puts it in slightly more context:
https://homepages.inf.ed.ac.uk/imurray2/teaching/14mlss/
p(w(s) | D)
Here r (s) = is the importance weight, which upweights the predictions for parameters
q (w(s) )
that are more probable under the posterior than the distribution we sampled from.
1. For example reweighting data in a loss function to reflect how they were gathered, or weighting the importance
of different trial runs in reinforcement learning, depending on the policy from which they were sampled.
P(D | w) p(w)
p(w | D) = , (7)
P(D)
because we can’t usually evaluate the denominator p(D). However, we can approximate that
using importance sampling!
Z
P(D) = P(D | w) p(w) dw (8)
q(w)
Z
= P(D | w) p(w) dw (9)
q(w)
P(D | w) p(w)
= Eq ( w ) (10)
q(w)
S S
1 P(D | w(s) ) p(w(s) ) 1
≈
S ∑ q ( w(s) )
=
S ∑ r̃(s) , (11)
s =1 s =1
S
1 > r̃ (s)
P(y = 1 | x, D) ≈ ∑ σ ( w(s) x) 0 , w(s) ∼ q ( w ) (13)
S s =1
1
S ∑Ss0 =1 r̃ (s )
or
S >
P(y = 1 | x, D) ≈ ∑ σ ( w(s) x ) r (s) , w ( s ) ∼ q ( w ). (14)
s =1
In this final form, the average is under the distribution defined by the ‘normalized importance
weights’:
r̃ (s)
r (s) = S 0 . (15)
∑s0 =1 r̃ (s )
Consider a 1-dimensional bimodal posterior p(w | D) and q(w) a Gaussian centred at the
trough of p(w | D) as shown in the figure below.
-0.5
y
-1
-1.5
r(θ(s) ) ∝ p(Data | θ(s) )
-2 0 2 4
x
This importance sampling procedure works in principle for any model where we can sample
possible models from the prior and evaluate the likelihood, including logistic regression.
However, if we have many parameters, it is unlikely that any of our S samples from the prior
will match the data well, and our estimates will be poor.
We could try to make the sampling distribution q(w) approximate the posterior, but for
models with many parameters it is difficult to approximate the posterior well enough for
importance sampling to work well. Advanced sampling methods like MCMC (mentioned
above) and more advanced importance sampling methods (e.g., Sequential Monte Carlo, SMC)
have been applied to neural networks, but are beyond the scope of this course.