0% found this document useful (0 votes)

26 views

Machine Learning and Pattern Recognition Sampling Based Approximations

This document discusses using Monte Carlo approximations and importance sampling to make Bayesian predictions for logistic regression when a Gaussian approximation is not possible. It describes approximating predictions by taking the average over samples from the posterior distribution, and using importance sampling to draw samples from an arbitrary distribution and reweight them based on the posterior. Importance sampling with the prior distribution is provided as a special case example.

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Machine Learning and Pattern Recognition Sampling Based Approximations

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Computing logistic regression predictions

using sampling based methods

In the previous note we approximated the logistic regression posterior with a Gaussian distri-
bution. By comparing to the joint probability, we immediately obtained an approximation for
the marginal likelihood P(D) or P(D | M), which can be used to choose between alternative
model settings M, and we could use the Laplace approximation to make Bayesian predic-
tion. We now look at other ways to make Bayesian predictions, not involving a Gaussian
approximation.

1 Monte Carlo approximations

A route to avoiding Gaussian approximations is to approximate the prediction, which is an
expectation, with an empirical average over samples:
Z
P(y = 1 | x, D) = σ(w> x) p(w | D) dw (1)

= E p(w | D) [σ(w> x)] (2)

S
1 >
≈
S ∑ σ ( w(s) x ), w(s) ∼ p(w | D). (3)
s =1

Our prediction is the average of the predictions made by S different plausible model fits,
sampled from the posterior distribution over parameters.
However, it is not at all obvious how to draw samples from the posterior over weights for
general models. For simple versions of linear regression, we know that p(w | D) is Gaussian,
but we don’t need to approximate the integral in that case. For logistic regression there’s no
obvious way to draw samples from the posterior distribution (if we don’t approximate it
with a Gaussian).
A family of methods, widely used in Statistics, known as Markov chain Monte Carlo (MCMC)
methods, can be used to draw samples from the posterior distribution for models like logistic
regression and neural networks. We don’t cover the details of MCMC in this course. If you’re
interested, Iain has a tutorial here: https://homepages.inf.ed.ac.uk/imurray2/teaching/
15nips/ — or a longer tutorial on probabilistic modelling that puts it in slightly more context:
https://homepages.inf.ed.ac.uk/imurray2/teaching/14mlss/

1.1 Importance Sampling

Importance sampling is a simple trick you should understand, because it comes up in various
contexts in machine learning beyond Bayesian prediction1 . Here we rewrite the integral as
an expectation under an arbitrary tractable distribution of our choice, q(w):
q(w)
Z
P(y = 1 | x, D) = σ(w> x) p(w | D) dw (4)
q(w)
p(w | D)

= Eq ( w ) σ ( w > x ) (5)
q(w)
S
1 > p(w(s) | D)
≈
S ∑ σ ( w(s) x )
q ( w(s) )
, w ( s ) ∼ q ( w ). (6)
s =1

p(w(s) | D)
Here r (s) = is the importance weight, which upweights the predictions for parameters
q (w(s) )
that are more probable under the posterior than the distribution we sampled from.

1. For example reweighting data in a loss function to reflect how they were gathered, or weighting the importance
of different trial runs in reinforcement learning, depending on the policy from which they were sampled.

MLPR:w11a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

We shouldn’t divide by zero, so we need q(w) > 0 when p(w | D) > 0. Moreover, we don’t
want q(w) p(w | D) for any region of the weights, or occasionally we would see an
enormous importance weight, and the estimator will have high variance.
The final detail is that we can’t usually evaluate the posterior

P(D | w) p(w)
p(w | D) = , (7)
P(D)

because we can’t usually evaluate the denominator p(D). However, we can approximate that
using importance sampling!
Z
P(D) = P(D | w) p(w) dw (8)
q(w)
Z
= P(D | w) p(w) dw (9)
q(w)
P(D | w) p(w)

= Eq ( w ) (10)
q(w)
S S
1 P(D | w(s) ) p(w(s) ) 1
≈
S ∑ q ( w(s) )
=
S ∑ r̃(s) , (11)
s =1 s =1

where we’ve introduced “unnormalized importance weights”, defined as:

P(D | w(s) ) p(w(s) )

r̃ (s) = . (12)
q ( w(s) )

Substituting in this approximation to the Bayesian prediction, we obtain:

S
1 > r̃ (s)
P(y = 1 | x, D) ≈ ∑ σ ( w(s) x) 0 , w(s) ∼ q ( w ) (13)
S s =1
1
S ∑Ss0 =1 r̃ (s )
or
S >
P(y = 1 | x, D) ≈ ∑ σ ( w(s) x ) r (s) , w ( s ) ∼ q ( w ). (14)
s =1

In this final form, the average is under the distribution defined by the ‘normalized importance
weights’:
r̃ (s)
r (s) = S 0 . (15)
∑s0 =1 r̃ (s )

Consider a 1-dimensional bimodal posterior p(w | D) and q(w) a Gaussian centred at the
trough of p(w | D) as shown in the figure below.

[The website version of this note has a question here.]

MLPR:w11a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

1.2 Importance sampling with the prior
A special case might help understand importance sampling. If we sampled model parameters
from the prior, q(w) = p(w), the unnormalized weights are equal to the likelihood, r̃ (w) =
P(D | w(s) ).
We would sample some number, S, settings of the parameters from the prior. Then we form
a discrete distribution over these parameters with importance weights proportional to the
likelihood. Functions that match the data will be given large importance weight.
Below is a linear regression example, where the true (unknown) line is shown in blue,
and the purple lines show the discrete distribution over possible models we will use for
prediction. The intensity of the lines indicate the importance weights. I drew 10,000 samples
from the prior, but most of the functions didn’t go near the data and were given such small
weight that they are nearly white.
0

-0.5
y

-1

-1.5
r(θ(s) ) ∝ p(Data | θ(s) )

-2 0 2 4
x

This importance sampling procedure works in principle for any model where we can sample
possible models from the prior and evaluate the likelihood, including logistic regression.
However, if we have many parameters, it is unlikely that any of our S samples from the prior
will match the data well, and our estimates will be poor.
We could try to make the sampling distribution q(w) approximate the posterior, but for
models with many parameters it is difficult to approximate the posterior well enough for
importance sampling to work well. Advanced sampling methods like MCMC (mentioned
above) and more advanced importance sampling methods (e.g., Sequential Monte Carlo, SMC)
have been applied to neural networks, but are beyond the scope of this course.

MLPR:w11a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

Statistical Methods For Machine Learning
No ratings yet
Statistical Methods For Machine Learning
272 pages
Eigenvalue of Pq-Laplace System Along The Powers o
No ratings yet
Eigenvalue of Pq-Laplace System Along The Powers o
10 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Info 159/259 HW 2
No ratings yet
Info 159/259 HW 2
3 pages
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
No ratings yet
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
4 pages
Luby's Alg. For Maximal Independent Sets Using Pairwise Independence
No ratings yet
Luby's Alg. For Maximal Independent Sets Using Pairwise Independence
6 pages
Michael Importance Weighting
No ratings yet
Michael Importance Weighting
30 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
On The Mathematics of Diffusion Models: David Mcallester Toyota Technologicical Institute at Chicago (Ttic)
No ratings yet
On The Mathematics of Diffusion Models: David Mcallester Toyota Technologicical Institute at Chicago (Ttic)
5 pages
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
Machine Learning and Pattern Recognition - Variational - Details
No ratings yet
Machine Learning and Pattern Recognition - Variational - Details
3 pages
selberg
No ratings yet
selberg
6 pages
Mathematics 09 01973
No ratings yet
Mathematics 09 01973
20 pages
Existence and Concentration of Positive Solutions For A Class of Gradient Systems - Claudianor O. ALVES
No ratings yet
Existence and Concentration of Positive Solutions For A Class of Gradient Systems - Claudianor O. ALVES
21 pages
MIT8 962S20 Pset01
No ratings yet
MIT8 962S20 Pset01
4 pages
Unit 3 Graphical Models
No ratings yet
Unit 3 Graphical Models
18 pages
GraphAlgorithms
No ratings yet
GraphAlgorithms
68 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
ME3261-IV Surface Models & Fitting (1-On-1)
No ratings yet
ME3261-IV Surface Models & Fitting (1-On-1)
37 pages
MMSE Equalizer Design: Phil Schniter March 6, 2008
No ratings yet
MMSE Equalizer Design: Phil Schniter March 6, 2008
6 pages
Unit 3-Bayesian Logistic
No ratings yet
Unit 3-Bayesian Logistic
11 pages
A1 (6)
No ratings yet
A1 (6)
4 pages
Connectivity: Dijkstra's Algorithm. Flow Networks: Maximum Flow Algorithms
No ratings yet
Connectivity: Dijkstra's Algorithm. Flow Networks: Maximum Flow Algorithms
56 pages
Notes
No ratings yet
Notes
9 pages
Epl 21 Supp2
No ratings yet
Epl 21 Supp2
2 pages
Supplementary Material
No ratings yet
Supplementary Material
4 pages
Inverse Z-Transform and Difference Equations: 6.003 Fall 2016 Lecture 12
No ratings yet
Inverse Z-Transform and Difference Equations: 6.003 Fall 2016 Lecture 12
29 pages
Tung Kieu - Probabilistic - Graphical - Model - Report
No ratings yet
Tung Kieu - Probabilistic - Graphical - Model - Report
9 pages
Superlinear Schrodinger-Kirchhoff Type Problems in
No ratings yet
Superlinear Schrodinger-Kirchhoff Type Problems in
20 pages
L24-graphs2
No ratings yet
L24-graphs2
15 pages
Calv4 Sheet
No ratings yet
Calv4 Sheet
2 pages
《模式识别》第二版 (边肇祺张学工著) 课后习题答案 (bian zhaoqi)
No ratings yet
《模式识别》第二版 (边肇祺张学工著) 课后习题答案 (bian zhaoqi)
23 pages
GM Curves (Analytic)
No ratings yet
GM Curves (Analytic)
152 pages
Quantum Calculus
No ratings yet
Quantum Calculus
13 pages
Non-Deterministic Reward and Action
No ratings yet
Non-Deterministic Reward and Action
2 pages
24 - Single-Source Shortest Paths
No ratings yet
24 - Single-Source Shortest Paths
55 pages
Geometry A1 Chapter 2
No ratings yet
Geometry A1 Chapter 2
23 pages
Graph Algorithms
No ratings yet
Graph Algorithms
70 pages
Velocity Potential
No ratings yet
Velocity Potential
14 pages
Hw11 Solutions-2
No ratings yet
Hw11 Solutions-2
2 pages
Rotating Waves in Parabolic Systems
No ratings yet
Rotating Waves in Parabolic Systems
1 page
A Brief Primer On Variational Inference - Fabian Dablander
No ratings yet
A Brief Primer On Variational Inference - Fabian Dablander
14 pages
Meshless Cubature Over The Disk by Thin-Plate Splines ?: Alessandro Punzi, Alvise Sommariva, Marco Vianello
No ratings yet
Meshless Cubature Over The Disk by Thin-Plate Splines ?: Alessandro Punzi, Alvise Sommariva, Marco Vianello
11 pages
s00526-015-0903-5
No ratings yet
s00526-015-0903-5
25 pages
App1 PDF
No ratings yet
App1 PDF
7 pages
IA Char
No ratings yet
IA Char
14 pages
Pset01 PDF
No ratings yet
Pset01 PDF
3 pages
JNFA202040
No ratings yet
JNFA202040
19 pages
w2e_multivariate_gaussian
No ratings yet
w2e_multivariate_gaussian
6 pages
hw2 Sol
No ratings yet
hw2 Sol
9 pages
Censored Quantile Instrumental Variable Estimation With Stata
No ratings yet
Censored Quantile Instrumental Variable Estimation With Stata
10 pages
Cee235b Handout Jacobian
No ratings yet
Cee235b Handout Jacobian
4 pages
Graph-Based Algorithms: CSE373: Design and Analysis of Algorithms
No ratings yet
Graph-Based Algorithms: CSE373: Design and Analysis of Algorithms
20 pages
1 Linear Algebra: 1 K 1 1 K K 1 K
No ratings yet
1 Linear Algebra: 1 K 1 1 K K 1 K
3 pages
Week4 Sol
No ratings yet
Week4 Sol
10 pages
vertex operator algebras modular forms and moonside
No ratings yet
vertex operator algebras modular forms and moonside
34 pages
Integration Over Curves and Surfaces
No ratings yet
Integration Over Curves and Surfaces
2 pages
Lecture1-Notes 2 PDF
No ratings yet
Lecture1-Notes 2 PDF
5 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
MDA3S
No ratings yet
MDA3S
22 pages
w2c_central_limit
No ratings yet
w2c_central_limit
1 page
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
Part 5
No ratings yet
Part 5
31 pages
Part 4
No ratings yet
Part 4
24 pages
TS Part2
No ratings yet
TS Part2
62 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Part 3
No ratings yet
Part 3
29 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
B.Tech AIML 5th Semester CSE University Questions
No ratings yet
B.Tech AIML 5th Semester CSE University Questions
20 pages
Usage Note 40724: Comparing Covariance Structures, Testing Covariance Parameters Using The COVTEST Statement in PROC GLIMMIX
No ratings yet
Usage Note 40724: Comparing Covariance Structures, Testing Covariance Parameters Using The COVTEST Statement in PROC GLIMMIX
8 pages
CS30 5 System Modeling and Simulation Prof. Dr. Khaled Mahar
No ratings yet
CS30 5 System Modeling and Simulation Prof. Dr. Khaled Mahar
32 pages
Stratified Odds Ratios For Evaluating NBA Players Based On Their Plus/Minus Statistics
No ratings yet
Stratified Odds Ratios For Evaluating NBA Players Based On Their Plus/Minus Statistics
10 pages
Segment 1 - PPD
No ratings yet
Segment 1 - PPD
32 pages
STAT 231 1111 Midterm1 Solutions
No ratings yet
STAT 231 1111 Midterm1 Solutions
8 pages
Information Theory and Log-Likelihood Models: A Basis For Model Selection and Inference
No ratings yet
Information Theory and Log-Likelihood Models: A Basis For Model Selection and Inference
22 pages
BHP Risk Assessment Dalf r1
No ratings yet
BHP Risk Assessment Dalf r1
18 pages
exponential family
No ratings yet
exponential family
45 pages
New Developments in Statistical Information Theory Based On Entropy and Divergence Measures
No ratings yet
New Developments in Statistical Information Theory Based On Entropy and Divergence Measures
346 pages
Appendix-I: - Scheme of Examination
No ratings yet
Appendix-I: - Scheme of Examination
11 pages
Internal Order SAP
No ratings yet
Internal Order SAP
5 pages
Description: Package Name: Author: Date
No ratings yet
Description: Package Name: Author: Date
14 pages
Vignette Mvord PDF
No ratings yet
Vignette Mvord PDF
47 pages
Advanced Econometrics (PDFDrive)
No ratings yet
Advanced Econometrics (PDFDrive)
164 pages
Debt and Distress - Evaluating The Psychological Cost of Credit
No ratings yet
Debt and Distress - Evaluating The Psychological Cost of Credit
22 pages
Proefschrift Robert Zwitser PDF
No ratings yet
Proefschrift Robert Zwitser PDF
127 pages
Multinomial Logistic Regression Basic Relationships
No ratings yet
Multinomial Logistic Regression Basic Relationships
73 pages
Logic Tree Approach Used in PSHA
No ratings yet
Logic Tree Approach Used in PSHA
27 pages
Approaches For Credit Scorecard Calibration: An Empirical Analysis
No ratings yet
Approaches For Credit Scorecard Calibration: An Empirical Analysis
40 pages
Penalized Regression
No ratings yet
Penalized Regression
19 pages
Pondicherry University Pondicherry: Semester Pattern) Effective From 2009-2010 (Onwards)
No ratings yet
Pondicherry University Pondicherry: Semester Pattern) Effective From 2009-2010 (Onwards)
44 pages
Risk Assessment (RA) Form Rev
100% (1)
Risk Assessment (RA) Form Rev
3 pages
Lecture-8 Outlier Detection
No ratings yet
Lecture-8 Outlier Detection
72 pages
Semantics of Darou: Likelihood and Quantification Over Reasoning
No ratings yet
Semantics of Darou: Likelihood and Quantification Over Reasoning
4 pages
How To Break Down A Set Defence
No ratings yet
How To Break Down A Set Defence
27 pages
Dra 17 164468 11021C - en - VF
No ratings yet
Dra 17 164468 11021C - en - VF
16 pages
Bootstrapping Weakly Supervised Segmentation-Free Word Spotting Through HMM-based Alignment
No ratings yet
Bootstrapping Weakly Supervised Segmentation-Free Word Spotting Through HMM-based Alignment
7 pages
A1w2017 PDF
No ratings yet
A1w2017 PDF
2 pages

Machine Learning and Pattern Recognition Sampling Based Approximations

Uploaded by

Machine Learning and Pattern Recognition Sampling Based Approximations

Uploaded by

Computing logistic regression predictions

using sampling based methods

1 Monte Carlo approximations

= E p(w | D) [σ(w> x)] (2)

1.1 Importance Sampling

MLPR:w11a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

where we’ve introduced “unnormalized importance weights”, defined as:

P(D | w(s) ) p(w(s) )

Substituting in this approximation to the Bayesian prediction, we obtain:

[The website version of this note has a question here.]

MLPR:w11a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

MLPR:w11a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

You might also like