Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
28 views

Advanced ML Notes (Midterm)

Gaussian processes are a non-parametric Bayesian method for regression and classification problems. They place a prior distribution over functions, assuming this prior takes the form of a multivariate Gaussian distribution. This distribution is then updated based on observed data to obtain a posterior distribution over functions. Kernels are used to define the covariance between points in order to sample meaningful functions from the Gaussian process prior. Hyperparameters of the kernel and noise parameters are tuned to maximize the likelihood of the observed data.

Uploaded by

abdhatemsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Advanced ML Notes (Midterm)

Gaussian processes are a non-parametric Bayesian method for regression and classification problems. They place a prior distribution over functions, assuming this prior takes the form of a multivariate Gaussian distribution. This distribution is then updated based on observed data to obtain a posterior distribution over functions. Kernels are used to define the covariance between points in order to sample meaningful functions from the Gaussian process prior. Hyperparameters of the kernel and noise parameters are tuned to maximize the likelihood of the observed data.

Uploaded by

abdhatemsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Advanced Machine Learning

Gaussian Processes
Multivariate Gaussian Distributions
A Gaussian distribution is continuous probability distribution for a real-valued random variable.

𝑋~𝑁(𝜇, 𝜎 2 )
1 1 𝑥−𝜇 2
− ( )
PDF: 𝑓(𝑥) = 𝑒 2 𝜎
𝜎√2𝜋

Given a column vector of normally distributed random variables each with finite mean and variance,
this multivariate distribution is also normally distributed. This is called a multivariate Gaussian
distribution
𝑋1
𝑋
𝑋 = ( ⋮2 ) ~𝑁(𝜇, 𝐾)
𝑋𝑛
𝜇1 𝐾𝑋1𝑋1 𝐾𝑋1 𝑋2 … 𝐾𝑋1𝑋𝑛
𝜇2 𝐾𝑋1𝑋2 𝐾𝑋2 𝑋2 … 𝐾𝑋2𝑋𝑛
𝜇 = ( ⋮ ), 𝐾=
⋮ ⋮ ⋮ ⋮
𝜇𝑛 [𝐾𝑋1 𝑋𝑛 𝐾𝑋2𝑋𝑛 … 𝐾𝑋𝑛 𝑋𝑛 ]

The covariance matrix describes the shape of the distribution

𝐾𝑋𝑖 ,𝑋𝑗 = cov(𝑋𝑖 , 𝑋𝑗 ) = E[(𝑋𝑖 − 𝜇𝑖 )(𝑋𝑗 − 𝜇𝑗 )

Marginalization and Conditioning


Given a joint Gaussian distribution of two random variables
𝑋
𝑃𝑋,𝑌 = [ ] ~𝑁(𝜇, 𝐾)
𝑌
𝑋~𝑁(𝜇𝑋 , 𝐾𝑋𝑋 ), 𝑌~𝑁(𝜇𝑌 , 𝐾𝑌𝑌 )

−1 ( −1
𝑋|𝑌~𝑁(𝜇𝑋 + 𝐾𝑋𝑌 𝐾𝑌𝑌 𝑌 − 𝜇𝑌 ), 𝐾𝑋𝑌 𝐾𝑌𝑌 𝐾𝑌𝑋 )
𝑃𝑋 = ∫ 𝑃𝑋,𝑌 (𝑥, 𝑦) 𝑑𝑦 = ∫ 𝑃𝑋|𝑌 (𝑥 |𝑦)𝑃𝑌 (𝑦) 𝑑𝑦
−1 ( −1
𝑌|𝑋~𝑁(𝜇𝑌 + 𝐾𝑌𝑋 𝐾𝑋𝑋 𝑋 − 𝜇𝑋 ), 𝐾𝑌𝑋 𝐾𝑋𝑋 𝐾𝑋𝑌 )
𝑃𝑋 = E𝑌 [𝑃𝑋|𝑌 (𝑥 |𝑦)]
𝑃 (𝑋, 𝑦)
The marginal probability 𝑃𝑋 is obtained by 𝑃𝑋|𝑦 =
𝑃(𝑌 = 𝑦)
integrating 𝑃𝑋,𝑌 over all values of 𝑌.
𝑃 (𝑥, 𝑌)
This can be thought of as examining the conditional 𝑃𝑌|𝑥 =
𝑃(𝑋 = 𝑥)
probability of 𝑋 given a particular value of 𝑌, and
then averaging this conditional probability over the The conditional probabilities of 𝑃𝑋|𝑦 or 𝑃𝑌|𝑥 are
distribution of 𝑌 calculated using Bayes’ rule
Gaussian distributions are closed under marginalization and conditioning, this means the resulting
distributions from these operations is also Gaussian.

Marginalization Conditioning
Gaussian Processes
A Gaussian process is a collection of random variables, any finite number of which have a joint
Gaussian distribution.
{𝑋𝑡 ; 𝑡 ∈ 𝑇} is Gaussian if and only if:

𝑋𝑡1,…,𝑡𝑘 = (𝑋𝑡1 , … , 𝑋𝑡𝑘 ) is a multivariate Gaussian variable

In simpler terms, in order for {𝑋𝑡 ; 𝑡 ∈ 𝑇} (a process) to be a Gaussian process any linear combination
of (𝑋𝑡1 , … , 𝑋𝑡𝑘 ) must be a univariate Gaussian distribution.

When Gaussian processes are used in a task like regression or classification, each data point is treated
as a random variable, we then try to infer the underlying distribution of the random variables with
each new data point (Bayesian inference).
Suppose 𝑋 is training data (features), and 𝑌 is testing data (labels), we model the underlying
distribution as a multivariate Gaussian distribution 𝑃𝑋,𝑌 , this distribution spans the space of possible
function values that we want to predict.
Priors and Posteriors
Before looking at our data, there are many possible “candidate” functions that could have produced
our data, these are called priors.
As we look at the data, we can narrow down the functions that could have generated it. We perform
conditioning on our priors using our data points to turn them into posteriors. The more data points we
look at the more confident we can be about the data-generating function.
Kernels
A kernel is a function that measures the similarity of two inputs 𝑥, 𝑥′ and is tuned by a set of
hyperparameters 𝜏.

𝑘: ℝ𝑛 × ℝ𝑛 → ℝ, 𝐾 = cov(𝑋, 𝑋 ′ ) = 𝑘 (𝑥, 𝑥 ′ |𝜏)

Ex.
Squared-exp (SE)/Radial Bias Function (RBF)
2
′| 2
1 𝑥 − 𝑥′
𝑘 (𝑥, 𝑥 𝜏) = 𝜎 exp (− ( ) )
2 ℓ

𝜏 = {𝜎, ℓ}
Changing the hyperparameters of the kernel changes the covariance matrix and the resulting
functions sampled using the kernel.

small ℓ

large ℓ

large 𝜎

ℓ, 𝜎 circumscribe the horizontal and vertical “range” of the function.


The Gaussian process will sample functions with nearby 𝑦’s for 𝑥’s deemed similar by the kernel.

Prior and Posterior Distributions


The prior distribution comes from creating a Gaussian distribution using the kernel
𝑋
𝑃𝑋𝑌 = [ ] ~𝑁(𝜇, 𝐾)
𝑌
𝐾 = 𝑘 (𝑥, 𝑥 ′ |𝜏)
This distribution is updated using the data points to form the posterior distribution

𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}𝑁
𝑖=1 *Data

The posterior distribution over the predicted function 𝑓 ∗ comes from conditioning the Gaussian
process using the data points
𝑃𝑓∗|𝑋 ∗ , 𝐷~𝑁(𝜇𝑓∗ , 𝐾𝑓∗ )
−1 −1
𝑃𝑓∗|𝑋 ∗ , 𝐷~𝑁(𝐾𝑋 ∗,𝑋 𝐾𝑋,𝑋 𝑦, 𝐾𝑋 ∗,𝑋 ∗ − 𝐾𝑋,𝑋 𝐾𝑋 ∗,𝑋 )

Noise
We add noise to the training points 𝑌 to model the error of measurements
𝑌 = 𝑓(𝑋 ) + 𝜖, 𝜖~𝑁(0, 𝜎𝜖2 )

This modifies the setup of the joint distribution 𝑃𝑋,𝑌


𝑋 0 𝐾𝑋𝑋 𝐾𝑋𝑌
𝑃𝑋,𝑌 = [ ] ~𝑁(0, 𝐾) = 𝑁([ ] , [ }
𝑌 0 𝐾𝑌𝑋 𝐾𝑌𝑌 + 𝜎𝜖2 𝐼

*𝜎𝜖2 is a hyperparameter
Combining Kernels
Kernels can be combined together to create more specialized kernels that better represent our prior
knowledge about the data.
Common kernel combinations are addition and subtraction, but as long as the resulting covariance
matrix from the kernel is positive semi-definite any method is allowed.

𝑘𝑙𝑖𝑛 (𝑥, 𝑥 ′ ) 𝑘𝑝𝑒𝑟 (𝑥, 𝑥 ′ )

𝑘𝑙𝑖𝑛 (𝑥, 𝑥 ′ ) 𝑘𝑝𝑒𝑟 (𝑥, 𝑥 ′ ) 𝑘𝑙𝑖𝑛 (𝑥, 𝑥 ′ ) + 𝑘𝑝𝑒𝑟 (𝑥, 𝑥 ′ )

Hyperparameter Selection
We tune the hyperparameters of the kernel 𝜏 and the noise parameter 𝜎𝜖2 such that the prior samples
explain the data quite well even before conditioning.

We pick 𝜏 by maximizing the log-likelyhood of 𝑦 after integrating out possible 𝑓(∙)’s


maxarg(log 𝑃𝑦|𝑋,𝜏 )

log 𝑃(𝑦|𝑋, 𝜏, 𝜎 2 ) = log ∫ 𝑃 2 𝑃 𝑑f̃


𝜖 (𝑦 | ̃f, 𝜎𝜖 ), (̃f |𝑋, 𝜏 )

= log 𝑁(𝑦|0, 𝐾𝐾𝑋𝑋 + 𝜎𝜖2 𝐼)

For a given sample from the prior it will


describe the data imperfectly. This
“imperfectness” is measured by f̃.
We can simply think of tuning as trying to
reduce the amount of blue lines we see over
an infinite number of samples.
Due to our distributions being Gaussian we
can algebraically calculate this directly.
Ex.

Using just one kernel to The data has both a quadratic


model this distribution won’t trend as well as a periodic
be sufficient one

We can capture the periodic trend with a periodic kernel


2 𝜋
𝑘𝑝𝑒𝑟 = exp ( sin2 ( |𝑥 = 𝑥 ′ |))
ℓ2 𝑝
The quadratic trend can be captured by multiplying two linear kernels together

𝑘𝑙𝑖𝑛 = 𝑣𝑥𝑥′
𝑘𝑞𝑢𝑎𝑑 = (𝑣1 𝑥𝑥 ′ )(𝑣2 𝑥𝑥 ′ )

To combine these two kernels we simply add them together


2 𝜋
𝑘 ∗ = (𝑣1 𝑥𝑥 ′ )(𝑣2 𝑥𝑥 ′ ) + exp ( 2
sin2 ( |𝑥 = 𝑥 ′ |))
ℓ 𝑝

The prior samples produced using the new kernel are much more likely to fit out data
Conformal Prediction
Uncertainty Quantification
Uncertainty quantification is the process of trying to assess the reliability and confidence level of
predictions made by a ML model. The idea is to try and provide a prediction interval rather than a
single point prediction and to give this interval a confidence level.

Single point
prediction

Interval prediction
with an attached
probabilistic
statement

Validity
The validity of the interval prediction refers to if the probabilistic statement is true (i.e. is there an
actual 90% likelihood that the true value is in the proposed interval). Validity must be insured even
with finite datasets, this is called finite-sample validity.
Efficiency/Tightness

The efficiency of a prediction interval is a measure of the


tightness of the prediction interval. While an inflated
prediction interval may posses the property of finite-
sample validity it won’t be efficient

Model-Agnostic and Distribution-Free


The prediction interval shouldn’t be bound to some point predicter (model) or some data distribution.
Conformal Predictors
Conformity
Conformity is a measure of agreement or harmony between a point and the expectation for that point
(i.e. does the point stand out from the others). Using this we can define our prediction interval as a set:

Prediction set = {all plausible labels that, if assigned to a new object,


will make it conform sufficiently with the previous objects}
To determine if a point is sufficiently conforming we define a function that measures the conformity
of an object with a bag of other objects, this is called a (non-)conformity measuring function.

𝑓( ) ∶ Conformity Measuring Function

𝑓({𝑝0 , … , 𝑝𝑖 }; 𝑝𝑗 )

If 𝑝𝑗 was the blue point the function should


give a high value, if 𝑝𝑗 was the purple
point it should give a low value

Some realizations of non-conformity functions:


- Absolute error/loss: difference between a line (model) fitted to all points and the point

|𝑀(𝑥𝑗 ) − 𝑦𝑗 ||

- Absolute deviation: the deviation of the point from the average of all points
∑𝑛+1
𝑖=1 𝑦𝑖 + 𝑦𝑗
| − 𝑦𝑗 |
𝑛+1

The actual value given by this conformity function isn’t very useful because conformity is a relative
measure.

Instead of using the conformity value we create a ranking of the points based on conformity. With this
ranking we can propose criteria for “sufficient” conformity, which could for example be: being in the
top 90% of the conformity ranking.
Full/Transductive Conformal Predictors
Given a confidence level 𝛿, training points (𝑥0 , 𝑦0 ), … (𝑥𝑛−1 , 𝑦𝑛−1 ), and some test point (𝑥𝑛 , ? )
Conformity Measuring Function: 𝐹()
∀𝑦 ∈ 𝑌|𝐵 ≜ [(𝑥0 , 𝑦0 ), … , (𝑥𝑛−1 , 𝑦𝑛−1 ), (𝑥𝑛 , 𝑦)]
𝑓𝑖 ≜ 𝐹(𝐵, (𝑥𝑖 , 𝑦𝑖 ), ∀𝑖 ∈ {0, … , 𝑛 − 1} and 𝑓𝑛 ≜ 𝐹(𝐵, (𝑥𝑛 , 𝑦)
|𝑖 = 1, … , 𝑛|𝑓𝑖 < 𝑓𝑛 |
Γ(𝑥0 , 𝑦0 , … , 𝑥𝑛−1 , 𝑦𝑛−1 , 𝑥𝑛 ) = {𝑦 ∈ 𝑌: < 𝛿}
𝑛

ℙ(𝑦𝑛 ∈ Γ) ≥ 𝛿
Due to the predictor being transductive, it is very computationally expensive.

*Transduction vs. Induction


Transduction is reasoning from observed training cases to test cases. Induction is reasoning from
observed training cases to general rules, which are then applied to test cases.
For example: a classification problem

The inductive approach would be to use the labeled points to


train a model, and then have it predict labels for the unknown
points. A KNN algorithm might be used, but with the few
labeled data points it would be difficult to capture the
complexity of the data.
The transductive approach would be to consider all the points,
while performing labeling. A transductive algorithm would
label unlabeled points based on the clusters they belong to.

Transductive algorithms are better at making predictions with fewer labeled points. However,
transductive algorithms don’t build a model. If a previously unknown point is added to the set, the
entire transductive algorithm would need to be repeated with all of the points in order to predict a
label.
Split/Inductive Conformal Predictor
Instead of considering all points when calculating conformity, we split the dataset into two disjoint
sets a training set and a calibration set.
𝐷 = (𝑥0 , 𝑦0 ), … (𝑥𝑛−1 , 𝑦𝑛−1 )

𝐷𝑡𝑟𝑎𝑖𝑛 , 𝐷𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑒 ⊂ 𝐷
We use the training set to fit a model using the underlying algorithm, and we use the calibration set to
construct the conformity ranking and find the prediction set. This way we get around having the refit
the model for every possible value.

You might also like