Advanced ML Notes (Midterm)
Advanced ML Notes (Midterm)
Gaussian Processes
Multivariate Gaussian Distributions
A Gaussian distribution is continuous probability distribution for a real-valued random variable.
𝑋~𝑁(𝜇, 𝜎 2 )
1 1 𝑥−𝜇 2
− ( )
PDF: 𝑓(𝑥) = 𝑒 2 𝜎
𝜎√2𝜋
Given a column vector of normally distributed random variables each with finite mean and variance,
this multivariate distribution is also normally distributed. This is called a multivariate Gaussian
distribution
𝑋1
𝑋
𝑋 = ( ⋮2 ) ~𝑁(𝜇, 𝐾)
𝑋𝑛
𝜇1 𝐾𝑋1𝑋1 𝐾𝑋1 𝑋2 … 𝐾𝑋1𝑋𝑛
𝜇2 𝐾𝑋1𝑋2 𝐾𝑋2 𝑋2 … 𝐾𝑋2𝑋𝑛
𝜇 = ( ⋮ ), 𝐾=
⋮ ⋮ ⋮ ⋮
𝜇𝑛 [𝐾𝑋1 𝑋𝑛 𝐾𝑋2𝑋𝑛 … 𝐾𝑋𝑛 𝑋𝑛 ]
−1 ( −1
𝑋|𝑌~𝑁(𝜇𝑋 + 𝐾𝑋𝑌 𝐾𝑌𝑌 𝑌 − 𝜇𝑌 ), 𝐾𝑋𝑌 𝐾𝑌𝑌 𝐾𝑌𝑋 )
𝑃𝑋 = ∫ 𝑃𝑋,𝑌 (𝑥, 𝑦) 𝑑𝑦 = ∫ 𝑃𝑋|𝑌 (𝑥 |𝑦)𝑃𝑌 (𝑦) 𝑑𝑦
−1 ( −1
𝑌|𝑋~𝑁(𝜇𝑌 + 𝐾𝑌𝑋 𝐾𝑋𝑋 𝑋 − 𝜇𝑋 ), 𝐾𝑌𝑋 𝐾𝑋𝑋 𝐾𝑋𝑌 )
𝑃𝑋 = E𝑌 [𝑃𝑋|𝑌 (𝑥 |𝑦)]
𝑃 (𝑋, 𝑦)
The marginal probability 𝑃𝑋 is obtained by 𝑃𝑋|𝑦 =
𝑃(𝑌 = 𝑦)
integrating 𝑃𝑋,𝑌 over all values of 𝑌.
𝑃 (𝑥, 𝑌)
This can be thought of as examining the conditional 𝑃𝑌|𝑥 =
𝑃(𝑋 = 𝑥)
probability of 𝑋 given a particular value of 𝑌, and
then averaging this conditional probability over the The conditional probabilities of 𝑃𝑋|𝑦 or 𝑃𝑌|𝑥 are
distribution of 𝑌 calculated using Bayes’ rule
Gaussian distributions are closed under marginalization and conditioning, this means the resulting
distributions from these operations is also Gaussian.
Marginalization Conditioning
Gaussian Processes
A Gaussian process is a collection of random variables, any finite number of which have a joint
Gaussian distribution.
{𝑋𝑡 ; 𝑡 ∈ 𝑇} is Gaussian if and only if:
In simpler terms, in order for {𝑋𝑡 ; 𝑡 ∈ 𝑇} (a process) to be a Gaussian process any linear combination
of (𝑋𝑡1 , … , 𝑋𝑡𝑘 ) must be a univariate Gaussian distribution.
When Gaussian processes are used in a task like regression or classification, each data point is treated
as a random variable, we then try to infer the underlying distribution of the random variables with
each new data point (Bayesian inference).
Suppose 𝑋 is training data (features), and 𝑌 is testing data (labels), we model the underlying
distribution as a multivariate Gaussian distribution 𝑃𝑋,𝑌 , this distribution spans the space of possible
function values that we want to predict.
Priors and Posteriors
Before looking at our data, there are many possible “candidate” functions that could have produced
our data, these are called priors.
As we look at the data, we can narrow down the functions that could have generated it. We perform
conditioning on our priors using our data points to turn them into posteriors. The more data points we
look at the more confident we can be about the data-generating function.
Kernels
A kernel is a function that measures the similarity of two inputs 𝑥, 𝑥′ and is tuned by a set of
hyperparameters 𝜏.
Ex.
Squared-exp (SE)/Radial Bias Function (RBF)
2
′| 2
1 𝑥 − 𝑥′
𝑘 (𝑥, 𝑥 𝜏) = 𝜎 exp (− ( ) )
2 ℓ
𝜏 = {𝜎, ℓ}
Changing the hyperparameters of the kernel changes the covariance matrix and the resulting
functions sampled using the kernel.
small ℓ
large ℓ
large 𝜎
𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}𝑁
𝑖=1 *Data
The posterior distribution over the predicted function 𝑓 ∗ comes from conditioning the Gaussian
process using the data points
𝑃𝑓∗|𝑋 ∗ , 𝐷~𝑁(𝜇𝑓∗ , 𝐾𝑓∗ )
−1 −1
𝑃𝑓∗|𝑋 ∗ , 𝐷~𝑁(𝐾𝑋 ∗,𝑋 𝐾𝑋,𝑋 𝑦, 𝐾𝑋 ∗,𝑋 ∗ − 𝐾𝑋,𝑋 𝐾𝑋 ∗,𝑋 )
Noise
We add noise to the training points 𝑌 to model the error of measurements
𝑌 = 𝑓(𝑋 ) + 𝜖, 𝜖~𝑁(0, 𝜎𝜖2 )
*𝜎𝜖2 is a hyperparameter
Combining Kernels
Kernels can be combined together to create more specialized kernels that better represent our prior
knowledge about the data.
Common kernel combinations are addition and subtraction, but as long as the resulting covariance
matrix from the kernel is positive semi-definite any method is allowed.
Hyperparameter Selection
We tune the hyperparameters of the kernel 𝜏 and the noise parameter 𝜎𝜖2 such that the prior samples
explain the data quite well even before conditioning.
𝑘𝑙𝑖𝑛 = 𝑣𝑥𝑥′
𝑘𝑞𝑢𝑎𝑑 = (𝑣1 𝑥𝑥 ′ )(𝑣2 𝑥𝑥 ′ )
The prior samples produced using the new kernel are much more likely to fit out data
Conformal Prediction
Uncertainty Quantification
Uncertainty quantification is the process of trying to assess the reliability and confidence level of
predictions made by a ML model. The idea is to try and provide a prediction interval rather than a
single point prediction and to give this interval a confidence level.
Single point
prediction
Interval prediction
with an attached
probabilistic
statement
Validity
The validity of the interval prediction refers to if the probabilistic statement is true (i.e. is there an
actual 90% likelihood that the true value is in the proposed interval). Validity must be insured even
with finite datasets, this is called finite-sample validity.
Efficiency/Tightness
𝑓({𝑝0 , … , 𝑝𝑖 }; 𝑝𝑗 )
|𝑀(𝑥𝑗 ) − 𝑦𝑗 ||
- Absolute deviation: the deviation of the point from the average of all points
∑𝑛+1
𝑖=1 𝑦𝑖 + 𝑦𝑗
| − 𝑦𝑗 |
𝑛+1
The actual value given by this conformity function isn’t very useful because conformity is a relative
measure.
Instead of using the conformity value we create a ranking of the points based on conformity. With this
ranking we can propose criteria for “sufficient” conformity, which could for example be: being in the
top 90% of the conformity ranking.
Full/Transductive Conformal Predictors
Given a confidence level 𝛿, training points (𝑥0 , 𝑦0 ), … (𝑥𝑛−1 , 𝑦𝑛−1 ), and some test point (𝑥𝑛 , ? )
Conformity Measuring Function: 𝐹()
∀𝑦 ∈ 𝑌|𝐵 ≜ [(𝑥0 , 𝑦0 ), … , (𝑥𝑛−1 , 𝑦𝑛−1 ), (𝑥𝑛 , 𝑦)]
𝑓𝑖 ≜ 𝐹(𝐵, (𝑥𝑖 , 𝑦𝑖 ), ∀𝑖 ∈ {0, … , 𝑛 − 1} and 𝑓𝑛 ≜ 𝐹(𝐵, (𝑥𝑛 , 𝑦)
|𝑖 = 1, … , 𝑛|𝑓𝑖 < 𝑓𝑛 |
Γ(𝑥0 , 𝑦0 , … , 𝑥𝑛−1 , 𝑦𝑛−1 , 𝑥𝑛 ) = {𝑦 ∈ 𝑌: < 𝛿}
𝑛
ℙ(𝑦𝑛 ∈ Γ) ≥ 𝛿
Due to the predictor being transductive, it is very computationally expensive.
Transductive algorithms are better at making predictions with fewer labeled points. However,
transductive algorithms don’t build a model. If a previously unknown point is added to the set, the
entire transductive algorithm would need to be repeated with all of the points in order to predict a
label.
Split/Inductive Conformal Predictor
Instead of considering all points when calculating conformity, we split the dataset into two disjoint
sets a training set and a calibration set.
𝐷 = (𝑥0 , 𝑦0 ), … (𝑥𝑛−1 , 𝑦𝑛−1 )
𝐷𝑡𝑟𝑎𝑖𝑛 , 𝐷𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑒 ⊂ 𝐷
We use the training set to fit a model using the underlying algorithm, and we use the calibration set to
construct the conformity ranking and find the prediction set. This way we get around having the refit
the model for every possible value.