ProbabilisticProgramming SUT 2024
ProbabilisticProgramming SUT 2024
Programming
Bayesian
Learning Probabilistic Programming
Probabilistic
Programming An Introduction to Applications in Machine Learning
Turing
Inference
Amirabbas Asadi
MCMC
Variational Inference
Differentiable
Programming amir.asadi78@sharif.edu
Learning
Resources
Sharif University of Technology
Department of Mathematical Sciences
2024
Presentation Outline
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
1 Bayesian Learning
Turing
2 Probabilistic Programs
Inference
MCMC 3 Approximate Bayesian Inference
Variational Inference
Differentiable
Markov Chain Monte Carlo
Programming Variational Inference
Learning
Resources
4 Differentiable Programming
Bayesian Learning
Probabilistic
Programming
Probabilistic
2, 4, 6, 8, 10, ?
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Probabilistic
2, 4, 6, 8, 10, ?
Programming
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Probabilistic
2, 4, 6, 8, 10, ?
Programming
Inference
MCMC
Variational Inference
𝑓1 (𝑛) = 2𝑛
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Probabilistic
2, 4, 6, 8, 10, ?
Programming
Inference
MCMC
Variational Inference
𝑓1 (𝑛) = 2𝑛
Differentiable
Programming
Learning
Resources 𝑓2 (𝑛) = 0.0167𝑛5 − 0.25𝑛4 + 1.4167𝑛3 − 3.75𝑛2 + 6.5667𝑛 − 2
Bayesian Learning
Probabilistic
Programming
Probabilistic
2, 4, 6, 8, 10, ?
Programming
Inference
MCMC
Variational Inference
𝑓1 (𝑛) = 2𝑛
Differentiable
Programming
Learning
Resources 𝑓2 (𝑛) = 0.0167𝑛5 − 0.25𝑛4 + 1.4167𝑛3 − 3.75𝑛2 + 6.5667𝑛 − 2
Probabilistic
Programming
𝐻1 ∶ 𝑓1 (𝑛) = 2𝑛
Bayesian
Learning 𝐻2 ∶ 𝑓2 (𝑛) = 0.0167𝑛5 −0.25𝑛4 +1.4167𝑛3 −3.75𝑛2 +6.5667𝑛−2
Probabilistic
Programming
Turing
𝐻3 ∶ 𝑓3 (𝑛) = 0.05𝑛5 − 0.75𝑛4 + 4.25𝑛3 − 11.25𝑛2 + 15.7𝑛 − 6
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
𝐻1 ∶ 𝑓1 (𝑛) = 2𝑛
Bayesian
Learning 𝐻2 ∶ 𝑓2 (𝑛) = 0.0167𝑛5 −0.25𝑛4 +1.4167𝑛3 −3.75𝑛2 +6.5667𝑛−2
Probabilistic
Programming
Turing
𝐻3 ∶ 𝑓3 (𝑛) = 0.05𝑛5 − 0.75𝑛4 + 4.25𝑛3 − 11.25𝑛2 + 15.7𝑛 − 6
Inference
MCMC
All functions reproduce the data exactly
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
𝐻1 ∶ 𝑓1 (𝑛) = 2𝑛
Bayesian
Learning 𝐻2 ∶ 𝑓2 (𝑛) = 0.0167𝑛5 −0.25𝑛4 +1.4167𝑛3 −3.75𝑛2 +6.5667𝑛−2
Probabilistic
Programming
Turing
𝐻3 ∶ 𝑓3 (𝑛) = 0.05𝑛5 − 0.75𝑛4 + 4.25𝑛3 − 11.25𝑛2 + 15.7𝑛 − 6
Inference
MCMC
All functions reproduce the data exactly
Variational Inference
Differentiable
Programming All have the same likelihood
Learning
Resources
Bayesian Learning
Probabilistic
Programming
𝐻1 ∶ 𝑓1 (𝑛) = 2𝑛
Bayesian
Learning 𝐻2 ∶ 𝑓2 (𝑛) = 0.0167𝑛5 −0.25𝑛4 +1.4167𝑛3 −3.75𝑛2 +6.5667𝑛−2
Probabilistic
Programming
Turing
𝐻3 ∶ 𝑓3 (𝑛) = 0.05𝑛5 − 0.75𝑛4 + 4.25𝑛3 − 11.25𝑛2 + 15.7𝑛 − 6
Inference
MCMC
All functions reproduce the data exactly
Variational Inference
Differentiable
Programming All have the same likelihood
Learning
Resources
𝑝(𝐷|𝐻1 ) = 𝑝(𝐷|𝐻2 ) = 𝑝(𝐷|𝐻3 )
Bayesian Learning
Probabilistic
Programming
𝐻1 ∶ 𝑓1 (𝑛) = 2𝑛
Bayesian
Learning 𝐻2 ∶ 𝑓2 (𝑛) = 0.0167𝑛5 −0.25𝑛4 +1.4167𝑛3 −3.75𝑛2 +6.5667𝑛−2
Probabilistic
Programming
Turing
𝐻3 ∶ 𝑓3 (𝑛) = 0.05𝑛5 − 0.75𝑛4 + 4.25𝑛3 − 11.25𝑛2 + 15.7𝑛 − 6
Inference
MCMC
All functions reproduce the data exactly
Variational Inference
Differentiable
Programming All have the same likelihood
Learning
Resources
𝑝(𝐷|𝐻1 ) = 𝑝(𝐷|𝐻2 ) = 𝑝(𝐷|𝐻3 )
Then why do people choose the first one for the same data???
Bayesian Learning
Probabilistic
Programming
𝐻1 ∶ 𝑓1 (𝑛) = 2𝑛
Bayesian
Learning 𝐻2 ∶ 𝑓2 (𝑛) = 0.0167𝑛5 −0.25𝑛4 +1.4167𝑛3 −3.75𝑛2 +6.5667𝑛−2
Probabilistic
Programming
Turing
𝐻3 ∶ 𝑓3 (𝑛) = 0.05𝑛5 − 0.75𝑛4 + 4.25𝑛3 − 11.25𝑛2 + 15.7𝑛 − 6
Inference
MCMC
All functions reproduce the data exactly
Variational Inference
Differentiable
Programming All have the same likelihood
Learning
Resources
𝑝(𝐷|𝐻1 ) = 𝑝(𝐷|𝐻2 ) = 𝑝(𝐷|𝐻3 )
Then why do people choose the first one for the same data???
Probabilistic
Programming
But How can we quantify and take into account a prior belief?
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
But How can we quantify and take into account a prior belief?
Bayesian We can formulate our prior belief 𝑝(𝐻) as distribution over all
Learning
hypothesis
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
But How can we quantify and take into account a prior belief?
Bayesian We can formulate our prior belief 𝑝(𝐻) as distribution over all
Learning
hypothesis
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
𝑝(𝐷|𝐻)
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
But How can we quantify and take into account a prior belief?
Bayesian We can formulate our prior belief 𝑝(𝐻) as distribution over all
Learning
hypothesis
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
𝑝(𝐷|𝐻)𝑝(𝐻)
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
But How can we quantify and take into account a prior belief?
Bayesian We can formulate our prior belief 𝑝(𝐻) as distribution over all
Learning
hypothesis
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
𝑝(𝐷|𝐻)𝑝(𝐻)
Differentiable
Programming To be a valid probability distribution, a normalization constant is
Learning
Resources
needed
𝑝(𝐷|𝐻)𝑝(𝐻)
𝑝(𝐻|𝐷) =
𝑝(𝐷)
Bayesian Learning
Probabilistic
Programming
But How can we quantify and take into account a prior belief?
Bayesian We can formulate our prior belief 𝑝(𝐻) as distribution over all
Learning
hypothesis
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
𝑝(𝐷|𝐻)𝑝(𝐻)
Differentiable
Programming To be a valid probability distribution, a normalization constant is
Learning
Resources
needed
𝑝(𝐷|𝐻)𝑝(𝐻)
𝑝(𝐻|𝐷) =
𝑝(𝐷)
Bayes Theorem
Bayesian Learning
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Bayesian Learning provides a natural framework for updating
Inference our beliefs.
MCMC
Variational Inference
Differentiable
𝑝(𝐻) →𝐷1 𝑝(𝐻|{𝐷1 }) →𝐷2 𝑝(𝐻|{𝐷1 , 𝐷2 }) →𝐷3 ⋅ ⋅ ⋅
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Learning
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic Models
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing How to represent a probabilstic model?
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic Models
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing How to represent a probabilstic model?
Inference
MCMC
Variational Inference
A table of all possible events!
Differentiable
Programming
Learning
Resources
Probabilistic Models
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing How to represent a probabilstic model?
Inference
MCMC
Variational Inference
A table of all possible events!
Differentiable Probabilistic Graphical Models
Programming
Learning
Resources
Probabilistic Models
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing How to represent a probabilstic model?
Inference
MCMC
Variational Inference
A table of all possible events!
Differentiable Probabilistic Graphical Models
Programming
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC Can we represent a joint density with an algorithm?
Variational Inference
Differentiable
Programming
Learning
Resources
Beta-Binomial Example
Probabilistic
Programming
Bayesian
Learning @model function coin_toss(N, X)
Probabilistic θ ~ Beta(1.0, 1.0)
Programming
Turing
Learning
end
Resources
Beta-Binomial Example
Probabilistic
Programming
Bayesian
Learning @model function coin_toss(N, X)
Probabilistic θ ~ Beta(1.0, 1.0)
Programming
Turing
Learning
end
Resources
𝑝(𝜃, 𝑋)
Beta-Binomial Example
Probabilistic
Programming
Bayesian X = [1, 1, 1, 1, 0]
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Beta-Binomial Example
Probabilistic
Programming
Bayesian X = [1, 1, 1, 1, 0]
Learning
Probabilistic
Programming
Bayesian X = [1, 1, 1, 1, 0]
Learning
Probabilistic
Programming
Probabilistic Programs offer more representation power than
Bayesian PGMs!
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic Programs
Probabilistic
Programming
Probabilistic Programs offer more representation power than
Bayesian PGMs!
Learning
Probabilistic
Programming
Turing
@model function program()
Inference T ~ Geometric(0.1)
MCMC
Variational Inference
S = 0
Differentiable X = Vector{Any}(undef, T)
Programming
for t ∈ 1:T
Learning
Resources X[t] ~ Bernoulli(0.5)
S = S + X[t]
end
return S
end
Probabilistic Programming Languages
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic Programming Languages
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic Programming Languages
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic Programming Languages
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic Programming Languages
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Why is inference difficult?
Probabilistic
Programming
Bayesian 𝑝(𝑥|𝑧)𝑝(𝑧)
Learning 𝑝(𝑧|𝑥) =
𝑝(𝑥)
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Why is inference difficult?
Probabilistic
Programming
Bayesian 𝑝(𝑥|𝑧)𝑝(𝑧)
Learning 𝑝(𝑧|𝑥) =
𝑝(𝑥)
Probabilistic
Programming
Turing
To obtain 𝑝(𝑥) we have to marginalize all possible hypotheses:
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Why is inference difficult?
Probabilistic
Programming
Bayesian 𝑝(𝑥|𝑧)𝑝(𝑧)
Learning 𝑝(𝑧|𝑥) =
𝑝(𝑥)
Probabilistic
Programming
Turing
To obtain 𝑝(𝑥) we have to marginalize all possible hypotheses:
Inference
MCMC
Variational Inference 𝑝(𝑥) = ∫ 𝑝(𝑥, 𝑧)𝑑𝑧
Differentiable
Programming
Learning
Resources
Why is inference difficult?
Probabilistic
Programming
Bayesian 𝑝(𝑥|𝑧)𝑝(𝑧)
Learning 𝑝(𝑧|𝑥) =
𝑝(𝑥)
Probabilistic
Programming
Turing
To obtain 𝑝(𝑥) we have to marginalize all possible hypotheses:
Inference
MCMC
Variational Inference 𝑝(𝑥) = ∫ 𝑝(𝑥, 𝑧)𝑑𝑧
Differentiable
Programming
Learning
Now imagine what does 𝑝(𝑥) look like if we have used
Resources something like neural networks inside the model!
Why is inference difficult?
Probabilistic
Programming
Bayesian 𝑝(𝑥|𝑧)𝑝(𝑧)
Learning 𝑝(𝑧|𝑥) =
𝑝(𝑥)
Probabilistic
Programming
Turing
To obtain 𝑝(𝑥) we have to marginalize all possible hypotheses:
Inference
MCMC
Variational Inference 𝑝(𝑥) = ∫ 𝑝(𝑥, 𝑧)𝑑𝑧
Differentiable
Programming
Learning
Now imagine what does 𝑝(𝑥) look like if we have used
Resources something like neural networks inside the model!
Bayesian
Learning If we choose likelihood and prior carefully, Exact Inference is
Probabilistic possible.
Programming
Turing
Likelihood Conjugate Prior
Inference
MCMC Bernoulli Beta
Binomial Beta
Variational Inference
Differentiable
Programming Poisson Gamma
Learning
Resources
Categorical Dirichlet
Uniform Pareto
Approximate Inference Methods
Probabilistic
Programming
Exact Inference is not possible for most of the models
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Approximate Inference Methods
Probabilistic
Programming
Exact Inference is not possible for most of the models
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic
Programming posterior 𝑝(𝑧𝑧|𝑥
𝑥) is intractable
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Markov Chain Monte Carlo
Probabilistic
Programming posterior 𝑝(𝑧𝑧|𝑥
𝑥) is intractable
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic
Programming Consider a particle in ℝ𝑛 with an initial position (state) 𝑋0 .
When the particle is in a position 𝑋𝑡 it will move to a position
Bayesian
Learning
𝑋𝑡+1 with probability 𝑝(𝑋𝑡+1 |𝑋𝑡 ) so we have a sequence of
Probabilistic random variables
Programming
Turing
Inference 𝑋0 , 𝑋1 , 𝑋2 , ...
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Markov Chain Monte Carlo
Probabilistic
Programming Consider a particle in ℝ𝑛 with an initial position (state) 𝑋0 .
When the particle is in a position 𝑋𝑡 it will move to a position
Bayesian
Learning
𝑋𝑡+1 with probability 𝑝(𝑋𝑡+1 |𝑋𝑡 ) so we have a sequence of
Probabilistic random variables
Programming
Turing
Inference 𝑋0 , 𝑋1 , 𝑋2 , ...
MCMC
Variational Inference
Such a stochastic process is called Markov Chain
Differentiable
Programming
Learning
Resources
Markov Chain Monte Carlo
Probabilistic
Programming Consider a particle in ℝ𝑛 with an initial position (state) 𝑋0 .
When the particle is in a position 𝑋𝑡 it will move to a position
Bayesian
Learning
𝑋𝑡+1 with probability 𝑝(𝑋𝑡+1 |𝑋𝑡 ) so we have a sequence of
Probabilistic random variables
Programming
Turing
Inference 𝑋0 , 𝑋1 , 𝑋2 , ...
MCMC
Variational Inference
Such a stochastic process is called Markov Chain
Differentiable
Programming Under some conditions after a time 𝜏 the Markov Chain will
Learning forget it’s initial state and becomes stationary
Resources
Markov Chain Monte Carlo
Probabilistic
Programming Consider a particle in ℝ𝑛 with an initial position (state) 𝑋0 .
When the particle is in a position 𝑋𝑡 it will move to a position
Bayesian
Learning
𝑋𝑡+1 with probability 𝑝(𝑋𝑡+1 |𝑋𝑡 ) so we have a sequence of
Probabilistic random variables
Programming
Turing
Inference 𝑋0 , 𝑋1 , 𝑋2 , ...
MCMC
Variational Inference
Such a stochastic process is called Markov Chain
Differentiable
Programming Under some conditions after a time 𝜏 the Markov Chain will
Learning forget it’s initial state and becomes stationary
Resources
In other words the terms in the sequence
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Is it possible to construct a Markov chain that converges to a
Inference
MCMC
specific distribution?
Variational Inference
Differentiable
Programming
Learning
Resources
Markov Chain Monte Carlo
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Is it possible to construct a Markov chain that converges to a
Inference
MCMC
specific distribution?
Variational Inference
We need a way to guaranty the existence of stationary
distribution.
Differentiable
Programming
Learning
Resources
Markov Chain Monte Carlo
Probabilistic
Programming
Definition
Bayesian
Learning A Markov chain is called reversible if it satisfies the detailed
Probabilistic
Programming
balance equations. It means the probability of being in a state
Turing 𝑥𝑖 then transitioning to a state 𝑥𝑗 is equal to the probability of
Inference
MCMC
being in 𝑥𝑗 and then transitioning to 𝑥𝑖 . formally:
Variational Inference
Probabilistic
Programming
Definition
Bayesian
Learning A Markov chain is called reversible if it satisfies the detailed
Probabilistic
Programming
balance equations. It means the probability of being in a state
Turing 𝑥𝑖 then transitioning to a state 𝑥𝑗 is equal to the probability of
Inference
MCMC
being in 𝑥𝑗 and then transitioning to 𝑥𝑖 . formally:
Variational Inference
Probabilistic
Programming
So we use the detailed balance equation:
Bayesian
Learning 𝜋(𝑥)𝑃 (𝑥′ |𝑥) = 𝜋(𝑥′ )𝑃 (𝑥|𝑥′ )
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
So we use the detailed balance equation:
Bayesian
Learning 𝜋(𝑥)𝑃 (𝑥′ |𝑥) = 𝜋(𝑥′ )𝑃 (𝑥|𝑥′ )
Probabilistic
Programming
Turing
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
So we use the detailed balance equation:
Bayesian
Learning 𝜋(𝑥)𝑃 (𝑥′ |𝑥) = 𝜋(𝑥′ )𝑃 (𝑥|𝑥′ )
Probabilistic
Programming
Turing
Probabilistic
Programming
So we use the detailed balance equation:
Bayesian
Learning 𝜋(𝑥)𝑃 (𝑥′ |𝑥) = 𝜋(𝑥′ )𝑃 (𝑥|𝑥′ )
Probabilistic
Programming
Turing
Probabilistic
Programming
Bayesian
Learning Now we rewrite the equation
Probabilistic
Programming 𝐴(𝑥′ , 𝑥) 𝜋(𝑥′ ) 𝑔(𝑥|𝑥′ )
Turing =
Inference
𝐴(𝑥, 𝑥′ ) 𝜋(𝑥) 𝑔(𝑥′ |𝑥)
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning Now we rewrite the equation
Probabilistic
Programming 𝐴(𝑥′ , 𝑥) 𝜋(𝑥′ ) 𝑔(𝑥|𝑥′ )
Turing =
Inference
𝐴(𝑥, 𝑥′ ) 𝜋(𝑥) 𝑔(𝑥′ |𝑥)
MCMC
Variational Inference
Metropolis-Hastings algorithm defines an acceptance ratio that
Differentiable
Programming
satisfies the above condition
Learning
Resources 𝜋(𝑥′ ) 𝑔(𝑥|𝑥′ )
𝐴(𝑥′ , 𝑥) = 𝑚𝑖𝑛(1, )
𝜋(𝑥) 𝑔(𝑥′ |𝑥)
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming The acceptance probability in higher dimensions is not
Turing
satisfying
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming The acceptance probability in higher dimensions is not
Turing
satisfying
Inference
MCMC RWMH treats the target density as blackbox
Variational Inference
Differentiable
Programming
Learning
Resources
Random Walk Metropolis-Hastings
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming The acceptance probability in higher dimensions is not
Turing
satisfying
Inference
MCMC RWMH treats the target density as blackbox
Variational Inference
Differentiable
Programming
Advanced MCMC methods exploit properties of the target density!
Learning
Resources
Langevin Monte Carlo
Probabilistic
Programming
Bayesian √
Learning
d𝑋𝑡 = ∇log 𝜋(𝑋𝑡 ) + 2 d𝑊𝑡
Probabilistic
Programming
Turing Overdamped Langevin Dynamics
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Langevin Monte Carlo
Probabilistic
Programming
Bayesian √
Learning
d𝑋𝑡 = ∇log 𝜋(𝑋𝑡 ) + 2 d𝑊𝑡
Probabilistic
Programming
Turing Overdamped Langevin Dynamics
Inference
MCMC function langevin_dynamics(logπ, z, τ)
Variational Inference
ζ = sqrt(2τ) * randn(size(z))
Differentiable
Programming ∇logπ = ForwardDiff.gradient(logπ, z)
Learning z .+ τ * ∇logπ .+ ζ
Resources end
Langevin Monte Carlo
Probabilistic
Programming
Bayesian √
Learning
d𝑋𝑡 = ∇log 𝜋(𝑋𝑡 ) + 2 d𝑊𝑡
Probabilistic
Programming
Turing Overdamped Langevin Dynamics
Inference
MCMC function langevin_dynamics(logπ, z, τ)
Variational Inference
ζ = sqrt(2τ) * randn(size(z))
Differentiable
Programming ∇logπ = ForwardDiff.gradient(logπ, z)
Learning z .+ τ * ∇logπ .+ ζ
Resources end
Probabilistic
Programming
function langevin_monte_carlo(logπ, z₀, τ, T)
z = z₀
Bayesian samples = [z₀]
Learning
for t ∈ 1:T
Probabilistic ζ = sqrt(2τ) * randn(size(z))
Programming
Turing
∇logπ = ForwardDiff.gradient(logπ, z)
Inference
y = z .+ τ * ∇logπ .+ ζ
MCMC ∇logπ_y = ForwardDiff.gradient(logπ, y)
Variational Inference
logg_y_z = -1/(4τ) * norm(y .- z .- τ * ∇logπ)^2
Differentiable logg_z_y = -1/(4τ) * norm(z .- y .- τ * ∇logπ_y)^2
Programming
α = logπ(y) + logg_z_y - logπ(z) - logg_y_z
Learning
Resources if rand() < exp(α)
z = y
push!(samples, z)
end
end
return samples
end
Langevin Monte Carlo
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Research Trends
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming alternatives for the base dynamics (HMC)
Turing
Inference
adaptive samplers (NUTS)
MCMC
Variational Inference
exploiting other properties of the target density
Differentiable
Programming
compositional samplers
Learning SMC, particle filter
Resources
Useful Tools
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming BlackJAX: Composable Bayesian Inference in JAX
Learning
Resources
effiecient implementation of samplers
composable samplers
GPU acceleration
suitable for designing PPLs or using with exisiting ones
Example : Bayesian Linear Regression
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Example : Bayesian Linear Regression
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Example : Bayesian Linear Regression
Probabilistic
Programming
Probabilistic
Programming
Inference:
ch = sample(bayesian_regression(x, y), NUTS(), 10000)
Example : Bayesian Linear Regression
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Example : Bayesian Linear Regression
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Example : Bayesian Neural Network
Probabilistic
Programming
Bayesian
Learning
Let 𝑓(𝑥; 𝑤 ) be a neural network with three layers and sigmoid
Probabilistic
Programming activation:
Turing
𝑤 = [𝑤𝑤𝐿1 , 𝑤 𝐿2 , 𝑤 𝐿3 ]
Inference
MCMC
Variational Inference
𝑤𝐿1 )𝑤
𝑓(𝑥; 𝑤 ) = 𝜎(𝜎(𝜎(𝑥𝑤 𝑤𝐿2 )𝑤
𝑤𝐿3 )
Differentiable
Programming
Learning
Resources
Example : Bayesian Neural Network
Probabilistic
Programming
Bayesian
Learning
Let 𝑓(𝑥; 𝑤 ) be a neural network with three layers and sigmoid
Probabilistic
Programming activation:
Turing
𝑤 = [𝑤𝑤𝐿1 , 𝑤 𝐿2 , 𝑤 𝐿3 ]
Inference
MCMC
Variational Inference
𝑤𝐿1 )𝑤
𝑓(𝑥; 𝑤 ) = 𝜎(𝜎(𝜎(𝑥𝑤 𝑤𝐿2 )𝑤
𝑤𝐿3 )
Differentiable
Programming We define a Multivariavte Normal prior over 𝑤
Learning
Resources 𝑤 ∼ 𝒩(0, 𝐼)
Example : Bayesian Neural Network
Probabilistic
Programming
Some samples from 𝑝(𝑤
𝑤)
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Example : Bayesian Neural Network
Probabilistic
Programming
Now we update our belief about network weights using the below
Bayesian dataset
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
𝑦 ∼ Bernoulli(𝑓(𝑥; 𝑤 ))
Example : Bayesian Neural Network
Probabilistic
Programming
return O
end
Example : Bayesian Neural Network
Probabilistic
Programming
@model function bayesian_neural_network(X, y, σ, H_dim)
n_L1 = 2 * H_dim
Bayesian
Learning n_L2 = H_dim * H_dim
n_L3 = H_dim
Probabilistic
Programming
Turing
Σ(n) = Diagonal(abs2.(σ .* ones(n)))
Inference
MCMC
Variational Inference
L1 ~ MvNormal(zeros(n_L1), Σ(n_L1))
L2 ~ MvNormal(zeros(n_L2), Σ(n_L2))
Differentiable
Programming L3 ~ MvNormal(zeros(n_L3), Σ(n_L3))
Learning
Resources O = neural_network(X, L1, L2, L3, H_dim)
for i in eachindex(y)
y[i] ~ Bernoulli(O[:][i])
end
end
Example : Bayesian Neural Network
Probabilistic
Programming
Some samples from 𝑝(𝑤
𝑤|𝑦)
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Bayesian Neural Ordinary Differential Equations
Probabilistic
Programming
Imagine we have a differential equation with unknown
parameters.
Bayesian
Learning d𝑢1
= −𝛼𝑢1 − 𝛽𝑢1 𝑢2
Probabilistic d𝑡
Programming
d𝑢2
Turing
= −𝛿𝑢2 + 𝛾𝑢1 𝑢2
Inference d𝑡
MCMC
Variational Inference 1
Differentiable
Programming
Learning
Resources
Probabilistic
Programming
Bayesian
Learning
How to differentiate through an ODE solver !?2
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
SciML: Open Source Software for Scientific Machine Learning
Probabilistic
Programming Another idea is trying to find a tractable density 𝑞(𝑧; 𝜆) as close as
possible to the posterior.
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Variational Inference
Probabilistic
Programming
Bayesian
Learning
log 𝑝(𝑥) = log ∫ 𝑝(𝑥, 𝑧)𝑑𝑧
Probabilistic
𝑝(𝑥, 𝑧)𝑞(𝑧; 𝜆)
Programming
Turing
= log ∫ 𝑑𝑧
𝑞(𝑧; 𝜆)
Inference
𝑝(𝑥, 𝑧)
= log 𝔼𝑞(𝑧;𝜆)
MCMC
Variational Inference
Differentiable
𝑞(𝑧; 𝜆)
Programming 𝑝(𝑥, 𝑧)
Learning ≥ 𝔼𝑞(𝑧;𝜆) log
Resources 𝑞(𝑧; 𝜆)
Variational Inference
Probabilistic
Programming
Bayesian
Learning
log 𝑝(𝑥) = log ∫ 𝑝(𝑥, 𝑧)𝑑𝑧
Probabilistic
𝑝(𝑥, 𝑧)𝑞(𝑧; 𝜆)
Programming
Turing
= log ∫ 𝑑𝑧
𝑞(𝑧; 𝜆)
Inference
𝑝(𝑥, 𝑧)
= log 𝔼𝑞(𝑧;𝜆)
MCMC
Variational Inference
Differentiable
𝑞(𝑧; 𝜆)
Programming 𝑝(𝑥, 𝑧)
Learning ≥ 𝔼𝑞(𝑧;𝜆) log
Resources 𝑞(𝑧; 𝜆)
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Differentiable
Programming
Learning
Resources
Variational Inference
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Learning
Resources
Variational Inference
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Learning
Resources Maximizing ℒ(𝜆) is equivalent to minimizing 𝐷𝐾𝐿 (𝑞||𝑝)
Research Trends
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
How to pick a variational distribution?
Inference
MCMC Is minimizing 𝐷𝐾𝐿 a good idea ?!
Variational Inference
Learning
Resources
Differentiable Programming
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
The moral of the story!
Turing
Advanced inference methods require our program to be differentiable
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Differentiable Programming
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
The moral of the story!
Turing
Advanced inference methods require our program to be differentiable
Inference
MCMC
Variational Inference
Differentiable
Programming How to write a probabilistic program in a differentiable way?
Learning
Resources
Differentiable Programming
Differentiable Programming
Probabilistic
Programming
Softmax as a differentiable approximation of argmax
Bayesian 𝑒𝑥𝑖
Learning softmax(𝑥𝑖 ) =
Probabilistic
∑𝑗 𝑒𝑥𝑗
Programming
Inference
MCMC
Variational Inference
𝑛
Probabilistic
Programming
Now can we come up with a soft approximation of sort?3
Bayesian
Learning [10, 5, 20, 8, 40, 0]
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic
Programming
Now can we come up with a soft approximation of sort?3
Bayesian
Learning [10, 5, 20, 8, 40, 0]
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Differentiable
Programming
Learning
Resources
Differentiable Programming
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Differentiable
Programming
Learning
Resources
Example: Inverse Optimization
Probabilistic
Programming Suppose we have a graph with unknown weights
Bayesian
Learning ?
Probabilistic
Programming
?
Turing
?
Inference
MCMC
Variational Inference
Differentiable
Programming
?
Learning
Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
CVXpyLayers : Differentiable Convex Optimization Layers4
Learning
Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming AlgoVision : Differentiable Algorithms and Algorithmic Supervision5
Learning
Resources
Probabilistic
Programming
Probabilistic
Programming
Turing
Inference
MCMC
Variational Inference
Differentiable
Programming
Learning
Resources
Learning Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Inference
MCMC Programming (25 lectures) by Dr. Frank
Variational Inference
Wood at UBC.
Differentiable
Programming book An Introduction to Probabilistic
Learning Programming
Resources
Learning Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
Differentiable
Programming
Learning
Resources
Learning Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Inference
MCMC
Monte Carlo Methods
Variational Inference
Differentiable
Adrian Barbu , Song-Chun Zhu
Programming
Learning
Resources
Learning Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming
Turing
Learning
Resources
Learning Resources
Probabilistic
Programming
Bayesian
Learning
Probabilistic
Programming Neural Algorithmic Reasoning
Turing
Inference
MCMC
Homepage : algo-reasoning.github.io
Variational Inference