Cheatsheet Variables Models
Cheatsheet Variables Models
edu/~shervine
Factor graphs • If any of these domains becomes empty, we stop the local backtracking search.
• If we un-assign a variable Xi , we have to restore the domain of its neighbors.
r Definition – A factor graph, also referred to as a Markov random field, is a set of variables
X = (X1 ,...,Xn ) where Xi ∈ Domaini and m factors f1 ,...,fm with each fj (X) > 0.
r Most constrained variable – It is a variable-level ordering heuristic that selects the next
unassigned variable that has the fewest consistent values. This has the effect of making incon-
sistent assignments to fail earlier in the search, which enables more efficient pruning.
r Least constrained value – It is a value-level ordering heuristic that assigns the next value
that yields the highest number of consistent values of neighboring variables. Intuitively, this
procedure chooses first the values that are most likely to work.
Remark: in practice, this heuristic is useful when all factors are constraints.
r Scope and arity – The scope of a factor fj is the set of variables it depends on. The size of
this set is called the arity.
Remark: factors of arity 1 and 2 are called unary and binary respectively.
r Assignment weight – Each assignment x = (x1 ,...,xn ) yields a weight Weight(x) defined as
being the product of all factors fj applied to that assignment. Its expression is given by:
m
Y
Weight(x) = fj (x)
j=1
Here, the constraint j with assignment x is said to be satisfied if and only if fj (x) = 1.
r AC-3 – The AC-3 algorithm is a multi-step lookahead heuristic that applies forward checking Remark: independence is the key property that allows us to solve subproblems in parallel.
to all relevant variables. After a given assignment, it performs forward checking and then
successively enforces arc consistency with respect to the neighbors of variables for which the r Conditional independence – We say that A and B are conditionally independent given C
domain change during the process. if conditioning on C produces a graph in which A and B are independent. In this case, it is
Remark: AC-3 can be implemented both iteratively and recursively. written:
A⊥
⊥ B|C
• we assign to each element u ∈ Domaini a weight w(u) that is the product of all factors
connected to that variable, r Elimination – Elimination is a factor graph transformation that removes Xi from the graph
and solves a small subproblem conditioned on its Markov blanket as follows:
• we sample v from the probability distribution induced by w and assign it to Xi .
• Consider all factors fi,1 ,...,fi,k that depend on Xi
Remark: Gibbs sampling can be seen as the probabilistic counterpart of ICM. It has the advan-
tage to be able to escape local minima in most cases.
• Remove Xi and fi,1 ,...,fi,k
r Independence – Let A,B be a partitioning of the variables X. We say that A and B are k
independent if there are no edges between A and B and we write:
Y
fnew,i (x) = max fi,l (x)
xi
A,B independent ⇐⇒ A ⊥
⊥B l=1
r Treewidth – The treewidth of a factor graph is the maximum arity of any factor created by
variable elimination with the best variable ordering. In other words,
r Locally normalized – For each xParents(i) , all factors are local conditional distributions.
Hence they have to satisfy:
X
p(xi |xParents(i) ) = 1
xi
Probabilistic programs
Bayesian networks
r Concept – A probabilistic program randomizes variables assignment. That way, we can write
In this section, our goal will be to compute conditional probabilities. What is the probability of down complex Bayesian networks that generate assignments without us having to explicitly
a query given evidence? specify associated probabilities.
Remark: examples of probabilistic programs include Hidden Markov model (HMM), factorial
HMM, naive Bayes, latent Dirichlet allocation, diseases and symptoms and stochastic block
models.
Introduction r Summary – The table below summarizes the common probabilistic programs as well as their
applications:
r Explaining away – Suppose causes C1 and C2 influence an effect E. Conditioning on the
effect E and on one of the causes (say C1 ) changes the probability of the other cause (say C2 ).
In this case, we say that C1 has explained away C2 .
Program Algorithm Illustration Example
r Directed acyclic graph – A directed acyclic graph (DAG) is a finite directed graph with
no directed cycles.
Language
r Bayesian network – A Bayesian network is a directed acyclic graph (DAG) that specifies Markov Model Xi ∼ p(Xi |Xi−1 )
a joint distribution over random variables X = (X1 ,...,Xn ) as a product of local conditional modeling
distributions, one for each node:
n
Y
P (X1 = x1 ,...,Xn = xn ) , p(xi |xParents(i) )
i=1 Hidden Markov Ht ∼ p(Ht |Ht−1 )
Object tracking
Model (HMM) Et ∼ p(Et |Ht )
Remark: Bayesian networks are factor graphs imbued with the language of probability.
with the convention F0 = BL+1 = 1. From this procedure and these notations, we get that
Hto ∼ o )
p(Hto |Ht−1 P (H = hk |E = e) = Sk (hk )
o∈{a,b} Multiple object
Factorial HMM
tracking
Et ∼ p(Et |Hta ,Htb ) Remark: this algorithm interprets each assignment to be a path where each edge hi−1 → hi is
of weight p(hi |hi−1 )p(ei |hi ).
r Gibbs sampling – This algorithm is an iterative approximate method that uses a small set of
assignments (particles) to represent a large probability distribution. From a random assignment
x, Gibbs sampling performs the following steps for i ∈ {1,...,n} until convergence:
Y ∼ p(Y ) Document
Naive Bayes
Wi ∼ p(Wi |Y ) classification • For all u ∈ Domaini , compute the weight w(u) of assignment x where Xi = u
• Step 1: proposal - For each old particle xt−1 ∈ C, sample x from the transition probability
distribution p(x|xt−1 ) and add x to a set C 0 .
• Step 2: weighting - Weigh each x of the set C 0 by w(x) = p(et |x), where et is the evidence
Inference
observed at time t.
r General probabilistic inference strategy – The strategy to compute the probability • Step 3: resampling - Sample K elements from the set C 0 using the probability distribution
P (Q|E = e) of query Q given evidence E = e is as follows: induced by w and store them in C: these are the current particles xt .
• Step 1: Remove variables that are not ancestors of the query Q or the evidence E by Remark: a more expensive version of this algorithm also keeps track of past particles in the
marginalization proposal step.
• Step 2: Convert Bayesian network to factor graph r Maximum likelihood – If we don’t know the local conditional distributions, we can learn
them using maximum likelihood.
• Step 3: Condition on the evidence E = e
Y
max p(X = x; θ)
θ
• Step 4: Remove nodes disconnected from the query Q by marginalization x∈Dtrain
• Step 5: Run probabilistic inference algorithm (manual, variable elimination, Gibbs sam- r Laplace smoothing – For each distribution d and partial assignment (xParents(i) ,xi ), add λ
pling, particle filtering) to countd (xParents(i) ,xi ), then normalize to get probability estimates.