Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
45 views

Cheatsheet Variables Models

This document summarizes techniques for solving constraint satisfaction problems (CSPs) using variable-based models and factor graphs. It discusses backtracking search with dynamic variable and value ordering heuristics like most constrained variable and least constrained value. It also covers lookahead techniques like forward checking and arc consistency (AC-3). Approximate methods discussed include beam search. The document emphasizes that independence properties allow breaking large problems into smaller subproblems that can be solved in parallel.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Cheatsheet Variables Models

This document summarizes techniques for solving constraint satisfaction problems (CSPs) using variable-based models and factor graphs. It discusses backtracking search with dynamic variable and value ordering heuristics like most constrained variable and least constrained value. It also covers lookahead techniques like forward checking and arc consistency (AC-3). Approximate methods discussed include beam search. The document emphasizes that independence properties allow breaking large problems into smaller subproblems that can be solved in parallel.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 221 – Artificial Intelligence https://stanford.

edu/~shervine

VIP Cheatsheet: Variables-based models Dynamic ordering


r Dependent factors – The set of dependent factors of variable Xi with partial assignment x
Afshine Amidi and Shervine Amidi is called D(x,Xi ), and denotes the set of factors that link Xi to already assigned variables.

r Backtracking search – Backtracking search is an algorithm used to find maximum weight


May 26, 2019 assignments of a factor graph. At each step, it chooses an unassigned variable and explores
its values by recursion. Dynamic ordering (i.e. choice of variables and values) and lookahead
(i.e. early elimination of inconsistent options) can be used to explore the graph more efficiently,
although the worst-case runtime stays exponential: O(|Domain|n ).
Constraint satisfaction problems
r Forward checking – It is a one-step lookahead heuristic that preemptively removes incon-
In this section, our objective is to find maximum weight assignments of variable-based models. sistent values from the domains of neighboring variables. It has the following characteristics:
One advantage compared to states-based models is that these algorithms are more convenient
to encode problem-specific constraints. • After assigning a variable Xi , it eliminates inconsistent values from the domains of all its
neighbors.

Factor graphs • If any of these domains becomes empty, we stop the local backtracking search.
• If we un-assign a variable Xi , we have to restore the domain of its neighbors.
r Definition – A factor graph, also referred to as a Markov random field, is a set of variables
X = (X1 ,...,Xn ) where Xi ∈ Domaini and m factors f1 ,...,fm with each fj (X) > 0.
r Most constrained variable – It is a variable-level ordering heuristic that selects the next
unassigned variable that has the fewest consistent values. This has the effect of making incon-
sistent assignments to fail earlier in the search, which enables more efficient pruning.
r Least constrained value – It is a value-level ordering heuristic that assigns the next value
that yields the highest number of consistent values of neighboring variables. Intuitively, this
procedure chooses first the values that are most likely to work.
Remark: in practice, this heuristic is useful when all factors are constraints.
r Scope and arity – The scope of a factor fj is the set of variables it depends on. The size of
this set is called the arity.
Remark: factors of arity 1 and 2 are called unary and binary respectively.
r Assignment weight – Each assignment x = (x1 ,...,xn ) yields a weight Weight(x) defined as
being the product of all factors fj applied to that assignment. Its expression is given by:
m
Y
Weight(x) = fj (x)
j=1

r Constraint satisfaction problem – A constraint satisfaction problem (CSP) is a factor


graph where all factors are binary; we call them to be constraints:
∀j ∈ [[1,m]], fj (x) ∈ {0,1}

Here, the constraint j with assignment x is said to be satisfied if and only if fj (x) = 1.

r Consistent assignment – An assignment x of a CSP is said to be consistent if and only if


The example above is an illustration of the 3-color problem with backtracking search coupled
Weight(x) = 1, i.e. all constraints are satisfied. with most constrained variable exploration and least constrained value heuristic, as well as
forward checking at each step.
r Arc consistency – We say that arc consistency of variable Xl with respect to Xk is enforced
when for each xl ∈ Domainl :

• unary factors of Xl are non-zero,


• there exists at least one xk ∈ Domaink such that any factor between Xl and Xk is
non-zero.

Stanford University 1 Spring 2019


CS 221 – Artificial Intelligence https://stanford.edu/~shervine

r AC-3 – The AC-3 algorithm is a multi-step lookahead heuristic that applies forward checking Remark: independence is the key property that allows us to solve subproblems in parallel.
to all relevant variables. After a given assignment, it performs forward checking and then
successively enforces arc consistency with respect to the neighbors of variables for which the r Conditional independence – We say that A and B are conditionally independent given C
domain change during the process. if conditioning on C produces a graph in which A and B are independent. In this case, it is
Remark: AC-3 can be implemented both iteratively and recursively. written:

A and B cond. indep. given C ⇐⇒ A ⊥


⊥ B|C
Approximate methods
r Beam search – Beam search is an approximate algorithm that extends partial assignments r Conditioning – Conditioning is a transformation aiming at making variables independent
of n variables of branching factor b = |Domain| by exploring the K top paths at each step. The that breaks up a factor graph into smaller pieces that can be solved in parallel and can use
beam size K ∈ {1,...,bn } controls the tradeoff between efficiency and accuracy. This algorithm backtracking. In order to condition on a variable Xi = v, we do as follows:
has a time complexity of O(n · Kb log(Kb)).
The example below illustrates a possible beam search of parameters K = 2, b = 3 and n = 5. • Consider all factors f1 ,...,fk that depend on Xi

• Remove Xi and f1 ,...,fk

• Add gj (x) for j ∈ {1,...,k} defined as:

gj (x) = fj (x ∪ {Xi : v})

r Markov blanket – Let A ⊆ X be a subset of variables. We define MarkovBlanket(A) to be


the neighbors of A that are not in A.

r Proposition – Let C = MarkovBlanket(A) and B = X\(A ∪ C). Then we have:

A⊥
⊥ B|C

Remark: K = 1 corresponds to greedy search whereas K → +∞ is equivalent to BFS tree search.


r Iterated conditional modes – Iterated conditional modes (ICM) is an iterative approximate
algorithm that modifies the assignment of a factor graph one variable at a time until convergence.
At step i, we assign to Xi the value v that maximizes the product of all factors connected to
that variable.
Remark: ICM may get stuck in local minima.
r Gibbs sampling – Gibbs sampling is an iterative approximate method that modifies the
assignment of a factor graph one variable at a time until convergence. At step i:

• we assign to each element u ∈ Domaini a weight w(u) that is the product of all factors
connected to that variable, r Elimination – Elimination is a factor graph transformation that removes Xi from the graph
and solves a small subproblem conditioned on its Markov blanket as follows:
• we sample v from the probability distribution induced by w and assign it to Xi .
• Consider all factors fi,1 ,...,fi,k that depend on Xi
Remark: Gibbs sampling can be seen as the probabilistic counterpart of ICM. It has the advan-
tage to be able to escape local minima in most cases.
• Remove Xi and fi,1 ,...,fi,k

Factor graph transformations • Add fnew,i (x) defined as:

r Independence – Let A,B be a partitioning of the variables X. We say that A and B are k
independent if there are no edges between A and B and we write:
Y
fnew,i (x) = max fi,l (x)
xi
A,B independent ⇐⇒ A ⊥
⊥B l=1

Stanford University 2 Spring 2019


CS 221 – Artificial Intelligence https://stanford.edu/~shervine

r Treewidth – The treewidth of a factor graph is the maximum arity of any factor created by
variable elimination with the best variable ordering. In other words,

Treewidth = min max arity(fnew,i )


orderings i∈{1,...,n}

The example below illustrates the case of a factor graph of treewidth 3.

r Locally normalized – For each xParents(i) , all factors are local conditional distributions.
Hence they have to satisfy:

X
p(xi |xParents(i) ) = 1
xi

As a result, sub-Bayesian networks and conditional distributions are consistent.


Remark: local conditional distributions are the true conditional distributions.

r Marginalization – The marginalization of a leaf node yields a Bayesian network without


that node.
Remark: finding the best variable ordering is a NP-hard problem.

Probabilistic programs
Bayesian networks
r Concept – A probabilistic program randomizes variables assignment. That way, we can write
In this section, our goal will be to compute conditional probabilities. What is the probability of down complex Bayesian networks that generate assignments without us having to explicitly
a query given evidence? specify associated probabilities.
Remark: examples of probabilistic programs include Hidden Markov model (HMM), factorial
HMM, naive Bayes, latent Dirichlet allocation, diseases and symptoms and stochastic block
models.
Introduction r Summary – The table below summarizes the common probabilistic programs as well as their
applications:
r Explaining away – Suppose causes C1 and C2 influence an effect E. Conditioning on the
effect E and on one of the causes (say C1 ) changes the probability of the other cause (say C2 ).
In this case, we say that C1 has explained away C2 .
Program Algorithm Illustration Example
r Directed acyclic graph – A directed acyclic graph (DAG) is a finite directed graph with
no directed cycles.
Language
r Bayesian network – A Bayesian network is a directed acyclic graph (DAG) that specifies Markov Model Xi ∼ p(Xi |Xi−1 )
a joint distribution over random variables X = (X1 ,...,Xn ) as a product of local conditional modeling
distributions, one for each node:

n
Y
P (X1 = x1 ,...,Xn = xn ) , p(xi |xParents(i) )
i=1 Hidden Markov Ht ∼ p(Ht |Ht−1 )
Object tracking
Model (HMM) Et ∼ p(Et |Ht )

Remark: Bayesian networks are factor graphs imbued with the language of probability.

Stanford University 3 Spring 2019


CS 221 – Artificial Intelligence https://stanford.edu/~shervine

with the convention F0 = BL+1 = 1. From this procedure and these notations, we get that
Hto ∼ o )
p(Hto |Ht−1 P (H = hk |E = e) = Sk (hk )
o∈{a,b} Multiple object
Factorial HMM
tracking
Et ∼ p(Et |Hta ,Htb ) Remark: this algorithm interprets each assignment to be a path where each edge hi−1 → hi is
of weight p(hi |hi−1 )p(ei |hi ).

r Gibbs sampling – This algorithm is an iterative approximate method that uses a small set of
assignments (particles) to represent a large probability distribution. From a random assignment
x, Gibbs sampling performs the following steps for i ∈ {1,...,n} until convergence:
Y ∼ p(Y ) Document
Naive Bayes
Wi ∼ p(Wi |Y ) classification • For all u ∈ Domaini , compute the weight w(u) of assignment x where Xi = u

• Sample v from the probability distribution induced by w: v ∼ P (Xi = v|X−i = x−i )


• Set Xi = v
α∈ RKdistribution
Latent Dirichlet Remark: X−i denotes X\{Xi } and x−i represents the corresponding assignment.
Zi ∼ p(Zi |α) Topic modeling
Allocation (LDA)
Wi ∼ p(Wi |Zi ) r Particle filtering – This algorithm approximates the posterior density of state variables
given the evidence of observation variables by keeping track of K particles at a time. Starting
from a set of particles C of size K, we run the following 3 steps iteratively:

• Step 1: proposal - For each old particle xt−1 ∈ C, sample x from the transition probability
distribution p(x|xt−1 ) and add x to a set C 0 .

• Step 2: weighting - Weigh each x of the set C 0 by w(x) = p(et |x), where et is the evidence
Inference
observed at time t.
r General probabilistic inference strategy – The strategy to compute the probability • Step 3: resampling - Sample K elements from the set C 0 using the probability distribution
P (Q|E = e) of query Q given evidence E = e is as follows: induced by w and store them in C: these are the current particles xt .

• Step 1: Remove variables that are not ancestors of the query Q or the evidence E by Remark: a more expensive version of this algorithm also keeps track of past particles in the
marginalization proposal step.

• Step 2: Convert Bayesian network to factor graph r Maximum likelihood – If we don’t know the local conditional distributions, we can learn
them using maximum likelihood.
• Step 3: Condition on the evidence E = e
Y
max p(X = x; θ)
θ
• Step 4: Remove nodes disconnected from the query Q by marginalization x∈Dtrain

• Step 5: Run probabilistic inference algorithm (manual, variable elimination, Gibbs sam- r Laplace smoothing – For each distribution d and partial assignment (xParents(i) ,xi ), add λ
pling, particle filtering) to countd (xParents(i) ,xi ), then normalize to get probability estimates.

r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at


r Forward-backward algorithm – This algorithm computes the exact value of P (H = hk |E = estimating the parameter θ through maximum likelihood estimation by repeatedly constructing
e) (smoothing query) for any k ∈ {1, ..., L} in the case of an HMM of size L. To do so, we proceed a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
in 3 steps:
• E-step: Evaluate the posterior probability q(h) that each data point e came from a
• Step 1: for i ∈ {1,..., L}, compute Fi (hi ) =
P
Fi−1 (hi−1 )p(hi |hi−1 )p(ei |hi ) particular cluster h as follows:
hi−1
q(h) = P (H = h|E = e; θ)
• Step 2: for i ∈ {L,..., 1}, compute Bi (hi ) = Bi+1 (hi+1 )p(hi+1 |hi )p(ei+1 |hi+1 )
P
hi+1
• M-step: Use the posterior probabilities q(h) as cluster specific weights on data points e
F (h )B (h )
• Step 3: for i ∈ {1,...,L}, compute Si (hi ) = P i i i i to determine θ through maximum likelihood.
Fi (hi )Bi (hi )
hi

Stanford University 4 Spring 2019

You might also like