Modeling, Inference and Prediction: 2.1 Probabilistic Models

Chapter 2 Modeling, Inference and Prediction
Throughout this work our approach will be to 1. dene a probabilistic model with certain unknown parameters for data of a particular character; 2. perform inference, that is, nd values of the unknown parameters of the model that best explain observations; 3. make predictions using a model whose parameters have been determined. In this chapter I describe the framework in which we execute these steps. A more detailed treatment can be found in Wainwright and Jordan (2008).
2.1
Probabilistic Models
Our approach uses the language of directed graphical models to describe probabilistic models. Directed graphical models have been described as a synthesis of graph theory and probability. In this framework, distributions are represented as directed, acyclic graphs. Nodes in this graph represent variables and arrows indicate, informally, a
Y Z X U V N
(a) Un- (b) Observed (c) Variable V possibly dependent (d) Variable Y repliobserved (indicated by on variable U (indicated by arrow). cated N times (indivariable shading) varicated by box). named Z. able named X.
Figure 2.1: The language of graphical models. possible dependence between variables.1 The constituents of directed graphical models are 1. unshaded nodes indicating unobserved variables whose names are enclosed in the circle; 2. shaded nodes indicated observed variables; 3. arrows between nodes indicating a possible dependence between variables; 4. boxes which depict replication. These are shown in Figure 2.1. Associated with each node is a conditional probability distribution over the variable represented by that node. That probability distribution is conditioned on the variable represented by that nodes parents. That is, letting xi represent the variable associated with the ith node,
pi (xi |xjparents(i) )
(2.1)
describes the distribution of xi . The full joint distribution of the entire graphical model can thus be written as
The dependence between variables can be formally described by D-separation which is outside the scope of this text.
1
p(x) =
pi (xi |xjparents(i) ).
(2.2)
Note that it is straightforward to evaluate the probability of a state in this formalism; one need only take the product of the evaluation of each pi . This formalism also makes it convenient to simulate draws from this distribution by drawing each constituent variable in topological order. Because each of the variables xi is conditioned on is a parent, and all parent variables are guaranteed to have xed values by dint of the topological sort, xi can be simulated by doing a single draw from pi . This also means it is straightforward to describe each probability distribution as a generative process, that is, a sequence of probabilistic steps by which the data were hypothetically generated. The intermediate steps of the generative process create unobserved variables while the nal step generates the observed data, i.e., the leaves of the graph. This construction will be of particular interest in the sequel.
2.2
Inference
With a probability thus dened our goal is to nd values of unobserved variables which explain observed variables. More formally, we are interested in nding the posterior distribution of hidden variables (z) conditioned on observed variables (x).
p(z|x)
(2.3)
For all but a few special cases, it is computationally prohibitive to compute this exactly. To see why, let us recall the denition of marginalization,
p(z|x) =
p(x, z) p(x) p(x, z) = . z p(x, z )
As mentioned in the previous section, evaluating the joint distribution p(x, z) is straightforward. However, to compute the posterior probability we must evaluate the joint probability across all possible values of z . Since the number of possible values of z increases exponentially with the number of variables comprising z , this quickly becomes prohibitive. Thus we turn to approximate methods. There are many approaches to approximating the posterior such as Markov Chain Monte Carlo (MCMC) (Neal 1993). However, we will use variational approximations in this work because they do not rely on stochasticity, they are amenable to various optimization approaches, and have been empirically shown to achieve good approximations. Variational methods approximate the true posterior, p(z|x) with an approximate posterior, q(z). The approximation chosen is that distribution which is in some sense closest to the true distribution. The denition of closeness used is Kullback-Leibler (KL) divergence,
10
KL(q(z)||p(z|x)) =
q(z) log
q(z) p(z|x) p(z|x) q(z) p(z|x) q(z)
(2.4)
q(z) log
z z
log = log
q(z)
p(z|x)
= log 1 = 0, (2.5)
where the inequality follows from Jensens inequality. This choice of distance can be intuitively justied several ways. One is to rewrite the KL-divergence as
KL(q(z)||p(z|x)) = Eq [log p(z|x)] H (q) ,
(2.6)
where H () denotes entropy. Thus KL-divergence promotes distributions q which look similar to p while adding an entropy regularization. Another justication of KL-divergence arises from its relationship to the likelihood of observed data,
KL(q(z)||p(z|x)) = Eq [log p(z|x)] H (q) p(z, x) = Eq log H (q) p(x) = Eq [log p(z, x)] + Eq [log p(x)] H (q) = log p(x) Eq [log p(z, x)] H (q) . (2.7)
This representation rst implies that the problem can be expressed as nding the distance between the variational distribution and the joint distribution rather than 11
the posterior distribution. The second is that this distance can be used to form an evidence lower bound (ELBO); as the distance decreases the likelihood of our data increases. Our objective function now is to nd q such that
q (z) = argmin KL(q(z)||p(z|x)).

qQ
(2.8)
Note that this is trivially minimized when q (z) = p(z|x), the true posterior. Therefore, the optimization problem as formulated is equivalent to posterior inference. But since this is intractable, a tractable approximation is made by restricting the search space Q. A common choice is the family of factorized distributions, q(z) =
i
qi (zi ).
(2.9)
This choice of Q is often termed a nave variational approximation. This expression is convenient since
H (q) = Eq log = Eq = =
qi (zi )
log qi (zi )
Eq [log qi (zi )] (2.10)
H (qi ) .
Further, recall that from the discussion above that in a generative process all of the observations (x) appear as leaves of the graph. Therefore the expected log joint probability can be expressed as
12
Eq [log p(z, x)] = Eq log pi (zi |zjparents(i) ) pi (xi |zjparents(i ) ) = Eq log pi (zi |zjparents(i) ) + Eq log pi (xi |zjparents(i ) ) . Note that because of marginalization the expectation of term pi depends only on {qj (zj )|j parents(i)} if i is a leaf node, and {qj (zj )|j parents(i) {i}} otherwise. Optimizing this with respect to a common choice for pi warrants further elucidation below.
2.2.1
Exponential family distributions
Exponential family distributions are a class of distributions which take a particular form. This form encompasses many common distributions and is convenient to optimize with respect to the objective described in the previous section. Exponential family distributions take the following form: exp( T (x)) . Z()
p(x) =
(2.11)
The normalization constant Z() is chosen so that the distribution sums to one. The vector is termed the natural parameters while (x) are the sucient statistics. Figure 2.2 helps illustrate how common distributions such as the Gaussian and the Beta can be expressed in this representation. The structure of the exponential family representation allows for these distributions to be easily manipulated in the variational optimization above. In particular,
13
< 0.5, 0 > 0.6 0.5 0.4
< 0.5, 1 >
< 1, 0 >
p(x)
0.3 0.2 0.1 0 6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6
(a) The Gaussian distribution has sucient statistics (x) = x2 , x. The natural parameters are 1 related to the common parameterization by = 22 , 2 . The normalization constant Z is given 2 2 by 1 exp( 41 ).
< 0, 0 > 3.5 3 2.5 < 0.5, 0.5 > < 2, 1 >
p(x)
2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(b) The Beta distribution has sucient statistics (x) = log(x), log(1 x). The natural parameters are related to the common parameterization by = 1, 1. The normalization constant Z is given by (1 +1)(2 +1) . (1 +2 +2)
Figure 2.2: Two exponential family functions. The title of each panel shows the value of the natural parameters for the depicted distribution.
14
x N
Figure 2.3: A directed graphical model representation of a Gaussian mixture model.
exp( T (x, z)) Eq [log p(x, z)] = Eq log Z() = Eq T (x, z) Eq [log Z()] = Eq T Eq [(x, z)] Eq [log Z()] ,
(2.12)
where the last line follows by independence under a fully-factorized variational distribution.
2.3
Example
To illustrate the procedure described in the previous sections, we perform it on a simple Gaussian mixture model. Figure 2.3 shows a directed graphical model for this example. We describe the generative process as 1. For i {0, 1}, (a) Draw i Uniform(, ). 2. For n [N ], (a) Draw mixture indicator zn Bernoulli(0.5); 15
(b) Draw observation xn N(zn , 1). Our goal now is to approximate the posterior distribution of the hidden variables, p(z, |x), conditioned on observations x. To do so we use the factorized distribution,
n
q(, z) = r(0 |m0 )r(1 |m1 )
qn (zn |n ),
(2.13)
where qn (zn |n ) is a binomial distribution with parameter n , and r(i |mi ) is a Gaussian distribution with mean mi and unit variance. With the variational family thus parameterized, the optimization problem becomes
L = Eq [log p(, z)] H(q)
(2.14)
with respect to and m. To do so we rst appeal to Equation 2.12 for the expected log probability of an exponential family with our choice of parameter,
1 1 1 Eq [log p(xn |i )] = x2 + Eq [i ] xn Eq 2 log 2 n i 2 2 2 1 2 1 1 = xn + mi xn (1 + m2 ) log 2. i 2 2 2 Since we have chosen uniform distributions for z and , we can express the expected log probability of the joint as
16
Eq [log p(, z)] = Eq log = = =

n n i
p(xn |0 )
1zn
p(xn |1 )
zn
Eq [(1 zn ) log p(xn |0 )] + Eq [zn log p(xn |1 )] (1 n )Eq [log p(xn |0 )] + n Eq [log p(xn |1 )] 1 1 (1 n )(m0 xn m2 ) + n (m1 xn m2 ) + C, 0 2 2 1
where C contains terms which do not depend on either n or mi . We also compute the entropy terms,
H (qn (zn |n )) = (1 n ) log(1 n ) + n log n H (ri (i |mi )) = 1 log(2e). 2
To optimize these expressions we take the derivative with respect to each variable,
L 1 n = (m1 m0 )(2 m1 m0 )xn + log n 2 1 n L = (1 n )(xn m0 ) m0 n L = n (xn m1 ). m1 n
17
1 0.8 0.6
!!
!! !!! ! !! ! ! ! ! ! !!! ! !!! ! !!! ! ! ! !! !
z
0.4 0.2 0
! !! ! !! !!! !!! ! ! !! ! !!!!!!! ! ! ! ! !! ! ! ! !!!
x
Figure 2.4: 100 points drawn from the mixture model depicted in Figure 2.3 with 0 = 3 and 1 = 3. The x axis denotes observed values while the horizontal axis and coloring denote the latent mixture indicator values. Setting these equal to zero yields the following optimality conditions, 1 n = ( (m1 m0 )(2 m1 m0 )xn ) 2 n (1 n )xn m0 = (1 n ) n n n xn m1 = n n where (x) denotes the sigmoid function
1 . 1+exp(x)
This is a system of transcenden-
tal equations which cannot be solved analytically. However, we may apply coordinate ascent; we initialize each variable to some guess and repeatedly cycle through variables optimizing them one at a time while holding the others xed.
18
3 2 1
mi
0 1 2 3 4 1 2 3 4 5
iteration
Figure 2.5: Estimated values of m0 and m1 as a function of iteration using coordinate ascent. The variational method is able to quickly recover the true values of these parameters (shown as dashed lines).
19
zN+1
x N
xN+1
Figure 2.6: The mixture model of Figure 2.3 augmented with an additional unobserved datum to be predicted. Figure 2.4 shows the result of simulating 100 draws from the distribution to be estimated. The distribution has 0 = 3 and 1 = 3. The x axis denotes observed values while the horizontal axis and coloring denote the latent mixture indicator values. Figure 2.5 shows the result of applying the variational method with coordinte ascent estimation. The series show the estimated values of mi as a function of iteration. The approach is able to quickly nd the parameters of the true generating distributions (dashed lines).
2.4
Prediction
With an approximate posterior in hand, our goal is often to make predictions about data we have not yet seen. That is, given some observed data x1:N we wish to evaluate the probability of an additional datum xN +1 ,
p(xN +1 |x1:N ).
(2.15)
This desideratum is illustrated in Figure 2.6 for the case of the Gaussian mixture of the previous section. On the right hand side another unobserved instance of a draw from the mixture model has been added as the datum to be predicted. One way of approaching the problem is by noting that the marginalization of the gredictive
20
distribution,
p(xN +1 |x1:N ) = =
zN +1 z1:N
p(xN +1 , zN +1 |z1:N )p(z1:N |x1:N )
zN +1
Ep [p(xN +1 , zN +1 |z1:N )] Eq [p(xN +1 , zN +1 |z1:N )] , (2.16)
zN +1
where the expectation on the second line is taken with respect to the true posterior of the observed data, p(z1:N |x1:N ) and the expectation on the third line is taken with respect to the variational approximation to the posterior, q(z1:N ). In the case of the Gaussian mixture, this expression is 1 1 p(xN +1 |x1:N ) Eq [p(xN +1 |1 )] + Eq [p(xN +1 |0 )] 2 2 1 1 = p(xN +1 |m1 ) + p(xN +1 |m0 ). 2 2
(2.17)
The ecacy of this approach is demonstrated in Figure 2.7 wherein we empirically estimate the expected value of p(xN +1 |x1:N ) by drawing an additional M values and taking their average. The dashed line shows the expectation estimated using the variational approximation. We have now described a framework for dening probabilistic models, inferring the values of their unknowns using data, and taking the model and inferred values to provide predictions about unseen data. In the following chapters we leverage this framework to model, understand, and make predictions about networked data.
21
1 0 1 2 3 4 0 20 40 60 80 100
^ xN+1
M
Figure 2.7: Estimated expected value of p(xN +1 |x1:N ) taken by averaging M random draws from this function. The dashed line shows the value of this expectation estimated by the variational approximation.
22

Modeling, Inference and Prediction: 2.1 Probabilistic Models

Uploaded by

Copyright:

Available Formats

Modeling, Inference and Prediction: 2.1 Probabilistic Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modeling, Inference and Prediction: 2.1 Probabilistic Models

Uploaded by

Copyright:

Available Formats

Chapter 2 Modeling, Inference and Prediction

p(x, z) p(x) p(x, z) = . z p(x, z )

q(z) p(z|x) p(z|x) q(z) p(z|x) q(z)

KL(q(z)||p(z|x)) = Eq [log p(z|x)] H (q) ,

q (z) = argmin KL(q(z)||p(z|x)).

Eq [log qi (zi )] (2.10)

Exponential family distributions

< 0.5, 0 > 0.6 0.5 0.4

< 0.5, 1 >

0.3 0.2 0.1 0 6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6

Figure 2.3: A directed graphical model representation of a Gaussian mixture model.

q(, z) = r(0 |m0 )r(1 |m1 )

L = Eq [log p(, z)] H(q)

Eq [log p(, z)] = Eq log = = =

H (qn (zn |n )) = (1 n ) log(1 n ) + n log n H (ri (i |mi )) = 1 log(2e). 2

L 1 n = (m1 m0 )(2 m1 m0 )xn + log n 2 1 n L = (1 n )(xn m0 ) m0 n L = n (xn m1 ). m1 n

!! !!! ! !! ! ! ! ! ! !!! ! !!! ! !!! ! ! ! !! !

This is a system of transcenden-

p(xN +1 , zN +1 |z1:N )p(z1:N |x1:N )

Ep [p(xN +1 , zN +1 |z1:N )] Eq [p(xN +1 , zN +1 |z1:N )] , (2.16)

You might also like