0% found this document useful (0 votes)

18 views

Lecture 2

Uploaded by

nguyenhoangnguyennt

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Lecture 2

Uploaded by

nguyenhoangnguyennt

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Introduction to probabilistic graphical models 2013/2014

Lecture 2 — October 11
Lecturer: Guillaume Obozinski Scribes: Aymeric Reshef, Claire Vernade

• Course webpage: http://www.di.ens.fr/~fbach/courses/fall2013/

2.1 Single node models (last part)

The previous course introduced the notion of Maximum Likelihood Estimator (MLE). Basic
examples on Bernoulli model, multinomial model and Gaussian model were explicited, and
side notes detailed the use of Lagrangian operators and of differentials. The last example
was using the multivariate Gaussian model. We recall it briefly in the next subsection.

2.1.1 The Multivariate Gaussian model

If X is a random variable taking values in Rd . Let µ ∈ Rd and Σ ∈ Rd×d be a positive
definite matrix. X follows a multivariate Gaussian model (denoted by X ∼ N (µ, Σ)) if

1 1 > −1
pµ,Σ (x) = q exp − (x − µ) Σ (x − µ) .
2
(2π)d det Σ
Let X1 , · · · , Xn ∼ N (µ, Σ), iid. Then, the negative log-likelihood of the joint distribution
is
n
X
−l (µ, Σ) = − log pµ,Σ (xi )
i=1
n
nd n 1X
= log (2π) + log (det Σ) + (xi − µ)> Σ−1 (xi − µ) .
2 2 2 i=1
Its gradient with respect to µ is given by
n
X
−∇µ l (µ, Σ) = Σ−1 (xi − µ)
i=1
n
!
X
= Σ−1 xi − nµ = Σ−1 (nx̄ − nµ) ,
i=1

which leads to µ̂ = n1 x̄ , the empirical mean.

In order to compute the gradient with respect to Σ, we first write A = Σ−1 , so that
n
nd n 1X
−l (µ, Σ) = log (2π) − log (det A) + (xi − µ)> A (xi − µ)
2 2 2 i=1
nd n n
= log (2π) − log (det A) + Tr(AΣ̃),
2 2 2

2-1
Cours 2 — October 11 2013/2014

where we introduced the empirical covariance matrix Σ̃ defined as

n
1X
Σ̃ = (xi − µ)> (xi − µ) .
n i=1
The matrix A appears in the expression of the log-likelihood in two terms: n2 log det A and
n
2
Tr(AΣ̃).
Denote by f (A) = Tr(AΣ̃). Then f (A + H) − f (A) = Tr(H Σ̃), which leads to ∇f (A) = Σ̃.
Now, write log det A as
1 1 1
1
log det(A + H) = log det A 2 I + A− 2 HA− 2 A 2 = log det A + log det(I + H̃)
1
where A 2 stands for the square root matrix of A (it exists, since A is positive definite) and
1 1
H̃ = A− 2 HA− 2 . Let’s see how log det(I + H̃) looks like. Noting that log det I = 0, and
denoting by (λ1 , · · · , λd ) the eigenvalues of H̃, we have that
d
X d
X
log det(I + H̃) = log det(I + H̃) − log det I = log(1 + λj ) ≈ λj + o(kH̃k).
j=1 j=1

But then,
d
1 1
X
λj = Tr(H̃) = Tr(A− 2 HA− 2 ) = Tr(HA−1 ).
j=1

We conclude that ∇A log det A = A−1 .

Plugging these results into the gradient of the log-likelihood with respect to A, we have
n n
∇A l(A) = − A−1 + Σ̃.
2 2
The optimality condition ∇A l(A) leads to A−1 = Σ̃, which means that

Σ̂ = n1 ni=1 (xi − µ)> (xi − µ)

is the empirical covariance matrix.

Note that we assumed that A was invertible, which is an implicit condition when writing
log det A. This implies that in a rigorous sense the maximum likelihood estimator is undefined
when Σ̃ is not invertible. In practice, the MLE is extended by continuity to the rank deficient
case.

2.2 Models with two nodes

In this section, we work with two nodes: one node corresponds to an input X, and one node
corresponds to an output Y .
Recall that when dealing with two random variables X and Y , one can use a generative
model, i.e. which models the joint distribution p(X, Y ), or one can use instead a conditional
model (often considered equivalent to the slightly different concept of discriminative model ),
which models the conditional probability of the output, given the input p(Y |X). The two
following models, linear regression or a logistic regression, are conditional models.

2-2
Cours 2 — October 11 2013/2014

2.2.1 Linear regression

Let’s assume that Y ∈ R depends linearly on X ∈ Rp . Let w ∈ Rp be a weighting vector
and σ 2 > 0. We make the following assumption:

Y | X ∼ N (w> X, σ 2 ),

which can be rewritten as

Y = w> X + ,
with ∼ N (0, σ 2 ). Note that if there is an offset w0 ∈ Rp , that is, if Y = w> X + w0 + ,
one can always redefine a weighting vector w̃ ∈ Rp+1 such that

> x
Y = w̃ + .
1

Let D = {(x1 , y1 ), · · · , (xn , yn )} be a training set of i.i.d. random variables. Each yi is a

label (a decision) on observation xi . We consider the conditional distribution of all outputs
given all inputs, which is a product of terms because of the independence of the pairs forming
the training set:
n
Y
p(y1 , · · · , yn |x1 , · · · , xn ; w, σ 2 ) = p(yi |xi ; w, σ 2 ).
i=1

The associated log-likelihood has the following expression:

n n
2
X n 1 X (yi − w> xi )2
−l(w, σ ) = − log p(yi |xi ) = log(2πσ 2 ) + .
i=1
2 2 i=1 σ2

The minimization problem with respect to w can now be reformulated as:

1
Pn > 2
find ŵ = arg minw 2n i=1 (yi − w xi ) .

Define the so-called design matrix X as

 
x> 1
X =  ...  ∈ Rn×p
 
x>
n

and denote by y the vector of coordinates (y1 , · · · , yn ). The minimization problem over w
can be rewritten in a more compact way as:

find ŵ = arg minw 1

2n
ky − Xwk2 .
1
Let f : w 7→ 2n ky − Xwk2 = 2n 1
(y> y − 2w> X> y + w> X> Xw. f is strictly convex if and
only if its Hessian matrix is invertible. This is never the case when n < p (in this case,
we deal with underdetermined problems). Most of the time, the Hessian matrix is invertible
when n ≥ p. When this is not the case, we often use the Tikhonov regularization, which adds

2-3
Cours 2 — October 11 2013/2014

a penalization of the `2 -norm of w by minimizing f (w) + λkwk2 with some hyperparameter

λ > 0.
The gradient of f is
1 >
∇f (w) = X (Xw − y) = 0 ⇐⇒ X> Xw = X> y.
n

The equation X> Xw = X> y is known as the normal equation. If X> X is invertible, then
the optimal weighting vector is

ŵ = (X> X)−1 X> y = X† y

where X† = (X> X)−1 X> is the Moore-Penrose pseudo-inverse of X. If X> X is not invert-
ible, the solution is not unique anymore, and for any h ∈ ker(X), ŵ = (X> X)† X> y + h is
an admissible solution. In that case however it would be necessary to use regularization.
The computational cost to evaluate the optimal weighting vector from X and y is O(p3 )
(use a Cholesky decomposition of matrix X> X and solve two triangular systems).
Now, let’s differentiate l(w, σ 2 ) with respect to σ 2 : we have
n
n n 1X
2
∇σ2 l(w, σ ) = 2 − 4 (yi − w> xi )2 .
2σ 2σ n i=1

Setting ∇σ2 l(w, σ 2 ) to zero gives

Pn
σ̂ 2 = 1
n i=1 (yi − w> xi )2 .

In practice, whenever we use a data matrix X in machine learning, we first preprocess

it to try and avoid that it would be too badly conditioned, so to avoid numerical issues.
Two main operations are applied columnwise: first, a centering (remove the mean of the
coefficients) and a normalization (divide coefficients from a column by the standard deviation
of the column vector). Note that this preprocessing *does not guarantee* that the matrix
we obtain is well-conditioned: in particular, it can be low rank...

2.2.2 Logistic regression

Let X ∈ Rp , Y ∈ {0, 1}. We assume that Y follows a Bernoulli distribution with parameter
θ. The problem is to find θ. Let’s define the sigmoid function σ defined on the real axis and
taking values in [0, 1], such that
1
∀z ∈ R, σ(z) = .
1 + e−z
The sigmoid function is plot on Figure 2.1.
One can easily prove that

∀z ∈ R, σ(−z) = 1 − σ(z),
∀z ∈ R, σ 0 (z) = σ(z)(1 − σ(z)) = σ(z)σ(−z).

2-4
Cours 2 — October 11 2013/2014

0.9

0.8

0.7

0.6

σ(x)
0.5

0.4

0.3

0.2

0.1

0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x

Figure 2.1. Sigmoid function.

We now assume that, for a given observation X = x, the output Y |X = x follows a

Bernoulli law with parameter θ = σ(w> x), where w is again a weighting vector. In practice,
we still can add an offset w> x + w0 . Then, the conditional distribution is given by

p(Y = y|X = x) = θy (1 − θ)1−y = σ(w> x)y σ(−w> x)1−y .

Given a training set D = {(x1 , y1 ), · · · , (xn , yn )} of iid random variables , we can compute
the log-likelihood
n
X
l(w) = yi log σ(w> xi ) + (1 − yi ) log σ(−w> xi ).
i=1

In order to minimize the log-likelihood, since z 7→ log(1 + e−z ) is a convex function and
w 7→ w> xi is linear, we calculate its gradient. We write ηi = σ(θ> xi ):
n n
X σ(w> xi )σ(−w> xi ) σ(w> xi )σ(−w> xi ) X
∇w l(w) = yi x i − (1 − yi )x i = xi (yi − ηi )
i=1
σ(w> xi ) σ(−w> xi ) i=1
Pn >
Thus, ∇w l(w) = 0 ⇐⇒ i=1 xi (yi − σ(θ xi )) = 0. This equation is nonlinear and we need
an iterative optimization method to solve it. For this purpose, we derive the Hessian matrix
of l: n
X
Hl(w) = xi (0 − σ 0 (w> xi )σ 0 (−w> xi )x>
i )
i=1
n
X
= (−ηi (1 − ηi ))xi x> >
i = −X Diag(ηi (1 − ηi ))X
i=1

where X is the design matrix defined previously.

In the following we discuss first- and second-order optimization methods and apply them
to logistic regression.

2-5
Cours 2 — October 11 2013/2014

First-order methods
Let f : Rp → R be the convex C 1 function that we want to minimize. A descent direction
at point x is a vector d such that hd, ∇f (x)i < 0. The minimization of f can be done by
applying a descent algorithm, which iteratively takes a step in a descent direction, leading
to an iterative scheme of the form

x(k+1) = x(k) + ε(k) d(k) ,

where ε(k) is the stepsize. The direction d(k) is often chosen as the opposite of the gradient
of f at point x(k) : d(k) = −∇f (x(k) ).
There are several choices for ε(k) :

1. Constant step: ε(k) = ε. But the scheme does not necessarily converge.

2. Decreasing step size: ε(k) ∝ k1 with k ε(k) = ∞ and k (ε(k) )2 < ∞. The scheme is
P P
guaranteed to converge.

3. One can determine ε(k) by doing a Line Search which tries to find minε f (x(k) + εd(k) ):

• either exactly but this is costly and rather useless in many situations;
• or approximately (see the Armijo linesearch). This is a better method.

Second-order methods
This time, let f : Rp → R be the C 2 function that we want to minimize. We write the
second-order Taylor-expansion of f :
1 2 def 2
f (x) = f (xt )+(x−xt )> ∇f (xt )+ (x−xt )> Hf (xt )(x−xt )+o( x − xt ) = gt (x)+( x − xt )
2
A local optimum x∗ is then reached when
(
∇f (x∗ ) = 0
H(f (x∗ ) 0

In order to solve such a problem, we are going to use Newton’s method. If f is a convex
function, then ∇gt (x) = ∇f (xt ) + Hf (xt )(x − xt ) and we only need to find x∗ so that
∇gt (x) = 0, ie. we set xt+1 = xt − [Hf (xt ]−1 ∇f (xt ). If the Hessian matrix is not invertible,
2
we can regularize the problem and minimize gt (x) + λkx − xt k instead.
In general the previous update, called the Pure Newton step does not lead to a convergent
algorithm even if the function is convex!
In general it is necessary to use the so-called Damped Newton method, to obtain a con-
vergent algorithm which consists in doing the following iterations:

xt+1 = xt − εt (Hf (xt ))−1 ∇f (xt ),

where εt is set with the Armijo Line Search

2-6
Cours 2 — October 11 2013/2014

This method may be computationally costly in high dimension because of the inverse of
the hessian matrix that needs to be computed at each iteration. For some functions, however,
the pure Newton’s method does converge. This is the case for logistic regression.
In the context of non-convex optimization, the situation is more complicated because
the Hessian can have negative eigenvalues. In that case, so-called trust region methods are
typically used.

Application to logistic regression

We will write the form that Newton’s algorithm takes for logistic regression. We had :
n
X
l(w) = yi log σ(w> xi ) + (1 − yi ) log σ(−w> xi )
i=1
n
X
∇w l(w) = xi (yi − ηi ) = X> (y − η)
i=1
Hl(w) = −X> Diag(ηi (1 − ηi ))X

The second-order Taylor expansion of the loss function leads to

1
l(w) = l(wt ) + (w − wt )> ∇l(wt ) + (w − wt )> Hl(wt )(w − wt ).
2
Let us set h = w − wt . The minimization problem becomes:

1 > > 1
min h X (y − η) − h X Diag(η(1 − η))Xh ⇐⇒ min h> ∇w l(w) + h> Hl(w)h.
> >
h 2 h 2
This leads, according to the previous part, to set wt+1 = wt + Hl(wt )−1 ∇w l(w). The
minimization problem above can be seen as some weighted linear regression over h of some
> 2
function of the form i (y˜i −xσ˜i2 h) , where ỹi = yi − ηi and σi2 = [ηi (1 − ηi )]−1 . Thus, this
P
i
method is often refered as the iterative reweighted least squares algorithm (IRLS).

We may run into a classification problem with more than two classes : Y ∈ {1, · · · , K}
with Y ∼ M(1, π1 (x), · · · , πK (x)) where We will need to define a rule over the classifiers
(softmax function, one-versus-all, etc.) in order to make a decision.

2.2.3 Generative models

This section briefly presents the Fisher linear discriminant also known as the linear discrim-
inant analysis. Suppose that we have X ∈ Rp and Y ∈ {0, 1}.
P (X = x | Y = 1)P (Y = 1)
P (Y = 1 | X = x) =
P (X = x | Y = 1)P (Y = 1) + P (X = x | Y = 0)P (Y = 0)
The assumption then consists in considering P (X = x | Y = 1) ∼ N (x, µ1 , Σ1 ) and P (X =
x | Y = 0) ∼ N (x, µ0 , Σ0 ). Fisher’s assumption is the assumption that Σ1 = Σ0 = Σ.

2-7
Cours 2 — October 11 2013/2014

2.3 Unsupervised classification

Unsupervised learning consists in finding a label prediction function based on unlabeled
training data only. In the case where the learning problem is a classification problem, and
under the assumption that the classes form clusters in input space, the problem reduces to
a clustering problem, which consists in finding groups of points that form denser clusters.
When the clusters are assumed to be isotropic the formulation of the K-means algorithm is
appropriate.

The K-means algorithm

We start from a set of data points (x1 , · · · , xn ) (where xi ∈ Rp ), that are unlabelled. We
wish to divide this set into K clusters defined by their centroids (µ1 , · · · , µK ). The problem
can be formulated as: n
1X
min minkxi − µk k2 .
µ1 ,··· ,µK n k
i=1

The minimization step inside the summation leads to a nonconvex problem. The K-means
algorithm is a greedy algorithm which consists in iteratively apply two steps:

2 2
Ck ← i | kxi − µk k = min kxi − µj k
j
1 X
µk ← xi .
| Ck | i∈C
k

The first step defines the clusters Ck by assigning each data point to its closest centroid.
The second step then updates the centroids given the new cluster.

Two remarks:

• It can be shown that K-means converges in a finite number of steps.

• The algorithm however typically get stuck in local minima and it practice it is necessary
to try several restarts of the algorithm with a random initialization to have chances to
obtain a better solution.

2-8