Lecture 2
Lecture 2
Lecture 2 — October 11
Lecturer: Guillaume Obozinski Scribes: Aymeric Reshef, Claire Vernade
2-1
Cours 2 — October 11 2013/2014
But then,
d
1 1
X
λj = Tr(H̃) = Tr(A− 2 HA− 2 ) = Tr(HA−1 ).
j=1
2-2
Cours 2 — October 11 2013/2014
Y | X ∼ N (w> X, σ 2 ),
and denote by y the vector of coordinates (y1 , · · · , yn ). The minimization problem over w
can be rewritten in a more compact way as:
2-3
Cours 2 — October 11 2013/2014
The equation X> Xw = X> y is known as the normal equation. If X> X is invertible, then
the optimal weighting vector is
where X† = (X> X)−1 X> is the Moore-Penrose pseudo-inverse of X. If X> X is not invert-
ible, the solution is not unique anymore, and for any h ∈ ker(X), ŵ = (X> X)† X> y + h is
an admissible solution. In that case however it would be necessary to use regularization.
The computational cost to evaluate the optimal weighting vector from X and y is O(p3 )
(use a Cholesky decomposition of matrix X> X and solve two triangular systems).
Now, let’s differentiate l(w, σ 2 ) with respect to σ 2 : we have
n
n n 1X
2
∇σ2 l(w, σ ) = 2 − 4 (yi − w> xi )2 .
2σ 2σ n i=1
∀z ∈ R, σ(−z) = 1 − σ(z),
∀z ∈ R, σ 0 (z) = σ(z)(1 − σ(z)) = σ(z)σ(−z).
2-4
Cours 2 — October 11 2013/2014
0.9
0.8
0.7
0.6
σ(x)
0.5
0.4
0.3
0.2
0.1
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
Given a training set D = {(x1 , y1 ), · · · , (xn , yn )} of iid random variables , we can compute
the log-likelihood
n
X
l(w) = yi log σ(w> xi ) + (1 − yi ) log σ(−w> xi ).
i=1
In order to minimize the log-likelihood, since z 7→ log(1 + e−z ) is a convex function and
w 7→ w> xi is linear, we calculate its gradient. We write ηi = σ(θ> xi ):
n n
X σ(w> xi )σ(−w> xi ) σ(w> xi )σ(−w> xi ) X
∇w l(w) = yi x i − (1 − yi )x i = xi (yi − ηi )
i=1
σ(w> xi ) σ(−w> xi ) i=1
Pn >
Thus, ∇w l(w) = 0 ⇐⇒ i=1 xi (yi − σ(θ xi )) = 0. This equation is nonlinear and we need
an iterative optimization method to solve it. For this purpose, we derive the Hessian matrix
of l: n
X
Hl(w) = xi (0 − σ 0 (w> xi )σ 0 (−w> xi )x>
i )
i=1
n
X
= (−ηi (1 − ηi ))xi x> >
i = −X Diag(ηi (1 − ηi ))X
i=1
2-5
Cours 2 — October 11 2013/2014
First-order methods
Let f : Rp → R be the convex C 1 function that we want to minimize. A descent direction
at point x is a vector d such that hd, ∇f (x)i < 0. The minimization of f can be done by
applying a descent algorithm, which iteratively takes a step in a descent direction, leading
to an iterative scheme of the form
where ε(k) is the stepsize. The direction d(k) is often chosen as the opposite of the gradient
of f at point x(k) : d(k) = −∇f (x(k) ).
There are several choices for ε(k) :
1. Constant step: ε(k) = ε. But the scheme does not necessarily converge.
2. Decreasing step size: ε(k) ∝ k1 with k ε(k) = ∞ and k (ε(k) )2 < ∞. The scheme is
P P
guaranteed to converge.
3. One can determine ε(k) by doing a Line Search which tries to find minε f (x(k) + εd(k) ):
• either exactly but this is costly and rather useless in many situations;
• or approximately (see the Armijo linesearch). This is a better method.
Second-order methods
This time, let f : Rp → R be the C 2 function that we want to minimize. We write the
second-order Taylor-expansion of f :
1 2 def 2
f (x) = f (xt )+(x−xt )> ∇f (xt )+ (x−xt )> Hf (xt )(x−xt )+o( x − xt ) = gt (x)+( x − xt )
2
A local optimum x∗ is then reached when
(
∇f (x∗ ) = 0
H(f (x∗ ) 0
In order to solve such a problem, we are going to use Newton’s method. If f is a convex
function, then ∇gt (x) = ∇f (xt ) + Hf (xt )(x − xt ) and we only need to find x∗ so that
∇gt (x) = 0, ie. we set xt+1 = xt − [Hf (xt ]−1 ∇f (xt ). If the Hessian matrix is not invertible,
2
we can regularize the problem and minimize gt (x) + λkx − xt k instead.
In general the previous update, called the Pure Newton step does not lead to a convergent
algorithm even if the function is convex!
In general it is necessary to use the so-called Damped Newton method, to obtain a con-
vergent algorithm which consists in doing the following iterations:
2-6
Cours 2 — October 11 2013/2014
This method may be computationally costly in high dimension because of the inverse of
the hessian matrix that needs to be computed at each iteration. For some functions, however,
the pure Newton’s method does converge. This is the case for logistic regression.
In the context of non-convex optimization, the situation is more complicated because
the Hessian can have negative eigenvalues. In that case, so-called trust region methods are
typically used.
We may run into a classification problem with more than two classes : Y ∈ {1, · · · , K}
with Y ∼ M(1, π1 (x), · · · , πK (x)) where We will need to define a rule over the classifiers
(softmax function, one-versus-all, etc.) in order to make a decision.
2-7
Cours 2 — October 11 2013/2014
The minimization step inside the summation leads to a nonconvex problem. The K-means
algorithm is a greedy algorithm which consists in iteratively apply two steps:
2 2
Ck ← i | kxi − µk k = min kxi − µj k
j
1 X
µk ← xi .
| Ck | i∈C
k
The first step defines the clusters Ck by assigning each data point to its closest centroid.
The second step then updates the centroids given the new cluster.
Two remarks:
• The algorithm however typically get stuck in local minima and it practice it is necessary
to try several restarts of the algorithm with a random initialization to have chances to
obtain a better solution.
2-8