Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Convex - Optimization - Homework 3

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Convex - Optimization - Homework 3

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Convex optimization - Homework 3

1. Second order methods for dual problem

1. Let the LASSO (Least Absolute Shrinkange Operator and Selection Opera-
tor) problem
minimize 21 ||Xw − y||22 + λ||w||1
In the variable w ∈ Rd , with X ∈ Rn×d , y ∈ Rn and λ > 0 the regularization
parameter.

We can rewrite the problem as


minimize 12 ||z − y||22 + λ||w||1
subject to z = Xw
The Lagrangian of the problem will be
L(w, z, v) = 21 ||z − y||22 + λ||w||1 + v T (Xw − z).

As the function is separable


inf L(w, z, v) = inf (λ||w||1 + v T Xw) + inf ( 21 ||z − y||22 − v T z)
w,z w z

We have the rst part


1
− y||22 − v T z = 12 z T z − (y + v)T z + 21 y T y
2 ||z
arg min( 12 z T z − (y + v)T z + 12 y T y) = y + v
z
inf ( 12 ||z − y||22 − v T z) = − 12 ||v||22 − y T v
z

And the second part P


d
inf (λ||w||1 + (X T v)T w) = i=1 (inf (λ|wi | + (X T v)i wi ))
w w i

inf (λ|wi | + (X T v)i wi ) = 0 if |(X T v)i | ≤ λ


wi
−∞ else
inf (λ||w||1 + (X T v)T w) = 0 if ||X T v||∞ ≤ λ
w
−∞ else
We deduce that :
inf L(w, z, v) = − 21 ||v||22 − y T v when ||X T v||∞ ≤ λ
w,z

Thus, the dual problem is


maximize - 21 ||v||22 − y T v
subject to ||X T v||∞ ≤ λ
We can rewrite the problem as
minimize v T Qv + pT v
subject to Av ≤ b
In the variable v ∈ Rn , with Q = 12 In , p = y , A = (X, −X)T and b = λI2d .

3.We can take v0 = 0 that satises the initial condition.


We notice that the bigger µ is, the deeper the objective function goes at each
iterations. However, when µ is too big, the number of iterations decreases.

1
Figure 1: The objective function vs the number of iterations for dierent µ

Figure 2: The gap in a semilog scale

Thus, an appropriate µ should be 50

2. First order methods for primal problem

1. The function f (w) = 12 ||Xw − y||22 + λ||w||1 is not dierentiable. However,


the function is dierentiable in the set {w ∈ Rd |∃i ∈ {1, ..., d}wi = 0}, otherwise

2
∂f (w) = X T Xw − X T y + λ(sgn(wi ))1≤i≤d .
By setting g(w) = X T Xw − X T y + λ(sgn(wi ))1≤i≤d , with sgn(0) = 0, we
have
∀z ∈ Rd f (z) − f (w) − g(w)T (z − w) ≥ 0. So we can consider g a subgradient
for f .

Figure 3: The objective function vs the number of iterations for dierent strate-
gies

Figure 4: The gap in a semilog scale for dierent strategies

• Strategy 1 : Constant step size αk = h


• Strategy 2 : Constant step length αk = h
gk

3
• Strategy 3 : Square summable but not summable αk = h
k

• Strategy 4 : Nonsummable diminishing αk = √h


h

We can see that the 4th strategy is the fastest, while the 1st and the 2nd are
really slow. However, if we continue to iterate a certain number of times, we'll
notice that the rst two strategies are more precise, even though they are not
converging.
2. The function is in the form f (w) = 12 wT X T Xw − y T Xw + 12 y T y + λ||w||1 .
If we writePthe function in close form, P
we'll get :
f (w) = 21 i=1 j=1 (X T X)ij wi wj − i=1 (X T y)i wi + 21 y T y + λ i=1 |wi |.
d Pd Pd
A better form
P should be : P
d d i−1
f (w) = 21 i=1 (X T X)ii wi2 + T T
y)i wi + 21 y T y+
P P
i=1 j=1 (X X)ij wi wj − i=1 (X
λ i=1 |wi |.
Pd

Here, for each i ∈ {1, .., d}, if we x wj with j 6= i, fi (wi ) = f (w) in


the form fi (wi ) = αi wi2 + βi wi + γi |wi | + δi with αi = 12 (X T X)ii , βi =
j6=i (X X)ij wj − (X y)i , γi = λ and δi = 2
P T T 1
Pd Pd T
j6=i k6=i (X X)kj wk wj −
Pd
T
y)j wj + 21 y T y + λ j6=i |wj |
P
j6=i (X
We can say than αi , γi > 0. Thus, when we study the function fi (wi ) =
αi wi2 + βi wi + γi |wi | + δi , we know that the function is convex and its lim-
its are +∞. Moreover :
fi0 (wi ) = 2αi wi + βi + γi if wi > 0
2αi wi + βi − γi if wi < 0
(the funcion is not dierentiable on 0).

We can thus deduce that :


arg min fi (wi ) = 0 if |βi | ≤ γi
wi
−βi −γi
2αi if βi < −γi
−βi +γi
2αi if βi > γi
We can thus set at each iteration wi(k) = arg min fi (wi ) and wj(k+1) = wj(k) when
wi
j 6= i, by cycling over i.

We notice that the coordinate descent method is faster than the sub-gradient
method. Indeed, while with the sub-gradient method, there is no convergence
but an oscillation over p∗, the coordinate descent method assures a convergence
in a certain number of step (approximately 250 to have a gap  = 10−3 ).

4
Figure 5: The objective function vs the number of iterations for the coordinate
descent

Figure 6: The gap in a semilog scale for the coordinate descent

5
Figure 7: Comparison of the gap for the 2 methods

You might also like