Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Subgrad Method Slides

The document summarizes the subgradient method for minimizing nondifferentiable convex functions. It discusses using subgradients to define iterative updates that may not decrease the objective at each step. It presents convergence results showing the objective value will converge to within a proximity of the optimal value, with the proximity determined by the step size schedule. Constant and diminishing step sizes are analyzed, with diminishing ensuring convergence to the optimal value. An example applies the method to piecewise linear minimization.

Uploaded by

timedaxuxa07
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Subgrad Method Slides

The document summarizes the subgradient method for minimizing nondifferentiable convex functions. It discusses using subgradients to define iterative updates that may not decrease the objective at each step. It presents convergence results showing the objective value will converge to within a proximity of the optimal value, with the proximity determined by the step size schedule. Constant and diminishing step sizes are analyzed, with diminishing ensuring convergence to the optimal value. An example applies the method to piecewise linear minimization.

Uploaded by

timedaxuxa07
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Subgradient Methods

• subgradient method and stepsize rules


• convergence results and proof
• optimal step size and alternating projections
• speeding up subgradient methods

EE364b, Stanford University updated: April 7, 2022


Subgradient method
subgradient method is simple algorithm to minimize nondifferentiable
convex function f

x(k+1) = x(k) − αk g (k)


• x(k) is the kth iterate
• g (k) is any subgradient of f at x(k)
• αk > 0 is the kth step size
not a descent method, so we keep track of best point so far

(k)
fbest = min f (x(i))
i=1,...,k

EE364b, Stanford University 1


Step size rules
step sizes are fixed ahead of time
• constant step size: αk = α (constant)

• constant step length: αk = γ/kg (k)k2 (so kx(k+1) − x(k)k2 = γ)

• square summable but not summable: step sizes satisfy



X ∞
X
αk2 < ∞, αk = ∞
k=1 k=1

• nonsummable diminishing: step sizes satisfy



X
lim αk = 0, αk = ∞
k→∞
k=1

EE364b, Stanford University 2


Assumptions

• f ? = inf x f (x) > −∞, with f (x?) = f ?

• kgk2 ≤ G for all g ∈ ∂f (equivalent to Lipschitz condition on f )

• kx(1) − x?k2 ≤ R

these assumptions are stronger than needed, just to simplify proofs

EE364b, Stanford University 3


Convergence results
(k)
define f¯ = limk→∞ fbest

• constant step size: f¯ − f ? ≤ G2α/2, i.e.,


converges to G2α/2-suboptimal
(converges to f ? if f differentiable, α small enough)

• constant step length: f¯ − f ? ≤ Gγ/2, i.e.,


converges to Gγ/2-suboptimal

• diminishing step size rule: f¯ = f ?, i.e., converges

EE364b, Stanford University 4


Convergence proof
key quantity: Euclidean distance to the optimal set, not the function value

let x? be any minimizer of f

kx(k+1) − x?k22 = kx(k) − αk g (k) − x?k22


= kx(k) − x?k22 − 2αk g (k)T (x(k) − x?) + αk2 kg (k)k22
≤ kx(k) − x?k22 − 2αk (f (x(k)) − f ?) + αk2 kg (k)k22

using f ? = f (x?) ≥ f (x(k)) + g (k)T (x? − x(k))

EE364b, Stanford University 5


apply recursively to get
k
X k
X
kx(k+1) − x?k22 ≤ kx(1) − x?k22 − 2 αi(f (x(i)) − f ?) + αi2kg (i)k22
i=1 i=1
k
X k
X
≤ R2 − 2 αi(f (x(i)) − f ?) + G2 αi2
i=1 i=1

now we use
k k
!
(k)
X X
αi(f (x(i)) − f ?) ≥ (fbest − f ?) αi
i=1 i=1

to get Pk
2 2 2
(k) R + G α
i=1 i
fbest − f? ≤ Pk .
2 i=1 αi

EE364b, Stanford University 6


constant step size: for αk = α we get

2 2 2
(k) R + G kα
fbest − f? ≤
2kα

righthand side converges to G2α/2 as k → ∞

constant step length: for αk = γ/kg (k)k2 we get

2
Pk 2 (i) 2 2 2
(k) R + α
i=1 i kg k2 R + γ k
fbest − f? ≤ Pk ≤ ,
2 i=1 αi 2γk/G

righthand side converges to Gγ/2 as k → ∞

EE364b, Stanford University 7


square summable but not summable step sizes:
suppose step sizes satisfy


X ∞
X
αk2 < ∞, αk = ∞
i=1 k=1

then Pk
2 2 2
R
(k) + G α i
− f? ≤
fbest Pk
i=1
2 i=1 αi
as k → ∞, numerator converges to a finite number, denominator
(k)
converges to ∞, so fbest → f ?

EE364b, Stanford University 8


Stopping criterion
2 2 2
Pk
R +G i=1 α i
• terminating when Pk ≤  is really, really, slow
2 i=1 αi

2 22
Pk
R +G i=1 α i
• optimal choice of αi to achieve Pk ≤  for smallest k:
2 i=1 αi

αi = (R/G)/ k, i = 1, . . . , k

number of steps required: k = (RG/)2

• the truth: there really isn’t a good stopping criterion for the subgradient
method . . .

EE364b, Stanford University 9


Example: Piecewise linear minimization

minimize f (x) = maxi=1,...,m(aTi x + bi)


to find a subgradient of f : find index j for which

aTj x + bj = max (aTi x + bi)


i=1,...,m

and take g = aj

subgradient method: x(k+1) = x(k) − αk aj

EE364b, Stanford University 10


problem instance with n = 20 variables, m = 100 terms, f ? ≈ 1.1
(k)
fbest − f ?, constant step length γ = 0.05, 0.01, 0.005

0 gamma1val
10 gamma2val
gamma3val

−1
fbest − fmin

10

−2
10

−3
10
500 1000 1500 2000 2500 3000
k

EE364b, Stanford University 11


√ √
diminishing step rules αk = 0.1/ k and αk = 1/ k, square summable
step size rules αk = 1/k and αk = 10/k
1
10
0.1/sqrt(k) value
1/sqrt(k) value
1/k value
0
10 10/k value
fbest − fmin

−1
10

−2
10

−3
10
0 500 1000 1500 2000 2500 3000
k

EE364b, Stanford University 12


Optimal step size when f ? is known

• choice due to Polyak:


f (x(k)) − f ?
αk =
kg (k)k22
(can also use when optimal value is estimated)

• motivation: start with basic inequality

kx(k+1) − x?k22 ≤ kx(k) − x?k22 − 2αk (f (x(k)) − f ?) + αk2 kg (k)k22

and choose αk to minimize righthand side

EE364b, Stanford University 13


• yields
(k) ? 2
(k+1) ? 2 (k) ? 2 (f (x ) − f )
kx − x k2 ≤ kx − x k2 −
kg (k)k22
(in particular, kx(k) − x?k2 decreases each step)

• applying recursively,

k
X (f (x(i)) − f ?)2
≤ R2
i=1
kg (i)k22

and so
k
X
(f (x(i)) − f ?)2 ≤ R2G2
i=1
which proves f (x(k)) → f ?

EE364b, Stanford University 14



PWL example with Polyak’s step size, αk = 0.1/ k, αk = 1/k
1
10
optimal
a/sqrt(k) valuelab
a/k value
0
10
fbest − fmin

−1
10

−2
10

−3
10
0 500 1000 1500 2000 2500 3000
k

EE364b, Stanford University 15


Finding a point in the intersection of convex sets

C = C1 ∩ · · · Cm is nonempty, C1, . . . , Cm ⊆ Rn closed and convex

find a point in C by minimizing

f (x) = max{dist(x, C1), . . . , dist(x, Cm)}

with dist(x, Cj ) = f (x), a subgradient of f is

x − PCj (x)
g = ∇ dist(x, Cj ) =
kx − PCj (x)k2

EE364b, Stanford University 16


subgradient update with optimal step size:

x(k+1) = x(k) − αk g (k)


(k) (k)
x − PCj (x)
= x − f (x )
kx − PCj (x)k2
= PCj (x(k))

• a version of the famous alternating projections algorithm

• at each step, project the current point onto the farthest set

• for m = 2 sets, projections alternate onto one set, then the other

• convergence: dist(x(k), C) → 0 as k → ∞

EE364b, Stanford University 17


Alternating projections
first few iterations:

x1

y1 x2

y2 xstar

D
C

. . . x(k) eventually converges to a point x? ∈ C1 ∩ C2

EE364b, Stanford University 18


Example: Positive semidefinite matrix completion

• some entries of matrix in Sn fixed; find values for others so completed


matrix is PSD
• C1 = Sn+, C2 is (affine) set in Sn with specified fixed entries
• projection
Pn onto CT1 by eigenvalue decomposition, truncation: for
X = i=1 λiqiqi ,

n
X
PC1 (X) = max{0, λi}qiqiT
i=1

• projection of X onto C2 by re-setting specified entries to fixed values

EE364b, Stanford University 19


specific example: 50 × 50 matrix missing about half of its entries

• initialize X (1) with unknown entries set to 0

EE364b, Stanford University 20


convergence is linear:
2
10

0
10
dist

−2
10

−4
10

−6
10
0 20 40 60 80 100
k

EE364b, Stanford University 21


Polyak step size when f ? isn’t known

• use step size


(k)
f (x(k)) − fbest + γk
αk =
kg (k)k22
P∞ P∞
with k=1 γk = ∞, k=1 γk2 < ∞

(k)
• fbest − γk serves as estimate of f ?

• γk is in scale of objective value

(k)
• can show fbest → f ?

EE364b, Stanford University 22


PWL example with Polyak’s step size, using f ?, and estimated with
γk = 10/(10 + k)
1
10
optimal
estimated

0
10
fbest − fmin

−1
10

−2
10

−3
10
0 500 1000 1500 2000 2500 3000
k

EE364b, Stanford University 23


Speeding up subgradient methods

• subgradient methods are very slow

• often convergence can be improved by keeping memory of past steps

x(k+1) = x(k) − αk g (k) + βk (x(k) − x(k−1))

(heavy ball method)

other ideas: localization methods, conjugate directions, . . .

EE364b, Stanford University 24


A couple of speedup algorithms

(k+1) (k) (k)f (x(k)) − f ?


x = x − αk s , αk =
ks(k)k22
(we assume f ? is known or can be estimated)

• ‘filtered’ subgradient, s(k) = (1 − β)g (k) + βs(k−1), where β ∈ [0, 1)

• Camerini, Fratta, and Maffioli (1975)

s(k) = g (k) + βk s(k−1), βk = max{0, −γk (s(k−1))T g (k)/ks(k−1)k22}

where γk ∈ [0, 2) (γk = 1.5 ‘recommended’)

EE364b, Stanford University 25


PWL example, Polyak’s step, filtered subgradient, CFM step
1
10
optimal
with filter beta val
CFM
0
10
fbest − fmin

−1
10

−2
10

−3
10
0 500 1000 1500 2000
k

EE364b, Stanford University 26


Optimality of the subgradient method

2 2 2
Pk
(k) ? R +G i=1 αi
• optimal choice of αi to achieve fbest −f ≤ Pk ≤ :
2 i=1 αi

αi = (R/G)/ k, i = 1, . . . , k

number of steps required: k = (RG/)2

(k) RG
• fbest − f ? ≤ √
k
after k iterations

• this is optimal among first order methods based on subgradients

EE364b, Stanford University 27


Subgradient oracle

• we query a point x

• oracle returns a subgradient g ∈ ∂f (x) and the function value f (x)

• there exists a convex function such that

(k) ? RG
fbest −f ≥ √
k

EE364b, Stanford University 28


Worst case function

• Suppose x ∈ Rn and let f (x) = max1≤i≤k xi + λ2 kxk22

EE364b, Stanford University 29


Resisting oracle

• f (x) = max1≤i≤k xi + λ2 kxk22

• f (x) is minimized at
(
1
∗ − λk , 1≤i≤k
x =
0, k+1≤i≤n

with optimal value f (x∗) = − 2λk


1

• ei + λx is a subgradient

• it can be checked that 0 ∈ ∂f (x∗)

EE364b, Stanford University 30


• suppose that the subgradient oracle returns the subgradient

λ
ei∗ + λx ∈ ∂f (x) = ∂ max xi + kxk22
1≤i≤k 2

where i∗ is the first index such that xi∗ = max1≤i≤k xi

• we initialize at x0 = 0, f (x0) = 0 and observe that


 T
x1 = -α1, 0 , 0 , ..., 0 f (x1) ≥ 0
 T
x2 = -(α1+λα2), -α2, 0, ..., 0 f (x2) ≥ 0
..
 T
xk−1 = -|∗, -∗, -∗,{z
. . . , , -∗, -∗}, 0 ..., 0 f (xk−1) ≥ 0
first k − 1 coordinates

EE364b, Stanford University 31


Lower bound
• we can set λ to control R = kx0 − x∗k2 and G = k∂f (x)k2 and obtain

(k) RG
fbest − f ? ≥ √
2(1 + k)

• the lower bound matches the earlier upper bound

(k) RG
fbest − f ? ≤ √
k
up to constants

• subgradient method is optimal among first-order methods

• localization methods can achieve better complexity

EE364b, Stanford University 32

You might also like