Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Lecture_11_AGD_restart_lower_bounds

Uploaded by

drbaskerphd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture_11_AGD_restart_lower_bounds

Uploaded by

drbaskerphd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Lecture 11: Acceleration via Regularization and Restarting;


Lower Bounds
Yudong Chen

Last week we discussed two variants of Nesterov’s accelerated gradient descent (AGD).

Algorithm 1 Nesterov’s AGD, smooth and strongly convex


input: initial x0 , strong convexity

and smoothness parameters m, L, number of iterations K
L/m − 1
initialize: x−1 = x0 , β = √ L/m+1 .
for k = 0, 1, . . . K
y k = x k + β ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )
return xK

Theorem 1. For Nesterov’s AGD Algorithm 1 applied to m-strongly convex L-smooth f , we have
r k
( L + m) ∥ x0 − x ∗ ∥22

∗ m
f ( xk ) − f ≤ 1 − · .
L 2
q 
∗ L L∥ x0 − x ∗ ∥22
Equivalently, we have f ( xk ) − f ≤ ϵ after at most k = O m log ϵ iterations.

Algorithm 2 Nesterov’s AGD, smooth convex


input: initial x0 , smoothness parameter L, number of iterations K
initialize: x−1 = x0 , λ0 = 0, β 0 = 0.
for k = 0, 1, . . . K
y k = x k + β k ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )

1+ 1+4λ2k
λ k +1 = 2 , β k+1 = λλkk−
+1
1

return xK

Theorem 2. For Nesterov’s AGD Algorithm 2 applied to L-smooth convex f , we have

2L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≤ .
k2

In this lecture, we will show that the two types of acceleration above are closely related: we
can use one to derive the other. We then show that in a certain precise (but narrow) sense, the
convergence rates of AGD are optimal among first-order methods. For this reason, AGD is also
known as Nesterov’s optimal method.

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

1 Acceleration via regularization


Suppose we only know the AGD method for strongly convex functions (Algorithm 1) and its
p k
1 − mL guarantee (Theorem 1). Can we use it as a subroutine to develop an accelerated al-
gorithm for (non-strongly) convex functions with a k12 convergence rate?
The answer is yes (up to logarithmic factors). One approach is to add a regularizer ϵ ∥ x ∥22 to
f ( x ) and apply Algorithm 1 to the function f ( x ) + ϵ ∥ x ∥22 , which is strongly convex. See HW 3.

2 Acceleration via restarting


In the opposite direction, suppose we only know the AGD method for (non-strongly) convex func-
tions (Algorithm 2) and its k12 guarantee (Theorem 2). Can we use it as a subroutine to develop an
p k
accelerated algorithm for strongly convex functions with a 1 − mL convergence rate (equiva-
q
lently, a mL log 1ϵ iteration complexity)?
powerful idea in optimization: restarting. See Algorithm 3.
This is possible using a classical andq
8L
In each round, we run Algorithm 2 for iterations to obtain x t+1 . In the next round, we restart
m
q
Algorithm 2 using x t+1 as the initial solution and run for another 8L m iterations. This is repeated
for T rounds.

Algorithm 3 Restarting AGD


input: initial x0 , strong convexity and smoothness parameters m, L, number of rounds T
for t = 0, 1, . . . T
q
Run Algorithm 2 with x t (initial solution), L (smoothness parameter), 8L
m (number of
iterations) as the input. Let x t+1 be the output.

return x T

Exercise
q 1. How is Algorithm 3 different from running Algorithm 2 without restarting for T ×
8L
m iterations?

2.1 Analysis
Suppose f is m-strongly convex and L-smooth. By Theorem 2, we know that

2L ∥ x t − x ∗ ∥22 m ∥ x t − x ∗ ∥22
f ( x t +1 ) − f ( x ∗ ) ≤ = .
8L/m 4
By strong convexity, we have
m
f ( x t ) ≥ f ( x ∗ ) + ⟨∇ f ( x ∗ ), x t − x ∗ ⟩ + ∥ x t − x ∗ ∥22 ,
| {z } 2
=0

hence ∥ x t − x ∗ ∥22 ≤ 2
m ( f ( x t ) − f ( x ∗ )). Combining, we get
f (xt ) − f (x∗ )
f ( x t +1 ) − f ( x ∗ ) ≤ .
2

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

That is, each round of Algorithm 3 halves the optimality gap. It follows that
 T
1
f (x T ) − f (x∗ ) ≤ ( f ( x0 ) − f ( x ∗ )) .
2
Therefore, f ( x T ) − f ( x ∗ ) ≤ ϵ can be achieved after at most
f ( x0 ) − f ( x ∗ )
 
T = O log rounds,
ϵ
which corresponds to a total of
r r !
8L L f ( x0 ) − f ( x ∗ )
T× =O log AGD iterations.
m m ϵ
This iteration complexity is the same as Theorem 1 up to a logarithmic factor.
Remark 1. Note how strong convexity is needed in the above argument.
Remark 2. Optional reading: This overview article discusses restarting as a general/meta algorith-
mic technique.

3 Lower bounds
In this section, we consider a class of first-order iterative algorithms that satisfy x0 = 0, and
xk+1 ∈ Lin {∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk )} , ∀k ≥ 0, (1)
where the RHS denotes the linear subspace spanned by ∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk ); in other
words, xk+1 is an (arbitrary) linear combination of the gradients at the previous (k + 1) iterates.

3.1 Smooth and convex f


Theorem 3. There exists an L-smooth convex function f such that any first-order method in the sense of
(1) must satisfy
3L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≥ .
32(k + 1)2
L
Comparing with this lower bound, we see that the k2
rate for AGD in Theorem 2 is opti-
mal/unimprovable (up to constants).
Proof of Theorem 3. Let A ∈ Rd×d be the matrix given by

2,
 i=j
Aij = −1, j ∈ {i − 1, i + 1} (2)

0, otherwise.

Explicitly,
−1 0 0 ··· ···
 
2 0
 −1 2 −1 0 · · · ··· 0
 
 0 −1 2 −1 0 ··· 0
A= .
 
.. .. ..

 . . . 

 0 ··· −1 2 −1
0 ··· −1 2

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Let ei ∈ Rd denote the i-th standard basis vector. Consider the quadratic function

L ⊤ L
f (x) = x Ax − x ⊤ e1 ,
8 4
L
which is convex and L-smooth since 0 ≼ A ≼ 4I. Note that ∇ f ( x ) = 4 ( Ax − e1 ). By induction,
we can show that for k ≥ 1,

xk ∈ Lin {e1 , Ax1 , . . . , Axk−1 } ⊆ Lin {e1 , . . . , ek } .

Therefore, if we let Ak ∈ Rd×d denote the matrix obtained by zeroing out the entries of A outside
the top-left k × k block, then
 
L ⊤ L ⊤ ∗ L ⊤ L ⊤
f ( xk ) = xk Ak xk − xk e1 ≥ f k := min x A k x − x e1 .
8 4 x 8 4

By setting gradient to zero, we find that the minimum above is attained by


 ⊤
1 2 k
xk∗ := 1− ,1− ,...,1− , 0, . . . , 0 ∈ Rd ,
k+1 k+1 k+1

with f k∗ = − L8 1 − k+1 1 . It follows that the global minimizer x ∗ = xd∗ of f satisfies f ( x ∗ ) = f d∗ =




− L8 1 − d+1 1 and (since x0 = 0)




d  2
i d+1
∥ xd∗ − x0 ∥22 = ∥ xd∗ ∥22 = ∑ 1−
d+1

3
.
i =1

Combining pieces and taking d = 2k + 1, we have


 
L 1 1
f ( xk ) − f ( x ∗ ) ≥ f k∗ − f d∗ = −
8 k + 1 2k + 2
L k+1
=
16 (k + 1)2
L d+1
=
32 (k + 1)2
2
3L ∥ x ∗ − x0 ∥2
≥ .
32 (k + 1)2

3.2 Smooth and strongly convex f


 k
For strongly convex functions, we have the following lower bound, which shows that the 1 − √1
L/m
rate of AGD in Theorem 1 cannot be significantly improved.

Theorem 4. There exists an m-strongly convex and L-smooth function such that any first-order method in
the sense of (1) must satisfy
  k +1
m 4

f ( xk ) − f ( x ) ≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Proof. Let A ∈ Rd×d be defined in (2) above and consider the function
L−m  ⊤  m
f (x) = x Ax − 2x e1 + ∥ x ∥22 ,

8 2
which is L-smooth and m-strongly convex. Strong convexity implies that
m
f ( xk ) − f ( x ∗ ) ≥ ∥ xk − x ∗ ∥22 . (3)
2
A similar argument as above shows that xk ∈ Lin {e1 , . . . , ek } , hence
d
∥ xk − x ∗ ∥22 ≥ ∑ x ∗ ( i )2 , (4)
i = k +1

where x ∗ (i ) denotes the ith entry of x ∗ . For simplicity we take d → ∞ (we omit the formal limiting
argument).1 The minimizer x ∗ can be computed by setting the gradient of f to zero, which gives
an infinite set of equations

L/m + 1 ∗
1−2 x (1) + x ∗ (2) = 0,
L/m − 1
L/m + 1 ∗
x ∗ ( k − 1) − 2 x (k ) + x ∗ (k + 1) = 0, k = 2, 3, . . .
L/m − 1
Solving these equations gives
√ !i
∗ L/m − 1
x (i ) = √ , i = 1, 2, . . . (5)
L/m + 1

Combining pieces, we obtain

m ∞ ∗ 2
2 i=∑
f ( xk ) − f ( x ∗ ) ≥ x (i ) by (3) and (4)
k +1
√ ! 2( k +1)
m L/m − 1
≥ √ ∥ x0 − x ∗ ∥22 by (5) and x0 = 0
2 L/m + 1
  k +1
m 4 4
= 1− √ + √ ∥ x0 − x ∗ ∥22
2 L/m + 1 ( L/m + 1)2
  k +1
m 4
≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

Remark 3. The lower bounds in Theorems 3 and 4 are in the worst-case/minimax sense: one cannot
find a first-order method that achieves a better convergence rate on all smooth convex functions
than AGD. This, however, does not prevent better rates to be achieved for a sub class of such
functions. It is also possible to achieve better rates by using higher-order information (e.g., the
Hessian).
1 The convergence rates for AGD in Theorems 1 and 2 do not explicitly depend on the dimension d, hence these
results can be generalized to infinite dimensions.

You might also like