Lecture_11_AGD_restart_lower_bounds
Lecture_11_AGD_restart_lower_bounds
Last week we discussed two variants of Nesterov’s accelerated gradient descent (AGD).
Theorem 1. For Nesterov’s AGD Algorithm 1 applied to m-strongly convex L-smooth f , we have
r k
( L + m) ∥ x0 − x ∗ ∥22
∗ m
f ( xk ) − f ≤ 1 − · .
L 2
q
∗ L L∥ x0 − x ∗ ∥22
Equivalently, we have f ( xk ) − f ≤ ϵ after at most k = O m log ϵ iterations.
return xK
2L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≤ .
k2
In this lecture, we will show that the two types of acceleration above are closely related: we
can use one to derive the other. We then show that in a certain precise (but narrow) sense, the
convergence rates of AGD are optimal among first-order methods. For this reason, AGD is also
known as Nesterov’s optimal method.
1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
return x T
Exercise
q 1. How is Algorithm 3 different from running Algorithm 2 without restarting for T ×
8L
m iterations?
2.1 Analysis
Suppose f is m-strongly convex and L-smooth. By Theorem 2, we know that
2L ∥ x t − x ∗ ∥22 m ∥ x t − x ∗ ∥22
f ( x t +1 ) − f ( x ∗ ) ≤ = .
8L/m 4
By strong convexity, we have
m
f ( x t ) ≥ f ( x ∗ ) + ⟨∇ f ( x ∗ ), x t − x ∗ ⟩ + ∥ x t − x ∗ ∥22 ,
| {z } 2
=0
hence ∥ x t − x ∗ ∥22 ≤ 2
m ( f ( x t ) − f ( x ∗ )). Combining, we get
f (xt ) − f (x∗ )
f ( x t +1 ) − f ( x ∗ ) ≤ .
2
2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
That is, each round of Algorithm 3 halves the optimality gap. It follows that
T
1
f (x T ) − f (x∗ ) ≤ ( f ( x0 ) − f ( x ∗ )) .
2
Therefore, f ( x T ) − f ( x ∗ ) ≤ ϵ can be achieved after at most
f ( x0 ) − f ( x ∗ )
T = O log rounds,
ϵ
which corresponds to a total of
r r !
8L L f ( x0 ) − f ( x ∗ )
T× =O log AGD iterations.
m m ϵ
This iteration complexity is the same as Theorem 1 up to a logarithmic factor.
Remark 1. Note how strong convexity is needed in the above argument.
Remark 2. Optional reading: This overview article discusses restarting as a general/meta algorith-
mic technique.
3 Lower bounds
In this section, we consider a class of first-order iterative algorithms that satisfy x0 = 0, and
xk+1 ∈ Lin {∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk )} , ∀k ≥ 0, (1)
where the RHS denotes the linear subspace spanned by ∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk ); in other
words, xk+1 is an (arbitrary) linear combination of the gradients at the previous (k + 1) iterates.
Explicitly,
−1 0 0 ··· ···
2 0
−1 2 −1 0 · · · ··· 0
0 −1 2 −1 0 ··· 0
A= .
.. .. ..
. . .
0 ··· −1 2 −1
0 ··· −1 2
3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
Let ei ∈ Rd denote the i-th standard basis vector. Consider the quadratic function
L ⊤ L
f (x) = x Ax − x ⊤ e1 ,
8 4
L
which is convex and L-smooth since 0 ≼ A ≼ 4I. Note that ∇ f ( x ) = 4 ( Ax − e1 ). By induction,
we can show that for k ≥ 1,
Therefore, if we let Ak ∈ Rd×d denote the matrix obtained by zeroing out the entries of A outside
the top-left k × k block, then
L ⊤ L ⊤ ∗ L ⊤ L ⊤
f ( xk ) = xk Ak xk − xk e1 ≥ f k := min x A k x − x e1 .
8 4 x 8 4
d 2
i d+1
∥ xd∗ − x0 ∥22 = ∥ xd∗ ∥22 = ∑ 1−
d+1
≤
3
.
i =1
Theorem 4. There exists an m-strongly convex and L-smooth function such that any first-order method in
the sense of (1) must satisfy
k +1
m 4
∗
f ( xk ) − f ( x ) ≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m
4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
Proof. Let A ∈ Rd×d be defined in (2) above and consider the function
L−m ⊤ m
f (x) = x Ax − 2x e1 + ∥ x ∥22 ,
⊤
8 2
which is L-smooth and m-strongly convex. Strong convexity implies that
m
f ( xk ) − f ( x ∗ ) ≥ ∥ xk − x ∗ ∥22 . (3)
2
A similar argument as above shows that xk ∈ Lin {e1 , . . . , ek } , hence
d
∥ xk − x ∗ ∥22 ≥ ∑ x ∗ ( i )2 , (4)
i = k +1
where x ∗ (i ) denotes the ith entry of x ∗ . For simplicity we take d → ∞ (we omit the formal limiting
argument).1 The minimizer x ∗ can be computed by setting the gradient of f to zero, which gives
an infinite set of equations
L/m + 1 ∗
1−2 x (1) + x ∗ (2) = 0,
L/m − 1
L/m + 1 ∗
x ∗ ( k − 1) − 2 x (k ) + x ∗ (k + 1) = 0, k = 2, 3, . . .
L/m − 1
Solving these equations gives
√ !i
∗ L/m − 1
x (i ) = √ , i = 1, 2, . . . (5)
L/m + 1
m ∞ ∗ 2
2 i=∑
f ( xk ) − f ( x ∗ ) ≥ x (i ) by (3) and (4)
k +1
√ ! 2( k +1)
m L/m − 1
≥ √ ∥ x0 − x ∗ ∥22 by (5) and x0 = 0
2 L/m + 1
k +1
m 4 4
= 1− √ + √ ∥ x0 − x ∗ ∥22
2 L/m + 1 ( L/m + 1)2
k +1
m 4
≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m
Remark 3. The lower bounds in Theorems 3 and 4 are in the worst-case/minimax sense: one cannot
find a first-order method that achieves a better convergence rate on all smooth convex functions
than AGD. This, however, does not prevent better rates to be achieved for a sub class of such
functions. It is also possible to achieve better rates by using higher-order information (e.g., the
Hessian).
1 The convergence rates for AGD in Theorems 1 and 2 do not explicitly depend on the dimension d, hence these
results can be generalized to infinite dimensions.