0% found this document useful (0 votes)

2 views

Lecture_11_AGD_restart_lower_bounds

Uploaded by

drbaskerphd

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture_11_AGD_restart_lower_bounds

Uploaded by

drbaskerphd

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Lecture 11: Acceleration via Regularization and Restarting;

Lower Bounds
Yudong Chen

Last week we discussed two variants of Nesterov’s accelerated gradient descent (AGD).

Algorithm 1 Nesterov’s AGD, smooth and strongly convex

input: initial x0 , strong convexity
√
and smoothness parameters m, L, number of iterations K
L/m − 1
initialize: x−1 = x0 , β = √ L/m+1 .
for k = 0, 1, . . . K
y k = x k + β ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )
return xK

Theorem 1. For Nesterov’s AGD Algorithm 1 applied to m-strongly convex L-smooth f , we have
r k
( L + m) ∥ x0 − x ∗ ∥22

∗ m
f ( xk ) − f ≤ 1 − · .
L 2
q
∗ L L∥ x0 − x ∗ ∥22
Equivalently, we have f ( xk ) − f ≤ ϵ after at most k = O m log ϵ iterations.

Algorithm 2 Nesterov’s AGD, smooth convex

input: initial x0 , smoothness parameter L, number of iterations K
initialize: x−1 = x0 , λ0 = 0, β 0 = 0.
for k = 0, 1, . . . K
y k = x k + β k ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )
√
1+ 1+4λ2k
λ k +1 = 2 , β k+1 = λλkk−
+1
1

return xK

Theorem 2. For Nesterov’s AGD Algorithm 2 applied to L-smooth convex f , we have

2L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≤ .
k2

In this lecture, we will show that the two types of acceleration above are closely related: we
can use one to derive the other. We then show that in a certain precise (but narrow) sense, the
convergence rates of AGD are optimal among first-order methods. For this reason, AGD is also
known as Nesterov’s optimal method.

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

1 Acceleration via regularization

Suppose we only know the AGD method for strongly convex functions (Algorithm 1) and its
p k
1 − mL guarantee (Theorem 1). Can we use it as a subroutine to develop an accelerated al-
gorithm for (non-strongly) convex functions with a k12 convergence rate?
The answer is yes (up to logarithmic factors). One approach is to add a regularizer ϵ ∥ x ∥22 to
f ( x ) and apply Algorithm 1 to the function f ( x ) + ϵ ∥ x ∥22 , which is strongly convex. See HW 3.

2 Acceleration via restarting

In the opposite direction, suppose we only know the AGD method for (non-strongly) convex func-
tions (Algorithm 2) and its k12 guarantee (Theorem 2). Can we use it as a subroutine to develop an
p k
accelerated algorithm for strongly convex functions with a 1 − mL convergence rate (equiva-
q
lently, a mL log 1ϵ iteration complexity)?
powerful idea in optimization: restarting. See Algorithm 3.
This is possible using a classical andq
8L
In each round, we run Algorithm 2 for iterations to obtain x t+1 . In the next round, we restart
m
q
Algorithm 2 using x t+1 as the initial solution and run for another 8L m iterations. This is repeated
for T rounds.

Algorithm 3 Restarting AGD

input: initial x0 , strong convexity and smoothness parameters m, L, number of rounds T
for t = 0, 1, . . . T
q
Run Algorithm 2 with x t (initial solution), L (smoothness parameter), 8L
m (number of
iterations) as the input. Let x t+1 be the output.

return x T

Exercise
q 1. How is Algorithm 3 different from running Algorithm 2 without restarting for T ×
8L
m iterations?

2.1 Analysis
Suppose f is m-strongly convex and L-smooth. By Theorem 2, we know that

2L ∥ x t − x ∗ ∥22 m ∥ x t − x ∗ ∥22
f ( x t +1 ) − f ( x ∗ ) ≤ = .
8L/m 4
By strong convexity, we have
m
f ( x t ) ≥ f ( x ∗ ) + ⟨∇ f ( x ∗ ), x t − x ∗ ⟩ + ∥ x t − x ∗ ∥22 ,
| {z } 2
=0

hence ∥ x t − x ∗ ∥22 ≤ 2
m ( f ( x t ) − f ( x ∗ )). Combining, we get
f (xt ) − f (x∗ )
f ( x t +1 ) − f ( x ∗ ) ≤ .
2

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

That is, each round of Algorithm 3 halves the optimality gap. It follows that
T
1
f (x T ) − f (x∗ ) ≤ ( f ( x0 ) − f ( x ∗ )) .
2
Therefore, f ( x T ) − f ( x ∗ ) ≤ ϵ can be achieved after at most
f ( x0 ) − f ( x ∗ )

T = O log rounds,
ϵ
which corresponds to a total of
r r !
8L L f ( x0 ) − f ( x ∗ )
T× =O log AGD iterations.
m m ϵ
This iteration complexity is the same as Theorem 1 up to a logarithmic factor.
Remark 1. Note how strong convexity is needed in the above argument.
Remark 2. Optional reading: This overview article discusses restarting as a general/meta algorith-
mic technique.

3 Lower bounds
In this section, we consider a class of first-order iterative algorithms that satisfy x0 = 0, and
xk+1 ∈ Lin {∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk )} , ∀k ≥ 0, (1)
where the RHS denotes the linear subspace spanned by ∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk ); in other
words, xk+1 is an (arbitrary) linear combination of the gradients at the previous (k + 1) iterates.

3.1 Smooth and convex f

Theorem 3. There exists an L-smooth convex function f such that any first-order method in the sense of
(1) must satisfy
3L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≥ .
32(k + 1)2
L
Comparing with this lower bound, we see that the k2
rate for AGD in Theorem 2 is opti-
mal/unimprovable (up to constants).
Proof of Theorem 3. Let A ∈ Rd×d be the matrix given by

2,
 i=j
Aij = −1, j ∈ {i − 1, i + 1} (2)

0, otherwise.


Explicitly,
−1 0 0 ··· ···
 
2 0
 −1 2 −1 0 · · · ··· 0
 
 0 −1 2 −1 0 ··· 0
A= .
 
.. .. ..

 . . . 

 0 ··· −1 2 −1
0 ··· −1 2

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Let ei ∈ Rd denote the i-th standard basis vector. Consider the quadratic function

L ⊤ L
f (x) = x Ax − x ⊤ e1 ,
8 4
L
which is convex and L-smooth since 0 ≼ A ≼ 4I. Note that ∇ f ( x ) = 4 ( Ax − e1 ). By induction,
we can show that for k ≥ 1,

xk ∈ Lin {e1 , Ax1 , . . . , Axk−1 } ⊆ Lin {e1 , . . . , ek } .

Therefore, if we let Ak ∈ Rd×d denote the matrix obtained by zeroing out the entries of A outside
the top-left k × k block, then

L ⊤ L ⊤ ∗ L ⊤ L ⊤
f ( xk ) = xk Ak xk − xk e1 ≥ f k := min x A k x − x e1 .
8 4 x 8 4

By setting gradient to zero, we find that the minimum above is attained by

⊤
1 2 k
xk∗ := 1− ,1− ,...,1− , 0, . . . , 0 ∈ Rd ,
k+1 k+1 k+1

with f k∗ = − L8 1 − k+1 1 . It follows that the global minimizer x ∗ = xd∗ of f satisfies f ( x ∗ ) = f d∗ =

− L8 1 − d+1 1 and (since x0 = 0)

d 2
i d+1
∥ xd∗ − x0 ∥22 = ∥ xd∗ ∥22 = ∑ 1−
d+1
≤
3
.
i =1

Combining pieces and taking d = 2k + 1, we have

L 1 1
f ( xk ) − f ( x ∗ ) ≥ f k∗ − f d∗ = −
8 k + 1 2k + 2
L k+1
=
16 (k + 1)2
L d+1
=
32 (k + 1)2
2
3L ∥ x ∗ − x0 ∥2
≥ .
32 (k + 1)2

3.2 Smooth and strongly convex f

k
For strongly convex functions, we have the following lower bound, which shows that the 1 − √1
L/m
rate of AGD in Theorem 1 cannot be significantly improved.

Theorem 4. There exists an m-strongly convex and L-smooth function such that any first-order method in
the sense of (1) must satisfy
k +1
m 4
∗
f ( xk ) − f ( x ) ≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Proof. Let A ∈ Rd×d be defined in (2) above and consider the function
L−m ⊤ m
f (x) = x Ax − 2x e1 + ∥ x ∥22 ,
⊤
8 2
which is L-smooth and m-strongly convex. Strong convexity implies that
m
f ( xk ) − f ( x ∗ ) ≥ ∥ xk − x ∗ ∥22 . (3)
2
A similar argument as above shows that xk ∈ Lin {e1 , . . . , ek } , hence
d
∥ xk − x ∗ ∥22 ≥ ∑ x ∗ ( i )2 , (4)
i = k +1

where x ∗ (i ) denotes the ith entry of x ∗ . For simplicity we take d → ∞ (we omit the formal limiting
argument).1 The minimizer x ∗ can be computed by setting the gradient of f to zero, which gives
an infinite set of equations

L/m + 1 ∗
1−2 x (1) + x ∗ (2) = 0,
L/m − 1
L/m + 1 ∗
x ∗ ( k − 1) − 2 x (k ) + x ∗ (k + 1) = 0, k = 2, 3, . . .
L/m − 1
Solving these equations gives
√ !i
∗ L/m − 1
x (i ) = √ , i = 1, 2, . . . (5)
L/m + 1

Combining pieces, we obtain

m ∞ ∗ 2
2 i=∑
f ( xk ) − f ( x ∗ ) ≥ x (i ) by (3) and (4)
k +1
√ ! 2( k +1)
m L/m − 1
≥ √ ∥ x0 − x ∗ ∥22 by (5) and x0 = 0
2 L/m + 1
k +1
m 4 4
= 1− √ + √ ∥ x0 − x ∗ ∥22
2 L/m + 1 ( L/m + 1)2
k +1
m 4
≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

Remark 3. The lower bounds in Theorems 3 and 4 are in the worst-case/minimax sense: one cannot
find a first-order method that achieves a better convergence rate on all smooth convex functions
than AGD. This, however, does not prevent better rates to be achieved for a sub class of such
functions. It is also possible to achieve better rates by using higher-order information (e.g., the
Hessian).
1 The convergence rates for AGD in Theorems 1 and 2 do not explicitly depend on the dimension d, hence these
results can be generalized to infinite dimensions.

Canon EF 100-400 4.5-5.6L Is USM Repair Manual
No ratings yet
Canon EF 100-400 4.5-5.6L Is USM Repair Manual
9 pages
Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
12 pages
Unconstrained Optimization (Contd.) Constrained Optimization
No ratings yet
Unconstrained Optimization (Contd.) Constrained Optimization
19 pages
lec13
No ratings yet
lec13
6 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Gradient
No ratings yet
Gradient
31 pages
Lecture_15_projected_gradient
No ratings yet
Lecture_15_projected_gradient
8 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Raghu Meka notes
No ratings yet
Raghu Meka notes
7 pages
A Note On The Optimal Convergence Rate of Descent
No ratings yet
A Note On The Optimal Convergence Rate of Descent
11 pages
Long-Memory Time Series: Theory and Methods
From Everand
Long-Memory Time Series: Theory and Methods
Wilfredo Palma
No ratings yet
Lecture 11
No ratings yet
Lecture 11
4 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
O4MD 02 Foundations
No ratings yet
O4MD 02 Foundations
8 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
15.093 Optimization Methods
No ratings yet
15.093 Optimization Methods
12 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Ps 2
No ratings yet
Ps 2
3 pages
Gradient
No ratings yet
Gradient
37 pages
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
No ratings yet
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
24 pages
sheet_2_solution
No ratings yet
sheet_2_solution
5 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
The Penalty Function Method
No ratings yet
The Penalty Function Method
22 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
lect5_removed
No ratings yet
lect5_removed
35 pages
p5-CO-opti-algo
No ratings yet
p5-CO-opti-algo
15 pages
Numerical Algebra, Control and Optimization Volume 6, Number 2, June 2016
No ratings yet
Numerical Algebra, Control and Optimization Volume 6, Number 2, June 2016
13 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
No ratings yet
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
3 pages
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
No ratings yet
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
3 pages
Institute of Computer Science: Academy of Sciences of The Czech Republic
No ratings yet
Institute of Computer Science: Academy of Sciences of The Czech Republic
49 pages
lecture-4-si416-2025
No ratings yet
lecture-4-si416-2025
22 pages
Appendix PDF
No ratings yet
Appendix PDF
6 pages
Homework 2
No ratings yet
Homework 2
5 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Steepest Descent Algorithm
No ratings yet
Steepest Descent Algorithm
28 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
No ratings yet
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
8 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
13 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
27 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Composite
No ratings yet
Composite
26 pages
On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms
No ratings yet
On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms
12 pages
Fast Gradient Method
No ratings yet
Fast Gradient Method
25 pages
Exam 2023
No ratings yet
Exam 2023
16 pages
Controle16
No ratings yet
Controle16
4 pages
Convex - Optimization - Homework 3
No ratings yet
Convex - Optimization - Homework 3
6 pages
Proximal Minimization With D-Functions: Gorithms
No ratings yet
Proximal Minimization With D-Functions: Gorithms
11 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
sol3_2015
No ratings yet
sol3_2015
8 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Blockchain of Finite-Lifetime Blocks With Applications to Edge-Based IoT
No ratings yet
Blockchain of Finite-Lifetime Blocks With Applications to Edge-Based IoT
15 pages
3595647.3595651
No ratings yet
3595647.3595651
6 pages
3595179
No ratings yet
3595179
28 pages
2021-01-14
No ratings yet
2021-01-14
12 pages
PUF_aging
No ratings yet
PUF_aging
26 pages
selfstudys_com_file
No ratings yet
selfstudys_com_file
5 pages
Fuzzy_extractor_n
No ratings yet
Fuzzy_extractor_n
15 pages
2207.10526
No ratings yet
2207.10526
6 pages
CPAKA_Mutual_Authentication_and_Key_Agreement_Scheme_Based_on_Conditional_PUF_in_Space-Air-Ground_Integrated_Network
No ratings yet
CPAKA_Mutual_Authentication_and_Key_Agreement_Scheme_Based_on_Conditional_PUF_in_Space-Air-Ground_Integrated_Network
14 pages
Fringe
No ratings yet
Fringe
38 pages
A_Privacy-Aware_Provably_Secure_Smart_Card_Authentication_Protocol_Based_on_Physically_Unclonable_Functions
No ratings yet
A_Privacy-Aware_Provably_Secure_Smart_Card_Authentication_Protocol_Based_on_Physically_Unclonable_Functions
13 pages
bookwithindex
No ratings yet
bookwithindex
96 pages
Classically Cursive The Attributes of God Book IV Annas Archive
No ratings yet
Classically Cursive The Attributes of God Book IV Annas Archive
96 pages
Calibration Techniques. General Requirements
No ratings yet
Calibration Techniques. General Requirements
8 pages
Avr Study Plan
No ratings yet
Avr Study Plan
4 pages
Preview-9781351239578 A37610081
No ratings yet
Preview-9781351239578 A37610081
22 pages
Lift Control SLC4 Information For The Expert I Subranges of The Main Card of Central Unit - AZE0
No ratings yet
Lift Control SLC4 Information For The Expert I Subranges of The Main Card of Central Unit - AZE0
6 pages
INE's Implementing Nexus Bootcamp Lab Scenarios: Topology Overview
No ratings yet
INE's Implementing Nexus Bootcamp Lab Scenarios: Topology Overview
18 pages
EPM1602 v1
No ratings yet
EPM1602 v1
20 pages
A Descriptive Study of Rotc Training
No ratings yet
A Descriptive Study of Rotc Training
2 pages
Setting Up Remote Desktop Services and Adding Intouch Windowviewer To Remoteapp List On An Intouch Access Anywhere Server Machine
No ratings yet
Setting Up Remote Desktop Services and Adding Intouch Windowviewer To Remoteapp List On An Intouch Access Anywhere Server Machine
15 pages
20240901-Estado de Cuenta Bancario
No ratings yet
20240901-Estado de Cuenta Bancario
8 pages
Spare Parts Catalog: 4 WG 200 (555/5) Material Number: 4644.024.173 Current Date: 12.07.2018
100% (1)
Spare Parts Catalog: 4 WG 200 (555/5) Material Number: 4644.024.173 Current Date: 12.07.2018
83 pages
Unable To Save Application Information During Test Application or MER Creation
No ratings yet
Unable To Save Application Information During Test Application or MER Creation
6 pages
12 Worksheet-13 Database Management
No ratings yet
12 Worksheet-13 Database Management
5 pages
A Survey On Joint-Operation Application For Unmanned Swarm Formations Under A Complex Confrontation Environment
No ratings yet
A Survey On Joint-Operation Application For Unmanned Swarm Formations Under A Complex Confrontation Environment
15 pages
Audit Chapter 5
No ratings yet
Audit Chapter 5
3 pages
PPE Quiz 1
No ratings yet
PPE Quiz 1
1 page
Practical's Theory
No ratings yet
Practical's Theory
112 pages
DTI Directory of Key Officials As of 24 June 2024
No ratings yet
DTI Directory of Key Officials As of 24 June 2024
33 pages
Unit 7 Lecture Note
No ratings yet
Unit 7 Lecture Note
25 pages
Special Gearbox Oil: For Vacuum Pumps and Compressors
No ratings yet
Special Gearbox Oil: For Vacuum Pumps and Compressors
1 page
BOQ-XXX_REVD
No ratings yet
BOQ-XXX_REVD
22 pages
Algoritma Dan Pemograman - Tugas 2
No ratings yet
Algoritma Dan Pemograman - Tugas 2
9 pages
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
No ratings yet
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
11 pages
Objective Que - Laser Principles and Fiber Optics
No ratings yet
Objective Que - Laser Principles and Fiber Optics
21 pages
Nodal Analysis
No ratings yet
Nodal Analysis
38 pages
812732: Godrej Prakriti-Tower R - Structural Concrete - Pre Checks
No ratings yet
812732: Godrej Prakriti-Tower R - Structural Concrete - Pre Checks
2 pages
Software Requirements Specification: 2. Scope
No ratings yet
Software Requirements Specification: 2. Scope
4 pages
Sae J98-2019
No ratings yet
Sae J98-2019
7 pages
SQL Server Internals English
No ratings yet
SQL Server Internals English
5 pages