Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
193 views

Lecture Notes PDF

This chapter introduces the theory of convex functions. It defines convex sets and convex functions, and discusses properties such as differentiability and characterizations of convexity using first and second order conditions. It also covers minimizing convex functions, existence of minimizers, and examples of convex problems in handwritten digit recognition and admissions decisions. The chapter aims to provide the necessary background on convex functions for the optimization methods discussed in subsequent chapters.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views

Lecture Notes PDF

This chapter introduces the theory of convex functions. It defines convex sets and convex functions, and discusses properties such as differentiability and characterizations of convexity using first and second order conditions. It also covers minimizing convex functions, existence of minimizers, and examples of convex problems in handwritten digit recognition and admissions decisions. The chapter aims to provide the necessary background on convex functions for the optimization methods discussed in subsequent chapters.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

Contents

1 Theory of Convex Functions 2–30

2 Gradient Descent 30–48

3 Projected and Proximal Gradient Descent 48–64

4 Subgradient Descent 64–75

5 Stochastic Gradient Descent 75–83

6 Nonconvex functions 83–102

7 Newton’s Method 102–112

8 Quasi-Newton Methods 112–130

9 Frank-Wolfe 130–132

10 Coordinate Descent 132–143

1
Chapter 1

Theory of Convex Functions

Contents
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Differentiable functions . . . . . . . . . . . . . . . . . 6
1.3.2 First-order characterization of convexity . . . . . . . 10
1.3.3 Second-order characterization of convexity . . . . . . 12
1.3.4 Operations that preserve convexity . . . . . . . . . . 13
1.4 Minimizing convex functions . . . . . . . . . . . . . . . . . . 13
1.4.1 Strictly convex functions . . . . . . . . . . . . . . . . . 15
1.4.2 Example: Least squares . . . . . . . . . . . . . . . . . 15
1.4.3 Constrained Minimization . . . . . . . . . . . . . . . . 17
1.5 Existence of a minimizer . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Sublevel sets and the Weierstrass Theorem . . . . . . 19
1.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.1 Handwritten digit recognition . . . . . . . . . . . . . 20
1.6.2 Master’s Admission . . . . . . . . . . . . . . . . . . . 21
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2
This chapter develops the basic theory of convex functions that we will
need later. Much of the material is also covered in other courses, so we will
refer to the literature for standard material and focus more on material that
we feel is less standard (but important in our context).

1.1 Notation
For vectors in Rd , we use bold font, and for their coordinates normal font,
e.g. x = (x1 , . . . , xd ) 2 Rd . x1 , x2 , . . . denotes a sequence of vectors. Vectors
are considered as column vectors, unless they are explicitly transposed.
kxk denotes the Euclidean norm (`2 -norm or 2-norm) of vector x,
d
X
2 >
kxk = x x = x2i .
i=1

We also use
N = {1, 2, . . .} and R+ := {x 2 R : x 0}
to denote the natural and non-negative real numbers, respectively. We
are freely using basic notions and material such as open and closed sets,
vector spaces, continuity, convergence, limits, triangle inequality, among
others.

1.2 Convex sets


Definition 1.1. A set C ✓ Rd is convex if for any two points x, y 2 C, the
connecting line segment is contained in C. In formulas, if for all 2 [0, 1],
x + (1 )y 2 C; see Figure 1.1.

Observation 1.2. Let


T Ci , i 2 I be convex sets, where I is a (possibly infinite)
index set. Then C = i2I Ci is a convex set.

1.3 Convex functions


We are considering real-valued functions f : dom(f ) ! R, where dom(f ) ✓
Rd denotes the domain of f . The graph of f is the set {(x, f (x)) 2 Rd+1 :

3
y
y x
x

Figure 1.1: A convex set (left) and a non-convex set (right)

x 2 dom(f )}. The epigraph (Figure 1.2) is the set of points above the graph,

epi(f ) := {(x, ↵) 2 Rd+1 : x 2 dom(f ), ↵ f (x)}.

epi(f ) epi(f )
graph of f

f (x)
f (x)

x x

Figure 1.2: Graph and epigraph of a non-convex function (left) and a con-
vex function (right)

Definition 1.3 ([BV04, 3.1.1]). A function f : dom(f ) ! R is convex if (i)


dom(f ) is convex and (ii) for all x, y 2 dom(f ) and all 2 [0, 1], we have

f ( x + (1 )y)  f (x) + (1 )f (y). (1.1)

Geometrically, the condition means that the line segment connecting


the two points (x, f (x)), (y, f (y)) 2 Rd+1 lies pointwise above the graph

4
of f ; see Figure 1.3. (Whenever we say “above”, we mean “above or on”.)
An important special case arises when f : Rd ! R is an affine function,
i.e. f (x) = c> x + c0 for some vector c 2 Rd and scalar c0 2 R. In this case,
(1.1) is always satisfied with equality, and line segments connecting points
on the graph lie pointwise on the graph.

f (x) + (1 )f (y) f (y)

f (x) f ( x + (1 )y)

x x + (1 )y y

Figure 1.3: A convex function

Observation 1.4. f is a convex function if and only if epi(f ) is a convex set.


Proof. This is easy but let us still do it to illustrate the concepts. Let f be a
convex function and consider two points (x, ↵), (y, ) 2 epi(f ), 2 [0, 1].
This means, f (x)  ↵, f (y)  , hence by convexity of f ,
f ( x + (1 )y)  f (x) + (1 )f (y)  ↵ + (1 ) .
Therefore, by definition of the epigraph,
(x, ↵) + (1 )(y, ) = ( x + (1 )y, ↵ + (1 ) ) 2 epi(f ),
so epi(f ) is a convex set. In the other direction, let epi(f ) be a convex set
and consider two points x, y 2 dom(f ), 2 [0, 1]. By convexity of epi(f ),
we have
epi(f ) 3 (x, f (x)) + (1 )(y, f (y)) = ( x + (1 )y, f (x) + (1 )f (y)),
and this is just a different way of writing (1.1).

5
Lemma 1.5 (Jensen’s inequality). Let f : Rd ! R Pbe a convex function,
m
x1 , . . . , xm 2 dom(f ), and 1 , . . . , m 2 R+ such that i=1 i = 1. Then
m
! m
X X
f i xi  i f (xi ).
i=1 i=1

For m = 2, this is (1.1). The proof of the general case is Exercise 1.


Lemma 1.6. Let f be convex and suppose that dom(f ) is open. Then f is con-
tinuous.
This is not entirely obvious (see Exercise 2), and it becomes false if we
consider convex functions over general vector spaces. What saves us is
that Rd has finite dimension.
As an example, let us consider f (x1 , x2 ) = x21 + x22 . The graph of f is the
unit paraboloid in R3 which looks convex. However, to verify (1.1) directly
is somewhat cumbersome. Next, we develop better ways to do this if the
function under consideration is differentiable.

1.3.1 Differentiable functions


The following is standard material taught in multivariate calculus. As we
frequently need it, we include a refresher here.
Definition 1.7. Let f : dom(f ) ! Rm where dom(f ) ✓ Rd is open. Function
f is called differentiable at x 2 dom(f ) if there exists an (m ⇥ d)-matrix A and
an error function r : Rd ! Rm defined around 0 2 Rd such that for all y in some
neighborhood of x,
f (y) = f (x) + A(y x) + r(y x),
where
kr(v)k
lim = 0.
v!0 kvk

It then also follows that the matrix A is unique, and it is called the differential or
Jacobian matrix of f at x. We will denote it by Df (x). More precisely, Df (x)
is the matrix of partial derivatives at the point x,
@fi
Df (x)ij = (x).
@xj
f is called differentiable if f is differentiable at all x 2 dom(f ).

6
Differentiability at x means that in some neighborhood of x, f is ap-
proximated by a (unique) affine function f (x) + Df (x)(y x), up to a
sublinear error term. If m = 1, Df (x) is a row vector typically denoted
by rf (x)> , where the (column) vector rf (x) is called the gradient of f at
x. Geometrically, this means that the graph of the affine function f (x) +
rf (x)> (y x) is a tangent hyperplane to the graph of f at (x, f (x)); see
Figure 1.4.

f (x) + rf (x)> (y x)

f (y)

x y

Figure 1.4: If f is differentiable at x, the graph of f is locally (around x)


approximated by a tangent hyperplane

Let us do a simpe example to illustrate the concept of differentiability.


Consider the function f (x) = x2 . We know that its derivative is f 0 (x) = 2x.
But why? For y = x + v, we compute

f (y) = (x + v)2 = x2 + 2vx + v 2


= f (x) + 2x · v + v 2
= f (x) + A(y x) + r(y x)),

where A := 2x, r(y x) = r(v) := v 2 . We have limv!0 |r(v)|


|v|
= limv!0 |v| = 0.
Hence, A = 2x is indeed the differential (a.k.a. derivative) of f at x.
In computing differentials, the chain rule is particularly useful.
Lemma 1.8 (Chain rule). Let f : dom(f ) ! Rm , dom(f ) ✓ Rd and g :
dom(g) ! Rd . Suppose that g is differentiable at x 2 dom(g) and that f is

7
differentiable at g(x) 2 dom(f ). Then f g (the composition of f and g) is
differentiable at x, with the differential given by the matrix equation

D(f g)(x) = Df (g(x))Dg(x).

Let us do an example. Let f : Rd ! Rm be a differentiable function,


and fix x, y 2 Rd . Now define g : R ! Rd by g(t) = x + t(y x) and set
h = f g. Thus, h(t) = f (x + t(y x)), and we have

h0 (t) = Dh(t) = Df (x + t(y x))Dg(t) = Df (x + t(y x))(y x). (1.2)

The following is a general result that we will later use in specific set-
tings. As its proof also highlights some important notions and techniques,
we will give it here. As a preparation, we need the concept of the spectral
norm of a matrix.

Definition 1.9. Let A be an (m ⇥ d)-matrix. Then

kAvk
kAk := max = max kAvk
v2Rd ,v6=0 kvk kvk=1

is the 2-norm (or spectral norm) of A.

In words, the spectral norm is the largest factor by which a unit vector
can be stretched in length under the mapping v ! Av.
Also recall that a function f : dom(f ) ! Rm is B-Lipschitz (or simply
Lipschitz if there is a suitable B) if kf (x) f (y)k  B kx yk for all x, y 2
dom(f ). In particular, Lipschitz functions are continuous.

Theorem 1.10. Let f : dom(f ) ! Rm be differentiable, X ✓ dom(f ) a con-


vex set, B 2 R+ . If X ✓ dom(f ) is open, the following two statements are
equivalent. For any convex X ✓ dom(f ), (ii) implies (i).

(i) f is B-Lipschitz, meaning that

kf (x) f (y)k  B kx yk , 8x, y 2 X

(ii) f has differentials bounded by B, meaning that

kDf (x)k  B, 8x 2 X.

8
Indeed, (i) might not imply (ii) if X is closed. As a trivial example,
the Lipschitz condition is always satisfied over X = {0} but does not say
anything about kDf (x)k.
Proof. Suppose that f is B-Lipschitz over an open set X. For v 2 Rd ,
v ! 0, differentiability at x 2 X yields for small v 2 Rd that x + v 2 X
and therefore

B kvk kf (x + v) f (x)k = kDf (x)v + r(v)k kDf (x)vk kr(v)k ,

where kr(v)k / kvk ! 0, the first inequality uses (i), and the last is the
reverse triangle inequality. Rearranging and dividing by kvk, we get

kDf (x)vk kr(v)k


B+ .
kvk kvk

Let v? be a unit vector such that kDf (x)k = kDf (x)v? k / kv? k and let v =
tv? for t ! 0. Then we further get

kr(v)k
kDf (x)k  B + ! B,
kvk

and kDf (x)k  B follows, so differentials are bounded by B.


For the other direction, suppose that differentials are bounded by B
over X (not necessarily open); we apply the fundamental theorem of calculus:
Z b
h0 (t)dt = h(b) h(a), (1.3)
a

where h : dom(h) ! Rm is a univariate differentiable


R function, h0 its com-
ponentwise derivative, [a, b] ✓ dom(h) and the componentwise integral.
For fixed x, y 2 X, x 6= y, we apply this with

h(t) = f (x + t(y x)),

in which case the chain rule yields

h0 (t) = Df (x + t(y x))(y x),

9
see (1.2). Note that h is well-defined since X was assumed to be convex.
Then we compute

kf (y) f (x)k = kh(1) h(0)k


Z 1 Z 1
= 0
h (t)dt  kh0 (t)k dt (Exercise 46)
0 0
Z 1
= kDf (x + t(y x))(y x)k dt
0
Z 1
 kDf (x + t(y x))k k(y x)k dt (spectral norm)
0
Z 1
 B k(y x)k dt (bounded differentials)
0
= B k(y x)k .

Hence, f is B-Lipschitz over X.

1.3.2 First-order characterization of convexity


Now we come back to convex functions with image in R. If function
f : dom(f ) ! R is differentiable, convexity can be characterized by an
inequality involving the gradient.
Lemma 1.11 ([BV04, 3.1.3]). Suppose that dom(f ) is open and that f is differ-
entiable; in particular, the gradient (vector of partial derivatives)
✓ ◆
@f @f
rf (x) := (x), . . . , (x)
@x1 @xd
exists at every point x 2 dom(f ). Then f is convex if and only if dom(f ) is
convex and
f (y) f (x) + rf (x)> (y x) (1.4)
holds for all x, y 2 dom(f ).
Geometrically, this means that for all x 2 dom(f ), the graph of f lies
above its tangent hyperplane at the point (x, f (x)); see Figure 1.5.
Proof. Suppose that f is convex, meaning that for t 2 (0, 1),

f (x+t(y x)) = f ((1 t)x+ty)  (1 t)f (x)+tf (y) = f (x)+t(f (y) f (x)).

10
f (y)

f (x) + rf (x)> (y x)

x y

Figure 1.5: First-order characterization of convexity

Dividing by t and using differentiability at x, we get

f (x + t(y x)) f (x) rf (x)T t(y x) + r(t(y x))


f (y) f (x) + = f (x) +
t t
r(t(y x))
= f (x) + rf (x)T (y x) + ,
t
where the error term r(t(y x))/t goes to 0 as t ! 0. The inequality
f (y) f (x) + rf (x)T (y x) follows.
Now suppose this inequality holds for all x, y 2 dom(f ) and define
z := x + (1 )y 2 dom(f ) (by convexity of dom(f )). Then we have

f (x) f (z) + rf (z)T (x z),


f (y) f (z) + rf (z)T (y z).

After multiplying the first inequality by and the second one by (1 ),


the gradient terms cancel in the sum of the two inequalities, and we get

f (x) + (1 )f (y) f (z) = f ( x + (1 )y).

This is convexity.

11
For f (x1 , x2 ) = x21 + x22 , we have rf (x) = (2x1 , 2x2 ), hence (1.4) boils
down to
y12 + y22 x21 + x22 + 2x1 (y1 x1 ) + 2x2 (y2 x2 ),
which after some rearranging of terms is equivalent to
(y1 x1 )2 + (y2 x2 ) 2 0,
hence true. There are relevant convex functions that are not differentiable,
see Figure 1.6 for an example. More generally, Exercise 7 asks you to prove
that the `1 -norm (or 1-norm) f (x) = kxk1 is convex.

f (x) = |x|

x 0

Figure 1.6: A non-differentiable convex function

1.3.3 Second-order characterization of convexity


If f : dom(f ) ! R is twice differentiable (meaning that the function rf is
differentiable), convexity can be characterized as follows.
Lemma 1.12 ([BV04, 3.1.4]). Suppose that dom(f ) is open and that f is twice
differentiable; in particular, the Hessian (matrix of second partial derivatives)
0 1
@2f @2f @2f
(x) (x) · · · (x)
B @x@12@x 1 @x1 @x2
@2f
@x1 @xd
2f C
B @x2 @x1 (x) @x2 @x2 (x) · · · @x@2 @x
f
(x) C
2
r f (x) = B B d C
.. .. .. C
@ . . ··· . A
@2f @2f @2f
@xd @x1
(x) @xd @x2
(x) ··· @xd @xd
(x)

exists at every point x 2 dom(f ) and is symmetric. Then f is convex if and only
if dom(f ) is convex, and for all x 2 dom(f ), we have
r2 f (x) ⌫ 0 (i.e. r2 f (x) is positive semidefinite). (1.5)

12
(A symmetric matrix M is positive semidefinite, denoted by M ⌫ 0, if x> M x
0 for all x, and positive definite, denoted by M 0, if x> M x > 0 for all x 6= 0.)

Geometrically, this means that the graph of f has non-negative curva-


ture everywhere and hence “looks like a bowl”. For f (x1 , x2 ) = x21 + x22 ,
we have ✓ ◆
2 2 0
r f (x) = ,
0 2
which is a positive definite matrix. In higher dimensions, the same ar-
gument can be used to show that the squared distance dy (x) = kx
yk2 to a fixed point y is a convex function; see Exercise 3. The non-
squared Euclidean distance kx yk is also convex in x, as a consequence
of Lemma 1.13(ii) below and the fact that every seminorm (in particular
the Euclidean norm kxk) is convex (Exercise 8). The squared Euclidean
distance has the advantage that it is differentiable, while the Euclidean
distance itself (whose graph is an “ice cream cone” for d = 2) is not.

1.3.4 Operations that preserve convexity


There are two important operations that preserve convexity.

Lemma 1.13 (Exercise 4).

(i) P
Let f1 , f2 , . . . , fm be convex functions,
Tm 1 , 2 , . . . , m 2 R+ . Then f :=
m
i=1 i i f is convex on dom(f ) := i=1 dom(fi ).

(ii) Let f be a convex function with dom(f ) ✓ Rd , g : Rm ! Rd an affine


function, meaning that g(x) = Ax + b, for some matrix A 2 Rd⇥m and
some vector b 2 Rd . Then the function f g (that maps x to f (Ax + b))
is convex on dom(f g) := {x 2 Rm : g(x) 2 dom(f )}.

1.4 Minimizing convex functions


The main feature that makes convex functions attractive in optimization
is that every local minimum is a global one, so we cannot “get stuck” in
local optima. This is quite intuitive if we think of the graph of a convex
function as being bowl-shaped.

13
Definition 1.14. A local minimum of f : dom(f ) ! R is a point x such that
there exists " > 0 with
f (x)  f (y) 8y 2 dom(f ) satisfying ky xk < ".
Lemma 1.15. Let x? be a local minimum of a convex function f : dom(f ) ! R.
Then x? is a global minimum, meaning that
f (x? )  f (y) 8y 2 dom(f ).
Proof. Suppose there exists y 2 dom(f ) such that f (y) < f (x? ) and define
y0 := x? + (1 )y for 2 (0, 1). From convexity (1.1), we get that
that f (y0 ) < f (x? ). Choosing so close to 1 that ky0 x? k < " yields a
contradiction to x? being a local minimum.
This does not mean that a convex function always has a global mini-
mum. Think of f (x) = x as a trivial example. But also if f is bounded from
below over dom(f ), it may fail to have a global minimum (f (x) = ex ).
To ensure the existence of a global minimum, we need additional condi-
tions. For example, it suffices if outside some ball B, all function values
are larger than some value f (x), x 2 B. In this case, we can restrict f
to B, without changing the smallest attainable value. And on B (which is
compact), f attains a minimum by continuity (Lemma 1.6). An easy exam-
ple: for f (x1 , x2 ) = x21 + x22 , we know that outside any ball containing 0,
f (x) > f (0) = 0.
Another easy condition in the differentiable case is given by the follow-
ing result.
Lemma 1.16. Suppose that f : dom(f ) ! R is convex and differentiable over
an open domain dom(f ) ✓ Rd . Let x 2 dom(f ). If rf (x) = 0, then x is a
global minimum.
Proof. Suppose that rf (x) = 0. According to Lemma 1.11, we have
f (y) f (x) + rf (x)> (y x) = f (x)
for all y 2 dom(f ), so x is a global minimum.
The converse is also true and is a corollary of Lemma 1.22 [BV04, 4.2.3].
Lemma 1.17. Suppose that f : dom(f ) ! R is convex and differentiable over
an open domain dom(f ) ✓ Rd . Let x 2 dom(f ). If x is a global minimum then
rf (x) = 0.

14
1.4.1 Strictly convex functions
In general, a global minimum of a convex function is not unique (think of
f (x) = 0 as a trivial example). However, if we forbid “flat” parts of the
graph of f , a global minimum becomes unique (if it exists at all).

Definition 1.18 ([BV04, 3.1.1]). A function f : dom(f ) ! R is strictly con-


vex if (i) dom(f ) is convex and (ii) for all x 6= y 2 dom(f ) and all 2 (0, 1),
we have
f ( x + (1 )y) < f (x) + (1 )f (y). (1.6)

This means that the open line segment connecting (x, f (x)) and (y, f (y))
is pointwise strictly above the graph of f . For example, f (x) = x2 is strictly
convex.

Lemma 1.19 ([BV04, 3.1.4]). Suppose that dom(f ) is open and that f is twice
differentiable. If the Hessian r2 f (x) 0 for every x 2 dom(f ) (i.e., z> r2 f (x)z >
0 for any z 6= 0), then f is strictly convex.

The converse is false, though: f (x) = x4 is strictly convex but has van-
ishing second derivative at x = 0.

Lemma 1.20. Let f : dom(f ) ! R be strictly convex. Then f has at most one
global minimum.

Proof. Suppose x? 6= y? are two global minima with fmin = f (x? ) = f (y? ),
and let z = 12 x? + 12 y? . By (1.6),

1 1
f (z) < fmin + fmin = fmin ,
2 2
a contradiction to x? and y? being global minima.

1.4.2 Example: Least squares


Suppose we want to fit a hyperplane to a set of data points x1 , . . . , xm in
Rd , based on the hypothesis that the points actually come (approximately)
from a hyperplane. A classical method for this is least squares. For con-
creteness, let us do this in R2 . Suppose that the data points are

(1, 10), (2, 11), (3, 11), (4, 10), (5, 9), (6, 10), (7, 9), (8, 10),

15
y y

x x

Figure 1.7: Data points in R2 (left) and least-squares fit (right)

Figure 1.7 (left).


Also, for simplicity (and quite appropriately in this case), let us restrict
to fitting a linear model, of more formally to fit non-vertical lines of the
form y = w0 + w1 x. If (xi , yi ) is the i-th data point, the least squares fit
chooses w0 , w1 such that the least squares objective
8
X
f (w0 , w1 ) = (w1 xi + w0 yi ) 2
i=1

is minimized. It easily follows from Lemma 1.13 that f is convex. In fact,

f (w0 , w1 ) = 204w12 + 72w1 w0 706w1 + 8w02 160w0 + 804, (1.7)

so we can check convexity directly using the second order condition. We


have gradient

rf (w0 , w1 ) = (72w1 + 16w0 160, 408w1 + 72w0 706)

and Hessian ✓ ◆
2 16 72
r (w0 , w1 ) = .
72 408

16
A 2 ⇥ 2 matrix is positive semidefinite if the diagonal elements and the
determinant are positive, which is the case here, so f is actually strictly
convex and has a unique global minimum. To find it, we solve the linear
system rf (w0 , w1 ) = (0, 0) of two equations in two unknowns and obtain
the global minimum
⇣ 43 1 ⌘
(w0? , w1? ) = , .
4 6
Hence, the “optimal” line is
1 43
y= x+ ,
6 4
see Figure 1.7 (right).

1.4.3 Constrained Minimization


Frequently, we are interested in minimizing a convex function only over a
subset X of its domain.

Definition 1.21. Let f : dom(f ) ! R be convex and let X ✓ dom(f ) be a


convex set. A point x 2 X is a minimizer of f over X if

f (x)  f (y) 8y 2 X.

If f is differentiable, minimizers of f over X have a very useful charac-


terization.

Lemma 1.22 ([BV04, 4.2.3]). Suppose that f : dom(f ) ! R is convex and


differentiable over an open domain dom(f ) ✓ Rd , and let X ✓ dom(f ) be a
convex set. Point x? 2 X is a minimizer of f over X if and only if

rf (x? )> (x x? ) 0 8x 2 X.

Applying the this result with X = dom(f ), we recover Lemma 1.16,


and because dom(f ) is open, its converse Lemma 1.17 follows [BV04,
4.2.3]. If X does not contain the global minimum, then Lemma 1.22 has
a nice geometric interpretation. Namely, it means that X is contained in
the halfspace {x 2 Rd : rf (x? )> (x x? ) 0} (normal vector rf (x? ) point-
ing into the halfspace); see Figure 1.8. In still other words, x x? forms a
non-obtuse angle with rf (x? ) for all x 2 X.

17
rf (x? )> (x x? ) 0
X

rf (x? )

x
x?

Figure 1.8: Optimality condition for constrained optimization

We typically write constrained minimization problems in the form

argmin{f (x) : x 2 X} (1.8)

or
minimize f (x)
(1.9)
subject to x 2 X.

1.5 Existence of a minimizer


The existence of a minimizer (or a global minimum if X = dom(f )) will
be an assumption made by most minimization algorithms that we discuss
later. In practice, such algorithms are being used (and often also work)
if there is no minimizer. By “work”, we mean in this case that they com-
pute a point x such that f (x) is close to inf y2X f (y), assuming that the
infimum is finite (as in f (x) = ex ). But a sound theoretical analysis usu-
ally requires the existence of a minimizer. Therefore, this section develops
tools that may helps us in analyzing whether this is the case for a given
convex function. To avoid technicalities, we restrict ourselves to the case
dom(f ) = Rd .

18
1.5.1 Sublevel sets and the Weierstrass Theorem
Definition 1.23. Let f : Rd ! R, ↵ 2 R. The set

f ↵ := {x 2 Rd : f (x)  ↵}

is the ↵-sublevel set of f ; see Figure 1.9

f ↵ f ↵ f ↵

Figure 1.9: Sublevel set of a non-convex function (left) and a convex func-
tion (right)

It is easy to see from the definition that every sublevel set of a convex
function is convex. Moreover, as a consequence of continuity of f , sublevel
sets are closed. The following (known as the Weierstrass Theorem) just
formalizes an argument that we have made earlier.

Theorem 1.24. Let f : Rd ! R be a convex function, and suppose there is a


nonempty and bounded sublevel set f ↵ . Then f has a global minimum.

Proof. We know that f —as a continuous function—attains a minimum


over the closed and bounded (= compact) set f ↵ at some x? . This x?
is also a global minimum as it has value f (x? )  ↵, while any x 2
/ f ↵ has
value f (x) > ↵ f (x? ).

19
1.6 Examples
In the following two sections, we give two examples of convex function
minimization tasks that arise from machine learning applications.

1.6.1 Handwritten digit recognition


Suppose you want to write a program that recognizes handwritten deci-
mal digits 0, 1, . . . , 9. You have a set P of grayscale images (28 ⇥ 28 pixels,
say) that represent handwritten decimal digits, and for each image x 2 P ,
you know the digit d(x) 2 {0, . . . , 9} that it represents, see Figure 1.10.
You want to train your program with the set P , and after that, use it to
recognize handwritten digits in arbitrary 28 ⇥ 28 images.

Figure 1.10: Some training images from the MNIST data set (picture from
http://corochann.com/mnist-dataset-introduction-1138.
html

The classical approach is the following. We represent an image as a


feature vector x 2 R784 , where xi is the gray value of the i-th pixel (in some
order). During the training phase, we compute a matrix W 2 R10⇥784 and

20
then use the vector y = W x 2 R10 to predict the digit seen in an arbitrary
image x. The idea is that yj , j = 0, . . . , 9 corresponds to the probability
of the digit being j. This does not work directly, since the entries of y
may be negative and generally do not sum up to 1. But we can convert y
to a vector z of actual probabilities, such that a small yj leads to a small
probability zj and a large yj to a large probability zj . How to do this is not
canonical, but here is a well-known formula that works:
e yj
zj = zj (y) = P9 . (1.10)
yk
k=0 e
The classification then simply outputs digit j with probability zj . The
matrix W is chosen such that it (approximately) minimizes the classifica-
tion error on the training set P . Again, it is not canonical how we measure
classification error; here we use the following loss function to evaluate the
error induced by a given matrix W .
!
X X ⇣X 9 ⌘
`(W ) = ln zd(x) (W x) = ln e(W x)k (W x)d(x) . (1.11)
x2P x2P k=0

This function “punishes” images for which the correct digit j has low
probability zj (corresponding to a significantly negative value of log zj ).
In an ideal world, the correct digit would always have probability 1, re-
sulting in `(W ) = 0. But under (1.10), probabilities are always strictly
between 0 and 1, so we have `(W ) > 0 for all W .
Exercise 5 asks you to prove that ` is convex. In Exercise 6, you will
characterize the situations in which ` has a global minimum.

1.6.2 Master’s Admission


The computer science department of a well known Swiss university is ad-
mitting top international students to its MSc program, in a competitive
application process. Applicants are submitting various documents (GPA,
TOEFL test score, GRE test scores, reference letters,. . . ). During the evalu-
ation of an application, the admission committee would like to compute a
(rough) forecast of the applicant’s performance in the MSc program, based
on the submitted documents.1
1
Any resemblance to real departments is purely coincidental. Also, no serious depart-
ment will base performance forecasts on data from 10 students, as we will do it here.

21
Data on the actual performance of students admitted in the past is
available. To keep things simple in the following example, Let us base
the forecast on GPA (grade point average) and TOEFL (Test of English as
a Foreign Language) only. GPA scores are normalized to a scale with a
minimum of 0.0 and a maximum of 4.0, where admission starts from 3.5.
TOEFL scores are on an integer scale between 0 and 120, where admission
starts from 100.
Table 1.1 contains the known data. GGPA (graduation grade point av-
erage on a Swiss grading scale) is the average grade obtained by an ad-
mitted student over all courses in the MSc program. The Swiss scale goes
from 1 to 6 where 1 is the lowest grade, 6 is the highest, and 4 is the lowest
passing grade.

GPA TOEFL GGPA


3.52 100 3.92
3.66 109 4.34
3.76 113 4.80
3.74 100 4.67
3.93 100 5.52
3.88 115 5.44
3.77 115 5.04
3.66 107 4.73
3.87 106 5.03
3.84 107 5.06

Table 1.1: Data for 10 admitted students: GPA and TOEFL scores (at time
of application), GGPA (at time of graduation)

As in Section 1.4.2, we are attempting a linear regression with least


squares fit, i.e. we are making the hypothesis that

GGPA ⇡ w0 + w1 · GPA + w2 · TOEFL. (1.12)

However, in our scenario, the relevant GPA scores span a range of only
0.5 while the relevant TOEFL scores span a range of 20. The resulting least
squares objective would be somewhat ugly; we already saw this in our
previous example (1.7), where the data points had large second coordinate,
resulting in the w1 -scale being very different from the w2 -scale. This time,

22
we normalize first, so that w1 und w2 become comparable and allow us to
understand the relative influences of GPA and TOEFL.
The general setting is this: we have n inputs x1 , . . . , xn , where each vec-
tor xi 2 Rd consists of d input variables; then we have n outputs y1 , . . . , yn 2
R. Each pair (xi , yi ) is an observation. In our case, d = 2, n = 10, and for
example, ((3.93, 100), 5.52) is an observation (of a student doing very well).
With variable weights w0 , w = (w1 , . . . , wd ) 2 Rd , we plan to minimize
the least squares objective
n
X
f (w0 , w) = (w0 + wT xi yi ) 2 .
i=1

We first want to assume that the inputs and outputs are centered, mean-
ing that
n n
1X 1X
xi = 0, yi = 0.
n i=1 n i=1
P
This can be achieved by simply subtracting
P the mean x̄ = n1 ni=1 xi from
every input and the mean ȳ = n1 ni=1 yi from every output. In our exam-
ple, this yields the numbers in Table 1.2 (left).

GPA TOEFL GGPA GPA TOEFL GGPA


-0.24 -7.2 -0.94 -2.04 -1.28 -0.94
-0.10 1.8 -0.52 -0.88 0.32 -0.52
-0.01 5.8 -0.05 -0.05 1.03 -0.05
-0.02 -7.2 -0.18 -0.16 -1.28 -0.18
0.17 -7.2 0.67 1.42 -1.28 0.67
0.12 7.8 0.59 1.02 1.39 0.59
0.01 7.8 0.19 0.06 1.39 0.19
-0.10 -0.2 -0.12 -0.88 -0.04 -0.12
0.11 -1.2 0.17 0.89 -0.21 0.17
0.07 -0.2 0.21 0.62 -0.04 0.21

Table 1.2: Centered observations (left); normalized inputs (right)

After centering, the global minimum (w0? , w? ) of the least squares ob-
jective satisfies w0? = 0 while w? is unaffected by centering (Exercise 9), so
that we can simply omit the variable w0 in the sequel.

23
Finally, we assume that all d input variables are on the same scale,
meaning that
n
1X 2
x = 1, j = 1, . . . , d.
n i=1 ij
To achieve this for fixed j (assuming
q P that no variable is 0 in all inputs),
we multiply all xij by s(j) = n/ ni=1 x2ij (which, in the optimal solution
w⇤ , just multiplies wj? by 1/s(j), an argument very similar to the one in
Exercise 9). For our data set, the resulting normalized data are shown in
Table 1.2 (right). Now the least squares objective (after omitting w0 ) is
10
X
f (w1 , w2 ) = (w1 xi1 + w2 xi2 yi ) 2
i=1
⇡ 10w12 + 10w22 + 1.99w1 w2 8.7w1 2.79w2 + 2.09.

This is minimized at

w? = (w1? , w2? ) ⇡ (0.43, 0.097),

so if our initial hypothesis (1.12) is true, we should have

yi ⇡ yi? = 0.43xi1 + 0.097xi2 (1.13)

in the normalized data. This can quickly be checked, and the results are
not perfect, but not too bad, either; see Table 1.3 (ignore the last column
for now).
What we also see from (1.13) is that the first input variable (GPA) has a
much higher influence on the output (GGPA) than the second one (TOEFL).
In fact, if we drop the second one altogether, we obtain outputs zi? (last col-
umn in Table 1.3) that seem equivalent to the predicted outputs yi? within
the level of noise that we have anyway.
We conclude that TOEFL scores are probably not indicative for the per-
formance of admitted students, so the admission committee should not
care too much about them. Requiring a minimum score of 100 might make
sense, but whenever an applicant reaches at least this score, the actual
value does not matter.

24
xi1 xi2 yi yi? zi?
-2.04 -1.28 -0.94 -1.00 -0.87
-0.88 0.32 -0.52 -0.35 -0.37
-0.05 1.03 -0.05 0.08 -0.02
-0.16 -1.28 -0.18 -0.19 -0.07
1.42 -1.28 0.67 0.49 0.61
1.02 1.39 0.59 0.57 0.44
0.06 1.39 0.19 0.16 0.03
-0.88 -0.04 -0.12 -0.38 -0.37
0.89 -0.21 0.17 0.36 0.38
0.62 -0.04 0.21 0.26 0.27

Table 1.3: Outputs yi? predicted by the linear model (1.13) and by the model
zi? = 0.43xi1 that simply ignores the second input variable

The LASSO. So far, we have computed linear functions y = 0.43x1 +


0.097x2 and z = 0.43x1 that “explain” the historical data from Table 1.1.
However, they are optimized to fit the historical data, not the future. We
may have overfitting. This typyically leads to unrealiable predictions of
high variance in the future. Also, ideally, we would like non-indicative
variables (such as the TOEFL in our example) to actually have weight 0,
so that the model “knows” the important variables and is therefore better
to interpret.
The question is: how can we in general improve the quality of our fore-
cast? There are various heuristics to identify the “important” variables’
(subset selection). A very simple one is just to forget about weights close
to 0 in the least squares solution. However, for this, we need to define
what it means to be close to 0; and it may happen that small changes in the
data lead to different variables being dropped if their weights are around
the threshold. On the other end of the spectrum, there is best subset selec-
tion where we compute the least squares solution subject to the constraint
that there are at most k nonzero weights, for some k that we believe is the
right number of important variables. This is NP-hard, though.
A popular approach that in many cases improves forecasts and at the
same time identifies important variables has been suggested by Tibshirani
in 1996 [Tib96]. Instead of minimizing the least squares objective glob-
ally, it is minimized over a suitable `1 -ball (ball in the 1-norm kwk1 =

25
Pd
j=1 |wj |):
Pn
minimize >
i=1 kw xi y i k2
(1.14)
subject to kwk1  R,
where R 2 R+ is some parameter. In our case, if we for example

minimize f (w1 , w2 ) = 10w12 + 10w22 + 1.99w1 w2 8.7w1 2.79w2 + 2.09


subject to |w1 | + |w2 |  0.2,
(1.15)
we obtain weights w? = (w1? , w2? ) = (0.2, 0): the non-indicative TOEFL
score has disappeared automatically! For R = 0.3, the same happens (with
w1? = 0.3, respectively). For R = 0.4, the TOEFL score starts creeping
back in: we get (w1? , w2? ) ⇡ (0.36, 0.036). For R = 0.5, we have (w1? , w2? ) ⇡
(0.41, 0.086), while for R = 0.6 (and all larger values of R), we recover the
original solution (w1? , w2? ) = (0.43, 0.097).
It is important to understand that using the “fixed” weights (which
may be significantly shrunken), we make predictions worse on the histori-
cal data (this must be so, since least squares was optimal for the historical
data). But future predictions may benefit (a lot). To quantify this benefit,
we need to make statistical assumptions about future observations; this is
beyond the scope of our treatment here.
The phenomenon that adding a constraint on kwk1 tends to set weights
to 0 is not restricted to d = 2. The constrained minimization problem (1.14)
is called the LASSO (least absolute shrinkage and selection operator) and
has the tendency to assign weights of 0 and thus to select a subset of input
variables, where R controls how aggressive the selection is.
In our example, it is easy to get an intuition why this works. Let us look
at the case R = 0.2. The smallest value attainable in (1.15) is the smallest ↵
such that that the (elliptical) sublevel set f ↵ of the least squares objective
f still intersects the `1 -ball {(w1 , w2 ) : |w1 |+|w2 |  0.2}. This smallest value
turns out to be ↵ = 0.75, see Figure 1.11. For this value of ↵, the sublevel
set intersects the `1 -ball exactly in one point, namely (0.2, 0).
At (0.2, 0), the ellipse {(w1 , w2 ) : f (w1 , w2 ) = ↵} is “vertical enough”
to just intersect the corner of the `1 -ball. The reason is that the center of
the ellipse is relatively close to the w1 -axis, when compared to its size. As
R increases, the relevant value of ↵ decreases, the ellipse gets smaller and
less vertical around the w1 -axis; until it eventually stops intersecting the `1 -
ball {(w1 , w2 ) : |w1 |+|w2 |  R} in a corner (dashed situation in Figure 1.11,

26
b
10w12 + 10w22 + 1.99w1 w2 8.7w1 2.79w2 + 2.09 = 0.75

(0.43, 0.097)

|w1 | + |w2 |  0.2

Figure 1.11: Lasso

for R = 0.4).
Even though we have presented a toy example in this section, the back-
ground is real. The theory of admission and in particular performance
forecasts has been developed in a recent PhD thesis by Zimmermann [Zim16].

1.7 Exercises
Exercise 1. Prove Jensen’s inequality (Lemma 1.5)!

Exercise 2. Prove that a convex function (with dom(f ) open) is continuous


(Lemma 1.6)!
Hint: First prove that a convex function f is bounded on any cube C =
[l1 , u1 ] ⇥ [l2 , u2 ] ⇥ · · · ⇥ [ld , ud ] ✓ dom(f ), with the maximum value occurring
on some corner of the cube (a point z such that zi 2 {li , ui } for all i). Then use
this fact to show that—given x 2 dom(f ) and " > 0—all y in a sufficiently
small ball around x satisfy |f (y) f (x)| < ".

Exercise 3. Prove that the function dy : Rd ! R, x 7! kx yk2 is strictly


convex for any y 2 Rd . (Use Lemma 1.19.)

27
Exercise 4. Prove Lemma 1.13! Can (ii) be generalized to show that for two
convex functions f, g, the function f g is convex as well?

Exercise 5. Consider the function ` defined in (1.11). Prove that ` is convex!

Exercise 6. Consider the logistic regression problem with two classes. Given a
training set P consisting of datapoint and label pairs (x, y) where x 2 Rd and
y 2 { 1, +1}, we define our loss ` for weight vector w 2 Rd to be
X
`(w) = ln z(yw> x) ,
(x,y)2P

where z(s) = 1/(1 + exp( s)). This loss function is in fact a simplification of
(1.11) when we only have two classes.
We say that the weight vector w is a separator for P if for all (x, y) 2 P ,

y(w> x) 0.

A separator is said to be trivial if for all (x, y) 2 P ,

y(w> x) = 0 .

For example w = 0 is a trivial separator. Depending on the data P , there may be


other trivial separators.
Prove the following statement: the function ` has a global minimum if and
only if all separators are trivial.
P
Exercise 7. Prove that the function f (x) = kxk1 = di=1 |xi | (`1 -norm) is con-
vex!

Exercise 8. A seminorm is a function f : Rd ! R satisfying the following two


properties for all x, y 2 Rd and all 2 R.

(i) f ( x) = | |f (x),

(ii) f (x + y)  f (x) + f (y) (triangle inequality).

Prove that every seminorm is convex!

28
P
Pn 9. Suppose that
Exercise we have centered observations (xi , yi ) such that ni=1 xi =
0, i=1 yi = 0. Let w0? , w⇤ be the global minimum of the least squares objective
n
X
f (w0 , w) = (w0 + wT xi yi ) 2 .
i=1

Prove that w0? = 0. Also, suppose x0i and yi0 are such that for all i, x0i = xi + q,
yi0 = yi + r. Show that (w0 , w) minimizes f if and only if (w0 w> q + r, w)
minimizes n
X
0
f (wo , w) = (w0 + wT x0i yi0 )2 .
i=1

29
Chapter 2

Gradient Descent

Contents
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Vanilla analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Lipschitz convex functions: O(1/"2 ) steps . . . . . . . . . . . 35
2.5 Smooth convex functions: O(1/") steps . . . . . . . . . . . . 37
2.6 Interlude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Smooth and strongly convex functions: O(log(1/")) steps . . 42
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

30
2.1 Overview
The gradient descent algorithm (including variants such as projected or
stochastic gradient descent) is the most useful workhorse for minimizing
loss functions in practice. The algorithm is extremely simple and surpris-
ingly robust in the sense that it also works well for many loss functions
that are not convex. While it is easy to construct (artificial) non-convex
functions on which gradient descent goes completely astray, such func-
tions do not seem to be typical in practice; however, understanding this
on a theoretical level is an open problem, and only few results exist in this
direction.
The vast majority of theoretical results concerning the performance of
gradient descent hold for convex functions only. In this and the following
chapters, we will present some of these results, but maybe more impor-
tantly, the main ideas behind them. As it turns out, the number of ideas
that we need is rather small, and typically, they are shared between dif-
ferent results. Our approach is therefore to fully develop each idea once,
in the context of a concrete result. If the idea reappears, we will typically
only discuss the changes that are necessary in order to establish a new re-
sult from this idea. In order to avoid boredom from ideas that reappear
too often, we omit other results and variants that one could also get along
the lines of what we discuss.
Let f : Rd ! R be a convex and differentiable function. We also assume
that f has a global minimum x? , and the goal is to find (an approximation
of) x? . This usually means that for a given " > 0, we want to find x 2 Rd
such that
f (x) f (x? ) < ".
Notice that we are not making an attempt to get near to x? itself — there
can be several minima x?1 6= x? 6= x?2 with f (x?1 ) = f (x?2 ) = f (x? ).
Table 2.1 gives an overview of the results that we will prove. They con-
cern several variants of gradient descent as well as several classes of func-
tions. The significance of each algorithm and function class will briefly be
discussed when it first appears.
In Chapter 6, we will also look at gradient descent on functions that
are not convex. In this case, provably small approximation error can still
be obtained for some particularly well-behaved functions (we will give an
example). For smooth (but not necessarily convex) functions, we gener-

31
smooth &
Lipschitz smooth strongly
strongly
convex convex convex
convex
functions functions functions
functions
gradient Thm. 2.1 Thm. 2.7 Thm. 2.11
descent O(1/"2 ) O(1/") O(log(1/"))
projected
Thm. 3.2 Thm. 3.4 Thm. 3.5
gradient
O(1/"2 ) O(1/") O(log(1/"))
descent
proximal
Thm. 3.14
gradient
O(1/")
descent
subgradient Thm. 4.7 Thm. 4.11
descent O(1/"2 ) O(1/")
stochastic
Thm. 5.1 Thm. 5.2
gradient
O(1/"2 ) O(1/")
descent

Table 2.1: Results on gradient descent. Below each theorem, the number
of steps is given which the respective variant needs on the respective func-
tion class to achieve additive approximation error at most ".

ally cannot show convergence in error, but a (much) weaker convergence


property still holds.

2.2 The algorithm


Gradient descent is a very simple iterative algorithm for finding the de-
sired approximation x, under suitable conditions that we will get to. It
computes a sequence x0 , x1 , . . . of vectors such that x0 is arbitrary, and for
each t 0, xt+1 is obtained from xt by making a step of vt 2 Rd :

xt+1 = xt + vt .

How do we choose vt in order to get closer to optimality, meaning that


f (xt+1 ) < f (xt )?
From differentiability of f at xt (Definition 1.7), we know that for kvt k

32
tending to 0,

f (xt + vt ) = f (xt ) + rf (xt )> vt + r(vt ) ⇡ f (xt ) + rf (xt )> vt .


| {z }
o(kvt k)

To get any decrease in function value at all, we have to choose vt such that
rf (xt )> vt < 0. But among all steps vt of the same length, we should in
fact choose the one with the most negative value of rf (xt )> vt , so that we
maximize our decrease in function value. This is achieved when vt points
into the direction of the negative gradient rf (xt ). But as differentiabilty
guarantees decrease only for small steps, we also want to control how far
we go along the direction of the negative gradient.
Therefore, the step of gradient descent is defined by

xt+1 := xt rf (xt ). (2.1)

Here, is a fixed stepsize, but it may also make sense to have depend
on t. For now, is fixed. We hope that for some reasonably small integer
t, in the t-th iteration we get that f (xt ) f (x? ) < "; see Figure 2.1 for an
example.
Now it becomes clear why we are assuming that dom(f ) = Rd : The
update step (2.1) may in principle take us “anywhere”, so in order to get
a well-defined algorithm, we want to make sure that f is defined and dif-
ferentiable everywhere.
The choice of is critical for the performance. If is too small, the
process might take too long, and if is too large, we are in danger of
overshooting. It is not clear at this point whether there is a “right” stepsize.

2.3 Vanilla analysis


Let xt be some iterate in the sequence (2.1). We abbreviate gt := rf (xt ),
and will relate this vector to our current direction from an optimum xt x? .
By definition of gradient descent (2.1), gt = (xt xt+1 )/ , hence
1
gt> (xt x? ) = (xt xt+1 )> (xt x? ). (2.2)

Now we apply (somewhat out of the blue, but this will clear up in the next
step) the basic vector equation 2v> w = kvk2 + kwk2 kv wk2 (a.k.a. the

33
x2

x5
3 x3 x4
x2

x1

x0
x1
4

Figure 2.1: Example run of gradient descent on the quadratic function


f (x1 , x2 ) = 2(x1 4)2 + 3(x2 3)2 with global minimum (4, 3); we have
chosen x0 = (0, 0), = 0.1; dashed lines represent level sets of f (points of
constant f -value)

cosine theorem) to rewrite the same expression as

1
gt> (xt x? ) = kxt xt+1 k2 + kxt x ? k2 kxt+1 x ? k2
2
1 2
= kgt k2 + kxt x ? k2 kxt+1 x ? k2
2
1
= kgt k2 + kxt x? k2 kxt+1 x? k2 (2.3)
2 2

Next we sum this up over the iterations t, so that the latter two terms in
the bracket cancel in a telescoping sum.
T 1
X T 1
X 1
gt> (xt ?
x) = kgt k2 + kx0 x ? k2 kxT x? k2
t=0
2 t=0
2
X1
T
1
 kgt k2 + kx0 x ? k2 (2.4)
2 t=0
2

34
So far, we have not used any properties of the function f or its gradi-
ent gt , except the definition of the update step xt+1 = xt gt . Now we
invoke convexity of f , or more precisely the first-order characterization of
convexity (1.4) with x = xt , y = x? :

f (xt ) f (x? )  gt> (xt x? ). (2.5)

Hence we further obtain


T 1
X T 1
X 1
(f (xt ) ?
f (x ))  kgt k2 + kx0 x ? k2 . (2.6)
t=0
2 t=0
2

This gives us an upper bound for the average error f (xt ) f (x? ), t =
0, . . . , T 1, hence in particular for the error incurred by the iterate with
the smallest function value. The last iterate is not necessarily the best one:
gradient descent with fixed stepsize will in general also make steps that
overshoot and actually increase the function value; see Exercise 12(i).
The question is of course: is this result any good? In general, the an-
swer is no. A dependence on kx0 x? k is to be expected (the further we
start from x? , the longer we will take); the dependence on the squared gra-
dients kgt k2 is more of an issue, and if we cannot control them, we cannot
say much.

2.4 Lipschitz convex functions: O(1/"2) steps


Here is the cheapest “solution” to squeeze something out of the vanilla
analysis (2.4): let us simply assume that all gradients of f are bounded
in norm. Equivalently, such functions are Lipschitz continuous over Rd
by Theorem 1.10. (A small subtetly here is that in the situation of real-
valued functions, Theorem 1.10 is talking about the spectral norm of the
(1 ⇥ d)-matrix (or row vector) rf (x)> , while below, we are talking about
the Euclidean norm of the (column) vector rf (x); but these two norms are
the same; see Exercise 10.)
Assuming bounded gradients rules out many interesting functions,
though. For example, f (x) = x2 (a supermodel in the world of convex
functions) already doesn’t qualify, as rf (x) = 2x—and this is unbounded
as x tends to infinity. But let’s care about supermodels later.

35
Theorem 2.1. Let f : Rd ! R be convex and differentiable with a global mini-
mum x? ; furthermore, suppose that kx0 x? k  R and krf (x)k  B for all x.
Choosing the stepsize
R
:= p ,
B T
gradient descent (2.1) yields
T 1
1X RB
(f (xt ) f (x? ))  p .
T t=0 T
Proof. This is a simple calculation on top of (2.6): after plugging in the
bounds kx0 x? k  R and kgt k  B, we get
T 1
X 1 2
(f (xt ) f (x? ))  B2T + R ,
t=0
2 2

so want to choose such that


R2
q( ) = B2T +
2 2
is minimized.
p Setting the
p derivative to zero yields the above value of ,
and q(R/(B T )) = RB T . Dividing by T , the result follows.
This means that in order to achieve minTt=01 (f (xt ) f (x? ))  ", we need
R2 B 2
T
"2
many iterations. This is not particularly good when it comes to concrete
numbers (think of desired error " = 10 6 when R, B are somewhat larger).
On the other hand, the number of steps does not depend on d, the di-
mension of the space. This is very important since we often optimize in
high-dimensional spaces. Of course, R and B may depend on d, but in
many relevant cases, this dependence is mild.
What happens if we don’t know R and/or B? An idea is to “guess”
R and B, run gradient descent with T and resulting from the guess,
check whether the result has absolute error at most ", and repeat with a
different guess otherwise. This fails, however, since in order to compute
the absolute error, we need to know f (x? ) which we typically don’t. But
Exercise 13 asks you to show that knowing R is sufficient.

36
2.5 Smooth convex functions: O(1/") steps
Our workhorse in the vanilla analysis was the first-order characterization
of convexity: for all x, y 2 dom(f ), we have
f (y) f (x) + rf (x)> (y x). (2.7)
Next we want to look at functions for which f (y) can be bounded from
above by f (x)+rf (x)> (y x), up to at most quadratic error. The following
definition applies to all differentiable functions, convexity is not required.
Definition 2.2. Let f : dom(f ) ! R be a differentiable function, X ✓ dom(f )
convex and L 2 R+ . Function f is called smooth (with parameter L) over X if
L
f (y)  f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X. (2.8)
2
If X = dom(f ), f is simply called smooth.
Recall that (2.7) says that for any x, the graph of f is above its tangential
hyperplane at (x, f (x)). In contrast, (2.8) says that for any x 2 X, the
graph of f is below a not-too-steep tangential paraboloid at (x, f (x)); see
Figure 2.2.
This notion of smoothness has become standard in convex optimiza-
tion, but the naming is somewhat unfortunate, since there is an (older)
definition of a smooth function in mathematical analysis where it means a
function that is infinitely often differentiable.
Let us discuss some cases. If L = 0, (2.7) and (2.8) together require that
f (y) = f (x) + rf (x)> (y x), 8x, y 2 dom(f ),
meaning that f is an affine function. A simple calculation shows that our
supermodel function f (x) = x2 is smooth with parameter L = 2:
f (y) = y 2 = x2 + 2x(y x) + (x y)2
L
= f (x) + f 0 (x)(y x) + (x y)2 .
2
More generally, we also claim that all quadratic functions of the form
f (x) = x> Qx + b> x + c are smooth, where Q is a (d ⇥ d) matrix, b 2 Rd
and c 2 R. Because x> Qx = x> Q> x, we get that f (x) = x> Qx = 12 x> (Q +
Q> )x, where 12 (Q + Q> ) is symmetric. Therefore, we can assume without
loss of generality that Q is symmetric, i.e., it suffices to show that quadratic
functions defined by symmetric functions are smooth.

37
f (x) + rf (x)> (y x) + L2 kx yk2
f (y)
f (x) + rf (x)> (y x)

x y

Figure 2.2: A smooth convex function

Lemma 2.3 (Exercise 11). Let f (x) = x> Qx+b> x+c, where Q is a symmetric
(d ⇥ d) matrix, b 2 Rd , c 2 R. Then f is smooth with parameter 2 kQk, where
kQk is the spectral norm of Q (Definition 1.9).
The (univariate) convex function f (x) = x4 is not smooth (over R): at
x = 0, condition (2.8) reads as
L 2
y4  y ,
2
and there is obviously no L that works for all y. The function is smooth,
however, over any bounded set X (Exercise 16).
In general—and this is the important message here—only functions of
asymptotically at most quadratic growth can be smooth. It is tempting to
believe that any such “subquadratic” function is actually smooth, but this
is not true. Exercise 12(iii) provides a counterexample.
While bounded gradients are equivalent to Lipschitz continuity of f
(Theorem 1.10), smoothness turns out to be equivalent to Lipschitz con-

38
tinuity of rf —if f is convex over the whole space. In general, Lipschitz
continuity of rf implies smoothness, but not the other way around.

Lemma 2.4. Let f : Rd ! R be convex and differentiable. The following two


statements are equivalent.

(i) f is smooth with parameter L.

(ii) krf (x) rf (y)k  Lkx yk for all x, y 2 Rd .

We will derive the direction (ii))(i) as Lemma 6.1 in Chapter 6 (which


neither requires convexity nor domain Rd ). The other direction is a bit
more involved. A proof of the equivalence can be found in the lecture
slides of L. Vandenberghe, http://www.seas.ucla.edu/˜vandenbe/
236C/lectures/gradient.pdf.
The operations that we have shown to preserve convexity (Lemma 1.13)
also preserve smoothness. This immediately gives us a rich collection of
smooth functions.

Lemma 2.5 (Exercise 14).

(i) Let f1 , f2 , . . . , fm be smooth with parameters P L1 , L2 , . . . , Lm , and let


m
1 , 2 , . . . ,Pmm 2 R + . Then the function
Tm f := i=1 i fi is smooth with
parameter i=1 i Li over dom(f ) := i=1 dom(fi ).

(ii) Let f : dom(f ) ! R with dom(f ) ✓ Rd be smooth with parameter L,


and let g : Rm ! Rd be an affine function, meaning that g(x) = Ax + b,
for some matrix A 2 Rd⇥m and some vector b 2 Rd . Then the function
f g (that maps x to f (Ax + b)) is smooth with parameter LkAk2 on
dom(f g) := {x 2 Rm : g(x) 2 dom(f )}, where kAk is the spectral
norm of A (Definition 1.9).

We next show that for smooth convex functions, the vanilla analysis
provides a better bound than it does under bounded gradients. In partic-
ular, we are now able to serve the supermodel f (x) = x2 .
We start with a preparatory lemma showing that gradient descent (with
suitable stepsize ) makes progress in function value on smooth functions
in every step. We call this sufficient decrease, and maybe suprisingly, it does
not require convexity.

39
Lemma 2.6. Let f : Rd ! R be differentiable and smooth with parameter L
according to (2.8). With
1
:= ,
L
gradient descent (2.1) satisfies

1
f (xt+1 )  f (xt ) krf (xt )k2 , t 0.
2L
More specifically, this already holds if f is smooth with parameter L over the line
segment connecting xt and xt+1 .

Proof. We apply the smoothness condition (2.8) and the definition of gra-
dient descent that yields xt+1 xt = rf (xt )/L. We compute

L
f (xt+1 )  f (xt ) + rf (xt )> (xt+1 xt ) + kxt xt+1 k2
2
1 1
= f (xt ) krf (xt )k2 + krf (xt )k2
L 2L
1 2
= f (xt ) krf (xt )k .
2L

Theorem 2.7. Let f : Rd ! R be convex and differentiable with a global min-


imum x? ; furthermore, suppose that f is smooth with parameter L according
to (2.8). Choosing stepsize
1
:= ,
L
gradient descent (2.1) yields

L
f (xT ) f (x? )  kx0 x ? k2 , T > 0.
2T
Proof. We apply sufficient decrease (Lemma 2.6) to bound the sum of the
kgt k2 = krf (xt )k2 after step (2.6) of the vanilla analysis as follows:
T 1 T 1
1 X X
krf (xt )k2  (f (xt ) f (xt+1 )) = f (x0 ) f (xT ). (2.9)
2L t=0 t=0

40
With = 1/L, (2.6) then yields
T 1
X T 1
? 1 X L
(f (xt ) f (x ))  krf (xt )k2 + kx0 x ? k2
t=0
2L t=0 2
L
 f (x0 ) f (xT ) + kx0 x ? k2 ,
2
equivalently
T
X L
(f (xt ) f (x? ))  kx0 x ? k2 . (2.10)
t=1
2
Because f (xt+1 )  f (xt ) for each 0  t  T by Lemma 2.6, by taking the
average we get that
T
? 1X L
f (xT ) f (x )  (f (xt ) f (x? ))  kx0 x ? k2 .
T t=1 2T

This improves over the bounds of Theorem 2.1. With R2 := kx0 x? k2 ,


we now only need
R2 L
T
2"
iterations instead of R B /" to achieve absolute error at most ".
2 2 2

Exercise 15 shows that we do not need to know L to obtain the same


asymptotic runtime.

2.6 Interlude
Let us get back to the supermodel f (x) = x2 (that is smooth with param-
eter L = 2, as we observed before). According to Theorem 2.7, gradient
descent (2.1) with stepsize = 1/2 satisfies
1 2
f (xT ) 
x. (2.11)
T 0
Here we used that the minimizer is x? = 0. Let us check how good this
bound really is. For our concrete function and concrete stepsize, (2.1) reads
as
1
xt+1 = xt rf (xt ) = xt xt = 0,
2
41
so we are always done after one step! But we will see in the next section
that this is only because the function is particularly beautiful, and on top of
that, we have picked the best possible smoothness parameter. To simulate
a more realistic situation here, let us assume that we have not looked at the
supermodel too closely and found it to be smooth with parameter L = 4
only (which is a suboptimal but still valid parameter). In this case, = 1/4
and (2.1) becomes
1 xt xt
xt+1 = xt rf (xt ) = xt = .
4 2 2
So, we in fact have ⇣x ⌘
0 1 2
f (xT ) = f = x. (2.12)
2 T 22T 0
This is still vastly better than the bound of (2.11)! While (2.11) requires
T ⇡ x20 /" to achieve f (xT )  ", (2.12) requires only
✓ 2◆
1 x0
T ⇡ log ,
2 "

which is an exponential improvement in the number of steps.

2.7 Smooth and strongly convex functions: O(log(1/"))


steps
The supermodel function f (x) = x2 is not only smooth (“not too curved”)
but also strongly convex (“not too flat”). It will turn out that this is the
crucial ingredient that makes gradient descent fast.

Definition 2.8. Let f : dom(f ) ! R be a convex and differentiable function,


X ✓ dom(f ) convex and µ 2 R+ , µ > 0. Function f is called strongly convex
(with parameter µ) over X if
µ
f (y) f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X. (2.13)
2
If X = dom(f ), f is simply called strongly convex.

42
f (x) + rf (x)> (y x) + L2 kx yk2
f (y)
f (x) + rf (x)> (y x) + µ2 kx yk2

x y

Figure 2.3: A smooth and strongly convex function

While smoothness according to (2.8) says that for any x 2 X, the graph
of f is below a not-too-steep tangential paraboloid at (x, f (x)), strong con-
vexity means that the graph of f is above a not-too-flat tangential paraboloid
at (x, f (x)). The graph of a smooth and strongly convex function is there-
fore at every point wedged between two paraboloids; see Figure 2.3.
We can also interpret (2.13) as a strengthening of convexity. In the form
of (2.7), convexity reads as

f (y) f (x) + rf (x)> (y x), 8x, y 2 dom(f ),

and therefore says that every convex function satisfies (2.13) with µ = 0.
Lemma 2.9 (Exercise 17). If f : Rd ! R is strongly convex with parameter
µ > 0, then f is strictly convex and has a unique global minimum.
The supermodel f (x) = x2 is particularly beautiful since it is both
smooth and strongly convex with the same parameter L = µ = 2 (go-
ing through the calculations in Exercise 11 will reveal this). We can easily

43
characterize the class of particularly beautiful functions. These are exactly
the ones whose sublevel sets are `2 -balls.
Lemma 2.10 (Exercise 18). Let f : Rd ! R be strongly convex with parameter
µ > 0 and smooth with parameter µ. Prove that f is of the form
µ
f (x) = kx bk2 + c,
2
where b 2 Rd , c 2 R.
Once we have a unique global minimum x? , we can attempt to prove
that limt!1 xt = x? in gradient descent. We start from the vanilla analysis
(2.3) and plug in the lower bound gt> (xt x? ) = rf (xt )> (xt x? ) f (xt )
f (x? ) + µ2 kxt x? k2 resulting from strong convexity. We get
1 µ
f (xt ) f (x? )  2
krf (xt )k2 + kxt x? k2 kxt+1 x? k2 kxt x? k2 .
2 2
(2.14)
Rewriting this yields a bound on kxt+1 x? k2 in terms of kxt x? k2 , along
with some “noise” that we still need to take care of:

kxt+1 x? k2  2 (f (x? ) f (xt ))+ 2


krf (xt )k2 +(1 µ )kxt x? k2 . (2.15)

Theorem 2.11. Let f : Rd ! R be convex and differentiable. Suppose that f is


smooth with parameter L according to (3.5) and strongly convex with parameter
µ > 0 according to (3.9). Exercise 20 asks you to prove that there is a unique
global minimum x? of f . Choosing
1
:= ,
L
gradient descent (2.1) with arbitrary x0 satisfies the following two properties.
(i) Squared distances to x? are geometrically decreasing:
⇣ µ⌘
? 2
kxt+1 x k  1 kxt x? k2 , t 0.
L

(ii) The absolute error after T iterations is exponentially small in T :


L⇣ µ ⌘T
f (xT ) f (x? )  1 kx0 x? k2 , T > 0.
2 L
44
Proof. For (i), we show that the noise in (2.15) disappears. By sufficient
decrease (Lemma 2.6), we know that
1
f (x? ) f (xt )  f (xt+1 ) f (xt )  krf (xt )k2 ,
2L
and hence the noise can be bounded as follows, using = 1/L, multiply-
ing by 2 and rearranging the terms, we get:

2 (f (x? ) f (xt )) + 2
krf (xt )k2  0,

Hence, (2.15) actually yields


⇣ µ⌘
kxt+1 x? k2  (1 µ )kxt x ? k2 = 1 kxt x ? k2
L
and ⇣
µ ⌘T
kxT x? k2  1kx0 x? k2 .
L
The bound in (ii) follows from smoothness (2.8), using rf (x? ) = 0
(Lemma 1.17):
L L
f (xT ) f (x? )  rf (x? )> (xT x? ) + kxT x ? k2 = kxT x ? k2 .
2 2

This implies that after


✓ ◆
L R2 L
T ln ,
µ 2"

iterations, we reach absolute error at most ".

2.8 Exercises
Exercise 10. Let c 2 Rd . Prove that the spectral norm of c> equals the Euclidean
norm of c, meaning that
|c> x|
max = kck .
x6=0 kxk

45
Exercise 11. Prove Lemma 2.3: The quadratic function f (x) = x> Qx + b> x + c
is smooth with parameter 2 kQk.
Exercise 12. Consider the function f (x) = |x|3/2 for x 2 R.
(i) Prove that f is strictly convex and differentiable, with a unique global min-
imum x? = 0.
(ii) Prove that for every fixed stepsize in gradient descent (2.1) applied to f ,
there exists x0 for which f (x1 ) > f (x0 ).
(iii) Prove that f is not smooth.
(iv) Let X ✓ R be a closed convex set such that 0 2 X and X 6= {0}. Prove
that f is not smooth over X.
Exercise 13. In order to obtain average error at most " in Theorem 2.1, we need
to choose iteration number and step size as
✓ ◆2
RB R
T , := p .
" B T
If R or B are unknown, we cannot do this.
But suppose that we know R. Develop an algorithm that—not knowing B—
finds a vector x such that f (x) f (x? ) < ", using at most
✓ ◆2 !
RB
O
"
many gradient descent steps!
Exercise 14. Prove Lemma 2.5! (Operations which preserve smoothness)
Exercise 15. In order to obtain average error at most " in Theorem 2.7, we need
to choose
1 R2 L
:= , T ,
L 2"
if kx0 x? k  R. If L is unknown, we cannot do this.
But suppose that we know R. Develop an algorithm that—not knowing L—
finds a vector x such that f (x) f (x? ) < ", using at most
✓ 2 ◆
R L
O
2"
many gradient descent steps!

46
Exercise 16. Let a 2 R. Prove that f (x) = x4 is smooth over X = ( a, a) and
determine a concrete smoothness parameter L.

Exercise 17. Prove Lemma 2.9! (Strongly convex functions have unique global
minimum)

Exercise 18. Prove Lemma 2.10! (Strongly convex and smooth functions)

47
Chapter 3

Projected and Proximal Gradient


Descent

Contents
3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Bounded gradients: O(1/"2 ) steps . . . . . . . . . . . . . . . . 49
3.3 Smooth convex functions: O(1/") steps . . . . . . . . . . . . 51
3.4 Smooth and strongly convex functions: O(log(1/")) steps . . 54
3.5 Projecting onto `1 -balls . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Proximal gradient descent . . . . . . . . . . . . . . . . . . . . 60
3.6.1 The proximal gradient algorithm . . . . . . . . . . . . 61
3.6.2 Convergence in O(1/") steps . . . . . . . . . . . . . . 62
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

48
3.1 The Algorithm
Another way to control gradients in (2.4) is to minimize f over a closed
convex subset X ✓ Rd . For example, we may have a constrained opti-
mization problem to begin with (for example the LASSO in Section 1.6.2),
or we happen to know some region X containing a global minimum x? , so
that we can restrict our search to that region. In this case, gradient descent
also works, but we need an additional projection step. After all, it can hap-
pen that some iteration of (2.1) takes us “into the wild” (out of X) where
we have no business to do. Projected gradient descent is the following
modification. We choose x0 2 X arbitrary and for t 0 define

yt+1 := xt rf (xt ), (3.1)


xt+1 := ⇧X (yt+1 ) := argmin kx 2
yt+1 k . (3.2)
x2X

This means, after each iteration, we project the obtained iterate yt+1 back
to X. This may be very easy (think of X as the unit ball in which case we
just have to scale yt+1 down to length 1 if it is longer). But it may also be
very difficult. In general, computing ⇧X (yt+1 ) means to solve an auxiliary
convex constrained minimization problem in each step! Here, we are just
assuming that we can do this. The projection is well-defined since dy :=
kx yk2 has bounded sublevel sets. Moreover, dy (x) is strictly convex, so
the minimum over X (that exists by continuity of dy and compactness of X
intersected with any nonempty sublevel set) is unique by Lemma 1.20. We
note that finding an initial x0 2 X also reduces to projection (of 0, for
example) onto X.

3.2 Bounded gradients: O(1/"2) steps


As in the unconstrained case, let us first assume that gradients are bounded
by a constant B—this time over X. This implies that f is B-Lipschitz over
X (see Theorem 1.10), but the converse may not hold.
To show that the vanilla analysis still goes through, we need the fol-
lowing
Fact 3.1. Let X ✓ Rd be closed and convex, x 2 X, y 2 Rd . Then
(i) (x ⇧X (y))> (y ⇧X (y))  0.

49
(ii) kx ⇧X (y)k2 + ky ⇧X (y)k2  kx yk2 .

Part (i) says that the vectors x ⇧X (y) and y ⇧X (y) form an obtuse
angle, and (ii) equivalently says that the square of the long side x y in
the triangle formed by the three points is at least the sum of squares of the
two short sides; see Figure 3.1.
y
↵ 90o

↵ ⇧X (y) X

Figure 3.1: Illustration of Fact 3.1

Proof. ⇧X (y) is by definition a minimizer of the (differentiable) convex


function dy (x) = kx yk2 over X, and (i) is just the equivalent optimality
condition of Lemma 1.22. We need X to be closed in the first place in order
to ensure that we can project onto X (see Exercise 20 applied with dy (x)).
Indeed, for example, 1 has no closest point in the set [ 1, 0) 2 R1 .
Part (ii) follows from (i) via the (by now well-known) equation 2v> w =
kvk2 + kwk2 kv wk2 .
If we minimize f over a closed and bounded (= compact) convex set X,
we get the existence of a minimizer and a bound R for the initial distance
to it for free; assuming that f is continuously differentiable, we also have a
bound B for the gradient norms over X. This is because then x 7! krf (x)k
is a continuous function that attains a maximum over X. In this case, our
vanilla analysis yields a much more useful result than the one in Theo-
rem 2.1, with the same stepsize and the same number of steps.

50
Theorem 3.2. Let f : Rd ! R be a convex and differentiable, X ✓ Rd closed and
convex, x? a minimizer of f over X; furthermore, suppose that kx0 x? k  R,
and that krf (x)k  B for all x 2 X. Choosing the constant stepsize

R
:= p ,
B T
projected gradient descent (3.1) with x0 2 X yields
T 1
1X RB
(f (xt ) f (x? ))  p .
T t=0 T

Proof. The only required changes to the vanilla analysis are that in steps
(2.2) and (2.3), xt+1 needs to be replaced by yt+1 as this is the real next
(non-projected) gradient descent iterate after these steps; we therefore get

1
gt> (xt x? ) = 2
kgt k2 + kxt x? k2 kyt+1 x? k2 . (3.3)
2

From Fact 3.1 (ii) (with x = x? , y = yt+1 ), we obtain kxt+1 x? k2  kyt+1


x? k2 , hence we get

1
gt> (xt x? )  2
kgt k2 + kxt x? k2 kxt+1 x? k2 (3.4)
2

and return to the previous vanilla analysis for the remainder of the proof.

3.3 Smooth convex functions: O(1/") steps


We recall from Definition 2.2 that f that is smooth over X if
L
f (y)  f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X. (3.5)
2
To minimize f over X, we use projected gradient descent again. The
runtime turns out to be the same as in the unconstrained case. Again, we
have sufficient decrease. This is not obvious from the following lemma,
but you are asked to prove it in Exercise 19.

51
Lemma 3.3. Let f : Rd ! R be differentiable and smooth with parameter L over
X according to (3.5). Choosing stepsize
1
:=
,
L
projected gradient descent (3.1) with arbitrary x0 2 X satisfies
1 L
f (xt+1 )  f (xt ) krf (xt )k2 + kyt+1 xt+1 k2 , t 0.
2L 2
More specifically, this already holds if f is smooth with parameter L over the line
segment connecting xt and xt+1 .
Proof. We proceed similar to the proof of the “unconstrained” sufficient
decrease Lemma 2.6, except that we now need to deal with projected gra-
dient descent. We again start from smoothness but then use yt+1 = xt
rf (xt )/L, followed by the usual equation 2v> w = kvk2 +kwk2 kv wk2 :
L
f (xt+1 )  f (xt ) + rf (xt )> (xt+1 xt ) + kxt xt+1 k2
2
L
= f (xt ) L(yt+1 xt )> (xt+1 xt ) + kxt xt+1 k2
2
L
= f (xt ) kyt+1 xt k2 + kxt+1 xt k2 kyt+1 xt+1 k2
2
L
+ kxt xt+1 k2
2
L L
= f (xt ) kyt+1 xt k2 + kyt+1 xt+1 k2
2 2
1 L
= f (xt ) krf (xt )k2 + kyt+1 xt+1 k2 .
2L 2

Theorem 3.4. Let f : Rd ! R be convex and differentiable. Let X ✓ Rd be


a closed convex set, and assume that there is a minimizer x? of f over X; fur-
thermore, suppose that f is smooth over X with parameter L according to (3.5).
Choosing stepsize
1
:= ,
L
projected gradient descent (3.1) with x0 2 X satisfies
L
f (xT ) f (x? )  kx0 x ? k2 , T > 0.
2T
52
Proof. The plan is as in the proof of Theorem 2.7 to use the inequality

1 L
krf (xt )k2  f (xt ) f (xt+1 ) + kyt+1 xt+1 k2 (3.6)
2L 2
resulting from sufficient decrease (Lemma 3.3) to bound the squared gra-
dient kgt k2 = krf (xt )k2 in the vanilla analysis. Unfortunately, (3.6) has
an extra term compared to what we got in the unconstrained case. But we
can compensate for this in the vanilla analysis itself. Let us go back to its
“constrained” version (3.3), featuring yt+1 instead of xt+1 :

1
gt> (xt x? ) = 2
kgt k2 + kxt x? k2 kyt+1 x? k2 .
2

Previously, we applied kxt+1 x? k2  kyt+1 x? k2 (Fact 3.1(ii)) to get back


on the unconstrained vanilla track. But in doing so, we dropped a term
that now becomes useful. Indeed, Fact 3.1(ii) actually yields kxt+1 x? k2 +
kyt+1 xt+1 k2  kyt+1 x? k2 , so that we get the following upper bound
for gt> (xt x? ):
1 2
kgt k2 + kxt x? k2 kxt+1 x? k2 kyt+1 xt+1 k2 . (3.7)
2

Using f (xt ) f (x? )  gt> (xt x? ) from convexity, we have (with = 1/L)
that
T 1
X T 1
X
(f (xt ) f (x? ))  gt> (xt x? ) (3.8)
t=0 t=0
T 1 T 1
1 X L LX
 kgt k2 + kx0 x ? k2 kyt+1 xt+1 k2 .
2L t=0 2 2 t=0

To bound the sum of the squared gradients, we use (3.6):


T 1 T 1✓ ◆
1 X 2
X L 2
kgt k  f (xt ) f (xt+1 ) + kyt+1 xt+1 k
2L t=0 t=0
2
T 1
LX
= f (x0 ) f (xT ) + kyt+1 xt+1 k2 .
2 t=0

53
Plugging this into (3.8), the extra terms cancel, and we arrive—as in the
unconstrained case—at
T
X L
(f (xt ) f (x? ))  kx0 x? k2 .
t=1
2
The statement follows as in the proof of Theorem 2.7 from the fact that due
to sufficient decrease (Exercise 19), the last iterate is the best one.

3.4 Smooth and strongly convex functions: O(log(1/"))


steps
Assuming that f is smooth and strongly convex over a set X, we can also
prove fast convergence of projected gradient descent. This does not re-
quire any new ideas, we have seen all the ingredients before.
We recall from Definition 2.8 that f is strongly convex with parameter
µ > 0 over X if
µ
f (y) f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X. (3.9)
2
Theorem 3.5. Let f : Rd ! R be convex and differentiable. Let X ✓ Rd
be a nonempty closed and convex set and suppose that f is smooth over X with
parameter L according to (3.5) and strongly convex over X with parameter µ > 0
according to (3.9). Exercise 20 asks you to prove that there is a unique minimizer
x? of f over X. Choosing
1
:= ,
L
projected gradient descent (3.1) with arbitrary x0 satisfies the following two prop-
erties.
(i) Squared distances to x? are geometrically decreasing:
⇣ µ⌘
? 2
kxt+1 x k  1 kxt x? k2 , t 0.
L
(ii) The absolute error after T iterations is exponentially small in T :
⇣ µ ⌘T /2
f (xT ) f (x? )  krf (x? )k 1 kx0 x? k
L
L⇣ µ ⌘T
+ 1 kx0 x? k2 , T > 0.
2 L
54
We note that this is almost the same result as in Theorem 2.11 for the
unconstrained case; in fact, the result in part (i) is identical, but in part (ii),
we get an additional term. This is due to the fact that in the constrained
case, we cannot argue that rf (x? ) = 0. In fact, this additional term is the
dominating one, once the error becomes small. It has the effect that the
required number of steps to reach error at most " will roughly double, in
comparison to the bound of Theorem 2.11.
Proof. In the strongly convex case, the “constrained” vanilla bound (3.7)
1 2
krf (xt )k2 + kxt x? k2 kxt+1 x? k2 kyt+1 xt+1 k2
2
on f (xt ) f (x? ) can be strengthened to
1 2 µ
krf (xt )k2 + kxt x? k2 kxt+1 x? k2 kyt+1 xt+1 k2 kxt x? k2
2 2
(3.10)
Now we proceed as in the proof of Theorem 2.11 and rewrite the latter
bound into a bound on kxt+1 x? k2 that is
2 (f (x? ) f (xt )) + 2
krf (xt )k2 kyt+1 xt+1 k2 + (1 µ )kxt x ? k2 ,
so we have geometric decrease in squared distance to x? , up to some noise.
Again, we show that by sufficient decrease, the noise in this bound disap-
pears. From Lemma 3.3, we know that
1 L
f (x? ) f (xt )  f (xt+1 ) f (xt )  krf (xt )k2 + kyt+1 xt+1 k2 ,
2L 2
and using this, the noise can be bounded. Multiplying the previous in-
equality by 2/L, and rearranging the terms we get:
2 1
(f (x? ) f (xt )) + 2 krf (xt )k2 kyt+1 xt+1 k2  0.
L L
With = 1/L, this exactly shows that the noise is nonpositive. This yields
(i). The bound in (ii) follows from smoothness (2.8):
L
f (xT ) f (x? )  rf (x? )> (xT x? ) + kx? xT k2
2
L
 krf (x )k kxT x k + kx? xT k2 (Cauchy-Schwarz)
? ?
2
⇣ µ ⌘T /2 L⇣ µ ⌘T
 krf (x? )k 1 kx0 x? k + 1 kx0 x? k2 .
L 2 L

55
3.5 Projecting onto `1-balls
Problems that are `1 -regularized appear among the most commonly used
models in machine learning and signal processing, and we have already
discussed the Lasso as an important example of that class. We will now
address how to perform projected gradient as an efficient optimization for
`1 -constrained problems. Let

n d
X o
d
X = B1 (R) := x 2 R : kxk1 = |xi |  R
i=1

be the `1 -ball of radius R > 0 around 0, i.e., the set of all points with 1-
norm at most R. Our goal is to compute ⇧X (v) for a given vector v, i.e. the
projection of v onto X; see Figure 3.2.

X = B1 (R)
v

⇧X (v)

0 R

Figure 3.2: Projecting onto an `1 -ball

At first sight, this may look like a rather complicated task. Geometri-
cally, X is a cross polytope (square for d = 2, octahedron for d = 3), and as
such it has 2d many facets. But we can start with some basic simplifying
observations.

Fact 3.6. We may assume without loss of generality that (i) R = 1, (ii) vi 0 for
P
all i, and (iii) di=1 vi > 1.

56
Proof. If we project v/R onto B1 (1), we obtain ⇧X (v)/R (just scale Fig-
ure 3.2), so we can restrict to the case R = 1. For (ii), we observe that
simultaneously flipping the signs of a fixed subset of coordinates in both
v and x 2 X yields vectors v0 and x0 2 X such that kx vk = kx0 v0 k;
thus, x minimizes the distance to v if and only if x0 minimizes the distance
to v0 . Hence, it suffices to compute ⇧X (v) for vectors with nonnegative
P
entries. If di=1 vi  1, we have ⇧X (v) = v and are done, so the interesting
case is (iii).
Fact 3.7. Under the assumptions of Fact 3.6, x = ⇧X (v) satisfies xi 0 for all i
P
and di=1 xi = 1.
Proof. If xi < 0 for some i, then ( xi vi )2  (xi vi )2 (since vi 0),
so flipping the i-th sign in x would yield another vector in X at least as
close to v as x, but such a vector cannot exist by strict convexity of the
P
squared distance. And if di=1 xi < 1, then x0 = x + (v x) 2 X for some
small positive , with kx0 vk = (1 )kx vk, again contradicting the
optimality of x.
Corollary 3.8. Under the assumptions of Fact 3.6,
⇧X (v) = argmin kx vk2 ,
x2 d

where
n d
X o
d := x 2 Rd : xi = 1, xi 0 8i
i=1
is the standard simplex.
This means, we have reduced the projection onto an `1 -ball to the pro-
jection onto the standard simplex; see Figure 3.3.
To address the latter task, we make another assumption that can be
established by suitably permuting the entries of v (which just permutes
the entries of its projection onto d in the same way).
Fact 3.9. We may assume without loss of generality that v1 v2 ··· vd .
Lemma 3.10. Let x? := argminx2 d kx vk2 . Under the assumption of Fact 3.9,
there exists (a unique) p 2 {1, . . . , d} such that
x?i > 0, i  p,
x?i = 0, i > p.

57
d
v

⇧X (v)

0 1

Figure 3.3: Projecting onto the standard simplex

Proof. We are using the optimality criterion of Lemma 1.22:

rdv (x? )> (x x? ) = 2(x? v)> (x x? ) 0, x2 d, (3.11)

where dv (z) := kz vk2 is the squared distance to v.


P
Because di=1 x?i = 1, there is at least one positive entry in x? . It remains
to show that we cannot have x?i = 0 and x?i+1 > 0. Indeed, in this situa-
tion, we could decrease x?i+1 by some small positive " and simultaneously
increase x?i to " to obtain a vector x 2 d such that

(x? v)> (x x? ) = (0 vi )" (x?i+1 vi+1 )" = "(vi+1 vi x?i+1 ) < 0,


| {z } |{z}
0 >0

contradicting the optimality (3.11).


But we can say even more about x? .

Lemma 3.11. Under the assumption of Fact 3.9, and with p as in Lemma 3.10,

x?i = vi ⇥p , i  p,

where
1⇣ X ⌘
p
⇥p = vi 1 .
p i=1

58
Proof. Suppose x?i vi < x?j vj for some i, j  p. As before, we could then
decrease x?j > 0 by some small positive " and simultaneously increase x?i
by " to obtain x 2 d such that

(x? v)> (x x? ) = (x?i vi )" (x?j vj )" = "((x?i vi ) (x?j vj )) < 0,


| {z }
<0

again contradicting (3.11). The expression for ⇥p is then obtained from


p p p
X X X
1= x?i = (vi ⇥p ) = vi p⇥p .
i=1 i=1 i=1

Let us summarize the situation: we now have d candidates for x? ,


namely the vectors

x? (p) := (v1 ⇥p , . . . , vp ⇥p , 0, . . . , 0), p 2 {1, . . . , d}, (3.12)

and we just need to find the right one. In order for candidate x? (p) to
comply with Lemma 3.10, we must have

vp ⇥p > 0, (3.13)

and this actually ensures x? (p)i > 0 for all i  p by the assumption of
Fact 3.9 and therefore x? (p) 2 d . But there could still be several values of
p satisfying (3.13). Among them, we simply pick the one for which x? (p)
minimizes the distance to v. It is not hard to see that this can be done in
time O(d log d), by first sorting v and then carefully updating the values
⇥p and kx? (p) vk2 as we vary p to check all candidates.
But actually, there is an even simpler criterion that saves us from com-
paring distances.
Lemma 3.12. Under the assumption of Fact 3.9, with x? (p) as in (3.12), and
with
1⇣ X ⌘
p
?
p := max p 2 {1, . . . , d} : vp vi 1 > 0 ,
p i=1
it holds that
argmin kx vk2 = x? (p? ).
x2 d

59
The proof is Exercise 21. Together with our previous reductions, we
obtain the following result.
Theorem 3.13. Let v 2 Rd , R 2 R+ , X = B1 (R) the `1 -ball around 0 of
radius R. The projection

⇧X (v) = argmin kx vk2


x2X

of v onto B1 (R) can be computed in time O(d log d).


This can be improved to time O(d), based on the observation that a
given p can be compared to the value p? in Lemma 3.12 in linear time,
without the need to presort v [DSSSC08].

3.6 Proximal gradient descent


Many optimization problems in applications come with additional struc-
ture. An important class of objective functions is composed as

f (x) := g(x) + h(x) (3.14)

where g is a “nice” function, where as h is a “simple” additional term,


which however doesn’t satisfy the assumptions of niceness which we used
in the convergence analysis so far. In particular, an important case is
when h is not differentiable.
The classical gradient step for unconstrained minimization of a func-
tion g can be equivalently written as
1
xt+1 = argmin g(xt ) + rg(xt )> (y xt ) + ky x t k2 (3.15)
y2Rd 2
1
= argmin ky (xt rg(xt ))k2 . (3.16)
y2Rd 2

To obtain the last equality, we have just completed the quadratic kvk2 +
2v> w + kwk2 = kv + wk2 for v := rg(xt ) and w := y xt . Here it is
crucial that v is independent of the optimization variable y, so therefore
the term can be ignored when taking the argmin. The scaling by 21 is also
irrelevant but we keep it for better illustrating the next step.

60
The interpretation of the above equivalent reformulation of the classic
gradient step is important for us, and is what has enabled the previous
convergence analysis in Section 2.5 for smooth unconstrained optimiza-
tion: For the particular choice of stepsize := L1 which we have used,
the above formulation shows that the gradient descent step exactly min-
imizes the local quadratic model of g at our current iterate xt , formed by
the smoothness property with parameter L as defined in (2.8).

Our goal in this section is to minimize f = g + h, instead of only the


smooth part g alone. The idea of the proximal gradient method is to mod-
ify the simple quadratic model (3.15) above, so as to make it a valid model
for f , that is a model which upper bounds f at all points. The simplest way
to do this is to just treat the h function separately by adding it unmodified.
We obtain the update equation for proximal gradient descent
1
xt+1 := argmin g(xt ) + rg(xt )> (y xt ) + ky xt k2 + h(y) (3.17)
y2Rd 2
1
= argmin ky (xt rg(xt ))k2 + h(y) . (3.18)
y 2
The last formulation makes clear that the resulting update tries to com-
bine the two goals, staying close to the classic gradient update, as well as
also to minimize h.

3.6.1 The proximal gradient algorithm


We define the proximal mapping for a given function h, and parameter >
0: n1 o
proxh, (z) := argmin ky zk2 + h(y)
y 2
An iteration of proximal gradient descent is defined as
xt+1 := proxh, (xt rg(xt )) . (3.19)
This same update step can also be written in different form as
xt+1 = xt G (xt ) (3.20)
⇣ ⌘
for Gh, (x) := 1
x proxh, (x rg(x)) being the so called generalized
gradient of f .

61
A generalization of gradient descent. The proximal gradient descent
method (3.19) is also known as generalized gradient descent. In the special
case h ⌘ 0, we of course recover classic gradient descent.
More interestingly, it is also a generalization of projected gradient de-
scent as we have discussed in the previous sections. Given a closed convex
set X, the indicator function of the set X is given as the convex function
◆X : Rd ! R [ +1
(
0 if x 2 X,
x 7! ◆X (x) := (3.21)
+1 otherwise.
When using the indicator function of our constraint set X as h ⌘ ◆X , it is
easy to see that the proximal mapping simply becomes
n1 o
proxh, (z) := argmin ky zk2 + ◆X (y)
y 2
= argmin ky zk2 = ⇧X (z) ,
y2X

which is the projection of z onto X.


As we will see, the convergence of proximal gradient will be as fast as
classic gradient descent. However, this still comes not entirely for free. In
every iteration, we now have to additionally compute the proximal map-
ping. This can be very expensive if h is complex. Nevertheless, for some
important examples of h the proximal mapping is efficient to compute,
such as for the `1 -norm.

3.6.2 Convergence in O(1/") steps


Interestingly, the vanilla convergence analysis for smooth functions as in
Theorem 2.7 directly applies for the more general case of proximal gradi-
ent descent. Intuitively, this means that proximal method only “sees” the
nice smooth part g of the objective, and is not impacted by the additional h
which it treats separately in each step.
Theorem 3.14. Let g : Rd ! R be convex and smooth with parameter L, and
also h convex and proxh, (x) := argminz {kx zk2 /(2 ) + h(z)} can be com-
puted. Choosing the fixed stepsize
1
:= ,
L
62
proximal gradient descent (3.19) with arbitrary x0 satisfies

L
f (xT ) f (x? )  kx0 x ? k2 , T > 0.
2T
Proof. The proof follows the vanilla analysis for the smooth case, applying
it only to g, while always keeping h separate, as in (3.17). We leave the
details as Exercise 22 for the reader.

3.7 Exercises
Exercise 19. Prove that in Theorem 3.4 (i),

f (xt+1 )  f (xt ).

Exercise 20. Prove that under the assumptions of Theorem 3.5, f has a unique
minimizer x? over any nonempty closed and convex set X ✓ Rd ! In particular,
for X = Rd , we obtain the existence of a unique global minimum.

Exercise 21. Prove Lemma 3.12!


Hint: It is useful to prove that with x? (p) as in (3.12) and satisfying (3.13),
d
X
?
x (p) = argmin{kx vk : xi = 1, xp+1 = · · · = xd = 0}.
i=1

Exercise 22. Prove Theorem 3.14!

63
Chapter 4

Subgradient Descent

Contents
4.1 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Differentiability of convex functions . . . . . . . . . . . . . . 67
4.3 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Lipschitz convex functions: O(1/"2 ) steps . . . . . . . . . . . 68
4.5 Tame strong convexity: O(1/") steps . . . . . . . . . . . . . . 69
4.6 Optimality of first-order methods . . . . . . . . . . . . . . . . 72
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

64
4.1 Subgradients
Definition 4.1. Let f : dom(f ) ! R. Then g 2 Rd is a subgradient of f at
x 2 dom(f ) if

f (y) f (x) + g> (y x) 8y 2 dom(f ). (4.1)

The set of subgradients of f at x is called the subdifferential at x and is denoted


by @f (x).
The notion of a subgradient can be seen as a generalization of the gra-
dient, for functions which are not necessarily differentiable. A promi-
nent example is the `1 -norm, which we have discussed in Exercise 7. Fig-
ure 4.1 shows that this function has several subgradients at x = 0 (one-
dimensional case).

f (x) = |x|

y 7! 15 y

0
2
f (y) gy y 7! 5
y

Figure 4.1: The function f (x) = |x| has subgradients g 2 [ 1, 1] at 0, since


f (y) gy for exactly g 2 [ 1, 1].

Lemma 4.2 (Exercise 23). If f : dom(f ) ! R is differentiable at x 2 dom(f ),


then @f (x) ✓ {rf (x)}.
This means that in the differentiable case, there is either exactly one
subgradient rf (x), or no subgradient at all (if f is not above its tangent
hyperplane at x; see Figure 1.4).
Definition 4.1 above looks suspiciously similar to the first-order char-
acterization of convexity (1.4) that we discussed earlier. Indeed, the only
difference is that here we have replaced rf (x) by g. It turns out that con-
vexity is equivalent to the existence of subgradients everywhere. So we

65
get a “first order characterization” of convexity that also covers the non-
differentiable case.

Lemma 4.3 (Exercise 24). A function f : dom(f ) ! R is convex if and only if


dom(f ) is convex and @f (x) 6= ; for all x 2 dom(f ).

It turns out that Lipschitz continuity can be characterized by bounded


subgradients. For real-valued convex functions, this is a generalization of
Lemma 1.10 to the non-differentiable case.

Lemma 4.4 (Exercise 25). Let f : dom(f ) ! R be convex, dom(f ) open,


B 2 R+ . Then the following two statements are equivalent.

(i) kgk  B for all x 2 dom(f ) and all g 2 @f (x).

(ii) |f (x) f (y)|  Bkx yk for all x, y 2 dom(f ).

Subgradient optimality condition. Subgradients also allow us to de-


scribe cases of optimality for functions which are not necessarily differ-
entiable (and not necessarily convex), in the spirit of Lemma 1.16:

Lemma 4.5. Suppose that f : dom(f ) ! R and x 2 dom(f ). If 0 2 @f (x),


then x is a global minimum.

Proof. By (4.1), g = 0 2 @f (x) gives

f (y) f (x) + g> (y x) = f (x)

for all y 2 dom(f ), so x is a global minimum.


Here we see (again) that subgradients are “stronger” than gradients for
differentiable functions. Indeed, if rf (x) = 0 for a differentiable function
f and x 2 dom(f ), we can only say that x is a critical point, but not nec-
essarily a global minimum. Unlike the gradient, a subgradient yields by
definition a linear lower bound to the function.

66
4.2 Differentiability of convex functions
Before we move on to subgradient descent, we want to get a feeling for
how “wild” non-differentiable convex functions can be. The answer is:
they are surprisingly tame. While there are continuous functions that are
nowhere differentiable (the classical example is the Weierstrass function),
convex function cannot be as pathological. In fact, a convex function f
is differentiable almost everywhere. Formally, this means that wherever you
are in dom(f ), you find points arbitrarily close to you at which f is differ-
entiable. In still other words, the set of points where f is not differentiable
has measure 0 [Roc97, Theorem 25.5].
This does not mean that we can ignore non-differentiability in opti-
mization. For example, as Figure 4.1 demonstrates, the global minimum x?
can easily be a “kink”, a point where f is not differentiable. Also, while
running an iterative optimization scheme, we may always stumble upon
an intermediate kink.
An important fact is the following characterization of subdifferentials;

Theorem 4.6 ([Roc97, Theorem 25.6]). Let f : dom(f ) ! R be convex,


dom(f ) open, x 2 dom(f ). Then @f (x) is the convex hull of the set

S(x) = { lim rf (xn ) | lim xn = x}.


n!1 n!1

In words, we consider sequences (xn )n2N that converge to x and for


which the sequence of gradients (rf (xn ))n2N exists and also converges;
the theorem says that the limit is a subgradient at x, and that any subgra-
dient can be obtained as a convex combination of such limit subgradients.
In the example of Figure 4.1, there are two types of sequences converg-
ing to 0 such the gradients converge as well. These are sequences that have
almost all elements negative (gradients converge to 1), and sequences
that have almost all elements positive (gradients converge to 1). Conse-
quently, the subgradients at 0 are formed by the set [ 1, 1], the convex hull
of 1 and 1.

67
4.3 The algorithm
An iteration of subgradient descent is defined as

Let gt 2 @f (xt )
xt+1 := xt t gt . (4.2)

In contrast to our previous descent algorithms, we allow a time-varying


stepsize here. This can of course be done for any descent algorithm but so
far, we just did not need it. Later in this chapter, we will make use of a
time-varying step size.

4.4 Lipschitz convex functions: O(1/"2) steps


The following result gives the convergence for Subgradient Descent. It is
identical to Theorem 2.1, up to relaxing the requirement of differentiability.

Theorem 4.7. Let f : Rd ! R be convex and B-Lipschitz continuous with a


global minimum x? ; furthermore, suppose that kx0 x? k  R. Choosing the
constant stepsize
R
t = := p ,
B T
subgradient descent (4.2) yields
T 1
1X RB
f (xt ) f (x? )  p .
T t=0 T

Proof. The proof is identical to the one of Theorem 2.1 presented in Sec-
tion 2.4. The only change is that gt is a subgradient now and not a gra-
dient, so that the inequality (2.5) now follows from the subgradient prop-
erty (4.1) instead of the first-order characterization of convexity. The re-
quired bound kgt k2  B 2 follows from Lemma 4.4 (“convex and Lipschitz
= bounded subgradients”).

Projected subgradient descent. Theorem 3.2 for constrained optimiza-


tion in O(1/"2 ) steps directly extends to the case of subgradient descent as
well.

68
4.5 Tame strong convexity: O(1/") steps
(Projected) gradient descent converges in O(log(1/")) steps for functions
that are both smooth and strongly convex. But if a function is non-differen-
tiable, then it cannot be smooth under the natural definition of smoothness
(Exercise 26). It can still be strongly convex, however, so it is natural to ask
whether strong convexity alone allows us to obtain a convergence result.
The answer is no in general, but before we discuss this, let us define strong
convexity for not necessarily differentiable functions. This is straightfor-
ward; for differentiable functions, we recover Definition 2.8. Here, we
restrict to the unconstrained case for simplicity.
Definition 4.8. Let f : dom(f ) ! R be convex, µ 2 R+ , µ > 0. Function f is
called strongly convex (with parameter µ) if
µ
f (y) f (x)+g> (y x)+ kx yk2 , 8x, y 2 dom(f ), 8g 2 @f (x). (4.3)
2
Actually, requiring (4.3) only for some g 2 @f (x) would be another
straightforward generalization of Definition 2.8, so which one is the “right”
one? The answer is that it does not matter if dom(f ) is open. We could
even afford to not require anything for points x where f is not differen-
tiable. This is a consequence of Theorem 4.6 (Exercise 27).
Strong convexity has the following useful characterization.
Lemma 4.9 (Exercise 28). Let f : dom(f ) ! R be convex, dom(f ) open,
µ 2 R+ , µ > 0. f is strongly convex with parameter µ if and only if fµ :
dom(f ) ! R defined by
µ
fµ (x) = f (x) kxk2 , x 2 dom(f )
2
is convex.
Let’s look at the problem with (sub)gradient descent on strongly con-
vex functions.
Lemma 4.10 (Exercise 29). The function f (x) = e|x| is strongly convex with
parameter µ = 1.
This function is of course far from being smooth; it grows exponen-
tially, so there can’t be any quadratic upper bounds. In fact, as strong

69
convexity ony requires quadratic lower bounds, strongly convex functions
can be extremely fast-growing. In such a situation, (sub)gradient descent
will overshoot already for tiny step sizes and diverge.
In case of f (x) = e|x| , the function is differentiable at x 6= 0 with f 0 (x) =
sgn(x)e|x| , so the (sub)gradient step is
|xt |
xt+1 = xt t sgn(xt )e .

For |x| only mildly larger than 0, the step will overshoot the optimum
x⇤ = 0 and take us (much) further away. To compensate for this, we would
need extremely small stepsizes. These in turn would lead to extremely
poor convergence for functions such as f (x) = x2 /2 (which is also strongly
convex with µ = 1) . Hence, there are no stepsizes that fit all strongly
convex functions with a fixed strong convexity parameter µ.
To succeed with (sub)gradient descent in this situation, we therefore
need to make some additional assumptions. Smoothness (quadratic upper
bounds) is such an assumption, but in the non-differentiable case, this is
precisely not an option. What people have done instead is to assume that
the subgradients gt that we encounter during the algorithm are bounded
in norm.
To ensure bounded subgradients, we could simply assume that f is
Lipschitz, but then we will only make a statement about an empty function
class. The reason is that a function cannot be globally strongly convex and
Lipschitz at the same time (Exercise 30). It can be strongly convex and
have bounded gradients over a closed and bounded set X, so analyzing
projected subgradient descent is an alternative.
But even when we optimize over Rd , we may be lucky and only hit
iterates with small subgradients. This will typically happen if we start
sufficiently close to optimality. In this case, there are step sizes t (not
depending on the observed gradients) that give us useful error bounds.
Below, we prove such a bound for subgradient descent, and this re-
sult then clearly extends to gradient descent on differentiable and strongly
convex (but not necessarily smooth) functions. The bound on the number
of steps will be O(1/") which is of course much worse than O(log(1/")),
but still better than O(1/"2 ) that we get in the Lipschitz case. So assum-
ing strong convexity results in a convergence behavior as in the smooth
case—if the gradients stay bounded, and this is what we mean by “tame”.
In order to analyze subgradient descent on strongly convex functions,

70
we will for the first time depart from algorithm variants with a constant
stepsize , but instead use a time-varying stepsize t decreasing over time.

Theorem 4.11. Let f : Rd ! R be strongly convex with parameter µ > 0 and


let x? be the unique global minimum of f . With decreasing step size

2
t := , t > 0,
µ(t + 1)

subgradient descent (4.2) yields


✓ X T ◆
2 2B 2
f t · xt f (x? )  ,
T (T + 1) t=1 µ(T + 1)

where B = maxTt=1 kgt k.

Unlike in previous convergence results, small error is not achieved by


some iterate that we have gone through, but by a convex combination of
iterates.
Proof. We start from the vanilla analysis (2.3) (with = t ):

t 1
gt> (xt x? ) = kgt k2 + kxt x? k2 kxt+1 x? k2 .
2 2 t

Now we plug in the lower bound gt> (xt x? ) f (xt ) f (x? ) + µ2 kxt x ? k2
resulting from strong convexity to obtain (with kgt k2  B 2 ) that
1 1
B2 t ( µ)
f (xt ) f (x? )  + t
kxt x? k2 t
kxt+1 x ? k2 . (4.4)
2 2 2
Unlike in the vanilla analysis (where we had t = , µ = 0), the right-hand
side does not telescope anymore when we sum over all t  T ; to fix this,
we precisely need the time-varying stepsize.
Let’s make a small computation: to get telescoping behavior, we would
need that t 1 = t+1 1
µ. For example, t 1 = µ(1 + t) satisfies this, but
our choice t = µ(1 + t)/2 does not. Exercise 31 asks you to compute
1

what happens when we actually choose t 1 = µ(1 + t); this will let you

71
appreciate the seemingly “wrong” choice of t = µ(t+1)2
here. Plugging in
this stepsize and multiplying with t on both the sides, we get

? B2t µ⇣ ? 2 ? 2

t · f (xt ) f (x )  + t(t 1) kxt x k (t + 1)t kxt+1 x k
µ(t + 1) 4
B2 µ ⇣ ⌘
 + t(t 1) kxt x? k2 (t + 1)t kxt+1 x? k2 .
µ 4
Summing from t = 1, . . . , T , we obtain a telescoping sum:

T B2 µ ⇣ ⌘
T
X T B2
t · f (xt ) f (x? )  + 0 T (T + 1) kxT +1 x ? k2  .
t=1
µ 4 µ

Since
X T
2
t = 1,
T (T + 1) t=1
Jensen’s inequality (Lemma 1.5) yields
✓ X T ◆ X T
2 ? 2
f t · xt f (x )  t · f (xt ) f (x? ) .
T (T + 1) t=1 T (T + 1) t=1

This in turn implies


✓ X T ◆
2 2B 2
f t · xt f (x? )  .
T (T + 1) t=1 µ(T + 1)

Unlike all previous bounds, this bound seems to be independent from


the initial distance kxo x? k to the optimum. However, there is no free
lunch here. The initial distance will typically affect the bound B (think of
a quadratic function where B is proportional to kxo x? k).

4.6 Optimality of first-order methods


With all the convergence rates we have seen so far, a very natural question
to ask is if these rates are best possible or not. Surprisingly, the rate can
indeed not be improved in general.

72
Theorem 4.12 (Nesterov). For any T  d 1 and starting point x0 , there is a
function f in the problem class of B-Lipschitz functions over Rd , such that any
(sub)gradient method has an objective error at least

RB
f (xT ) f (x? ) p .
2(1 + T + 1)

The above theorem applies to all first-order methods which form iter-
ates by linearly combining past iterates and (sub)gradients, and requires
the dimension d to be sufficiently large.

4.7 Exercises
Exercise 23. Prove Lemma 4.2, meaning that a function that is differentiable at x
has at most one subgradient there, namely rf (x).

Exercise 24. Prove the easy direction of Lemma 4.3, meaning that the existence
of subgradients everywhere implies convexity!

Exercise 25. Prove Lemma 4.4 (Lipschitz continuity and bounded subgradients).

Exercise 26. Generalizing Definition 2.2, let us call a (not necessarily differen-
tiable) function f : Rd ! R smooth with parameter L 2 R+ if for all x 2 Rd ,
there exists gx 2 Rd (not necessarily a subgradient; we do not assume that f is
convex) such that
L
f (y)  f (x) + gx> (y x) + kx yk2 , 8x, y 2 Rd .
2
This means that for every point x, the graph of f is below the graph of the
quadratic function f (x) + gx> (y x) + L2 kx yk2 .
Prove that if f is smooth according to this definition, then f is differentiable,
with gx = rf (x) for all x. In particular, for differentiable functions, the notion of
smoothness introduced above coincides with the one of Definition 2.2; moreover,
non-differentiable functions cannot be smooth.

Exercise 27. Suppose that f : Rd ! R is convex and satisfies


µ
f (y) f (x) + rf (x)> (y x) + kx yk2
2

73
for all x such that rf (x) exists, and for all y. Prove that this implies
µ
f (y) f (x) + gx> (y x) + kx yk2
2
for all x, all gx 2 @f (x) and all y.

Exercise 28. Prove Lemma 4.9: f is strongly convex with parameter µ over an
7 f (x) µ2 kxk2 is convex over the same
open domain if and only if fµ : x !
domain.

Exercise 29. Prove Lemma 4.10: f (x) = e|x| is strongly convex with parameter
µ = 1.

Exercise 30. Prove that a function cannot simultaneously be Lipschitz and strongly
convex!

Exercise 31. Which result can you prove when you use the “telescoping stepsize”

1
t =
µ(t + 1)

in Theorem 4.11 instead of t = 2


µ(t+1)
?

74
Chapter 5

Stochastic Gradient Descent

Contents
5.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Bounded stochastic gradients: O(1/"2 ) steps . . . . . . . . . 78
5.4 Tame strong convexity: O(1/") steps . . . . . . . . . . . . . . 79
5.5 Stochastic Subgradient Descent . . . . . . . . . . . . . . . . . 80
5.6 Mini-batch variants . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

75
5.1 The algorithm
Many objective functions occurring in machine learning are formulated as
sum structured objective functions
n
1X
f (x) = fi (x). (5.1)
n i=1
Here fi is typically the cost function of the i-th datapoint, taken from a
training set of n elements in total.
We have already seen an example for this: the loss function (1.11) in
the handwritten digit recognition (Section 1.6.1) has one term for each of
the n training images x 2 P :
X
`(W ) = ln zd(x) (W x).
x2P

The normalizing factor 1/n that we assume in the general setting (5.1)
will just simplify the following a bit.
An iteration of stochastic gradient descent (SGD) in its basic form is de-
fined as
sample i 2 [n] uniformly at random
xt+1 := xt t rfi (xt ). (5.2)
This update looks almost identical to the classical gradient method, the
only difference being that we have computed the gradient not of the en-
tire f but only of one particular (randomly chosen) function fi . As we will
need varying stepsizes a bit later, we allow for the stepsize to depend on t
now.
In the above setting, the update vector gt := rfi (xt ) is called a stochastic
gradient. Formally, gt is a vector of d random variables, but we will also
simply call this a random variable.
The crucial advantage of SGD versus its classical gradient descent coun-
terpart is the efficiency per iteration: While computing the full gradient for
a sum structured problem (5.1) would require us to compute n individual
gradients of the fi functions, an iteration of SGD requires only a single
one of those, and therefore is n times cheaper. SGD has therefore become
the main workhorse for training machine learning models. Whether such
cheaper iterations also give similar progress is another question, which we
analyze next.

76
5.2 Unbiasedness
We would like to start with the vanilla analysis again, but now we can-
not bound the random variable gt> (xt x? ) from below using (2.5), as the
inequality
f (xt ) f (x? )  gt> (xt x? )
may hold or not hold, depending on how gt turns out. But it still holds in
expectation, as we show now.
The vector gt may be far from the true gradient, and of high variance,
but in expectation over the random choice of i, it does coincide with the
full gradient of f . We formalize this as
n
⇥ ⇤ 1X
E gt x t = x = rfi (x) = rf (x), x 2 Rd . (5.3)
n i=1
⇥ ⇤
Here, E gt xt = x is the conditional expectation of gt , given the event
{xt = x}. If this event is nonempty, linearity of conditional expectations
yields that
⇥ ⇤ ⇥ ⇤>
E gt> (x x? ) xt = x = E gt xt = x (x x? ) = rf (x)> (x x? ).
Using the fact that {xt = x} can occur only for x in some finite set X (one
element for every choice of indices throughout all iterations), the partition
theorem further gives us

⇥ ⇤ X ⇥ ⇤
E gt> (xt x? ) = E gt> (x x? ) xt = x prob(xt = x)
x2X
X
= rf (x)> (x x? ) prob(xt = x)
x2X
⇥ ⇤
= E rf (xt )> (xt x? ) .
Hence, we have
⇥ ⇤ ⇥ ⇤ ⇥ ⇤
E gt> (xt x? ) = E rf (xt )> (xt x? ) E f (xt ) f (x? ) . (5.4)
The last inequality is by convexity, and this is means that the lower bound
(2.5) holds in expectation.
Exercise 32 lets you recall some basics around conditional expectations.
Under (5.3) we say that the stochastic gradient gt is an unbiased estimator
of the gradient, for any time-step t.

77
5.3 Bounded stochastic gradients: O(1/"2) steps
To get a first result out of the vanilla analysis, we assumed in Section 2.4
that krf (x)k2  B 2 for all x 2 Rd , where B was a constant. Here, we
are assuming the same for the expected squared norms of our stochastic
gradients. And we are getting the same result, except that it now holds for
the expected function values.
Theorem 5.1. Let f : Rd ! R be a convex and differentiable function, and let
x?⇥ be a global
⇤ minimum of f ; furthermore, suppose that kx0 x? k  R, and that
E kgt k  B 2 for all t. Choosing the constant stepsize
2

R
:= p
B T
stochastic gradient descent (5.2) yields
T 1
1X ⇥ ⇤ RB
E f (xt ) f (x? )  p .
T t=0 T
Proof. Taking expectations on both sides of the vanilla analyis (2.4) and
using linearity of expectations, we get
T 1
X T 1
X
⇥ ⇤ ⇥ ⇤ 1
E gt> (xt ?
x)  E kgt k2 + kx0 x? k2 . (5.5)
t=0
2 t=0 2

By (5.4), ⇥ ⇤ ⇥ ⇤
E f (xt ) f (x? )  E gt> (xt x? ) .
⇥ ⇤
Plugging this into (5.5), using E kgt k2  B 2 and kx0 x? k  R, we get
T 1
X ⇥ ⇤ 1 2
E f (xt ) f (x? )  B2T + R ,
t=0
2 2

from which the statement follows from the choice of as in Theorem 2.1.

Constrained optimization. For constrained optimization, Theorem 5.1


for the convergence in O(1/"2 ) steps directly extends to constrained prob-
lems as well. After every step of SGD, projection back to X is applied as
usual. The resulting algorithm is called projected SGD.

78
5.4 Tame strong convexity: O(1/") steps
It is possible to strengthen our above SGD analysis. One way to do so
is under the additional assumption of strong convexity of the objective
function f (as in Definition 2.8). Again, the proof works by “taking ex-
pectations” over a previous analysis, in this case the one for subgradient
descent in the tame strongly convex case (Theorem 4.11).

Theorem 5.2. Let f : Rd ! R be differentiable and strongly convex with pa-


rameter µ > 0; let x? be the unique global minimum of f . With decreasing step
size
2
t :=
µ(t + 1)
stochastic gradient descent (5.2) yields

h ✓ 2 X T ◆ i 2B 2
E f t · xt f (x? )  ,
T (T + 1) t=1 µ(T + 1)
⇥ ⇤
where B = maxTt=1 E kgt k .

Proof. We start from the vanilla analysis (2.3) (with = t) and take expec-
tations on both sides:
⇥ ⇤ t ⇥ ⇤ 1 ⇥ ⇤ ⇥ ⇤
E gt> (xt x? ) = E kgt k2 + E kxt x? k2 E kxt+1 x ? k2 .
2 2 t

Now we use (5.4) along with strong convexity to get a lower bound
⇥ ⇤ ⇥ ⇤
E gt> (xt x? ) = E rf (xt )> (xt x? )
⇥ ⇤ µ ⇥ ⇤
E f (xt ) f (x? ) + E kxt x? k2
2
for
⇥ the2left-hand
⇤ side. Combining the previous two equations and using
E kgt k  B , we get the “expected version” of (4.4):
2

1 1
B2 t ( µ) ⇥ ? 2
⇤ ⇥ ⇤
E[f (xt ) ?
f (x )]  + t
E kxt xk t
E kxt+1 x ? k2 .
2 2 2
The proof continues as in Theorem 4.11, with every step being the “ex-
pected version” of the corresponding step in the earlier proof.

79
5.5 Stochastic Subgradient Descent
For problems which are not necessarily differentiable, we modify SGD to
use a subgradient of fi in each iteration. The update of stochastic subgra-
dient descent is given by
sample i 2 [n] uniformly at random
let gt 2 @fi (xt ) (5.6)
xt+1 := xt t gt .

Let gi : Rd ! Rd denote the function that selects the subgradient of fi


at the current point. Then we have gt = gi (xt ) for random i. Unbiasedness
now becomes
n
⇥ ⇤ 1X
E gt x t = x = gi (x) =: g(x), x 2 Rd .
n i=1

It is immediate from the subgradient property that g(x) 2 @f (x) if gi (x) 2


@fi (x) for all i. As in Section 5.2 for SGD, we then get
⇥ ⇤ ⇥ ⇤
E gt> (xt x? ) = E g(xt )> (xt x? ) .
This in turn can be lower bounded by
⇥ ⇤ µ ⇥ ⇤
E f (xt ) f (x? ) + E kxt x? k2 ,
2
with µ = 0 in the convex case and µ > 0 in the strongly convex case,
now using g(xt )’s subgradient property (4.1) in the convex and (4.3) in the
strongly convex case instead of the first-order condition for rf (xt ). As
this lower bound is the crucial ingredient in the previous two analyses of
convergence in O(1/"2 ) and O(1/") steps, the results directly extend to the
case of subgradient descent as well.

5.6 Mini-batch variants


Instead of using a single element fi of our sum objective (5.1) to form a
stochastic gradient gt = rfi (xt ), another variant is to use an average of
several of them: m
1 X j
g̃t := g. (5.7)
m j=1 t

80
where gtj = rfij (xt ) for an index ij . The set of the (distinct) ij indices is
called a mini-batch, and m is the mini batch size.
Using the step direction g̃t defines mini-batch SGD. For m = 1, we re-
cover SGD as originally defined, while for m = n we recover full gradient
descent.
Mini-batch SGD can be advantageous in several applications. For ex-
ample, parallelization over up to m processors will easily give a speed-up
for the gradient computation, which is typically the main cost of running
SGD. Here, parallelization exploits the fact that all gtj are defined at the
same iterate xt and can therefore be computed independently.
Taking an average of many independent random variables reduces the
variance. In the context of mini-batch SGD, we obtain that for larger size
of the mini-batch m our estimate g̃t will be closer to the true gradient, in
expectation:
h 2i h 1 Xm
2i
E g̃t rf (xt ) =E gtj rf (xt )
m j=1
1 ⇥ 1 ⇤
= E kgt rf (xt )k2
m
1 ⇥ ⇤ 1 B2
= E kgt1 k2 krf (xt )k2  .
m m m
Using a modification of the above analysis, it is possible to use this
property to relate the above convergence rate of SGD to the rate of full
gradient descent.

5.7 Exercises
Exercise 32. Let Y be a random variable over a finite probability space (⌦, prob)
where prob : 2⌦ ! [0, 1]; this avoids subtleties in defining conditional probabili-
ties and expectations; and it covers the random variables occurring in SGD, since
in each step, we are randomly choosing among a finite set of n indices. Further-
more, let B ✓ ⌦ be an event.
For nonemepty B, the conditional expectation of Y given B is the number
⇥ ⇤ X
E Y B := y · prob Y = y B .
y2Y (⌦)

81
where Y = y is shorthand for the event {! 2 ⌦ : Y (!) = y}. ⇥ ⇤
Finally, for two events A and B 6= ;, the conditional probability prob A B
is defined as
prob A \ B
prob A B := .
prob B
⇥ ⇤
If B = ;, E Y B can be defined arbitrarily.
Prove the following statements.

(i) Alternative definition of conditional expectation:


⇥ ⇤ X
prob B · E Y B = Y (!)prob(!).
!2B

(ii) Partition Theorem: Let B1 , . . . , Bm be a partition of ⌦. Then


m
⇥ ⇤ X ⇥ ⇤
E Y = E Y Bi prob Bi .
i=1

(iii) Linearity of conditional expectation: For random variables Y1 , . . . , Ym over


(⌦, prob) and real numbers 1 , . . . , m , and if B 6= ;,
m
X m
⇥ ⇤ ⇥X ⇤
i E Yi B = E i Yi B .
i=1 i=1

82
Chapter 6

Nonconvex functions

Contents
6.1 Smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Trajectory analysis . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Deep linear neural networks . . . . . . . . . . . . . . 91
6.2.2 A simple nonconvex function . . . . . . . . . . . . . . 93
6.2.3 Smoothness along the trajectory . . . . . . . . . . . . 96
6.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

83
So far, all convergence results that we have given for variants of gra-
dient descent have been for convex functions. And there is a good reason
for this: on nonconvex functions, gradient descent can in general not be
expected to come close (in distance or function value) to the global mini-
mum x? , even if there is one.
As an example, consider the nonconvex function from Figure 1.2 (left).
Figure 6.1 shows what happens if we start gradient descent somewhere “to
the right”, with a not too large stepsize so that we do not overshoot. For
any sufficiently large T , the iterate xT will be close to the local minimum
y? , but not to the global minimum x? .

x⇤ y⇤ x0

Figure 6.1: Gradient descent may get stuck in a local minimum y? 6= x?

Even if the global minimum is the unique local minimum, gradient


descent is not guaranteed to get there, as it may also get stuck in a saddle
point, or even fail to reach anything at all; see Figure 6.2.
In practice, variants of gradient descent are often observed to perform
well even on nonconvex functions, but theoretical explanations for this are
mostly missing.
In this chapter, we show that under favorable conditions, we can still
say something useful about the behavior of gradient descent, even on non-
convex functions.

84
x0 y⇤ x⇤ x⇤ x0

Figure 6.2: Gradient descent may get stuck in a flat region (saddle point)
y? (left), or reach neither a local minimum nor a saddle point (right).

6.1 Smooth functions


A particularly low hanging fruit is the analysis of gradient descent on
smooth (but not necessarily convex) functions. We recall from Defini-
tion 2.2) that a differentiable function f : dom(f ) ! R is smooth with
parameter L 2 R+ over a convex set X ✓ dom(f ) if
L
f (y)  f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X.
2
This means that at every point x 2 X, the graph of f is below a not-too-
steep tangential paraboloid, and this may happen even if the function is
not convex; see Figure 6.3.
There is a class of arbitrarily smooth nonconvex functions, namely the
differentiable concave functions. A function f is called concave if f is
convex. Hence, for all x, the graph of a differentiable concave function
is below the tangent hyperplane at x, hence f is smooth with parameter
L = 0; see Figure 6.4.
However, from our optimization point of view, concave functions are
boring, since they have no global minimum (at least in the unconstrained
setting that we are treating here). Gradient descent will then simply “run
off to infinity”.
We will therefore consider smooth functions that have a global min-
imum x? . Are there even such functions that are not convex? Actually,

85
f (x) + rf (x)> (y x) + L2 kx yk2
f (y)

x y

Figure 6.3: A smooth and nonconvex function

many. As we show next, any twice differentiable function with bounded


Hessians over some convex set X is smooth over X. A concrete example of
a smooth function that is not convex but has a global minimum (actually,
many), is f (x) = sin(x).

Lemma 6.1. Let f : dom(f ) ! R be twice differentiable, with X ✓ dom(f ) a


convex set, and kr2 f (x)k  L for all x 2 X, where k·k is again spectral norm.
Then f is smooth with parameter L over X.

Proof. By Theorem 1.10 (applied to the gradient function rf ), bounded


Hessians imply Lipschitz continuity of the gradient,

krf (x) rf (y)k  L kx yk , x, y 2 X. (6.1)

We show that this in turn implies smoothness. This is in fact the easy
direction of Lemma 2.4 (in the twice differentiable case), and we proceed
as in the proof of Theorem 1.10 by employing the fundamental theorem of

86
x y

f (x) + rf (x)> (y x)

f (y)

Figure 6.4: A concave function and the first-order characterization of con-


cavity: f (y)  f (x) + rf (x)> (y x), 8x, y 2 Rd

calculus with h : dom(h) ! Rd , [0, 1] ✓ dom(h) given by

h(t) = f x + t(y x) , t 2 R,

in which case the chain rule (see (1.2) yields


>
h0 (t) = rf x + t(y x) (y x).

87
For any x, y 2 X, we now compute
f (y) f (x) rf (x)> (y x)
= h(1) h(0) rf (x)> (y x)
Z 1
= h0 (t)dt rf (x)> (y x)
Z0 1
= rf (x + t(y x))> (y x)dt rf (x)> (y x)
Z0 1
= rf (x + t(y x))> (y x) rf (x)> (y x) dt
0
Z 1
>
= rf (x + t(y x)) rf (x) (y x)dt
Z0 1
>
 rf (x + t(y x)) rf (x) (y x) dt
Z0 1
 rf (x + t(y x)) rf (x) k(y x)k dt (Cauchy-Schwarz)
0
Z 1
 L kt(y x)k k(y x)k dt (Lipschitz continuous gradients)
0
Z 1
= Lt kx yk2
0
L
= kx yk2 .
2
This is smoothness over X according to Definition 2.2.
For twice differentiable functions, the converse is also (almost) true.
If f is smooth over an open convex subset X ✓ dom(f ), the maximum
eigenvalue of the Hessian is bounded over X (Exercise 33 ). We can only
bound the eigenvalues from above since e.g. concave functions are smooth
with parameter L = 0 but generally have unbounded Hessians. It is also
not hard to understand why openness is necessary in general. Indeed, for
a point x on the boundary of X, the smoothness condition does not give
us any information about nearby points not in X. As a consequence, even
at points with large Hessians, f might look smooth inside X. As a simple
example, consider f (x1 , x2 ) = x21 + M x22 with M 2 R+ large. The function
f is smooth with L = 2 over X = {(x1 , x2 ) : x2 = 0}: indeed, over this set,
f looks just like the supermodel. But for all x, we have kr2 f (x)k = 2M .

88
Now we get back to gradient descent on smooth functions with a global
minimum. The punchline is so unspectacular that there is no harm in
spoiling it already now: What we can prove is that krf (xt )k2 converges to
0 at the same rate as f (xt ) f (x? ) converges to 0 in the convex case. Nat-
urally, f (xt ) f (x? ) itself is not guaranteed to converge in the nonconvex
case, for example if xt converges to a local minimum that is not global, as
in Figure 6.1.
It is tempting to interpret convergence of krf (xt )k2 to 0 as convergence
to a critical point of f (a point where the gradient vanishes). But this inter-
pretation is not fully accurate in general, as Figure 6.2 (right) shows: The
algorithm may enter a region where f asymptotically approaches some
value, without reaching it (think of the rightmost piece of the function in
the figure as f (x) = e x ). In this case, the gradient converges to 0, but the
iterates are nowhere near a critical point.
Theorem 6.2. Let f : Rd ! R be differentiable with a global minimum x? ; fur-
thermore, suppose that f is smooth with parameter L according to Definition 2.2.
Choosing stepsize
1
:= ,
L
gradient descent (2.1) yields
T 1
1X 2L
krf (xt )k2  f (x0 ) f (x? ) , T > 0.
T t=0 T

In particular, krf (xt )k2  2L


T
f (x0 ) f (x? ) for some t 2 {0, . . . , T 1}. And
also, limt!1 krf (xt )k2 = 0 (Exercise 34).
Proof. We recall that sufficient decrease (Lemma 2.6) does not require con-
vexity, and this gives
1
krf (xt )k2 , t
f (xt+1 )  f (xt ) 0.
2L
Rewriting this into a bound on the gradient yields
krf (xt )k2  2L f (xt ) f (xt+1 ) .
Hence, we get a telescoping sum
T 1
X
krf (xt )k2  2L f (x0 ) f (xT )  2L f (x0 ) f (x? ) .
t=0

89
The statement follows.
In the smooth setting, gradient descent has another interesting prop-
erty: with stepsize 1/L, it cannot overshoot. By this, we mean that it
cannot pass a critical point (in particular, not the global minimum) when
moving from xt to xt+1 . Equivalently, with a smaller stepsize, no critical
point can be reached. With stepsize 1/L, it is possible to reach a critical
point, as we have demonstrated for the supermodel function f (x) = x2 in
Section 2.6.

Lemma 6.3 (Exercise 35). Let f : Rd ! R be differentiable; let x 2 Rd such


that rf (x) 6= 0, i.e. x is not a critical point. Suppose that f is smooth with
parameter L over the line segment connecting x and x0 = x rf (x), where
= 1/L0 < 1/L. Then x0 is also not a critical point.

Figure 6.5 illustrates the situation.

x x0 y ? x y ? x0 x x0 = y ?

Figure 6.5: Gradient descent on smooth functions: When moving from x


to x0 = x rf (x) with < 1/L, x0 will not be a critical point (left);
equivalently, with = 1/L, we cannot overshoot, i.e. pass a critical point
(middle); with = 1/L, we may exactly reach a critical point (right).

6.2 Trajectory analysis


Even if the “landscape” (graph) of a nonconvex function has local minima,
saddle points, and flat parts, it is sometimes possible to prove that gradient
descent avoids these bad spots and still converges to a global minimum.
For this, one needs a good starting point and some theoretical understand-
ing of what happens when we start there—this is trajectory analysis.

90
In 2018, results along these lines have appeared that prove convergence
of gradient descent to a global minimum in training deep linear linear net-
works, under suitable conditions. In this section, we will study a vastly
simplified setting that allows us to show the main ideas (and limitations)
behind one particular trajectory analysis [ACGH18].
In our simplified setting, we will look at the task of minimizing a con-
crete and very simple nonconvex function. This function turns out be
smooth along the trajectories that we analyze, and this is one important
ingredient. However, smoothness alone does not suffice to prove con-
vergence to the global minimum, let alone fast convergence: As we have
seen in the last section, we can in general only guarantee that the gradient
norms converge to 0, and at a rather slow rate. To get beyond this, we will
need to exploit additional properties of the function under consideration.

6.2.1 Deep linear neural networks


Let us go back to the problem of learning linear models as discussed in
Section 1.6.2, using the example of Master’s admission. We had n inputs
x1 , . . . , xn , where each input xi 2 Rd consisted of d input variables; and
we had n outputs y1 , . . . , yn 2 R. Then we made the hypothesis that (after
centering), output values depend (approximately) linearly on the input,
yi ⇡ w > xi ,
for a weight vector w = (w1 , . . . , wd ) 2 Rd to be learned.
Now we consider the more general case where there is not just one
output yi 2 R as response to the i-th input, but m outputs yi 2 Rm . In this
case, the linear hypothesis becomes
yi ⇡ W xi ,
for a weight matrix W 2 Rm⇥d to be learned. The matrix that best fits this
hypothesis on the given observations is the least-squares matrix
n
X
?
W = argmin kW xi y i k2 .
W 2Rm⇥d i=1

If we let X 2 R be the matrix whose columns are the xi and Y 2 Rm⇥n


d⇥n

the matrix whose columns are the yi , we can equivalently write this as
W ? = argmin kW X Y k2F , (6.2)
W 2Rm⇥d

91
qP
where kAkF = i,j aij is the Frobenius norm of a matrix A.
2

Finding W ⇤ (the global minimum of a convex quadratic function) is a


simple task that boils down to solving a system of linear equations; see
also Section 1.4.2. A fancy way of saying this is that we are training a
linear neural network with one layer, see Figure 6.6 (left).
h21
x1
x1 h11 h22
x2
x2 h12 h23 y1
y1 x3
x3 h13 h24 y2
y2 x4
x4 h14 h25
x5
x5 h26

W W1 W2 W3

Figure 6.6: Left: A linear neural network over d input variables x =


(x1 , . . . , xd ) and m output variables y = (y1 , . . . , ym ). The edge connecting
input variable xj with output variable yi has a weight wij (to be learned),
and all weights together form a weight matrix W 2 Rm⇥d . Given the
weights, the network computes the linear transformation y = W x be-
tween inputs and outputs. Right: a deep linear neural network of depth
3 with weight matrices W1 , W2 , W3 . Given the weights, the network com-
putes the linear transformation y = W3 W2 W1 x.

But what if we have ` layers (Figure 6.6 (right)? Training such a net-
work corresponds to minimizing
kW` W` 1 · · · W1 X Y k2F ,
over ` weight matrices W1 , . . . , W` to be learned. In case of linear neural
networks, there is no benefit in adding layers, as any linear transforma-
tion x 7! W` W` 1 · · · W1 X can of course be represented as x 7! W X with
W := W` 1 · · · W1 . But from a theoretical point of view, a deep linear neu-
ral network gives us a simple playground in which we can try to under-
stand why training deep neural networks with gradient descent works,

92
despite the fact that the objective function is no longer convex. The hope
is that such an understanding can ultimately lead to an analyis of gradient
descent (or other suitable methods) for “real” (meaning non-linear) deep
neural networks.
In the next section, we will discuss the case where all matrices are 1 ⇥ 1,
so they are just numbers. This is arguably a toy example in our already
simple playground. Still, it gives rise to a nontrivial nonconvex function,
and the analysis of gradient descent on it will require similar ingredients
as the one on general deep linear neural networks [ACGH18].

6.2.2 A simple nonconvex function


The function (that we consider fixed throughout the section) is f : Rd ! R
defined by
d
!2
1 Y
f (x) := xk 1 , (6.3)
2 k=1
Q Q
As d is fixed, we will abbreviate dk=1 xk by k xk throughout. Minimizing
this function corresponds to training a deep linear neural network with d
layers, one neuron per layer, with just one training input x = 1 and a
corresponding output y = 1. Figure 6.7 visualizes the function f for d = 2.
First of all, the function f does have global minima, as it is nonnegative,
and value 0 can be achieved (in many ways). Hence, we immediately
know how to minimize this (for example, set xk = 1 for all k). The question
is whether gradient descent also knows, and if so, how we prove this.
Let us start by computing the gradient. We have
! !>
Y Y Y
rf (x) = xk 1 xk , . . . , xk . (6.4)
k k6=1 k6=d

What areQthe critical points, the ones where rf (x) vanishes? This hap-
pens when k xk = 1 in which case we have a global minimum (level 0
in Figure 6.7). But there are other critical points. Whenever at least two
of the xk are zero, the gradient also vanishes, and the value of f is 1/2 at
such a point (point 0 in Figure 6.7). This already shows that the function
cannot be convex, as for convex functions, every critical point is a global
minimum (Lemma 1.16). It is easy to see that every non-optimal critical
point must have two or more zeros.

93
Figure 6.7: Levels sets of f (x1 , x2 ) = 12 (x1 x2 1)2

In fact, all critical points except the global minima are saddle points.
This is because at any such point x, we can slightly perturb the (two or
more) zero entries in such a way that the product of all entries becomes
either positive or negative, so that the function value either decreases or
increases.
Figure 6.8 visualizes (scaled) negative gradients of f for d = 2; these are
the directions in which gradient descent would move from the tails of the
respective arrows. The figure already indicates that it is difficult to avoid
convergence to a global minimum, but it is possible (see Exercise 37).
We now want to Q show that for any dimension d, and from anywhere in
X = {x : x > 0, k xk  1}, gradient descent will converge to a global
minimum. Unfortunately, our function f is not smooth over X. For the
analysis, we will therefore show that f is smooth along the trajectory of

94
Figure 6.8: Scaled negative gradients of f (x1 , x2 ) = 12 (x1 x2 1)2

gradient descent for suitable L, so that we get sufficient decrease

1
f (xt+1 )  f (xt ) krf (xt )k2 , t 0
2L
by Lemma 2.6.
This already shows that gradient descent cannot converge to a saddle
point: all these have (at least two) zero entries and therefore function value
1/2. But for starting point x0 2 X, we have f (x0 ) < 1/2, so we can never
reach a saddle while decreasing f .
But doesn’t this mean that we necessarily have to converge to a global
minimum? No, because the sublevel sets of f are unbounded, so it could in
principle happen that gradient descent runs off to infinity while constantly
improving f (xt ) (an example is gradient descent on f (x) = e x ). Or some

95
other bad behavior occurs (we haven’t characterized what can go wrong).
So there is still something to prove. Q
How about convergence from other starting points? For x > 0, k xk
1, we also get convergence (Exercise 36). But there are also starting points
from which gradient descent will not converge to a global minimum (Ex-
ercise 37).
The following simple lemma is the key to showing that gradient de-
scent behaves nicely in our case.
Definition 6.4. Let x > 0 (componentwise) , and let c 1 be a real number. x
is called c-balanced if xi  cxj for all 1  i, j  d.
In fact, any initial iterate x0 > 0 is c-balanced for some (possibly large) c.
Q
Lemma 6.5. Let x > 0 be c-balanced with k xk  1. Then for any stepsize
> 0, x0 := x rf (x) satisfies x0 x (componentwise) and is also c-balanced.
If c = 1 (all entries of x are equal), this is easy to see since then also
all entries of rf (x) in (6.4) are equal.
Q Later we will show that for suitable
step size, we also maintain that k x0k  1, so that gradient descent only
goes through balanced iterates.
Q Q
Proof. Set := ( k xk 1)( k xk ) 0. Then the gradient descent
update assumes the form

x0k = xk + xk , k = 1, . . . , d.
xk
For i, j, we have xi  cxj and xj  cxi (, 1/xi  c/xj ). We therefore get
c
x0i = xi +  cxj + = cx0j .
xi xj

6.2.3 Smoothness along the trajectory


It will turn out that our function f —despite not being globally smooth—
is smooth over Qthe trajectory of gradient descent, assuming that we start
with x0 > 0, k (x0 )k < 1. We will derive this from bounded Hessians.
Let us therefore start by computing the Hessian matrix r2 f (x), where by

96
definition, r2 f (x)ij is the j-th partial derivative of the i-th entry of rf (x).
This i-th entry is !
Y Y
(rf )i = xk 1 xk
k k6=i

and its j-th partial derivative is therefore


8 !2
>
> Y
>
< xi , j=i
2
r f (x)ij = k6=i
Y Y Y
>
>
>
: 2 x k xk xk , j 6= i
k6=i k6=j k6=i,j
Q
This looks
Q promising:
Q if k xk  1, then we would also expect that the
products k6=i xk and k6=i,j xk are small, in which case all entries of the
Hessian are small, giving us a bound on krf 2 xk that we need to establish
smoothness of f . However, for general x, Q this fails. Q
If x contains entries
close to 0, it may happen that some terms k6=i xk and k6=i,j xk are actually
very large.
What comes to our rescue is again c-balancedness.
Lemma 6.6. Suppose that x > 0 is c-balanced (Definition 6.4). Then for any
I ✓ {1, . . . , d}, we have
✓ ◆|I| Y !1 |I|/d Y Y
!1 |I|/d
1
xk  xk  c|I| xk .
c k k
k2I
/
Q
For any i, we have xi
Proof. Q d
(1/c)d k xk by balancedness, hence xi
(1/c)( k xk )1/d . It follows that
Q Q !1 |I|/d
Y x k x Y
k k
xk = Q k  |I| (
Q |I|/d
= c|I| xk .
i2I x i (1/c) k x k ) k
k2I
/
Q
The lower bound follows in the same way from xdi  cd k xk .
This lets us bound the Hessians of c-balanced points.
Q
Lemma 6.7. Let x > 0 be c-balanced with k xk  1. Then
r2 f (x)  r2 f (x) F
 3dc2 .
where kAkF is the Frobenius norm and kAk the spectral norm.

97
Proof. The fact that kAk  kAkF is Exercise 38. To bound the Frobenius
norm, we use the previous lemma to compute
!2
Y
r2 f (x)ii = xi  c2
k6=i

and for i 6= j,

Y Y Y
r2 f (x)ij  2 xk xk + xk  3c2 .
k6=i k6=j k6=i,j

2
Hence, kr2 f (x)kF  9d2 c4 . Taking square roots, the statement follows.
This now implies smoothness of f along the whole trajectory of gradi-
ent descent, under the usual “smooth stepsize” = 1/L = 1/3dc2 .
Q
Lemma 6.8. Let x > 0 be c-balanced with k xk < 1, L = 3dc2 . Let := 1/L.
Then for all 0  ⌫  ,
x0 := x ⌫rf (x) x
Q
is c-balanced with k x0k  1, and f is smooth with parameter L over the line
segment connecting x and x rf (x)}.

Proof. We get that x0 x >Q0 is c-balanced by Lemma 6.5. Furthermore,


rf (x) 6= 0 (due to x > 0, k xk < 1, we can’t
0
Q be at a critical point). By
Lemma 6.3 (no overshooting), we can’t reach k x0k = 1 (global minimum)
for ⌫ < , as f is smooth with parameter L over X = conv{x, x0 } for the
smallest such ⌫, using Lemma 6.1 with the bound onQthe0 Hessians from the
previous lemma. By continutity, we therefore get k xk  1 for all ⌫  ,
and f is smooth with parameter L over the line segment connecting x and
x0 for ⌫ = .

6.2.4 Convergence
Theorem
Q 6.9. Let c 1 and > 0 such that x0 > 0 is c-balanced with 
k (x0 )k < 1. Choosing stepsize

1
= ,
3dc2

98
gradient descent satisfies
✓ 2
◆T
f (xT )  1 f (x0 ), T 0.
3c4

This means that the loss indeed converges to its optimal value 0, and
does so with a fast exponential error decrease. Exercise 39 asks you to
prove that also the iterates themselves converge (to an optimal solution),
so gradient descent will not run off to infinity.
Proof. For each t 0, f is smooth over conv({xt , xt+1 }) with parameter
L = 3dc , hence Lemma 2.6 yields sufficient decrease:
2

1
f (xt+1 )  f (xt ) 2
krf (xt )k2 . (6.5)
6dc
Q
For every c-balanced x with  k xk  1, we have

d
!2
2
X Y
krf (x)k = 2f (x) xk
i=1 k6=i
!2 2/d
d Y
2f (x) 2 xk (Lemma 6.6)
c k
!2
d Y
2f (x) xk
c2 k
d 2
2f (x) .
c2
Then, (6.5) further yields
✓ 2

1 d 2
f (xt+1 )  f (xt ) 2f (x t ) = f (xt ) 1 ,
6dc2 c2 3c4

proving the theorem.


This looks great: just as for strongly convex functions, we seem to have
fast convergence since the function value goes down by a constant factor
in each step. There is a catch, though. To see this, consider the starting

99
solution x0 = (1/2, . . . , 1/2). This is c-balanced with c = 1, but the that
we get is 1/2d . Hence, the “constant factor” is
✓ ◆
1
1 ,
3 · 4d

and we need T ⇡ 4d to reduce the initial error by a constant factor not


depending on d.
Indeed, for this starting value x0 , the gradient is exponentially small,
so we are crawling towards the optimum at exponentially small speed. In
order to get polynomial-time convergence, we need to start with a that
decays at most polynomially with d. For large d, this requires us to start
very close to optimality. As a concrete example, let us try to achieve a
constant (not depending on d) with a 1-balanced solution of the form
xi = (1 b/d) for all i. For this, we need that
✓ ◆d
b
1 ⇡ e b = ⌦(1),
d
p
and this requires b = O(1). Hence, we need to start at distance O(1/ d)
from the optimal solution (1, . . . , 1).
The problem is due to constant stepsize. Indeed, f is locally much
smoother at small x0 than Lemma 6.8 predicts, so e could afford much
larger steps in the beginning. The lemma covers the “worst case” when
we are close to optimality already.
So could we improve using a time-varying stepsize? The question is
moot: if we know the function f under consideration, we do not need
to run any optimization in the first place. The question we were trying
to address is whether and how a standard gradient descent algorithm is
able to optimize nonconvex functions as well. Above, we have given a
(partially satisfactory) answer for a concrete function: yes, it can, but at a
very slow rate, if d is large and the starting point not close to optimality
yet.

6.3 Exercises
Exercise 33. Let f : Rn ! R twice differentiable, with X ✓ dom(f ) an open
convex set, and suppose that f is smooth with parameter L over X. Prove that

100
under these conditions, the largest eigenvalue of the Hessian max (r
2
f (x))  L
for all x 2 X.

Exercise 34. Prove that the statement of Theorem 6.2 implies that

lim krf (xt )k2 = 0.


t!1

Exercise 35. Prove Lemma 6.3 (gradient descent does not overshoot on smooth
functions).
⇣Q ⌘2
d
Exercise 36. Consider the function f (x) = 12 k=1 x k 1 . Prove that for
Q
any starting point x0 2 X = {x 2 R : x > 0, k xk
d
1} and any " > 0,
gradient descent attains f (xT )  " for some iteration T .
⇣Q ⌘2
d
Exercise 37. Consider the function f (x) = 2 1
k=1 xk 1 . Prove that for
even dimension d 2, there is a point x0 (not a critical point) such that gradient
descent does not converge to a global minimum when started at x0 , regardless of
step size(s).

Exercise 38. Prove that for any matrix A, kAk  kAkF , where k·k is the spectral
norm and k·kF the Frobenius norm.

Exercise 39. Prove that the sequence (xT )T 0 of iterates in Theorem 6.9 con-
verges to a an optimal solution x? .

101
Chapter 7

Newton’s Method

Contents
7.1 1-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2 Newton’s method for optimization . . . . . . . . . . . . . . . 105
7.3 Once you’re close, you’re there. . . . . . . . . . . . . . . . . . . 107
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

102
7.1 1-dimensional case
The Newton method (or Newton-Raphson method, invented by Sir Isaac
Newton and formalized by Joseph Raphson) is an iterative method for
finding a zero of a differentiable univariate function f : R ! R. Starting
from some number x0 , it computes

f (xt )
xt+1 := xt , t 0. (7.1)
f 0 (xt )

Figure 7.1 shows what happens. xt+1 is the point where the tangent line
to the graph of f at (xt , f (xt ) intersects the x-axis. In formulas, xt+1 is the
solution of the linear equation

f (xt ) + f 0 (xt )(x xt ) = 0,

and this yields the update formula (7.1).

f (x)

xt xt+1

f (xt ) + f 0 (xt )(x xt )

Figure 7.1: One step of Newton’s method

The Newton step (7.1) obviously fails if f 0 (xt ) = 0 and may get out of
control if |f 0 (xt )| is very small. Any theoretical analysis will have to make
suitable assumptions to avoid this. But before going into this, we look at
Newton’s method in a benign case.

103
p p
Let f (x) = x2 R, where R 2 R+ . f has two zeros, p R and R.
Starting for example at x0 = R, we hope to converge to R quickly. In this
case, (7.1) becomes
✓ ◆
x2t R 1 R
xt+1 = xt = xt + . (7.2)
2xt 2 xt
This is in fact the Babylonian method to compute square roots, and here we
see that it is just a special case of Newton’s method. p
Can we prove that we indeed quickly converge to R? What we im-
mediately see from (7.2) is that all iterates will be positive and hence
✓ ◆
1 R xt
xt+1 = xt + .
2 xt 2
p
So we cannot be too fast. Suppose R 1. In order to even get xt < 2 R,
we need at least T log(R)/2 steps. Itpturns out that the Babylonian
method starts taking off only when xt R < 1/2, say (Exercise 40 asks
you to prove that it takes O(log R) steps to get there).p
To watch takeoff,
p let us now suppose that x0 R < 1/2, so we are
starting close to R already. We rewrite (7.2) as
p xt R p 1 ⇣ p ⌘2
xt+1 R= + R= xt R . (7.3)
2 2xt 2xt
p
Assuming for now that R 1/4, all iterates have value at least R
1/2, hence we get
p ⇣ p ⌘2
xt+1 R  xt R .
This means that the error goes to 0 quadratically, and
⇣ ✓ ◆2 T
p p ⌘ 2T 1
xT R  x0 R < , T 0. (7.4)
2
p
What does this tell us? In order to get xT R < ", we only
p need
T = log log( " ) steps! Hence, it takes a while to get to roughly R, but
1

from there, we achieve high accuracy very fast.


Let us do a concrete example (with p IEEE 754 double arithmetic). If
R = 1000, we need 7 stepspto get x7 1000 < 1/2, and then just 3 more
steps to get x10 equal to 1000 up to the machine precision (53 binary
digits). In this last phase, we essentially double the number of correct
digits in each iteration!

104
7.2 Newton’s method for optimization
Suppose we want to find a global minimum x? of a differentiable con-
vex function f : R ! R (assuming that a global minimum exists). Lem-
mata 1.16 and 1.17 guarantee that we can equivalently search for a zero of
the derivative f 0 . To do this, we can apply Newton’s method if f is twice
differentiable; the update step then becomes
f 0 (xt )
xt+1 := xt = xt f 00 (xt ) 1 f 0 (xt ), t 0. (7.5)
f 00 (xt )
There is no reason to restrict to d = 1. Here is Newton’s method for min-
imizing a convex function f : Rd ! R. We choose x0 arbitrarily and then
iterate:
xt+1 := xt r2 f (xt ) 1 rf (xt ), t 0. (7.6)
The update vector r2 f (xt ) 1 rf (xt ) is the result of a matrix-vector mul-
tiplication: we invert the Hessian at xt and multiply the result with the
gradient at xt . As before, this fails if the Hessian is not invertible, and may
get out of control if the Hessian has small norm.
We have introduced iteration (7.6) simply as a (more or less natural)
generalization of (7.5), but there’s more to it. If we consider (7.6) as a
special case of a general update scheme
xt+1 = xt H(xt )rf (xt ),
where H(x) 2 Rd⇥d is some matrix, then we see that also gradient de-
scent (2.1) is of this form, with H(xt ) = I. Hence, Newton’s method can
also be thought of as “adaptive gradient descent” where the adaptation is
w.r.t. the local geometry of the function at xt . Indeed, as we show next,
this allows Newton’s method to converge on all nondegenerate quadratic
functions in one step, while gradient descent only does so with the right
stepsize on “beautiful” quadratic functions whose sublevel sets are Eu-
clidean balls (Exercise 18).
Lemma 7.1. A nondegenerate quadratic function is a function of the form
1
f (x) = x> M x q> x + c,
2
where M 2 R d⇥d
is an invertible symmetric matrix, q 2 Rd , c 2 R. Let x? =
M 1 q be the unique solution of rf (x) = 0 (the unique global minimum if f is
convex). With any starting point x0 2 Rd , Newton’s method (7.6) yields x1 = x? .

105
Proof. We have rf (x) = M x q (this implies x? = M 1
q) and r2 f (x) =
M . Hence,

x0 r2 f (x0 ) 1 rf (x0 ) = x0 M 1
(M x0 q) = M 1
q = x? .

In particular, Newton’s method can solve an invertible system M x = q


of linear equations in one step. But no miracle is happening here, as this
step involves the inversion of the matrix r2 f (x0 ) = M .
More generally, the behavior of Newton’s method is affine invariant.
By this, we mean that it is invariant under any invertible affine transfor-
mation, as follows:

Lemma 7.2 (Exercise 41). Let f : Rd ! R be twice differentiable, A 2 Rd⇥d


an invertible matrix, b 2 Rd . Let g : Rd ! R be the (bijective) affine function
g(y) = Ay + b, y 2 Rd . Finally, for a twice differentiable function h : Rd ! R,
let Nh : Rd ! Rd denote the Newton step for h, i.e.

Nh (x) := x r2 h(x) 1 rh(x),

whenever this is defined. Then we have Nf g =g 1


Nf g.

This says that in order to perform a Newton step for f g on yt , we


can transform yt to xt = g(yt ), perform the Newton step for f on x and
transform the result xt+1 back to yt+1 = g 1 (xt+1 ). Another way of saying
this is that the following diagram commutes:
xt xt+1
Nf

1
g g

Nf g
yt yt+1

106
Hence, while gradient descent suffers if the coordinates are at very dif-
ferent scales, Newton’s method doesn’t.
We conclude the general exposition with another interpretation of New-
ton’s method: each step minimizes the local second-order Taylor approxi-
mation.
Lemma 7.3 (Exercise 44). Let f be convex and twice differentiable at xt 2
dom(f ), with r2 f (xt ) 0 being invertible. The vector xt+1 resulting from
the Netwon step (7.6) satisfies
1
xt+1 = argmin f (xt ) + rf (xt )> (x xt ) + (x xt )> r2 f (xt )(x xt ).
x2Rd 2

7.3 Once you’re close, you’re there. . .


We will prove a result about Newton’s method that may seem rather weak:
under suitable conditions, and starting close to the global minimum, we
will reach distance at most " to the minimum within log log(1/") steps.
The weak part here is of course not the number of steps log log(1/")—this
is much faster than anything we have seen so far—but the assumption that
we are starting close to the minimum already. Under such an assumption,
we say that we have a local convergence result.
Global convergence results that hold for every starting point in general
were unknown for Newton’s method as in (7.6) until recently. Under a sta-
bility assumption on the Hessian, global convergence was shown to hold
by [KSJ18]. There are some variants of the method for which such results
can be proved, most notably the cubic regularization variant of Nesterov
and Polyak [NP06]. Weak global convergence results can be obtained by
adding a step size to (7.6) and always making only steps that decrease the
function value (which may not happen under the full Newton step).
An alternative is to use gradient descent to get us sufficiently close to
the global minimum, and then switch to Newton’s method for the rest. In
Chapter 2, we have seen that under favorable conditions, we may know
when gradient descent has taken us close enough.
In practice, Newton’s method is often (but not always) much faster
than gradient descent in terms of the number of iterations. The price to pay
is a higher iteration cost, since we need to compute (and invert) Hessians.
After this disclaimer, let us state the main result right away. We follow
Vishnoi [Vis14].

107
Theorem 7.4. Let f : dom(f ) ! R be convex with a unique global mini-
mum x? . Suppose that there is a ball X ✓ dom(f ) with center x? such that the
following two properties hold.
(i) Bounded inverse Hessians: There exists a real number µ > 0 such that
1
kr2 f (x) 1 k  , 8x 2 X.
µ

(ii) Lipschitz continuous Hessians: There exists a real number B > 0 such
that
kr2 f (x) r2 f (y)k  Bkx yk 8x, y 2 X.

In both cases, the matrix norm is the spectral norm defined in Lemma 2.5. Prop-
erty (i) in particular stipulates that Hessians are invertible at all points in X.
Then, for xt 2 X and xt+1 resulting from the Newton step (7.6), we have
B
kxt+1 x? k  kxt x ? k2 .

Before we prove this main theorem, here is the local convergence result
that follows from it.
Corollary 7.5 (Exercise 42). With the assumptions and terminology of Theo-
rem 7.4, and if x0 2 X satisfies
µ
kx0 x? k  ,
B
then Newton’s method (7.6) yields
✓ ◆2T 1
? µ 1
kxT x k , T 0.
B 2
Hence, we have a bound as (7.4) for the last phase of the Babylonian
method: in order to get kxT x? k < ", we only need T = log log( 1" ) steps.
But before this fast behavior kicks in, we need to be µ/B-close to x? al-
ready.
An intuitive reason for fast convergence is that under our assumptions,
the Hessians we encounter are almost constant when we are close to x? .
This means that locally, our function behaves almost like a quadratic func-
tion which has truly constant Hessians and allows Newton’s method to
convergence in one step (Lemma 7.1).

108
Lemma 7.6 (Exercise 43). With the assumptions and terminology of Theorem 7.4,
and if x0 2 X satisfies
µ
kx0 x? k  ,
B
then the Hessians in Newton’s method satisfy the relative error bound
✓ ◆2 t 1
kr2 f (xt ) r2 f (x? )k 1
 , t 0.
kr2 f (x? )k 2
We now still owe to the reader the proof of main convergence result,
Theorem 7.4:
Proof of Theorem 7.4. To simplify notation, let us abbreviate H := r2 f , x =
xt , x0 = xt+1 . Subtracting x? from both sides of (7.6), we get

x0 x? = x x? H(x) 1 rf (x)
= x x? + H(x) 1 (rf (x? ) rf (x))
Z 1
? 1
= x x + H(x) H(x + t(x? x))(x? x)dt,
0

using the fundamental theorem of calculus and the chain rule as in (1.2)
with h(t) = rf (x + t(x? x)). With
Z 1
? 1 ? 1
x x = H(x) H(x)(x x ) = H(x) H(x)(x? x)dt,
0

we further get
Z 1
0 ? 1
x x = H(x) (H(x + t(x? x)) H(x)) (x? x)dt.
0

Taking norms, we have


Z 1
0 ? 1
kx x k  kH(x) k · (H(x + t(x? x)) H(x)) (x? x)dt ,
0

where we have used that kAyk  kAk · kyk for any matrix A 2 Rd⇥d and
any vector y 2 Rd which follows directly from the definition of the spectral
norm. As we also have
Z 1 Z 1
g(t)dt  kg(t)kdt
0 0

109
for any vector-valued function g (Exercise 46), we can further bound
Z 1
0 ? 1
kx x k  kH(x) k H(x + t(x? x)) H(x) (x? x) dt
Z0 1
 kH(x) 1 k H(x + t(x? x)) H(x) · kx? xkdt
0
Z 1
1 ?
= kH(x) k · kx xk H(x + t(x? x)) H(x) dt.
0

We can now use the properties (i) and (ii) (bounded inverse Hessians, Lip-
schitz continuous Hessians) to conclude that
Z 1 Z 1
0 ? 1 ? ? B ? 2
kx x k  kx xk Bkt(x x)kdt = kx xk tdt .
µ 0 µ 0
| {z }
1/2

How realistic are properties (i) and (ii)? If f is twice continuously dif-
ferentiable (meaning that the second derivative r2 f is continuous), then
we will always find suitable values of µ and L over a ball X with center
x? —provided that r2 f (x? ) 6= 0.
Indeed, already in the one-dimensional case, we see that under f 00 (x? ) =
0 (vanishing second derivative at the global minimum), Newton’s method
will in the worst reduce the distance to x? at most by a constant factor in
each step, no matter how close to x? we start. Exercise 45 asks you to find
such an example. In such a case, we have linear convergence, but the fast
quadratic convergence (O(log log(1/")) steps cannot be proven.
One way to ensure bounded inverse Hessians is to require strong con-
vexity over X.
Lemma 7.7 (Exercise 47). Let f : dom(f ) ! R be twice differentiable and
strongly convex with parameter µ over an open convex subset X ✓ dom(f )
according to Definition 2.8, meaning that
µ
f (y) f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X.
2
Then r2 f (x) is invertible and kr2 f (x) 1 k  1/µ for all x 2 X, where k · k is
the spectral norm defined in Lemma 2.5.

110
7.4 Exercises
Exercise
p 40. Consider the Babylonian method (7.2). Prove that we get xT
R < 1/2 for T = O(log R).
Exercise 41. Prove Lemma 7.2!
Exercise 42. Prove Corollary 7.5!
Exercise 43. Prove Lemma 7.6!
Exercise 44. Prove Lemma 7.3!
Exercise 45. Let > 0 be any real number. Find an example of a convex function
f : R ! R such that (i) the unique global minimum x? has a vanishing second
derivative f 00 (x? ) = 0, and (ii) Newton’s method satisfies
|xt+1 x? | (1 )|xt x? |,
for all xt 6= x? .
Exercise 46. This exercise is just meant to recall some basics around integrals.
Show that for a vector-valued function g : R ! Rd , the inequality
Z 1 Z 1
g(t)dt  kg(t)kdt
0 0

holds, where k · k is the 2-norm (always assuming that the funtions under consid-
eration are integrable)! You may assume (i) that integrals are linear:
Z 1 Z 1 Z 1
( 1 g1 (t) + 2 g2 (t))dt = 1 g1 (t)dt + 2 g2 (t)dt,
0 0 0
R1
And (ii), if g(t) 0 for all t 2 [0, 1], then 0
g(t)dt 0.
Exercise 47. Prove Lemma 7.7! You may want to proceed in the following steps.
(i) Prove that the function g(x) = f (x) µ
2
kxk2 is convex over X (see also
Exercise 28).
(ii) Prove that r2 f (x) is invertible for all x 2 X.
(iii) Prove that all eigenvalues of r2 f (x) 1
are positive and at most 1/µ.
(iv) Prove that for a symmetric matrix M , the spectral norm kM k is the largest
absolute eigenvalue.

111
Chapter 8

Quasi-Newton Methods

Contents
8.1 The secant method . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2 The secant condition . . . . . . . . . . . . . . . . . . . . . . . 115
8.3 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . 115
8.4 Greenstadt’s approach (Optional Material) . . . . . . . . . . . 116
8.4.1 The method of Lagrange multipliers . . . . . . . . . . 118
8.4.2 Application to Greenstadt’s Update . . . . . . . . . . 118
8.4.3 The Greenstadt family . . . . . . . . . . . . . . . . . . 120
8.4.4 The BFGS method . . . . . . . . . . . . . . . . . . . . 122
8.4.5 The L-BFGS method . . . . . . . . . . . . . . . . . . . 124
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

112
The main computational bottleneck in Newton’s method (7.6) is the
computation and inversion of the Hessian matrix in each step. This matrix
has size d ⇥ d, so it will take up to O(d3 ) time to invert it (or to solve the
system r2 f (xt ) x = rf (xt ) that gives us the next Newton step x).
Already in the 1950s, attempts were made to circumvent this costly step,
the first one going back to Davidon [Dav59].
In this chapter, we will (for a change) not prove convergence results;
rather, we focus on the development of Quasi-Newton methods, and how
state-of-the-art methods arise from first principles. To motivate the ap-
proach, let us go back to the 1-dimensional case.

8.1 The secant method


Like Newton’s method (7.1), the secant method is an iterative method for
finding a zero of a univariate function. Unlike Newton’s method, it does
not use derivatives and hence does not require the function under con-
sideration to be differentiable. In fact, it is (therefore) much older than
Newton’s method. Reversing history and starting from the Newton step
f (xt )
xt+1 := xt , t 0,
f 0 (xt )
we can derive the secant method by replacing the derivative f 0 (xt ) with its
finite difference approximation
f (xt ) f (xt 1 )
.
xt xt 1
As we (in the differentiable case) have
f (xt ) f (x)
f 0 (xt ) = lim ,
x!xt xt x
we get
f (xt ) f (xt 1 )
⇡ f 0 (xt )
xt xt 1
for |xt xt 1 | small. As the method proceeds, we expect consecutive iter-
ates xt 1 , xt to become closer and closer, so that the secant step
xt xt 1
xt+1 := xt f (xt ) , t 1 (8.1)
f (xt ) f (xt 1 )

113
approximates the Newton step (two starting values x0 , x1 need to be cho-
sen here). Figure 8.1 shows what the method does: it constructs the line
through the two points (xt 1 , f (xt 1 )) and (xt , f (xt )) on the graph of f ; the
next iterate xt+1 is where this line intersects the x-axis. Exercise 48 asks
you to formally prove this.

f (x)

xt 1 xt xt+1

Figure 8.1: One step of the secant method

Convergence of the secant method can be analyzed, but we don’t do


this here. The main point for us is that we now have a derivative-free ver-
sion of Newton’s method.
When the task is to optimize a differentiable univariate function, we
can apply the secant method to its derivative to obtain the secant method
for optimization:
xt xt 1
xt+1 := xt f 0 (xt ) , t 1. (8.2)
f 0 (x
t) f 0 (xt 1 )

This is a second-derivative-free version of Newton’s method (7.5) for opti-


mization. The plan is now to generalize this to higher dimensions to obtain
a Hessian-free version of Newton’s method (7.6) for optimization over Rd .

114
8.2 The secant condition
Applying finite difference approximation to the second derivative of f
(we’re still in the 1-dimensional case), we get
f 0 (xt ) f 0 (xt 1 )
Ht := ⇡ f 00 (xt ),
xt xt 1
which we can write as
f 0 (xt ) f 0 (xt 1 ) = Ht (xt xt 1 ) ⇡ f 00 (xt )(xt xt 1 ). (8.3)
Now, while Newton’s method for optimization uses the update step
xt+1 = xt f 00 (xt ) 1 f 0 (xt ), t 0,
the secant method works with the approximation Ht ⇡ f 00 (xt ):
xt+1 = xt Ht 1 f 0 (xt ), t 1. (8.4)
The fact that Ht approximates f 00 (xt ) in the twice differentiable case
was our motivation for the secant method, but in the method itself, there
is no reference to f 00 (which is exactly the point). All that is needed is the
secant condition from (8.3) that defines Ht :
f 0 (xt ) f 0 (xt 1 ) = Ht (xt xt 1 ). (8.5)
This view can be generalized to higher dimensions. If f : Rd ! R is
differentiable, (8.4) becomes
xt+1 = xt Ht 1 rf (xt ), t 1, (8.6)
where Ht 2 Rd⇥d is now supposed to be a symmetric matrix satisfying the
d-dimensional secant condition
rf (xt ) rf (xt 1 ) = Ht (xt xt 1 ). (8.7)

8.3 Quasi-Newton methods


If f is twice differentiable, the secant condition (8.7) along with the first-
order Taylor approximation of rf (x) yields the d-dimensional analog of
(8.3):
rf (xt ) rf (xt 1 ) = Ht (xt xt 1 ) ⇡ r2 f (xt )(xt xt 1 ),

115
We might therefore hope that Ht ⇡ r2 f (xt ), and this would mean that
(8.6) approximates Newton’s method. Therefore, whenever we use (8.6)
with a symmetric matrix satisfying the secant condition (8.7), we say that
we have a Quasi-Newton method.
In the 1-dimensional case, there is only one Quasi-Newton method—
the secant method (8.1). Indeed, equation (8.5) uniquely defines the num-
ber Ht in each step.
But in the d-dimensional case, the matrix Ht in the secant condition is
underdetermined, starting from d = 2: Taking the symmetry requirement
into account, (8.7) is a system of d equations in (d + 1)/2 unknowns, so if
it is satisfiable at all, there are many solutions Ht . This raises the question
of which one to choose, and how to do so efficiently; after all, we want to
get some savings over Newton’s method.
Newton’s method is a Quasi-Newton method if and only if f is a non-
degenerate quadratic function (Exercise 49). Hence, Quasi-Newton meth-
ods do not generalize Newton’s method but form a family of related algo-
rithms.
The first Quasi-Newton method was developed by William C. Davi-
don in 1956; he desperately needed iterations that were faster than those
of Newton’s method in order obtain results in the short time spans be-
tween expected failures of the room-sized computer that he used to run
his computations on.
But the paper he wrote about his new method got rejected for lacking
a convergence analysis, and for allegedly dubious notation. It became a
very influential Technical Report in 1959 [Dav59] and was finally officially
published in 1991, with a foreword giving the historical context [Dav91].
Ironically, Quasi-Newton methods are today the methods of choice in a
number of relevant machine learning applications.

8.4 Greenstadt’s approach (Optional Material)


For efficieny reasons (we want to avoid matrix inversions), Quasi-Newton
methods typically directly deal with the inverse matrices Ht 1 . Suppose
that we have the iterates xt 1 , xt as well as the matrix Ht 11 ; now we want
to compute a matrix Ht 1 to perform the next Quasi-Newton step (8.6).
How should we choose Ht 1 ?

116
We draw some intuition from (the analysis of) Newton’s method. Re-
call that we have shown rf 2 (xt ) to fluctuate only very little in the region
of extremely fast convergence (Lemma 7.6); in fact, Newton’s method is
optimal (one step!) when rf 2 (xt ) is actually constant— this is the case of
a quadratic function, see Lemma 7.1. Hence, in a Quasi-Newton method,
it also makes sense to have that Ht ⇡ Ht 1 , or Ht 1 ⇡ Ht 11 .
Greenstadt’s approach from 1970 [Gre70] is to update Ht 11 by an “error
matrix” Et to obtain
Ht 1 = Ht 11 + Et .
Moreover, the errors should be as small as possible, subject to the con-
straints that Ht 1 is symmetric and satisfies the secant condition (8.7). A
simple measure of error introduced by an update matrix E is its squared
Frobenius norm
Xd X d
2
kEkF := e2ij .
i=1 j=1

Since Greenstadt considered the resulting Quasi-Newton method as “too


specialized”, he searched for a compromise between variability in the method
and simplicity of the resulting formulas. This led him to minimize the er-
ror term
kAEA> k2F ,
where A 2 Rd⇥d is some fixed invertible transformation matrix. If A = I,
we recover the squared Frobenius norm of E.
Let us now fix t and simplify notation by setting
H := Ht 11 ,
H 0 := Ht 1 ,
E := Et ,
:= xt xt 1 ,
y = rf (xt ) rf (xt 1 ),
r = Hy.
The update formula then is
H 0 = H + E, (8.8)
and the secant condition rf (xt ) rf (xt 1 ) = Ht (xt xt 1 ) becomes
H 0y = (, Ey = r). (8.9)

117
Greenstadt’s approach can now be distilled into the following convex
constrained minimization problem in the d2 variables Eij :

minimize 12 kAEA> k2F


subject to Ey = r (8.10)
E> E = 0

8.4.1 The method of Lagrange multipliers


Minimization subject to equality constraints can be done via the method
of Lagrange multipliers. Here we need it only for the case of linear equality
constraints in which case the method assumes a very simple form.
Theorem 8.1. Let f : Rd ! R be convex and differentiable, C 2 Rm⇥d for some
m 2 N, e 2 Rm , x? 2 Rd such that Cx? = e. Then the following two statements
are equivalent.
(i) x? = argmin{f (x) : x 2 Rd , Cx = e}
(ii) There exists a vector 2 Rm such that
>
rf (x? )> = C.
The entries of are known as the Lagrange multipliers.
Proof. The easy direction is (ii))(i): if as specified exists and x 2 Rd
satisfies Cx = e, we get
> >
rf (x? )> (x x? ) = C(x x? ) = (e e) = 0.
Hence, x? is a minimizer of f over {x 2 Rd : Cx = e} by the optimality
condition of Lemma 1.22.
The other direction is Exercise 50.

8.4.2 Application to Greenstadt’s Update


In order to apply this method to (8.10), we need to compute the gradient
of f (E) = 12 kAEA> k2F . Formally, this is a d2 -dimensional vector, but it is
customary and more practical to write it as a matrix again,
✓ ◆
@f (E)
rf (E) = .
@Eij 1i,jd

118
Fact 8.2 (Exercise 51). Let A, B 2 Rd⇥d two matrices. With f : Rd⇥d ! R,
f (E) := 12 kAEBk2F , we have

rf (E) = A> AEBB > .


The second step is to write the system of equations Ey = r, E > E = 0
in Greenstadt’s convex program (8.10) in matrix form Cx = e so that we
can apply the method of Lagrange multipliers according to Theorem 8.1.
As there are d + d2 equations in d2 variables, it is best to think of the
rows of C as being indexed with elements i 2 [d] := {1, . . . , d} for the first
d equations Ey = r, and pairs (i, j) 2 [d] ⇥ [d] for the last d2 symmetry
constraints (more than half of which are redundant but we don’t care).
Columns of C are indexed with pairs (i, j) as well.
Let us denote by 2 Rd the Lagrange multipliers for the first d equa-
tions and 2 Rd⇥d the ones for the last d2 ones.
In column (i, j) of C corresponding to variable Eij , we have entry yj in
row i as well as entries 1 (row (j, i)) and 1 (row (i, j)). Taking the inner
product with the Lagrange multipliers, this column therefore yields

i yj + ji ij .

After aggregating these entries into a d ⇥ d matrix, Theorem 8.1 tells us


that we should aim for equality with rf (E) as derived in Fact 8.2. We
have proved the following intermediate result.
Lemma 8.3. An update matrix E ? satisfying the constraints Ey = r (secant
condition in the next step) and E > E = 0 (symmetry) is a minimizer of the
error function f (E) := 12 kAEA> k2F subject to the aforementioned constraints if
and only if there exists a vector 2 Rd and a matrix 2 Rd⇥d such that
W E ? W = y> + >
, (8.11)
where W := A> A (a symmetric and positive definite matrix).
Note that y> is the outer product of a column and a row vector and
hence a matrix. As we assume A to be invertible, the quadratic func-
tion f (E) is easily seen to be strongly convex and as a consequence has
a unique minimizer E ? subject to the set of linear equations in (8.10) (see
Lemma 2.9 which also applies if we minimize over a closed set). Hence,
we know that the minimizer E ? and corresponding Lagrange multipiers
, exist.

119
8.4.3 The Greenstadt family
We need to solve the system of equations

Ey = r, (8.12)
E>
E = 0, (8.13)
W EW = y> + >
. (8.14)

This system is linear in E, , , hence easy to solve computationally. How-


ever, we want a formula for the unique solution E ? in terms of the pa-
rameters W, y, = r + Hy. In the following derivation, we closely follow
Greenstadt [Gre70, pages 4–5].
With M := W 1 (which exists since W = A> A is positive definite),
(8.14) can be rewritten as

E=M y> + >


M. (8.15)

Transposing this system (using that M is symmetric) yields


>
E> = M y + >
M.

By symmetry (8.13), we can subtract the latter two equations to obtain


>
M y> y +2 >
2 M = 0.

As M is invertible, this is equivalent to

> 1 >
= y y> ,
2
so we can eliminate by substituting back into (8.15):
✓ ◆
1 > 1 >
E=M >
y + y y >
M = M y> + y M. (8.16)
2 2
To also eliminate , we now use (8.12)—the secant condition in the next
step—to get
1
Ey = M y> + y > M y = r.
2
Premultiplying with 2M gives
1

> >
2M 1
r= y> + y M y = y> M y + y M y.

120
Hence,
1 >
= 2M 1
r y My . (8.17)
y> M y
To get rid of on the right hand side, we premultiply this with y> M to
obtain
0 1
1 @ > 2y> r
y> M = > 2y r (y> M y)( >
M y )A = > >
My
| {z } y M y | {z } y My | {z }
z z z

It follows that
> y> r
z= My = .
y> M y
This in turn can be substituted into the right-hand side of (8.17) to remove
there, and we get
✓ ◆
1 1 (y> r)
= > 2M r y .
y My y> M y

Consequently,
✓ ◆
> 1 (y> r)
y = >
2M 1 ry> yy >
,
y My y> M y
✓ ◆
> 1 (y> r)
y = 2yr> M 1 yy >
.
y> M y y> M y

This gives us an explicit formula for E, by substituting the previous ex-


pressions back into (8.16). For this, we compute
✓ ◆
> 1 > (y> r) >
M y M = 2ry M M yy M ,
y> M y y> M y
✓ ◆
> 1 > (y> r) >
My M = 2M yr M yy M ,
y> M y y> M y

and consequently,
✓ ◆
1 > > 1 > > (y> r) >
E= M y +y M= > ry M + M yr M yy M .
2 y My y> M y
(8.18)

121
Finally, we use r = Hy to obtain the update matrix E ? in terms
of the original parameters H = Ht 11 (previous approximation of the in-
verse Hessian that we now want to update to Ht 1 = H 0 = H + E ? ),
= xt xt 1 (previous Quasi-Newton step) and y = rf (xt ) rf (xt 1 )
(previous change in gradients). This gives us the Greenstadt family of
Quasi-Newton methods.

Definition 8.4. Let M 2 Rd⇥d be a symmetric and invertible matrix. Consider


the Quasi-Newton method

xt+1 = xt Ht 1 rf (xt ), t 1,

where H0 = I (or some other positive definite matrix), and Ht 1 = Ht 11 + Et is


chosen for all t 1 in such a way that Ht 1 is symmetric and satisfies the secant
condition
rf (xt ) rf (xt 1 ) = Ht (xt xt 1 ).
For any fixed t, set

H := Ht 11 ,
H 0 := Ht 1 ,
:= xt xt 1 ,
y := rf (xt ) rf (xt 1 ),

and define

1 ⇣ >
E? = y M + M y > Hyy> M M yy> H
y> M y
1 ⌘
(y >
y >
Hy)M yy >
M . (8.19)
y> M y

If the update matrix Et = E ? is used, the method is called the Greenstadt


method with parameter M .

8.4.4 The BFGS method


In his paper, Greenstadt suggested two obvious choices for the matrix M
In Definition 8.4, namely M = H (the previous approximation of the in-
verse Hessian) and M = I. In the next paper of the same issue of the same

122
journal, Goldfarb suggested to use the matrix M = H 0 , the next approxi-
mation of the inverse Hessian. Even though we don’t yet have it, we can
use it in the formula (8.19) since we know that H 0 will by design satisfy the
secant condition H 0 y = . And as M always appears next to y in (8.19),
M y = H 0 y = , so H 0 disappears from the formula!

Definition 8.5. The BFGS method is the Greenstadt method with parameter
M = H 0 = Ht 1 in step t, in which case the update matrix E ? assumes the form

1 ⇣ > > 1 ⌘
E? = 2 Hy y> H (y> y> Hy) >
y> >y

1 ⇣ ⇣ y> Hy ⌘ ⌘
= Hy >
y> H + 1 + > >
., (8.20)
y> y

where H = Ht 11 , = xt xt 1 , y = rf (xt ) rf (xt 1 ).

We leave it as Exercise 52 (i) to prove that the denominator y> appear-


ing twice in the formula is positive, unless the function f is flat between
the iterates xt 1 and xt . And under y> > 0, the BFGS method has an-
other nice property: if the previous matrix H is positive definite, then also
the next matrix H 0 is positive definite; see Exercise 52 (ii). In this sense, the
matrices Ht 1 behave like proper inverse Hessians.
The method is named after Broyden, Fletcher, Goldfarb and Shanno
who all came up with it independently around 1970. Greenstadt’s name is
mostly forgotten.
Let’s take a step back and see what we have achieved. Recall that our
starting point was that Newton’s method needs to compute and invert
Hessian matrices in each iteration and therefore has in practice a cost of
O(d3 ) per iteration. Did we improve over this?
First of all, any method in Greenstadt’s family avoids the computation
of Hessian matrices altogether. Only gradients are needed. In the BFGS
method in particular, the cost per iteration drops to O(d2 ). Indeed, the
computation of the update matrix E ? in Definition 8.5 reduces to matrix-
vector multiplications and outer-product computations, all of which can
be done in O(d2 ) time.
Newton and Quasi-Newton methods are often performed with scaled
steps. This means that the iteration becomes

xt+1 = xt ↵t Ht 1 rf (xt ), t 1, (8.21)

123
for some ↵t 2 R+ . This parameter can for example be chosen such that
f (xt+1 ) is minimized (line search). Another approach is backtracking line
search where we start with ↵t = 1, and as long as this does not lead to
sufficient progress, we halve ↵t . Line search ensures that the matrices Ht 1
in the BFGS method remain positive definite [Gol70].
As the Greenstadt update method just depends on the step = xt
xt 1 but not on how it was obtained, the update works in exactly the same
way as before even if scaled steps are being used.

8.4.5 The L-BFGS method


In high dimensions d, even an iteration cost of O(d2 ) as in the BFGS method
may be prohibitive. In fact, already at the end of the 1970s, the first limited
memory (and limited time) variants of the method have been proposed.
Here we essentially follow Nocedal [Noc80]. The idea is to use only in-
formation from the previous m iterations, for some small value of m, and
“forget” anything older. In order to describe the resulting L-BFGS method,
we first rewrite the BFGS update formula in product form.

Observation 8.6. With E ? as in Definition 8.5 and H 0 = H + E ? , we have


✓ ◆ ✓ ◆
y> y > >
0
H = I H I + . (8.22)
y> y> y>

To verify this, simply expand the product in the right-hand side and
compare with (8.20).
We further observe that we do not need the actual matrix H 0 = Ht 1 to
perform the next Quasi-Newton step (8.6), but only the vector H 0 rf (xt ).
Here is the crucial insight.

Lemma 8.7. Let H, H 0 as in Observation 8.6, i.e.


✓ ◆ ✓ ◆
0 y> y > >
H = I H I + .
y> y> y>

Let g0 2 Rd . Suppose that we have an oracle to compute s = Hg for any vector


g. Then s0 = H 0 g0 can be computed with one oracle call and O(d) additional
arithmetic operations, assuming that and y are known.

124
Proof. From (8.22), we conclude that
✓ ◆ ✓ ◆
0 0 y> y > 0
>
Hg = I >
H I >
g + >
g0 .
y y y
| {z } | {z }
g h
| {z }
s
| {z }
w
| {z }
z

We compute the vectors h, g, s, w, z in turn. We have


> > 0
g
h= g0 = ,
y> y>

so h can be computed with two inner products, a real division, and a mul-
tiplication of with a scalar. For g, we obtain
✓ ◆
y > 0 0
> 0
g
g= I >
g = g y >
.
y y

which is a multiplication of y with a scalar that we already know, followed


by a vector addition. To get s = Hg, we call the oracle. For w, we similarly
have ✓ ◆
y> y> s
w= I s = s ,
y> y>
which is one inner product (the other one we already know), a real divison,
a multiplication of with a scalar, and a vector addition. Finally,

H 0 g0 = z = w + h

is a vector addition. In total, we needed three inner product computations,


three scalar multiplications, three vector additions, two real divisions, and
one oracle call.
How do we implement the oracle? We simply apply the previous
Lemma recursively. Let

k = xk xk 1 ,
yk = rf (xk ) rf (xk 1 )

125
be the values of and y in iteration k  t. When we perform the Quasi-
Newton step xt+1 = xt Ht 1 rf (xt ) in iteration t 1, we have already
computed these vectors for k = 1, . . . , t. Using Lemma 8.7, we could there-
fore call the recursive procedure in Figure 8.2 with k = t, g0 = rf (xt ) to
compute the required vector Ht 1 rf (xt ) in iteration t. To maintain the im-
mediate connection to Lemma 8.7, we refrain from introducing extra vari-
ables for values that occur several times; but in an actual implementation,
this would be done, of course.

function BFGS- STEP(k, g0 ) . returns Hk 1 g0


if k = 0 then
return H0 1 g0
else . apply Lemma 8.7
> 0
kg
h= >
yk k
> 0
g
g = g0 y >k
yk k
s = BFGS- STEP(k 1, g)
yk> s
w=s k >
yk k
z=w+h
return z
end if
end function

Figure 8.2: Recursive view of the BFGS method. To compute Ht 1 rf (xt ),


call the function with arguments (t, rf (xt )); values k , yk from iterations
1, . . . , t are assumed to be available.

By Lemma 8.7, the runtime of BFGS- STEP(t, rf (xt )) is O(td). For t >
d, this is slower (and needs more memory) than the standard BFGS step
according to Definition 8.5 which always takes O(d2 ) time.
The benefit of the recursive variant is that it can easily be adapted to
a step that is faster (and needs less memory) than the standard BFGS step.
The idea is to let the recursion bottom out after a fixed number m of recur-
sive calls (in practice, values of m  10 are not uncommon). The step then
has runtime O(md) which is a substantial saving over the standard step if
m is much smaller than d.

126
The only remaining question is what we return when the recursion
now bottoms out prematurely at k = t m. As we don’t know the matrix
Ht 1m , we cannot return Ht 1m g0 (which would be the correct output in this
case). Instead, we pretend that we have started the whole method just now
and use our initial matrix H0 instead of Ht m .1 The resulting algorithm is
depicted in Figure 8.3.

function L-BFGS- STEP(k, `, g0 ) . `  k; returns s0 ⇡ Hk 1 g0


if ` = 0 then
return H0 1 g0
else . apply Lemma 8.7
> 0
kg
h= >
yk k
> 0
g
g = g0 y >k
yk k
s = L-BFGS- STEP(k 1, ` 1, g)
yk> s
w=s k >
yk k
z=w+h
return z
end if
end function

Figure 8.3: The L-BFGS method. To compute Ht 1 rf (xt ) based on the pre-
vious m iterations, call the function with arguments (t, m, rf (xt )); values
k , yk from iterations t m + 1, . . . , t are assumed to be available.

Note that the L-BFGS method is still a Quasi-Newton method as long


as m 1: if we go through at least one update step of the form H 0 = H +E,
the matrix H 0 will satisfy the secant condition by design, irrespective of H.
1
In practice, we can do better: as we already have some information from previous
steps, we can use this information to construct a more tuned H0 . We don’t go into this
here.

127
8.5 Exercises
Exercise 48. Consider a step of the secant method:
xt xt 1
xt+1 = xt f (xt ) , t 1.
f (xt ) f (xt 1 )

Assuming that xt = 6 xt 1 and f (xt ) 6= f (xt 1 ), prove that the line through
the two points (xt 1 , f (xt 1 )) and (xt , f (xt )) intersects the x-axis at the point
x = xt+1 .
Exercise 49. Let f : Rd ! R be a twice differentiable function with nonzero
Hessians everywhere. Prove that the following two statements are equivalent.
(i) f is a nondegenerate quadratic function, meaning that
1
f (x) = x> M x q> x + c,
2
where M 2 Rd⇥d is an invertible symmetric matrix, q 2 Rd , c 2 R (see
also Lemma 7.1).

(ii) Applied to f , Newton’s update step

xt+1 := xt r2 f (xt ) 1 rf (xt ), t 1

defines a Quasi-Newton method for all x0 , x1 2 Rd .


Exercise 50. Prove the direction (i))(ii) of Theorem 8.1! You may want to do
proceed in the following steps.
1. Prove the Poor Man’s Farkas Lemma: a system of linear equations Ax =
b in d variables has a solution if and only for all 2 Rd , > A = 0> implies
>
b = 0. (You may use the fact that the row rank of a matrix equals its
column rank.)

2. Argue that x? = argmin{rf (x? )> x : x 2 Rd , Cx = e}.

3. Apply the Poor Man’s Farkas Lemma.


Exercise 51. Prove Fact 8.2!
Exercise 52. Consider the BFGS method (Definition 8.5).

128
(i) Prove that y> > 0, unless xt = xt 1 , or f ( xt +(1 )xt+1 ) = f (xt )+
(1 )f (xt 1 ) for all 2 (0, 1).

(ii) Prove that if H is positive definite and y> > 0, then also H 0 is positive
definite. You may want to use the product form of the BFGS update as
developed in Observation 8.6.

129
Chapter 9

Frank-Wolfe

Contents

130
TODO (see slides)

131
Chapter 10

Coordinate Descent

Contents
10.1 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . 133
10.2 Randomized Coordinate Descent . . . . . . . . . . . . . . . . 133
10.2.1 The Polyak-Łojasiewicz Condition . . . . . . . . . . . 135
10.2.2 Importance Sampling . . . . . . . . . . . . . . . . . . 136
10.3 Steepest Coordinate Descent . . . . . . . . . . . . . . . . . . . 136
10.4 Non-smooth objectives . . . . . . . . . . . . . . . . . . . . . . 138
10.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

132
10.1 Coordinate Descent
Coordinate descent methods generate a sequence {xt }t 0 of iterates as fol-
lows:
xt+1 := xt + eit , (10.1)
where ei denotes the i-th unit basis vector in Rd , and is a suitable stepsize
for the selected coordinate of our objective function. Here we will focus
on the gradient-based choice of the stepsize as

xt+1 := xt 1
r f (xt ) eit
L it
, (10.2)

As an alternative, for some problems we can find an even better step-


size by solving the single-variable minimization argmin 2R f (xt + eit ) in
closed form.

10.2 Randomized Coordinate Descent


In random coordinate descent, the active coordinate it in each step is chosen
uniformly at random from the set [d].
[Nes12] shows that randomized coordinate descent achieves a faster
convergence rate than gradient descent, if our problem of interest has d
variables and it is assumed to be d times cheaper to update one coordinate
than it is to compute the full gradient.

Convergence Analysis. To analyze coordinate descent methods, we as-


sume coordinate-wise smoothness of f , which is defined as
L
f (x + ei )  f (x) + ri f (x) + 2
8x 2 Rd , 8 2 R, (10.3)
2
for any coordinate i. As with our familiar definition of smoothness, the
property here is equivalent to the gradient being coordinate-wise Lipschitz
continuous, that is |ri f (x + ei ) ri f (x)|  L| |, 8x 2 Rd , 2 R, i 2 [d].
We have seen the equivalence in Lemma 2.4 previously.
If we additionally assume strong convexity, we can obtain a fast linear
convergence rate as follows.

133
Theorem 10.1. Consider minimization of a function f which is coordinate-wise
smooth with constant L as in (10.3), and is strongly convex with parameter µ >
0. Then, coordinate descent with a stepsize of 1/L,
1
xt+1 := xt r f (xt ) eit
L it
.

when choosing the active coordinate it uniformly at random, has an expected lin-
ear convergence rate of
⇣ µ ⌘t
E[f (xt ) f ?]  1 [f (x0 ) f ? ].
dL
Proof. We follow [KNS16]. By pluggin the update rule (10.1) into the smooth-
ness condition (10.3) we have the step improvement

1
f (xt+1 )  f (xt ) |ri f (xt )|2 .
2L t
By taking the expectation of both sides with respect to it we have

1 ⇥ ⇤
E [f (xt+1 )]  f (xt ) E |rit f (xt )|2
2L
1 1X
= f (xt ) |ri f (xt )|2
2L d i
1
= f (xt ) krf (xt )k2 .
2dL
We now use the the fact that strongly convex functions satisfy 12 krf (x)k2
µ(f (x) f ? ) 8 x. This is proven in Lemma 10.2 below and is a property of
separate interest. Subtracting f ? from both sides, we therefore obtain
⇣ µ⌘
?
E[f (xt+1 ) f ]  1 [f (xt ) f ? ].
dL
Applying this recursively and using iterated expectations yields the result.

For the algorithm variant using exact coordinate optimization instead


of using the fixed stepsize 1/L, the same result still holds (since progress
per step is at least as good).

134
10.2.1 The Polyak-Łojasiewicz Condition
A function f satisfies the Polyak-Łojasiewicz Inequality (PL) if the following
holds for some µ > 0,
1
2
krf (x)k2 µ(f (x) f ? ), 8 x. (10.4)

The condition was proposed by Polyak in 1963, and also by Łojasiewicz in


the same year. It implies the quadratic growth condition.
Lemma 10.2 (Strong Convexity ) PL). Let f be strongly convex with param-
eter µ > 0. Then f satisfies PL for the same µ.
Proof. For all x and y we have
µ
f (y) f (x) + hrf (x), y xi + ky xk2 .
2
minimizing each side of the inequality with respect to y we obtain
1
f (x? ) f (x) krf (x)k2 ,

which implies the PL inequality holds with the same value µ.
The PL condition is a weaker condition than strong convexity. For ex-
ample, it can be shown that it is satisfied for all compositions f (x) :=
g(Ax) for strongly convex g and arbitrary matrix A, including least squares
regression and many other applications in machine learning.
As we have seen in the proof of the above theorem, the linear conver-
gence rate holds not only for strongly convex objectives but indeed for the
wider class of any f satisfying the PL condition:
Corollary 10.3. For minimization of a function f which is coordinate-wise smooth
with constant L as in (10.3), satisfies the PL inequality (10.4), and has a non-
empty solution set X ? , random coordinate descent with a stepsize of 1/L has the
expected linear convergence rate of
⇣ µ ⌘t
?
E[f (xt ) f ]  1 [f (x0 ) f ? ].
dL
Using the same proof technique, gradient descent can be shown to ex-
hibit a linear convergence rate for PL functions as well, see Exercise 53.

135
10.2.2 Importance Sampling
Uniformly random selection of the active coordinate might not always be
the best choice. Let us consider an individual smoothness constant Li for
each coordinate i, that is

f (x + ei )  f (x) + ri f (x) + Li
2
2
(10.5)

for all x 2 Rd and 2 R. In this case, instead of uniform random sampling


of the active coordinate, it makes sense to sample proportional to the Li
values as suggested by [Nes12]. Formally, the selection rules picks i with
probability P [it = i] = PLiLi .
i
For coordinate descent using this modified sampling probabilities, and
using a stepsize of 1/Lit , the same convergence argument as above can be
shown (Exercise 54) to give the faster rate of
⇣ µ ⌘t
E[f (xt ) f ?]  1 [f (x0 ) f ? ],
dL̄
P
where L̄ = d1 di=1 Li now is the average of all coordinate-wise smoothness
constants. Note that this value can be much smaller than the global L we
have used above, since that one was required to hold for all i so has to be
chosen as L = maxi Li instead.
Similar importance sampling strategies work for the different setting
of stochastic gradient descent (SGD) on sum-structured problems. Prac-
tical performance of importance sampling over uniform sampling can be
very significant for coordinate descent (or SGD) in particular for sparse or
inhomogeneous data.

10.3 Steepest Coordinate Descent


In contrast to random coordinate descent, steepest coordinate descent (or
greedy coordinate descent) chooses the active coordinate according to

it := argmax |ri f (xt )| . (10.6)


i2[d]

which is also called the Gauss-Southwell (GS) rule.

136
Convergence Analysis. It is easy to show that the same convergence rate
which we have obtained for random coordinate descent in Theorem 10.1
also holds for steepest coordinate descent. To see this, the only ingredient
we need is the fact that
1X
max |ri f (x)|2 |ri f (x)|2 ,
i d i

and since we now have a deterministic algorithm, there is no need to take


expectations in the proof.

Corollary 10.4. For minimization of a function f which is coordinate-wise smooth


with constant L as in (10.3), and is strongly convex with parameter µ > 0, steep-
est coordinate descent with a stepsize of 1/L has the linear convergence rate of
⇣ µ ⌘t
?
E[f (xt ) f ] 1 [f (x0 ) f ? ].
dL
It was shown by [NSL+ 15] that a stronger convergence result can be
obtained for this algorithm when the strong convexity of f is measured
with respect to the `1 -norm instead of the standard Euclidean norm, i.e.
µ1
f (y) f (x) + hrf (x), y xi + ky xk21 .
2
Theorem 10.5. For minimization of a function f which is coordinate-wise smooth
with constant L as in (10.3), and is strongly convex w.r.t. the `1 -norm with pa-
rameter µ1 > 0, steepest coordinate descent with a stepsize of 1/L has the linear
convergence rate of
⇣ µ1 ⌘ t
?
E[f (xt ) f ] 1 [f (x0 ) f ? ].
L
The proof again directly follows the one of Theorem 10.1, but uses the
following lemma measuring the PL inequality in the `1 -norm:

Lemma 10.6. Let f be strongly convex w.r.t. the `1 -norm with parameter µ1 > 0.
Then f satisfies
1
2
krf (x)k21 µ1 (f (x) f ? ).

The proof of the lemma is not given here, but follows the same strategy
as in the earlier analogue Lemma 10.2. It then uses a property of convex

137
conjugate functions (coming from the fact that the norms k.k1 and k.k1 are
dual to each other).

In summary, we have that steepest coordinate descent can be up to d


times faster than random coordinate descent in terms of number of itera-
tions. However, of course the selection rule is now more costly. Naively,
finding the steepest coordinate would require computing the full gradient,
and might also cost d times more than using a random coordinate.
Steepest coordinate descent is nevertheless an attractive choice for prob-
lem classes where we can obtain (or maintain) the steepest coordinate ef-
ficiently. This includes several practical case, for example when the gradi-
ents are sparse, e.g. because the original data is sparse. Another important
use-case is for problems where we would want to find a solution in as few
steps as possible, i.e. a sparse solution. For example, the Lasso problem is
interesting in terms of both mentioned aspects. Last but not least, we note
that the steepest selection rule (10.6) looks very similar to the Frank-Wolfe
algorithm, if one is optimizing over an `1 -ball. This is not a coincidence,
but indeed the two algorithms and their convergence are closely related in
that case.

10.4 Non-smooth objectives


So far, we have only considered unconstrained and smooth optimization
problems in this chapter.
We have just proven that coordinate decent converges for differen-
tiable, smooth f . What if f is not differentiable at all points? Earlier, when
we analyzed gradient methods, we saw that in that case the extension to
subgradients was straightforward and maintained the convergence results
up to a small slowdown. Unfortunately for coordinate descent, the situa-
tion is not that easy.
Even when using exact minimization on each coordinate step, the al-
gorithm can get permanently stuck in non-optimal points, as for example
shown in the objective function of Figure 10.2: Not all hope is lost how-
ever. Consider the class of composite problems (recall proximal gradient
descent as we discussed in Section 3.6),
X
f (x) := g(x) + h(x) with h(x) = hi (xi ) , (10.7)
i

138
Figure 10.1: A smooth function: f (x) := kxk2 . Figure by Alp Yurtsever &
Volkan Cevher, EPFL.

Figure 10.2: A non-smooth function: f (x) := kxk2 + |x1 x2 |. Figure by Alp


Yurtsever & Volkan Cevher, EPFL.

P
for g convex and smooth, and h(x) = i hi (xi ) separable with hi convex
but possibly non-smooth. For this class of problems, coordinate descent
with exact minimization converges to a global optimum, as illustrated in
Figure 10.3.
One very important class of applications here are smooth functions f
combined with `1 -regularization, such as the Lasso.

139
Figure 10.3: A function with separable non-smooth part: f (x) := kxk2 +
kxk1 . Figure by Alp Yurtsever & Volkan Cevher, EPFL.

10.5 Applications
Coordinate descent methods are used widely in classic machine learning
applications. Variants of coordinate methods form the state of the art for
the class of generalized linear models, including linear classifiers and re-
gression models, as long as separable convex regularizers are used (e.g. `1
or `2 norm regularization).
For least-squares linear regression f (x) := kAx bk2 , exact coordinate
minimization can easily be performed readily in closed form.

Lasso. The optimization problem for sparse least squares linear regres-
sion (also known as the Lasso) is given by

min kAx bk2 + kxk1 (10.8)


x2Rn

for some regularization parameter > 0. It is an instance of our class of


composite optimization problems (10.7).

Support Vector Machines. The original optimization problem for the


Support Vector Machine (SVM) is given by
n
X
min `(yi A>
i w) + kwk
2
(10.9)
w2Rd
i=1
2

140
where ` : R ! R, `(z) := max{0, 1 z} is the hinge loss function. Here for
any i, 1  i  n, the vector Ai 2 Rd is the i-th data example, and yi 2 {±1}
is the corresponding label.
The dual optimization problem for the SVM is given by
max ↵> 1 2
1
↵> Y A> AY ↵ such that 0  ↵i  1 8i (10.10)
↵2Rn

where Y := diag(y), and A 2 Rd⇥n again collects all n data examples as


its columns. The dual problem is an instance of our class of composite
optimization problems (10.7), since the non-differentiable box-constraint
0  ↵i  1 8i can be written as a separable g as required.

10.6 Exercises
Exercise 53 (Alternative analysis for gradient descent). Let f be smooth with
constant L in the classical sense, and satisfy the PL inequality (10.4). Let the prob-
lem minx f (x) have a non-empty solution set X ? . Prove that gradient descent
with a stepsize of 1/L has a global linear convergence rate
⇣ µ ⌘t
f (xt ) f ?  1 (f (x0 ) f ? ).
L
Exercise 54 (Importance Sampling). Consider random coordinate descent with
selecting the i-th coordinate with probability proportional to the Li value, where Li
is the individual smoothness constant for each coordinate i as in (10.5).
When using a stepsize of 1/Lit , prove that we obtain the faster rate of
⇣ µ ⌘t
E[f (xt ) f ? ]  1 [f (x0 ) f ? ],
dL̄
P
where L̄ = d1 di=1 Li now is the average of all coordinate-wise smoothness con-
stants. Note that this value can be much smaller than the global L we have
used above, since that one was required to hold for all i so has to be chosen as
L = maxi Li instead.
Can you come up with an example from machine learning where L̄ ⌧ L?
Exercise 55. Derive the solution to exact coordinate minimization for the Lasso
problem (10.8), for the i-th coordinate. Write A i for the (d 1) ⇥ n matrix
obtained by removing the i-th column from A, and same for the vector x i with
one entry removed accordingly.

141
Bibliography

[ACGH18] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A
convergence analysis of gradient descent for deep linear neu-
ral networks. CoRR, abs/1810.02281, 2018.

[BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimiza-


tion. Cambridge University Press, New York, NY, USA, 2004.
https://web.stanford.edu/˜boyd/cvxbook/.

[Dav59] William C. Davidon. Variable metric method for minimiza-


tion. Technical Report ANL-5990, AEC Research and Devel-
opment, 1959.

[Dav91] William C. Davidon. Variable metric method for minimiza-


tion. SIAM J. Optimization, 1(1):1–17, 1991.

[DSSSC08] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar


Chandra. Efficient projections onto the `1 -ball for learning in
high dimensions. In Proceedings of the 25th International Confer-
ence on Machine Learning, pages 272–279, 07 2008.

[Gol70] D. Goldfarb. A family of variable-metric methods derived by


variational means. Mathematics of Computation, 24(109):23–26,
1970.

[Gre70] J. Greenstadt. Variations on variable-metric methods. Mathe-


matics of Computation, 24(109):1–22, 1970.

[KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear Con-
vergence of Gradient and Proximal-Gradient Methods Under
the Polyak-Łojasiewicz Condition. In ECML PKDD 2016: Ma-
chine Learning and Knowledge Discovery in Databases, pages

142
795–811. Springer International Publishing, Cham, September
2016.

[KSJ18] Sai Praneeth Karimireddy, Sebastian U Stich, and Martin


Jaggi. Global linear convergence of Newton’s method with-
out strong-convexity or Lipschitz gradients. arXiv, 2018.

[Nes12] Yu Nesterov. Efficiency of coordinate descent methods on


huge-scale optimization problems. SIAM Journal on Optimiza-
tion, 22(2):341–362, 2012.

[Noc80] J. Nocedal. Updating quasi-newton matrices with limited stor-


age. Mathematics of Computation, 35(151):773–782, 1980.

[NP06] Yurii Nesterov and B.T. Polyak. Cubic regularization of new-


ton method and its global performance. Mathematical Program-
ming, 108(1):177–205, Aug 2006.

[NSL+ 15] Julie Nutini, Mark W Schmidt, Issam H Laradji, Michael P


Friedlander, and Hoyt A Koepke. Coordinate Descent Con-
verges Faster with the Gauss-Southwell Rule Than Random
Selection. In ICML, pages 1632–1641, 2015.

[Roc97] R. Tyrrell Rockafellar. Convex Analysis. Princeton Landmarks


in Mathematics. Princeton University Press, 1997.

[Tib96] Robert Tibshirani. Regression shrinkage and selection via the


LASSO. J. R. Statist. Soc. B, 58(1):267–288, 1996.

[Vis14] Nisheeth Vishnoi. Lecture notes on fundamentals of con-


vex optimization, 2014. https://tcs.epfl.ch/files/
content/sites/tcs/files/Lec3-Fall14-Web.pdf.

[Zim16] Judith Zimmermann. Information Processing for Effective and


Stable Admission. PhD thesis, ETH Zurich, 2016. .

143

You might also like