Lecture Notes PDF
Lecture Notes PDF
9 Frank-Wolfe 130–132
1
Chapter 1
Contents
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Differentiable functions . . . . . . . . . . . . . . . . . 6
1.3.2 First-order characterization of convexity . . . . . . . 10
1.3.3 Second-order characterization of convexity . . . . . . 12
1.3.4 Operations that preserve convexity . . . . . . . . . . 13
1.4 Minimizing convex functions . . . . . . . . . . . . . . . . . . 13
1.4.1 Strictly convex functions . . . . . . . . . . . . . . . . . 15
1.4.2 Example: Least squares . . . . . . . . . . . . . . . . . 15
1.4.3 Constrained Minimization . . . . . . . . . . . . . . . . 17
1.5 Existence of a minimizer . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Sublevel sets and the Weierstrass Theorem . . . . . . 19
1.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.1 Handwritten digit recognition . . . . . . . . . . . . . 20
1.6.2 Master’s Admission . . . . . . . . . . . . . . . . . . . 21
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2
This chapter develops the basic theory of convex functions that we will
need later. Much of the material is also covered in other courses, so we will
refer to the literature for standard material and focus more on material that
we feel is less standard (but important in our context).
1.1 Notation
For vectors in Rd , we use bold font, and for their coordinates normal font,
e.g. x = (x1 , . . . , xd ) 2 Rd . x1 , x2 , . . . denotes a sequence of vectors. Vectors
are considered as column vectors, unless they are explicitly transposed.
kxk denotes the Euclidean norm (`2 -norm or 2-norm) of vector x,
d
X
2 >
kxk = x x = x2i .
i=1
We also use
N = {1, 2, . . .} and R+ := {x 2 R : x 0}
to denote the natural and non-negative real numbers, respectively. We
are freely using basic notions and material such as open and closed sets,
vector spaces, continuity, convergence, limits, triangle inequality, among
others.
3
y
y x
x
x 2 dom(f )}. The epigraph (Figure 1.2) is the set of points above the graph,
epi(f ) epi(f )
graph of f
f (x)
f (x)
x x
Figure 1.2: Graph and epigraph of a non-convex function (left) and a con-
vex function (right)
4
of f ; see Figure 1.3. (Whenever we say “above”, we mean “above or on”.)
An important special case arises when f : Rd ! R is an affine function,
i.e. f (x) = c> x + c0 for some vector c 2 Rd and scalar c0 2 R. In this case,
(1.1) is always satisfied with equality, and line segments connecting points
on the graph lie pointwise on the graph.
f (x) f ( x + (1 )y)
x x + (1 )y y
5
Lemma 1.5 (Jensen’s inequality). Let f : Rd ! R Pbe a convex function,
m
x1 , . . . , xm 2 dom(f ), and 1 , . . . , m 2 R+ such that i=1 i = 1. Then
m
! m
X X
f i xi i f (xi ).
i=1 i=1
It then also follows that the matrix A is unique, and it is called the differential or
Jacobian matrix of f at x. We will denote it by Df (x). More precisely, Df (x)
is the matrix of partial derivatives at the point x,
@fi
Df (x)ij = (x).
@xj
f is called differentiable if f is differentiable at all x 2 dom(f ).
6
Differentiability at x means that in some neighborhood of x, f is ap-
proximated by a (unique) affine function f (x) + Df (x)(y x), up to a
sublinear error term. If m = 1, Df (x) is a row vector typically denoted
by rf (x)> , where the (column) vector rf (x) is called the gradient of f at
x. Geometrically, this means that the graph of the affine function f (x) +
rf (x)> (y x) is a tangent hyperplane to the graph of f at (x, f (x)); see
Figure 1.4.
f (x) + rf (x)> (y x)
f (y)
x y
7
differentiable at g(x) 2 dom(f ). Then f g (the composition of f and g) is
differentiable at x, with the differential given by the matrix equation
The following is a general result that we will later use in specific set-
tings. As its proof also highlights some important notions and techniques,
we will give it here. As a preparation, we need the concept of the spectral
norm of a matrix.
kAvk
kAk := max = max kAvk
v2Rd ,v6=0 kvk kvk=1
In words, the spectral norm is the largest factor by which a unit vector
can be stretched in length under the mapping v ! Av.
Also recall that a function f : dom(f ) ! Rm is B-Lipschitz (or simply
Lipschitz if there is a suitable B) if kf (x) f (y)k B kx yk for all x, y 2
dom(f ). In particular, Lipschitz functions are continuous.
kDf (x)k B, 8x 2 X.
8
Indeed, (i) might not imply (ii) if X is closed. As a trivial example,
the Lipschitz condition is always satisfied over X = {0} but does not say
anything about kDf (x)k.
Proof. Suppose that f is B-Lipschitz over an open set X. For v 2 Rd ,
v ! 0, differentiability at x 2 X yields for small v 2 Rd that x + v 2 X
and therefore
where kr(v)k / kvk ! 0, the first inequality uses (i), and the last is the
reverse triangle inequality. Rearranging and dividing by kvk, we get
Let v? be a unit vector such that kDf (x)k = kDf (x)v? k / kv? k and let v =
tv? for t ! 0. Then we further get
kr(v)k
kDf (x)k B + ! B,
kvk
9
see (1.2). Note that h is well-defined since X was assumed to be convex.
Then we compute
f (x+t(y x)) = f ((1 t)x+ty) (1 t)f (x)+tf (y) = f (x)+t(f (y) f (x)).
10
f (y)
f (x) + rf (x)> (y x)
x y
This is convexity.
11
For f (x1 , x2 ) = x21 + x22 , we have rf (x) = (2x1 , 2x2 ), hence (1.4) boils
down to
y12 + y22 x21 + x22 + 2x1 (y1 x1 ) + 2x2 (y2 x2 ),
which after some rearranging of terms is equivalent to
(y1 x1 )2 + (y2 x2 ) 2 0,
hence true. There are relevant convex functions that are not differentiable,
see Figure 1.6 for an example. More generally, Exercise 7 asks you to prove
that the `1 -norm (or 1-norm) f (x) = kxk1 is convex.
f (x) = |x|
x 0
exists at every point x 2 dom(f ) and is symmetric. Then f is convex if and only
if dom(f ) is convex, and for all x 2 dom(f ), we have
r2 f (x) ⌫ 0 (i.e. r2 f (x) is positive semidefinite). (1.5)
12
(A symmetric matrix M is positive semidefinite, denoted by M ⌫ 0, if x> M x
0 for all x, and positive definite, denoted by M 0, if x> M x > 0 for all x 6= 0.)
(i) P
Let f1 , f2 , . . . , fm be convex functions,
Tm 1 , 2 , . . . , m 2 R+ . Then f :=
m
i=1 i i f is convex on dom(f ) := i=1 dom(fi ).
13
Definition 1.14. A local minimum of f : dom(f ) ! R is a point x such that
there exists " > 0 with
f (x) f (y) 8y 2 dom(f ) satisfying ky xk < ".
Lemma 1.15. Let x? be a local minimum of a convex function f : dom(f ) ! R.
Then x? is a global minimum, meaning that
f (x? ) f (y) 8y 2 dom(f ).
Proof. Suppose there exists y 2 dom(f ) such that f (y) < f (x? ) and define
y0 := x? + (1 )y for 2 (0, 1). From convexity (1.1), we get that
that f (y0 ) < f (x? ). Choosing so close to 1 that ky0 x? k < " yields a
contradiction to x? being a local minimum.
This does not mean that a convex function always has a global mini-
mum. Think of f (x) = x as a trivial example. But also if f is bounded from
below over dom(f ), it may fail to have a global minimum (f (x) = ex ).
To ensure the existence of a global minimum, we need additional condi-
tions. For example, it suffices if outside some ball B, all function values
are larger than some value f (x), x 2 B. In this case, we can restrict f
to B, without changing the smallest attainable value. And on B (which is
compact), f attains a minimum by continuity (Lemma 1.6). An easy exam-
ple: for f (x1 , x2 ) = x21 + x22 , we know that outside any ball containing 0,
f (x) > f (0) = 0.
Another easy condition in the differentiable case is given by the follow-
ing result.
Lemma 1.16. Suppose that f : dom(f ) ! R is convex and differentiable over
an open domain dom(f ) ✓ Rd . Let x 2 dom(f ). If rf (x) = 0, then x is a
global minimum.
Proof. Suppose that rf (x) = 0. According to Lemma 1.11, we have
f (y) f (x) + rf (x)> (y x) = f (x)
for all y 2 dom(f ), so x is a global minimum.
The converse is also true and is a corollary of Lemma 1.22 [BV04, 4.2.3].
Lemma 1.17. Suppose that f : dom(f ) ! R is convex and differentiable over
an open domain dom(f ) ✓ Rd . Let x 2 dom(f ). If x is a global minimum then
rf (x) = 0.
14
1.4.1 Strictly convex functions
In general, a global minimum of a convex function is not unique (think of
f (x) = 0 as a trivial example). However, if we forbid “flat” parts of the
graph of f , a global minimum becomes unique (if it exists at all).
This means that the open line segment connecting (x, f (x)) and (y, f (y))
is pointwise strictly above the graph of f . For example, f (x) = x2 is strictly
convex.
Lemma 1.19 ([BV04, 3.1.4]). Suppose that dom(f ) is open and that f is twice
differentiable. If the Hessian r2 f (x) 0 for every x 2 dom(f ) (i.e., z> r2 f (x)z >
0 for any z 6= 0), then f is strictly convex.
The converse is false, though: f (x) = x4 is strictly convex but has van-
ishing second derivative at x = 0.
Lemma 1.20. Let f : dom(f ) ! R be strictly convex. Then f has at most one
global minimum.
Proof. Suppose x? 6= y? are two global minima with fmin = f (x? ) = f (y? ),
and let z = 12 x? + 12 y? . By (1.6),
1 1
f (z) < fmin + fmin = fmin ,
2 2
a contradiction to x? and y? being global minima.
(1, 10), (2, 11), (3, 11), (4, 10), (5, 9), (6, 10), (7, 9), (8, 10),
15
y y
x x
and Hessian ✓ ◆
2 16 72
r (w0 , w1 ) = .
72 408
16
A 2 ⇥ 2 matrix is positive semidefinite if the diagonal elements and the
determinant are positive, which is the case here, so f is actually strictly
convex and has a unique global minimum. To find it, we solve the linear
system rf (w0 , w1 ) = (0, 0) of two equations in two unknowns and obtain
the global minimum
⇣ 43 1 ⌘
(w0? , w1? ) = , .
4 6
Hence, the “optimal” line is
1 43
y= x+ ,
6 4
see Figure 1.7 (right).
f (x) f (y) 8y 2 X.
rf (x? )> (x x? ) 0 8x 2 X.
17
rf (x? )> (x x? ) 0
X
rf (x? )
x
x?
or
minimize f (x)
(1.9)
subject to x 2 X.
18
1.5.1 Sublevel sets and the Weierstrass Theorem
Definition 1.23. Let f : Rd ! R, ↵ 2 R. The set
f ↵ := {x 2 Rd : f (x) ↵}
f ↵ f ↵ f ↵
Figure 1.9: Sublevel set of a non-convex function (left) and a convex func-
tion (right)
It is easy to see from the definition that every sublevel set of a convex
function is convex. Moreover, as a consequence of continuity of f , sublevel
sets are closed. The following (known as the Weierstrass Theorem) just
formalizes an argument that we have made earlier.
19
1.6 Examples
In the following two sections, we give two examples of convex function
minimization tasks that arise from machine learning applications.
Figure 1.10: Some training images from the MNIST data set (picture from
http://corochann.com/mnist-dataset-introduction-1138.
html
20
then use the vector y = W x 2 R10 to predict the digit seen in an arbitrary
image x. The idea is that yj , j = 0, . . . , 9 corresponds to the probability
of the digit being j. This does not work directly, since the entries of y
may be negative and generally do not sum up to 1. But we can convert y
to a vector z of actual probabilities, such that a small yj leads to a small
probability zj and a large yj to a large probability zj . How to do this is not
canonical, but here is a well-known formula that works:
e yj
zj = zj (y) = P9 . (1.10)
yk
k=0 e
The classification then simply outputs digit j with probability zj . The
matrix W is chosen such that it (approximately) minimizes the classifica-
tion error on the training set P . Again, it is not canonical how we measure
classification error; here we use the following loss function to evaluate the
error induced by a given matrix W .
!
X X ⇣X 9 ⌘
`(W ) = ln zd(x) (W x) = ln e(W x)k (W x)d(x) . (1.11)
x2P x2P k=0
This function “punishes” images for which the correct digit j has low
probability zj (corresponding to a significantly negative value of log zj ).
In an ideal world, the correct digit would always have probability 1, re-
sulting in `(W ) = 0. But under (1.10), probabilities are always strictly
between 0 and 1, so we have `(W ) > 0 for all W .
Exercise 5 asks you to prove that ` is convex. In Exercise 6, you will
characterize the situations in which ` has a global minimum.
21
Data on the actual performance of students admitted in the past is
available. To keep things simple in the following example, Let us base
the forecast on GPA (grade point average) and TOEFL (Test of English as
a Foreign Language) only. GPA scores are normalized to a scale with a
minimum of 0.0 and a maximum of 4.0, where admission starts from 3.5.
TOEFL scores are on an integer scale between 0 and 120, where admission
starts from 100.
Table 1.1 contains the known data. GGPA (graduation grade point av-
erage on a Swiss grading scale) is the average grade obtained by an ad-
mitted student over all courses in the MSc program. The Swiss scale goes
from 1 to 6 where 1 is the lowest grade, 6 is the highest, and 4 is the lowest
passing grade.
Table 1.1: Data for 10 admitted students: GPA and TOEFL scores (at time
of application), GGPA (at time of graduation)
However, in our scenario, the relevant GPA scores span a range of only
0.5 while the relevant TOEFL scores span a range of 20. The resulting least
squares objective would be somewhat ugly; we already saw this in our
previous example (1.7), where the data points had large second coordinate,
resulting in the w1 -scale being very different from the w2 -scale. This time,
22
we normalize first, so that w1 und w2 become comparable and allow us to
understand the relative influences of GPA and TOEFL.
The general setting is this: we have n inputs x1 , . . . , xn , where each vec-
tor xi 2 Rd consists of d input variables; then we have n outputs y1 , . . . , yn 2
R. Each pair (xi , yi ) is an observation. In our case, d = 2, n = 10, and for
example, ((3.93, 100), 5.52) is an observation (of a student doing very well).
With variable weights w0 , w = (w1 , . . . , wd ) 2 Rd , we plan to minimize
the least squares objective
n
X
f (w0 , w) = (w0 + wT xi yi ) 2 .
i=1
We first want to assume that the inputs and outputs are centered, mean-
ing that
n n
1X 1X
xi = 0, yi = 0.
n i=1 n i=1
P
This can be achieved by simply subtracting
P the mean x̄ = n1 ni=1 xi from
every input and the mean ȳ = n1 ni=1 yi from every output. In our exam-
ple, this yields the numbers in Table 1.2 (left).
After centering, the global minimum (w0? , w? ) of the least squares ob-
jective satisfies w0? = 0 while w? is unaffected by centering (Exercise 9), so
that we can simply omit the variable w0 in the sequel.
23
Finally, we assume that all d input variables are on the same scale,
meaning that
n
1X 2
x = 1, j = 1, . . . , d.
n i=1 ij
To achieve this for fixed j (assuming
q P that no variable is 0 in all inputs),
we multiply all xij by s(j) = n/ ni=1 x2ij (which, in the optimal solution
w⇤ , just multiplies wj? by 1/s(j), an argument very similar to the one in
Exercise 9). For our data set, the resulting normalized data are shown in
Table 1.2 (right). Now the least squares objective (after omitting w0 ) is
10
X
f (w1 , w2 ) = (w1 xi1 + w2 xi2 yi ) 2
i=1
⇡ 10w12 + 10w22 + 1.99w1 w2 8.7w1 2.79w2 + 2.09.
This is minimized at
in the normalized data. This can quickly be checked, and the results are
not perfect, but not too bad, either; see Table 1.3 (ignore the last column
for now).
What we also see from (1.13) is that the first input variable (GPA) has a
much higher influence on the output (GGPA) than the second one (TOEFL).
In fact, if we drop the second one altogether, we obtain outputs zi? (last col-
umn in Table 1.3) that seem equivalent to the predicted outputs yi? within
the level of noise that we have anyway.
We conclude that TOEFL scores are probably not indicative for the per-
formance of admitted students, so the admission committee should not
care too much about them. Requiring a minimum score of 100 might make
sense, but whenever an applicant reaches at least this score, the actual
value does not matter.
24
xi1 xi2 yi yi? zi?
-2.04 -1.28 -0.94 -1.00 -0.87
-0.88 0.32 -0.52 -0.35 -0.37
-0.05 1.03 -0.05 0.08 -0.02
-0.16 -1.28 -0.18 -0.19 -0.07
1.42 -1.28 0.67 0.49 0.61
1.02 1.39 0.59 0.57 0.44
0.06 1.39 0.19 0.16 0.03
-0.88 -0.04 -0.12 -0.38 -0.37
0.89 -0.21 0.17 0.36 0.38
0.62 -0.04 0.21 0.26 0.27
Table 1.3: Outputs yi? predicted by the linear model (1.13) and by the model
zi? = 0.43xi1 that simply ignores the second input variable
25
Pd
j=1 |wj |):
Pn
minimize >
i=1 kw xi y i k2
(1.14)
subject to kwk1 R,
where R 2 R+ is some parameter. In our case, if we for example
26
b
10w12 + 10w22 + 1.99w1 w2 8.7w1 2.79w2 + 2.09 = 0.75
(0.43, 0.097)
for R = 0.4).
Even though we have presented a toy example in this section, the back-
ground is real. The theory of admission and in particular performance
forecasts has been developed in a recent PhD thesis by Zimmermann [Zim16].
1.7 Exercises
Exercise 1. Prove Jensen’s inequality (Lemma 1.5)!
27
Exercise 4. Prove Lemma 1.13! Can (ii) be generalized to show that for two
convex functions f, g, the function f g is convex as well?
Exercise 6. Consider the logistic regression problem with two classes. Given a
training set P consisting of datapoint and label pairs (x, y) where x 2 Rd and
y 2 { 1, +1}, we define our loss ` for weight vector w 2 Rd to be
X
`(w) = ln z(yw> x) ,
(x,y)2P
where z(s) = 1/(1 + exp( s)). This loss function is in fact a simplification of
(1.11) when we only have two classes.
We say that the weight vector w is a separator for P if for all (x, y) 2 P ,
y(w> x) 0.
y(w> x) = 0 .
(i) f ( x) = | |f (x),
28
P
Pn 9. Suppose that
Exercise we have centered observations (xi , yi ) such that ni=1 xi =
0, i=1 yi = 0. Let w0? , w⇤ be the global minimum of the least squares objective
n
X
f (w0 , w) = (w0 + wT xi yi ) 2 .
i=1
Prove that w0? = 0. Also, suppose x0i and yi0 are such that for all i, x0i = xi + q,
yi0 = yi + r. Show that (w0 , w) minimizes f if and only if (w0 w> q + r, w)
minimizes n
X
0
f (wo , w) = (w0 + wT x0i yi0 )2 .
i=1
29
Chapter 2
Gradient Descent
Contents
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Vanilla analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Lipschitz convex functions: O(1/"2 ) steps . . . . . . . . . . . 35
2.5 Smooth convex functions: O(1/") steps . . . . . . . . . . . . 37
2.6 Interlude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Smooth and strongly convex functions: O(log(1/")) steps . . 42
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
30
2.1 Overview
The gradient descent algorithm (including variants such as projected or
stochastic gradient descent) is the most useful workhorse for minimizing
loss functions in practice. The algorithm is extremely simple and surpris-
ingly robust in the sense that it also works well for many loss functions
that are not convex. While it is easy to construct (artificial) non-convex
functions on which gradient descent goes completely astray, such func-
tions do not seem to be typical in practice; however, understanding this
on a theoretical level is an open problem, and only few results exist in this
direction.
The vast majority of theoretical results concerning the performance of
gradient descent hold for convex functions only. In this and the following
chapters, we will present some of these results, but maybe more impor-
tantly, the main ideas behind them. As it turns out, the number of ideas
that we need is rather small, and typically, they are shared between dif-
ferent results. Our approach is therefore to fully develop each idea once,
in the context of a concrete result. If the idea reappears, we will typically
only discuss the changes that are necessary in order to establish a new re-
sult from this idea. In order to avoid boredom from ideas that reappear
too often, we omit other results and variants that one could also get along
the lines of what we discuss.
Let f : Rd ! R be a convex and differentiable function. We also assume
that f has a global minimum x? , and the goal is to find (an approximation
of) x? . This usually means that for a given " > 0, we want to find x 2 Rd
such that
f (x) f (x? ) < ".
Notice that we are not making an attempt to get near to x? itself — there
can be several minima x?1 6= x? 6= x?2 with f (x?1 ) = f (x?2 ) = f (x? ).
Table 2.1 gives an overview of the results that we will prove. They con-
cern several variants of gradient descent as well as several classes of func-
tions. The significance of each algorithm and function class will briefly be
discussed when it first appears.
In Chapter 6, we will also look at gradient descent on functions that
are not convex. In this case, provably small approximation error can still
be obtained for some particularly well-behaved functions (we will give an
example). For smooth (but not necessarily convex) functions, we gener-
31
smooth &
Lipschitz smooth strongly
strongly
convex convex convex
convex
functions functions functions
functions
gradient Thm. 2.1 Thm. 2.7 Thm. 2.11
descent O(1/"2 ) O(1/") O(log(1/"))
projected
Thm. 3.2 Thm. 3.4 Thm. 3.5
gradient
O(1/"2 ) O(1/") O(log(1/"))
descent
proximal
Thm. 3.14
gradient
O(1/")
descent
subgradient Thm. 4.7 Thm. 4.11
descent O(1/"2 ) O(1/")
stochastic
Thm. 5.1 Thm. 5.2
gradient
O(1/"2 ) O(1/")
descent
Table 2.1: Results on gradient descent. Below each theorem, the number
of steps is given which the respective variant needs on the respective func-
tion class to achieve additive approximation error at most ".
xt+1 = xt + vt .
32
tending to 0,
To get any decrease in function value at all, we have to choose vt such that
rf (xt )> vt < 0. But among all steps vt of the same length, we should in
fact choose the one with the most negative value of rf (xt )> vt , so that we
maximize our decrease in function value. This is achieved when vt points
into the direction of the negative gradient rf (xt ). But as differentiabilty
guarantees decrease only for small steps, we also want to control how far
we go along the direction of the negative gradient.
Therefore, the step of gradient descent is defined by
Here, is a fixed stepsize, but it may also make sense to have depend
on t. For now, is fixed. We hope that for some reasonably small integer
t, in the t-th iteration we get that f (xt ) f (x? ) < "; see Figure 2.1 for an
example.
Now it becomes clear why we are assuming that dom(f ) = Rd : The
update step (2.1) may in principle take us “anywhere”, so in order to get
a well-defined algorithm, we want to make sure that f is defined and dif-
ferentiable everywhere.
The choice of is critical for the performance. If is too small, the
process might take too long, and if is too large, we are in danger of
overshooting. It is not clear at this point whether there is a “right” stepsize.
Now we apply (somewhat out of the blue, but this will clear up in the next
step) the basic vector equation 2v> w = kvk2 + kwk2 kv wk2 (a.k.a. the
33
x2
x5
3 x3 x4
x2
x1
x0
x1
4
1
gt> (xt x? ) = kxt xt+1 k2 + kxt x ? k2 kxt+1 x ? k2
2
1 2
= kgt k2 + kxt x ? k2 kxt+1 x ? k2
2
1
= kgt k2 + kxt x? k2 kxt+1 x? k2 (2.3)
2 2
Next we sum this up over the iterations t, so that the latter two terms in
the bracket cancel in a telescoping sum.
T 1
X T 1
X 1
gt> (xt ?
x) = kgt k2 + kx0 x ? k2 kxT x? k2
t=0
2 t=0
2
X1
T
1
kgt k2 + kx0 x ? k2 (2.4)
2 t=0
2
34
So far, we have not used any properties of the function f or its gradi-
ent gt , except the definition of the update step xt+1 = xt gt . Now we
invoke convexity of f , or more precisely the first-order characterization of
convexity (1.4) with x = xt , y = x? :
This gives us an upper bound for the average error f (xt ) f (x? ), t =
0, . . . , T 1, hence in particular for the error incurred by the iterate with
the smallest function value. The last iterate is not necessarily the best one:
gradient descent with fixed stepsize will in general also make steps that
overshoot and actually increase the function value; see Exercise 12(i).
The question is of course: is this result any good? In general, the an-
swer is no. A dependence on kx0 x? k is to be expected (the further we
start from x? , the longer we will take); the dependence on the squared gra-
dients kgt k2 is more of an issue, and if we cannot control them, we cannot
say much.
35
Theorem 2.1. Let f : Rd ! R be convex and differentiable with a global mini-
mum x? ; furthermore, suppose that kx0 x? k R and krf (x)k B for all x.
Choosing the stepsize
R
:= p ,
B T
gradient descent (2.1) yields
T 1
1X RB
(f (xt ) f (x? )) p .
T t=0 T
Proof. This is a simple calculation on top of (2.6): after plugging in the
bounds kx0 x? k R and kgt k B, we get
T 1
X 1 2
(f (xt ) f (x? )) B2T + R ,
t=0
2 2
36
2.5 Smooth convex functions: O(1/") steps
Our workhorse in the vanilla analysis was the first-order characterization
of convexity: for all x, y 2 dom(f ), we have
f (y) f (x) + rf (x)> (y x). (2.7)
Next we want to look at functions for which f (y) can be bounded from
above by f (x)+rf (x)> (y x), up to at most quadratic error. The following
definition applies to all differentiable functions, convexity is not required.
Definition 2.2. Let f : dom(f ) ! R be a differentiable function, X ✓ dom(f )
convex and L 2 R+ . Function f is called smooth (with parameter L) over X if
L
f (y) f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X. (2.8)
2
If X = dom(f ), f is simply called smooth.
Recall that (2.7) says that for any x, the graph of f is above its tangential
hyperplane at (x, f (x)). In contrast, (2.8) says that for any x 2 X, the
graph of f is below a not-too-steep tangential paraboloid at (x, f (x)); see
Figure 2.2.
This notion of smoothness has become standard in convex optimiza-
tion, but the naming is somewhat unfortunate, since there is an (older)
definition of a smooth function in mathematical analysis where it means a
function that is infinitely often differentiable.
Let us discuss some cases. If L = 0, (2.7) and (2.8) together require that
f (y) = f (x) + rf (x)> (y x), 8x, y 2 dom(f ),
meaning that f is an affine function. A simple calculation shows that our
supermodel function f (x) = x2 is smooth with parameter L = 2:
f (y) = y 2 = x2 + 2x(y x) + (x y)2
L
= f (x) + f 0 (x)(y x) + (x y)2 .
2
More generally, we also claim that all quadratic functions of the form
f (x) = x> Qx + b> x + c are smooth, where Q is a (d ⇥ d) matrix, b 2 Rd
and c 2 R. Because x> Qx = x> Q> x, we get that f (x) = x> Qx = 12 x> (Q +
Q> )x, where 12 (Q + Q> ) is symmetric. Therefore, we can assume without
loss of generality that Q is symmetric, i.e., it suffices to show that quadratic
functions defined by symmetric functions are smooth.
37
f (x) + rf (x)> (y x) + L2 kx yk2
f (y)
f (x) + rf (x)> (y x)
x y
Lemma 2.3 (Exercise 11). Let f (x) = x> Qx+b> x+c, where Q is a symmetric
(d ⇥ d) matrix, b 2 Rd , c 2 R. Then f is smooth with parameter 2 kQk, where
kQk is the spectral norm of Q (Definition 1.9).
The (univariate) convex function f (x) = x4 is not smooth (over R): at
x = 0, condition (2.8) reads as
L 2
y4 y ,
2
and there is obviously no L that works for all y. The function is smooth,
however, over any bounded set X (Exercise 16).
In general—and this is the important message here—only functions of
asymptotically at most quadratic growth can be smooth. It is tempting to
believe that any such “subquadratic” function is actually smooth, but this
is not true. Exercise 12(iii) provides a counterexample.
While bounded gradients are equivalent to Lipschitz continuity of f
(Theorem 1.10), smoothness turns out to be equivalent to Lipschitz con-
38
tinuity of rf —if f is convex over the whole space. In general, Lipschitz
continuity of rf implies smoothness, but not the other way around.
We next show that for smooth convex functions, the vanilla analysis
provides a better bound than it does under bounded gradients. In partic-
ular, we are now able to serve the supermodel f (x) = x2 .
We start with a preparatory lemma showing that gradient descent (with
suitable stepsize ) makes progress in function value on smooth functions
in every step. We call this sufficient decrease, and maybe suprisingly, it does
not require convexity.
39
Lemma 2.6. Let f : Rd ! R be differentiable and smooth with parameter L
according to (2.8). With
1
:= ,
L
gradient descent (2.1) satisfies
1
f (xt+1 ) f (xt ) krf (xt )k2 , t 0.
2L
More specifically, this already holds if f is smooth with parameter L over the line
segment connecting xt and xt+1 .
Proof. We apply the smoothness condition (2.8) and the definition of gra-
dient descent that yields xt+1 xt = rf (xt )/L. We compute
L
f (xt+1 ) f (xt ) + rf (xt )> (xt+1 xt ) + kxt xt+1 k2
2
1 1
= f (xt ) krf (xt )k2 + krf (xt )k2
L 2L
1 2
= f (xt ) krf (xt )k .
2L
L
f (xT ) f (x? ) kx0 x ? k2 , T > 0.
2T
Proof. We apply sufficient decrease (Lemma 2.6) to bound the sum of the
kgt k2 = krf (xt )k2 after step (2.6) of the vanilla analysis as follows:
T 1 T 1
1 X X
krf (xt )k2 (f (xt ) f (xt+1 )) = f (x0 ) f (xT ). (2.9)
2L t=0 t=0
40
With = 1/L, (2.6) then yields
T 1
X T 1
? 1 X L
(f (xt ) f (x )) krf (xt )k2 + kx0 x ? k2
t=0
2L t=0 2
L
f (x0 ) f (xT ) + kx0 x ? k2 ,
2
equivalently
T
X L
(f (xt ) f (x? )) kx0 x ? k2 . (2.10)
t=1
2
Because f (xt+1 ) f (xt ) for each 0 t T by Lemma 2.6, by taking the
average we get that
T
? 1X L
f (xT ) f (x ) (f (xt ) f (x? )) kx0 x ? k2 .
T t=1 2T
2.6 Interlude
Let us get back to the supermodel f (x) = x2 (that is smooth with param-
eter L = 2, as we observed before). According to Theorem 2.7, gradient
descent (2.1) with stepsize = 1/2 satisfies
1 2
f (xT )
x. (2.11)
T 0
Here we used that the minimizer is x? = 0. Let us check how good this
bound really is. For our concrete function and concrete stepsize, (2.1) reads
as
1
xt+1 = xt rf (xt ) = xt xt = 0,
2
41
so we are always done after one step! But we will see in the next section
that this is only because the function is particularly beautiful, and on top of
that, we have picked the best possible smoothness parameter. To simulate
a more realistic situation here, let us assume that we have not looked at the
supermodel too closely and found it to be smooth with parameter L = 4
only (which is a suboptimal but still valid parameter). In this case, = 1/4
and (2.1) becomes
1 xt xt
xt+1 = xt rf (xt ) = xt = .
4 2 2
So, we in fact have ⇣x ⌘
0 1 2
f (xT ) = f = x. (2.12)
2 T 22T 0
This is still vastly better than the bound of (2.11)! While (2.11) requires
T ⇡ x20 /" to achieve f (xT ) ", (2.12) requires only
✓ 2◆
1 x0
T ⇡ log ,
2 "
42
f (x) + rf (x)> (y x) + L2 kx yk2
f (y)
f (x) + rf (x)> (y x) + µ2 kx yk2
x y
While smoothness according to (2.8) says that for any x 2 X, the graph
of f is below a not-too-steep tangential paraboloid at (x, f (x)), strong con-
vexity means that the graph of f is above a not-too-flat tangential paraboloid
at (x, f (x)). The graph of a smooth and strongly convex function is there-
fore at every point wedged between two paraboloids; see Figure 2.3.
We can also interpret (2.13) as a strengthening of convexity. In the form
of (2.7), convexity reads as
and therefore says that every convex function satisfies (2.13) with µ = 0.
Lemma 2.9 (Exercise 17). If f : Rd ! R is strongly convex with parameter
µ > 0, then f is strictly convex and has a unique global minimum.
The supermodel f (x) = x2 is particularly beautiful since it is both
smooth and strongly convex with the same parameter L = µ = 2 (go-
ing through the calculations in Exercise 11 will reveal this). We can easily
43
characterize the class of particularly beautiful functions. These are exactly
the ones whose sublevel sets are `2 -balls.
Lemma 2.10 (Exercise 18). Let f : Rd ! R be strongly convex with parameter
µ > 0 and smooth with parameter µ. Prove that f is of the form
µ
f (x) = kx bk2 + c,
2
where b 2 Rd , c 2 R.
Once we have a unique global minimum x? , we can attempt to prove
that limt!1 xt = x? in gradient descent. We start from the vanilla analysis
(2.3) and plug in the lower bound gt> (xt x? ) = rf (xt )> (xt x? ) f (xt )
f (x? ) + µ2 kxt x? k2 resulting from strong convexity. We get
1 µ
f (xt ) f (x? ) 2
krf (xt )k2 + kxt x? k2 kxt+1 x? k2 kxt x? k2 .
2 2
(2.14)
Rewriting this yields a bound on kxt+1 x? k2 in terms of kxt x? k2 , along
with some “noise” that we still need to take care of:
2 (f (x? ) f (xt )) + 2
krf (xt )k2 0,
2.8 Exercises
Exercise 10. Let c 2 Rd . Prove that the spectral norm of c> equals the Euclidean
norm of c, meaning that
|c> x|
max = kck .
x6=0 kxk
45
Exercise 11. Prove Lemma 2.3: The quadratic function f (x) = x> Qx + b> x + c
is smooth with parameter 2 kQk.
Exercise 12. Consider the function f (x) = |x|3/2 for x 2 R.
(i) Prove that f is strictly convex and differentiable, with a unique global min-
imum x? = 0.
(ii) Prove that for every fixed stepsize in gradient descent (2.1) applied to f ,
there exists x0 for which f (x1 ) > f (x0 ).
(iii) Prove that f is not smooth.
(iv) Let X ✓ R be a closed convex set such that 0 2 X and X 6= {0}. Prove
that f is not smooth over X.
Exercise 13. In order to obtain average error at most " in Theorem 2.1, we need
to choose iteration number and step size as
✓ ◆2
RB R
T , := p .
" B T
If R or B are unknown, we cannot do this.
But suppose that we know R. Develop an algorithm that—not knowing B—
finds a vector x such that f (x) f (x? ) < ", using at most
✓ ◆2 !
RB
O
"
many gradient descent steps!
Exercise 14. Prove Lemma 2.5! (Operations which preserve smoothness)
Exercise 15. In order to obtain average error at most " in Theorem 2.7, we need
to choose
1 R2 L
:= , T ,
L 2"
if kx0 x? k R. If L is unknown, we cannot do this.
But suppose that we know R. Develop an algorithm that—not knowing L—
finds a vector x such that f (x) f (x? ) < ", using at most
✓ 2 ◆
R L
O
2"
many gradient descent steps!
46
Exercise 16. Let a 2 R. Prove that f (x) = x4 is smooth over X = ( a, a) and
determine a concrete smoothness parameter L.
Exercise 17. Prove Lemma 2.9! (Strongly convex functions have unique global
minimum)
Exercise 18. Prove Lemma 2.10! (Strongly convex and smooth functions)
47
Chapter 3
Contents
3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Bounded gradients: O(1/"2 ) steps . . . . . . . . . . . . . . . . 49
3.3 Smooth convex functions: O(1/") steps . . . . . . . . . . . . 51
3.4 Smooth and strongly convex functions: O(log(1/")) steps . . 54
3.5 Projecting onto `1 -balls . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Proximal gradient descent . . . . . . . . . . . . . . . . . . . . 60
3.6.1 The proximal gradient algorithm . . . . . . . . . . . . 61
3.6.2 Convergence in O(1/") steps . . . . . . . . . . . . . . 62
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
48
3.1 The Algorithm
Another way to control gradients in (2.4) is to minimize f over a closed
convex subset X ✓ Rd . For example, we may have a constrained opti-
mization problem to begin with (for example the LASSO in Section 1.6.2),
or we happen to know some region X containing a global minimum x? , so
that we can restrict our search to that region. In this case, gradient descent
also works, but we need an additional projection step. After all, it can hap-
pen that some iteration of (2.1) takes us “into the wild” (out of X) where
we have no business to do. Projected gradient descent is the following
modification. We choose x0 2 X arbitrary and for t 0 define
This means, after each iteration, we project the obtained iterate yt+1 back
to X. This may be very easy (think of X as the unit ball in which case we
just have to scale yt+1 down to length 1 if it is longer). But it may also be
very difficult. In general, computing ⇧X (yt+1 ) means to solve an auxiliary
convex constrained minimization problem in each step! Here, we are just
assuming that we can do this. The projection is well-defined since dy :=
kx yk2 has bounded sublevel sets. Moreover, dy (x) is strictly convex, so
the minimum over X (that exists by continuity of dy and compactness of X
intersected with any nonempty sublevel set) is unique by Lemma 1.20. We
note that finding an initial x0 2 X also reduces to projection (of 0, for
example) onto X.
49
(ii) kx ⇧X (y)k2 + ky ⇧X (y)k2 kx yk2 .
Part (i) says that the vectors x ⇧X (y) and y ⇧X (y) form an obtuse
angle, and (ii) equivalently says that the square of the long side x y in
the triangle formed by the three points is at least the sum of squares of the
two short sides; see Figure 3.1.
y
↵ 90o
↵ ⇧X (y) X
50
Theorem 3.2. Let f : Rd ! R be a convex and differentiable, X ✓ Rd closed and
convex, x? a minimizer of f over X; furthermore, suppose that kx0 x? k R,
and that krf (x)k B for all x 2 X. Choosing the constant stepsize
R
:= p ,
B T
projected gradient descent (3.1) with x0 2 X yields
T 1
1X RB
(f (xt ) f (x? )) p .
T t=0 T
Proof. The only required changes to the vanilla analysis are that in steps
(2.2) and (2.3), xt+1 needs to be replaced by yt+1 as this is the real next
(non-projected) gradient descent iterate after these steps; we therefore get
1
gt> (xt x? ) = 2
kgt k2 + kxt x? k2 kyt+1 x? k2 . (3.3)
2
1
gt> (xt x? ) 2
kgt k2 + kxt x? k2 kxt+1 x? k2 (3.4)
2
and return to the previous vanilla analysis for the remainder of the proof.
51
Lemma 3.3. Let f : Rd ! R be differentiable and smooth with parameter L over
X according to (3.5). Choosing stepsize
1
:=
,
L
projected gradient descent (3.1) with arbitrary x0 2 X satisfies
1 L
f (xt+1 ) f (xt ) krf (xt )k2 + kyt+1 xt+1 k2 , t 0.
2L 2
More specifically, this already holds if f is smooth with parameter L over the line
segment connecting xt and xt+1 .
Proof. We proceed similar to the proof of the “unconstrained” sufficient
decrease Lemma 2.6, except that we now need to deal with projected gra-
dient descent. We again start from smoothness but then use yt+1 = xt
rf (xt )/L, followed by the usual equation 2v> w = kvk2 +kwk2 kv wk2 :
L
f (xt+1 ) f (xt ) + rf (xt )> (xt+1 xt ) + kxt xt+1 k2
2
L
= f (xt ) L(yt+1 xt )> (xt+1 xt ) + kxt xt+1 k2
2
L
= f (xt ) kyt+1 xt k2 + kxt+1 xt k2 kyt+1 xt+1 k2
2
L
+ kxt xt+1 k2
2
L L
= f (xt ) kyt+1 xt k2 + kyt+1 xt+1 k2
2 2
1 L
= f (xt ) krf (xt )k2 + kyt+1 xt+1 k2 .
2L 2
1 L
krf (xt )k2 f (xt ) f (xt+1 ) + kyt+1 xt+1 k2 (3.6)
2L 2
resulting from sufficient decrease (Lemma 3.3) to bound the squared gra-
dient kgt k2 = krf (xt )k2 in the vanilla analysis. Unfortunately, (3.6) has
an extra term compared to what we got in the unconstrained case. But we
can compensate for this in the vanilla analysis itself. Let us go back to its
“constrained” version (3.3), featuring yt+1 instead of xt+1 :
1
gt> (xt x? ) = 2
kgt k2 + kxt x? k2 kyt+1 x? k2 .
2
Using f (xt ) f (x? ) gt> (xt x? ) from convexity, we have (with = 1/L)
that
T 1
X T 1
X
(f (xt ) f (x? )) gt> (xt x? ) (3.8)
t=0 t=0
T 1 T 1
1 X L LX
kgt k2 + kx0 x ? k2 kyt+1 xt+1 k2 .
2L t=0 2 2 t=0
53
Plugging this into (3.8), the extra terms cancel, and we arrive—as in the
unconstrained case—at
T
X L
(f (xt ) f (x? )) kx0 x? k2 .
t=1
2
The statement follows as in the proof of Theorem 2.7 from the fact that due
to sufficient decrease (Exercise 19), the last iterate is the best one.
55
3.5 Projecting onto `1-balls
Problems that are `1 -regularized appear among the most commonly used
models in machine learning and signal processing, and we have already
discussed the Lasso as an important example of that class. We will now
address how to perform projected gradient as an efficient optimization for
`1 -constrained problems. Let
n d
X o
d
X = B1 (R) := x 2 R : kxk1 = |xi | R
i=1
be the `1 -ball of radius R > 0 around 0, i.e., the set of all points with 1-
norm at most R. Our goal is to compute ⇧X (v) for a given vector v, i.e. the
projection of v onto X; see Figure 3.2.
X = B1 (R)
v
⇧X (v)
0 R
At first sight, this may look like a rather complicated task. Geometri-
cally, X is a cross polytope (square for d = 2, octahedron for d = 3), and as
such it has 2d many facets. But we can start with some basic simplifying
observations.
Fact 3.6. We may assume without loss of generality that (i) R = 1, (ii) vi 0 for
P
all i, and (iii) di=1 vi > 1.
56
Proof. If we project v/R onto B1 (1), we obtain ⇧X (v)/R (just scale Fig-
ure 3.2), so we can restrict to the case R = 1. For (ii), we observe that
simultaneously flipping the signs of a fixed subset of coordinates in both
v and x 2 X yields vectors v0 and x0 2 X such that kx vk = kx0 v0 k;
thus, x minimizes the distance to v if and only if x0 minimizes the distance
to v0 . Hence, it suffices to compute ⇧X (v) for vectors with nonnegative
P
entries. If di=1 vi 1, we have ⇧X (v) = v and are done, so the interesting
case is (iii).
Fact 3.7. Under the assumptions of Fact 3.6, x = ⇧X (v) satisfies xi 0 for all i
P
and di=1 xi = 1.
Proof. If xi < 0 for some i, then ( xi vi )2 (xi vi )2 (since vi 0),
so flipping the i-th sign in x would yield another vector in X at least as
close to v as x, but such a vector cannot exist by strict convexity of the
P
squared distance. And if di=1 xi < 1, then x0 = x + (v x) 2 X for some
small positive , with kx0 vk = (1 )kx vk, again contradicting the
optimality of x.
Corollary 3.8. Under the assumptions of Fact 3.6,
⇧X (v) = argmin kx vk2 ,
x2 d
where
n d
X o
d := x 2 Rd : xi = 1, xi 0 8i
i=1
is the standard simplex.
This means, we have reduced the projection onto an `1 -ball to the pro-
jection onto the standard simplex; see Figure 3.3.
To address the latter task, we make another assumption that can be
established by suitably permuting the entries of v (which just permutes
the entries of its projection onto d in the same way).
Fact 3.9. We may assume without loss of generality that v1 v2 ··· vd .
Lemma 3.10. Let x? := argminx2 d kx vk2 . Under the assumption of Fact 3.9,
there exists (a unique) p 2 {1, . . . , d} such that
x?i > 0, i p,
x?i = 0, i > p.
57
d
v
⇧X (v)
0 1
Lemma 3.11. Under the assumption of Fact 3.9, and with p as in Lemma 3.10,
x?i = vi ⇥p , i p,
where
1⇣ X ⌘
p
⇥p = vi 1 .
p i=1
58
Proof. Suppose x?i vi < x?j vj for some i, j p. As before, we could then
decrease x?j > 0 by some small positive " and simultaneously increase x?i
by " to obtain x 2 d such that
and we just need to find the right one. In order for candidate x? (p) to
comply with Lemma 3.10, we must have
vp ⇥p > 0, (3.13)
and this actually ensures x? (p)i > 0 for all i p by the assumption of
Fact 3.9 and therefore x? (p) 2 d . But there could still be several values of
p satisfying (3.13). Among them, we simply pick the one for which x? (p)
minimizes the distance to v. It is not hard to see that this can be done in
time O(d log d), by first sorting v and then carefully updating the values
⇥p and kx? (p) vk2 as we vary p to check all candidates.
But actually, there is an even simpler criterion that saves us from com-
paring distances.
Lemma 3.12. Under the assumption of Fact 3.9, with x? (p) as in (3.12), and
with
1⇣ X ⌘
p
?
p := max p 2 {1, . . . , d} : vp vi 1 > 0 ,
p i=1
it holds that
argmin kx vk2 = x? (p? ).
x2 d
59
The proof is Exercise 21. Together with our previous reductions, we
obtain the following result.
Theorem 3.13. Let v 2 Rd , R 2 R+ , X = B1 (R) the `1 -ball around 0 of
radius R. The projection
To obtain the last equality, we have just completed the quadratic kvk2 +
2v> w + kwk2 = kv + wk2 for v := rg(xt ) and w := y xt . Here it is
crucial that v is independent of the optimization variable y, so therefore
the term can be ignored when taking the argmin. The scaling by 21 is also
irrelevant but we keep it for better illustrating the next step.
60
The interpretation of the above equivalent reformulation of the classic
gradient step is important for us, and is what has enabled the previous
convergence analysis in Section 2.5 for smooth unconstrained optimiza-
tion: For the particular choice of stepsize := L1 which we have used,
the above formulation shows that the gradient descent step exactly min-
imizes the local quadratic model of g at our current iterate xt , formed by
the smoothness property with parameter L as defined in (2.8).
61
A generalization of gradient descent. The proximal gradient descent
method (3.19) is also known as generalized gradient descent. In the special
case h ⌘ 0, we of course recover classic gradient descent.
More interestingly, it is also a generalization of projected gradient de-
scent as we have discussed in the previous sections. Given a closed convex
set X, the indicator function of the set X is given as the convex function
◆X : Rd ! R [ +1
(
0 if x 2 X,
x 7! ◆X (x) := (3.21)
+1 otherwise.
When using the indicator function of our constraint set X as h ⌘ ◆X , it is
easy to see that the proximal mapping simply becomes
n1 o
proxh, (z) := argmin ky zk2 + ◆X (y)
y 2
= argmin ky zk2 = ⇧X (z) ,
y2X
L
f (xT ) f (x? ) kx0 x ? k2 , T > 0.
2T
Proof. The proof follows the vanilla analysis for the smooth case, applying
it only to g, while always keeping h separate, as in (3.17). We leave the
details as Exercise 22 for the reader.
3.7 Exercises
Exercise 19. Prove that in Theorem 3.4 (i),
f (xt+1 ) f (xt ).
Exercise 20. Prove that under the assumptions of Theorem 3.5, f has a unique
minimizer x? over any nonempty closed and convex set X ✓ Rd ! In particular,
for X = Rd , we obtain the existence of a unique global minimum.
63
Chapter 4
Subgradient Descent
Contents
4.1 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Differentiability of convex functions . . . . . . . . . . . . . . 67
4.3 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Lipschitz convex functions: O(1/"2 ) steps . . . . . . . . . . . 68
4.5 Tame strong convexity: O(1/") steps . . . . . . . . . . . . . . 69
4.6 Optimality of first-order methods . . . . . . . . . . . . . . . . 72
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
64
4.1 Subgradients
Definition 4.1. Let f : dom(f ) ! R. Then g 2 Rd is a subgradient of f at
x 2 dom(f ) if
f (x) = |x|
y 7! 15 y
0
2
f (y) gy y 7! 5
y
65
get a “first order characterization” of convexity that also covers the non-
differentiable case.
66
4.2 Differentiability of convex functions
Before we move on to subgradient descent, we want to get a feeling for
how “wild” non-differentiable convex functions can be. The answer is:
they are surprisingly tame. While there are continuous functions that are
nowhere differentiable (the classical example is the Weierstrass function),
convex function cannot be as pathological. In fact, a convex function f
is differentiable almost everywhere. Formally, this means that wherever you
are in dom(f ), you find points arbitrarily close to you at which f is differ-
entiable. In still other words, the set of points where f is not differentiable
has measure 0 [Roc97, Theorem 25.5].
This does not mean that we can ignore non-differentiability in opti-
mization. For example, as Figure 4.1 demonstrates, the global minimum x?
can easily be a “kink”, a point where f is not differentiable. Also, while
running an iterative optimization scheme, we may always stumble upon
an intermediate kink.
An important fact is the following characterization of subdifferentials;
67
4.3 The algorithm
An iteration of subgradient descent is defined as
Let gt 2 @f (xt )
xt+1 := xt t gt . (4.2)
Proof. The proof is identical to the one of Theorem 2.1 presented in Sec-
tion 2.4. The only change is that gt is a subgradient now and not a gra-
dient, so that the inequality (2.5) now follows from the subgradient prop-
erty (4.1) instead of the first-order characterization of convexity. The re-
quired bound kgt k2 B 2 follows from Lemma 4.4 (“convex and Lipschitz
= bounded subgradients”).
68
4.5 Tame strong convexity: O(1/") steps
(Projected) gradient descent converges in O(log(1/")) steps for functions
that are both smooth and strongly convex. But if a function is non-differen-
tiable, then it cannot be smooth under the natural definition of smoothness
(Exercise 26). It can still be strongly convex, however, so it is natural to ask
whether strong convexity alone allows us to obtain a convergence result.
The answer is no in general, but before we discuss this, let us define strong
convexity for not necessarily differentiable functions. This is straightfor-
ward; for differentiable functions, we recover Definition 2.8. Here, we
restrict to the unconstrained case for simplicity.
Definition 4.8. Let f : dom(f ) ! R be convex, µ 2 R+ , µ > 0. Function f is
called strongly convex (with parameter µ) if
µ
f (y) f (x)+g> (y x)+ kx yk2 , 8x, y 2 dom(f ), 8g 2 @f (x). (4.3)
2
Actually, requiring (4.3) only for some g 2 @f (x) would be another
straightforward generalization of Definition 2.8, so which one is the “right”
one? The answer is that it does not matter if dom(f ) is open. We could
even afford to not require anything for points x where f is not differen-
tiable. This is a consequence of Theorem 4.6 (Exercise 27).
Strong convexity has the following useful characterization.
Lemma 4.9 (Exercise 28). Let f : dom(f ) ! R be convex, dom(f ) open,
µ 2 R+ , µ > 0. f is strongly convex with parameter µ if and only if fµ :
dom(f ) ! R defined by
µ
fµ (x) = f (x) kxk2 , x 2 dom(f )
2
is convex.
Let’s look at the problem with (sub)gradient descent on strongly con-
vex functions.
Lemma 4.10 (Exercise 29). The function f (x) = e|x| is strongly convex with
parameter µ = 1.
This function is of course far from being smooth; it grows exponen-
tially, so there can’t be any quadratic upper bounds. In fact, as strong
69
convexity ony requires quadratic lower bounds, strongly convex functions
can be extremely fast-growing. In such a situation, (sub)gradient descent
will overshoot already for tiny step sizes and diverge.
In case of f (x) = e|x| , the function is differentiable at x 6= 0 with f 0 (x) =
sgn(x)e|x| , so the (sub)gradient step is
|xt |
xt+1 = xt t sgn(xt )e .
For |x| only mildly larger than 0, the step will overshoot the optimum
x⇤ = 0 and take us (much) further away. To compensate for this, we would
need extremely small stepsizes. These in turn would lead to extremely
poor convergence for functions such as f (x) = x2 /2 (which is also strongly
convex with µ = 1) . Hence, there are no stepsizes that fit all strongly
convex functions with a fixed strong convexity parameter µ.
To succeed with (sub)gradient descent in this situation, we therefore
need to make some additional assumptions. Smoothness (quadratic upper
bounds) is such an assumption, but in the non-differentiable case, this is
precisely not an option. What people have done instead is to assume that
the subgradients gt that we encounter during the algorithm are bounded
in norm.
To ensure bounded subgradients, we could simply assume that f is
Lipschitz, but then we will only make a statement about an empty function
class. The reason is that a function cannot be globally strongly convex and
Lipschitz at the same time (Exercise 30). It can be strongly convex and
have bounded gradients over a closed and bounded set X, so analyzing
projected subgradient descent is an alternative.
But even when we optimize over Rd , we may be lucky and only hit
iterates with small subgradients. This will typically happen if we start
sufficiently close to optimality. In this case, there are step sizes t (not
depending on the observed gradients) that give us useful error bounds.
Below, we prove such a bound for subgradient descent, and this re-
sult then clearly extends to gradient descent on differentiable and strongly
convex (but not necessarily smooth) functions. The bound on the number
of steps will be O(1/") which is of course much worse than O(log(1/")),
but still better than O(1/"2 ) that we get in the Lipschitz case. So assum-
ing strong convexity results in a convergence behavior as in the smooth
case—if the gradients stay bounded, and this is what we mean by “tame”.
In order to analyze subgradient descent on strongly convex functions,
70
we will for the first time depart from algorithm variants with a constant
stepsize , but instead use a time-varying stepsize t decreasing over time.
2
t := , t > 0,
µ(t + 1)
t 1
gt> (xt x? ) = kgt k2 + kxt x? k2 kxt+1 x? k2 .
2 2 t
Now we plug in the lower bound gt> (xt x? ) f (xt ) f (x? ) + µ2 kxt x ? k2
resulting from strong convexity to obtain (with kgt k2 B 2 ) that
1 1
B2 t ( µ)
f (xt ) f (x? ) + t
kxt x? k2 t
kxt+1 x ? k2 . (4.4)
2 2 2
Unlike in the vanilla analysis (where we had t = , µ = 0), the right-hand
side does not telescope anymore when we sum over all t T ; to fix this,
we precisely need the time-varying stepsize.
Let’s make a small computation: to get telescoping behavior, we would
need that t 1 = t+1 1
µ. For example, t 1 = µ(1 + t) satisfies this, but
our choice t = µ(1 + t)/2 does not. Exercise 31 asks you to compute
1
what happens when we actually choose t 1 = µ(1 + t); this will let you
71
appreciate the seemingly “wrong” choice of t = µ(t+1)2
here. Plugging in
this stepsize and multiplying with t on both the sides, we get
? B2t µ⇣ ? 2 ? 2
⌘
t · f (xt ) f (x ) + t(t 1) kxt x k (t + 1)t kxt+1 x k
µ(t + 1) 4
B2 µ ⇣ ⌘
+ t(t 1) kxt x? k2 (t + 1)t kxt+1 x? k2 .
µ 4
Summing from t = 1, . . . , T , we obtain a telescoping sum:
T B2 µ ⇣ ⌘
T
X T B2
t · f (xt ) f (x? ) + 0 T (T + 1) kxT +1 x ? k2 .
t=1
µ 4 µ
Since
X T
2
t = 1,
T (T + 1) t=1
Jensen’s inequality (Lemma 1.5) yields
✓ X T ◆ X T
2 ? 2
f t · xt f (x ) t · f (xt ) f (x? ) .
T (T + 1) t=1 T (T + 1) t=1
72
Theorem 4.12 (Nesterov). For any T d 1 and starting point x0 , there is a
function f in the problem class of B-Lipschitz functions over Rd , such that any
(sub)gradient method has an objective error at least
RB
f (xT ) f (x? ) p .
2(1 + T + 1)
The above theorem applies to all first-order methods which form iter-
ates by linearly combining past iterates and (sub)gradients, and requires
the dimension d to be sufficiently large.
4.7 Exercises
Exercise 23. Prove Lemma 4.2, meaning that a function that is differentiable at x
has at most one subgradient there, namely rf (x).
Exercise 24. Prove the easy direction of Lemma 4.3, meaning that the existence
of subgradients everywhere implies convexity!
Exercise 25. Prove Lemma 4.4 (Lipschitz continuity and bounded subgradients).
Exercise 26. Generalizing Definition 2.2, let us call a (not necessarily differen-
tiable) function f : Rd ! R smooth with parameter L 2 R+ if for all x 2 Rd ,
there exists gx 2 Rd (not necessarily a subgradient; we do not assume that f is
convex) such that
L
f (y) f (x) + gx> (y x) + kx yk2 , 8x, y 2 Rd .
2
This means that for every point x, the graph of f is below the graph of the
quadratic function f (x) + gx> (y x) + L2 kx yk2 .
Prove that if f is smooth according to this definition, then f is differentiable,
with gx = rf (x) for all x. In particular, for differentiable functions, the notion of
smoothness introduced above coincides with the one of Definition 2.2; moreover,
non-differentiable functions cannot be smooth.
73
for all x such that rf (x) exists, and for all y. Prove that this implies
µ
f (y) f (x) + gx> (y x) + kx yk2
2
for all x, all gx 2 @f (x) and all y.
Exercise 28. Prove Lemma 4.9: f is strongly convex with parameter µ over an
7 f (x) µ2 kxk2 is convex over the same
open domain if and only if fµ : x !
domain.
Exercise 29. Prove Lemma 4.10: f (x) = e|x| is strongly convex with parameter
µ = 1.
Exercise 30. Prove that a function cannot simultaneously be Lipschitz and strongly
convex!
Exercise 31. Which result can you prove when you use the “telescoping stepsize”
1
t =
µ(t + 1)
74
Chapter 5
Contents
5.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Bounded stochastic gradients: O(1/"2 ) steps . . . . . . . . . 78
5.4 Tame strong convexity: O(1/") steps . . . . . . . . . . . . . . 79
5.5 Stochastic Subgradient Descent . . . . . . . . . . . . . . . . . 80
5.6 Mini-batch variants . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
75
5.1 The algorithm
Many objective functions occurring in machine learning are formulated as
sum structured objective functions
n
1X
f (x) = fi (x). (5.1)
n i=1
Here fi is typically the cost function of the i-th datapoint, taken from a
training set of n elements in total.
We have already seen an example for this: the loss function (1.11) in
the handwritten digit recognition (Section 1.6.1) has one term for each of
the n training images x 2 P :
X
`(W ) = ln zd(x) (W x).
x2P
The normalizing factor 1/n that we assume in the general setting (5.1)
will just simplify the following a bit.
An iteration of stochastic gradient descent (SGD) in its basic form is de-
fined as
sample i 2 [n] uniformly at random
xt+1 := xt t rfi (xt ). (5.2)
This update looks almost identical to the classical gradient method, the
only difference being that we have computed the gradient not of the en-
tire f but only of one particular (randomly chosen) function fi . As we will
need varying stepsizes a bit later, we allow for the stepsize to depend on t
now.
In the above setting, the update vector gt := rfi (xt ) is called a stochastic
gradient. Formally, gt is a vector of d random variables, but we will also
simply call this a random variable.
The crucial advantage of SGD versus its classical gradient descent coun-
terpart is the efficiency per iteration: While computing the full gradient for
a sum structured problem (5.1) would require us to compute n individual
gradients of the fi functions, an iteration of SGD requires only a single
one of those, and therefore is n times cheaper. SGD has therefore become
the main workhorse for training machine learning models. Whether such
cheaper iterations also give similar progress is another question, which we
analyze next.
76
5.2 Unbiasedness
We would like to start with the vanilla analysis again, but now we can-
not bound the random variable gt> (xt x? ) from below using (2.5), as the
inequality
f (xt ) f (x? ) gt> (xt x? )
may hold or not hold, depending on how gt turns out. But it still holds in
expectation, as we show now.
The vector gt may be far from the true gradient, and of high variance,
but in expectation over the random choice of i, it does coincide with the
full gradient of f . We formalize this as
n
⇥ ⇤ 1X
E gt x t = x = rfi (x) = rf (x), x 2 Rd . (5.3)
n i=1
⇥ ⇤
Here, E gt xt = x is the conditional expectation of gt , given the event
{xt = x}. If this event is nonempty, linearity of conditional expectations
yields that
⇥ ⇤ ⇥ ⇤>
E gt> (x x? ) xt = x = E gt xt = x (x x? ) = rf (x)> (x x? ).
Using the fact that {xt = x} can occur only for x in some finite set X (one
element for every choice of indices throughout all iterations), the partition
theorem further gives us
⇥ ⇤ X ⇥ ⇤
E gt> (xt x? ) = E gt> (x x? ) xt = x prob(xt = x)
x2X
X
= rf (x)> (x x? ) prob(xt = x)
x2X
⇥ ⇤
= E rf (xt )> (xt x? ) .
Hence, we have
⇥ ⇤ ⇥ ⇤ ⇥ ⇤
E gt> (xt x? ) = E rf (xt )> (xt x? ) E f (xt ) f (x? ) . (5.4)
The last inequality is by convexity, and this is means that the lower bound
(2.5) holds in expectation.
Exercise 32 lets you recall some basics around conditional expectations.
Under (5.3) we say that the stochastic gradient gt is an unbiased estimator
of the gradient, for any time-step t.
77
5.3 Bounded stochastic gradients: O(1/"2) steps
To get a first result out of the vanilla analysis, we assumed in Section 2.4
that krf (x)k2 B 2 for all x 2 Rd , where B was a constant. Here, we
are assuming the same for the expected squared norms of our stochastic
gradients. And we are getting the same result, except that it now holds for
the expected function values.
Theorem 5.1. Let f : Rd ! R be a convex and differentiable function, and let
x?⇥ be a global
⇤ minimum of f ; furthermore, suppose that kx0 x? k R, and that
E kgt k B 2 for all t. Choosing the constant stepsize
2
R
:= p
B T
stochastic gradient descent (5.2) yields
T 1
1X ⇥ ⇤ RB
E f (xt ) f (x? ) p .
T t=0 T
Proof. Taking expectations on both sides of the vanilla analyis (2.4) and
using linearity of expectations, we get
T 1
X T 1
X
⇥ ⇤ ⇥ ⇤ 1
E gt> (xt ?
x) E kgt k2 + kx0 x? k2 . (5.5)
t=0
2 t=0 2
By (5.4), ⇥ ⇤ ⇥ ⇤
E f (xt ) f (x? ) E gt> (xt x? ) .
⇥ ⇤
Plugging this into (5.5), using E kgt k2 B 2 and kx0 x? k R, we get
T 1
X ⇥ ⇤ 1 2
E f (xt ) f (x? ) B2T + R ,
t=0
2 2
from which the statement follows from the choice of as in Theorem 2.1.
78
5.4 Tame strong convexity: O(1/") steps
It is possible to strengthen our above SGD analysis. One way to do so
is under the additional assumption of strong convexity of the objective
function f (as in Definition 2.8). Again, the proof works by “taking ex-
pectations” over a previous analysis, in this case the one for subgradient
descent in the tame strongly convex case (Theorem 4.11).
h ✓ 2 X T ◆ i 2B 2
E f t · xt f (x? ) ,
T (T + 1) t=1 µ(T + 1)
⇥ ⇤
where B = maxTt=1 E kgt k .
Proof. We start from the vanilla analysis (2.3) (with = t) and take expec-
tations on both sides:
⇥ ⇤ t ⇥ ⇤ 1 ⇥ ⇤ ⇥ ⇤
E gt> (xt x? ) = E kgt k2 + E kxt x? k2 E kxt+1 x ? k2 .
2 2 t
Now we use (5.4) along with strong convexity to get a lower bound
⇥ ⇤ ⇥ ⇤
E gt> (xt x? ) = E rf (xt )> (xt x? )
⇥ ⇤ µ ⇥ ⇤
E f (xt ) f (x? ) + E kxt x? k2
2
for
⇥ the2left-hand
⇤ side. Combining the previous two equations and using
E kgt k B , we get the “expected version” of (4.4):
2
1 1
B2 t ( µ) ⇥ ? 2
⇤ ⇥ ⇤
E[f (xt ) ?
f (x )] + t
E kxt xk t
E kxt+1 x ? k2 .
2 2 2
The proof continues as in Theorem 4.11, with every step being the “ex-
pected version” of the corresponding step in the earlier proof.
79
5.5 Stochastic Subgradient Descent
For problems which are not necessarily differentiable, we modify SGD to
use a subgradient of fi in each iteration. The update of stochastic subgra-
dient descent is given by
sample i 2 [n] uniformly at random
let gt 2 @fi (xt ) (5.6)
xt+1 := xt t gt .
80
where gtj = rfij (xt ) for an index ij . The set of the (distinct) ij indices is
called a mini-batch, and m is the mini batch size.
Using the step direction g̃t defines mini-batch SGD. For m = 1, we re-
cover SGD as originally defined, while for m = n we recover full gradient
descent.
Mini-batch SGD can be advantageous in several applications. For ex-
ample, parallelization over up to m processors will easily give a speed-up
for the gradient computation, which is typically the main cost of running
SGD. Here, parallelization exploits the fact that all gtj are defined at the
same iterate xt and can therefore be computed independently.
Taking an average of many independent random variables reduces the
variance. In the context of mini-batch SGD, we obtain that for larger size
of the mini-batch m our estimate g̃t will be closer to the true gradient, in
expectation:
h 2i h 1 Xm
2i
E g̃t rf (xt ) =E gtj rf (xt )
m j=1
1 ⇥ 1 ⇤
= E kgt rf (xt )k2
m
1 ⇥ ⇤ 1 B2
= E kgt1 k2 krf (xt )k2 .
m m m
Using a modification of the above analysis, it is possible to use this
property to relate the above convergence rate of SGD to the rate of full
gradient descent.
5.7 Exercises
Exercise 32. Let Y be a random variable over a finite probability space (⌦, prob)
where prob : 2⌦ ! [0, 1]; this avoids subtleties in defining conditional probabili-
ties and expectations; and it covers the random variables occurring in SGD, since
in each step, we are randomly choosing among a finite set of n indices. Further-
more, let B ✓ ⌦ be an event.
For nonemepty B, the conditional expectation of Y given B is the number
⇥ ⇤ X
E Y B := y · prob Y = y B .
y2Y (⌦)
81
where Y = y is shorthand for the event {! 2 ⌦ : Y (!) = y}. ⇥ ⇤
Finally, for two events A and B 6= ;, the conditional probability prob A B
is defined as
prob A \ B
prob A B := .
prob B
⇥ ⇤
If B = ;, E Y B can be defined arbitrarily.
Prove the following statements.
82
Chapter 6
Nonconvex functions
Contents
6.1 Smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Trajectory analysis . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Deep linear neural networks . . . . . . . . . . . . . . 91
6.2.2 A simple nonconvex function . . . . . . . . . . . . . . 93
6.2.3 Smoothness along the trajectory . . . . . . . . . . . . 96
6.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
83
So far, all convergence results that we have given for variants of gra-
dient descent have been for convex functions. And there is a good reason
for this: on nonconvex functions, gradient descent can in general not be
expected to come close (in distance or function value) to the global mini-
mum x? , even if there is one.
As an example, consider the nonconvex function from Figure 1.2 (left).
Figure 6.1 shows what happens if we start gradient descent somewhere “to
the right”, with a not too large stepsize so that we do not overshoot. For
any sufficiently large T , the iterate xT will be close to the local minimum
y? , but not to the global minimum x? .
x⇤ y⇤ x0
84
x0 y⇤ x⇤ x⇤ x0
Figure 6.2: Gradient descent may get stuck in a flat region (saddle point)
y? (left), or reach neither a local minimum nor a saddle point (right).
85
f (x) + rf (x)> (y x) + L2 kx yk2
f (y)
x y
We show that this in turn implies smoothness. This is in fact the easy
direction of Lemma 2.4 (in the twice differentiable case), and we proceed
as in the proof of Theorem 1.10 by employing the fundamental theorem of
86
x y
f (x) + rf (x)> (y x)
f (y)
h(t) = f x + t(y x) , t 2 R,
87
For any x, y 2 X, we now compute
f (y) f (x) rf (x)> (y x)
= h(1) h(0) rf (x)> (y x)
Z 1
= h0 (t)dt rf (x)> (y x)
Z0 1
= rf (x + t(y x))> (y x)dt rf (x)> (y x)
Z0 1
= rf (x + t(y x))> (y x) rf (x)> (y x) dt
0
Z 1
>
= rf (x + t(y x)) rf (x) (y x)dt
Z0 1
>
rf (x + t(y x)) rf (x) (y x) dt
Z0 1
rf (x + t(y x)) rf (x) k(y x)k dt (Cauchy-Schwarz)
0
Z 1
L kt(y x)k k(y x)k dt (Lipschitz continuous gradients)
0
Z 1
= Lt kx yk2
0
L
= kx yk2 .
2
This is smoothness over X according to Definition 2.2.
For twice differentiable functions, the converse is also (almost) true.
If f is smooth over an open convex subset X ✓ dom(f ), the maximum
eigenvalue of the Hessian is bounded over X (Exercise 33 ). We can only
bound the eigenvalues from above since e.g. concave functions are smooth
with parameter L = 0 but generally have unbounded Hessians. It is also
not hard to understand why openness is necessary in general. Indeed, for
a point x on the boundary of X, the smoothness condition does not give
us any information about nearby points not in X. As a consequence, even
at points with large Hessians, f might look smooth inside X. As a simple
example, consider f (x1 , x2 ) = x21 + M x22 with M 2 R+ large. The function
f is smooth with L = 2 over X = {(x1 , x2 ) : x2 = 0}: indeed, over this set,
f looks just like the supermodel. But for all x, we have kr2 f (x)k = 2M .
88
Now we get back to gradient descent on smooth functions with a global
minimum. The punchline is so unspectacular that there is no harm in
spoiling it already now: What we can prove is that krf (xt )k2 converges to
0 at the same rate as f (xt ) f (x? ) converges to 0 in the convex case. Nat-
urally, f (xt ) f (x? ) itself is not guaranteed to converge in the nonconvex
case, for example if xt converges to a local minimum that is not global, as
in Figure 6.1.
It is tempting to interpret convergence of krf (xt )k2 to 0 as convergence
to a critical point of f (a point where the gradient vanishes). But this inter-
pretation is not fully accurate in general, as Figure 6.2 (right) shows: The
algorithm may enter a region where f asymptotically approaches some
value, without reaching it (think of the rightmost piece of the function in
the figure as f (x) = e x ). In this case, the gradient converges to 0, but the
iterates are nowhere near a critical point.
Theorem 6.2. Let f : Rd ! R be differentiable with a global minimum x? ; fur-
thermore, suppose that f is smooth with parameter L according to Definition 2.2.
Choosing stepsize
1
:= ,
L
gradient descent (2.1) yields
T 1
1X 2L
krf (xt )k2 f (x0 ) f (x? ) , T > 0.
T t=0 T
89
The statement follows.
In the smooth setting, gradient descent has another interesting prop-
erty: with stepsize 1/L, it cannot overshoot. By this, we mean that it
cannot pass a critical point (in particular, not the global minimum) when
moving from xt to xt+1 . Equivalently, with a smaller stepsize, no critical
point can be reached. With stepsize 1/L, it is possible to reach a critical
point, as we have demonstrated for the supermodel function f (x) = x2 in
Section 2.6.
x x0 y ? x y ? x0 x x0 = y ?
90
In 2018, results along these lines have appeared that prove convergence
of gradient descent to a global minimum in training deep linear linear net-
works, under suitable conditions. In this section, we will study a vastly
simplified setting that allows us to show the main ideas (and limitations)
behind one particular trajectory analysis [ACGH18].
In our simplified setting, we will look at the task of minimizing a con-
crete and very simple nonconvex function. This function turns out be
smooth along the trajectories that we analyze, and this is one important
ingredient. However, smoothness alone does not suffice to prove con-
vergence to the global minimum, let alone fast convergence: As we have
seen in the last section, we can in general only guarantee that the gradient
norms converge to 0, and at a rather slow rate. To get beyond this, we will
need to exploit additional properties of the function under consideration.
the matrix whose columns are the yi , we can equivalently write this as
W ? = argmin kW X Y k2F , (6.2)
W 2Rm⇥d
91
qP
where kAkF = i,j aij is the Frobenius norm of a matrix A.
2
W W1 W2 W3
But what if we have ` layers (Figure 6.6 (right)? Training such a net-
work corresponds to minimizing
kW` W` 1 · · · W1 X Y k2F ,
over ` weight matrices W1 , . . . , W` to be learned. In case of linear neural
networks, there is no benefit in adding layers, as any linear transforma-
tion x 7! W` W` 1 · · · W1 X can of course be represented as x 7! W X with
W := W` 1 · · · W1 . But from a theoretical point of view, a deep linear neu-
ral network gives us a simple playground in which we can try to under-
stand why training deep neural networks with gradient descent works,
92
despite the fact that the objective function is no longer convex. The hope
is that such an understanding can ultimately lead to an analyis of gradient
descent (or other suitable methods) for “real” (meaning non-linear) deep
neural networks.
In the next section, we will discuss the case where all matrices are 1 ⇥ 1,
so they are just numbers. This is arguably a toy example in our already
simple playground. Still, it gives rise to a nontrivial nonconvex function,
and the analysis of gradient descent on it will require similar ingredients
as the one on general deep linear neural networks [ACGH18].
What areQthe critical points, the ones where rf (x) vanishes? This hap-
pens when k xk = 1 in which case we have a global minimum (level 0
in Figure 6.7). But there are other critical points. Whenever at least two
of the xk are zero, the gradient also vanishes, and the value of f is 1/2 at
such a point (point 0 in Figure 6.7). This already shows that the function
cannot be convex, as for convex functions, every critical point is a global
minimum (Lemma 1.16). It is easy to see that every non-optimal critical
point must have two or more zeros.
93
Figure 6.7: Levels sets of f (x1 , x2 ) = 12 (x1 x2 1)2
In fact, all critical points except the global minima are saddle points.
This is because at any such point x, we can slightly perturb the (two or
more) zero entries in such a way that the product of all entries becomes
either positive or negative, so that the function value either decreases or
increases.
Figure 6.8 visualizes (scaled) negative gradients of f for d = 2; these are
the directions in which gradient descent would move from the tails of the
respective arrows. The figure already indicates that it is difficult to avoid
convergence to a global minimum, but it is possible (see Exercise 37).
We now want to Q show that for any dimension d, and from anywhere in
X = {x : x > 0, k xk 1}, gradient descent will converge to a global
minimum. Unfortunately, our function f is not smooth over X. For the
analysis, we will therefore show that f is smooth along the trajectory of
94
Figure 6.8: Scaled negative gradients of f (x1 , x2 ) = 12 (x1 x2 1)2
1
f (xt+1 ) f (xt ) krf (xt )k2 , t 0
2L
by Lemma 2.6.
This already shows that gradient descent cannot converge to a saddle
point: all these have (at least two) zero entries and therefore function value
1/2. But for starting point x0 2 X, we have f (x0 ) < 1/2, so we can never
reach a saddle while decreasing f .
But doesn’t this mean that we necessarily have to converge to a global
minimum? No, because the sublevel sets of f are unbounded, so it could in
principle happen that gradient descent runs off to infinity while constantly
improving f (xt ) (an example is gradient descent on f (x) = e x ). Or some
95
other bad behavior occurs (we haven’t characterized what can go wrong).
So there is still something to prove. Q
How about convergence from other starting points? For x > 0, k xk
1, we also get convergence (Exercise 36). But there are also starting points
from which gradient descent will not converge to a global minimum (Ex-
ercise 37).
The following simple lemma is the key to showing that gradient de-
scent behaves nicely in our case.
Definition 6.4. Let x > 0 (componentwise) , and let c 1 be a real number. x
is called c-balanced if xi cxj for all 1 i, j d.
In fact, any initial iterate x0 > 0 is c-balanced for some (possibly large) c.
Q
Lemma 6.5. Let x > 0 be c-balanced with k xk 1. Then for any stepsize
> 0, x0 := x rf (x) satisfies x0 x (componentwise) and is also c-balanced.
If c = 1 (all entries of x are equal), this is easy to see since then also
all entries of rf (x) in (6.4) are equal.
Q Later we will show that for suitable
step size, we also maintain that k x0k 1, so that gradient descent only
goes through balanced iterates.
Q Q
Proof. Set := ( k xk 1)( k xk ) 0. Then the gradient descent
update assumes the form
x0k = xk + xk , k = 1, . . . , d.
xk
For i, j, we have xi cxj and xj cxi (, 1/xi c/xj ). We therefore get
c
x0i = xi + cxj + = cx0j .
xi xj
96
definition, r2 f (x)ij is the j-th partial derivative of the i-th entry of rf (x).
This i-th entry is !
Y Y
(rf )i = xk 1 xk
k k6=i
97
Proof. The fact that kAk kAkF is Exercise 38. To bound the Frobenius
norm, we use the previous lemma to compute
!2
Y
r2 f (x)ii = xi c2
k6=i
and for i 6= j,
Y Y Y
r2 f (x)ij 2 xk xk + xk 3c2 .
k6=i k6=j k6=i,j
2
Hence, kr2 f (x)kF 9d2 c4 . Taking square roots, the statement follows.
This now implies smoothness of f along the whole trajectory of gradi-
ent descent, under the usual “smooth stepsize” = 1/L = 1/3dc2 .
Q
Lemma 6.8. Let x > 0 be c-balanced with k xk < 1, L = 3dc2 . Let := 1/L.
Then for all 0 ⌫ ,
x0 := x ⌫rf (x) x
Q
is c-balanced with k x0k 1, and f is smooth with parameter L over the line
segment connecting x and x rf (x)}.
6.2.4 Convergence
Theorem
Q 6.9. Let c 1 and > 0 such that x0 > 0 is c-balanced with
k (x0 )k < 1. Choosing stepsize
1
= ,
3dc2
98
gradient descent satisfies
✓ 2
◆T
f (xT ) 1 f (x0 ), T 0.
3c4
This means that the loss indeed converges to its optimal value 0, and
does so with a fast exponential error decrease. Exercise 39 asks you to
prove that also the iterates themselves converge (to an optimal solution),
so gradient descent will not run off to infinity.
Proof. For each t 0, f is smooth over conv({xt , xt+1 }) with parameter
L = 3dc , hence Lemma 2.6 yields sufficient decrease:
2
1
f (xt+1 ) f (xt ) 2
krf (xt )k2 . (6.5)
6dc
Q
For every c-balanced x with k xk 1, we have
d
!2
2
X Y
krf (x)k = 2f (x) xk
i=1 k6=i
!2 2/d
d Y
2f (x) 2 xk (Lemma 6.6)
c k
!2
d Y
2f (x) xk
c2 k
d 2
2f (x) .
c2
Then, (6.5) further yields
✓ 2
◆
1 d 2
f (xt+1 ) f (xt ) 2f (x t ) = f (xt ) 1 ,
6dc2 c2 3c4
99
solution x0 = (1/2, . . . , 1/2). This is c-balanced with c = 1, but the that
we get is 1/2d . Hence, the “constant factor” is
✓ ◆
1
1 ,
3 · 4d
6.3 Exercises
Exercise 33. Let f : Rn ! R twice differentiable, with X ✓ dom(f ) an open
convex set, and suppose that f is smooth with parameter L over X. Prove that
100
under these conditions, the largest eigenvalue of the Hessian max (r
2
f (x)) L
for all x 2 X.
Exercise 34. Prove that the statement of Theorem 6.2 implies that
Exercise 35. Prove Lemma 6.3 (gradient descent does not overshoot on smooth
functions).
⇣Q ⌘2
d
Exercise 36. Consider the function f (x) = 12 k=1 x k 1 . Prove that for
Q
any starting point x0 2 X = {x 2 R : x > 0, k xk
d
1} and any " > 0,
gradient descent attains f (xT ) " for some iteration T .
⇣Q ⌘2
d
Exercise 37. Consider the function f (x) = 2 1
k=1 xk 1 . Prove that for
even dimension d 2, there is a point x0 (not a critical point) such that gradient
descent does not converge to a global minimum when started at x0 , regardless of
step size(s).
Exercise 38. Prove that for any matrix A, kAk kAkF , where k·k is the spectral
norm and k·kF the Frobenius norm.
Exercise 39. Prove that the sequence (xT )T 0 of iterates in Theorem 6.9 con-
verges to a an optimal solution x? .
101
Chapter 7
Newton’s Method
Contents
7.1 1-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2 Newton’s method for optimization . . . . . . . . . . . . . . . 105
7.3 Once you’re close, you’re there. . . . . . . . . . . . . . . . . . . 107
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
102
7.1 1-dimensional case
The Newton method (or Newton-Raphson method, invented by Sir Isaac
Newton and formalized by Joseph Raphson) is an iterative method for
finding a zero of a differentiable univariate function f : R ! R. Starting
from some number x0 , it computes
f (xt )
xt+1 := xt , t 0. (7.1)
f 0 (xt )
Figure 7.1 shows what happens. xt+1 is the point where the tangent line
to the graph of f at (xt , f (xt ) intersects the x-axis. In formulas, xt+1 is the
solution of the linear equation
f (x)
xt xt+1
The Newton step (7.1) obviously fails if f 0 (xt ) = 0 and may get out of
control if |f 0 (xt )| is very small. Any theoretical analysis will have to make
suitable assumptions to avoid this. But before going into this, we look at
Newton’s method in a benign case.
103
p p
Let f (x) = x2 R, where R 2 R+ . f has two zeros, p R and R.
Starting for example at x0 = R, we hope to converge to R quickly. In this
case, (7.1) becomes
✓ ◆
x2t R 1 R
xt+1 = xt = xt + . (7.2)
2xt 2 xt
This is in fact the Babylonian method to compute square roots, and here we
see that it is just a special case of Newton’s method. p
Can we prove that we indeed quickly converge to R? What we im-
mediately see from (7.2) is that all iterates will be positive and hence
✓ ◆
1 R xt
xt+1 = xt + .
2 xt 2
p
So we cannot be too fast. Suppose R 1. In order to even get xt < 2 R,
we need at least T log(R)/2 steps. Itpturns out that the Babylonian
method starts taking off only when xt R < 1/2, say (Exercise 40 asks
you to prove that it takes O(log R) steps to get there).p
To watch takeoff,
p let us now suppose that x0 R < 1/2, so we are
starting close to R already. We rewrite (7.2) as
p xt R p 1 ⇣ p ⌘2
xt+1 R= + R= xt R . (7.3)
2 2xt 2xt
p
Assuming for now that R 1/4, all iterates have value at least R
1/2, hence we get
p ⇣ p ⌘2
xt+1 R xt R .
This means that the error goes to 0 quadratically, and
⇣ ✓ ◆2 T
p p ⌘ 2T 1
xT R x0 R < , T 0. (7.4)
2
p
What does this tell us? In order to get xT R < ", we only
p need
T = log log( " ) steps! Hence, it takes a while to get to roughly R, but
1
104
7.2 Newton’s method for optimization
Suppose we want to find a global minimum x? of a differentiable con-
vex function f : R ! R (assuming that a global minimum exists). Lem-
mata 1.16 and 1.17 guarantee that we can equivalently search for a zero of
the derivative f 0 . To do this, we can apply Newton’s method if f is twice
differentiable; the update step then becomes
f 0 (xt )
xt+1 := xt = xt f 00 (xt ) 1 f 0 (xt ), t 0. (7.5)
f 00 (xt )
There is no reason to restrict to d = 1. Here is Newton’s method for min-
imizing a convex function f : Rd ! R. We choose x0 arbitrarily and then
iterate:
xt+1 := xt r2 f (xt ) 1 rf (xt ), t 0. (7.6)
The update vector r2 f (xt ) 1 rf (xt ) is the result of a matrix-vector mul-
tiplication: we invert the Hessian at xt and multiply the result with the
gradient at xt . As before, this fails if the Hessian is not invertible, and may
get out of control if the Hessian has small norm.
We have introduced iteration (7.6) simply as a (more or less natural)
generalization of (7.5), but there’s more to it. If we consider (7.6) as a
special case of a general update scheme
xt+1 = xt H(xt )rf (xt ),
where H(x) 2 Rd⇥d is some matrix, then we see that also gradient de-
scent (2.1) is of this form, with H(xt ) = I. Hence, Newton’s method can
also be thought of as “adaptive gradient descent” where the adaptation is
w.r.t. the local geometry of the function at xt . Indeed, as we show next,
this allows Newton’s method to converge on all nondegenerate quadratic
functions in one step, while gradient descent only does so with the right
stepsize on “beautiful” quadratic functions whose sublevel sets are Eu-
clidean balls (Exercise 18).
Lemma 7.1. A nondegenerate quadratic function is a function of the form
1
f (x) = x> M x q> x + c,
2
where M 2 R d⇥d
is an invertible symmetric matrix, q 2 Rd , c 2 R. Let x? =
M 1 q be the unique solution of rf (x) = 0 (the unique global minimum if f is
convex). With any starting point x0 2 Rd , Newton’s method (7.6) yields x1 = x? .
105
Proof. We have rf (x) = M x q (this implies x? = M 1
q) and r2 f (x) =
M . Hence,
x0 r2 f (x0 ) 1 rf (x0 ) = x0 M 1
(M x0 q) = M 1
q = x? .
1
g g
Nf g
yt yt+1
106
Hence, while gradient descent suffers if the coordinates are at very dif-
ferent scales, Newton’s method doesn’t.
We conclude the general exposition with another interpretation of New-
ton’s method: each step minimizes the local second-order Taylor approxi-
mation.
Lemma 7.3 (Exercise 44). Let f be convex and twice differentiable at xt 2
dom(f ), with r2 f (xt ) 0 being invertible. The vector xt+1 resulting from
the Netwon step (7.6) satisfies
1
xt+1 = argmin f (xt ) + rf (xt )> (x xt ) + (x xt )> r2 f (xt )(x xt ).
x2Rd 2
107
Theorem 7.4. Let f : dom(f ) ! R be convex with a unique global mini-
mum x? . Suppose that there is a ball X ✓ dom(f ) with center x? such that the
following two properties hold.
(i) Bounded inverse Hessians: There exists a real number µ > 0 such that
1
kr2 f (x) 1 k , 8x 2 X.
µ
(ii) Lipschitz continuous Hessians: There exists a real number B > 0 such
that
kr2 f (x) r2 f (y)k Bkx yk 8x, y 2 X.
In both cases, the matrix norm is the spectral norm defined in Lemma 2.5. Prop-
erty (i) in particular stipulates that Hessians are invertible at all points in X.
Then, for xt 2 X and xt+1 resulting from the Newton step (7.6), we have
B
kxt+1 x? k kxt x ? k2 .
2µ
Before we prove this main theorem, here is the local convergence result
that follows from it.
Corollary 7.5 (Exercise 42). With the assumptions and terminology of Theo-
rem 7.4, and if x0 2 X satisfies
µ
kx0 x? k ,
B
then Newton’s method (7.6) yields
✓ ◆2T 1
? µ 1
kxT x k , T 0.
B 2
Hence, we have a bound as (7.4) for the last phase of the Babylonian
method: in order to get kxT x? k < ", we only need T = log log( 1" ) steps.
But before this fast behavior kicks in, we need to be µ/B-close to x? al-
ready.
An intuitive reason for fast convergence is that under our assumptions,
the Hessians we encounter are almost constant when we are close to x? .
This means that locally, our function behaves almost like a quadratic func-
tion which has truly constant Hessians and allows Newton’s method to
convergence in one step (Lemma 7.1).
108
Lemma 7.6 (Exercise 43). With the assumptions and terminology of Theorem 7.4,
and if x0 2 X satisfies
µ
kx0 x? k ,
B
then the Hessians in Newton’s method satisfy the relative error bound
✓ ◆2 t 1
kr2 f (xt ) r2 f (x? )k 1
, t 0.
kr2 f (x? )k 2
We now still owe to the reader the proof of main convergence result,
Theorem 7.4:
Proof of Theorem 7.4. To simplify notation, let us abbreviate H := r2 f , x =
xt , x0 = xt+1 . Subtracting x? from both sides of (7.6), we get
x0 x? = x x? H(x) 1 rf (x)
= x x? + H(x) 1 (rf (x? ) rf (x))
Z 1
? 1
= x x + H(x) H(x + t(x? x))(x? x)dt,
0
using the fundamental theorem of calculus and the chain rule as in (1.2)
with h(t) = rf (x + t(x? x)). With
Z 1
? 1 ? 1
x x = H(x) H(x)(x x ) = H(x) H(x)(x? x)dt,
0
we further get
Z 1
0 ? 1
x x = H(x) (H(x + t(x? x)) H(x)) (x? x)dt.
0
where we have used that kAyk kAk · kyk for any matrix A 2 Rd⇥d and
any vector y 2 Rd which follows directly from the definition of the spectral
norm. As we also have
Z 1 Z 1
g(t)dt kg(t)kdt
0 0
109
for any vector-valued function g (Exercise 46), we can further bound
Z 1
0 ? 1
kx x k kH(x) k H(x + t(x? x)) H(x) (x? x) dt
Z0 1
kH(x) 1 k H(x + t(x? x)) H(x) · kx? xkdt
0
Z 1
1 ?
= kH(x) k · kx xk H(x + t(x? x)) H(x) dt.
0
We can now use the properties (i) and (ii) (bounded inverse Hessians, Lip-
schitz continuous Hessians) to conclude that
Z 1 Z 1
0 ? 1 ? ? B ? 2
kx x k kx xk Bkt(x x)kdt = kx xk tdt .
µ 0 µ 0
| {z }
1/2
How realistic are properties (i) and (ii)? If f is twice continuously dif-
ferentiable (meaning that the second derivative r2 f is continuous), then
we will always find suitable values of µ and L over a ball X with center
x? —provided that r2 f (x? ) 6= 0.
Indeed, already in the one-dimensional case, we see that under f 00 (x? ) =
0 (vanishing second derivative at the global minimum), Newton’s method
will in the worst reduce the distance to x? at most by a constant factor in
each step, no matter how close to x? we start. Exercise 45 asks you to find
such an example. In such a case, we have linear convergence, but the fast
quadratic convergence (O(log log(1/")) steps cannot be proven.
One way to ensure bounded inverse Hessians is to require strong con-
vexity over X.
Lemma 7.7 (Exercise 47). Let f : dom(f ) ! R be twice differentiable and
strongly convex with parameter µ over an open convex subset X ✓ dom(f )
according to Definition 2.8, meaning that
µ
f (y) f (x) + rf (x)> (y x) + kx yk2 , 8x, y 2 X.
2
Then r2 f (x) is invertible and kr2 f (x) 1 k 1/µ for all x 2 X, where k · k is
the spectral norm defined in Lemma 2.5.
110
7.4 Exercises
Exercise
p 40. Consider the Babylonian method (7.2). Prove that we get xT
R < 1/2 for T = O(log R).
Exercise 41. Prove Lemma 7.2!
Exercise 42. Prove Corollary 7.5!
Exercise 43. Prove Lemma 7.6!
Exercise 44. Prove Lemma 7.3!
Exercise 45. Let > 0 be any real number. Find an example of a convex function
f : R ! R such that (i) the unique global minimum x? has a vanishing second
derivative f 00 (x? ) = 0, and (ii) Newton’s method satisfies
|xt+1 x? | (1 )|xt x? |,
for all xt 6= x? .
Exercise 46. This exercise is just meant to recall some basics around integrals.
Show that for a vector-valued function g : R ! Rd , the inequality
Z 1 Z 1
g(t)dt kg(t)kdt
0 0
holds, where k · k is the 2-norm (always assuming that the funtions under consid-
eration are integrable)! You may assume (i) that integrals are linear:
Z 1 Z 1 Z 1
( 1 g1 (t) + 2 g2 (t))dt = 1 g1 (t)dt + 2 g2 (t)dt,
0 0 0
R1
And (ii), if g(t) 0 for all t 2 [0, 1], then 0
g(t)dt 0.
Exercise 47. Prove Lemma 7.7! You may want to proceed in the following steps.
(i) Prove that the function g(x) = f (x) µ
2
kxk2 is convex over X (see also
Exercise 28).
(ii) Prove that r2 f (x) is invertible for all x 2 X.
(iii) Prove that all eigenvalues of r2 f (x) 1
are positive and at most 1/µ.
(iv) Prove that for a symmetric matrix M , the spectral norm kM k is the largest
absolute eigenvalue.
111
Chapter 8
Quasi-Newton Methods
Contents
8.1 The secant method . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2 The secant condition . . . . . . . . . . . . . . . . . . . . . . . 115
8.3 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . 115
8.4 Greenstadt’s approach (Optional Material) . . . . . . . . . . . 116
8.4.1 The method of Lagrange multipliers . . . . . . . . . . 118
8.4.2 Application to Greenstadt’s Update . . . . . . . . . . 118
8.4.3 The Greenstadt family . . . . . . . . . . . . . . . . . . 120
8.4.4 The BFGS method . . . . . . . . . . . . . . . . . . . . 122
8.4.5 The L-BFGS method . . . . . . . . . . . . . . . . . . . 124
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
112
The main computational bottleneck in Newton’s method (7.6) is the
computation and inversion of the Hessian matrix in each step. This matrix
has size d ⇥ d, so it will take up to O(d3 ) time to invert it (or to solve the
system r2 f (xt ) x = rf (xt ) that gives us the next Newton step x).
Already in the 1950s, attempts were made to circumvent this costly step,
the first one going back to Davidon [Dav59].
In this chapter, we will (for a change) not prove convergence results;
rather, we focus on the development of Quasi-Newton methods, and how
state-of-the-art methods arise from first principles. To motivate the ap-
proach, let us go back to the 1-dimensional case.
113
approximates the Newton step (two starting values x0 , x1 need to be cho-
sen here). Figure 8.1 shows what the method does: it constructs the line
through the two points (xt 1 , f (xt 1 )) and (xt , f (xt )) on the graph of f ; the
next iterate xt+1 is where this line intersects the x-axis. Exercise 48 asks
you to formally prove this.
f (x)
xt 1 xt xt+1
114
8.2 The secant condition
Applying finite difference approximation to the second derivative of f
(we’re still in the 1-dimensional case), we get
f 0 (xt ) f 0 (xt 1 )
Ht := ⇡ f 00 (xt ),
xt xt 1
which we can write as
f 0 (xt ) f 0 (xt 1 ) = Ht (xt xt 1 ) ⇡ f 00 (xt )(xt xt 1 ). (8.3)
Now, while Newton’s method for optimization uses the update step
xt+1 = xt f 00 (xt ) 1 f 0 (xt ), t 0,
the secant method works with the approximation Ht ⇡ f 00 (xt ):
xt+1 = xt Ht 1 f 0 (xt ), t 1. (8.4)
The fact that Ht approximates f 00 (xt ) in the twice differentiable case
was our motivation for the secant method, but in the method itself, there
is no reference to f 00 (which is exactly the point). All that is needed is the
secant condition from (8.3) that defines Ht :
f 0 (xt ) f 0 (xt 1 ) = Ht (xt xt 1 ). (8.5)
This view can be generalized to higher dimensions. If f : Rd ! R is
differentiable, (8.4) becomes
xt+1 = xt Ht 1 rf (xt ), t 1, (8.6)
where Ht 2 Rd⇥d is now supposed to be a symmetric matrix satisfying the
d-dimensional secant condition
rf (xt ) rf (xt 1 ) = Ht (xt xt 1 ). (8.7)
115
We might therefore hope that Ht ⇡ r2 f (xt ), and this would mean that
(8.6) approximates Newton’s method. Therefore, whenever we use (8.6)
with a symmetric matrix satisfying the secant condition (8.7), we say that
we have a Quasi-Newton method.
In the 1-dimensional case, there is only one Quasi-Newton method—
the secant method (8.1). Indeed, equation (8.5) uniquely defines the num-
ber Ht in each step.
But in the d-dimensional case, the matrix Ht in the secant condition is
underdetermined, starting from d = 2: Taking the symmetry requirement
into account, (8.7) is a system of d equations in (d + 1)/2 unknowns, so if
it is satisfiable at all, there are many solutions Ht . This raises the question
of which one to choose, and how to do so efficiently; after all, we want to
get some savings over Newton’s method.
Newton’s method is a Quasi-Newton method if and only if f is a non-
degenerate quadratic function (Exercise 49). Hence, Quasi-Newton meth-
ods do not generalize Newton’s method but form a family of related algo-
rithms.
The first Quasi-Newton method was developed by William C. Davi-
don in 1956; he desperately needed iterations that were faster than those
of Newton’s method in order obtain results in the short time spans be-
tween expected failures of the room-sized computer that he used to run
his computations on.
But the paper he wrote about his new method got rejected for lacking
a convergence analysis, and for allegedly dubious notation. It became a
very influential Technical Report in 1959 [Dav59] and was finally officially
published in 1991, with a foreword giving the historical context [Dav91].
Ironically, Quasi-Newton methods are today the methods of choice in a
number of relevant machine learning applications.
116
We draw some intuition from (the analysis of) Newton’s method. Re-
call that we have shown rf 2 (xt ) to fluctuate only very little in the region
of extremely fast convergence (Lemma 7.6); in fact, Newton’s method is
optimal (one step!) when rf 2 (xt ) is actually constant— this is the case of
a quadratic function, see Lemma 7.1. Hence, in a Quasi-Newton method,
it also makes sense to have that Ht ⇡ Ht 1 , or Ht 1 ⇡ Ht 11 .
Greenstadt’s approach from 1970 [Gre70] is to update Ht 11 by an “error
matrix” Et to obtain
Ht 1 = Ht 11 + Et .
Moreover, the errors should be as small as possible, subject to the con-
straints that Ht 1 is symmetric and satisfies the secant condition (8.7). A
simple measure of error introduced by an update matrix E is its squared
Frobenius norm
Xd X d
2
kEkF := e2ij .
i=1 j=1
117
Greenstadt’s approach can now be distilled into the following convex
constrained minimization problem in the d2 variables Eij :
118
Fact 8.2 (Exercise 51). Let A, B 2 Rd⇥d two matrices. With f : Rd⇥d ! R,
f (E) := 12 kAEBk2F , we have
i yj + ji ij .
119
8.4.3 The Greenstadt family
We need to solve the system of equations
Ey = r, (8.12)
E>
E = 0, (8.13)
W EW = y> + >
. (8.14)
> 1 >
= y y> ,
2
so we can eliminate by substituting back into (8.15):
✓ ◆
1 > 1 >
E=M >
y + y y >
M = M y> + y M. (8.16)
2 2
To also eliminate , we now use (8.12)—the secant condition in the next
step—to get
1
Ey = M y> + y > M y = r.
2
Premultiplying with 2M gives
1
> >
2M 1
r= y> + y M y = y> M y + y M y.
120
Hence,
1 >
= 2M 1
r y My . (8.17)
y> M y
To get rid of on the right hand side, we premultiply this with y> M to
obtain
0 1
1 @ > 2y> r
y> M = > 2y r (y> M y)( >
M y )A = > >
My
| {z } y M y | {z } y My | {z }
z z z
It follows that
> y> r
z= My = .
y> M y
This in turn can be substituted into the right-hand side of (8.17) to remove
there, and we get
✓ ◆
1 1 (y> r)
= > 2M r y .
y My y> M y
Consequently,
✓ ◆
> 1 (y> r)
y = >
2M 1 ry> yy >
,
y My y> M y
✓ ◆
> 1 (y> r)
y = 2yr> M 1 yy >
.
y> M y y> M y
and consequently,
✓ ◆
1 > > 1 > > (y> r) >
E= M y +y M= > ry M + M yr M yy M .
2 y My y> M y
(8.18)
121
Finally, we use r = Hy to obtain the update matrix E ? in terms
of the original parameters H = Ht 11 (previous approximation of the in-
verse Hessian that we now want to update to Ht 1 = H 0 = H + E ? ),
= xt xt 1 (previous Quasi-Newton step) and y = rf (xt ) rf (xt 1 )
(previous change in gradients). This gives us the Greenstadt family of
Quasi-Newton methods.
xt+1 = xt Ht 1 rf (xt ), t 1,
H := Ht 11 ,
H 0 := Ht 1 ,
:= xt xt 1 ,
y := rf (xt ) rf (xt 1 ),
and define
1 ⇣ >
E? = y M + M y > Hyy> M M yy> H
y> M y
1 ⌘
(y >
y >
Hy)M yy >
M . (8.19)
y> M y
122
journal, Goldfarb suggested to use the matrix M = H 0 , the next approxi-
mation of the inverse Hessian. Even though we don’t yet have it, we can
use it in the formula (8.19) since we know that H 0 will by design satisfy the
secant condition H 0 y = . And as M always appears next to y in (8.19),
M y = H 0 y = , so H 0 disappears from the formula!
Definition 8.5. The BFGS method is the Greenstadt method with parameter
M = H 0 = Ht 1 in step t, in which case the update matrix E ? assumes the form
1 ⇣ > > 1 ⌘
E? = 2 Hy y> H (y> y> Hy) >
y> >y
1 ⇣ ⇣ y> Hy ⌘ ⌘
= Hy >
y> H + 1 + > >
., (8.20)
y> y
123
for some ↵t 2 R+ . This parameter can for example be chosen such that
f (xt+1 ) is minimized (line search). Another approach is backtracking line
search where we start with ↵t = 1, and as long as this does not lead to
sufficient progress, we halve ↵t . Line search ensures that the matrices Ht 1
in the BFGS method remain positive definite [Gol70].
As the Greenstadt update method just depends on the step = xt
xt 1 but not on how it was obtained, the update works in exactly the same
way as before even if scaled steps are being used.
To verify this, simply expand the product in the right-hand side and
compare with (8.20).
We further observe that we do not need the actual matrix H 0 = Ht 1 to
perform the next Quasi-Newton step (8.6), but only the vector H 0 rf (xt ).
Here is the crucial insight.
124
Proof. From (8.22), we conclude that
✓ ◆ ✓ ◆
0 0 y> y > 0
>
Hg = I >
H I >
g + >
g0 .
y y y
| {z } | {z }
g h
| {z }
s
| {z }
w
| {z }
z
so h can be computed with two inner products, a real division, and a mul-
tiplication of with a scalar. For g, we obtain
✓ ◆
y > 0 0
> 0
g
g= I >
g = g y >
.
y y
H 0 g0 = z = w + h
k = xk xk 1 ,
yk = rf (xk ) rf (xk 1 )
125
be the values of and y in iteration k t. When we perform the Quasi-
Newton step xt+1 = xt Ht 1 rf (xt ) in iteration t 1, we have already
computed these vectors for k = 1, . . . , t. Using Lemma 8.7, we could there-
fore call the recursive procedure in Figure 8.2 with k = t, g0 = rf (xt ) to
compute the required vector Ht 1 rf (xt ) in iteration t. To maintain the im-
mediate connection to Lemma 8.7, we refrain from introducing extra vari-
ables for values that occur several times; but in an actual implementation,
this would be done, of course.
By Lemma 8.7, the runtime of BFGS- STEP(t, rf (xt )) is O(td). For t >
d, this is slower (and needs more memory) than the standard BFGS step
according to Definition 8.5 which always takes O(d2 ) time.
The benefit of the recursive variant is that it can easily be adapted to
a step that is faster (and needs less memory) than the standard BFGS step.
The idea is to let the recursion bottom out after a fixed number m of recur-
sive calls (in practice, values of m 10 are not uncommon). The step then
has runtime O(md) which is a substantial saving over the standard step if
m is much smaller than d.
126
The only remaining question is what we return when the recursion
now bottoms out prematurely at k = t m. As we don’t know the matrix
Ht 1m , we cannot return Ht 1m g0 (which would be the correct output in this
case). Instead, we pretend that we have started the whole method just now
and use our initial matrix H0 instead of Ht m .1 The resulting algorithm is
depicted in Figure 8.3.
Figure 8.3: The L-BFGS method. To compute Ht 1 rf (xt ) based on the pre-
vious m iterations, call the function with arguments (t, m, rf (xt )); values
k , yk from iterations t m + 1, . . . , t are assumed to be available.
127
8.5 Exercises
Exercise 48. Consider a step of the secant method:
xt xt 1
xt+1 = xt f (xt ) , t 1.
f (xt ) f (xt 1 )
Assuming that xt = 6 xt 1 and f (xt ) 6= f (xt 1 ), prove that the line through
the two points (xt 1 , f (xt 1 )) and (xt , f (xt )) intersects the x-axis at the point
x = xt+1 .
Exercise 49. Let f : Rd ! R be a twice differentiable function with nonzero
Hessians everywhere. Prove that the following two statements are equivalent.
(i) f is a nondegenerate quadratic function, meaning that
1
f (x) = x> M x q> x + c,
2
where M 2 Rd⇥d is an invertible symmetric matrix, q 2 Rd , c 2 R (see
also Lemma 7.1).
128
(i) Prove that y> > 0, unless xt = xt 1 , or f ( xt +(1 )xt+1 ) = f (xt )+
(1 )f (xt 1 ) for all 2 (0, 1).
(ii) Prove that if H is positive definite and y> > 0, then also H 0 is positive
definite. You may want to use the product form of the BFGS update as
developed in Observation 8.6.
129
Chapter 9
Frank-Wolfe
Contents
130
TODO (see slides)
131
Chapter 10
Coordinate Descent
Contents
10.1 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . 133
10.2 Randomized Coordinate Descent . . . . . . . . . . . . . . . . 133
10.2.1 The Polyak-Łojasiewicz Condition . . . . . . . . . . . 135
10.2.2 Importance Sampling . . . . . . . . . . . . . . . . . . 136
10.3 Steepest Coordinate Descent . . . . . . . . . . . . . . . . . . . 136
10.4 Non-smooth objectives . . . . . . . . . . . . . . . . . . . . . . 138
10.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
132
10.1 Coordinate Descent
Coordinate descent methods generate a sequence {xt }t 0 of iterates as fol-
lows:
xt+1 := xt + eit , (10.1)
where ei denotes the i-th unit basis vector in Rd , and is a suitable stepsize
for the selected coordinate of our objective function. Here we will focus
on the gradient-based choice of the stepsize as
xt+1 := xt 1
r f (xt ) eit
L it
, (10.2)
133
Theorem 10.1. Consider minimization of a function f which is coordinate-wise
smooth with constant L as in (10.3), and is strongly convex with parameter µ >
0. Then, coordinate descent with a stepsize of 1/L,
1
xt+1 := xt r f (xt ) eit
L it
.
when choosing the active coordinate it uniformly at random, has an expected lin-
ear convergence rate of
⇣ µ ⌘t
E[f (xt ) f ?] 1 [f (x0 ) f ? ].
dL
Proof. We follow [KNS16]. By pluggin the update rule (10.1) into the smooth-
ness condition (10.3) we have the step improvement
1
f (xt+1 ) f (xt ) |ri f (xt )|2 .
2L t
By taking the expectation of both sides with respect to it we have
1 ⇥ ⇤
E [f (xt+1 )] f (xt ) E |rit f (xt )|2
2L
1 1X
= f (xt ) |ri f (xt )|2
2L d i
1
= f (xt ) krf (xt )k2 .
2dL
We now use the the fact that strongly convex functions satisfy 12 krf (x)k2
µ(f (x) f ? ) 8 x. This is proven in Lemma 10.2 below and is a property of
separate interest. Subtracting f ? from both sides, we therefore obtain
⇣ µ⌘
?
E[f (xt+1 ) f ] 1 [f (xt ) f ? ].
dL
Applying this recursively and using iterated expectations yields the result.
134
10.2.1 The Polyak-Łojasiewicz Condition
A function f satisfies the Polyak-Łojasiewicz Inequality (PL) if the following
holds for some µ > 0,
1
2
krf (x)k2 µ(f (x) f ? ), 8 x. (10.4)
135
10.2.2 Importance Sampling
Uniformly random selection of the active coordinate might not always be
the best choice. Let us consider an individual smoothness constant Li for
each coordinate i, that is
f (x + ei ) f (x) + ri f (x) + Li
2
2
(10.5)
136
Convergence Analysis. It is easy to show that the same convergence rate
which we have obtained for random coordinate descent in Theorem 10.1
also holds for steepest coordinate descent. To see this, the only ingredient
we need is the fact that
1X
max |ri f (x)|2 |ri f (x)|2 ,
i d i
Lemma 10.6. Let f be strongly convex w.r.t. the `1 -norm with parameter µ1 > 0.
Then f satisfies
1
2
krf (x)k21 µ1 (f (x) f ? ).
The proof of the lemma is not given here, but follows the same strategy
as in the earlier analogue Lemma 10.2. It then uses a property of convex
137
conjugate functions (coming from the fact that the norms k.k1 and k.k1 are
dual to each other).
138
Figure 10.1: A smooth function: f (x) := kxk2 . Figure by Alp Yurtsever &
Volkan Cevher, EPFL.
P
for g convex and smooth, and h(x) = i hi (xi ) separable with hi convex
but possibly non-smooth. For this class of problems, coordinate descent
with exact minimization converges to a global optimum, as illustrated in
Figure 10.3.
One very important class of applications here are smooth functions f
combined with `1 -regularization, such as the Lasso.
139
Figure 10.3: A function with separable non-smooth part: f (x) := kxk2 +
kxk1 . Figure by Alp Yurtsever & Volkan Cevher, EPFL.
10.5 Applications
Coordinate descent methods are used widely in classic machine learning
applications. Variants of coordinate methods form the state of the art for
the class of generalized linear models, including linear classifiers and re-
gression models, as long as separable convex regularizers are used (e.g. `1
or `2 norm regularization).
For least-squares linear regression f (x) := kAx bk2 , exact coordinate
minimization can easily be performed readily in closed form.
Lasso. The optimization problem for sparse least squares linear regres-
sion (also known as the Lasso) is given by
140
where ` : R ! R, `(z) := max{0, 1 z} is the hinge loss function. Here for
any i, 1 i n, the vector Ai 2 Rd is the i-th data example, and yi 2 {±1}
is the corresponding label.
The dual optimization problem for the SVM is given by
max ↵> 1 2
1
↵> Y A> AY ↵ such that 0 ↵i 1 8i (10.10)
↵2Rn
10.6 Exercises
Exercise 53 (Alternative analysis for gradient descent). Let f be smooth with
constant L in the classical sense, and satisfy the PL inequality (10.4). Let the prob-
lem minx f (x) have a non-empty solution set X ? . Prove that gradient descent
with a stepsize of 1/L has a global linear convergence rate
⇣ µ ⌘t
f (xt ) f ? 1 (f (x0 ) f ? ).
L
Exercise 54 (Importance Sampling). Consider random coordinate descent with
selecting the i-th coordinate with probability proportional to the Li value, where Li
is the individual smoothness constant for each coordinate i as in (10.5).
When using a stepsize of 1/Lit , prove that we obtain the faster rate of
⇣ µ ⌘t
E[f (xt ) f ? ] 1 [f (x0 ) f ? ],
dL̄
P
where L̄ = d1 di=1 Li now is the average of all coordinate-wise smoothness con-
stants. Note that this value can be much smaller than the global L we have
used above, since that one was required to hold for all i so has to be chosen as
L = maxi Li instead.
Can you come up with an example from machine learning where L̄ ⌧ L?
Exercise 55. Derive the solution to exact coordinate minimization for the Lasso
problem (10.8), for the i-th coordinate. Write A i for the (d 1) ⇥ n matrix
obtained by removing the i-th column from A, and same for the vector x i with
one entry removed accordingly.
141
Bibliography
[ACGH18] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A
convergence analysis of gradient descent for deep linear neu-
ral networks. CoRR, abs/1810.02281, 2018.
[KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear Con-
vergence of Gradient and Proximal-Gradient Methods Under
the Polyak-Łojasiewicz Condition. In ECML PKDD 2016: Ma-
chine Learning and Knowledge Discovery in Databases, pages
142
795–811. Springer International Publishing, Cham, September
2016.
143