Duchi 16
Duchi 16
Duchi 16
John C. Duchi
Contents
1 Introduction 2
1.1 Scope, limitations, and other references 3
1.2 Notation 4
2 Basic Convex Analysis 5
2.1 Introduction and Definitions 5
2.2 Properties of Convex Sets 7
2.3 Continuity and Local Differentiability of Convex Functions 14
2.4 Subgradients and Optimality Conditions 16
2.5 Calculus rules with subgradients 21
3 Subgradient Methods 24
3.1 Introduction 24
3.2 The gradient and subgradient methods 25
3.3 Projected subgradient methods 31
3.4 Stochastic subgradient methods 35
4 The Choice of Metric in Subgradient Methods 43
4.1 Introduction 43
4.2 Mirror Descent Methods 44
4.3 Adaptive stepsizes and metrics 54
5 Optimality Guarantees 60
5.1 Introduction 60
5.2 Le Cam’s Method 65
5.3 Multiple dimensions and Assouad’s Method 70
A Technical Appendices 74
A.1 Continuity of Convex Functions 74
A.2 Probability background 76
A.3 Auxiliary results on divergences 78
B Questions and Exercises 80
2010 Mathematics Subject Classification. Primary 65Kxx; Secondary 90C15, 62C20.
Key words and phrases. Convexity, stochastic optimization, subgradients, mirror descent, minimax op-
timal.
1
2 Introductory Lectures on Stochastic Optimization
1. Introduction
In this set of four lectures, we study the basic analytical tools and algorithms
necessary for the solution of stochastic convex optimization problems, as well as
for providing various optimality guarantees associated with the methods. As we
proceed through the lectures, we will be more exact about the precise problem
formulations, providing a number of examples, but roughly, by a stochastic op-
timization problem we mean a numerical optimization problem that arises from
observing data from some (random) data-generating process. We focus almost
exclusively on first-order methods for the solution of these types of problems, as
they have proven quite successful in the large scale problems that have driven
many advances throughout the early 2000s.
Our main goal in these lectures, as in the lectures by S. Wright in this volume,
is to develop methods for the solution of optimization problems arising in large-
scale data analysis. Our route will be somewhat circuitous, as we will build the
necessary convex analytic and other background (see Lecture 2), but broadly, the
problems we wish to solve are the problems arising in stochastic convex optimiza-
tion. In these problems, we have samples S coming from a sample space S, drawn
from a distribution P, and we have some decision vector x ∈ Rn that we wish to
choose to minimize the expected loss Z
(1.0.1) f(x) := EP [F(x; S)] = F(x; s)dP(s),
S
where F is convex in its first argument.
The methods we consider for minimizing problem (1.0.1) are typically sim-
ple methods that are slower to converge than more advanced methods—such as
Newton or other second-order methods—for deterministic problems, but have the
advantage that they are robust to noise in the optimization problem itself. Con-
sequently, it is often relatively straightforward to derive generalization bounds
for these procedures: if they produce an estimate b x exhibiting good performance
on some sample S1 , . . . , Sm drawn from P, then they are likely to exhibit good
performance (on average) for future data, that is, to have small objective f(b x);
see Lecture 3, and especially Theorem 3.4.11. It is of course often advantageous
to take advantage of problem structure and geometric aspects of the problem,
broadly defined, which is the goal of mirror descent and related methods, which
we discuss in Lecture 4.
The last part of our lectures is perhaps the most unusual for material on opti-
mization, which is to investigate optimality guarantees for stochastic optimization
problems. In Lecture 5, we study the sample complexity of solving problems of
the form (1.0.1). More precisely, we measure the performance of an optimization
procedure given samples S1 , . . . , Sm drawn independently from the population
distribution P, denoted by b x=b x(S1:m ), in a uniform sense: for a class of objec-
tive functions F, a procedure’s performance is its expected error—or risk—for the
worst member of the class F. We provide lower bounds on this maximum risk,
John C. Duchi 3
showing that the first-order procedures we have developed satisfy certain notions
of optimality.
We briefly outline the coming lectures. The first lecture provides definitions
and the convex analytic tools necessary for the development of our algorithms
and other ideas, developing separation properties of convex sets as well as other
properties of convex functions from basic principles. The second two lectures
investigate subgradient methods and their application to certain stochastic opti-
mization problems, demonstrating a number of convergence results. The second
lecture focuses on standard subgradient-type methods, while the third investi-
gates more advanced material on mirror descent and adaptive methods, which
require more care but can yield substantial practical performance benefits. The
final lecture investigates optimality guarantees for the various methods we study,
demonstrating two standard techniques for proving lower bounds on the ability
of any algorithm to solve stochastic optimization problems.
1.1. Scope, limitations, and other references The lectures assume some limited
familiarity with convex functions and convex optimization problems and their
formulation, which will help appreciation of the techniques herein. All that is
truly essential is a level of mathematical maturity that includes some real analysis,
linear algebra, and introductory probability. In terms of real analysis, a typical
undergraduate course, such as one based on Marsden and Hoffman’s Elementary
Real Analysis [37] or Rudin’s Principles of Mathematical Analysis [50], are sufficient.
Readers should not consider these lectures in any way a comprehensive view of
convex analysis or stochastic optimization. These subjects are well-established,
and there are numerous references.
Our lectures begin with convex analysis, whose study Rockafellar, influenced
by Fenchel, launched in his 1970 book Convex Analysis [49]. We develop the basic
ideas necessary for our treatment of first-order (gradient-based) methods for op-
timization, which includes separating and supporting hyperplane theorems, but
we provide essentially no treatment of the important concepts of Lagrangian and
Fenchel duality, support functions, or saddle point theory more broadly. For these
and other important ideas, I have found the books of Rockafellar [49], Hiriart-
Urruty and Lemaréchal [27, 28], Bertsekas [8], and Boyd and Vandenberghe [12]
illuminating.
Convex optimization itself is a huge topic, with thousands of papers and nu-
merous books on the subject. Because of our focus on solution methods for large-
scale problems arising out of data collection, we are somewhat constrained in
our views. Boyd and Vandenberghe [12] provide an excellent treatment of the
possibilities of modeling engineering and scientific problems as convex optimiza-
tion problems, as well as some important numerical methods. Polyak [47] pro-
vides a treatment of stochastic and non-stochastic methods for optimization from
which ours borrows substantially. Nocedal and Wright [46] and Bertsekas [9]
also describe more advanced methods for the solution of optimization problems,
4 Introductory Lectures on Stochastic Optimization
focusing on non-stochastic optimization problems for which there are many so-
phisticated methods.
Because of our goal to solve problems of the form (1.0.1), we develop first-order
methods that are in some ways robust to many types of noise from sampling.
There are other approaches to dealing with data uncertainty, and researchers in
of robust optimization [6], who study and develop tractable (polynomial-time-
solvable) formulations for a variety of data-based problems in engineering and
the sciences. The book of Shapiro et al. [54] provides a more comprehensive
picture of stochastic modeling problems and optimization algorithms than we
have been able to in our lectures, as stochastic optimization is by itself a major
field. Several recent surveys on online learning and online convex optimization
provide complementary treatments to ours [26, 52].
The last lecture traces its roots to seminal work in information-based-complexity
by Nemirovski and Yudin in the early 1980s [41], who investigate the limits of “op-
timal” algorithms, where optimality is defined in a worst-case sense according to
an oracle model of algorithms given access to function, gradient, or other types
of local information about the problem at hand. Issues of optimal estimation in
statistics are as old as the field itself, and the minimax formulation we use is
originally due to Wald in the late 1930s [59, 60]. We prove our results using infor-
mation theoretic tools, which have broader applications across statistics, and that
have been developed by many authors [31, 33, 61, 62].
1.2. Notation We use mostly standard notation throughout these notes, but for
completeness, we collect it here. We let R denote the typical field of real numbers,
with Rn having its usual meaning as n-dimensional Euclidean space. Given
vectors x and y, we let hx, yi denote the inner product between x and y. Given a
norm k·k, its dual norm k·k∗ is defined as
kzk∗ := sup {hz, xi | kxk 6 1} .
Hölder’s inequality (see Exercise 4) shows that the ℓp and ℓq norms, defined by
Xn 1
p
p
kxkp = |xj |
j=1
(and as the limit kxk∞ = maxj |xj |) are dual to one another, where 1/p + 1/q = 1
p
and p, q ∈ [1, ∞]. Throughout, we will assume that kxk2 = hx, xi is the norm
defined by the inner product h·, ·i.
We also require notation related to sets. For a sequence of vectors v1 , v2 , v3 , . . .,
we let (vn ) denote the entire sequence. Given sets A and B, we let A ⊂ B to denote
that A is a subset (possibly equal to) B, and A ( B to mean that A is a strict subset
of B. The notation cl A denotes the closure of A, while int A denotes the interior
of the set A. For a function f, the set dom f is its domain. If f : Rn → R ∪ {+∞}
is convex, we let dom f := {x ∈ Rn | f(x) < +∞}.
John C. Duchi 5
2.1. Introduction and Definitions This set of lecture notes considers convex op-
timization problems, numerical optimization problems of the form
minimize f(x)
(2.1.1)
subject to x ∈ C,
where f is a convex function and C is a convex set. While we will consider
tools to solve these types of optimization problems presently, this first lecture is
concerned most with the analytic tools and background that underlies solution
methods for these problems.
(a) (b)
The starting point for any study of convex functions is the definition and study
of convex sets, which are intimately related to convex functions. To that end, we
recall that a set C ⊂ Rn is convex if for all x, y ∈ C,
λx + (1 − λ)y ∈ C for λ ∈ [0, 1].
See Figure 2.1.2.
A convex function is similarly defined: a function f : Rn → (−∞, ∞] is convex
if for all x, y ∈ dom f := {x ∈ Rn | f(x) < +∞}
f(λx + (1 − λ)y) 6 λf(x) + (1 − λ)f(y) for λ ∈ [0, 1].
The epigraph of a function is defined as
epi f := {(x, t) : f(x) 6 t},
6 Introductory Lectures on Stochastic Optimization
epi f
f(x) f(x)
(a) (b)
Figure 2.1.3. (a) The convex function f(x) = max{x2 , −2x − .2}
and (b) its epigraph, which is a convex set.
One may ask why, precisely, we focus convex functions. In short, as Rock-
afellar [49] notes, convex optimization problems are the clearest dividing line
between numerical problems that are efficiently solvable, often by iterative meth-
ods, and numerical problems for which we have no hope. We give one simple
result in this direction first:
To see this, note that if x is a local minimum then for any y ∈ C, we have for
small enough t > 0 that
f(x + t(y − x)) − f(x)
f(x) 6 f(x + t(y − x)) or 0 6 .
t
We now use the criterion of increasing slopes, that is, for any convex function f the
function
f(x + tu) − f(x)
(2.1.5) t 7→
t
is increasing in t > 0. (See Fig. 2.1.4.) Indeed, let 0 6 t1 6 t2 . Then
t3
t1 t2
f(x+t)−f(x)
Figure 2.1.4. The slopes t increase, with t1 < t2 < t3 .
epigraphs and gradients, results that in turn find many applications in the design
of optimization algorithms as well as optimality certificates.
A few basic properties We list a few simple properties that convex sets have,
which are evident from their definitions. First, if Cα are convex sets for each
α ∈ A, where A is an arbitrary index set, then the intersection
\
C= Cα
α∈A
is also convex. Additionally, convex sets are closed under scalar multiplication: if
α ∈ R and C is convex, then
αC := {αx : x ∈ C}
is evidently convex. The Minkowski sum of two convex sets is defined by
C1 + C2 := {x1 + x2 : x1 ∈ C1 , x2 ∈ C2 },
and is also convex. To see this, note that if xi , yi ∈ Ci , then
λ(x1 + x2 ) + (1 − λ)(y1 + y2 ) = λx1 + (1 − λ)y1 + λx2 + (1 − λ)y2 ∈ C1 + C2 .
| {z } | {z }
∈C1 ∈C2
In particular, convex sets are closed under all linear combination: if α ∈ Rm , then
P
C= m i=1 αi Ci is also convex.
We also define the convex hull of a set of points x1 , . . . , xm ∈ Rn by
m m
X X
Conv{x1 , . . . , xm } = λi xi : λi > 0, λi = 1 .
i=1 i=1
This set is clearly a convex set.
Projections We now turn to a discussion of orthogonal projection onto a con-
vex set, which will allow us to develop a number of separation properties and
alternate characterizations of convex sets. See Figure 2.2.5 for a geometric view
of projection. We begin by stating a classical result about the projection of zero
onto a convex set.
Theorem 2.2.1 (Projection of zero). Let C be a closed convex set not containing the
origin 0. Then there is a unique point xC ∈ C such that kxC k2 = infx∈C kxk2 . Moreover,
kxC k2 = infx∈C kxk2 if and only if
(2.2.2) hxC , y − xC i > 0
for all y ∈ C.
Proof. The key to the proof is the following parallelogram identity, which holds
in any inner product space: for any x, y,
1 1
(2.2.3) kx − yk22 + kx + yk22 = kxk22 + kyk22 .
2 2
Define M := infx∈C kxk2 . Now, let (xn ) ⊂ C be a sequence of points in C such that
kxn k2 → M as n → ∞. By the parallelogram identity (2.2.3), for any n, m ∈ N,
John C. Duchi 9
we have
1 1
kxn − xm k22 = kxn k22 + kxm k22 − kxn + xm k22 .
2 2
Fix ǫ > 0, and choose N ∈ N such that n > N implies that kxn k22 6 M2 + ǫ. Then
for any m, n > N, we have
1 1
(2.2.4) kxn − xm k22 6 2M2 + 2ǫ − kxn + xm k22 .
2 2
Now we use the convexity of the set C. We have 21 xn + 12 xm ∈ C for any n, m,
which implies
2
1
1 1
kxn + xm k22 = 2
xn + x
m
> 2M
2
2 2 2 2
by definition of M. Using the above inequality in the bound (2.2.4), we see that
1
kxn − xm k22 6 2M2 + 2ǫ − 2M2 = 2ǫ.
2 √
In particular, kxn − xm k2 6 2 ǫ; since ǫ was arbitrary, (xn ) forms a Cauchy
sequence and so must converge to a point xC . The continuity of the norm k·k2
implies that kxC k2 = infx∈C kxk2 , and the fact that C is closed implies that xC ∈
C.
Now we show the inequality (2.2.2) holds if and only if xC is the projection of
the origin 0 onto C. Suppose that inequality (2.2.2) holds. Then
kxC k22 = hxC , xC i 6 hxC , yi 6 kxC k2 kyk2 ,
the last inequality following from the Cauchy-Schwartz inequality. Dividing each
side by kxC k2 implies that kxC k2 6 kyk2 for all y ∈ C. For the converse, let xC
minimize kxk2 over C. Then for any t ∈ [0, 1] and any y ∈ C, we have
kxC k22 6 k(1 − t)xC + tyk22 = kxC + t(y − xC )k22 = kxC k22 + 2t hxC , y − xC i + t2 ky − xC k22 .
Subtracting kxC k22 and t2 ky − xC k22 from both sides of the above inequality, we
have
−t2 ky − xC k22 6 2t hxC , y − xC i .
Dividing both sides of the above inequality by 2t, we have
t
− ky − xC k22 6 hxC , y − xC i
2
for all t ∈ (0, 1]. Letting t ↓ 0 gives the desired inequality.
With this theorem in place, a simple shift gives a characterization of more
general projections onto convex sets.
Corollary 2.2.6 (Projection onto convex sets). Let C be a closed convex set and x ∈
Rn . Then there is a unique point πC (x), called the projection of x onto C, such
that kx − πC (x)k2 = infy∈C kx − yk2 , that is, πC (x) = argminy∈C ky − xk22 . The
projection is characterized by the inequality
(2.2.7) hπC (x) − x, y − πC (x)i > 0
for all y ∈ C.
10 Introductory Lectures on Stochastic Optimization
y πC (x)
Figure 2.2.5. Projection of the point x onto the set C (with pro-
jection πC (x)), exhibiting hx − πC (x), y − πC (x)i 6 0.
Proof. When x ∈ C, the statement is clear. For x 6∈ C, the corollary simply fol-
lows by considering the set C ′ = C − x, then using Theorem 2.2.1 applied to the
recentered set.
Corollary 2.2.8 (Non-expansive projections). Projections onto convex sets are non-
expansive, in particular,
kπC (x) − yk2 6 kx − yk2
for any x ∈ Rn and y ∈ C.
y πC (x)
Proposition 2.2.10 (Strict separation of points). Let C be a closed convex set. Given
any point x 6∈ C, there is a vector v such that
(2.2.11) hv, xi > sup hv, yi
y∈C
Moreover, we can take the vector v = x − πC (x), and hv, xi > supy∈C hv, yi + kvk22 .
See Figure 2.2.9.
We can also investigate the existence of hyperplanes that support the convex
set C, meaning that they touch only its boundary and never enter its interior.
Such hyperplanes—and the halfspaces associated with them—provide alternate
descriptions of convex sets and functions. See Figure 2.2.13.
Figure 2.2.18. The function f (solid blue line) and affine under-
estimators (dotted lines).
Corollary 2.2.19. Let f be a closed convex function that is not identically −∞. Then
f(x) = sup {hv, xi + b : f(y) > b + hv, yi for all y ∈ Rn } .
v∈Rn ,b∈R
Proof. First, we note that epi f is closed by definition. Moreover, we know that we
can write
epi f = ∩{H : H ⊃ epi f},
where H denotes a halfspace. More specifically, we may index each halfspace by
(v, a, c) ∈ Rn × R × R, and we have Hv,a,c = {(x, t) ∈ Rn × R : hv, xi + at 6 c}.
Now, because H ⊃ epi f, we must be able to take t → ∞ so that a 6 0. If a < 0,
14 Introductory Lectures on Stochastic Optimization
we may divide by |a| and assume without loss of generality that a = −1, while
otherwise a = 0. So if we let
H1 := (v, c) : Hv,−1,c ⊃ epi f and H0 := {(v, c) : Hv,0,c ⊃ epi f} .
then \ \
epi f = Hv,−1,c ∩ Hv,0,c .
(v,c)∈H1 (v,c)∈H0
We would like to show that epi f = ∩(v,c)∈H1 Hv,−1,c , as the set Hv,0,c is a vertical
hyperplane separating the domain of f, dom f, from the rest of the space.
To that end, we show that for any (v1 , c1 ) ∈ H1 and (v0 , c0 ) ∈ H0 , then
\
H := Hv1 +λv0 ,−1,c1 +λc1 = Hv1 ,−1,c1 ∩ Hv0 ,0,c0 .
λ>0
theorem begins by showing that if a convex function is bounded in some set, then
it is Lipschitz continuous in the set, then using Lemma 2.3.1 we can show that on
compact sets f is indeed bounded.
Theorem 2.3.2. Let f be convex and defined on a set C with non-empty interior. Let
B ⊆ int C be compact. Then there is a constant L such that |f(x) − f(y)| 6 L kx − yk on
B, that is, f is L-Lipschitz continuous on B.
The last result, which we make strong use of in the next section, concerns the
existence of directional derivatives for convex functions.
Definition 2.3.3. The directional derivative of a function f at a point x in the direc-
tion u is
1
f ′ (x; u) := lim [f(x + αu) − f(x)] .
α↓0 α
This definition makes sense by our earlier arguments that convex functions have
increasing slopes (recall expression (2.1.5)). To see that the above definition makes
sense, we restrict our attention to x ∈ int dom f, so that we can approach x from
all directions. By taking u = y − x for any y ∈ dom f,
f(x + α(y − x)) = f((1 − α)x + αy) 6 (1 − α)f(x) + αf(y)
so that
1 1
[f(x + α(y − x)) − f(x)] 6 [αf(y) − αf(x)] = f(y) − f(x) = f(x + u) − f(x).
α α
We also know from Theorem 2.3.2 that f is locally Lipschitz, so for small enough
α there exists some L such that f(x + αu) > f(x) − Lα kuk, and thus f ′ (x; u) >
−L kuk. Further, an argument by convexity (the criterion (2.1.5) of increasing
slopes) shows that the function
1
α 7→ [f(x + αu) − f(x)]
α
is increasing, so we can replace the limit in the definition of f ′ (x; u) with an
infimum over α > 0, that is, f ′ (x; u) = infα>0 α1 [f(x + αu) − f(x)]. Noting that if
x is on the boundary of dom f and x + αu 6∈ dom f for any α > 0, then f ′ (x; u) =
+∞, we have proved the following theorem.
Theorem 2.3.4. For convex f, at any point x ∈ dom f and for any u, the directional
derivative f ′ (x; u) exists and is
1 1
f ′ (x; u) = lim [f(x + αu) − f(x)] = inf [f(x + αu) − f(x)] .
α↓0 α α>0 α
If x ∈ int dom f, there exists a constant L < ∞ such that |f ′ (x; u)| 6 L kuk for any
u ∈ Rn . If f is Lipschitz continuous with respect to the norm k·k, we can take L to be
the Lipschitz constant of f.
Lastly, we state a well-known condition that is equivalent to convexity. This is
inuitive: if a function is bowl-shaped, it should have positive second derivatives.
Theorem 2.3.5. Let f : Rn → R be twice continuously differentiable. Then f is convex
if and only if ∇2 f(x) 0 for all x, that is, ∇2 f(x) is positive semidefinite.
16 Introductory Lectures on Stochastic Optimization
is, f should always have global linear underestimators of itself. When a function f
is convex, the subgradient generalizes the derivative of f (which is a global linear
underestimator of f when f is differentiable), and is also intimately related to
optimality conditions for convex minimization.
f(x1 ) + hg1 , x − x1 i
f(x2 ) + hg3 , x − x2 i
f(x2 ) + hg2 , x − x2 i
x2 x1
Theorem 2.4.3. Let x ∈ int dom f. Then ∂f(x) is nonempty, closed convex, and com-
pact.
Proof. The fact that ∂f(x) is closed and convex is straightforward. Indeed, all we
need to see this is to recognize that
\
∂f(x) = {g : f(z) > f(x) + hg, z − xi}
z
which is an intersection of half-spaces, which are all closed and convex.
Now we need to show that ∂f(x) 6= ∅. This will essentially follow from the
following fact: the set epi f has a supporting hyperplane at the point (x, f(x)).
Indeed, from Theorem 2.2.14, we know that there exist a vector v and scalar b
such that
hv, xi + bf(x) > hv, yi + bt
for all (y, t) ∈ epi f (that is, y and t such that f(y) 6 t). Rearranging slightly, we
have
hv, x − yi > b(t − f(x))
18 Introductory Lectures on Stochastic Optimization
and setting y = x shows that b 6 0. This is close to what we desire, since if b < 0
we set t = f(y) and see that
Dv E
−bf(y) > −bf(x) + hv, y − xi or f(y) > f(x) − ,y−x
b
for all y, by dividing both sides by −b. In particular, −v/b is a subgradient. Thus,
suppose for the sake of contradiction that b = 0. In this case, we have hv, x − yi >
0 for all y ∈ dom f, but we assumed that x ∈ int dom f, so for small enough ǫ > 0,
we can set y = x + ǫv. This would imply that hv, x − yi = −ǫ hv, vi = 0, i.e. v = 0,
contradicting the fact that at least one of v and b must be non-zero.
For the compactness of ∂f(x), we use Lemma 2.3.1, which implies that f is
bounded in an ℓ1 -ball around of x. As x ∈ int dom f by assumption, there is
some ǫ > 0 such that x + ǫB ⊂ int dom f for the ℓ1 -ball B = {v : kvk1 6 1}.
Lemma 2.3.1 implies that supv∈B f(x + ǫv) = M < ∞ for some M, so we have
M > f(x + ǫv) > f(x) + ǫ hg, vi for all v ∈ B and g ∈ ∂f(x), or kgk∞ 6 (M − f(x))/ǫ.
Thus ∂f(x) is closed and bounded, hence compact.
The next two results require a few auxiliary results related to the directional
derivative of a convex function. The reason for this is that both require connect-
ing the local properties of the convex function f with the sub-differential ∂f(x),
which is difficult in general since ∂f(x) can consist of multiple vectors. However,
by looking at directional derivatives, we can accomplish what we desire. The
connection between a directional derivative and the subdifferential is contained
in the next two lemmas.
Proof. Denote the set on the right hand side of the equality (2.4.5) by S = {g :
hg, ui 6 f ′ (x; u)}, and let g ∈ S. By the increasing slopes condition, we have
f(x + αu) − f(x)
hg, ui 6 f ′ (x; u) 6
α
for all u and α > 0; in particular, by taking α = 1 and u = y − x, we have the
standard subgradient inequality that f(x) + hg, y − xi 6 f(y). So if g ∈ S, then
g ∈ ∂f(x). Conversely, for any g ∈ ∂f(x), the definition of a subgradient implies
that
f(x + αu) > f(x) + hg, x + αu − xi = f(x) + α hg, ui .
Subtracting f(x) from both sides and dividing by α gives that
1
[f(x + αu) − f(x)] > sup hg, ui
α g∈∂f(x)
Proof. Certainly, Lemma 2.4.4 shows that f ′ (x; u) > supg∈∂f(x) hg, ui. We must
show the other direction. To that end, note that viewed as a function of u, we have
f ′ (x; u) is convex and positively homogeneous, meaning that f ′ (x; tu) = tf ′ (x; u)
for t > 0. Thus, we can always write (by Corollary 2.2.19)
f ′ (x; u) = sup hv, ui + b : f ′ (x; w) > b + hv, wi for all w ∈ Rn .
Using the positive homogeneity, we have f ′ (x; 0) = 0 and thus we must have
b = 0, so that u 7→ f ′ (x; u) is characterized as the supremum of linear functions:
f ′ (x; u) = sup hv, ui : f ′ (x; w) > hv, wi for all w ∈ Rn .
But the set {v : hv, wi 6 f ′ (x; w) for all w} is simply ∂f(x) by Lemma 2.4.4.
A relatively straightforward calculation using Lemma 2.4.4, which we give
in the next proposition, shows that the subgradient is simply the gradient of
differentiable convex functions. Note that as a consequence of this, we have
the first-order inequality that f(y) > f(x) + h∇f(x), y − xi for any differentiable
convex function.
Proposition 2.4.8. Let f be convex and differentiable at a point x. Then ∂f(x) = {∇f(x)}.
Proposition 2.4.9. Suppose that f is L-Lipschitz with respect to the norm k·k over a set
C, where C ⊂ int dom f. Then
sup{kgk∗ : g ∈ ∂f(x), x ∈ C} 6 L.
20 Introductory Lectures on Stochastic Optimization
Moreover, we have that kxk = supy:kyk 61 hy, xi. Fixing x ∈ Rn , we thus see that
∗
if kgk∗ 6 1 and hg, xi = kxk, then
kxk + hg, y − xi = kxk − kxk + hg, yi 6 sup hv, yi = kyk .
v:kvk∗ 61
Theorem 2.4.11. Let f be convex. The point x ∈ int dom f minimizes f over a convex
set C if and only if there exists a subgradient g ∈ ∂f(x) such that simultaneously for all
y ∈ C,
(2.4.12) hg, y − xi > 0.
John C. Duchi 21
−g
y
x⋆
Proof. One direction of the theorem is easy. Indeed, pick y ∈ C. Then certainly
there exists g ∈ ∂f(x) for which hg, y − xi > 0. Then by definition,
f(y) > f(x) + hg, y − xi > f(x).
This holds for any y ∈ C, so x is clearly optimal.
For the converse, suppose that x minimizes f over C. Then for any y ∈ C and
any t > 0 such that x + t(y − x) ∈ C, we have
f(x + t(y − x)) − f(x)
f(x + t(y − x)) > f(x) or 0 6 .
t
′
Taking the limit as t → 0, we have f (x; y − x) > 0 for all y ∈ C. Now, let
us suppose for the sake of contradiction that there exists a y such that for all
g ∈ ∂f(x), we have hg, y − xi < 0. Because
∂f(x) = {g : hg, ui 6 f ′ (x; u) for all u ∈ Rn }
by Lemma 2.4.6, and ∂f(x) is compact, we have that supg∈∂f(x) hg, y − xi is at-
tained, which would imply
f ′ (x; y − x) < 0.
This is a contradiction.
2.5. Calculus rules with subgradients We present a number of calculus rules
that show how subgradients are, essentially, similar to derivatives, with a few
exceptions (see also Ch. VII of [27]). When we develop methods for optimization
problems based on subgradients, these basic calculus rules will prove useful.
Scaling. If we let h(x) = αf(x) for some α > 0, then ∂h(x) = α∂f(x).
22 Introductory Lectures on Stochastic Optimization
Pm
Finite sums. Suppose that f1 , . . . , fm are convex functions and let f = i=1 fi .
Then
Xm
∂f(x) = ∂fi (x),
i=1
Pm
where the addition is Minkowski addition. To see that i (x) ⊂ ∂f(x),
i=1 ∂fP
m
let gi ∈ ∂fi (x) for each i, in which case it is clear that f(y) = i=1 fi (y) >
Pm Pm
i=1 fi (x) + hgi , y − xi, so that i=1 gi ∈ ∂f(x). The converse is somewhat more
technical and is a special case of the results to come.
Integrals. More generally, we can extend this summation result to integrals, as-
suming the integrals exist. These calculations are essential for our development
of stochastic optimization schemes based on stochastic (sub)gradient information
in the coming lectures. Indeed, for each s ∈ S, where S is some set, let fs be
convex. Let µ be a positive measure on the set S, and define the convex function
R
f(x) = fs (x)dµ(s). In the notation of the introduction (Eq. (1.0.1)) and the prob-
lems coming in Section 3.4, we take µ to be a probability distribution on a set S,
and if F(·; s) is convex in its first argument for all s ∈ S, then we may take
f(x) = E[F(x; S)]
and satisfy the conditions above. We shall see many such examples in the sequel.
Then if we let gs (x) ∈ ∂fs (x) for each s ∈ S, we have (assuming the integral
exists and that the selections gZs (x) are appropriately measurable)
(2.5.1) gs (x)dµ(s) ∈ ∂f(x).
To see the inclusion, note that for any y we have
Z Z
gs (x)dµ(s), y − x = hgs (x), y − xi dµ(s)
Z
6 (fs (y) − fs (x))dµ(s) = f(y) − f(x).
So the inclusion (2.5.1) holds. Eliding a few technical details, one generally ob-
tains the equality
Z
∂f(x) = gs (x)dµ(s) : gs (x) ∈ ∂fs (x) for each s ∈ S .
Returning to our running example of stochastic optimization, if we have a
collection of functions F : Rn × S → R, where for each s ∈ S the function F(·; s)
is convex, then f(x) = E[f(x; S)] is convex when we take expectations over S, and
taking
g(x; s) ∈ ∂F(x; s)
gives a stochastic gradient with the property that E[g(x; S)] ∈ ∂f(x). For more on
these calculations and conditions, see the classic paper of Bertsekas [7], which
addresses the measurability issues.
Affine transformations. Let f : Rm → R be convex and A ∈ Rm×n and
b ∈ Rm . Then h : Rn → R defined by h(x) = f(Ax + b) is convex and has
John C. Duchi 23
subdifferential
∂h(x) = AT ∂f(Ax + b).
Indeed, let g ∈ ∂f(Ax + b), so that
D E
h(y) = f(Ay + b) > f(Ax + b) + hg, (Ay + b) − (Ax + b)i = h(x) + AT g, y − x ,
giving the result.
Finite maxima. Let fi , i = 1, . . . , m, be convex functions, and f(x) = maxi6m fi (x).
Then we have \
epi f = epi fi ,
i6m
which is convex, and f is convex. Now, let i be any index such that fi (x) = f(x),
and let gi ∈ ∂fi (x). Then we have for any y ∈ Rn that
f(y) > fi (y) > fi (x) + hgi , y − xi = f(x) + hgi , y − xi .
So gi ∈ ∂f(x). More generally, we have the result that
(2.5.2) ∂f(x) = Conv{∂fi (x) : fi (x) = f(x)},
that is, the subgradient set of f is the convex hull of the subgradients of active
functions at x, that is, those attaining the maximum. If there is only a single
unique active function fi , then ∂f(x) = ∂fi (x). See Figure 2.5.3 for a graphical
representation.
epi f f1
f2
x0 x1
3. Subgradient Methods
Lecture Summary: In this lecture, we discuss first order methods for the min-
imization of convex functions. We focus almost exclusively on subgradient-
based methods, which are essentially universally applicable for convex opti-
mization problems, because they rely very little on the structure of the prob-
lem being solved. This leads to effective but slow algorithms in classical
optimization problems. In large scale problems arising out of machine learn-
ing and statistical tasks, however, subgradient methods enjoy a number of
(theoretical) optimality properties and have excellent practical performance.
good method for solving the problem (3.1.1) is hopeless, though we will see that
the subgradient method does essentially apply in this generality.
Convex programming methodologies developed in the last fifty years or so
have given powerful methods for solving optimization problems. The perfor-
mance of many methods for solving convex optimization problems is measured
by the amount of time or number of iterations required of them to give an ǫ-
optimal solution to the problem (3.1.1), roughly, how long it takes to find some b x
such that f(bx) − f(x⋆ ) 6 ǫ and dist(b x, C) 6 ǫ for an optimal x⋆ ∈ C. Essentially
any problem for which we can compute subgradients efficiently can be solved
to accuracy ǫ in time polynomial in the dimension n of the problem and log ǫ1
by the ellipsoid method (cf. [41, 45]). Moreover, for somewhat better structured
(but still quite general) convex problems, interior point and second order meth-
ods [12, 45] are practically and theoretically quite efficient, sometimes requiring
only O(log log ǫ1 ) iterations to achieve optimization error ǫ. (See the lectures by S.
Wright in this volume.) These methods use the Newton method as a basic solver,
along with specialized representations of the constraint set C, and are quite pow-
erful.
However, for large scale problems, the time complexity of standard interior
point and Newton methods can be prohibitive. Indeed, for n-dimensional problems—
that is, when x ∈ Rn —interior point methods scale at best as O(n3 ), and can be
much worse. When n is large (where today, large may mean n ≈ 109 ), this be-
comes highly non-trivial. In such large scale problems and problems arising from
any type of data-collection process, it is reasonable to expect that our representa-
tion of problem data is inexact at best. In statistical machine learning problems,
for example, this is often the case; generally, many applications do not require
accuracy higher than, say ǫ = 10−2 or 10−3 , in which case faster but less exact
methods become attractive.
It is with this motivation that we attack solving the problem (3.1.1) in this
lecture, showing classical subgradient algorithms. These algorithms have the ad-
vantage that their per-iteration costs are low—O(n) or smaller for n-dimensional
problems—but they achieve low accuracy solutions to (3.1.1) very quickly. More-
over, depending on problem structure, they can sometimes achieve convergence
rates that are independent of problem dimension. More precisely, and as we will
see later, the methods we study will guarantee convergence to an ǫ-optimal so-
lution to problem (3.1.1) in O(1/ǫ2 ) iterations, while methods that achieve better
dependence on ǫ require at least n log ǫ1 iterations.
3.2. The gradient and subgradient methods We begin by focusing on the un-
constrained case, that is, when the set C in problem (3.1.1) is C = Rn . That is, we
wish to solve
minimize
n
f(x).
x∈R
26 Introductory Lectures on Stochastic Optimization
f(xk ) + h∇f(xk ), x − xk i + 1
2 kx − xk k22
f(x)
f(x)
f(xk ) + h∇f(xk ), x − xk i
We first review the gradient descent method, using it as motivation for what fol-
lows. In the gradient descent method, minimize the objective (3.1.1) by iteratively
updating
(3.2.2) xk+1 = xk − αk ∇f(xk ),
where αk > 0 is a positive sequence of stepsizes. The original motivations for
this choice of update come from the fact that x⋆ minimizes a convex f if and only
if 0 = ∇f(x⋆ ); we believe a more compelling justification comes from the idea
of modeling the convex function being minimized. Indeed, the update (3.2.2) is
equivalent to
1 2
(3.2.3) xk+1 = argmin f(xk ) + h∇f(xk ), x − xk i + kx − xk k2 .
x 2αk
The interpretation is as follows: the linear functional x 7→ {f(xk ) + h∇f(xk ), x − xk i}
is the best linear approximation to the function f at the point xk , and we would
like to make progress minimizing x. So we minimize this linear approximation,
but to make sure that it has fidelity to the function f, we add a quadratic kx − xk k22
to penalize moving too far from xk , which would invalidate the linear approxi-
mation. See Figure 3.2.1. Assuming that f is continuously differentiable (often,
one assumes the gradient ∇f(x) is Lipschitz), then gradient descent is a descent
method if the stepsize αk > 0 is small enough—it monotonically decreases the
objective f(xk ). We spend no more time on the convergence of gradient-based
methods, except to say that the choice of the stepsize αk is often extremely im-
portant, and there is a body of research on carefully choosing directions as well
as stepsize lengths; Nesterov [44] provides an excellent treatment of many of the
basic issues.
John C. Duchi 27
Subgradient algorithms The subgradient method is a minor variant of the method (3.2.2),
except that instead of using the gradient, we use a subgradient. The method can
be written simply: for k = 1, 2, . . ., we iterate
i. Choose any subgradient
gk ∈ ∂f(xk )
ii. Take the subgradient step
(3.2.4) xk+1 = xk − αk gk .
Unfortunately, the subgradient method is not, in general, a descent method.
For a simple example, take the function f(x) = |x|, and let x1 = 0. Then except
for the choice g = 0, all subgradients g ∈ ∂f(0) = [−1, 1] are ascent directions.
This is not just an artifact of 0 being optimal for f; in higher dimensions, this
behavior is common. Consider, for example, f(x) = kxk1 and let x = e1 ∈ Rn ,
P
the first standard basis vector. Then ∂f(x) = e1 + n i=2 ti ei , where ti ∈ [−1, 1].
Pn Pn
Any vector g = e1 + i=2 ti ei with i=2 |ti | > 1 is an ascent direction for f,
meaning that f(x − αg) > f(x) for all α > 0. If we were to pick a uniformly
random g ∈ ∂f(e1 ), for example, then the probability that g is a descent direction
is exponentially small in the dimension n.
In general, the characterization of the subgradient set ∂f(x) as in Lemma 2.4.4,
f(x+tu)−f(x)
as {g : f ′ (x; u) > hg, ui for all u} where f ′ (x; u) = limt→0 t is the
directional derivative, and the fact that f ′ (x; u) = supg∈∂f(x) hg, ui guarantees
that
argmin {kgk22 }
g∈∂f(x)
is a descent direction, but we do not prove this here. Indeed, finding such a
descent direction would require explicitly calculating the entire subgradient set
∂f(x), which for a number of functions is non-trivial and breaks the simplicity of
the subgradient method (3.2.4), which works with any subgradient.
It is the case, however, that so long as the point x does not minimize f(x), then
subgradients descend on a related quantity: the distance of x to any optimal point.
Indeed, let g ∈ ∂f(x), and let x⋆ ∈ argmin f(x) (we assume such a point exists),
which need not be unique. Then we have for any α that
1 1 α2
kx − αg − x⋆ k22 = kx − x⋆ k22 − α hg, x − x⋆ i + kgk22 .
2 2 2
The key is that for small enough α > 0, the quantity on the right is strictly
smaller than 21 kx − x⋆ k22 , as we now show. We use the defining inequality of the
subgradient, that is, that f(y) > f(x) + hg, y − xi for all y, including x⋆ . This gives
− hg, x − x⋆ i = hg, x⋆ − xi 6 f(x⋆ ) − f(x), and thus
1 1 α2
(3.2.5) kx − αg − x⋆ k22 6 kx − x⋆ k22 − α (f(x) − f(x⋆ )) + kgk22 .
2 2 2
28 Introductory Lectures on Stochastic Optimization
From inequality (3.2.5), we see immediately that, no matter our choice g ∈ ∂f(x),
we have
2(f(x) − f(x⋆ ))
0<α< implies kx − αg − x⋆ k22 < kx − x⋆ k22 .
kgk22
Summarizing, by noting that f(x) − f(x⋆ ) > 0, we have
Observation 3.2.6. If 0 6∈ ∂f(x), then for any x⋆ ∈ argminx f(x) and any g ∈ ∂f(x),
there is a stepsize α > 0 such that kx − αg − x⋆ k22 < kx − x⋆ k22 .
Theorem 3.2.7. Let αk > 0 be any non-negative sequence of stepsizes and the preceding
assumptions hold. Let xk be generated by the subgradient iteration (3.2.4). Then for all
K > 1,
K K
X 1 1X 2 2
αk [f(xk ) − f(x⋆ )] 6 kx1 − x⋆ k22 + αk M .
2 2
k=1 k=1
Proof. The entire proof essentially amounts to writing down the distance kxk+1 − x⋆ k22
and expanding the square, which we do. By applying inequality (3.2.5), we have
1 1
kxk+1 − x⋆ k22 = kxk − αk gk − x⋆ k22
2 2
(3.2.5) 1 α2
6 kxk − x⋆ k22 − αk (f(xk ) − f(x⋆ )) + k kgk k22 .
2 2
Rearranging this inequality and using that kgk k22 6 M2 , we obtain
1 1 α2
αk [f(xk ) − f(x⋆ )] 6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + k kgk k22
2 2 2
1 1 α2
6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + k M2 .
2 2 2
By summing the preceding expression from k = 1 to k = K and canceling the
alternating ± kxk − x⋆ k22 terms, we obtain the theorem.
John C. Duchi 29
Theorem 3.2.7 is the starting point from which we may derive a number of
useful consquences. First, we use convexity to obtain the following immediate
corollary (we assume that αk > 0 in the corollary).
P 1 PK
Corollary 3.2.8. Let Ak = k i=1 αi and define xK = AK k=1 αk xk . Then
P
kx1 − x⋆ k22 + K 2
k=1 αk M
2
f(xK ) − f(x⋆ ) 6 PK .
2 k=1 αk
−1 PK
Proof. Noting that AK k=1 αk = 1, we see by convexity that
K
" K #
⋆ 1 X
⋆ −1
X
⋆
f(xK ) − f(x ) 6 PK αk f(xk ) − f(x ) = AK αk (f(xk ) − f(x )) .
k=1 αk k=1 k=1
Applying Theorem 3.2.7 gives the result.
Corollary 3.2.8 allows us to give a number of basic convergence guarantees
based on our stepsize choices. For example, we see that whenever we have
∞
X
αk → 0 and αk = ∞,
k=1
PK 2
PK
then k=1 αk / k=1 αk → 0 and so
f(xK ) − f(x⋆ ) → 0 as K → ∞.
Moreover, we can give specific stepsize choices to optimize the bound. For exam-
ple, let us assume for simplicity that R2 = kx1 − x⋆ k22 is our distance (radius) to
optimality. Then choosing a fixed stepsize αk = α, we have
R2 αM2
(3.2.9) f(xK ) − f(x⋆ ) 6 + .
2Kα 2
R
Optimizing this bound by taking α = √ gives
M K
RM
f(xK ) − f(x⋆ ) 6 √ .
K
Given that subgradient descent methods are not descent methods, it often
makes sense, instead of tracking the (weighted) average of the points or using
the final point, to use the best point observed thus far. Naturally, if we let
xbest
k = argmin f(xi )
xi :i6k
1
10
α = .01
α = .1
α =1
f(xk ) − f(x⋆ ) 10
0 α =10
-1
10
-2
10
-3
10
0 500 1000 1500 2000 2500 3000 3500 4000
α = .01
α = .1
α =1
10
0 α =10
k − f(x )
-1
10
fbest ⋆
-2
10
-3
10
0 500 1000 1500 2000 2500 3000 3500 4000
Example 3.2 (Some norm balls): Let us consider updates when C = {x : kxkp 6 1}
for p ∈ {1, 2, ∞}, each of which is reasonably simple, though the projections are
no longer affine. First, for p = ∞, we consider each coordinate j = 1, 2, . . . , n in
turn, giving
[πC (x)]j = min{1, max{xj , −1}},
that is, we simply truncate the coordinates of x to be in the range [−1, 1]. For
p = 2, we have a similarly simple to describe update:
x if kxk2 6 1
πC (x) =
x/ kxk2 otherwise.
When p = 1, that is, C = {x : kxk1 6 1}, the update is somewhat more complex. If
kxk1 6 1, then πC (x) = x. Otherwise, we find the (unique) t > 0 such that
n
X
|xj | − t +
= 1,
j=1
Theorem 3.3.4. Let xk be generated by the projected subgradient iteration (3.3.1), where
the stepsizes αk > 0 are non-increasing. Then
K K
X R2 1X
[f(xk ) − f(x⋆ )] 6 + α k M2 .
2αK 2
k=1 k=1
Proof. The starting point of the proof is the same basic inequality as we have been
using, that is, the distance kxk+1 − x⋆ k22 . In this case, we note that projections can
never increase distances to points x⋆ ∈ C, so that
kxk+1 − x⋆ k22 = kπC (xk − αk gk ) − x⋆ k22 6 kxk − αk gk − x⋆ k22 .
34 Introductory Lectures on Stochastic Optimization
1 h i
K
X
kxk − x⋆ k22 − kxk+1 − x⋆ k22
2αk
k=1
K
X 1 1 1 1
= − kxk − x⋆ k22 + kx1 − x⋆ k22 − kxK − x⋆ k22
2αk 2αk−1 2α1 2αK
k=2
K
X 1 1 1 2
6 − R2 + R
2αk 2αk−1 2α1
k=2
because αk 6 αk−1 . Noting that this last sum telescopes and that kgk k22 6 M2 in
inequality (3.3.5) gives the result.
One application of this result is when we use a decreasing stepsize of αk =
√
α/ k, which allows nearly as strong of a convergence rate as in the fixed stepsize
case when the number of iterations K is known, but the algorithm provides a
guarantee for all iterations k. Here, we have that
K ZK √
X 1 1
√ 6 t− 2 dt = 2 K,
k=1
k 0
1 P K
and so by taking xK = K k=1 xk we obtain the following corollary.
Corollary 3.3.6. In addition to the conditions of the preceding paragraph, let the condi-
tions of Theorem 3.3.4 hold. Then
R2 M2 α
f(xK ) − f(x⋆ ) 6 √ + √ .
2α K K
√
So we see that convergence is guaranteed, at the “best” rate 1/ K, for all iter-
ations. Here, we say “best” because this rate is unimprovable—there are worst
case functions for which no method can achieve a rate of convergence faster than
√
RM/ K—but in practice, one would hope to attain better behavior by leveraging
problem structure.
John C. Duchi 35
Definition 3.4.1. A stochastic gradient oracle for the function f consists of a triple
(g, S, P), where S is a sample space, P is a probability distribution, and g : Rn ×
S → Rn is a mapping that for each xZ∈ dom f satisfies
EP [g(x, S)] = g(x, s)dP(s) ∈ ∂f(x),
where S ∈ S is a sample drawn from P.
Often, with some abuse of notation, we will use g or g(x) for shorthand of the
random vector g(x, S) when this does not cause confusion.
A standard example for these types of problems is stochastic programming,
where we wish to solve the convex optimization problem
To make the setting (3.4.2) more concrete, consider the robust regression prob-
lem (3.2.12), which uses
m
1 1 X
f(x) = kAx − bk1 = | hai , xi − bi |.
m m
i=1
Then a natural stochastic gradient, which requires time only O(n) to compute
(as opposed to O(m · n) to compute Ax − b), is to uniformly at random draw an
index i ∈ [m], then return
g = ai sign(hai , xi − bi ).
More generally, given any problem in which one has a large dataset {s1 , . . . , sm },
and we wish to minimize the sum
m
1 X
f(x) = F(x; si ),
m
i=1
then drawing an index i ∈ {1, . . . , m} uniformly at random and using g ∈ ∂x F(x; si )
is a stochastic gradient. Computing this stochastic gradient requires only the time
necessary for computing some element of the subgradient set ∂x F(x; si ), while the
standard subgradient method applied to these problems is m-times more expen-
sive in each iteration.
More generally, the expectation E[F(x; S)] is generally intractable to compute,
especially if S is a high-dimensional distribution. In statistical and machine learn-
ing applications, we may not even know the distribution P, but we can observe
iid
samples Si ∼ P. In these cases, it may be impossible to even implement the cal-
culation of a subgradient f ′ (x) ∈ ∂f(x), but sampling from P is possible, allowing
us to compute stochastic subgradients.
Stochastic subgradient method With this motivation in place, we can describe
the (projected) stochastic subgradient method. Simply, the method iterates as
follows:
(1) Compute a stochastic subgradient gk at the point xk , where E[gk | xk ] ∈
∂f(x)
(2) Perform the projected subgradient step
xk+1 = πC (xk − αk gk ).
This is essentially identical to the projected gradient method (3.3.1), except that
we replace the true subgradient with a stochastic gradient.
John C. Duchi 37
101
Subgradient
Stochastic
100
10-1
f(xk ) − f(x⋆ )
10-2
10-3
10-4
0 500 1000 1500 2000
Iteration k
Figure 3.4.3. Stochastic subgradient method versus non-
stochastic subgradient method performance on problem (3.4.4).
In the next section, we analyze the convergence of the procedure, but here we
give two examples example here that exhibit some of the typical behavior of these
methods.
Example 3.3 (Robust regression): We consider the robust regression problem (3.2.12),
solving
m
1 X
(3.4.4) minimize f(x) = | hai , xi − bi | subject to kxk2 6 R,
x m
i=1
using the random sample g = ai sign(hai , xi − bi ) as our stochastic gradient. We
iid
generate A = [a1 · · · am ]⊤ by drawing ai ∼ N(0, In×n ) and bi = hai , ui + εi |εi |3 ,
iid
where εi ∼ N(0, 1) and u is a Gaussian random variable with identity covariance.
We use n = 50, m = 100, and R = 4 for this experiment.
We plot the results of running the stochastic gradient iteration versus stan-
dard projected subgradient descent in Figure 3.4.3; both methods run with the
√
fixed stepsize α = R/M K for M2 = m 1
kAk2Fr , which optimizes the convergence
guarantees for the methods. We see in the figure the typical performance of a sto-
chastic gradient method: the initial progress in improving the objective is quite
fast, but the method eventually stops making progress once it achieves some low
accuracy (in this case, 10−1 ). In this figure we should make clear, however, that
each iteration of the stochastic gradient method requires time O(n), while each
iteration of the (non-noisy) projected gradient method requires times O(n · m), a
factor of approximately 100 times slower. ♦
Example 3.4 (Multiclass support vector machine): Our second example is some-
what more complex. We are given a collection of 16 × 16 grayscale images of
38 Introductory Lectures on Stochastic Optimization
103
SGD: α1 = R/M
Non-stochastic: α1 =0.01R/M
102 Non-stochastic: α1 =0.1R/M
Non-stochastic: α1 =1.0R/M
f(Xk ) − f(X⋆ )
101 Non-stochastic: α1 =10.0R/M
Non-stochastic: α1 =100.0R/M
100
10-1
10-2
10-3 0 10 20 30 40 50 60 70 80
Effective passes through A
handwritten digits {0, 1, . . . , 9}, and wish to classify images, represented as vec-
tors a ∈ R256 , as one of the 10 digits. In a general k-class classification problem,
we represent the multiclass classifier using the matrix
X = [x1 x2 · · · xk ] ∈ Rn×k ,
where k = 10 for the digit classification problem. Given a data vector a ∈ Rn , the
“score” associated with class l is then hxl , ai, and the goal (given image data) is
to find a matrix X assigning high scores to the correct image labels. (In machine
learning, the typical notation is to use weight vectors w1 , . . . , wk ∈ Rn instead of
x1 , . . . , xk , but we use X to remain consistent with our optimization focus.) The
predicted class for a data vector a ∈ Rn is then
argmax ha, xl i = argmax{[XT a]l }.
l∈[k] l∈[k]
where [t]+ = max{t, 0} denotes the positive part. Then F is convex in X, and for
a pair (a, b) we have F(X; (a, b)) = 0 if and only if the classifer represented by X
John C. Duchi 39
Theorem 3.4.7. Let the conditions of the preceding paragraph hold and let αk > 0 be a
P
non-increasing sequence of stepsizes. Let xK = K1 Kk=1 xk . Then
K
R2 1 X
⋆
E[f(xK ) − f(x )] 6 + α k M2 .
2KαK 2K
k=1
Proof. The analysis is quite similar to our previous analyses, in that we simply
expand the error kxk+1 − x⋆ k22 . Let use define f ′ (x) := E[g(x, S)] ∈ ∂f(x) to be
40 Introductory Lectures on Stochastic Optimization
the expected subgradient returned by the stochastic gradient oracle, and let ξk =
gk − f ′ (xk ) be the error in the kth subgradient. Then
1 1
kxk+1 − x⋆ k22 = kπC (xk − αk gk ) − x⋆ k22
2 2
1
6 kxk − αk gk − x⋆ k22
2
1 α2
= kxk − x⋆ k22 − αk hgk , xk − x⋆ i + k kgk k22 ,
2 2
as in the proof of Theorems 3.2.7 and 3.3.4. Now, we add and subtract αk hf ′ (xk ), xk − x⋆ i,
which gives
1 1
α2
kxk+1 − x⋆ k22 6 kxk − x⋆ k22 − αk f ′ (xk ), xk − x⋆ + k kgk k22 − αk hξk , xk − x⋆ i
2 2 2
α 2
1
6 kxk − x⋆ k22 − αk [f(xk ) − f(x⋆ )] + k kgk k22 − αk hξk , xk − x⋆ i ,
2 2
where we have used the standard first-order convexity inequality.
Except for the error term hξk , xk − x⋆ i, the proof is completely identical to that
of Theorem 3.3.4. Indeed, dividing each side of the preceding display by αk and
rearranging, we have
1 α
f(xk ) − f(x⋆ ) 6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + k kgk k22 − hξk , xk − x⋆ i .
2αk 2
Summing this inequality, as in the proof of Theorem 3.3.4 following inequal-
ity (3.3.5), yields that
K K K
X R2 1X X
(3.4.8) [f(xk ) − f(x⋆ )] 6 + αk kgk k22 − hξk , xk − x⋆ i .
2αK 2
k=1 k=1 k=1
The inequality (3.4.8) is the basic inequality from which all our subsequent con-
vergence guarantees follow.
For this theorem, we need only take expectations, realizing that
E[hξk , xk − x⋆ i] = E E[ g(xk ) − f ′ (xk ), xk − x⋆ | xk ]
h i
= E hE[g(xk ) | xk ] −f ′ (xk ), xk − xi = 0.
| {z }
=f ′ (xk )
Thus we obtain
X
K K
R2 1X
E (f(xk ) − f(x )) 6 ⋆
+ α k M2
2αK 2
k=1 k=1
once we realize that E[kgk k22 ] 6 M2 , which gives the desired result.
Theorem 3.4.7 makes it clear that, in expectation, we can achieve the same con-
vergence guarantees as in the non-noisy case. This does not mean that stochastic
subgradient methods are always as good as non-stochastic methods, but it does
show the robustness of the subgradient method even to substantial noise. So
John C. Duchi 41
while the subgradient method is very slow, its slowness comes with the benefit
that it can handle large amounts of noise.
We now provide a few corollaries on the convergence of stochastic gradient de-
scent. For background on probabilistic modes of convergence, see Appendix A.2.
√
Corollary 3.4.9. Let the conditions of Theorem 3.4.7 hold, and let αk = R/M k for
each k. Then
3RM
E[f(xK )] − f(x⋆ ) 6 √
2 K
for all K ∈ N.
The proof of the corollary is identical to that of Corollary 3.3.6 for the projected
gradient method, once we substitute α = R/M in the bound. We can also obtain
convergence in probability of the iterates more generally.
P
Corollary 3.4.10. Let αk be non-summable but convergent to zero, that is, ∞k=1 αk =
p
∞ and αk → 0. Then f(xK ) − f(x⋆ ) → 0 as K → ∞, that is, for all ǫ > 0 we have
lim sup P (f(xk ) − f(x⋆ ) > ǫ) = 0.
k→∞
Theorem 3.4.11. In addition to the conditions of Theorem 3.4.7, assume that kgk2 6 M
for all stochastic subgradients g. Then for any ǫ > 0,
K
R2 X αk RM
f(xK ) − f(x⋆ ) 6 + M2 + √ ǫ
2KαK 2 K
k=1
1 2
with probability at least 1 − e− 2 ǫ .
1 2
Written differently, we see that by taking αk = √ R and setting δ = e− 2 ǫ , we
kM
have q
1
3MR MR 2 log δ
f(xK ) − f(x⋆ ) 6 √ + √
K K
√
with probability at least 1 − δ. That is, we have convergence of O(MR/ K) with
high probability.
Before providing the proof proper, we discuss two examples in which the
boundedness condition holds. Recall from Lecture 2 that a convex function f
is M-Lipschitz if and only if kgk2 6 M for all g ∈ ∂f(x) and x ∈ Rn , so Theo-
rem 3.4.11 requires that the random functions F(·; S) are Lipschitz over the domain
C. Our robust regression and multiclass support vector machine examples both
satisfy the conditions of the theorem so long as the data is bounded. More pre-
cisely, for the robust regression problem (3.2.12) with loss F(x; (a, b)) = | ha, xi − b|,
42 Introductory Lectures on Stochastic Optimization
we have ∂F(x; (a, b)) = a sign(ha, xi − b) so that the condition kgk2 6 M holds
if and only if kak2 6 M. For the multiclass hinge loss problem (3.4.6), with
P
F(X; (a, b)) = l6=b [1 + ha, xl − xb i]+ , Exercise 5 develops the subgradient cal-
culations, but again, we have the boundedness of ∂X F(X; (a, b)) if and only if
a ∈ Rn is bounded.
Proof. We begin with the basic inequality of Theorem 3.4.7, inequality (3.4.8). We
see that we would like to bound the probability that
K
X
hξk , x⋆ − xk i
k=1
is large. First, we note that the iterate xk is a function of ξ1 , . . . , ξk−1 , and we
have the conditional expectation
E[ξk | ξ1 , . . . , ξk−1 ] = E[ξk | xk ] = 0.
Moreover, using the boundedness assumption that kgk2 6 M, we have kξk k2 =
kgk − f ′ (xk )k2 6 2M and
| hξk , xk − x⋆ i | 6 kξk k2 kxk − x⋆ k2 6 2MR.
PK
k=1 hξk , xk − x i is a bounded difference martingale se-
Thus, the sequence ⋆
quence, and we may apply Azuma’s inequality (Theorem A.2.5), which gurantees
X K
t2
P hξk , x⋆ − xk i > t 6 exp −
2KM2 R2
k=1
√
for all t > 0. Substituting t = MR Kǫ, we obtain that
X K 2
1 ǫMR ǫ
P hξk , x⋆ − xk i > √ 6 exp − ,
K K 2
k=1
as desired.
Summarizing the results of this section, we see a number of consequences.
First, stochastic gradient methods guarantee that after O(1/ǫ2 ) iterations, we have
error at most f(x) − f(x⋆ ) = O(ǫ). Secondly, this convergence is (at least to the
order in ǫ) the same as in the non-noisy case; that is, stochastic gradient meth-
ods are robust enough to noise that their convergence is hardly affected by it. In
addition to this, they are often applicable in situations in which we cannot even
evaluate the objective f, whether for computational reasons or because we do not
have access to it, as in statistical problems. This robustness to noise and good
performance has led to wide adoption of subgradient-like methods as the de facto
choice for many large-scale data-based optimization problems. In the coming sec-
tions, we give further discussion of the optimality of stochastic gradient methods,
showing that—roughly—when we have access only to noisy data, it is impossi-
ble to solve (certain) problems to accuracy better than ǫ given 1/ǫ2 data points;
John C. Duchi 43
thus, using more expensive but accurate optimization methods may have limited
benefit (though there may still be some benefit practically!).
Notes and further reading Our treatment in this chapter borrows from a num-
ber of resources. The two heaviest are the lecture notes for Stephen Boyd’s Stan-
ford’s EE364b course [10, 11] and Polyak’s Introduction to Optimization [47]. Our
guarantees of high probability convergence are similar to those originally de-
veloped by Cesa-Bianchi et al. [16] in the context of online learning, which Ne-
mirovski et al. [40] more fully develop. More references on subgradient methods
include the lecture notes of Nemirovski [43] and Nesterov [44].
A number of extensions of (stochastic) subgradient methods are possible, in-
cluding to online scenarios in which we observe streaming sequences of func-
tions [25, 63]; our analysis in this section follows closely that of Zinkevich [63].
The classic paper of Polyak and Juditsky [48] shows that stochastic gradient de-
scent methods, coupled with averaging, can achieve asymptotically optimal rates
of convergence even to constant factors. Recent work in machine learning by a
number of authors [18, 32, 53] has shown how to leverage the structure of opti-
1 PN
mization problems based on finite sums, that is, when f(x) = N i=1 fi (x), to
develop methods that achieve convergence rates similar to those of interior point
methods but with iteration complexity close to stochastic gradient methods.
h(x)
Dh (x, y)
4.2. Mirror Descent Methods Our first set of results focuses on mirror descent
methods, which modify the basic subgradient update to use a different distance-
measuring function rather than the squared ℓ2 -term. Before presenting these
methods, we give a few definitions. Let h be a differentiable convex function,
differentiable on C. The Bregman divergence associated with h is defined as
(4.2.2) Dh (x, y) = h(x) − h(y) − h∇h(y), x − yi .
The divergence Dh is always nonnegative, by the standard first-order inequality
for convex functions, and measures the gap between the linear approximation
h(y) + h∇h(y), x − yi for h(x) taken from the point y and the value h(x) at x. See
Figure 4.2.1. As one standard example, if we take h(x) = 21 kxk22 , then Dh (x, y) =
1 2
2 kx − ykP2 . A second common example follows by taking the entropy functional
n
h(x) = j=1 xj log xj , restricting x to Pthe probability simplex (i.e. x 0 and
P n xj
j xj = 1). We then have D h (x, y) = j=1 xj log yj , the entropic or Kullback-
Leibler divergence.
Because the quantity (4.2.2) is always non-negative and convex in its first ar-
gument, it is natural to treat it as a distance-like function in the development of
John C. Duchi 45
Thus, substituting into the update (4.2.3), we see the choice h(x) = 1
2 kxk22 recovers
the standard (stochastic sub)gradient method
1
xk+1 = argmin hgk , xi + kx − xk k22 .
x∈C 2αk
It is evident that h is strongly convex with respect to the ℓ2 -norm for any con-
straint set C. ♦
Example 4.2 (Solving problems on the simplex with exponentiated gradient meth-
ods): Suppose that our constraint set C = {x ∈ Rn + : h1, xi = 1} is the probability
simplex in Rn . Then updates with the standard Euclidean distance are some-
what challenging—though there are efficient implementations [14, 23]—and it is
natural to ask for a simpler method.
P
With that in mind, let h(x) = n j=1 xj log xj be the negative entropy, which is
convex because it is the sum of convex functions. (The derivatives of f(t) = t log t
are f ′ (t) = log t + 1 and f ′′ (t) = 1/t > 0 for t > 0.) Then we have
n
X
Dh (x, y) = xj log xj − yj log yj − (log yj + 1)(xj − yj )
j=1
n
X xj
= xj log + h1, y − xi = Dkl (x||y) ,
yj
j=1
We assume that the yj > 0, though this is not strictly necessary. Though we
have not discussed this, we write the Lagrangian for this problem by introducing
Lagrange multipliers τ ∈ R for the equality constraint h1, xi = 1 and λ ∈ Rn
+ for
the inequality x 0. Then we obtain Lagrangian
n
X xj
L(x, τ, λ) = hg, xi + xj log + τxj − λj xj − τ.
yj
j=1
Minimizing out x to find the appropriate form for the solution, we take deriva-
tives with respect to x and set them to zero to find
∂
0= L(x, τ, λ) = gj + log xj + 1 − log yj + τ − λj ,
∂xj
or
xj (τ, λ) = yj exp(−gj − 1 − τ + λj ).
We may take λj = 0, as the latter expression yields all positive xj , and to satisfy
P P
the constraint that j xj = 1, we set τ = log( j yj e−gj ) − 1. Thus we have the
John C. Duchi 47
update
y exp(−gi )
xi = Pn i .
j=1 yj exp(−gj )
Rewriting this in terms of the precise update at time k for the mirror descent
method, we have for each coordinate i of iterate k + 1 of the method that
xk,i exp(−αk gk,i )
(4.2.4) xk+1,i = Pn .
j=1 xk,j exp(−αk gk,j )
This is the so-called exponentiated gradient update, also known as entropic mirror
descent.
Later, after stating and proving our main convergence theorems, we will show
that the negative entropy is strongly convex with respect to the ℓ1 -norm, meaning
that our coming convergence guarantees apply. ♦
C = {x ∈ Rn + : h1, xi = 1}, the probability simplex. In this case, the update with
ℓp -norms becomes a problem of solving
1
minimize hv, xi + kxk2p subject to h1, xi = 1, x 0,
x 2
where v = αk (p − 1)gk − φ(xk ), and ϕ and φ are defined as above. An analysis
of the Karush-Kuhn-Tucker conditions for this problem (omitted) yields that the
solution to the problem is given by finding the t⋆ ∈ R such that
n
X
ϕj ( −vj + t⋆ + ) = 1 and setting xj = ϕ( −vj + t⋆ + ).
j=1
Theorem 4.2.6. Let αk > 0 be any sequence of non-increasing stepsizes and the above as-
sumptions hold. Let xk be generated by the mirror descent iteration (4.2.3). If Dh (x, x⋆ ) 6
R2 for all x ∈ C, then for all K ∈ N
K K
X 1 2 X αk
[f(xk ) − f(x⋆ )] 6 R + kgk k2∗ .
αK 2
k=1 k=1
If αk ≡ α is constant, then for all K ∈ N
K K
X 1 αX
[f(xk ) − f(x⋆ )] 6 Dh (x⋆ , x1 ) + kgk k2∗ .
α 2
k=1 k=1
1 PK
As an immediate consequence of this theorem, we see that if xK = K k=1 xk or
xK = argminxk f(xk ) and we have the gradient bound kgk∗ 6 M for all g ∈ ∂f(x)
for x ∈ C, then (say, in the second case) convexity implies
1 α
(4.2.7) f(xK ) − f(x⋆ ) 6 Dh (x⋆ , x1 ) + M2 .
Kα 2
By comparing with the bound (3.2.9), we see that the mirror descent (non-Euclidean
gradient descent) method gives roughly the same type of convergence guarantees
as standard subgradient descent. Roughly we expect the following type of behav-
ior with a fixed stepsize: a rate of convergence of roughly 1/αK until we are
John C. Duchi 49
within a radius α of the optimum, after which mirror descent and subgradient
descent essentially jam—they just jump back and forth near the optimum.
Proof. We begin by considering the progress made in a single update of xk , but
whereas our previous proofs all began with a Lyapunov function for the distance
kxk − x⋆ k2 , we use function value gaps instead of the distance to optimality. Us-
ing the first order convexity inequality—i.e. the definition of a subgradient—we
have
f(xk ) − f(x⋆ ) 6 hgk , xk − x⋆ i .
The idea is to show that replacing xk with xk+1 makes the term hgk , xk − x⋆ i
small because of the definition of xk+1 , but xk and xk+1 are close together so that
this is not much of a difference.
First, we add and subtract hgk , xk+1 i to obtain
(4.2.8) f(xk ) − f(x⋆ ) 6 hgk , xk+1 − x⋆ i + hgk , xk − xk+1 i .
Now, we use the the first-order necessary and sufficient conditions for optimality
of convex optimization problems given by Theorem 2.4.11. Because xk+1 solves
problem (4.2.3), we have
D E
gk + α−1
k (∇h(xk+1 ) − ∇h(xk )) , x − xk+1 > 0 for all x ∈ C.
(4.2.9)
1
f(xk ) − f(x⋆ ) 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 ) − Dh (xk+1 , xk )] + hgk , xk − xk+1 i .
αk
The second insight is that the subtraction of Dh (xk+1 , xk ) allows us to cancel
some of hgk , xk − xk+1 i. To see this, recall the Fenchel-Young inequality, which
states that
η 1
hx, yi 6 kxk2 + kyk2∗
2 2η
for any pair of dual norms (k·k , k·k∗ ) and any η > 0. To see this, note that by
definition of the dual norm, we have hx, yi 6 kxk kyk∗ , and for any constants
1 1
a, b ∈ R and η > 0, we have 0 6 21 (η 2 a − η− 2 b)2 = η2 a2 + 2η 1 2
b − ab, so that
η 2 1 2
kxk kyk∗ 6 2 kxk + 2η kyk∗ . In particular, we have
αk 1
hgk , xk − xk+1 i 6 kgk k2∗ + kxk − xk+1 k2 .
2 2αk
50 Introductory Lectures on Stochastic Optimization
The strong convexity assumption on h guarantees Dh (xk , xk+1 ) > 21 kxk − xk+1 k2 ,
or that
1 α
− Dh (xk+1 , xk ) + hgk , xk − xk+1 i 6 k kgk k2∗ .
αk 2
Substituting this into inequality (4.2.9), we have
1 α
(4.2.10) f(xk ) − f(x⋆ ) 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 )] + k kgk k2∗ .
αk 2
This inequality should look similar to inequality (3.3.5) in the proof of Theo-
rem 3.3.4 on the projected subgradient method in Lecture 3. Indeed, using that
Dh (x⋆ , xk ) 6 R2 by assumption, an identical derivation to that in Theorem 3.3.4
gives the first result of this theorem. For the second when the stepsize is fixed,
note that
K K K
X X 1 X α
[f(xk ) − f(x⋆ )] 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 )] + kgk k2∗
α 2
k=1 k=1 k=1
K
1 X α
= [Dh (x⋆ , x1 ) − Dh (x⋆ , xK+1 )] + kgk k2∗ ,
α 2
k=1
which is the second result.
We briefly provide a few remarks before moving on. As a first remark, all
of the preceding analysis carries through in an almost completely identical fash-
ion in the stochastic case. We state the most basic result, as the extension from
Section 3.4 is essentially straightforward.
Corollary 4.2.11. Let the conditions of Theorem 4.2.6 hold, except that instead of re-
ceiving a vector gk ∈ ∂f(xk ) at iteration k, the vector gk is a stochastic subgradient
satisfying E[gk | xk ] ∈ ∂f(xk ). Then for any non-increasing stepsize sequence αk
(where αk may be chosen dependent on g1 , . . . , gk ),
XK " K
#
R 2 X α k 2
E (f(xk ) − f(x⋆ )) 6 E + kgk k∗ .
αK 2
k=1 k=1
Proof. We sketch the result. The proof is identical to that for Theorem 4.2.6, ex-
cept that we replace gk with the particular vector f ′ (xk ) satisfying E[gk | xk ] =
f ′ (xk ) ∈ ∂f(xk ). Then
f(xk ) − f(x⋆ ) 6 f ′ (xk ), xk − x⋆ = hgk , xk − x⋆ i + f ′ (xk ) − gk , xk − x⋆ ,
and an identical derivation yields the following analogue of inequality (4.2.10):
1 α
f(xk ) − f(x⋆ ) 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 )] + k kgk k2∗ + f ′ (xk ) − gk , xk − x⋆ .
αk 2
This inequality holds regardless of how we choose αk . Moreover, by iterating
expectations, we have
E[ f ′ (xk ) − gk , xk − x⋆ ] = E[ f ′ (xk ) − E[gk | xk ], xk − x⋆ ] = 0,
which gives the corollary once we follow an identical derivation to Theorem 4.2.6.
John C. Duchi 51
Thus, if we have the bound E[kgk2∗ ] 6 M2 for all stochastic subgradients, then
1 PK
√
taking xK = K k=1 xk and αk = R/M k, then
K
RM R maxk E[kgk k2∗ ] X 1 RM
(4.2.12) E[f(xK ) − f(x⋆ )] 6 √ + √ 6 3√
K M 2 k K
k=1
P √
− 21 6 2 K.
where we have used that E[kgk2∗ ] 6 M2 and K k=1 k
In addition, we can provide concrete convergence guarantees for a few meth-
ods, revisiting our earlier examples. We begin with Example 4.2, exponentiated
gradient descent.
Pn
Corollary 4.2.13. Let C = {x ∈ Rn + : h1, xi = 1}, and take h(x) = j=1 xj log xj , the
1
negative entropy. Let x1 = n 1, the vector whose entries are each 1/n. Then if xK =
1 PK
K k=1 xk , the exponentiated gradient method (4.2.4) with fixed stepsize α guarantees
K
log n α X
f(xK ) − f(x⋆ ) 6 + kgk k2∞ .
Kα 2K
k=1
Proof. To apply Theorem 4.2.6, we must show that the negative entropy h is
strongly convex with respect to the ℓ1 -norm, whose dual norm is the ℓ∞ -norm.
By a Taylor expansion, we know that for any x, y ∈ C, we have
1
h(x) = h(y) + h∇h(y), x − yi + (x − y)⊤ ∇2 h(e x)(x − y)
2
for some ex between x and y, that is, e
x = tx + (1 − t)y for some t ∈ [0, 1]. Calculat-
ing these quantities, this is equivalent to
1 ⊤ 1 1
Dkl (x||y) = Dh (x, y) = (x − y) diag ,..., (x − y)
2 e
x1 e
xn
n
1 X (xj − yj )2
= .
2 e
xj
j=1
1
Proof. First, we note that h(x) = 2(p−1) kxk2p is strongly convex with respect to the
ℓp -norm, where 1 < p 6 2. (Recall Example 4.3 and see Exercise 9.) Moreover, we
know that the dual to the ℓp -norm is the conjugate ℓq -norm with 1/p + 1/q = 1,
and thus Theorem 4.2.6 implies that
K K
X 1 X αk
[f(xk ) − f(x⋆ )] 6 sup Dh (x, x⋆ ) + kgk k2q .
αK x∈C 2
k=1 k=1
Now, we use that if C is contained in the ℓ1 -ball of radius R1 , then (p − 1)Dh (x, y) 6
kxk2p + kyk2p 6 kxk21 + kyk21 6 2R21 . Moreover, because p = 1 + log(2n) 1
, we have
q = 1 + log(2n), and
1 1
kvkq 6 k1kq kvk∞ = n q kvk∞ = n log(2n) kvk∞ 6 e kvk∞ .
Substituting this into the previous display and noting that 1/(p − 1) = log(2n)
P − 12 and using convexity gives the second.
gives the first result. Integrating Kk=1 k
So we see that, in more general cases than the simple simplex constraint af-
forded by the entropic mirror descent (exponentiated gradient) updates, we have
p √
convergence guarantees of order log n/ K, which may be substantially faster
than that guaranteed by the standard projected gradient methods.
h(x) = xj logxj
j
h(x) = 21 ||x||22
h(x) = 21 ||x||2p
fbest
k
−f ⋆
-1
10
iid
where εi ∼ N(0, 10−2 ), and m = 20 and the dimension n = 3000. Then we define
m
X
f(x) := kAx − bk1 = | hai , xi − bi |,
i=1
(See the papers [14, 23] for a full derivation of this expression.) We use stepsizes
√
αk = α0 / k, where the initial stepsize α0 is chosen to optimize the convergence
guarantee for each of the methods (see the coming section). In Figure 4.2.16, we
plot the results of performing the projected gradient method versus the expo-
nentiated gradient (entropic mirror decent) method and a method using distance
generating functions h(x) = 21 kxk2p for p = 1 + 1/ log(2n), which can also be
shown to be optimal, showing the optimality gap versus iteration count. All
three methods are sensitive to initial stepsize, the mirror descent method (4.2.4)
enjoys faster convergence than the standard gradient-based method.
4.3. Adaptive stepsizes and metrics In our discussion of mirror descent meth-
ods, we assumed we knew enough about the geometry of the problem at hand—
or at least the constraint set—to choose an appropriate metric and associated
distance-generating function h. In other situations, however, it may be advanta-
geous to adapt the metric being used, or at least the stepsizes, to achieve faster
convergence guarantees. We begin by describing a simple scheme for choosing
stepsizes to optimize bounds on convergence, which means one does not need
to know the Lipschitz constants of gradients ahead of time, and then move on to
somewhat more involved schemes that use a distance-generating function of the
type h(x) = 21 x⊤ Ax for some matrix A, which may change depending on infor-
mation observed during solution of the problem. We leave proofs of the major
results in these sections to exercises at the end of the lectures.
Adaptive stepsizes Let us begin by recalling the convergence guarantees for
mirror descent in the stochastic case, given by Corollary 4.2.11, which assumes
the stepsize αk used to calculate xk+1 is chosen based on the observed gradients
1 PK
g1 , . . . , gk (it may be specified ahead of time). In this case, taking xK = K k=1 xk ,
John C. Duchi 55
Proof. In contrast to mirror descent methods, in this proof we return to our classic
Lyapunov-based style of proof for standard subgradient methods, looking at the
distance kxk − x⋆ k. Let kxk2A = hx, Axi for any positive semidefinite matrix. We
claim that
2
⋆
(4.3.6) kxk+1 − x⋆ k2Hk 6
xk − H−1
k g k − x
,
Hk
the analogue of the fact that projections are non-expansive. This is an immediate
consequence of the update (4.3.4): we have that
2
xk+1 = argmin
x − (xk − H−1 k g )
k
,
x∈C Hk
John C. Duchi 57
which is a Euclidean projection of xk − H−1 k gk into C (in the norm k·kHk ). Then
the standard result that projections are non-expansive (Corollary 2.2.8) gives in-
equality (4.3.6).
Inequality (4.3.6) is the key to our analysis, as previously. Expanding the
square on the right side of the inequality, we obtain
1 1
2
kxk+1 − x⋆ k2Hk 6
xk − H−1 k gk − x⋆
2 2 Hk
1 1
= kxk − x⋆ k2Hk − hgk , xk − x⋆ i + kgk k2H−1 ,
2 2 k
and taking expectations we have E[hgk , xk − x⋆ i | xk ] > f(xk ) − f(x⋆ ) by convexity
and that E[gk | xk ] ∈ ∂f(xk ). Thus
1 h ⋆ 2
i 1 ⋆ 2 ⋆ 1 2
E kxk+1 − x kHk 6 E kxk − x kHk − [f(xk ) − f(x )] + kgk kH−1 .
2 2 2 k
Rearranging, we have
1 1 1
E[f(xk ) − f(x⋆ )] 6 E kxk − x⋆ k2Hk − kxk+1 − x⋆ k2Hk + kgk k2H−1 .
2 2 2 k
Summing this inequality from k = 1 to K gives the theorem.
We may specialize the theorem in a number of ways to develop particular algo-
rithms. One specialization, which is convenient because the computational over-
head is fairly small, is to use diagonal matrices Hk . In particular, the AdaGrad
method sets
Xk 1
1 2
(4.3.7) Hk = diag gi g⊤
i ,
α
i=1
where α > 0 is a pre-specified constant (stepsize). In this case, the following
corollary to Theorem 4.3.5 follows. Exercise 10 sketches the proof of the corollary,
which is similar to that of Corollary 4.3.3. In the corollary, recall that tr(A) =
Pn
j=1 Ajj is the trace of a matrix.
AdaGrad
SGD
10-1
f(xk )−f ⋆
10-2
101
AdaGrad
SGD
100
f(xk )−f ⋆
10-1
10-2
0 50 100 150 200
Iteration k/100
Notes and further reading The mirror descent method was originally devel-
oped by Nemirovski and Yuding [41] in order to more carefully control the norms
of gradients, and associated dual spaces, in first-order optimization methods.
Since their original development, a number of researchers have explored variants
and extensions of their methods. Beck and Teboulle [5] give an analysis of mirror
descent as a non-Euclidean gradient method, which is the approach we take in
60 Introductory Lectures on Stochastic Optimization
this lecture. Nemirovski et al. [40] study mirror descent methods in stochastic
settings, giving high-probability convergence guarantees similar to those we gave
in the previous lecture. Bubeck and Cesa-Bianchi [15] explore the use of mirror
descent methods in the context of bandit optimization problems, where instead of
observing stochastic gradients one observes only random function values f(x) + ε,
where ε is mean-zero noise.
Variable metric methods have a similarly long history. Our simple results with
stepsize selection follow the more advanced techniques of Auer et al. [3] (see es-
pecially their Lemma 3.5), and the AdaGrad method (and our development) is
due to Duchi, Hazan, and Singer [22] and McMahan and Streeter [38]. More gen-
eral metric methods include Shor’s space dilation methods (of which the ellipsoid
method is a celebrated special case), which develop matrices Hk that make new
directions of descent somewhat less correlated with previous directions, allowing
faster convergence in directions toward x⋆ ; see the books of Shor [55, 56] as well
as the thesis of Nedić [39]. Newton methods, which we do not discuss, use scaled
multiples of ∇2 f(xk ) for Hk , while Quasi-Newton methods approximate ∇2 f(xk )
with Hk while using only gradient-based information; for more on these and
other more advanced methods for smooth optimization problems, see the books
of Nocedal and Wright [46] and Boyd and Vandenberghe [12].
5. Optimality Guarantees
Lecture Summary: In this lecture, we provide a framework for demonstrat-
ing the optimality of a number of algorithms for solving stochastic optimiza-
tion problems. In particular, we introduce minimax lower bounds, showing
how techniques for reducing estimation problems to statistical testing prob-
lems allow us to prove lower bounds on optimization.
5.1. Introduction The procedures and algorithms we have presented thus far en-
joy good performance on a number of statistical, machine learning, and stochastic
optimization tasks, and we have provided theoretical guarantees on their perfor-
mance. It is interesting to ask whether it is possible to improve the algorithms, or
in what ways it may be possible to improve them. With that in mind, in this lec-
ture we develop a number of tools for showing optimality—according to certain
metrics—of optimization methods for stochastic problems.
Minimax rates We provide optimality guarantees in the minimax framework for
optimality, which proceeds roughly as follows: we have a collection of possi-
ble problems and an error measure for the performance of a procedure, and we
measure a procedure’s performance by its behavior on the hardest (most difficult)
member of the problem class. We then ask for the best procedure under this
worst-case error measure. Let us describe this more formally in the context of our
stochastic optimization problems, where the goal is to understand the difficulty
John C. Duchi 61
where the expectation is taken over the subgradients g(xi , Si , f) returned by the
stochastic oracle and any randomness in the chosen iterates, or query points,
x1 , . . . , xK of the optimization method. Of course, if we only consider this ex-
cess objective value for a fixed function f, then a trivial optimization procedure
achieves excess risk 0: simply return some x ∈ argminx∈C f(x). It is thus impor-
tant to ask for a more uniform notion of risk: we would like the procedure to have
good performance uniformly across all functions f ∈ F, leading us to measure the
performance of a procedure by its worst-case risk
sup E f(bx(g1 , . . . , gk )) − inf f(x⋆ ) ,
f∈F x∈C
where the supremum is taken over functions f ∈ F (the subgradient oracle g then
implicitly depends on f). An optimal estimator for this metric then gives the
minimax risk for optimizing the family of stochastic optimization problems {f}f∈F
over x ∈ C ⊂ Rn , which is
(5.1.2) MK (C, F) := inf sup E f(b
xK (g1 , . . . , gK )) − inf f(x⋆ ) .
b K f∈F
x x ∈C
⋆
f0 f1
δ = dopt (f0 , f1 )
x : f1 (x) 6 f⋆1 + δ
the set C be
dopt (f0 , f1 ; C) :=
(5.1.4) f1 (x) 6 f⋆1 + δ implies f0 (x) > f⋆0 + δ
sup δ > 0 : for any x ∈ C .
f0 (x) 6 f⋆0 + δ implies f1 (x) > f⋆1 + δ
That is, if we have any point x such that fv (x) − f⋆v 6 dopt (f0 , f1 ), then x cannot
optimize f1−v well, i.e. we can only optimize one of the two functions f0 and f1
to accuracy dopt (f0 , f1 ). See Figure 5.1.3 for an illustration of this quantity. For
example, if f1 (x) = (x + c)2 and f0 (x) = (x − c)2 for a constant c 6= 0, then we
have dopt (f1 , f0 ) = c2 .
This separation dopt allows us to give a reduction from optimization to testing
via the canonical hypothesis testing problem, which is as defined as follows:
1. Nature chooses an index V ∈ V uniformly at random
2. Conditional on the choice V = v, the procedure observes stochastic subgradi-
ents for the function fv according to the oracle g(xk , Sk , fv ) for i.i.d. Sk .
Then, given the observed subgradients, the goal is to test which of the random
indices v nature chose. Intuitively, if we can optimize fv well—to better than
the separation dopt (fv , fv ′ )—then we can identify the index v. If we can show
this, then we can adapt classical statistical results on optimal hypothesis testing
to lower bound the probability of error in testing whether the data was generated
conditional on V = v.
More formally, we have the following key lower bound. In the lower bound,
we say that a collection of functions {fv }v∈V is δ-separated, where δ > 0, if
(5.1.5) dopt (fv , fv ′ ; C) > δ for each v, v ′ ∈ V with v 6= v ′ .
Then we have the next proposition.
Proposition 5.1.6. Let S be drawn uniformly from V, where |V| < ∞, and assume the
collection {fv }v∈V is δ-separated. Then for any optimization procedure b
x based on the
observed subgradients,
1 X
x) − f⋆v ] > δ · inf P(b
E[fv (b v 6= V),
|V| b
v
v∈V
where the distribution P is the joint distribution over the random index V and the ob-
served gradients g1 , . . . , gK and the infimum is taken over all testing procedures b
v based
on the observed data.
Definition 5.2.1. Let P and Q be distributions on a space S, and assume that they
are both absolutely continuous with respect to a measure µ on S. The variation
distance between P and Q is Z
1
kP − QkTV := sup |P(A) − Q(A)| = |p(s) − q(s)|dµ(s).
A⊂S 2 S
The Kullback-Leibler divergence between P and Q is
Z
p(s)
Dkl (P||Q) := p(s) log dµ(s).
S q(s)
We can connect the variation distance to binary hypothesis tests via the following
lemma, due to Le Cam. The lemma states that testing between two distributions
is hard precisely when they are close in variation distance.
v 6= 1) + P−1 (b
inf{P1 (b v 6= −1)} = inf{1 − P1 (A) + P−1 (A)}
b
v A
= 1 − sup{P1 (A) − P−1 (A)} = 1 − kP1 − P−1 kTV
A
as desired.
As an immediate consequence of Lemma 5.2.2, we obtain the standard mini-
max lower bound based on binary hypothesis testing. In particular, let f1 and f−1
be δ-separated and belong to F, and assume that the method b x receives data (in
this case, the data is the K subgradients) from PvK when fv is the true function.
Then we immediately have
1 h
i
K
(5.2.3) MK (C, F) > inf max {EPv [fv (b xK ) − f⋆v ]} > δ · 1 −
P1K − P−1
.
b K v∈{−1,1}
x 2 TV
samples. First, we use Pinsker’s inequality (see Appendix A.3, Proposition A.3.2
for a proof): for any distributions P and Q,
1
kP − Qk2TV 6 Dkl (P||Q) .
2
As we see presently, the KL-divergence tensorizes when we have multiple obser-
vations from different distributions (see Lemma 5.2.8 to come), allowing substan-
tially easier computation of individual divergence terms. Then we have the fol-
lowing theorem.
Theorem 5.2.4. Let F be a collection of convex functions, and let f1 , f−1 ∈ F. Assume
that when function fv is to be optimized, we observe K subgradients according to PvK .
Then " r #
dopt (f−1 , f1 ; C) 1
MK (C, P) > 1− D PK | PK .
2 2 kl 1 −1
What remains to give a concrete lower bound, then, is (1) to construct a family
of well-separated functions f1 , f−1 , and (2) to construct a stochastic gradient or-
acle for which we give a small upper bound on the KL-divergence between the
distributions P1 and P−1 associated with the functions, which means that testing
between P1 and P−1 is hard.
Constructing well-separated functions Our first goal is to construct a family
of well-separated functions and an associated first-order subgradient oracle that
makes the functions hard to distinguish. We parameterize our functions—of
which we construct only 2—by a parameter δ > 0 governing their separation.
Our construction applies in dimension n = 1: let us assume that C contains the
interval [−R, R] (this is no loss of generality, as we may simply shift the interval).
Then define the M-Lipschitz continuous functions
(5.2.5) f1 (x) = Mδ|x − R| and f−1 (x) = Mδ|x + R|.
See Figure 5.2.6 for an example of these functions, which makes clear that their
separation (5.1.4) is
dopt (f1 , f−1 ) = δMR.
We also consider the stochastic oracle for this problem, recalling that we must
construct subgradients satisfying E[kgk22 ] 6 M2 . We will do slightly more: we
will guarantee that |g| 6 M always. With this in mind, we assume that δ 6 1,
and define the stochastic gradient oracle for the distribution Pv , v ∈ {−1, 1} at the
point x to be
M sign(x − vR) with probability 1+δ
2
(5.2.7) gv (x) =
1−δ
−M sign(x − vR) with probability 2 .
At x = vR the oracle simply returns a random sign. Then by inspection, we see
that
Mδ Mδ
E[gv (x)] = sign(x − vR) − (− sign(x − vR)) = Mδ sign(x − vR) ∈ ∂fv (x)
2 2
68 Introductory Lectures on Stochastic Optimization
f−1 f1
dopt (f−1 , f1 ) = MRδ
x : f1 (x) 6
f⋆1 + MRδ
−R R
for v = −1, 1. Thus, the combination of the functions (5.2.5) and the stochastic
gradient (5.2.7) give us a valid subgradient and well-separated pair of functions.
Bounding the distance between distributions The second step in proving our
minimax lower bound is to upper bound the distance between the distributions
that generate the subgradients our methods observe. This means that testing
which of the functions we are optimizing is challenging, giving us a strong lower
bound. At a high level, building off of Theorem 5.2.4, we hope to show an upper
bound of the form
Dkl P1K | P−1
K
6 κδ2
for some κ. This is a local condition, allowing us to scale our problems with δ
to achieve minimax bounds. If we have such a quadratic, we may simply choose
δ2 = 1/2κ, giving the constant probability of error
r r
K K
1 κδ2 1
1 −
P1 − P−1
> 1 − Dkl P1K | P−1
K /2 > 1 − > .
TV 2 2 2
To this end, we begin with a standard lemma (the chain rule for KL diver-
gence), which applies when we have K potentially dependent observations from
a distribution. The result is an immediate consequence of Bayes’ rule.
Lemma 5.2.8. Let P(· | g1 , . . . , gk−1 ) denote the conditional distribution of gk given
g1 , . . . , gk−1 . For each k ∈ N let P1k and P−1
k be distributions on the K subgradients
g1 , . . . , gk . Then
K
X
Dkl P1K | P−1
K
= EPk−1 [Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 ))] .
1
k=1
John C. Duchi 69
Using Lemma 5.2.8, we have the following upper bound on the KL-divergence
between P1K and P−1
K for the stochastic gradient (5.2.7).
Lemma 5.2.9. Let the K observations under distribution Pv come from the stochastic
gradient oracle (5.2.7). Then for δ 6 54 ,
Dkl P1K | P−1
K
6 3Kδ2 .
Proof. We use the chain-rule for KL-divergence, whence we must only provide
an upper bound on the individual terms. We first note that xk is a function
of g1 , . . . , gk−1 (because we may assume w.l.o.g. that xk is deterministic) so that
Pv (· | g1 , . . . , gk−1 ) is the distribution of a Bernoulli random variable with distri-
bution (5.2.7), i.e. with probabilities 1±δ 2 . Thus we have
1+δ 1−δ
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 )) 6 Dkl |
2 2
1+δ 1+δ 1−δ 1−δ
= log + log
2 1−δ 2 1+δ
1+δ
= δ log .
1−δ
By a Taylor expansion, we have that
1+δ 1 1
δ log = δ δ − δ2 + O(δ3 ) − δ −δ − δ2 + O(δ3 ) = 2δ2 + O(δ4 ) 6 3δ2
1−δ 2 2
for δ 6 54 , or
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 )) 6 3δ2
for δ 6 54 . Summing over k completes the proof.
Putting it all together: a minimax lower bound With Lemma 5.2.9 in place
along with our construction (5.2.5) of well-separated functions, we can now give
a theorem on the best possible convergence guarantees for a broad family of
problems.
Theorem 5.2.10. Let C ⊂ Rn be a convex set containing an ℓ2 ball of radius R, and let P
denote the collection of distributions generating stochastic subgradients with kgk2 6 M
with probability 1. Then
RM
MK (C, P) > √ √
4 6 K
for all K ∈ N.
Proof. We combine Le Cam’s method, Lemma 5.2.2 (and the subsequent Theo-
rem 5.2.4) with our construction (5.2.5) and their stochastic subgradients (5.2.7).
Certainly, the class of n-dimensional optimization problems is at least as challeng-
ing as a 1-dimensional problem (we may always restrict our functions to depend
only on a single coordinate), so that for any δ > 0 we have
r !
δMR 1 K K
MK (C, F) > 1− D P |P .
2 2 kl 1 −1
70 Introductory Lectures on Stochastic Optimization
Now we use Lemma 5.2.9, which guarantees the further lower bound
r !
δMR 3Kδ2
MK (C, F) > 1− ,
2 2
1
1
valid for all δ 6 45 . Choosing δ2 = 6K < 54 , we have that Dkl P1K | P−1
K 6 2 , and
δMR
MK (C, F) > .
4
Substituting our choice of δ into this expression gives the theorem.
In short, Theorem 5.2.10 gives a guarantee that matches the upper bounds of
the previous lectures to within a numerical constant factor of 10. A more careful
inspection of our analysis allows us to prove a lower bound, at least as K → ∞,
√
of 1/8 K. In particular, by using Theorem 3.4.7 of our lecture on subgradient
methods, we find that if the set C contains an ℓ2 -ball of radius Rinner and is
contained in an ℓ2 -ball of radius Router , we have
1 MRinner MRouter
(5.2.11) √ √ 6 MK (C, F) 6 √
96 K K
for all K ∈ N, where the upper bound is attained by the stochastic projected
subgradient method.
5.3. Multiple dimensions and Assouad’s Method The results in Section 5.2 pro-
vide guarantees for problems where we can embed much of the difficulty of our
family F in optimizing a pair of only two functions—something reminiscent of
problems in classical statistics on the “hardest one-dimensional subproblem” (see,
for example, the work of Donoho, Liu, and MacGibbon [19]). In many stochas-
tic optimization problems, the higher-dimension n yields increased difficulty, so
that we would like to derive bounds that incorporate dimension more directly.
With that in mind, we develop a family of lower bounds, based on Assouad’s
method [2], that reduce optimization to a collection of binary hypothesis tests,
one for each of the n dimensions of the problem.
More precisely, we let V = {−1, 1}n be the n-dimensional binary hypercube,
and for each v ∈ V, we assume we have a function fv ∈ F where fv : Rn → R.
Without loss of generality, we will assume that our constraint set C has the point
0 in its interior. Let δ ∈ Rn + be an n-dimensional nonnegative vector. Then we
say that the functions {fv } induce a δ-separation in the Hamming metric if for any
x ∈ C ⊂ Rn we have
n
X
(5.3.1) fv (x) − f⋆v > δj 1 sign(xj ) 6= vj ,
j=1
where the subscript j denotes the jth coordinate. For example, if we define the
function fv (x) = δ kx − vk1 for each v ∈ V, then certainly {fv } is δ1-separated
P
in the Hamming metric; more generally, fv (x) = n j=1 δj |xj − vj | is δ-separated.
With this definition, we have the following lemma, providing a lower bound for
functions f : Rn → R.
John C. Duchi 71
Then
n
1 X 1X
x) − f⋆v ] >
E[fv (b δj (1 −
P+j − P−j
TV ).
2n 2
v∈{−1,1}n j=1
Now we use Le Cam’s lemma (Lemma 5.2.2) on optimal binary hypothesis tests
to see that
xj ) 6= −1) > 1 −
P+j − P−j
TV
xj ) 6= 1) + P−j (sign(b
P+j (sign(b
which gives the desired result.
As a nearly immediate consequence of Lemma 5.3.2, we see that if the separa-
tion is a constant δ > 0 for each coordinate, we have the following lower bound
on the minimax risk.
Proposition 5.3.3. Let the collection {fv }v∈V ⊂ F, where V = {−1, 1}n , be δ-separated
in Hamming metric for some δ ∈ R+ , and let the conditions of Lemma 5.3.2 hold. Then
v
u n
n u 1 X
MK (C, F) > δ 1 − t Dkl P+j | P−j .
2 2n
j=1
by Pinsker’s inequality. Substituting this into the previous bound gives the de-
sired result.
With this proposition, we can give a number of minimax lower bounds. We
focus on two concrete cases, which show that the stochastic gradient procedures
we have developed are optimal for a variety of problems. We give one result,
deferring others to the exercises associated with the lecture notes. For our main
result using Assouad’s method, we consider optimization problems for which the
set C ⊂ Rn contains an ℓ∞ ball of radius R. We also assume that the stochastic
gradient oracle satisfies the ℓ1 -bound condition
E[kg(x, S, f)k21 ] 6 M2 .
This means that all the functions f ∈ F are M-Lipschitz continuous with respect
to the ℓ∞ -norm, that is, |f(x) − f(y)| 6 M kx − yk∞ .
Theorem 5.3.4. Let F and the stochastic gradient oracle be as above, and assume C ⊃
[−R, R]n . Then √
1 1 n
MK (C, F) > RM min ,√ √ .
5 96 K
Proof. Our proof is similar to our construction of our earlier lower bounds, except
that now we must construct functions defined on Rn so that our minimax lower
bound on convergence rate grows with the dimension. Let δ > 0 be fixed for now.
For each v ∈ V = {−1, 1}n , define the function
Mδ
fv (x) := kx − Rvk1 .
n
Then by inspection, the collection {fv } is MRδ
n -separated in Hamming metric, as
n n
Mδ X Mδ X
fv (x) = |xj − Rvj | > R1 sign(xj ) 6= vj .
n n
j=1 j=1
the equality
E[g(x, fv )]
n
X 1+δ 1−δ Mδ
=M ej sign(xj − Rvj ) − sign(xj − Rvj ) = sign(x − Rv).
n n n
j=1
It remains to upper bound the KL-divergence terms. Let PvK denote the distribu-
tion of the K subgradients the method observes for the function fv , and let v(±j)
denote the vector v except that its jth entry is forced to be ±1. Then, we may use
the convexity of the KL-divergence to obtain that
1 X
Dkl P+j | P−j 6 n Dkl PvK(+j) | PvK(−j) .
2
v∈V
Let us thus bound Dkl Pv | Pv ′ when v and v ′ differ in only a single coordinate
K K
(we let it be the first coordinate with no loss of generality). Let us assume for
notational simplicity M = 1 for the next calculation, as this only changes the
support of the subgradient distribution (5.3.5) but not any divergences. Applying
the chain rule (Lemma 5.2.8), we have
XK
Dkl PvK | PvK′ = EPv [Dkl (Pv (· | g1:k−1 )||Pv ′ (· | g1:k−1 ))] .
k=1
We consider one of the terms, noting that the kth query xk is a function of
g1 , . . . , gk−1 . We have
n
Choosing δ2 = min{16/25, 4K } gives the result of the theorem.
A few remarks are in order about the theorem. First, we see that it recovers
the 1-dimensional result of Theorem 5.2.10, as we may simply take n = 1 in
the theorem statement. Second, we see that if we wish to optimize over a set
larger than the ℓ2 -ball, then there must necessarily be some dimension-dependent
penalty, at least in the worst case. Lastly, the result again is sharp. By using
Theorem 3.4.7, we obtain the following corollary.
A. Technical Appendices
A.1. Continuity of Convex Functions In this appendix, we provide proofs of
the basic continuity results for convex functions. Our arguments are based on
those of Hiriart-Urruty and Lemaréchal [27].
John C. Duchi 75
P
Proof of Lemma 2.3.1 We can write x ∈ B1 as x = n i=1 xi ei , where ei are the
P
standard basis vectors and n |x
i=1 i | 6 1. Thus, we have
X
n X
n
f(x) = f ei xi =f |xi | sign(xi )ei + (1 − kxk1 )0
i=1 i=1
n
X
6 |xi |f(sign(xi )ei ) + (1 − kxk1 )f(0)
i=1
6 max {f(e1 ), f(−e1 ), f(e2 ), f(−e2 ), . . . , f(en ), f(−en ), f(0)} .
The first inequality uses the fact that the |xi | and (1 − kxk1 ) form a convex combi-
nation, since x ∈ B1 , as does the second.
For the lower bound, note by the fact that x ∈ int B1 satisfies x ∈ int dom f,
we have ∂f(x) 6= ∅ by Theorem 2.4.3. In particular, there is a vector g such that
f(y) > f(x) + hg, y − xi for all y, and even more,
f(y) > f(x) + inf hg, y − xi > f(x) − 2 kgk∞
y∈B1
for all y ∈ B1 .
Proof of Theorem 2.3.2 First, let us suppose that for each point x0 ∈ C, there
exists an open ball B ⊂ int dom f such that
(A.1.1) |f(x) − f(x ′ )| 6 L
x − x ′
2 for all x, x ′ ∈ B.
The collection of such balls B covers C, and as C is compact, there exists a
finite subcover B1 , . . . , Bk with associated Lipschitz constants L1 , . . . , Lk . Take
L = maxi Li to obtain the result. It thus remains to show that we can construct
balls satisfying the Lipschitz condition (A.1.1) at each point x0 ∈ C.
With that in mind, we use Lemma 2.3.1, which shows that for each point x0 ,
there is some ǫ > 0 and −∞ < m 6 M < ∞ such that
−∞ < m 6 inf f(x + v) 6 sup f(x + v) 6 M < ∞.
v:kvk2 62ǫ v:kvk2 62ǫ
We make the following claim, from which the condition (A.1.1) evidently follows
based on the preceding display.
Lemma A.1.2. Let ǫ > 0, f be convex, and B = {v : kvk2 6 1}. Suppose that f(x) ∈
[m, M] for all x ∈ x0 + 2ǫB. Then
M−m
x − x ′
for all x, x ′ ∈ x0 + ǫB.
|f(x) − f(x ′ )| 6 2
ǫ
Proof. Let x, x ′ ∈ x0 + ǫB. Let
x′ − x
x ′′ = x ′ + ǫ ∈ x0 + 2ǫB,
kx ′ − xk2
76 Introductory Lectures on Stochastic Optimization
Without loss of generality, we assume E[X] = 0 and 0 ∈ [a, b]. Let ψ(λ) =
log E[eλX ]. Then
E[XeλX ] E[X2 eλX ] E[XeλX ]2
ψ ′ (λ) = and ψ ′′
(λ) = − .
E[eλX ] E[eλX ] E[eλX ]2
Note that ψ ′ (0) = E[X] = 0. Let P denote the distribution of X, and assume
without loss of generality that X has a density p.5 Define the random variable Y
to have the shifted density f defined by
eλy
f(y) = p(y)
E[eλX ]
for y ∈ R, where p(y) = 0 for y 6∈ [a, b]. Then E[Y] = ψ ′ (λ) and Var(Y) = E[Y 2 ] −
E[Y]2 = ψ ′′ (λ). But of course, we know that Y ∈ [a, b] because the distribution P
of X is supported on [a, b], so that
(b − a)2
ψ ′′ (λ) = Var(Y) 6
4
by inequality (A.2.4). Using Taylor’s theorem, we have that
λ2 λ2
ψ(λ) = ψ(0) + ψ ′ (0) λ + ψ ′′ (eλ)ψ(λ) = ψ(0) + ψ ′′ (eλ)
| {z } 2 2
=0
(b−a)2 λ2 (b−a)
2
for some eλ between 0 and λ. But ψ ′′ (eλ) 6 4 , so that ψ(λ) 6 2 4 as
desired.
Proof. We prove the upper tail, as the lower tail is similar. The proof is a nearly
immediate consequence of Hoeffding’s lemma (Lemma A.2.3) and the Chernoff
bound technique. Indeed, we have
n
! " n
!#
X X
P Xi > t 6 E exp λ Xi exp(−λt)
i=1 i=1
5We may assume there is a dominating base measure µ with respect to which P has a density p.
78 Introductory Lectures on Stochastic Optimization
for all λ > 0. Now, letting Zi be the sequence to which the Xi are adapted, we
iterate conditional expectations. We have
" n
!# " " n−1
! ##
X X
E exp λ Xi = E E exp λ Xi exp(λXn ) | Zn−1
i=1 i=1
" n−1
! #
X
= E exp λ Xi E[exp(λXn ) | Zn−1 ]
i=1
" n−1
! #
X λ2 B2
6 E exp λ Xi e 8
i=1
because X1 , . . . , Xn−1 are functions of Zn−1 . By iteratively applying this calcula-
tion, we arrive at
" n
!# 2 2
X λ nB
(A.2.6) E exp λ Xi 6 exp .
8
i=1
Now we optimize by choosing λ > 0 to minimize the upper bound that in-
equality (A.2.6) provides, namely
n
! 2 2
X λ nB 2t2
P Xi > t 6 inf exp − λt = exp − 2
λ>0 8 nB
i=1
4t
by taking λ = Bn .
A.3. Auxiliary results on divergences We present a few standard results on
divergences without proof, referring to standard references (e.g. the book of Cover
and Thomas [17] or the extensive paper on divergence measures by Liese and
Vajda [35]). Nonetheless, we state and prove a few results. The first is known as
the data processing inequality, and it says that processing a random variable (even
adding noise to it) can only make distributions closer together. See Cover and
Thomas [17] or Theorem 14 of Liese and Vajda [35] for a proof.
where [t]+ = max{t, 0} denotes the positive part. We will use stochastic gradient
descent to attempt to minimize Z
f(X) := EP [F(X; (A, B))] = F(X; (a, b))dP(a, b),
where the expectation is taken over pairs (A, B).
(a) Show that F is convex.
(b) Show that F(X; (a, b)) = 0 if and only if the classifer represented by X has a
large margin, meaning that
ha, xb i > ha, xl i + 1 for all l 6= b.
(c) For a pair (a, b), give a way to calculate a vector G ∈ ∂F(X; (a, b)) (note that
G ∈ Rd×k ).
Question 6: In this problem, you will perform experiments to explore the per-
formance of stochastic subgradient methods for classification problems, specif-
ically, a handwritten digit recognition problem using zip code data from the
United States Postal Service (this data is taken from the book [24], originally
due to Yann Le Cunn). The data—training data zip.train, test data zip.test,
and information file zip.inf—are available for download from the zipped tar
file http://web.stanford.edu/~jduchi/PCMIConvex/ZIPCodes.tgz. Starter code is
available for julia and Matlab at the following urls.
i. For Julia: http://web.stanford.edu/~jduchi/PCMIConvex/sgd.jl
ii. For Matlab: http://web.stanford.edu/~jduchi/PCMIConvex/matlab.tgz
There are two methods left un-implemented in the starter code: the sgd method
and the MulticlassSVMSubgradient method. Implement these methods (you may
find the code for unit-testing the multiclass SVM subgradient useful to double
82 Introductory Lectures on Stochastic Optimization
check your implementation). For the SGD method, your stepsizes should be
√
proportional to αi ∝ 1/ i, and you should project X to the Frobenius norm ball
X
Br := {X ∈ Rd×k : kXkFr 6 r}, where kXk2Fr = X2ij .
ij
We have implemented a pre-processing step that also kernelizes the data repre-
1
sentation. Let the function K(a, a ′ ) = exp(− 2τ ka − a ′ k22 ). Then the kernelized
data representation transforms each datapoint a ∈ Rd into a vector
⊤
φ(a) = K(a, ai1 ) K(a, ai2 ) · · · K(a, aim )
where i1 , . . . , im is a random subset of {1, . . . , N} (see GetKernelRepresentation.)
Once you have implemented the sgd and MulticlassSVMSubgradient methods,
use the method RunExperiment (Julia/Matlab). What performance do you get in
classification? Which digits is your classifier most likely to confuse?
Here we have assumed that Dh (x⋆ , xk ) 6 R2 for all k. We now use this inequality
to prove Corollary 4.3.3.
John C. Duchi 83
Now apply an argument similar to that used in Example 4.2 to show the strong
P
convexity of h(x) = j xj log xj , but applying Hölder’s inequality instead of
Cauchy-Schwarz.
(d) Suppose that the domain C = {x : kxk∞ 6 1}. What is the expected regret of
AdaGrad? Show that (to a numerical constant factor we ignore) this expected
regret is always smaller than the expected regret bound for standard projected
gradient descent, which is
X K XK 1
2
⋆ ⋆ 2
E (f(xk ) − f(x )) 6 O(1) sup kx − x k2 E kgk k2 .
k=1 x∈C k=1
Hint: Use Cauchy-Schwarz
(e) As in the previous sub-question, assume that C = {x : kxk∞ 6 1}. Suppose
that the subgradients are such that gk ∈ {−1, 0, 1}n for all k, and that for each
coordinate j we have P(gk,j 6= 0) = pj . Show that AdaGrad has convergence
guarantee
XK √ n
⋆ 3 K X√
E (f(xk ) − f(x )) 6 pj .
2
k=1 j=1
dopt (f−1 , f1 ; C) :=
f1 (x) 6 f⋆1 + δ implies f−1 (x) > f⋆−1 + δ
sup δ > 0 : for any x ∈ C. .
f−1 (x) 6 f⋆−1 + δ implies f1 (x) > f⋆1 + δ
When C = R (or, more generally, as long as C ⊃ [−δ, δ]), show that
λ
dopt (f−1 , f1 ; C) > δ2 .
2
(b) Show that the Kullback-Leibler divergence between two normal distributions
P1 = N(µ1 , σ2 ) and P2 = N(µ2 , σ2 ) is
(µ1 − µ2 )2
Dkl (P1 | P−1 ) = .
2σ2
(c) Use Le Cam’s method to show the following lower bound for stochastic opti-
mization: for any optimization procedure b xK using K noisy gradient evalua-
tions,
σ2
xK ) − f⋆v ] >
max EPv [fv (b .
v∈{−1,1} 32λK
Compare the result with the regret upper bound in problem 7. Hint: If PvK
denotes the distribution of the K noisy gradients for function fv , show that
2Kλ2 δ2
Dkl P1K | P−1
K
6 .
σ2
Question 12: Let C = {x ∈ Rn : kxk∞ 6 1}, and consider the collection of
functions F where the stochastic gradient oracle g : Rn × S × F → {−1, 0, 1}n
satisfies
P(gj (x, S, f) 6= 0) 6 pj
for each coordinate j = 1, 2, . . . , n. Show that, for large enough K ∈ N, a minimax
lower bound for this class of functions and the given stochastic oracle is
n
1 X√
MK (C, F) > c √ pj ,
K j=1
86 References
where c > 0 is a numerical constant. How does this compare to the convergence
guarantee that AdaGrad gives?
References
[1] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright, Information-
theoretic lower bounds on the oracle complexity of convex optimization, IEEE Transactions on Informa-
tion Theory 58 (2012), no. 5, 3235–3249. ←74
[2] P. Assouad, Deux remarques sur l’estimation, Comptes Rendus des Séances de l’Académie des
Sciences, Série I 296 (1983), no. 23, 1021–1024. ←70
[3] P. Auer, N. Cesa-Bianchi, and C. Gentile, Adaptive and self-confident on-line learning algorithms,
Journal of Computer and System Sciences 64 (2002), no. 1, 48–75. ←60
[4] K. Azuma, Weighted sums of certain dependent random variables, Tohoku Mathematical Journal 68
(1967), 357–367. ←77
[5] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex opti-
mization, Operations Research Letters 31 (2003), 167–175. ←59
[6] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski, Robust optimization, Princeton Uni-
versity Press, 2009. ←4
[7] D. P. Bertsekas, Stochastic optimization problems with nondifferentiable cost functionals, Journal of
Optimization Theory and Applications 12 (1973), no. 2, 218–231. ←22
[8] Dimitri P. Bertsekas, Convex optimization theory, Athena Scientific, 2009. ←3, 24
[9] D.P. Bertsekas, Nonlinear programming, Athena Scientific, 1999. ←3
[10] Stephen Boyd, John Duchi, and Lieven Vandenberghe, Subgradients, 2015. Course notes for Stan-
ford Course EE364b. ←43
[11] Stephen Boyd and Almir Mutapcic, Stochastic subgradient meth-
ods, 2007. Course notes for EE364b at Stanford, available at
http://www.stanford.edu/class/ee364b/notes/stoch_subgrad_notes.pdf. ←43
[12] Stephen Boyd and Lieven Vandenberghe, Convex optimization, Cambridge University Press, 2004.
←3, 20, 24, 25, 60
[13] Gábor Braun, Cristóbal Guzmán, and Sebastian Pokutta, Lower bounds on the oracle complexity of
nonsmooth convex optimization via information theory, IEEE Transactions on Information Theory 63
(2017), no. 7. ←74
[14] P. Brucker, An O(n) algorithm for quadratic knapsack problems, Operations Research Letters 3
(1984), no. 3, 163–166. ←33, 46, 54
[15] Sébastien Bubeck and Nicoló Cesa-Bianchi, Regret analysis of stochastic and nonstochastic multi-
armed bandit problems, Foundations and Trends in Machine Learning 5 (2012), no. 1, 1–122. ←60
[16] N. Cesa-Bianchi, A. Conconi, and C. Gentile, On the generalization ability of on-line learning algo-
rithms, IEEE Transactions on Information Theory 50 (2004September), no. 9, 2050–2057. ←43
[17] Thomas M. Cover and Joy A. Thomas, Elements of information theory, second edition, Wiley, 2006.
←74, 78
[18] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien, SAGA: A fast incremental gradient method
with support for non-strongly convex composite objectives, Advances in neural information processing
systems 27, 2014. ←43
[19] David L. Donoho, Richard C. Liu, and Brenda MacGibbon, Minimax risk over hyperrectangles, and
implications, Annals of Statistics 18 (1990), no. 3, 1416–1437. ←70
[20] D.L. Donoho, Compressed sensing, Technical report, stanford university, 2006. ←31
[21] John C. Duchi, Stats311/EE377: Information theory and statistics, 2015. ←74
[22] John C. Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning and
stochastic optimization, Journal of Machine Learning Research 12 (2011), 2121–2159. ←56, 60
References 87
[23] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra, Efficient projections onto
the ℓ1 -ball for learning in high dimensions, Proceedings of the 25th international conference on
machine learning, 2008. ←33, 46, 54
[24] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements of statistical learning, Second,
Springer, 2009. ←81
[25] Elad Hazan, The convex optimization approach to regret minimization, Optimization for machine
learning, 2012. ←43
[26] Elad Hazan, Introduction to online convex optimization, Foundations and Trends in Optimization 2
(2016), no. 3–4, 157–325. ←4
[27] J. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms I, Springer, New
York, 1993. ←3, 21, 24, 74
[28] J. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms II, Springer,
New York, 1993. ←3, 24
[29] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal, Fundamentals of convex analysis, Springer,
2001. ←24
[30] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American
Statistical Association 58 (March 1963), no. 301, 13–30. ←76
[31] I. A. Ibragimov and R. Z. Has’minskii, Statistical estimation: Asymptotic theory, Springer-Verlag,
1981. ←4, 62, 74
[32] Rie Johnson and Tong Zhang, Accelerating stochastic gradient descent using predictive variance reduc-
tion, Advances in neural information processing systems 26, 2013. ←43
[33] Lucien Le Cam, Asymptotic methods in statistical decision theory, Springer-Verlag, 1986. ←4, 62, 65
[34] Erich L. Lehmann and George Casella, Theory of point estimation, second edition, Springer, 1998.
←62
[35] Friedrich Liese and Igor Vajda, On divergences and informations in statistics and information theory,
IEEE Transactions on Information Theory 52 (2006), no. 10, 4394–4412. ←78
[36] David Luenberger, Optimization by vector space methods, Wiley, 1969. ←24
[37] Jerrold Marsden and Michael Hoffman, Elementary classical analysis, second edition, W.H. Freeman,
1993. ←3
[38] Brendan McMahan and Matthew Streeter, Adaptive bound optimization for online convex optimiza-
tion, Proceedings of the twenty third annual conference on computational learning theory, 2010.
←60
[39] Angelia Nedić, Subgradient methods for convex minimization, Ph.D. Thesis, 2002. ←60
[40] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to
stochastic programming, SIAM Journal on Optimization 19 (2009), no. 4, 1574–1609. ←43, 60
[41] A. Nemirovski and D. Yudin, Problem complexity and method efficiency in optimization, Wiley, 1983.
←4, 25, 59, 74
[42] Arkadi Nemirovski, Efficient methods in convex programming, 1994. Technion: The Israel Institute
of Technology. ←74
[43] Arkadi Nemirovski, Lectures on modern convex optimization, 2005. Georgia Institute of Technology.
←43
[44] Y. Nesterov, Introductory lectures on convex optimization, Kluwer Academic Publishers, 2004. ←26,
43, 74
[45] Y. Nesterov and A. Nemirovski, Interior-point polynomial algorithms in convex programming, SIAM
Studies in Applied Mathematics, 1994. ←25
[46] Jorge Nocedal and Stephen J. Wright, Numerical optimization, Springer, 2006. ←3, 60
[47] B. T. Polyak, Introduction to optimization, Optimization Software, Inc., 1987. ←3, 43
[48] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM Journal
on Control and Optimization 30 (1992), no. 4, 838–855. ←43
[49] R. Tyrell Rockafellar, Convex analysis, Princeton University Press, 1970. ←3, 6, 24
88 References
[50] Walter Rudin, Principles of mathematical analysis, third edition, McGraw-Hill, 1976. ←3
[51] S. Shalev-Shwartz, Online learning: Theory, algorithms, and applications, Ph.D. Thesis, 2007. ←47
[52] Shai Shalev-Shwartz, Online learning and online convex optimization, Foundations and Trends in
Machine Learning 4 (2012), no. 2, 107–194. ←4
[53] Shai Shalev-Shwartz and Tong Zhang, Stochastic dual coordinate ascent methods for regularized loss
minimization, Journal of Machine Learning Research 14 (2013), 567–599. ←43
[54] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński, Lectures on stochastic program-
ming: Modeling and theory, SIAM and Mathematical Programming Society, 2009. ←4
[55] Naum Zuselevich Shor, Minimization methods for nondifferentiable functions, translated by Krzystof
Kiwiel and Andrzej Ruszczyński, Springer-Verlag, 1985. ←60
[56] Naum Zuselevich Shor, Nondifferentiable optimization and polynomial problems, Springer, 1998. ←56,
60
[57] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Royal. Statist. Soc B. 58 (1996), no. 1,
267–288. ←31
[58] Alexandre B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009. ←63, 74
[59] Abraham Wald, Contributions to the theory of statistical estimation and testing hypotheses, Annals of
Mathematical Statistics 10 (1939), no. 4, 299–326. ←4, 74
[60] Abraham Wald, Statistical decision functions which minimize the maximum risk, Annals of Mathemat-
ics 46 (1945), no. 2, 265–280. ←4
[61] Y. Yang and A. Barron, Information-theoretic determination of minimax rates of convergence, Annals of
Statistics 27 (1999), no. 5, 1564–1599. ←4, 63
[62] Bin Yu, Assouad, Fano, and Le Cam, Festschrift for lucien le cam, 1997, pp. 423–435. ←4, 63, 74
[63] Martin Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, Proceed-
ings of the twentieth international conference on machine learning, 2003. ←43