Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Duchi 16

Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

IAS/Park City Mathematics Series

Volume 00, Pages 000–000


S 1079-5634(XX)0000-0

Introductory Lectures on Stochastic Optimization

John C. Duchi

Contents
1 Introduction 2
1.1 Scope, limitations, and other references 3
1.2 Notation 4
2 Basic Convex Analysis 5
2.1 Introduction and Definitions 5
2.2 Properties of Convex Sets 7
2.3 Continuity and Local Differentiability of Convex Functions 14
2.4 Subgradients and Optimality Conditions 16
2.5 Calculus rules with subgradients 21
3 Subgradient Methods 24
3.1 Introduction 24
3.2 The gradient and subgradient methods 25
3.3 Projected subgradient methods 31
3.4 Stochastic subgradient methods 35
4 The Choice of Metric in Subgradient Methods 43
4.1 Introduction 43
4.2 Mirror Descent Methods 44
4.3 Adaptive stepsizes and metrics 54
5 Optimality Guarantees 60
5.1 Introduction 60
5.2 Le Cam’s Method 65
5.3 Multiple dimensions and Assouad’s Method 70
A Technical Appendices 74
A.1 Continuity of Convex Functions 74
A.2 Probability background 76
A.3 Auxiliary results on divergences 78
B Questions and Exercises 80
2010 Mathematics Subject Classification. Primary 65Kxx; Secondary 90C15, 62C20.
Key words and phrases. Convexity, stochastic optimization, subgradients, mirror descent, minimax op-
timal.

©0000 (copyright holder)

1
2 Introductory Lectures on Stochastic Optimization

1. Introduction
In this set of four lectures, we study the basic analytical tools and algorithms
necessary for the solution of stochastic convex optimization problems, as well as
for providing various optimality guarantees associated with the methods. As we
proceed through the lectures, we will be more exact about the precise problem
formulations, providing a number of examples, but roughly, by a stochastic op-
timization problem we mean a numerical optimization problem that arises from
observing data from some (random) data-generating process. We focus almost
exclusively on first-order methods for the solution of these types of problems, as
they have proven quite successful in the large scale problems that have driven
many advances throughout the early 2000s.
Our main goal in these lectures, as in the lectures by S. Wright in this volume,
is to develop methods for the solution of optimization problems arising in large-
scale data analysis. Our route will be somewhat circuitous, as we will build the
necessary convex analytic and other background (see Lecture 2), but broadly, the
problems we wish to solve are the problems arising in stochastic convex optimiza-
tion. In these problems, we have samples S coming from a sample space S, drawn
from a distribution P, and we have some decision vector x ∈ Rn that we wish to
choose to minimize the expected loss Z
(1.0.1) f(x) := EP [F(x; S)] = F(x; s)dP(s),
S
where F is convex in its first argument.
The methods we consider for minimizing problem (1.0.1) are typically sim-
ple methods that are slower to converge than more advanced methods—such as
Newton or other second-order methods—for deterministic problems, but have the
advantage that they are robust to noise in the optimization problem itself. Con-
sequently, it is often relatively straightforward to derive generalization bounds
for these procedures: if they produce an estimate b x exhibiting good performance
on some sample S1 , . . . , Sm drawn from P, then they are likely to exhibit good
performance (on average) for future data, that is, to have small objective f(b x);
see Lecture 3, and especially Theorem 3.4.11. It is of course often advantageous
to take advantage of problem structure and geometric aspects of the problem,
broadly defined, which is the goal of mirror descent and related methods, which
we discuss in Lecture 4.
The last part of our lectures is perhaps the most unusual for material on opti-
mization, which is to investigate optimality guarantees for stochastic optimization
problems. In Lecture 5, we study the sample complexity of solving problems of
the form (1.0.1). More precisely, we measure the performance of an optimization
procedure given samples S1 , . . . , Sm drawn independently from the population
distribution P, denoted by b x=b x(S1:m ), in a uniform sense: for a class of objec-
tive functions F, a procedure’s performance is its expected error—or risk—for the
worst member of the class F. We provide lower bounds on this maximum risk,
John C. Duchi 3

showing that the first-order procedures we have developed satisfy certain notions
of optimality.
We briefly outline the coming lectures. The first lecture provides definitions
and the convex analytic tools necessary for the development of our algorithms
and other ideas, developing separation properties of convex sets as well as other
properties of convex functions from basic principles. The second two lectures
investigate subgradient methods and their application to certain stochastic opti-
mization problems, demonstrating a number of convergence results. The second
lecture focuses on standard subgradient-type methods, while the third investi-
gates more advanced material on mirror descent and adaptive methods, which
require more care but can yield substantial practical performance benefits. The
final lecture investigates optimality guarantees for the various methods we study,
demonstrating two standard techniques for proving lower bounds on the ability
of any algorithm to solve stochastic optimization problems.
1.1. Scope, limitations, and other references The lectures assume some limited
familiarity with convex functions and convex optimization problems and their
formulation, which will help appreciation of the techniques herein. All that is
truly essential is a level of mathematical maturity that includes some real analysis,
linear algebra, and introductory probability. In terms of real analysis, a typical
undergraduate course, such as one based on Marsden and Hoffman’s Elementary
Real Analysis [37] or Rudin’s Principles of Mathematical Analysis [50], are sufficient.
Readers should not consider these lectures in any way a comprehensive view of
convex analysis or stochastic optimization. These subjects are well-established,
and there are numerous references.
Our lectures begin with convex analysis, whose study Rockafellar, influenced
by Fenchel, launched in his 1970 book Convex Analysis [49]. We develop the basic
ideas necessary for our treatment of first-order (gradient-based) methods for op-
timization, which includes separating and supporting hyperplane theorems, but
we provide essentially no treatment of the important concepts of Lagrangian and
Fenchel duality, support functions, or saddle point theory more broadly. For these
and other important ideas, I have found the books of Rockafellar [49], Hiriart-
Urruty and Lemaréchal [27, 28], Bertsekas [8], and Boyd and Vandenberghe [12]
illuminating.
Convex optimization itself is a huge topic, with thousands of papers and nu-
merous books on the subject. Because of our focus on solution methods for large-
scale problems arising out of data collection, we are somewhat constrained in
our views. Boyd and Vandenberghe [12] provide an excellent treatment of the
possibilities of modeling engineering and scientific problems as convex optimiza-
tion problems, as well as some important numerical methods. Polyak [47] pro-
vides a treatment of stochastic and non-stochastic methods for optimization from
which ours borrows substantially. Nocedal and Wright [46] and Bertsekas [9]
also describe more advanced methods for the solution of optimization problems,
4 Introductory Lectures on Stochastic Optimization

focusing on non-stochastic optimization problems for which there are many so-
phisticated methods.
Because of our goal to solve problems of the form (1.0.1), we develop first-order
methods that are in some ways robust to many types of noise from sampling.
There are other approaches to dealing with data uncertainty, and researchers in
of robust optimization [6], who study and develop tractable (polynomial-time-
solvable) formulations for a variety of data-based problems in engineering and
the sciences. The book of Shapiro et al. [54] provides a more comprehensive
picture of stochastic modeling problems and optimization algorithms than we
have been able to in our lectures, as stochastic optimization is by itself a major
field. Several recent surveys on online learning and online convex optimization
provide complementary treatments to ours [26, 52].
The last lecture traces its roots to seminal work in information-based-complexity
by Nemirovski and Yudin in the early 1980s [41], who investigate the limits of “op-
timal” algorithms, where optimality is defined in a worst-case sense according to
an oracle model of algorithms given access to function, gradient, or other types
of local information about the problem at hand. Issues of optimal estimation in
statistics are as old as the field itself, and the minimax formulation we use is
originally due to Wald in the late 1930s [59, 60]. We prove our results using infor-
mation theoretic tools, which have broader applications across statistics, and that
have been developed by many authors [31, 33, 61, 62].
1.2. Notation We use mostly standard notation throughout these notes, but for
completeness, we collect it here. We let R denote the typical field of real numbers,
with Rn having its usual meaning as n-dimensional Euclidean space. Given
vectors x and y, we let hx, yi denote the inner product between x and y. Given a
norm k·k, its dual norm k·k∗ is defined as
kzk∗ := sup {hz, xi | kxk 6 1} .
Hölder’s inequality (see Exercise 4) shows that the ℓp and ℓq norms, defined by
Xn 1
p
p
kxkp = |xj |
j=1

(and as the limit kxk∞ = maxj |xj |) are dual to one another, where 1/p + 1/q = 1
p
and p, q ∈ [1, ∞]. Throughout, we will assume that kxk2 = hx, xi is the norm
defined by the inner product h·, ·i.
We also require notation related to sets. For a sequence of vectors v1 , v2 , v3 , . . .,
we let (vn ) denote the entire sequence. Given sets A and B, we let A ⊂ B to denote
that A is a subset (possibly equal to) B, and A ( B to mean that A is a strict subset
of B. The notation cl A denotes the closure of A, while int A denotes the interior
of the set A. For a function f, the set dom f is its domain. If f : Rn → R ∪ {+∞}
is convex, we let dom f := {x ∈ Rn | f(x) < +∞}.
John C. Duchi 5

2. Basic Convex Analysis


Lecture Summary: In this lecture, we will outline several standard facts
from convex analysis, the study of the mathematical properties of convex
functions and sets. For the most part, our analysis and results will all be
with the aim of setting the necessary background for understanding first-
order convex optimization methods, though some of the results we state will
be quite general.

2.1. Introduction and Definitions This set of lecture notes considers convex op-
timization problems, numerical optimization problems of the form

minimize f(x)
(2.1.1)
subject to x ∈ C,
where f is a convex function and C is a convex set. While we will consider
tools to solve these types of optimization problems presently, this first lecture is
concerned most with the analytic tools and background that underlies solution
methods for these problems.

(a) (b)

Figure 2.1.2. (a) A convex set (b) A non-convex set.

The starting point for any study of convex functions is the definition and study
of convex sets, which are intimately related to convex functions. To that end, we
recall that a set C ⊂ Rn is convex if for all x, y ∈ C,
λx + (1 − λ)y ∈ C for λ ∈ [0, 1].
See Figure 2.1.2.
A convex function is similarly defined: a function f : Rn → (−∞, ∞] is convex
if for all x, y ∈ dom f := {x ∈ Rn | f(x) < +∞}
f(λx + (1 − λ)y) 6 λf(x) + (1 − λ)f(y) for λ ∈ [0, 1].
The epigraph of a function is defined as
epi f := {(x, t) : f(x) 6 t},
6 Introductory Lectures on Stochastic Optimization

and by inspection, a function is convex if and only if its epigraph is a convex


set. A convex function f is closed if its epigraph is a closed set; continuous
convex functions are always closed. We will assume throughout that any convex
function we deal with is closed. See Figure 2.1.3 for graphical representations of
these ideas, which make clear that the epigraph is indeed a convex set.

epi f

f(x) f(x)

(a) (b)

Figure 2.1.3. (a) The convex function f(x) = max{x2 , −2x − .2}
and (b) its epigraph, which is a convex set.

One may ask why, precisely, we focus convex functions. In short, as Rock-
afellar [49] notes, convex optimization problems are the clearest dividing line
between numerical problems that are efficiently solvable, often by iterative meth-
ods, and numerical problems for which we have no hope. We give one simple
result in this direction first:

Observation. Let f : Rn → R be convex and x be a local minimum of f (respectively


a local minimum over a convex set C). Then x is a global minimum of f (resp. a global
minimum of f over C).

To see this, note that if x is a local minimum then for any y ∈ C, we have for
small enough t > 0 that
f(x + t(y − x)) − f(x)
f(x) 6 f(x + t(y − x)) or 0 6 .
t
We now use the criterion of increasing slopes, that is, for any convex function f the
function
f(x + tu) − f(x)
(2.1.5) t 7→
t
is increasing in t > 0. (See Fig. 2.1.4.) Indeed, let 0 6 t1 6 t2 . Then

f(x + t1 u) − f(x) t2 f(x + t2 (t1 /t2 )u) − f(x)


=
t1 t1 t2
t2 f((1 − t1 /t2 )x + (t1 /t2 )(x + t2 u)) − f(x)
=
t1 t2
John C. Duchi 7

t3

t1 t2

f(x+t)−f(x)
Figure 2.1.4. The slopes t increase, with t1 < t2 < t3 .

t2 (1 − t1 /t2 )f(x) + (t1 /t2 )f(x + t2 u) − f(x) f(x + t2 u) − f(x)


6 = .
t1 t2 t2
In particular, because 0 6 f(x + t(y − x)) for small enough t > 0, we see that for
all t > 0 we have
f(x + t(y − x)) − f(x)
06 or f(x) 6 inf f(x + t(y − x)) 6 f(y)
t t>0
for all y ∈ C.
Most of the results herein apply in general Hilbert (complete inner product)
spaces, and many of our proofs will not require anything particular about finite
dimensional spaces, but for simplicity we use Rn as the underlying space on
which all functions and sets are defined.1 While we present all proofs in the
chapter, we try to provide geometric intuition that will aid a working knowledge
of the results, which we believe is the most important.
2.2. Properties of Convex Sets Convex sets enjoy a number of very nice proper-
ties that allow efficient and elegant descriptions of the sets, as well as providing
a number of nice properties concerning their separation from one another. To
that end, in this section, we give several fundamental properties on separating
and supporting hyperplanes for convex sets. The results here begin by showing
that there is a unique (Euclidean) projection to any convex set C, then use this
fact to show that whenever a point is not contained in a set, it can be separated
from the set by a hyperplane. This result can be extended to show separation
of convex sets from one another and that points in the boundary of a convex set
have a hyperplane tangent to the convex set running through them. We leverage
these results in the sequel by making connections of supporting hyperplanes to
1The generality of Hilbert, or even Banach, spaces in convex analysis is seldom needed. Readers
familiar with arguments in these spaces will, however, note that the proofs can generally be extended
to infinite dimensional spaces in reasonably straightforward ways.
8 Introductory Lectures on Stochastic Optimization

epigraphs and gradients, results that in turn find many applications in the design
of optimization algorithms as well as optimality certificates.
A few basic properties We list a few simple properties that convex sets have,
which are evident from their definitions. First, if Cα are convex sets for each
α ∈ A, where A is an arbitrary index set, then the intersection
\
C= Cα
α∈A
is also convex. Additionally, convex sets are closed under scalar multiplication: if
α ∈ R and C is convex, then
αC := {αx : x ∈ C}
is evidently convex. The Minkowski sum of two convex sets is defined by
C1 + C2 := {x1 + x2 : x1 ∈ C1 , x2 ∈ C2 },
and is also convex. To see this, note that if xi , yi ∈ Ci , then
λ(x1 + x2 ) + (1 − λ)(y1 + y2 ) = λx1 + (1 − λ)y1 + λx2 + (1 − λ)y2 ∈ C1 + C2 .
| {z } | {z }
∈C1 ∈C2

In particular, convex sets are closed under all linear combination: if α ∈ Rm , then
P
C= m i=1 αi Ci is also convex.
We also define the convex hull of a set of points x1 , . . . , xm ∈ Rn by
m m

X X
Conv{x1 , . . . , xm } = λi xi : λi > 0, λi = 1 .
i=1 i=1
This set is clearly a convex set.
Projections We now turn to a discussion of orthogonal projection onto a con-
vex set, which will allow us to develop a number of separation properties and
alternate characterizations of convex sets. See Figure 2.2.5 for a geometric view
of projection. We begin by stating a classical result about the projection of zero
onto a convex set.

Theorem 2.2.1 (Projection of zero). Let C be a closed convex set not containing the
origin 0. Then there is a unique point xC ∈ C such that kxC k2 = infx∈C kxk2 . Moreover,
kxC k2 = infx∈C kxk2 if and only if
(2.2.2) hxC , y − xC i > 0
for all y ∈ C.

Proof. The key to the proof is the following parallelogram identity, which holds
in any inner product space: for any x, y,
1 1
(2.2.3) kx − yk22 + kx + yk22 = kxk22 + kyk22 .
2 2
Define M := infx∈C kxk2 . Now, let (xn ) ⊂ C be a sequence of points in C such that
kxn k2 → M as n → ∞. By the parallelogram identity (2.2.3), for any n, m ∈ N,
John C. Duchi 9

we have
1 1
kxn − xm k22 = kxn k22 + kxm k22 − kxn + xm k22 .
2 2
Fix ǫ > 0, and choose N ∈ N such that n > N implies that kxn k22 6 M2 + ǫ. Then
for any m, n > N, we have
1 1
(2.2.4) kxn − xm k22 6 2M2 + 2ǫ − kxn + xm k22 .
2 2
Now we use the convexity of the set C. We have 21 xn + 12 xm ∈ C for any n, m,
which implies
2
1 1 1
kxn + xm k22 = 2
xn + x
m > 2M
2
2 2 2 2
by definition of M. Using the above inequality in the bound (2.2.4), we see that
1
kxn − xm k22 6 2M2 + 2ǫ − 2M2 = 2ǫ.
2 √
In particular, kxn − xm k2 6 2 ǫ; since ǫ was arbitrary, (xn ) forms a Cauchy
sequence and so must converge to a point xC . The continuity of the norm k·k2
implies that kxC k2 = infx∈C kxk2 , and the fact that C is closed implies that xC ∈
C.
Now we show the inequality (2.2.2) holds if and only if xC is the projection of
the origin 0 onto C. Suppose that inequality (2.2.2) holds. Then
kxC k22 = hxC , xC i 6 hxC , yi 6 kxC k2 kyk2 ,
the last inequality following from the Cauchy-Schwartz inequality. Dividing each
side by kxC k2 implies that kxC k2 6 kyk2 for all y ∈ C. For the converse, let xC
minimize kxk2 over C. Then for any t ∈ [0, 1] and any y ∈ C, we have
kxC k22 6 k(1 − t)xC + tyk22 = kxC + t(y − xC )k22 = kxC k22 + 2t hxC , y − xC i + t2 ky − xC k22 .
Subtracting kxC k22 and t2 ky − xC k22 from both sides of the above inequality, we
have
−t2 ky − xC k22 6 2t hxC , y − xC i .
Dividing both sides of the above inequality by 2t, we have
t
− ky − xC k22 6 hxC , y − xC i
2
for all t ∈ (0, 1]. Letting t ↓ 0 gives the desired inequality. 
With this theorem in place, a simple shift gives a characterization of more
general projections onto convex sets.

Corollary 2.2.6 (Projection onto convex sets). Let C be a closed convex set and x ∈
Rn . Then there is a unique point πC (x), called the projection of x onto C, such
that kx − πC (x)k2 = infy∈C kx − yk2 , that is, πC (x) = argminy∈C ky − xk22 . The
projection is characterized by the inequality
(2.2.7) hπC (x) − x, y − πC (x)i > 0
for all y ∈ C.
10 Introductory Lectures on Stochastic Optimization

y πC (x)

Figure 2.2.5. Projection of the point x onto the set C (with pro-
jection πC (x)), exhibiting hx − πC (x), y − πC (x)i 6 0.

Proof. When x ∈ C, the statement is clear. For x 6∈ C, the corollary simply fol-
lows by considering the set C ′ = C − x, then using Theorem 2.2.1 applied to the
recentered set. 

Corollary 2.2.8 (Non-expansive projections). Projections onto convex sets are non-
expansive, in particular,
kπC (x) − yk2 6 kx − yk2
for any x ∈ Rn and y ∈ C.

Proof. When x ∈ C, the inequality is clear, so assume that x 6∈ C. Now use


inequality (2.2.7) from the previous corollary. By adding and subtracting y in the
inner product, we have

0 6 hπC (x) − x, y − πC (x)i


= hπC (x) − y + y − x, y − πC (x)i
= − kπC (x) − yk22 + hy − x, y − πC (x)i
We rearrange the above and then use the Cauchy-Schwartz or Hölder’s inequality,
which gives
kπC (x) − yk22 6 hy − x, y − πC (x)i 6 ky − xk2 ky − πC (x)k2 .
Now divide both sides by kπC (x) − yk2 . 
Separation Properties Projections are important not just because of their ex-
istence, but because they also guarantee that convex sets can be described by
halfplanes that contain them as well as that any two convex sets are separated by
hyperplanes. Moreover, the separation can be strict if one of the sets is compact.
John C. Duchi 11

y πC (x)

Figure 2.2.9. Separation of the point x from the set C by the


vector v = x − πC (x).

Proposition 2.2.10 (Strict separation of points). Let C be a closed convex set. Given
any point x 6∈ C, there is a vector v such that
(2.2.11) hv, xi > sup hv, yi
y∈C

Moreover, we can take the vector v = x − πC (x), and hv, xi > supy∈C hv, yi + kvk22 .
See Figure 2.2.9.

Proof. Indeed, since x 6∈ C, we have x − πC (x) 6= 0. By setting v = x − πC (x), we


have from the characterization (2.2.7) that
0 > hv, y − πC (x)i = hv, y − x + x − πC (x)i = hv, y − x + vi = hv, y − xi + kvk22 .
In particular, we see that hv, xi > hv, yi + kvk2 for all y ∈ C. 

Proposition 2.2.12 (Strict separation of convex sets). Let C1 , C2 be closed convex


sets, with C2 compact. Then there is a vector v such that
inf hv, xi > sup hv, xi .
x∈C1 x∈C2

Proof. The set C = C1 − C2 is convex and closed.2 Moreover, we have 0 6∈ C, so


that there is a vector v such that 0 < infz∈C hv, zi by Proposition 2.2.10. Thus we
have
0< inf hv, zi = inf hv, xi − sup hv, xi ,
z∈C1 −C2 x∈C1 x∈C2

which is our desired result. 


2If C is closed and C is compact, then C + C is closed. Indeed, let z = x + y be a convergent
1 2 1 2 n n n
sequence of points (say zn → z) with zn ∈ C1 + C2 . We claim that z ∈ C1 + C2 . Indeed, passing
to a subsequence if necessary, we may assume yn → y. Then on the subsequence, we have xn =
zn − yn → z − y, so that xn is convergent and necessarily converges to a point x ∈ C1 .
12 Introductory Lectures on Stochastic Optimization

We can also investigate the existence of hyperplanes that support the convex
set C, meaning that they touch only its boundary and never enter its interior.
Such hyperplanes—and the halfspaces associated with them—provide alternate
descriptions of convex sets and functions. See Figure 2.2.13.

Figure 2.2.13. Supporting hyperplanes to a convex set.

Theorem 2.2.14 (Supporting hyperplanes). Let C be a closed convex set and x ∈ bd C,


the boundary of C. Then there exists a vector v 6= 0 supporting C at x, that is,
(2.2.15) hv, xi > hv, yi for all y ∈ C.
Proof. Let (xn ) be a sequence of points approaching x from outside C, that is,
xn 6∈ C for any n, but xn → x. For each n, we can take sn = xn − πC (xn ) and
define vn = sn / ksn k2 . Then (vn ) is a sequence satisfying hvn , xi > hvn , yi for
all y ∈ C, and since kvn k2 = 1, the sequence (vn ) belongs to the compact set
{v : kvk2 6 1}.3 Passing to a subsequence if necessary, it is clear that there is a
vector v such that vn → v, and we have hv, xi > hv, yi for all y ∈ C. 
Theorem 2.2.16 (Halfspace intersections). Let C ( Rn be a closed convex set. Then
C is the intersection of all the spaces containing it; moreover,
\
(2.2.17) C= Hx
x∈bd C
where Hx denotes the intersection of the halfspaces contained in hyperplanes supporting
C at x.
T
Proof. It is clear that C ⊆ x∈bd C Hx . Indeed, let hx 6= 0 be a hyperplane support-
ing to C at x ∈ bd C and consider Hx = {y : hhx , xi > hhx , yi}. By Theorem 2.2.14
we see that Hx ⊇ C.
3In a general Hilbert space, this set is actually weakly compact by Alaoglu’s theorem. However, in
a weakly compact set, any sequence has a
weakly convergent
subsequence, that is, there exists a
subsequence n(m) and vector v such that vn(m) , y → hv, yi for all y.
John C. Duchi 13
T
Now we show the other inclusion: x∈bd C Hx ⊆ C. Suppose for the sake
T
of contradiction that z ∈ x∈bd C Hx satisfies z 6∈ C. We will construct a hy-
perplane supporting C that separates z from C, which will be a contradiction to
our supposition. Since C is closed, the projection of πC (z) of z onto C satisfies
hz − πC (z), zi > supy∈C hz − πC (z), yi by Proposition 2.2.10. In particular, defin-
ing vz = z − πC (z), the hyperplane {y : hvz , yi = hvz , πC (z)i} is supporting to C
at the point πC (z) (Corollary 2.2.6) and the halfspace {y : hvz , yi 6 hvz , πC (z)i}
does not contain z but does contain C. This contradicts the assumption that
T
z ∈ x∈bd C Hx . 
As a not too immediate consequence of Theorem 2.2.16 we obtain the following
characterization of a convex function as the supremum of all affine functions that
minorize the function (that is, affine functions that are everywhere less than or
equal to the original function). This is intuitive: if f is a closed convex function,
meaning that epi f is closed, then epi f is the intersection of all the halfspaces
containing it. The challenge is showing that we may restrict this intersection to
non-vertical halfspaces. See Figure 2.2.18.

Figure 2.2.18. The function f (solid blue line) and affine under-
estimators (dotted lines).

Corollary 2.2.19. Let f be a closed convex function that is not identically −∞. Then
f(x) = sup {hv, xi + b : f(y) > b + hv, yi for all y ∈ Rn } .
v∈Rn ,b∈R

Proof. First, we note that epi f is closed by definition. Moreover, we know that we
can write
epi f = ∩{H : H ⊃ epi f},
where H denotes a halfspace. More specifically, we may index each halfspace by
(v, a, c) ∈ Rn × R × R, and we have Hv,a,c = {(x, t) ∈ Rn × R : hv, xi + at 6 c}.
Now, because H ⊃ epi f, we must be able to take t → ∞ so that a 6 0. If a < 0,
14 Introductory Lectures on Stochastic Optimization

we may divide by |a| and assume without loss of generality that a = −1, while
otherwise a = 0. So if we let

H1 := (v, c) : Hv,−1,c ⊃ epi f and H0 := {(v, c) : Hv,0,c ⊃ epi f} .
then \ \
epi f = Hv,−1,c ∩ Hv,0,c .
(v,c)∈H1 (v,c)∈H0

We would like to show that epi f = ∩(v,c)∈H1 Hv,−1,c , as the set Hv,0,c is a vertical
hyperplane separating the domain of f, dom f, from the rest of the space.
To that end, we show that for any (v1 , c1 ) ∈ H1 and (v0 , c0 ) ∈ H0 , then
\
H := Hv1 +λv0 ,−1,c1 +λc1 = Hv1 ,−1,c1 ∩ Hv0 ,0,c0 .
λ>0

Indeed, suppose that (x, t) ∈ Hv1 ,−1,c1 ∩ Hv0 ,0,c0 . Then


hv1 , xi − t 6 c1 and λ hv0 , xi 6 λc0 for all λ > 0.
Summing these, we have
(2.2.20) hv1 + λv0 , xi − t 6 c1 + λc0 for all λ > 0,
or (x, t) ∈ H. Conversely, if (x, t) ∈ H then inequality (2.2.20) holds, so that taking
λ → ∞ we have hv0 , xi 6 c0 , while taking λ = 0 we have hv1 , xi − t 6 c1 .
Noting that H ∈ {Hv,−1,c : (v, c) ∈ H1 }, we see that
\
epi f = Hv,−1,c = {(x, t) ∈ Rn × R : hv, xi − t 6 c for all (v, c) ∈ H1 } .
(v,c)∈H1

This is equivalent to the claim in the corollary. 


2.3. Continuity and Local Differentiability of Convex Functions Here we dis-
cuss several important results concerning convex functions in finite dimensions.
We will see that assuming that a function f is convex is quite strong. In fact, we
will see the (intuitive if one pictures a convex function) facts that f is continuous,
has a directional derivatve everywhere, and in fact is locally Lipschitz. We prove
the first two results here on continuity in Appendix A.1, as they are not fully
necessary for our development.
We begin with the fact that if f is defined on a compact domain, then f has an
upper bound. The first step in this direction is to argue that this holds for ℓ1 balls,
which can be proved by a simple argument with the definition of convexity.

Lemma 2.3.1. Let f be convex and defined on the ℓ1 ball in n dimensions: B1 = {x ∈


Rn : kxk1 6 1}. Then there exist −∞ < m 6 M < ∞ such that m 6 f(x) 6 M for all
x ∈ B1 .

We provide a proof of this lemma, as well as the coming theorem, in Appen-


dix A.1, as they are not central to our development, relying on a few results in
the sequel. The coming theorem makes use of the above lemma to show that on
compact domains, convex functions are Lipschitz continuous. The proof of the
John C. Duchi 15

theorem begins by showing that if a convex function is bounded in some set, then
it is Lipschitz continuous in the set, then using Lemma 2.3.1 we can show that on
compact sets f is indeed bounded.
Theorem 2.3.2. Let f be convex and defined on a set C with non-empty interior. Let
B ⊆ int C be compact. Then there is a constant L such that |f(x) − f(y)| 6 L kx − yk on
B, that is, f is L-Lipschitz continuous on B.
The last result, which we make strong use of in the next section, concerns the
existence of directional derivatives for convex functions.
Definition 2.3.3. The directional derivative of a function f at a point x in the direc-
tion u is
1
f ′ (x; u) := lim [f(x + αu) − f(x)] .
α↓0 α
This definition makes sense by our earlier arguments that convex functions have
increasing slopes (recall expression (2.1.5)). To see that the above definition makes
sense, we restrict our attention to x ∈ int dom f, so that we can approach x from
all directions. By taking u = y − x for any y ∈ dom f,
f(x + α(y − x)) = f((1 − α)x + αy) 6 (1 − α)f(x) + αf(y)
so that
1 1
[f(x + α(y − x)) − f(x)] 6 [αf(y) − αf(x)] = f(y) − f(x) = f(x + u) − f(x).
α α
We also know from Theorem 2.3.2 that f is locally Lipschitz, so for small enough
α there exists some L such that f(x + αu) > f(x) − Lα kuk, and thus f ′ (x; u) >
−L kuk. Further, an argument by convexity (the criterion (2.1.5) of increasing
slopes) shows that the function
1
α 7→ [f(x + αu) − f(x)]
α
is increasing, so we can replace the limit in the definition of f ′ (x; u) with an
infimum over α > 0, that is, f ′ (x; u) = infα>0 α1 [f(x + αu) − f(x)]. Noting that if
x is on the boundary of dom f and x + αu 6∈ dom f for any α > 0, then f ′ (x; u) =
+∞, we have proved the following theorem.
Theorem 2.3.4. For convex f, at any point x ∈ dom f and for any u, the directional
derivative f ′ (x; u) exists and is
1 1
f ′ (x; u) = lim [f(x + αu) − f(x)] = inf [f(x + αu) − f(x)] .
α↓0 α α>0 α

If x ∈ int dom f, there exists a constant L < ∞ such that |f ′ (x; u)| 6 L kuk for any
u ∈ Rn . If f is Lipschitz continuous with respect to the norm k·k, we can take L to be
the Lipschitz constant of f.
Lastly, we state a well-known condition that is equivalent to convexity. This is
inuitive: if a function is bowl-shaped, it should have positive second derivatives.
Theorem 2.3.5. Let f : Rn → R be twice continuously differentiable. Then f is convex
if and only if ∇2 f(x)  0 for all x, that is, ∇2 f(x) is positive semidefinite.
16 Introductory Lectures on Stochastic Optimization

Proof. We may essentially reduce the argument to one-dimensional problems, be-


cause if f is twice continuously differentiable, then for each v ∈ Rn we may define
hv : R → R by
hv (t) = f(x + tv),
and f is convex if and only if hv is convex for each v (because convexity is a
property only of lines, by definition). Moreover, we have
hv′′ (0) = v⊤ ∇2 f(x)v,
and ∇2 f(x)  0 if and only if hv′′ (0) > 0 for all v.
Thus, with no loss of generality, we assume n = 1 and show that f is convex if
and only if f ′′ (x) > 0. First, suppose that f ′′ (x) > 0 for all x. Then using that
1
f(y) = f(x) + f ′ (x)(y − x) + (y − x)2 f ′′ (e
x)
2
for some ex between x and y, we have that f(y) > f(x) + f ′ (x)(y − x) for all x, y.
Let λ ∈ [0, 1]. Then we have

f(y) > f(λx + (1 − λ)y) + λf ′ (λx + (1 − λ)y)(y − x) and


f(x) > f(λx + (1 − λ)y) + (1 − λ)f ′ (λx + (1 − λ)y)(x − y).
Multiplying the first equation by 1 − λ and the second by λ, then adding, we
obtain
(1 − λ)f(y) + λf(x) > (1 − λ)f(λx + (1 − λ)y) + λf(λx + (1 − λ)y) = f(λx + (1 − λ)y),
that is, f is convex.
For the converse, let δ > 0 and define x1 = x + δ > x > x − δ = x0 . Then we
have x1 − x0 = 2δ, and

f(x1 ) = f(x) + f ′ (x)δ + 2δ2 f ′′ (e


x1 ) and f(x0 ) = f(x) − f ′ (x)δ + 2δ2 f ′′ (e
x0 )
for e x0 ∈ [x − δ, x + δ]. Adding these quantities and defining cδ = f(x1 ) +
x1 , e
f(x0 ) − 2f(x) > 0 (the last inequality by convexity), we have
cδ = 2δ2 [f ′′ (e
x1 ) + f ′′ (e
x0 )].
By continuity, we have f ′′ (exi ) → f ′′ (x) as δ → 0, and as cδ /2δ2 > 0 for all δ > 0,
we must have
c
2f ′′ (x) = lim sup{f ′′ (e x0 )} = lim sup δ2 > 0.
x1 ) + f ′′ (e
δ→0 δ→0 2δ
This gives the result. 
2.4. Subgradients and Optimality Conditions The subgradient set of a function
f at a point x ∈ dom f is defined as follows:
(2.4.1) ∂f(x) := {g : f(y) > f(x) + hg, y − xi for all y} .
Intuitively, since a function is convex if and only if epi f is convex, the subgradient
set ∂f should be non-empty and consist of supporting hyperplanes to epi f. That
John C. Duchi 17

is, f should always have global linear underestimators of itself. When a function f
is convex, the subgradient generalizes the derivative of f (which is a global linear
underestimator of f when f is differentiable), and is also intimately related to
optimality conditions for convex minimization.

f(x1 ) + hg1 , x − x1 i

f(x2 ) + hg3 , x − x2 i

f(x2 ) + hg2 , x − x2 i

x2 x1

Figure 2.4.2. Subgradients of a convex function. At the point


x1 , the subgradient g1 is the gradient. At the point x2 , there are
multiple subgradients, because the function is non-differentiable.
We show the linear functions given by g2 , g3 ∈ ∂f(x2 ).

Existence and characterizations of subgradients Our first theorem guarantees


that the subdifferential set is non-empty.

Theorem 2.4.3. Let x ∈ int dom f. Then ∂f(x) is nonempty, closed convex, and com-
pact.

Proof. The fact that ∂f(x) is closed and convex is straightforward. Indeed, all we
need to see this is to recognize that
\
∂f(x) = {g : f(z) > f(x) + hg, z − xi}
z
which is an intersection of half-spaces, which are all closed and convex.
Now we need to show that ∂f(x) 6= ∅. This will essentially follow from the
following fact: the set epi f has a supporting hyperplane at the point (x, f(x)).
Indeed, from Theorem 2.2.14, we know that there exist a vector v and scalar b
such that
hv, xi + bf(x) > hv, yi + bt
for all (y, t) ∈ epi f (that is, y and t such that f(y) 6 t). Rearranging slightly, we
have
hv, x − yi > b(t − f(x))
18 Introductory Lectures on Stochastic Optimization

and setting y = x shows that b 6 0. This is close to what we desire, since if b < 0
we set t = f(y) and see that
Dv E
−bf(y) > −bf(x) + hv, y − xi or f(y) > f(x) − ,y−x
b
for all y, by dividing both sides by −b. In particular, −v/b is a subgradient. Thus,
suppose for the sake of contradiction that b = 0. In this case, we have hv, x − yi >
0 for all y ∈ dom f, but we assumed that x ∈ int dom f, so for small enough ǫ > 0,
we can set y = x + ǫv. This would imply that hv, x − yi = −ǫ hv, vi = 0, i.e. v = 0,
contradicting the fact that at least one of v and b must be non-zero.
For the compactness of ∂f(x), we use Lemma 2.3.1, which implies that f is
bounded in an ℓ1 -ball around of x. As x ∈ int dom f by assumption, there is
some ǫ > 0 such that x + ǫB ⊂ int dom f for the ℓ1 -ball B = {v : kvk1 6 1}.
Lemma 2.3.1 implies that supv∈B f(x + ǫv) = M < ∞ for some M, so we have
M > f(x + ǫv) > f(x) + ǫ hg, vi for all v ∈ B and g ∈ ∂f(x), or kgk∞ 6 (M − f(x))/ǫ.
Thus ∂f(x) is closed and bounded, hence compact. 
The next two results require a few auxiliary results related to the directional
derivative of a convex function. The reason for this is that both require connect-
ing the local properties of the convex function f with the sub-differential ∂f(x),
which is difficult in general since ∂f(x) can consist of multiple vectors. However,
by looking at directional derivatives, we can accomplish what we desire. The
connection between a directional derivative and the subdifferential is contained
in the next two lemmas.

Lemma 2.4.4. An equivalent characterization of the subdifferential ∂f(x) of f at x is



(2.4.5) ∂f(x) = g : hg, ui 6 f ′ (x; u) for all u .

Proof. Denote the set on the right hand side of the equality (2.4.5) by S = {g :
hg, ui 6 f ′ (x; u)}, and let g ∈ S. By the increasing slopes condition, we have
f(x + αu) − f(x)
hg, ui 6 f ′ (x; u) 6
α
for all u and α > 0; in particular, by taking α = 1 and u = y − x, we have the
standard subgradient inequality that f(x) + hg, y − xi 6 f(y). So if g ∈ S, then
g ∈ ∂f(x). Conversely, for any g ∈ ∂f(x), the definition of a subgradient implies
that
f(x + αu) > f(x) + hg, x + αu − xi = f(x) + α hg, ui .
Subtracting f(x) from both sides and dividing by α gives that
1
[f(x + αu) − f(x)] > sup hg, ui
α g∈∂f(x)

for all α > 0; in particular, g ∈ S. 


The representation (2.4.5) gives another proof that ∂f(x) is compact, as claimed in
Theorem 2.4.3. Because we know that f ′ (x; u) is finite for all u as x ∈ int dom f,
John C. Duchi 19

and g ∈ ∂f(x) satisfies


kgk2 = sup hg, ui 6 sup f ′ (x; u) < ∞.
u:kuk2 61 u:kuk2 61

Lemma 2.4.6. Let f be closed convex and ∂f(x) 6= ∅. Then


(2.4.7) f ′ (x; u) = sup hg, ui .
g∈∂f(x)

Proof. Certainly, Lemma 2.4.4 shows that f ′ (x; u) > supg∈∂f(x) hg, ui. We must
show the other direction. To that end, note that viewed as a function of u, we have
f ′ (x; u) is convex and positively homogeneous, meaning that f ′ (x; tu) = tf ′ (x; u)
for t > 0. Thus, we can always write (by Corollary 2.2.19)

f ′ (x; u) = sup hv, ui + b : f ′ (x; w) > b + hv, wi for all w ∈ Rn .
Using the positive homogeneity, we have f ′ (x; 0) = 0 and thus we must have
b = 0, so that u 7→ f ′ (x; u) is characterized as the supremum of linear functions:

f ′ (x; u) = sup hv, ui : f ′ (x; w) > hv, wi for all w ∈ Rn .
But the set {v : hv, wi 6 f ′ (x; w) for all w} is simply ∂f(x) by Lemma 2.4.4. 
A relatively straightforward calculation using Lemma 2.4.4, which we give
in the next proposition, shows that the subgradient is simply the gradient of
differentiable convex functions. Note that as a consequence of this, we have
the first-order inequality that f(y) > f(x) + h∇f(x), y − xi for any differentiable
convex function.

Proposition 2.4.8. Let f be convex and differentiable at a point x. Then ∂f(x) = {∇f(x)}.

Proof. If f is differentiable at a point x, then the chain rule implies that


f ′ (x; u) = h∇f(x), ui > hg, ui
for any g ∈ ∂f(x), the inequality following from Lemma 2.4.4. By replacing u with
−u, we have f ′ (x; −u) = − h∇f(x), ui > − hg, ui as well, or hg, ui = h∇f(x), ui for
all u. Letting u vary in (for example) the set {u : kuk2 6 1} gives the result. 
Lastly, we have the following consequence of the previous lemmas, which re-
lates the norms of subgradients g ∈ ∂f(x) to the Lipschitzian properties of f.
Recall that a function f is L-Lipschitz with respect to the norm k·k over a set C if
|f(x) − f(y)| 6 L kx − yk
for all x, y ∈ C. Then the following proposition is an immediate consequence of
Lemma 2.4.6.

Proposition 2.4.9. Suppose that f is L-Lipschitz with respect to the norm k·k over a set
C, where C ⊂ int dom f. Then
sup{kgk∗ : g ∈ ∂f(x), x ∈ C} 6 L.
20 Introductory Lectures on Stochastic Optimization

Examples We can provide a number of examples of subgradients. A general


rule of thumb is that, if it is possible to compute the function, it is possible to
compute its subgradients. As a first example, we consider
f(x) = |x|.
Then by inspection, we have

−1 if x < 0


∂f(x) = [−1, 1] if x = 0


1 if x > 0.
A more complex example is given by any vector norm k·k. In this case, we use
the fact that the dual norm is defined by
kyk∗ := sup hx, yi .
x:kxk61

Moreover, we have that kxk = supy:kyk 61 hy, xi. Fixing x ∈ Rn , we thus see that

if kgk∗ 6 1 and hg, xi = kxk, then
kxk + hg, y − xi = kxk − kxk + hg, yi 6 sup hv, yi = kyk .
v:kvk∗ 61

It is possible to show a converse—we leave this as an exercise for the interested


reader—and we claim that
∂ kxk = {g ∈ Rn : kgk∗ 6 1, hg, xi = kxk}.
For a more concrete example, we have

x/ kxk2 if x 6= 0
∂ kxk2 =
{u : kuk2 6 1} if x = 0.
Optimality properties Subgradients also allows us to characterize solutions to
convex optimization problems, giving similar characterizations as those we pro-
vided for projections. The next theorem, containing necessary and sufficient con-
ditions for a point x to minimize a convex function f, generalizes the standard
first-order optimality conditions for differentiable f (e.g., Section 4.2.3 in [12]).
The intuition for Theorem 2.4.11 is that there is a vector g in the subgradient set
∂f(x) such that −g is a supporting hyperplane to the feasible set C at the point
x. That is, the directions of decrease of the function f lie outside the optimization
set C. Figure 2.4.10 shows this behavior.

Theorem 2.4.11. Let f be convex. The point x ∈ int dom f minimizes f over a convex
set C if and only if there exists a subgradient g ∈ ∂f(x) such that simultaneously for all
y ∈ C,
(2.4.12) hg, y − xi > 0.
John C. Duchi 21

−g

y
x⋆

Figure 2.4.10. The point x⋆ minimizes f over C (the shown level


curves) if and only if for some g ∈ ∂f(x⋆ ), hg, y − x⋆ i > 0 for all
y ∈ C. Note that not all subgradients satisfy this inequality.

Proof. One direction of the theorem is easy. Indeed, pick y ∈ C. Then certainly
there exists g ∈ ∂f(x) for which hg, y − xi > 0. Then by definition,
f(y) > f(x) + hg, y − xi > f(x).
This holds for any y ∈ C, so x is clearly optimal.
For the converse, suppose that x minimizes f over C. Then for any y ∈ C and
any t > 0 such that x + t(y − x) ∈ C, we have
f(x + t(y − x)) − f(x)
f(x + t(y − x)) > f(x) or 0 6 .
t

Taking the limit as t → 0, we have f (x; y − x) > 0 for all y ∈ C. Now, let
us suppose for the sake of contradiction that there exists a y such that for all
g ∈ ∂f(x), we have hg, y − xi < 0. Because
∂f(x) = {g : hg, ui 6 f ′ (x; u) for all u ∈ Rn }
by Lemma 2.4.6, and ∂f(x) is compact, we have that supg∈∂f(x) hg, y − xi is at-
tained, which would imply
f ′ (x; y − x) < 0.
This is a contradiction. 
2.5. Calculus rules with subgradients We present a number of calculus rules
that show how subgradients are, essentially, similar to derivatives, with a few
exceptions (see also Ch. VII of [27]). When we develop methods for optimization
problems based on subgradients, these basic calculus rules will prove useful.
Scaling. If we let h(x) = αf(x) for some α > 0, then ∂h(x) = α∂f(x).
22 Introductory Lectures on Stochastic Optimization
Pm
Finite sums. Suppose that f1 , . . . , fm are convex functions and let f = i=1 fi .
Then
Xm
∂f(x) = ∂fi (x),
i=1
Pm
where the addition is Minkowski addition. To see that i (x) ⊂ ∂f(x),
i=1 ∂fP
m
let gi ∈ ∂fi (x) for each i, in which case it is clear that f(y) = i=1 fi (y) >
Pm Pm
i=1 fi (x) + hgi , y − xi, so that i=1 gi ∈ ∂f(x). The converse is somewhat more
technical and is a special case of the results to come.
Integrals. More generally, we can extend this summation result to integrals, as-
suming the integrals exist. These calculations are essential for our development
of stochastic optimization schemes based on stochastic (sub)gradient information
in the coming lectures. Indeed, for each s ∈ S, where S is some set, let fs be
convex. Let µ be a positive measure on the set S, and define the convex function
R
f(x) = fs (x)dµ(s). In the notation of the introduction (Eq. (1.0.1)) and the prob-
lems coming in Section 3.4, we take µ to be a probability distribution on a set S,
and if F(·; s) is convex in its first argument for all s ∈ S, then we may take
f(x) = E[F(x; S)]
and satisfy the conditions above. We shall see many such examples in the sequel.
Then if we let gs (x) ∈ ∂fs (x) for each s ∈ S, we have (assuming the integral
exists and that the selections gZs (x) are appropriately measurable)
(2.5.1) gs (x)dµ(s) ∈ ∂f(x).
To see the inclusion, note that for any y we have
Z  Z
gs (x)dµ(s), y − x = hgs (x), y − xi dµ(s)
Z
6 (fs (y) − fs (x))dµ(s) = f(y) − f(x).
So the inclusion (2.5.1) holds. Eliding a few technical details, one generally ob-
tains the equality
Z
∂f(x) = gs (x)dµ(s) : gs (x) ∈ ∂fs (x) for each s ∈ S .
Returning to our running example of stochastic optimization, if we have a
collection of functions F : Rn × S → R, where for each s ∈ S the function F(·; s)
is convex, then f(x) = E[f(x; S)] is convex when we take expectations over S, and
taking
g(x; s) ∈ ∂F(x; s)
gives a stochastic gradient with the property that E[g(x; S)] ∈ ∂f(x). For more on
these calculations and conditions, see the classic paper of Bertsekas [7], which
addresses the measurability issues.
Affine transformations. Let f : Rm → R be convex and A ∈ Rm×n and
b ∈ Rm . Then h : Rn → R defined by h(x) = f(Ax + b) is convex and has
John C. Duchi 23

subdifferential
∂h(x) = AT ∂f(Ax + b).
Indeed, let g ∈ ∂f(Ax + b), so that
D E
h(y) = f(Ay + b) > f(Ax + b) + hg, (Ay + b) − (Ax + b)i = h(x) + AT g, y − x ,
giving the result.
Finite maxima. Let fi , i = 1, . . . , m, be convex functions, and f(x) = maxi6m fi (x).
Then we have \
epi f = epi fi ,
i6m

which is convex, and f is convex. Now, let i be any index such that fi (x) = f(x),
and let gi ∈ ∂fi (x). Then we have for any y ∈ Rn that
f(y) > fi (y) > fi (x) + hgi , y − xi = f(x) + hgi , y − xi .
So gi ∈ ∂f(x). More generally, we have the result that
(2.5.2) ∂f(x) = Conv{∂fi (x) : fi (x) = f(x)},
that is, the subgradient set of f is the convex hull of the subgradients of active
functions at x, that is, those attaining the maximum. If there is only a single
unique active function fi , then ∂f(x) = ∂fi (x). See Figure 2.5.3 for a graphical
representation.

epi f f1
f2

x0 x1

Figure 2.5.3. Subgradients of finite maxima. The function


f(x) = max{f1 (x), f2 (x)} where f1 (x) = x2 and f2 (x) =p−2x − 51 ,
and f is differentiable everywhere except at x0 = −1 + 4/5.

Uncountable maxima (supremum). Lastly, consider f(x) = supα∈S fα (x), where


A is an arbitrary index set and fα is convex for each α. First, let us assume that
the supremum is attained at some α ∈ A. Then, identically to the above, we have
24 Introductory Lectures on Stochastic Optimization

that ∂fα (x) ⊂ ∂f(x). More generally, we have


∂f(x) ⊃ Conv{∂fα (x) : fα (x) = f(x)}.
Achieving equality in the preceding definition requires a number of conditions,
and if the supremum is not attained, the function f may not be subdifferentiable.
Notes and further reading The study of convex analysis and optimization orig-
inates, essentially, with Rockafellar’s 1970 book Convex Analysis [49]. Because of
the limited focus of these lecture notes, we have only barely touched on many
topics in convex analysis, developing only those we need. Two omissions are
perhaps the most glaring: except tangentially, we have provided no discussion of
conjugate functions and conjugacy, and we have not discussed Lagrangian dual-
ity, both of which are central to any study of convex analysis and optimization.
A number of books provide coverage of convex analysis in finite and infinite di-
mensional spaces and make excellent further reading. For broad coverage of con-
vex optimization problems, theory, and algorithms, Boyd and Vandenberghe [12]
is an excellent reference, also providing coverage of basic convex duality theory
and conjugate functions. For deeper forays into convex analysis, personal fa-
vorites of mine include the books of Hiriart-Urruty and Lemarécahl [27, 28], as
well as the shorter volume [29], and Bertsekas [8] also provides an elegant geo-
metric picture of convex analysis and optimization. Our approach here follows
Hiriart-Urruty and Lemaréchal’s most closely. For a treatment of the issues of
separation, convexity, duality, and optimization in infinite dimensional spaces,
an excellent reference is the classic book by Luenberger [36].

3. Subgradient Methods
Lecture Summary: In this lecture, we discuss first order methods for the min-
imization of convex functions. We focus almost exclusively on subgradient-
based methods, which are essentially universally applicable for convex opti-
mization problems, because they rely very little on the structure of the prob-
lem being solved. This leads to effective but slow algorithms in classical
optimization problems. In large scale problems arising out of machine learn-
ing and statistical tasks, however, subgradient methods enjoy a number of
(theoretical) optimality properties and have excellent practical performance.

3.1. Introduction In this lecture, we explore a basic subgradient method, and a


few variants thereof, for solving general convex optimization problems. Through-
out, we will attack the problem
(3.1.1) minimize f(x) subject to x ∈ C
x
where f : Rn → R is convex (though it may take on the value +∞ for x 6∈ dom f)
and C is a closed convex set. Certainly in this generality, finding a universally
John C. Duchi 25

good method for solving the problem (3.1.1) is hopeless, though we will see that
the subgradient method does essentially apply in this generality.
Convex programming methodologies developed in the last fifty years or so
have given powerful methods for solving optimization problems. The perfor-
mance of many methods for solving convex optimization problems is measured
by the amount of time or number of iterations required of them to give an ǫ-
optimal solution to the problem (3.1.1), roughly, how long it takes to find some b x
such that f(bx) − f(x⋆ ) 6 ǫ and dist(b x, C) 6 ǫ for an optimal x⋆ ∈ C. Essentially
any problem for which we can compute subgradients efficiently can be solved
to accuracy ǫ in time polynomial in the dimension n of the problem and log ǫ1
by the ellipsoid method (cf. [41, 45]). Moreover, for somewhat better structured
(but still quite general) convex problems, interior point and second order meth-
ods [12, 45] are practically and theoretically quite efficient, sometimes requiring
only O(log log ǫ1 ) iterations to achieve optimization error ǫ. (See the lectures by S.
Wright in this volume.) These methods use the Newton method as a basic solver,
along with specialized representations of the constraint set C, and are quite pow-
erful.
However, for large scale problems, the time complexity of standard interior
point and Newton methods can be prohibitive. Indeed, for n-dimensional problems—
that is, when x ∈ Rn —interior point methods scale at best as O(n3 ), and can be
much worse. When n is large (where today, large may mean n ≈ 109 ), this be-
comes highly non-trivial. In such large scale problems and problems arising from
any type of data-collection process, it is reasonable to expect that our representa-
tion of problem data is inexact at best. In statistical machine learning problems,
for example, this is often the case; generally, many applications do not require
accuracy higher than, say ǫ = 10−2 or 10−3 , in which case faster but less exact
methods become attractive.
It is with this motivation that we attack solving the problem (3.1.1) in this
lecture, showing classical subgradient algorithms. These algorithms have the ad-
vantage that their per-iteration costs are low—O(n) or smaller for n-dimensional
problems—but they achieve low accuracy solutions to (3.1.1) very quickly. More-
over, depending on problem structure, they can sometimes achieve convergence
rates that are independent of problem dimension. More precisely, and as we will
see later, the methods we study will guarantee convergence to an ǫ-optimal so-
lution to problem (3.1.1) in O(1/ǫ2 ) iterations, while methods that achieve better
dependence on ǫ require at least n log ǫ1 iterations.
3.2. The gradient and subgradient methods We begin by focusing on the un-
constrained case, that is, when the set C in problem (3.1.1) is C = Rn . That is, we
wish to solve
minimize
n
f(x).
x∈R
26 Introductory Lectures on Stochastic Optimization

f(xk ) + h∇f(xk ), x − xk i + 1
2 kx − xk k22

f(x)

f(x)

f(xk ) + h∇f(xk ), x − xk i

Figure 3.2.1. Left: linear approximation (in black) to the func-


tion f(x) = log(1 + ex ) (in blue) at the point xk = 0. Right: linear
plus quadratic upper bound for the function f(x) = log(1 + ex )
at the point xk = 0. This is the upper-bound and approximation
of the gradient method (3.2.3) with the choice αk = 1.

We first review the gradient descent method, using it as motivation for what fol-
lows. In the gradient descent method, minimize the objective (3.1.1) by iteratively
updating
(3.2.2) xk+1 = xk − αk ∇f(xk ),
where αk > 0 is a positive sequence of stepsizes. The original motivations for
this choice of update come from the fact that x⋆ minimizes a convex f if and only
if 0 = ∇f(x⋆ ); we believe a more compelling justification comes from the idea
of modeling the convex function being minimized. Indeed, the update (3.2.2) is
equivalent to

1 2
(3.2.3) xk+1 = argmin f(xk ) + h∇f(xk ), x − xk i + kx − xk k2 .
x 2αk
The interpretation is as follows: the linear functional x 7→ {f(xk ) + h∇f(xk ), x − xk i}
is the best linear approximation to the function f at the point xk , and we would
like to make progress minimizing x. So we minimize this linear approximation,
but to make sure that it has fidelity to the function f, we add a quadratic kx − xk k22
to penalize moving too far from xk , which would invalidate the linear approxi-
mation. See Figure 3.2.1. Assuming that f is continuously differentiable (often,
one assumes the gradient ∇f(x) is Lipschitz), then gradient descent is a descent
method if the stepsize αk > 0 is small enough—it monotonically decreases the
objective f(xk ). We spend no more time on the convergence of gradient-based
methods, except to say that the choice of the stepsize αk is often extremely im-
portant, and there is a body of research on carefully choosing directions as well
as stepsize lengths; Nesterov [44] provides an excellent treatment of many of the
basic issues.
John C. Duchi 27

Subgradient algorithms The subgradient method is a minor variant of the method (3.2.2),
except that instead of using the gradient, we use a subgradient. The method can
be written simply: for k = 1, 2, . . ., we iterate
i. Choose any subgradient
gk ∈ ∂f(xk )
ii. Take the subgradient step
(3.2.4) xk+1 = xk − αk gk .
Unfortunately, the subgradient method is not, in general, a descent method.
For a simple example, take the function f(x) = |x|, and let x1 = 0. Then except
for the choice g = 0, all subgradients g ∈ ∂f(0) = [−1, 1] are ascent directions.
This is not just an artifact of 0 being optimal for f; in higher dimensions, this
behavior is common. Consider, for example, f(x) = kxk1 and let x = e1 ∈ Rn ,
P
the first standard basis vector. Then ∂f(x) = e1 + n i=2 ti ei , where ti ∈ [−1, 1].
Pn Pn
Any vector g = e1 + i=2 ti ei with i=2 |ti | > 1 is an ascent direction for f,
meaning that f(x − αg) > f(x) for all α > 0. If we were to pick a uniformly
random g ∈ ∂f(e1 ), for example, then the probability that g is a descent direction
is exponentially small in the dimension n.
In general, the characterization of the subgradient set ∂f(x) as in Lemma 2.4.4,
f(x+tu)−f(x)
as {g : f ′ (x; u) > hg, ui for all u} where f ′ (x; u) = limt→0 t is the
directional derivative, and the fact that f ′ (x; u) = supg∈∂f(x) hg, ui guarantees
that
argmin {kgk22 }
g∈∂f(x)

is a descent direction, but we do not prove this here. Indeed, finding such a
descent direction would require explicitly calculating the entire subgradient set
∂f(x), which for a number of functions is non-trivial and breaks the simplicity of
the subgradient method (3.2.4), which works with any subgradient.
It is the case, however, that so long as the point x does not minimize f(x), then
subgradients descend on a related quantity: the distance of x to any optimal point.
Indeed, let g ∈ ∂f(x), and let x⋆ ∈ argmin f(x) (we assume such a point exists),
which need not be unique. Then we have for any α that

1 1 α2
kx − αg − x⋆ k22 = kx − x⋆ k22 − α hg, x − x⋆ i + kgk22 .
2 2 2
The key is that for small enough α > 0, the quantity on the right is strictly
smaller than 21 kx − x⋆ k22 , as we now show. We use the defining inequality of the
subgradient, that is, that f(y) > f(x) + hg, y − xi for all y, including x⋆ . This gives
− hg, x − x⋆ i = hg, x⋆ − xi 6 f(x⋆ ) − f(x), and thus
1 1 α2
(3.2.5) kx − αg − x⋆ k22 6 kx − x⋆ k22 − α (f(x) − f(x⋆ )) + kgk22 .
2 2 2
28 Introductory Lectures on Stochastic Optimization

From inequality (3.2.5), we see immediately that, no matter our choice g ∈ ∂f(x),
we have
2(f(x) − f(x⋆ ))
0<α< implies kx − αg − x⋆ k22 < kx − x⋆ k22 .
kgk22
Summarizing, by noting that f(x) − f(x⋆ ) > 0, we have

Observation 3.2.6. If 0 6∈ ∂f(x), then for any x⋆ ∈ argminx f(x) and any g ∈ ∂f(x),
there is a stepsize α > 0 such that kx − αg − x⋆ k22 < kx − x⋆ k22 .

This observation is the key to the analysis of subgradient methods.


Convergence guarantees Perhaps unsurprisingly, given the simplicity of the
subgradient method, the analysis of convergence for the method is also quite
simple. We begin by stating a general result on the convergence of subgradient
methods; we provide a number of variants in the sequel. We make a few sim-
plifying assumptions in stating our result, several of which are not completely
necessary, but which considerably simplify the analysis. We enumerate them
here:
i. There is at least one (possibly non-unique) minimizing point x⋆ ∈ argminx f(x)
with f(x⋆ ) = infx f(x) > −∞
ii. The subgradients are bounded: for all x and all g ∈ ∂f(x), we have the
subgradient bound kgk2 6 M < ∞ (independently of x).

Theorem 3.2.7. Let αk > 0 be any non-negative sequence of stepsizes and the preceding
assumptions hold. Let xk be generated by the subgradient iteration (3.2.4). Then for all
K > 1,
K K
X 1 1X 2 2
αk [f(xk ) − f(x⋆ )] 6 kx1 − x⋆ k22 + αk M .
2 2
k=1 k=1

Proof. The entire proof essentially amounts to writing down the distance kxk+1 − x⋆ k22
and expanding the square, which we do. By applying inequality (3.2.5), we have
1 1
kxk+1 − x⋆ k22 = kxk − αk gk − x⋆ k22
2 2
(3.2.5) 1 α2
6 kxk − x⋆ k22 − αk (f(xk ) − f(x⋆ )) + k kgk k22 .
2 2
Rearranging this inequality and using that kgk k22 6 M2 , we obtain

1 1 α2
αk [f(xk ) − f(x⋆ )] 6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + k kgk k22
2 2 2
1 1 α2
6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + k M2 .
2 2 2
By summing the preceding expression from k = 1 to k = K and canceling the
alternating ± kxk − x⋆ k22 terms, we obtain the theorem. 
John C. Duchi 29

Theorem 3.2.7 is the starting point from which we may derive a number of
useful consquences. First, we use convexity to obtain the following immediate
corollary (we assume that αk > 0 in the corollary).
P 1 PK
Corollary 3.2.8. Let Ak = k i=1 αi and define xK = AK k=1 αk xk . Then
P
kx1 − x⋆ k22 + K 2
k=1 αk M
2
f(xK ) − f(x⋆ ) 6 PK .
2 k=1 αk
−1 PK
Proof. Noting that AK k=1 αk = 1, we see by convexity that
K
" K #
⋆ 1 X
⋆ −1
X

f(xK ) − f(x ) 6 PK αk f(xk ) − f(x ) = AK αk (f(xk ) − f(x )) .
k=1 αk k=1 k=1
Applying Theorem 3.2.7 gives the result. 
Corollary 3.2.8 allows us to give a number of basic convergence guarantees
based on our stepsize choices. For example, we see that whenever we have

X
αk → 0 and αk = ∞,
k=1
PK 2
PK
then k=1 αk / k=1 αk → 0 and so
f(xK ) − f(x⋆ ) → 0 as K → ∞.
Moreover, we can give specific stepsize choices to optimize the bound. For exam-
ple, let us assume for simplicity that R2 = kx1 − x⋆ k22 is our distance (radius) to
optimality. Then choosing a fixed stepsize αk = α, we have
R2 αM2
(3.2.9) f(xK ) − f(x⋆ ) 6 + .
2Kα 2
R
Optimizing this bound by taking α = √ gives
M K
RM
f(xK ) − f(x⋆ ) 6 √ .
K
Given that subgradient descent methods are not descent methods, it often
makes sense, instead of tracking the (weighted) average of the points or using
the final point, to use the best point observed thus far. Naturally, if we let
xbest
k = argmin f(xi )
xi :i6k

and define fbest


k = f(xbest
k ),then we have the same convergence guarantees that
P
best ⋆ R2 + K 2
k=1 αk M
2
f(xk ) − f(x ) 6 PK .
2 k=1 αk
A number of more careful stepsize choices are possible, though we refer to the
notes at the end of this lecture for more on these choices and applications outside
of those we consider, as our focus is naturally circumscribed.
30 Introductory Lectures on Stochastic Optimization

1
10

α = .01
α = .1
α =1
f(xk ) − f(x⋆ ) 10
0 α =10

-1
10

-2
10

-3
10
0 500 1000 1500 2000 2500 3000 3500 4000

Figure 3.2.10. Subgradient method applied to the robust regres-


sion problem (3.2.12) with fixed stepsizes.
1
10

α = .01
α = .1
α =1
10
0 α =10
k − f(x )

-1
10
fbest ⋆

-2
10

-3
10
0 500 1000 1500 2000 2500 3000 3500 4000

Figure 3.2.11. Subgradient method applied to the robust regres-


sion problem (3.2.12) with fixed stepsizes, showing performance
of the best iterate fbest
k − f(x ).

Example We now present an example that has applications in robust statistics


and other data fitting scenarios. As a motivating scenario, suppose we have a
sequence of vectors ai ∈ Rn and target responses bi ∈ R, and we would like to
predict bi via the inner product hai , xi for some vector x. If there are outliers or
other data corruptions in the targets bi , a natural objective for this task, given the
John C. Duchi 31

data matrix A = [a1 · · · am ]⊤ ∈ Rm×n and vector b ∈ Rm , is the absolute error


m
1 1 X
(3.2.12) f(x) = kAx − bk1 = | hai , xi − bi |.
m m
i=1
We perform subgradient descent on this objective, which has subgradient
m
1 T 1 X
g(x) = A sign(Ax − b) = ai sign(hai , xi − bi ) ∈ ∂f(x)
m m
i=1
at the point x, for K = 4000 iterations with a fixed stepsize αk ≡ α for all k. We
give the results in Figures 3.2.10 and 3.2.11, which exhibit much of the typical
behavior of subgradient methods. From the plots, we see roughly a few phases of
behavior: the method with stepsize α = 1 makes progress very quickly initially,
but then enters its “jamming” phase, where it essentially makes no more progress.
(The largest stepsize, α = 10, simply jams immediately.) The accuracy of the
methods with different stepsizes varies greatly, as well—the smaller the stepsize,
the better the (final) performance of the iterates xk , but initial progress is much
slower.
3.3. Projected subgradient methods It is often the case that we wish to solve
problems not over Rn but over some constrained set, for example, in the Lasso [57]
and in compressed sensing applications [20] one minimizes an objective such as
kAx − bk22 subject to kxk1 6 R for some constant R < ∞. Recalling the prob-
lem (3.1.1), we more generally wish to solve the problem
minimize f(x) subject to x ∈ C ⊂ Rn ,
where C is a closed convex set, not necessarily Rn . The projected subgradient
method is close to the subgradient method, except that we replace the iteration
with
(3.3.1) xk+1 = πC (xk − αk gk )
where
πC (x) = argmin{kx − yk2 }
y∈C

denotes the (Euclidean) projection onto C. As in the gradient case (3.2.3), we


can reformulate the update as making a linear approximation, with quadratic
damping, to f and minimizing this approximation: by algebraic manipulation,
the update (3.3.1) is equivalent to

1
(3.3.2) xk+1 = argmin f(xk ) + hgk , x − xk i + kx − xk k22 .
x∈C 2αk
Figure 3.3.3 shows an example of the iterations of the projected gradient method
applied to minimizing f(x) = kAx − bk22 subject to the ℓ1 -constraint kxk1 6 1.
Note that the method iterates between moving outside the ℓ1 -ball toward the
minimum of f (the level curves) and projecting back onto the ℓ1 -ball.
32 Introductory Lectures on Stochastic Optimization

Figure 3.3.3. Example execution of the projected gradient


method (3.3.1), on minimizing f(x) = 12 kAx − bk22 subject to
kxk1 6 1.

It is very important in the projected subgradient method that the projection


mapping πC be efficiently computable—the method is effective essentially only in
problems where this is true. In many situations, this is the case, but some care is
necessary if the objective f is simple while the set C is complex. In such scenarios,
projecting onto the set C may be as complex as solving the original optimization
problem (3.1.1). For example, a general linear programming problem is described
by
minimize hc, xi subject to Ax = b, Cx  d.
x
Then computing the projection onto the set {x : Ax = b, Cx  d} is at least as
difficult as solving the original problem.
Examples of projections As noted above, it is important that projections πC
be efficiently calculable, and often a method’s effectiveness is governed by how
quickly one can compute the projection onto the constraint set C. With that in
mind, we now provide two examples exhibiting convex sets C onto which projec-
tion is reasonably straightforward and for which we can write explicit, concrete
projected subgradient updates.
Example 3.1: Suppose that C is an affine set, represented by C = {x ∈ Rn : Ax =
b} for A ∈ Rm×n , m 6 n, where A is full rank. (So that A is a short and fat
matrix and AAT ≻ 0.) Then the projection of x onto C is
πC (x) = (I − AT (AAT )−1 A)x + AT (AAT )−1 b,
and if we begin the iterates from a point xk ∈ C, i.e. with Axk = b, then
xk+1 = πC (xk − αk gk ) = xk − αk (I − AT (AAT )−1 A)gk ,
that is, we simply project gk onto the nullspace of A and iterate. ♦
John C. Duchi 33

Example 3.2 (Some norm balls): Let us consider updates when C = {x : kxkp 6 1}
for p ∈ {1, 2, ∞}, each of which is reasonably simple, though the projections are
no longer affine. First, for p = ∞, we consider each coordinate j = 1, 2, . . . , n in
turn, giving
[πC (x)]j = min{1, max{xj , −1}},
that is, we simply truncate the coordinates of x to be in the range [−1, 1]. For
p = 2, we have a similarly simple to describe update:

x if kxk2 6 1
πC (x) =
x/ kxk2 otherwise.
When p = 1, that is, C = {x : kxk1 6 1}, the update is somewhat more complex. If
kxk1 6 1, then πC (x) = x. Otherwise, we find the (unique) t > 0 such that
n
X  
|xj | − t +
= 1,
j=1

and then set the coordinates j via


 
[πC (x)]j = sign(xj ) |xj | − t + .
There are numerous efficient algorithms for finding this t (e.g. [14, 23]). ♦

Convergence results We prove the convergence of the projected subgradient


using an argument similar to our proof of convergence for the classic (uncon-
strained) subgradient method. We assume that the set C is contained in the
interior of the domain of the function f, which (as noted in the lecture on con-
vex analysis) guarantees that f is Lipschitz continuous and subdifferentiable, so
that there exists M < ∞ with kgk2 6 M for all g ∈ ∂f. We make the following
assumptions in the next theorem.
i. The set C ⊂ Rn is compact and convex, and kx − x⋆ k2 6 R < ∞ for all x ∈ C.
ii. There exists M < ∞ such that kgk2 6 M for all g ∈ ∂f(x) and x ∈ C.
We make the compactness assumption to allow for a slightly different result than
Theorem 3.2.7.

Theorem 3.3.4. Let xk be generated by the projected subgradient iteration (3.3.1), where
the stepsizes αk > 0 are non-increasing. Then
K K
X R2 1X
[f(xk ) − f(x⋆ )] 6 + α k M2 .
2αK 2
k=1 k=1

Proof. The starting point of the proof is the same basic inequality as we have been
using, that is, the distance kxk+1 − x⋆ k22 . In this case, we note that projections can
never increase distances to points x⋆ ∈ C, so that
kxk+1 − x⋆ k22 = kπC (xk − αk gk ) − x⋆ k22 6 kxk − αk gk − x⋆ k22 .
34 Introductory Lectures on Stochastic Optimization

Now, as in our earlier derivation, we apply inequality (3.2.5) to obtain


1 1 α2
kxk+1 − x⋆ k22 6 kxk − x⋆ k22 − αk [f(xk ) − f(x⋆ )] + k kgk k22 .
2 2 2
Rearranging this slightly by dividing by αk , we find that
1 h i α
f(xk ) − f(x⋆ ) 6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + k kgk k22 .
2αk 2
Now, using a variant of the telescoping sum in the proof of Theorem 3.2.7 we
have
(3.3.5)
1 h i X
XK XK K
αk
[f(xk ) − f(x⋆ )] 6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + kgk k22 .
2αk 2
k=1 k=1 k=1
We rearrange the middle sum in expression (3.3.5), obtaining

1 h i
K
X
kxk − x⋆ k22 − kxk+1 − x⋆ k22
2αk
k=1
K  
X 1 1 1 1
= − kxk − x⋆ k22 + kx1 − x⋆ k22 − kxK − x⋆ k22
2αk 2αk−1 2α1 2αK
k=2
K  
X 1 1 1 2
6 − R2 + R
2αk 2αk−1 2α1
k=2

because αk 6 αk−1 . Noting that this last sum telescopes and that kgk k22 6 M2 in
inequality (3.3.5) gives the result. 
One application of this result is when we use a decreasing stepsize of αk =

α/ k, which allows nearly as strong of a convergence rate as in the fixed stepsize
case when the number of iterations K is known, but the algorithm provides a
guarantee for all iterations k. Here, we have that
K ZK √
X 1 1
√ 6 t− 2 dt = 2 K,
k=1
k 0

1 P K
and so by taking xK = K k=1 xk we obtain the following corollary.

Corollary 3.3.6. In addition to the conditions of the preceding paragraph, let the condi-
tions of Theorem 3.3.4 hold. Then
R2 M2 α
f(xK ) − f(x⋆ ) 6 √ + √ .
2α K K

So we see that convergence is guaranteed, at the “best” rate 1/ K, for all iter-
ations. Here, we say “best” because this rate is unimprovable—there are worst
case functions for which no method can achieve a rate of convergence faster than

RM/ K—but in practice, one would hope to attain better behavior by leveraging
problem structure.
John C. Duchi 35

3.4. Stochastic subgradient methods The real power of subgradient methods,


which has become evident in the last ten or fifteen years, is in their applicability to
large scale optimization problems. Indeed, while subgradient methods guarantee
only slow convergence—requiring 1/ǫ2 iterations to achieve ǫ-accuracy—their
simplicity provides the benefit that they are robust to a number of errors. In fact,
subgradient methods achieve unimprovable rates of convergence for a number
of optimization problems with noise, and they often do so very computationally
efficiently.
Stochastic optimization problems The basic building block for stochastic (sub)gradient
methods is the stochastic (sub)gradient, often called the stochastic (sub)gradient or-
acle. Let f : Rn → R ∪ {∞} be a convex function, and fix x ∈ dom f. (We will
typically omit the sub- qualifier in what follows.) Then a random vector g is a
stochastic gradient for f at the point x if E[g] ∈ ∂f(x), or
f(y) > f(x) + hE[g], y − xi for all y.
Said somewhat more formally, we make the following definition.

Definition 3.4.1. A stochastic gradient oracle for the function f consists of a triple
(g, S, P), where S is a sample space, P is a probability distribution, and g : Rn ×
S → Rn is a mapping that for each xZ∈ dom f satisfies
EP [g(x, S)] = g(x, s)dP(s) ∈ ∂f(x),
where S ∈ S is a sample drawn from P.

Often, with some abuse of notation, we will use g or g(x) for shorthand of the
random vector g(x, S) when this does not cause confusion.
A standard example for these types of problems is stochastic programming,
where we wish to solve the convex optimization problem

minimize f(x) := EP [F(x; S)]


(3.4.2)
subject to x ∈ C.
Here S is a random variable on the space S with distribution P (so the expectation
EP [F(x; S)] is taken according to P), and for each s ∈ S, the function x 7→ F(x; s) is
convex. Then we immediately see that if we let
g(x, s) ∈ ∂x F(x; s),
then g is a stochastic gradient when we draw S ∼ P and set g = g(x, S), as in
Lecture 2 (recall expression (2.5.1)). Recalling this calculation, we have
f(y) = EP [F(y; S)] > EP [F(x; S) + hg(x, S), y − xi] = f(x) + hEP [g(x, S)], y − xi
so that EP [g(x, S)] is a stochastic subgradient.
36 Introductory Lectures on Stochastic Optimization

To make the setting (3.4.2) more concrete, consider the robust regression prob-
lem (3.2.12), which uses
m
1 1 X
f(x) = kAx − bk1 = | hai , xi − bi |.
m m
i=1
Then a natural stochastic gradient, which requires time only O(n) to compute
(as opposed to O(m · n) to compute Ax − b), is to uniformly at random draw an
index i ∈ [m], then return
g = ai sign(hai , xi − bi ).
More generally, given any problem in which one has a large dataset {s1 , . . . , sm },
and we wish to minimize the sum
m
1 X
f(x) = F(x; si ),
m
i=1
then drawing an index i ∈ {1, . . . , m} uniformly at random and using g ∈ ∂x F(x; si )
is a stochastic gradient. Computing this stochastic gradient requires only the time
necessary for computing some element of the subgradient set ∂x F(x; si ), while the
standard subgradient method applied to these problems is m-times more expen-
sive in each iteration.
More generally, the expectation E[F(x; S)] is generally intractable to compute,
especially if S is a high-dimensional distribution. In statistical and machine learn-
ing applications, we may not even know the distribution P, but we can observe
iid
samples Si ∼ P. In these cases, it may be impossible to even implement the cal-
culation of a subgradient f ′ (x) ∈ ∂f(x), but sampling from P is possible, allowing
us to compute stochastic subgradients.
Stochastic subgradient method With this motivation in place, we can describe
the (projected) stochastic subgradient method. Simply, the method iterates as
follows:
(1) Compute a stochastic subgradient gk at the point xk , where E[gk | xk ] ∈
∂f(x)
(2) Perform the projected subgradient step
xk+1 = πC (xk − αk gk ).
This is essentially identical to the projected gradient method (3.3.1), except that
we replace the true subgradient with a stochastic gradient.
John C. Duchi 37

101
Subgradient
Stochastic
100

10-1
f(xk ) − f(x⋆ )

10-2

10-3

10-4
0 500 1000 1500 2000
Iteration k
Figure 3.4.3. Stochastic subgradient method versus non-
stochastic subgradient method performance on problem (3.4.4).

In the next section, we analyze the convergence of the procedure, but here we
give two examples example here that exhibit some of the typical behavior of these
methods.
Example 3.3 (Robust regression): We consider the robust regression problem (3.2.12),
solving
m
1 X
(3.4.4) minimize f(x) = | hai , xi − bi | subject to kxk2 6 R,
x m
i=1
using the random sample g = ai sign(hai , xi − bi ) as our stochastic gradient. We
iid
generate A = [a1 · · · am ]⊤ by drawing ai ∼ N(0, In×n ) and bi = hai , ui + εi |εi |3 ,
iid
where εi ∼ N(0, 1) and u is a Gaussian random variable with identity covariance.
We use n = 50, m = 100, and R = 4 for this experiment.
We plot the results of running the stochastic gradient iteration versus stan-
dard projected subgradient descent in Figure 3.4.3; both methods run with the

fixed stepsize α = R/M K for M2 = m 1
kAk2Fr , which optimizes the convergence
guarantees for the methods. We see in the figure the typical performance of a sto-
chastic gradient method: the initial progress in improving the objective is quite
fast, but the method eventually stops making progress once it achieves some low
accuracy (in this case, 10−1 ). In this figure we should make clear, however, that
each iteration of the stochastic gradient method requires time O(n), while each
iteration of the (non-noisy) projected gradient method requires times O(n · m), a
factor of approximately 100 times slower. ♦

Example 3.4 (Multiclass support vector machine): Our second example is some-
what more complex. We are given a collection of 16 × 16 grayscale images of
38 Introductory Lectures on Stochastic Optimization

103
SGD: α1 = R/M
Non-stochastic: α1 =0.01R/M
102 Non-stochastic: α1 =0.1R/M
Non-stochastic: α1 =1.0R/M
f(Xk ) − f(X⋆ )
101 Non-stochastic: α1 =10.0R/M
Non-stochastic: α1 =100.0R/M
100

10-1

10-2

10-3 0 10 20 30 40 50 60 70 80
Effective passes through A

Figure 3.4.5. Comparison of stochastic versus non-stochastic


methods for the average hinge-loss minimization problem (3.4.6).
The horizontal axis is a measure of the time used by each method,
represented as the number of times the matrix-vector product
XT ai is computed. Stochastic gradient descent vastly outper-
forms the non-stochastic methods.

handwritten digits {0, 1, . . . , 9}, and wish to classify images, represented as vec-
tors a ∈ R256 , as one of the 10 digits. In a general k-class classification problem,
we represent the multiclass classifier using the matrix
X = [x1 x2 · · · xk ] ∈ Rn×k ,
where k = 10 for the digit classification problem. Given a data vector a ∈ Rn , the
“score” associated with class l is then hxl , ai, and the goal (given image data) is
to find a matrix X assigning high scores to the correct image labels. (In machine
learning, the typical notation is to use weight vectors w1 , . . . , wk ∈ Rn instead of
x1 , . . . , xk , but we use X to remain consistent with our optimization focus.) The
predicted class for a data vector a ∈ Rn is then
argmax ha, xl i = argmax{[XT a]l }.
l∈[k] l∈[k]

We represent single training examples as pairs (a, b) ∈ Rn × {1, . . . k}, and as a


convex surrogate for a misclassification error that the matrix X makes on the pair
(a, b), we use the multiclass hinge loss function
F(X; (a, b)) = max [1 + ha, xl − xb i]+
l6=b

where [t]+ = max{t, 0} denotes the positive part. Then F is convex in X, and for
a pair (a, b) we have F(X; (a, b)) = 0 if and only if the classifer represented by X
John C. Duchi 39

has a large margin, meaning that


ha, xb i > ha, xl i + 1 for all l 6= b.
In this example, we have a sample of N = 7291 digits (ai , bi ) ∈ Rn × {1, . . . , k},
and we compare the performance of stochastic subgradient descent to standard
subgradient descent for solving the problem
N
1 X
(3.4.6) minimize f(X) = F(X; (ai , bi )) subject to kXkFr 6 R
N
i=1
where R = 40. We perform stochastic gradient descent using stepsizes αk =
√ 1 PN 2
α1 / k, where α1 = R/M and M2 = N i=1 kai k2 (this is an approximation
to the Lipschitz constant of f). For our stochastic gradient oracle, we select an
index i ∈ {1, . . . , N} uniformly at random, then take g ∈ ∂X F(X; (ai , bi )). For
the standard subgradient method, we also perform projected subgradient de-
scent, where we compute subgradients by taking gi ∈ ∂F(X; (ai , bi )) and set-
1 PN
ting g = N
√ i=1 gi ∈ ∂f(X). We use an identical stepsize strategy of setting
αk = α1 / k, but use the five stepsizes α1 = 10−j R/M for j ∈ {−2, −1, . . . , 2}. We
plot the results of this experiment in Figure 3.4.5, showing the optimality gap
(vertical axis) plotted against the number of matrix-vector products X⊤ a com-
puted, normalized by N = 7291. The plot makes clear that computing the entire
subgradient ∂f(X) is wasteful: the non-stochastic methods’ convergence, in terms
of iteration count, is potentially faster than that for the stochastic method, but
the large (7291×) per-iteration speedup the stochastic method enjoys because of
its random sampling yields substantially better performance. Though we do not
demonstrate this in the figure, this benefit remains typically true even across a
range of stepsize choices, suggesting the benefits of stochastic gradient methods
in stochastic programming problems such as problem (3.4.6). ♦

Convergence guarantees We now turn to guarantees of convergence for the


stochastic subgradient method. As in our analysis of the projected subgradi-
ent method, we assume that C is compact and there is some R < ∞ such that
kx⋆ − xk2 6 R for all x ∈ C, that projections πC are efficiently computable, and
that for all x ∈ C we have the bound E[kg(x, S)k22 ] 6 M2 for our stochastic oracle
g. (The oracle’s noise S may depend on the previous iterates, but we always have
the unbiased condition E[g(x, S)] ∈ ∂f(x).)

Theorem 3.4.7. Let the conditions of the preceding paragraph hold and let αk > 0 be a
P
non-increasing sequence of stepsizes. Let xK = K1 Kk=1 xk . Then
K
R2 1 X

E[f(xK ) − f(x )] 6 + α k M2 .
2KαK 2K
k=1

Proof. The analysis is quite similar to our previous analyses, in that we simply
expand the error kxk+1 − x⋆ k22 . Let use define f ′ (x) := E[g(x, S)] ∈ ∂f(x) to be
40 Introductory Lectures on Stochastic Optimization

the expected subgradient returned by the stochastic gradient oracle, and let ξk =
gk − f ′ (xk ) be the error in the kth subgradient. Then
1 1
kxk+1 − x⋆ k22 = kπC (xk − αk gk ) − x⋆ k22
2 2
1
6 kxk − αk gk − x⋆ k22
2
1 α2
= kxk − x⋆ k22 − αk hgk , xk − x⋆ i + k kgk k22 ,
2 2
as in the proof of Theorems 3.2.7 and 3.3.4. Now, we add and subtract αk hf ′ (xk ), xk − x⋆ i,
which gives

1 1
α2
kxk+1 − x⋆ k22 6 kxk − x⋆ k22 − αk f ′ (xk ), xk − x⋆ + k kgk k22 − αk hξk , xk − x⋆ i
2 2 2
α 2
1
6 kxk − x⋆ k22 − αk [f(xk ) − f(x⋆ )] + k kgk k22 − αk hξk , xk − x⋆ i ,
2 2
where we have used the standard first-order convexity inequality.
Except for the error term hξk , xk − x⋆ i, the proof is completely identical to that
of Theorem 3.3.4. Indeed, dividing each side of the preceding display by αk and
rearranging, we have
1   α
f(xk ) − f(x⋆ ) 6 kxk − x⋆ k22 − kxk+1 − x⋆ k22 + k kgk k22 − hξk , xk − x⋆ i .
2αk 2
Summing this inequality, as in the proof of Theorem 3.3.4 following inequal-
ity (3.3.5), yields that
K K K
X R2 1X X
(3.4.8) [f(xk ) − f(x⋆ )] 6 + αk kgk k22 − hξk , xk − x⋆ i .
2αK 2
k=1 k=1 k=1
The inequality (3.4.8) is the basic inequality from which all our subsequent con-
vergence guarantees follow.
For this theorem, we need only take expectations, realizing that


E[hξk , xk − x⋆ i] = E E[ g(xk ) − f ′ (xk ), xk − x⋆ | xk ]
h i
= E hE[g(xk ) | xk ] −f ′ (xk ), xk − xi = 0.
| {z }
=f ′ (xk )

Thus we obtain
X
K  K
R2 1X
E (f(xk ) − f(x )) 6 ⋆
+ α k M2
2αK 2
k=1 k=1

once we realize that E[kgk k22 ] 6 M2 , which gives the desired result. 
Theorem 3.4.7 makes it clear that, in expectation, we can achieve the same con-
vergence guarantees as in the non-noisy case. This does not mean that stochastic
subgradient methods are always as good as non-stochastic methods, but it does
show the robustness of the subgradient method even to substantial noise. So
John C. Duchi 41

while the subgradient method is very slow, its slowness comes with the benefit
that it can handle large amounts of noise.
We now provide a few corollaries on the convergence of stochastic gradient de-
scent. For background on probabilistic modes of convergence, see Appendix A.2.

Corollary 3.4.9. Let the conditions of Theorem 3.4.7 hold, and let αk = R/M k for
each k. Then
3RM
E[f(xK )] − f(x⋆ ) 6 √
2 K
for all K ∈ N.

The proof of the corollary is identical to that of Corollary 3.3.6 for the projected
gradient method, once we substitute α = R/M in the bound. We can also obtain
convergence in probability of the iterates more generally.
P
Corollary 3.4.10. Let αk be non-summable but convergent to zero, that is, ∞k=1 αk =
p
∞ and αk → 0. Then f(xK ) − f(x⋆ ) → 0 as K → ∞, that is, for all ǫ > 0 we have
lim sup P (f(xk ) − f(x⋆ ) > ǫ) = 0.
k→∞

The above corollaries guarantee convergence of the iterates in expectation and


with high probability, but sometimes it is advantageous to give finite sample
guarantees of convergence with high probability. We can do this under somewhat
stronger conditions on the subgradient noise sequence and using the Azuma-
Hoeffding inequality (Theorem A.2.5 in Appendix A.2), which we present now.

Theorem 3.4.11. In addition to the conditions of Theorem 3.4.7, assume that kgk2 6 M
for all stochastic subgradients g. Then for any ǫ > 0,
K
R2 X αk RM
f(xK ) − f(x⋆ ) 6 + M2 + √ ǫ
2KαK 2 K
k=1
1 2
with probability at least 1 − e− 2 ǫ .
1 2
Written differently, we see that by taking αk = √ R and setting δ = e− 2 ǫ , we
kM
have q
1
3MR MR 2 log δ
f(xK ) − f(x⋆ ) 6 √ + √
K K

with probability at least 1 − δ. That is, we have convergence of O(MR/ K) with
high probability.
Before providing the proof proper, we discuss two examples in which the
boundedness condition holds. Recall from Lecture 2 that a convex function f
is M-Lipschitz if and only if kgk2 6 M for all g ∈ ∂f(x) and x ∈ Rn , so Theo-
rem 3.4.11 requires that the random functions F(·; S) are Lipschitz over the domain
C. Our robust regression and multiclass support vector machine examples both
satisfy the conditions of the theorem so long as the data is bounded. More pre-
cisely, for the robust regression problem (3.2.12) with loss F(x; (a, b)) = | ha, xi − b|,
42 Introductory Lectures on Stochastic Optimization

we have ∂F(x; (a, b)) = a sign(ha, xi − b) so that the condition kgk2 6 M holds
if and only if kak2 6 M. For the multiclass hinge loss problem (3.4.6), with
P
F(X; (a, b)) = l6=b [1 + ha, xl − xb i]+ , Exercise 5 develops the subgradient cal-
culations, but again, we have the boundedness of ∂X F(X; (a, b)) if and only if
a ∈ Rn is bounded.
Proof. We begin with the basic inequality of Theorem 3.4.7, inequality (3.4.8). We
see that we would like to bound the probability that
K
X
hξk , x⋆ − xk i
k=1
is large. First, we note that the iterate xk is a function of ξ1 , . . . , ξk−1 , and we
have the conditional expectation
E[ξk | ξ1 , . . . , ξk−1 ] = E[ξk | xk ] = 0.
Moreover, using the boundedness assumption that kgk2 6 M, we have kξk k2 =
kgk − f ′ (xk )k2 6 2M and
| hξk , xk − x⋆ i | 6 kξk k2 kxk − x⋆ k2 6 2MR.
PK
k=1 hξk , xk − x i is a bounded difference martingale se-
Thus, the sequence ⋆

quence, and we may apply Azuma’s inequality (Theorem A.2.5), which gurantees
X K   
t2
P hξk , x⋆ − xk i > t 6 exp −
2KM2 R2
k=1

for all t > 0. Substituting t = MR Kǫ, we obtain that
 X K   2
1 ǫMR ǫ
P hξk , x⋆ − xk i > √ 6 exp − ,
K K 2
k=1
as desired. 
Summarizing the results of this section, we see a number of consequences.
First, stochastic gradient methods guarantee that after O(1/ǫ2 ) iterations, we have
error at most f(x) − f(x⋆ ) = O(ǫ). Secondly, this convergence is (at least to the
order in ǫ) the same as in the non-noisy case; that is, stochastic gradient meth-
ods are robust enough to noise that their convergence is hardly affected by it. In
addition to this, they are often applicable in situations in which we cannot even
evaluate the objective f, whether for computational reasons or because we do not
have access to it, as in statistical problems. This robustness to noise and good
performance has led to wide adoption of subgradient-like methods as the de facto
choice for many large-scale data-based optimization problems. In the coming sec-
tions, we give further discussion of the optimality of stochastic gradient methods,
showing that—roughly—when we have access only to noisy data, it is impossi-
ble to solve (certain) problems to accuracy better than ǫ given 1/ǫ2 data points;
John C. Duchi 43

thus, using more expensive but accurate optimization methods may have limited
benefit (though there may still be some benefit practically!).
Notes and further reading Our treatment in this chapter borrows from a num-
ber of resources. The two heaviest are the lecture notes for Stephen Boyd’s Stan-
ford’s EE364b course [10, 11] and Polyak’s Introduction to Optimization [47]. Our
guarantees of high probability convergence are similar to those originally de-
veloped by Cesa-Bianchi et al. [16] in the context of online learning, which Ne-
mirovski et al. [40] more fully develop. More references on subgradient methods
include the lecture notes of Nemirovski [43] and Nesterov [44].
A number of extensions of (stochastic) subgradient methods are possible, in-
cluding to online scenarios in which we observe streaming sequences of func-
tions [25, 63]; our analysis in this section follows closely that of Zinkevich [63].
The classic paper of Polyak and Juditsky [48] shows that stochastic gradient de-
scent methods, coupled with averaging, can achieve asymptotically optimal rates
of convergence even to constant factors. Recent work in machine learning by a
number of authors [18, 32, 53] has shown how to leverage the structure of opti-
1 PN
mization problems based on finite sums, that is, when f(x) = N i=1 fi (x), to
develop methods that achieve convergence rates similar to those of interior point
methods but with iteration complexity close to stochastic gradient methods.

4. The Choice of Metric in Subgradient Methods


Lecture Summary: Standard subgradient and projected subgradient meth-
ods are inherently Euclidean—they rely on measuring distances using Eu-
clidean norms, and their updates are based on Euclidean steps. In this lec-
ture, we study methods for more carefully choosing the metric, giving rise
to mirror descent, also known as non-Euclidean subgradient descent, as well
as methods for adapting the updates performed to the problem at hand. By
more carefully studying the geometry of the optimization problem being
solved, we show how faster convergence guarantees are possible.

4.1. Introduction In the previous lecture, we studied projected subgradient meth-


ods for solving the problem (2.1.1) by iteratively updating xk+1 = πC (xk − αk gk ),
where πC denotes Euclidean projection. The convergence of these methods, as
exemplified by Corollaries 3.2.8 and 3.4.9, scales as
MR diam(C)Lip(f)
(4.1.1) f(xK ) − f(x⋆ ) 6 √ = O(1) √ ,
K K
where R = supx∈C kx − x⋆ k2 and M is the Lipschitz constant of f over the set C
with respect to the ℓ2 -norm,
X n 1
2
2
M = sup sup kgk2 = gj .
x∈C g∈∂f(x) j=1
44 Introductory Lectures on Stochastic Optimization

The convergence guarantee (4.1.1) reposes on Euclidean measures of scale—the


diameter of C and norm of the subgradients g are both measured in ℓ2 -norm. It is
thus natural to ask if we can develop methods whose convergence rates depend
on other measures of scale of f and C, obtainining better problem-dependent
behavior and geometry. With that in mind, in this lecture we derive a number of
methods that use either non-Euclidean or adaptive updates to better reflect the
geometry of the underlying optimization problem.

h(x)
Dh (x, y)

h(y) h(y) + h∇h(y), x − yi

Figure 4.2.1. Bregman divergence Dh (x, y). The bottom upper


function is h(x) = log(1 + ex ), the lower (linear) is the linear
approximation x 7→ h(y) + h∇h(y), x − yi to h at y.

4.2. Mirror Descent Methods Our first set of results focuses on mirror descent
methods, which modify the basic subgradient update to use a different distance-
measuring function rather than the squared ℓ2 -term. Before presenting these
methods, we give a few definitions. Let h be a differentiable convex function,
differentiable on C. The Bregman divergence associated with h is defined as
(4.2.2) Dh (x, y) = h(x) − h(y) − h∇h(y), x − yi .
The divergence Dh is always nonnegative, by the standard first-order inequality
for convex functions, and measures the gap between the linear approximation
h(y) + h∇h(y), x − yi for h(x) taken from the point y and the value h(x) at x. See
Figure 4.2.1. As one standard example, if we take h(x) = 21 kxk22 , then Dh (x, y) =
1 2
2 kx − ykP2 . A second common example follows by taking the entropy functional
n
h(x) = j=1 xj log xj , restricting x to Pthe probability simplex (i.e. x  0 and
P n xj
j xj = 1). We then have D h (x, y) = j=1 xj log yj , the entropic or Kullback-
Leibler divergence.
Because the quantity (4.2.2) is always non-negative and convex in its first ar-
gument, it is natural to treat it as a distance-like function in the development of
John C. Duchi 45

optimization procedures. Indeed, by recalling the updates (3.2.3) and (3.3.2), by


analogy we consider the method
i. Compute subgradient gk ∈ ∂f(xk )
ii. Perform update

1
xk+1 = argmin f(xk ) + hgk , x − xk i + Dh (x, xk )
x∈C αk
(4.2.3)
1
= argmin hgk , xi + Dh (x, xk ) .
x∈C αk
This scheme is the mirror descent method. Thus, each differentiable convex function
h gives a new optimization scheme, where we often attempt to choose h to better
match the geometry of the underlying constraint set C.
To this point, we have been vague about the “geometry” of the constraint set,
so we attempt to be somewhat more concrete. We say that h is λ-strongly convex
over C with respect to the norm k·k if
λ
h(y) > h(x) + h∇h(x), y − xi + kx − yk2 for all x, y ∈ C.
2
Importantly, this norm need not be the typical ℓ2 or Euclidean norm. Then our
goal is, roughly, to choose a strongly convex function h so that the diameter of
C is small in the norm k·k with respect to which h is strongly convex (as we see
presently, an analogue of the bound (4.1.1) holds). In the standard updates (3.2.3)
and (3.3.2), we use the squared Euclidean norm to trade between making progress
on the linear approximation x 7→ f(xk ) + hgk , x − xk i and making sure the approx-
imation is reasonable—we regularize progress. Thus it is natural to ask that the
function h we use provide a similar type of regularization, and consequently, we
will require that the function h be 1-strongly convex (usually shortened to the
unqualified strongly convex) with respect to some norm k·k over the constraint
set C in the mirror descent method (4.2.3).4 Note that strong convexity of h is
equivalent to
1
Dh (x, y) > kx − yk2 for all x, y ∈ C.
2
Examples of mirror descent Before analyzing the method (4.2.3), we present a
few examples, showing the updates that are possible as well as verifying that the
associated divergence is appropriately strongly convex. One of the nice conse-
quences of allowing different divergence measures Dh , as opposed to only the
Euclidean divergence, is that they often yield cleaner or simpler updates.
Example 4.1 (Gradient descent is mirror descent): Let h(x) = 12 kxk22 . Then
∇h(y) = y, and
1 1 1 1 1
Dh (x, y) = kxk22 − kyk22 − hy, x − yi = kxk22 + kyk22 − hx, yi = kx − yk22 .
2 2 2 2 2
4 This is not strictly a requirement, and sometimes it is analytitcally convenient to avoid this, but our
analysis is simpler when h is strongly convex.
46 Introductory Lectures on Stochastic Optimization

Thus, substituting into the update (4.2.3), we see the choice h(x) = 1
2 kxk22 recovers
the standard (stochastic sub)gradient method

1
xk+1 = argmin hgk , xi + kx − xk k22 .
x∈C 2αk
It is evident that h is strongly convex with respect to the ℓ2 -norm for any con-
straint set C. ♦

Example 4.2 (Solving problems on the simplex with exponentiated gradient meth-
ods): Suppose that our constraint set C = {x ∈ Rn + : h1, xi = 1} is the probability
simplex in Rn . Then updates with the standard Euclidean distance are some-
what challenging—though there are efficient implementations [14, 23]—and it is
natural to ask for a simpler method.
P
With that in mind, let h(x) = n j=1 xj log xj be the negative entropy, which is
convex because it is the sum of convex functions. (The derivatives of f(t) = t log t
are f ′ (t) = log t + 1 and f ′′ (t) = 1/t > 0 for t > 0.) Then we have
n
X  
Dh (x, y) = xj log xj − yj log yj − (log yj + 1)(xj − yj )
j=1
n
X xj
= xj log + h1, y − xi = Dkl (x||y) ,
yj
j=1

the KL-divergence between x and y (when extended to Rn + , though over C we


have h1, x − yi = 0). This gives us the form of the update (4.2.3).
Let us consider the update (4.2.3). Simplifying notation, we would like to solve
n
X xj
minimize hg, xi + xj log subject to h1, xi = 1, x  0.
yj
j=1

We assume that the yj > 0, though this is not strictly necessary. Though we
have not discussed this, we write the Lagrangian for this problem by introducing
Lagrange multipliers τ ∈ R for the equality constraint h1, xi = 1 and λ ∈ Rn
+ for
the inequality x  0. Then we obtain Lagrangian
n  
X xj
L(x, τ, λ) = hg, xi + xj log + τxj − λj xj − τ.
yj
j=1

Minimizing out x to find the appropriate form for the solution, we take deriva-
tives with respect to x and set them to zero to find

0= L(x, τ, λ) = gj + log xj + 1 − log yj + τ − λj ,
∂xj
or
xj (τ, λ) = yj exp(−gj − 1 − τ + λj ).
We may take λj = 0, as the latter expression yields all positive xj , and to satisfy
P P
the constraint that j xj = 1, we set τ = log( j yj e−gj ) − 1. Thus we have the
John C. Duchi 47

update
y exp(−gi )
xi = Pn i .
j=1 yj exp(−gj )
Rewriting this in terms of the precise update at time k for the mirror descent
method, we have for each coordinate i of iterate k + 1 of the method that
xk,i exp(−αk gk,i )
(4.2.4) xk+1,i = Pn .
j=1 xk,j exp(−αk gk,j )
This is the so-called exponentiated gradient update, also known as entropic mirror
descent.
Later, after stating and proving our main convergence theorems, we will show
that the negative entropy is strongly convex with respect to the ℓ1 -norm, meaning
that our coming convergence guarantees apply. ♦

Example 4.3 (Using ℓp -norms): As a final example, we consider using squared


ℓp -norms for our distance-generating function h. These have nice robustness
properties, and are also finite on any compact set (unlike the KL-divergence of
Example 4.2). Indeed, let p ∈ (1, 2], and define h(x) = 2(p−1) 1
kxk2p . We claim
without proof that h is strongly convex with respect to the ℓp -norm, that is,
1
Dh (x, y) > kx − yk2p .
2
(See, for example, the thesis of Shalev-Shwartz [51] and Question 9 in the exer-
cises. This inequality fails for powers other than 2 as well as for p > 2.)
We do not address the constrained case here, assuming instead that C = Rn .
In this case, we have
1 h i⊤
2−p
∇h(x) = kxkp sign(x1 )|x1 |p−1 · · · sign(xn )|xn |p−1 .
p−1
Now, if we define the function φ(x) = (p − 1)∇h(x), then a calculation verifies
that the function ϕ : Rn → Rn defined coordinate-wise by
2−p 2−q
φj (x) = kxkp sign(xj )|xj |p−1 and ϕj (y) = kykq sign(yj )|yj |q−1 ,
where p1 + q1 = 1, satisfies ϕ(φ(x)) = x, that is, ϕ = φ−1 (and similarly φ = ϕ−1 ).
Thus, the mirror descent update (4.2.3) when C = Rn becomes the somewhat
more complex
(4.2.5) xk+1 = ϕ(φ(xk ) − αk (p − 1)gk ) = (∇h)−1 (∇h(xk ) − αk gk ).
The second form of the update (4.2.5), that is, that involving the inverse of the
gradient mapping (∇h)−1 , holds more generally, that is, for any strictly convex
and differentiable h. This is the original form of the mirror descent update (4.2.3),
and it justifies the name mirror descent, as the gradient is “mirrored” through
the distance-generating function h and back again. Nonetheless, we find the
modeling perspective of (4.2.3) somewhat easier to explain.
We remark in passing that while constrained updates are somewhat more chal-
lenging for this case, a few are efficiently solvable. For example, suppose that
48 Introductory Lectures on Stochastic Optimization

C = {x ∈ Rn + : h1, xi = 1}, the probability simplex. In this case, the update with
ℓp -norms becomes a problem of solving
1
minimize hv, xi + kxk2p subject to h1, xi = 1, x  0,
x 2
where v = αk (p − 1)gk − φ(xk ), and ϕ and φ are defined as above. An analysis
of the Karush-Kuhn-Tucker conditions for this problem (omitted) yields that the
solution to the problem is given by finding the t⋆ ∈ R such that
n
X    
ϕj ( −vj + t⋆ + ) = 1 and setting xj = ϕ( −vj + t⋆ + ).
j=1

Because ϕ is increasing in its argument with ϕ(0) = 0, this t⋆ can be found to


accuracy ǫ in time O(n log ǫ1 ) by binary search. ♦

Convergence guarantees With the mirror descent method described, we now


provide an analysis of its convergence behavior. In this case, the analysis is
somewhat more complex than that for the subgradient, projected subgradient,
and stochastic subgradient methods, as we cannot simply expand the distance
kxk+1 − x⋆ k22 . Thus, we give a variant proof that relies on the optimality condi-
tions for convex optimization problems, as well as a few tricks involving norms
and their dual norms. Recall that we assume that the function h is strongly con-
vex with respect to some norm k·k, and that the associated dual norm k·k∗ is
defined by
kyk∗ := sup hy, xi .
x:kxk61

Theorem 4.2.6. Let αk > 0 be any sequence of non-increasing stepsizes and the above as-
sumptions hold. Let xk be generated by the mirror descent iteration (4.2.3). If Dh (x, x⋆ ) 6
R2 for all x ∈ C, then for all K ∈ N
K K
X 1 2 X αk
[f(xk ) − f(x⋆ )] 6 R + kgk k2∗ .
αK 2
k=1 k=1
If αk ≡ α is constant, then for all K ∈ N
K K
X 1 αX
[f(xk ) − f(x⋆ )] 6 Dh (x⋆ , x1 ) + kgk k2∗ .
α 2
k=1 k=1
1 PK
As an immediate consequence of this theorem, we see that if xK = K k=1 xk or
xK = argminxk f(xk ) and we have the gradient bound kgk∗ 6 M for all g ∈ ∂f(x)
for x ∈ C, then (say, in the second case) convexity implies
1 α
(4.2.7) f(xK ) − f(x⋆ ) 6 Dh (x⋆ , x1 ) + M2 .
Kα 2
By comparing with the bound (3.2.9), we see that the mirror descent (non-Euclidean
gradient descent) method gives roughly the same type of convergence guarantees
as standard subgradient descent. Roughly we expect the following type of behav-
ior with a fixed stepsize: a rate of convergence of roughly 1/αK until we are
John C. Duchi 49

within a radius α of the optimum, after which mirror descent and subgradient
descent essentially jam—they just jump back and forth near the optimum.
Proof. We begin by considering the progress made in a single update of xk , but
whereas our previous proofs all began with a Lyapunov function for the distance
kxk − x⋆ k2 , we use function value gaps instead of the distance to optimality. Us-
ing the first order convexity inequality—i.e. the definition of a subgradient—we
have
f(xk ) − f(x⋆ ) 6 hgk , xk − x⋆ i .
The idea is to show that replacing xk with xk+1 makes the term hgk , xk − x⋆ i
small because of the definition of xk+1 , but xk and xk+1 are close together so that
this is not much of a difference.
First, we add and subtract hgk , xk+1 i to obtain
(4.2.8) f(xk ) − f(x⋆ ) 6 hgk , xk+1 − x⋆ i + hgk , xk − xk+1 i .
Now, we use the the first-order necessary and sufficient conditions for optimality
of convex optimization problems given by Theorem 2.4.11. Because xk+1 solves
problem (4.2.3), we have
D E
gk + α−1
k (∇h(xk+1 ) − ∇h(xk )) , x − xk+1 > 0 for all x ∈ C.

In particular, this inequality holds for x = x⋆ , and substituting into expres-


sion (4.2.8) yields
1
f(xk ) − f(x⋆ ) 6 h∇h(xk+1 ) − ∇h(xk ), x⋆ − xk+1 i + hgk , xk − xk+1 i .
αk
We now use two tricks: an algebraic identity involving Dh and the Fenchel-Young
inequality. By algebraic manipulations, we have that
h∇h(xk+1 ) − ∇h(xk ), x⋆ − xk+1 i = Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 ) − Dh (xk+1 , xk ).
Substituting into the preceding display, we have

(4.2.9)
1
f(xk ) − f(x⋆ ) 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 ) − Dh (xk+1 , xk )] + hgk , xk − xk+1 i .
αk
The second insight is that the subtraction of Dh (xk+1 , xk ) allows us to cancel
some of hgk , xk − xk+1 i. To see this, recall the Fenchel-Young inequality, which
states that
η 1
hx, yi 6 kxk2 + kyk2∗
2 2η
for any pair of dual norms (k·k , k·k∗ ) and any η > 0. To see this, note that by
definition of the dual norm, we have hx, yi 6 kxk kyk∗ , and for any constants
1 1
a, b ∈ R and η > 0, we have 0 6 21 (η 2 a − η− 2 b)2 = η2 a2 + 2η 1 2
b − ab, so that
η 2 1 2
kxk kyk∗ 6 2 kxk + 2η kyk∗ . In particular, we have
αk 1
hgk , xk − xk+1 i 6 kgk k2∗ + kxk − xk+1 k2 .
2 2αk
50 Introductory Lectures on Stochastic Optimization

The strong convexity assumption on h guarantees Dh (xk , xk+1 ) > 21 kxk − xk+1 k2 ,
or that
1 α
− Dh (xk+1 , xk ) + hgk , xk − xk+1 i 6 k kgk k2∗ .
αk 2
Substituting this into inequality (4.2.9), we have
1 α
(4.2.10) f(xk ) − f(x⋆ ) 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 )] + k kgk k2∗ .
αk 2
This inequality should look similar to inequality (3.3.5) in the proof of Theo-
rem 3.3.4 on the projected subgradient method in Lecture 3. Indeed, using that
Dh (x⋆ , xk ) 6 R2 by assumption, an identical derivation to that in Theorem 3.3.4
gives the first result of this theorem. For the second when the stepsize is fixed,
note that
K K K
X X 1 X α
[f(xk ) − f(x⋆ )] 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 )] + kgk k2∗
α 2
k=1 k=1 k=1
K
1 X α
= [Dh (x⋆ , x1 ) − Dh (x⋆ , xK+1 )] + kgk k2∗ ,
α 2
k=1
which is the second result. 
We briefly provide a few remarks before moving on. As a first remark, all
of the preceding analysis carries through in an almost completely identical fash-
ion in the stochastic case. We state the most basic result, as the extension from
Section 3.4 is essentially straightforward.
Corollary 4.2.11. Let the conditions of Theorem 4.2.6 hold, except that instead of re-
ceiving a vector gk ∈ ∂f(xk ) at iteration k, the vector gk is a stochastic subgradient
satisfying E[gk | xk ] ∈ ∂f(xk ). Then for any non-increasing stepsize sequence αk
(where αk may be chosen dependent on g1 , . . . , gk ),
XK  " K
#
R 2 X α k 2
E (f(xk ) − f(x⋆ )) 6 E + kgk k∗ .
αK 2
k=1 k=1
Proof. We sketch the result. The proof is identical to that for Theorem 4.2.6, ex-
cept that we replace gk with the particular vector f ′ (xk ) satisfying E[gk | xk ] =
f ′ (xk ) ∈ ∂f(xk ). Then



f(xk ) − f(x⋆ ) 6 f ′ (xk ), xk − x⋆ = hgk , xk − x⋆ i + f ′ (xk ) − gk , xk − x⋆ ,
and an identical derivation yields the following analogue of inequality (4.2.10):
1 α

f(xk ) − f(x⋆ ) 6 [Dh (x⋆ , xk ) − Dh (x⋆ , xk+1 )] + k kgk k2∗ + f ′ (xk ) − gk , xk − x⋆ .
αk 2
This inequality holds regardless of how we choose αk . Moreover, by iterating
expectations, we have



E[ f ′ (xk ) − gk , xk − x⋆ ] = E[ f ′ (xk ) − E[gk | xk ], xk − x⋆ ] = 0,
which gives the corollary once we follow an identical derivation to Theorem 4.2.6.

John C. Duchi 51

Thus, if we have the bound E[kgk2∗ ] 6 M2 for all stochastic subgradients, then
1 PK

taking xK = K k=1 xk and αk = R/M k, then
K
RM R maxk E[kgk k2∗ ] X 1 RM
(4.2.12) E[f(xK ) − f(x⋆ )] 6 √ + √ 6 3√
K M 2 k K
k=1
P √
− 21 6 2 K.
where we have used that E[kgk2∗ ] 6 M2 and K k=1 k
In addition, we can provide concrete convergence guarantees for a few meth-
ods, revisiting our earlier examples. We begin with Example 4.2, exponentiated
gradient descent.
Pn
Corollary 4.2.13. Let C = {x ∈ Rn + : h1, xi = 1}, and take h(x) = j=1 xj log xj , the
1
negative entropy. Let x1 = n 1, the vector whose entries are each 1/n. Then if xK =
1 PK
K k=1 xk , the exponentiated gradient method (4.2.4) with fixed stepsize α guarantees
K
log n α X
f(xK ) − f(x⋆ ) 6 + kgk k2∞ .
Kα 2K
k=1

Proof. To apply Theorem 4.2.6, we must show that the negative entropy h is
strongly convex with respect to the ℓ1 -norm, whose dual norm is the ℓ∞ -norm.
By a Taylor expansion, we know that for any x, y ∈ C, we have
1
h(x) = h(y) + h∇h(y), x − yi + (x − y)⊤ ∇2 h(e x)(x − y)
2
for some ex between x and y, that is, e
x = tx + (1 − t)y for some t ∈ [0, 1]. Calculat-
ing these quantities, this is equivalent to
 
1 ⊤ 1 1
Dkl (x||y) = Dh (x, y) = (x − y) diag ,..., (x − y)
2 e
x1 e
xn
n
1 X (xj − yj )2
= .
2 e
xj
j=1

Using the Cauchy-Schwarz inequality and the fact that e x ∈ C, we have


n n q  X 2  X
n 1 n 1
X X |xj − yj | (xj − yj )2 2
kx − yk1 = |xj − yj | = exj p 6 e
xj .
e
xj e
xj
j=1 j=1 j=1 j=1
| {z }
=1

That is, we have Dkl (x||y) = Dh (x, y) > 1


2 kx − yk21 ,
and h is strongly convex with
respect to the ℓ1 -norm over C.
With this strong convexity result in hand, we may apply second result of The-
orem 4.2.6, achieving
K K
X Dkl (x⋆ | x1 ) α X
[f(xk ) − f(x⋆ )] 6 + kgk k2∞ .
α 2
k=1 k=1
1
If x1 = then Dkl (x||x1 ) = h(x) + log n 6 log n, as h(x) 6 0 for x ∈ C. Thus,
n 1,
1 PK
dividing by K and using that f(xK ) 6 K k=1 f(xk ) gives the corollary. 
52 Introductory Lectures on Stochastic Optimization

Inspecting the guarantee Corollary 4.2.13 provides versus that guaranteed by


the standard (non-stochastic) projected subgradient method (i.e. using h(x) =
1 2
2 kxk2 as in Theorem 3.3.4) is instructive. In the case of projected subgradient
descent, we have Dh (x⋆ , x) = 12 kx⋆ − xk22 6 1 for all x, x⋆ ∈ C = {x ∈ Rn + :
h1, xi = 1} (and this distance is achieved). However, the dual norm to the ℓ2 -norm
is ℓ2 , meaning we measure the size of the gradient terms kgk k in ℓ2 -norm. As

kgk k∞ 6 kgk k2 6 n kgk k∞ , supposing that kgk k∞ 6 1 for all k, the conver-
gence guarantee r
log n
O(1)
p K
may be up to n/ log n-times better than that guaranteed by the standard (Eu-
clidean) projected gradient method.
Lastly, we provide a final convergence guarantee for the mirror descent method
using ℓp -norms, where p ∈ (1, 2]. Using such norms has the benefit that Dh
is bounded whenever the set C is compact—distinct from the relative entropy
P x
Dh (x, y) = j xj log yjj —and thus providing a nicer guarantee of convergence.
Indeed, for h(x) = 12 kxk2p we always have that
n
1 1 X
Dh (x, y) = kxk2p − kyk2p − 2−p
kykp sign(yj )|yj |p−1 (xj − yj )
2 2
j=1
n
1 1 X
(4.2.14) = kxk2p + kyk2p − kykp
2−p
|yj |p−1 sign(yj )xj 6 kxk2p + kyk2p ,
2 2
j=1
| {z }
6 12 kxk2p + 12 kyk2p

where the inequality uses that q(p − 1) = p and


n
X X
n 1 X
n 1
q p
2−p 2−p
kykp |yj |p−1 |xj | 6 kykp |yj |q(p−1) |xj |p
j=1 j=1 j=1
p
2−p 1 1
= kykp q
kykp kxkp = kykp kxkp 6 kyk2p + kxk2p .
2 2
More generally, with h(x) = 21 kx − x0 k2p , we have Dh (x, y) 6 kx − x0 k2p + ky − x0 k2p .
As one example, we obtain the following corollary.
1
Corollary 4.2.15. Let h(x) = 2(p−1) kxk2p , where p = 1 + 1
log(2n) , and assume that
n
C ⊂ {x ∈ R : kxk1 6 R1 }. Then
K K
X 2R21 log(2n) e2 X
[f(xk ) − f(x⋆ )] 6 + αk kgk k2∞ .
αK 2
k=1 k=1
p 1 PK
In particular, taking αk = R1 log(2n)/k/e and xK = K k=1 xk gives
p
R1 log(2n)

f(xK ) − f(x ) 6 3e √ .
K
John C. Duchi 53

1
Proof. First, we note that h(x) = 2(p−1) kxk2p is strongly convex with respect to the
ℓp -norm, where 1 < p 6 2. (Recall Example 4.3 and see Exercise 9.) Moreover, we
know that the dual to the ℓp -norm is the conjugate ℓq -norm with 1/p + 1/q = 1,
and thus Theorem 4.2.6 implies that
K K
X 1 X αk
[f(xk ) − f(x⋆ )] 6 sup Dh (x, x⋆ ) + kgk k2q .
αK x∈C 2
k=1 k=1
Now, we use that if C is contained in the ℓ1 -ball of radius R1 , then (p − 1)Dh (x, y) 6
kxk2p + kyk2p 6 kxk21 + kyk21 6 2R21 . Moreover, because p = 1 + log(2n) 1
, we have
q = 1 + log(2n), and
1 1
kvkq 6 k1kq kvk∞ = n q kvk∞ = n log(2n) kvk∞ 6 e kvk∞ .
Substituting this into the previous display and noting that 1/(p − 1) = log(2n)
P − 12 and using convexity gives the second.
gives the first result. Integrating Kk=1 k

So we see that, in more general cases than the simple simplex constraint af-
forded by the entropic mirror descent (exponentiated gradient) updates, we have
p √
convergence guarantees of order log n/ K, which may be substantially faster
than that guaranteed by the standard projected gradient methods.

h(x) = xj logxj
j

h(x) = 21 ||x||22
h(x) = 21 ||x||2p
fbest
k
−f ⋆

-1
10

0 200 400 600 800 1000

Figure 4.2.16. Convergence of mirror descent (entropic gradient


method) versus projected gradient method.

A simulated mirror-descent example With our convergence theorems given, we


provide a (simulation-based) example of the convergence behavior for an opti-
mization problem for which it is natural to use non-Euclidean norms. We con-
sider a robust regression problem of the following form: we let A ∈ Rm×n have
1
entries drawn i.i.d. N(0, 1) with rows a⊤ ⊤
1 , . . . , am . We let bi = 2 (ai,1 + ai,2 ) + εi
54 Introductory Lectures on Stochastic Optimization

iid
where εi ∼ N(0, 10−2 ), and m = 20 and the dimension n = 3000. Then we define
m
X
f(x) := kAx − bk1 = | hai , xi − bi |,
i=1

which has subgradients A⊤ sign(Ax − b). We minimize f over the simplex C =


{x ∈ Rn+ : h1, xi = 1}; this is the same robust regression problem (3.2.12), except
with a particular choice of C.
We compare the subgradient method to exponentiated gradient descent for
this problem, noting that the Euclidean projection of a vector v ∈ Rn to the set C
 
has coordinates xj = vj − t + , where t ∈ R is chosen so that
n n
X X  
xj = vj − t +
= 1.
j=1 j=1

(See the papers [14, 23] for a full derivation of this expression.) We use stepsizes

αk = α0 / k, where the initial stepsize α0 is chosen to optimize the convergence
guarantee for each of the methods (see the coming section). In Figure 4.2.16, we
plot the results of performing the projected gradient method versus the expo-
nentiated gradient (entropic mirror decent) method and a method using distance
generating functions h(x) = 21 kxk2p for p = 1 + 1/ log(2n), which can also be
shown to be optimal, showing the optimality gap versus iteration count. All
three methods are sensitive to initial stepsize, the mirror descent method (4.2.4)
enjoys faster convergence than the standard gradient-based method.
4.3. Adaptive stepsizes and metrics In our discussion of mirror descent meth-
ods, we assumed we knew enough about the geometry of the problem at hand—
or at least the constraint set—to choose an appropriate metric and associated
distance-generating function h. In other situations, however, it may be advanta-
geous to adapt the metric being used, or at least the stepsizes, to achieve faster
convergence guarantees. We begin by describing a simple scheme for choosing
stepsizes to optimize bounds on convergence, which means one does not need
to know the Lipschitz constants of gradients ahead of time, and then move on to
somewhat more involved schemes that use a distance-generating function of the
type h(x) = 21 x⊤ Ax for some matrix A, which may change depending on infor-
mation observed during solution of the problem. We leave proofs of the major
results in these sections to exercises at the end of the lectures.
Adaptive stepsizes Let us begin by recalling the convergence guarantees for
mirror descent in the stochastic case, given by Corollary 4.2.11, which assumes
the stepsize αk used to calculate xk+1 is chosen based on the observed gradients
1 PK
g1 , . . . , gk (it may be specified ahead of time). In this case, taking xK = K k=1 xk ,
John C. Duchi 55

we have by Corollary 4.2.11 that as long as Dh (x, x⋆ ) 6 R2 for all x ∈ C, then


" K
#
⋆ R2 1 X αk 2
(4.3.1) E[f(xK ) − f(x )] 6 E + kgk k∗ .
KαK K 2
k=1
Now, if we were to use a fixed stepsize αk = α for all k, we see that the choice of
stepsize minimizing
K
R2 α X
+ kgk k2∗
Kα 2K
k=1
is
X
K − 1

√ 2
2
α = 2R kgk k∗ ,
k=1
which, when substituted into the bound (4.3.1) yields
" K 1 #

√ R X 2
2
(4.3.2) E[f(xK ) − f(x )] 6 2 E kgk k∗ .
K
k=1
While the stepsize choice α⋆ and the resulting bound are not strictly possible,
as we do not know the magnitudes of the gradients kgk k∗ before the procedure
executes, in Exercise 8, we prove the following corollary, which uses the “up to
now” optimal choice of stepsize αk .
q
Pk 2
Corollary 4.3.3. Let the conditions of Corollary 4.2.11 hold. Let αk = R/ i=1 kgi k∗ .
Then " K
X 1 #
⋆ R 2
2
E[f(xK ) − f(x )] 6 3 E kgk k∗ ,
K
k=1
1 PK
where xK = K k=1 xk .

When comparing Corollary 4.3.3 to Corollary 4.2.11, we see by Jensen’s in-


equality that, if E[kgk k2∗ ] 6 M2 for all k, then
" K 1 # X K 1
X 2
2 2 √ √
E kgk k∗ 6E kgk k2∗ 6 M2 K = M K.
k=1 k=1

Thus, ignoring the 2 versus 3 multiplier, the bound of Corollary 4.3.3 is al-
ways tighter than that provided by Corollary 4.2.11 and its immediate conse-
quence (4.2.12). We do not explore these particular stepsize choices further, but
turn to more sophisticated adaptation strategies.
Variable metric methods and the adaptive gradient method In variable metric
methods, the idea is to adjust the metric with which one constructs updates to
better reflect local (or non-local) problem structure. The basic framework is very
similar to the standard subgradient method (or the mirror descent method), and
proceeds as follows.
(i) Receive subgradient gk ∈ ∂f(xk ) (or stochastic subgradient gk satisfying
E[gk | xk ] ∈ ∂f(xk ))
56 Introductory Lectures on Stochastic Optimization

(ii) Update positive semidefinite matrix Hk ∈ Rn×n


(iii) Compute update

1
(4.3.4) xk+1 = argmin hgk , xi + hx, Hk xi .
x∈C 2
The method (4.3.4) subsumes a number of standard and less standard optimiza-
tion methods. If Hk = α1k In×n , a scaled identity matrix, we recover the (sto-
chastic) subgradient method (3.2.4) when C = Rn (or (3.3.2) generally). If f is
twice differentiable and C = Rn , then taking Hk = ∇2 f(xk ) to be the Hessian
of f at xk gives the (undamped) Newton method, and using Hk = ∇2 f(xk ) even
when C 6= Rn gives a constrained Newton method. More general choices of
Hk can even give the ellipsoid method and other classical convex optimization
methods [56].
In our case, we specialize the iterations above to focus on diagonal matrices Hk ,
and we do not assume the function f is smooth (not even differentiable). This, of
course, renders unusable standard methods using second order information in
the matrix Hk (as it does not exist), but we may still develop useful algorithms.
It is possible to consider more general matrices [22], but their additional com-
putational cost generally renders them impractical in large scale and stochastic
settings. With that in mind, let us develop a general framework for algorithms
and provide their analysis.
We begin with a general convergence guarantee.

Theorem 4.3.5. Let Hk be a sequence of positive definite matrices, where Hk is a func-


tion of g1 , . . . , gk (and potentially some additional randomness). Let gk be (stochastic)
subgradients with E[gk | xk ] ∈ ∂f(xk ). Then
X  " K #
K
1 X 
⋆ ⋆ 2 ⋆ 2 ⋆ 2
E (f(xk ) − f(x )) 6 E kxk − x kHk − kxk − x kHk−1 + kx1 − x kH1
2
k=1 k=2
X
K 
1
+ E kgk k2H−1 .
2 k
k=1

Proof. In contrast to mirror descent methods, in this proof we return to our classic
Lyapunov-based style of proof for standard subgradient methods, looking at the
distance kxk − x⋆ k. Let kxk2A = hx, Axi for any positive semidefinite matrix. We
claim that
2

(4.3.6) kxk+1 − x⋆ k2Hk 6 xk − H−1
k g k − x ,
Hk
the analogue of the fact that projections are non-expansive. This is an immediate
consequence of the update (4.3.4): we have that
2

xk+1 = argmin x − (xk − H−1 k g )
k ,
x∈C Hk
John C. Duchi 57

which is a Euclidean projection of xk − H−1 k gk into C (in the norm k·kHk ). Then
the standard result that projections are non-expansive (Corollary 2.2.8) gives in-
equality (4.3.6).
Inequality (4.3.6) is the key to our analysis, as previously. Expanding the
square on the right side of the inequality, we obtain
1 1
2

kxk+1 − x⋆ k2Hk 6 xk − H−1 k gk − x⋆
2 2 Hk
1 1
= kxk − x⋆ k2Hk − hgk , xk − x⋆ i + kgk k2H−1 ,
2 2 k
and taking expectations we have E[hgk , xk − x⋆ i | xk ] > f(xk ) − f(x⋆ ) by convexity
and that E[gk | xk ] ∈ ∂f(xk ). Thus
 
1 h ⋆ 2
i 1 ⋆ 2 ⋆ 1 2
E kxk+1 − x kHk 6 E kxk − x kHk − [f(xk ) − f(x )] + kgk kH−1 .
2 2 2 k
Rearranging, we have
 
1 1 1
E[f(xk ) − f(x⋆ )] 6 E kxk − x⋆ k2Hk − kxk+1 − x⋆ k2Hk + kgk k2H−1 .
2 2 2 k
Summing this inequality from k = 1 to K gives the theorem. 
We may specialize the theorem in a number of ways to develop particular algo-
rithms. One specialization, which is convenient because the computational over-
head is fairly small, is to use diagonal matrices Hk . In particular, the AdaGrad
method sets
Xk 1
1 2
(4.3.7) Hk = diag gi g⊤
i ,
α
i=1
where α > 0 is a pre-specified constant (stepsize). In this case, the following
corollary to Theorem 4.3.5 follows. Exercise 10 sketches the proof of the corollary,
which is similar to that of Corollary 4.3.3. In the corollary, recall that tr(A) =
Pn
j=1 Ajj is the trace of a matrix.

Corollary 4.3.8 (AdaGrad convergence). Let R∞ := supx∈C kx − x⋆ k∞ be the ℓ∞ ra-


dius of the set C and let the conditions of Theorem 4.3.5 hold. Then with the choice (4.3.7)
in the variable metric method, we have
XK 
⋆ 1 2
E (f(xk ) − f(x )) 6 R E[tr(HK )] + αE[tr(HK )].
2α ∞
k=1

Inspecting Corollary 4.3.8, we see a few consequences. First, by choosing α =


R∞ , we obtain the expected convergence guarantee 23 R∞ E[tr(HK )]. If we let xK =
1 PK
K k=1 xk as usual, and let gk,j denote the jth component of the kth gradient
observed by the method, then we immediately obtain the convergence guarantee
n
" K 1 #
⋆ 3 3 X X
2
2
(4.3.9) E [f(xK ) − f(x )] 6 R∞ E[tr(HK )] = R∞ E gk,j .
2K 2K
j=1 k=1
58 Introductory Lectures on Stochastic Optimization

In addition to proving the bound (4.3.9), Exercise 10 also shows that, if C = {x ∈


Rn : kxk∞ 6 1}, then the bound (4.3.9) is always better than the bounds (e.g.
Corollary 3.4.9) guaranteed by standard stochastic gradient methods. In addition,
the bound (4.3.9) is unimprovable—there are stochastic optimization problems
for which no algorithm can achieve a faster convergence rate. These types of
problems generally involve data in which the gradients g have highly varying
components (or components that are often zero, i.e. the gradients g are sparse),
as for such problems geometric aspects are quite important.

AdaGrad
SGD

10-1
f(xk )−f ⋆

10-2

0 50 100 150 200


Iteration k/100

Figure 4.3.10. A comparison of the convergence of AdaGrad


and SGD on the problem (4.3.11) for the best initial stepsize α for
each method.

We now give an example application of the AdaGrad method, showing its


performance on a simulated example. We consider solving the problem
m
1 X
(4.3.11) minimize f(x) = [1 − bi hai , xi]+ subject to kxk∞ 6 1,
m
i=1
where the vectors ai ∈ n
m = 5000 and n = 1000. This is the
{−1, 0, 1} with
objective common to hinge loss (support vector machine) classification problems.
For each coordinate j ∈ {1, . . . , n}, we set ai,j ∈ {±1} to have a random sign
with probability 1/j, and ai,j = 0 otherwise. Letting u ∈ {−1, 1}n uniformly at
random, we set bi = sign(hai , ui) with probability .95 and bi = − sign(hai , ui)
otherwise. For this problem, the coordinates of ai (and hence subgradients or
stochastic subgradients of f) naturally have substantial variability, making it a
natural problem for adaptation of the metric Hk .
In Figure 4.3.10, we show the convergence behavior of AdaGrad versus sto-
chastic gradient descent (SGD) on one realization of this problem, where at each
iteration we choose a stochastic gradient by selecting i ∈ {1, . . . , m} uniformly
at random, then setting gk ∈ ∂ [1 − bi hai , xk i]+ . For SGD, we use stepsizes
John C. Duchi 59

αk = α/ k, where α is the best stepsize of several choices (based on the even-
tual convergence of the method), while AdaGrad uses the matrix (4.3.7), with
α similarly chosen based on the best eventual convergence. The plot shows
the typical behavior of AdaGrad with respect to stochastic gradient methods, at
least for problems with appropriate geometry: with good initial stepsize choice,
AdaGrad often outperforms stochastic gradient descent. (We have been vague
about the “right” geometry for problems in which we expect AdaGrad to per-
form well. Roughly, problems for which the domain C is well-approximated by a
box {x ∈ Rn : kxk∞ 6 c} are those for which we expect AdaGrad to succeed, and
otherwise, it may exhibit worse performance than standard subgradient methods.
As in any problem, some care is needed in the choice of methods.) Figure 4.3.12
shows this somewhat more broadly, plotting the convergence f(xk ) − f(x⋆ ) ver-
sus iteration k for a number of initial stepsize choices for both stochastic gradient
descent and AdaGrad on the problem (4.3.11). Roughly, we see that both meth-
ods are sensitive to initial stepsize choice, but the best choice for AdaGrad often
outperforms the best choice for SGD.

101
AdaGrad
SGD

100
f(xk )−f ⋆

10-1

10-2
0 50 100 150 200
Iteration k/100

Figure 4.3.12. A comparison of the convergence of AdaGrad


and SGD on the problem (4.3.11) for various initial stepsize
choices α ∈ {10−i/2 , i = −2, . . . , 2} = {.1, .316, 1, 3.16, 10}. Both
methods are sensitive to the initial stepsize choice α, though for
each initial stepsize choice, AdaGrad has better convergence than
the subgradient method.

Notes and further reading The mirror descent method was originally devel-
oped by Nemirovski and Yuding [41] in order to more carefully control the norms
of gradients, and associated dual spaces, in first-order optimization methods.
Since their original development, a number of researchers have explored variants
and extensions of their methods. Beck and Teboulle [5] give an analysis of mirror
descent as a non-Euclidean gradient method, which is the approach we take in
60 Introductory Lectures on Stochastic Optimization

this lecture. Nemirovski et al. [40] study mirror descent methods in stochastic
settings, giving high-probability convergence guarantees similar to those we gave
in the previous lecture. Bubeck and Cesa-Bianchi [15] explore the use of mirror
descent methods in the context of bandit optimization problems, where instead of
observing stochastic gradients one observes only random function values f(x) + ε,
where ε is mean-zero noise.
Variable metric methods have a similarly long history. Our simple results with
stepsize selection follow the more advanced techniques of Auer et al. [3] (see es-
pecially their Lemma 3.5), and the AdaGrad method (and our development) is
due to Duchi, Hazan, and Singer [22] and McMahan and Streeter [38]. More gen-
eral metric methods include Shor’s space dilation methods (of which the ellipsoid
method is a celebrated special case), which develop matrices Hk that make new
directions of descent somewhat less correlated with previous directions, allowing
faster convergence in directions toward x⋆ ; see the books of Shor [55, 56] as well
as the thesis of Nedić [39]. Newton methods, which we do not discuss, use scaled
multiples of ∇2 f(xk ) for Hk , while Quasi-Newton methods approximate ∇2 f(xk )
with Hk while using only gradient-based information; for more on these and
other more advanced methods for smooth optimization problems, see the books
of Nocedal and Wright [46] and Boyd and Vandenberghe [12].

5. Optimality Guarantees
Lecture Summary: In this lecture, we provide a framework for demonstrat-
ing the optimality of a number of algorithms for solving stochastic optimiza-
tion problems. In particular, we introduce minimax lower bounds, showing
how techniques for reducing estimation problems to statistical testing prob-
lems allow us to prove lower bounds on optimization.

5.1. Introduction The procedures and algorithms we have presented thus far en-
joy good performance on a number of statistical, machine learning, and stochastic
optimization tasks, and we have provided theoretical guarantees on their perfor-
mance. It is interesting to ask whether it is possible to improve the algorithms, or
in what ways it may be possible to improve them. With that in mind, in this lec-
ture we develop a number of tools for showing optimality—according to certain
metrics—of optimization methods for stochastic problems.
Minimax rates We provide optimality guarantees in the minimax framework for
optimality, which proceeds roughly as follows: we have a collection of possi-
ble problems and an error measure for the performance of a procedure, and we
measure a procedure’s performance by its behavior on the hardest (most difficult)
member of the problem class. We then ask for the best procedure under this
worst-case error measure. Let us describe this more formally in the context of our
stochastic optimization problems, where the goal is to understand the difficulty
John C. Duchi 61

of minimizing a convex function f subject to constraints x ∈ C while observing


only stochastic gradient (or other noisy) information about f. Our bounds build
on three objects:
(i) A collection F of convex functions f : Rn → R
(ii) A closed convex set C ⊂ Rn over which we optimize
(iii) A stochastic gradient oracle, which consists of a sample space S, a gradient
mapping
g : Rn × S × F → Rn ,
and (implicitly) a probability distributions P on S. The stochastic gradient
oracle may be queried at a point x, and when queried, draws S ∼ P with the
property that
(5.1.1) E[g(x, S, f)] ∈ ∂f(x).
Depending on the scenario of the problem, the optimization procedure may be
given access either to S or simply the value of the stochastic gradient g = g(x, S, f),
and the goal is to use the sequence of observations g(xk , Sk , f), for k = 1, 2, . . . , to
optimize f.
A simple example of the setting (i)–(iii) is as follows. Let A ∈ Rn×n be a fixed
positive definite matrix, and let F be the collection of convex functions of the form
f(x) = 12 x⊤ Ax − b⊤ x for all b ∈ Rn . Then C may be any convex set, and—for the
sake of proving lower bounds, not for real applicability in solving problems—we
might take the stochastic gradient
iid
g = ∇f(x) + ξ = Ax − b + ξ for ξ ∼ N(0, In×n ).
A somewhat more complex example, but with more fidelity to real problems,
comes from the stochastic programming problem (3.4.2) from Lecture 3 on sub-
gradient methods. In this case, there is a known convex function F : Rn × S → R,
which is the instantaneous loss function F(x; s). The problem is then to optimize
fP (x) := EP [F(x; S)]
where the distribution P on the random variable S is unknown to the method
a priori; there is then a correspondence between distributions P and functions
f ∈ F. Generally, an optimization is given access to a sample S1 , . . . , SK drawn i.i.d.
according to the distribution P (in this case, there is no selection of points xi by the
optimization procedure, as the sample S1 , . . . , SK contains even more information
than the stochastic gradients). A similar variant with a natural stochastic gradient
oracle is to set g(x, s, F) ∈ ∂F(x; s) instead of providing the sample S = s.
We focus in this note on the case when the optimization procedure may view
only the sequence of subgradients g1 , g2 , . . . at the points it queries. We note in
passing, however, that for many problems we can reconstruct S from a gradient
g ∈ ∂F(x; S). For example, consider a logistic regression problem with data s =
62 Introductory Lectures on Stochastic Optimization

(a, b) ∈ {0, 1}n × {−1, 1}, a typical data case. Then


1
F(x; s) = log(1 + e−bha,xi ), and ∇x F(x; s) = − ba,
1 + ebha,xi
so that (a, b) is identifiable from any g ∈ ∂F(x; s). More generally, classical linear
models in statistics have gradients that are scaled multiples of the data, so that
the sample s is typically identifiable from g ∈ ∂F(x; s).
Now, given function f and stochastic gradient oracle g, an optimization pro-
cedure chooses query points x1 , x2 , . . . , xK and observes stochastic subgradients
gk with E[gk ] ∈ ∂f(xk ). Based on these stochastic gradients, the optimization
procedure outputs b xK , and we assess the quality of the procedure in terms of the
excess loss  

E f (bxK (g1 , . . . , gK )) − inf f(x ) ,
x ∈C ⋆

where the expectation is taken over the subgradients g(xi , Si , f) returned by the
stochastic oracle and any randomness in the chosen iterates, or query points,
x1 , . . . , xK of the optimization method. Of course, if we only consider this ex-
cess objective value for a fixed function f, then a trivial optimization procedure
achieves excess risk 0: simply return some x ∈ argminx∈C f(x). It is thus impor-
tant to ask for a more uniform notion of risk: we would like the procedure to have
good performance uniformly across all functions f ∈ F, leading us to measure the
performance of a procedure by its worst-case risk
 
sup E f(bx(g1 , . . . , gk )) − inf f(x⋆ ) ,
f∈F x∈C

where the supremum is taken over functions f ∈ F (the subgradient oracle g then
implicitly depends on f). An optimal estimator for this metric then gives the
minimax risk for optimizing the family of stochastic optimization problems {f}f∈F
over x ∈ C ⊂ Rn , which is  
(5.1.2) MK (C, F) := inf sup E f(b
xK (g1 , . . . , gK )) − inf f(x⋆ ) .
b K f∈F
x x ∈C

We take the supremum (worst-case) over distributions f ∈ F and the infimum


over all possible optimization schemes b xK using K stochastic gradient samples.
A criticism of the framework (5.1.2) is that it is too pessimistic: by taking a
worst-case over distributions of functions f ∈ F, one is making the family of
problems too challenging. We will not address these challenges except to say
that one response is to develop adaptive procedures b x, which are simultaneously
optimal for a variety of collections of problems F.
The basic approach There are a variety of techniques for providing lower bounds
on the minimax risk (5.1.2). Each of them transforms the maximum risk by lower
bounding it via a Bayesian problem (e.g. [31, 33, 34]), then proving a lower bound
on the performance of all possible estimators for the Bayesian problem. In partic-
ular, let {fv } ⊂ F be a collection of functions in F indexed by some (finite or count-
able) set V and π be any probability mass function over V. Let f⋆ = infx∈C f(x).
John C. Duchi 63

Then for any procedure b


x, the maximum risk has lower bound
X
x) − f⋆ ] >
sup E [f(b x) − f⋆v ] .
π(v)E [fv (b
f∈F v
While trivial, this lower bound serves as the departure point for each of the
subsequent techniques for lower bounding the minimax risk. The lower bound
also allows us to assume that the procedure b x is deterministic. Indeed, assume that
b
x is non-deterministic, which we can represent generally as depending on some
auxiliary random variable U independent of the observed subgradients. Then we
certainly have
X  X

E π(v)E [fv (b
x) − fv | U] > inf x) − f⋆v | U = u] ,
π(v)E [fv (b
u
v v
that is, there is some realization of the auxiliary randomness that is at least as
good as the average realization. We can simply incorporate this into our minimax
optimal procedures b x, and thus we assume from this point onward that all our
optimization procedures are deterministic when proving our lower bounds.

f0 f1

δ = dopt (f0 , f1 )

x : f1 (x) 6 f⋆1 + δ

Figure 5.1.3. Separation of optimizers of f0 and f1 . Optimizing


one function to accuracy better than δ = dopt (f0 , f1 ) implies we
optimize the other poorly; the gap f(x) − f⋆ is at least δ.

The second step in proving minimax bounds is to reduce the optimization


problem to a type of statistical test [58, 61, 62]. To perform this reduction, we de-
fine a distance-like quantity between functions such that, if we have optimized a
function fv to better than the distance, we cannot have optimized other functions
well. In particular, consider two convex functions f0 and f1 . Let f⋆v = infx∈C fv (x)
for v ∈ {0, 1}. We let the optimization separation between functions f0 and f1 over
64 Introductory Lectures on Stochastic Optimization

the set C be
dopt (f0 , f1 ; C) :=
 
(5.1.4) f1 (x) 6 f⋆1 + δ implies f0 (x) > f⋆0 + δ
sup δ > 0 : for any x ∈ C .
f0 (x) 6 f⋆0 + δ implies f1 (x) > f⋆1 + δ
That is, if we have any point x such that fv (x) − f⋆v 6 dopt (f0 , f1 ), then x cannot
optimize f1−v well, i.e. we can only optimize one of the two functions f0 and f1
to accuracy dopt (f0 , f1 ). See Figure 5.1.3 for an illustration of this quantity. For
example, if f1 (x) = (x + c)2 and f0 (x) = (x − c)2 for a constant c 6= 0, then we
have dopt (f1 , f0 ) = c2 .
This separation dopt allows us to give a reduction from optimization to testing
via the canonical hypothesis testing problem, which is as defined as follows:
1. Nature chooses an index V ∈ V uniformly at random
2. Conditional on the choice V = v, the procedure observes stochastic subgradi-
ents for the function fv according to the oracle g(xk , Sk , fv ) for i.i.d. Sk .
Then, given the observed subgradients, the goal is to test which of the random
indices v nature chose. Intuitively, if we can optimize fv well—to better than
the separation dopt (fv , fv ′ )—then we can identify the index v. If we can show
this, then we can adapt classical statistical results on optimal hypothesis testing
to lower bound the probability of error in testing whether the data was generated
conditional on V = v.
More formally, we have the following key lower bound. In the lower bound,
we say that a collection of functions {fv }v∈V is δ-separated, where δ > 0, if
(5.1.5) dopt (fv , fv ′ ; C) > δ for each v, v ′ ∈ V with v 6= v ′ .
Then we have the next proposition.

Proposition 5.1.6. Let S be drawn uniformly from V, where |V| < ∞, and assume the
collection {fv }v∈V is δ-separated. Then for any optimization procedure b
x based on the
observed subgradients,
1 X
x) − f⋆v ] > δ · inf P(b
E[fv (b v 6= V),
|V| b
v
v∈V
where the distribution P is the joint distribution over the random index V and the ob-
served gradients g1 , . . . , gK and the infimum is taken over all testing procedures b
v based
on the observed data.

Proof. We let Pv denote the distribution of the subgradients conditional on the


choice V = v, meaning that E[gk | V = v] ∈ ∂fv (xk ). We observe that for any v,
we have
x) − f⋆v ] > δE[1 {fv (b
E[fv (b x) > f⋆v + δ}] = δPv (fv (b
x) > f⋆v + δ).
John C. Duchi 65

Now, define the hypothesis test b


v, which is a function of b x, by

v x) 6 f⋆v + δ
if fv (b
b
v=
arbitrary in V otherwise.
This is a well-defined mapping, as by the condition that dopt (fv , fv ′ ) > δ, there
can be only a single index v such that fv (x) 6 f⋆v + δ. We then note the following
implication:
x) > f⋆v + δ.
v 6= v implies fv (b
b
Thus we have
x) > f⋆v + δ),
v 6= v) 6 Pv (fv (b
Pv (b
or, summarizing, we have
1 X 1 X
x) − f⋆v ] > δ ·
E[fv (b v 6= v).
Pv (b
|V| |V|
v∈V v∈V
1 P
But by definition of the distribution P, we have |V| v∈V Pv (b v 6= v) = P(b
v 6= V),
and taking the best possible test b
v gives the result of the proposition. 
Proposition 5.1.6 allows us to then bring in the tools of optimal testing in
statistics and information theory, which we can use to prove lower bounds. To
leverage Proposition 5.1.6, we follow a two phase strategy: we construct a well-
separated function collection, and then we show that it is difficult to test which of
the functions we observe data from. There is a natural tension in the proposition,
as it is easier to distinguish functions that are far apart (i.e. large δ), while hard-
v 6= V)) often have smaller separation. Thus
to-distinguish functions (i.e. large P(b
we trade these against one another carefully in constructing our lower bounds on
the minimax risk. We also present a variant lower bound in Section 5.3 based on
a similar reduction, except that we use multiple binary hypothesis tests.
5.2. Le Cam’s Method Our first set of lower bounds is based on Le Cam’s
method [33], which uses optimality guarantees for simple binary hypothesis tests
to provide lower bounds for optimization problems. That is, we let V = {−1, 1}
and will construct only pairs of functions and distributions P1 , P−1 generating
data. In this section, we show how to use these binary hypothesis tests to prove
lower bounds on the family of stochastic optimization problems characterized by
the following conditions: the domain C ⊂ Rn contains an ℓ2 -ball of radius R and
the subgradients gk satisfy the second moment bound
E[kgk k22 ] 6 M2
for all k. We assume that F consists of M-Lipschitz continuous convex functions.
With the definition (5.1.4) of the separation in terms of optimization value,
we can provide a lower bound on optimization in terms of distances between
distributions P1 and P−1 . Before we continue, we require a few definitions about
distances between distributions.
66 Introductory Lectures on Stochastic Optimization

Definition 5.2.1. Let P and Q be distributions on a space S, and assume that they
are both absolutely continuous with respect to a measure µ on S. The variation
distance between P and Q is Z
1
kP − QkTV := sup |P(A) − Q(A)| = |p(s) − q(s)|dµ(s).
A⊂S 2 S
The Kullback-Leibler divergence between P and Q is
Z
p(s)
Dkl (P||Q) := p(s) log dµ(s).
S q(s)
We can connect the variation distance to binary hypothesis tests via the following
lemma, due to Le Cam. The lemma states that testing between two distributions
is hard precisely when they are close in variation distance.

Lemma 5.2.2. Let P1 and P−1 be any distributions. Then


v 6= −1)} = 1 − kP1 − P−1 kTV .
v 6= 1) + P−1 (b
inf {P1 (b
b
v
Proof. Any testing procedure bv : S → {−1, 1} maps one region of the sample space,
call it A, to 1 and the complement Ac to −1. Thus, we have
v 6= −1) = P1 (Ac ) + P−1 (A) = 1 − P1 (A) + P−1 (A).
v 6= 1) + P−1 (b
P1 (b
Optimizing over b
v is then equivalent to optimizing over sets A, yielding

v 6= 1) + P−1 (b
inf{P1 (b v 6= −1)} = inf{1 − P1 (A) + P−1 (A)}
b
v A
= 1 − sup{P1 (A) − P−1 (A)} = 1 − kP1 − P−1 kTV
A
as desired. 
As an immediate consequence of Lemma 5.2.2, we obtain the standard mini-
max lower bound based on binary hypothesis testing. In particular, let f1 and f−1
be δ-separated and belong to F, and assume that the method b x receives data (in
this case, the data is the K subgradients) from PvK when fv is the true function.
Then we immediately have
1 h i
K
(5.2.3) MK (C, F) > inf max {EPv [fv (b xK ) − f⋆v ]} > δ · 1 − P1K − P−1 .
b K v∈{−1,1}
x 2 TV

Inequality (5.2.3) gives a quantitative guarantee on an intuitive fact: if we observe


data from one of two distributions P1 and P−1 that are close, while the optimiz-
ers of the functions f1 and f−1 associated with P1 and P−1 differ, it is difficult to
optimize well. Moreover, there is a natural tradeoff—the farther apart the func-
tions f1 and f−1 are (i.e. δ = dopt (f1 , f−1 ) is large), the bigger the penalty for
optimizing one well, but conversely, this usually forces the distributions P1 and
P−1 to be quite different, as they provide subgradient information on f1 and f−1 ,
respectively.
It is challenging to compute quantities—especially with multiple samples—
involving the variation distance, so we now convert our bounds to ones involving
the KL-divergence, which is computationally easier when dealing with multiple
John C. Duchi 67

samples. First, we use Pinsker’s inequality (see Appendix A.3, Proposition A.3.2
for a proof): for any distributions P and Q,
1
kP − Qk2TV 6 Dkl (P||Q) .
2
As we see presently, the KL-divergence tensorizes when we have multiple obser-
vations from different distributions (see Lemma 5.2.8 to come), allowing substan-
tially easier computation of individual divergence terms. Then we have the fol-
lowing theorem.

Theorem 5.2.4. Let F be a collection of convex functions, and let f1 , f−1 ∈ F. Assume
that when function fv is to be optimized, we observe K subgradients according to PvK .
Then " r #
dopt (f−1 , f1 ; C) 1 
MK (C, P) > 1− D PK | PK .
2 2 kl 1 −1

What remains to give a concrete lower bound, then, is (1) to construct a family
of well-separated functions f1 , f−1 , and (2) to construct a stochastic gradient or-
acle for which we give a small upper bound on the KL-divergence between the
distributions P1 and P−1 associated with the functions, which means that testing
between P1 and P−1 is hard.
Constructing well-separated functions Our first goal is to construct a family
of well-separated functions and an associated first-order subgradient oracle that
makes the functions hard to distinguish. We parameterize our functions—of
which we construct only 2—by a parameter δ > 0 governing their separation.
Our construction applies in dimension n = 1: let us assume that C contains the
interval [−R, R] (this is no loss of generality, as we may simply shift the interval).
Then define the M-Lipschitz continuous functions
(5.2.5) f1 (x) = Mδ|x − R| and f−1 (x) = Mδ|x + R|.
See Figure 5.2.6 for an example of these functions, which makes clear that their
separation (5.1.4) is
dopt (f1 , f−1 ) = δMR.
We also consider the stochastic oracle for this problem, recalling that we must
construct subgradients satisfying E[kgk22 ] 6 M2 . We will do slightly more: we
will guarantee that |g| 6 M always. With this in mind, we assume that δ 6 1,
and define the stochastic gradient oracle for the distribution Pv , v ∈ {−1, 1} at the
point x to be

M sign(x − vR) with probability 1+δ
2
(5.2.7) gv (x) =
1−δ
−M sign(x − vR) with probability 2 .
At x = vR the oracle simply returns a random sign. Then by inspection, we see
that
Mδ Mδ
E[gv (x)] = sign(x − vR) − (− sign(x − vR)) = Mδ sign(x − vR) ∈ ∂fv (x)
2 2
68 Introductory Lectures on Stochastic Optimization

f−1 f1
dopt (f−1 , f1 ) = MRδ
x : f1 (x) 6
f⋆1 + MRδ

−R R

Figure 5.2.6. The function construction (5.2.5) with separation


dopt (f1 , f−1 ) = MRδ.

for v = −1, 1. Thus, the combination of the functions (5.2.5) and the stochastic
gradient (5.2.7) give us a valid subgradient and well-separated pair of functions.
Bounding the distance between distributions The second step in proving our
minimax lower bound is to upper bound the distance between the distributions
that generate the subgradients our methods observe. This means that testing
which of the functions we are optimizing is challenging, giving us a strong lower
bound. At a high level, building off of Theorem 5.2.4, we hope to show an upper
bound of the form  
Dkl P1K | P−1
K
6 κδ2
for some κ. This is a local condition, allowing us to scale our problems with δ
to achieve minimax bounds. If we have such a quadratic, we may simply choose
δ2 = 1/2κ, giving the constant probability of error
r r
K K 1  κδ2 1
1 − P1 − P−1 > 1 − Dkl P1K | P−1
K /2 > 1 − > .
TV 2 2 2
To this end, we begin with a standard lemma (the chain rule for KL diver-
gence), which applies when we have K potentially dependent observations from
a distribution. The result is an immediate consequence of Bayes’ rule.

Lemma 5.2.8. Let P(· | g1 , . . . , gk−1 ) denote the conditional distribution of gk given
g1 , . . . , gk−1 . For each k ∈ N let P1k and P−1
k be distributions on the K subgradients

g1 , . . . , gk . Then
  K
X
Dkl P1K | P−1
K
= EPk−1 [Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 ))] .
1
k=1
John C. Duchi 69

Using Lemma 5.2.8, we have the following upper bound on the KL-divergence
between P1K and P−1
K for the stochastic gradient (5.2.7).

Lemma 5.2.9. Let the K observations under distribution Pv come from the stochastic
gradient oracle (5.2.7). Then for δ 6 54 ,
 
Dkl P1K | P−1
K
6 3Kδ2 .
Proof. We use the chain-rule for KL-divergence, whence we must only provide
an upper bound on the individual terms. We first note that xk is a function
of g1 , . . . , gk−1 (because we may assume w.l.o.g. that xk is deterministic) so that
Pv (· | g1 , . . . , gk−1 ) is the distribution of a Bernoulli random variable with distri-
bution (5.2.7), i.e. with probabilities 1±δ 2 . Thus we have
 
1+δ 1−δ
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 )) 6 Dkl |
2 2
1+δ 1+δ 1−δ 1−δ
= log + log
2 1−δ 2 1+δ
1+δ
= δ log .
1−δ
By a Taylor expansion, we have that
   
1+δ 1 1
δ log = δ δ − δ2 + O(δ3 ) − δ −δ − δ2 + O(δ3 ) = 2δ2 + O(δ4 ) 6 3δ2
1−δ 2 2
for δ 6 54 , or
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 )) 6 3δ2
for δ 6 54 . Summing over k completes the proof. 
Putting it all together: a minimax lower bound With Lemma 5.2.9 in place
along with our construction (5.2.5) of well-separated functions, we can now give
a theorem on the best possible convergence guarantees for a broad family of
problems.
Theorem 5.2.10. Let C ⊂ Rn be a convex set containing an ℓ2 ball of radius R, and let P
denote the collection of distributions generating stochastic subgradients with kgk2 6 M
with probability 1. Then
RM
MK (C, P) > √ √
4 6 K
for all K ∈ N.
Proof. We combine Le Cam’s method, Lemma 5.2.2 (and the subsequent Theo-
rem 5.2.4) with our construction (5.2.5) and their stochastic subgradients (5.2.7).
Certainly, the class of n-dimensional optimization problems is at least as challeng-
ing as a 1-dimensional problem (we may always restrict our functions to depend
only on a single coordinate), so that for any δ > 0 we have
r !
δMR 1 K K

MK (C, F) > 1− D P |P .
2 2 kl 1 −1
70 Introductory Lectures on Stochastic Optimization

Now we use Lemma 5.2.9, which guarantees the further lower bound
r !
δMR 3Kδ2
MK (C, F) > 1− ,
2 2
 1
1
valid for all δ 6 45 . Choosing δ2 = 6K < 54 , we have that Dkl P1K | P−1
K 6 2 , and
δMR
MK (C, F) > .
4
Substituting our choice of δ into this expression gives the theorem. 
In short, Theorem 5.2.10 gives a guarantee that matches the upper bounds of
the previous lectures to within a numerical constant factor of 10. A more careful
inspection of our analysis allows us to prove a lower bound, at least as K → ∞,

of 1/8 K. In particular, by using Theorem 3.4.7 of our lecture on subgradient
methods, we find that if the set C contains an ℓ2 -ball of radius Rinner and is
contained in an ℓ2 -ball of radius Router , we have
1 MRinner MRouter
(5.2.11) √ √ 6 MK (C, F) 6 √
96 K K
for all K ∈ N, where the upper bound is attained by the stochastic projected
subgradient method.
5.3. Multiple dimensions and Assouad’s Method The results in Section 5.2 pro-
vide guarantees for problems where we can embed much of the difficulty of our
family F in optimizing a pair of only two functions—something reminiscent of
problems in classical statistics on the “hardest one-dimensional subproblem” (see,
for example, the work of Donoho, Liu, and MacGibbon [19]). In many stochas-
tic optimization problems, the higher-dimension n yields increased difficulty, so
that we would like to derive bounds that incorporate dimension more directly.
With that in mind, we develop a family of lower bounds, based on Assouad’s
method [2], that reduce optimization to a collection of binary hypothesis tests,
one for each of the n dimensions of the problem.
More precisely, we let V = {−1, 1}n be the n-dimensional binary hypercube,
and for each v ∈ V, we assume we have a function fv ∈ F where fv : Rn → R.
Without loss of generality, we will assume that our constraint set C has the point
0 in its interior. Let δ ∈ Rn + be an n-dimensional nonnegative vector. Then we
say that the functions {fv } induce a δ-separation in the Hamming metric if for any
x ∈ C ⊂ Rn we have
n
X 
(5.3.1) fv (x) − f⋆v > δj 1 sign(xj ) 6= vj ,
j=1

where the subscript j denotes the jth coordinate. For example, if we define the
function fv (x) = δ kx − vk1 for each v ∈ V, then certainly {fv } is δ1-separated
P
in the Hamming metric; more generally, fv (x) = n j=1 δj |xj − vj | is δ-separated.
With this definition, we have the following lemma, providing a lower bound for
functions f : Rn → R.
John C. Duchi 71

Lemma 5.3.2 (Generalized Assouad). Let δ ∈ Rn + and let {fv }, where v ∈ V =


{−1, 1}n , be δ-separated in Hamming metric. Let b x be any optimization algorithm, and
let Pv be the distribution of (all) the subgradients g1 , . . . , gK the procedure b
x observes
when optimizing fv . Define
1 X 1 X
P+j = n−1 Pv and P−j = n−1 Pv .
2 2
v:vj =1 v:vj =−1

Then
n
1 X 1X
x) − f⋆v ] >
E[fv (b δj (1 − P+j − P−j TV ).
2n 2
v∈{−1,1}n j=1

Proof. By using the separation condition, we immediately see that


d
X
x) − f⋆v ] >
E[fv (b xj ) 6= vj )
δj Pv (sign(b
j=1

for any v ∈ V. Averaging over the vectors v ∈ V, we obtain


d
1 X ⋆
X 1 X
n
E[f v (b
x ) − f v ] > xj ) 6= vj )
δj Pv (sign(b
2 |V|
v∈V j=1 v∈V
d  X 
X 1 X
= δj xj ) 6= 1) +
Pv (sign(b xj ) 6= −1)
Pv (sign(b
|V|
j=1 v:vj =1 v:vj =−1
d
X δj  
= xj ) 6= 1) + P−j (sign(b
P+j (sign(b xj ) 6= −1) .
2
j=1

Now we use Le Cam’s lemma (Lemma 5.2.2) on optimal binary hypothesis tests
to see that

xj ) 6= −1) > 1 − P+j − P−j TV
xj ) 6= 1) + P−j (sign(b
P+j (sign(b
which gives the desired result. 
As a nearly immediate consequence of Lemma 5.3.2, we see that if the separa-
tion is a constant δ > 0 for each coordinate, we have the following lower bound
on the minimax risk.

Proposition 5.3.3. Let the collection {fv }v∈V ⊂ F, where V = {−1, 1}n , be δ-separated
in Hamming metric for some δ ∈ R+ , and let the conditions of Lemma 5.3.2 hold. Then
v
 u n 
n u 1 X 
MK (C, F) > δ 1 − t Dkl P+j | P−j .
2 2n
j=1

Proof. Lemma 5.3.2 guarantees that


n
δX
MK (C, F) > (1 − P+j − P−j TV ).
2
j=1
72 Introductory Lectures on Stochastic Optimization

Applying the Cauchy-Schwarz inequality, we have


v v
n u X u X
X u n 2 u n 
P+j − P−j 6 tn P+j − P−j 6 t n Dkl P+j | P−j
TV TV 2
j=1 j=1 j=1

by Pinsker’s inequality. Substituting this into the previous bound gives the de-
sired result. 
With this proposition, we can give a number of minimax lower bounds. We
focus on two concrete cases, which show that the stochastic gradient procedures
we have developed are optimal for a variety of problems. We give one result,
deferring others to the exercises associated with the lecture notes. For our main
result using Assouad’s method, we consider optimization problems for which the
set C ⊂ Rn contains an ℓ∞ ball of radius R. We also assume that the stochastic
gradient oracle satisfies the ℓ1 -bound condition
E[kg(x, S, f)k21 ] 6 M2 .
This means that all the functions f ∈ F are M-Lipschitz continuous with respect
to the ℓ∞ -norm, that is, |f(x) − f(y)| 6 M kx − yk∞ .

Theorem 5.3.4. Let F and the stochastic gradient oracle be as above, and assume C ⊃
[−R, R]n . Then √
1 1 n
MK (C, F) > RM min ,√ √ .
5 96 K
Proof. Our proof is similar to our construction of our earlier lower bounds, except
that now we must construct functions defined on Rn so that our minimax lower
bound on convergence rate grows with the dimension. Let δ > 0 be fixed for now.
For each v ∈ V = {−1, 1}n , define the function

fv (x) := kx − Rvk1 .
n
Then by inspection, the collection {fv } is MRδ
n -separated in Hamming metric, as
n n
Mδ X Mδ X 
fv (x) = |xj − Rvj | > R1 sign(xj ) 6= vj .
n n
j=1 j=1

Now, we must (as before) construct a stochastic subgradient oracle. Let e1 , . . . , en


be the n standard basis vectors. For each v ∈ V, we define the stochastic subgra-
dient as

Mej sign(xj − Rvj ) with probability 1+δ
2n
(5.3.5) g(x, fv ) =
−Mej sign(xj − Rvj ) with probability 1−δ
2n .
That is, the oracle randomly chooses a coordinate j ∈ {1, . . . , n}, then conditional
on this choice, flips a biased coin and with probability 1+δ
2 returns the correctly
signed jth coordinate of the subgradient, Mej sign(xj − Rvj ), and otherwise re-
turns the negative. Letting sign(x) denote the vector of signs of x, we then have
John C. Duchi 73

the equality

E[g(x, fv )]
n  
X 1+δ 1−δ Mδ
=M ej sign(xj − Rvj ) − sign(xj − Rvj ) = sign(x − Rv).
n n n
j=1

That is, E[g(x, fv )] ∈ ∂fv (x) as desired.


Now, we apply Proposition 5.3.3, which guarantees that
 v 
u n
MRδ  u 1 X 
(5.3.6) MK (C, F) > 1−t Dkl P+j | P−j  .
2 2n
j=1

It remains to upper bound the KL-divergence terms. Let PvK denote the distribu-
tion of the K subgradients the method observes for the function fv , and let v(±j)
denote the vector v except that its jth entry is forced to be ±1. Then, we may use
the convexity of the KL-divergence to obtain that
 1 X  
Dkl P+j | P−j 6 n Dkl PvK(+j) | PvK(−j) .
2
v∈V

Let us thus bound Dkl Pv | Pv ′ when v and v ′ differ in only a single coordinate
K K

(we let it be the first coordinate with no loss of generality). Let us assume for
notational simplicity M = 1 for the next calculation, as this only changes the
support of the subgradient distribution (5.3.5) but not any divergences. Applying
the chain rule (Lemma 5.2.8), we have
  XK
Dkl PvK | PvK′ = EPv [Dkl (Pv (· | g1:k−1 )||Pv ′ (· | g1:k−1 ))] .
k=1
We consider one of the terms, noting that the kth query xk is a function of
g1 , . . . , gk−1 . We have

Dkl (Pv (· | xk )||Pv ′ (· | xk ))


Pv (g = e1 | xk ) Pv (g = −e1 | xk )
= Pv (g = e1 | xk ) log + Pv (g = −e1 | xk ) log ,
Pv ′ (g = e1 | xk ) Pv ′ (g = −e1 | xk )
because Pv and Pv ′ assign the same probability to all subgradients except when
g ∈ {±e1 }. Continuing the derivation, we obtain
1+δ 1+δ 1−δ 1−δ δ 1+δ
Dkl (Pv (· | xk )||Pv ′ (· | xk )) = log + log = log .
2n 1−δ 2n 1+δ n 1−δ
2
Noting that this final quantity is bounded by 3δ 4
n for δ 6 5 gives that
  3Kδ2 4
Dkl PvK | PvK′ 6 if δ 6 .
n 5
Substituting the preceding calculation into the lower bound (5.3.6), we obtain
 v  r !
u n
MRδ  u 1 X 3Kδ 2 MRδ 3Kδ 2
MK (C, F) > 1−t = 1− .
2 2n n 2 2n
j=1
74 Introductory Lectures on Stochastic Optimization

n
Choosing δ2 = min{16/25, 4K } gives the result of the theorem. 
A few remarks are in order about the theorem. First, we see that it recovers
the 1-dimensional result of Theorem 5.2.10, as we may simply take n = 1 in
the theorem statement. Second, we see that if we wish to optimize over a set
larger than the ℓ2 -ball, then there must necessarily be some dimension-dependent
penalty, at least in the worst case. Lastly, the result again is sharp. By using
Theorem 3.4.7, we obtain the following corollary.

Corollary 5.3.7. In addition to the conditions of Theorem 5.3.4, let C ⊂ Rn contain an


ℓ∞ box of radius Rinner and be contained in an ℓ∞ box of radius Router . Then
√ √
1 1 n n
Rinner M min , √ √ 6 MK (C, F) 6 Router M min 1, √ .
5 96 K K
Notes and further reading The minimax criterion for measuring optimality of
optimization and estimation procedures has a long history, dating back at least
to Wald [59] in 1939. The information-theoretic approach to optimality guaran-
tees was extensively developed by Ibragimov and Has’minskii [31], and this is
our approach. Our treatment in this chapter is specifically based off of that by
Agarwal et al. [1] for proving lower bounds for stochastic optimization problems,
though our results appear to have slightly sharper constants. Notably missing
in our treatment is the use of Fano’s inequality for lower bounds, which is com-
monly used to prove converse statements to achievability results in information
theory [17, 62]. Recent treatments of various techniques for proving lower bounds
in statistics can be found in the book of Tsybakov [58] or the lecture notes [21].
Our focus on stochastic optimization problems allows reasonably straightfor-
ward reductions from optimization to statistical testing problems, for which in-
formation theoretic and statistical tools give elegant solutions. It is possible to
give lower bounds for non-stochastic problems, where the classical reference is
the book of Nemirovski and Yudin [41] (who also provide optimality guaran-
tees for stochastic problems). The basic idea is to provide lower bounds for the
oracle model of convex optimization, where we consider optimality in terms of
the number of queries to an oracle giving true first- or second-order information
(as opposed to the stochastic oracle studied here). More recent work, includ-
ing the lecture notes [42] and the book [44] provide a somewhat easier guide to
such results, while the recent paper of Braun et al. [13] shows how to leverage
information-theoretic tools to prove optimality guarantees even for non-stochastic
optimization problems.

A. Technical Appendices
A.1. Continuity of Convex Functions In this appendix, we provide proofs of
the basic continuity results for convex functions. Our arguments are based on
those of Hiriart-Urruty and Lemaréchal [27].
John C. Duchi 75
P
Proof of Lemma 2.3.1 We can write x ∈ B1 as x = n i=1 xi ei , where ei are the
P
standard basis vectors and n |x
i=1 i | 6 1. Thus, we have
X
n  X
n 
f(x) = f ei xi =f |xi | sign(xi )ei + (1 − kxk1 )0
i=1 i=1
n
X
6 |xi |f(sign(xi )ei ) + (1 − kxk1 )f(0)
i=1
6 max {f(e1 ), f(−e1 ), f(e2 ), f(−e2 ), . . . , f(en ), f(−en ), f(0)} .
The first inequality uses the fact that the |xi | and (1 − kxk1 ) form a convex combi-
nation, since x ∈ B1 , as does the second.
For the lower bound, note by the fact that x ∈ int B1 satisfies x ∈ int dom f,
we have ∂f(x) 6= ∅ by Theorem 2.4.3. In particular, there is a vector g such that
f(y) > f(x) + hg, y − xi for all y, and even more,
f(y) > f(x) + inf hg, y − xi > f(x) − 2 kgk∞
y∈B1

for all y ∈ B1 .
Proof of Theorem 2.3.2 First, let us suppose that for each point x0 ∈ C, there
exists an open ball B ⊂ int dom f such that

(A.1.1) |f(x) − f(x ′ )| 6 L x − x ′ 2 for all x, x ′ ∈ B.
The collection of such balls B covers C, and as C is compact, there exists a
finite subcover B1 , . . . , Bk with associated Lipschitz constants L1 , . . . , Lk . Take
L = maxi Li to obtain the result. It thus remains to show that we can construct
balls satisfying the Lipschitz condition (A.1.1) at each point x0 ∈ C.
With that in mind, we use Lemma 2.3.1, which shows that for each point x0 ,
there is some ǫ > 0 and −∞ < m 6 M < ∞ such that
−∞ < m 6 inf f(x + v) 6 sup f(x + v) 6 M < ∞.
v:kvk2 62ǫ v:kvk2 62ǫ

We make the following claim, from which the condition (A.1.1) evidently follows
based on the preceding display.

Lemma A.1.2. Let ǫ > 0, f be convex, and B = {v : kvk2 6 1}. Suppose that f(x) ∈
[m, M] for all x ∈ x0 + 2ǫB. Then
M−m
x − x ′ for all x, x ′ ∈ x0 + ǫB.
|f(x) − f(x ′ )| 6 2
ǫ
Proof. Let x, x ′ ∈ x0 + ǫB. Let
x′ − x
x ′′ = x ′ + ǫ ∈ x0 + 2ǫB,
kx ′ − xk2
76 Introductory Lectures on Stochastic Optimization

as (x − x ′ )/ kx − x ′ k2 ∈ B. By construction, we have that x ′ ∈ {tx + (1 − t)x ′′ , t ∈


[0, 1]}, the segment between x and x ′′ ; explicitly,
 
ǫ ǫ kx ′ − xk2 ǫ
1+ ′ x ′ = x ′′ + ′ x or x ′ = x ′′ + ′ x.
kx − xk2 kx − xk2 kx ′ − xk2 + ǫ kx − xk2 + ǫ
Then we find that
kx − x ′ k2 ǫ
f(x ′ ) 6 f(x ′′ ) + f(x),
kx − x ′ k2 + ǫ kx − x ′ k2 + ǫ
or
kx − x ′ k2  ′′  kx − x ′ k2
f(x ′ ) − f(x) 6 f(x ) − f(x) 6 [M − m]
kx − x ′ k2 + ǫ kx − x ′ k2 + ǫ
M−m
x − x ′ .
6 2
ǫ
Swapping the roles of x and x ′ gives the result. 
A.2. Probability background In this section, we very tersely review a few of
the necessary definitions and results that we employ here. We provide a non
measure-theoretic treatment, as it is not essential for the basic uses we have.

Definition A.2.1. A sequence X1 , X2 , . . . of random vectors converges in probability


to a random vector X∞ if for all ǫ > 0, we have
lim sup P(kXn − X∞ k > ǫ) = 0.
n→∞

Definition A.2.2. A sequence X1 , X2 , . . . of random vectors is a martingale if there is


a sequence of random variables Z1 , Z2 , . . . (which may contain all the information
about X1 , X2 , . . .) such that for each n, (i) Xn is a function of Zn , (ii) Zn−1 is a
function of Zn , and (iii) we have the conditional expectation condition
E[Xn | Zn−1 ] = Xn−1 .
When condition (i) is satisfied, we say that Xn is adapted to Z. We say that a se-
P
quence X1 , X2 , . . . is a martingale difference sequence if Sn = ni=1 Xi is a martingale,
or, equivalently, if E[Xn | Zn−1 ] = 0.

We now provide a self-contained proof of the Azuma-Hoeffding inequality.


Our first result is an important intermediate result.

Lemma A.2.3 (Hoeffding’s Lemma [30]). Let X be a random variable with a 6 X 6 b.


Then  2 
λ (b − a)2
E [exp(λ(X − E[X]))] 6 exp for all λ ∈ R.
8
Proof. First, we note that if Y is any random variable with Y ∈ [c1 , c2 ], then
(c −c )2
Var(Y) 6 2 4 1 . Indeed, we have that E[Y] minimizes E[(Y − t)2 ] over t ∈ R,
so that
(A.2.4) "  #  
2 c2 + c1 2 c2 + c1 2 (c2 − c1 )2
Var(Y) = E[(Y − E[Y]) ] 6 E Y− 6 c2 − = .
2 2 4
John C. Duchi 77

Without loss of generality, we assume E[X] = 0 and 0 ∈ [a, b]. Let ψ(λ) =
log E[eλX ]. Then
E[XeλX ] E[X2 eλX ] E[XeλX ]2
ψ ′ (λ) = and ψ ′′
(λ) = − .
E[eλX ] E[eλX ] E[eλX ]2
Note that ψ ′ (0) = E[X] = 0. Let P denote the distribution of X, and assume
without loss of generality that X has a density p.5 Define the random variable Y
to have the shifted density f defined by
eλy
f(y) = p(y)
E[eλX ]
for y ∈ R, where p(y) = 0 for y 6∈ [a, b]. Then E[Y] = ψ ′ (λ) and Var(Y) = E[Y 2 ] −
E[Y]2 = ψ ′′ (λ). But of course, we know that Y ∈ [a, b] because the distribution P
of X is supported on [a, b], so that
(b − a)2
ψ ′′ (λ) = Var(Y) 6
4
by inequality (A.2.4). Using Taylor’s theorem, we have that
λ2 λ2
ψ(λ) = ψ(0) + ψ ′ (0) λ + ψ ′′ (eλ)ψ(λ) = ψ(0) + ψ ′′ (eλ)
| {z } 2 2
=0
(b−a)2 λ2 (b−a)
2
for some eλ between 0 and λ. But ψ ′′ (eλ) 6 4 , so that ψ(λ) 6 2 4 as
desired. 

Theorem A.2.5 (Azuma-Hoeffding Inequality [4]). Let X1 , X2 , . . . be a martingale


difference sequence with |Xi | 6 B for all i = 1, 2, . . .. Then
n
!  
X 2t2
P Xi > t 6 exp − 2
nB
i=1
and !  
n
X 2t2
P Xi 6 −t 6 exp − 2
nB
i=1
for all t > 0.

Proof. We prove the upper tail, as the lower tail is similar. The proof is a nearly
immediate consequence of Hoeffding’s lemma (Lemma A.2.3) and the Chernoff
bound technique. Indeed, we have
n
! " n
!#
X X
P Xi > t 6 E exp λ Xi exp(−λt)
i=1 i=1

5We may assume there is a dominating base measure µ with respect to which P has a density p.
78 Introductory Lectures on Stochastic Optimization

for all λ > 0. Now, letting Zi be the sequence to which the Xi are adapted, we
iterate conditional expectations. We have
" n
!# " " n−1
! ##
X X
E exp λ Xi = E E exp λ Xi exp(λXn ) | Zn−1
i=1 i=1
" n−1
! #
X
= E exp λ Xi E[exp(λXn ) | Zn−1 ]
i=1
" n−1
! #
X λ2 B2
6 E exp λ Xi e 8

i=1
because X1 , . . . , Xn−1 are functions of Zn−1 . By iteratively applying this calcula-
tion, we arrive at
" n
!#  2 2
X λ nB
(A.2.6) E exp λ Xi 6 exp .
8
i=1
Now we optimize by choosing λ > 0 to minimize the upper bound that in-
equality (A.2.6) provides, namely
n
!  2 2   
X λ nB 2t2
P Xi > t 6 inf exp − λt = exp − 2
λ>0 8 nB
i=1
4t
by taking λ = Bn . 
A.3. Auxiliary results on divergences We present a few standard results on
divergences without proof, referring to standard references (e.g. the book of Cover
and Thomas [17] or the extensive paper on divergence measures by Liese and
Vajda [35]). Nonetheless, we state and prove a few results. The first is known as
the data processing inequality, and it says that processing a random variable (even
adding noise to it) can only make distributions closer together. See Cover and
Thomas [17] or Theorem 14 of Liese and Vajda [35] for a proof.

Proposition A.3.1 (Data processing). Let P0 and P1 be distributions on a random


variable S ∈ S, and let Q(· | s) denote any conditional probability distribution conditioned
on s, and define Z
Qv (A) = Q(A | s)dPv (s)
for v = 0, 1 and all sets A. Then
kQ0 − Q1 kTV 6 kP0 − P1 kTV and Dkl (Q0 | Q1 ) 6 Dkl (P0 | P1 ) .

This proposition is perhaps somewhat intuitive: it says that if we do any process-


ing on a random variable S ∼ P, then there is less “information” about the initial
distribution of P than if we did no further processing.
A consequence of this result is Pinsker’s inequality.
John C. Duchi 79

Proposition A.3.2 (Pinsker’s inequality). Let P and Q be arbitrary distributions. Then


1
kP − Qk2TV 6 Dkl (P||Q) .
2
Proof. First, we note that if we show the result assuming that the sample space
S on which P and Q are defined is finite, we have the general result. Indeed,
suppose that A ⊂ S achieves the supremum
kP − QkTV = sup |P(A) − Q(A)|.
A⊂S
(We may assume without loss of generality that such a set exists.) Then if we
e and Q
define P e to be the binary distributions with P(0)
e e
= P(A) and P(1) = 1−
e we have kP − Qk = kP
P(A), and similarly for Q, e − Qk
e TV , and Proposition A.3.1
TV
immediately guarantees that
e |Q)
Dkl (P| e 6 Dkl (P||Q) .

Let us assume then that |S| < ∞.


In this case, Pinsker’s inequality is an immediate consequence of the strong
P
convexity of the negative entropy functional h(p) = n i=1 pi log pi with respect to
the ℓ1 -norm over the probability simplex. For completeness, let us prove this. Let
Pn Pn
p and q ∈ Rn + satisfy i=1 pi = i=1 qi = 1. Then Taylor’s theorem guarantees
that
1
h(q) = h(p) + h∇h(p), q − pi + (q − p)⊤ ∇2 h(e q)(q − p),
2
where q e = λp + (1 − λ)q for some λ ∈ [0, 1]. Now, we note that
∇2 h(p) = diag(1/p1 , . . . , 1/pn ),
and using that ∇h(p) = [log pi + 1]n
i=1 , we find
n n
X 1 X (qi − pi )2
h(q) = h(p) + (qi − pi ) log pi + .
2 ei
q
i=1 i=1
Using the Cauchy-Schwarz inequality, we have
X n 2  Xn p  Xn  Xn 
|qi − pi | 2 (qi − pi )2
|qi − pi | = ei p
q 6 ei
q .
qei ei
q
i=1 i=1 i=1 i=1
Of course, this gives
n
X 1
h(q) > h(p) + (qi − pi ) log pi + kp − qk21 .
2
i=1
P qi
Rearranging this, we have h(q) − h(p) − h∇h(p), q − pi = n i=1 qi log pi , or that
1
Dkl (q||p) > kp − qk21 = 2 kP − Qk2TV .
2
This is the result. 
80 Introductory Lectures on Stochastic Optimization

B. Questions and Exercises


Exercises for Lecture 2
Question 1: Let πC (x) := argminy∈C kx − yk2 denote the Euclidean projection of
x onto the set C, where C is closed convex. Show that the projection is a Lipschitz
mapping, that is,
kπC (x0 ) − πC (x1 )k2 6 kx0 − x1 k2
for all vectors x0 , x1 . Show that, even if C is compact, this inequality cannot (in
general) be improved.

Question 2: Let Sn = {A ∈ Rn×n : A = AT } be the set of n × n symmetric


matrices. Let f(X) = λmax (X) for X ∈ Sn . Show that f is convex and compute
∂f(X).

Question 3: A convex function f is called λ strongly convex with respect to the


norm k · k on the (convex) domain X if for any x, y ∈ X, we have
λ
f(y) > f(x) + hg, y − xi + kx − yk2
2
for all g ∈ ∂f(x). Recall that a function f is L-Lipschitz continuous with respect to
the norm k·k on the domain X if
|f(x) − f(y)| 6 L kx − yk for all x, y ∈ X.
Let f be λ-strongly convex w.r.t. k·k and h1 , h2 be L-Lipschitz continuous convex
functions with respect to the norm k·k. For i = 1, 2 define
xi = arg min{f(x) + hi (x)}.
x∈X
Show that
2L
kx1 − x2 k 6 .
λ
Hint: You may use the fact, demonstrated in the notes, that if h is L-Lipschitz and
convex, then kgk∗ 6 L for all g ∈ ∂h(x), where k·k∗ is the dual norm to k·k.

Question 4 (Hölder’s inequality): Let x and y be vectors in Rn and p, q ∈ (1, ∞)


be conjugate, that is, satisfy 1/p + 1/q = 1. In this question, we will show that
hx, yi 6 kxkp kykq , and moreover, that k·kp and k·kq are dual norms. (The result
is essentially immediate in the case that p = 1 and q = ∞.)
(a) Show that for any a, b > 0 and any η > 0, we have
ηp p 1
ab 6 a + q bq .
p η q
Hint: use the concavity of the logarithm and that 1/p + 1/q = 1.
p
(b) Show that hx, yi 6 ηp kxkp 1 q
p + ηq q kykq for all η > 0.
(c) Using the result of part (b), show that hx, yi 6 kxkp kykq .
(d) Show that k·kp and k·kq are dual norms.
John C. Duchi 81

Exercises for Lecture 3


Question 5: In this question and the next, we perform experiments with (sto-
chastic) subgradient methods to train a handwritten digit recognition classifier
(one to recognize the digits {0, 1, . . . , 9}). A warning: we use optimization nota-
tion here, consistent with Example 3.4, which is non-standard for typical machine
learning or statistical learning applications.
We represent a multiclass classifier using a matrix
X = [x1 x2 · · · xk ] ∈ Rd×k ,
where there are k classes, and the predicted class for a data vector a ∈ Rd is
argmax ha, xl i = argmax{[XT a]l }.
l∈[k] l∈[k]

We represent data as pairs (a, b) ∈ Rd


× {1, . . . , k}, where a is the data point (fea-
tures) and b the label of the data point. We use the multiclass hinge loss function
F(X; (a, b)) = max [1 + ha, xl − xb i]+
l6=b

where [t]+ = max{t, 0} denotes the positive part. We will use stochastic gradient
descent to attempt to minimize Z
f(X) := EP [F(X; (A, B))] = F(X; (a, b))dP(a, b),
where the expectation is taken over pairs (A, B).
(a) Show that F is convex.
(b) Show that F(X; (a, b)) = 0 if and only if the classifer represented by X has a
large margin, meaning that
ha, xb i > ha, xl i + 1 for all l 6= b.
(c) For a pair (a, b), give a way to calculate a vector G ∈ ∂F(X; (a, b)) (note that
G ∈ Rd×k ).

Question 6: In this problem, you will perform experiments to explore the per-
formance of stochastic subgradient methods for classification problems, specif-
ically, a handwritten digit recognition problem using zip code data from the
United States Postal Service (this data is taken from the book [24], originally
due to Yann Le Cunn). The data—training data zip.train, test data zip.test,
and information file zip.inf—are available for download from the zipped tar
file http://web.stanford.edu/~jduchi/PCMIConvex/ZIPCodes.tgz. Starter code is
available for julia and Matlab at the following urls.
i. For Julia: http://web.stanford.edu/~jduchi/PCMIConvex/sgd.jl
ii. For Matlab: http://web.stanford.edu/~jduchi/PCMIConvex/matlab.tgz
There are two methods left un-implemented in the starter code: the sgd method
and the MulticlassSVMSubgradient method. Implement these methods (you may
find the code for unit-testing the multiclass SVM subgradient useful to double
82 Introductory Lectures on Stochastic Optimization

check your implementation). For the SGD method, your stepsizes should be

proportional to αi ∝ 1/ i, and you should project X to the Frobenius norm ball
X
Br := {X ∈ Rd×k : kXkFr 6 r}, where kXk2Fr = X2ij .
ij

We have implemented a pre-processing step that also kernelizes the data repre-
1
sentation. Let the function K(a, a ′ ) = exp(− 2τ ka − a ′ k22 ). Then the kernelized
data representation transforms each datapoint a ∈ Rd into a vector
 ⊤
φ(a) = K(a, ai1 ) K(a, ai2 ) · · · K(a, aim )
where i1 , . . . , im is a random subset of {1, . . . , N} (see GetKernelRepresentation.)
Once you have implemented the sgd and MulticlassSVMSubgradient methods,
use the method RunExperiment (Julia/Matlab). What performance do you get in
classification? Which digits is your classifier most likely to confuse?

Question 7: In this problem, we give a simple bound on the rate of convergence


for stochastic optimization for minimization of strongly convex functions. Let
C denote a compact convex set and f denote a λ-strongly convex function with
respect to the ℓ2 -norm on C, meaning that
λ
f(y) > f(x) + hg, y − xi + kx − yk22 for all g ∈ ∂f(x), x, y ∈ C.
2
Consider the following stochastic gradient method: at iteration k, we
i. receive a noisy subgradient gk with E[gk | xk ] ∈ ∂f(xk )
ii. perform the projected subgradient step
xk+1 = πC (xk − αk gk ).
Show that if E[kgk k22 ] 6 M2 for all k, then with the stepsize choice αk = 1
λk , we
have the convergence guarantee
" K #
X
∗ M2
E (f(xk ) − f(x )) 6 (log K + 1).

k=1

Exercises for Lecture 4


Question 8: We saw in the lecture that if we use mirror descent,

1
xk+1 = argmin hgk , xi + Dh (x, xk ) ,
x∈C αk
in the stochastic setting with E[gk | xk ] ∈ ∂f(xk ) then we have the regret bound
X K   K 
⋆ 1 2 1X 2
E (f(xk ) − f(x )) 6 E R + αk kgk k∗ .
αK 2
k=1 k=1

Here we have assumed that Dh (x⋆ , xk ) 6 R2 for all k. We now use this inequality
to prove Corollary 4.3.3.
John C. Duchi 83

In particular, choose the stepsize αk adaptively at the kth step by optimizing


the convergence bound up to the current iterate, that is, set
k
!− 21
X 2
αk = R kgi k∗ ,
i=1
based on the previous subgradients. Prove that in this case one has
XK  " K 1 #
X 2
2

E (f(xk ) − f(x )) 6 3RE kgk k∗
k=1 k=1
Conclude Corollary 4.3.3.
Hint: An intermediate step, which may be useful, is to prove the following
inequality: for any non-negative sequence a1 , a2 , . . . , ak , one has
v
k u k
X ai uX
qP 6 2t ai .
i
i=1 j=1 aj i=1

Induction is one natural strategy.

Question 9 (Strong convexity of ℓp -norms): Prove the claim of Example 4.3.


1
That is, for some fixed p ∈ (1, 2], if h(x) = 2(p−1) kxk2p , show that h is strongly
convex with respect to the ℓp -norm.
P
1
Hint: Let Ψ(t) = 2(p−1) t2/p and φ(t) = |t|p , noting that h(x) = Ψ( n j=1 φ(xj )).
Then by a Taylor expansion, this question is equivalent to showing that for any
w, x ∈ Rn , we have
x⊤ ∇2 h(w)x > kxk2p
where, defining the shorthand vector ∇φ(w) = [φ ′ (w1 ) · · · φ ′ (wn )]⊤ , we have
X
n 
∇2 h(w) = Ψ ′′ φ(wj ) ∇φ(w)∇φ(w)⊤
j=1
X
n 

+ Ψ′ φ(wj ) diag φ ′′ (w1 ), . . . , φ ′′ (wn ) .
j=1

Now apply an argument similar to that used in Example 4.2 to show the strong
P
convexity of h(x) = j xj log xj , but applying Hölder’s inequality instead of
Cauchy-Schwarz.

Question 10 (Variable metric methods and AdaGrad): Consider the following


variable-metric method for minimizing a convex function f on a convex set C ⊂
Rn :
1 ⊤
xk+1 = argmin hgk , xi + (x − xk ) Hk (x − xk ) ,
x∈C 2
84 Introductory Lectures on Stochastic Optimization

where E[gk ] ∈ ∂f(xk ). In the lecture, we showed that


X  " K #
K
1 X 
E (f(xk ) − f(x⋆ )) 6 E kxk − x⋆ k2Hk − kxk − x⋆ k2Hk−1 + kx1 − x⋆ k2H1
2
k=1 k=2
X
K 
1
+ E kgk k2H−1 .
2 k
k=1
(a) Let
X
k 1
2
Hk = diag gi g⊤
i
i=1
be the diagonal matrix whose entries are the square roots of the sum of the
squares of the gradient coordinates. (This is the AdaGrad method.) Show
that
kxk − x⋆ k2Hk − kxk − x⋆ k2Hk−1 6 kxk − x⋆ k∞ tr(Hk − Hk−1 ),
P
where tr(A) = n i=1 Aii is the trace of the matrix
(b) Assume that R∞ = supx∈C kx − x⋆ k∞ is finite. Show that with any choice of
diagonal matrix Hk , we obtain
XK  XK 
1 1
E (f(xk ) − f(x⋆ )) 6 R∞ E[tr(HK )] + E kgk k2H−1 .
2 2 k
k=1 k=1
(c) Let gk,j denote the jth coordinate of the kth subgradient. Let Hk be chosen
as above. Show that
XK  n
" K 1 #
3 X X 2
E (f(xk ) − f(x⋆ )) 6 R∞ E g2k,j .
2
k=1 j=1 k=1

(d) Suppose that the domain C = {x : kxk∞ 6 1}. What is the expected regret of
AdaGrad? Show that (to a numerical constant factor we ignore) this expected
regret is always smaller than the expected regret bound for standard projected
gradient descent, which is
X K  XK 1
2
⋆ ⋆ 2
E (f(xk ) − f(x )) 6 O(1) sup kx − x k2 E kgk k2 .
k=1 x∈C k=1
Hint: Use Cauchy-Schwarz
(e) As in the previous sub-question, assume that C = {x : kxk∞ 6 1}. Suppose
that the subgradients are such that gk ∈ {−1, 0, 1}n for all k, and that for each
coordinate j we have P(gk,j 6= 0) = pj . Show that AdaGrad has convergence
guarantee
XK  √ n
⋆ 3 K X√
E (f(xk ) − f(x )) 6 pj .
2
k=1 j=1

What is the corresponding bound for standard projected gradient descent?


How much better can AdaGrad be?
John C. Duchi 85

Exercises for Lecture 5


Question 11: In this problem, we prove a lower bound for strongly convex
optimization problems. Suppose at each iteration of the optimization procedure,
we receive a noisy subgradient gk satisfying
iid
gk = ∇f(xk ) + ξk , ξk ∼ N(0, σ2 ).
To prove a lower bound for optimization procedures, we use the functions
λ
fv (x) = (x − vδ)2 , v ∈ {±1}.
2
Let f⋆v = 0 denote the minimum function values for fv on R for v = ±1.
(a) Recall the separation between two functions f1 and f−1 as defined previ-
ously (5.1.4),

dopt (f−1 , f1 ; C) :=
 
f1 (x) 6 f⋆1 + δ implies f−1 (x) > f⋆−1 + δ
sup δ > 0 : for any x ∈ C. .
f−1 (x) 6 f⋆−1 + δ implies f1 (x) > f⋆1 + δ
When C = R (or, more generally, as long as C ⊃ [−δ, δ]), show that
λ
dopt (f−1 , f1 ; C) > δ2 .
2
(b) Show that the Kullback-Leibler divergence between two normal distributions
P1 = N(µ1 , σ2 ) and P2 = N(µ2 , σ2 ) is
(µ1 − µ2 )2
Dkl (P1 | P−1 ) = .
2σ2
(c) Use Le Cam’s method to show the following lower bound for stochastic opti-
mization: for any optimization procedure b xK using K noisy gradient evalua-
tions,
σ2
xK ) − f⋆v ] >
max EPv [fv (b .
v∈{−1,1} 32λK
Compare the result with the regret upper bound in problem 7. Hint: If PvK
denotes the distribution of the K noisy gradients for function fv , show that
  2Kλ2 δ2
Dkl P1K | P−1
K
6 .
σ2
Question 12: Let C = {x ∈ Rn : kxk∞ 6 1}, and consider the collection of
functions F where the stochastic gradient oracle g : Rn × S × F → {−1, 0, 1}n
satisfies
P(gj (x, S, f) 6= 0) 6 pj
for each coordinate j = 1, 2, . . . , n. Show that, for large enough K ∈ N, a minimax
lower bound for this class of functions and the given stochastic oracle is
n
1 X√
MK (C, F) > c √ pj ,
K j=1
86 References

where c > 0 is a numerical constant. How does this compare to the convergence
guarantee that AdaGrad gives?

References
[1] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright, Information-
theoretic lower bounds on the oracle complexity of convex optimization, IEEE Transactions on Informa-
tion Theory 58 (2012), no. 5, 3235–3249. ←74
[2] P. Assouad, Deux remarques sur l’estimation, Comptes Rendus des Séances de l’Académie des
Sciences, Série I 296 (1983), no. 23, 1021–1024. ←70
[3] P. Auer, N. Cesa-Bianchi, and C. Gentile, Adaptive and self-confident on-line learning algorithms,
Journal of Computer and System Sciences 64 (2002), no. 1, 48–75. ←60
[4] K. Azuma, Weighted sums of certain dependent random variables, Tohoku Mathematical Journal 68
(1967), 357–367. ←77
[5] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex opti-
mization, Operations Research Letters 31 (2003), 167–175. ←59
[6] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski, Robust optimization, Princeton Uni-
versity Press, 2009. ←4
[7] D. P. Bertsekas, Stochastic optimization problems with nondifferentiable cost functionals, Journal of
Optimization Theory and Applications 12 (1973), no. 2, 218–231. ←22
[8] Dimitri P. Bertsekas, Convex optimization theory, Athena Scientific, 2009. ←3, 24
[9] D.P. Bertsekas, Nonlinear programming, Athena Scientific, 1999. ←3
[10] Stephen Boyd, John Duchi, and Lieven Vandenberghe, Subgradients, 2015. Course notes for Stan-
ford Course EE364b. ←43
[11] Stephen Boyd and Almir Mutapcic, Stochastic subgradient meth-
ods, 2007. Course notes for EE364b at Stanford, available at
http://www.stanford.edu/class/ee364b/notes/stoch_subgrad_notes.pdf. ←43
[12] Stephen Boyd and Lieven Vandenberghe, Convex optimization, Cambridge University Press, 2004.
←3, 20, 24, 25, 60
[13] Gábor Braun, Cristóbal Guzmán, and Sebastian Pokutta, Lower bounds on the oracle complexity of
nonsmooth convex optimization via information theory, IEEE Transactions on Information Theory 63
(2017), no. 7. ←74
[14] P. Brucker, An O(n) algorithm for quadratic knapsack problems, Operations Research Letters 3
(1984), no. 3, 163–166. ←33, 46, 54
[15] Sébastien Bubeck and Nicoló Cesa-Bianchi, Regret analysis of stochastic and nonstochastic multi-
armed bandit problems, Foundations and Trends in Machine Learning 5 (2012), no. 1, 1–122. ←60
[16] N. Cesa-Bianchi, A. Conconi, and C. Gentile, On the generalization ability of on-line learning algo-
rithms, IEEE Transactions on Information Theory 50 (2004September), no. 9, 2050–2057. ←43
[17] Thomas M. Cover and Joy A. Thomas, Elements of information theory, second edition, Wiley, 2006.
←74, 78
[18] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien, SAGA: A fast incremental gradient method
with support for non-strongly convex composite objectives, Advances in neural information processing
systems 27, 2014. ←43
[19] David L. Donoho, Richard C. Liu, and Brenda MacGibbon, Minimax risk over hyperrectangles, and
implications, Annals of Statistics 18 (1990), no. 3, 1416–1437. ←70
[20] D.L. Donoho, Compressed sensing, Technical report, stanford university, 2006. ←31
[21] John C. Duchi, Stats311/EE377: Information theory and statistics, 2015. ←74
[22] John C. Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning and
stochastic optimization, Journal of Machine Learning Research 12 (2011), 2121–2159. ←56, 60
References 87

[23] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra, Efficient projections onto
the ℓ1 -ball for learning in high dimensions, Proceedings of the 25th international conference on
machine learning, 2008. ←33, 46, 54
[24] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements of statistical learning, Second,
Springer, 2009. ←81
[25] Elad Hazan, The convex optimization approach to regret minimization, Optimization for machine
learning, 2012. ←43
[26] Elad Hazan, Introduction to online convex optimization, Foundations and Trends in Optimization 2
(2016), no. 3–4, 157–325. ←4
[27] J. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms I, Springer, New
York, 1993. ←3, 21, 24, 74
[28] J. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms II, Springer,
New York, 1993. ←3, 24
[29] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal, Fundamentals of convex analysis, Springer,
2001. ←24
[30] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American
Statistical Association 58 (March 1963), no. 301, 13–30. ←76
[31] I. A. Ibragimov and R. Z. Has’minskii, Statistical estimation: Asymptotic theory, Springer-Verlag,
1981. ←4, 62, 74
[32] Rie Johnson and Tong Zhang, Accelerating stochastic gradient descent using predictive variance reduc-
tion, Advances in neural information processing systems 26, 2013. ←43
[33] Lucien Le Cam, Asymptotic methods in statistical decision theory, Springer-Verlag, 1986. ←4, 62, 65
[34] Erich L. Lehmann and George Casella, Theory of point estimation, second edition, Springer, 1998.
←62
[35] Friedrich Liese and Igor Vajda, On divergences and informations in statistics and information theory,
IEEE Transactions on Information Theory 52 (2006), no. 10, 4394–4412. ←78
[36] David Luenberger, Optimization by vector space methods, Wiley, 1969. ←24
[37] Jerrold Marsden and Michael Hoffman, Elementary classical analysis, second edition, W.H. Freeman,
1993. ←3
[38] Brendan McMahan and Matthew Streeter, Adaptive bound optimization for online convex optimiza-
tion, Proceedings of the twenty third annual conference on computational learning theory, 2010.
←60
[39] Angelia Nedić, Subgradient methods for convex minimization, Ph.D. Thesis, 2002. ←60
[40] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to
stochastic programming, SIAM Journal on Optimization 19 (2009), no. 4, 1574–1609. ←43, 60
[41] A. Nemirovski and D. Yudin, Problem complexity and method efficiency in optimization, Wiley, 1983.
←4, 25, 59, 74
[42] Arkadi Nemirovski, Efficient methods in convex programming, 1994. Technion: The Israel Institute
of Technology. ←74
[43] Arkadi Nemirovski, Lectures on modern convex optimization, 2005. Georgia Institute of Technology.
←43
[44] Y. Nesterov, Introductory lectures on convex optimization, Kluwer Academic Publishers, 2004. ←26,
43, 74
[45] Y. Nesterov and A. Nemirovski, Interior-point polynomial algorithms in convex programming, SIAM
Studies in Applied Mathematics, 1994. ←25
[46] Jorge Nocedal and Stephen J. Wright, Numerical optimization, Springer, 2006. ←3, 60
[47] B. T. Polyak, Introduction to optimization, Optimization Software, Inc., 1987. ←3, 43
[48] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM Journal
on Control and Optimization 30 (1992), no. 4, 838–855. ←43
[49] R. Tyrell Rockafellar, Convex analysis, Princeton University Press, 1970. ←3, 6, 24
88 References

[50] Walter Rudin, Principles of mathematical analysis, third edition, McGraw-Hill, 1976. ←3
[51] S. Shalev-Shwartz, Online learning: Theory, algorithms, and applications, Ph.D. Thesis, 2007. ←47
[52] Shai Shalev-Shwartz, Online learning and online convex optimization, Foundations and Trends in
Machine Learning 4 (2012), no. 2, 107–194. ←4
[53] Shai Shalev-Shwartz and Tong Zhang, Stochastic dual coordinate ascent methods for regularized loss
minimization, Journal of Machine Learning Research 14 (2013), 567–599. ←43
[54] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński, Lectures on stochastic program-
ming: Modeling and theory, SIAM and Mathematical Programming Society, 2009. ←4
[55] Naum Zuselevich Shor, Minimization methods for nondifferentiable functions, translated by Krzystof
Kiwiel and Andrzej Ruszczyński, Springer-Verlag, 1985. ←60
[56] Naum Zuselevich Shor, Nondifferentiable optimization and polynomial problems, Springer, 1998. ←56,
60
[57] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Royal. Statist. Soc B. 58 (1996), no. 1,
267–288. ←31
[58] Alexandre B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009. ←63, 74
[59] Abraham Wald, Contributions to the theory of statistical estimation and testing hypotheses, Annals of
Mathematical Statistics 10 (1939), no. 4, 299–326. ←4, 74
[60] Abraham Wald, Statistical decision functions which minimize the maximum risk, Annals of Mathemat-
ics 46 (1945), no. 2, 265–280. ←4
[61] Y. Yang and A. Barron, Information-theoretic determination of minimax rates of convergence, Annals of
Statistics 27 (1999), no. 5, 1564–1599. ←4, 63
[62] Bin Yu, Assouad, Fano, and Le Cam, Festschrift for lucien le cam, 1997, pp. 423–435. ←4, 63, 74
[63] Martin Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, Proceed-
ings of the twentieth international conference on machine learning, 2003. ←43

Stanford University, Stanford CA 94305


E-mail address: jduchi@stanford.edu

You might also like