Convex Optimization For Machine Learning
Convex Optimization For Machine Learning
Learning
S
ebastien Bubeck1
1
Abstract
This monograph presents the main mathematical ideas in convex optimization. Starting from the fundamental theory of black-box optimization, the material progresses towards recent advances in structural optimization and stochastic optimization. Our presentation of black-box
optimization, strongly influenced by the seminal book of Nesterov, includes the analysis of the Ellipsoid Method, as well as (accelerated) gradient descent schemes. We also pay special attention to non-Euclidean
settings (relevant algorithms include Frank-Wolfe, Mirror Descent, and
Dual Averaging) and discuss their relevance in machine learning. We
provide a gentle introduction to structural optimization with FISTA (to
optimize a sum of a smooth and a simple non-smooth term), SaddlePoint Mirror Prox (Nemirovskis alternative to Nesterovs smoothing),
and a concise description of Interior Point Methods. In stochastic optimization we discuss Stochastic Gradient Descent, mini-batches, Random Coordinate Descent, and sublinear algorithms. We also briefly
touch upon convex relaxation of combinatorial problems and the use of
randomness to round solutions, as well as random walks based methods.
Contents
1 Introduction
1.1
1.2
1.3
1.4
1.5
1.6
2
3
6
7
8
9
12
2.1
2.2
12
14
19
3.1
3.2
3.3
3.4
20
23
28
33
ii Contents
3.5
3.6
Lower bounds
Nesterovs Accelerated Gradient Descent
37
41
48
4.1
4.2
4.3
4.4
4.5
4.6
50
51
53
55
57
59
Mirror maps
Mirror Descent
Standard setups for Mirror Descent
Lazy Mirror Descent, aka Nesterovs Dual Averaging
Mirror Prox
The vector field point of view on MD, DA, and MP
61
5.1
5.2
62
5.3
64
70
81
6.1
6.2
6.3
82
84
6.4
6.5
6.6
6.7
86
90
94
96
100
1
Introduction
The central objects of our study are convex functions and convex sets
in Rn .
Definition 1.1 (Convex sets and convex functions). A set X
Rn is said to be convex if it contains all of its segments, that is
(x, y, ) X X [0, 1], (1 )x + y X .
A function f : X R is said to be convex if it always lies below its
chords, that is
(x, y, ) X X [0, 1], f ((1 )x + y) (1 )f (x) + f (y).
We are interested in algorithms that take as input a convex set X and
a convex function f and output an approximate minimum of f over X .
We write compactly the problem of finding the minimum of f over X
as
min. f (x)
s.t. x X .
1
Introduction
In the following we will make more precise how the set of constraints X
and the objective function f are specified to the algorithm. Before that
we proceed to give a few important examples of convex optimization
problems in machine learning.
1.1
xRn
m
X
fi (x) + R(x),
(1.1)
i=1
xRn
where W Rmn is the matrix with wi> on the ith row and
Y = (y1 , . . . , yn )> . With R(x) = kxk22 one obtains the ridge regression
problem, while with R(x) = kxk1 this is the LASSO problem.
1.2
A basic result about convex sets that we shall use extensively is the
Separation Theorem.
Theorem 1.1 (Separation Theorem). Let X Rn be a closed
convex set, and x0 Rn \ X . Then, there exists w Rn and t R
such that
w> x0 < t, and x X , w> x t.
Note that if X is not closed then one can only guarantee that
>
0 w x, x X (and w 6= 0). This immediately implies the
Supporting Hyperplane Theorem:
w> x
Introduction
(1.2)
Clearly, by letting t tend to infinity, one can see that b 0. Now let
us assume that x is in the interior of X . Then for > 0 small enough,
y = x+a X , which implies that b cannot be equal to 0 (recall that if
b = 0 then necessarily a 6= 0 which allows to conclude by contradiction).
Thus rewriting (1.2) for t = f (y) one obtains
f (x) f (y)
1 >
a (x y).
|b|
Thus a/|b| f (x) which concludes the proof of the second claim.
Finally let f be a convex and differentiable function. Then by definition:
f (y)
f ((1 )x + y) (1 )f (x)
f (x + (y x)) f (x)
f (x) +
>
f (x) + f (x) (y x),
Introduction
1.3
Why convexity?
1.4
Black-box model
We now describe our first model of input for the objective function
and the set of constraints. In the black-box model we assume that
we have unlimited computational resources, the set of constraint X is
known, and the objective function f : X R is unknown but can be
accessed through queries to oracles:
A zeroth order oracle takes as input a point x X and
outputs the value of f at x.
A first order oracle takes as input a point x X and outputs
a subgradient of f at x.
In this context we are interested in understanding the oracle complexity
of convex optimization, that is how many queries to the oracles are
necessary and sufficient to find an -approximate minima of a convex
function. To show an upper bound on the sample complexity we
need to propose an algorithm, while lower bounds are obtained by
information theoretic reasoning (we need to argue that if the number
of queries is too small then we dont have enough information about
the function to identify an -approximate solution).
Introduction
1.5
Structured optimization
1.6
10 Introduction
assume a fixed number of iterations t, and the algorithms we consider
can depend on t. Similarly we assume that the relevant parameters
describing the regularity of the objective function (Lipschitz constant,
smoothness constant, strong convexity parameter) are know and can
also be used to tune the algorithms own parameters. The interested
reader can find guidelines to adapt to these potentially unknown
parameters in the references given in the text.
Notation. We always denote by x a point in X such that f (x ) =
minxX f (x) (note that the optimization problem under consideration
will always be clear from the context). In particular we always assume
that x exists. For a vector x Rn we denote by x(i) its ith coordinate.
The dual of a norm k k (defined later) will be denoted either k k or
k k (depending on whether the norm already comes with a subscript).
Other notation are standard (e.g., In for the n n identity matrix,
for the positive semi-definite order on matrices, etc).
11
Algorithm
Rate
# Iterations
non-smooth
Center of
Gravity
exp(t/n)
n log(1/)
one gradient,
one n-dim integral
non-smooth
Ellipsoid
Method
n2 log(R/(r))
one gradient,
separation oracle,
matrix-vector mult.
non-smooth,
Lipschitz
PGD
RL/ t
R2 L2 /2
one gradient,
one projection
smooth
PGD
R2 /t
R2 /
one gradient,
one projection
smooth
Nesterovs
AGD
R2 /t2
smooth
(arbitrary norm)
FW
R2 /t
R2 /
one gradient,
one linear opt.
strongly convex,
Lipschitz
PGD
L2 /(t)
L2 /()
one gradient,
one projection
strongly convex,
smooth
PGD
R2 exp(t/Q)
Q log(R2 /)
one gradient,
one projection
strongly convex,
smooth
Nesterovs
AGD
R2 exp(t/ Q)
f + g,
f smooth,
g simple
FISTA
R2 /t2
SP-MP
R2 /t
R2 /
c> x,
X with F
-self-conc.
IPM
O(1) exp(t/ )
R
r
exp(t/n2 )
p
/
one gradient
Q log(R2 /)
one gradient
p
/
one gradient of f
Prox. step with g
log(/)
MD step on X
MD step on Y
Newton direction
for F on X
2
Convex optimization in finite dimension
2.1
Z
xdx.
(2.1)
xSt
(2) Query the first order oracle at ct and obtain wt f (ct ). Let
St+1 = St {x Rn : (x ct )> wt 0}.
12
13
If stopped after t queries to the first order oracle then we use t queries
to a zeroth order oracle to output
xt argmin f (cr ).
1rt
This procedure is known as the center of gravity method, it was discovered independently on both sides of the Wall by Levin [1965] and
Newman [1965].
Theorem 2.1. The center of gravity method satisfies
1 t/n
f (xt ) min f (x) 2B 1
.
xX
e
Before proving this result a few comments are in order.
To attain an -optimal point the center of gravity method requires
O(n log(2B/)) queries to both the first and zeroth order oracles. It can
be shown that this is the best one can hope for, in the sense that for
small enough one needs (n log(1/)) calls to the oracle in order to
find an -optimal point, see Nemirovski and Yudin [1983] for a formal
proof.
The rate of convergence given by Theorem 2.1 is exponentially fast.
In the optimization literature this is called a linear rate for the following
reason: the number of iterations required to attain an -optimal point
is proportional to log(1/), which means that to double the number of
digits in accuracy one needs to double the number of iterations, hence
the linear nature of the convergence rate.
The last and most important comment concerns the computational
complexity of the method. It turns out that finding the center of gravity
ct is a very difficult problem by itself, and we do not have computationally efficient procedure to carry this computation in general. In Section
6.7 we will discuss a relatively recent (compared to the 50 years old
center of gravity method!) breakthrough that gives a randomized algorithm to approximately compute the center of gravity. This will in turn
give a randomized center of gravity method which we will describe in
details.
We now turn to the proof of Theorem 2.1. We will use the following
elementary result from convex geometry:
2.2
15
where c Rn , and H is a symmetric positive definite matrix. Geometrically c is the center of the ellipsoid, and the semi-axes of E are given
by the eigenvectors of H, with lengths given by the square root of the
corresponding eigenvalues.
We give now a simple geometric lemma, which is at the heart of the
ellipsoid method.
Lemma 2.2. Let E0 = {x Rn : (x c0 )> H01 (x c0 ) 1}. For any
w Rn , w 6= 0, there exists an ellipsoid E such that
and
E {x E0 : w> (x c0 ) 0},
(2.3)
1
vol(E) exp
vol(E0 ).
2n
(2.4)
c = c0
(2.5)
(2.6)
= r ,
=
=s
vol(B)
a
2 n1
b
1
1
f 1t
1 t
(1t)2
1t
17
argmin
c{c1 ,...,ct }X
f (cr ).
3
Dimension-free convex optimization
We investigate here variants of the gradient descent scheme. This iterative algorithm, which can be traced back to Cauchy [1847], is the
simplest strategy to minimize a differentiable function f on Rn . Starting at some initial point x1 Rn it iterates the following equation:
xt+1 = xt f (xt ),
(3.1)
y
ky X (y)k
X (y)
ky xk
kX (y) xk
x
X
3.1
21
yt+1
projection (3.3)
gradient step
(3.2)
xt+1
xt
X
(3.2)
xt+1 = X (yt+1 ).
(3.3)
L t
sat-
1
kxs x k2 + kxs ys+1 k2 kys+1 x k2
=
2
1
=
kxs x k2 kys+1 x k2 + kgs k2 .
2
2
Now note that kgs k L, and furthermore by Lemma 3.1
kys+1 x k kxs+1 x k.
Summing the resulting inequality over s, and using that kx1 x k R
yield
t
X
R2 L2 t
+
.
(f (xs ) f (x ))
2
2
s=1
23
(think of X being an Euclidean ball), or an easy and fast combinatorial algorithms to solve it (this is the case for X being an `1 -ball, see
Duchi et al. [2008]). We will see in Section 3.3 a projection-free algorithm which operates under an extra assumption of smoothness on the
function to be optimized.
Finally we observe that the step-size recommended by Theorem 3.1
depends on the number of iterations to be performed. In practice this
may be an undesirable feature. However using a time-varying step size
one can prove the same rate up to a log t factor.
of the form s = LR
s
In any case these step sizes are very small, which is the reason for
the slow convergence. In the next section we will see that by assuming
smoothness in the function f one can afford to be much more aggressive.
Indeed in this case, as one approaches the optimum the size of the
gradients themselves will go to 0, resulting in a sort of auto-tuning of
the step sizes which does not happen for an arbitrary convex function.
3.2
2kx1 x k2
.
t1
kx yk2 .
2
tkx yk2 dt
0
= kx yk2 .
2
kx yk2 .
2
(3.4)
2
The next lemma, which improves the basic inequality for subgradients
under the smoothness assumption, shows that in fact f is convex and
-smooth if and only if (3.4) holds true. In the literature (3.4) is often
used as a definition of smooth convex functions.
25
Lemma 3.3. Let f be such that (3.4) holds true. Then for any x, y
Rn , one has
f (x) f (y) f (x)> (x y)
1
kf (x) f (y)k2 .
2
kz yk2
2
1
kf (x) f (y)k2
2
1
kf (x) f (y)k2 .
2
1
kf (xs )k2 .
2
1
kf (xs )k2 .
2
1
2.
2kx1 x k2 s
s
s+1
1
1
1
1
1
(t1).
s
s+1
s+1 s
t
2
1
= kxs x k2 f (xs )> (xs x ) + 2 kf (xs )k2
1
kxs x k2 2 kf (xs )k2
2
kxs x k ,
which concludes the proof.
The constrained case
We now come back to the constrained problem
min. f (x)
s.t. x X .
Similarly to what we did in Section 3.1 we consider the projected gradient descent algorithm, which iterates xt+1 = X (xt f (xt )).
The key point in the analysis of gradient descent for unconstrained
smooth optimization is that a step of gradient descent started at x will
1
decrease the function value by at least 2
kf (x)k2 , see (3.5). In the
constrained case we cannot expect that this would still hold true as a
step may be cut short by the projection. The next lemma defines the
right quantity to measure progress in the constrained case.
1 The
last step in the sequence of implications can be improved by taking 1 into account.
1
Indeed one can easily show with (3.4) that 1 4
. This improves the rate of Theorem
3.2 from
2kx1 x k2
t1
to
2kx1 x k2
.
t+3
27
Lemma 3.4. Let x, y X , x+ = X x 1 f (x) , and gX (x) =
(x x+ ). Then the following holds true:
f (x+ ) f (y) gX (x)> (x y)
1
kgX (x)k2 .
2
(3.7)
which follows from Lemma 3.1. Now we use (3.7) as follows to prove
the lemma (we also use (3.4) which still holds true in the constrained
case)
f (x+ ) f (y)
= f (x+ ) f (x) + f (x) f (y)
3kx1 x k2 + f (x1 ) f (x )
.
t
1
kgX (xs )k2 ,
2
and
f (xs+1 ) f (x ) kgX (xs )k kxs x k.
We will prove that kxs x k is decreasing with s, which with the two
above displays will imply
s+1 s
1
2 .
2kx1 x k2 s+1
3kx1 x k2 + f (x1 ) f (x )
.
s
2
1
= kxs x k2 gX (xs )> (xs x ) + 2 kgX (xs )k2
2
kxs x k .
kxs+1 x k2 = kxs
3.3
(3.8)
xt+1 = (1 t )xt + t yt .
(3.9)
In words the Conditional Gradient Descent makes a step in the steepest descent direction given the constraint set X , see Figure 3.3 for an
29
yt
f (xt )
xt+1
xt
X
Fig. 3.3 Illustration of the Conditional Gradient Descent method.
2R2
.
t+1
Proof. The following inequalities hold true, using respectively smoothness (it can easily be seen that (3.4) holds true for smoothness
in an arbitrary norm), the definition of xs+1 , the definition of ys , and
kxs+1 xs k2
2
>
s f (xs ) (x xs ) + s2 R2
2
s (f (x ) f (xs )) + s2 R2 .
2
2 2
R .
2 s
2
finishes the proof (note that
A simple induction using that s = s+1
the initialization is done at step 2 with the above inequality yielding
2 2 R2 ).
In addition to being projection-free and norm-free, the Conditional Gradient Descent satisfies a perhaps even more important property: it produces sparse iterates. More precisely consider the situation
where X Rn is a polytope, that is the convex hull of a finite set of
points (these points are called the vertices of X ). Then Caratheodorys
theorem states that any point x X can be written as a convex combination of at most n + 1 vertices of X . On the other hand, by definition
of the Conditional Gradient Descent, one knows that the tth iterate xt
can be written as a convex combination of t vertices (assuming that x1
is a vertex). Thanks to the dimension-free rate of convergence one is
usually interested in the regime where t n, and thus we see that the
iterates of Conditional Gradient Descent are very sparse in their vertex
representation.
We note an interesting corollary of the sparsity property together
with the rate of convergence we proved: smooth functions on the simP
plex {x Rn+ : ni=1 xi = 1} always admit sparse approximate minimizers. More precisely there must exist a point x with only t non-zero
coordinates and such that f (x) f (x ) = O(1/t). Clearly this is the
best one can hope for in general, as it can be seen with the function
31
p
f (x) = kxk22 since by Cauchy-Schwarz one has kxk1
kxk0 kxk2
2
which implies on the simplex kxk2 1/kxk0 .
Next we describe an application where the three properties of Conditional Gradient Descent (projection-free, norm-free, and sparse iterates) are critical to develop a computationally efficient procedure.
An application of Conditional Gradient Descent: Leastsquares regression with structured sparsity
This example is inspired by an open problem of Lugosi [2010] (what
is described below solves the open problem). Consider the problem of
approximating a signal Y Rn by a small combination of dictionary
elements d1 , . . . , dN Rn . One way to do this is to consider a LASSO
type problem in dimension N of the following form (with R fixed)
N
X
2
min
Y
x(i)di
2 + kxk1 .
xRN
i=1
RnN
Let D
be the dictionary matrix with ith column given by di .
Instead of considering the penalized version of the problem one could
look at the following constrained problem (with s R fixed) on which
we will now focus:
min kY Dxk22
xRN
subject to kxk1 s
min kY /s Dxk22
xRN
(3.10)
subject to kxk1 1.
N
> X
= max di
dj (x(j) y(j))
1iN
2
j=1
m kx yk1 ,
33
8m2
.
t+1
(3.12)
Putting together (3.11) and (3.12) we proved that one can get an optimal solution to (3.10) with a computational effort of O(m2 p(n)/+
m4 /2 ) using the Conditional Gradient Descent.
3.4
Strong convexity
kx yk2 .
2
(3.13)
1
s 2
f (xs ) f (x ) L +
kxs x k2
kxs+1 x k2 .
2
2s
2
2s
Multiplying this inequality by s yields
L2
s(f (xs ) f (x ))
+
s(s 1)kxs x k2 s(s + 1)kxs+1 x k2 ,
35
As will see now, having both strong convexity and smoothness allows
for a drastic improvement in the convergence rate. We denote Q =
for the condition number of f . The key observation is that Lemma 3.4
can be improved to (with the notation of the lemma):
1
1
2
= kxt x k2 gX (xt )> (xt x ) + 2 kgX (xt )k2
1
kxt x k2
t
1
kx1 x k2
t
exp
kx1 x k2 ,
Q
which concludes the proof.
We now show that in the unconstrained case one can improve the
rate by a constant factor, precisely one can replace Q by (Q + 1)/4 in
the oracle complexity bound by using a larger step size. This is not a
spectacular gain but the reasoning is based on an improvement of (3.6)
which can be of interest by itself. Note that (3.6) and the lemma to
follow are sometimes referred to as coercivity of the gradient.
1
(f (x) f (y))> (x y)
kx yk2 +
kf (x) f (y)k2 .
+
+
Proof. Let (x) = f (x) 2 kxk2 . By definition of -strong convexity
one has that is convex. Furthermore one can show that is ( )smooth by proving (3.4) (and using that it implies smoothness). Thus
using (3.6) one gets
1
((x) (y))> (x y)
k(x) (y)k2 ,
4t
f (xt+1 ) f (x ) exp
kx1 x k2 .
2
Q+1
Proof. First note that by -smoothness (since f (x ) = 0) one has
f (xt ) f (x ) kxt x k2 .
2
Now using Lemma 3.5 one obtains
kxt+1 x k2 = kxt f (xt ) x k2
= kxt x k2 2f (xt )> (xt x ) + 2 kf (xt )k2
2
2
12
kxt x k + 2
kf (xt )k2
+
+
Q1 2
kxt x k2
=
Q+1
4t
exp
kx1 x k2 ,
Q+1
37
3.5
Lower bounds
(3.15)
1st
xB2 (R)
RL
.
2(1 + t)
1st
L2
min f (x)
.
L
8t
xB2 ( 2
)
y(i) = t
for 1 i t and y(i) = 0 for t + 1 i n. It is clear that
0 f (y) and thus the minimal value of f is
f (y) =
2
2 2
+
=
.
2
t
2 t
2t
2
.
2t
L
Taking = L/2 and R = 2
we proved the lower bound for -strongly
2
L2
2
convex functions (note in particular that kyk2 = 2 t = 4
2 t R with
L 1
these parameters). On the other taking = R
and = L 1+t t
1+ t
concludes the proof for convex functions (note in particular that kyk2 =
2
= R2 with these parameters).
2 t
We proceed now to the smooth case. We recall that for a twice differentiable function f , -smoothness is equivalent to the largest eigenvalue
of the Hessian of f being smaller than at any point, which we write
2 f (x) In , x.
Furthermore -strong convexity is equivalent to
2 f (x) In , x.
39
1st
3 kx1 x k2
.
32 (t + 1)2
i = j, i k
2,
(Ak )i,j =
1, j {i 1, i + 1}, i k, j 6= k + 1
0,
otherwise.
It is easy to verify that 0 Ak 4In since
x> Ak x = 2
k
X
i=1
x(i)2 2
k1
X
k1
X
x(i)x(i+1) = x(1)2 +x(k)2 + (x(i)x(i+1))2 .
i=1
i=1
>
x A2t+1 x x> e1 .
8
4
Similarly to what happened in the proof Theorem 3.8, one can see here
too that xs must lie in the linear span of e1 , . . . , es1 (because of our
assumption on the black-box procedure). In particular for s t we
necessarily have xs (i) = 0 for i = s, . . . , n, which implies x>
s A2t+1 xs =
x>
A
x
.
In
other
words,
if
we
denote
s s s
fk (x) =
>
x Ak x x > e 1 ,
8
4
i
k+1
2
=
k
X
i=1
i
k+1
2
k+1
.
3
f2t+1
=
8
1
1
t + 1 2t + 2
=
3 kx2t+1 k2
,
32 (t + 1)2
Q1
f (xt ) f (x )
kx1 x k2 .
2
Q+1
Note that for large values of the condition number Q one has
2(t1)
Q1
4(t 1)
exp
.
Q+1
Q
Proof. The overall argument is similar to the proof of Theorem 3.9.
Let A : `2 `2 be the linear operator that corresponds to the infinite
41
(Q 1)
We already proved that 0 A 4I which easily implies that f is strongly convex and -smooth. Now as always the key observation is
that for this function, thanks to our assumption on the black-box procedure, one necessarily has xt (i) = 0, i t. This implies in particular:
2
kxt x k
+
X
x (i)2 .
i=t
f (xt ) f (x ) kxt x k2 .
2
Thus it only remains to compute x . This can be done by differentiating
f and setting the gradient to 0, which gives the following infinite set
of equations
Q+1
x (1) + x (2) = 0,
Q1
Q+1
x (k 1) 2
x (k) + x (k + 1) = 0, k 2.
Q1
i
It is easy to verify that x defined by x (i) = Q1
satisfy this
Q+1
infinite set of equations, and the conclusion of the theorem then follows
by straightforward computations.
12
3.6
So far our results leave a gap in the case of smooth optimization: gradient descent achieves an oracle complexity of O(1/) (respectively
O(Q log(1/)) in the strongly convex case) while we proved a lower
3.6.1
Q1
Q1
xs+1 =
1+
ys+1
ys .
Q+1
Q+1
Theorem 3.11. Let f be -strongly convex and -smooth, then Nesterovs Accelerated Gradient Descent satisfies
t1
+
2
f (yt ) f (x )
kx1 x k exp
.
2
Q
43
xs+2
xs+1
ys+2
ys+1
1 f (xs )
xs
ys
Fig. 3.4 Illustration of Nesterovs Accelerated Gradient Descent.
induction as follows:
1 (x) = f (x1 ) + kx x1 k2 ,
2
1
s (x)
s+1 (x) =
1
Q
1
f (xs ) + f (xs )> (x xs ) + kx xs k2 (3.16)
+
.
2
Q
Intuitively s becomes a finer and finer approximation (from below) to
f in the following sense:
1 s
s+1 (x) f (x) + 1
(1 (x) f (x)).
(3.17)
Q
The above inequality can be proved immediately by induction, using
the fact that by -strong convexity one has
(3.18)
2
2 kx x k ):
f (yt ) f (x ) t (x ) f (x )
1 t1
1
(1 (x ) f (x ))
Q
1 t1
+
2
.
kx1 x k 1
2
Q
We now prove (3.18) by induction (note that it is true at s = 1 since
x1 = y1 ). Let s = minxRn s (x). Using the definition of ys+1 (and
-smoothness), convexity, and the induction hypothesis, one gets
1
f (ys+1 ) f (xs )
kf (xs )k2
2
1
1
=
1
f (ys ) + 1
(f (xs ) f (ys ))
Q
Q
1
1
+ f (xs )
kf (xs )k2
2
Q
1
1
1
s + 1
f (xs )> (xs ys )
Q
Q
1
1
+ f (xs )
kf (xs )k2 .
2
Q
Thus we now have to show that
1
1
s+1
1
s + 1
f (xs )> (xs ys )
Q
Q
1
1
+ f (xs )
kf (xs )k2 .
(3.19)
2
Q
To prove this inequality we have to understand better the functions
s . First note that 2 s (x) = In (immediate by induction) and thus
s has to be of the following form:
s (x) = s +
kx vs k2 ,
2
45
(x vs ) + f (xs ) + (x xs ).
s+1 (x) = 1
Q
Q
Q
In particular s+1 is by definition minimized at vs+1 which can now be
defined by induction using the above identity, precisely:
1
1
1
vs+1 = 1
vs + xs f (xs ).
(3.20)
Q
Q
Q
Using the form of s and s+1 , as well as the original definition (3.16)
one gets the following identity by evaluating s+1 at xs :
1
1
= 1
s +
1
kxs vs k2 + f (xs ). (3.21)
2
Q
Q
Q
Note that thanks to (3.20) one has
1
1 2
2
kxs vs k2 + 2 kf (xs )k2
kxs vs+1 k =
1
Q
Q
2
1
1
f (xs )> (vs xs ),
Q
Q
which combined with (3.21) yields
1
1
s+1 =
1
s + f (xs ) +
1
kxs vs k2
Q
Q
2 Q
Q
1
1
1
2
1
f (xs )> (vs xs ).
kf (xs )k +
2
Q
Q
Finally we show by induction that vs xs = Q(xs ys ), which concludes the proof of (3.19) and thus also concludes the proof of the
theorem:
1
1
1
vs + xs f (xs ) xs+1
vs+1 xs+1 =
1
Q
Q
Q
p
p
Q
=
Qxs ( Q 1)ys
f (xs ) xs+1
p
p
=
Qys+1 ( Q 1)ys xs+1
p
=
Q(xs+1 ys+1 ),
= (1 s )ys+1 + s ys .
ys+1 = xs
xs+1
Theorem 3.12. Let f be a convex and -smooth function, then Nesterovs Accelerated Gradient Descent satisfies
f (yt ) f (x )
2kx1 x k2
.
t2
(3.22)
47
s kxs ys+1 k2 .
2
2
2
2
=
ks xs (s 1)ys x k ks ys+1 (s 1)ys x k
2
(3.24)
Next remark that, by definition, one has
xs+1 = ys+1 + s (ys ys+1 )
s+1 xs+1 = s+1 ys+1 + (1 s )(ys ys+1 )
s+1 xs+1 (s+1 1)ys+1 = s ys+1 (s 1)ys .
(3.25)
2
2
2
2
2
s s+1 s1 s
kus k kus+1 k .
2
Summing these inequalities from s = 1 to s = t 1 one obtains:
t
ku1 k2 .
22t1
t
2
4
Almost dimension-free convex optimization in
non-Euclidean spaces
In the previous chapter we showed that dimension-free oracle complexity is possible when the objective function f and the constraint
set X are well-behaved in the Euclidean norm; e.g. if for all points
x X and all subgradients g f (x), one has that kxk2 and kgk2
are independent of the ambient dimension n. If this assumption is not
met then the gradient descent techniques of Chapter 3 may lose their
dimension-free convergence rates. For instance consider a differentiable
convex function f defined on the Euclidean ball B2,n and such that
49
the situation for a moment and forget that we are doing optimization
in finite dimension. We already observed that Projected Gradient
Descent works in an arbitrary Hilbert space H. Suppose now that we
are interested in the more general situation of optimization in some
Banach space B. In other words the norm that we use to measure
the various quantity of interest does not derive from an inner product
(think of B = `1 for example). In that case the Gradient Descent
strategy does not even make sense: indeed the gradients (more formally
the Frechet derivative) f (x) are elements of the dual space B and
thus one cannot perform the computation x f (x) (it simply does
not make sense). We did not have this problem for optimization in a
Hilbert space H since by Riesz representation theorem H is isometric
to H. The great insight of Nemirovski and Yudin is that one can still
do a gradient descent by first mapping the point x B into the dual
space B , then performing the gradient update in the dual space,
and finally mapping back the resulting point to the primal space B.
Of course the new point in the primal space might lie outside of the
constraint set X B and thus we need a way to project back the
point on the constraint set X . Both the primal/dual mapping and the
projection are based on the concept of a mirror map which is the key
element of the scheme. Mirror maps are defined in Section 4.1, and
the above scheme is formally described in Section 4.2.
In the rest of this chapter we fix an arbitrary norm k k on Rn ,
and a compact convex set X Rn . The dual norm k k is defined as
kgk = supxRn :kxk1 g > x. We say that a convex function f : X R
is (i) L-Lipschitz w.r.t. k k if x X , g f (x), kgk L, (ii) smooth w.r.t. k k if kf (x) f (y)k kx yk, x, y X , and (iii)
-strongly convex w.r.t. k k if
(4.1)
4.1
Mirror maps
xD
Property (i) and (iii) ensures the existence and uniqueness of this projection (in particular since x 7 D (x, y) is locally increasing on the
boundary of D). The following lemma shows that the Bregman divergence essentially behaves as the Euclidean norm squared in terms of
projections (recall Lemma 3.1).
Lemma 4.1. Let x X D and y D, then
>
((
X (y)) (y)) (X (y) x) 0,
D (x,
X (y)) + D (X (y), y) D (x, y).
1 Assumption
(ii) can be relaxed in some cases, see for example Audibert et al. [2014].
51
xt
(xt )
gradient step
(4.2)
xt+1
X
(yt+1 )
Rn
()1
projection (4.3)
yt+1
D
4.2
Mirror Descent
(4.3)
f
xs f (x ) RL
.
t
t
s=1
R
L
2
t
satisfies
1
=
D (x, xs ) + D (xs , ys+1 ) D (x, ys+1 )
1
The term D (x, xs ) D (x, xs+1 ) will lead to a telescopic sum when
summing over s = 1 to s = t, and it remains to bound the other term
as follows using -strong convexity of the mirror map and az bz 2
a2
4b , z R:
D (xs , ys+1 ) D (xs+1 , ys+1 )
= (xs ) (xs+1 ) (ys+1 )> (xs xs+1 )
>
2
= gs (xs xs+1 ) kxs xs+1 k
2
.
2
We proved
t
X
D (x, x1 )
L2 t
f (xs ) f (x)
+
,
2
s=1
53
(4.4)
xX D
(4.5)
xX D
4.3
n
X
i=1
log n
t .
n
X
i=1
n
X
i=1
i (X).
55
4.4
t1
X
gs> x + (x).
(4.6)
s=1
f
xs f (x ) 2RL
.
t
t
s=1
P
Proof. We define t (x) = ts=1 gs> x + (x), so that xt
argminxX D t1 (x). Since is -strongly convex one clearly has that
2
kxt+1 xt k ,
2
where the second inequality comes from the first order optimality condition for xt+1 (see Proposition 1.3). Next observe that
t (xt+1 ) t (xt ) = t1 (xt+1 ) t1 (xt ) + gt> (xt+1 xt )
gt> (xt+1 xt ).
Putting together the two above displays and using Cauchy-Schwarz
(with the assumption kgt k L) one obtains
gs> (xs
x)
s=1
t
X
s=1
(x) (x1 )
,
(4.8)
which would clearly conclude the proof thanks to (4.7) and straightforward computations. Equation (4.8) is equivalent to
t
X
gs> xs+1
s=1
(x1 ) X >
(x)
+
gs x +
,
s=1
(x1 )
gs> xs+1 +
gt> xt+1 +
t1
X
s=1
(xt+1 )
gs> xt+1 +
t
X
s=1
gs> x+
(x)
.
4.5
57
Mirror Prox
0
yt+1 argmin D (x, yt+1
),
xX D
2 Basically
Mirror Prox allows for a smooth vector field point of view (see Section 4.6), while
Mirror Descent does not.
(xt )
f (xt )
(x0t+1 )
Rn
xt
f (yt+1 )
0 )
(yt+1
xt+1
yt+1
X
x0t+1
projection
()1
0
yt+1
59
(4.9)
one gets
4.6
In this section we consider a mirror map that satisfies the assumptions from Theorem 4.1.
By inspecting the proof of Theorem 4.1 one can see that for arbitrary vectors g1 , . . . , gt Rn the Mirror Descent strategy described by
(4.2) or (4.3) (or alternatively by (4.5)) satisfies for any x X D,
t
X
s=1
gs> (xs
R2
X
x)
+
kgs k2 .
(4.10)
s=1
The observation that the sequence of vectors (gs ) does not have to come
from the subgradients of a fixed function f is the starting point for the
theory of Online Learning, see Bubeck [2011] for more details. In this
gs> (xs
s=1
t
R2 2 X
x)
kgs k2 .
+
s=1
g(yt+1 ),
R2
.
(4.11)
5
Beyond the black-box model
61
5.1
xRn
kx1 x k22
.
2t
restrict to unconstrained minimization for sake of simplicity. One can extend the
discussion to constrained minimization by using ideas from Section 3.2.
63
1
kx (xs f (xs ))k22 ,
2
= (1 s )ys+1 + s ys .
2kx1 x k2
.
t2
5.2
Quite often the non-smoothness of a function f comes from a max operation. More precisely non-smooth functions can often be represented
as
f (x) = max fi (x),
1im
(5.1)
where the functions fi are smooth. This was the case for instance with
the function we used to prove the black-box lower bound 1/ t for nonsmooth optimization in Theorem 3.8. We will see now that by using
this structural representation one can in fact attain a rate of 1/t. This
was first observed in Nesterov [2004b] who proposed the Nesterovs
smoothing technique. Here we will present the alternative method of
Nemirovski [2004a] which we find more transparent. Most of what is
described in this section can be found in Juditsky and Nemirovski
[2011a,b].
In the next subsection we introduce the more general problem of
saddle point computation. We then proceed to apply a modified version
of Mirror Descent to this problem, which will be useful both in Chapter
6 and also as a warm-up for the more powerful modified Mirror Prox
that we introduce next.
5.2.1
65
yY xX
xX
The key observation is that the duality gap can be controlled similarly
to the suboptimality gap f (x) f (x ) in a simple convex optimization
problem. Indeed for any (x, y) X Y,
(e
x, ye) (x, ye) gX (e
x, ye)> (e
x x),
and
(e
x, ye) ((e
x, y)) gY (e
x, ye)> (e
y y).
In particular, using the notation z = (x, y) Z := X Y and g(z) =
(gX (x, y), gY (x, y)) we just proved
max (e
x, y) min (x, ye) g(e
z )> (e
z z),
yY
xX
(5.2)
2 Observe
max
yY
1X
xs , y
t
s=1
1X
min x,
ys
xX
t
r
(RX LX + RY LY )
s=1
2
.
t
kzkZ =
kxkX +
kykY ,
a
b
and thus the vector field (gt ) used in the SP-MD satisfies:
s
L2Y
L2X
kgt kZ
+
.
a
b
Using (4.10) together with (5.2) and the values of a, b and concludes
the proof.
5.2.3
67
is
(11 , 12 , 22 , 21 )-smooth.
1
, b
=
and
=
2 ,
RY
2 , R2 , R R , R R
1/ 2 max 11 RX
satisfies
12 X Y
21 X Y
22 Y
!
!
t
t
1X
1X
us+1 , y min x,
vs+1
max
xX
yY
t
t
s=1
s=1
4
2
2
max 11 RX
, 22 RY
, 12 RX RY , 21 RX RY .
t
1
2
RX
Proof. In light of the proof of Theorem 5.1 and (4.11) it clearly suffices to show that the vector field g(z) = (x (x, y), y ( x, y))
q
1
1
2
2 with =
is -Lipschitz w.r.t. kzkZ =
2 kxkX + R2 kykY
RX
Y
2 , R2 , R R , R R . In other words one needs
2 max 11 RX
22 Y
12 X Y
21 X Y
to show that
kg(z) g(z 0 )kZ kz z 0 kZ ,
which can be done with straightforward calculations (by introducing
g(x0 , y) and using the definition of smoothness for ).
Applications
m
X
Pm
i=1 yi fi (x),
and
i=1
m
X
i=1
5.2.4.2
69
Matrix games
Let A Rnm , we denote kAkmax for the maximal entry (in absolute value) of A, and Ai Rn for the ith column of A. We consider
the problem of computing a Nash equilibrium for the zero-sum game
corresponding to the loss matrix A, that is we want to solve
min max x> Ay.
xn ym
Here we equip both n and m with k k1 . Let (x, y) = x> Ay. Using
that x (x, y) = Ay and y (x, y) = A> x one immediately obtains
11 = 22 = 0. Furthermore since
kA(y y 0 )k = k
m
X
i=1
one also has 12 = 21 = kAkmax . Thus SP-MP with the negentropy on both n and m attains an -optimal pair of mixed
p
strategies with O kAkmax log(n) log(m)/ iterations. Furthermore
the computational complexity of a step of SP-MP is dominated by
the matrix-vector multiplications which are O(nm). Thus overall the
complexity
of getting an -optimal
Nash equilibrium with SP-MP is
p
O kAkmax nm log(n) log(m)/ .
5.2.4.3
Linear classification
xB2,n 1im
xB2,n ym
(5.3)
Assuming that kAi k2 B, and using the calculations we did in Section 5.2.4.1, it is clear that (x, y) = x> Ay is (0, B, 0, B)-smooth with
5.3
71
The idea of the barrier method is to move along the central path by
boosting a fast locally convergent algorithm, which we denote for
the moment by A, using the following scheme: Assume that one has
computed x (t), then one uses A initialized at x (t) to compute x (t0 )
for some t0 > t. There is a clear tension for the choice of t0 , on the one
hand t0 should be large in order to make as much progress as possible on
the central path, but on the other hand x (t) needs to be close enough
to x (t0 ) so that it is in the basin of fast convergence for A when run
on Ft0 .
IPM follows the above methodology with A being Newtons method.
Indeed as we will see in the next subsection, Newtons method has a
quadratic convergence rate, in the sense that if initialized close enough
to the optimum it attains an -optimal point in log log(1/) iterations!
Thus we now have a clear plan to make these ideas formal and analyze
the iteration complexity of IPM:
(1) First we need to describe precisely the region of fast convergence for Newtons method. This will lead us to define
self-concordant functions, which are natural functions for
Newtons method.
(2) Then we need to evaluate precisely how much larger t0 can be
compared to t, so that x (t) is still in the region of fast convergence of Newtons method when optimizing the function
Ft0 with t0 > t. This will lead us to define -self concordant
barriers.
.
2M
73
Now note that f (x ) = 0, and thus with the above formula one
obtains
Z 1
f (xk ) =
2 f (x + s(xk x )) (xk x ) ds,
0
2
1
= xk x [ f (xk )]
2 f (x + s(xk x )) (xk x ) ds
0
Z 1
[2 f (xk ) 2 f (x + s(xk x ))] (xk x ) ds.
= [2 f (xk )]1
0
k[ f (xk )]
Z
k
Self-concordant functions
75
f (x)
,
1 f (x)
(5.4)
-self-concordant barriers
We deal here with Step (2) of the plan described in Section 5.3.1. Given
Theorem 5.4 we want t0 to be as large as possible and such that
Ft0 (x (t)) 1/4.
(5.5)
(5.6)
Thus taking
t0 = t +
1
4kckx (t)
(5.7)
1
F (x)[F (x)]> .
(5.8)
77
F (x)> h
sup
h:h>
F (x)> h
sup
h:h> F 2 (x)h1
1
F (x)[F (x)]>
)h1
0
Thus
a safe
choice to increase the penalization parameter is t =
1 + 41 t. Note that the condition (5.8) can also be written as the
1
=
+ (Ft (y) F (y))> (y x (t))
t
t
1
Ft (y)
+ (Ft (y) + )
(5.10)
t
t
1 Ft (y)
t0
t
Ft (x) +
1
.
t
t
5.3.5
(5.11)
Path-following scheme
We can now formally describe and analyze the most basic IPM called
the path-following scheme. Let F be -self-concordant barrier for X .
Assume that one can find x0 such that Ft0 (x0 ) 1/4 for some small
value t0 > 0 (we describe a method to find x0 at the end of this subsection). Then for k 0, let
1
tk+1 = 1 +
tk ,
13
xk+1 = xk [2 F (xk )]1 (tk+1 c + F (xk )).
The next theorem shows that after O
log t0 iterations of the
path-following scheme one obtains an -optimal point.
Theorem 5.6. The path-following scheme described above satisfies
2
k
>
>
c xk min c x
exp
.
xX
t0
1 + 13
Proof. We show that the iterates (xk )k0 remain close to the central
path (x (tk ))k0 . Precisely one can easily prove by induction that
Ftk (xk ) 1/4.
Indeed using Theorem 5.4 and equation (5.11) one immediately obtains
Ftk+1 (xk+1 ) 2Ftk+1 (xk )2
2
tk+1
tk+1
Ftk (xk ) +
1
2
tk
tk
1/4,
79
where we used in the last inequality that tk+1 /tk = 1 + 131 and 1.
Thus using (5.10) one obtains
2
+ /3 + 1/12
c> xk min c> x
.
xX
tk
tk
k
Observe that tk = 1 + 131 t0 , which finally yields
2
c xk min c x
xX
t0
>
>
1
1+
13
k
.
At this point we still need to explain how one can get close to
an intial point x (t0 ) of the central path. This can be done with the
following rather clever trick. Assume that one has some point y0 X .
The observation is that y0 is on the central path at t = 1 for the problem
where c is replaced by F (y0 ). Now instead of following this central
path as t +, one follows it as t 0. Indeed for t small enough the
central paths for c and for F (y0 ) will be very close. Thus we iterate
the following equations, starting with t00 = 1,
1
0
tk+1 = 1
t0 ,
13 k
yk+1 = yk [2 F (yk )]1 (t0k+1 F (y0 ) + F (yk )).
A straightforward analysis shows that for k = O( log ), which corresponds to t0k = 1/ O(1) , one obtains a point yk such that Ft0 (yk ) 1/4.
k
In other words one can initialize the path-following scheme with t0 = t0k
and x0 = yk .
5.3.6
with a -self-concordant barrier is O M log , where M is the complexity of computing a Newton direction (which can be done by computing and inverting the Hessian of the barrier). Thus the efficiency of
the method is directly related to the form of the self-concordant barrier that one can construct for X . It turns out that for LPs and SDPs
6
Convex optimization and randomness
6.1
1 We
83
X
R2
+
ke
g (xs )k2 .
2
s=1
s=1
1 X
E
(f (xs ) f (x))
t
s=1
t
1 X
E
E(e
g (xs )|xs )> (xs x)
t
1
E
t
s=1
t
X
s=1
Similarly, in the Euclidean and strongly convex case, one can directly generalize Theorem 3.5. Precisely we consider Stochastic Gradient Descent (SGD), that is S-MD with (x) = 21 kxk22 , with timevarying step size (t )t1 , that is
xt+1 = X (xt t ge(xt )).
6.2
t satisfies
r
X
t
1
2 R2
Ef
xs+1 f (x ) R
+
.
t
t
t
s=1
2 While
being true in general this statement does not say anything about specific functions/oracles. For example it was shown in Bach and Moulines [2013] that acceleration
can be obtained for the square loss and the logistic loss.
85
kxs+1 xs k2
2
kxs+1 xs k2
2
1
ges> (xs+1 xs ) + kf (xs ) ges k2 + ( + 1/)kxs+1 xs k2
2
2
>
2
ges (xs+1 xs ) + kf (xs ) ges k + ( + 1/)D (xs+1 , xs ).
2
Observe that, using the same argument than to derive (4.9), one has
1
ge> (xs+1 x ) D (x , xs ) D (x , xs+1 ) D (xs+1 , xs ).
+ 1/ s
Thus
f (xs+1 )
f (xs ) + ges> (x xs ) + ( + 1/) (D (x , xs ) D (x , xs+1 ))
+ kf (xs ) ges k2
2
f (x ) + (e
gs f (xs ))> (x xs ) + ( + 1/) (D (x , xs ) D (x , xs+1 ))
+ kf (xs ) ges k2 .
2
In particular this yields
Ef (xs+1 ) f (x ) ( + 1/)E (D (x , xs ) D (x , xs+1 )) +
2
.
2
Ek
1 X
2B 2
1
g1 (x) f (x)k22
.
gei (x) f (x)k22 = Eke
m
m
m
i=1
Thus one obtains that with t calls to the (original) stochastic oracle,
that is t/m iterations of the mini-batch SGD, one has a suboptimality
gap bounded by
s
r
2B 2
2
R2
RB mR2
R
+
=2 +
.
m
t/m t/m
t
t
B
t one obtains, with mini-batch SGD and t
Thus as long as m R
-optimal.
calls to the oracle, a point which is 3 RB
t
Mini-batch SGD can be a better option than basic SGD in at least
two situations: (i) When the computation for an iteration of mini-batch
SGD can be distributed between multiple processors. Indeed a central
unit can send the message to the processors that estimates of the gradient at point xs has to be computed, then each processor can work
independently and send back the average estimate they obtained. (ii)
Even in a serial setting mini-batch SGD can sometimes be advantageous, in particular if some calculations can be re-used to compute
several estimated gradients at the same point.
6.3
Let us examine in more details the main example from Section 1.1.
That is one is interested in the unconstrained minimization of
m
1 X
fi (x),
f (x) =
m
i=1
where f1 , . . . , fm are -smooth and convex functions, and f is strongly convex. Typically in Machine Learning contexts can be as
6.3. Improved SGD for a sum of smooth and strongly convex functions
87
small as 1/m, while is of order of a constant. In other words the condition number Q = / can be as large as (m). Let us now compare
the basic Gradient Descent, that is
m
xt+1 = xt
X
fi (x),
m
i=1
to SGD
xt+1 = xt fit (x),
where it is drawn uniformly at random in [m] (independently of
everything else). Theorem 3.6 shows that Gradient Descent requires
O(mQ log(1/)) gradient computations (which can be improved to
(s)
where it is drawn uniformly at random (and independently of everything else) in [m]. Also let
y
(s+1)
k
1 X (s)
=
xt .
k
t=1
6.3. Improved SGD for a sum of smooth and strongly convex functions
89
xt
f (x ) 0.9(f (y (s) ) f (x )),
Ef (y
) f (x ) = Ef
k
t=1
which clearly implies the theorem. To simplify the notation in the following we drop the dependency on s, that is we want to show that
!
k
1X
Ef
xt f (x ) 0.9(f (y) f (x )).
(6.1)
k
t=1
(6.2)
where
vt = fit (xt ) fit (y) + f (y).
Using Lemma 6.1, we upper bound Eit kvt k22 as follows (also recall that
EkX E(X)k22 EkXk22 , and Eit fit (x ) = 0):
Eit kvt k22
2Eit kfit (xt ) fit (x )k22 + 2Eit kfit (y) fit (x ) f (y)k22
2Eit kfit (xt ) fit (x )k22 + 2Eit kfit (y) fit (x )k22
4(f (xt ) f (x ) + f (y) f (x )).
(6.3)
x k22
kx1
x k22
k
X
2(1 2)E
(f (xt ) f (x ))
t=1
xt f (x )
+
(f (y) f (x )).
Ef
k
(1 2)k 1 2
t=1
1
Using that = 10
and k = 20Q finally yields (6.1) which itself concludes the proof.
6.4
Eke
g (x)k22 =
1X
kni f (x)ei k22 = nkf (x)k22 .
n
i=1
Thus using Theorem 6.1 (with (x) = 12 kxk22 , that is S-MD being SGD)
one immediately obtains the following result.
3 Uniqueness
91
n
Theorem 6.5.
q Let f be convex and L-Lipschitz on R , then RCD
2
with = R
L
nt satisfies
r
X
t
2n
1
xs min f (x) RL
.
Ef
xX
t
t
s=1
1
i f (x)eis .
is s
Furthermore we study a more general sampling distribution than uniform, precisely for 0 we assume that is is drawn (independently)
from the distribution p defined by
p (i) = Pn i
,i
j=1 j
[n].
2
x2 .
kxk[] = t
i xi , and kxk[] = t
i i
i=1
i=1
2 (x )
2R1
1
i=1 i
Ef (xt ) f (x )
,
t1
where
R1 (x1 ) =
sup
xRn :f (x)f (x1 )
kx x k[1] .
Recall from Theorem 3.2 that in this context the basic Gradient DeP
scent attains a rate of kx1 x k22 /t where ni=1 i (see the discussion above). Thus we see that RCD(1) greatly improves upon Gradient
P
Descent for functions where is of order of ni=1 i . Indeed in this case
both methods attain the same accuracy after a fixed number of iterations, but the iterations of Coordinate Descent are potentially much
cheaper than the iterations of Gradient Descent.
Proof. By applying (3.5) to the i -smooth function u R 7 f (x+uei )
one obtains
1
1
f x i f (x)ei f (x)
(i f (x))2 .
i
2i
We use this as follows:
Eis f (xs+1 ) f (xs ) =
n
X
i=1
1
p (i) f xs i f (xs )ei f (xs )
i
n
X
p (i)
i=1
2i
(i f (xs ))2
2
1
= Pn
kf
(x
)k
.
s [1]
2 i=1 i
93
1
2 (x )
2R1
1
2
s .
i=1 i
Pn
The proof can be concluded with similar computations than for Theorem 3.2.
We discussed above the specific case of = 1. Both = 0 and
= 1/2 also have an interesting behavior, and we refer to Nesterov
[2012] for more details. The latter paper also contains a discussion of
high probability results and potential acceleration `a la Nesterov. We
also refer to Richt
arik and Takac [2012] for a discussion of RCD in a
distributed setting.
6.4.2
If in addition to directional smoothness one also assumes strong convexity, then RCD attains in fact a linear rate.
Theorem 6.7. Let 0. Let f be -strongly convex w.r.t. k k[1] ,
and such that
u R 7 f (x + uei ) is i -smooth for any i [n], x Rn .
Pn
i
Let Q = i=1
, then RCD() satisfies
Ef (xt+1 ) f (x )
1 t
1
(f (x1 ) f (x )).
Q
1
kf (x)k2 .
2
kf (x)k kx yk kx yk22
2
1
2
kf (x)k ,
2
which concludes the proof by taking y = x .
We can now prove Theorem 6.7.
Proof. In the proof of Theorem 6.6 we showed that
2
1
kf
(x
)k
s+1 s Pn
.
s [1]
2 i=1 i
On the other hand Lemma 6.2 shows that
2
kf (xs )k[1] 2s .
The proof is concluded with straightforward calculations.
6.5
95
s=1
(6.4)
and i [m],
geY (x, y)(i) = Ai (J), where J [n] is drawn according to x n .
(6.5)
Clearly ke
gX (x, y)k kAkmax and ke
gX (x, y)k kAkmax , which
implies that S-SP-MD attains an -optimal pair of points with
O kAk2max log(n + m)/2 iterations. Furthermore the computational complexity of a step of S-SP-MD is dominated by drawing
the indices I and J which takes O(n + m). Thus overall the complexity of getting an -optimal Nash equilibrium with S-SP-MD is
O kAk2max (n + m) log(n + m)/2 . While the dependency on is
worse than for SP-MP (see Section 5.2.4.2), the dependencies on
e + m) instead of O(nm).
e
the dimensions is O(n
In particular, quite
astonishingly, this is sublinear in the size of the matrix A. The
possibility of sublinear algorithms for this problem was first observed
in Grigoriadis and Khachiyan [1995].
Linear classification. Here x B2,n and y m . Thus the stochastic
oracle for the x-subgradient can be taken as in (6.4) but for the ysubgradient we modify (6.5) as follows. For a vector x we denote by x2
j=1
j=1
Unfortunately this last term can be O(n). However it turns out that
one can do a more careful analysis of Mirror Descent in terms of local
norms, which allows to prove that the local variance is dimensionfree. We refer to Bubeck and Cesa-Bianchi [2012] for more details on
these local norms, and to Clarkson et al. [2012] for the specific details
in the linear classification situation.
6.6
x{1,1}n
x> Lx.
(6.7)
97
n
X
i,j=1,i6=j
Ai,j
1
max x> Lx.
2 x{1,1}n
x{1,1}n
x> Lx =
x{1,1}n
max
XSn
+ ,Xi,i =1,i[n]
hL, Xi.
The right hand side in the above display is known as the convex (or
SDP) relaxation of MAXCUT. The convex relaxation is an SDP and
thus one can find its solution efficiently with Interior Point Methods (see Section 5.3). The following result states both the GoemansWilliamson strategy and the corresponding approximation ratio.
Theorem 6.9. Let be the solution to the SDP relaxation of
max
x{1,1}n
x> Lx.
The proof of this result is based on the following elementary geometric lemma.
Lemma 6.3. Let N (0, ) with i,i = 1 for i [n], and =
sign(). Then
2
E i j = arcsin (i,j ) .
Proof. Let V Rnn (with ith row Vi> ) be such that = V V > . Note
that since i,i = 1 one has kVi k2 = 1 (remark also that necessarily
|i,j | 1, which will be important in the proof of Theorem 6.9). Let
N (0, In ) be such that = V . Then i = sign(Vi> ), and in
particular
E i j
Also remark that for X Rnn such that Xi,i = 1, one has
hL, Xi =
n
X
i,j=1
Ai,j (1 Xi,j ),
(6.8)
99
P
and in particular for x {1, 1}n , x> Lx = ni,j=1 Ai,j (1xi xj ). Thus,
using Lemma 6.3, and the facts that Ai,j 0 and |i,j | 1 (see the
proof of Lemma 6.3), one has
n
X
2
>
E L =
Ai,j 1 arcsin (i,j )
i,j=1
0.878
n
X
Ai,j (1 i,j )
i,j=1
= 0.878
0.878
max
XSn
+ ,Xi,i =1,i[n]
max
x{1,1}n
hL, Xi
x> Lx.
XSn
+ ,Xi,i =1,i[n]
hB, Xi.
max
x{1,1}n
x> Bx.
n
X
i,j=1
2
2
Bi,j arcsin (Xi,j ) = hB, arcsin(X)i.
100
(2k+1) .
arcsin() =
=+
4k (2k + 1)
4k (2k + 1)
k=0
k=1
6.7
101
Thus if one can ensure that St is in (near) isotropic position, and kct
ct k2 is small (say smaller than 0.1), then the randomized center of
gravity method (which replaces ct by ct ) will converge at the same
speed than the original center of gravity method.
Assuming that St is in isotropic position one immediately obtains
n
Ekct ct k22 = N
, and thus by Chebyshevs inequality one has P(kct
n
ct k2 > 0.1) 100 N
. In other words with N = O(n) one can ensure
that the randomized center of gravity method makes progress on a
constant fraction of the iterations (to ensure progress at every step one
would need a larger value of N because of an union bound, but this is
unnecessary).
Let us now consider the issue of putting St in near-isotropic posi t = 1 PN (Xi ct )(Xi ct )> . Rudelson [1999] showed that
tion. Let
i=1
N
e
as long as N = (n),
one has with high probability (say at least prob2
1/2 (St ct ) is in near-isotropic position.
ability 11/n ) that the set
t
Thus it only remains to explain how to sample from a near-isotropic
convex set K. This is where random walk ideas come into the picture.
The hit-and-run walk is described as follows: at a point x K, let
L be a line that goes through x in a direction taken uniformly at
random, then move to a point chosen uniformly at random in L K.
Lov
asz [1998] showed that if the starting point of the hit-and-run
walk is chosen from a distribution close enough to the uniform
distribution on K, then after O(n3 ) steps the distribution of the last
point is away (in total variation) from the uniform distribution
on K. In the randomized center of gravity method one can obtain
a good initial distribution for St by using the distribution that was
obtained for St1 . In order to initialize the entire process correctly
we start here with S1 = [L, L]n X (in Section 2.1 we used
S1 = X ), and thus we also have to use a separation oracle at iterations
where ct 6 X , just like we did for the Ellipsoid Method (see Section 2.2).
Wrapping up the above discussion, we showed (informally) that to
attain an -optimal point with the randomized center of gravity method
102
e
e
one needs: O(n)
iterations, each iterations requires O(n)
random samples from St (in order to put it in isotropic position) as well as a call
to either the separation oracle or the first order oracle, and each same 3 ) steps of the random walk. Thus overall one needs O(n)
e
ple costs O(n
e 5)
calls to the separation oracle and the first order oracle, as well as O(n
steps of the random walk.
References
S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University Press, 2009.
J.Y Audibert, S. Bubeck, and R. Munos. Bandit view on noisy optimization. In S. Sra, S. Nowozin, and S. Wright, editors, Optimization
for Machine Learning. MIT press, 2011.
J.Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39:3145,
2014.
F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Advances in Neural
Information Processing Systems (NIPS), 2013.
A. Beck and M. Teboulle. Mirror Descent and nonlinear projected
subgradient methods for convex optimization. Operations Research
Letters, 31(3):167175, 2003.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183202, 2009.
A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM, 2001.
103
104
105
106
107