Convex-Optimization Github Io
Convex-Optimization Github Io
Nisheeth K. Vishnoi
Preface page v
Notation viii
1 Bridging continuous and discrete optimization 1
1.1 An example: the maximum flow problem 2
1.2 Linear programming 8
1.3 Fast and exact algorithms via interior point methods 12
1.4 Ellipsoid method beyond succinct linear programs 13
2 Preliminaries 15
2.1 Derivatives, gradients, and hessians 15
2.2 Taylor series 17
2.3 Linear algebra, matrices, and eigenvalues 18
2.4 The Cauchy-Schwarz inequality 20
2.5 Norms 21
2.6 Euclidean topology 22
2.7 Dynamical systems 23
2.8 Graphs 24
2.9 Exercises 27
3 Convexity 32
3.1 Convex sets 32
3.2 Convex functions 33
3.3 The usefulness of convexity 39
3.4 Exercises 44
4 Convex Optimization and Efficiency 47
4.1 Convex programs 47
4.2 Computational models 49
4.3 Membership problem for convex sets 51
4.4 Solution concepts for optimization problems 57
iv
Contents v
vii
viii Preface
The structure of the book. The book has roughly four parts: Chapters 3, 4,
and 5 provide an introduction to convexity, models of computation and no-
tions of efficiency in convex optimization, and duality. Chapters 6, 7, and 8
introduce first-order methods such as gradient descent, mirror descent and the
multiplicative weights update method, and accelerated gradient descent respec-
tively. Chapters 9, 10, and 11 present Newton’s method and various interior
point methods for linear programs. Chapters 12 and 13 present cutting plane
methods such as the ellipsoid method for linear and general convex programs.
Chapter 1 summarizes the book via a brief history of the interplay between
continuous and discrete optimization: how the search for fast algorithms for
discrete problems is leading to improvements in algorithms for convex opti-
mization.
Many chapters contain applications ranging from finding maximum flows,
minimum cuts, and perfect matchings in graphs, to linear optimization over
0-1-polytopes, to submodular function minimization, to computing maximum
entropy distributions over combinatorial polytopes.
The book is self-contained and starts with a review of calculus, linear alge-
bra, geometry, dynamical systems, and graph theory in Chapter 2. Exercises
posed in this book not only play an important role in checking one’s under-
standing, but sometimes important methods and concepts are introduced and
developed entirely through them. Examples include the Frank-Wolfe method,
coordinate descent, stochastic gradient descent, online convex optimization,
the min-max theorem for zero-sum games, the Winnow algorithm for classifi-
cation, the conjugate gradient method, primal-dual interior point method, and
matrix scaling.
How to use this book. This book can be used either as a textbook for a stan-
dalone advanced undergraduate or beginning graduate level course, or can act
as a supplement to an introductory course on convex optimization or algo-
rithm design. The intended audience includes advanced undergraduate stu-
dents, graduate students, and researchers from theoretical computer science,
discrete optimization, operations research, statistics, and machine learning. To
make this book accessible to a broad audience with different backgrounds, the
writing style deliberately emphasizes the intuition, sometimes at the expense
of rigor.
A course for a theoretical computer science or discrete optimization audi-
ence could cover the entire book. A course on convex optimization can omit
the applications to discrete optimization and can, instead, include applications
as per the choice of the instructor. Finally, an introductory course on convex
optimization for machine learning could include material from Chapters 2-6.
Preface ix
Beyond convex optimization? This book should also prepare the reader for
working in areas beyond convex optimization, e.g., nonconvex optimization
and geodesic convex optimization, which are currently in their formative years.
Nonconvex optimization. One property of convex functions is that a “local”
minimum is also a “global” minimum. Thus, algorithms for convex optimiza-
tion, essentially, find a local minimum. Interestingly, this viewpoint has led
to convex optimization methods being wildly successful for nonconvex op-
timization problems, especially those that arise in machine learning. Unlike
convex programs, some of which can be NP-hard to optimize, most interesting
classes of nonconvex optimization problems are NP-hard. Hence, in many of
these applications, one defines a suitable notion of local minimum and looks
for methods that can take us to one. Thus, algorithms for convex optimization
methods are important for nonconvex optimization as well.
Geodesic convex optimization. Sometimes, a function that is nonconvex in
a Euclidean space turns out to be convex if one introduces a suitable Rieman-
nian metric on the underlying space and redefines convexity with respect to
the straight lines – geodesics – induced by the metric. Such a function is called
geodesically convex; see the survey by Vishnoi (2018). Unlike convex opti-
mization, however, how to optimize such a function over a geodesically convex
set is an underdeveloped theory – waiting to be explored.
Acknowledgments. The contents of this book have been developed over sev-
eral courses – for both undergraduate and graduate students – that I have
taught, starting in Fall 2014. The content of the book is closest to that of a
course taught in Fall 2019 at Yale. I am grateful to all the students and other
attendees of these courses for their questions and comments that have made
me reflect on the topic and improve the presentation. I am thankful to Slo-
bodan Mitrovic, Damian Straszak, Jakub Tarnawski, and George Zakhour for
being some of the first to take this course and scribing my initial lectures on
this topic. Special thanks to Damian for scribing a significant fraction of the
my lectures, sometimes adding his own insights. I am indebted to Somenath
Biswas, Elisa Celis, Yan Zhong Ding, and Anay Mehrotra for carefully reading
a draft of this book and giving numerous valuable comments and suggestions.
Finally, this book has been influenced by several classic works: Geometric
algorithms and combinatorial optimization by Grötschel et al. (1988), Convex
optimization by Boyd and Vandenberghe (2004), Introductory lectures on con-
vex optimization by Nesterov (2014), and The multiplicative weights update
method: A meta-algorithm and applications by Arora et al. (2012).
x
Notation xi
Graphs:
• A graph G has a vertex set V and an edge set E. All graphs are assumed to
be undirected unless stated otherwise. If the graph is weighted, there is a
weight function w : E → R≥0 .
• A graph is said to be simple if there is at most one edge between two
vertices and there are no edges whose endpoints are the same vertex.
• Typically, n is reserved for the number of vertices |V|, and m for the number
of edges |E|.
Probability:
• ED [·] denotes the expectation and PrD [·] denotes the probability over a
distribution D. The subscript is dropped when clear from context.
Running times:
• Standard big-O notation is used to describe the limiting behavior of a func-
tion. Õ denotes that potential poly-logarithmic factors have been omitted,
i.e., f = Õ(g) is equivalent to f = O(g logk (g)) for some constant k.
1
Bridging continuous and discrete optimization
1
2 Bridging continuous and discrete optimization
One of the first combinatorial algorithms for the s − t-maximum flow prob-
lem was presented in the seminal work by Ford and Fulkerson (1956). Roughly
speaking, the Ford-Fulkerson method starts by setting xi = 0 for all edges i
and checks if there is a path from s to t such that the capacity of each edge on
it is 1. If there is such a path, the method adds 1 to the flow value of the edges
that point (from head to tail) in the direction of this path and subtracts 1 from
the flow values of edges that point in the opposite direction. Given the new
flow value on each edge, it constructs a residual graph where the capacity of
each edge is updated to reflect how much additional flow can still be pushed
through it, and the algorithm repeats. If there is no path left between s and t in
the residual graph, it stops and outputs the current flow values.
The fact that the algorithm always outputs a maximum s−t-flow is nontrivial
and a consequence of duality; in particular, of the max-flow min-cut theorem
that states that, the maximum amount of flow that can be pushed form s to t is
equal to the minimum number of edges in G whose deletion leads to discon-
necting s from t. This latter problem is referred to as the s − t-minimum cut
problem and is the dual of the s − t-maximum flow problem. Duality gives a
way to certify that we are at an optimal solution and, if not, suggests a way to
improve the current solution.
It is not hard to see that the Ford-Fulkerson method generalizes to the setting
of nonnegative and integral capacities: now the flow values are
xi ∈ {−U, , . . . , −1, 0, 1, . . . , U}
for some U ∈ Z≥0 . However, the running time of the Ford-Fulkerson method
in this general capacity case depends linearly on U. As the number of bits
required to specify U is roughly log U, this is not a polynomial time algorithm.
Following the work of Ford and Fulkerson (1956), a host of combinatorial
algorithms for the s−t-maximum flow problem were developed. Roughly, each
of them augments the flow in the graph iteratively in an increasingly faster, but
combinatorial, manner. The first polynomial time algorithms were by Dinic
(1970) and by Edmonds and Karp (1972) who used breadth-first search to aug-
ment flows. This line of work
culminated
n in
o an algorithm
by Goldberg and
Rao (1998) that runs in O m min n , m
e 2/3 1/2
log U time. Note that unlike the
Ford-Fulkerson method, these latter combinatorial algorithms are polynomial
time: they find the exact solution to the problem and run in time that depends
polynomially on the number of bits required to describe the input. However,
since the result of Goldberg and Rao (1998), there was no real progress on im-
proving the running times for algorithms for the s − t-maximum flow problem
until 2011.
4 Bridging continuous and discrete optimization
z
−1 1
One problem with the application of gradient descent is that the convex pro-
gram in (1.2) has constraints {x ∈ Rm : Bx = F(e s − et )} and, hence, the
direction gradient descent asks us to move can take us out of this set. A way
to get around this is to project the gradient of the objective function on to
the subspace {x ∈ Rm : Bx = 0} at every step and move in the direction of
the projected gradient. However, this projection step requires solving a least
squares problem which, in turn, reduces to the numerical problem of solving a
linear system of equations. While one can appeal to the Gaussian elimination
method for this latter task, it is not fast enough to warrant improvements over
combinatorial algorithms mentioned earlier. Here, a major result discovered
by Spielman and Teng (2004) implies that such a projection can, in fact, be
computed in time O(m).
e This is achieved by noting that the linear system that
arises when projecting a vector onto the subspace {x ∈ Rm : Bx = 0} is the
same as solving a Laplacian systems that are of the form BB> y = a (for a
given vector a), where B is a vertex-edge incidence matrix of the given graph.
Such a result is not known for general linear systems and (implicitly) relies on
the combinatorial structure of the graph that gets encoded in the matrix B.
Thus, roughly speaking, in each iteration, the projected gradient descent al-
gorithm takes a point xt in the space of all s − t-flows of value F, moves to-
wards the set B∞ along the negative gradient of the objective function, and then
projects the new point back to the linear space; see Figure 1.2 for an illustra-
tion. While each iterate is an s − t-flow, it is not a feasible flow.
{x : Bx = F(e s − et )} xt+1
xt
B∞
Figure 1.2 An illustration of one step of the projected gradient descent in the
algorithm by Lee et al. (2013).
1.1 An example: the maximum flow problem 7
A final issue is that such a method may not lead to an exact solution but
only an approximate solution. Moreover, in general, the number of iterations
depends inverse polynomially in the quality of the desired approximation. Lee
et al. (2013) proved the following result: there is an algorithm that, given an
ε > 0, can compute a feasible s−t-flow of value (1−ε)F in time O(mne 1/3 ε−2/3 ).
If we ignore the ε in their bound, this improved the result of Goldberg and Rao
(1998) mentioned earlier.
We point out that the combinatorial algorithm of Goldberg and Rao (1998)
has the same running time even when the input graph is directed. It is not clear
how to generalize the gradient descent based algorithm for the s − t-maximum
flow problem presented above run for directed graphs.
The results of Christiano et al. (2011) and Lee et al. (2013) were further
improved using increasingly sophisticated ideas from continuous optimiza-
tion and finally led to a nearly linear time algorithm for the undirected s − t-
maximum flow problem in a sequence of work by Sherman (2013), Kelner
et al. (2014), and Peng (2016). Remarkably, while these improvements aban-
doned discrete approaches and used algorithms for convex optimization, beat-
ing the running times of combinatorial algorithms leveraged the underlying
combinatorial structure of the s − t-maximum flow problem.
The goal of this book is to enable a reader to gain an in-depth understanding
of algorithms for convex optimization in a manner that allows them to ap-
ply these algorithms in domains such as combinatorial optimization, algorithm
design, and machine learning. The emphasis is to derive various convex opti-
mization methods in a principled manner and to establish precise running time
bounds in terms of the input length (and not just on the number of iterations).
The book also contains several examples, such as the one of s − t-maximum
flow presented earlier, that illustrate the bridge between continuous and dis-
crete optimization. Laplacian solvers are not discussed in this book. The reader
is referred to the monograph by Vishnoi (2013) for more on that topic.
The focus of Chapters 3, 4, 5 is on basics of convexity, computational mod-
els, and duality. Chapters 6, 7, 8 present three different first-order methods:
gradient descent, mirror descent and multiplicative weights update method,
and accelerated gradient descent. In particular, the discussion here is pre-
sented in detail as an application in Chapter 6. In fact, the fastest version of
the method of Lee et al. (2013) uses the accelerated gradient method. Chap-
ter 7 also draws a connection between mirror descent and the multiplicative
weights update method and shows how the latter can be used to design a fast
(approximate) algorithm for the bipartite maximum matching problem. We re-
mark that the algorithm of Christiano et al. (2011) relies on the multiplicative
weights update method.
8 Bridging continuous and discrete optimization
While duality and integrality do not directly imply a polynomial time al-
gorithm for the s − t-maximum flow problem, the mathematical structure that
enables these properties is relevant to the design of efficient algorithms for this
problem. It is worth mentioning that these ideas were generalized in a major
way by Edmonds (1965a,b) who figured out an integral polyhedral represen-
tation for the matching problem and gave a polynomial time algorithm for
optimizing linear functions over this polyhedron.
For general linear programs, however, integrality does not hold. The reason
is that for a general matrix A, the determinants of submatrices that show up in
the denominator of the vertices of the associated polyhedra may not be 1 or
−1. However, (for A with integer entries) these determinants cannot be more
than poly(n, L) in magnitude where L is the number of bits required to en-
code A, b and c. This is a consequence of the fact that determinant of a matrix
with integer entries bounded by 2L is no more than n! · 2nL . While there were
combinatorial algorithms for linear programming, e.g., the simplex method of
Dantzig (1990) that moved from one vertex to another, none were known to
run in polynomial time (in the bit complexity) in the worst case.
10 Bridging continuous and discrete optimization
H
xt
K
Et
Et+1
Figure 1.3 Illustration of one step of the ellipsoid method for the polytope K.
x?
c
x0
Figure 1.4 Illustration of the interior point method for the polytope K.
where both f and K are convex. Their method outputs a point x̂ such that
f ( x̂) ≤ min f (x) + ε
x∈K
ε , T f , T K ),
R
in time, roughly, poly(n, log where R is such that K is contained in a
ball of radius R, time T f is required to compute the gradient of f , and time T K
is required to separate a point from K. Thus, this settled convex optimization in
its most generality. However, we emphasize that this result does not imply that
any convex program of the form (1.3) is in P. The reason is that sometimes
it may be impossible to get an efficient gradient oracle for f , or an efficient
separation oracle for K, or a good enough bound on R.
In Chapter 13, we present an algorithm to minimize a convex function over
a convex set and prove the guarantee mentioned above. Subsequently, we show
how this can be used to give a polynomial time algorithm for another combi-
natorial problem – submodular function minimization – given just an eval-
uation oracle to the function. A submodular function f : 2[m] → R has the
diminishing returns property: for sets S ⊆ T ⊆ [m], the marginal gain of
adding an element not in T to S is at least the marginal gain of adding it to
T . The ability to minimize submodular set functions allows us to obtain sep-
aration oracles for matroid polytopes. Submodular functions arose in discrete
optimization, but have recently found applications in machine learning.
14 Bridging continuous and discrete optimization
Finally, in Chapter 13, we consider convex programs that have been recently
used for designing various counting problems over discrete sets, such spanning
trees. Given a graph G = (V, E), let TG denote the set of spanning trees in G
and let PG denote the spanning tree polytope, i.e., the convex hull of indicator
vectors of all the spanning trees in TG . Each vertex of PG corresponds to a
spanning tree in G. The problem that we consider is the following: given a
point θ ∈ PG , find a way to write θ as a convex combination of the vertices
of the polytope PG so that the probability distribution corresponding to this
convex combination maximizes Shannon entropy; see Figure 1.5. To see what
this problem has to do with counting spanning trees, the reader is encouraged
to check that if we let θ be the average of all the vertex vectors of PG , the value
of this optimization problem is exactly log |TG |.
As stated, this is an optimization problem where there is a variable corre-
sponding to each vertex of the polytope, the constraints on these variables are
linear, and the objective function maximizes a concave function; see Figure
1.5. Thus, this is a convex program. Note, however, that |TG | can be exponen-
tial in the number of vertices in G; the complete graph on n vertices has nn−2
spanning trees. Thus, the number of variables can be exponential in the input
size and it is not clear how to solve this problem. Interestingly, if one considers
the dual of this convex optimization problem, the number of variables becomes
the number of edges in the graph.
However, there are obstacles to applying the general convex optimization
method to this setting and this is discussed in detail in Chapter 13. In par-
ticular, Chapter 13 presents a polynomial time algorithm for the maximum
entropy problem over polytopes due to Singh and Vishnoi (2014); Straszak
and Vishnoi (2019). Such algorithms have been used to design very general
approximate counting algorithms for discrete problems by Anari and Oveis
Gharan (2017); Straszak and Vishnoi (2017), and have enabled breakthrough
results for the traveling salesman problem by Oveis Gharan et al. (2011);
Karlin et al. (2020).
v2 P
max − i pi log pi
v3 v1 subject to
θ
i pi vi = θ,
P
i pi = 1,
P
v4 v5 ∀i, pi ≥ 0.
Figure 1.5 The maximum entropy problem and its convex program.
2
Preliminaries
We review the mathematical preliminaries required for this book. These include some
standard notion and facts from multivariate calculus, linear algebra, geometry, topology,
dynamical systems, and graph theory.
15
16 Preliminaries
The Hessian is symmetric as the order of i and j does not matter in differenti-
ation.The definitions of gradient and Hessian presented here easily generalize
to the setting where the function is defined on a subdomain of Rn ; see Chap-
ter 11. As a generalization of the notation used for gradient and Hessian, we
sometimes also use the notation ∇k f (x) instead of Dk f (x).
We now show that differentiation in the multivariate setting can be expressed
2.2 Taylor series 17
The proof of the following useful lemma is left as an exercise (Exercise 2.2).
Lemma 2.6 (Integral representations of functions) Let f : Rn → R be a
continuously differentiable function. For x, y ∈ R, let g : [0, 1] → R be defined
as
g(t) := f (x + t(y − x)).
Then, the following are true:
(i) ġ(t) = h∇ f (x +R t(y − x)), y − xi, and
1
(ii) f (y) = f (x) + 0 ġ(t) dt.
If, in addition, f has a continuous Hessian, then the following are also true:
(i) g̈(t) = (y − x)> ∇2 f (x + t(yR − x))(y − x), and
1
(ii) h∇ f (y) − ∇ f (x), y − xi = 0 g̈(t) dt.
In many interesting cases one can prove that whenever x is close enough to
a, then the higher-order terms do not contribute much to the value of f (x)
and, hence, the second-order (or even first-order) approximation gives a good
estimate of f (x).
Sometimes, a matrix A may be square but not of full rank, or A may be full
rank but not square. In both cases, we can extend the notion of inverse.
(i) AA+ A = A,
(ii) A+ AA+ = A+ ,
(iii) (AA+ )> = AA+ , and
(iv) (A+ A)> = A+ A.
The pseudoinverse can be shown to always exist. Moreover, if A has linearly
independent columns, then
A+ = (A> A)−1 A>
and, if A has linearly independent rows, then
A+ = A> (AA> )−1 ,
see Exercise 2.10.
An important subclass of symmetric matrices are positive semidefinite ma-
trices, as defined below.
Definition 2.11 (Positive semidefinite matrix) A real symmetric matrix M
is said to be positive semidefinite (PSD) if for all x ∈ Rn , x> Mx ≥ 0. This is
denoted by:
M 0.
M is said to be positive definite (PD) if x> Mx > 0 holds for all nonzero x ∈ Rn .
This is denoted by:
M 0.
For instance, the identity matrix I is PD, while the diagonal 2×2 matrix with a !1
2 −1
and a −1 on the diagonal is not PSD. As another example, let M := ,
−1 1
then
∀x ∈ R2 , x> Mx = 2x12 − 2x1 x2 + x22 = x12 + (x1 − x2 )2 ≥ 0,
hence M is PSD and in fact PD. Occasionally, we make use of the following
convenient notation: for two symmetric matrices M and N we write M N if
and only if N − M 0. It is not hard to prove that defines a partial order on
the set of symmetric matrices. An n × n real PSD matrix M can be written as
M = BB>
where B is an n × n real matrix with possibly dependent rows. B is sometimes
called the square-root of M and denoted by A1/2 . If M is PD, then the rows of
such a B are linearly independent (Exercise 2.6).
We now review the notion of eigenvalues and eigenvectors of square matri-
ces.
20 Preliminaries
Since this degree two polynomial is nonnegative it has at most one real zero
and its discriminant must be less than or equal to zero, implying that
n 2 n n
X X X
xi yi − xi2 y2i ≤ 0,
i=1 i=1 i=1
2.5 Norms
So far we have talked about the `2 (Euclidean) norm and proved the Cauchy-
Schwarz inequality for this norm. We now present the general notion of a norm
and give several examples used throughout this book. We conclude with gen-
eralization of Cauchy-Schwarz inequality for general norms.
Geometrically, a norm is a way to tell how long a vector is. However not any
function is a norm, and the following definition formalizes what it means to be
a norm.
Definition 2.16 (Norm) A norm is a function k · k : Rn → R that satisfies,
for every u, v ∈ Rn and c ∈ R,
(i) kc · uk = |c| · kuk,
(ii) ku + vk ≤ kuk + kvk, and
(iii) kuk = 0 if and only if u = 0.
An important class of norms are the ` p -norms for p ≥ 1. Given a p ≥ 1, for
u ∈ Rn , define
X p 1/p
n
kuk p := |ui | .
i=1
It can be seen that this is a norm for p ≥ 1. Another class of norms are those
induced by PD matrices. Given an n × n real PD matrix A and u ∈ Rn , define
√
kukA := u> Au.
This can also be shown to be a norm.
22 Preliminaries
Definition 2.17 (Dual norm) Let k · k be a norm in Rn , then the dual norm,
denoted by k · k∗ , is defined as
Unless stated otherwise in this book, k · k will denote k · k2 , the Euclidean norm.
An open ball generalizes the concept of an open interval over real numbers.
A set K ⊆ Rn is said to be open if every point in K is the center of an open
ball contained in K. A set K ⊆ Rn is said to be bounded if there exists an
0 ≤ r < ∞ such that K is contained in an open ball of radius r. A set K ⊆ Rn
is said to be closed if Rn \ K open. A closed and bounded set K ⊆ Rn is also
sometimes referred to as compact. A point x ∈ Rn is a limit point of a set K
if every open set containing x contains at least one point of K different from
x itself. It can be shown that a set is closed if and only if it contains all of its
limit points. The closure of a set K consists of all points in K along with all
limit points of K. A function f : Rn → R is said to be closed if it maps every
closed set K ⊆ Rn to a closed set.
A point x ∈ K ⊆ Rn is said to be in the interior of K, if some ball of
positive radius containing x is contained in K. For a set K ⊆ Rn let ∂K denote
its boundary: it is the set of all points x in the closure of K that are not in the
interior of K. The point x ∈ K is said to be in the relative interior of K if
2.7 Dynamical systems 23
the above definition is not quite correct, as F(x) might land outside of Ω – this
is a manifestation of the “existence of solution” issue for discrete systems –
sometimes in order to achieve existence one has to take h very small, and in
some cases it might not be possible at all.
2.8 Graphs
An undirected graph G = (V, E) consists a finite set of vertices V and a set of
edges E. Each edge e ∈ E is a two element subset of V. A vertex v is said to
be incident to e if v ∈ e. A loop is an edge for which both endpoints are the
same. Two edges are said to be parallel if their endpoints are identical. A graph
without loops and parallel edges is said to be simple. A simple, undirected
graph that has an edge between every pair of vertices is called the complete
graph and is denoted by KV or Kn where n := |V|. For a vertex v ∈ V, N(v)
denotes the set of vertices u such that {u, v} is an edge in E. The degree of
a vertex v is denoted by dv := |N(v)|. A subgraph of a graph G = (V, E) is
simply a graph H = (U, F) such that U ⊆ V and F ⊆ E where the edges in F
are only incident to vertices in U.
A graph G = (V, E) is said to be bipartite if the vertex set V can be parti-
tioned into two parts V1 and V2 and all edges contain one vertex from V1 and
one vertex from V2 . If there is an edge between every vertex in V1 to every ver-
tex in V2 , G is called a complete bipartite graph and denoted by Kn1 ,n2 where
n1 := |V1 | and n2 := |V2 |.
A directed graph is one where the edges E ⊆ V × V are 2-tuples of vertices.
We denote an edge by e = (u, v) where u is the “tail” of the edge e and v is
its “head”. The edge is “outgoing” at u and “incoming” at v. When the context
is clear, we sometimes use e = uv = {u, v} (undirected) or e = uv = (u, v)
(directed) to denote an edge.
L := D − A.
2.9 Exercises
2.1 For each of the following function, compute the gradient and the Hes-
sian, and write the second-order Taylor approximation.
(a) f (x) = m i=1 (ai x − bi ) , for x ∈ Q where a1 , . . . , am ∈ Q , and
> 2 n m
P
b1 , . . . , bm ∈ Q.
(b) f (x) = log mj=1 eh x,v j i where v1 , . . . , vm ∈ Qn .
P
M = BB>
is a norm. What aspect of being a norm for kukA breaks down when A
is just guaranteed to be PSD (and not PD)? And when A has negative
eigenvalues?
2.14 Prove that ∀p, q ≥ 1, the norm k · kq is the dual norm of k · k p whenever
p + q = 1.
1 1
x1(k+1) = x1(k) (1 − h) + h · l2
x1(k) ·l2 +x2(k) ·l1
(k+1)
x = x (k)
(1 − h) + h · l1 (2.2)
x1(k) ·l2 +x2(k) ·l1
2 2
x(0) = x(0) = 1.
1 2
2.21 For a simple, connected, and undirected graph G = (V, E), let
Π = B> L+ B,
• Set T = E.
• For i = 1, 2, . . . , m
– If the graph (V, T \ {ei }) is connected, set T := T \ {ei }.
• Output T .
Is the algorithm correct? If yes prove its correctness, otherwise provide
a counterexample.
2.23 Let T = (V, E) be a tree on n vertices V and let L denote its Laplacian.
Given a vector b ∈ Rn , design an algorithm for solving linear systems
of the form Lx = b in time O(n). Hint: use Gaussian elimination but
carefully consider the order in which you eliminate variables.
2.24 Let G = (V, E, w) be a connected, weighted, undirected graph. Let n :=
|V| and m := |E|. We use LG to denote the corresponding Laplacian. For
any subgraph H of G we use the notation LH to denote its Laplacian.
(a) Let T = (V, F) be a connected subgraph of G. Let PT be any
square matrix satisfying
LT+ = PT P>T .
Prove that
x> P>T LG PT x ≥ x> x
for all x ∈ Rn satisfying hx, 1i = 0.
(b) Prove that
X
Tr LT+ LG = we b>e LT+ be ,
e∈G
Prove
that if T is the maximum weight spanning tree of G, then
Tr LT+ LG ≤ m(n − 1).
(e) Deduce that if T is a maximum weight spanning tree of G and PT
is any matrix such that PT P>T = LT+ , then the condition number of
the matrix PT LG P>T is at most ≤ m(n − 1). The condition number
of PT LG P>T is defined as
λn (PT LG P>T )
.
λ2 (PT LG P>T )
Here λn (PT LG P>T ) denotes the largest eigenvalue PT LG P>T and
λ2 (PT LG P>T ) is the smallest nonzero eigenvalue of PT LG P>T . How
large can the condition number of LG be?
2.25 Let G = (V, E) (with n := |V| and m := |E|) be a connected, undirected
graph and L be its Laplacian. Take two vertices s , t ∈ V, let χ st := e s −et
(where ev is the standard basis vector for a vertex v) and let x ∈ Rn such
that Lx = χ st . Define the distance between two vertices s, t ∈ V as:
d(s, t) := x s − xt .
For s = t we set d(s, t) = 0. Let k be the value of the minimum s − t
cut, i.e., the minimum number of edges one needs to remove from G to
disconnect s from t. Prove that
r
m
k≤ .
d(s, t)
2.26 A matrix is said to be totally unimodular if each of its square submatri-
ces has determinant 0, +1, or −1. Prove that, for a graph G = (V, E), any
of its vertex-edge incidence matrix B is totally unimodular.
2.27 Prove that the matching polytope of a bipartite graph G = (V, E) can be
equivalently written as:
X
P M (G) = E
.
x ∈ : x ≥ 0, x ≤ 1 ∀v ∈ V
R e
e:v∈e
Notes
The content presented in this chapter is classic and draws from several text-
books. For a first introduction to calculus, including a proof of the fundamental
theorem of calculus (Theorem 2.5), see the textbook by Apostol (1967a). For
an advanced discussion on multivariate calculus, the reader is referred to the
textbook by Apostol (1967b). For an introduction to real analysis (including
introductory topology), the reader is referred to the textbook by Rudin (1987).
Linear algebra and related topics are covered in great detail in the textbook by
Strang (2006). See also the paper by Strang (1993) for an accessible treatment
of the fundamental theorem of algebra. The Cauchy-Schwarz inequality has a
wide range of applications; see the book by Steele (2004). For a formal discus-
sion on dynamical systems, refer to the book by Perko (2001). The books by
Diestel (2012) and Schrijver (2002a) provide in-depth introductions to graph
theory and combinatorial optimization respectively.
3
Convexity
We introduce convex sets, notions of convexity, and show the power that comes along
with convexity: convex sets have separating hyperplanes, subgradients exist, and locally
optimal solutions of convex functions are globally optimal.
32
3.2 Convex functions 33
Sometimes when working with convex functions f defined over a certain sub-
set K ⊆ Rn we extend them to the whole Rn by setting f (x) = +∞ for all
x < K. One can check that f is then still convex (on Rn ) when the arithmetic
operations on R ∪ {+∞} are interpreted in the only reasonable way. A function
f is said to be concave if − f is convex.
We now provide two different characterizations of convexity that might be
more convenient to apply in certain situations. They require additional smooth-
ness conditions on f (differentiability or twice-differentiability respectively).
34 Convexity
y
f (x)
f (x0 ) + h∇ f (x0 ), x − x0 i
f (x0 )
x
x0
Figure 3.1 The first-order convexity condition at the point x0 . Note that in this
one-dimensional case, the gradient of f is just its derivative.
In other words, any tangent to a convex function f lies below the function f
as illustrated in Figure 3.1. Similarly, any tangent to a concave function lies
above the function.
Conversely, suppose the function f satisfies (3.2). Fix x, y ∈ K and λ ∈ [0, 1].
Let z := λx + (1 − λ)y be some point in the convex hull. Note that the first-order
approximation of f around z underestimates both f (x) and f (y). Thus, the two
underestimates are:
f (x) ≥ f (z) + h∇ f (z), x − zi, (3.3)
f (y) ≥ f (z) + h∇ f (z), y − zi. (3.4)
Multiplying (3.3) by λ and (3.4) by (1 − λ), and summing both inequalities, we
obtain
(1 − λ) f (x) + λ f (y)) ≥ f (z) + h∇ f (z), λx + (1 − λ)y − zi
= f (z) + h∇ f (z), 0i
= f ((1 − λ)x + λy).
Here, the first equality come from Lemma 2.6 and the last inequality comes
from the application of (3.5) in the integral.
∇2 f (x) 0, ∀x ∈ K.
1
0≤ h∇ f (xτ ) − ∇ f (x), xτ − xi
τ2
1 τ 2
Z
1
= h∇ f (xτ ) − ∇ f (x), si = h∇ f (x + λs)s, si dλ,
τ τ 0
where the last equality is from the second part of Lemma 2.6. We can conclude
the result by letting τ → 0.
Conversely, suppose that ∀x ∈ K, ∇2 f (x) 0. Then, for any x, y ∈ K, we
3.2 Convex functions 37
have
Z 1
f (y) = f (x) + h∇ f (x + λ(y − x)), y − xi dλ
0
Z 1
= f (x) + h∇ f (x), y − xi + h∇ f (x + λ(y − x)) − ∇ f (x), y − xi dλ
0
= f (x) + h∇ f (x), y − xi
Z 1Z λ
+ (y − x)> ∇2 f (x + τ(y − x))(y − x) dτ dλ
0 0 | {z }
≥0
≥ f (x) + h∇ f (x), y − xi.
The first and third equality come from Lemma 2.6 and the last inequality uses
the fact that ∇2 f is PSD.
∇2 f (x) 0, ∀x ∈ K,
then f is strictly convex. The converse is not true, see Exercise 3.12.
We now introduce the notion of strong convexity that makes the notion of
strict convexity quantitative.
f (y)
σ
≥ 2
· ky − xk2
= h∇ f (x), y − xi
f (x)
x y
H := {x ∈ Rn : hh, xi = c}
The first consequence of convexity is that for a closed convex set, a “proof” of
non-membership of a point – a separating hyperplane – always exists.
Proof Let x? be the unique point in K that minimizes the Euclidean distance
to y, i.e.,
x? := argmin kx − yk.
x∈K
The proof uses the notion of the epigraph of a function f (denoted by epi f ):
If hn+1 , 0, then by dividing the above by hn+1 (and reversing the sign since
hn+1 < 0), we obtain
1
f (z) ≥ f (x) − hh, z − xi,
hn+1
hh, z − xi ≤ 0
hh, zi ≤ hh, xi
for all z ∈ K. However, x is in the relative interior of K and, hence, the above
cannot be true. Thus, hn+1 must be nonzero and we deduce the theorem.
42 Convexity
f (x)
x1 x2 x3
x
f (x) ≤ f (y) ∀y ∈ Rn
if and only if
∇ f (x) = 0.
f (x + tv) ≥ f (x)
and, hence,
f (x + tv) − f (x)
0 ≤ lim = h∇ f (x), vi .
t→0 t
Since v ∈ Rn is chosen arbitrarily, if we let v := −∇ f (x), we deduce that
∇ f (x) = 0.
We proceed with the proof in the opposite direction: if ∇ f (x) = 0 for some
x, then x is a minimizer for f . First-order convexity (Theorem 3.3) of f implies
3.3 The usefulness of convexity 43
that ∀y ∈ Rn
f (y) ≥ f (x) + h∇ f (x), y − xi
= f (x) + h0, y − xi
= f (x).
Hence, if ∇ f (x) = 0, then f (y) ≥ f (x) ∀y ∈ Rn , proving the claim.
Note that, in the constrained setting (i.e., K , Rn ), Theorem 3.13 does not
necessarily hold. However, one can prove the following generalization that
holds for all convex functions, differentiable or not.
Theorem 3.14 (Optimality in the constrained setting) If f : K → R is a
convex and differentiable function defined on a convex set K, then
f (x) ≤ f (y) ∀y ∈ K
if and only if
h∇ f (x), y − xi ≥ 0 ∀y ∈ K.
The proof of this theorem is left for the reader (Exercise 3.19). We conclude
by noting that if ∇ f (x) , 0, then −∇ f (x) gives us a supporting hyperplane for
∂K at the point x.
44 Convexity
3.4 Exercises
3.1 Is the intersection of two convex sets convex? Is the union of two convex
sets convex?
3.2 Prove that the following sets are convex.
S + T := {x + y : x ∈ S , y ∈ T }.
3.15 Let B ∈ R be a symmetric matrix. Consider the function f (x) = x> Bx.
n×n
Write the gradient and the Hessian of f (x). When is f convex and when
is it concave?
3.16 Prove that if A is an n×n PD matrix with smallest eigenvalue λ1 , then the
function f (x) = x> Ax is λ1 -strongly convex with respect to the `2 -norm.
3.17 Consider the generalized negative entropy function f (x) = ni=1 xi log xi −
P
xi over Rn>0 .
(a) Write the gradient and Hessian of f .
(b) Prove f is strictly convex.
(c) Prove that f is not strongly convex with respect to the `2 -norm.
(d) Write the Bregman divergence D f . Is D f (x, y) = D f (y, x) for all
x, y ∈ Rn>0 ?
(e) Prove that f is 1-strongly convex with respect to `1 -norm when
restricted to points in the subdomain {x ∈ Rn>0 : ni=1 xi = 1}.
P
3.18 Prove that for a nonempty, closed, and convex set K, argmin x∈K kx − yk
exists.
3.19 Prove Theorem 3.14.
46 Convexity
Notes
The study of convexity is central to a variety of mathematical, scientific, and
engineering disciplines. For a detailed discussion on convex sets and convex
functions, the reader is referred to the classic textbook by Rockafellar (1970).
The textbook by Boyd and Vandenberghe (2004) also provides a comprehen-
sive (and modern) treatment of convex analysis. The book by Barvinok (2002)
provides an introduction to convexity with various mathematical applications,
e.g., to lattices. The book by Krantz (2014) introduces analytical (real and
complex) tools for studying convexity. Many results in this book hold in the
nondifferentiable setting by resorting to subgradients; see the book by Boyd
and Vandenberghe (2004) for more on the calculus of subgradients.
4
Convex Optimization and Efficiency
We present the notion of convex optimization and discuss formally what it means to
solve a convex program efficiently as a function of the representation length of the
input and the desired accuracy.
47
48 Convex Optimization and Efficiency
System of linear equations. Suppose we are given the task to find a solution
to a system of linear equations Ax = b where A ∈ Rm×n and b ∈ Rn . The tradi-
tional method to solve this problem is via Gaussian elimination. Interestingly,
the problem of solving a system of linear equations can also be formulated as
the following convex program:
minn kAx − bk22 .
x∈R
subject to Ax = b (4.2)
x ≥ 0.
The objective function hc, xi is a linear function which is (trivially) convex.
The set of points x satisfying Ax = b, x ≥ 0, is a polyhedron, which is also a
4.2 Computational models 49
convex set. We now illustrate how the problem of finding the shortest path in a
graph between two nodes can be encoded as a linear program.
subject to x ≥ 0,
if i = s,
X X
1
x ji = if i = t,
∀i ∈ V, xi j −
−1
j∈V j∈V
0 otherwise.
We now examine some specific instances of this question and understand how
to specify the input (x, K) to the algorithm. While for x, we can imagine writing
it down in binary (if it is rational), K is an infinite set and we need to work
with some kind of finite representation. In this process, we introduce several
important concepts which are useful later on.
K := {y ∈ Rn : ha, yi ≤ b}
And finally, for inputs that are collections of rational vectors, numbers, matri-
ces, etc., we set
L(x, a, b) := L(x) + L(a) + L(b),
Example: Intersections and polyhedra. Note that being able to answer mem-
bership questions about two sets K1 , K2 ⊆ Rn allows us to answer questions
about their intersection K1 ∩ K2 . For instance, consider a polyhedron, i.e., a set
of the form
L(x, a1 , b1 , a2 , b2 , . . . , am , bm ).
4.3 Membership problem for convex sets 53
K := {y ∈ Rn : kyk1 ≤ r}
and one can argue that none of the 2n (exponentially many) different linear
inequalities is redundant. Thus, using the method from the previous example,
our algorithm would not be efficient. However, in this case one can just check
whether
Xn
|xi | ≤ r.
i=1
This requires only O(n) arithmetic operations and can be efficiently imple-
mented on a TM.
Example: Spanning tree polytope. Given an undirected and simple (no re-
peated edge or loop) graph G = (V, E) with n := |V| and m := |E| ≤ n(n − 1)/2,
54 Convex Optimization and Efficiency
let T ⊆ 2[m] denote the set of all spanning trees in G. A spanning tree in a
graph is a subset of edges such that each vertex has at least one edge incident
to it and has no cycles. Define the spanning tree polytope corresponding to G,
PG ⊆ Rm , as follows:
PG := conv{1T : T ∈ T },
where conv denotes the convex hull of a set of vectors and 1T ∈ Rm is the
indicator vector of a set T ⊆ [m].2 In applications, the input is just the graph
G = (V, E) whose size is O(n+m). PG can have an exponential (in n) number of
vertices. Further, it is not easy, but one can show that the number of inequalities
describing PG could also be an exponential in n. Nevertheless, given x ∈ Qm
and the graph G, checking whether x ∈ PG can be done in polynomial time in
the number of bits needed to represent x and G, and is discussed more in the
next section.
Example: PSD matrices. Consider now K ⊆ Rn×n to be the set of all sym-
metric PSD matrices. Is it algorithmically easy to check whether X ∈ K? In
Exercise 3.6 we proved that K is convex. Recall that for a symmetric matrix
X ∈ Qn×n
X ∈ K ⇔ ∀y ∈ Rn y> Xy ≥ 0.
Thus, K is defined as an intersection of infinitely many halfspaces – one for
every vector y ∈ Rn . Clearly, it is not possible to go over all such y and check
them one by one, thus a different approach for checking if X ∈ K is required.
Recall that by λ1 (X) we denote the smallest eigenvalue of X (note that because
of symmetry, all its eigenvalues are real). It holds that
X ∈ K ⇔ λ1 (X) ≥ 0.
Recall from basic linear algebra that eigenvalues of X are roots of the poly-
nomial det(λI − X). Hence, one can just try to compute all of them and check
if they are nonnegative. However, computing roots of a polynomial is nontriv-
ial. On the one hand, there are efficient procedures to compute eigenvalues of
matrices. However, they are all approximate since eigenvalues are typically
irrational numbers and, thus, cannot be represented exactly in binary. More
algorithms to compute a sequence of numbers λ1 , λ2 , . . . , λn
0 0 0
precisely, there are
such that λi − λi < ε where λ1 , λ2 , . . . , λn are eigenvalues of X and ε ∈ (0, 1) is
0
the specified precision. Such algorithms can be made run in time polynomial in
the bit complexity L(X) and log 1ε . Consequently, we can check whether X ∈ K
“approximately”. This kind of approximation often suffices in applications.
2 Recall that the indicator vector 1T ∈ {0, 1}m of a set T ⊆ [m] is defined as 1T (i) = 1 for all
i ∈ T and 1T (i) = 0 otherwise.
4.3 Membership problem for convex sets 55
𝑥
𝐾
{𝑦: 𝑎, 𝑦 = 𝑏}
Example: Spanning tree polytope. Recall the spanning tree polytope from
the previous section. By definition, to solve the membership problem for this
polytope, it is sufficient to construct a polynomial time separation oracle for
this polytope. Surprisingly, one can prove that there exist a polynomial time
(in the number of bits needed to encode the graph G) for this polytope that
only need access to the graph G – we discuss this and other similar polytopes
in Chapter 13.
In both cases we are correct approximately. Here, the quality of the approxi-
mation depends on the error in our estimate for λ1 (X); we let the reader figure
out the details.
Composing separation oracles. Suppose that K1 and K2 are two convex sets
and that we have access to separation oracles for them. Then we can construct
a separation oracle for their intersection K := K1 ∩ K2 . Indeed, given a point x
we can just check whether x belongs to both of K1 , K2 and if not, say x < K1
then we output the separating hyperplane output by the oracle for K1 . Using a
similar idea, we can also construct a separation oracle for K1 × K2 ⊆ R2n .
Given ε > 0, compute c ∈ Q, such that min x∈K f (x) ∈ [c, c + ε]? (4.3)
Since it requires roughly log 1ε bits to store ε, we expect an efficient algorithm
58 Convex Optimization and Efficiency
Optimal value vs. optimal point. Perhaps the most natural way to approach
problem (4.3) is to try to find a point x ∈ K such that y? ≤ f (x) ≤ y? + ε.
Then we can simply take c := f (x) − ε. Note that providing such a point x ∈ K
is a significantly stronger solution concept (than just outputting an appropriate
c) and provides much more information. The algorithms we consider in this
book will typically output both the optimal point as well as the value, but there
are examples of methods which naturally compute the optimal value only, and
recovering the optimal input for them is rather nontrivial, if not impossible; see
notes.
f (x) = hc, xi
and a polyhedron
K = {x ∈ Rn : Ax ≤ b},
where A ∈ Qm×n and b ∈ Qm . It is not hard to show that whenever min x∈K f (x)
is bounded (i.e., the solution is not −∞), and K has a vertex, then its optimal
value is achieved at some vertex of K. Thus, in order to solve a linear program,
it is enough to find the minimum value over all its vertices.
It turns out that every vertex x̃ of K can be represented as a solution to
A0 x = b, where A0 is some square submatrix of A. In particular, this means
that there is a rational optimal solution, whenever the input data is rational.
Moreover, if the bit complexity of (A, b) is L then the bit complexity of the
optimal solution is polynomial in L (and n) and hence a priori, there is no
direct obstacle (from the viewpoint of bit complexity) for obtaining efficient
algorithms for exactly solving linear programs.
Black box models. Unlike the examples mentioned above where f is specified
“explicitly”, now consider perhaps the most primitive – or black box – access
that we can hope to have for a function:
Given x ∈ Qn , output f (x).
Note that such a black box, or evaluation oracle is in fact a complete descrip-
tion of the function, i.e., it hides no information about f , however if we have
no analytical handle on f and such a description might be hard to work with.
In fact, if we drop the convexity assumption and only assume that f is contin-
uous, then one can show that no algorithm can efficiently optimize f ; see notes
for a pointer. Under the convexity assumption the situation improves, but still,
typically more information about the local behavior of f around a given point
is desired to optimize f efficiently. If the function f is sufficiently smooth one
can ask for gradients, Hessians and possibly higher-order derivatives of f . We
formalize this concept in the definition below.
60 Convex Optimization and Efficiency
Remark 4.6 We note that in the oracle model, while we are not charged for
the time it takes to compute ∇k f (x) given x, we have to account for the bit
length of the input x to the oracle the bit length of the output.
E [F(x)] = l(x),
and, thus, F(x) provides an “unbiased estimator” for the value l(x). One can
design a similar unbiased estimator for the gradient ∇l(x). Note that such “ran-
domized oracles” have the advantage of being more computationally efficient
over computing the value or gradient of l(x) – in fact, they run m times faster
than their exact deterministic counterparts. Still, in many cases, one can per-
form minimization given just oracles of this type – for instance, using a method
called stochastic gradient descent.
4.5 The notion of polynomial time for convex optimization 61
f (x) ≤ y? + ε
is polynomial in the bit complexity L of the input, the dimension n and log 1ε .
Similarly, we can talk about polynomial time algorithms in the black box
model – then the part depending on L is not present, we instead require that
the number of oracle calls (to the set K and to the function f ) is polynomially
bounded in the dimension and, importantly, in log 1ε .
One might ask why we insist on the dependency on 1ε to be poly-logarithmic.
There are several important applications where a polynomial dependency on 1ε
would be not sufficient. For instance, consider a linear program min x∈K hc, xi
with K = {x ∈ Rn : Ax ≤ b} with A ∈ Qm×n and b ∈ Qm . For simplicity, assume
there is exactly one optimal solution (a vertex) v? ∈ K of value y? = c, v? .
Suppose we would like to find v? in polynomial time, given an algorithm which
provides only approximately optimal answers.
How small an ε > 0 do we need to take, to make sure that an approximate
solution x̃ (with hc, x̃i ≤ y? + ε) can be uniquely “rounded” to the optimal
vertex v? ? A necessary condition for that is that there is no non-optimal vertex
v ∈ K with
y? < hc, vi ≤ y? + ε.
Thus, we should ask: what is the minimum nonzero distance in value between
two distinct vertices of K? By our discussion on linear programming in Sec-
62 Convex Optimization and Efficiency
3 Such a rounding algorithm is quite nontrivial, and is based on the lattice basis reduction.
4.6 Exercises 63
4.6 Exercises
4.1 Prove that, when A is a full rank n × n matrix and b ∈ Rn , then the unique
minimizer of the function kAx − bk2 over all x is A−1 b.
4.2 Let M be a collection of subsets of {1, 2, . . . , n} and θ ∈ Rn be given.
For a set M ∈ M, let 1 M ∈ Rn be the indicator vector of the set M, i.e.,
1 M (i) = 1 if i ∈ M and 1 M (i) = 0 otherwise. Let
n
X X
f (x) := θi xi + ln ehx,1M i .
i=1 M∈M
inf f (x).
x∈Rn
(a) Prove that, if θi > 0 for some i, then the value of the above pro-
gram is −∞, i.e., it is unbounded from below.
(b) Assume the program attains a finite optimal value. Is the optimal
point necessarily unique?
(c) Compute the gradient of f.
(d) Is the program convex?
4.3 Give a precise formula for the number of spanning trees in the complete
graph Kn on n vertices.
4.4 For every n, give an example of a polytope that has 2n vertices but can
be described using 2n inequalities.
4.5 Prove that, if A is an n × n full rank matrix with entries in Q and b ∈ Qn ,
then A−1 b ∈ Qn .
4.6 Prove that the following algorithm for linear programming when the
polytope K ⊆ Rn is given as a list of vertices v1 , . . . , vN is correct: com-
pute hc, vi i for each i ∈ [N] and output the vertex with the least cost.
4.7 Find a polynomial time separation oracle for a set of the form {x ∈ Rn :
f (x) ≤ y}, where f : Rn → R is a convex function and y ∈ R. Assume
that we have zero- and first-order oracle access to f .
4.8 Give bounds on the time complexity of computing the gradients and Hes-
sians of the functions in Exercise 2.1.
4.9 The goal of this problem is to bound bit complexities of certain quantities
related to linear programs. Let A ∈ Qm×n be a matrix and b ∈ Qm be
a vector and let L be the bit complexity of (A, b). (Thus, in particular,
L > m and L > n.) We assume that K = {x ∈ Rn : Ax ≤ b} is a bounded,
full-dimensional polytope in Rn .
64 Convex Optimization and Efficiency
min t
t∈R
subject to ∀i ∈ V, Mii = t − 1,
∀i j < E, Mi j = −1,
M ∈ Cn .
(c) Prove that the value of the above convex program is α(G) where
α(G) is size of the largest independent set in G: an independent
set is a set of vertices such that no two vertices in the set are
adjacent.
Since computing the size of the largest independent set in G is NP-hard,
we have an instance of convex optimization that is NP-hard.
4.11 Recall that an undirected graph G = (V, E) is said to be bipartite if the
vertex set V has two disjoint parts L, R and all edges go between L and R.
Consider the case when n := |L| = |R| and m := |E|. A perfect matching
in such a graph is a set of n edges such that each vertex has exactly
one edge incident to it. Let M denote the set of all perfect matchings in
G. Let 1 M ∈ {0, 1}E denote the indicator vector of the perfect matching
M ∈ M. Consider the function
X
f (y) := ln eh1M ,yi .
M∈M
Notes
For a detailed background on computational complexity, including models of
computation and formal definitions of complexity classes such as P, NP, RP,
BPP, and #P, the reader is referred to the textbook by Arora and Barak (2009).
For a formal and detailed treatment on oracles for convex functions, see the
book by Nesterov (2014). Algorithmic aspects of convex sets are discussed in
detail in Chapter 2 of the classic book by Grötschel et al. (1988). Chapter 2 of
Grötschel et al. (1988) also presents precise details of the results connecting
separation and optimization.
For an algorithm on approximately computing eigenvalues, see the paper by
Pan and Chen (1999). For more on copositive programs (introduced in Exercise
4.10), the reader is referred to Chapter 7 of the book by Gärtner and Matousek
(2014). In a seminal result, Valiant (1979) proved that counting the number of
perfect matchings in a bipartite graph (introduced in Exercise 4.11) is #P-hard.
5
Duality and Optimality
We introduce the notion of Lagrangian duality and show that under a mild condition,
called Slater’s condition, strong Lagrangian duality holds. Subsequently, we introduce
the Legendre-Fenchel dual that often arises in Lagrangian duality and optimization
methods. Finally, we present Kahn-Karush-Tucker (KKT) optimality conditions and
their relation to strong duality.
yL ≤ y? ≤ yU ,
67
68 Duality and Optimality
inf f (x)
x∈Rn
Note that this problem is a special case of (5.1) when the set K is defined by
m inequalities and p equalities.
Suppose we would like to obtain a lower bound on the optimal value of (5.2).
Towards that, one can apply the general idea of “moving the constraints to
the objective”. More precisely, introduce m new variables λ1 , . . . , λm ≥ 0 for
the inequality constraints and p new variables µ1 , . . . , µ p ∈ R for the equality
constraints; these are referred to as Lagrange multipliers. Then, consider the
following Lagrangian function:
m
X p
X
L(x, λ, µ) := f (x) + λ j f j (x) + µi hi (x). (5.3)
j=1 i=1
L(x, λ, µ) ≤ f (x)
1 We also allow f to take the value +∞, the discussion in this section is still valid in such a case.
5.1 Lagrangian duality 69
inf L(x, λ, µ) ≤ y? .
x∈Rn
has a closed-form solution (as a function of λ and µ) and thus allows efficient
computation of lower bounds to y? . We will see some examples soon.
So far we have constructed a function g(λ, µ) over λ, µ such that for every
λ ≥ 0 we have g(λ, µ) ≤ y? . A natural question arises: what is the best lower
bound we can achieve this way?
Definition 5.2 (Lagrangian dual) For the primal optimization problem con-
sidered in Definition 5.1, the following is referred to as the dual optimization
problem of the dual program.
We defer the proof of this theorem to later. The proof relies on Theorem 3.11,
along with Slater’s condition, to show the existence of Lagrange multipliers
that achieve strong duality. It has some similarity to the proof of Theorem 3.12
and starts by constructing the following (epigraph-type) convex set related to
the objective function and the constraints.
where F is the set of points that satisfy all the equality constraints. It is then
argued that Lagrange multipliers exist whenever there is a “nonvertical” sup-
porting hyperplane that passes through the point (y? , 0) on the boundary of
this convex set. Here y? is the optimal value to the convex program (5.2) and
nonvertical means that the coefficient corresponding to the first coordinate in
the hyperplane is nonzero. Such a nonvertical hyperplane can only fail to exist
when constraints of the convex program can only be achieved at certain ex-
treme points of this convex set, and Slater’s condition ensures that this does
not happen.
For now, we present some interesting cases where strong duality holds and
some where it does not. We conclude this section by providing two examples:
one where strong duality holds even though the primal program is nonconvex
and one where strong duality fails to hold, even for a convex program (imply-
ing that Slater’s condition is not satisfied).
The next step is to derive g(λ) := inf x L(x, λ). For this, we first rewrite
From this form it is easy to see that, unless c − A> λ = 0, the minimum of
L(x, λ) over all x ∈ Rn is −∞. More precisely,
hb, λi if c − A λ = 0,
>
g(λ) =
−∞
otherwise.
Consequently, the dual program is:
maxm hb, λi
λ∈R
3 1
sup λ − λ3
2 2 (5.10)
s.t. λ ≥ 0.
72 Duality and Optimality
The above is a convex program as the objective function is concave. Its optimal
value is attained at λ = 1 and is equal to 1, hence strong duality holds.
Example: strong duality fails for a convex program. The fact that strong
duality can fail for convex programs might be a little disheartening, yet such
programs are not commonly encountered in practice. The reason for such be-
havior is typically an unnatural or redundant description of the domain.
The standard example used to illustrate this is as follows: consider a function
f : D → R given by f (x) := e−x and the two-dimensional domain
D = {(x, y) : y > 0} ⊆ R2 .
The convex program we consider is
inf e−x
(x,y)∈D
x2 (5.11)
s.t. ≤ 0.
y
2
Such a program is indeed convex, as e−x and xy are convex functions over D
but we see that its description is rather artificial, as the constraint forces x = 0
and hence the y variable is redundant. Still, the optimal value of (5.11) is 1 and
we can derive its Lagrangian dual. To this end, we write down the Lagrangian
x2
L(x, y, λ) = e−x + λ ,
y
and derive that
g(λ) = inf L(x, y, λ) = 0,
(x,y)∈D
for every λ ≥ 0. Thus, g(λ) is a lower bound for the optimal value of (5.11) but
there is no λ that makes g(λ) equal to 1. Hence, strong duality fails in this case.
Lemma 5.7 (Conjugate function is closed and convex) For every function
f (convex or not), its conjugate f ∗ is convex and closed.
Another simple, yet useful conclusion of the definition is the following in-
equality.
The following fact that explains the name “conjugate” function – the conjuga-
tion operation is in fact an involution and applying it twice brings the original
function back. Clearly, this cannot hold in general as we observed that f ∗ is
always convex, but in fact not much more than convexity is required.
Lemma 5.10 (The inverse of the gradient map) Suppose that f is closed,
convex, and differentiable. Then y = ∇ f (x) if and only if x = ∇ f ∗ (y).
The first equality is the definition of f ∗ and the second equality follows from
taking the derivative of the argument in the supremum and plugging in u = x.
However, for any v
where the last equality follows from (5.12). This implies that x = ∇ f ∗ (y).
For the other direction let g := f ∗ . Then, by Lemma 5.7, g is closed and
convex. Thus, we can use the above argument for g to deduce that if x = ∇g(y),
then y = ∇g∗ (x). By Lemma 5.9, we know that g∗ = f and, hence, the claim
follows.
We now show that strong duality along with primal and dual feasibility are
equivalent to the KKT conditions. This is sometimes useful in solving a convex
optimization problem.
f (x? ) = g(λ? , µ? )
= inf L(x, λ? , µ? )
x
m p
X X
? ?
= inf f (x) +
λi fi (x) + µi hi (x)
x
j=1 i=1
m
X p
X
≤ f (x? ) + λ?j f j (x? ) + µ?i hi (x? )
j=1 i=1
≤ f (x? ).
Here, the last inequality follows from primal and dual feasibility. Thus, all of
the inequalities above should be equalities. Hence, from the tightness of the
first inequality we deduce the stationarity condition and from the tightness of
the second inequality we deduce the complementary slackness condition. The
converse is left as an exercise.
where in the last inequality we have used the fact that λ , 0 and the
existence of a point x̄ that satisfies Slater’s condition. This gives a contra-
diction, hence λ0 > 0.
2. λ0 > 0: in this case, we can divide (5.13) by λ0 to obtain
inf hλ/λ0 , ui + w ≥ y? .
(w,u)∈C
Maximizing the left hand side for all λ̂ ≥ 0, we obtain the nontrivial side
of strong duality, proving the theorem.
Observe that Slater’s condition does not hold for the convex optimization prob-
lem 5.11.
5.5 Exercises 77
5.5 Exercises
5.1 Prove that the dual function g in Definition 5.2 is concave irrespective of
whether the primal problem is convex or not.
5.2 Univariate duality. Consider the following optimization problem:
min x2 + 2x + 4
x∈R
s.t. x2 − 4x ≤ −3
(a) Solve this problem, i.e., find the optimal solution.
(b) Is this a convex program?
(c) Derive the dual problem maxλ≥0 g(λ). Find g and its domain.
(d) Prove that weak duality holds.
(e) Is Slater’s condition satisfied? Does strong duality hold?
5.3 Duality for linear programs. Consider the linear program in the canon-
ical form:
minn hc, xi
x∈R
s.t. Ax = b,
x ≥ 0.
Derive its Lagrangian dual and try to arrive at the simplest possible
form. Hint: represent equality constraints hai , xi = bi as hai , xi ≤ bi
and − hai , xi ≤ −bi .
5.4 Shortest vector in an affine space. Consider the following optimization
problem
minn kxk2 (5.14)
x∈R
2 We assume that the output of this oracle is always a vector with rational entries, and the bit
complexity of the output is polynomial in the bit complexity of c. Such oracles exist in
particular for certain important classes of polytopes with rational vertices.
3 As we will see later in the book, the converse of this theorem is also true.
80 Duality and Optimality
Notes
The book by Boyd and Vandenberghe (2004) provides a thorough treatment
of Lagrangian duality, conjugate functions, and KKT optimality conditions.
The reader looking for additional examples and exercises is referred this book.
Chapter IV in the book by Barvinok (2002) discusses polarity and convexity
(introduced in Exercise 5.15) in detail.
6
Gradient Descent
The remainder of the book is devoted to the design and analysis of algorithms for con-
vex optimization. We start by presenting the gradient descent method and show how it
can be viewed as a steepest descent. Subsequently, we prove a convergence time bound
on the gradient descent method when the gradient of the function is Lipschitz continu-
ous. Finally, we use the gradient descent method to come up with a fast algorithm for a
discrete optimization problem: computing maximum flow in an undirected graph.
81
82 Gradient Descent
Let xt denote the point chosen in Step 2 of this algorithm at the tth iteration
and let T be the total number of iterations performed. T is usually given as a
part of the input, but it could also be implied by a given stopping criteria. The
running time for the algorithm is O(T · M) where M is an upper bound on the
time of each update, including that of finding the starting point. Normally, the
update time M is something which cannot be optimized below a certain level
(for a fixed function f ) and, hence, the main goal is to design methods with T
as small as possible.
f (x) − f (x + δu)
" #
max lim .
kuk=1 δ→0 δ
The expression inside the limit is simply the negative directional derivative of
f at x in direction u and, thus, we obtain
and the equality is attained at u = u? . Thus, the negative of the gradient at the
current point is the direction of steepest descent. This instantaneous direction
at each point x gives rise to a continuous-time dynamical system (Definition
2.19) called the gradient flow of f :
dx ∇ f (x)
=− .
dt k∇ f (x)k
x̃
x?
Figure 6.1 The steepest descent direction for the function x12 + 4x22 at x̃ =
√ √
( 2, 2).
where, again, η > 0 is a parameter – the step length (which we might also make
depend on the time t or the point xt ). In machine learning, η is often called the
learning rate.
The Lipschitz constant of the same function can vary dramatically with the
choice of the norms. Unless specified, we assume that both the norms are k · k2 .
k∇ f (x) − ∇ f (y)k ≤ L kx − yk ,
be more difficult. However, one can often try to set these values adaptively.
As an example, it is possible to start with a guess G0 (or equivalently L0 )
and update the guess given the outcome of the gradient descent. Should the
algorithm “fail” (the gradient at the point at the end of the gradient descent
algorithm is not small enough) for this value of G0 , we can double the constant
G1 = 2G0 . Otherwise, we can half the constant G1 = 21 G0 and so on.
In practice, especially in machine learning applications, these quantities are
often known up to a good precision and are often considered to be constants.
Under such an assumption the number
of oracle calls performed by the respec-
tive algorithms are O ε12 and O 1ε and do not depend on the dimension n.
Therefore, such algorithms are often referred to as dimension-free, their con-
vergence rate, as measured in the number of iterations (as clarified later) does
not depend on n, only on certain regularity parameters of f . In theoretical com-
puter science applications, an important aspect is to find formulations where
these quantities are small and this is often the key challenge.
Lemma 6.3 (Upper bound on the Bregman divergence for L-smooth func-
tions) Suppose that f : Rn → R is a differentiable function that satisfies
k∇ f (x) − ∇ f (y)k ≤ L kx − yk for every x, y ∈ Rn . Then, for every x, y ∈ Rn it
holds that
L
f (y) − f (x) − h∇ f (x), y − xi ≤ kx − yk2 . (6.4)
2
We note that this lemma does not assume that f is convex. However, if f is
convex, then the distance in the left hand side of Equation (6.4) is nonnega-
tive. See Figure 6.2 for an illustration. Moreover, when f is convex, (6.4) is
equivalent to the condition that the gradient of f is L-Lipschitz continuous; see
Exercise 6.2.
6.3 Analysis when the gradient is Lipschitz continuous 87
f (y)
≤ L2 ky − xk2
f (x) + h∇ f (x), y − xi
f (x)
x y
Figure 6.2 for a convex function f , the gap at y between the value of f and the
first-order approximation of f at x is nonnegative and bounded from above by a
quadratic function L2 kx − yk2 when the gradient of f is L-Lipschitz continuous.
Proof of Theorem 6.2. Let us examine the evolution of the error as we iterate.
We first use Lemma 6.3 to obtain
L
f (xt+1 ) − f (xt ) ≤ h∇ f (xt ), xt+1 − xt i + kxt+1 − xt k2
2
Lη2
= −η k∇ f (xt )k2 + k∇ f (xt )k2 .
2
Since we wish to maximize the drop in the function value ( f (xt ) − f (xt+1 )),
we should chose η to be as large as possible. The right hand side is a convex
function of η and a simple calculation shows that it is minimized when η = L1 .
Substituting it, we obtain
1
f (xt+1 ) − f (xt ) ≤ − k∇ f (xt )k2 . (6.5)
2L
Intuitively, (6.5) suggests that if the norm of the gradient k∇ f (xt )k is large, we
make good progress. On the other hand, if k∇ f (xt )k is small, we are already
close to the optimum.
6.3 Analysis when the gradient is Lipschitz continuous 89
which measures how far the current objective value is from the optimal value.
Note that Rt is nonincreasing and we would like to find the smallest t for which
it is below ε. Let us start by noting that R1 ≤ LD2 . This is because
The first inequality in (6.6) follows from the first-order characterization of con-
vexity of f and the second inequality follows (vacuously) from the definition
of D. The optimality of x? (Theorem 3.13) implies that ∇ f (x? ) = 0. Therefore,
by the L-smoothness of f , we obtain
(Rt /2)2
k· ≥ Rt /2.
2LD2
2
l m l R1
m
In other words, k needs to be at least 4LDRt . Let r := log ε . By repeatedly
halving r times to get from R1 to ε, the required number of steps is at most
r & r
4LD2 2 2
LD2
' !
i 4LD r+1 4LD
X X
≤r+1+ 2 ≤ (r + 1) + 2 =O .
i=0
R1 · 2−i i=0
R1 R1 ε
i.e., the algorithm can move only in the subspace spanned by the gradients at
previous iterations. Note that gradient descent clearly follows this scheme as
t−1
X
xt = x1 − η∇ f (x j ).
j=1
We do not restrict the running time of one iteration of such an algorithm, in fact
we allow it to do an arbitrarily long calculation to compute xt from x1 , . . . , xt−1
and the corresponding gradients. In this model we are interested only in the
number of iterations. The lower bound is as follows.
Theorem 6.4 (Lower bound) Consider any algorithm for solving the con-
vex unconstrained minimization problem min x∈Rn f (x) in the model as in (6.9),
when the gradient of f is Lipschitz continuous (with a constant L) and the ini-
tial point x1 ∈ Rn satisfies
x1 − x?
≤ D. There is a function f such that for
LD2
min f (xi ) − minn f (x) ≥ .
1≤i≤T x∈R T2
The above theorem translates to a lower bound of Ω √1ε iterations to reach an
ε-optimal solution. This lower bound does not quite match the upper bound of
1
ε established in Theorem 6.2. Therefore one can ask: Is there a method which
matches the above √1ε iterations bound? And, surprisingly, the answer is yes.
This can be achieved using the so-called accelerated gradient descent. This
is the topic of Chapter 8. We also prove Theorem 6.4 in Exercise 8.4.
This is called the projected gradient descent method. The convergence rates
of this method remains the same (the proof carries over to this new setting
by noting that kprojK (x) − projK (y)k ≤ kx − yk for all x, y ∈ Rn ). However,
depending on K, the projection may or may not be difficult (or computationally
expensive) to perform. More precisely, as long as the algorithm has access to
an oracle which, given a query point x, returns the projection projK (x) of x onto
K, we have the following analogue of Theorem 6.2.
properties.3 For all vertices u ∈ V\{s, t}, we require that the “incoming” flow
is equal to the “outgoing” flow:
heu , Bxi = 0.
An s − t flow satisfies the given capacities if for all i ∈ [m], |xi | ≤ ρi for all
i ∈ [m]. The goal is to find such a flow in G that is feasible and maximizes the
flow
he s , Bxi
where b := e s − et .
From now on we assume that all the capacities are one, i.e., ρi = 1 for all
i ∈ [m] and hence the last constraint in (6.10) simplifies to kxk∞ ≤ 1. The
approach presented here can be extended to the general capacity case.
It is important to note that F ? ≥ 1 if in G there is at least one path from s to
t; we assume this (we can check this property in O(m)e time using breadth-first
?
search). Moreover, since each edge has capacity 1, F ≤ m. Thus, not only is
feasibility not an issue, we know that 1 ≤ F ? ≤ m.
We do not give a fully detailed proof of the above, rather present the main steps
and leave certain steps as exercises.
3 As mentioned earlier, we extend x from E to V × V as a skew-symmetric function on the
edges. If the ith edge went between u and v and bi = eu − ev , we let x(v, u) := −x(u, v). If we
do not wish to refer to the direction of the ith edge, we use the notation xi .
4 The notation O(e f (n)) means O( f (n) · logk f (n)) for some constant k > 0.
94 Gradient Descent
Find x ∈ Rm
s.t. Bx = Fb, (6.11)
kxk∞ ≤ 1.
If we denote
HF := {x ∈ Rm : Bx = Fb}
and
Bm,∞ := {x ∈ Rm : kxk∞ ≤ 1},
then the set of solutions above is the intersection of these two convex sets
and hence the problem really asks the following question for the convex set
K := HF ∩ Bm,∞ :
As we would like to use the gradient descent framework, we need to first state
the above problems as a minimization problem. We have two choices to this
problem as a convex optimization problem:
It turns out that the first formulation has the following advantage over the sec-
ond: for a suitable choice of the distance function, the objective function is
convex, has an easily computable first-order oracle, and, importantly, the Lips-
chitz constant of its gradient is O(1). Thus, we reformulate (6.11) as minimiz-
ing the distance of x to Bm,∞ over x ∈ HF . Towards that, let P be the projection
operator P : Rm → Bm,∞ given as
5 For that one can either consider projected gradient descent – see Theorem 6.5, or just repeat
the proof of Theorem 6.2 over a linear subspace.
96 Gradient Descent
Hence,
n
X
f (x) = h(xi ),
i=1
Thus, in particular
0 if xi ∈ [−1, 1],
∇ f (x) i = 2(xi + 1) if xi < −1,
2(x − 1) if x > 1.
i i
While ∇ f (xt ) is easy to compute in linear time given the formulas derived
previously, computing its projection might be quite expensive. In fact, even if
we precomputed Π (which would take roughly O(m3 ) time), it would still take
O(m2 ) time to apply it to the vector ∇ f (xt ). To the rescue comes an important
and nontrivial result that such a projection can be computed in time O(m).
e This
is achieved by noting that this problem trivially reduces to solving Laplacian
systems of the form BB> y = a.
6.5 Exercises
6.1 Let f : Rn → R be a differentiable function. Prove that if k∇ f (x)k ≤ G
for all x ∈ Rn and some G > 0, then f is G-Lipschitz continuous, i.e.,
∀x, y ∈ Rn , | f (x) − f (y)| ≤ G.
Is the converse true? (See Exercise 7.1.)
6.2 Suppose that a differentiable function f : Rn → R has the property that
∀x, y ∈ Rn ,
L
f (y) ≤ f (x) + hy − x, ∇ f (x)i + kx − yk2 . (6.14)
2
Prove that if f is twice-differentiable and has a continuous Hessian then
(6.14) is equivalent to ∇2 f (x) LI. Further, prove that if f is also con-
vex, then (6.14) is equivalent to the condition that the gradient of f is
L-Lipschitz continuous, i.e.,
∀x, y ∈ Rn , k∇ f (x) − ∇ f (y)k ≤ L kx − yk .
6.3 Let M be a nonempty family of subsets of {1, 2, . . . , n}. For a set M ∈ M,
let 1 M ∈ Rn be the indicator vector of M, i.e., 1 M (i) = 1 if i ∈ M and
1 M (i) = 0 otherwise. Consider a function f : Rn → R given by
X
f (x) := log e M .
hx,1 i
M∈M
2
whenever T = Ω BD ε , where D := max{kx − x1 k : f (x) ≤
f (x1 )}.
6.8 Frank-Wolfe method. Consider the following algorithm for minimizing
a convex function f : K → R over a convex domain K ⊆ Rn .
• Initialize x1 ∈ K,
• For each iteration t = 1, 2, . . . , T :
– Define zt := argmin x∈K { f (xt ) + h∇ f (xt ), x − xt i}
– Let xt+1 := (1 − γt )xt + γt zt , for some γt ∈ [0, 1]
• Output xT
(a) Prove that if the gradient of f is L-Lipschitz continuous with re-
spect to a norm k·k, max x,y kx − yk ≤ D and γt is taken to be Θ 1t
then
LD2
!
?
f (xT ) − f (x ) ≤ O ,
T
where x? is any minimizer of min x∈K f (x).
(b) Show that one iteration of this algorithm can be implemented ef-
ficiently when given a first-order oracle access to f and the set K
is any of:
• K := {x ∈ Rn : kxk∞ ≤ 1},
• K := {x ∈ Rn : kxk1 ≤ 1},
• K := {x ∈ Rn : kxk2 ≤ 1}.
6.9 Recall that for a nonempty subset K ⊆ Rn we can define the distance
function dist(·, K) and the projection operator projK : Rn → K as
dist(x, K) := inf kx − yk and projK (x) := arginf y∈K kx − yk .
y∈K
102 Gradient Descent
(c) Prove that for any set K ⊆ Rn the function x 7→ dist2 (x, K) is
convex.
(d) Prove correctness of the explicit formula (given in Equation 6.13
in this chapter) for the projection operator projK when K = Bm,∞ =
{x ∈ Rm : kxk∞ ≤ 1}.
(e) Prove that the function f (x) := dist2 (x, K) has a Lipschitz contin-
uous gradient with Lipschitz constant equal to 2.
6.10 Let G = (V, E) be an undirected graph with n vertices and m edges. Let
B ∈ Rn×m be the vertex-edge incidence matrix of G. Assume that G is
connected and let Π := B> (BB> )+ B. Prove that, given a vector g ∈ Rm ,
if we let x? denote the projection of g on the subspace {x ∈ Rm : Bx = 0}
(as defined in Exercise 6.9), then it holds that
x? = g − Πg.
(d) Apply gradient descent to solve the program (6.16). Estimate all
relevant parameters and provide a complete analysis. What is the
running time to reach a point with value at most MinCut s,t (G)+ε?
Hint: to make the objective smooth (have Lipschitz continuous
Pm q 2
gradient) replace kxk1 by i=1 xi + µ2 . Then pick µ appropri-
ately to make the error incurred by this approximation small com-
pared to ε.
104 Gradient Descent
Notes
While this chapter focuses on one version of gradient descent, there are several
variants of the gradient descent method and the reader is referred to the book
by Nesterov (2004) and the monograph by Bubeck (2015) for a discussion of
more variants. For more on the Frank-Wolfe method (introduced in Exercise
6.8) see the paper by Jaggi (2013). The lower bound (Theorem 6.4) was first
established in a paper by Nemirovski and Yudin (1983).
The s − t-maximum flow problem is one of the most well-studied problems
in combinatorial optimization. Early combinatorial algorithms for this problem
included those by Ford and Fulkerson (1956), Dinic (1970), and Edmonds and
Karp
(1972))
n leadingo to an algorithm by Goldberg and Rao (1998) that runs in
Oe m min n2/3 , m1/2 log U time.
A convex optimization-based approach for the s − t-maximum flow problem
was initiated in a paper by Christiano et al. (2011) who gave an algorithm for
the s − t-maximum flow problem that runs in time O(mne 1/3 ε−11/3 ). Section 6.4
is based on a paper by Lee et al. (2013). We refer the reader to Lee et al. (2013)
for a stronger version of Theorem 6.6 that achieves a running time of O e √m1.75
εF ?
by applying the accelerated gradient descent method (discussed in a Chapter 8)
instead of the version we use here; also see Exercise 8.5. By further
optimizing
the trade-off between these parameters one can obtain an O e mn1/3 ε−2/3 time
algorithm for the s − t-maximum flow problem. Nearly linear time algorithms
for the s − t-maximum flow problem were discovered in papers by Sherman
(2013), Kelner et al. (2014), and Peng (2016). All of these algorithms used
techniques from convex optimization. See Chapter 11 for a different class of
continuous algorithms for maximum flow problems whose dependence on ε is
polylog(ε−1 ).
All of the above results rely on the availability of fast Laplacian solvers.
A nearly linear time algorithm for Laplacian solvers were first discovered in
a seminal paper of Spielman and Teng (2004). To read more about Laplacian
systems and their applications to algorithm design, see the surveys by Spielman
(2012) and Teng (2010), and the monograph by Vishnoi (2013).
7
Mirror Descent and the
Multiplicative Weights Update Method
We derive our second algorithm for convex optimization – called the mirror descent
method – via a regularization viewpoint. First, the mirror descent algorithm is devel-
oped for optimizing convex functions over the probability simplex. Subsequently, we
show how to generalize it and, importantly, derive the multiplicative weights update
(MWU) method from it. This latter algorithm is then used to develop a fast approxi-
mate algorithm to solve the bipartite matching problem on graphs.
k∇ f (x)k2 ≤ G. (7.2)
Using the fundamental theorem of calculus (as in the proof of Lemma 6.3),
this condition can be shown to imply the condition that f is G-Lipschitz, i.e.,
105
106 Mirror Descent and Multiplicative Weights Update
for all x, y ∈ K; see Exercise 6.1. Moreover, if f is also convex, then these two
conditions are equivalent; see Exercise 7.1. Note, however, that the Lipschitz
continuous gradient condition may not imply the bounded gradient condition.
For instance, it may be the case that G = O(1), but there is no such bound on
the Lipschitz constant of the gradient of f ; see Exercise 7.2. In this case, one
can prove the following theorem.
Theorem 7.1 (Guarantee for gradient descent when the gradient is bounded)
There is a gradient descent-based algorithm which, given a first-order oracle
access to a convex function f : Rn → R, a number G such
that
k∇ f (x)k2 ≤ G
for all x ∈ Rn , an initial point x1 ∈ Rn and D such that
x1 − x?
2 ≤ D and an
ε > 0, outputs a sequence of points x1 , . . . , xT such that
T
1 X
x − f (x? ) ≤ ε
t
f
T t=1
where
DG 2
T= .
ε
Remark 7.2 (Notation change in this chapter) Just in this chapter, we index
vectors using upper-indices: x1 , x2 , . . . , xT . This is to avoid confusion with the
coordinates of these vectors which are often referred to, i.e., the ith coordinate
of vector xt is denoted by xit . Also, since the results of this chapter generalize
to arbitrary norms, we are careful about the norm. In particular, k · k refers to a
general norm and the Euclidean norm is explicitly denoted by k · k2 .
7.2 A local optimization principle and regularizers 107
then
∀x ∈ K, ft (x) ≤ f (x)
A downside of the above is that it is very aggressive – in fact, the new point
xt+1 could be very far from xt . This is easily illustrated by an example in one
dimension, when K = [−1, 1] and f (x) = x2 . If started at 1, and update using
(7.4), the algorithm jumps indefinitely between −1 and 1. This is because one
of these two points is always a minimizer of a linear lower bound of f over K.
Thus, the sequence {xt }t∈N never reaches 0 – the unique minimizer of f . The
reader is encouraged to check the details.
The situation is even worse when the domain K is unbounded: the minimum
is not attained at any finite point and, hence, the update rule (7.4) is not well
defined. This issue can be easily countered when the function is σ-strongly
convex (see Definition 3.8) for some σ > 0, since then we can use a stronger
quadratic lower bound on f at xt , i.e.,
D E σ
2
ft (x) = f (xt ) + ∇ f (xt ), x − xt +
x − xt
2 .
2
Then, the minimizer of ft (x) is always attained at the following point (using
the same calculation leading to (7.3)):
1
xt+1 = xt − ∇ f (xt ),
σ
which can be made close to xt by choosing a large σ. For the case of f (x) = x2
over [−1, 1],
2−σ t
xt+1 = x
σ
which converges to 0 when σ > 1.
This observation leads to the following idea: even if the gradient of f is not
Lipschitz continuous, we can still add a function to ft , to make it smoother.
Specifically, we add a term involving a “distance” function D : K × K → R
that does not allow the new point xt+1 to land far away from the previous point
xt ; D is referred to as a regularizer.1 More precisely, instead of minimizing
1 This D should not be confused with a bound on the distance of the starting point to the optimal
point.
7.3 Exponential gradient descent 109
ft (x) we minimize D(x, xt ) + ft (x). To vary the importance of these two terms
we also introduce a positive parameter η > 0 and write the revised update as:
n D Eo
xt+1 := argmin D(x, xt ) + η f (xt ) + ∇ f (xt ), x − xt .
x∈K
Since the above is an argmin, we can ignore terms that depend only on x and
simplify it as
n D Eo
xt+1 = argmin D(x, xt ) + η ∇ f (xt ), x . (7.5)
x∈K
Note that by picking large η, the significance of the regularizer D(x, xt ) is re-
duced and thus it does not play a big role in choosing the next step. By picking
η to be small we force xt+1 to stay in a close vicinity of xt .2 However, unlike
gradient descent, the value of the function does not have to decrease: f (xt )
may be more than f (xt+1 ), hence, it is not clear how to analyze progress of this
method; we explain it later.
Before we go any further with these general considerations, let us consider
one important example, where the “right” choice of the distance function D(·, ·)
for a simple convex set already leads to a very interesting algorithm – expo-
nential gradient descent. This algorithm can then be generalized to the setting
of an arbitrary convex set K.
i.e., the set of all probability distributions over n elements. From the discussion
in the previous section, the general form of an algorithm we would like to
construct is
n D Eo
pt+1 := argmin D(p, pt ) + η ∇ f (pt ), p , (7.7)
p∈∆n
2 A slightly different, yet related way to ensure that the next point xt+1 does not move too far
away from xt , is to first compute a candidate e xt+1 according to the rule (7.4) and then to make
a small step from xt towards e xt+1 to obtain xt+1 . This is the main idea of the Frank-Wolfe
algorithm; see Exercise 6.8.
110 Mirror Descent and Multiplicative Weights Update
While not being symmetric, DKL satisfies several natural distance-like proper-
ties. For instance, from convexity it follows that DKL (p, q) ≥ 0. The reason it is
called a divergence is because it can also be seen as the Bregman divergence
(Definition 3.9) corresponding to the function
n
X
h(p) := pi log pi .
i=1
DF (x, y) measures the error in approximating F(x) using the first-order Taylor
approximation of F at y. In particular, DF (x, y) ≥ 0 and DF (x, y) → 0 when
x is fixed and y → x. In the next section we derive several other properties of
DKL .
When specialized to this particular distance function DKL , the update rule
takes the form
n D Eo
pt+1 := argmin DKL (p, pt ) + η ∇ f (pt ), p . (7.8)
p∈∆n
As we prove in the lemma below, the vector pt+1 can be computed using an
explicit formula involving only pt and ∇ f (pt ). It is useful to extend the notion
of KL-divergence to Rn≥0 by introducing the generalized KL divergence DH : it
7.3 Exponential gradient descent 111
The above problem is in fact convex, hence one just needs to find a zero of the
gradient with respect to w to find the minimum. By computing the gradient,
we obtain the following optimality condition
log wi = −ηgi + log qi ,
and, hence,
w?i = qi exp(−ηgi ).
For the second part, we use ideas developed in Chapter 5. To incorporate the
constraint ni=1 pi = 1, we introduce a Lagrange multiplier µ ∈ R and obtain
P
n n
n
X X X
pi log pi + pi (ηgi − log qi ) + µ pi − 1 .
min (7.10)
p≥0
i=1 i=1 i=1
112 Mirror Descent and Multiplicative Weights Update
w?
p? = .
kw? k1
point is either 1 or −1.3 Hence, by knowing that the gradient at a certain point
x is 1 we still have no clue whether x is close to the minimizer (0) or very far
from it. Thus, as opposed to the Lipschitz gradient case, the gradient at a point
x does not provide us with a certificate that f (x) is close to optimal. Therefore
one naturally needs to gather more information, by visiting multiple points and
average them in some manner.
Theorem 7.5 (Guarantees for EGD) Suppose that f : ∆n → R is a convex
√ !
log n
function which satisfies k∇ f (p)k∞ ≤ G for all p ∈ ∆n . If we let η := Θ √TG
2
then, after T = Θ G εlog2
n
iterations of the EGD algorithm, the point p̄ :=
1 PT t
T t=1 p satisfies
f ( p̄) − f (p? ) ≤ ε,
where p? is any minimizer of f over ∆n .
then
DF (x, y) + DF (y, z) ≤ DF (x, z).
It is an instructive exercise to consider the special case of this lemma for F(x) =
kxk22 . It says that if we project z onto a convex set S and call the projection y,
then the angle between the vectors x − y and z − y is obtuse (larger than 90
degrees).
then
h∇g(y), w − yi ≥ 0,
which translates to
h∇F(y) − ∇F(z), w − yi ≥ 0.
Finally, we state the following inequality that asserts that the negative entropy
function is 1-strongly convex with respect to the `1 -norm, when restricted to
the probability simplex ∆n (Exercise 3.17(e)).
Thus, from now on we focus on the task of providing an upper bound on the
sum Tt=1 gt , pt − p .
P
wt+1
i = pti exp(−ηgti )
This can be also written in terms of the gradient of the generalized negative
entropy function H(x) = ni=1 xi log xi − xi as follows:
P
1 1
gt = log pt − log wt+1 = ∇H(pt ) − ∇H(wt+1 ) , (7.12)
η η
116 Mirror Descent and Multiplicative Weights Update
where the log is applied coordinatewise to the appropriate vectors. Thus, using
Lemma 7.6 (law of cosines) we obtain that
D E 1D E
gt , pt − p = ∇H(pt ) − ∇H(wt+1 ), pt − p
η
(7.13)
1
= DH (p, pt ) + DH (pt , wt+1 ) − DH (p, wt+1 ) .
η
T D
X E XT
η gt , pt − p = DH (p, pt ) + DH (pt , wt+1 ) − DH (p, wt+1 )
t=1 t=1
T
X
≤ DH (p, pt ) + DH (pt , wt+1 )
t=1
h i
− DH (p, pt+1 ) + DH (pt+1 , wt+1 )
(7.14)
T h
X i
= t
DH (p, p ) − DH (p, p t+1
)
t=1
h i
+ DH (pt , wt+1 ) − DH (pt+1 , wt+1 )
T h
X i
≤ DH (p, p1 ) + DH (pt , wt+1 ) − DH (pt+1 , wt+1 ) .
t=1
In the last step we used the fact that the first term of the sum is telescoping to
DH (p, p1 ) − DH (p, pT +1 ) and that DH (p, pT +1 ) ≥ 0.
By observing that
DH (p, p1 ) = DKL (p, p1 ) ≤ log n
and move to the next point which is the minimizer of this lower bound “regu-
larized” by a distance function D(·, ·), thus we get (7.5)
n D Eo
xt+1 := argmin D(x, xt ) + η ∇ f (xt ), x .
x∈K
118 Mirror Descent and Multiplicative Weights Update
When deriving the exponential gradient descent algorithm we used the gener-
alized KL-divergence DH (·, ·).
Mirror descent is defined with respect to DR (·, ·), for any convex regularizer
R : Rn → R. In general, by denoting the gradient at step t by gt we have
n D Eo
xt+1 = argmin DR (x, xt ) + η gt , x
x∈K
n D E D Eo
= argmin η gt , x + R(x) − R(xt ) − ∇R(xt ), x − xt (7.18)
x∈K
n D Eo
= argmin R(x) − ∇R(xt ) − ηgt , x ,
x∈K
where in the last step we have ignored the terms that do not depend on x. Let
wt+1 be a point such that
It is not clear under shat conditions such a point wt+1 should exist and we
address this later; for now assume such a wt+1 exists. This corresponds to the
same wt+1 as we had in the EGD algorithm (the “unscaled” version of pt+1 ).
We then have
n D Eo
xt+1 = argmin R(x) − ∇R(wt+1 ), x
x∈K
n D Eo
= argmin R(x) − R(wt+1 ) + ∇R(wt+1 ), x (7.19)
x∈K
n o
= argmin DR (x, wt+1 ) .
x∈K
Note again the analogy with the EGD algorithm: there pt+1 was obtained as a
KL-divergence projection of wt+1 onto the simplex ∆n = K, exactly as above.
This is also called the proximal viewpoint of mirror descent and the calcu-
lations above establish the equivalence of the regularization and the proximal
viewpoints.
∀x ∈ K, k∇ f (x)k ≤ G.
f ( x̄) − f (x? ) ≤ ε
Proof The proof of the above theorem follows exactly as the one we gave
for Theorem 7.5 by replacing the generalized negative entropy function H by
a regularizer R and replacing KL-divergence terms DH by DR .
We now go step by step through the proof of Theorem 7.5 and emphasize
which properties of R and DR are being used.
In Step 1 the reasoning in (7.11) used to obtain
T
1 XD t t
f ( x̄) − f (x? ) ≤ g , x − x?
E
T t=1
Finally in Step 4, an analog of (7.17) that can be proved under our current
assumptions is
∗ σ
∗2
DR (xt , wt+1 ) − DR (xt+1 , wt+1 ) ≤
gt
xt − xt+1
−
xt+1 − xt
.
2
The above follows from the strong convexity assumption with respect to k·k
(which is used in place of Pinsker’s inequality) and the Cauchy-Schwarz in-
equality for dual norms:
hu, vi ≤ kuk kvk∗ .
The rest of the proof is the same and does not rely on any specific properties
of R or DR .
Theorem 7.10 (Guarantees for MWU algorithm) Consider the MWU al-
gorithm presented in Algorithm 4. Assume that all of
√ the!vectors gt provided by
2
log n
, after T = Θ G log n
the oracle satisfy
gt
≤ G. Then, taking η = Θ √
∞ TG ε2
iterations we have
T T
1 X D t tE 1 XD t E
g , p − min g , p ≤ ε.
T t=1 p∈∆n T
t=1
122 Mirror Descent and Multiplicative Weights Update
5: end for
At this point it might not be clear what the purpose of stating such a theorem is.
However, as we will shortly see (including in several exercises), this theorem
allows us to come up with algorithms for numerous different problems based
on the idea of maintaining weights and changing them multiplicatively. In the
example we provide, we design an algorithm for checking if a bipartite graph
has a perfect matching. This can be further extended to linear programming or
even semidefinite programming.
vertices is 2n (note that if the cardinalities of A and B are not the same, then
there is no perfect matching in G). Let m := |E| as usual.
Our approach to solving this problem is based on solving the following lin-
ear programming reformulation of the perfect matching problem:
Find x ∈ Rm
X
s.t. xe = n,
e∈E
X (7.20)
∀v ∈ V, xe ≤ 1,
e: v∈e
∀e ∈ E, xe ≥ 0.
A solution to the above linear feasibility program is called a fractional perfect
matching. The question which arises very naturally is whether the “fractional”
problem is equivalent to the original one? This following is a classical exercise
in combinatorial optimization (see Exercise 2.27). It basically asserts that the
“fractional” bipartite matching polytope (defined in the linear programming
description above) has only integral vertices.
Theorem 7.11 (Integrality of the bipartite matching polytope) If G is a
bipartite graph then G has a perfect matching if and only if G has a fractional
perfect matching. Alternatively, the vertices of the bipartite matching polytope
of G, defined as the convex hull of the indicator vectors of all perfect matchings
in G, are exactly these indicator vectors.
Note that one direction of this theorem is trivial: if there is a perfect matching
M ⊆ E in G then its characteristic vector x = 1 M is a fractional perfect match-
ing. The other direction is harder and relies on the assumption that the graph is
bipartite.
Algorithmically, one can also convert fractional matchings into matchings
in O(|E|)
e time; we omit the details. Thus, solving (7.20) sufficies to solve the
perfect matching problem in its original form.
As our algorithm naturally produces approximate answers, for an ε > 0, we
also define an ε-approximate fractional perfect matching to be an x ∈ Rm
which satisfies
X
xe = n,
e∈E
X
∀v ∈ V, xe ≤ 1 + ε,
e: v∈e
∀e ∈ E, xe ≥ 0.
124 Mirror Descent and Multiplicative Weights Update
From such an approximate fractional matching one can (given the ideas dis-
cussed above) construct a matching in G of cardinality at least (1 − ε)n and,
thus, also solve the perfect matching problem exactly by taking ε < 1n .
The running time of the above algorithm is certainly not comparable to the best
known algorithms for this problem, but its advantage is its overall simplicity.
and
xet ≥ 0 for all e ∈ E
7.6.3 Analysis
The analysis is divided into two steps: how to find xt ? and the correctness of
the algorithm given in the previous section (assuming a method for finding xt ).
in the form
X
αe xe ≤ β, (7.21)
e∈E
If G has a perfect matching M, then there are edges e1 , . . . , en that do not share
any vertex. Hence,
Xn
αei = β.
i=1
e? := arg min αe .
e∈E
Hence, setting
xet ? = n and xet 0 = 0 ∀e0 , e?
is also a valid solution to (7.21). Such a choice of xt also guarantees that for
every v ∈ V:
X
−1 ≤ xet − 1 ≤ n − 1
e∈N(v)
which in particular implies that
gt
∞ ≤ 1.
7.6 Application: perfect matching in bipartite graphs 127
Step 2: Proof of Theorem 7.10. We now invoke Theorem 7.10 to obtain the
guarantee claimed in Theorem 7.12 on the output of Algorithm 5. We start by
noticing that, if we set
wt
pt := P t ,
v∈V wv
Algorithm 5 is a special case of the MWU algorithm with gt satisfying
gt
≤ ∞
1.
Further, we can plug in p := ev for any fixed v ∈ V in Theorem 7.10 to
conclude
T T
1X t 1 X D t tE
− gv ≤ − p ,g + δ (7.22)
T t=1 T t=1
n
for T = Θ logδ2
(since G = 1). The fact that xt satisfies
X X X
t
wv xe ≤ wtv
v∈V e:v∈e v∈V
implies that
X X
wtv 1 − xe ≥ 0.
v∈V e:v∈e
t
Dividing by n and kw k1 , the above inequality can be seen to be the same as
D E
pt , gt ≥ 0.
wtv .
P
done in O(m) time as it just requires to find the edge e with minimum e:v∈e
This completes the proof of Theorem 7.12.
Remark 7.14 It can be easily seen that the reason for the n2 factor in the
running time is because of our bound | e:v∈e xet − 1| ≤ n. If one could always
P
produce a point xt with
X t
∀v ∈ V xe − 1 ≤ ρ,
e:v∈e
then the running time would become O(ε−2 mρ2 log n). Note that intuitively,
this should be possible to do, since xt = 1 M (for any perfect matching M) gives
ρ = 1. However, at the same time we would like our procedure for finding xt to
be efficient, preferably it should run in nearly linear time. It is an interesting ex-
ercise to design a nearly linear time algorithm for finding xt with the guarantee
ρ = 2, which yields a much better running time of only O(ε−2 m log2 n).
7.7 Exercises 129
f (x)
x2
Figure 7.1 The plot of the function f (x) = |x|+1 .
7.7 Exercises
7.1 Prove that if f : Rn → R is a convex function which is G-Lipschitz
continuous, then k∇ f (x)k ≤ G for all x ∈ Rn . Extend this result to the
case when f : K → R for some convex set K.
7.2 Give an example of a function f : Rn → R such that k∇ f (x)k2 ≤ 1 for
all x ∈ Rn but the Lipschitz constant of its gradient is unbounded.
x2
7.3 Smoothed absolute value. Consider the function f (x) := |x|+1 (see Fig-
ure 7.1). As one can see, the function f is a “smooth” variant of x 7→ |x|:
it is differentiable everywhere and | f (x) − |x| | → 0 when x tends to
either +∞ or −∞. Similarly, one can consider a multivariate extension
F : Rn → R of f given by
n
X xi2
F(x) := .
i=1
|xi | + 1
thus, the larger α is, the better approximation we obtain. Further, prove
that for every x ∈ Rn ,
k∇sα (x)k∞ ≤ 1.
7.5 Prove the following properties of KL-divergence:
(a) DKL is not symmetric: DKL (p, q) , DKL (q, p) for all p, q ∈ ∆n .
(b) DKL is indeed a Bregman divergence of a convex function on ∆n
and DKL (p, q) ≥ 0.
7.6 Let F : Rn → R be a convex function. Prove that the function mapping
x 7→ DF (x, y) for a fixed y ∈ Rn is strictly convex.
7.7 Prove Lemma 7.6.
7.8 Prove that for all p ∈ ∆n , DKL (p, p1 ) ≤ log n. Here p1 is the uniform
probability distribution with p1i = 1/n for 1 ≤ i ≤ n.
7.9 Prove that for F(x) := kxk22 over Rn ,
DF (x, y) = kx − yk22 .
7.10 Gradient descent when the Euclidean norm of gradient is bounded.
Use Theorem 7.9 to prove that there is an algorithm that, given a convex
and differentiable function f : Rn → R with a gradient bounded in
the Euclidean norm by G, an ε > 0, and an initial point x1 satisfying
kx1 − x? k2 ≤ D, produces x ∈ Rn satisfying
f (x) − f (x? ) ≤ ε
in T iterations using a step size η where
DG 2 D
T := and η := √ .
ε G T
7.11 Stochastic gradient descent. In this problem we study how well gradi-
ent descent does when we have a relatively weak access to the gradient.
In the previous problem, we assumed that for a differentiable convex
function f : Rn → R, we have a first-order oracle: given x it outputs
∇ f (x). Now assume that when we give a point x, we get a random vector
g(x) ∈ Rn from some underlying distribution such that
E[g(x)] = ∇ f (x).
(If x itself has been chosen from some random distribution, then E[g(x)|x] =
∇ f (x).) Assume that k∇ f (x)k2 ≤ G for all x ∈ Rn . Subsequently, we do
gradient descent in the following way. Pick some η > 0, and assume that
the starting point x1 is s.t. kx1 − x? k2 ≤ D. Let
xt+1 := xt − ηg(xt ).
7.7 Exercises 131
X
MinCuts,t (G) := min |xi − x j |.
x∈Rn ,x s −xt =1
i j∈E
Apply the mirror descent algorithm with the regularizer R(x) = kxk22
along with Theorem 7.9 to find MinCuts,t (G) exactly (note that it is an
integer of value at most m). Estimate all relevant parameters and pro-
vide a bound on the running time. Explain how to deal with the (simple)
constraint x s − xt = 1.
132 Mirror Descent and Multiplicative Weights Update
7.13 Min-max theorem for zero sum games. In this problem we apply the
MWU algorithm to approximately find equilibria in two player zero-sum
games.
Let A ∈ Rn×m be a matrix with A(i, j) ∈ [0, 1] for all i ∈ [n] and
j ∈ [m]. We consider a game between two players: the row player and a
column player. The game consists of one round in which the row player
picks one row i ∈ {1, 2, . . . , n} and the column player picks one column
j ∈ {1, 2, . . . , m}. The goal of the row player is to minimize the value
A(i, j) which she pays to the to column player after such a round, the
goal of the column player is of course the opposite (to maximize the
value A(i, j)).
The min-max theorem asserts that
max min E J←q A(i, J) = min max EI←p A(I, j). (7.23)
q∈∆m i∈{1,...,n} p∈∆n j∈{1,...,m}
Here EI←p A(I, j) is the expected loss of the row player when using the
randomized strategy p ∈ ∆n against a fixed strategy j ∈ {1, 2, . . . , m} of
the column player. Similarly, define E J←q A(i, J). Formally,
n
X m
X
EI←p A(I, j) := pi A(i, j) and E J←q A(i, J) := q j A(i, j).
i=1 j=1
Let opt be the common value of the two quantities in (7.23) correspond-
ing to two optimal strategies p? ∈ ∆n and q? ∈ ∆m respectively. Our
goal is to use the MWU framework to construct, for any ε > 0, a pair of
strategies p ∈ ∆n and q ∈ ∆m such that
(f) What is the total running time of the whole procedure to find an
ε-approximate pair of strategies p and q we set out to find at the
beginning of this problem?
7.14 Winnow algorithm for classification. Suppose we are given m labeled
examples, (a1 , l1 ), (a2 , l2 ), . . . , (am , lm ) where ai ∈ Rn are feature vectors,
and li ∈ {−1, +1} are their labels. Our goal is to find a hyperplane sepa-
rating the points labeled with a +1 from the −1 labeled points. Assume
that the dividing hyperplane contains 0 and its normal is nonnegative.
Hence, formally, our goal is to find p ∈ Rn with p ≥ 0 such that
sign hai , pi = li
for every i ∈ {1, . . . , m}. By scaling we can assume that kai k∞ ≤ 1
for every i ∈ {1, . . . , m} and that h1, pi = 1 (recall 1 is the vector of all
1s). For notational convenience we redefine ai to be li ai . The problem is
thus reduced to finding a solution to the following linear programming
problem: find a p such that
hai , pi > 0 for every i ∈ {1, . . . , m} where p ∈ ∆n .
Prove the following theorem.
Theorem 7.15 Given a1 , . . . , am ∈ Rn and ε > 0, if there exists p? ∈ ∆n
such that hai , p? i ≥ ε for every i ∈ {1, . . . , m}, the Winnow algorithm
(Algorithm 6) produces a point p ∈ ∆n such that hai , pi > 0 for every
i ∈ {1, . . . , m} in T = Θ lnε2n iterations.
What is the running time of the entire algorithm?
134 Mirror Descent and Multiplicative Weights Update
8: end for
return T1 Tt=1 pt
P
9:
Assume that when the oracle returns a feasible solution for a p, the so-
lution x that it returns is not arbitrary but has the following property:
max |hai , xi − bi | ≤ 1.
i
7.7 Exercises 135
Notes
Mirror descent was introduced by Nemirovski and Yudin (1983). Beck and
Teboulle (2003) presented an alternative derivation and analysis of mirror de-
scent. In particular, they showed that mirror descent can be viewed as a non-
linear projected-gradient type method, derived using a general distance-like
function instead of the `22 -distance. This chapter covers both of these view-
points.
The MWU method has appeared in the literature at least as early as the 1950s
and has been rediscovered in many fields since then. It has applications to
optimization (e.g., Exercise 7.15), game theory (e.g., Exercise 7.13), machine
learning (e.g., Exercise 7.14), and theoretical computer science (Plotkin et al.
(1995); Garg and Könemann (2007); Barak et al. (2009)). We refer the reader
to the comprehensive survey by Arora et al. (2012). We note that the variant
of the MWU method we present is often called the “hedge” (see Arora et al.
(2012)). The algorithm for the s − t-maximum flow problem by Christiano
et al. (2011) (mentioned in Chapter 1) relies on MWU; see Vishnoi (2013) for
a presentation.
Matrix variants of the MWU method are studied in papers by Arora and Kale
(2016), Arora et al. (2005), Orecchia et al. (2008), and Orecchia et al. (2012).
This variant of the MWU method is used to design fast algorithms for solving
semidefinite programs which, in turn, are used to come up with approximation
algorithms for problems such as maximum cut and sparsest cut.
The MWU method is one method in the area of online convex optimiza-
tion (introduced in Exercise 7.16). See the monographs by Hazan (2016) and
Shalev-Shwartz (2012) for in depth treatment of online convex optimization.
The perfect matching problem on bipartite graphs has been extensively stud-
ied in the combinatorial optimization literature; see the book by Schrijver
(2002a). It can be reduced to the s−t-maximum flow problem and, thus, solved
in O(nm) time using an algorithm that needs to compute at most n augmenting
paths, where each such iteration takes O(m) time (as it runs depth-first search.)
√
A more refined variant of the above algorithm, which runs in O(m n) time,
was presented in the paper by Hopcroft and Karp (1973). Forty years after
the result of Hopcroft and Karp (1973), a partial improvement (for the sparse
regime, i.e., when m = O(n)) was obtained in a paper by Madry (2013). His
e 10/7 ) time and, in fact, works for the s − t-maximum flow
algorithm runs in O(m
problem on directed graphs with unit capacity. It relies on a novel modification
and application of interior point methods (which we introduce in Chapters 10
and 11).
8
Accelerated Gradient Descent
137
138 Accelerated Gradient Descent
constructs a sequence λt which goes to zero as t12 , a quadratic speed-up over the
standard gradient descent method. Formally, we prove the following theorem.
Theorem 8.4 (Existence of optimal estimate sequences) For every convex,
L-smooth (with respect to k·k) function f : Rn → R, for every σ-strongly
convex regularizer R (with respect to the same norm k·k), and for every x0 ∈ Rn ,
there exists an estimate sequence (φt , λt , xt )t∈N with
L
φ0 (x) := f (x0 ) + DR (x, x0 )
2σ
and
c
λt ≤
t2
for some absolute constant c > 0.
Suppose now that DR (x? , x0 ) ≤ D2 . Then, what we obtain for such a sequence,
using conditions (2) and (1) with x = x? , is
(Upper Bound)
f (xt ) ≤ φt (x? ) (8.3)
(Lower Bound)
≤ (1 − λt ) f (x? ) + λt φ0 (x? ) (8.4)
L
= (1 − λt ) f (x? ) + λt f (x0 ) + λt DR (x? , x0 ) (8.5)
2σ
L
= f (x? ) + λt ( f (x0 ) − f (x? )) + λt DR (x? , x0 ) (8.6)
2σ
(L-smoothness) L
≤ f (x? ) + λt hx0 − x? , ∇ f (x? )i + kx0 − x? k2 (8.7)
2
L ?
+ λt DR (x , x0 ) (8.8)
2σ !
1 1
= f (x? ) + λt L kx0 − x? k2 + DR (x? , x0 ) (8.9)
2 2σ
(λt ≤ tc2 ) cL 1 1 2
!
? ? 2
≤ f (x ) + 2 kx0 − x k + D (8.10)
t 2 2σ
(8.2) cLD2
≤ f (x? ) + . (8.11)
σt2
q
?
σε to make sure that f (xt ) − f (x ) ≤ ε. We
2
Thus, it is enough to take t ≈ LD
cannot yet deduce Theorem 8.2 from Theorem 8.4, as in the form as stated it
is not algorithmic – we need to know that such a sequence can be efficiently
computed using a 1st order oracle to f and R only, while Theorem 8.4 only
claims existence. However, as we will see, the proof of Theorem 8.4 provides
an efficient algorithm to compute estimate sequences.
8.4 Construction of an estimate sequence 141
Thus, the lower bound condition in Definition 8.3 is trivially satisfied. The
upper bound condition follows from noting that
Thus,
φ0 (x) = φ?0 + DR (x, x0 ) (8.12)
and in the case when R(x) := 12 kxk22 , this is just a parabola centered at x0 .
The construction of subsequent elements of the estimate sequence is induc-
tive. Suppose we are given (φt−1 , xt−1 , λt−1 ). Then φt will be a convex combi-
nation of φt−1 and the linear lower bound Lt−1 to f at a carefully chosen point
yt−1 ∈ Rn (to be defined later). More precisely, we set
state several constraints on xt , yt , γt and λt which are necessary for our proofs
to work. At the final stage of the proof we collect all these constraints and show
that they can be simultaneously satisfied and, thus, there is a way to set these
parameters in order to obtain a valid estimate sequence.
hence
φt (x) ≤ (1 − λt ) f (x) + λt φ0 (x).
Thus, as long as (8.18) holds, we obtain the lower bound condition. Note also
that a different way to state (8.18) is that
Y
λt = (1 − γt ).
1≤i≤t
Note that this in particular requires us to specify yt−1 , as the right-hand side
depends on yt−1 . Towards this, consider any x ∈ Rn , then
To give some intuition as to why we make such a choice for yt−1 , consider for
a moment the case when R(x) = kxk22 and DR (x, y) is the squared Euclidean
distance. In the above, we would like to obtain a term which looks like a 2nd
order (quadratic) upper bound on f (e
x) around f (yt−1 ) (here e
x is variable), i.e.,
1
f (yt−1 ) + he
x − yt−1 , ∇ f (yt−1 )i + x − yt−1 k2 .
ke (8.19)
2
Such a choice of yt−1 allows us to cancel out an undesired linear term. We do
not quite succeed in getting the form as in (8.19) – our expression has zt−1
instead of yt−1 in several places and has additional constants in front of the
linear and quadratic term. We deal with these issues in the next step and make
our choice of xt accordingly.
8.5 The algorithm and its analysis 145
The reason for renaming of variables and introducing e x follows the same intu-
ition as the choice of yt−1 in Step 4. We would like to arrive at an expression
which is a quadratic upper bound on f around yt−1 , evaluated at a point e x. Such
a choice of ex allows us to cancel the γt in front of the linear part of this upper
bound and hence to obtain the desired expression assuming that γλ2t ≥ 1. As we
t
will see later, this constraint really determines the convergence rate of the re-
sulting method. Finally, the choice of xt follows straightforwardly: we simply
pick a point which minimizes this quadratic upper bound on f over e x.
z0 = x0 ,
γ0 = 0 and λ0 = 1.
146 Accelerated Gradient Descent
and
yt−1 := (1 − γt )xt−1 + γt zt−1 ,
γt
∇R(zt ) := ∇R(zt−1 ) − ∇ f (yt−1 ),
λt (8.20)
1
x, ∇ f (yt−1 )i + ke
xt := argmin he x − yt−1 k2 .
x
e 2
For instance, when k · k is the `2 -norm, then
xt = yt−1 − ∇ f (yt−1 ).
Note that yt is just a “hybrid” of the sequences (xt ) and (zt ), which follow two
different optimization primitives:
1. The update rule to obtain xt is simply to perform one step of gradient
descent starting from yt−1 .
2. The sequence zt is just performing mirror descent with respect to R, taking
gradients at yt−1 .
Thus, accelerated gradient descent is simply a result of combining gradient
descent with mirror descent. Note that, in particular, if we choose γt = 0 for all
t then yt is simply the gradient descent method from Chapter 6, and we choose
γt = 1 for all t then yt is mirror descent from Chapter 7.
Pictorially, in algorithm (8.20), the parameters are fixed in the following
order.
x0 x1 xt
& % & % & %
y0 y1 ······ yt−1
% & % & % &
z0 z1 zt
The proof of Theorem 8.2 follows simply from the construction of the esti-
mate sequence. The only remaining piece which we have not established yet
is that one can take γt ≈ 1t and λt ≈ t12 – this follows from a straightforward
calculation and is proved in the next section.
From the update rules in (8.20) one can see that one can keep track of xt , yt , zt
by using a constant number of oracle queries to ∇ f , ∇R and (∇R)−1 . Further-
more, as already demonstrated in (8.3), we have
LD2
!
?
f (xt ) ≤ f (x ) + O .
σt2
8.5 The algorithm and its analysis 147
q
Thus, to attain f (xt ) ≤ f (x? ) + ε, taking t = O LD2
σε is sufficient. This
completes the proof of Theorem 8.2.
8.5.1 Choice of γt s
When deriving the estimate sequence we have made an assumption that
λt ≥ γt2 ,
which does not allow us to set γt s arbitrarily. The following lemma provides
an example setting of γt s which satisfy this constraint.
Lemma 8.6 (Choice of γt s) Let γ0 = γ1 = γ2 = γ3 = 0 and γi = 2
i for all
i ≥ 4, then
t
Y
∀t ≥ 0, (1 − γi ) ≥ γt2 .
i=1
Proof For t ≤ 4 one can verify the claim directly. Let t > 4. We have
t
Y 2 3 4 t−2 2·3
(1 − γi ) = · · · . . . · = ,
i=1
4 5 6 t (t − 1)t
by cancelling all but 2 terms in the numerator and denominator. It is now easy
6
to verify that t(t−1) ≥ t42 = γt2 .
steps. Here, we used the L-smoothness of f and the fact that ∇ f (x? ) = 0 to
obtain
L LD2
E0 = f (x0 ) − f (x? ) ≤ kx − x? k22 ≤ .
2 2
The map ∇R is the identity in this case, hence no additional computational cost
is incurred because of R.
kAy − bk22 ≤ ε.
√ > ?
λn (A A)k x k2
The algorithm performs T := O κ(A> A) log ε iterations (where
x? ∈ Rn satisfies Ax? = b), each iteration requires computing a constant
number of matrix-vector multiplications and inner products.
We denote
f (x) := kAx − bk22 .
Note that the optimal value of the above is 0, and is achieved for x = x? , where
x? is the solution to the linear systems considered.
We now derive all the relevant parameters of f which are required to apply
Theorem 8.7. By computing the Hessian of f we have
∇2 f (x) = A> A.
Since λ1 (A> A) · I A> A and A> A λn (A> A) we have that f is L-smooth for
L := λn (A> A) and β-strongly convex for β := λ1 (A> A).
As a starting point x0 we can choose
x0 := 0
which is at distance
D :=
x0 − x?
2 =
x?
2
150 Accelerated Gradient Descent
from the optimal solution. Thus, from Theorem 8.7 we obtain the running time
λn (A> A)
x?
p
O κ(A> A) · log .
ε
Note that computing the gradient of f (x) is
∇ f (x) = A> (Ax − b),
hence it boils down to performing two matrix-vector multiplications.
8.7 Exercises 151
8.7 Exercises
8.1 Prove that, for a convex function, the L-smoothness condition (8.1) is
equivalent to the L-Lipschitz continuous gradient condition in any norm.
8.2 Conjugate gradient. Consider the following strategy for minimizing a
convex function f : Rn → R. Construct a sequence (φt , Lt , xt )t∈N , where
for every t ∈ N there is a function φt : Rn → R, a linear subspace Lt of
Rn such that {0} = L0 ⊆ L1 ⊆ L2 ⊆ · · · , and a point xt ∈ Lt such that,
together, they satisfy the following conditions:
• Lower bound. ∀t, ∀x ∈ Lt , φt (x) ≤ f (x),
• Common minimizer. ∀t, xt = argmin x∈Rn φt (x) = argmin x∈Lt f (x).
Consider applying this to the function
2
f (x) =
x − x?
A ,
and
2
φt (x) := kx − xt k2A +
xt − x?
A
Let x? := A−1 b and notice that it is the same as solving the following
problem:
1
f (x) := minn (x − x? )> A(x − x? ).
x∈R 2
152 Accelerated Gradient Descent
i.e., the algorithm might move only in the subspace spanned by the gra-
dients at previous iterations. We do not restrict the running time of one
iteration of such an algorithm, in fact we allow it to do an arbitrarily
long calculation to compute xt from x1 , . . . , xt−1 and the corresponding
gradients and are interested only in the number of iterations.
Consider the quadratic function f : Rn → R defined as
2t
L 1 2 1 X 1
f (y1 , . . . , yn ) := y1 + (yi − yi+1 )2 + y22t+1 − y1 .
4 2 2
i=1
2
(c) Prove that the span of the gradients at the first t points is just the
span of {e1 , . . . , et }.
(d) Deduce that
f (xt+1 ) − f (x? ) 3L
≥ .
kx1 − x? k22 32(t + 1)2
Thus, the accelerated gradient method is tight up to constants.
8.5 Acceleration for the s−t-maximum flow problem. Use the accelerated
gradient method developed in this chapter to improve
the
running time
1.75
in Theorem 6.6 for the s − t-maximum flow to O √e m
. εF ?
8.6 Acceleration for the s − t-minimum cut problem. Recall the formu-
lation of the s − t-minimum problem in an undirected graph G = (V, E)
with n vertices and m edges that was studied Exercise 6.11.
X
MinCuts,t (G) := min |xi − x j |.
x∈Rn ,x s −xt =1
i j∈E
Apply Theorem 8.7 with the regularizer R(x) = kxk22 to find MinCuts,t (G)
exactly. Obtain a method with running time O(m3/2 n1/2 ∆1/2 ) where ∆ is
the maximum degree of a vertex in G. In other words
∆ := max |N(v)|.
v∈V
Notes
The accelerated gradient descent method was discovered by Nesterov (1983)
(see also Nesterov (2004)). This idea of acceleration was then extended to
variants of gradient descent and led to the introduction of algorithms such as
FISTA by Beck and Teboulle (2009). Allen Zhu and Orecchia (2017) present
the accelerated gradient method as a coupling between the gradient descent
method and the mirror descent method. The lower bound (Theorem 6.4) was
first established in a paper by Nemirovski and Yudin (1983). The heavy ball
method (Exercise 8.3) is due to Polyak (1964). Exercises 8.5 and 8.6 are adapted
from the paper by Lee et al. (2013).
Algorithms for solving a linear system of equations have a rich history. It
is well known that Gaussian elimination has a worst case time complexity
of O(n3 ). This can be improved upon by using fast matrix multiplication to
O(nω ) ≈ O(n2.373 ); see the books by Trefethen and Bau (1997), Saad (2003),
and Golub and Van Loan (1996). The bound presented in Theorem 8.8 is com-
parable to the conjugate gradient method (introduced in Exercise 8.2) is due to
Hestenes and Stiefel (1952); see also the monograph by Sachdeva and Vishnoi
(2014). For references on linear solvers for Laplacian solvers, refer to the notes
in Chapter 6.
9
Newton’s Method
We begin our journey towards designing algorithms for convex optimization whose
number of iterations scale polylogarithmically with the error. As a first step, we de-
rive and analyze the classic Newton’s method, which is an example of a second-order
method. We argue that Newton’s method can be seen as steepest descent on a Rieman-
nian manifold, which then motivates an affinely invariant analysis of its convergence.
155
156 Newton’s Method
f (x0 )
r x1 x0
g(x0 )
x1 := x0 − .
g0 (x0 )
g(xt )
xt+1 := xt − for all t ≥ 0. (9.1)
g0 (xt )
It is evident that the method requires the differentiability of g. In fact, the anal-
ysis assumes even more – that g is twice continuously differentiable.
We now present a simple optimization problem and see how we can attempt
to use Newton’s method to solve it. This example also illustrates that the con-
vergence of Newton’s method may heavily depend on the starting point.
9.1 Finding a root of a univariate function 157
An example. Suppose that for some a > 0 we would like to minimize the
function
f (x) := ax − log x
over all positive x > 0. To solve this optimization problem one can first take
the derivative g(x) := f 0 (x) and try to find a root of g. As f is convex, we
know by first-order optimality conditions that the root of g (if it exists) is an
optimizer for f . We have
1
g(x) := f 0 (x) = a − .
x
While it is trivial to see that this equation can be solved exactly and the solu-
tion is 1a , we would still like to apply Newton’s method to solve it. One reason
is that we would like to illustrate the method on a particularly simple example.
Another reason is historical – early computers used Newton’s method to com-
pute the reciprocal as it only involved addition, subtraction, and multiplication.
We initialize Newton’s method at any point x0 > 0, and iterate as follows
g(xt )
xt+1 = xt − = 2xt − axt2 .
g0 (xt )
Note that computing xt+1 from xt indeed does not use division.
We now try to analyze the sequence {xt }t∈N and see when it converges to a1 .
By denoting
et := 1 − axt
et+1 = e2t .
Thus, it is now easy to see that whenever |e0 | < 1 then et → 0. Further, if
|e0 | = 1 then et = 1 for all t ≥ 1 and if |e0 | > 1 then et → ∞.
In terms of x0 this means that whenever 0 < x0 < a2 then xt → 1a . However,
if we initialize at x0 = 2a or x0 , then the algorithm gets stuck at 0. And even
worse: if we initialize at x0 > 2a then xt → −∞. This example shows that
choosing the right starting point has a crucial impact on whether Newton’s
method succeeds or fails.
It is interesting to note that by modifying the function g, for example by
taking g(x) = x − 1a , one obtains different algorithms to compute 1a . Some of
these algorithms might not make sense (for instance, the iteration xt+1 = 1a is
not how we would like to compute 1a ), or might not be efficient.
158 Newton’s Method
This gives us the bound on the distance from x1 to r in terms of the distance
from x0 to r
00
g (ξ)
|r − x1 | = 0 |r − x0 |2 ≤ M|r − x0 |2 ,
2g (x0 )
where M is as in the statement of the theorem.
and hence, for the error |xt − r| to become smaller than ε it is enough to take
1
t ≈ log log .
ε
As one can imagine, for this reason Newton’s method is very efficient. In addi-
tion, in practice it turns out to be very robust and sometimes converges rapidly
even when no bounds on M or |x0 − r| are available.
g1 (r) = 0,
g2 (r) = 0,
..
.
gn (r) = 0.
be. It turns out that the right analog of g0 (x0 ) in the multivariate setting is the
Jacobian matrix of g at x0 , i.e., Jg (x0 ) is the matrix of partial derivatives
∂gi
" #
(x0 ) .
∂x j 1≤i, j≤n
Hence, we can now consider the following extension of (9.1) to the multivari-
ate setting:
xt+1 := xt − Jg (xt )−1 g(xt ) for all t ≥ 0. (9.2)
where k·k2 denotes the spectral norm of a matrix (see Definition 2.13). For more
details, we refer to Section 9.4 where we present a closely related convergence
result, which can be adapted to this setting.
x? := argmin f (x)
x∈Rn
Rn and its Jacobian J∇ f is the Hessian ∇2 f . Hence, the update rule (9.2) for
finding a root of a multivariate function g, adjusted to this setting, is
xt+1 := xt + n(xt ).
x1 := argmin e
f (x).
x∈Rn
Hence
−1
x1 = x0 − ∇2 f (x0 ) ∇ f (x0 ),
and we recover Newton’s method. Thus, at every step, Newton’s method min-
imizes the second-order approximation around the current point and takes the
minimizer as the next point. This has the following consequence: whenever we
apply Newton’s method to a strictly convex quadratic function, i.e., of the form
162 Newton’s Method
In the worst case, this takes O(n3 ) time using Gaussian elimination (or O(nω )
using fast matrix multiplication). However, if the Hessian matrix has a special
form, e.g., it is a Laplacian corresponding to some graph, Newton’s method can
lead to fast algorithms due to the availability of nearly-linear time Laplacian
solvers.
One can observe a rough analogy between Theorem 9.4 and Theorem 9.1. In
the latter, for the method to have quadratic convergence, |g0 (x)| should be large
(relatively to |g00 (x)|). Here, the role of g is played by the gradient ∇ f of f .
The first condition on H(x) in Definition 9.3 basically says that the “magni-
tude” of the second derivative of f is “big”. The second condition may be a
bit more tricky to decipher: it says that ∇2 f (x) is Lipschitz-continuous, and
upper-bounds the Lipschitz constant. Assuming f is thrice continuously dif-
ferentiable, this roughly gives an upper bound on the magnitude of D(3) f . This
intuitive explanation is not quite formal, however, we only wanted to empha-
size that the spirit of Theorem 9.4 is the same as Theorem 9.1.
The proof of Theorem 9.4 is similar to the proof of Theorem 9.1 and, thus,
is moved to the end of this chapter for the interested reader; see Section 9.7.
when initialized at e
x (this can be checked by hand for this simple 2-dimensional
example).
y := φ(x) = Ax + b,
where K > 0 is a large constant. One can see by inspection that the minimum
is x? = (0, 0). Nevertheless, we would like to compare the gradient descent
method and Newton’s method for this function.
The gradient descent method takes a starting point, say x̃ := (1, 1) and goes
to
x0 := x̃ − η∇ f ( x̃)
This direction points directly towards the minimum x? = (0, 0) of f and, thus,
we are no longer forced to take small steps (see Figure 9.2). How can we rec-
oncile this difference between the predictions of the gradient descent method
and Newton’s method?
x̃
x?
Recall that in the gradient descent method we chose to follow the nega-
tive gradient direction because it was the direction of steepest descent with
respect to Euclidean distance (see Section 6.2.1). However, since the role of
coordinates x1 and x2 is not symmetric in f , the Euclidean norm is no longer
appropriate and it makes sense to select a different norm to measure the length
of such vectors. Consider the norm
q
k(u1 , u2 )k◦ := u21 + Ku22 .
(Check that this is indeed a norm when K > 0.) Let us now redo the derivation
of gradient descent as steepest descent from Chapter 6 with respect to this
new norm. This gives rise to the following optimization problem (analogous to
(6.1)):
argmax (−D f (x)[u]) . (9.5)
kuk◦ =1
Here, the rationale behind using the A-norm is again (as in the 2-dimensional
example) to counter the effect of “stretching” caused by the quadratic term
x> Ax in the objective. This turns out to be norm that gives us the right direction
as the optimal vector u (up to scaling) is equal to −x (Exercise 9.6). Hence,
again, pointing towards the optimum x? = 0. Moreover,
−x = −(2A)−1 (2Ax) = −(∇2 h(x))−1 ∇h(x),
168 Newton’s Method
The above are sometimes called local inner product and local norm with
respect to the Hessian ∇2 f (·) respectively as they vary with x. Sometimes we
say “local norm” when the underlying function f is clear from the context.
They can be used to measure angles or distances between vectors u, v at each
x and give rise to a new geometry on Rn . We now revisit the derivation of
the gradient descent method as steepest descent, which relied on the use of
Euclidean norm k·k2 , and see what this new geometry yields.
Recall that when deriving the gradient descent algorithm, we decided to pick
the direction of steepest descent which is a solution to the following optimiza-
tion problem:
By taking the optimal direction u? with respect to the Euclidean norm, i.e.,
k·k = k·k2 we obtained that u? is in the direction of −∇ f (x) and arrived at the
gradient flow
dx
= −∇ f (x).
dt
What if we instead maximize over all u of local norm 1? Then, Equation (9.6)
becomes
max (− h∇ f (x), ui) = max (− h∇ f (x), ui) . (9.7)
kuk x =1 u> H(x)u=1
9.5 Newton’s method as steepest descent 169
The rationale behind the above is clear given our discussion on the quadratic
case – we would like to capture the “shape” of the function f around a point x
with our choice of the norm, and now, our best guess for that is the quadratic
term of f around x which is given by the Hessian. Again, using the Cauchy-
Schwarz inequality we see that the optimal solution to (9.7) is in the direction
which is exactly the Newton step. Indeed, let v := H(x)−1 ∇ f (x) and observe
that
D E p p
− h∇ f (x), ui = − H(x)1/2 v, H(x)1/2 u ≤ v> H(x)v u> H(x)u = kvk x kuk x ,
(9.9)
and equality if and only if H(x)1/2 u = −H(x)1/2 v. This is the same as
u = −v = −H(x)−1 ∇ f (x)
This inner product g(x) should bepa “smooth” function of x. The Riemannian
metric induces the norm kuk x := hu, ui x . The space Rn along with a Rieman-
nian metric g is an example of a Riemannian manifold. It will take us too far
to define a Riemannian manifold formally, but let us just mention that a mani-
fold Ω is a topological space that locally resembles Euclidean space near each
point. More precisely, an n-dimensional manifold, is a topological space with
the property that each point has a neighborhood that topologically resembles
the Euclidean space of dimension n. It is important to note that, in general, at a
point x on a manifold Ω, one may not be able to move in all possible directions
while still being in Ω (see Figure 9.3). The set of all directions that one can
move in at a point x is referred to as the tangent space at x and denoted by
1 Note that the use of the word metric here should not be confused with a distance metric.
170 Newton’s Method
g(x) : T x Ω × T x Ω → R.
Thus, we can rephrase our result from the previous section as: Newton’s method
is a gradient descent with respect to a Riemannian metric. In fact, the term
H(x)−1 ∇ f (x) is the Riemannian gradient of f at x with respect to the met-
ric H(·). Note that this type of Riemannian metric is very special: the inner
product at every point is given by a Hessian of a strictly convex function. Such
Riemannian metrics are referred to as Hessian metrics.
1
f (x0 ) − ef (x1 ) = − h∇ f (x0 ), n(x0 )i − n(x0 )> ∇2 f (x0 )n(x0 )
2
1
= h∇ f (x0 )n(x0 ), n(x0 )i − n(x0 )> ∇2 f (x0 )n(x0 )
2
2
1
= kn(x0 )k x0 .
2
2
At this point it is instructive to revisit the example studied in Section 9.4.1.
First observe that the new potential kn(e x)kex is small at the point e x = 0, K12 .
2
Indeed, thinking of K2 as large and suppressing lower order terms, we obtain
! ! !
0 0 0 0
n(e x) = −H(e x)−1 ∇F(ex) ≈ − · = − .
0 K2−2 2 2K2−2
Further,
!
1
x)kex ≈ Θ
kn(e .
K2
More generally, kn(x)k x corresponds to the (Euclidean) distance from x to 0
when the polytope P is scaled (along with point x) to become the square
[−1, 1] × [−1, 1]. As our next theorem says, the error measured as kn(x)k x de-
cays quadratically fast in Newton’s method.
ky − xk x ≤ δ,
we have
(1 − 3δ)H(x) H(y) (1 + 3δ)H(x).
Roughly, a function satisfies this condition if for any two points close enough
in the local norm, their Hessians are close enough. It is worth noting that the
NL condition is indeed affinely invariant (Exercise 9.4(c)). This means that
x 7→ f (x) satisfies this condition if and only if x 7→ f (φ(x)) satisfies it, for any
affine change of variables φ. We now state the main theorem of this chapter.
Theorem 9.6 (Quadratic convergence with respect to the local norm) Let
f : Rn → R be a strictly convex function satisfying the NL condition for
δ0 = 16 , x0 ∈ Rn be any point and
x1 := x0 + n(x0 ).
1
If kn(x0 )k x0 ≤ 6 then
kn(x1 )k x1 ≤ 3 kn(x0 )k2x0 .
We now inspect the NL condition in more detail. To this end, assume for sim-
plicity that f (x) is a univariate function. Then, the NL condition says, roughly
that
|H(x) − H(y)| ≤ 3 kx − yk x |H(x)|
| f 00 (x) − f 00 (y)| 1
· 00 ≤ 3.
kx − yk x | f (x)|
Note that the first term in the left hand side above, roughly, corresponds to
bounding the third derivative of f in the local norm. Thus, the above says
something very similar to “M ≤ 3” where M is the quantity from Theorem 9.1
(recall that g(x) corresponds to f 0 (x) here). The difference is that the quantities
we consider here are computed with respect to the local norm, as opposed to
the Euclidean norm, as in Theorem 9.1 or Theorem 9.4. The constant “3” is by
no means crucial to the definition of the NL condition, it is only chosen for the
convenience of our future calculations.
9.6 Analysis of Newton’s method based on the local norm 173
Note that kn(x1 )k x1 can also be written as k∇ f (x1 )kH(x1 )−1 and similarly, kn(x0 )k x0 =
k∇ f (x0 )kH(x0 )−1 . Therefore, using the fact that the local norms at x1 and x0 are
the same up to a factor of 2 (see Lemma 9.7) it is enough to prove that
3
k∇ f (x1 )kH(x0 )−1 ≤ k∇ f (x0 )k2H(x0 )−1 . (9.10)
2
to obtain
1
k∇ f (x1 )kH −1 (x1 ) ≤ k∇ f (x1 )kH −1 (x0 ) .
2
Towards proving (9.10), we first write the gradient ∇ f (x1 ) in the form A(x0 )∇ f (x0 )
where A(x0 ) is a certain explicit matrix. Subsequently, we show that the norm
of A(x0 ) is (in a certain sense) small, which in turn allows us to establish Equa-
tion (9.10).
Z 1
∇ f (x1 ) = ∇ f (x0 ) + H(x0 + t(x1 − x0 ))(x1 − x0 )dt
0
(by the fundamental theorem of calculus applied to ∇ f : Rn → Rn )
Z 1
= ∇ f (x0 ) − H(x0 + t(x1 − x0 ))H(x0 )−1 ∇ f (x0 )dt
0
(by rewriting (x1 − x0 ) as −H(x0 )−1 ∇ f (x0 ))
"Z 1 #
= ∇ f (x0 ) − H(x0 + t(x1 − x0 ))dt H(x0 )−1 ∇ f (x0 )
0
(by linearity of integral)
" Z 1 #
= H(x0 ) − H(x0 + t(x1 − x0 ))dt H(x0 )−1 ∇ f (x0 )
0
(by writing ∇ f (x0 ) as H(x0 )H(x0 )−1 ∇ f (x0 ))
= M(x0 )H(x0 )−1 ∇ f (x0 ).
Z 1
M(x0 ) := H(x0 ) − H(x0 + t(x1 − x0 ))dt.
0
9.7 Analysis based on the Euclidean norm 175
By taking k·kH(x0 )−1 on both sides of the above derived equality we obtain
k∇ f (x1 )kH(x0 )−1 =
M(x0 )H(x0 )−1 ∇ f (x0 )
H(x )−1
0
=
H(x )−1/2 M(x )H(x )−1 ∇ f (x )
0 0 0 0 2
(by the fact that kukA =
A−1/2 u
2 )
≤
H(x0 )−1/2 M(x0 )H(x0 )−1/2
·
H(x0 )−1/2 ∇ f (x0 )
2
(since kAuk2 ≤ kAk kuk2 )
=
H(x0 )−1/2 M(x0 )H(x0 )−1/2
· k∇ f (x0 )kH(x0 )−1 .
(by the fact that kukA =
A−1/2 u
2 )
Thus, to conclude the proof it remains to show that the matrix M(x0 ) is “small”
in the following sense:
3
H(x0 )−1/2 M(x0 )H(x0 )−1/2
≤ k∇ f (x0 )kH(x0 )−1 .
2
For this, by Lemma 9.8, it is enough to show that
3 3
− δH(x0 ) M(x0 ) δH(x0 ), (9.11)
2 2
where, for brevity, δ := k∇ f (x0 )kH(x0 )−1 ≤ 61 by the hypothesis of the theorem.
This in turn follows from the NL condition. Indeed, since
if we define
z := x0 + t(x1 − x0 )
kz − x0 k x0 = tkx1 − x0 k x0 = tδ ≤ δ.
?
Z 1
(H(x0 + t(x? − x0 )) − H(x0 ))
dt.
−1
≤
H(x0 )
x − x0
2
0
(9.13)
We can then use the Lipschitz condition on H to bound the integral as follows:
Z 1
Z 1
(H(x + t(x? − x )) − H(x ))
dt ≤ L
t(x? − x0 )
2 dt
0 0 0
0 0
Z 1
≤ L
x? − x0
2
tdt
0
L
=
x? − x0
2 .
2
Together with (9.13) this implies
?
L
H(x0 )−1
?
2
x1 − x
2 ≤
x − x0
2 . (9.14)
2
LkH(x0 )−1 k
We can then take M = 2
L
≤ 2h , which completes the proof.
9.8 Exercises 177
9.8 Exercises
9.1 Consider the problem of minimizing the function f (x) = x log x over
x ∈ R≥0 . Perform the full convergence analysis of Newton’s method
applied to minimizing f – consider all starting points x0 ∈ R≥0 and
determine where the method converges for each of them.
9.2 Newton’s method to find roots of polynomials. Consider a real-rooted
polynomial p ∈ R[x]. Prove that if Newton’s method is applied to finding
roots of p and is initialized at a point x0 > λmax (p) (where λmax (p) is the
largest root of p), then it converges to the largest root of p. For a given
ε > 0, derive a bound on the number of iterations required to reach a
point ε additively close to λmax .
9.3 Verify that for a twice-differentiable function f : Rn → R,
J∇ f (x) = ∇2 f (x).
9.4 Let f : Rn → R and let n(x) be the Newton step at x with respect to f .
(a) Prove that Newton’s method is affinely invariant.
(b) Prove that the quantity kn(x)k x is affinely invariant while k∇ f (x)k2
is not.
(c) Prove that the NL condition is affinely invariant.
9.5 Consider the following functions f : K → R, check if they satisfy the
NL condition for some constant 0 < δ0 < 1.
(a) f (x) := − log cos(x) on K = (− π2 , π2 ),
(b) f (x) := x log x + (1 − x) log(1 − x) on K = (0, 1),
(c) f (x) := − ni=1 log xi on K = Rn>0 ,
P
satisfies the following condition similar to the NL condition but with the
local norm replaced by `∞ :
1 2
∀w, v ∈ R2n s.t. kw − vk∞ ≤ 1, ∇ f (w) ∇2 f (v) 10∇2 f (w).
10
9.13 Prove Lemma 9.8.
9.8 Exercises 179
Notes
For a thorough discussion of Newton’s method, we refer to the books by Gal-
ntai (2000) and Renegar (2001). Exercise 9.2 is adapted from the paper by
Louis and Vempala (2016). Exercise 9.11 is from the paper by Straszak and
Vishnoi (2016). Exercise 9.12 is extracted from the papers by Cohen et al.
(2017) and Zhu et al. (2017). For a formal introduction to Riemannian man-
ifolds, including Riemannian gradients and Hessian metrics, the reader is re-
ferred to the survey by Vishnoi (2018).
10
An Interior Point Method for Linear
Programming
We build upon Newton’s method and its convergence to derive a polynomial time algo-
rithm for linear programming. Key to the this algorithm is a reduction from constrained
to unconstrained optimization using the notion of a barrier function and the correspond-
ing central path.
P := {x ∈ Rn : Ax ≤ b},
x? ∈ argmin hc, xi
x∈P
Note that, in general, x? may not be unique. Moreover, there may be ways to
specify a polyhedron other than as a collection of linear inequalities. We refer
to this version of linear programming as canonical. We study some other vari-
ants of linear programming in subsequent chapters. A word of caution is that
when we transform one form of linear programming into another, we should
carefully account for the running time, including the cost of transformation.
Linear programming is a central problem in optimization and has made its
180
10.1 Linear programming 181
in time that depends inverse polynomially on the error parameter ε. For in-
stance, the method based on the MWU scheme that we introduced for the
bipartite matching problem in Chapter 7 can be applied to general linear pro-
grams and yields an algorithm that, given ε > 0, computes a solution x̂ which
has value at most hc, x? i + ε (for an optimal solution x? ) and violates all of
the constraints by at most ε (additively), in time proportional to ε12 . Not only
is the dependency on ε unsatisfactory, but when no additional assumptions on
A, b or c are made, these methods run in time exponential in the bit complexity
of the input. As discussed in Chapter 4, for a method to be regarded as poly-
nomial time, we require the dependency on the error parameter ε > 0 in the
running time to be polynomial in log 1ε , moreover the dependency on the bit
complexity L should be also polynomial. In this chapter, we derive a method
which satisfies both these requirements for linear programming.
Here, we do not discuss how to get rid of the full-dimensionality and nonempti-
ness assumptions. Given the proof of this theorem, it is not too difficult, but
tedious, and we omit it. Also, note that the above theorem does not solve the
linear programming problem mentioned in Definition 10.1 as it does not out-
put x? , but only an approximation for it. However, with a bit more work one
can transform the above algorithm into one that outputs x? . Roughly, this is
achieved by picking ε > 0 small enough (about 2−poly(L) ) and rounding the out-
1 If the solutions of Ax ≤ b do not lie in a proper affine subspace of Rn , then the polytope is
full-dimensional.
182 An Interior Point Method for Linear Programming
where F(x) can be regarded as a penalty’ for violating constraints. Thus, F(x)
should become big when x is close to the boundary (∂K) of the convex set K:
∂K. One seemingly perfect choice of F would be a function which is 0 on an
inside K and +∞ on the complement of K. However, this objective may no
longer be continuous.
10.3 The logarithmic barrier function 183
If one would like the methods for unconstrained optimization that we de-
veloped in the previous chapters to be applicable here, then F should satisfy
certain properties, and at the very least, convexity. Instead of giving a precise
definition of a barrier function F for K, we list some properties that are ex-
pected:
Suppose F is such a barrier function. Then, note that solving (10.2) might give
us some idea of what min x∈K hc, xi is, but it does not give us the right answer,
since the addition of F(x) to the objective alters the location of the point we
are looking for. For this reason, one typically considers a family of perturbed
objective functions fη parametrized by η > 0 as follows:
The idea now is to start at some point, say x1? (i.e., for η = 1) on the central
path, and progress along the path by gradually taking η to infinity. A method
2 Of course this is not always the case for general linear programs. However, if the polyhedron
is feasible then we can perturb coefficients of each of the constraints by an exponentially small
value to force the feasible set to be full-dimensional.
10.5 A path following algorithm for linear programming 185
x?
x0?
satisfies the NL condition from Chapter 9 (we show this later). Then, we know
that by performing a Newton step
x1 := x0 + nη0 (x0 ),
where for any x and η > 0 we define
−1 −1
nη (x) := − ∇2 fη (x) ∇ fη (x) = − ∇2 F(x) ∇ fη (x),
we make significant progress towards xη?0 . Note that since F is strictly convex,
−1
∇2 F(x) exists for all x ∈ int(P). The main idea is to use this progress, and
the opportunity it presents, to increase the value of η0 to
η1 := η0 · (1 + γ)
for some γ > 0 so that x1 is close enough to xη?1 and again satisfies the NL
condition; enabling us to repeat this procedure and continue.
The key question becomes: how large a γ can we pick in order to make
this scheme work and produce a sequence of pairs (x0 , η0 ), (x1 , η1 ), . . . such
that ηt increases at a rate (1 + γ) and that xt is in the quadratic convergence
region of fηt . We show in Section 10.6 that the right value for γ is roughly
√1 . A more precise description of this algorithm is presented in Algorithm 7.
m
This algorithm description is not complete and we explain how to set η0 , why
ηT > m/ε suffices, how to compute the Newton step, and how to terminate in
upcoming sections.
The notion of closeness of a point xt to the central path, i.e., to xη?t , that we
employ here is
1
nηt (xt )
x ≤ .
t 6
This is directly motivated by the guarantee obtained in Chapter 9 for New-
ton’s method – this condition means precisely that xt belongs to the quadratic
convergence region for fηt .
where x0 := x + nη (x).
This implies that one Newton step brings us closer to the central path. The
lemma below tells us that if we are very close to the central path, then it is
188 An Interior Point Method for Linear Programming
safe to increase ηt by a factor of roughly 1 + √1m so that the centrality is still
preserved.
Lemma 10.7 (Effect of re-entering on centrality) For every point x ∈ int(P)
and every two positive η, η0 > 0, we have
η0
√ η0
nη0 (x)
x ≤
nη (x)
x + m − 1 .
η η
Thus,
√ if we initialize the algorithm at η0 and terminate it at ηT then, after
O m log ηηT0 iterations, each of which is just solving one m × m linear system
Theorem 10.10
√ (Convergence
of the path following IPM) Algorithm 7,
after T = O m log εηm0 iterations, outputs a point x̂ ∈ P that satisfies
Moreover, every iteration (step 4) requires solving one linear system of the
form ∇2 F(x)y = z, where x ∈ int(P), z ∈ Rn is determined by x and we solve
for y. Thus, it can be implemented in polynomial time.
Note now that by combining Theorem 10.10 with Lemma 10.8 the proof of
Theorem 10.2 follows easily.
ky − xk x ≤ δ,
we have
(1 − 3δ)H(x) H(y) (1 + 3δ)H(x).
is given by
m
X ai a>i
H(x) = ,
s (x)2
i=1 i
190 An Interior Point Method for Linear Programming
where
si (x) := bi − hai , xi;
Thus, in particular, every term in this sum is upper bounded by δ2 . Hence, for
every i = 1, 2, . . . , m we have
and, thus,
(1 + δ)−2 1 (1 − δ)−2
≤ ≤ .
si (x)2 si (y)2 si (x)2
It now follows that
(1 + δ)−2 ai a>i ai a>i (1 − δ)−2 ai a>i
si (x)2 si (y)2 si (x)2
and, thus, by summing the above for all i, we obtain
To arrive at the NL condition it remains to observe that for every δ ∈ (0, 0.23)
it holds that
1 − 3δ ≤ (1 + δ)−2 and (1 − δ)−2 ≤ 1 + 3δ.
a small
constant,
and what prevents us from choosing a large γ is the second
η0
term 1 − η kH(x)−1 g(x)k x . Thus, what remains to do is to derive an upper
bound on kH(x)−1 g(x)k x . To this end, we show something stronger:
√
sup kH(y)−1 g(y)ky ≤ m.
y∈int(P)
To see this, we pick any y ∈ int(P) and denote z := H −1 (y)g(y). From the
Cauchy-Schwarz inequality we obtain
v
t m
m
X hz, ai i √ X hz, ai i2
kzky = g(y) H(y) g(y) = hz, g(y)i =
2 > −1
≤ m . (10.5)
i=1
si (y) i=1
si (y)2
Termination under the “ideal assumption”. Assume that the point output
by Algorithm 7 is really
x̂ := xη?T .
The point xη? is the minimum of fη , hence, by the first order optimality condi-
tion, we know
∇ fη (xη? ) = 0
and, thus,
g(xη? ) = −ηc. (10.7)
Using this observation we obtain that
E 1D
hc, xη? i − hc, x? i = − c, x? − xη? = g(xη? ), x? − xη? .
D E
η
To complete the proof it remains to argue that g(xη? ), x? − xη? < m. We show
D E
m
X hai , y − xi
hg(x), y − xi =
i=1
si (x)
m
X si (x) − si (y)
=
i=1
si (x)
m
X si (y)
=m−
s
i=1 i
(x)
< m,
where in the last inequality we make use of the fact that our points x, y are
strictly feasible, i.e., si (x), si (y) > 0 for all i.
Dropping the “ideal assumption”. We now show that even after dropping
the ideal assumption we still get an error which is O(ε). For this, we perform a
constant number
of Newton
iterations starting from xT , so that the local norm
of the output x̂, nηT ( x̂) x̂ , becomes small.
We derive a relation between the length of the Newton step at a point x (with
respect to the function fη ) and the distance to the optimum xη? . We show that
whenever kn(x)k x is sufficiently small, kx − xη? k x is small as well. This fact,
together with a certain strengthening of Lemma 10.12, implies that in the last
step of Algorithm 7, only two additional Newton steps bring us 2ε-close to the
optimum.
194 An Interior Point Method for Linear Programming
We start with an extension of Lemma 10.12 that shows that to get a decent
approximation of the optimum, we do not necessarily need to be on the central
path, but only close enough to it.
Lemma 10.13 (Approximation guarantee for points close to central path)
For every point x ∈ int(P) and every η > 0, if kx − xη? k x < 1 then:
m −1
hc, xi − hc, x? i ≤ 1 − kx − xη? k x .
η
Proof For every y ∈ int(P) we have:
hc, x − yi = hH(x)−1/2 c, H 1/2 (x)(x − y)i
≤
H(x)−1/2 c
H 1/2 (x)(x − y)
2 2
= kH(x)−1 ck x kx − yk x
Where inequality follows from Cauchy-Schwarz. Let
c x := H(x)−1 c.
This term is also the Riemannian gradient of the objective function hc, xi at x
with respect to the Hessian metric. Now, we bound kc x k x . Imagine we are at
point x and we move in the direction of −c x until we hit the boundary of the
unit ball (in the local norm) around x. We will land at the point x − kccxxkx , which
is still inside P (as proved in Exercise 10.4). Therefore,
* +
cx
c, x − ≥ hc, x? i.
kc x k x
Since hc, c x i = kc x k2x , by substituting and rearranging in the inequality above,
we obtain
kc x k x ≤ hc, xi − hc, x? i.
Thus,
hc, x − yi ≤ kx − yk x (hc, xi − hc, x? i). (10.8)
Now, we express
hc, xi − hc, x? i = hc, xi − hc, xη? i + hc, xη? i − hc, x? i
and use (10.8) with y = xη? . We obtain
hc, xi − hc, x? i ≤ (hc, xi − hc, x? i)kx − xη? k x + (hc, xη? i − hc, x? i).
Thus,
(hc, xi − hc, x? i)(1 − kx − xη? k x ) ≤ hc, xη? i − hc, x? i.
By applying Lemma 10.12 the result follows.
10.6 Analysis of the path following IPM 195
Note that in Algorithm 7 we never mention the condition that kx− xη? k x is small.
However, we show that it follows from knη (x)k x being small. In fact, we prove
the following more general lemma which holds for any function f satisfying
the NL condition.
Lemma 10.14 (Distance to the optimum and the Newton step) Let f :
Rn → R be any strictly convex function satisfying the NL condition for δ0 = 61 .
Let x be any point in the domain of f . Consider the Newton step n(x) at a point
x. If
1
kn(x)k x < ,
24
then
kx − zk x ≤ 4kn(x)k x ,
Proof Pick any h such that khk x ≤ 16 . Expand f (x + h) into a Taylor series
around x and use the mean value theorem (Theorem 9.2) to write
1
f (x + h) = f (x) + hh, ∇ f (x)i + h> ∇2 f (θ)h (10.9)
2
for some point θ lying in the interval (x, x + h). We proceed by lower bounding
the linear term. Note that
kyk x = r,
i.e., points on the boundary of the local norm ball of radius r centered at x. For
such a point y,
4 1
kyk x = 4kn(x)k x ≤ =
24 6
196 An Interior Point Method for Linear Programming
Proof of Lemma 10.9. Let x̂ be the point obtained by doing two additional
Newton steps with a fixed ηT and starting at xT . Then, since
1
knηT (xT )k xT ≤ ,
6
we know that
1
knηT ( x̂)k x̂ ≤ .
48
Applying Lemma 10.13 and Lemma 10.14 for such a x̂ we obtain that if η ≥ m
ε
then
1
hc, x̂i − hc, x? i ≤ ε < 2ε.
1 − 4knηT ( x̂)k x̂
10.6.2 Initialization
In this section we present a method for finding an appropriate starting point.
More precisely, we show how to find efficiently some η0 > 0 and x0 such that
knη0 (x0 )k x0 ≤ 16 . Before we start, we remark that we provide a very small η0 – of
order 2−poly(L) . While this enables us to prove that Algorithm 7 can solve linear
programming in polynomial time, it does not seem promising when trying to
apply IPM to devise fast algorithms for combinatorial problems. Indeed, there
is a factor of log η−1 0 in the bound on the number of iterations, which translates
to L. To make an algorithm fast, we need to have ideally η0 = Ω(1/poly(m)).
It turns out that for specific problems (such as maximum flow) we can devise
some specialized methods to find such an η0 and x0 .
Finding (η0 , x0 ) given a point in the interior of P. First we show that assum-
ing we are given a point x0 ∈ int(P), we can find a starting pair (η0 , x0 ) with the
desired property. In fact, we assume something stronger: that each constraint
is satisfied with slack at least 2−O(nL) at x0 , i.e.,
e
bi − hai , x0 i ≥ 2−O(nL) .
e
10.6 Analysis of the path following IPM 197
𝚪𝒅
𝚪𝒄
𝑓 𝑒
𝒙⋆𝟎
𝚪𝒆
𝚪𝒇
𝑐 𝑑
Γc := {xη? : η ≥ 0},
which corresponds to the objective function hc, xi. Note that as η → 0, xη? →
x0? , the analytic center of P. Hence, finding a point x0 close to the analytic
center and choosing η0 to be some small number should be a good strategy.
But how do we find a point close to x0? ?
While it is the path Γc that is of interest, in general, one can define other
central paths. If d ∈ Rn is any vector, define
( )
Γd := argmin (ηhd, xi + F(x)) : η ≥ 0 .
x∈int(P)
As one varies d, what do the paths Γd have in common? The origin! They all
start at the same point: the analytic center of P; see Figure 10.2. The idea now
is to pick one such path on which the initial point x0 lies and traverse it in
reverse to reach a point close to the analytic center of this path, giving us x0 .
Recall that g is used to denote the gradient of the logarithmic barrier F. If
we define
d := −g(x0 ),
198 An Interior Point Method for Linear Programming
Proof We have
nη0 (x)
x =
H(x)−1 (η0 c + g(x))
x ≤ η0
H(x)−1 c
x +
H(x)−1 g(x)
x .
Since
H(x)−1 g(x)
≤ 1 , to arrive at the conclusion, it suffices to prove that
x 12
H(x)−1 c
x ≤ hc, x − x? i.
Thus,
cx
x− ∈P
kc x k x
because this point belongs to E x . Therefore,
* + D
cx
≥ c, x? ,
E
c, x −
kc x k x
since x? minimizes this linear objective over P. The above, after rewriting
gives
* +
cx
≤ hc, xi − c, x? .
D E
c,
kc x k x
By observing that
* +
cx
c, =
H(x)−1 c
x ,
kc x k x
the lemma follows.
Given Lemmas 10.15 and 10.16 we are now ready to prove that a good starting
point can be found in polynomial time.
Proof The lemma follows from a combination of Lemmas 10.15 and 10.16.
1. Start by running the path following IPM in reverse for the cost function
d := −g(x0 ), starting point as x0 , and the starting value of η as 1. Since, by
choice, x0 is optimal for the central path Γd , we know that kn1 (x0 )k x0 = 0.
10.6 Analysis of the path following IPM 201
Thus,
1 1
η0 ≤ ≤ .
kck2 · diam(P) hc, y − x? i
a>i x ≤ bi + t, ∀1 ≤ i ≤ m, (10.16)
− C ≤ t ≤ C.
for some large integer C > 0 that we specify in a moment. Taking x = 0
and t big enough (t := 1 + i |bi |), we obtain a strictly feasible solution with
P
at least O(1) slack at every constraint; see Exercise 10.5. Thus, we can use
the Algorithm 7 in conjunction with Lemma 10.17 to solve it up to precision
2−Ω(nL) in polynomial time. If P is full-dimensional and nonempty, then the
e
2−Ω(nL) precision, we obtain a feasible point x0 whose all slacks are at least
e
10.7 Exercises
10.1 Consider a bounded polyhedron
P := {x ∈ Rn : hai , xi ≤ bi , for i = 1, 2, . . . , m}
and let F(x) be the logarithmic barrier function on the interior of P, i.e.,
m
X
F(x) := − log(bi − hai , xi).
i=1
(c) Prove that there exists a point x0 in the interior of P such that
bi − hai , x0 i ≥ 2−O(nL)
e
for every i = 1, 2, . . . , m.
(d) Prove that if x0? is the analytic center of P then
10.3 In this problem we would like to show the following intuitive fact: if the
gradient of the logarithmic barrier at a given point x is short then the
point x is far away from the boundary of the polytope. In particular, the
analytic center lies well inside the polytope. Let A ∈ Rm×n , b ∈ Rm and
P = {x ∈ Rn : Ax ≤ b} be a polytope. Suppose that x0 ∈ P is a point such
that
bi − hai , x0 i ≥ δ
Let x0? be the analytic center, i.e., the minimizer of the logarithmic bar-
rier over the polytope.
(a) Prove that for all x ∈ int(P),
E x ⊆ P.
(b) Assume without loss of generality (by shifting the polytope) that
x0? = 0. Prove that
P ⊆ mE x0? .
(c) Prove that if the set of constraints is symmetric, i.e., for every con-
straint of the form ha0 , xi ≤ b0 there is a corresponding constraint
ha0 , xi ≥ −b0 then
√
x0? = 0 and P ⊆ mE x0? .
(b) Verify that the O(1)-slack in the auxiliary linear program of Equa-
tion 10.16 satisfies the condition in Lemma 10.17 with
β = min Ω(C −1 ), 2−O(n(L+nLC )) ,
e
The formulation. Consider the following linear program and its dual.
min hc, xi
s.t. Ax = b, (10.17)
x ≥ 0.
max hb, yi
s.t. A> y + s = c, (10.18)
s ≥ 0.
(a) Prove that the linear programs (10.17) and (10.18) are dual to each
other.
(b) Prove that if (x, y, s) ∈ F is a feasible primal-dual solution then
the duality gap hc, xi − hb, yi = m
P
i=1 xi si .
(c) Prove, using the above, that if xi si = 0 for every i = 1, 2, . . . , m,
then x is optimal for (10.17) and (y, s) is optimal for (10.18).
The high-level idea. The main idea of this IPM is to approach this goal
by maintaining a solution to the following equations:
xi si = µ, ∀ i = 1, 2, . . . , m (x, y, s) ∈ F+ , (10.19)
for a positive parameter µ > 0. As it turns out, the above defines a unique
pair (x(µ), s(µ)) and, hence, the set of all solutions (over µ > 0) forms a
continuous path in F . The strategy is to approximately follow this path:
we initialize the algorithm at a (perhaps large) value µ0 and reduce it
multiplicatively. More precisely, we start with a triple (x0 , y0 , s0 ) that ap-
proximately satisfies condition (10.19) (with µ := µ0 ) and, subsequently,
produce a sequence (xt , yt , st , µt ) of solutions, with t = 1, 2, 3, . . . such
that (xt , yt , st ) also approximately satisfies (10.19), with µ := µt . We will
show that the value of µ can be reduced by a factor of (1 − γ) with
γ := Θ(m−1/2 ) in every step.
= b,
Ax
A y + s = c,
>
= µ, ∀ i = 1, 2, . . . , m.
x s
i i
ε
Thus, it suffices to take µ := 2m in order to bring the duality gap down to
ε. To obtain a complete algorithm it remains to prove that the following
invariant is satisfied in every step t:
(xt , yt , st ) ∈ F+ ,
1 (10.21)
v(xt , st , µt ) ≤ .
2
√
Given the above, we will be able to deduce that after t = O m log µε
0
steps, the duality gap drops below ε. We analyze the behavior of the
potential in two steps. We first study what happens when going from
(x, y, s) to (x+∆x, y+∆y, s+∆s) while keeping the value of µ unchanged,
and later we analyze the effect of changing µ to µ(1 − γ).
Analysis of the potential: Newton step. As one can show, the linear
system (10.20) has always a solution. We need to show now that after
one such Newton step, the new point is still strictly feasible and, the
value of the potential is reduced.
(f) Prove that if ∆x and ∆s satisfy Equation (10.20) and v(x + ∆x, s +
∆s, µ) < 1 then x + ∆x > 0 and s + ∆s > 0.
(g) Prove that if ∆x and ∆s satisfy Equation (10.20) then
m
X
(∆xi )(∆si ) = 0. (10.22)
i=1
(h) Prove that if ∆x and ∆s satisfy Equation (10.20) and v(x, s, µ) < 1
then
1 v(x, s, µ)2
v(x + ∆x, s + ∆s, µ) ≤ · .
2 1 − v(x, s, µ)
Notes
Interior point methods (IPMs) in the form of affine scaling first appeared in the
Ph.D. thesis of I. I. Dikin in the 60s; see Vanderbei (2001). Karmarkar (1984)
gave a polynomial-time algorithm to solve linear programs using an IPM based
on projective scaling. By then, there was a known polynomial time algorithm
for solving LPs, namely the ellipsoid algorithm by Khachiyan (1979, 1980)
(see Chapter 12). However, the method of choice in practice was the simplex
method due to Dantzig (1990), despite it being known to be inefficient in the
worst-case (see Klee and Minty (1972)).3 Karmarkar, in his paper, also pre-
sented empirical evidence demonstrating that his algorithm was consistently
faster than the simplex method. For a comprehensive historical perspective on
IPMs, we refer to the survey by Wright (2005).
Karmarkar’s algorithm needs roughly O(mL) iterations to find a solution;
here L is the bit complexity of the input linear program and m is the number of
constraints. Renegar (1988) combined Karmarkar’s approach with Newton’s
√
method to design a path following interior point method that took O( mL)
iterations to solve a linear program, and each iteration just had to solve a linear
system of equations of size m × m. A similar result was independently proven
by Gonzaga (1989). This is the result presented in this chapter.
There is also a class of primal-dual interior point methods for solving linear
programs (see Exercise 10.6) and the reader is referred to the book by Wright
(1997) for a discussion. It is possible, though nontrivial, to remove the full-
dimensionality Theorem 10.2; we refer the reader to the book by Grötschel
et al. (1988) for a thorough treatment on this.
3 Spielman and Teng (2009) showed that, under certain assumptions on the data (A, b, c), a
variant of the simplex method is provably efficient.
11
Variants of the Interior Point Method and
Self-Concordance
We present various generalizations and extensions of the path following IPM for the
case of linear programming. As an application, we derive a fast algorithm for the s − t-
minimum cost flow problem. Subsequently, we introduce the notion of self-concordance
and give an overview of barrier functions for polytopes and more general convex sets.
210
11.1 The minimum cost flow problem 211
2 11
1 4
6
4
1 2
9
5
4
3
Figure 11.1 An s − t-minimum cost flow problem instance. The vertices 1 and
5 correspond to s and t respectively. We assume that all capacities are 1 and the
costs of edges are as given in the picture. The minimum cost flow of value F = 1
has cost 11, for F = 2 it is 11 + 13 = 24 and there are no feasible flow for F > 2.
edges (w, v) and (v, w) in place of every undirected edge {w, v}. Moreover, this
problem is more general than the s−t-maximum flow problem and the bipartite
matching problem studied introduced in Chapters 6 and 7 respectively.
the linear programming formulation the minimum cost flow problem in (11.1).
Moreover, we need to show that one iteration of an appropriate variant of the
path following method can be reduced to solving a Laplacian system. Finally,
the issue of finding a suitable point to initialize the interior point method be-
comes crucial and we develop a fast algorithm for finding such a point.
min hc, xi
x∈Rm
(11.2)
s.t. Ax ≤ b.
By inspecting the form of the linear program (11.2), it is evident that it has
a different form than that of (11.1). However, this linear programming frame-
work is very expressive and one can easily translate the program (11.1) to the
form (11.2). This follows simply by turning equalities Bx = Fχ st into pairs of
inequalities
Bx ≤ Fχ st and − Bx ≤ −Fχ st ,
and expressing 0 ≤ xi ≤ ρi as
−xi ≤ 0 and xi ≤ ρi .
However, one of the issues is that the resulting polytope
P := {x ∈ Rm : Ax ≤ b}
is not full-dimensional, which was a crucial assumption for the path following
IPM from Chapter 10. Indeed, the running time of this algorithm depends on
β where β is the distance from the boundary of P of an initial point x ∈ P; see
1 0
min hc, xi
x∈Rm
! ! !
B 0 x b
s.t. = , (11.3)
I I y u
x, y ≥ 0.
The number of variables in this new program is 2m and the number of linear
constraints is n + m.
By developing a Newton’s method for equality constrained problems and
then deriving a new, corresponding path following interior point method for
linear problems in the form as (11.3) we prove the following theorem.
iteration involves solving one Laplacian linear system, i.e., of the form
(BW B> )h = d,
min hc, xi
x∈Rm
s.t. Ax = b, (11.4)
x ≥ 0,
Eb := {x ∈ Rm : Ax = b} and E := {x ∈ Rm : Ax = 0},
min f (x)
x∈Rm
(11.5)
s.t. Ax = b,
x0 := x + n(x).
e e −1e
n(x) := H(x) g(x),
x1 := x0 + e
n(x0 ).
1
n(x0 )k x0 ≤
If ke 6 then
n(x0 )k2x0 .
n(x1 )k x1 ≤ 3 ke
ke
These definitions might look abstract, but in fact, the only way they differ from
their unconstrained variants is the fact that we consider E to be the tangent
space instead of Rm . As an example, consider the function
f (x1 , x2 ) := 2x12 + x22 .
When constrained to a “vertical” line
Eb := {(x1 , x2 )> : x1 = b, x2 ∈ R}
we obtain that the gradient is
!
0
g(x1 , x2 ) =
e
2x2
( ! )
e 1 , x2 ) is a linear operator E → E (where E = 0
and the Hessian H(x :h∈R
h
such that
! !
0 0
H(x1 , x2 )
e = .
h 2h
Indeed, this is consistent with the intuition that by restricting the function to a
vertical line, only the variable x2 matters.
To summarize, we have the lemma below.
Lemma 11.3 (Gradient and Hessian in a subspace) Let f : Rm → R be
f : Eb → R be its restriction to Eb . Then we have
strictly convex and let e
g(x) = ΠE ∇ f (x),
e
e = Π>E H(x)ΠE ,
H(x)
where ΠE : Rm → E is the orthogonal projection operator onto E. Note that
ΠE = I − A> (AA> )−1 A.
n(x) = −H(x)−1 g(x) + H(x)−1 A> (AH(x)−1 A> )−1 AH(x)−1 g(x).
e
and, hence,
ΠE (H(x)h + g(x)) = 0.
In other words, this means that H(x)h + g(x) belongs to the space orthogonal
to E (denoted by E ⊥ ), i.e.,
The Newton’s method for minimizing f over Eb then simply starts at any x0 ∈
Eb and iterates according to
xt+1 := xt + e
n(xt ), for t = 0, 1, 2, . . ., (11.7)
11.2 An IPM for linear programming in standard form 219
As in the unconstrained setting, we consider the local norm k·k x defined on the
tangent space at x by
D E
∀h ∈ E, khk2x := h, H(x)h
e = h> H(x)h,
e
we have
(1 − 3δ)H(x) e (1 + 3δ)H(x).
e H(y) e
In the above, the PSD ordering is understood in the usual sense, yet restricted
to the space E. To make this precise, we need to define a notion similar to a
symmetric matrix for linear operators. A linear operator H : E → E is said to
be self-adjoint if hHh1 , h2 i = hh1 , Hh2 i for all h1 , h2 ∈ E.
At this point, all components present in Theorem 11.2 have been formally
defined. The proof of Theorem 11.2 is identical to the proof of Theorem 9.6 in
Chapter 9; in fact, one just has to replace f , g, and H with e
f, e
g and H
e for the
proof to go through.
The idea is to use Newton’s method to progress along the central path, i.e., the
set Γc := {xη? : η ≥ 0}.
220 Variants of the Interior Point Method and Self-Concordance
For completeness, we now adjust the notation to this new setting. We define
the function
fη (x) := ηhc, xi + F(x)
where H(x) and g(x) are the Hessian and gradient of F(x) in Rm . Note that, in
this case, the Hessian and the gradient of F(x) have particularly simple forms:
H(x) = X −2 , g(x) = X −1 1,
where X := Diag(x) is the diagonal matrix with the entries of x on the diagonal
and 1 is the all ones vector. Thus,
The algorithm for this setting is analogous to Algorithm 7 that was presented in
Chapter 10; the only difference is the use of the equality constrained Newton
step e
nη (x) instead of the unconstrained one nη (x). The theorem that one can
formulate in this setting is as follows.
Moreover, every Newton step requires O(m) time, plus the time required to
solve one linear system of equations of the form (AWA> )y = d, where W 0
is a diagonal matrix, d ∈ Rn is a given vector and y is the vector of variables.
For the application to the s − t-minimum cost flow problem, it is also important
to show that finding the initial point on the central path is not a bottleneck.
From arguments presented in Section 10.6.2, one can derive the following
lemma that is an analog of Lemma 10.17.
such that
√
11.3 An O(
e m)-iteration algorithm for minimum cost flow 221
• x0 ∈ Eb ∩ Rm ,
>0
1
• nη0 (x0 ) x ≤ 6 ,
e
0
• η0 ≥ Ω D1 · kck1 2 , where D is the (Euclidean) diameter of the feasible set
≥0 .
E b ∩ Rm
√
The algorithm performs O
e m log Dβ iterations of the interior point method.
In particular, whenever we can show that the diameter of the feasible set is
bounded by a polynomial in the dimension and we can find a point which
is at least an inverse polynomial distance away from the boundary, then the
total number of iterations of the equality constrained path following IPM (to
find
√the starting
kck2
point and then find a point ε-close to optimum) is roughly
O m log ε .
e
√
e m)-iteration algorithm for minimum cost flow
11.3 An O(
We show how to conclude Theorem 11.1 using the equality constrained path
following method IPM mentioned in the previous section.
Number of iterations. We first note that by Theorem 11.7 the linear pro-
√
gram (11.3) can be solved in O m log εηm0 iterations of the path following
method. In order to match the number of iterations claimed in Theorem 11.1
we need to show that we can initialize the path following method at η0 such
that η−1
0 = poly(m, C, U). We show this in Section 11.3.1.
One can use the formula (11.8) as a starting point for proving the above. How-
ever, one still needs to determine how the task of solving a certain 2m × 2m
linear system reduces to a Laplacian system. We leave this is an exercise (Ex-
ercise 11.3).
222 Variants of the Interior Point Method and Self-Concordance
1. Bx = 21 χ st ,
2. if an edge i ∈ E does not belong to any directed path from s to t, then
xi = 0,
3. if an edge i ∈ E belongs to some directed path from s to t then
1 1
≤ xi ≤ .
2m 2
The algorithm runs in O(m) time.
Note that the edges that do not belong to any path from s to t can be removed
from the graph, as they cannot be part of any feasible flow. Below we prove
that one can find such a flow in roughly O(nm) time – the question on how to
make this algorithm more efficient is left as an exercise (Exercise 11.4).
Flows constructed in the lemma above are strictly positive, yet they fail to
satisfy the constraint Bx = Fχ st , i.e., not enough flow is pushed from s to t. To
get around this problem we need the final idea of preconditioning. We add one
directed edge î := (s, t) to the graph with large enough capacity
uî := 2F
We denote the new graph (V, E ∪ {î}) by Ĝ and call it the preconditioned
n ograph.
By Lemma 11.10, we can construct a flow of value f = min 21 , F2 in G
which strictly satisfies all capacity constraints since the capacities are integral.
We can then make it have value exactly F by sending the remaining
F
F− f ≥
2
units directly from s to t through î. This allows us to construct a strictly positive
feasible solution on the preconditioned graph.
Note, importantly, that this preconditioning does not change the optimal so-
lution to the original instance of the problem, since sending even one unit of
flow through i incurs a large cost – larger than taking any path from s to t in
the original graph G. Therefore, solving the s − t-minimum cost flow problem
on Ĝ is equivalent to solving it on G. Given this observation, it is enough to
run the path following algorithm on the preconditioned graph. Below we show
how to find a suitable starting point for the path following method in this case.
Proof First note that we may assume without loss of generality that all edges
224 Variants of the Interior Point Method and Self-Concordance
xî = F − f.
yi := ρi − xi
we obtain that
( )
1 F
min , ≤ yi .
2 2
We now apply Lemma 11.8 with
( )
1 F 1 F min{1, F}
β := min , , , ,F ≥ .
2m 2m 2 2 2m
It remains to find an appropriate bound on the diameter of the feasible set. Note
that we have, for every i ∈ E ∪ {î},
0 ≤ xi , yi ≤ ρi .
This implies that the Euclidean diameter of the feasible set is bounded by
v
t m
X √
2 ρ2i ≤ 2mU.
i=1
Thus, by plugging these parameters into Lemma 11.8, the bound on η0 and the
bound on the number of iterations follow.
11.4 Self-concordance 225
11.4 Self-concordance
In this section, by abstracting out the key properties of the logarithmic barrier
and the convergence proof of the path following IPM from Chapter 10, we
arrive at the central notion of self-concordance. Not only allows does the notion
of self-concordance help us to develop path following algorithms for more
general convex bodies (beyond polytopes), it also allows us search for better
barrier functions and leads to some of the fastest known algorithm for linear
programming.
be the logarithmic barrier function for a polytope, let g(x) be its gradient, and
let H(x) be its Hessian. Then, we have
and
2. for every x ∈ int(P) and every h ∈ Rn
3 3/2
∇ F(x)[h, h, h] ≤ 2 h> H(x)h .
Recall that the first property was used to show that one can
progress along the
central path by changing η multiplicatively by 1 + 20 1√m (see Lemma 10.7 in
Chapter 10). The second property was not explicitly mentioned in the proof,
however, it results in the logarithmic barrier F satisfying the NL condition.
Thus, it allows the repeated recentering of the point (bringing it closer to the
central path) using Newton’s method (see Lemma 10.6 in Chapter 10).
226 Variants of the Interior Point Method and Self-Concordance
Proof of Lemma 11.12. Part 1 was proved in Chapter 10. We provide a slightly
different proof here. Let x ∈ int(P) and let S x ∈ Rm×m be the diagonal matrix
with slacks si (x) on the diagonal. Thus, we know that
H(x) = A> S −2
x A and g(x) = A> S −1
x 1,
4. (Self-concordance) For all x ∈ int(K) and for all allowed vectors h (in
the tangent space of the domain of F)
3 3/2
∇ F(x)[h, h, h] ≤ 2 h> ∇2 F(x)h .
Note that the logarithmic barrier on P satisfies the above conditions with the
complexity parameter equal to ν = m. Indeed, the third condition above coin-
cides with the first property in Lemma 11.12 and fourth condition is the same
as the second property in Lemma 11.12.
and define the central path to consist of minimizers xη? of the above paramet-
ric family of convex functions. Algorithm 7 presented in Chapter 10 for linear
programming can be adapted straightforwardly for this self-concordant bar-
rier function F. The only difference is that m is replaced by ν, the complexity
parameter of the barrier F. I.e., the value η is updated according to the rule
!
1
ηt+1 := ηt 1 + √ ,
20 ν
Theorem 11.14 (Convergence of the path following IPM for general bar-
rier functions) Let F(x) be a self-concordant barrier function for P = {x ∈
Rn : Ax ≤ b} with complexity parameter ν. The corresponding path following
√
IPM, after T := O ν log εην 0 iterations, outputs x̂ ∈ P that satisfies
Moreover, every iteration of the algorithm requires computing the gradient and
Hessian (H(·)) of F at given point x ∈ int(P) and solving one linear system of
the form H(x)y = z, where z ∈ Rn is determined by x and we solve for y.
228 Variants of the Interior Point Method and Self-Concordance
We prove that if F is a strictly convex barrier then the above imply that ν ≥ 1.
For the sake of contradiction, assume ν < 1. Let
F 0 (t)2
g(t) := ,
F 00 (t)
and consider the derivative of g
!2
F 0 (t) F 0 (t)F 000 (t)
!
g (t) = 2F (t) − 00
0 0
F (t) = F (t) 2 −
000 0
. (11.9)
F (t) F 00 (t)2
Now, using the self-concordance of F, we obtain
F 0 (t)F 000 (t) √
≤ 2 ν. (11.10)
F 00 (t)2
Moreover, since F is a strictly convex function, it holds that F 0 (t) > 0 for t ∈
(α, 1] for some α ∈ (0, 1). Hence, for t > α, by combining (11.9) and (11.10)
we have
√
g0 (t) ≥ 2F 0 (t)(1 − ν).
Hence,
√
g(t) ≥ g(α) + 2(1 − ν)(F(t) − F(α)),
11.5 Linear programming using self-concordant barriers 229
which gives us a contradiction as, on the one hand, g(t) ≤ ν (from the self-
concordance of F), and on the other hand, g(t) is unbounded (as limt→1 F(t) =
+∞).
Interestingly, it has also been shown that every full-dimensional convex sub-
set of Rd admits a self-concordant barrier – called the universal barrier –
with parameter O(d). Note however, that just the existence of an O(d)-self-
concordant barrier does not imply
√ that we can construct efficient algorithms
for optimizing over P in O( d) iterations. Indeed, the universal and related
barriers are largely mathematical results that are rather hard to make practical
with from the computational viewpoint – computing gradients and Hessians of
these barriers is nontrivial. We now give an overview of two important exam-
ples of barrier function for linear programming that have good computational
properties.
a>i H(x)−1 ai
σi (x) := .
si (x)2
The vector σ(x) estimates the relative importance of each of the constraints. In
fact, it holds that
m
X
σi (x) = n and ∀i, 0 ≤ σi (x) ≤ 1.
i=1
Note that, unlike the logarithmic barrier, if some constraint is repeated multiple
times, its importance in the volumetric barrier does not scale with the number
of repetitions.
If for a moment we assumed that the leverage scores σ(x) do not vary with
x, i.e.,
σ(x) ≡ σ
then Q(x) would correspond to the Hessian ∇2 Fσ (x) of the following weighted
logarithmic barrier
m
X
Fσ (x) = σi log si (x).
i=1
Recall that, in the case of the logarithmic barrier, each σi = 1 and, hence, the
total importance of constraints is m, while here m i=1 σi = n. This idea is further
P
generalized in Section 11.5.2.
E x0?
x0?
Figure 11.2 A polytope P and the Dikin ellipsoid center at x0? – the analytic center
√
of P. While the Dikin ellipsoid is fully contained in P, its m-scaling contains P.
s.t. Ax = b,
x ≥ 0.
min hC, Xi
X
hM, Ni := Tr (MN) .
One can extend path following IPMs to the world of SDPs. All one needs is
an efficiently computable self-concordant barrier function for the cone of PD
matrices
Cn := {X ∈ Rn×n : X is PD}.
Complexity parameter. Note that the logarithmic barrier for the positive or-
thant is self-concordant with parameter n. We show that this also holds when
we move to Cn .
Theorem 11.24 (Complexity parameter of the log barrier function for the
PD cone) The logarithmic barrier function F on the cone of positive definite
matrices Cn is self-concordant with parameter n.
Proof To start, note that to verify the conditions in Definition 11.13 we need
to understand first what vectors h to consider. As we are moving in the cone
Cn , we would like to consider any H of the form X1 − X2 for X1 , X2 ∈ Cn , which
coincides with the set of all symmetric matrices. Let X ∈ Cn and H ∈ Rn×n be
a symmetric matrix. We have
∇F(X)[H] = − Tr X −1 H ,
∇2 F(X)[H, H] = Tr X −1 HX −1 H , (11.14)
∇3 F(X)[H, H, H] = −2 Tr X −1 HX −1 HX −1 H) .
We are now ready to verify the self-concordance of F(X). The fact that X is a
barrier follows simply from the fact that if X tends to ∂Cn then det(X) → 0.
The convexity of F also follows as
∇2 F(X)[H, H] = Tr X −1 HX −1 H = Tr HX2 ≥ 0,
We are left with the task of determining the complexity parameter of F. This
follows simply from the Cauchy-Schwarz inequality, as
√ 1/2 √
∇F(X)[H] = − Tr (HX ) ≤ n · Tr HX2 = n∇2 F(X)[H, H]1/2 .
Theorem 11.25 (Gradient and Hessian of logdet) The Hessian and the
gradient of the logarithmic barrier F at a point X ∈ Cn can be computed in
time required to invert the matrix X.
Theorem 11.26 (Path following IPM for arbitrary convex sets) Let F(x)
be a self-concordant barrier function on a convex set K ⊆ Rn with complexity
√
parameter ν. An analog of the Algorithm 7, after T := O ν log εη10 iterations
outputs x̂ ∈ K that satisfies
Moreover, every iteration requires computing the gradient and Hessian (H(·))
of F at given point x ∈ int(K) and solving one linear system of the form
H(x)y = z, where , z ∈ Rn is a given vector and y is a vector of variables.
We remark that one can also use the above framework to optimize general
convex functions over K. This follows from the fact that every convex program
11.7 Optimizing over convex sets via self-concordance 237
min x∈K f (x) can be transformed into one that minimizes a linear function over
a convex set. Indeed, the above is equivalent to
min t
(x,t)∈K 0
where
K 0 := {(x, t) : x ∈ K, f (x) ≤ t}.
This means that we just need to construct a self-concordant function F on the
set K 0 to solve min x∈K f (x).2
2 Note that this reduction makes the convex set K 0 rather complicated, as the function f is built
in to its definition. Nevertheless, this reduction is still useful in many interesting cases.
238 Variants of the Interior Point Method and Self-Concordance
11.8 Exercises
kzk p2 ≤ kzk p1 .
11.6 The goal of this problem is to develop a method for solving the s − t-
minimum cost flow problem exactly. We assume here that there exists
an algorithm solving the s − t-minimum
cost
flow problem up to addi-
tive error ε > 0 (in value) in O
e m3/2 log CU , where C is the maximum
ε
magnitude of a cost of an edge and U is the maximum capacity of an
edge.
11.7 Optimal assignment problem. There are n workers and n tasks, the
profit for assigning task j to worker i is wi j ∈ Z. The goal is to assign
exactly one task to each worker so that no task is assigned twice and the
total profit is maximized.
(a) Prove that the optimal assignment problem is captured by the fol-
11.8 Exercises 239
x ≥ 0.
(b) Develop an algorithm based on the path-following framework for
solving this problem.
(1) The method should perform O(n)e iterations.
(2) Analyze the time per iteration (time to compute the equality
constrained Newton’s step).
(3) Give an efficient algorithm to find a strictly positive initial
solution along with a bound on η0 .
(4) Give a bound on the total running time.
Note: while the optimal assignment problem can be combinatorially re-
duced to the s − t-minimum cost flow problem, this problem asks to
solve it directly using a path following IPM and not via a reduction to
the s − t-minimum cost flow problem.
11.8 Directed Physarum dynamics. In this problem we develop a different
interior point method for solving linear programs in the standard form
min hc, xi
x∈Rm
s.t. Ax = b, (11.16)
x ≥ 0.
We assume here that the program is strictly feasible, i.e., S := {x > 0 :
Ax = b} , ∅.
The idea is to use a barrier function F(x) := m
P
i=1 xi log xi to endow
m
the positive orthant R>0 with the local inner product coming from the
Hessian of F(x), i.e., for x ∈ Rm
>0 we define
x0 := x − g(x),
240 Variants of the Interior Point Method and Self-Concordance
implies the NL condition for some δ0 < 1. Hint: it is useful to first show
that self-concordance implies that for all x ∈ K and for all h1 , h2 , h3 ∈ Rn
it holds that
3
∇ F(x)[h1 , h2 , h3 ] ≤ 2 kh1 k x kh2 k x kh3 k x .
11.11 Let K ⊆ Rn be a convex set and let F : int(K) → R. Prove that, for every
ν ≥ 0, the Condition (11.17) holds if and only if Condition (11.18) does.
−1
For all x ∈ int(K), ∇F(x)> ∇2 F(x) ∇F(x) ≤ ν. (11.17)
For all x ∈ int(K), h ∈ Rn , (DF(x)[h])2 ≤ ν · D2 F(x)[h, h]. (11.18)
Notes
The book by Cormen et al. (2001) contains a comprehensive discussion on the
s − t-minimum cost flow problem. The main theorem presented in this chapter
(Theorem 11.1) was first proved by Daitch and Spielman (2008). The proof
presented here is slightly different from the one by Daitch and Spielman (2008)
(which was based on the primal-dual framework). Their algorithm improved
over the previously fastest algorithm by Goldberg and Tarjan (1987) by a factor
√
of roughly m. Using ideas mentioned towards the end of this chapter, Lee and
√
Sidford (2014) improved this bound further to O(m e n logO(1) U).
As discussed in this chapter, the s − t-maximum flow problem is a special
case of the s − t-minimum cost flow problem. Hence, the result of Lee and
√
Sidford (2014) also implies an O(me n logO(1) U) time algorithm for the s − t-
maximum flow problem; the first improvement since Goldberg and Rao (1998).
Self-concordant functions are presented in the book by Nesterov and Ne-
mirovskii (1994). The lower bound in Theorem 11.14 can also be found in
their book. An alternative proof of the existence of a barrier function with
O(n)-self-concordance can be found in the paper by Bubeck and Eldan (2015).
Coming up with efficient O(n)-self-concordant barriers for various convex sets
K is still an important open problem. The volumetric barrier was introduced
by Vaidya (1987, 1989b,a), and the Lee-Sidford barrier was introduced by Lee
and Sidford (2014).
The Physarum dynamics-based algorithm for linear programming (Exercise
11.8) was analyzed by Straszak and Vishnoi (2016). For more on the isolation
lemma (introduced in Exercise 11.6(b)), see the papers by Mulmuley et al.
(1987) and Gurjar et al. (2018).
12
Ellipsoid Method for Linear Programming
We introduce a class of cutting plane methods for convex optimization and present an
analysis of a special case, namely, the ellipsoid method. We then show how to use this
ellipsoid method to solve linear programs over 0-1-polytopes when we only have access
to a separation oracle for the polytope.
PF := conv{1S : S ∈ F },
where 1S ∈ {0, 1}m is the indicator vector of the set S ⊆ [m]. Such polytopes
are called 0-1-polytopes as all their vertices belong to the set {0, 1}m .
244
12.1 0-1-polytopes with exponentially many constraints 245
Note that by solving this linear optimization problem over PF for the cost
vector c, we can solve the corresponding combinatorial optimization problem
over F . Indeed, as the optimal solution is always attained at a vertex, say x? =
1S ? , then S ? is the optimal solution to the discrete problem and vice versa.1 As
this is a type of linear program, a natural question is: can we solve such linear
programs using the interior point method?
While this does give us all the inequalities describing P M (G), there are 2m−1 of
them, one for each odd subset of [m]. Thus, plugging these inequalities into the
logarithmic barrier would result in 2O(m) terms, exponential in the size of the
input: O(n + m). Similarly, computing either the volumetric barrier or the Lee-
Sidford barrier would require exponential time. In this much time, we might as
well just enumerate all possible subsets of [m] and output the best matching.
One may ask if there are more compact representations of this polytope and
the answer (see notes) turns out to be NO: any representation (even with more
variables) of P M (G) requires exponentially many inequalities. Thus, applying
an algorithm based on the path following method to minimize over P M (G) does
not seem to lead to an efficient algorithm.
In fact, the situation is even worse, this description does not even seem use-
ful in checking if a point is in P M (G). Interestingly, the proof of Theorem
12.2 leads to a polynomial time algorithm to solve the separation problem over
P M (G) exactly. There are also (related) combinatorial algorithms to find the
maximum matching in a graph. In the next chapter, we see that efficient separa-
tion oracles can also be constructed for a large class of combinatorial polytopes
using the machinery of convex optimization (including techniques presented in
this chapter). Meanwhile, the question is: can we do linear optimization over a
polytope if we just have access to a separation oracle for it? The answer to this
leads us to the ellipsoid method that, historically, predates IPMs.
While we have abandoned the goal of developing IPMs for optimizing linear
functions over 0-1-polytopes with exponentially many constraints in this book,
it is an interesting question if self-concordant barrier functions for polytopes
such as P M (G) that are also computationally tractable exist. More specifically,
is there a self-concordant barrier function F : P M (G) → R with complexity pa-
rameter polynomial in m such that the gradient ∇F(x) and the Hessian ∇2 F(x)
are efficiently computable? By efficiently computable we not only mean poly-
nomial time in m, but also that computing the gradient and Hessian should be
simpler than solving the underlying linear program itself.
Recall that in the NO case, the vector h defines a hyperplane separating x from
the polytope P. An important class of results in combinatorial optimization is
that for many graph-based combinatorial 0-1-polytopes, such as those intro-
duced in Definition 12.1, the separation problem can be solved in polynomial
time. Here, we measure the running time of a separation oracle as a function
of the size of the graph (O(n + m)) and the bit complexity of the input vector x.
Combining the above with the existence of an efficient separation oracle for
P M (G) (Theorem 12.4), we conclude, in particular, that the problem of com-
puting the maximum cardinality matching in a graph is polynomial time solv-
able. Similarly, using the above, one can compute a matching (or a perfect
matching, or a spanning tree) of maximum or minimum weight in polynomial
time. The remainder of this chapter is devoted to the proof of Theorem 12.5.
We start by sketching a general algorithm scheme in the subsequent section
and recover the ellipsoid method as a special case of it. We then prove that the
ellipsoid method indeed takes just a polynomial number of simple iterations to
converge.
lem:
min hc, xi
(12.1)
s.t. x ∈ P,
where P ⊆ Rm is a polytope and we have access to a separation oracle for
P. Throughout this section we assume that P is full-dimensional. Later we
comment on how to get around this assumption.
such that all of them contain P and their volume reduces significantly at every
step. Eventually, the set Et approximates P so well that a point x ∈ Et is likely
to be in P as well. Note that we use the Lebesgue measure in Rm to measure
volumes. We denote by vol(K) the m−dimensional Lebesgue measure of a set
K ⊆ Rm . Formally, we consider Algorithm 9; see figure 12.1 for an illustration.
The following theorem characterizes the performance of the above scheme.
𝐻
𝐸!
𝑥!
𝑃
𝐸!"#
Figure 12.1 An illustration of one iteration of the cutting plane method. One starts
with a set Et guaranteed to contain the polytope P and queries a point xt ∈ Et . If
xt is not in P, then the separation oracle for P provides a separating hyperplane
H. Hence, one can reduce the region to which P belongs by picking a set Et+1
containing the intersection of Et with shaded halfspace.
vol(Et+1 ) ≤ α · vol(Et ),
Then, the
cutting plane method
described in Algorithm 9 outputs a point x̂ ∈ P
after O m log α−1 log Rr iterations.
vol(P) ≤ vol(Et )
vol(Et ) ≤ αt · vol(E0 ).
252 Ellipsoid Method for Linear Programming
Moreover, since P contains some ball B(x0 , r) and E0 is contained in some ball
B(x00 , R), we have
Thus,
vol(B(x0 , r))
≤ αt . (12.2)
vol(B(x00 , R))
However, both B(x0 , r) and B(x00 , R) are scalings (and translations) of the unit
Euclidean ball B(0, 1) and hence
R
t ≤ m log α−1 log .
r
Note that if we could find an instantiation of the cutting plane method (a way
of choosing Et and xt at every step t) such that the corresponding parameter
α is a constant less than 1, then it yields an algorithm for linear programming
that performs a polynomial number of iterations – indeed using the results in
one of the exercises, it follows that we can always enclose P in a ball of radius
roughly R = 2O(Ln) and find a ball inside of P of radius r = 2−O(Ln) . Thus,
log Rr = O(Ln).
The method developed we present in this chapter does not quite achieve a
constant drop in volume but rather
1
α≈1− .
m
However, this still implies a method for linear programming that performs a
polynomial number of iterations, and (as we see later) each iteration is just a
simple matrix operation that can be implemented in polynomial time.
𝐸!
Figure 12.2 An illustration of the use of Euclidean balls in the cutting plane
method. Note that the smallest ball containing the intersection of E with the
shaded halfplane is E itself.
Approximation using polytopes. The second idea one can try is to use poly-
topes as Et s. We start with E0 := [−R, R]m – a polytope containing P. When-
ever a new hyperplane is obtained from the separation oracle, we update
Et+1 := Et ∩ {x : hx, ht i ≤ hxt , ht i},
a polytope again. This is certainly the most aggressive strategy one can imag-
ine, as we always cut out as much a possible from Et to obtain Et+1 .
We also need to give a strategy for picking a point xt ∈ Et at every iteration.
For this, it is crucial that the point xt is well in the interior of Et , as otherwise
254 Ellipsoid Method for Linear Programming
we might end up cutting only a small piece out of Et and, thus, not reducing
the volume enough. Therefore, to make it work, at every step we need to find
an approximate center of the polytope Et and use it as the query point xt . As
the polytope gets more and more complex, finding a center also becomes more
and more time-consuming. In fact, we are reducing a problem of finding a
point in P to a problem of finding a point in another polytope, seemingly a
cyclic process.
In one of the exercises we show that this scheme can be made work by main-
taining a suitable center of the polytope and using a recentering step whenever
a new constraint is added. This is a significantly nontrivial approach and based
on the hope that the new center will be close to the old one, when the polytope
does not change too drastically and hence can be found rather efficiently using,
for instance, the Newton’s method.
through its center, for some h ∈ Rm \ {0} we define the minimum enclosing el-
lipsoid of E(x0 , M)∩H to be the ellipsoid E(x? , M ? ) being the optimal solution
to
min vol(E(x, M 0 ))
x∈Rm ,M 0 0
One can show that such a minimum enclosing ellipsoid always exists and is
unique, as the above can be formulated as a convex program. For our applica-
tion, we give an exact formula for the minimum volume ellipsoid containing
the intersection of an ellipsoid with a halfspace.
Given what has been already proved in Theorem 12.7, we are only missing
the following two components (stated in a form of a lemma) to deduce Theo-
rem 12.10.
Lemma 12.11 (Informal; see Lemma 12.12) Consider the ellipsoid algo-
rithm defined in Algorithm 10. Then,
1. the volume of the ellipsoids
E0 , E1 , E2 , . . . constructed in the algorithm
drops at a rate α ≈ 1 − 2m 1
, and
2. every iteration of the ellipsoid algorithm can be implemented in polyno-
mial time.
Given the above lemma (stated formally as Lemma 12.12) we can deduce The-
orem 12.10.
Proof of Theorem 12.10. By combining Theorem 12.7 with Lemma 12.11 we
obtain the bound on the number of iterations:
R R
O m log α−1 · log = O m2 log .
r r
12.4 Analysis of volume drop and efficiency for ellipsoids 257
and
1
vol(E(x0 , M 0 )) ≤ e− 2(m+1) · vol(E(x0 , M)).
The above lemma states that using ellipsoids we can implement the cutting
plane method with volume drop
!
1 1
α = e− 2(m+1) ≈ 1 − < 1.
2(m + 1)
The proof of Lemma 12.12 is has two parts. In the first part, we prove Lemma
12.12 for a very simple and symmetric case. In the second part, we reduce
the general case to the symmetric case studied in the first part using an affine
transformation of Rm .
258 Ellipsoid Method for Linear Programming
From the two previous facts one can easily derive the following corollary re-
garding the volume of an ellipsoid.
12.4 Analysis of volume drop and efficiency for ellipsoids 259
E1
E0
Figure 12.3 Illustration of the symmetric case of the argument. E0 is the unit
Euclidean ball in R2 and E1 is the ellipse of smallest√
area that contains the right
half (depicted in shaded) of E0 . The area of E1 is 4 9 3 < 1.
which follows from the fact that x belongs to the unit ball. Thus, to conclude
that x ∈ E 0 , we just need to show that
!2 !2
m+1 1 m2 − 1
x1 − + (1 − x12 ) ≤ 1. (12.6)
m m+1 m2
Note that the left hand side of (12.6) is convex (show that the second derivative
is positive) and nonnegative (in [0, 1]) function in one variable. We would like
to show that it is bounded by 1 for every x1 ∈ [0, 1]. To this end, it suffices
to verify this for both end points x1 = 0, 1. By directly plugging in x1 = 0
and x1 = 1 we obtain that the left hand side of (12.6) equals 1 and thus the
inequality is satisfied.
We now proceed to bounding the volume of E 0 . To this end, note that E 0 =
E(z, S ) with
!>
1
z := , 0, 0, . . . , 0 ,
m+1
m2 m2
m 2 !
S := Diag , 2 ,..., 2 .
m+1 m −1 m −1
Here Diag (x) is a diagonal matrix with diagonal entries as in the vector x. To
compute the volume of E 0 we simply apply Corollary 12.15
vol(E 0 ) = det(S )1/2 · vol(E(0, I)).
12.4 Analysis of volume drop and efficiency for ellipsoids 261
𝐸 0, 𝐼 𝐸 𝑥!, 𝑀
We have
!m−1 !2 !m−1
m 2 m2 1 1
det(S ) = · 2 = 1− 1+ 2 .
m+1 m −1 m+1 m −1
Using the inequality 1 + x ≤ e x , which is valid for every x ∈ R, we arrive at the
upper bound
1 2 1 m−1 1
det(S ) ≤ e− m+1 · e m2 −1 = e− m+1 .
Finally, we obtain
1
vol(E 0 ) = det(S )1/2 · vol(E(0, I)) ≤ e− 2(m+1) · vol(E(0, I)).
Thus, indeed, this gives a proof of Lemma 12.12 in the case when E(x0 , M) is
the unit ball and h = −e1 .
The idea now is to find an affine transformation φ so that
1. E(x0 , M) = φ(E(0, I)),
2. {x : hx, hi ≤ hx0 , hi} = φ ({x : x1 ≥ 0}) .
We refer to Figure 12.4 for an illustration of this idea. We claim that when
these conditions hold, the ellipsoid E(x0 , M 0 ) := φ(E 0 ) satisfies the conclusion
of Lemma 12.12. Indeed, using Lemma 12.13 and Lemma 12.16, we have
vol(E(x0 , M 0 )) vol(φ(E(x0 , M 0 ))) vol(E 0 ) 1
= = ≤ e− 2(m+1) .
vol(E(x0 , M)) vol(φ(E(x0 , M))) vol(E(0, I))
Moreover, by applying φ to the inclusion
{x ∈ E : x1 ≥ 0} ⊆ E 0 ,
we obtain
{x ∈ E(x0 , M) : hx, hi ≤ hx0 , hi} ⊆ E(x0 , M 0 ).
It remains now to derive a formula for φ so that we can obtain an explicit
expression for E(x0 , M 0 ). We claim that the following affine transformation φ
satisfies the above stated properties
φ(x) := x0 + M 1/2 U x,
where U is any orthonormal matrix, i.e., U ∈ Rm×m and satisfies UU > =
U > U = I, such that Ue1 = v where
M 1/2 h
v = −
.
M 1/2 h
2
We have
φ(E(0, I)) = E(x0 , M 1/2 U > U M 1/2 ) = E(x0 , M).
This proves the first condition. Further,
φ({x : x1 ≥ 0}) = {φ(x) : x1 ≥ 0}
= {φ(x) : h−e1 , xi ≤ 0}
D E
= {y : −e1 , φ−1 (y) ≤ 0}
D E
= {y : −e1 , U > M −1/2 (y − x0 ) ≤ 0}
D E
= {y : −M −1/2 Ue1 , y − x0 ≤ 0}
= {y : hh, y − x0 i ≤ 0}
12.4 Analysis of volume drop and efficiency for ellipsoids 263
and
m2
!
2
M 0 = M 1/2 U > I − e 1 e>
U M 1/2
m2 − 1 m+1 1
m2 2 1/2 > !
= 2 M− M Ue1 M 1/2 Ue1
m −1 m+1
2
2 (Mh) (Mh)>
!
m
= 2 M− .
m −1 m + 1 h> Mh
h
By plugging in g := , the lemma follows.
k M1/2 hk2
This is enough to guarantee that the algorithm is still correct and converges in
the same number of steps (up to a constant factor) assuming that p is large
enough.
c0i := 2m ci + 2i−1 .
The total contribution of the perturbation for any vertex solution is strictly less
than 2m and, hence, it does not create any new optimal vertex. Moreover, it is
now easy to see that every vertex has a different cost and, thus, the optimal cost
(and vertex) is unique.
Finally, note that by doing this perturbation, the bit complexity of c only
grows by O(m).
12.5 Linear optimization over 0-1-polytopes 265
( )
1
P := x ∈ P : hc, xi ≤ C +
0
,
4
266 Ellipsoid Method for Linear Programming
Proof Assume without loss of generality that 0 ∈ P and that 0 is the unique
optimal solution. It is then enough to consider C = 0, as for C < 0, the polytope
P0 is empty and, for C ≥ 1, the optimal cost is larger than that for C = 0.
We start by noting that P contains a ball B of radius 2−poly(m) . To see this, we
first pick m + 1 affinely independent vertices of P and show that the simplex
spanned by them contains a ball of such a radius.
Next, we show that by scaling this ball down, roughly by a factor of 2L with
respect to the origin, it is contained in
( )
1
0
P := x ∈ P : hc, xi ≤ .
4
To see this, note that for every point x ∈ P we have
X
hc, xi ≤ |ci | ≤ 2L .
i
but
vol 2−L−3 B = 2−m(L+3) vol(B) = 2−m(L+3)−poly(m) = 2−poly(m,L) .
of a ball that can be fit in P0 provided in Lemma 12.17, we conclude that the
running time of one such application is poly(m, L).
12.6 Exercises
12.1 Consider the undirected graph on three vertices {1, 2, 3} with edges
12, 23, 13.
Prove that the following polytope does not capture the matching poly-
tope of this graph.
x ≥ 0, x1 + x2 ≤ 1, x2 + x3 ≤ 1, x1 + x3 ≤ 1.
12.2 Construct a polynomial time separation oracle for the perfect matching
polytope using the following steps:
(a) Prove that separating over PPM (G) for a graph G = (V, E) reduces
to the following odd minimum cut problem: given x ∈ QE find:
X
min xi j .
S :|S | odd
i∈S , j∈S̄ ,i j∈E
12.6 Exercises 269
(b) Prove that the odd minimum cut problem is polynomial time solv-
able.
12.3 Recall that, for a polytope P ⊆ Rm the linear optimization problem is:
given a cost vector c ∈ Qn find a vertex x? ∈ P minimizing hc, xi over
x ∈ P.
Let P be a class of full-dimensional 0-1-polytopes for which the linear
optimization problem is polynomial time solvable (i.e., polynomial in m
and the bit complexity of c). Prove that the separation problem for the
class P is also polynomial time solvable using the following steps:
P◦ := {y ∈ Rn : ∀x ∈ P hy, x − x0 i ≤ 1}.
(b) Prove that a polynomial time linear optimization oracle for the po-
lar P◦ of P is enough to implement a polynomial time separation
oracle for P.
(c) Prove that every P ∈ P has a polynomial time linear optimiza-
tion oracle using the ellipsoid method. Note: one can assume that
polynomial time rounding to a vertex solution is possible, given a
method to find an ε-optimal solution in poly(n, L(c), log 1ε ) time.
12.4 Prove that there is a polynomial time algorithm (based on the ellipsoid
method) for linear programming in the explicit form, i.e., given a matrix
A ∈ Qm×n , b ∈ Qm and c ∈ Qn find an optimal solution x? ∈ Qn to
min x∈Rn {hc, xi : Ax ≤ b}. The following hints may be helpful:
12.5 Prove that the ellipsoid E(x0 , M 0 ) ⊆ Rm defined in Lemma 12.12 coin-
cides with the minimum volume ellipsoid containing
12.6 In this problem, we derive a variant of the cutting plane method for solv-
ing the feasibility problem. In the feasibility problem, the goal is to find
a point x̂ in a convex set K. This algorithm (in contrast to the ellipsoid
method) maintains a polytope as an approximation of K, instead of an
ellipsoid.
xt := argmin Ft (x).
x∈Pt
Lower bound on the potential. The next step is to show a lower bound
on φt , intuitively, we would like to show that when t becomes large
enough then (on average) φt+1 − φt > 2 log 1r and hence eventually φt
becomes larger than the lower bound derived above – this gives us a
bound on when does the algorithm terminate.
Let us denote by Ht := ∇2 Ft (xt ) the Hessian of the logarithmic barrier
at the analytic center.
(b) Prove that for every step t of the algorithm
1
φt+1 ≥ φt − log(a>t+1 Ht−1 at+1 ) (12.7)
2
and conclude that
t
t 1 X > −1
φt ≥ φ0 − log a H ai .
2 t i=1 i i−1
Pt
The next step is to analyze (prove an upper bound on) the sum i=1 a>i Hi−1
−1
ai .
Towards this, we establish the following sequence of steps.
(c) Prove that for every step t of the algorithm
t
1X >
Lt := I + a i a i Ht .
m i=1
272 Ellipsoid Method for Linear Programming
Notes
We refer the reader to the book by Schrijver (2002a) for numerous examples
of 0-1-polytopes related to various combinatorial objects. Theorem 12.2 was
proved in a seminal paper by Edmonds (1965a). We note that a result similar to
Theorem 12.2 also holds for the spanning tree polytope. However, while there
are results showing that there is a way to encode the spanning tree polytope
using polynomially many variables and inequalities, for the matching poly-
tope, Rothvoss (2017) proved that any linear representation of P M (G) requires
exponentially many constraints.
The ellipsoid algorithm was first developed for linear programs by Khachiyan
(1979, 1980) who built upon the ellipsoid method given by Shor (1972) and
Yudin and Nemirovskii (1976). It was further generalized to the case of poly-
topes (when we only have access to a separation oracle for the polytope) by
Grötschel et al. (1981), Padberg and Rao (1981), and Karp and Papadimitriou
(1982). For details on how to deal with the bit precision issues mentioned in
Section 12.4.4, see the book by Grötschel et al. (1988). For details on how to
avoid the full-dimensionality assumption by “jumping” to a lower dimensional
subspace, the reader is also referred to Grötschel et al. (1988). Exercise 12.6 is
based on the paper by Vaidya (1989a).
13
Ellipsoid Method for Convex Optimization
We show how to adapt the ellipsoid method to solve general convex programs. As appli-
cations, we present a polynomial time algorithm for submodular function minimization
and a polynomial time algorithm to compute maximum entropy distributions over com-
binatorial polytopes.
Property (1) shows that the ellipsoid method is applicable more generally than
interior point methods, which in contrast require strong access to the convex
set via a self-concordant barrier function. Properties (2) and (3) say that, at
least asymptotically, the ellipsoid algorithm outperforms first-order methods.
However, so far we have only developed the ellipsoid method for linear pro-
gramming, i.e., when K is a polytope and f is a linear function.
In this chapter, we extend the framework of the ellipsoid method to general
convex sets and convex functions assuming an appropriate (oracle) access to f
and a separation oracle for K. We also switch back from m to n to denote the
ambient space of the optimization problem. More precisely, in Section 13.4 we
prove the following theorem.
274
13.1 Convex optimization using the ellipsoid method? 275
f ( x̂) ≤ f (x? ) + ε,
Here, T K and T f are the running times for the separation oracle for K and the
first-order oracle for f respectively.
It can be seen that Theorem 13.1 applies to this setting by letting ε := δl0 . If x̂
is such that
f ( x̂) ≤ f (x? ) + ε,
then
rF (S ) := max{|T | : T ∈ F , T ⊆ S },
Note that the inclusion from left to right (⊆) is true for all families F , the
opposite direction relies on the matroid assumption, and is nontrivial.
For the matroid F consisting of all sets of size at most k, that rank function
is particularly simple:
rF (S ) = min{|S |, k}.
Hence, the corresponding matroid polytope is
X
PF = .
x ≥ 0 : ∀S ⊆ [n] xi ≤ min{|S |, k}
i∈S
Note that, as the rank of a singleton set can be at most 1, the right hand side has
constraints of the form 0 ≤ xi ≤ 1 for all 1 ≤ i ≤ n. Hence, they trivially imply
the rank constraint for any set S whose cardinality is at most k. Moreover, there
is a constraint
Xn
xi ≤ k.
i=1
This constraint, along with the fact that xi ≥ 0 for all i, implies the constraint
X
xi ≤ k
i∈T
for any T such that |T | ≥ k. Thus, the corresponding matroid polytope is just
X
PF = x ∈ [0, 1]n : ,
xi ≤ k
i∈[n]
13.2 Submodular function minimization 279
and find
S ? := argmin F x (S ).
S ⊆[n]
Indeed,
x ∈ PF if and only if F x (S ? ) ≥ 0.
hy, 1S ? i ≤ rF (S ? ).
This is because
but Theorem 13.4 implies that the inequality is satisfied by all points in PF .
Thus, in order to solve the separation problem, it is sufficient to solve the min-
imization problem for F x . The crucial observation is that F x is not an arbitrary
function – it is submodular, and submodular functions have a lot of combina-
torial structure that can be leveraged.
We have not specified how the function F is given. Depending on the applica-
tion, it can either be succinctly described using a graph (or some other com-
binatorial structure), or given as an oracle that, given S ⊆ [n], outputs F(S ).
Remarkably, this is all we need to develop algorithms for SFM.
Theorem 13.8 (Polynomial time algorithm for SFM) There is an algorithm
that, given an oracle access to a submodular function F : 2[n] → [l0 , u0 ] for
some integers l0 ≤ u0 , and an ε > 0, finds a set S ⊆ [n] such that
F(S ) ≤ F(S ? ) + ε,
where S ? is a minimizer of F. The algorithm makes poly n, log u0ε−l0 queries
to the oracle.
Note that, for the application of constructing separation oracles for matroid
polytopes, a polynomial running time with respect to log u0ε−l0 is necessary.
Recall that, in the separation problem, the input also consists of an x ∈ [0, 1]n
and we considered the function
X
F x (S ) := rF (S ) − xi .
i∈S
Thus, if we have a separation oracle for the matroid polytope, it easily extends
to a separation oracle for the matroid base polytope. Note that spanning trees
are maximum cardinality elements in the graphic matroid and, hence, the the-
orem above also gives us a separation oracle for the spanning tree polytope.
where the expectation is taken over a uniformly random choice of λ ∈ [0, 1].
The Lovász extension has many interesting and important properties. First,
observe that the Lovász extension of F is always a continuous function and
agrees with F on integer vectors. It is, however, not smooth. Moreover, one
can show that
min f (x) = min F(S );
x∈[0,1]n S ⊆[n]
see Exercise 13.8. Further, the following theorem asserts that the Lovász ex-
tension of a submodular function is convex.
where f is the Lovász extension of the submodular function F. Note that the
constraint set is just the hypercube [0, 1]n .
the function f (x) (as demonstrated above) is just a linear function, hence, its
gradient can be evaluated efficiently.
We are now ready to establish a proof of Theorem 13.8.
Proof of Theorem 13.8. As seen in the discussion so far, minimizing a sub-
modular function F : 2[n] → R reduces to minimizing its Lovász extension
f : [0, 1]n → R. Moreover, Theorem 13.11 asserts that f is a convex function.
Therefore, one can compute an ε-approximate solution min x∈[0,1]n f (x) given
just an oracle that provides values and subgradients of f (this follows from
Theorem 13.12). If the range of F is contained in [l, u], then so is the range of
f . Hence, the running time bound in Theorem 13.8 follows.
Finally, we show how to round an ε-approximate solution x̂ ∈ [0, 1]n to a set
S ⊆ 2[n] . From the definition of the Lovász extension it follows that
n
X
f ( x̂) = λi F(S i ),
i=0
p ∈ ∆Ω .
284 Ellipsoid Method for Convex Optimization
Here, ∆Ω denotes the probability simplex, i.e., the set of all probability distri-
butions over Ω.
Note that the entropy function is concave (nonlinear), and ∆ω is convex, hence,
this is a convex optimization problem. However, solving this problem requires
specifying how the vectors are given as input.
If the domain Ω is of size N and the vectors vω are given to us explicitly, then
one can find an ε-approximate solution to Equation (13.3) in time polynomial
in N, log ε−1 , and the bit complexity of the input vectors and θ. This can be
done for instance using the interior point method2 , or the ellipsoid method
from this chapter.
However, when the domain Ω is large and the vectors are implicitly spec-
ified, this problem becomes computationally nontrivial. To take an instruc-
tive example, let Ω be the set of all spanning trees T of an undirected graph
G = (V, E), and let vT := 1T be the indicator vector of the spanning tree T .
Thus, all vectors vT are implicitly specified by the graph G = (V, E). The input
to the entropy maximization problem in the spanning tree case is, thus, a graph
G = (V, E) and a vector θ ∈ QE . In trying to apply the above approach to the
case of spanning trees in a graph, one immediately realizes that N is exponen-
tial in the size of the graph and, thus, a polynomial in N algorithm is, in fact,
exponential in the number of vertices (or edges) in G.
Further, even though an exponential domain (such as spanning trees in a
graph) can be specified compactly (via a graph), the output of the problem is
still of exponential size – a vector p of dimension N = |Ω|. How can we output
a vector of exponential length in polynomial time? We cannot. However, what
is not ruled out is an algorithm that, given an element of the domain (say a tree
T in G), outputs the probability associated to it in polynomial time (pT ). In the
next sections we show that, surprisingly, this is possible and, using the ellipsoid
method for convex optimization, we can obtain a polynomial time algorithm
for the entropy maximization problem over the spanning tree polytope. What
we present here is a sketch consisting of the key steps in the algorithm and the
proof; several steps are left as exercises.
We would like to find the minimum of f (y). For this we apply the general algo-
rithm for convex optimization based on the ellipsoid method that we derive in
Section 13.4. When translated to this setting it has the following consequence.
Theorem 13.17 (Informal; see Theorem 13.1) Suppose the following con-
ditions are satisfied:
1. we are given access to a polynomial time oracle computing the value and
the gradient of f ,
2. y? is guaranteed to lie in a ball B(0, R) for some R > 0, and
3. the values of f over B(0, R) stay in the interval [−M, M] for some M > 0.
f (ŷ) ≤ f (y? ) + ε
in time poly(n, log R, log M, log ε−1 ) (when an oracle query is treated as a unit
operation).
Note that we do not require the inner ball due to the following fact (Exercise
13.13):
√
∀y, y0 ∈ Rn , f (y) − f (y0 ) ≤ 2 nky − y0 k2 . (13.6)
Thus, we can set r := Θ √εn . In general, this does not imply a polynomial
time algorithm for the maximum entropy problem as, for specific instantia-
tions, we have to provide and account for an evaluation oracle, a value of R,
and M. Nevertheless, we show how to use the above to obtain a polynomial
time algorithm for the maximum entropy problem for the case of spanning
trees.
solution to
X
hy,1 −θi
minE log e T (13.7)
y∈R
T ∈TG
in time poly(|V| + |E|, η−1 , log ε−1 ), where TG is the set of all spanning trees in
G.
Note importantly that the above gives a polynomial time algorithm only if θ is
sufficiently inside the polytope PS T (or in other words, when it is far away from
the boundary of PS T ). If the point θ comes close to the boundary (i.e., η ≈ 0)
then the bound deteriorates to infinity. This assumption is not necessary and
we give a suggestion in notes on how to get rid of it. However the η-interiority
assumption makes the proof simpler and easier to understand.
Lemma 13.19 (Bound on the norm of the optimal solution) Suppose that
G = (V, E) is an undirected graph and θ ∈ RE satisfies B(θ, η) ⊆ PS T (G),
for some η > 0. Then y? , the optimal solution to the dual to the max entropy
problem (13.7), satisfies
?
|E|
y
2 ≤ .
η
Note that the bound deteriorates as η → 0 and, thus, is not very useful when θ
is close to the boundary.
T ∈TG
y? , 1T − θ ≤ m.
D E
288 Ellipsoid Method for Convex Optimization
y? , x − θ ≤ m.
D E
y? , v ≤ m.
D E
∀v ∈ B(0, η),
?
In particular, by taking v := η kyy? k we obtain that
2
?
m
y
2 ≤ .
η
Lemma 13.20 (Polynomial time evaluation oracle for f and its gradient)
Suppose that G = (V, E) is an undirected graph and f is defined as
X
f (y) := log ehy,1 T −θi ,
T ∈TG
then there is an algorithm that given y outputs the value f (y) and the gradient
∇ f (y) in time poly(kyk2 , |E|).
The proof is left as Exercise 13.14. Note that the running time of the oracle
is polynomial in kyk and not in log kyk as one would perhaps desire. This is a
consequence of the fact that even a single term in the sum e(y,1T −θ) might be as
large as ekyk and hence it requires up to kyk bits to represent.
when we have a separation oracle for K and an oracle access to K and a first-
order oracle for f . To this end, we start by generalizing the ellipsoid algorithm
from Chapter 12 to solve the feasibility problem for any convex set K (not
only for polytopes) and, further, we show that this algorithm, combined with
the standard binary search procedure, yields an efficient algorithm under mild
conditions on f and K.
is convex and that a separation oracle for P is available. Indeed Theorem 12.10
in Chapter 12 can be rewritten in this more general form to yield the following.
Theorem 13.21 (Solving the feasibility problem for convex sets) There is
an algorithm that, given a separation oracle for a convex set K ⊆ Rn , a radius
R > 0 such that K ⊆ B(0, R) and a parameter r > 0 gives one of the following
outputs
1. YES, along with a point x̂ ∈ K, proving that K is nonempty, or
2. NO, in which case K is guaranteed to not contain a Euclidean ball of
radius r.
The running time of the algorithm is
R
O (n2 + T K ) · n2 · log ,
r
where T K is the running time of the separation oracle for K.
Note that the above is written in a slightly different form than Theorem 12.10
in Chapter 12. Indeed, there we assumed that K contains a ball of radius r and
wanted to compute a point inside of K. Here, the algorithm proceeds until the
volume of the current ellipsoid becomes smaller than the volume of a ball of
radius r, and then outputs NO if the center of the current ellipsoid does not lie
in K. This variant can be seen as an approximate nonemptiness check:
1. If K , ∅ and vol(K) is “large enough” (determined by r) then the algo-
rithm outputs YES,
2. If K = ∅ then the algorithm outputs NO,
3. If K , ∅, but vol(K) is small (determined by r), then the algorithm can
answer either YES or NO.
The uncertainty introduced when K has a small volume is typically not a prob-
lem, as we can often mathematically force K to be either be empty or have a
large volume when running the ellipsoid algorithm.
• A parameter ε > 0
Output: An x̂ ∈ K such that f ( x̂) ≤ min x∈K f (x) + ε
Algorithm:
1: Let l := l0 and u := u0
2: Let r0 := 2(ur·ε
0 −l0 )
3: while u − l > 2ε do
4: Set g := b l+u
2 c
5: Define
K g := K ∩ {x : f (x) ≤ g}
for some value g. This can be done as long as a separation oracle for K g is
available. One also needs to be careful about the uncertainty introduced in the
nonemptiness check, as it might potentially cause errors in the binary search
procedure. For this, it is assumed that K contains a ball of radius r and, at every
step of the binary search, the ellipsoid algorithm is called with an appropriate
292 Ellipsoid Method for Convex Optimization
choice of the inner ball parameter. Details are provided in Algorithm 12, where
it is assumed that all values of f over K lie in the interval [l0 , u0 ].
At this point it is not clear that the algorithm presented in Algorithm 12 gives
a correct algorithm. Indeed, we need to verify that the choice of r0 guarantees
that the binary search algorithm gives correct answers most of the time. Fur-
ther, it is not even clear how one would implement such an algorithm, as so far
we have not yet talked about how f is given. It turns out that access to values
and gradients of f (zeroth- and first-order oracles of f ) suffices in this setting.
We call this first-order access to f as well and recall its definition.
K g := K ∩ {x : f (x) ≤ g},
and the second component is to show that by using the ellipsoid algorithm to
test the nonemptiness of K g with the parameter r0 specified in the algorithm,
we are guaranteed to obtain a correct answer up to the required precision. We
discuss these two components in separate steps and subsequently conclude the
result.
S g := {x ∈ Rn : f (x) ≤ g}
Bounds on the volume of sublevel sets. In this step, we give a lower bound on
the smallest ball contained in K g depending on how close g is to the minimum
f ? of f on K. This is required to claim that the ellipsoid algorithm correctly
determines whether K g , ∅ in various steps of the binary search.
Lemma 13.24 (Lower bound on volume of sublevel sets) Let K ⊆ Rn be a
convex set containing a Euclidean ball of radius r > 0 and f : K → [ f ? , fmax ]
be a convex function on K with
f ? := min f (x).
x∈K
For any
g := f ? + δ
(for δ > 0), define
K g := {x ∈ K : f (x) ≤ g}.
294 Ellipsoid Method for Convex Optimization
Proof of Theorem 13.1. Given Lemmas 13.23 and 13.24, we are well equipped
to prove Theorem 13.1.
Proof of Theorem 13.1. Let f ? = f (x? ). First, observe that whenever
ε
g ≥ f? + ,
2
then K g contains a Euclidean ball of radius
r·ε
r0 := .
2(u0 − l0 )
Hence, for such a g, the ellipsoid algorithm terminates with YES and outputs
a point x̂ ∈ K g . This is a direct consequence of Lemma 13.24.
Let l and u be the values of these variables at the moment of termination of
the algorithm. From the above reasoning it follows that
u ≤ f ? + ε.
13.5 Variants of the cutting plane method 295
Indeed,
ε
u≤l+
2
and the ellipsoid algorithm can answer NO only for
ε
g ≤ f? + .
2
Hence,
ε
l ≤ f? + .
2
Therefore, u ≤ f ? + ε and the x̂ output by the algorithm belongs to K u . Hence,
f ( x̂) ≤ f ? + ε.
Remark 13.25 (Avoiding binary search) By taking a closer look at the al-
gorithm and the proof of Theorem 13.1, one can observe that one does not need
to restart the ellipsoid algorithm at every iteration of the binary search. Indeed,
one can just reuse the ellipsoid obtained in the previous call. This leads to an
algorithm with a slightly reduced running time
!!
R u0 − l0
O (n + T K + T f ) · n · log
2 2
· .
r ε
Note that there is no square in the logarithmic factor.
The intuition for using the volumetric center xt as the queried point at step t is
that xt is the point around which the Dikin ellipsoid has the largest volume. As
the Dikin ellipsoid is a decent approximation of the polytope, then one should
expect that a hyperplane through the center of this ellipsoid should divide the
polytope into two roughly equal pieces and hence we should expect α to be
small. What can be proved is that indeed, on average (over a large number of
iterations)
vol(Et+1 )
≤ 1 − 10−6 ,
vol(Et )
and, hence, α is indeed a constant (slightly) smaller than 1. Further, the volu-
metric center xt of Et does not have to be recomputed from scratch every single
298 Ellipsoid Method for Convex Optimization
time. In fact, since Et+1 has only one constraint more than Et , we can use New-
ton’s method to compute xt+1 with xt as the starting point. This “recentering”
step can be implemented in O(nω ), or matrix multiplication, time. In fact, to
achieve this running time, the number of facets in Et needs to be kept small
throughout the iterations, hence occasionally the algorithm has to drop certain
constraints (we omit the details).
To summarize, this method, based on the volumetric center, attains the op-
timal, constant rate of volume drop, hence, requires roughly O(n log Rr log 1ε )
iterations to terminate. However, the update time per iteration is nω ≈ n2.373
is slower as compared to the update time for the ellipsoid method, which is
O(n2 ) (as a rank one update to an n × n matrix). See also the method based on
“analytic center” in Exercise 12.6.
m
X 1 λ
G(x) := − wi log si (x) + x A + λI) +
log det(A> S −2 kxk22 , (13.10)
i=1
2 2
xt := argmin G(x).
x∈Et
13.6 Exercises
13.1 Prove that any two bases of a matroid have the same cardinality.
13.2 Given disjoint sets E1 , . . . , E s ⊆ [n] and nonnegative integers k1 , . . . , k s ,
consider the following family of sets
rF (S ) = n − κ(V, S )
fc (S ) := |{i j ∈ E : i ∈ S , j < S }| .
13.8 Prove that for any function F : 2[n] → R (not necessarily convex) and
for its Lovász extension f the following hold:
−
(a) Prove that f is a convex function.
(b) Prove that if F is submodular, then f − coincides with the Lovász
extension of F.
(c) Prove that if f − coincides with the Lovász extension of F, then F
is submodular.
13.11 Prove that for any finite set Ω, and any probability distribution p on it,
X 1
pω log ≤ log |Ω|.
ω∈Ω
p ω
13.12 For a finite set Ω, a collection of vectors {vω }ω∈Ω ⊆ Rn , and a point θ ∈
Rn , prove that the convex program considered in (13.3) is the Lagrangian
dual of that considered in (13.4). For the above, you may assume that the
polytope P := conv{vω : ω ∈ Ω} is a full-dimensional subset of Rn and θ
belongs to the interior of P. Prove that, under these assumptions, strong
duality holds.
13.13 Prove the claim in Equation (13.6).
13.14 Let G = (V, E) be an undirected graph and let Ω be a subset of 2E .
Consider the objective of the dual to max entropy program in (13.4):
X
f (y) := log ehy,1S −θi .
S ∈Ω
13.6 Exercises 301
in time poly(log ε−1 , n, log L). You may assume that every point x ∈ K
satisfies B(x, η) ⊆ P where P is the convex hull of the support of p
and your algorithm might depend polynomially on 1η . For a multiaffine
polynomial p(x) = S ⊆[n] cS i∈S xi .4
P Q
13.16 In this problem, we derive an algorithm for the following problem: given
a matrix A ∈ Zn×n n
>0 with positive entries, find positive vectors x, y ∈ R>0
such that
XAY is a doubly stochastic matrix,
where X = Diag (x), Y = Diag (y) and a matrix W is called doubly
stochastic if its entries are nonnegative and all its rows and columns sum
up to 1. We denote the set of all n × n doubly stochastic matrices by Ωn ,
i.e.,
Xn Xn
Ωn := n×n
= = .
W ∈ [0, 1] : ∀ i ∈ [n] W 1, ∀ j ∈ [n] W 1
i, j i, j
j=1 i=1
Consider the following pair of optimization problems:
X Ai, j
max Wi, j log . (13.11)
W∈Ωn
1≤i, j≤n
Wi, j
4 The support of p is the family of all sets S such that cS , 0.
302 Ellipsoid Method for Convex Optimization
n
n n
X X X
minn log ez j Ai, j − z j. (13.12)
z∈R
i=1 j=1 j=1
(a) Prove that both problems in Equations (13.11) and (13.12) are
convex programs.
(b) Prove that the problem in Equation (13.12) is the Lagrangian dual
of that in Equation (13.11) and strong duality holds.
(c) Suppose that z? is an optimal solution to (13.12). Prove that, if
?
y ∈ Rn is such that yi := ezi for all i = 1, 2, . . . , n then there exists
x > 0 such that XAY ∈ Ωn .
As an optimal solution to the convex program (13.12) gives us the re-
quired solution, we apply the ellipsoid algorithm to solve it. We denote
by f (z) the objective of (13.12). For simplicity, assume that the entries
of A are integral.
(d) Design an algorithm that given z ∈ Qn computes an ε−approximation
to the gradient ∇ f (z) in time polynomial in the bit complexity of
A, the bit complexity of z, log ε−1 and the norm kzk∞ .
(e) Prove that f (z) has an optimal solution z? satisfying that
?
z
≤ O(ln(Mn)),
∞
Notes
For a thorough treatment on matroid theory, including a proof of Theorem
13.4, we refer the reader to the book by Schrijver (2002a). A proof of Theorem
13.4 based on the method of iterative rounding appears in the text by Lau et al.
(2011). Theorems 13.1 and 13.8 were first proved in the paper by Grötschel
et al. (1981). The Lovász extension and its properties (such as convexity) were
established by Lovász (1983).
The maximum entropy principle has its origins in the works of Gibbs (1902)
and Jaynes (1957a,b). It is used to learn probability distributions from data;
see the work by Dudik (2007) and Celis et al. (2020). Theorem 13.18 is a
special case of a theorem proved in Singh and Vishnoi (2014). Using a differ-
ent argument one can obtain a variant of Lemma 13.19 that avoids the inte-
riority assumption; see the paper by Straszak and Vishnoi (2019). Maximum
entropy-based algorithms have also been used to design very general approx-
imate counting algorithms for discrete problems in the papers by Anari and
Oveis Gharan (2017) and Straszak and Vishnoi (2017). Recently, the maxi-
mum entropy framework has been generalized to continuous manifolds; see
the paper by Leake and Vishnoi (2020).
The proof of Theorem 13.1 can be adapted to work with approximate or-
acles; we refer to Grötschel et al. (1988) for a thorough discussion on weak
oracles and optimization with them.
The volumetric center-based method is due to Vaidya (1989b). The hybrid
barrier presented in Section 13.5.3 is from the paper by Lee et al. (2015). Using
their cutting plane method, which is based on this barrier function, Lee et al.
(2015) derive a host of new algorithms for combinatorial problems. In par-
ticular, they give the asymptotically fastest known algorithms for submodular
function minimization and matroid intersection.
Exercise 13.15 is adapted from the paper by Straszak and Vishnoi (2019).
The objective function in Exercise 13.15 can be shown to be geodesically con-
vex; see the survey by Vishnoi (2018) for a discussion on geodesic convexity
and related problems.
Bibliography
Allen Zhu, Zeyuan, and Orecchia, Lorenzo. 2017. Linear Coupling: An Ultimate Uni-
fication of Gradient and Mirror Descent. Pages 3: 1–3: 22 of: 8th Innovations
in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017,
Berkeley, CA, USA. LIPIcs, vol. 67.
Anari, Nima, and Oveis Gharan, Shayan. 2017. A Generalization of Permanent In-
equalities and Applications in Counting and Optimization. Pages 384–396 of:
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Comput-
ing. STOC 2017.
Apostol, Tom M. 1967a. Calculus: One-variable calculus, with an introduction to lin-
ear algebra. Blaisdell book in pure and applied mathematics. Blaisdell Publishing
Company.
Apostol, Tom M. 1967b. Calculus, Vol. 2: Multi-Variable Calculus and Linear Algebra
with Applications to Differential Equations and Probability. New York: J. Wiley.
Arora, Sanjeev, and Barak, Boaz. 2009. Computational complexity: a modern ap-
proach. Cambridge University Press.
Arora, Sanjeev, and Kale, Satyen. 2016. A Combinatorial, Primal-Dual Approach to
Semidefinite Programs. J. ACM, 63(2).
Arora, Sanjeev, Hazan, Elad, and Kale, Satyen. 2005. Fast Algorithms for Approximate
Semidefinite Programming Using the Multiplicative Weights Update Method.
Page 339348 of: Proceedings of the 46th Annual IEEE Symposium on Founda-
tions of Computer Science. FOCS ’05. USA: IEEE Computer Society.
Arora, Sanjeev, Hazan, Elad, and Kale, Satyen. 2012. The Multiplicative Weights Up-
date Method: a Meta-Algorithm and Applications. Theory of Computing, 8(6),
121–164.
Barak, Boaz, Hardt, Moritz, and Kale, Satyen. 2009. The uniform hardcore lemma
via approximate Bregman projections. Pages 1193–1200 of: Proceedings of the
Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009,
New York, NY, USA, January 4-6, 2009. SIAM.
Barvinok, Alexander. 2002. A course in convexity. American Mathematical Society.
Beck, Amir, and Teboulle, Marc. 2003. Mirror descent and nonlinear projected subgra-
dient methods for convex optimization. Oper. Res. Lett., 31(3), 167–175.
305
306 Bibliography
Beck, Amir, and Teboulle, Marc. 2009. A fast iterative shrinkage-thresholding algo-
rithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–
202.
Boyd, Stephen, and Vandenberghe, Lieven. 2004. Convex optimization. Cambridge
University Press.
Bubeck, Sébastien. 2015. Convex Optimization: Algorithms and Complexity. Found.
Trends Mach. Learn., 8(34), 231357.
Bubeck, Sébastien, and Eldan, Ronen. 2015. The entropic barrier: a simple and optimal
universal self-concordant barrier. Page 279 of: Proceedings of The 28th Confer-
ence on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015.
Celis, L. Elisa, Keswani, Vijay, and Vishnoi, Nisheeth K. 2020. Data preprocessing
to mitigate bias: A maximum entropy based approach. In: Proceedings of the
37th International Conference on International Conference on Machine Learning.
ICML’20. JMLR.org.
Christiano, Paul, Kelner, Jonathan A., Madry, Aleksander, Spielman, Daniel A., and
Teng, Shang-Hua. 2011. Electrical flows, Laplacian systems, and faster approx-
imation of maximum flow in undirected graphs. Pages 273–282 of: Proceedings
of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA,
USA, 6-8 June 2011.
Cohen, Michael B., Madry, Aleksander, Tsipras, Dimitris, and Vladu, Adrian. 2017.
Matrix Scaling and Balancing via Box Constrained Newton’s Method and Inte-
rior Point Methods. Pages 902–913 of: 58th IEEE Annual Symposium on Foun-
dations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17,
2017. IEEE Computer Society.
Cormen, Thomas H., Leiserson, Charles E., Rivest, Ronald L., and Stein, Clifford.
2001. Introduction to Algorithms. The MIT Press.
Daitch, Samuel I., and Spielman, Daniel A. 2008. Faster approximate lossy generalized
flow via interior point algorithms. Pages 451–460 of: Proceedings of the 40th
Annual ACM Symposium on Theory of Computing, Victoria, British Columbia,
Canada, May 17-20, 2008.
Dantzig, George B. 1990. Origins of the Simplex Method. New York, NY, USA: Asso-
ciation for Computing Machinery. Page 141151.
Dasgupta, Sanjoy, Papadimitriou, Christos H., and Vazirani, Umesh. 2006. Algorithms.
1 edn. USA: McGraw-Hill, Inc.
Diestel, Reinhard. 2012. Graph Theory, 4th Edition. Graduate texts in mathematics,
vol. 173. Springer.
Dinic, E. A. 1970. Algorithm for solution of a problem of maximal flow in a network
with power estimation. Soviet Math Dokl, 224(11), 12771280.
Dudik, Miroslav. 2007. Maximum entropy density estimation and modeling geographic
distributions of species.
Edmonds, Jack. 1965a. Maximum Matching and a Polyhedron with 0, 1 Vertices. J. of
Res. the Nat. Bureau of Standards, 69, 125–130.
Edmonds, Jack. 1965b. Paths, trees, and flowers. Canadian Journal of mathematics,
17(3), 449–467.
Edmonds, Jack, and Karp, Richard M. 1972. Theoretical Improvements in Algorithmic
Efficiency for Network Flow Problems. J. ACM, 19(2), 248264.
Bibliography 307
Farkas, Julius. 1902. Theorie der einfachen Ungleichungen. Journal fr die reine und
angewandte Mathematik, 124, 1–27.
Ford, L.R., and Fulkerson, D.R. 1956. Maximal flow in a network. Canadian J. Math.,
8, 399–404.
Galntai, A. 2000. The theory of Newton’s method. Journal of Computational and
Applied Mathematics, 124(1), 25 – 44. Numerical Analysis 2000. Vol. IV: Opti-
mization and Nonlinear Equations.
Garg, Naveen, and Könemann, Jochen. 2007. Faster and Simpler Algorithms for Mul-
ticommodity Flow and Other Fractional Packing problems. SIAM J. Comput.,
37(2), 630–652.
Gärtner, Bernd, and Matousek, Jirı́. 2014. Approximation Algorithms and Semidefinite
Programming. Springer Publishing Company, Incorporated.
Gibbs, Josiah Willard. 1902. Elementary principles in statistical mechanics: developed
with especial reference to the rational foundation of thermodynamics. C. Scrib-
ner’s sons.
Goldberg, Andrew V., and Rao, Satish. 1998. Beyond the Flow Decomposition Barrier.
J. ACM, 45(5), 783–797.
Goldberg, Andrew V., and Tarjan, Robert Endre. 1987. Solving Minimum-Cost Flow
Problems by Successive Approximation. Pages 7–18 of: Proceedings of the 19th
Annual ACM Symposium on Theory of Computing, 1987, New York, New York,
USA.
Golub, G.H., and Van Loan, C.F. 1996. Matrix computations. Johns Hopkins Univ
Press.
Gonzaga, Clovis C. 1989. An Algorithm for Solving Linear Programming Problems in
O(n3 L) Operations. New York, NY: Springer New York. Pages 1–28.
Grötschel, M., Lovász, L., and Schrijver, A. 1981. The ellipsoid method and its conse-
quences in combinatorial optimization. Combinatorica, 1(2), 169–197.
Grötschel, Martin, Lovász, Lászlo, and Schrijver, Alexander. 1988. Geometric Algo-
rithms and Combinatorial Optimization. Algorithms and Combinatorics, vol. 2.
Springer.
Gurjar, Rohit, Thierauf, Thomas, and Vishnoi, Nisheeth K. 2018. Isolating a Vertex via
Lattices: Polytopes with Totally Unimodular Faces. Pages 74: 1–74: 14 of: 45th
International Colloquium on Automata, Languages, and Programming, ICALP
2018, July 9-13, 2018, Prague, Czech Republic. LIPIcs, vol. 107.
Hazan, Elad. 2016. Introduction to Online Convex Optimization. Found. Trends Optim.,
2(34), 157325.
Hestenes, Magnus R., and Stiefel, Eduard. 1952. Methods of Conjugate Gradients for
Solving Linear Systems. Journal of Research of the National Bureau of Standards,
49(Dec.), 409–436.
Hopcroft, John E., and Karp, Richard M. 1973. An n5/2 Algorithm for Maximum
Matchings in Bipartite Graphs. SIAM J. Comput., 2(4), 225–231.
Jaggi, Martin. 2013. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimiza-
tion. Page I427I435 of: Proceedings of the 30th International Conference on In-
ternational Conference on Machine Learning - Volume 28. ICML’13. JMLR.org.
Jaynes, Edwin T. 1957a. Information Theory and Statistical Mechanics. Physical Re-
view, 106(May), 620–630.
308 Bibliography
Jaynes, Edwin T. 1957b. Information Theory and Statistical Mechanics. II. Physical
Review, 108(Oct.), 171–190.
Karlin, Anna R., Klein, Nathan, and Oveis Gharan, Shayan. 2020. A (Slightly) Im-
proved Approximation Algorithm for Metric TSP. CoRR, abs/2007.01409.
Karmarkar, Narendra. 1984. A new polynomial-time algorithm for linear programming.
Combinatorica, 4(4), 373–395.
Karp, Richard M., and Papadimitriou, Christos H. 1982. On Linear Characterizations
of Combinatorial Optimization Problems. SIAM J. Comput., 11(4), 620–632.
Kelner, Jonathan A., Lee, Yin Tat, Orecchia, Lorenzo, and Sidford, Aaron. 2014.
An Almost-Linear-Time Algorithm for Approximate Max Flow in Undirected
Graphs, and its Multicommodity Generalizations. Pages 217–226 of: Proceed-
ings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms,
SODA 2014, Portland, Oregon, USA, January 5-7, 2014.
Khachiyan, L. G. 1979. A Polynomial Algorithm for Linear Programming. Doklady
Akademii Nauk SSSR, 224(5), 10931096.
Khachiyan, L. G. 1980. Polynomial algorithms in linear programming. USSR Compu-
tational Mathematics and Mathematical Physics, 20(1), 53 – 72.
Klee, V., and Minty, G. J. 1972. How Good is the Simplex Algorithm? Pages 159–175
of: Inequalities III. Academic Press Inc.
Kleinberg, Jon, and Tardos, Eva. 2005. Algorithm Design. USA: Addison-Wesley
Longman Publishing Co., Inc.
Krantz, S.G. 2014. Convex Analysis. Textbooks in Mathematics. Taylor & Francis.
Lau, Lap-Chi, Ravi, R., and Singh, Mohit. 2011. Iterative Methods in Combinatorial
Optimization. 1st edn. USA: Cambridge University Press.
Leake, Jonathan, and Vishnoi, Nisheeth K. 2020. On the computability of continuous
maximum entropy distributions with applications. Pages 930–943 of: Proccedings
of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC
2020, Chicago, IL, USA, June 22-26, 2020. ACM.
Lee, Yin Tat, and Sidford, Aaron. 2014. Path Finding Methods for Linear Program-
ming: Solving Linear Programs in O(vrank)
e Iterations and Faster Algorithms for
Maximum Flow. Pages 424–433 of: 55th IEEE Annual Symposium on Founda-
tions of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21,
2014. IEEE Computer Society.
Lee, Yin Tat, Rao, Satish, and Srivastava, Nikhil. 2013. A New Approach to Computing
Maximum Flows Using Electrical Flows. Pages 755–764 of: Proceedings of the
Forty-fifth Annual ACM Symposium on Theory of Computing. STOC ’13. New
York, NY, USA: ACM.
Lee, Yin Tat, Sidford, Aaron, and Wong, Sam Chiu-wai. 2015. A Faster Cutting Plane
Method and its Implications for Combinatorial and Convex Optimization. Pages
1049–1065 of: IEEE 56th Annual Symposium on Foundations of Computer Sci-
ence, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015.
Louis, Anand, and Vempala, Santosh S. 2016. Accelerated Newton Iteration for Roots
of Black Box Polynomials. Pages 732–740 of: IEEE 57th Annual Symposium
on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Re-
gency, New Brunswick, New Jersey, UNITED STATES. IEEE Computer Society.
Lovász, László. 1983. Submodular functions and convexity. Pages 235–257 of: Math-
ematical Programming The State of the Art. Springer.
Bibliography 309
Madry, Aleksander. 2013. Navigating Central Path with Electrical Flows: From Flows
to Matchings, and Back. Pages 253–262 of: 54th Annual IEEE Symposium on
Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley,
CA, USA.
Mulmuley, Ketan, Vazirani, Umesh V., and Vazirani, Vijay V. 1987. Matching is as easy
as matrix inversion. Combinatorica, 7, 105–113.
Nemirovski, A., and Yudin, D. 1983. Problem Complexity and Method Efficiency in
Optimization. Wiley Interscience.
Nesterov, Y., and Nemirovskii, A. 1994. Interior-Point Polynomial Algorithms in Con-
vex Programming. Society for Industrial and Applied Mathematics.
Nesterov, Yurii. 1983. A method for unconstrained convex minimization problem with
the rate of convergence O(1/k2 ). Pages 543–547 of: Doklady AN USSR, vol. 269.
Nesterov, Yurii. 2004. Introductory lectures on convex optimization. Vol. 87. Springer
Science & Business Media.
Nesterov, Yurii. 2014. Introductory Lectures on Convex Optimization: A Basic Course.
1 edn. Springer.
Orecchia, Lorenzo, Schulman, Leonard J., Vazirani, Umesh V., and Vishnoi,
Nisheeth K. 2008. On partitioning graphs via single commodity flows. Pages
461–470 of: Proceedings of the 40th Annual ACM Symposium on Theory of Com-
puting, Victoria, British Columbia, Canada, May 17-20, 2008. ACM.
Orecchia, Lorenzo, Sachdeva, Sushant, and Vishnoi, Nisheeth K. 2012. Approximating
the Exponential, the Lanczos Method and an O(m)-Time
e Spectral Algorithm for
Balanced Separator. Page 11411160 of: Proceedings of the Forty-Fourth Annual
ACM Symposium on Theory of Computing. STOC ’12. New York, NY, USA:
Association for Computing Machinery.
Oveis Gharan, Shayan, Saberi, Amin, and Singh, Mohit. 2011. A Randomized Round-
ing Approach to the Traveling Salesman Problem. Pages 267–276 of: FOCS’11:
Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer
Science.
Padberg, Manfred W, and Rao, M Rammohan. 1981. The Russian method for linear
inequalities III: Bounded integer programming. Ph.D. thesis, INRIA.
Pan, Victor Y., and Chen, Zhao Q. 1999. The Complexity of the Matrix Eigenproblem.
Pages 507–516 of: STOC. ACM.
Peng, Richard. 2016. Approximate Undirected Maximum Flows in O(m polylog(n))
Time. Pages 1862–1867 of: Proceedings of the Twenty-Seventh Annual ACM-
SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, Jan-
uary 10-12, 2016.
Perko, Lawrence. 2001. Differential equations and dynamical systems. Vol. 7. Springer
Science & Business Media.
Plotkin, Serge A., Shmoys, David B., and Tardos, Éva. 1995. Fast Approximation
Algorithms for Fractional Packing and Covering Problems. Math. Oper. Res.,
20(2), 257–301.
Polyak, Boris. 1964. Some methods of speeding up the convergence of iteration meth-
ods. Ussr Computational Mathematics and Mathematical Physics, 4(12), 1–17.
Renegar, James. 1988. A polynomial-time algorithm, based on Newton’s method, for
linear programming. Math. Program., 40(1-3), 59–93.
310 Bibliography
313
314 Index