Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
219 views

Convex-Optimization Github Io

Uploaded by

SaurabhAriyan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views

Convex-Optimization Github Io

Uploaded by

SaurabhAriyan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 328

Algorithms for Convex Optimization

Nisheeth K. Vishnoi

This material will be published by Cambridge University Press as


Algorithms for Convex Optimization by Nisheeth K. Vishnoi. This
pre-publication version is free to view and download for personal use
only. Not for re-distribution, re-sale or use in derivative works.
©Nisheeth K. Vishnoi 2020.
Dedicated to Maya and Vayu
Contents

Preface page v
Notation viii
1 Bridging continuous and discrete optimization 1
1.1 An example: the maximum flow problem 2
1.2 Linear programming 8
1.3 Fast and exact algorithms via interior point methods 12
1.4 Ellipsoid method beyond succinct linear programs 13
2 Preliminaries 15
2.1 Derivatives, gradients, and hessians 15
2.2 Taylor series 17
2.3 Linear algebra, matrices, and eigenvalues 18
2.4 The Cauchy-Schwarz inequality 20
2.5 Norms 21
2.6 Euclidean topology 22
2.7 Dynamical systems 23
2.8 Graphs 24
2.9 Exercises 27
3 Convexity 32
3.1 Convex sets 32
3.2 Convex functions 33
3.3 The usefulness of convexity 39
3.4 Exercises 44
4 Convex Optimization and Efficiency 47
4.1 Convex programs 47
4.2 Computational models 49
4.3 Membership problem for convex sets 51
4.4 Solution concepts for optimization problems 57

iv
Contents v

4.5 The notion of polynomial time for convex optimization 61


4.6 Exercises 63
5 Duality and Optimality 67
5.1 Lagrangian duality 68
5.2 The conjugate function 72
5.3 KKT optimality conditions 74
5.4 Proof of strong duality under Slater’s condition 75
5.5 Exercises 77
6 Gradient Descent 81
6.1 The setup 81
6.2 Gradient descent 82
6.3 Analysis when the gradient is Lipschitz continuous 86
6.4 Application: the maximum flow problem 92
6.5 Exercises 99
7 Mirror Descent and Multiplicative Weights Update 105
7.1 Beyond the Lipschitz gradient condition 105
7.2 A local optimization principle and regularizers 107
7.3 Exponential gradient descent 109
7.4 Mirror descent 117
7.5 Multiplicative weights update framework 121
7.6 Application: perfect matching in bipartite graphs 122
7.7 Exercises 129
8 Accelerated Gradient Descent 137
8.1 The setup 137
8.2 Main result on accelerated gradient descent 138
8.3 Proof strategy: estimate sequences 139
8.4 Construction of an estimate sequence 141
8.5 The algorithm and its analysis 145
8.6 Application to solving linear systems 148
8.7 Exercises 151
9 Newton’s Method 155
9.1 Finding a root of a univariate function 155
9.2 Newton’s method for multivariate functions 159
9.3 Newton’s method for unconstrained optimization 160
9.4 First take on the analysis 162
9.5 Newton’s method as steepest descent 165
9.6 Analysis of Newton’s method based on the local norm 170
9.7 Analysis based on the Euclidean norm 175
vi Contents

9.8 Exercises 177


10 An Interior Point Method for Linear Programming 180
10.1 Linear programming 180
10.2 To unconstrained optimization via barrier functions 182
10.3 The logarithmic barrier function 183
10.4 The central path 184
10.5 A path following algorithm for linear programming 185
10.6 Analysis of the path following IPM 189
10.7 Exercises 203
11 Variants of the Interior Point Method and Self-Concordance 210
11.1 The minimum cost flow problem 210
11.2 An IPM for linear programming in standard form 214

11.3 An O(e m)-iteration algorithm for minimum cost flow 221
11.4 Self-concordance 224
11.5 Linear programming using self-concordant barriers 227
11.6 Semidefinite programming using IPM 233
11.7 Optimizing over convex sets via self-concordance 235
11.8 Exercises 237
12 Ellipsoid Method for Linear Programming 243
12.1 0-1-polytopes with exponentially many constraints 243
12.2 Cutting plane methods 246
12.3 Ellipsoid method 253
12.4 Analysis of volume drop and efficiency for ellipsoids 256
12.5 Linear optimization over 0-1-polytopes 263
12.6 Exercises 267
13 Ellipsoid Method for Convex Optimization 273
13.1 Convex optimization using the ellipsoid method? 273
13.2 Submodular function minimization 275
13.3 Maximum entropy distributions over discrete domains 282
13.4 Convex optimization using the ellipsoid method 288
13.5 Variants of the cutting plane method 294
13.6 Exercises 298
Bibliography 303
Index 311
Preface

Convex optimization studies the problem of minimizing a convex function over


a convex set. Convexity, along with its numerous implications, has been used
to come up with efficient algorithms for many classes of convex programs.
Consequently, convex optimization has broadly impacted several disciplines
of science and engineering.
In the last few years, algorithms for convex optimization have revolution-
ized algorithm design, both for discrete and continuous optimization prob-
lems. The fastest known algorithms for problems such as maximum flow in
graphs, maximum matching in bipartite graphs, and submodular function min-
imization, involve an essential and nontrivial use of algorithms for convex op-
timization such as gradient descent, mirror descent, interior point methods,
and cutting plane methods. Surprisingly, algorithms for convex optimization
have also been used to design counting problems over discrete objects such
as matroids. Simultaneously, algorithms for convex optimization have become
central to many modern machine learning applications. The demand for al-
gorithms for convex optimization, driven by larger and increasingly complex
input instances, has also significantly pushed the state of the art of convex op-
timization itself.
The goal of this book is to enable a reader to gain an in depth understand-
ing of algorithms for convex optimization. The emphasis is to derive key al-
gorithms for convex optimization from first principles and to establish precise
running time bounds in terms of the input length. Given the broad applicability
of these methods, it is not possible for a single book to show the applications of
these methods to all of them. This book shows applications to fast algorithms
for various discrete optimization and counting problems. The applications se-
lected in this book serve the purpose of illustrating a rather surprising bridge
between continuous and discrete optimization.

vii
viii Preface

The structure of the book. The book has roughly four parts: Chapters 3, 4,
and 5 provide an introduction to convexity, models of computation and no-
tions of efficiency in convex optimization, and duality. Chapters 6, 7, and 8
introduce first-order methods such as gradient descent, mirror descent and the
multiplicative weights update method, and accelerated gradient descent respec-
tively. Chapters 9, 10, and 11 present Newton’s method and various interior
point methods for linear programs. Chapters 12 and 13 present cutting plane
methods such as the ellipsoid method for linear and general convex programs.
Chapter 1 summarizes the book via a brief history of the interplay between
continuous and discrete optimization: how the search for fast algorithms for
discrete problems is leading to improvements in algorithms for convex opti-
mization.
Many chapters contain applications ranging from finding maximum flows,
minimum cuts, and perfect matchings in graphs, to linear optimization over
0-1-polytopes, to submodular function minimization, to computing maximum
entropy distributions over combinatorial polytopes.
The book is self-contained and starts with a review of calculus, linear alge-
bra, geometry, dynamical systems, and graph theory in Chapter 2. Exercises
posed in this book not only play an important role in checking one’s under-
standing, but sometimes important methods and concepts are introduced and
developed entirely through them. Examples include the Frank-Wolfe method,
coordinate descent, stochastic gradient descent, online convex optimization,
the min-max theorem for zero-sum games, the Winnow algorithm for classifi-
cation, the conjugate gradient method, primal-dual interior point method, and
matrix scaling.

How to use this book. This book can be used either as a textbook for a stan-
dalone advanced undergraduate or beginning graduate level course, or can act
as a supplement to an introductory course on convex optimization or algo-
rithm design. The intended audience includes advanced undergraduate stu-
dents, graduate students, and researchers from theoretical computer science,
discrete optimization, operations research, statistics, and machine learning. To
make this book accessible to a broad audience with different backgrounds, the
writing style deliberately emphasizes the intuition, sometimes at the expense
of rigor.
A course for a theoretical computer science or discrete optimization audi-
ence could cover the entire book. A course on convex optimization can omit
the applications to discrete optimization and can, instead, include applications
as per the choice of the instructor. Finally, an introductory course on convex
optimization for machine learning could include material from Chapters 2-6.
Preface ix

Beyond convex optimization? This book should also prepare the reader for
working in areas beyond convex optimization, e.g., nonconvex optimization
and geodesic convex optimization, which are currently in their formative years.
Nonconvex optimization. One property of convex functions is that a “local”
minimum is also a “global” minimum. Thus, algorithms for convex optimiza-
tion, essentially, find a local minimum. Interestingly, this viewpoint has led
to convex optimization methods being wildly successful for nonconvex op-
timization problems, especially those that arise in machine learning. Unlike
convex programs, some of which can be NP-hard to optimize, most interesting
classes of nonconvex optimization problems are NP-hard. Hence, in many of
these applications, one defines a suitable notion of local minimum and looks
for methods that can take us to one. Thus, algorithms for convex optimization
methods are important for nonconvex optimization as well.
Geodesic convex optimization. Sometimes, a function that is nonconvex in
a Euclidean space turns out to be convex if one introduces a suitable Rieman-
nian metric on the underlying space and redefines convexity with respect to
the straight lines – geodesics – induced by the metric. Such a function is called
geodesically convex; see the survey by Vishnoi (2018). Unlike convex opti-
mization, however, how to optimize such a function over a geodesically convex
set is an underdeveloped theory – waiting to be explored.

Acknowledgments. The contents of this book have been developed over sev-
eral courses – for both undergraduate and graduate students – that I have
taught, starting in Fall 2014. The content of the book is closest to that of a
course taught in Fall 2019 at Yale. I am grateful to all the students and other
attendees of these courses for their questions and comments that have made
me reflect on the topic and improve the presentation. I am thankful to Slo-
bodan Mitrovic, Damian Straszak, Jakub Tarnawski, and George Zakhour for
being some of the first to take this course and scribing my initial lectures on
this topic. Special thanks to Damian for scribing a significant fraction of the
my lectures, sometimes adding his own insights. I am indebted to Somenath
Biswas, Elisa Celis, Yan Zhong Ding, and Anay Mehrotra for carefully reading
a draft of this book and giving numerous valuable comments and suggestions.
Finally, this book has been influenced by several classic works: Geometric
algorithms and combinatorial optimization by Grötschel et al. (1988), Convex
optimization by Boyd and Vandenberghe (2004), Introductory lectures on con-
vex optimization by Nesterov (2014), and The multiplicative weights update
method: A meta-algorithm and applications by Arora et al. (2012).

August 28, 2020 Nisheeth K. Vishnoi


Notation

Numbers and sets:


• The set of natural numbers, integers, rationals, and real numbers are de-
noted by N, Z, Q and R respectively. Z≥0 , Q≥0 and R≥0 denote the set of
non-negative integers, rationals, and reals respectively.
• For a positive integer n, we denote by [n] the set {1, 2, . . . , n}.
• For set S ⊆ [n] we use 1S ∈ Rn to denote the indicator vector of S defined
as 1S (i) = 1 for all i ∈ S and 1S (i) = 0 otherwise.
• For a set S ⊆ [n] of cardinality k, we sometimes write RS to denote Rk .

Vectors, matrices, inner products, and norms:


• Vectors are denoted by x and y. A vector x ∈ Rn is a column vector but is
usually written as x = (x1 , . . . , xn ). The transpose of a vector x is denoted
by x> .
• The standard basis vectors in Rn are denoted by e1 , . . . , en , where ei is the
vector whose ith entry is one and the remaining entries are zero.
• For vectors x, y ∈ Rn , by x ≥ y, we mean that xi ≥ yi for all i ∈ [n].
• For a vector x ∈ Rn , we use Diag(x) to denote the n × n matrix whose
(i, i)th entry is xi for 1 ≤ i ≤ n and is zero on all other entries.
• When it is clear from context, 0 and 1 are also used to denote vectors with
all 0 entries and all 1 entries respectively.
• For vectors x and y, their inner product is denoted by hx, yi or x> y.

• For a vector x, its `2 or Euclidean norm is denoted by kxk2 := hx, xi.
We sometimes also refer to the `1 or Manhattan distance norm kxk1 :=
i=1 |xi |. The `∞ -norm is defined as kxk∞ := maxi=1 |xi |.
Pn n

• The outer product of a vector x with itself is denoted by xx> .


• Matrices are denoted by capitals, e.g., A and L. The transpose of A is de-
noted by A> .

x
Notation xi

• The trace of an n × n matrix A is Tr (A) := ni=1 Aii . The determinant of an


P
n × n matrix A is det(A) = σ∈S n sgn(σ) i=1 Aiσ(i) . Here S n is the set of
P Qn
all permutations of n elements and sgn(σ) is the number of transpositions
in a permutation σ, i.e., the number of pairs i < j such that σ(i) > σ( j).

Graphs:
• A graph G has a vertex set V and an edge set E. All graphs are assumed to
be undirected unless stated otherwise. If the graph is weighted, there is a
weight function w : E → R≥0 .
• A graph is said to be simple if there is at most one edge between two
vertices and there are no edges whose endpoints are the same vertex.
• Typically, n is reserved for the number of vertices |V|, and m for the number
of edges |E|.

Probability:
• ED [·] denotes the expectation and PrD [·] denotes the probability over a
distribution D. The subscript is dropped when clear from context.

Running times:
• Standard big-O notation is used to describe the limiting behavior of a func-
tion. Õ denotes that potential poly-logarithmic factors have been omitted,
i.e., f = Õ(g) is equivalent to f = O(g logk (g)) for some constant k.
1
Bridging continuous and discrete optimization

A large part of algorithm design is concerned with problems that optimize or


enumerate over discrete structures such as paths, trees, cuts, flows, and match-
ings in objects such as graphs. Important examples include:
(i) Given a graph G = (V, E), a source s ∈ V, a sink t ∈ V, find a flow on
the edges of G of maximum value from s to t while ensuring that each
edge has at most one unit flow going through it.
(ii) Given a graph G = (V, E), find a matching of maximum size in G.
(iii) Given a graph G = (V, E), count the number of spanning trees in G.
Algorithms for these fundamental problems have been sought for more than
a century due to their numerous applications. Traditionally, such algorithms
are discrete in nature, leverage the rich theory of duality and integrality,
and are studied in the areas of algorithms and combinatorial optimization; see
the books by Dasgupta et al. (2006), Kleinberg and Tardos (2005), and Schri-
jver (2002a). However, classic algorithms for these problems have not always
turned out to be fast enough to handle the rapidly increasing input sizes of
modern-day problems.
An alternative continuous approach for designing faster algorithms for dis-
crete problems has emerged. At a very high level, the approach is to first formu-
late the problem as a convex program, and then develop continuous algorithms
such as gradient descent, the interior point method, or the ellipsoid method to
solve it. The innovative use of convex optimization formulations coupled with
algorithms that move in geometric spaces and leverage linear solvers has led
to faster algorithms for many discrete problems. This pursuit has also signifi-
cantly improved the state of the art of algorithms for convex optimization. For
these improvements to be possible it is often crucial to abandon an entirely
combinatorial viewpoint; simultaneously, fast convergence of continuous al-
gorithms often leverage the underlying combinatorial structure.

1
2 Bridging continuous and discrete optimization

1.1 An example: the maximum flow problem


We illustrate the interplay between continuous and discrete optimization through
the s − t-maximum flow problem on undirected graphs.

The maximum flow problem. Given an undirected graph G = (V, E) with


n := |V| and m := |E|, we first define the vertex-edge incidence matrix B ∈
Rn×m associated to it. Direct each edge i ∈ E arbitrarily and let i+ denote the
head vertex of i and i− denote its tail vertex. For every edge i, the matrix B
contains a column bi := ei+ − ei− ∈ Rn , where {e j } j∈[n] are the standard basis
vectors for Rn .
Given s , t ∈ V, an s − t-flow in G is an assignment x : E → R that satisfies
the following conservation of flow property: for all vertices j ∈ V \ {s, t}, we
require that the incoming flow is equal to the outgoing flow, i.e.,
he j , Bxi = 0.
An s − t-flow is said to be feasible if
|xi | ≤ 1
for all i ∈ E, i.e., the magnitude of the flow in each edge respects its capacity
(1 here). The objective of the s − t-maximum flow problem is to find a feasible
s − t-flow in G that maximizes the flow out of s, i.e., the value
he s , Bxi.
The s − t-maximum flow problem was not only used to encode various real-
world routing and scheduling problems, many fundamental combinatorial prob-
lems such as finding a maximum matching in bipartite graph were shown to be
its special cases; see Schrijver (2002a) and Schrijver (2002b) for an extensive
discussion.

Combinatorial algorithms for the maximum flow problem. An important


fact about the s − t-maximum flow problem is that there always exists an inte-
gral flow that maximizes the objective function. As it will be explained later in
this book, this is a consequence of the fact that the matrix B is totally unimod-
ular: every square submatrix of B has determinant either 0, 1, or −1. Thus, we
can restrict
xi ∈ {−1, 0, 1}
for each i ∈ E, making the search space for the optimal s − t-maximum flow
discrete. Because of this, the problem has been traditionally viewed as a com-
binatorial optimization problem.
1.1 An example: the maximum flow problem 3

One of the first combinatorial algorithms for the s − t-maximum flow prob-
lem was presented in the seminal work by Ford and Fulkerson (1956). Roughly
speaking, the Ford-Fulkerson method starts by setting xi = 0 for all edges i
and checks if there is a path from s to t such that the capacity of each edge on
it is 1. If there is such a path, the method adds 1 to the flow value of the edges
that point (from head to tail) in the direction of this path and subtracts 1 from
the flow values of edges that point in the opposite direction. Given the new
flow value on each edge, it constructs a residual graph where the capacity of
each edge is updated to reflect how much additional flow can still be pushed
through it, and the algorithm repeats. If there is no path left between s and t in
the residual graph, it stops and outputs the current flow values.
The fact that the algorithm always outputs a maximum s−t-flow is nontrivial
and a consequence of duality; in particular, of the max-flow min-cut theorem
that states that, the maximum amount of flow that can be pushed form s to t is
equal to the minimum number of edges in G whose deletion leads to discon-
necting s from t. This latter problem is referred to as the s − t-minimum cut
problem and is the dual of the s − t-maximum flow problem. Duality gives a
way to certify that we are at an optimal solution and, if not, suggests a way to
improve the current solution.
It is not hard to see that the Ford-Fulkerson method generalizes to the setting
of nonnegative and integral capacities: now the flow values are

xi ∈ {−U, , . . . , −1, 0, 1, . . . , U}

for some U ∈ Z≥0 . However, the running time of the Ford-Fulkerson method
in this general capacity case depends linearly on U. As the number of bits
required to specify U is roughly log U, this is not a polynomial time algorithm.
Following the work of Ford and Fulkerson (1956), a host of combinatorial
algorithms for the s−t-maximum flow problem were developed. Roughly, each
of them augments the flow in the graph iteratively in an increasingly faster, but
combinatorial, manner. The first polynomial time algorithms were by Dinic
(1970) and by Edmonds and Karp (1972) who used breadth-first search to aug-
ment flows. This line of work
 culminated
n in
o an algorithm
 by Goldberg and
Rao (1998) that runs in O m min n , m
e 2/3 1/2
log U time. Note that unlike the
Ford-Fulkerson method, these latter combinatorial algorithms are polynomial
time: they find the exact solution to the problem and run in time that depends
polynomially on the number of bits required to describe the input. However,
since the result of Goldberg and Rao (1998), there was no real progress on im-
proving the running times for algorithms for the s − t-maximum flow problem
until 2011.
4 Bridging continuous and discrete optimization

Convex programming-based algorithms. Starting with the paper by Chris-


tiano et al. (2011), the last decade has seen striking progress on the s − t-
maximum flow problem. One of the keys to this success has been to abandon
combinatorial approaches and view the s − t-maximum flow problem through
the lens of continuous optimization. At a very high level, these approaches still
maintain a vector x ∈ Rm which is updated in every iteration, but this update
is dictated by continuous and geometric quantities associated to the graph and
is not constrained to be a feasible s − t-flow in the intermediate steps of the
algorithm. Here, we outline one such approach for the s − t-maximum flow
problem from the paper by Lee et al. (2013).
For this discussion, assume that we are also given a value F and that we
would like to find a feasible s − t-flow of value F.1 Lee et al. (2013) start with
the observation that the problem of checking if there is a feasible s − t-flow of
value F in G is equivalent to determining if the intersection of the sets
{x ∈ Rm : Bx = F(e s − et )} ∩ {x ∈ Rm : |xi | ≤ 1, ∀i ∈ [m]} (1.1)
is nonempty. Moreover, finding a feasible s − t-flow of value F is equivalent
to finding a point in this intersection. Note that the first set in Equation 1.1 is
the set of all s − t-flows (a linear space) and the second set is the set of all
vectors that satisfy the capacity constraints, in this case the `∞ -ball of radius
one, denoted by B∞ , which is a polytope.
Their main idea is to reduce this nonemptiness testing problem to a convex
optimization problem. To motivate their idea, suppose, that we have convex
sets K1 and K2 and the goal is to find a point in their intersection (or assert that
there is none). One way to formulate this problem as a convex optimization
problem is as follows: find a point x ∈ K1 that minimizes the distance to K2 .
As K1 is convex, for this formulation to be a convex optimization problem,
we need to find a convex function that captures the distance of a point x to
K2 . It can be checked that the squared Euclidean distance has this property.
Alternatively, one could consider the convex optimization problem where we
switch the roles of K1 and K2 : find a point x ∈ K2 that minimizes the distance
to K1 . Note here that, while the squared Euclidean distance to a set is a convex
function, it is nonlinear. Thus, at this point it may seem like we are heading
in the wrong direction. We started off with a combinatorial problem that is a
special type of a linear programming problem, and here we are with a nonlinear
optimization formulation for it.
Thus the questions arise: which formulation should we chose? and, why
should this convex optimization approach lead us to faster algorithms?
1 Using a solution to this problem, we could solve the s − t-maximum flow problem by
performing a binary search over F.
1.1 An example: the maximum flow problem 5

Lee et al. (2013) considered the following convex optimization formulation


for the s − t-maximum flow problem:

min dist2 (x, B∞ )


x∈Rm
(1.2)
s.t. Bx = F(e s − et ),

where dist(x, B∞ ) is the Euclidean distance of x to the set B∞ := {y ∈ Rm :


kyk∞ ≤ 1}. As the optimization problem above minimizes a convex function
over a convex set, it is indeed a convex program. The choice of this formula-
tion, however, comes with a foresight that relies on an understanding of algo-
rithms for convex optimization.
A basic method to minimize a convex function is gradient descent, which
is an iterative algorithm that, in each iteration, takes a step in the direction of
the negative gradient of the function it is supposed to minimize. While gradient
descent does so in an attempt to optimize the function locally, the convexity of
the objective function implies that a local minimum of a convex function is
also its global minimum. Gradient descent only requires oracle access to the
gradient, or first derivative of the objective function and is, thus, called a first-
order method. It is really a meta-algorithm and, to instantiate it, one has to fix
its parameters such as the step-size and must specify a starting point. These
parameters, in turn, depend on various properties of the program including
estimates of smoothness of the objective function and those of the closeness of
the starting point to the optimal point.
For the convex program in Equation (1.2), the objective function has an
easy to compute first-order oracle. This follows from the observation that it
decomposes into a sum of squared-distances, one for each coordinate, and each
of these functions is quadratic. Moreover, the objective function is smooth:
the change in its gradient is bounded by a constant times the change in its
argument; one can visually inspect this in Figure 1.1.

z
−1 1

Figure 1.1 The function dist2 (z, [−1, 1]).


6 Bridging continuous and discrete optimization

One problem with the application of gradient descent is that the convex pro-
gram in (1.2) has constraints {x ∈ Rm : Bx = F(e s − et )} and, hence, the
direction gradient descent asks us to move can take us out of this set. A way
to get around this is to project the gradient of the objective function on to
the subspace {x ∈ Rm : Bx = 0} at every step and move in the direction of
the projected gradient. However, this projection step requires solving a least
squares problem which, in turn, reduces to the numerical problem of solving a
linear system of equations. While one can appeal to the Gaussian elimination
method for this latter task, it is not fast enough to warrant improvements over
combinatorial algorithms mentioned earlier. Here, a major result discovered
by Spielman and Teng (2004) implies that such a projection can, in fact, be
computed in time O(m).
e This is achieved by noting that the linear system that
arises when projecting a vector onto the subspace {x ∈ Rm : Bx = 0} is the
same as solving a Laplacian systems that are of the form BB> y = a (for a
given vector a), where B is a vertex-edge incidence matrix of the given graph.
Such a result is not known for general linear systems and (implicitly) relies on
the combinatorial structure of the graph that gets encoded in the matrix B.
Thus, roughly speaking, in each iteration, the projected gradient descent al-
gorithm takes a point xt in the space of all s − t-flows of value F, moves to-
wards the set B∞ along the negative gradient of the objective function, and then
projects the new point back to the linear space; see Figure 1.2 for an illustra-
tion. While each iterate is an s − t-flow, it is not a feasible flow.

{x : Bx = F(e s − et )} xt+1
xt

B∞

Figure 1.2 An illustration of one step of the projected gradient descent in the
algorithm by Lee et al. (2013).
1.1 An example: the maximum flow problem 7

A final issue is that such a method may not lead to an exact solution but
only an approximate solution. Moreover, in general, the number of iterations
depends inverse polynomially in the quality of the desired approximation. Lee
et al. (2013) proved the following result: there is an algorithm that, given an
ε > 0, can compute a feasible s−t-flow of value (1−ε)F in time O(mne 1/3 ε−2/3 ).
If we ignore the ε in their bound, this improved the result of Goldberg and Rao
(1998) mentioned earlier.
We point out that the combinatorial algorithm of Goldberg and Rao (1998)
has the same running time even when the input graph is directed. It is not clear
how to generalize the gradient descent based algorithm for the s − t-maximum
flow problem presented above run for directed graphs.
The results of Christiano et al. (2011) and Lee et al. (2013) were further
improved using increasingly sophisticated ideas from continuous optimiza-
tion and finally led to a nearly linear time algorithm for the undirected s − t-
maximum flow problem in a sequence of work by Sherman (2013), Kelner
et al. (2014), and Peng (2016). Remarkably, while these improvements aban-
doned discrete approaches and used algorithms for convex optimization, beat-
ing the running times of combinatorial algorithms leveraged the underlying
combinatorial structure of the s − t-maximum flow problem.
The goal of this book is to enable a reader to gain an in-depth understanding
of algorithms for convex optimization in a manner that allows them to ap-
ply these algorithms in domains such as combinatorial optimization, algorithm
design, and machine learning. The emphasis is to derive various convex opti-
mization methods in a principled manner and to establish precise running time
bounds in terms of the input length (and not just on the number of iterations).
The book also contains several examples, such as the one of s − t-maximum
flow presented earlier, that illustrate the bridge between continuous and dis-
crete optimization. Laplacian solvers are not discussed in this book. The reader
is referred to the monograph by Vishnoi (2013) for more on that topic.
The focus of Chapters 3, 4, 5 is on basics of convexity, computational mod-
els, and duality. Chapters 6, 7, 8 present three different first-order methods:
gradient descent, mirror descent and multiplicative weights update method,
and accelerated gradient descent. In particular, the discussion here is pre-
sented in detail as an application in Chapter 6. In fact, the fastest version of
the method of Lee et al. (2013) uses the accelerated gradient method. Chap-
ter 7 also draws a connection between mirror descent and the multiplicative
weights update method and shows how the latter can be used to design a fast
(approximate) algorithm for the bipartite maximum matching problem. We re-
mark that the algorithm of Christiano et al. (2011) relies on the multiplicative
weights update method.
8 Bridging continuous and discrete optimization

Beyond approximate algorithms? The combinatorial algorithms for the s−t-


maximum flow problem, unlike the first-order convex optimization-based al-
gorithms described above, are exact. One can convert the latter approximate
algorithms to exact ones, but it may require setting a very small value of ε
making the overall running time non-polynomial. The remainder of the book
is dedicated to developing algorithms for convex optimization – interior point
and ellipsoid – whose number of iterations depend poly-logarithmically on
ε−1 as opposed to polynomially on ε−1 . Thus, if we use such algorithms, we
can set ε to be small enough to recover exact algorithms for combinatorial
problems at hand. These algorithms use deeper mathematical structures and
more sophisticated strategies (as explained later). The advantage in learning
these algorithms is that they work more generally – for linear programs and
even convex programs in a very general form. Chapters 9, 10, 11, 12, and 13
develop these methods, their variants, and exhibit applications to a variety of
discrete optimization and counting problems.

1.2 Linear programming


The s − t-maximum flow problem on undirected graphs is a type of linear pro-
gram: a convex optimization problem where the objective function is a linear
function and all the constraints are either linear equalities or inequalities. In
fact, the objective function is to maximize the flow value F ≥ 0 constrained to
the set of feasible s − t-flows of value F; see Equation (1.1).
A linear program can be written in many different ways and we consider
its standard form, where one is given a matrix A ∈ Rn×m , a constraint vector
b ∈ Rm , and a cost vector c ∈ Rn , and the goal is to solve the following
optimization problem:
maxm hc, xi
x∈R
s.t. Ax = b,
x ≥ 0.
Typically we assume n ≤ m and, hence, the rank of A is at most n. Analogous
to the s − t-maximum flow problem, linear programming has a rich duality
theory, and in particular the following is the dual of the above linear program:
minn hb, yi
y∈R
s.t. A> y ≥ c.
Note that the dual is also a linear program and has n variables.
1.2 Linear programming 9

Linear programming duality asserts that if there is a feasible solution to


both the linear program and its dual then the optimal values of these two linear
programs are the same. Moreover, it is often enough to solve the dual if one
wants to solve the primal and vice-versa. While duality has been known for
linear programming for a very long time (see Farkas (1902)), a polynomial
time algorithm for linear programming was discovered much later. What was
special about the s − t-maximum flow problem that led to a polynomial time
algorithm for it before linear programming?

As mentioned earlier, one crucial property that underlies the s − t-maximum


flow problem is integrality. If one encodes the s − t-maximum flow problem
as a linear program in the standard form, the matrix A turns out to be totally
unimodular: determinants of all square submatrices of A are 0, 1, or −1. In
fact, in the case of the s − t-maximum flow problem, A is just the vertex-edge
incidence matrix of the graph G (which we denoted by B) that can be shown
to be totally unimodular. Because of linearity, one can always assume without
loss of generality, that the optimal solution is an extreme point, i.e., a vertex
of the polyhedra of constraints (not to be confused with the vertex of a graph).
Every such vertex arises as a solution to a system of linear equations involving
a subset of rows of the matrix A. The total unimodularity of A then, along with
Cramer’s rule from linear algebra, implies that each vertex of the polyhedra of
constraints has integral coordinates.

While duality and integrality do not directly imply a polynomial time al-
gorithm for the s − t-maximum flow problem, the mathematical structure that
enables these properties is relevant to the design of efficient algorithms for this
problem. It is worth mentioning that these ideas were generalized in a major
way by Edmonds (1965a,b) who figured out an integral polyhedral represen-
tation for the matching problem and gave a polynomial time algorithm for
optimizing linear functions over this polyhedron.

For general linear programs, however, integrality does not hold. The reason
is that for a general matrix A, the determinants of submatrices that show up in
the denominator of the vertices of the associated polyhedra may not be 1 or
−1. However, (for A with integer entries) these determinants cannot be more
than poly(n, L) in magnitude where L is the number of bits required to en-
code A, b and c. This is a consequence of the fact that determinant of a matrix
with integer entries bounded by 2L is no more than n! · 2nL . While there were
combinatorial algorithms for linear programming, e.g., the simplex method of
Dantzig (1990) that moved from one vertex to another, none were known to
run in polynomial time (in the bit complexity) in the worst case.
10 Bridging continuous and discrete optimization
H

xt

K
Et

Et+1

Figure 1.3 Illustration of one step of the ellipsoid method for the polytope K.

Ellipsoid method. In the late 1970s, a breakthrough occurred and a polyno-


mial time algorithm for linear programming was discovered by Khachiyan
(1979, 1980). The ellipsoid method is a geometric algorithm that checks if
a given linear program is feasible or not. As in the case of the s − t-maximum
flow problem, solving this feasibility problem implies an algorithm to optimize
a linear function via a binary search argument.
In iteration t, the ellipsoid method approximates the feasible region of the
linear program with an ellipsoid Et and outputs the center (xt ) of this ellipsoid
as its guess for a feasible point. If this guess is incorrect, it requires a certificate
– a hyperplane H that separates the center from the feasible region. It uses this
separating hyperplane to find a new ellipsoid (Et+1 ) that encloses the feasible
region; see Figure 1.3. The key point is that the update ensures that the volume
of the ellipsoid reduces at a fast enough rate, and only requires solving a linear
system of equations to find the new ellipsoid from the previous one. If the
volume of the ellipsoid becomes so small that it cannot contain any feasible
point, we can safely assert the infeasibility of the linear program.
The ellipsoid method belongs to the larger class of cutting plane methods
as, in each step, the current ellipsoid is cut by an affine halfspace and, a new
ellipsoid that contains this intersection is determined. The final running time
of Khachiyan’s ellipsoid method was a polynomial in n, L and, importantly, in
log 1ε : the algorithm output a point x̂ in the feasible region such that
hc, x̂i ≤ hc, x? i + ε.
This implied that one can handle an error as small as 2−poly(n,L) , and this is all
we need for linear programming to be in polynomial time. While this put lin-
ear programming in the complexity class P for the first time, the resulting algo-
rithm, when specialized for combinatorial problems such as the s−t-maximum
flow problem, was far from competitive in terms of running time.
1.2 Linear programming 11

x?

c
x0

Figure 1.4 Illustration of the interior point method for the polytope K.

Interior point methods. In 1984, another continuous polynomial time algo-


rithm to solve linear programs was discovered by Karmarkar (1984): this time
the idea was to move in the interior of the feasible region until one reaches
the optimal solution; see Figure 1.4. Karmarkar’s algorithm had its roots in
the barrier method from nonlinear optimization. The barrier method is one
way to convert a constrained optimization problem to an unconstrained one by
choosing a barrier function for the constraint set. Roughly speaking, a bar-
rier function for a convex set is function that is finite only in the interior of it
and increases to infinity as one approaches the boundary of the feasible region.
Once we have a barrier function for a constraint set, we can add it to the objec-
tive function to penalize any violation of the constraints. For such a function
to be algorithmically useful, it is desirable that it is a convex function and also
has certain smoothness properties (as explained later in this book).
Renegar (1988) combined the barrier approach with Newton’s method for
root finding to improve upon Karmarkar’s method. Unlike gradient descent,
that was based on the first-order approximation of the objective function, Rene-
gar’s algorithm, following Newton’s method considered a quadratic approx-
imation to the objective function around the current √ point and  optimized it to
1
find the next point. His method took roughly O m L log ε iterations to find
a solution that is ε (additive) away from the optimal, and each iteration just
had to solve a linear system of equations of size m × m.
Unlike the ellipsoid method that just needs a separating hyperplane, inte-
rior point methods requires the constraints explicitly to compute the second-
order approximation of the objective function (which includes computing the
Hessian of the barrier function for the constraint set). In Chapter 9, we de-
rive Newton’s method from first principles and present its analysis using local
norms. In Chapter 10, we introduce barrier functions and present Renegar’s
path following interior point method for linear programming.
12 Bridging continuous and discrete optimization

1.3 Fast and exact algorithms via interior point methods


Despite the remarkable effort that went into improving interior point methods
in the late 80s, they could still not compete with combinatorial algorithms for
problems such as the s − t-maximum flow problem. A key obstacle was the
fact that solving linear system of equations (a primitive used at each step of
ellipsoid and interior point method) required roughly O(m2.373 ) time.
Vaidya (1990) observed that the combinatorial structure of the problem man-
ifests in the linear systems that arise and this structure could be used to speed
up certain linear programs. For instance, and as mentioned earlier, for the s − t-
maximum flow problem, the linear systems that arise correspond to Lapla-
cian systems. Vaidya presented some initial results for such linear systems that
gave hope for improving the per-iteration cost. His program was completed by
Spielman and Teng (2004) who gave an O(m) e time algorithm to solve Lapla-
cian systems. And, using this (and a few more ideas), Daitch and Spielman
(2008) finally gave an interior point method that improved upon prior algo-
rithms for the s − t-minimum cost flow problem. This is presented in Chapter
11 and was the first sign that general-purpose convex optimization methods
can be specialized to be comparable to or even beat combinatorial algorithms.

Beyond log-barrier functions. Meanwhile, in a sequence of papers, Vaidya


(1987, 1989a,b) introduced the volumetric barrier as a generalization of Kar-
markar’s barrier function and obtain modest improvements on the number of
iterations while ensuring that each iteration of his interior point method for
linear programming still just required multiplying two m × m matrices.
Nesterov and Nemirovskii (1994) abstracted the essence of barrier functions
and introduced the notion of self-concordance. They introduced the universal
barrier function and showed that the number of iterations an interior point

method for their universal barrier function is n, where n is the dimension of
the feasible region. They also showed that this bound cannot, in general, go

below n. Computing the barrier function that achieved this, however, was
not easier than solving the linear programming problem itself.
Finally, Lee and Sidford (2014), building up on the ideas of Vaidya, gave a
new barrier function for interior point methods that not only came sensationally

close to the bound of O( n) of Nesterov and Nemirovskii (1994) and each iter-
ation of their method just solves a small number of n × n linear systems. Using
these ideas, they gave an exact algorithm for the s − t-maximum flow problem

that ran in O(m
e n log2 U) time, the first improvement since Goldberg and Rao
(1998). Chapter 11 outlines the methods of Vaidya, Nesterov-Nemirovskii, and
Lee-Sidford.
1.4 Ellipsoid method beyond succinct linear programs 13

1.4 Ellipsoid method beyond succinct linear programs


As mentioned earlier, an advantage of the ellipsoid method over interior point
methods was the fact that they just needed a separation oracle to the polytope
to optimize a linear function over it. A separation oracle for a convex set is
an algorithm that, given a point, either asserts that the point is in the convex
set, or outputs a hyperplane that separates the point from the convex set. This
fact was exploited by Grötschel et al. (1981) to show that the ellipsoid method
can also be used to perform linear optimization over combinatorial polytopes
that do not have a succinct linear description. Prominent examples include the
matching polytope for general graphs and various matroid polytopes.
Chapter 12 presents the general framework of cutting plane methods, de-
rives the ellipsoid method of Khachiyan, and applies it to the problem of linear
optimization over combinatorial 0-1-polytopes for which we only have a sepa-
ration oracle. Thus, the ellipsoid method may, sometimes, be the only way one
can obtain polynomial time algorithms for combinatorial problems.
Grötschel et al. (1981, 1988) noticed something more about the ellipsoid
method: it can be extended to general convex programs of the form
min f (x), (1.3)
x∈K

where both f and K are convex. Their method outputs a point x̂ such that
f ( x̂) ≤ min f (x) + ε
x∈K

ε , T f , T K ),
R
in time, roughly, poly(n, log where R is such that K is contained in a
ball of radius R, time T f is required to compute the gradient of f , and time T K
is required to separate a point from K. Thus, this settled convex optimization in
its most generality. However, we emphasize that this result does not imply that
any convex program of the form (1.3) is in P. The reason is that sometimes
it may be impossible to get an efficient gradient oracle for f , or an efficient
separation oracle for K, or a good enough bound on R.
In Chapter 13, we present an algorithm to minimize a convex function over
a convex set and prove the guarantee mentioned above. Subsequently, we show
how this can be used to give a polynomial time algorithm for another combi-
natorial problem – submodular function minimization – given just an eval-
uation oracle to the function. A submodular function f : 2[m] → R has the
diminishing returns property: for sets S ⊆ T ⊆ [m], the marginal gain of
adding an element not in T to S is at least the marginal gain of adding it to
T . The ability to minimize submodular set functions allows us to obtain sep-
aration oracles for matroid polytopes. Submodular functions arose in discrete
optimization, but have recently found applications in machine learning.
14 Bridging continuous and discrete optimization

Finally, in Chapter 13, we consider convex programs that have been recently
used for designing various counting problems over discrete sets, such spanning
trees. Given a graph G = (V, E), let TG denote the set of spanning trees in G
and let PG denote the spanning tree polytope, i.e., the convex hull of indicator
vectors of all the spanning trees in TG . Each vertex of PG corresponds to a
spanning tree in G. The problem that we consider is the following: given a
point θ ∈ PG , find a way to write θ as a convex combination of the vertices
of the polytope PG so that the probability distribution corresponding to this
convex combination maximizes Shannon entropy; see Figure 1.5. To see what
this problem has to do with counting spanning trees, the reader is encouraged
to check that if we let θ be the average of all the vertex vectors of PG , the value
of this optimization problem is exactly log |TG |.
As stated, this is an optimization problem where there is a variable corre-
sponding to each vertex of the polytope, the constraints on these variables are
linear, and the objective function maximizes a concave function; see Figure
1.5. Thus, this is a convex program. Note, however, that |TG | can be exponen-
tial in the number of vertices in G; the complete graph on n vertices has nn−2
spanning trees. Thus, the number of variables can be exponential in the input
size and it is not clear how to solve this problem. Interestingly, if one considers
the dual of this convex optimization problem, the number of variables becomes
the number of edges in the graph.
However, there are obstacles to applying the general convex optimization
method to this setting and this is discussed in detail in Chapter 13. In par-
ticular, Chapter 13 presents a polynomial time algorithm for the maximum
entropy problem over polytopes due to Singh and Vishnoi (2014); Straszak
and Vishnoi (2019). Such algorithms have been used to design very general
approximate counting algorithms for discrete problems by Anari and Oveis
Gharan (2017); Straszak and Vishnoi (2017), and have enabled breakthrough
results for the traveling salesman problem by Oveis Gharan et al. (2011);
Karlin et al. (2020).

v2 P
max − i pi log pi
v3 v1 subject to
θ
i pi vi = θ,
P

i pi = 1,
P

v4 v5 ∀i, pi ≥ 0.

Figure 1.5 The maximum entropy problem and its convex program.
2
Preliminaries

We review the mathematical preliminaries required for this book. These include some
standard notion and facts from multivariate calculus, linear algebra, geometry, topology,
dynamical systems, and graph theory.

2.1 Derivatives, gradients, and hessians


We start with the definition of the derivative of a univariate function.
Definition 2.1 (Derivative) For g : R → R and t ∈ R, consider the following
limit:
g(t + δ) − g(t)
lim .
δ→0 δ
The function g is said to be differentiable if this limit exists for all t ∈ R and
this limit is called the derivative of g at t. We denote it by
d
g(t) or ġ(t) or g0 (t).
dt
A function is said to be continuously differentiable if its derivative is also
continuous. The central objects of study are functions f : Rn → R. For such
multivariate functions, we can use the definition of the derivative to define
gradients. But first, we need the definition of a directional derivative.
Definition 2.2 (Directional derivative) For f : Rn → R and x ∈ Rn , given
h1 , . . . , hk ∈ Rn , define Dk f (x) : (Rn )k → R as

d d
Dk f (x)[h1 , . . . , hk ] := ··· f (x + t1 h1 + · · · + tk hk ) .
dt1 dtk t1 =···=tk =0
k
D f (x) is called the directional derivative as we can evaluate it for any k-tuple
of directions h1 , . . . , hk .

15
16 Preliminaries

D1 f (x)[·] is denoted by D f (x)[·], and is also known as the differential of f at


x. When the kth directional derivative of a function f exists for all arguments
to Dk f (·)[·, · · · , ·], we say that f is differentiable up to order k. For k = 1, 2
we call such a function differentiable or twice differentiable. Unless otherwise
stated, we assume that the functions are sufficiently differentiable. In this case,
we can talk about the gradient and the Hessian of f .
Definition 2.3 (Gradient) For a differentiable function f : Rn → R and an
x ∈ Rn the gradient of f at x is defined as the unique vector g(x) ∈ Rn such
that
hg(x), yi = D f (x)[y]
for all y ∈ Rn . When y = ei , the ith standard basis vector in Rn , D f (x)[ei ] is
denoted by ∂x∂ i f (x). We denote the gradient g(x) by ∇ f (x) or D f (x) and we
write it as:
#>
∂f ∂f ∂f
"
∇ f (x) = (x), (x) , . . . , (x) .
∂x1 ∂x2 ∂xn
Definition 2.4 (Hessian) For a twice differentiable function f : Rn → R, its
Hessian at x ∈ Rn is defined as the unique linear function H(x) : Rn → Rn
such that for all x, h1 , h2 ∈ Rn we have
h>1 H(x)h2 = D2 f (x)[h1 , h2 ].
The Hessian is often denoted by ∇2 f (x) or D2 f (x) and can be represented as
the following matrix whose element at row i and column j is
∂2 f
(∇2 f (x1 , x2 , . . . , xn ))i j = (x).
∂xi ∂x j
In other words ∇2 f (x) is the following n × n matrix:
 ∂2 f ∂2 f ∂2 f 

 2
∂x ∂x1 ∂x 2
· · · ∂x1 ∂x n

 ∂2 1f ∂2 f ∂2 f 


 ∂x2 ∂x1 ∂x2 2 · · · ∂x2 ∂xn 

..  .

 . .. ..
 .. . . . 
 2
 ∂ f ∂2 f
· · · ∂2 f 

∂xn ∂x1 ∂xn ∂x2 ∂x 2
n

The Hessian is symmetric as the order of i and j does not matter in differenti-
ation.The definitions of gradient and Hessian presented here easily generalize
to the setting where the function is defined on a subdomain of Rn ; see Chap-
ter 11. As a generalization of the notation used for gradient and Hessian, we
sometimes also use the notation ∇k f (x) instead of Dk f (x).
We now show that differentiation in the multivariate setting can be expressed
2.2 Taylor series 17

in terms of integrals of univariate functions. It relies on the following (part II


of the) fundamental theorem of calculus.
Theorem 2.5 (Fundamental theorem of calculus, part II) Let f : [a, b] →
R be a continuously differentiable function. Then
Z b
f˙(t) dt = f (b) − f (a).
a

The proof of the following useful lemma is left as an exercise (Exercise 2.2).
Lemma 2.6 (Integral representations of functions) Let f : Rn → R be a
continuously differentiable function. For x, y ∈ R, let g : [0, 1] → R be defined
as
g(t) := f (x + t(y − x)).
Then, the following are true:
(i) ġ(t) = h∇ f (x +R t(y − x)), y − xi, and
1
(ii) f (y) = f (x) + 0 ġ(t) dt.
If, in addition, f has a continuous Hessian, then the following are also true:
(i) g̈(t) = (y − x)> ∇2 f (x + t(yR − x))(y − x), and
1
(ii) h∇ f (y) − ∇ f (x), y − xi = 0 g̈(t) dt.

2.2 Taylor series


It is often useful to consider a linear or quadratic approximation of a function
f : Rn → R around a certain point a ∈ Rn .
Definition 2.7 (Taylor series and approximation) Let f : Rn → R be
a function that is infinitely differentiable. The Taylor series expansion of f
around a, whenever it exists, is given by:
1 X
f (x) = f (a)+h∇ f (a), x−ai+ (x−a)> ∇2 f (a)(x−a)+ ∇k f (a)[x−a, . . . , x−a].
2 k≥3

The first-order (Taylor) approximation of f around a is denoted by


f (a) + h∇ f (a), x − ai
and the second-order (Taylor) approximation of f around a is denoted by
1
f (a) + h∇ f (a), x − ai + (x − a)> ∇2 f (a)(x − a).
2
18 Preliminaries

In many interesting cases one can prove that whenever x is close enough to
a, then the higher-order terms do not contribute much to the value of f (x)
and, hence, the second-order (or even first-order) approximation gives a good
estimate of f (x).

2.3 Linear algebra, matrices, and eigenvalues


For a given set of vectors S = {v1 , . . . , vk } ∈ Rn , their linear span or linear
hull is the following set:
 k 

X 
α i v i : α i ∈ R .
 
span(S ) := 


 

i=1
m×n
Let R denote the set of all m × n matrices over the real numbers. In many
cases, the matrices we work with are square, i.e., m = n, and symmetric, i.e.,
M > = M (here > denotes the transpose). The identity matrix of size n is a
square matrix of size n×n with ones on the main diagonal and zero everywhere
else. It is denoted by In . If the dimension is clear from the context we drop the
subscript n.

Definition 2.8 (Image, nullspace/kernel, and rank) Let A ∈ Rm×n .

(i) The image of A is defined as Im(A) := {Ax : x ∈ Rn }.


(ii) The nullspace or kernel of A is defined as Ker(A) := {x ∈ Rn : Ax = 0}.
(iii) The column rank of A is defined to be the maximum number of linearly
independent columns of A and the row rank the maximum number of
rows of A. It turns out that the row rank of a matrix is the same as its
column rank – called the rank of a matrix, and the rank of A cannot
be more than min{m, n}. When the rank of A is equal to min{m, n}, it is
said to have full rank.

Note that both Im(A) and Ker(A) are vector spaces.

Definition 2.9 (Inverse) Let A ∈ Rn×n . A is said to be invertible if its rank is


n. In other words, if for each vector y in the image of A, there is exactly one x
such that y = Ax. In this case we use the notation A−1 for the inverse of A.

Sometimes, a matrix A may be square but not of full rank, or A may be full
rank but not square. In both cases, we can extend the notion of inverse.

Definition 2.10 (Pseudoinverse) For A ∈ Rm×n , a pseudoinverse of A is


defined as a matrix A+ ∈ Rn×m satisfying all of the following criteria:
2.3 Linear algebra, matrices, and eigenvalues 19

(i) AA+ A = A,
(ii) A+ AA+ = A+ ,
(iii) (AA+ )> = AA+ , and
(iv) (A+ A)> = A+ A.
The pseudoinverse can be shown to always exist. Moreover, if A has linearly
independent columns, then
A+ = (A> A)−1 A>
and, if A has linearly independent rows, then
A+ = A> (AA> )−1 ,
see Exercise 2.10.
An important subclass of symmetric matrices are positive semidefinite ma-
trices, as defined below.
Definition 2.11 (Positive semidefinite matrix) A real symmetric matrix M
is said to be positive semidefinite (PSD) if for all x ∈ Rn , x> Mx ≥ 0. This is
denoted by:
M  0.
M is said to be positive definite (PD) if x> Mx > 0 holds for all nonzero x ∈ Rn .
This is denoted by:
M  0.
For instance, the identity matrix I is PD, while the diagonal 2×2 matrix with a !1
2 −1
and a −1 on the diagonal is not PSD. As another example, let M := ,
−1 1
then
∀x ∈ R2 , x> Mx = 2x12 − 2x1 x2 + x22 = x12 + (x1 − x2 )2 ≥ 0,
hence M is PSD and in fact PD. Occasionally, we make use of the following
convenient notation: for two symmetric matrices M and N we write M  N if
and only if N − M  0. It is not hard to prove that  defines a partial order on
the set of symmetric matrices. An n × n real PSD matrix M can be written as
M = BB>
where B is an n × n real matrix with possibly dependent rows. B is sometimes
called the square-root of M and denoted by A1/2 . If M is PD, then the rows of
such a B are linearly independent (Exercise 2.6).
We now review the notion of eigenvalues and eigenvectors of square matri-
ces.
20 Preliminaries

Definition 2.12 (Eigenvalues and eigenvectors) λ ∈ R and u ∈ Rn is an


eigenvalue-eigenvector pair of the matrix A ∈ Rn×n if Au = λu and u , 0.
Geometrically speaking, this means that eigenvectors are vectors that, under
the transformation A, preserve the direction of the vector scaled by λ (the cor-
responding eigenvalue).
Note that for each eigenvalue λ of a matrix A with an eigenvector u, and
the vector cu is also an eigenvector with eigenvalue λ for any c ∈ R\{0}. An
alternate characterization of eigenvalues of a square matrix is that they are
zeros of the polynomial equation det(A − λI) = 0 in λ.
Thus, it follows from the fundamental theorem of algebra, that if A ∈ Rn×n ,
then it has n (possibly repeated and complex) eigenvalues. The set of eigen-
values {λ1 , . . . , λn } of a matrix A ∈ Rn×n is sometimes referred to as the spec-
trum of A. When A is symmetric, all of its eigenvalues are real (Exercise 2.7).
Moreover, if A is PSD, then all its eigenvalues are nonnegative (Exercise 2.8).
Further, in this case, the eigenvectors corresponding to different eigenvalues
can be chosen to be orthogonal to each other.
Definition 2.13 (Matrix norm) For an n × m real-valued matrix A, its 2 → 2,
or spectral norm is defined as
kAxk2
kAk2 := sup .
x∈Rm kxk2
Theorem 2.14 (Norm of a symmetric matrix) When A is an n × n real
symmetric matrix A with eigenvalues λ1 (A) ≤ λ2 (A) ≤ · · · ≤ λn (A), its norm is
kAk2 = max{|λ1 (A)|, |λn (A)|}.
If A is a symmetric PSD matrix, kAk2 = λn (A).

2.4 The Cauchy-Schwarz inequality


The following basic inequality is used frequently in this book.
Theorem 2.15 (Cauchy-Schwarz inequality) For every x, y ∈ Rn
hx, yi ≤ kxk2 kyk2 . (2.1)
Intuitively, this inequality can be explained very simply – the two vectors x
and y form together a subspace of dimension at most 2 that can be thought of
as R2 . Furthermore, assuming x, y ∈ R2 , we know that hx, yi = kxk2 kyk2 cos θ.
Since cos θ ≤ 1, the inequality holds. Nevertheless, the following is a formal
proof.
2.5 Norms 21

Proof The inequality can be equivalently written as


|hx, yi|2 ≤ kxk22 kyk22 .
Let us form the following nonnegative polynomial in z
n
 n   n  n
X X 2  2 X X
(xi z + yi )2 =  xi  z + 2  xi yi  z +

y2i ≥ 0.
i=1 i=1 i=1 i=1

Since this degree two polynomial is nonnegative it has at most one real zero
and its discriminant must be less than or equal to zero, implying that
 n 2 n n
X  X X
 xi yi  − xi2 y2i ≤ 0,
i=1 i=1 i=1

thus, completing the proof.

2.5 Norms
So far we have talked about the `2 (Euclidean) norm and proved the Cauchy-
Schwarz inequality for this norm. We now present the general notion of a norm
and give several examples used throughout this book. We conclude with gen-
eralization of Cauchy-Schwarz inequality for general norms.
Geometrically, a norm is a way to tell how long a vector is. However not any
function is a norm, and the following definition formalizes what it means to be
a norm.
Definition 2.16 (Norm) A norm is a function k · k : Rn → R that satisfies,
for every u, v ∈ Rn and c ∈ R,
(i) kc · uk = |c| · kuk,
(ii) ku + vk ≤ kuk + kvk, and
(iii) kuk = 0 if and only if u = 0.
An important class of norms are the ` p -norms for p ≥ 1. Given a p ≥ 1, for
u ∈ Rn , define
X p 1/p
 n 
kuk p :=  |ui |  .

i=1

It can be seen that this is a norm for p ≥ 1. Another class of norms are those
induced by PD matrices. Given an n × n real PD matrix A and u ∈ Rn , define

kukA := u> Au.
This can also be shown to be a norm.
22 Preliminaries

Definition 2.17 (Dual norm) Let k · k be a norm in Rn , then the dual norm,
denoted by k · k∗ , is defined as

kxk∗ := sup hx, yi.


y∈Rn : kyk≤1

It is an exercise to show that ∀p, q ≥ 1, k · kq is a dual norm of k · k p whenever


p + q = 1. Since p = q = 2 satisfies the equality above, k · k2 is the dual nor
1 1

of itself. Another important example of dual norms is p = 1 and q = ∞. One


can also see that, for a PD A, the dual norm corresponding to kukA is kukA−1 .
We conclude with a generalization of the Cauchy-Schwarz inequality.

Theorem 2.18 (Generalized Cauchy-Schwarz inequality) For every pair


of dual norms k·k and k·k∗ , the following general version of the Cauchy-Schwarz
inequality holds:
hx, yi ≤ kxk kyk∗ .

Unless stated otherwise in this book, k · k will denote k · k2 , the Euclidean norm.

2.6 Euclidean topology


We are concerned with Rn along with natural topology induced on it. An open
ball centered at a point x ∈ Rn and of radius r > 0 is the following set:

{y ∈ Rn : kx − yk2 < r}.

An open ball generalizes the concept of an open interval over real numbers.
A set K ⊆ Rn is said to be open if every point in K is the center of an open
ball contained in K. A set K ⊆ Rn is said to be bounded if there exists an
0 ≤ r < ∞ such that K is contained in an open ball of radius r. A set K ⊆ Rn
is said to be closed if Rn \ K open. A closed and bounded set K ⊆ Rn is also
sometimes referred to as compact. A point x ∈ Rn is a limit point of a set K
if every open set containing x contains at least one point of K different from
x itself. It can be shown that a set is closed if and only if it contains all of its
limit points. The closure of a set K consists of all points in K along with all
limit points of K. A function f : Rn → R is said to be closed if it maps every
closed set K ⊆ Rn to a closed set.
A point x ∈ K ⊆ Rn is said to be in the interior of K, if some ball of
positive radius containing x is contained in K. For a set K ⊆ Rn let ∂K denote
its boundary: it is the set of all points x in the closure of K that are not in the
interior of K. The point x ∈ K is said to be in the relative interior of K if
2.7 Dynamical systems 23

K contains the intersection of a ball of positive radius centered at x with the


smallest affine space containing K (also called the affine hull of K).

2.7 Dynamical systems


Many algorithms introduced in this book are best viewed as dynamical sys-
tems. Dynamical systems typically belong to two classes: continuous-time and
discrete-time. Both of them consist of a domain Ω and a “rule” according to
which a point moves in the domain. The difference is that in a discrete-time
dynamical system the time at which the point moves is discrete t = 0, 1, 2, . . . ,
while in the continuous-time dynamical system it is continuous, i.e., t ∈ [0, ∞).
Definition 2.19 (Continuous-time dynamical system) A continuous-time
dynamical system consists of a domain Ω ⊆ Rn and a function G : Ω → Rn .
For any point s ∈ Ω we define a solution (a trajectory) originating at s to be a
curve x : [0, ∞) → Ω such that
d
x(0) = s and x(t) = G(x(t)) for t ∈ (0, ∞).
dt
For brevity, the differential equation in the definition above is often written as
ẋ = G(x). The definition essentially says that a solution to a dynamical system
is any curve x : [0, ∞) → Ω which is tangent to G(x(t)) at x(t) for every
t ∈ (0, ∞), thus G(x) gives a direction to be followed at any given point x ∈ Ω.
Definition 2.20 (Discrete-time dynamical system) A discrete-time dynam-
ical consists of a domain Ω and a function F : Ω → Ω. For any point s ∈ Ω
we define a solution (a trajectory) originating at s to be the infinite sequence
of points {x(k) }k∈N with x(0) = s and x(k+1) = F(x(k) ) for every k ∈ N.
Sometimes we derive our algorithms by first defining a continuous-time dy-
namical system and then discretizing it. There is no standard way to do this
and sometimes the discretization can be very creative. The simplest way, how-
ever, is the Euler discretization.
Definition 2.21 (First-order Euler discretization) Consider a dynamical
system given by ẋ = G(x) on some domain Ω. Then, given a step size h ∈ (0, 1),
its first-order Euler discretization is a discrete-time dynamical system (Ω, F),
where
F(x) = x + h · G(x).
This is perfectly compatible with the intuition that a continuous dynamical
system at x is basically moving “just a little bit” along G(x). Note that formally
24 Preliminaries

the above definition is not quite correct, as F(x) might land outside of Ω – this
is a manifestation of the “existence of solution” issue for discrete systems –
sometimes in order to achieve existence one has to take h very small, and in
some cases it might not be possible at all.

2.8 Graphs
An undirected graph G = (V, E) consists a finite set of vertices V and a set of
edges E. Each edge e ∈ E is a two element subset of V. A vertex v is said to
be incident to e if v ∈ e. A loop is an edge for which both endpoints are the
same. Two edges are said to be parallel if their endpoints are identical. A graph
without loops and parallel edges is said to be simple. A simple, undirected
graph that has an edge between every pair of vertices is called the complete
graph and is denoted by KV or Kn where n := |V|. For a vertex v ∈ V, N(v)
denotes the set of vertices u such that {u, v} is an edge in E. The degree of
a vertex v is denoted by dv := |N(v)|. A subgraph of a graph G = (V, E) is
simply a graph H = (U, F) such that U ⊆ V and F ⊆ E where the edges in F
are only incident to vertices in U.
A graph G = (V, E) is said to be bipartite if the vertex set V can be parti-
tioned into two parts V1 and V2 and all edges contain one vertex from V1 and
one vertex from V2 . If there is an edge between every vertex in V1 to every ver-
tex in V2 , G is called a complete bipartite graph and denoted by Kn1 ,n2 where
n1 := |V1 | and n2 := |V2 |.
A directed graph is one where the edges E ⊆ V × V are 2-tuples of vertices.
We denote an edge by e = (u, v) where u is the “tail” of the edge e and v is
its “head”. The edge is “outgoing” at u and “incoming” at v. When the context
is clear, we sometimes use e = uv = {u, v} (undirected) or e = uv = (u, v)
(directed) to denote an edge.

2.8.1 Structures in graphs


A path in a graph is a nonempty subgraph with vertex set v0 , v1 , . . . , vk and
edges {v0 v1 , v1 v2 , . . . , vk−1 vk }. Since a path is defined to be a subgraph, vertices
cannot repeat in a path. A walk is a sequence of vertices connected by edges
without any restriction on repetition. A cycle in a graph is a nonempty sub-
graph with vertex set v0 , v1 , . . . , vk and edges {v0 v1 , v1 v2 , . . . , vk−1 vk , vk v0 }. A
graph is said to be connected if there is a path between every two vertices, and
disconnected otherwise.
A cut in a graph is subset of edges such that if one removes them from the
2.8 Graphs 25

graph, the graph is disconnected. For a graph G = (V, E) and s , t ∈ V, an s − t


cut in a graph is a set of edges whose removal leaves no path between s and t.
A graph is said to be acyclic if it does not contain any cycle. A connected
acyclic graph is called a tree. A spanning tree of a graph G = (V, E) is sub-
graph that is a tree with vertex set V. Check that a spanning tree contains
exactly |V| − 1 edges.
An s − t flow in an undirected G = (V, E) for s , t ∈ V is an assignment
f : E → R that satisfies the following properties. For all vertices u ∈ V\{s, t},
we require that the “incoming” flow is equal to the “outgoing” flow:
X
f (v, u) = 0.
v∈V

As a convention, we extend f from E to V × V as a skew-symmetric function


on the edges: if the edge e = uv, we let f (v, u) := − f (u, v).
A matching M in an undirected graph G = (V, E) is a subset M of edges
such that for every e1 , e2 ∈ M, e1 ∩ e2 = ∅. A matching is said to be perfect if
∪e∈M e = V.

2.8.2 Matrices associated to graphs


Two basic matrices associated with a simple, undirected graph G = (V, E),
indexed by its vertices, are its adjacency matrix A and its degree matrix D.
(
1 if uv ∈ E,
Au,v :=
0 otherwise,
and
if u = v,
(
dv
Du,v :=
0 otherwise.
The adjacency matrix is symmetric. The graph Laplacian of G is defined to be

L := D − A.

Given an undirected graph G = (V, E), consider an arbitrary orientation of its


edges. Let B ∈ {−1, 0, 1}n×m be the matrix whose columns are indexed by the
edges and rows by the vertices of G where the entry corresponding to (v, e) is
1 if a vertex v is the tail of the directed edge corresponding to e, is −1 if i is the
head of the directed edge e, and is zero otherwise. B is called the vertex-edge
incidence matrix of G. The Laplacian can now be expressed in terms of B.
While B depends on the choice of the directions to the edges, the Laplacian
does not.
26 Preliminaries

Lemma 2.22 (Laplacian and incidence matrix) Let G be a simple, undi-


rected graph with (arbitrarily chosen) vertex-edge incidence matrix B. Then,
BB> = L.
Proof For the diagonal elements of BB> , i.e., (BB> )v,v = e Bv,e Bv,e . The
P
terms are nonzero only for those edges e which are incident to v, in which
case the product is 1 and, hence, this sum gives the degree of vertex v in the
undirected graph. For other entries, (B> B)u,v = e Bu,e Bv,e . The product terms
P
are nonzero only when the edge e is shared by u and v. In either case the
product is −1. Hence, (BB> )u,v = −1, ∀u , v, uv ∈ E. Hence, BB> = L.
Observe that L1 = 0 where 1 is the all ones vector. Thus, the Laplacian is
not full rank. However, if G = (V, E) is connected then the Laplacian has rank
|V| − 1 (exercise) and, in the space orthogonal to the all ones vector, we can
define an inverse of the Laplacian and denote it by L+ .
A weighted, undirected graph G = (V, E, w) has a function w : E → R>0
that gives weights to edges. For such graphs we define the Laplacian as
L := BW B> ,
where W is the diagonal |E| × |E| matrix with We,e = w(e) and 0 otherwise.

2.8.3 Polytopes associated to graphs


For a graph G = (V, E), let F ⊆ 2E denote a family of subsets of its edges.
For each S ∈ F , let 1S ∈ {0, 1}E denote the indicator vector of the set S .
Consider the polytope PF ⊆ [0, 1]E that is defined as the convex hull of the
vectors {1S }S ∈F . When F is the set of all spanning trees in G, the corresponding
polytope is called the spanning tree polytope of G. When F is the set of all
matchings in G, the corresponding polytope is called the matching polytope
of G.
2.9 Exercises 27

2.9 Exercises
2.1 For each of the following function, compute the gradient and the Hes-
sian, and write the second-order Taylor approximation.
(a) f (x) = m i=1 (ai x − bi ) , for x ∈ Q where a1 , . . . , am ∈ Q , and
> 2 n m
P
b1 , . . . , bm ∈ Q. 
(b) f (x) = log mj=1 eh x,v j i where v1 , . . . , vm ∈ Qn .
P

(c) f (X) = Tr(AX) where A is a symmetric n × n matrix and X runs


over symmetric matrices.
(d) f (X) = − log det X, where X runs over positive definite matrices.
2.2 Prove Lemma 2.6.
2.3 Prove that the row rank of a matrix is equal to its column rank.
2.4 Prove that, given A ∈ Rn×n , it holds that λ is an eigenvalue of A if and
only if det(A − λI) = 0.
2.5 Given an n × n PD matrix H and a vector a ∈ Rn prove that H  aa> if
and only if 1 ≥ a> H −1 a.
2.6 Prove that if M is an n × n real symmetric matrix that is PD, then

M = BB>

where B is an n × n real matrix with linearly independent rows.


2.7 Prove that if A is an n×n real symmetric matrix then all of its eigenvalues
are real.
2.8 Prove that every eigenvalue of a PSD matrix is nonnegative.
2.9 Consider a real n × m matrix A with n ≤ m and a vector b ∈ Rm .
(a) Assume that A has full rank (n). Prove that AA> is an invertible
matrix.
(b) Prove that the nonzero eigenvalues of AA> and A> A are the same.
More precisely, prove that every nonzero eigenvalue of AA> is an
eigenvalue of A> A with the same multiplicity and vice versa.
(c) Let p(x) := arg min x∈Rn kA> x − bk22 . Assuming that A is of full
rank, derive a formula for p(x) in terms of A and b.
2.10 For A ∈ Rm×n , prove that its pseudoinverse always exists. Moreover,
prove that, if A has linearly independent columns, then

A+ = (A> A)−1 A>

and, if A has linearly independent rows, then

A+ = A> (AA> )−1 .


28 Preliminaries

2.11 Prove that, for a function f : Rn → R, its differential D f (x) : Rn → Rn


at a point x ∈ Rn is a linear function.
P 1/p
n p
2.12 Prove that kuk p := i=1 |ui | is a norm for p ≥ 1. Is it also a norm
for 0 < p < 1?
2.13 Given an n × n real PD matrix A and u ∈ Rn , show that

kukA := u> Au

is a norm. What aspect of being a norm for kukA breaks down when A
is just guaranteed to be PSD (and not PD)? And when A has negative
eigenvalues?
2.14 Prove that ∀p, q ≥ 1, the norm k · kq is the dual norm of k · k p whenever
p + q = 1.
1 1

2.15 Prove Theorem 2.18.


2.16 Prove Theorem 2.14.
2.17 Prove that the intersection of any family of closed sets in Rn is also
closed.
2.18 Prove that the closure of K ⊆ Rn is exactly the set of the limits of all
converging sequences of elements of K.
2.19 Find the solution to the following one-dimensional continuous time dy-
namical system
dx
= −αx2 with x(0) = β.
dt
2.20 Let l1 = L − 1 and l2 = L for some L > 1. Fix h ∈ (0, 1). Consider the
discrete-time dynamical system below. Consider the solution x1(k) , x2(k) to
this dynamical system. Prove that this solution converges to a pair of
numbers as k → ∞ and derive the pair of numbers it converges to.

  

 x1(k+1) = x1(k) (1 − h) + h · l2
x1(k) ·l2 +x2(k) ·l1





 (k+1)  
x = x (k)
(1 − h) + h · l1 (2.2)
x1(k) ·l2 +x2(k) ·l1


 2 2


 x(0) = x(0) = 1.


1 2

2.21 For a simple, connected, and undirected graph G = (V, E), let

Π = B> L+ B,

where B is any vertex-edge incidence matrix associated to G. Prove the


following:
(a) Π is symmetric.
(b) Π2 = Π.
2.9 Exercises 29

(c) The eigenvalues of Π are all either 0 or 1.


(d) The rank of Π is |V| − 1.
(e) Let T be a spanning tree chosen uniformly at random from all
spanning trees in G. Prove that the probability that an edge e be-
longs to T is given by
Pr[e ∈ T ] = Π(e, e).

2.22 Let G = (V, E) (with n := |V| and m := |E|) be an undirected, connected


graph with a weight vector w ∈ RE . Consider the following algorithm
for finding a maximum weight spanning tree in G.
• Sort the edges in nondecreasing order:
w(e1 ) ≤ w(e2 ) ≤ · · · ≤ w(em ).

• Set T = E.
• For i = 1, 2, . . . , m
– If the graph (V, T \ {ei }) is connected, set T := T \ {ei }.
• Output T .
Is the algorithm correct? If yes prove its correctness, otherwise provide
a counterexample.
2.23 Let T = (V, E) be a tree on n vertices V and let L denote its Laplacian.
Given a vector b ∈ Rn , design an algorithm for solving linear systems
of the form Lx = b in time O(n). Hint: use Gaussian elimination but
carefully consider the order in which you eliminate variables.
2.24 Let G = (V, E, w) be a connected, weighted, undirected graph. Let n :=
|V| and m := |E|. We use LG to denote the corresponding Laplacian. For
any subgraph H of G we use the notation LH to denote its Laplacian.
(a) Let T = (V, F) be a connected subgraph of G. Let PT be any
square matrix satisfying
LT+ = PT P>T .
Prove that
x> P>T LG PT x ≥ x> x
for all x ∈ Rn satisfying hx, 1i = 0.
(b) Prove that
 X
Tr LT+ LG = we b>e LT+ be ,
e∈G

where be is the column of B corresponding to the edge e.


30 Preliminaries

(c) Now let T = (V, F) be a spanning tree of G. For an edge e ∈ G,


write an explicit formula for b>e LT+ be .
(d) The weight of a spanning tree T = (V, F) of G = (V, E, w) is
defined as
X
w(T ) := we .
e∈F

Prove
 that if T is the maximum weight spanning tree of G, then
Tr LT+ LG ≤ m(n − 1).
(e) Deduce that if T is a maximum weight spanning tree of G and PT
is any matrix such that PT P>T = LT+ , then the condition number of
the matrix PT LG P>T is at most ≤ m(n − 1). The condition number
of PT LG P>T is defined as
λn (PT LG P>T )
.
λ2 (PT LG P>T )
Here λn (PT LG P>T ) denotes the largest eigenvalue PT LG P>T and
λ2 (PT LG P>T ) is the smallest nonzero eigenvalue of PT LG P>T . How
large can the condition number of LG be?
2.25 Let G = (V, E) (with n := |V| and m := |E|) be a connected, undirected
graph and L be its Laplacian. Take two vertices s , t ∈ V, let χ st := e s −et
(where ev is the standard basis vector for a vertex v) and let x ∈ Rn such
that Lx = χ st . Define the distance between two vertices s, t ∈ V as:
d(s, t) := x s − xt .
For s = t we set d(s, t) = 0. Let k be the value of the minimum s − t
cut, i.e., the minimum number of edges one needs to remove from G to
disconnect s from t. Prove that
r
m
k≤ .
d(s, t)
2.26 A matrix is said to be totally unimodular if each of its square submatri-
ces has determinant 0, +1, or −1. Prove that, for a graph G = (V, E), any
of its vertex-edge incidence matrix B is totally unimodular.
2.27 Prove that the matching polytope of a bipartite graph G = (V, E) can be
equivalently written as:
 
 X 
P M (G) =  E
.
 
x ∈ : x ≥ 0, x ≤ 1 ∀v ∈ V
 
 R e 

 
e:v∈e

Hint: Write P M (G) = {x ∈ R : Ax ≤ b}, and prove that resulting A is a


E

totally unimodular matrix.


2.9 Exercises 31

Notes
The content presented in this chapter is classic and draws from several text-
books. For a first introduction to calculus, including a proof of the fundamental
theorem of calculus (Theorem 2.5), see the textbook by Apostol (1967a). For
an advanced discussion on multivariate calculus, the reader is referred to the
textbook by Apostol (1967b). For an introduction to real analysis (including
introductory topology), the reader is referred to the textbook by Rudin (1987).
Linear algebra and related topics are covered in great detail in the textbook by
Strang (2006). See also the paper by Strang (1993) for an accessible treatment
of the fundamental theorem of algebra. The Cauchy-Schwarz inequality has a
wide range of applications; see the book by Steele (2004). For a formal discus-
sion on dynamical systems, refer to the book by Perko (2001). The books by
Diestel (2012) and Schrijver (2002a) provide in-depth introductions to graph
theory and combinatorial optimization respectively.
3
Convexity

We introduce convex sets, notions of convexity, and show the power that comes along
with convexity: convex sets have separating hyperplanes, subgradients exist, and locally
optimal solutions of convex functions are globally optimal.

3.1 Convex sets


We start by introducing the notion of a convex set.
Definition 3.1 (Convex set) A set K ⊆ Rn is said to be convex if for every
pair of points x, y ∈ K and for every λ ∈ [0, 1] we have,
λx + (1 − λ)y ∈ K.
In other words, a set K ⊆ Rn is convex if for every two points in K, the line
segment connecting them is contained in K. Common examples of convex sets
are:
1. Hyperplanes: sets of the form
{x ∈ Rn : hh, xi = c}
for some h ∈ Rn and c ∈ R.
2. Halfspaces: sets of the form
{x ∈ Rn : hh, xi ≤ c}
for some h ∈ Rn and c ∈ R.
3. Polytopes: For a set of vectors X ⊆ Rn its convex hull conv(X) ⊆ Rn is
defined as the set of all convex combinations
Xr
αjxj
j=1

32
3.2 Convex functions 33

for x1 , x2 , . . . , xr ∈ X and α1 , α2 , . . . , αr ≥ 0 such that rj=1 α j = 1. When


P
the set X has a finite cardinality, conv(X) is called a polytope and is convex
by definition. X is said to be full-dimensional if the linear span of the
points defining it is all of Rn .
4. Polyhedra: sets of the form K = {x ∈ Rn : hai , xi ≤ bi for i = 1, 2, . . . , m},
where ai ∈ Rn and bi ∈ R for i = 1, 2, . . . , m. A polyhedron that is
bounded can be shown to be a polytope (Exercise 3.5).
5. ` p -balls: for p ≥ 1, sets of the form B p (a, 1) := {x ∈ Rn : kx − ak p ≤ 1},
where a ∈ Rn is a vector.
6. Ellipsoids: sets of the form {x ∈ Rn : x = T (B) + a} where B is the
n-dimensional unit `2 -ball centered at the origin, T : Rn → Rn is an
invertible linear transformation, and a ∈ Rn . This is seen to be equivalent
to {x ∈ Rn : (x − a)> A(x − a) ≤ 1} where A ∈ Rn×n is a PD matrix. When
a = 0, this set is also the same as the set of all x with kxkA ≤ 1.

3.2 Convex functions


We define convex functions and present two ways to characterize them which
apply depending on their smoothness.

Definition 3.2 (Convexity) A function f : K → R, defined over a convex set


K, is convex if ∀x, y ∈ K and λ ∈ [0, 1], we have

f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y). (3.1)

Examples of convex functions include:

1. Linear functions: f (x) = hc, xi for a vector c ∈ Rn .


2. Quadratic functions: f (x) = x> Ax + b> x for a PSD matrix A ∈ Rn and a
vector b ∈ Rn .
3. Negative entropy function: f : [0, ∞) → R, f (x) = x log x.

Sometimes when working with convex functions f defined over a certain sub-
set K ⊆ Rn we extend them to the whole Rn by setting f (x) = +∞ for all
x < K. One can check that f is then still convex (on Rn ) when the arithmetic
operations on R ∪ {+∞} are interpreted in the only reasonable way. A function
f is said to be concave if − f is convex.
We now provide two different characterizations of convexity that might be
more convenient to apply in certain situations. They require additional smooth-
ness conditions on f (differentiability or twice-differentiability respectively).
34 Convexity
y
f (x)

f (x0 ) + h∇ f (x0 ), x − x0 i
f (x0 )

x
x0

Figure 3.1 The first-order convexity condition at the point x0 . Note that in this
one-dimensional case, the gradient of f is just its derivative.

Theorem 3.3 (First-order notion of convexity) A differentiable function


f : K → R over a convex set K is convex if and only if

f (y) ≥ f (x) + h∇ f (x), y − xi, ∀x, y ∈ K. (3.2)

In other words, any tangent to a convex function f lies below the function f
as illustrated in Figure 3.1. Similarly, any tangent to a concave function lies
above the function.

Proof Suppose f is convex as in Definition 3.2. Fix any x, y ∈ K. Then, from


(3.1), for every λ ∈ (0, 1) we have:

(1 − λ) f (x) + λ f (y) ≥ f ((1 − λ)x + λy) = f (x + λ(y − x)).

Subtracting (1 − λ) f (x) and dividing by λ yields


f (x + λ(y − x)) − f (x)
f (y) ≥ f (x) + .
λ
Taking the limit as λ → 0, the second term on the right converges to the direc-
tional derivative of f in the direction y − x, hence

f (y) ≥ f (x) + h∇ f (x), y − xi.


3.2 Convex functions 35

Conversely, suppose the function f satisfies (3.2). Fix x, y ∈ K and λ ∈ [0, 1].
Let z := λx + (1 − λ)y be some point in the convex hull. Note that the first-order
approximation of f around z underestimates both f (x) and f (y). Thus, the two
underestimates are:
f (x) ≥ f (z) + h∇ f (z), x − zi, (3.3)
f (y) ≥ f (z) + h∇ f (z), y − zi. (3.4)
Multiplying (3.3) by λ and (3.4) by (1 − λ), and summing both inequalities, we
obtain
(1 − λ) f (x) + λ f (y)) ≥ f (z) + h∇ f (z), λx + (1 − λ)y − zi
= f (z) + h∇ f (z), 0i
= f ((1 − λ)x + λy).

A convex function can be shown to be continuous (Exercise 3.9), but it does


not have to be differentiable; take for example the function f (x) := |x|. In some
of the examples considered in this book, the function f will be convex but
may not be differentiable everywhere. In such cases, the following notion of a
subgradient turns out be helpful.
Definition 3.4 (Subgradient) For a convex function f defined over a convex
set K, a vector v is said to be a subgradient of f at a point x ∈ K if for any
y∈K
f (y) ≥ f (x) + hv, y − xi.
The set of subgradients at x is denoted by ∂ f (x).
It follows from the definition that the set ∂ f (x) is always a convex set, even
when f is not convex (Exercise 3.10). In Section 3.3.2 we show that a subgra-
dient of a convex function at any point in its domain always exists, even when
it is not differentiable. Subgradients are useful, especially in linear or nons-
mooth optimization and Theorem 3.3 holds in the nondifferentiable case if we
replace ∇ f (x) with any v ∈ ∂ f (x).
Before we go into the second-order convexity conditions, we prove the
lemma below that is necessary for their proof.
Lemma 3.5 Let f : K → R be a continuously differentiable function over a
convex set K. The function f is convex if and only if for all x, y ∈ K
h∇ f (y) − ∇ f (x), y − xi ≥ 0. (3.5)
36 Convexity

Proof Let f be convex. Then from Theorem 3.3, we have

f (x) ≥ f (y) + h∇ f (y), x − yi and f (y) ≥ f (x) + h∇ f (x), y − xi.

Summing both inequalities and rearranging yields (3.5).


Let us now assume that (3.5) holds for all x, y ∈ K. For λ ∈ [0, 1], let
xλ := x + λ(y − x). Then, since ∇ f is continuous, we can apply Lemma 2.6 to
obtain
Z 1
f (y) = f (x) + h∇ f (x + λ(y − x)), y − xi dλ
0
Z 1
= f (x) + h∇ f (x), y − xi + h∇ f (xλ ) − ∇ f (x), y − xi dλ
0
Z 1
1
= f (x) + h∇ f (x), y − xi + h∇ f (xλ ) − ∇ f (x), xλ − xi dλ
0 λ
≥ f (x) + h∇ f (x), y − xi.

Here, the first equality come from Lemma 2.6 and the last inequality comes
from the application of (3.5) in the integral.

The following second-order notion of convexity generalizes the second deriva-


tive test of convexity for the one-dimensional case that the readers might al-
ready be familiar with.

Theorem 3.6 (Second-order notion of convexity) Suppose K is convex and


open. If f : K → R is twice continuously differentiable, then it is convex if and
only if

∇2 f (x)  0, ∀x ∈ K.

Proof Suppose f : K → R is twice continuously differentiable and convex.


For any x ∈ K, and any s ∈ Rn , since K is open, there is some τ > 0 such that
xτ := x + τs ∈ K. Then, from Lemma 3.5 we have

1
0≤ h∇ f (xτ ) − ∇ f (x), xτ − xi
τ2
1 τ 2
Z
1
= h∇ f (xτ ) − ∇ f (x), si = h∇ f (x + λs)s, si dλ,
τ τ 0

where the last equality is from the second part of Lemma 2.6. We can conclude
the result by letting τ → 0.
Conversely, suppose that ∀x ∈ K, ∇2 f (x)  0. Then, for any x, y ∈ K, we
3.2 Convex functions 37

have
Z 1
f (y) = f (x) + h∇ f (x + λ(y − x)), y − xi dλ
0
Z 1
= f (x) + h∇ f (x), y − xi + h∇ f (x + λ(y − x)) − ∇ f (x), y − xi dλ
0
= f (x) + h∇ f (x), y − xi
Z 1Z λ
+ (y − x)> ∇2 f (x + τ(y − x))(y − x) dτ dλ
0 0 | {z }
≥0
≥ f (x) + h∇ f (x), y − xi.

The first and third equality come from Lemma 2.6 and the last inequality uses
the fact that ∇2 f is PSD.

For some functions, the inequalities in the notions of convexity mentioned


above can be strict everywhere and the following definitions capture this phe-
nomena.

Definition 3.7 (Strict convexity) A function f : K → R, defined over a


convex set K, is said to be strictly convex if ∀x , y ∈ K and λ ∈ (0, 1), we have

λ f (x) + (1 − λ) f (y) > f (λx + (1 − λ)y). (3.6)

It can be shown that if the function is differentiable then it is strictly convex if


and only if for all x , y ∈ K

f (y) > f (x) + h∇ f (x), y − xi,

see Exercise 3.11. If the function is also twice differentiable and

∇2 f (x)  0, ∀x ∈ K,

then f is strictly convex. The converse is not true, see Exercise 3.12.
We now introduce the notion of strong convexity that makes the notion of
strict convexity quantitative.

Definition 3.8 (Strong convexity) For σ > 0 and a norm k · k, a differentiable


function f : K → R, defined over a convex set K, is said to be σ-strongly
convex with respect to a norm k · k if
σ
f (y) ≥ f (x) + h∇ f (x), y − xi + · ky − xk2 .
2
Note that a strongly convex function is also strictly convex, but not vice versa.
Both strict and strong convexity can be defined for nondifferentiable functions
38 Convexity

if the gradient is replaced by the subgradient. If f is twice continuously differ-


entiable and k · k = k · k2 is the `2 -norm, then strong convexity is implied by the
condition
∇2 f (x)  σI
for all x ∈ K. Intuitively, strong convexity means that there exists a quadratic
lower bound on
f (y) − ( f (x) + h∇ f (x), y − xi);
see Figure 3.2 for an illustration. The quantity f (y) − ( f (x) + h∇ f (x), y − xi) is
important enough to have a name.
Definition 3.9 (Bregman divergence) The Bregman divergence of a func-
tion f : K → R at u, w ∈ K is defined to be:
D f (u, w) := f (w) − ( f (u) + h∇ f (u), w − ui).
Note that, in general, the Bregman divergence is not symmetric in u and w, i.e.,
D f (u, w) is not necessarily equal to D f (w, u) (Exercise 3.17(d)).

f (y)
σ
≥ 2
· ky − xk2

= h∇ f (x), y − xi

f (x)

x y

Figure 3.2 Illustration of strong convexity.


3.3 The usefulness of convexity 39

3.3 The usefulness of convexity


In this section we prove some fundamental results about convex sets and con-
vex functions. For instance, one can always “separate” a point that is not in a
convex set from the set by a “simple” object – a hyperplane. This enables us
to provide a “certificate” of when a given point is not in a convex set. And, for
a convex function, a local minimum also has to be a global minimum. These
properties illustrate why convexity is so useful and play a key role in the design
of algorithms for convex optimization.

3.3.1 Separating and supporting hyperplanes for convex sets


For some h ∈ Rn and c ∈ R, let

H := {x ∈ Rn : hh, xi = c}

denote the hyperplane defined by h and c. Recall that for a set K ⊆ Rn , ∂K


denotes its boundary.

Definition 3.10 (Separating and supporting hyperplane) Suppose K ⊆ Rn


is convex. A hyperplane H defined by h and c is said to separate y ∈ Rn from
K if
hh, xi ≤ c (3.7)

for all x ∈ K, but


hh, yi > c.

If y ∈ ∂K, then H is said to be a supporting hyperplane for K containing y if


(3.7) holds and
hh, yi = c.

The first consequence of convexity is that for a closed convex set, a “proof” of
non-membership of a point – a separating hyperplane – always exists.

Theorem 3.11 (Separating and supporting hyperplane theorem) Let K ⊆


Rn be a nonempty, closed, and convex set. Then, given a y ∈ Rn \ K, there is
an h ∈ Rn \ {0} such that

∀x ∈ K hh, xi < hh, yi .

If y ∈ ∂K, then there exists a supporting hyperplane containing y.


40 Convexity

Proof Let x? be the unique point in K that minimizes the Euclidean distance
to y, i.e.,
x? := argmin kx − yk.
x∈K

The existence of such a minimizer requires that K is nonempty, closed, and


convex (Exercise 3.18). Since K is convex, any point xt := (1 − t)x? + tx is in
K for all t ∈ [0, 1] and for all x ∈ K. Thus, by optimality of x? ,
kxt − yk2 ≥ kx? − yk2 .
On the other hand, for small enough t,
k(1−t)x? +tx−yk2 = kx? −y+t(x− x? )k2 = kx? −yk2 +2thx? −y, x− x? i+O(t2 ).
Taking the limit t → 0, we obtain
hx? − y, x − x? i ≥ 0 for all x ∈ K. (3.8)
Now, we can let
h := y − x? .
Note that, in the case y < K, x? , y and, hence, h , 0. Moreover, observe that
for any x ∈ K,
(3.8)
hh, xi ≤ hh, x? i < hh, yi,
where the last strict inequality follows from the fact that
hh, x? i − hh, yi = −khk2 < 0.
Finally, note that if y ∈ ∂K, then x? = y and the the hyperplane corresponding
to h and hh, yi is a supporting hyperplane for K at y.

3.3.2 Existence of subgradients


Theorem 3.11 can be used to show the existence of subgradients (see Definition
3.4) of a convex function.
Theorem 3.12 (Existence of subgradients) For a convex function f : K →
R, and an x in the relative interior of a convex set K ⊆ Rn , there always exist
a vector v such that for any y ∈ K
f (y) ≥ f (x) + hv, y − xi.
In other words, ∂ f (x) , ∅.
3.3 The usefulness of convexity 41

The proof uses the notion of the epigraph of a function f (denoted by epi f ):

epi f := {(x, y) ∈ K × R : y ≥ f (x)}.

It is an exercise (Exercise 3.3) to prove that, for a function f : K → R where


K ⊆ Rn is convex, f is convex if and only if epi f is convex. Moreover, epi f is
a closed and nonempty set.

Proof of Theorem 3.12 For a given x in the relative interior of K, consider


the point (x, f (x)) that lies on the boundary of the convex set epi f. Thus, from
the “supporting hyperplane” part of Theorem 3.11, we obtain that there is a
nonzero vector (h1 , . . . , hn , hn+1 ) ∈ Rn+1 such that
n
X n
X
hi zi + hn+1 y ≤ hi xi + hn+1 f (x)
i=1 i=1

for all (z, y) ∈ epi f. Rearranging, we obtain


n
X
hi (zi − xi ) + hn+1 (y − f (x)) ≤ 0 (3.9)
i=1

for all (z, y) ∈ epi f. Plugging in z = x and y ≥ f (x) as a point in epi f, we


deduce that hn+1 ≤ 0. Since (z, f (z)) ∈ epi f , from (3.9) it follows that
n
X
hi (zi − xi ) + hn+1 ( f (z) − f (x)) ≤ 0.
i=1

If hn+1 , 0, then by dividing the above by hn+1 (and reversing the sign since
hn+1 < 0), we obtain
1
f (z) ≥ f (x) − hh, z − xi,
hn+1

where h := (h1 , . . . , hn ). This establishes that v := − hn+1


h
∈ ∂ f (x). On the other
hand, if hn+1 = 0, then we know that h , 0 and

hh, z − xi ≤ 0

for all z ∈ K. This implies that

hh, zi ≤ hh, xi

for all z ∈ K. However, x is in the relative interior of K and, hence, the above
cannot be true. Thus, hn+1 must be nonzero and we deduce the theorem.
42 Convexity

f (x)

x1 x2 x3
x

Figure 3.3 A nonconvex function for which ∇ f (x) = 0 at points x1 , x2 , x3 .

3.3.3 Local optima of convex functions are globally optimum


For simplicity, let us assume that we are in the unconstrained setting, i.e., K =
Rn . In general, for a function f : Rn → R, the gradient can be zero at many
points; see Figure 3.3. However, if f is a convex and differentiable function,
this cannot happen.

Theorem 3.13 (A local optimum of a convex function is a global optimum)


If f : Rn → R is a convex and differentiable function, then for a given x ∈ Rn

f (x) ≤ f (y) ∀y ∈ Rn

if and only if

∇ f (x) = 0.

Proof We start by showing that if x is a global minimizer of f , i.e., satisfies


the first condition of the theorem, then ∇ f (x) = 0. Indeed, for every vector
v ∈ Rn and every t ∈ R we have

f (x + tv) ≥ f (x)

and, hence,
f (x + tv) − f (x)
0 ≤ lim = h∇ f (x), vi .
t→0 t
Since v ∈ Rn is chosen arbitrarily, if we let v := −∇ f (x), we deduce that
∇ f (x) = 0.
We proceed with the proof in the opposite direction: if ∇ f (x) = 0 for some
x, then x is a minimizer for f . First-order convexity (Theorem 3.3) of f implies
3.3 The usefulness of convexity 43

that ∀y ∈ Rn
f (y) ≥ f (x) + h∇ f (x), y − xi
= f (x) + h0, y − xi
= f (x).
Hence, if ∇ f (x) = 0, then f (y) ≥ f (x) ∀y ∈ Rn , proving the claim.

Note that, in the constrained setting (i.e., K , Rn ), Theorem 3.13 does not
necessarily hold. However, one can prove the following generalization that
holds for all convex functions, differentiable or not.
Theorem 3.14 (Optimality in the constrained setting) If f : K → R is a
convex and differentiable function defined on a convex set K, then
f (x) ≤ f (y) ∀y ∈ K
if and only if
h∇ f (x), y − xi ≥ 0 ∀y ∈ K.
The proof of this theorem is left for the reader (Exercise 3.19). We conclude
by noting that if ∇ f (x) , 0, then −∇ f (x) gives us a supporting hyperplane for
∂K at the point x.
44 Convexity

3.4 Exercises
3.1 Is the intersection of two convex sets convex? Is the union of two convex
sets convex?
3.2 Prove that the following sets are convex.

(a) Polyhedra: sets of the form K = {x ∈ Rn : hai , xi ≤ bi for i =


1, 2, . . . , m}, where ai ∈ Rn and bi ∈ R for i = 1, 2, . . . , m.
(b) Ellipsoids: sets of the form K = {x ∈ Rn : x> Ax ≤ 1} where
A ∈ Rn×n is a PD matrix.
(c) Unit balls in ` p -norms for p ≥ 1: B p (a, 1) := {x ∈ Rn : kx − ak p ≤
1}, where a ∈ Rn is a vector.
3.3 Prove that a function f : K → R, where K ⊆ Rn is convex, is convex if
and only if epi f is convex.
3.4 For two sets S , T ⊆ Rn define their Minkowski sum to be

S + T := {x + y : x ∈ S , y ∈ T }.

Prove that if S and T are convex, then so is their Minkowski sum.


3.5 Prove that a polyhedron K ⊆ Rn is bounded, i.e., K is contained in an
`2 -ball of radius r < ∞, then it can be written as a convex hull of finitely
many points, i.e., K is also a polytope.
3.6 Prove that the set of n × n real PSD matrices is convex.
3.7 Prove that if f : Rn → R is convex, A ∈ Rn×m and b ∈ Rn then g(y) =
f (Ay + b) is also convex.
3.8 Prove that the functions in Exercise 2.1 are convex.
3.9 Prove that a convex function is continuous.
3.10 Prove that the set ∂ f (x) is always a convex set, even when f is not con-
vex.
3.11 Prove that a differentiable function f : K → R is strictly convex if and
only if for all x , y ∈ K

f (y) > f (x) + h∇ f (x), y − xi.

3.12 Prove that if a function f : K → R has a continuous and PD Hessian,


then it is strictly convex. Prove that the converse is false: the univariate
function f (x) := x4 is strictly convex but its second derivative can be 0.
3.13 Prove that a strictly convex function defined on an open convex set can
have at most one minimum.
3.4 Exercises 45

3.14 Consider a convex function f : K → R. Let λ1 , . . . , λn be a set of n


nonnegative real numbers that sum to 1. Prove that for all x1 , . . . , xn ∈ K,
 n  n
X  X
f  λi xi  ≤ λi f (xi ).
i=1 i=1

3.15 Let B ∈ R be a symmetric matrix. Consider the function f (x) = x> Bx.
n×n

Write the gradient and the Hessian of f (x). When is f convex and when
is it concave?
3.16 Prove that if A is an n×n PD matrix with smallest eigenvalue λ1 , then the
function f (x) = x> Ax is λ1 -strongly convex with respect to the `2 -norm.
3.17 Consider the generalized negative entropy function f (x) = ni=1 xi log xi −
P
xi over Rn>0 .
(a) Write the gradient and Hessian of f .
(b) Prove f is strictly convex.
(c) Prove that f is not strongly convex with respect to the `2 -norm.
(d) Write the Bregman divergence D f . Is D f (x, y) = D f (y, x) for all
x, y ∈ Rn>0 ?
(e) Prove that f is 1-strongly convex with respect to `1 -norm when
restricted to points in the subdomain {x ∈ Rn>0 : ni=1 xi = 1}.
P

3.18 Prove that for a nonempty, closed, and convex set K, argmin x∈K kx − yk
exists.
3.19 Prove Theorem 3.14.
46 Convexity

Notes
The study of convexity is central to a variety of mathematical, scientific, and
engineering disciplines. For a detailed discussion on convex sets and convex
functions, the reader is referred to the classic textbook by Rockafellar (1970).
The textbook by Boyd and Vandenberghe (2004) also provides a comprehen-
sive (and modern) treatment of convex analysis. The book by Barvinok (2002)
provides an introduction to convexity with various mathematical applications,
e.g., to lattices. The book by Krantz (2014) introduces analytical (real and
complex) tools for studying convexity. Many results in this book hold in the
nondifferentiable setting by resorting to subgradients; see the book by Boyd
and Vandenberghe (2004) for more on the calculus of subgradients.
4
Convex Optimization and Efficiency

We present the notion of convex optimization and discuss formally what it means to
solve a convex program efficiently as a function of the representation length of the
input and the desired accuracy.

4.1 Convex programs


The central object of study is the following class of optimization problems.
Definition 4.1 (Convex program) Given a convex set K ⊆ Rn and a convex
function f : K → R, a convex program is the following optimization problem:
inf f (x). (4.1)
x∈K

When f is concave, then maximizing f while constrained to be in K is also


referred to as a convex program. We say that a convex program is uncon-
strained when K = Rn , i.e., when we are optimizing over all inputs, and we
call it constrained when the set K is a strict subset of Rn . Further, when f
is differentiable with a continuous derivative, we call (4.1) a smooth convex
program and nonsmooth otherwise. Many of the functions we consider have
higher-order smoothness properties but sometimes we encounter functions that
are not smooth in their domains.
Remark 4.2 (Minimum vs. infimum) Consider the case of convex optimiza-
tion where f (x) = 1/x and K = (0, ∞). In this case,
( )
1
inf : x ∈ (0, ∞) = 0
x
and there is no x that attains this value. However, if K ⊆ Rn is closed and
bounded (compact) then we are assured that the infimum value is attained by

47
48 Convex Optimization and Efficiency

a point x ∈ K. In almost all cases presented in this book we are interested in


the latter case. Even when K = Rn is unbounded, the algorithms developed
to solve this optimization problem work in some closed and bounded subset.
Hence, often we ignore this distinction and use minimum instead of infimum.
A similar comment applies to supremum vs. maximum.

4.1.1 Examples of convex programs


Several important problems can be encoded as convex programs. Here are two
prominent examples.

System of linear equations. Suppose we are given the task to find a solution
to a system of linear equations Ax = b where A ∈ Rm×n and b ∈ Rn . The tradi-
tional method to solve this problem is via Gaussian elimination. Interestingly,
the problem of solving a system of linear equations can also be formulated as
the following convex program:
minn kAx − bk22 .
x∈R

The objective function is indeed convex as


f (x) = x> A> Ax − 2b> Ax + b> b
(why?) from which it can be easily calculated that
∇2 f (x) = 2A> A  0,
hence, by Theorem, 3.6 the function f is convex. Therefore solving such a
convex program can lead to the solution of a system of equations.

Linear programming. Linear programming is the problem of optimizing a


linear objective function subject to being inside a polyhedron. A variety of
discrete problems that appear in computer science can be represented as lin-
ear programs, for instance, finding the shortest path between two vertices in a
graph, or finding the maximum flow in a graph. One way to represent a linear
program is as follows: Given A ∈ Rn×m , b ∈ Rn , c ∈ Rm :
min hc, xi
x∈Rm

subject to Ax = b (4.2)
x ≥ 0.
The objective function hc, xi is a linear function which is (trivially) convex.
The set of points x satisfying Ax = b, x ≥ 0, is a polyhedron, which is also a
4.2 Computational models 49

convex set. We now illustrate how the problem of finding the shortest path in a
graph between two nodes can be encoded as a linear program.

Shortest path in a graph. Given a directed graph G = (V, E) with a “source”


vertex s, a “target” vertex t, and a nonnegative “weight” wi j for each edge
(i, j) ∈ E, consider the following program with variables xi j :
X
minE wi j xi j
x∈R
i j∈E

subject to x ≥ 0,
if i = s,

X X



1
x ji =  if i = t,

∀i ∈ V, xi j −

−1

j∈V j∈V 
0 otherwise.

The intuition behind this formulation is that xi j is an indicator variable for


whether the edge (i, j) is part of the minimum weight, or shortest path: 1 when
it is, and 0 if it is not. We wish to select the set of edges with minimum weight,
subject to the constraint that this set forms a walk from s to t (represented by
the equality constraint: for all vertices except s and t the number of incoming
and outgoing edges must be equal).
The equality condition can be rewritten in the form Bx = b, where B is the
edge incidence matrix. In particular, if e = (i, j) is the kth edge, then the kth
column of B has value 1 in the jth row, −1 in the ith row, and 0s elsewhere.
b ∈ RV has value 1 in the sth row, −1 in the tth row, and 0s elsewhere.
Note that the solution to the above linear program might be “fractional”, i.e.,
might not give true paths. However, one can still prove that the optimal value
is exactly the length of the shortest path, and one can recover the shortest path
from an optimal solution to the above linear program.

4.2 Computational models


Towards developing algorithms to solve convex programs of the kind (4.1), we
need to specify a model of computation, explain how we can input a convex set
or a function, and what it means for an algorithm to be efficient. The following
is by no means a substitute for a background on theory of computing but just
a refresher and a reminder that in this book we care about the running time of
algorithms from this point of view.
As the model of computation, we first quickly recall the standard notion of a
Turing machine (TM) that has a single one-directional infinite tape and works
50 Convex Optimization and Efficiency

with a finite-sized alphabet and transition function. The “head” of a TM that


implements its transitions function encodes the algorithm. A computation on
a TM is initialized with a binary input σ ∈ {0, 1}? on its tape and terminates
with an output τ ∈ {0, 1}? written on it. We measure the running time of such a
TM as the maximum number of steps performed on an input of a certain length
n. Here are some important variations of this standard model which will also
be useful:

• Randomized TMs. In this model, the TM has access to an additional tape


which is an infinite stream of independent and uniformly distributed ran-
dom bits. We then often allow such machines to be sometimes incorrect
(in what they output as a solution), but we require that the probability of
providing a correct answer is, say, at least 0.51.1
• TMs with oracles. For some applications, it is convenient to consider TMs
that have access to a black box primitive (called an oracle) for answering
certain questions or computing certain functions. Formally, such a TM has
an additional tape on which it can write a particular input and query the
oracle, which then writes the answer to such a query on the same tape. If
the size of the output of such an oracle is always polynomially bounded as
a function of the input query length, we call it a polynomial oracle.

As it is standard in computational complexity, our notion of efficiency is poly-


nomial time computability, i.e., we would like to design Turing machines (al-
gorithms) which solve a given problem and have (randomized) polynomial
running time. We will see that for several convex programs this goal is possi-
ble to achieve, while for others it might be hard or even impossible. The reader
is referred to the notes for references on computational complexity and time
complexity classes such as P and its randomized variants such as RP and BPP.

RAM Model. Oftentimes, we (unrealistically) consider the case when each


arithmetic operation (addition, multiplication, subtraction, checking equality,
etc.) takes exactly 1 time step. This is sometimes called the RAM model of
computation. On a TM, adding two numbers takes time proportional to the
number of bits required to represent them. If the numbers encountered in the
algorithm remain polynomially bounded, this causes only a poly-logarithmic
overhead and we do not worry about it. However, sometimes, the numbers to
be added/multiplied are itself produced by the algorithm and we have to be
careful.
1 In fact, any number strictly more than 0.5 suffices.
4.3 Membership problem for convex sets 51

In subsequent sections we analyze several natural computational questions


one might ask regarding convex sets and convex functions and formalize fur-
ther what kind of an input shall we provide and what type of a solution are we
really looking for.

4.3 Membership problem for convex sets


Perhaps the simplest computational problem one can consider, that is relevant
to convex optimization is the membership problem in convex sets:

Given a point x ∈ Rn and K ⊆ Rn , is x ∈ K?

We now examine some specific instances of this question and understand how
to specify the input (x, K) to the algorithm. While for x, we can imagine writing
it down in binary (if it is rational), K is an infinite set and we need to work
with some kind of finite representation. In this process, we introduce several
important concepts which are useful later on.

Example: Halfspaces. Let

K := {y ∈ Rn : ha, yi ≤ b}

where a ∈ Rn is a vector and b ∈ R is a number. K then represents a halfspace


in the Euclidean space Rn . For a given x ∈ Rn , checking whether x ∈ K just
boils down to verifying whether the inequality ha, xi ≤ b is satisfied or not.
Computing ha, xi requires O(n) arithmetic operations and, hence, is computa-
tionally efficient.
Let us now try to formalize what an input for an algorithm that checks
whether x ∈ K should be in this case. We need to provide x, a and b to this
algorithm. For this we need to represent these objects exactly using finitely
many bits. Hence, we assume that these are rational numbers, i.e., x ∈ Qn ,
a ∈ Qn and b ∈ Q. If we want to provide a rational number y ∈ Q as input,
we represent it as an irreducible fraction y = yy12 for y1 , y2 ∈ Z and give these
numbers in binary. The bit complexity L(z) of an integer z ∈ Z is defined to
be the number of bits required to store its binary representation, thus

L(z) := 1 + dlog(|z| + 1)e

and we overload this notation to rational numbers by defining


!
y1
L(y) := L := L(y1 ) + L(y2 ).
y2
52 Convex Optimization and Efficiency

Further, if x ∈ Qn is a vector of rationals we define

L(x) := L(x1 ) + L(x2 ) + · · · + L(xn ).

And finally, for inputs that are collections of rational vectors, numbers, matri-
ces, etc., we set
L(x, a, b) := L(x) + L(a) + L(b),

i.e., the total bit complexity of all numbers involved.


Given this definition, one can now verify that checking whether x ∈ K, for
a halfspace K can be done in polynomial time with respect to the input length,
i.e., the bit complexity L(x, a, b). It is important to note that the arithmetic op-
erations involved in checking this are multiplication, addition, and comparison
and all of the operations can be implemented efficiently (with respect to the bit
lengths of the inputs) on a TM.

Example: Ellipsoids. Consider that case when K ⊆ Rn is an ellipsoid, i.e., a


set of the form
K := {y ∈ Rn : y> Ay ≤ 1}

for a PD matrix A ∈ Qn×n . As before, the task of checking whether x ∈ K,


for an x ∈ Qn can be performed efficiently: indeed, computing x> Ax requires
O(n2 ) arithmetic operations (multiplications and additions) and, thus, can be
performed in polynomial time with respect to the bit complexity L(x, A) of the
input.

Example: Intersections and polyhedra. Note that being able to answer mem-
bership questions about two sets K1 , K2 ⊆ Rn allows us to answer questions
about their intersection K1 ∩ K2 . For instance, consider a polyhedron, i.e., a set
of the form

K := {y ∈ Rn : hai , yi ≤ bi for i = 1, 2, . . . , m},

where ai ∈ Qn and bi ∈ Q for i = 1, 2, . . . , m. Such a K is an intersection of


m halfspaces, each of which is easy to deal with computationally. Determining
whether x ∈ K reduces to m such questions, and hence is easy as well, can be
answered in polynomial time with respect to the bit complexity

L(x, a1 , b1 , a2 , b2 , . . . , am , bm ).
4.3 Membership problem for convex sets 53

Example: `1 -ball. Consider the case where

K := {y ∈ Rn : kyk1 ≤ r}

for r ∈ Q. Interestingly, K falls into the category of polyhedra: it can be written


as
K = {y ∈ Rn : hy, si ≤ r, for all s ∈ {−1, 1}n },

and one can argue that none of the 2n (exponentially many) different linear
inequalities is redundant. Thus, using the method from the previous example,
our algorithm would not be efficient. However, in this case one can just check
whether
Xn
|xi | ≤ r.
i=1

This requires only O(n) arithmetic operations and can be efficiently imple-
mented on a TM.

Example: Polytopes. Recall that, by definition, a polytope K ⊆ Rn is a con-


vex hull of finitely many points v1 , . . . , vN ∈ Qn for some integer N. We know
from Exercise 3.5 that a bounded polyhedron is also a polytope. We can choose
the above points so that each vi ∈ K is also a vertex of K. A vertex of a poly-
tope K is a point x ∈ K which does not belong to conv(K \ {x}). We can show
that if the inequalities and equalities generating the polyhedron contain ratio-
nal coefficients, then all vertices of K have a rational description: vi ∈ Qn , and
the description of each vi is polynomial in the bit complexity of the defining
inequalities. Thus, we can specify a polytope by specifying the list of inequal-
ities and equalities it is supposed to satisfy, or as a list of its vertices. The latter
representation, however, can often be much larger. Consider the case of the
hypercube [0, 1]n : it has 2n vertices but can be described by 2n inequalities,
each of which requires O(n) bits. The membership problem for polytopes can
be solved in polynomial time for both representations. In fact, the linear pro-
gramming problem has a trivial polynomial time algorithm if one is given the
list of vertices describing the polytope: compute hc, vi i for each i ∈ [N] and
output the vertex with the least cost. For this reason, the polyhedral descrip-
tion, which can be much more succinct, is often preferred. Indeed, developing
a polynomial time algorithm for linear programming problem in this model
was a major development and presented in later parts of the book.

Example: Spanning tree polytope. Given an undirected and simple (no re-
peated edge or loop) graph G = (V, E) with n := |V| and m := |E| ≤ n(n − 1)/2,
54 Convex Optimization and Efficiency

let T ⊆ 2[m] denote the set of all spanning trees in G. A spanning tree in a
graph is a subset of edges such that each vertex has at least one edge incident
to it and has no cycles. Define the spanning tree polytope corresponding to G,
PG ⊆ Rm , as follows:
PG := conv{1T : T ∈ T },
where conv denotes the convex hull of a set of vectors and 1T ∈ Rm is the
indicator vector of a set T ⊆ [m].2 In applications, the input is just the graph
G = (V, E) whose size is O(n+m). PG can have an exponential (in n) number of
vertices. Further, it is not easy, but one can show that the number of inequalities
describing PG could also be an exponential in n. Nevertheless, given x ∈ Qm
and the graph G, checking whether x ∈ PG can be done in polynomial time in
the number of bits needed to represent x and G, and is discussed more in the
next section.

Example: PSD matrices. Consider now K ⊆ Rn×n to be the set of all sym-
metric PSD matrices. Is it algorithmically easy to check whether X ∈ K? In
Exercise 3.6 we proved that K is convex. Recall that for a symmetric matrix
X ∈ Qn×n
X ∈ K ⇔ ∀y ∈ Rn y> Xy ≥ 0.
Thus, K is defined as an intersection of infinitely many halfspaces – one for
every vector y ∈ Rn . Clearly, it is not possible to go over all such y and check
them one by one, thus a different approach for checking if X ∈ K is required.
Recall that by λ1 (X) we denote the smallest eigenvalue of X (note that because
of symmetry, all its eigenvalues are real). It holds that
X ∈ K ⇔ λ1 (X) ≥ 0.
Recall from basic linear algebra that eigenvalues of X are roots of the poly-
nomial det(λI − X). Hence, one can just try to compute all of them and check
if they are nonnegative. However, computing roots of a polynomial is nontriv-
ial. On the one hand, there are efficient procedures to compute eigenvalues of
matrices. However, they are all approximate since eigenvalues are typically
irrational numbers and, thus, cannot be represented exactly in binary. More
algorithms to compute a sequence of numbers λ1 , λ2 , . . . , λn
0 0 0
precisely, there are
such that λi − λi < ε where λ1 , λ2 , . . . , λn are eigenvalues of X and ε ∈ (0, 1) is
0

the specified precision. Such algorithms can be made run in time polynomial in
the bit complexity L(X) and log 1ε . Consequently, we can check whether X ∈ K
“approximately”. This kind of approximation often suffices in applications.
2 Recall that the indicator vector 1T ∈ {0, 1}m of a set T ⊆ [m] is defined as 1T (i) = 1 for all
i ∈ T and 1T (i) = 0 otherwise.
4.3 Membership problem for convex sets 55

𝑥
𝐾

{𝑦: 𝑎, 𝑦 = 𝑏}

Figure 4.1 A convex set K, a point x < K and a hyperplane {y : ha, yi = b}


separating x from K.

4.3.1 Separation oracles for convex sets


When considering the question of whether a point x is in a set K, in the case
when the answer is negative one might want to give a “certificate” for that,
i.e., a proof that x is outside of K. It turns out that if K is a convex set, such
a certificate always exists and is given by a hyperplane separating x from K
(see Figure 4.1). More precisely, Theorem 3.11 in Chapter 3 asserts that for
any convex and closed set K ⊆ Rn and x ∈ Rn \ K, there exists a hyperplane
separating K from x, i.e., there exists a vector a ∈ Rn and a number b ∈ R such
that ha, yi ≤ b for all y ∈ K and ha, xi > b. In this theorem, the hyperplane
separating K from x is {y ∈ Rn : ha, yi = b}, as K is on one side of the
hyperplane while x is on the other side. The converse of Theorem 3.11 is also
true: every set that can be separated (by a hyperplane) from every point in its
complement is convex.

Theorem 4.3 (Converse of separation over convex sets) Let K ⊆ Rn be a


closed set. If for every point x ∈ Rn \ K there exists a hyperplane separating x
from K, then K is a convex set.

In other words, a convex set can be represented as a collection of hyperplanes


separating it from points outside. For it to be computationally useful, the hy-
perplane separating x and K must be described using rational numbers. This
motivates the following definition.

Definition 4.4 (Separation oracle) A separation oracle for a convex set K ⊆


Rn is a primitive which:
56 Convex Optimization and Efficiency

• given x ∈ K, answers YES,


• given x < K, answers NO and outputs a ∈ Qn , b ∈ Q such that the hyper-
plane {y ∈ Rn : ha, yi = b} separates x from K.

For this primitive to be algorithmically useful, we require that the output of


a separation oracle, i.e., a and b has polynomial bit complexity L(a, b) as a
function of L(x). Separation oracles provide a convenient way of “accessing”
a given convex set K, as by Theorems 3.11 and 4.3 they provide a complete
description of K.
Let us now give some examples and make a couple of remarks regarding
separation oracles.

Example: Spanning tree polytope. Recall the spanning tree polytope from
the previous section. By definition, to solve the membership problem for this
polytope, it is sufficient to construct a polynomial time separation oracle for
this polytope. Surprisingly, one can prove that there exist a polynomial time
(in the number of bits needed to encode the graph G) for this polytope that
only need access to the graph G – we discuss this and other similar polytopes
in Chapter 13.

Example: PSD matrices. We note that the algorithm to check if a matrix X


is PSD can be turned into an approximate separation oracle for the set of PSD
matrices. Compute an approximation λ to the smallest eigenvalue λ1 (X) along
with a v such that Xv = λv. If λ ≥ 0, output YES. If λ < 0 output NO along
with vv> . In this case, we know that

hX, vv> i := Tr(Xvv> ) = v> Xv < 0.

In both cases we are correct approximately. Here, the quality of the approxi-
mation depends on the error in our estimate for λ1 (X); we let the reader figure
out the details.

Separation vs. optimization. It is known that constructing efficient (polyno-


mial time) separation oracles for a given family of convex sets is equivalent to
constructing algorithms to optimize linear functions over convex sets in this
family. When applying such theorems one has to be careful though, as several
technicalities show up related to bit complexity; see the notes of this chapter
for references.
4.4 Solution concepts for optimization problems 57

Composing separation oracles. Suppose that K1 and K2 are two convex sets
and that we have access to separation oracles for them. Then we can construct
a separation oracle for their intersection K := K1 ∩ K2 . Indeed, given a point x
we can just check whether x belongs to both of K1 , K2 and if not, say x < K1
then we output the separating hyperplane output by the oracle for K1 . Using a
similar idea, we can also construct a separation oracle for K1 × K2 ⊆ R2n .

4.4 Solution concepts for optimization problems


With a better understanding of how to represent convex sets in a computation-
ally convenient way, we now go back to optimizing functions and to a crucial
question: what does it mean to solve a convex program of the form (4.1)?
One way of approaching this problem might be to formulate it as a decision
problem of the following form:
Given c ∈ Q, is min x∈K f (x) = c?
While in many important cases (e.g. linear programming) we can solve this
problem, even for the simplest of nonlinear convex programs this may not be
possible.

Example: Irrational solutions. Consider a simple, univariate convex pro-


gram of the form
2
min + x.
x≥1 x

By a direct calculation
√ one can see that the optimal solution is x? = 2 and the
optimal value 2 2. Thus, we obtain an irrational optimal solution even though
the problem was stated as minimizing a rational function with rational entries.
In such a case the optimal solution cannot be represented exactly using a finite
binary expansion and, thus, we are forced to consider approximate solutions.
For this reason we often cannot hope to obtain the optimal value
y? := min f (x)
x∈K

and we revise the question of asking for an approximation, or in other words,


asking for a small interval containing y? :

Given ε > 0, compute c ∈ Q, such that min x∈K f (x) ∈ [c, c + ε]? (4.3)
Since it requires roughly log 1ε bits to store ε, we expect an efficient algorithm
58 Convex Optimization and Efficiency

to run in time poly-logarithmic in 1ε . We elaborate a little more about this issue


in the subsequent section.

Optimal value vs. optimal point. Perhaps the most natural way to approach
problem (4.3) is to try to find a point x ∈ K such that y? ≤ f (x) ≤ y? + ε.
Then we can simply take c := f (x) − ε. Note that providing such a point x ∈ K
is a significantly stronger solution concept (than just outputting an appropriate
c) and provides much more information. The algorithms we consider in this
book will typically output both the optimal point as well as the value, but there
are examples of methods which naturally compute the optimal value only, and
recovering the optimal input for them is rather nontrivial, if not impossible; see
notes.

Example: Linear programming. As discussed in the previous subsections,


linear programming (or linear optimization) is special case of convex opti-
mization, and a nontrivial case of it is when one is given a linear objective

f (x) = hc, xi

and a polyhedron
K = {x ∈ Rn : Ax ≤ b},

where A ∈ Qm×n and b ∈ Qm . It is not hard to show that whenever min x∈K f (x)
is bounded (i.e., the solution is not −∞), and K has a vertex, then its optimal
value is achieved at some vertex of K. Thus, in order to solve a linear program,
it is enough to find the minimum value over all its vertices.
It turns out that every vertex x̃ of K can be represented as a solution to
A0 x = b, where A0 is some square submatrix of A. In particular, this means
that there is a rational optimal solution, whenever the input data is rational.
Moreover, if the bit complexity of (A, b) is L then the bit complexity of the
optimal solution is polynomial in L (and n) and hence a priori, there is no
direct obstacle (from the viewpoint of bit complexity) for obtaining efficient
algorithms for exactly solving linear programs.

Distance to optimum vs. approximation in value. Consider a convex pro-


gram which has a unique optimal solution x? ∈ K. One can then ask what is
the relation between the following two statements about an x ∈ K:

• being approximately optimal in value, i.e., f (x) ≤ f (x? ) + ε,


• being close to optimum, i.e., x − x? ≤ ε.

4.4 Solution concepts for optimization problems 59

It is not hard to see that in general (when f is an arbitrary convex function)


neither of these conditions implies the other, even approximately. However,
as we see in later chapters, under certain conditions such as Lipschitzness or
strong convexity, we might expect some such implications to hold.

4.4.1 Representing functions


Finally, we come to the question of how the objective function f is given to
us when trying to solve convex programs of the form (4.1). Below we discuss
several possibilities.

Explicit descriptions of functions. There are several important families of


functions which we often focus on, instead of considering the general, abstract
case. These include
• Linear and affine functions: f (x) = hc, xi + b for a vector c ∈ Qn and
b ∈ Q,
• Quadratic functions: f (x) = x> Ax + hb, xi + c for a PSD matrix A ∈ Qn ,
a vector b ∈ Qn and a number c ∈ Q.
• Linear matrix functions: f (X) = Tr(XA) where A ∈ Qn×n is a fixed
symmetric matrix and X ∈ Rn×n is a symmetric matrix variable.
All these families of functions can be described compactly. For instance, a
linear function can be described by a c and b, and a quadratic function by
providing A, b, c.

Black box models. Unlike the examples mentioned above where f is specified
“explicitly”, now consider perhaps the most primitive – or black box – access
that we can hope to have for a function:
Given x ∈ Qn , output f (x).
Note that such a black box, or evaluation oracle is in fact a complete descrip-
tion of the function, i.e., it hides no information about f , however if we have
no analytical handle on f and such a description might be hard to work with.
In fact, if we drop the convexity assumption and only assume that f is contin-
uous, then one can show that no algorithm can efficiently optimize f ; see notes
for a pointer. Under the convexity assumption the situation improves, but still,
typically more information about the local behavior of f around a given point
is desired to optimize f efficiently. If the function f is sufficiently smooth one
can ask for gradients, Hessians and possibly higher-order derivatives of f . We
formalize this concept in the definition below.
60 Convex Optimization and Efficiency

Definition 4.5 (Function oracle) For a function f : Rn → R and k ∈ N≥0 a


kth order oracle for f is the primitive:

Given x ∈ Rn , output ∇k f (x),

i.e., the kth order derivative of f .

In particular the 0-order oracle (same as an evaluation oracle) is simply a prim-


itive which given x outputs f (x), a 1st-order oracle given x outputs the gradient
∇ f (x) and a 2nd-order oracle outputs the Hessian of f . In some cases, we do
not need the Hessian per se but, for a vector v, it suffices to have (∇2 f (x))−1 v.
We will see (when discussing the gradient descent method) that even a 1st-
order oracle alone is sometimes enough to approximately minimize a convex
function (rather) efficiently.

Remark 4.6 We note that in the oracle model, while we are not charged for
the time it takes to compute ∇k f (x) given x, we have to account for the bit
length of the input x to the oracle the bit length of the output.

Stochastic models. Another interesting (and practically relevant) way of rep-


resenting a function is to provide a randomized black box procedure for com-
puting their values, gradients or higher-order information. To understand this,
consider an example of a quadratic loss function which often shows up in ma-
chine learning applications. Given a1 , a2 , . . . , am ∈ Rn and b1 , b2 , . . . , bm ∈ R
we define
m
1X
l(x) := khai , xi − bi k2 .
m i=1

Now suppose that F(x) is a randomized procedure which given x ∈ Qn outputs


at random one of kha1 , xi − b1 k2 , kha2 , xi − b2 k2 , . . . , kham , xi − bm k2 , each with
probability m1 . Then we have

E [F(x)] = l(x),

and, thus, F(x) provides an “unbiased estimator” for the value l(x). One can
design a similar unbiased estimator for the gradient ∇l(x). Note that such “ran-
domized oracles” have the advantage of being more computationally efficient
over computing the value or gradient of l(x) – in fact, they run m times faster
than their exact deterministic counterparts. Still, in many cases, one can per-
form minimization given just oracles of this type – for instance, using a method
called stochastic gradient descent.
4.5 The notion of polynomial time for convex optimization 61

4.5 The notion of polynomial time for convex optimization


For decision problems – where the answer is “YES” or “NO” and the input
is a string of characters – one defines the class P to be all problems that can
be solved by algorithms (deterministic Turing machines) that run in time poly-
nomial in the number of bits needed to represent the input. The randomized
counterparts to P are RP, BPP and problems belonging to these classes are
also considered efficiently solvable. Roughly speaking, the class RP consists
of problems for which there exist a randomized TM that runs in polynomial
time in the input size and, if the correct answer is NO, always returns NO and
if the correct answer is YES, then returns YES with probability at least 2/3
otherwise, they return NO.
For optimization problems we can define a concept analogous to a polyno-
mial time algorithm, yet for this we need to take into account the running time
as a function of the precision ε. We say that an algorithm runs in polynomial
time for a class of optimization problems of the form min x∈K f (x) (for a spe-
cific family of sets K and functions f ), if the time to find an ε-approximate
solution x ∈ K, i.e., such that

f (x) ≤ y? + ε

is polynomial in the bit complexity L of the input, the dimension n and log 1ε .
Similarly, we can talk about polynomial time algorithms in the black box
model – then the part depending on L is not present, we instead require that
the number of oracle calls (to the set K and to the function f ) is polynomially
bounded in the dimension and, importantly, in log 1ε .
One might ask why we insist on the dependency on 1ε to be poly-logarithmic.
There are several important applications where a polynomial dependency on 1ε
would be not sufficient. For instance, consider a linear program min x∈K hc, xi
with K = {x ∈ Rn : Ax ≤ b} with A ∈ Qm×n and b ∈ Qm . For simplicity, assume
there is exactly one optimal solution (a vertex) v? ∈ K of value y? = c, v? .


Suppose we would like to find v? in polynomial time, given an algorithm which
provides only approximately optimal answers.
How small an ε > 0 do we need to take, to make sure that an approximate
solution x̃ (with hc, x̃i ≤ y? + ε) can be uniquely “rounded” to the optimal
vertex v? ? A necessary condition for that is that there is no non-optimal vertex
v ∈ K with
y? < hc, vi ≤ y? + ε.

Thus, we should ask: what is the minimum nonzero distance in value between
two distinct vertices of K? By our discussion on linear programming in Sec-
62 Convex Optimization and Efficiency

tion 4.4 it follows that every vertex is a solution to an n×n subsystem of Ax = b.


Thus, one can deduce that a minimal such gap is at least, roughly 2−L ; see Ex-
ercise 4.9. Consequently, our choice of ε should be no bigger than that for the
rounding algorithm to work properly.3 Assuming our rounding algorithm is ef-
ficient, to get a polynomial time algorithm for linear programming we need to
make the dependence on ε in our approximate algorithm polynomial in log 1ε
in order to be polynomial in L. Thus, in this example, an approximate polyno-
mial time algorithm for linear programming yields an exact polynomial time
algorithm for linear programming.
Finally, we note that such an approximate algorithm with logarithmic de-
pendence on 1ε indeed exists (and we will learn a couple of such algorithms in
this book).

4.5.1 Is convex optimization in P or not?


Given the notion of efficiency in the previous sections one may ask:
Can every convex program be solved in polynomial time?
Even though the class of convex programs is very special and generally be-
lieved to have efficient algorithms, the short answer is NO.
There are various reasons why this is the case. For instance, not all (even nat-
ural examples of) convex sets have polynomial-time separation oracles, which
makes the task of optimizing over them impossible (see Exercise 4.10). It may
also be sometimes computationally hard to evaluate a function in polynomial
time (see Exercise 4.11). Further, it may be information-theoretically impos-
sible (especially in black box models) to design algorithms with logarithmic
dependency on the error ε. And finally, being polynomial in the bit complexity
L of the input is also often a serious problem since for certain convex pro-
grams, the bit complexity of all close-to-optimal solutions is exponential in L
and, thus, even the space required to output such a solution is already expo-
nential.
Despite these obstacles, there are still numerous interesting algorithms which
show that (certain subclasses) of convex programs can be solved efficiently
(might not necessarily mean “in polynomial time”). We will see several exam-
ples in subsequent chapters.

3 Such a rounding algorithm is quite nontrivial, and is based on the lattice basis reduction.
4.6 Exercises 63

4.6 Exercises
4.1 Prove that, when A is a full rank n × n matrix and b ∈ Rn , then the unique
minimizer of the function kAx − bk2 over all x is A−1 b.
4.2 Let M be a collection of subsets of {1, 2, . . . , n} and θ ∈ Rn be given.
For a set M ∈ M, let 1 M ∈ Rn be the indicator vector of the set M, i.e.,
1 M (i) = 1 if i ∈ M and 1 M (i) = 0 otherwise. Let
n
X X
f (x) := θi xi + ln ehx,1M i .
i=1 M∈M

Consider the following optimization problem:

inf f (x).
x∈Rn

(a) Prove that, if θi > 0 for some i, then the value of the above pro-
gram is −∞, i.e., it is unbounded from below.
(b) Assume the program attains a finite optimal value. Is the optimal
point necessarily unique?
(c) Compute the gradient of f.
(d) Is the program convex?
4.3 Give a precise formula for the number of spanning trees in the complete
graph Kn on n vertices.
4.4 For every n, give an example of a polytope that has 2n vertices but can
be described using 2n inequalities.
4.5 Prove that, if A is an n × n full rank matrix with entries in Q and b ∈ Qn ,
then A−1 b ∈ Qn .
4.6 Prove that the following algorithm for linear programming when the
polytope K ⊆ Rn is given as a list of vertices v1 , . . . , vN is correct: com-
pute hc, vi i for each i ∈ [N] and output the vertex with the least cost.
4.7 Find a polynomial time separation oracle for a set of the form {x ∈ Rn :
f (x) ≤ y}, where f : Rn → R is a convex function and y ∈ R. Assume
that we have zero- and first-order oracle access to f .
4.8 Give bounds on the time complexity of computing the gradients and Hes-
sians of the functions in Exercise 2.1.
4.9 The goal of this problem is to bound bit complexities of certain quantities
related to linear programs. Let A ∈ Qm×n be a matrix and b ∈ Qm be
a vector and let L be the bit complexity of (A, b). (Thus, in particular,
L > m and L > n.) We assume that K = {x ∈ Rn : Ax ≤ b} is a bounded,
full-dimensional polytope in Rn .
64 Convex Optimization and Efficiency

(a) Prove that there is an integer M ∈ Z and a matrix B ∈ Zm×n such


that A = M1 B and the bit complexities of M and every entry in B
are bounded by L.
(b) Let C be any square and invertible submatrix of A. Consider the
matrix norm kCk2 := max x,0 kCxk 2
kxk2 . Prove that there exists a con-
d d
stant d such that kCk2 ≤ 2O(L·[log(nL)] ) and kC −1 k2 ≤ 2O(nL·[log(nL)] ) .
(c) Prove that every vertex of P has coordinates in Q with bit com-
plexity O(nL · [log(nL)]d ) for some constant d.
4.10 A matrix M ∈ Rn×n that is symmetric is said to be copositive if for all
x ∈ Rn≥0 ,
x> Mx ≥ 0.
Let Cn denote the set of all n × n copositive matrices.
(a) Prove that the set Cn is closed and convex.
(b) Let G = (V, E) be a simple and undirected graph. Prove that the
following is a convex program:

min t
t∈R

subject to ∀i ∈ V, Mii = t − 1,
∀i j < E, Mi j = −1,
M ∈ Cn .
(c) Prove that the value of the above convex program is α(G) where
α(G) is size of the largest independent set in G: an independent
set is a set of vertices such that no two vertices in the set are
adjacent.
Since computing the size of the largest independent set in G is NP-hard,
we have an instance of convex optimization that is NP-hard.
4.11 Recall that an undirected graph G = (V, E) is said to be bipartite if the
vertex set V has two disjoint parts L, R and all edges go between L and R.
Consider the case when n := |L| = |R| and m := |E|. A perfect matching
in such a graph is a set of n edges such that each vertex has exactly
one edge incident to it. Let M denote the set of all perfect matchings in
G. Let 1 M ∈ {0, 1}E denote the indicator vector of the perfect matching
M ∈ M. Consider the function
X
f (y) := ln eh1M ,yi .
M∈M

(a) Prove that f is convex.


4.6 Exercises 65

(b) Consider the bipartite perfect matching polytope of G defined as


P := conv{1 M : M ∈ M}.
Give a polynomial time separation oracle for this polytope.
(c) Prove that, if there is a polynomial time to evaluate f given the
graph G as input, then one can count the number of perfect match-
ings in G in polynomial time.
Since the problem of computing the number of perfect matchings in a
bipartite graph is #P-hard, we have an instance of convex optimization
that is #P-hard.
66 Convex Optimization and Efficiency

Notes
For a detailed background on computational complexity, including models of
computation and formal definitions of complexity classes such as P, NP, RP,
BPP, and #P, the reader is referred to the textbook by Arora and Barak (2009).
For a formal and detailed treatment on oracles for convex functions, see the
book by Nesterov (2014). Algorithmic aspects of convex sets are discussed in
detail in Chapter 2 of the classic book by Grötschel et al. (1988). Chapter 2 of
Grötschel et al. (1988) also presents precise details of the results connecting
separation and optimization.
For an algorithm on approximately computing eigenvalues, see the paper by
Pan and Chen (1999). For more on copositive programs (introduced in Exercise
4.10), the reader is referred to Chapter 7 of the book by Gärtner and Matousek
(2014). In a seminal result, Valiant (1979) proved that counting the number of
perfect matchings in a bipartite graph (introduced in Exercise 4.11) is #P-hard.
5
Duality and Optimality

We introduce the notion of Lagrangian duality and show that under a mild condition,
called Slater’s condition, strong Lagrangian duality holds. Subsequently, we introduce
the Legendre-Fenchel dual that often arises in Lagrangian duality and optimization
methods. Finally, we present Kahn-Karush-Tucker (KKT) optimality conditions and
their relation to strong duality.

Consider a general (not necessarily convex) optimization problem of the


form

inf f (x), (5.1)


x∈K

where f : K → R and K ⊆ Rn . Let y? be one of its optimal value. In the


process of computing y? one often tries to obtain a good upper bound yU ∈ R
and a good lower bound yL ∈ R, so that

yL ≤ y? ≤ yU ,

and |yL − yU | is as small as possible. However, by inspecting the form of (5.1)


it is evident that the problems of producing yL and yU seem rather different and
the situation is asymmetric. Finding an upper bound yU may be much simpler,
as it boils down to picking an x ∈ K and taking yU = f (x), which is trivially
a correct upper bound. Giving an (even trivial) lower bound on y? does not
seem to be such a simple task. One can think of duality as a tool to construct
lower bounds for y? in an automatic way which is almost as simple as above
– it reduces to plugging in a feasible input to a different optimization problem,
called the Lagrangian dual of (5.1).

67
68 Duality and Optimality

5.1 Lagrangian duality


Definition 5.1 (Primal) Consider a problem of the form

inf f (x)
x∈Rn

s.t. f j (x) ≤ 0 for j = 1, 2, . . . , m, (5.2)


hi (x) = 0 for i = 1, 2, . . . , p.

Here f, f1 , . . . , fm , h1 , . . . , h p : Rn → R and need not be convex.1

Note that this problem is a special case of (5.1) when the set K is defined by
m inequalities and p equalities.

K := {x ∈ Rn : f j (x) ≤ 0 for j = 1, 2, . . . , m and hi (x) = 0 for i = 1, 2, . . . , p}.

Suppose we would like to obtain a lower bound on the optimal value of (5.2).
Towards that, one can apply the general idea of “moving the constraints to
the objective”. More precisely, introduce m new variables λ1 , . . . , λm ≥ 0 for
the inequality constraints and p new variables µ1 , . . . , µ p ∈ R for the equality
constraints; these are referred to as Lagrange multipliers. Then, consider the
following Lagrangian function:

m
X p
X
L(x, λ, µ) := f (x) + λ j f j (x) + µi hi (x). (5.3)
j=1 i=1

One can immediately see that, since λ ≥ 0, whenever x ∈ K we have that

L(x, λ, µ) ≤ f (x)

and L(x, 0, 0) = f (x). Moreover, for every x ∈ Rn ,



 f (x)
 if x ∈ K,
sup L(x, λ, µ) = 

λ≥0,µ

+∞ otherwise.

and for x ∈ K the supremum is attained at λ = 0. To see the second equality


above, note that when x < K, then either hi (x) , 0 for some i or f j (x) > 0
for some j. Hence, we can set the corresponding Lagrange multiplier µi or λ j
appropriately (and the other Lagrange multipliers to 0) to let the Lagrangian
go to +∞. Thus, consequently one can write the optimal value y? of (5.2) as

y? := inf sup L(x, λ, µ) = infn sup L(x, λ, µ).


x∈K λ≥0,µ x∈R λ≥0,µ

1 We also allow f to take the value +∞, the discussion in this section is still valid in such a case.
5.1 Lagrangian duality 69

It follows that for every fixed λ ≥ 0 and µ we have

inf L(x, λ, µ) ≤ y? .
x∈Rn

Thus, every choice of λ ≥ 0 and µ provides us with a lower bound for y? .


This is exactly what we were looking for, except that now, our lower bound is
a solution to an optimization problem, which is not necessarily easier than the
one we started with. This is a valid concern, since we wanted a lower bound
which is easy to compute. However, the optimization problem we are required
to solve now is at least unconstrained, i.e., the function L(x, λ, µ) is minimized
over all x ∈ Rn . In fact, for numerous important examples of problems of the
form (5.2), the value
g(λ, µ) := infn L(x, λ, µ) (5.4)
x∈R

has a closed-form solution (as a function of λ and µ) and thus allows efficient
computation of lower bounds to y? . We will see some examples soon.
So far we have constructed a function g(λ, µ) over λ, µ such that for every
λ ≥ 0 we have g(λ, µ) ≤ y? . A natural question arises: what is the best lower
bound we can achieve this way?

Definition 5.2 (Lagrangian dual) For the primal optimization problem con-
sidered in Definition 5.1, the following is referred to as the dual optimization
problem of the dual program.

sup g(λ, µ), (5.5)


λ≥0,µ

where g(λ, µ) := inf x∈Rn L(x, λ, µ).

From the above considerations we can deduce the following inequality.

Theorem 5.3 (Weak duality)

sup g(λ, µ) ≤ inf f (x). (5.6)


λ≥0,µ x∈K

An important observation is that the dual problem in Definition 5.2 is a convex


optimization problem irrespective of whether the primal problem of Definition
5.1 is convex or not. This is because the Lagrangian dual function g is concave
because it is a pointwise infimum of linear functions; see Exercise 5.1.
One might ask then: does equality hold in the inequality (5.6)? It turns out
that in the general case one cannot expect equality to hold. However, there
are sufficient conditions that imply equality. An important example is Slater’s
condition.
70 Duality and Optimality

Definition 5.4 (Slater’s condition) Slater’s condition is that there exists a


point x̄ ∈ K, such that all inequality constraints defining K are strict at x̄, i.e.,
hi ( x̄) = 0 for all i = 1, 2, . . . , p and for all j = 1, 2, . . . , m we have f j ( x̄) < 0.

When a convex optimization problem satisfies Slater’s condition, we have the


following fundamental theorem.

Theorem 5.5 (Strong duality) Suppose that the functions f, f1 , f2 , . . . , fm


and h1 , h2 , . . . , h p are convex and satisfy Slater’s condition. Then

sup g(λ, µ) = inf f (x).


λ≥0,µ x∈K

In other words, the duality gap is zero.

We defer the proof of this theorem to later. The proof relies on Theorem 3.11,
along with Slater’s condition, to show the existence of Lagrange multipliers
that achieve strong duality. It has some similarity to the proof of Theorem 3.12
and starts by constructing the following (epigraph-type) convex set related to
the objective function and the constraints.

{(w, u) ∈ R × Rm : f (x) ≤ w, fi (x) ≤ ui ∀i ∈ [m], for some x ∈ F },

where F is the set of points that satisfy all the equality constraints. It is then
argued that Lagrange multipliers exist whenever there is a “nonvertical” sup-
porting hyperplane that passes through the point (y? , 0) on the boundary of
this convex set. Here y? is the optimal value to the convex program (5.2) and
nonvertical means that the coefficient corresponding to the first coordinate in
the hyperplane is nonzero. Such a nonvertical hyperplane can only fail to exist
when constraints of the convex program can only be achieved at certain ex-
treme points of this convex set, and Slater’s condition ensures that this does
not happen.
For now, we present some interesting cases where strong duality holds and
some where it does not. We conclude this section by providing two examples:
one where strong duality holds even though the primal program is nonconvex
and one where strong duality fails to hold, even for a convex program (imply-
ing that Slater’s condition is not satisfied).

Example: dual of a linear program in standard form. Consider a linear


program of the form
minn hc, xi
x∈R
(5.7)
s.t. Ax ≥ b
5.1 Lagrangian duality 71

where A is an m × n matrix and b ∈ Rm is a vector. The notation Ax ≤ b is a


short way to say

hai , xi ≥ bi for all i = 1, 2, . . . , m,

where a1 , a2 , . . . , am are the rows of A.


To derive its dual we introduce a vector of dual variables λ ∈ Rm
≥0 and con-
sider the Lagrangian

L(x, λ) = hc, xi + hλ, b − Axi.

The next step is to derive g(λ) := inf x L(x, λ). For this, we first rewrite

L(x, λ) = hx, c − A> λi + hb, λi.

From this form it is easy to see that, unless c − A> λ = 0, the minimum of
L(x, λ) over all x ∈ Rn is −∞. More precisely,

hb, λi if c − A λ = 0,
 >
g(λ) = 

−∞
 otherwise.
Consequently, the dual program is:
maxm hb, λi
λ∈R

s.t. A> λ = c, (5.8)


λ ≥ 0.
It is known that in the setting of linear programming strong duality holds, i.e.,
whenever (5.7) has a finite optimal value, then the dual (5.8) is also finite and
has the same optimal value.

Example: strong duality for a nonconvex program. Consider the following



univariate optimization problem where · refers to the positive square root:

inf x
(5.9)
s.t. x ≥ 1.

Since x 7→ x is not a convex function, the above is a nonconvex program. Its
optimal value is equal to 1, attained at the boundary x = 1.
Verify that the dual program takes the following form:

3 1
sup λ − λ3
2 2 (5.10)
s.t. λ ≥ 0.
72 Duality and Optimality

The above is a convex program as the objective function is concave. Its optimal
value is attained at λ = 1 and is equal to 1, hence strong duality holds.

Example: strong duality fails for a convex program. The fact that strong
duality can fail for convex programs might be a little disheartening, yet such
programs are not commonly encountered in practice. The reason for such be-
havior is typically an unnatural or redundant description of the domain.
The standard example used to illustrate this is as follows: consider a function
f : D → R given by f (x) := e−x and the two-dimensional domain
D = {(x, y) : y > 0} ⊆ R2 .
The convex program we consider is
inf e−x
(x,y)∈D

x2 (5.11)
s.t. ≤ 0.
y
2
Such a program is indeed convex, as e−x and xy are convex functions over D
but we see that its description is rather artificial, as the constraint forces x = 0
and hence the y variable is redundant. Still, the optimal value of (5.11) is 1 and
we can derive its Lagrangian dual. To this end, we write down the Lagrangian
x2
L(x, y, λ) = e−x + λ ,
y
and derive that
g(λ) = inf L(x, y, λ) = 0,
(x,y)∈D

for every λ ≥ 0. Thus, g(λ) is a lower bound for the optimal value of (5.11) but
there is no λ that makes g(λ) equal to 1. Hence, strong duality fails in this case.

5.2 The conjugate function


We now introduce a notion that is closely related to Lagrangian dual and often
useful in deriving duals of optimization problems.
Definition 5.6 (Conjugate function) For a function f : Rn → R ∪ {+∞}, its
conjugate f ∗ : Rn → R ∪ {+∞} is defined as
f ∗ (y) := sup hy, xi − f (x),
x∈Rn
n
for y ∈ R .
5.2 The conjugate function 73

When f is a univariate function, the conjugate function has a particularly intu-


itive geometric interpretation. Suppose for a moment that f is strictly convex
and that the derivative of f attains all real values. Then for every value of the
angle θ there exists a unique line that is tangent to the plot of f and has slope
θ. The value of f ∗ (θ) is then (the negative of) the intersection of this line with
the y-axis. One can then think of f ∗ as an alternative encoding of f using a
different coordinate system.

Examples of conjugate functions.

1. If f (x) := ax + b, then f ∗ (a) = −b and f ∗ (y) = ∞ for y , a.


2. If f (x) := 12 x2 , then f ∗ (y) = 21 y2 .
3. If f (x) := x log x, then f ∗ (y) = ey−1 .

Recall that a function g : Rn → R is said to be closed if it maps every closed set


K ⊆ Rn to a closed set. For instance g(x) := e x is not closed as g(R) = (0, +∞),
while R is a closed set and (0, +∞) is not. One property of f ∗ is that it is closed
and convex.

Lemma 5.7 (Conjugate function is closed and convex) For every function
f (convex or not), its conjugate f ∗ is convex and closed.

Proof This follows directly from the definition as f ∗ is a pointwise supremum


of a set of convex (linear) functions, hence it is convex and closed.

Another simple, yet useful conclusion of the definition is the following in-
equality.

Lemma 5.8 (Young-Fenchel inequality) Let f : Rn → R ∪ {+∞} be any


function, then for all x, y ∈ Rn we have

f (x) + f ∗ (y) ≥ hx, yi.

The following fact that explains the name “conjugate” function – the conjuga-
tion operation is in fact an involution and applying it twice brings the original
function back. Clearly, this cannot hold in general as we observed that f ∗ is
always convex, but in fact not much more than convexity is required.

Lemma 5.9 (Conjugate of conjugate) If f : Rn → Rn is a convex and


closed function then ( f ∗ )∗ = f .

Finally, we present a lemma that is useful in viewing the gradient of a convex


function as a mapping that is invertible.
74 Duality and Optimality

Lemma 5.10 (The inverse of the gradient map) Suppose that f is closed,
convex, and differentiable. Then y = ∇ f (x) if and only if x = ∇ f ∗ (y).

Proof Given x such that y = ∇ f (x), it follows from first-order optimality


condition (Theorem 3.13) that

f ∗ (y) = sup (hy, ui − f (u)) = hy, xi − f (x). (5.12)


u

The first equality is the definition of f ∗ and the second equality follows from
taking the derivative of the argument in the supremum and plugging in u = x.
However, for any v

f ∗ (v) ≥ hv, xi − f (x) = hv − y, xi − f (x) + hy, xi = hv − y, xi + f ∗ (y),

where the last equality follows from (5.12). This implies that x = ∇ f ∗ (y).
For the other direction let g := f ∗ . Then, by Lemma 5.7, g is closed and
convex. Thus, we can use the above argument for g to deduce that if x = ∇g(y),
then y = ∇g∗ (x). By Lemma 5.9, we know that g∗ = f and, hence, the claim
follows.

The conjugate function is a fundamental object in convex analysis and opti-


mization. For instance, as we see in some of the exercises, the Lagrangian dual
g(λ, µ) can sometimes be expressed in terms of the conjugate of the primal
objective function.

5.3 KKT optimality conditions


We now introduce the Kahn-Karush-Tucker (KKT) optimality conditions for
pair of primal and dual programs introduced in Definitions 5.1 and 5.2.

Definition 5.11 (KKT conditions) Let f, f1 , . . . , fm , h1 , . . . , h p : Rn → R


be functions in variables x1 , . . . , xn . Let L(x, λ, µ) and g(λ, µ) be functions as
defined in Equations (5.3) and (5.4) in dual variables λ1 , . . . , λm and µ1 , . . . , µ p .
Then, x? ∈ Rn , λ? ∈ Rm , and µ? ∈ R p are said to satisfy KKT optimality
conditions if:

1. Primal feasibility: f j (x? ) ≤ 0 for j = 1, . . . , m and hi (x? ) = 0 for i =


1, . . . , p,
2. Dual feasibility: λ? ≥ 0,
3. Stationarity: ∂ x L(x? , λ? , µ? ) = 0, where ∂ x denotes the subgradient set
with respect to the x variables, and
5.4 Proof of strong duality under Slater’s condition 75

4. Complementary slackness: λ?j f j (x? ) = 0 for j = 1, . . . , m.

We now show that strong duality along with primal and dual feasibility are
equivalent to the KKT conditions. This is sometimes useful in solving a convex
optimization problem.

Theorem 5.12 (Equivalence of strong duality and KKT conditions) Let x?


and (λ? , µ? ) be primal and dual feasible points. Then they satisfy the remaining
KKT conditions if and only if f (x? ) = g(λ? , µ? ).

Proof Assume strong duality holds. Then we have the following:

f (x? ) = g(λ? , µ? )
= inf L(x, λ? , µ? )
x
 m p

X X
? ?
 
= inf  f (x) +
 λi fi (x) + µi hi (x)
x
j=1 i=1
m
X p
X
≤ f (x? ) + λ?j f j (x? ) + µ?i hi (x? )
j=1 i=1

≤ f (x? ).

Here, the last inequality follows from primal and dual feasibility. Thus, all of
the inequalities above should be equalities. Hence, from the tightness of the
first inequality we deduce the stationarity condition and from the tightness of
the second inequality we deduce the complementary slackness condition. The
converse is left as an exercise.

5.4 Proof of strong duality under Slater’s condition


Here we prove Theorem 5.5 assuming that the convex program (5.2) satisfies
Slater’s condition and that y? is finite. For simplicity we assume that there are
no equality constraints (p = 0). Key to the proof is the following set:

C := {(w, u) ∈ R × Rm : f (x) ≤ w, fi (x) ≤ ui ∀i ∈ [m], for some x ∈ Rn }.

It is straightforward to verify that C is convex and upwardly closed: for any


(w, u) ∈ C and any (w0 , u0 ) such that u0 ≥ u and w0 ≥ w, we have (w0 , u0 ) ∈ C.
Let y? be the optimal value to the convex program (5.2) (we assume it is
finite). Then, observe that (y? , 0) cannot be in the interior of the set C, as oth-
erwise y? is not optimal. Thus, (y? , 0) is on the boundary of C. Thus, from
76 Duality and Optimality

Theorem 3.11, we obtain that there is a (λ0 , λ) , 0 such that


h(λ0 , λ), (w, u)i ≥ λ0 y? ∀(w, u) ∈ C. (5.13)
From this, we deduce that that λ ≥ 0 and λ0 ≥ 0 as, otherwise, we could make
the left hand side of the inequality above arbitrarily negative, contradicting that
y? is finite. Now, there are two possibilities:
1. λ0 = 0: Then λ , 0 and, on the one hand,
inf hλ, ui = 0.
(w,u)∈C

On the other hand,


m
X m
X
inf hλ, ui = inf λi fi (x) ≤ λi fi ( x̄) < 0,
(w,u)∈C x
i=1 i=1

where in the last inequality we have used the fact that λ , 0 and the
existence of a point x̄ that satisfies Slater’s condition. This gives a contra-
diction, hence λ0 > 0.
2. λ0 > 0: in this case, we can divide (5.13) by λ0 to obtain
inf hλ/λ0 , ui + w ≥ y? .
(w,u)∈C

Thus, letting λ̂ := λ/λ0 , we obtain


 
 X 
g(λ̂) = inf  + λ̂ ≥ y? .
 
f (x) f (x)
 
i i 
x  

i

Maximizing the left hand side for all λ̂ ≥ 0, we obtain the nontrivial side
of strong duality, proving the theorem.
Observe that Slater’s condition does not hold for the convex optimization prob-
lem 5.11.
5.5 Exercises 77

5.5 Exercises
5.1 Prove that the dual function g in Definition 5.2 is concave irrespective of
whether the primal problem is convex or not.
5.2 Univariate duality. Consider the following optimization problem:
min x2 + 2x + 4
x∈R

s.t. x2 − 4x ≤ −3
(a) Solve this problem, i.e., find the optimal solution.
(b) Is this a convex program?
(c) Derive the dual problem maxλ≥0 g(λ). Find g and its domain.
(d) Prove that weak duality holds.
(e) Is Slater’s condition satisfied? Does strong duality hold?
5.3 Duality for linear programs. Consider the linear program in the canon-
ical form:
minn hc, xi
x∈R

s.t. Ax = b,
x ≥ 0.
Derive its Lagrangian dual and try to arrive at the simplest possible
form. Hint: represent equality constraints hai , xi = bi as hai , xi ≤ bi
and − hai , xi ≤ −bi .
5.4 Shortest vector in an affine space. Consider the following optimization
problem
minn kxk2 (5.14)
x∈R

s.t. ha, xi = b, (5.15)


where a ∈ Rn is a vector and b ∈ R. The above represents the problem
of finding the (squared) distance from the hyperplane {x : ha, xi = b} to
the origin.
(a) Is this a convex program?
(b) Derive the dual problem.
(c) Solve the dual program, i.e., derive a formula for its optimal solu-
tion.
5.5 Prove that the conjugate functions of the following functions are as spec-
ified:
(a) If f (x) := ax + b, then f ∗ (a) = −b and f ∗ (y) = ∞ for y , a.
78 Duality and Optimality

(b) If f (x) := 21 x2 , then f ∗ (y) = 12 y2 .


(c) If f (x) := x log x, then f ∗ (y) = ey−1 .
5.6 Fill in the details of the proof of Lemma 5.7.
5.7 Prove Lemma 5.9.
5.8 Let f (x) := kxk for some norm k · k : Rn → R. Prove that f ∗ (y) = 0 if
kyk∗ ≤ 1 and ∞ otherwise.
5.9 Consider the primal problem where f j (x) := ha j , xi − b j for some vectors
a1 , . . . , am ∈ Rn and b1 , . . . , bm ∈ R. Prove that the dual
g(λ) = −hb, λi − f ∗ (−A> λ)
where A is the m × n matrix whose rows are the vectors a j .
5.10 Compute the dual of the primal convex optimization problem over the
set real n × n PD matrices X for f (X) := − log det X and f j (X) := a>j Xa j
for j = 1, . . . , m where a j ∈ Rn .
5.11 Prove the converse direction in Theorem 5.12.
5.12 Consider the primal problem where f, f1 , . . . , fm : Rn → R are convex
and there are no equality constraints. Suppose x? and λ? satisfy the KKT
conditions. Prove that
h∇ f (x), x − x? i ≥ 0
for all feasible x (Theorem 3.14).
5.13 Compute the conjugate dual of the following functions:
(a) f (x) := |x|.
(b) For a positive definite matrix X, let f (X) := Tr(X log X).
5.14 Lagrange-Hamilton duality in Physics. Consider a particle with unit
mass in one dimension and let V : R → R be a potential function. In
Lagrangian dynamics one typically denotes the position of a particle by
q and its (generalized) velocity by q̇, as it is normally a time derivative
of the position. The Lagrangian of the particle is defined as
1 2
q̇ − V(q).
L(q, q̇) :=
2
The Hamiltonian on the other hand is defined with respect to a momen-
tum variable, denoted by p as
1 2
H(q, p) := V(q) + p.
2
Prove that the Hamiltonian is the Legendre-Fenchel dual of the Lagrangian:
H(q, p) = maxhq̇, pi − L(q, q̇).

5.5 Exercises 79

5.15 Polars and convexity. For a set S ⊆ Rn define its polar S ◦ as


S ◦ := {y ∈ Rn : hx, yi ≤ 1, for all x ∈ S }.
(a) Prove that 0 ∈ S ◦ and S ◦ is convex, whether S is convex or not.
(b) Compute the polar of a halfspace S = {x ∈ Rn : ha, xi ≤ b}.
(c) Compute the polar of S = {x ∈ Rn : x1 , x2 , . . . , xn ≥ 0} (the
positive orthant) and the polar of the set of PSD matrices.
(d) What is the polar of a polytope S = conv{x1 , x2 , . . . , xm }? Here
x1 , x2 , . . . , xm ∈ Rn .
(e) Prove that if S contains the origin along with a ball of radius r > 0
around it then S ◦ is contained in a ball of radius 1r .
(f) Let S be a closed convex subset of Rn and suppose we have ac-
cess to an oracle, which given c ∈ Qn outputs2 argmin x∈S hc, xi.
Construct an efficient separation oracle for S ◦ .3

2 We assume that the output of this oracle is always a vector with rational entries, and the bit
complexity of the output is polynomial in the bit complexity of c. Such oracles exist in
particular for certain important classes of polytopes with rational vertices.
3 As we will see later in the book, the converse of this theorem is also true.
80 Duality and Optimality

Notes
The book by Boyd and Vandenberghe (2004) provides a thorough treatment
of Lagrangian duality, conjugate functions, and KKT optimality conditions.
The reader looking for additional examples and exercises is referred this book.
Chapter IV in the book by Barvinok (2002) discusses polarity and convexity
(introduced in Exercise 5.15) in detail.
6
Gradient Descent

The remainder of the book is devoted to the design and analysis of algorithms for con-
vex optimization. We start by presenting the gradient descent method and show how it
can be viewed as a steepest descent. Subsequently, we prove a convergence time bound
on the gradient descent method when the gradient of the function is Lipschitz continu-
ous. Finally, we use the gradient descent method to come up with a fast algorithm for a
discrete optimization problem: computing maximum flow in an undirected graph.

6.1 The setup


We first consider the problem of unconstrained minimization
minn f (x),
x∈R

when only a first-order oracle to a convex function f is given, i.e., we can


query the gradient of f at any point; see Definition 4.5. Let y? be the optimal
value for this function and assume that x? is a point that achieves this value:
f (x? ) = y? . Our notion of a solution is an algorithm (could be randomized)
which, given access to f and given ε > 0, outputs a point x ∈ Rn such that
f (x) ≤ f (x? ) + ε.
Recall that one measures the running time of such an algorithm as a function
of the number of bits required to represent ε and the number of oracle calls to
the gradient of f . We note that while we are not charged for the time it takes
to compute ∇ f (x) given x, the running time depends on the bit length of x and
that of ∇ f (x). While we normally regard such an algorithm as a polynomial
time algorithm if its total running time is polynomial in log 1ε and it outputs the
right answer with high probability, in this chapter we present algorithms with
running times proportional to some polynomial in 1ε .

81
82 Gradient Descent

6.2 Gradient descent


Gradient descent is not a single method, but a general framework with many
possible realizations. We describe some concrete variants and present guaran-
tees on their running times. The performance guarantees that we are going to
obtain depend on the assumptions that we make about f . We first describe the
core ideas of the gradient descent methods in the unconstrained setting, i.e.,
K := Rn . In Section 6.3.2 we discuss the constrained setting.
The general scheme of gradient descent is summarized below.

1. Choose a starting point x1 ∈ Rn .


2. For some t ≥ 1, suppose x1 , . . . , xt have already been computed. Choose
xt+1 as a linear combination of xt and ∇ f (xt ).
3. Stop once a certain stopping criterion is met and output the last iterate.

Let xt denote the point chosen in Step 2 of this algorithm at the tth iteration
and let T be the total number of iterations performed. T is usually given as a
part of the input, but it could also be implied by a given stopping criteria. The
running time for the algorithm is O(T · M) where M is an upper bound on the
time of each update, including that of finding the starting point. Normally, the
update time M is something which cannot be optimized below a certain level
(for a fixed function f ) and, hence, the main goal is to design methods with T
as small as possible.

6.2.1 Why descend along the gradient?


In what follows we motivate one possible method for choosing the next itera-
tion point xt+1 from xt . Since the process we are about to describe can use only
the local information about f , it makes sense to pick a direction that locally
maximizes the rate of decrease of the function value – the direction of steepest
descent. More formally, we would like to pick a unit vector u for which the
rate at which f decreases is maximized. Such a direction is captured by the
following optimization problem:

f (x) − f (x + δu)
" #
max lim .
kuk=1 δ→0 δ
The expression inside the limit is simply the negative directional derivative of
f at x in direction u and, thus, we obtain

max (−D f (x)[u]) = max (− h∇ f (x), ui) . (6.1)


kuk=1 kuk=1
6.2 Gradient descent 83

We claim that the above is maximized at


∇ f (x)
u? := − .
k∇ f (x)k
Indeed, consider any point u with norm 1. From the Cauchy-Schwarz inequal-
ity (Theorem 2.15) we have

− h∇ f (x), ui ≤ k∇ f (x)k kuk = k∇ f (x)k ,

and the equality is attained at u = u? . Thus, the negative of the gradient at the
current point is the direction of steepest descent. This instantaneous direction
at each point x gives rise to a continuous-time dynamical system (Definition
2.19) called the gradient flow of f :
dx ∇ f (x)
=− .
dt k∇ f (x)k

x?

Figure 6.1 The steepest descent direction for the function x12 + 4x22 at x̃ =
√ √
( 2, 2).

However, to implement the strategy in an algorithm we should consider dis-


cretizations of the differential equations above. Since we assume first order
access to f , a natural discretization is the so-called Euler discretization (Defi-
nition 2.19):
∇ f (xt )
xt+1 = xt − α , (6.2)
k∇ f (xt )k
where α > 0 is the “step length”, i.e., how far we move along u? . Since k∇ f1(xt )k
is only a normalization factor we can omit it and arrive at the following gradient
descent update rule
xt+1 = xt − η∇ f (xt ), (6.3)
84 Gradient Descent

where, again, η > 0 is a parameter – the step length (which we might also make
depend on the time t or the point xt ). In machine learning, η is often called the
learning rate.

6.2.2 Assumptions on the function, gradient, and starting point


While moving in the direction of the negative gradient is a good instantaneous
strategy, it is far from clear how big a step to take. Ideally, we would like to take
big steps hoping for a smaller number of iterations, but there is also a danger.
The view we have of the function at the current point is local and implicitly
assumes the function is linear. However, the function can change or bend quite
dramatically and taking a long step can lead to a large error. This is one of
many issues that can arise.
To bypass this and related problems, it is customary to make assumptions on
the function in terms of parameters such as the Lipschitz constant of the func-
tion f or its gradient. These are natural measures of how “complicated” f is,
and thus how hard to optimize f could be using a first order oracle. Therefore
these “regularity parameters” often show up in the running time guarantees for
the optimization algorithms we develop. Moreover, when designing methods
with oracle access to f only, it is natural to provide as input an additional point
x1 ∈ Rn which is not “too far” from an optimal solution, as otherwise, it is not
even clear where such an algorithm should start its search. More formally, we
list the kind of assumptions which show up in the theorems below. We begin
with a definition.

Definition 6.1 (Lipschitz continuity) For a pair of arbitrary norms k · ka and


k · kb on Rn and Rm respectively and an L > 0, a function g : Rn → Rm is said
to be L-Lipschitz continuous if for all x, y ∈ Rn

kg(x) − g(y)kb ≤ Lkx − yka .

L is called the corresponding Lipschitz constant.

The Lipschitz constant of the same function can vary dramatically with the
choice of the norms. Unless specified, we assume that both the norms are k · k2 .

1. Lipschitz continuous gradient. Suppose f is differentiable and for every


x, y ∈ Rn we have

k∇ f (x) − ∇ f (y)k ≤ L kx − yk ,

where L > 0 is a possibly large but finite constant.


6.2 Gradient descent 85

This is also sometimes referred to as the L-smoothness of f . This con-


dition ensures that around x, the gradient changes in a controlled manner
and, thus, the gradient flow trajectories can only “bend” in a controlled
manner. The smaller the L is, the larger the step size we can think of tak-
ing.
2. Bounded gradient. For every x ∈ Rn we have
k∇ f (x)k ≤ G,
where G ∈ Q is a possibly large but finite constant.1 It is an exercise
to show that this condition implies that f is a G-Lipschitz continuous
function (Exercise 6.1). This quantity essentially controls how quickly the
function can go towards infinity. The smaller G is, the slower this growth
is.
3. Good initial point. A point x1 ∈ Qn is provided such that x1 − x? ≤ D,

where x? is some optimal solution to (4.1).2


We now state the main result that we prove in this chapter.
Theorem 6.2 (Guarantees on gradient descent for Lipschitz continuous
gradient) There is an algorithm which, given first-order oracle (see Def-
inition 4.5) access to a convex function f : Rn → R, a bound L on the
n
Lipschitz n
constant of its gradient,o an initial point x1 ∈ R , value D such
that max x − x? : f (x) ≤ f (x1 ) ≤ D (where x? is an optimal solution to

?
min x∈Rn f (x)) and an ε > 0, outputs
 2  a point x ∈ R such that f (x) ≤ f (x ) + ε.
n

The algorithm makes T = O ε queries to the first-order oracle for f and


LD

performs O(nT ) arithmetic operations.


We note that while in this chapter we only prove the theorem in the above
variant, one can alternatively use a weaker condition. x1 − x? ≤ D, to arrive
at the same conclusion. In Chapter 7, we prove a similar result that gives an
algorithm that can compute an ε-approximate solution
 under the condition that
 
DG 2
the gradient is bounded by G and makes T = O ε queries to the first-
order oracle for f
Before going any further, one might wonder if it is reasonable to assume
knowledge of parameters such as G, L and D of the function f in Theorem 6.2,
especially when the access to the function is black box only. While for D it is
often possible to obtain sensible bounds on the solution, finding G or L might
1 We assume here that the bound on the gradient is valid for all x ∈ Rn . One might often relax it
to just x over a suitably large set X ⊆ Rn which contains x1 and an optimal solution x? . This
requires one to prove that the algorithm “stays in X” for the whole computation.
2 There might be more than just one optimal solution, here it is enough that the distance to any
one of them is small.
86 Gradient Descent

be more difficult. However, one can often try to set these values adaptively.
As an example, it is possible to start with a guess G0 (or equivalently L0 )
and update the guess given the outcome of the gradient descent. Should the
algorithm “fail” (the gradient at the point at the end of the gradient descent
algorithm is not small enough) for this value of G0 , we can double the constant
G1 = 2G0 . Otherwise, we can half the constant G1 = 21 G0 and so on.
In practice, especially in machine learning applications, these quantities are
often known up to a good precision and are often considered to be constants.
Under such an assumption  the number
  of oracle calls performed by the respec-
tive algorithms are O ε12 and O 1ε and do not depend on the dimension n.
Therefore, such algorithms are often referred to as dimension-free, their con-
vergence rate, as measured in the number of iterations (as clarified later) does
not depend on n, only on certain regularity parameters of f . In theoretical com-
puter science applications, an important aspect is to find formulations where
these quantities are small and this is often the key challenge.

6.3 Analysis when the gradient is Lipschitz continuous


This section is devoted to proving Theorem 6.2. We start by stating an appro-
priate variant of the gradient descent algorithm and, subsequently, provide its
analysis. In Algorithm 1, the values D and L are as in the statement of Theo-
rem 6.2.
Before we present the proof of Theorem 6.2 we need to establish an impor-
tant lemma that shows how to upper bound the Bregman divergence (Definition
3.9) using our assumption on the gradient of f .

Lemma 6.3 (Upper bound on the Bregman divergence for L-smooth func-
tions) Suppose that f : Rn → R is a differentiable function that satisfies
k∇ f (x) − ∇ f (y)k ≤ L kx − yk for every x, y ∈ Rn . Then, for every x, y ∈ Rn it
holds that
L
f (y) − f (x) − h∇ f (x), y − xi ≤ kx − yk2 . (6.4)
2
We note that this lemma does not assume that f is convex. However, if f is
convex, then the distance in the left hand side of Equation (6.4) is nonnega-
tive. See Figure 6.2 for an illustration. Moreover, when f is convex, (6.4) is
equivalent to the condition that the gradient of f is L-Lipschitz continuous; see
Exercise 6.2.
6.3 Analysis when the gradient is Lipschitz continuous 87

Algorithm 1: Gradient descent when the gradient is Lipschitz continuous


Input:
• First-order oracle for a convex f
• A bound L ∈ Q>0 on the Lipschitz constant of the gradient of f
• A bound D ∈ Q>0 on the distance to optimal solution D
• A starting point x1 ∈ Qn
• An ε > 0
Output: A point x such that f (x) − f (x? ) ≤ ε
Algorithm:
 2
1: Set T := O LD ε
2: Set η := L1
3: for t = 1, 2, . . . , T − 1 do
4: xt+1 = xt − η∇ f (xt )
5: end for
6: return xT

Proof For fixed x and y, consider a univariate function


g(t) := f ((1 − t)x + ty)
for t ∈ [0, 1]. We have g(0) = f (x) and g(1) = f (y). Since g is differentiable,
from the fundamental theorem of calculus we have
Z 1
ġ(t)dt = f (y) − f (x).
0
Since ġ(t) = h∇ f ((1 − t)x + ty), y − xi (using the chain rule) we have that
Z 1
f (y) − f (x) = h∇ f ((1 − t)x + ty), y − xi dt
0
Z 1 Z 1
= h∇ f (x), y − xi dt + h∇ f ((1 − t)x + ty) − ∇ f (x), y − xi dt
0 0
Z 1
≤ h∇ f (x), y − xi + k∇ f ((1 − t)x + ty) − ∇ f (x)k ky − xk dt
0
Z 1
≤ h∇ f (x), y − xi + kx − yk L kt(y − x)k dt
0
L
≤ h∇ f (x), y − xi + kx − yk2 .
2
Where we have used the Cauchy-Schwarz inequality and the L-Lipschitz con-
tinuity of the gradient of f .
88 Gradient Descent

f (y)

≤ L2 ky − xk2

f (x) + h∇ f (x), y − xi

f (x)

x y

Figure 6.2 for a convex function f , the gap at y between the value of f and the
first-order approximation of f at x is nonnegative and bounded from above by a
quadratic function L2 kx − yk2 when the gradient of f is L-Lipschitz continuous.

Proof of Theorem 6.2. Let us examine the evolution of the error as we iterate.
We first use Lemma 6.3 to obtain
L
f (xt+1 ) − f (xt ) ≤ h∇ f (xt ), xt+1 − xt i + kxt+1 − xt k2
2
Lη2
= −η k∇ f (xt )k2 + k∇ f (xt )k2 .
2
Since we wish to maximize the drop in the function value ( f (xt ) − f (xt+1 )),
we should chose η to be as large as possible. The right hand side is a convex
function of η and a simple calculation shows that it is minimized when η = L1 .
Substituting it, we obtain
1
f (xt+1 ) − f (xt ) ≤ − k∇ f (xt )k2 . (6.5)
2L
Intuitively, (6.5) suggests that if the norm of the gradient k∇ f (xt )k is large, we
make good progress. On the other hand, if k∇ f (xt )k is small, we are already
close to the optimum.
6.3 Analysis when the gradient is Lipschitz continuous 89

Let us now denote


Rt := f (xt ) − f (x? ),

which measures how far the current objective value is from the optimal value.
Note that Rt is nonincreasing and we would like to find the smallest t for which
it is below ε. Let us start by noting that R1 ≤ LD2 . This is because

R1 ≤ k∇ f (x1 )k · kx1 − x? k ≤ Dk∇ f (x1 )k. (6.6)

The first inequality in (6.6) follows from the first-order characterization of con-
vexity of f and the second inequality follows (vacuously) from the definition
of D. The optimality of x? (Theorem 3.13) implies that ∇ f (x? ) = 0. Therefore,
by the L-smoothness of f , we obtain

k∇ f (x1 )k = k∇ f (x1 ) − ∇ f (x? )k ≤ Lkx1 − x? k ≤ LD. (6.7)

Combining the (6.6) and (6.7) we obtain that R1 ≤ LD2 .


Further, from (6.5) we know that
1
Rt − Rt+1 ≥ k∇ f (xt )k2 .
2L
Again, by convexity of f and the Cauchy-Schwarz inequality we obtain

Rt ≤ f (xt ) − f (x? ) ≤ ∇ f (xt ), xt − x? ≤ k∇ f (xt )k · xt − x? .


D E

Note that we can bound xt − x? by D since f (xt ) is a nonincreasing sequence


in t and, hence, always at most f (x1 ). Hence, we obtain


Rt
k∇ f (xt )k ≥ ,
D
and, by (6.5),
R2t
Rt − Rt+1 ≥ . (6.8)
2LD2
Thus, to bound the number of iterations to reach ε we need to solve the follow-
ing calculus problem: given a sequence of numbers R1 ≥ R2 ≥ R3 ≥ · · · ≥ 0,
with R1 ≤ LD2 and satisfying the recursive bound (6.8), find a bound on T for
which RT ≤ ε.  2
Before we give a formal proof that T can be bounded by O LDε let us pro-
vide some intuition by analyzing a continuous time analogue of recursion (6.8)
– it gives rise to the following dynamical system:
d
R(t) = −αR(t)2 ,
dt
90 Gradient Descent

where R : [0, ∞) → R with R(0) = LD2 and α = 1


LD2
. To solve it, first rewrite
the above as
" #
d 1
= α,
dt R(t)
and hence we obtain
1
R(t) = .
R(0)−1 + αt
From this we deduce that to reach R(t) ≤ ε, we need to take
ε
1 − R(0) LD2
!
t≥ ≈Θ .
εα ε
To formalize this argument in the discrete setting (where t = 1, 2, . . .), the idea
is to first estimate the number of steps for R1 to drop below R1 /2, then to go
from R1 /2 to R1 /4, then from R1 /4 to R1 /8 and so on, until we reach ε.
Given (6.8), to go from Rt to Rt /2 it suffices to make k steps, where

(Rt /2)2
k· ≥ Rt /2.
2LD2
2
l m l R1
m
In other words, k needs to be at least 4LDRt . Let r := log ε . By repeatedly
halving r times to get from R1 to ε, the required number of steps is at most

r & r
4LD2 2 2
LD2
' !
i 4LD r+1 4LD
X X
≤r+1+ 2 ≤ (r + 1) + 2 =O .
i=0
R1 · 2−i i=0
R1 R1 ε

6.3.1 A lower bound


One can ask whether the gradient descent algorithm proposed in the previous
section, which performs O(ε−1 ) iterations, is optimal. Consider now a general
model for first-order black box minimization which includes gradient descent
and many related algorithms. The algorithm is given x1 ∈ Rm and access to a
gradient oracle for a convex function f : Rn → R. It produces a sequence of
points: x1 , x2 , . . . , xT such that

xt ∈ x1 + span{∇ f (x1 ), . . . , ∇ f (xt−1 )}, (6.9)


6.3 Analysis when the gradient is Lipschitz continuous 91

i.e., the algorithm can move only in the subspace spanned by the gradients at
previous iterations. Note that gradient descent clearly follows this scheme as
t−1
X
xt = x1 − η∇ f (x j ).
j=1

We do not restrict the running time of one iteration of such an algorithm, in fact
we allow it to do an arbitrarily long calculation to compute xt from x1 , . . . , xt−1
and the corresponding gradients. In this model we are interested only in the
number of iterations. The lower bound is as follows.

Theorem 6.4 (Lower bound) Consider any algorithm for solving the con-
vex unconstrained minimization problem min x∈Rn f (x) in the model as in (6.9),
when the gradient of f is Lipschitz continuous (with a constant L) and the ini-
tial point x1 ∈ Rn satisfies x1 − x? ≤ D. There is a function f such that for

any 1 ≤ T ≤ n2 it holds that

LD2
min f (xi ) − minn f (x) ≥ .
1≤i≤T x∈R T2
 
The above theorem translates to a lower bound of Ω √1ε iterations to reach an
ε-optimal solution. This lower bound does not quite match the upper bound of
1
ε established in Theorem 6.2. Therefore one can ask: Is there a method which
matches the above √1ε iterations bound? And, surprisingly, the answer is yes.
This can be achieved using the so-called accelerated gradient descent. This
is the topic of Chapter 8. We also prove Theorem 6.4 in Exercise 8.4.

6.3.2 Projected gradient descent for constrained optimization


So far we have discussed the unconstrained optimization problem, however,
the same method can be also extended to the constrained setting, when we
would like to solve
min f (x)
x∈K
n
for a convex subset K ⊆ R .
When applying the gradient descent method, the next iterate xt+1 might fall
outside of the convex body K. In this case we need to project it back onto
K: find the point in K with the minimum Euclidean distance to x, denoted by
projK (x). Formally, for a closed set K ⊆ Rn and a point x ∈ Rn

projK (x) := argmin kx − yk.


y∈K
92 Gradient Descent

We take the projected point to be our new iterate, i.e.,

xt+1 = projK (xt − ηt ∇ f (xt )) .

This is called the projected gradient descent method. The convergence rates
of this method remains the same (the proof carries over to this new setting
by noting that kprojK (x) − projK (y)k ≤ kx − yk for all x, y ∈ Rn ). However,
depending on K, the projection may or may not be difficult (or computationally
expensive) to perform. More precisely, as long as the algorithm has access to
an oracle which, given a query point x, returns the projection projK (x) of x onto
K, we have the following analogue of Theorem 6.2.

Theorem 6.5 (Projected gradient descent for Lipchitz gradient) There is


an algorithm which given first-order oracle access to a convex function f :
K → R, access to the projection operator projK : Rn → K, a bound L on
n
the Lipschitz
n constant o an initial point x1 ∈ R , a value D such
of its gradient,
that max x − x? : f (x) ≤ f (x1 ) ≤ D (where x? is an optimal solution to
?
min x∈K f (x)) and an ε > 0, outputs
 2  a point x ∈ K such that f (x) ≤ f (x ) + ε.
The algorithm makes T = O ε queries to the first-order oracle for f and
LD

the projection operator projK and performs O(nT ) arithmetic operations.

6.4 Application: the maximum flow problem


As an application of the algorithms developed in this chapter, we present an
algorithm to solve the problem of computing the s − t-maximum flow in an
undirected graph. This is our first nontrivial example of a discrete optimization
problem for which we first present a formulation as a linear program and then
show how we can use continuous optimization – in this case gradient descent
– to solve the problem.

6.4.1 The s − t-maximum flow problem


We first formally define the s − t-maximum flow problem. The input to this
problem consists of an undirected, simple, and unweighted graph G = (V, E)
with n := |V| and m := |E|, two special vertices s , t ∈ V, where s is the
“source” and t is the “sink”, and “capacities” ρ ∈ Qm ≥0 . Let B ∈ R
n×m
denote
the vertex-edge incidence matrix of the graph introduced in Section 2.8.2.
Recall also the definition of an s − t-flow in G from Section 1.1 and Section
2.8.1. An s−t flow in G is an assignment x : E → R that satisfies the following
6.4 Application: the maximum flow problem 93

properties.3 For all vertices u ∈ V\{s, t}, we require that the “incoming” flow
is equal to the “outgoing” flow:

heu , Bxi = 0.

An s − t flow satisfies the given capacities if for all i ∈ [m], |xi | ≤ ρi for all
i ∈ [m]. The goal is to find such a flow in G that is feasible and maximizes the
flow
he s , Bxi

out of s. Let F ? denote this maximum flow value.


Then, the following linear program encodes the s − t-maximum flow prob-
lem:
max F
x∈RE , F≥0

s.t. Bx = Fb, (6.10)


|xi | ≤ ρi ∀i ∈ [m],

where b := e s − et .
From now on we assume that all the capacities are one, i.e., ρi = 1 for all
i ∈ [m] and hence the last constraint in (6.10) simplifies to kxk∞ ≤ 1. The
approach presented here can be extended to the general capacity case.
It is important to note that F ? ≥ 1 if in G there is at least one path from s to
t; we assume this (we can check this property in O(m)e time using breadth-first
?
search). Moreover, since each edge has capacity 1, F ≤ m. Thus, not only is
feasibility not an issue, we know that 1 ≤ F ? ≤ m.

6.4.2 The main result


Theorem 6.6 (Algorithm for the s − t-maximum flow problem) There is
an algorithm, which given an undirected graph G with unit capacities, two
vertices s, t and an ε > 0, finds an s − t flow of value at least (1 − ε)F ? in time
5/2

e ε−1 ? .4
O m
F

We do not give a fully detailed proof of the above, rather present the main steps
and leave certain steps as exercises.
3 As mentioned earlier, we extend x from E to V × V as a skew-symmetric function on the
edges. If the ith edge went between u and v and bi = eu − ev , we let x(v, u) := −x(u, v). If we
do not wish to refer to the direction of the ith edge, we use the notation xi .
4 The notation O(e f (n)) means O( f (n) · logk f (n)) for some constant k > 0.
94 Gradient Descent

6.4.3 A formulation as an unconstrained convex program


The first key idea is to reformulate (6.10) as an unconstrained convex optimiza-
tion problem. Note that it is already convex (in fact linear), however, we would
like to avoid complicated constraints such as “kxk∞ ≤ 1”. To this end, note that
instead of maximizing F we might just make a guess for F and ask if there is
a flow x with value F obeying the capacity constraints. If we could solve such
a decision problem efficiently, then we could solve (6.10) up to good precision
by performing a binary search over F. Thus it is enough, for a given F ∈ Q≥0
to solve the following problem:

Find x ∈ Rm
s.t. Bx = Fb, (6.11)
kxk∞ ≤ 1.
If we denote
HF := {x ∈ Rm : Bx = Fb}

and
Bm,∞ := {x ∈ Rm : kxk∞ ≤ 1},

then the set of solutions above is the intersection of these two convex sets
and hence the problem really asks the following question for the convex set
K := HF ∩ Bm,∞ :

Is K , ∅? If so, find a point in K.

As we would like to use the gradient descent framework, we need to first state
the above problems as a minimization problem. We have two choices to this
problem as a convex optimization problem:

1. minimize the distance of x from Bm,∞ while constraining it to be in HF , or


2. minimize the distance of x from HF while constraining it to be in Bm,∞ .

It turns out that the first formulation has the following advantage over the sec-
ond: for a suitable choice of the distance function, the objective function is
convex, has an easily computable first-order oracle, and, importantly, the Lips-
chitz constant of its gradient is O(1). Thus, we reformulate (6.11) as minimiz-
ing the distance of x to Bm,∞ over x ∈ HF . Towards that, let P be the projection
operator P : Rm → Bm,∞ given as

P(x) := argmin kx − yk : y ∈ Bm,∞ .



6.4 Application: the maximum flow problem 95

The final form of (6.10) we would like to consider is


minE kx − P(x)k2
x∈R
(6.12)
s.t. Bx = Fb.
One might be concerned about the constraint Bx = Fb, as so far we have
mostly talked about unconstrained optimization over Rm – however, as it is an
affine subspace of Rm one can easily prove that as long as we, in every step,
project the gradient onto {x : Bx = 0} then the conclusion of Theorem 6.2 still
holds.5 We will see later that this projection operator has a fast O(m)
e oracle.
Further, since we expect to solve (6.12) approximately, we need to show
how to deal with errors which might occur in this step and how they affect
solving (6.10) given a solution to (6.12); we refer to Section 6.4.6 for a discus-
sion on this.

6.4.4 Bounding parameters for gradient descent.


To apply the gradient descent algorithm to problem (6.12) and estimate its
running time we need to perform the following steps
1. Prove that the objective f (x) := kx − P(x)k2 is convex on Rm .
2. Prove that f has a Lipschitz continuous gradient and find the Lipschitz
constant.
3. Find a “good starting point” x1 ∈ Rm .
4. Estimate the running time of a single iteration, i.e., how quickly can we
compute the gradient.
We now discuss these steps one by one.

Step 1: Convexity. In general, for every convex set S ⊆ Rn the function x 7→


dist2 (x, S ) is convex; see Exercise 6.9. Here
dist(x, S ) := inf kx − yk .
y∈S

Step 2: Lipschitz continuous gradient. One has to compute the gradient of


f first to analyze its properties. For that, we first observe that the projection
operator on the hypercube is described rather simply by the following:




 xi if xi ∈ [−1, 1],
P(x)i =  −1 if xi < −1,

(6.13)



if xi > 1.

1

5 For that one can either consider projected gradient descent – see Theorem 6.5, or just repeat
the proof of Theorem 6.2 over a linear subspace.
96 Gradient Descent

Hence,
n
X
f (x) = h(xi ),
i=1

where h : R → R is the function dist2 (z, [−1, 1]), i.e.,






0 if z ∈ [−1, 1],
h(z) = (z + 1) if z < −1,

 2


(z − 1)2 if z > 1.

Thus, in particular



0 if xi ∈ [−1, 1],
 
∇ f (x) i = 2(xi + 1) if xi < −1,
 



2(x − 1) if x > 1.


i i

Such a function (∇ f (x)) is easily seen to be Lipschitz continuous, with a Lips-


chitz constant L = 2 with respect to the Euclidean norm.

Step 3: Good starting point. We need to provide a flow g ∈ HF which is as


close as possible to the “cube” Bm,∞ . For that we can just find the flow g ∈ HF
with smallest Euclidean norm. In other words, we can project the origin onto
the affine subspace HF to obtain a flow g (we discuss the time complexity of
this task in Step 4). Note that if kgk2 > m then the optimal value of (6.12) is
nonzero, which is enough to conclude that F ? < F. Indeed, if there was a flow
x ∈ Bm,∞ such that Bx = Fb then x would be a point in HF of Euclidean norm
kxk2 ≤ m kxk2∞ ≤ m,
and hence we would arrive at a contradiction with the choice of g. Such a
√ √
choice of x1 = g implies that x1 − x? ≤ 2 m, thus we have D = O( m).

Step 4: The complexity of computing gradients. In Step 2 we have already


derived a formula for the gradient of f . However, since we are working in a
constrained setting, such a vector ∇ f (x) needs to be projected onto the linear
subspace H := {x ∈ Rm : Bx = 0}. In Exercise 6.10 we show that, if we let
Π := B> (BB> )+ B,
where (BB> )+ is the pseudoinverse of BB> (the Laplacian of G), then the ma-
trix I − Π : Rm → Rm corresponds to the operator projH . Thus, in every step t
of the projected gradient descent algorithm, we need to compute
(I − Π)∇ f (xt ).
6.4 Application: the maximum flow problem 97

While ∇ f (xt ) is easy to compute in linear time given the formulas derived
previously, computing its projection might be quite expensive. In fact, even if
we precomputed Π (which would take roughly O(m3 ) time), it would still take
O(m2 ) time to apply it to the vector ∇ f (xt ). To the rescue comes an important
and nontrivial result that such a projection can be computed in time O(m).
e This
is achieved by noting that this problem trivially reduces to solving Laplacian
systems of the form BB> y = a.

6.4.5 Running time analysis


Given the above discussion we can bound the running time of the gradient
descent based algorithm to find an δ-approximate max-flow. By Theorem 6.5
the number of iterations to solve (6.12) up to an error of δ is
LD2
! m
O =O ,
δ δ
the cost of
 2every
 iteration (and of finding x1 ) is O(m).
e Thus in total, this re-
m
quires O
e
δ time.
To recover a flow of value F ? (1 − ε) from a solution to (6.12) we need to
?
solve (6.12) up to precision δ := F√mε (see Section 6.4.6). Thus, we obtain a
e ε−1 m2.5? . The binary search over F to find F ? up to a
 
running time bound of O F
good precision incurs only a logarithmic factor in m (since 1 ≤ F ? ≤ m) and,
hence, does not significantly affect the running time. This completes the proof
of Theorem 6.6.

6.4.6 Dealing with approximate solutions


A concern which arises when dealing with the formulation (6.12) is that in
order to solve (6.11) we need to solve (6.12) exactly, i.e., with error ε = 0 (we
need to know whether the optimal solution is zero or nonzero), which is not
really possible using gradient descent. This is dealt with using the following
lemma.
Lemma 6.7 (From approximate to exact flows) Let g ∈ Rm be an s-t flow of
value F in the graph G, i.e., Bg = Fb. Suppose that g overflows a total of Fov
units of flow (formally Fov := kg − P(g)k1 ) on all edges. There is an algorithm
that, given g, finds a flow g0 that does not overflow any edge (i.e., g0 ∈ Bm,∞ )
of value F 0 ≥ F − Fov . The algorithm runs in time O(m log n).
The lemma says that any error we incur in solving (6.12) can be efficiently
turned into an error in the original objective of (6.10). More precisely, if we
98 Gradient Descent

solve (6.12) up to an error of δ > 0 then we can efficiently recover a flow of



value at least F − mδ. Thus, for the flow to be of value at least F(1 − ε), we
need to set δ := √εFm .
6.5 Exercises 99

6.5 Exercises
6.1 Let f : Rn → R be a differentiable function. Prove that if k∇ f (x)k ≤ G
for all x ∈ Rn and some G > 0, then f is G-Lipschitz continuous, i.e.,
∀x, y ∈ Rn , | f (x) − f (y)| ≤ G.
Is the converse true? (See Exercise 7.1.)
6.2 Suppose that a differentiable function f : Rn → R has the property that
∀x, y ∈ Rn ,
L
f (y) ≤ f (x) + hy − x, ∇ f (x)i + kx − yk2 . (6.14)
2
Prove that if f is twice-differentiable and has a continuous Hessian then
(6.14) is equivalent to ∇2 f (x)  LI. Further, prove that if f is also con-
vex, then (6.14) is equivalent to the condition that the gradient of f is
L-Lipschitz continuous, i.e.,
∀x, y ∈ Rn , k∇ f (x) − ∇ f (y)k ≤ L kx − yk .
6.3 Let M be a nonempty family of subsets of {1, 2, . . . , n}. For a set M ∈ M,
let 1 M ∈ Rn be the indicator vector of M, i.e., 1 M (i) = 1 if i ∈ M and
1 M (i) = 0 otherwise. Consider a function f : Rn → R given by
 
 X 
f (x) := log  e M  .
hx,1 i

M∈M

Prove that the gradient of f is L-Lipschitz continuous for some L > 0


that depends polynomially on n with respect to the Euclidean norm.
6.4 Prove Theorem 6.5.
6.5 Gradient descent for strongly convex functions. In this problem we
analyze a gradient descent algorithm for minimizing a twice-differentiable
convex function f : Rn → R, which satisfies for every x ∈ Rn , one has
mI  ∇2 f (x)  MI for some 0 < m ≤ M.
The algorithm starts with some x1 ∈ Rn and at every step t = 1, 2, . . .
it chooses the next point
xt+1 := xt − αt ∇ f (xt ),
where αt is chosen to minimize the value f (xt − α∇ f (xt )) over all α ∈ R
while fixing xt . Let y? := min{ f (x) : x ∈ Rn }.
(a) Prove that
m M
∀x, y ∈ Rn , ky − xk2 ≤ f (y) − f (x) + h∇ f (x), x − yi ≤ ky − xk2 .
2 2
100 Gradient Descent

(b) Prove that


1 1
∀x ∈ Rn , f (x) − k∇ f (x)k2 ≤ y? ≤ f (x) − k∇ f (x)k2 .
2m 2M
(c) Prove that for every t = 1, 2 . . .
1
f (xt+1 ) ≤ f (xt ) − k∇ f (xt )k2 .
2M
(d) Prove that for every t = 1, 2 . . .
 m t−1 
f (xt ) − y? ≤ 1 − f (x1 ) − y? .

M
What is the number of iterations t required to reach f (xt )−y? ≤ ε?
(e) Consider a linear system Ax = b, where b ∈ Rn is a vector
and A ∈ Rn×n is a symmetric positive definite matrix such that
λn (A)
λ1 (A) ≤ κ (where λ1 (A) and λn (A) are the smallest and the largest
eigenvalues of A respectively). Use the above framework to de-
sign an algorithm for approximately solving the system Ax = b
with logarithmic dependency on the error ε > 0 and polynomial
dependency on κ. What is the running time?
6.6 Subgradient descent for nonsmooth functions. Consider the function
R : Rn → R where R(x) := kxk1 .
(a) Show that R(x) is convex.
(b) Show that R(x) is not differentiable everywhere.
(c) Recall that we say that g ∈ Rn is a subgradient of f : Rn → R at
a point x ∈ Rn if
∀y ∈ Rn , f (y) ≥ f (x) + hg, y − xi.
Let ∂ f (x) denote the set of all subgradient of f at x. Describe
∂R(x) for every x ∈ Rn .
(d) Consider the following optimization problem:
( )
1
min kAx − bk + R(x) : x ∈ R
2 n
η
where A ∈ Rm×n , b ∈ Rm and η > 0. The objective is not dif-
ferentiable, hence we cannot apply gradient descent directly. Use
subgradients to handle the nondifferentiability of R. State the up-
date rule and derive the corresponding running time.
6.7 Coordinate descent for smooth functions. Let f : Rn → R be a con-
vex, twice-differentiable function with ∂∂2 xfi ≤ βi (for every i = 1, 2, . . . , n)
2

and let B := i=1 βi .


Pn
6.5 Exercises 101

(a) Let x ∈ Rn and let


1 ∂ f (x)
x0 := x − ei ,
βi ∂xi
where i ∈ {1, 2, . . . , n} is chosen at random from the distribution
given by pi := βBi . Prove that
1
E f (x0 ) ≤ f (x) − k∇ f (x)k2 .
 
2B
(b) Use the above randomized update rule to design a randomized
gradient-descent like algorithm which after T steps satisfies
E f (xt ) − f (x? ) ≤ ε,
h i

 2
whenever T = Ω BD ε , where D := max{kx − x1 k : f (x) ≤
f (x1 )}.
6.8 Frank-Wolfe method. Consider the following algorithm for minimizing
a convex function f : K → R over a convex domain K ⊆ Rn .
• Initialize x1 ∈ K,
• For each iteration t = 1, 2, . . . , T :
– Define zt := argmin x∈K { f (xt ) + h∇ f (xt ), x − xt i}
– Let xt+1 := (1 − γt )xt + γt zt , for some γt ∈ [0, 1]
• Output xT
(a) Prove that if the gradient of f is L-Lipschitz continuous with re-
spect to a norm k·k, max x,y kx − yk ≤ D and γt is taken to be Θ 1t
then
LD2
!
?
f (xT ) − f (x ) ≤ O ,
T
where x? is any minimizer of min x∈K f (x).
(b) Show that one iteration of this algorithm can be implemented ef-
ficiently when given a first-order oracle access to f and the set K
is any of:
• K := {x ∈ Rn : kxk∞ ≤ 1},
• K := {x ∈ Rn : kxk1 ≤ 1},
• K := {x ∈ Rn : kxk2 ≤ 1}.
6.9 Recall that for a nonempty subset K ⊆ Rn we can define the distance
function dist(·, K) and the projection operator projK : Rn → K as
dist(x, K) := inf kx − yk and projK (x) := arginf y∈K kx − yk .
y∈K
102 Gradient Descent

(a) Prove that projK is well-defined when K is a closed and convex


set, i.e., show that the minimum is attained at a unique point.
(b) Prove that for all x, y ∈ Rn ,

kprojK (x) − projK (y)k ≤ kx − yk.

(c) Prove that for any set K ⊆ Rn the function x 7→ dist2 (x, K) is
convex.
(d) Prove correctness of the explicit formula (given in Equation 6.13
in this chapter) for the projection operator projK when K = Bm,∞ =
{x ∈ Rm : kxk∞ ≤ 1}.
(e) Prove that the function f (x) := dist2 (x, K) has a Lipschitz contin-
uous gradient with Lipschitz constant equal to 2.
6.10 Let G = (V, E) be an undirected graph with n vertices and m edges. Let
B ∈ Rn×m be the vertex-edge incidence matrix of G. Assume that G is
connected and let Π := B> (BB> )+ B. Prove that, given a vector g ∈ Rm ,
if we let x? denote the projection of g on the subspace {x ∈ Rm : Bx = 0}
(as defined in Exercise 6.9), then it holds that

x? = g − Πg.

6.11 s − t-minimum cut problem. Recall the s − t-maximum flow problem


from this chapter.
(a) Prove that the dual of the formulation (6.11) presented earlier in
this chapter is equivalent to the following:
X
minn y − y
i j
y∈R
i j∈E (6.15)
s.t. y s − yt = 1.
(b) Prove that the optimal value of (6.15) is equal to MinCut s,t (G):
the minimum number of edges one needs to remove from G to
disconnect s from t. This latter problem is known as the s − t-
minimum cut problem.
(c) Reformulate (6.15) as follows:
min kxk1
x∈Rm

s.t. x ∈ Im(B> ), (6.16)


hx, zi = 1
for some z ∈ Rm that depends on G and s, t. Write an explicit
formula for z.
6.5 Exercises 103

(d) Apply gradient descent to solve the program (6.16). Estimate all
relevant parameters and provide a complete analysis. What is the
running time to reach a point with value at most MinCut s,t (G)+ε?
Hint: to make the objective smooth (have Lipschitz continuous
Pm q 2
gradient) replace kxk1 by i=1 xi + µ2 . Then pick µ appropri-
ately to make the error incurred by this approximation small com-
pared to ε.
104 Gradient Descent

Notes
While this chapter focuses on one version of gradient descent, there are several
variants of the gradient descent method and the reader is referred to the book
by Nesterov (2004) and the monograph by Bubeck (2015) for a discussion of
more variants. For more on the Frank-Wolfe method (introduced in Exercise
6.8) see the paper by Jaggi (2013). The lower bound (Theorem 6.4) was first
established in a paper by Nemirovski and Yudin (1983).
The s − t-maximum flow problem is one of the most well-studied problems
in combinatorial optimization. Early combinatorial algorithms for this problem
included those by Ford and Fulkerson (1956), Dinic (1970), and Edmonds and
Karp
 (1972))
n leadingo to an algorithm by Goldberg and Rao (1998) that runs in
Oe m min n2/3 , m1/2 log U time.
A convex optimization-based approach for the s − t-maximum flow problem
was initiated in a paper by Christiano et al. (2011) who gave an algorithm for
the s − t-maximum flow problem that runs in time O(mne 1/3 ε−11/3 ). Section 6.4
is based on a paper by Lee et al. (2013). We refer the reader to Lee et al. (2013)
for a stronger version of Theorem 6.6 that achieves a running time of O e √m1.75
εF ?
by applying the accelerated gradient descent method (discussed in a Chapter 8)
instead of the version we use here; also see Exercise 8.5. By further
 optimizing

the trade-off between these parameters one can obtain an O e mn1/3 ε−2/3 time
algorithm for the s − t-maximum flow problem. Nearly linear time algorithms
for the s − t-maximum flow problem were discovered in papers by Sherman
(2013), Kelner et al. (2014), and Peng (2016). All of these algorithms used
techniques from convex optimization. See Chapter 11 for a different class of
continuous algorithms for maximum flow problems whose dependence on ε is
polylog(ε−1 ).
All of the above results rely on the availability of fast Laplacian solvers.
A nearly linear time algorithm for Laplacian solvers were first discovered in
a seminal paper of Spielman and Teng (2004). To read more about Laplacian
systems and their applications to algorithm design, see the surveys by Spielman
(2012) and Teng (2010), and the monograph by Vishnoi (2013).
7
Mirror Descent and the
Multiplicative Weights Update Method

We derive our second algorithm for convex optimization – called the mirror descent
method – via a regularization viewpoint. First, the mirror descent algorithm is devel-
oped for optimizing convex functions over the probability simplex. Subsequently, we
show how to generalize it and, importantly, derive the multiplicative weights update
(MWU) method from it. This latter algorithm is then used to develop a fast approxi-
mate algorithm to solve the bipartite matching problem on graphs.

7.1 Beyond the Lipschitz gradient condition


Consider a convex program
min f (x), (7.1)
x∈K

where f : K → Rn is a convex function over a convex set. In Chapter 6 we


introduced the (projected) gradient descent algorithm and proved that when f
satisfies the Lipschitz gradient condition then it can solve the problem (7.1) up
to an additive error of ε in a number of iterations that are proportional, roughly,
to 1ε . As we have seen, several classes of functions, e.g., quadratic functions
f (x) = x> Ax + x> b and squared distances to convex sets f (x) = dist2 (x, K 0 )
(for some convex K 0 ⊆ Rn ), satisfy the bounded Lipschitz gradient condition.
In Chapter 6, we also introduced the bounded gradient condition. It state
that there exists a G > 0 such that for all x ∈ K,

k∇ f (x)k2 ≤ G. (7.2)

Using the fundamental theorem of calculus (as in the proof of Lemma 6.3),
this condition can be shown to imply the condition that f is G-Lipschitz, i.e.,

| f (x) − f (y)| ≤ G kx − yk2

105
106 Mirror Descent and Multiplicative Weights Update

for all x, y ∈ K; see Exercise 6.1. Moreover, if f is also convex, then these two
conditions are equivalent; see Exercise 7.1. Note, however, that the Lipschitz
continuous gradient condition may not imply the bounded gradient condition.
For instance, it may be the case that G = O(1), but there is no such bound on
the Lipschitz constant of the gradient of f ; see Exercise 7.2. In this case, one
can prove the following theorem.

Theorem 7.1 (Guarantee for gradient descent when the gradient is bounded)
There is a gradient descent-based algorithm which, given a first-order oracle
access to a convex function f : Rn → R, a number G such that k∇ f (x)k2 ≤ G
for all x ∈ Rn , an initial point x1 ∈ Rn and D such that x1 − x? 2 ≤ D and an
ε > 0, outputs a sequence of points x1 , . . . , xT such that

T
 
 1 X
x  − f (x? ) ≤ ε
t

f 
T t=1

where
 DG 2
T= .
ε

Note that the dependence on ε, as compared to the Lipschitz continuous gra-


dient case (Theorem 6.2) has become worse: from 1ε to ε12 . Moreover, note
that the definition of bounded gradient (7.2) is in terms of the Euclidean norm.
Sometimes we might find ourselves in a setting where k∇ f k∞ = O(1), and a

naive bound would only give us k∇ f k2 = O( n). We see in this chapter how to
deal with such situation by a major generalization of this version of the gradi-
ent descent that can exploit the bounded gradient property in different norms.
We introduce the mirror descent method: a powerful method that can be, on
the one hand, viewed through the lens of gradient descent in a “dual” space
through an appropriate conjugate function, and on the other hand viewed as a
proximal method in the “primal” space. It turns out that these are equivalent,
and we show that as well.

Remark 7.2 (Notation change in this chapter) Just in this chapter, we index
vectors using upper-indices: x1 , x2 , . . . , xT . This is to avoid confusion with the
coordinates of these vectors which are often referred to, i.e., the ith coordinate
of vector xt is denoted by xit . Also, since the results of this chapter generalize
to arbitrary norms, we are careful about the norm. In particular, k · k refers to a
general norm and the Euclidean norm is explicitly denoted by k · k2 .
7.2 A local optimization principle and regularizers 107

7.2 A local optimization principle and regularizers


To construct an algorithm for optimizing functions with bounded gradients,
we first introduce a general idea – regularization. Our algorithm is iterative,
given points x1 , x2 , . . . , xt it finds a new point xt+1 based on its history. How
do we choose the next point xt+1 to converge to the minimizer x? quickly? An
obvious choice would be to let
xt+1 := argmin f (x).
x∈K
?
It certainly converges to x quickly (in one step), yet clearly it is not very help-
ful because then xt+1 is hard to compute. To counter this problem, one might
try to construct a function ft – a “simple model” of f – that approximates
f in a certain sense and is easier to minimize. Then the update rule of our
algorithm becomes
xt+1 := argmin ft (x).
x∈K

If the approximation ft of f becomes more and more accurate with increas-


ing t near xt then, intuitively, the sequence of iterates should converge to the
minimizer x? .
One can view the gradient descent method for the Lipschitz continuous gra-
dient case we introduced in Chapter 6 to be in this class of algorithms with
D E L 2
ft (x) := f (xt ) + ∇ f (xt ), x − xt + x − xt 2 .
2
To see this observe that ft is convex and, hence, its minimum occurs at a point
xt+1 where ∇ ft (xt+1 ) = 0. And,
0 = ∇ ft (xt+1 ) = ∇ f (xt ) + L(xt+1 − xt )
which implies that
1
xt+1 = xt − ∇ f (xt ). (7.3)
L
If the gradient of f L-Lipschitz continuous then f (x) ≤ ft (x) for all x ∈ K
and, moreover, f t (x) is a good approximation of f (x) for x in a small neigh-
borhood around xt . In this setting (as proved in Chapter 6), such an update rule
guarantees convergence of xt to the global minimizer.
In general, when the functions we deal with do not have Lipschitz continu-
ous gradients, we might not be able to construct such good quadratic approx-
imations. However, we might still use first order approximations of f at xt
(using subgradients if we have to): convexity of f implies that if we define
D E
ft (x) := f (xt ) + ∇ f (xt ), x − xt
108 Mirror Descent and Multiplicative Weights Update

then
∀x ∈ K, ft (x) ≤ f (x)

and, moreover, we expect ft to be a decent approximation of f in a small neigh-


borhood around xt . Thus, one could try to apply the resulting update rule as
follows:
n D Eo
xt+1 := argmin f (xt ) + ∇ f (xt ), x − xt . (7.4)
x∈K

A downside of the above is that it is very aggressive – in fact, the new point
xt+1 could be very far from xt . This is easily illustrated by an example in one
dimension, when K = [−1, 1] and f (x) = x2 . If started at 1, and update using
(7.4), the algorithm jumps indefinitely between −1 and 1. This is because one
of these two points is always a minimizer of a linear lower bound of f over K.
Thus, the sequence {xt }t∈N never reaches 0 – the unique minimizer of f . The
reader is encouraged to check the details.
The situation is even worse when the domain K is unbounded: the minimum
is not attained at any finite point and, hence, the update rule (7.4) is not well
defined. This issue can be easily countered when the function is σ-strongly
convex (see Definition 3.8) for some σ > 0, since then we can use a stronger
quadratic lower bound on f at xt , i.e.,
D E σ 2
ft (x) = f (xt ) + ∇ f (xt ), x − xt + x − xt 2 .
2
Then, the minimizer of ft (x) is always attained at the following point (using
the same calculation leading to (7.3)):
1
xt+1 = xt − ∇ f (xt ),
σ
which can be made close to xt by choosing a large σ. For the case of f (x) = x2
over [−1, 1],
2−σ t
xt+1 = x
σ
which converges to 0 when σ > 1.
This observation leads to the following idea: even if the gradient of f is not
Lipschitz continuous, we can still add a function to ft , to make it smoother.
Specifically, we add a term involving a “distance” function D : K × K → R
that does not allow the new point xt+1 to land far away from the previous point
xt ; D is referred to as a regularizer.1 More precisely, instead of minimizing
1 This D should not be confused with a bound on the distance of the starting point to the optimal
point.
7.3 Exponential gradient descent 109

ft (x) we minimize D(x, xt ) + ft (x). To vary the importance of these two terms
we also introduce a positive parameter η > 0 and write the revised update as:
n  D Eo
xt+1 := argmin D(x, xt ) + η f (xt ) + ∇ f (xt ), x − xt .
x∈K

Since the above is an argmin, we can ignore terms that depend only on x and
simplify it as
n D Eo
xt+1 = argmin D(x, xt ) + η ∇ f (xt ), x . (7.5)
x∈K

Note that by picking large η, the significance of the regularizer D(x, xt ) is re-
duced and thus it does not play a big role in choosing the next step. By picking
η to be small we force xt+1 to stay in a close vicinity of xt .2 However, unlike
gradient descent, the value of the function does not have to decrease: f (xt )
may be more than f (xt+1 ), hence, it is not clear how to analyze progress of this
method; we explain it later.
Before we go any further with these general considerations, let us consider
one important example, where the “right” choice of the distance function D(·, ·)
for a simple convex set already leads to a very interesting algorithm – expo-
nential gradient descent. This algorithm can then be generalized to the setting
of an arbitrary convex set K.

7.3 Exponential gradient descent


Consider a convex optimization problem
min f (p), (7.6)
p∈∆n

where f : ∆n → R is a convex function, over the (closed and compact) n-


dimensional probability simplex
 n

 X 
∆n :=  n
pi = 1 ,
 
p ∈ [0, 1] :
 

 

i=1

i.e., the set of all probability distributions over n elements. From the discussion
in the previous section, the general form of an algorithm we would like to
construct is
n D Eo
pt+1 := argmin D(p, pt ) + η ∇ f (pt ), p , (7.7)
p∈∆n

2 A slightly different, yet related way to ensure that the next point xt+1 does not move too far
away from xt , is to first compute a candidate e xt+1 according to the rule (7.4) and then to make
a small step from xt towards e xt+1 to obtain xt+1 . This is the main idea of the Frank-Wolfe
algorithm; see Exercise 6.8.
110 Mirror Descent and Multiplicative Weights Update

where D is a certain distance function on ∆n . The choice of D is up to us, but


ideally it should allow efficient computation of xt+1 given xt and ∇ f (xt ) and
should be a “natural” metric that is compatible with the geometry of the feasi-
ble set ∆n so as to guarantee quick convergence. For the probability simplex,
one choice of such a metric is the relative entropy, called the Kullback-Leibler
(KL) divergence. The appropriateness of this metric will be explained later.

Definition 7.3 (Kullback-Leibler divergence over ∆n ) For two probability


distributions p, q ∈ ∆n , their Kullback-Leibler divergence is defined as
n
X qi
DKL (p, q) := − pi log .
i=1
pi

For this definition to make sense, whenever qi = 0, we require that pi = 0.


When pi = 0 the corresponding term is set to 0 as lim x→0+ x log x = 0.

While not being symmetric, DKL satisfies several natural distance-like proper-
ties. For instance, from convexity it follows that DKL (p, q) ≥ 0. The reason it is
called a divergence is because it can also be seen as the Bregman divergence
(Definition 3.9) corresponding to the function
n
X
h(p) := pi log pi .
i=1

Recall that for a convex function F : K → R that is differentiable over a convex


subset K of Rn , the Bregman divergence corresponding to F at x with respect
to y is defined as

DF (x, y) := F(x) − F(y) − h∇F(y), x − yi .

DF (x, y) measures the error in approximating F(x) using the first-order Taylor
approximation of F at y. In particular, DF (x, y) ≥ 0 and DF (x, y) → 0 when
x is fixed and y → x. In the next section we derive several other properties of
DKL .
When specialized to this particular distance function DKL , the update rule
takes the form
n D Eo
pt+1 := argmin DKL (p, pt ) + η ∇ f (pt ), p . (7.8)
p∈∆n

As we prove in the lemma below, the vector pt+1 can be computed using an
explicit formula involving only pt and ∇ f (pt ). It is useful to extend the notion
of KL-divergence to Rn≥0 by introducing the generalized KL divergence DH : it
7.3 Exponential gradient descent 111

is the Bregman divergence corresponding to the function


n
X
H(x) := xi log xi − xi .
i=1

Thus, for x, y ∈ Rn≥0 ,


n n
X yi X
DH (x, y) = − xi log + (yi − xi )
i=1
xi i=1
where as before, whenever yi = 0, we require that xi = 0 and when xi = 0
the corresponding term is set to 0 as lim x→0+ x log x = 0. Note that DH (x, y) =
DKL (x, y) when x, y ∈ ∆n and we often use DKL to denote DH for all nonnega-
tive vectors (even outside of the simplex).
With the choice of the regularizer being DH , the following lemma gives a
characterization of the argmin in (7.5) for D = DH and K = Rn≥0 and ∆n
respectively. In the lemma below, we rename q := xt , g := ∇ f (xt ) and w = x
from (7.5).
Lemma 7.4 (Projection under KL-divergence) Consider any vector q ∈
Rn≥0 and a vector g ∈ Rn .
1. Let w? := argminw≥0 {DH (w, q) + η hg, wi}, then w?i = qi exp(−ηgi ) for all
i = 1, 2, . . . , n.
?
2. Let p? := argmin p∈∆n {DH (p, q) + η hg, pi}, then p? = kww? k1 for all i =
1, 2, . . . , n.
Proof In the first part we are given an optimization problem of the form
n
X n
X
min wi log wi + wi (ηgi − log qi − 1) + qi . (7.9)
w≥0
i=1 i=1

The above problem is in fact convex, hence one just needs to find a zero of the
gradient with respect to w to find the minimum. By computing the gradient,
we obtain the following optimality condition
log wi = −ηgi + log qi ,
and, hence,
w?i = qi exp(−ηgi ).
For the second part, we use ideas developed in Chapter 5. To incorporate the
constraint ni=1 pi = 1, we introduce a Lagrange multiplier µ ∈ R and obtain
P

n n
 n 
X X X
pi log pi + pi (ηgi − log qi ) + µ  pi − 1 .

min  (7.10)
p≥0
i=1 i=1 i=1
112 Mirror Descent and Multiplicative Weights Update

Algorithm 2: Exponential gradient descent (EGD)


Input:
• First-order oracle for a convex function f : ∆n → R
• An η > 0
• An integer T > 0
Output: A point p̄ ∈ ∆n
Algorithm:
1: Set p1 := 1n 1 (the uniform distribution)
2: for t = 1, 2, . . . , T do
3: Obtain gt := ∇ f (pt )
4: wt+1 := pti exp(−ηgti )
wt+1
5: pt+1
i := Pn i t+1
j=1 w j
6: end for
1 PT
7: return p̄ := T t=1 pt

Then, the optimality condition becomes


pi = qi exp(−ηgi − µ)
and, thus, we just need to pick µ so that ni=1 pi = 1, which gives
P

w?
p? = .
kw? k1

7.3.1 Main theorem on exponential gradient descent


We now have all the background necessary to describe the exponential gradient
descent (EGD) algorithm (Algorithm 2). In this algorithm, we introduce an
auxiliary (weight) vector wt at every iteration. While it is not necessary to state
the algorithm, it is useful for us to refer to wt in the proof of the convergence
guarantee of the algorithm.
Note one interesting difference between EGD and the variant of gradient
descent studied in Chapter 6: the output of EGD is the average of all iterates
p̄, not the last iterate pT . It is not entirely clear under what conditions one can
get a similar theorem with pT instead of p̄.
To illustrate the problem, note that if f (x) = |x|, then the gradient at every
7.3 Exponential gradient descent 113

point is either 1 or −1.3 Hence, by knowing that the gradient at a certain point
x is 1 we still have no clue whether x is close to the minimizer (0) or very far
from it. Thus, as opposed to the Lipschitz gradient case, the gradient at a point
x does not provide us with a certificate that f (x) is close to optimal. Therefore
one naturally needs to gather more information, by visiting multiple points and
average them in some manner.
Theorem 7.5 (Guarantees for EGD) Suppose that f : ∆n → R is a convex
√ !
log n
function which satisfies k∇ f (p)k∞ ≤ G for all p ∈ ∆n . If we let η := Θ √TG
 2 
then, after T = Θ G εlog2
n
iterations of the EGD algorithm, the point p̄ :=
1 PT t
T t=1 p satisfies
f ( p̄) − f (p? ) ≤ ε,
where p? is any minimizer of f over ∆n .

7.3.2 Properties of Bregman divergence


We present several important properties of KL-divergence that are useful in the
proof of Theorem 7.5. Many of these are more general and hold for Bregman
divergence.
We start with a simple, yet useful identity involving the Bregman diver-
gence.
Lemma 7.6 (Law of cosines for Bregman divergence) Let F : K → R be
a convex, differentiable function and let x, y, z ∈ K. Then
h∇F(y) − ∇F(z), y − xi = DF (x, y) + DF (y, z) − DF (x, z).
The proof of the above identity is a direct calculation (Exercise 7.7). Note that
for the case when F(x) = kxk22 the above says that for three points a, b, c ∈ Rn
we have
2 ha − c, b − ci = kb − ck22 + ka − ck22 − kb − ak22 ,
which is the familiar law of cosines in the Euclidean space.
Perhaps the simplest property of DF is the fact that it is strictly convex in
the first argument, i.e., the mapping
x 7→ DF (x, y)
is strictly convex. This ensures, for instance, the existence and uniqueness of
3 Except from the point x = 0, which is very unlikely to be hit by an iterative algorithm, hence,
we can ignore it.
114 Mirror Descent and Multiplicative Weights Update

a point u ∈ S that minimizes the divergence DF (u, x) for a point x from a


closed and convex set S . This is useful in the following generalization of the
Pythagoras theorem.

Theorem 7.7 (Pythagorean theorem for Bregman divergence) Let F :


K → R be a convex, differentiable function and let S ⊆ K be a closed, convex
subset of K. Let x, y ∈ S and z ∈ K such that

y := argmin DF (u, z),


u∈S

then
DF (x, y) + DF (y, z) ≤ DF (x, z).

It is an instructive exercise to consider the special case of this lemma for F(x) =
kxk22 . It says that if we project z onto a convex set S and call the projection y,
then the angle between the vectors x − y and z − y is obtuse (larger than 90
degrees).

Proof of Theorem 7.7. Let x, y, z be as in the statement of the lemma. By


the optimality condition for the optimization problem minu∈S DF (u, z) at the
minimizer y (Theorem 3.14) we obtain that, for every point w ∈ S , if we let

g(u) := DF (u, z),

then
h∇g(y), w − yi ≥ 0,

which translates to
h∇F(y) − ∇F(z), w − yi ≥ 0.

By plugging in w = x and using the Lemma 7.6 we obtain

DF (x, y) + DF (y, z) − DF (x, z) ≤ 0.

Finally, we state the following inequality that asserts that the negative entropy
function is 1-strongly convex with respect to the `1 -norm, when restricted to
the probability simplex ∆n (Exercise 3.17(e)).

Lemma 7.8 (Pinsker’s Inequality) For every x, y ∈ ∆n we have


1
DKL (x, y) ≥ kx − yk21 .
2
7.3 Exponential gradient descent 115

7.3.3 Convergence proof of EGD


Given the properties of KL-divergence stated in the previous section we are
ready to proceed with the proof of Theorem 7.5. We show that for any p ∈ ∆n ,
it holds that
f ( p̄) − f (p) ≤ ε,

where p̄ = T1 Tt=1 pt , provided the conditions of the theorem are satisfied;


P
therefore in particular the result holds for the minimizer p? .

Proof of Theorem 7.5.

Step 1: Bounding f ( p̄) − f (p) by gradients T1 Tt=1 gt , pt − p . Start by not-


P

ing that a simple consequence of convexity of f (used twice below) is that
T
 
 1 X 
f ( p̄) − f (p) ≤  f (pt ) − f (p)
T t=1
T
1 X t 
= f (p ) − f (p)
T t=1
(7.11)
T
1 XD E
≤ ∇ f (pt ), pt − p
T t=1
T
1 XD t t E
= g,p −p .
T t=1

Thus, from now on we focus on the task of providing an upper bound on the
sum Tt=1 gt , pt − p .
P

Step 2: Writing gt , pt − p in terms of KL-divergence. Let us fix t ∈ {1, 2, . . . , T }.




First, we provide an expression for gt in terms of wt+1 and pt . Note that since

wt+1
i = pti exp(−ηgti )

for all i ∈ {1, 2, . . . , n}, we have


1 
gti = log pti − log wt+1 .
η i

This can be also written in terms of the gradient of the generalized negative
entropy function H(x) = ni=1 xi log xi − xi as follows:
P

1  1 
gt = log pt − log wt+1 = ∇H(pt ) − ∇H(wt+1 ) , (7.12)
η η
116 Mirror Descent and Multiplicative Weights Update

where the log is applied coordinatewise to the appropriate vectors. Thus, using
Lemma 7.6 (law of cosines) we obtain that
D E 1D E
gt , pt − p = ∇H(pt ) − ∇H(wt+1 ), pt − p
η
(7.13)
1 
= DH (p, pt ) + DH (pt , wt+1 ) − DH (p, wt+1 ) .
η

Step 3: Using the Pythagorean theorem to get a telescoping sum. Now,


since pt+1 is the projection with respect to DH of wt+1 onto ∆n (see Lemma 7.4),
the generalized Pythagorean theorem (Theorem 7.7) says that
DH (p, wt+1 ) ≥ DH (p, pt+1 ) + DH (pt+1 , wt+1 ).
Thus we can bound the expression Tt=1 gt , pt − p as follows:
P

T D
X E XT
η gt , pt − p = DH (p, pt ) + DH (pt , wt+1 ) − DH (p, wt+1 )
t=1 t=1
T
X
≤ DH (p, pt ) + DH (pt , wt+1 )
t=1
h i
− DH (p, pt+1 ) + DH (pt+1 , wt+1 )
(7.14)
T h
X i
= t
DH (p, p ) − DH (p, p t+1
)
t=1
h i
+ DH (pt , wt+1 ) − DH (pt+1 , wt+1 )
T h
X i
≤ DH (p, p1 ) + DH (pt , wt+1 ) − DH (pt+1 , wt+1 ) .
t=1

In the last step we used the fact that the first term of the sum is telescoping to
DH (p, p1 ) − DH (p, pT +1 ) and that DH (p, pT +1 ) ≥ 0.

Step 4: Using Pinsker’s inequality and bounded gradient to bound the


remaining terms. To bound the second term we first apply the law of cosines
D E
DH (pt , wt+1 ) − DH (pt+1 , wt+1 ) = ∇H(pt ) − ∇H(wt+1 ), pt − pt+1
−DH (pt+1 , pt ) (7.15)
D E
= η gt , pt − pt+1 − DH (pt+1 , pt ).
And, finally, we apply Pinsker’s inequality (Lemma 7.8) to obtain
D E 1 2
DH (pt , wt+1 ) − DH (pt+1 , wt+1 ) ≤ η gt , pt − pt+1 − pt+1 − pt 1 , (7.16)
2
7.4 Mirror descent 117

where we use the fact


D that DH restricted
E to both arguments
in ∆n is the same as
DKL . Further, since gt , pt − pt+1 ≤ gt ∞ pt − pt+1 1 we can write
1 2
DH (pt , wt+1 ) − DH (pt+1 , wt+1 ) ≤ η gt ∞ pt − pt+1 1 − pt+1 − pt 1
2
t+1 1 2
≤ ηG p − pt 1 − pt+1 − pt 1

2
(ηG)2
≤ ,
2
(7.17)
where the last inequality follows by simply maximizing the quadratic function
x 7→ ηGx − 12 x2 .

Step 5: Conclusion of the proof. By combining (7.14) with the summation


of (7.17) over all t, we obtain
T D
(ηG)2
X E 1 !
g,p −p ≤
t t
DH (p, p ) + T
1
.
t=1
η 2

By observing that
DH (p, p1 ) = DKL (p, p1 ) ≤ log n

and optimizing the choice of η, the theorem follows.

7.4 Mirror descent


In this section we take inspiration from the exponential gradient descent algo-
rithm to derive a general method for convex optimization called mirror descent.

7.4.1 Generalizing the EGD algorithm and the proximal viewpoint


The main idea follows the intuition provided at the very beginning of the chap-
ter. Recall that our update rule is (7.5). I.e., at a given point xt we construct a
linear lower bound
D E
f (xt ) + ∇ f (xt ), x − xt ≤ f (x)

and move to the next point which is the minimizer of this lower bound “regu-
larized” by a distance function D(·, ·), thus we get (7.5)
n D Eo
xt+1 := argmin D(x, xt ) + η ∇ f (xt ), x .
x∈K
118 Mirror Descent and Multiplicative Weights Update

When deriving the exponential gradient descent algorithm we used the gener-
alized KL-divergence DH (·, ·).
Mirror descent is defined with respect to DR (·, ·), for any convex regularizer
R : Rn → R. In general, by denoting the gradient at step t by gt we have
n D Eo
xt+1 = argmin DR (x, xt ) + η gt , x
x∈K
n D E D Eo
= argmin η gt , x + R(x) − R(xt ) − ∇R(xt ), x − xt (7.18)
x∈K
n D Eo
= argmin R(x) − ∇R(xt ) − ηgt , x ,
x∈K

where in the last step we have ignored the terms that do not depend on x. Let
wt+1 be a point such that

∇R(wt+1 ) = ∇R(xt ) − ηgt .

It is not clear under shat conditions such a point wt+1 should exist and we
address this later; for now assume such a wt+1 exists. This corresponds to the
same wt+1 as we had in the EGD algorithm (the “unscaled” version of pt+1 ).
We then have
n D Eo
xt+1 = argmin R(x) − ∇R(wt+1 ), x
x∈K
n D Eo
= argmin R(x) − R(wt+1 ) + ∇R(wt+1 ), x (7.19)
x∈K
n o
= argmin DR (x, wt+1 ) .
x∈K

Note again the analogy with the EGD algorithm: there pt+1 was obtained as a
KL-divergence projection of wt+1 onto the simplex ∆n = K, exactly as above.
This is also called the proximal viewpoint of mirror descent and the calcu-
lations above establish the equivalence of the regularization and the proximal
viewpoints.

7.4.2 The mirror descent algorithm


We are ready to state the mirror descent algorithm (Algorithm 3) in its gen-
eral form. For it to be well defined, we need to make sure that wt+1 always
exists. Formally we assume that the regularizer R : Ω → R has a domain Ω
which contains K as a subset. Moreover, as in the case of the entropy function,
we assume that the map ∇R : Ω → Rn is a bijection – this is perhaps more
than what we really need, but such an assumption makes the picture much
more clear. In fact, R is sometimes referred to as the mirror map.
7.4 Mirror descent 119

Algorithm 3: Mirror descent


Input:
• First-order oracle access to a convex function f : K → R
• Oracle access to the ∇R mapping and its inverse
• Projection operator with respect to DR (·, ·)
• An initial point x1 ∈ K
• A parameter η > 0
• An integer T > 0
Output: A point x̄ ∈ K
Algorithm:
1: for t = 1, 2, . . . , T do
2: Obtain gt := ∇ f (pt )
3: Let wt+1 be such that ∇R(wt+1 ) = ∇R(xt ) − η∇ f (xt )
4: xt+1 = argmin x∈K DR (x, wt+1 )
5: end for
return x̄ := T1 Tt=1 xt
P
6:

By the theory of conjugate functions developed in Chapter 5, we know that if


R is convex and closed, then the inverse of ∇R is ∇R∗ , where R∗ is the conjugate
of R; see Theorem 5.10. Further, as mentioned earlier, if we assume that as x
tends to the boundary of the domain of R, the norm of the gradient of R tends to
infinity, then, along with the strict convexity of the map x 7→ DR (x, y) and the
compactness of the domain, we can be sure about the existence and uniqueness
of the projection xt+1 above. Note that in order for the algorithm to be useful,
the mirror map ∇R (and its inverse) should be efficiently computable. Similarly,
the projection step
argmin DR (x, wt+1 )
x∈K

should also be computationally easy to perform. The efficiency of these two


operations determines the time required to perform one iteration of mirror de-
scent.

7.4.3 Convergence proof


We are now ready to state the iteration bound guarantee for the mirror descent
algorithm presented in Algorithm 3.
120 Mirror Descent and Multiplicative Weights Update

Theorem 7.9 (Guarantees for mirror descent) Let f : K → R and R :


Ω → R be convex functions with K ⊆ Ω ⊆ Rn and suppose the following
assumptions hold:

1. The gradient map ∇R : Ω → Rn is a bijection.


2. The function f has bounded gradients with respect to norm k·k, i.e.,

∀x ∈ K, k∇ f (x)k ≤ G.

3. R is σ-strongly convex with respect to the dual norm k·k∗ , i.e.,


σ
∀x ∈ Ω, DR (x, y) ≥ kx − yk∗2 .
2
√ !
σDR (x? ,x1 ) ? 1
R (x ,x )
 G2 D 
If we let η := Θ √
TG
then, T := Θ σε2
iterations of the mirror
descent algorithm the point x̄ satisfies

f ( x̄) − f (x? ) ≤ ε

where x? is any minimizer of f (x).

Proof The proof of the above theorem follows exactly as the one we gave
for Theorem 7.5 by replacing the generalized negative entropy function H by
a regularizer R and replacing KL-divergence terms DH by DR .
We now go step by step through the proof of Theorem 7.5 and emphasize
which properties of R and DR are being used.
In Step 1 the reasoning in (7.11) used to obtain
T
1 XD t t
f ( x̄) − f (x? ) ≤ g , x − x?
E
T t=1

is general and relies only on the convexity of f .


In Step 2 the facts used in equations (7.12) and (7.13) to prove that
E 1
gt , xt − x? = DR (x? , xt ) + DR (xt , wt+1 ) − DR (x? , wt+1 )
D 
η
are the definition of wt+1 and the law of cosines which is valid for any Bregman
divergence (see Lemma 7.6). Subsequently, in Step 3, to arrive at the conclu-
sion of (7.14)
T D
X T h
X
gt , xt − x? ≤ DR (x? , x1 ) +
E i
η DR (xt , wt+1 ) − DR (xt+1 , wt+1 ) ,
t=1 t=1

only the generalized Pythagorean theorem (Theorem 7.7) is necessary.


7.5 Multiplicative weights update framework 121

Finally in Step 4, an analog of (7.17) that can be proved under our current
assumptions is
∗ σ ∗2
DR (xt , wt+1 ) − DR (xt+1 , wt+1 ) ≤ gt xt − xt+1 − xt+1 − xt .
2
The above follows from the strong convexity assumption with respect to k·k
(which is used in place of Pinsker’s inequality) and the Cauchy-Schwarz in-
equality for dual norms:
hu, vi ≤ kuk kvk∗ .

The rest of the proof is the same and does not rely on any specific properties
of R or DR .

7.5 Multiplicative weights update framework


In the proof of Theorem 7.5, the only place we used the fact that the vectors
gt are gradients of the function f at points pt (for t = 1, 2, . . . , T ) is the bound
in (7.11). Subsequently, gt s were treated as arbitrary vectors, and we proved a
guarantee that in order to reach
T T
1 X D t tE 1 XD t E
g , p − min g , p ≤ ε,
T t=1 p∈∆n T
t=1
 2 
it is enough to take T = O G εlog2
n
. Let us now state this observation as a theo-
rem. Before doing so, we first state this general meta-algorithm – known under
the name multiplicative weights update (MWU) method; see Algorithm 4.
In the exercises, we develop the more standard variant of the MWU method
and prove guarantees about it.
By reasoning exactly as in the proof of Theorem 7.5, we obtain the following
theorem.

Theorem 7.10 (Guarantees for MWU algorithm) Consider the MWU al-
gorithm presented in Algorithm 4. Assume that all of
√ the!vectors gt provided by
 2 
log n
, after T = Θ G log n

the oracle satisfy gt ≤ G. Then, taking η = Θ √
∞ TG ε2
iterations we have
T T
1 X D t tE 1 XD t E
g , p − min g , p ≤ ε.
T t=1 p∈∆n T
t=1
122 Mirror Descent and Multiplicative Weights Update

Algorithm 4: Multiplicative weights update (MWU) algorithm


Input:
• An oracle providing vectors gt ∈ Rn at every step t = 1, 2, . . .
• A parameter η > 0
• An integer T > 0
Output: A sequence of probability distributions p1 , p2 , . . . , pT ∈ ∆n
Algorithm:
1: Initialize p1 := 1n 1 (the uniform probability distribution)
2: for t = 1, 2, . . . , T do
3: Obtain gt ∈ Rn from the oracle
4: Let wt+1 ∈ Rn and pt+1 ∈ ∆n be defined as
wt+1
i := pti exp(−ηgti )
and
wt+1
pt+1
i := Pn i t+1
j=1 w j

5: end for

At this point it might not be clear what the purpose of stating such a theorem is.
However, as we will shortly see (including in several exercises), this theorem
allows us to come up with algorithms for numerous different problems based
on the idea of maintaining weights and changing them multiplicatively. In the
example we provide, we design an algorithm for checking if a bipartite graph
has a perfect matching. This can be further extended to linear programming or
even semidefinite programming.

7.6 Application: perfect matching in bipartite graphs


We formally define the problem of finding a perfect matching in a bipartite
graph. The input to this problem consists of an undirected and unweighted
bipartite graph G = (V = A ∪ B, E), where A, B are two disjoint sets of vertices
with A ∪ B = V, and all edges in E have one endpoint in A and another in B.
The goal is to find a perfect matching in G: a subset of edges M ⊆ E such
that every vertex v ∈ V is incident to exactly one of the edges in M. We assume
that in the bipartition (A, B) we have |A| = |B| = n, so that the total number of
7.6 Application: perfect matching in bipartite graphs 123

vertices is 2n (note that if the cardinalities of A and B are not the same, then
there is no perfect matching in G). Let m := |E| as usual.
Our approach to solving this problem is based on solving the following lin-
ear programming reformulation of the perfect matching problem:

Find x ∈ Rm
X
s.t. xe = n,
e∈E
X (7.20)
∀v ∈ V, xe ≤ 1,
e: v∈e

∀e ∈ E, xe ≥ 0.
A solution to the above linear feasibility program is called a fractional perfect
matching. The question which arises very naturally is whether the “fractional”
problem is equivalent to the original one? This following is a classical exercise
in combinatorial optimization (see Exercise 2.27). It basically asserts that the
“fractional” bipartite matching polytope (defined in the linear programming
description above) has only integral vertices.
Theorem 7.11 (Integrality of the bipartite matching polytope) If G is a
bipartite graph then G has a perfect matching if and only if G has a fractional
perfect matching. Alternatively, the vertices of the bipartite matching polytope
of G, defined as the convex hull of the indicator vectors of all perfect matchings
in G, are exactly these indicator vectors.
Note that one direction of this theorem is trivial: if there is a perfect matching
M ⊆ E in G then its characteristic vector x = 1 M is a fractional perfect match-
ing. The other direction is harder and relies on the assumption that the graph is
bipartite.
Algorithmically, one can also convert fractional matchings into matchings
in O(|E|)
e time; we omit the details. Thus, solving (7.20) sufficies to solve the
perfect matching problem in its original form.
As our algorithm naturally produces approximate answers, for an ε > 0, we
also define an ε-approximate fractional perfect matching to be an x ∈ Rm
which satisfies
X
xe = n,
e∈E
X
∀v ∈ V, xe ≤ 1 + ε,
e: v∈e

∀e ∈ E, xe ≥ 0.
124 Mirror Descent and Multiplicative Weights Update

From such an approximate fractional matching one can (given the ideas dis-
cussed above) construct a matching in G of cardinality at least (1 − ε)n and,
thus, also solve the perfect matching problem exactly by taking ε < 1n .

7.6.1 The main result


We construct an algorithm based on the MWU algorithm to construct ε-approximate
fractional perfect matchings. Below we formally state its running time guaran-
tee.

Theorem 7.12 (Algorithm for the fractional bipartite matching problem)


Algorithm 5, given a bipartite graph G on 2n vertices and m edges that has a
perfect matching, and an ε > 0, outputs an ε-approximate fractional perfect
e −2 n2 m).
matching in G in time O(ε

The running time of the above algorithm is certainly not comparable to the best
known algorithms for this problem, but its advantage is its overall simplicity.

7.6.2 The algorithm


We construct an algorithm for finding fractional solutions to the perfect match-
ing problem, i.e., to (7.20). In every step of the algorithm, we would like to
construct a point xt ∈ Rm e∈E xe = n, but is
P t
≥0 which satisfies the constraint
not necessarily a fractional matching (meaning that it does not have to satisfy
e:v∈e xe ≤ 1 + ε for every v). However, x should (in a certain sense) brings us
P t t

closer to satisfying all the constraints.


More precisely, we maintain positive weights wt ∈ R2n over vertices V of
the graph G. The value wtv corresponds to the importance of the inequality
P
e: v∈e xe ≤ 1 at the current state of the algorithm. Intuitively, the importance
is large whenever our “current solution” (which one should think of as the
average of all xt produced so far) violates this inequality, and more generally,
the larger the violation of the vth constraints, the larger wtv is.
Given such weights, we then compute a new point xt to be any point which
satisfies e∈E xet = n and satisfies the weighted average (with respect to wt )
P
of all the inequalities e: v∈e xet ≤ 1. Next, we update our weights based on the
P
violations of the inequalities by the new point xt .
Algorithm 5 is not completely specified, as we have not yet provided a method
for finding xt . In the analysis, we specify one particular method for finding xt
and show that it gives an algorithm with a running time as in Theorem 7.12.
7.6 Application: perfect matching in bipartite graphs 125

Algorithm 5: Approximate perfect matching in bipartite graphs


Input:
• A bipartite graph G = (V, E)
• An η > 0
• A positive integer T ∈ N
Output: An approximate fractional perfect matching x ∈ [0, 1]m
Algorithm:
1: Initialize w1 := (1, 1, ..., 1) ∈ R2n
2: for t = 1, 2, . . . , T do
3: Find a point xt ∈ Rm satisfying
 
X  X  X
t  t
wv 
 xe  ≤ wtv
v∈V e:v∈e v∈V
X
xet = n
e∈E

and
xet ≥ 0 for all e ∈ E

4: Construct a vector gt ∈ R2n


xet
P
1− e:v∈e
gtv :=
n

5: Update the weights:


wt+1 t t
v := wv · exp(−η · gv ) for all v ∈ V
6: end for
1 PT
7: return x := T i=1 xt

7.6.3 Analysis
The analysis is divided into two steps: how to find xt ? and the correctness of
the algorithm given in the previous section (assuming a method for finding xt ).

Step 1: An oracle to find xt . In the lemma below we prove that xt can be


t
always found efficiently such
that the infinity norm of g is bounded by 1. As
t
we see soon, the bound on g ∞ is crucial for the efficiency of our algorithm.
126 Mirror Descent and Multiplicative Weights Update

Lemma 7.13 (Oracle) If G has a perfect matching then:

1. xt always exists and can be found in O(m) time, and



2. we can guarantee that gt ∞ ≤ 1.

Proof If M is a perfect matching, then xt = 1 M (the indicator vector of M)


satisfies all conditions. However, we do not know M and cannot compute it
easily, but still we would like to find such a point. Let us rewrite the condition
 
X  X  X
t 
wv  xe  ≤ wtv
v∈V e:v∈e v∈V

in the form
X
αe xe ≤ β, (7.21)
e∈E

where all the coefficients αe and β are nonnegative and as follows


X X
αe := wtv and β := wtv .
e∈N(v) v∈V

If G has a perfect matching M, then there are edges e1 , . . . , en that do not share
any vertex. Hence,
Xn
αei = β.
i=1

Further, let e? be the edge such that

e? := arg min αe .
e∈E

Then, we have that


n
X
nαe? ≤ αei = β.
i=1

Hence, setting
xet ? = n and xet 0 = 0 ∀e0 , e?

is also a valid solution to (7.21). Such a choice of xt also guarantees that for
every v ∈ V:
X
−1 ≤ xet − 1 ≤ n − 1
e∈N(v)

which in particular implies that gt ∞ ≤ 1.
7.6 Application: perfect matching in bipartite graphs 127

Step 2: Proof of Theorem 7.10. We now invoke Theorem 7.10 to obtain the
guarantee claimed in Theorem 7.12 on the output of Algorithm 5. We start by
noticing that, if we set
wt
pt := P t ,
v∈V wv

Algorithm 5 is a special case of the MWU algorithm with gt satisfying gt ≤ ∞
1.
Further, we can plug in p := ev for any fixed v ∈ V in Theorem 7.10 to
conclude
T T
1X t 1 X D t tE
− gv ≤ − p ,g + δ (7.22)
T t=1 T t=1
 n
for T = Θ logδ2
(since G = 1). The fact that xt satisfies
 
X  X  X
t 
wv  xe  ≤ wtv
v∈V e:v∈e v∈V

implies that
  
X   X 
wtv 1 −  xe  ≥ 0.
v∈V e:v∈e
t
Dividing by n and kw k1 , the above inequality can be seen to be the same as
D E
pt , gt ≥ 0.

Therefore, from the guarantee (7.22), for any v ∈ V we get


 
1 1  X t  1
·  xe − 1 ≤ · T · 0 + δ.
T n e:v∈e T

This implies that for all v ∈ V


X
xe ≤ 1 + nδ.
e:v∈e

Thus, to make the r.h.s. in the above inequality 1 + ε, we pick δ := nε and,


 2 
hence, T becomes Θ n εlog 2
n
. Further, since x is a convex combination of xt
for t = 1, . . . , T , it also satisfies e∈E xe = n and x ≥ 0. Thus, if G contains
P
a perfect matching, the oracle keeps outputting a point xt for t = 1, . . . , T and
the final point x is an ε-approximate fractional perfect matching.
It remains
 to reason about the running time. The number of iterations is
n2 log n
O ε2
and each iterations is to find xt and update the weights, which can be
128 Mirror Descent and Multiplicative Weights Update

wtv .
P
done in O(m) time as it just requires to find the edge e with minimum e:v∈e
This completes the proof of Theorem 7.12.
Remark 7.14 It can be easily seen that the reason for the n2 factor in the
running time is because of our bound | e:v∈e xet − 1| ≤ n. If one could always
P
produce a point xt with

X t
∀v ∈ V xe − 1 ≤ ρ,
e:v∈e

then the running time would become O(ε−2 mρ2 log n). Note that intuitively,
this should be possible to do, since xt = 1 M (for any perfect matching M) gives
ρ = 1. However, at the same time we would like our procedure for finding xt to
be efficient, preferably it should run in nearly linear time. It is an interesting ex-
ercise to design a nearly linear time algorithm for finding xt with the guarantee
ρ = 2, which yields a much better running time of only O(ε−2 m log2 n).
7.7 Exercises 129

f (x)

x2
Figure 7.1 The plot of the function f (x) = |x|+1 .

7.7 Exercises
7.1 Prove that if f : Rn → R is a convex function which is G-Lipschitz
continuous, then k∇ f (x)k ≤ G for all x ∈ Rn . Extend this result to the
case when f : K → R for some convex set K.
7.2 Give an example of a function f : Rn → R such that k∇ f (x)k2 ≤ 1 for
all x ∈ Rn but the Lipschitz constant of its gradient is unbounded.
x2
7.3 Smoothed absolute value. Consider the function f (x) := |x|+1 (see Fig-
ure 7.1). As one can see, the function f is a “smooth” variant of x 7→ |x|:
it is differentiable everywhere and | f (x) − |x| | → 0 when x tends to
either +∞ or −∞. Similarly, one can consider a multivariate extension
F : Rn → R of f given by
n
X xi2
F(x) := .
i=1
|xi | + 1

This can be seen as a smoothening of the `1 -norm kxk1 . Prove that


k∇F(x)k∞ ≤ 1
for all x ∈ Rn .
7.4 Soft-max function. Since the function x 7→ max{x1 , x2 , . . . , xn } is not
differentiable, one often considers the so-called soft-max function
 n 
1 X αx 
sα (x) := log  e  .
i
α i=1

for some α > 0, as a replacement for the maximum. Prove that


log n
max{x1 , x2 , . . . , xn } ≤ sα (x) ≤ + max{x1 , x2 , . . . , xn },
α
130 Mirror Descent and Multiplicative Weights Update

thus, the larger α is, the better approximation we obtain. Further, prove
that for every x ∈ Rn ,
k∇sα (x)k∞ ≤ 1.
7.5 Prove the following properties of KL-divergence:
(a) DKL is not symmetric: DKL (p, q) , DKL (q, p) for all p, q ∈ ∆n .
(b) DKL is indeed a Bregman divergence of a convex function on ∆n
and DKL (p, q) ≥ 0.
7.6 Let F : Rn → R be a convex function. Prove that the function mapping
x 7→ DF (x, y) for a fixed y ∈ Rn is strictly convex.
7.7 Prove Lemma 7.6.
7.8 Prove that for all p ∈ ∆n , DKL (p, p1 ) ≤ log n. Here p1 is the uniform
probability distribution with p1i = 1/n for 1 ≤ i ≤ n.
7.9 Prove that for F(x) := kxk22 over Rn ,
DF (x, y) = kx − yk22 .
7.10 Gradient descent when the Euclidean norm of gradient is bounded.
Use Theorem 7.9 to prove that there is an algorithm that, given a convex
and differentiable function f : Rn → R with a gradient bounded in
the Euclidean norm by G, an ε > 0, and an initial point x1 satisfying
kx1 − x? k2 ≤ D, produces x ∈ Rn satisfying
f (x) − f (x? ) ≤ ε
in T iterations using a step size η where
 DG 2 D
T := and η := √ .
ε G T
7.11 Stochastic gradient descent. In this problem we study how well gradi-
ent descent does when we have a relatively weak access to the gradient.
In the previous problem, we assumed that for a differentiable convex
function f : Rn → R, we have a first-order oracle: given x it outputs
∇ f (x). Now assume that when we give a point x, we get a random vector
g(x) ∈ Rn from some underlying distribution such that
E[g(x)] = ∇ f (x).
(If x itself has been chosen from some random distribution, then E[g(x)|x] =
∇ f (x).) Assume that k∇ f (x)k2 ≤ G for all x ∈ Rn . Subsequently, we do
gradient descent in the following way. Pick some η > 0, and assume that
the starting point x1 is s.t. kx1 − x? k2 ≤ D. Let
xt+1 := xt − ηg(xt ).
7.7 Exercises 131

Now note that since g is a random variable, so is xt for all t ≥ 2. Let


T
1X i
x := x.
T i=1

(a) Prove that if we set


 DG 2 D
T := and η := √ ,
ε G T
then
E( f (x)) − f (x? ) ≤ ε.

(b) Suppose we have a number of labeled examples

(a1 , l1 ), (a2 , l2 ), . . . , (am , lm )

where each ai ∈ Rn and li ∈ R, our goal is to find an x which


minimizes
m
1X
f (x) := |hx, ai i − li |2 .
m i=1

In this case, for a given x, pick a number i uniformly at random


from {1, 2, . . . , m} and output

g(x) = 2ai (hx, ai i − li ).

(1) Prove that for this example E[g(x)] = ∇ f (x).


(2) What does the result proved above imply for this example?
(c) What is the advantage of this method over the traditional gradient
descent?
7.12 s−t-minimum cut problem. Recall the formulation of the s−t-minimum
cut problem in an undirected graph G = (V, E) with n vertices and m
edges that was studied Exercise 6.11, i.e.,

X
MinCuts,t (G) := min |xi − x j |.
x∈Rn ,x s −xt =1
i j∈E

Apply the mirror descent algorithm with the regularizer R(x) = kxk22
along with Theorem 7.9 to find MinCuts,t (G) exactly (note that it is an
integer of value at most m). Estimate all relevant parameters and pro-
vide a bound on the running time. Explain how to deal with the (simple)
constraint x s − xt = 1.
132 Mirror Descent and Multiplicative Weights Update

7.13 Min-max theorem for zero sum games. In this problem we apply the
MWU algorithm to approximately find equilibria in two player zero-sum
games.
Let A ∈ Rn×m be a matrix with A(i, j) ∈ [0, 1] for all i ∈ [n] and
j ∈ [m]. We consider a game between two players: the row player and a
column player. The game consists of one round in which the row player
picks one row i ∈ {1, 2, . . . , n} and the column player picks one column
j ∈ {1, 2, . . . , m}. The goal of the row player is to minimize the value
A(i, j) which she pays to the to column player after such a round, the
goal of the column player is of course the opposite (to maximize the
value A(i, j)).
The min-max theorem asserts that

max min E J←q A(i, J) = min max EI←p A(I, j). (7.23)
q∈∆m i∈{1,...,n} p∈∆n j∈{1,...,m}

Here EI←p A(I, j) is the expected loss of the row player when using the
randomized strategy p ∈ ∆n against a fixed strategy j ∈ {1, 2, . . . , m} of
the column player. Similarly, define E J←q A(i, J). Formally,
n
X m
X
EI←p A(I, j) := pi A(i, j) and E J←q A(i, J) := q j A(i, j).
i=1 j=1

Let opt be the common value of the two quantities in (7.23) correspond-
ing to two optimal strategies p? ∈ ∆n and q? ∈ ∆m respectively. Our
goal is to use the MWU framework to construct, for any ε > 0, a pair of
strategies p ∈ ∆n and q ∈ ∆m such that

max EI←p A(I, j) ≤ opt + ε and min E J←q A(i, J) ≥ opt − ε.


j i

(a) Prove the following “easier” direction of von Neumann’s theo-


rem:

max min E J←q A(i, J) ≤ min max EI←p A(I, j).


q∈∆m i∈{1,...,n} p∈∆n j∈{1,...,m}

(b) Give an algorithm, which given p ∈ ∆n constructs a j ∈ {1, 2, . . . , m}


which maximizes EI←p A(I, j). What is the running time of the al-
gorithm? Show that for such a choice of j we have EI←p A(I, j) ≥
opt.
We follow the MWU scheme with p1 , p2 , . . . , pT ∈ ∆n and the loss vector
at step t being f t := Aqt , where qt := e j with j chosen as to maximize
EI←pt A(I, j). (Recall that e j is the vector with 1 at coordinate j and 0
otherwise.)
7.7 Exercises 133

(c) Prove that f t ∞ ≤ 1 and p? , f t ≤ opt for every t = 1, 2, . . . , T.




(d) Use Theorem 7.10 to show that for T large enough,
T
1 X D t tE
opt ≤ p , f ≤ opt + ε.
T t=1
What is the smallest value of T which suffices for this to hold?
Conclude that for some t it holds that max j EI←pt A(I, j) ≤ opt + ε.
(e) Let
T
1X t
q := q.
T t=1
Prove that for T as in part (d),
min E J←q A(i, J) ≥ opt − ε.
i

(f) What is the total running time of the whole procedure to find an
ε-approximate pair of strategies p and q we set out to find at the
beginning of this problem?
7.14 Winnow algorithm for classification. Suppose we are given m labeled
examples, (a1 , l1 ), (a2 , l2 ), . . . , (am , lm ) where ai ∈ Rn are feature vectors,
and li ∈ {−1, +1} are their labels. Our goal is to find a hyperplane sepa-
rating the points labeled with a +1 from the −1 labeled points. Assume
that the dividing hyperplane contains 0 and its normal is nonnegative.
Hence, formally, our goal is to find p ∈ Rn with p ≥ 0 such that
sign hai , pi = li
for every i ∈ {1, . . . , m}. By scaling we can assume that kai k∞ ≤ 1
for every i ∈ {1, . . . , m} and that h1, pi = 1 (recall 1 is the vector of all
1s). For notational convenience we redefine ai to be li ai . The problem is
thus reduced to finding a solution to the following linear programming
problem: find a p such that
hai , pi > 0 for every i ∈ {1, . . . , m} where p ∈ ∆n .
Prove the following theorem.
Theorem 7.15 Given a1 , . . . , am ∈ Rn and ε > 0, if there exists p? ∈ ∆n
such that hai , p? i ≥ ε for every i ∈ {1, . . . , m}, the Winnow algorithm
(Algorithm 6) produces a point  p ∈ ∆n such that hai , pi > 0 for every
i ∈ {1, . . . , m} in T = Θ lnε2n iterations.
What is the running time of the entire algorithm?
134 Mirror Descent and Multiplicative Weights Update

Algorithm 6: The Winnow algorithm


Input:
• A set of m points a1 , . . . , am where ai ∈ Rn for every i
• An ε > 0
Output: A point p ∈ ∆n satisfying Theorem 7.15
Algorithm:
 
1: Let T := Θ lnε2n
2: Let w1i := 1 for every i ∈ {1, . . . , n}
3: for t = 1, 2, . . . , T do
wt
4: Let ptj := kwt jk1 for every j
5: Check for all 1 ≤ i ≤ m whether hai , pt i ≤ 0
6: If no such i exists then stop and return pt
7: If there exists an i such that hai , pt i ≤ 0:
(a) Let f t := −ai
(b) Update wt+1 t t
j := w j exp(−ε f j ) for every j

8: end for
return T1 Tt=1 pt
P
9:

7.15 Feasibility of linear inequalities. Consider a general linear feasibility


problem which asks for a point x satisfying a system of inequalities
hai , xi ≥ bi
for i = 1, 2, . . . , m, where a1 , a2 , . . . , am ∈ Rn and b1 , b2 , . . . , bm ∈ R. The
goal of this problem is to give an algorithm that, given an error parameter
ε > 0, outputs a point x such that
hai , xi ≥ bi − ε (7.24)
for all i whenever there is a solution to the above system of inequalities.
We also assume the existence of an oracle that, given vector p ∈ ∆m ,
solves the following relaxed problem: does there exist an x such that
m X
X n m
X
pi ai j x j ≥ pi bi . (7.25)
i=1 j=1 i=1

Assume that when the oracle returns a feasible solution for a p, the so-
lution x that it returns is not arbitrary but has the following property:
max |hai , xi − bi | ≤ 1.
i
7.7 Exercises 135

Prove the following theorem:


Theorem 7.16 There is an algorithm that, if there exists an x s.t. hai , xi ≥
 i, outputs an x̄ that satisfies (7.24). The algorithm makes at most
bi for all
O lnε2m calls to the oracle for the problem mentioned in (7.25).
7.16 Online convex optimization. Given a sequence of convex and differen-
tiable functions f 1 , f 2 , . . . : K → R and a sequence of points x1 , x2 , . . . ∈
K, define the regret up to time T to be
T
X T
X
RegretT := f t (xt ) − min f t (x).
x∈K
t=1 t=1

Consider the following strategy inspired by the mirror descent algorithm


in this chapter (called follow the regularized leader):
t
X
xt+1 := argmin f i (x) + R(x)
x∈K i=1

for a convex regularizer R : K → R and x1 := argmin x∈K R(x). Assume


that each the gradient of each fi is bounded everywhere by G and that
the diameter of K is bounded by D.
Prove the following:
(a)
T
X
RegretT ≤ ( f t (xt ) − f t (xt+1 )) − R(x1 ) + R(x? )
t=1

for every T = 0, 1, . . . where x? := argmin x∈K Tt=1 f t (x).


P
(b) Given an ε > 0, use this method for R(x) := 1η kxk22 for an appro-
priate choice of η and T to get
1
RegretT ≤ ε.
T
136 Mirror Descent and Multiplicative Weights Update

Notes
Mirror descent was introduced by Nemirovski and Yudin (1983). Beck and
Teboulle (2003) presented an alternative derivation and analysis of mirror de-
scent. In particular, they showed that mirror descent can be viewed as a non-
linear projected-gradient type method, derived using a general distance-like
function instead of the `22 -distance. This chapter covers both of these view-
points.
The MWU method has appeared in the literature at least as early as the 1950s
and has been rediscovered in many fields since then. It has applications to
optimization (e.g., Exercise 7.15), game theory (e.g., Exercise 7.13), machine
learning (e.g., Exercise 7.14), and theoretical computer science (Plotkin et al.
(1995); Garg and Könemann (2007); Barak et al. (2009)). We refer the reader
to the comprehensive survey by Arora et al. (2012). We note that the variant
of the MWU method we present is often called the “hedge” (see Arora et al.
(2012)). The algorithm for the s − t-maximum flow problem by Christiano
et al. (2011) (mentioned in Chapter 1) relies on MWU; see Vishnoi (2013) for
a presentation.
Matrix variants of the MWU method are studied in papers by Arora and Kale
(2016), Arora et al. (2005), Orecchia et al. (2008), and Orecchia et al. (2012).
This variant of the MWU method is used to design fast algorithms for solving
semidefinite programs which, in turn, are used to come up with approximation
algorithms for problems such as maximum cut and sparsest cut.
The MWU method is one method in the area of online convex optimiza-
tion (introduced in Exercise 7.16). See the monographs by Hazan (2016) and
Shalev-Shwartz (2012) for in depth treatment of online convex optimization.
The perfect matching problem on bipartite graphs has been extensively stud-
ied in the combinatorial optimization literature; see the book by Schrijver
(2002a). It can be reduced to the s−t-maximum flow problem and, thus, solved
in O(nm) time using an algorithm that needs to compute at most n augmenting
paths, where each such iteration takes O(m) time (as it runs depth-first search.)

A more refined variant of the above algorithm, which runs in O(m n) time,
was presented in the paper by Hopcroft and Karp (1973). Forty years after
the result of Hopcroft and Karp (1973), a partial improvement (for the sparse
regime, i.e., when m = O(n)) was obtained in a paper by Madry (2013). His
e 10/7 ) time and, in fact, works for the s − t-maximum flow
algorithm runs in O(m
problem on directed graphs with unit capacity. It relies on a novel modification
and application of interior point methods (which we introduce in Chapters 10
and 11).
8
Accelerated Gradient Descent

We present Nesterov’s accelerated gradient descent algorithm. This algorithm can be


viewed as a hybrid of the previously introduced gradient descent and mirror descent
methods. We also present an application of the accelerated gradient method to solving
a linear system of equations.

8.1 The setup


In this chapter we revisit the unconstrained optimization problem studied in
Chapter 6:
minn f (x),
x∈R

where f is convex and its gradient is L-Lipschitz continuous. While most of


the results in Chapter 6 were stated for the Euclidean norm, here we work
with a general pair of dual norms k · k and k · k∗ . We first extend the notion of
L-Lipschitz continuous gradient to all norms.

Definition 8.1 (L-Lipschitz continuous gradient with respect to an arbi-


trary norm) A function f : Rn → R is said to be L-Lipschitz gradient with
respect to a norm k · k if for all x, y ∈ Rn ,

k∇ f (x) − ∇ f (y)k∗ ≤ Lkx − yk.

For a convex function, this condition is the same as L-smoothness:


L
∀x, y ∈ Rn , f (y) ≤ f (x) + hy − x, ∇ f (x)i + kx − yk2 . (8.1)
2
We saw this before (see Exercise 6.2) when k · k is the Euclidean norm. The
general norm case can be established in a similar manner; see Exercise 8.1.

137
138 Accelerated Gradient Descent

As in Chapter 7, we let R : Rn → R be an σ-strongly convex regularizer with


respect to a norm k·k, i.e.,
σ
DR (x, y) := R(x) − R(y) − h∇R(y), x − yi ≥ kx − yk2 . (8.2)
2
Recall that DR (x, y) is called the Bregman divergence of R at x with respect to y.
In this chapter we consider only regularizers for which the map ∇R : Rn → Rn
is bijective. When reading this chapter it might be instructive to keep in mind
the special case of
1
R(x) := kxk22 .
2
Then, DR (x, y) = 21 kx − yk22 and ∇R is the identity map.
In Chapter 6, we saw that there is an algorithm for optimizing L-smooth
functions in roughly O(ε−1 ) iterations. The goal of this chapter is to give a new
algorithm that combines ideas from gradient descent and mirror descent and
achieves O(ε−1/2 ). In Exercise 8.4, we prove that in the black box model, this
is optimal.

8.2 Main result on accelerated gradient descent


The main result of this chapter is the following theorem. We present the crucial
steps of the underlying algorithm at the end of the proof of this theorem as
it requires setting several parameters whose choice becomes clear from the
proof. We are also back to our usual notation of using subscripts (xt ) instead of
superscripts (xt ) to denote the iterates in the algorithm.

Theorem 8.2 (Guarantees on accelerated gradient descent) There is an


algorithm (called accelerated gradient descent), which given:

• a first-order oracle access to a convex function f : Rn → R,


• a number L such that the gradient of f is L-Lipschitz continuous with
respect to a norm k·k,
• an oracle access to the gradient map ∇R and its inverse (∇R)∗ or (∇R)−1
for a convex regularizer R : Rn → R,
• a bound on the strong convexity parameter σ > 0 of R with respect to k·k,
• an initial point x0 ∈ Rn such that DR (x? , x0 ) ≤ D2 (where x? is an optimal
solution to min x∈Rn f (x)), and
• an ε > 0,
8.3 Proof strategy: estimate sequences 139
?
 q apoint x ∈ R such that f (x) ≤ f (x ) + ε. The algorithm makes T :=
n
outputs
LD2
O σε queries to the respective oracles and performs O(nT ) arithmetic
operations.
 2
Note that the algorithm in Theorem 6.2 required O LD σε iterations – exactly
the square of what the above theorem achieves.

8.3 Proof strategy: estimate sequences


In our proof of Theorem 8.2, instead of first stating the algorithm and then
proving its properties, we instead proceed in the opposite order. We first for-
mulate an important theorem asserting the existence of a so-called estimate
sequence. In the process of proving this theorem we derive – step by step – an
accelerated gradient descent algorithm, which then turns out to imply Theo-
rem 8.2.
A crucial notion used in deriving the accelerated gradient descent algorithm
is that of an estimate sequence.
Definition 8.3 (Estimate sequence) A sequence (φt , λt , xt )t∈N , with function
φt : Rn → R, value λt ∈ [0, 1] and vector xt ∈ Rn (for all t ∈ N) is said to
be an estimate sequence for a function f : Rn → R if it satisfies the following
properties.
(1) Lower bound. For all t ∈ N and for all x ∈ Rn ,
φt (x) ≤ (1 − λt ) f (x) + λt φ0 (x).
(2) Upper bound. For all x ∈ Rn ,
f (xt ) ≤ φt (x).
Intuitively, we can think of the sequence (xt )t∈N as converging to a minimizer
of f . The functions (φt )t∈N serve as approximations to f , which provide tighter
and tighter (as t increases) bounds on the gap f (xt ) − f (x? ). More precisely,
condition (1) says that φt (x) is an approximate lower bound to f (x) and condi-
tion (2) says that the minimum value of φt is above f (xt ).
To illustrate this definition, suppose for a moment that λt = 0 for some t ∈ N
in the estimate sequence. Then, from conditions (2) and (1) we obtain
f (xt ) ≤ φt (x? ) ≤ f (x? ).
This implies that xt is an optimal solution. Thus, as λt = 0 may be too ambi-
tious, λt → 0 is what we aim for and. In fact, the accelerated gradient method
140 Accelerated Gradient Descent

constructs a sequence λt which goes to zero as t12 , a quadratic speed-up over the
standard gradient descent method. Formally, we prove the following theorem.
Theorem 8.4 (Existence of optimal estimate sequences) For every convex,
L-smooth (with respect to k·k) function f : Rn → R, for every σ-strongly
convex regularizer R (with respect to the same norm k·k), and for every x0 ∈ Rn ,
there exists an estimate sequence (φt , λt , xt )t∈N with
L
φ0 (x) := f (x0 ) + DR (x, x0 )

and
c
λt ≤
t2
for some absolute constant c > 0.
Suppose now that DR (x? , x0 ) ≤ D2 . Then, what we obtain for such a sequence,
using conditions (2) and (1) with x = x? , is
(Upper Bound)
f (xt ) ≤ φt (x? ) (8.3)
(Lower Bound)
≤ (1 − λt ) f (x? ) + λt φ0 (x? ) (8.4)
L
= (1 − λt ) f (x? ) + λt f (x0 ) + λt DR (x? , x0 ) (8.5)

L
= f (x? ) + λt ( f (x0 ) − f (x? )) + λt DR (x? , x0 ) (8.6)

(L-smoothness)  L 
≤ f (x? ) + λt hx0 − x? , ∇ f (x? )i + kx0 − x? k2 (8.7)
2
L ?
+ λt DR (x , x0 ) (8.8)
2σ !
1 1
= f (x? ) + λt L kx0 − x? k2 + DR (x? , x0 ) (8.9)
2 2σ
(λt ≤ tc2 ) cL 1 1 2
!
? ? 2
≤ f (x ) + 2 kx0 − x k + D (8.10)
t 2 2σ
(8.2) cLD2
≤ f (x? ) + . (8.11)
σt2
q
?
σε to make sure that f (xt ) − f (x ) ≤ ε. We
2
Thus, it is enough to take t ≈ LD
cannot yet deduce Theorem 8.2 from Theorem 8.4, as in the form as stated it
is not algorithmic – we need to know that such a sequence can be efficiently
computed using a 1st order oracle to f and R only, while Theorem 8.4 only
claims existence. However, as we will see, the proof of Theorem 8.4 provides
an efficient algorithm to compute estimate sequences.
8.4 Construction of an estimate sequence 141

8.4 Construction of an estimate sequence


This section is devoted to proving Theorem 8.4. To start, we make a simplify-
ing assumption that L = 1. This can be ensured by considering Lf instead of f .
Similarly, we assume that R is 1-strongly convex, scaling R by σ if necessary.

8.4.1 Step 1: Iterative construction


The construction of the estimate sequence is iterative. Let x0 ∈ Rn be an arbi-
trary point. We set

φ0 (x) := DR (x, x0 ) + f (x0 ) and λ0 = 1.

Thus, the lower bound condition in Definition 8.3 is trivially satisfied. The
upper bound condition follows from noting that

φ?0 := min φ0 (x) = f (x0 ).


x

Thus,
φ0 (x) = φ?0 + DR (x, x0 ) (8.12)

and in the case when R(x) := 12 kxk22 , this is just a parabola centered at x0 .
The construction of subsequent elements of the estimate sequence is induc-
tive. Suppose we are given (φt−1 , xt−1 , λt−1 ). Then φt will be a convex combi-
nation of φt−1 and the linear lower bound Lt−1 to f at a carefully chosen point
yt−1 ∈ Rn (to be defined later). More precisely, we set

Lt−1 (x) := f (yt−1 ) + hx − yt−1 , ∇ f (yt−1 )i . (8.13)

Note that by the first order convexity criterion,

Lt−1 (x) ≤ f (x) (8.14)

for all x ∈ Rn . We set the new estimate to be

φt (x) := (1 − γt )φt−1 (x) + γt Lt−1 (x), (8.15)

where γt ∈ [0, 1] will be determined later.


We would like to now observe a nice property of Bregman divergences: if we
shift a Bregman divergence term of the form x 7→ DR (x, z) by a linear function
hl, xi, we again obtain a Bregman divergence, from a possibly different point z.
This is easy to check for the case R(x) := 12 kxk22 .

Lemma 8.5 (Shifting Bregman divergence by a linear term) Let z ∈ Rn


be an arbitrary point and R : Rn → R be a convex regularizer for which
142 Accelerated Gradient Descent

∇R : Rn → Rn is a bijection. Then, for every l ∈ Rn there exists z0 ∈ Rn such


that
∀x ∈ Rn , hl, z − xi = DR (x, z) + DR (z, z0 ) − DR (x, z0 ).
Moreover, z0 is uniquely determined by the following relation
l = ∇R(z) − ∇R(z0 ).
The proof of this lemma follows from the generalized Pythagorean theorem
(Theorem 7.7).
It can now be seen by induction that, if we denote the minimum value of φt
by φ?t and denote by zt the global minimum of φt , i.e.,
φ?t = φt (zt )
we have
φt (x) = φ?t + λt DR (x, zt ). (8.16)
From (8.12) it follows that this is true for t = 0 if we let z0 := x0 . Assume
that this is true for t − 1. From the induction hypothesis, (8.15), (8.13), and
λt = (1 − γt )λt−1 we know that
(8.15)
φt (x) = (1 − γt )(φ?t−1 + λt−1 DR (x, zt−1 )) + γt ( f (yt−1 ) + hx − yt−1 , ∇ f (yt−1 )i)
(1−γt )λt−1 =λt
= (1 − γt )φ?t−1 + γt ( f (yt−1 ) − hyt−1 , ∇ f (yt−1 )i) + γt hx, ∇ f (yt−1 )i
+λt DR (x, zt−1 )
= (1 − γt )φ?t−1 + γt ( f (yt−1 ) − hyt−1 , ∇ f (yt−1 )i)
γt
!
+λt DR (x, zt−1 ) + hx, ∇ f (yt−1 )i
λt
= (1 − γt )φ?t−1 + γt ( f (yt−1 ) − hyt−1 , ∇ f (yt−1 )i)
γt
!
+λt DR (x, zt ) − DR (zt−1 , zt ) + hzt−1 , ∇ f (yt−1 )i .
λt
Here, in the last equality we have used Lemma 8.5 (with x = x, z = zt−1 ,
z0 = zt and l = λγtt ∇ f (yt−1 )). Now it follows that the minimum of φt (x) is when
DR (x, zt ) = 0 which happens when x = zt (as no other term depends on x). This
establishes (8.16). Importantly, Lemma 8.5 also gives us following recurrence
on zt for all t ≥ 1:
γt
∇R(zt ) = ∇R(zt−1 ) − ∇ f (yt−1 ).
λt
In the next steps of the proof, we use the above scheme to prove conditions
(1) and (2) of the estimate sequence. This is proved by induction, i.e., when
proving the claim for t ∈ N we assume that it holds for (t − 1). On the way, we
8.4 Construction of an estimate sequence 143

state several constraints on xt , yt , γt and λt which are necessary for our proofs
to work. At the final stage of the proof we collect all these constraints and show
that they can be simultaneously satisfied and, thus, there is a way to set these
parameters in order to obtain a valid estimate sequence.

8.4.2 Step 2: Ensuring the lower bound condition


We would like to ensure that

φt (x) ≤ (1 − λt ) f (x) + λt φ0 (x).

From our inductive construction we have


φt (x) = (1 − γt )φt−1 (x) + γt Lt−1 (x)
(by def. of φt in (8.15))
≤ (1 − γt )[(1 − λt−1 ) f (x) + λt−1 φ0 (x)] + γt Lt−1 (x)
(by the induction hypothesis for t − 1)
(8.17)
≤ (1 − γt )[(1 − λt−1 ) f (x) + λt−1 φ0 (x)] + γt f (x)
(by Lt−1 (x) ≤ f (x) in (8.14))
≤ ((1 − γt )(1 − λt−1 ) + γt ) f (x) + (1 − γt )λt−1 φ0 (x).
(rearranging)

Recall that we set


λt := (1 − γt )λt−1 , (8.18)

hence
φt (x) ≤ (1 − λt ) f (x) + λt φ0 (x).

Thus, as long as (8.18) holds, we obtain the lower bound condition. Note also
that a different way to state (8.18) is that
Y
λt = (1 − γt ).
1≤i≤t

8.4.3 Step 3: Ensuring the upper bound and the dynamics of yt


To satisfy the upper bound condition, our goal is to set xt in such a way that

f (xt ) ≤ minn φt (x) = φ?t = φt (zt ).


x∈R
144 Accelerated Gradient Descent

Note that this in particular requires us to specify yt−1 , as the right-hand side
depends on yt−1 . Towards this, consider any x ∈ Rn , then

φt (x) = (1 − γt )φt−1 (x) + γt Lt−1 (x)


(by def. of φt as in (8.15))
= (1 − γt )(φt−1 (zt−1 ) + λt−1 DR (x, zt−1 )) + γt Lt−1 (x)
(by (8.16))
= (1 − γt )(φt−1 (zt−1 ) + λt−1 DR (x, zt−1 )) + γt ( f (yt−1 ) + hx − yt−1 , ∇ f (yt−1 )i)
(by def. of Lt−1 as in (8.13))
≥ (1 − γt ) f (xt−1 ) + λt DR (x, zt−1 ) + γt ( f (yt−1 ) + hx − yt−1 , ∇ f (yt−1 )i)
(by the upper bound condition for φt−1 )
≥ (1 − γt )( f (yt−1 ) + hxt−1 − yt−1 , ∇ f (yt−1 )i
+ γt ( f (yt−1 ) + hx − yt−1 , ∇ f (yt−1 )i) + λt DR (x, zt−1 )
(by convexity of f )
= f (yt−1 ) + h(1 − γt )(xt−1 − yt−1 ) + γt (x − yt−1 ), ∇ f (yt−1 )i + λt DR (x, zt−1 )
(rearranging)
= f (yt−1 ) + h(1 − γt )xt−1 + γt zt−1 − yt−1 , ∇ f (yt−1 )i
+ γt hx − zt−1 , ∇ f (yt−1 )i + λt DR (x, zt−1 )
(by adding and subtracting γt hx − zt−1 , ∇ f (yt−1 )i and rearranging)
= f (yt−1 ) + γt hx − zt−1 , ∇ f (yt−1 )i + λt DR (x, zt−1 ).

Where the last equality follows by setting

yt−1 := (1 − γt )xt−1 + γt zt−1 .

To give some intuition as to why we make such a choice for yt−1 , consider for
a moment the case when R(x) = kxk22 and DR (x, y) is the squared Euclidean
distance. In the above, we would like to obtain a term which looks like a 2nd
order (quadratic) upper bound on f (e
x) around f (yt−1 ) (here e
x is variable), i.e.,

1
f (yt−1 ) + he
x − yt−1 , ∇ f (yt−1 )i + x − yt−1 k2 .
ke (8.19)
2
Such a choice of yt−1 allows us to cancel out an undesired linear term. We do
not quite succeed in getting the form as in (8.19) – our expression has zt−1
instead of yt−1 in several places and has additional constants in front of the
linear and quadratic term. We deal with these issues in the next step and make
our choice of xt accordingly.
8.5 The algorithm and its analysis 145

8.4.4 Step 4: Ensuring condition (2) and the dynamics of xt


We continue the derivation from Step 3.

φt (x) ≥ f (yt−1 ) + γt hx − zt−1 , ∇ f (yt−1 )i + λt DR (x, zt−1 )


(as established in Step 3.)
λt
≥ f (yt−1 ) + γt hx − zt−1 , ∇ f (yt−1 )i + kx − zt−1 k2
2
(by 1-strong convexity of R)
λt
= f (yt−1 ) + he
x − yt−1 , ∇ f (yt−1 )i + x − yt−1 k2
ke
2γt2
x − yt−1 := γt (x − zt−1 ))
(by a change of variables: e
1
≥ f (yt−1 ) + he
x − yt−1 , ∇ f (yt−1 )i + x − yt−1 k2
ke
2
λt
(assuming γt2
≥ 1)
≥ f (xt ).

Where the last step follows by setting


1
x, ∇ f (yt−1 )i +
xt := argmin he x − yt−1 k2 .
ke
x
e 2

The reason for renaming of variables and introducing e x follows the same intu-
ition as the choice of yt−1 in Step 4. We would like to arrive at an expression
which is a quadratic upper bound on f around yt−1 , evaluated at a point e x. Such
a choice of ex allows us to cancel the γt in front of the linear part of this upper
bound and hence to obtain the desired expression assuming that γλ2t ≥ 1. As we
t
will see later, this constraint really determines the convergence rate of the re-
sulting method. Finally, the choice of xt follows straightforwardly: we simply
pick a point which minimizes this quadratic upper bound on f over e x.

8.5 The algorithm and its analysis


We now collect all the constraints and summarize the updates rules according
to which, new points are obtained. Initially, x0 ∈ Rn is arbitrary,

z0 = x0 ,

γ0 = 0 and λ0 = 1.
146 Accelerated Gradient Descent

Further, for t ≥ 1 we have


Y
λt := (1 − γi )
1≤i≤t

and
yt−1 := (1 − γt )xt−1 + γt zt−1 ,
γt
∇R(zt ) := ∇R(zt−1 ) − ∇ f (yt−1 ),
λt (8.20)
1
x, ∇ f (yt−1 )i + ke
xt := argmin he x − yt−1 k2 .
x
e 2
For instance, when k · k is the `2 -norm, then
xt = yt−1 − ∇ f (yt−1 ).
Note that yt is just a “hybrid” of the sequences (xt ) and (zt ), which follow two
different optimization primitives:
1. The update rule to obtain xt is simply to perform one step of gradient
descent starting from yt−1 .
2. The sequence zt is just performing mirror descent with respect to R, taking
gradients at yt−1 .
Thus, accelerated gradient descent is simply a result of combining gradient
descent with mirror descent. Note that, in particular, if we choose γt = 0 for all
t then yt is simply the gradient descent method from Chapter 6, and we choose
γt = 1 for all t then yt is mirror descent from Chapter 7.
Pictorially, in algorithm (8.20), the parameters are fixed in the following
order.

x0 x1 xt
& % & % & %
y0 y1 ······ yt−1
% & % & % &
z0 z1 zt
The proof of Theorem 8.2 follows simply from the construction of the esti-
mate sequence. The only remaining piece which we have not established yet
is that one can take γt ≈ 1t and λt ≈ t12 – this follows from a straightforward
calculation and is proved in the next section.
From the update rules in (8.20) one can see that one can keep track of xt , yt , zt
by using a constant number of oracle queries to ∇ f , ∇R and (∇R)−1 . Further-
more, as already demonstrated in (8.3), we have
LD2
!
?
f (xt ) ≤ f (x ) + O .
σt2
8.5 The algorithm and its analysis 147
 q 
Thus, to attain f (xt ) ≤ f (x? ) + ε, taking t = O LD2
σε is sufficient. This
completes the proof of Theorem 8.2.

8.5.1 Choice of γt s
When deriving the estimate sequence we have made an assumption that
λt ≥ γt2 ,
which does not allow us to set γt s arbitrarily. The following lemma provides
an example setting of γt s which satisfy this constraint.
Lemma 8.6 (Choice of γt s) Let γ0 = γ1 = γ2 = γ3 = 0 and γi = 2
i for all
i ≥ 4, then
t
Y
∀t ≥ 0, (1 − γi ) ≥ γt2 .
i=1

Proof For t ≤ 4 one can verify the claim directly. Let t > 4. We have
t
Y 2 3 4 t−2 2·3
(1 − γi ) = · · · . . . · = ,
i=1
4 5 6 t (t − 1)t

by cancelling all but 2 terms in the numerator and denominator. It is now easy
6
to verify that t(t−1) ≥ t42 = γt2 .

8.5.2 An algorithm for strongly convex and smooth functions


From Theorem 8.2 one can derive numerous other algorithms. Here, for in-
stance we deduce a method for minimizing functions which are both L-smooth
and β-strongly convex (with respect to the Euclidean norm) at the same time,
i.e.,
β L
∀x, y ∈ Rn , kx − yk22 ≤ f (x) − f (y) − hx − y, ∇ f (y)i ≤ kx − yk22 .
2 2
Theorem 8.7 (Accelerated gradient descent for strongly convex functions)
There is an algorithm that for any function f : Rn → R, which is both β-
strongly convex and L-smooth with respect to the Euclidean norm, given
• a first-order oracle access to a convex function f ,
• numbers L, β ∈ R, 2
• an initial point x0 ∈ Rn such that x? − x0 2 ≤ D2 (where x? is an optimal

solution to min x∈Rn f (x)),


• an ε > 0, and
148 Accelerated Gradient Descent

outputs aqpoint x ∈  Rn such that f (x) ≤ f (x? ) + ε. The algorithm makes


LD2
T = O L
β log ε queries to the respective oracles and performs O(nT )
arithmetic operations.
Note that, importantly, the above method has logarithmic dependency on 1ε .
Proof Consider R(x) = kxk22 as our choice of regularizer in Theorem 8.2.
Note that R is 1-strongly convex with respect to the Euclidean norm. Thus, the
algorithm in Theorem 8.2 constructs a sequence of points x0 , x1 , x2 , . . . such
that
LD2
!
f (xt ) − f (x? ) ≤ O 2 .
t
Denote
Et := f (xt ) − f (x? ).
Initially, we have
β 2 β
E0 ≥ x0 − x? 2 = D2 ,
2 2
because of strong convexity of f . Thus, the convergence guarantee of the algo-
rithm can be rewritten as
LD2
! !
LE0
Et ≤ O 2 ≤ O .
t βt2
E0
In particular, the number of steps required to bring the gap from E0 to 2 is
q 
β . Thus, to go from E 0 to ε we need
L
O
s  s 
 L E 
0
 L LD2 
· log  = O 

O  · log
  
β ε β ε 


steps. Here, we used the L-smoothness of f and the fact that ∇ f (x? ) = 0 to
obtain
L LD2
E0 = f (x0 ) − f (x? ) ≤ kx − x? k22 ≤ .
2 2
The map ∇R is the identity in this case, hence no additional computational cost
is incurred because of R.

8.6 Application to solving linear systems


Recall the problem of solving linear systems of equations
Ax = b
8.6 Application to solving linear systems 149

where A ∈ Rn×n and a vector b ∈ Rn . For simplicity, we assume that A is non-


singular and hence the system Ax = b has a unique solution. Hence, A> A is
positive definite. Let λ1 (A> A) and λn (A> A) be the smallest and largest eigen-
value of A> A respectively. We define the condition number of A> A to be
λn (A> A)
κ(A> A) := .
λ1 (A> A)

8.6.1 The algorithm



Theorem 8.8 (Solving a system of linear equations in κ(A> A) iterations)
There is an algorithm, which given an invertible square matrix A ∈ Rn×n ,
a vector b and a precision parameter ε > 0, outputs a point y ∈ Rn – an
approximate solution to the linear system Ax = b, which satisfies

kAy − bk22 ≤ ε.
√  > ? 
λn (A A)k x k2
The algorithm performs T := O κ(A> A) log ε iterations (where
x? ∈ Rn satisfies Ax? = b), each iteration requires computing a constant
number of matrix-vector multiplications and inner products.

Proof We apply Theorem 8.7 to solve the optimization problem

minn kAx − bk22 .


x∈R

We denote
f (x) := kAx − bk22 .

Note that the optimal value of the above is 0, and is achieved for x = x? , where
x? is the solution to the linear systems considered.
We now derive all the relevant parameters of f which are required to apply
Theorem 8.7. By computing the Hessian of f we have

∇2 f (x) = A> A.

Since λ1 (A> A) · I  A> A and A> A  λn (A> A) we have that f is L-smooth for
L := λn (A> A) and β-strongly convex for β := λ1 (A> A).
As a starting point x0 we can choose

x0 := 0

which is at distance
D := x0 − x? 2 = x? 2

150 Accelerated Gradient Descent

from the optimal solution. Thus, from Theorem 8.7 we obtain the running time
 λn (A> A) x? 
  
 p
O  κ(A> A) · log   .
ε
Note that computing the gradient of f (x) is
∇ f (x) = A> (Ax − b),
hence it boils down to performing two matrix-vector multiplications.
8.7 Exercises 151

8.7 Exercises
8.1 Prove that, for a convex function, the L-smoothness condition (8.1) is
equivalent to the L-Lipschitz continuous gradient condition in any norm.
8.2 Conjugate gradient. Consider the following strategy for minimizing a
convex function f : Rn → R. Construct a sequence (φt , Lt , xt )t∈N , where
for every t ∈ N there is a function φt : Rn → R, a linear subspace Lt of
Rn such that {0} = L0 ⊆ L1 ⊆ L2 ⊆ · · · , and a point xt ∈ Lt such that,
together, they satisfy the following conditions:
• Lower bound. ∀t, ∀x ∈ Lt , φt (x) ≤ f (x),
• Common minimizer. ∀t, xt = argmin x∈Rn φt (x) = argmin x∈Lt f (x).
Consider applying this to the function
2
f (x) = x − x? A ,

where A is an n×n PD matrix and x? ∈ Rn is the unique vector satisfying


Ax? = b. Let
Lt := span{b, Ab, A2 b, . . . , At−1 b},

and
2
φt (x) := kx − xt k2A + xt − x? A

where xt ∈ Rn is the unique choice which makes the common minimizer


property satisfied.
(a) Show that by choosing (φt , Lt , xt )t∈N as above, the lower bound
property is satisfied.
(b) Show that given A and b, there is an algorithm to compute x1 , . . . , xt
in total time O(t · (T A + n)), where T A is the time required to mul-
tiply the matrix A by a vector.
(c) Show that if A has only k pairwise distinct eigenvalues, then xk+1 =
x? .
8.3 Heavy ball method. Given an n × n PD matrix A and a b ∈ Rn , consider
the following optimization problem:
1
minn x> Ax − hb, xi.
x∈R 2

Let x? := A−1 b and notice that it is the same as solving the following
problem:
1
f (x) := minn (x − x? )> A(x − x? ).
x∈R 2
152 Accelerated Gradient Descent

Consider the following method to solve this problem

xt+1 := xt − η∇ f (xt ) + θ(xt − xt−1 ).

The term xt − xt−1 can be viewed as the momentum of the particle.


(a) Prove that
xt+1 − x? (1 + θ)I − ηA −θI xt − x?
" # " #" #
=
xt − x ? I 0 xt−1 − x?
(b) Let λ1 be the smallest eigenvalue of A and λn be the largest eigen-

value of A. For η := ( √λ +4√λ )2 and θ := max{|1 − ηλ1 |, |1 −
√ n 1
ηλn |} prove that
√ !t
? κ(A) − 1
kxt+1 − x k ≤ √ kx0 − x? k2 .
κ(A) + 1
Here κ(A) := λλn1 .
(c) Generalize this result to an L-smooth and σ-strongly convex func-
tion.
8.4 Lower bound. In this problem we prove Theorem 6.4. Consider a gen-
eral model for first-order black box minimization which includes gra-
dient descent, mirror descent, and accelerated gradient descent. The al-
gorithm is given x1 ∈ Rm and access to a gradient oracle for a convex
function f : Rn → R. It produces a sequence of points: x1 , x2 , . . . such
that
xt ∈ x1 + span{∇ f (x1 ), . . . , ∇ f (xt−1 )}, (8.21)

i.e., the algorithm might move only in the subspace spanned by the gra-
dients at previous iterations. We do not restrict the running time of one
iteration of such an algorithm, in fact we allow it to do an arbitrarily
long calculation to compute xt from x1 , . . . , xt−1 and the corresponding
gradients and are interested only in the number of iterations.
Consider the quadratic function f : Rn → R defined as
 2t

L  1 2 1 X 1 
f (y1 , . . . , yn ) :=  y1 + (yi − yi+1 )2 + y22t+1 − y1  .
4 2 2
i=1
2

Here yi denotes the ith coordinate of y.


(a) Prove that F is L-smooth with respect
 to the Euclidean norm.
1
(b) Prove that the minimum of f is L8 2t+2 − 1 and is attained for a
?
point x whose ith coordinate is 1 − 2t+2 .
i
8.7 Exercises 153

(c) Prove that the span of the gradients at the first t points is just the
span of {e1 , . . . , et }.
(d) Deduce that
f (xt+1 ) − f (x? ) 3L
≥ .
kx1 − x? k22 32(t + 1)2
Thus, the accelerated gradient method is tight up to constants.
8.5 Acceleration for the s−t-maximum flow problem. Use the accelerated
gradient method developed in this chapter to improve
 the
 running time
1.75
in Theorem 6.6 for the s − t-maximum flow to O √e m
. εF ?
8.6 Acceleration for the s − t-minimum cut problem. Recall the formu-
lation of the s − t-minimum problem in an undirected graph G = (V, E)
with n vertices and m edges that was studied Exercise 6.11.
X
MinCuts,t (G) := min |xi − x j |.
x∈Rn ,x s −xt =1
i j∈E

Apply Theorem 8.7 with the regularizer R(x) = kxk22 to find MinCuts,t (G)
exactly. Obtain a method with running time O(m3/2 n1/2 ∆1/2 ) where ∆ is
the maximum degree of a vertex in G. In other words
∆ := max |N(v)|.
v∈V

Hint: to make the objective L-smooth, use the following approximation


Pm q 2
of the `1 -norm of a vector y ∈ R : kyk1 ≈ j=1 yi + δ , for an appro-
n 2

priately chosen δ > 0.


154 Accelerated Gradient Descent

Notes
The accelerated gradient descent method was discovered by Nesterov (1983)
(see also Nesterov (2004)). This idea of acceleration was then extended to
variants of gradient descent and led to the introduction of algorithms such as
FISTA by Beck and Teboulle (2009). Allen Zhu and Orecchia (2017) present
the accelerated gradient method as a coupling between the gradient descent
method and the mirror descent method. The lower bound (Theorem 6.4) was
first established in a paper by Nemirovski and Yudin (1983). The heavy ball
method (Exercise 8.3) is due to Polyak (1964). Exercises 8.5 and 8.6 are adapted
from the paper by Lee et al. (2013).
Algorithms for solving a linear system of equations have a rich history. It
is well known that Gaussian elimination has a worst case time complexity
of O(n3 ). This can be improved upon by using fast matrix multiplication to
O(nω ) ≈ O(n2.373 ); see the books by Trefethen and Bau (1997), Saad (2003),
and Golub and Van Loan (1996). The bound presented in Theorem 8.8 is com-
parable to the conjugate gradient method (introduced in Exercise 8.2) is due to
Hestenes and Stiefel (1952); see also the monograph by Sachdeva and Vishnoi
(2014). For references on linear solvers for Laplacian solvers, refer to the notes
in Chapter 6.
9
Newton’s Method

We begin our journey towards designing algorithms for convex optimization whose
number of iterations scale polylogarithmically with the error. As a first step, we de-
rive and analyze the classic Newton’s method, which is an example of a second-order
method. We argue that Newton’s method can be seen as steepest descent on a Rieman-
nian manifold, which then motivates an affinely invariant analysis of its convergence.

9.1 Finding a root of a univariate function


Newton’s method, also known as the Newton-Raphson method, is a method
for finding successively better approximations to the roots, or zeroes, of a real-
valued function. Formally, given a sufficiently differentiable function g : R →
R, the goal is to find its root (or one of its roots); a point r such that g(r) = 0.
The method assumes a zeroth- and first-order access to g and a point x0 that
is “sufficiently close” to some root of g. It does not assume that g is convex.
We use the notation g0 and g00 in this chapter to denote the first and the second
derivative of g.

9.1.1 Derivation of the update rule


Like all the methods studied in this book, Newton’s method is also iterative
and generates a sequence of points x0 , x1 , . . .. To explain Newton’s method
visually, consider the graph of g as a subset of R × R and the point (x0 , g(x0 )).
Draw a line through this point, which is tangent to the graph of g. Let x1 be the
intersection of the line with the x-axis (see Figure 9.1). The hope is, at least
if one were to believe the figure, that by moving from x0 to x1 we have made
progress in reaching a zero of g. By a simple calculation, one can see that x1

155
156 Newton’s Method

f (x0 )

r x1 x0

Figure 9.1 One step of Newton’s method.

arises from x0 as follows

g(x0 )
x1 := x0 − .
g0 (x0 )

Thus, the following iterative algorithm for computing a root of g naturally


follows: start with an x0 ∈ R and use the inductive definition below to compute
x1 , x2 , x3 , . . .

g(xt )
xt+1 := xt − for all t ≥ 0. (9.1)
g0 (xt )

It is evident that the method requires the differentiability of g. In fact, the anal-
ysis assumes even more – that g is twice continuously differentiable.
We now present a simple optimization problem and see how we can attempt
to use Newton’s method to solve it. This example also illustrates that the con-
vergence of Newton’s method may heavily depend on the starting point.
9.1 Finding a root of a univariate function 157

An example. Suppose that for some a > 0 we would like to minimize the
function
f (x) := ax − log x

over all positive x > 0. To solve this optimization problem one can first take
the derivative g(x) := f 0 (x) and try to find a root of g. As f is convex, we
know by first-order optimality conditions that the root of g (if it exists) is an
optimizer for f . We have
1
g(x) := f 0 (x) = a − .
x
While it is trivial to see that this equation can be solved exactly and the solu-
tion is 1a , we would still like to apply Newton’s method to solve it. One reason
is that we would like to illustrate the method on a particularly simple example.
Another reason is historical – early computers used Newton’s method to com-
pute the reciprocal as it only involved addition, subtraction, and multiplication.
We initialize Newton’s method at any point x0 > 0, and iterate as follows
g(xt )
xt+1 = xt − = 2xt − axt2 .
g0 (xt )
Note that computing xt+1 from xt indeed does not use division.
We now try to analyze the sequence {xt }t∈N and see when it converges to a1 .
By denoting
et := 1 − axt

we obtain the following recurrence relation for et

et+1 = e2t .

Thus, it is now easy to see that whenever |e0 | < 1 then et → 0. Further, if
|e0 | = 1 then et = 1 for all t ≥ 1 and if |e0 | > 1 then et → ∞.
In terms of x0 this means that whenever 0 < x0 < a2 then xt → 1a . However,
if we initialize at x0 = 2a or x0 , then the algorithm gets stuck at 0. And even
worse: if we initialize at x0 > 2a then xt → −∞. This example shows that
choosing the right starting point has a crucial impact on whether Newton’s
method succeeds or fails.
It is interesting to note that by modifying the function g, for example by
taking g(x) = x − 1a , one obtains different algorithms to compute 1a . Some of
these algorithms might not make sense (for instance, the iteration xt+1 = 1a is
not how we would like to compute 1a ), or might not be efficient.
158 Newton’s Method

9.1.2 Quadratic convergence


We now formally analyze Newton’s method. Notice that, for the example in the
previous section, we encountered a phenomenon that the “distance” to the root
was being squared at every iteration – this ends up holding in full generality
and is sometimes referred to as quadratic convergence.

Theorem 9.1 (Quadratic convergence of Newton’s method for finding roots)


Suppose g : R → R is twice differentiable and its second derivative is contin-
uous, r ∈ R is a root of g, x0 ∈ R is a starting point, and
g(x0 )
x1 = x0 − ,
g0 (x0 )
then
|r − x1 | ≤ M|r − x0 |2
00
(ξ)
where M := supξ∈(r,x0 ) 2gg 0 (x0)
.
The proof of this theorem requires the mean value theorem.

Theorem 9.2 (Mean value theorem) If h : R → R is a continuous function


on the closed interval [a, b], and differentiable on the open interval (a, b), then
there exists a point c ∈ (a, b) such that
h(b) − h(a)
h0 (c) = .
b−a
Proof of Theorem 9.1. We start by a quadratic, or second-order Taylor ap-
proximation of g around the point x0 and using the mean value theorem (The-
orem 9.2) for g00 to obtain
1
g(r) = g(x0 ) + (r − x0 )g0 (x0 ) + (r − x0 )2 g00 (ξ)
2
for some ξ in the interval (r, x0 ). Here, we have used the fact that g00 is contin-
uous. From the definition of x1 , we know that

g(x0 ) = g0 (x0 )(x0 − x1 ).

We also know that g(r) = 0. Hence, we obtain


1
0 = g0 (x0 )(x0 − x1 ) + (r − x0 )g0 (x0 ) + (r − x0 )2 g00 (ξ)
2
which implies that
1
g0 (x0 )(x1 − r) = (r − x0 )2 g00 (ξ).
2
9.2 Newton’s method for multivariate functions 159

This gives us the bound on the distance from x1 to r in terms of the distance
from x0 to r
00
g (ξ)
|r − x1 | = 0 |r − x0 |2 ≤ M|r − x0 |2 ,
2g (x0 )
where M is as in the statement of the theorem.

Assuming that M is a small constant, say M ≤ 1 (and remains so throughout


the execution of this method) and that |x0 − r| < 12 , we obtain quadratically
fast convergence of xt to r. Indeed, after t steps we have
t t
|xt − r| ≤ |x0 − r|2 ≤ 2−2

and hence, for the error |xt − r| to become smaller than ε it is enough to take
1
t ≈ log log .
ε
As one can imagine, for this reason Newton’s method is very efficient. In addi-
tion, in practice it turns out to be very robust and sometimes converges rapidly
even when no bounds on M or |x0 − r| are available.

9.2 Newton’s method for multivariate functions


We now extend Newton’s method to the multivariate setting where we would
like to find a “root” of a function g : Rn → Rn :

g1 (r) = 0,
g2 (r) = 0,
..
.
gn (r) = 0.

In other words, g : Rn → Rn is of the form g(x) = (g1 (x), g2 (x), . . . , gn (x))>


and we would like to find x ∈ Rn such that g(x) = 0.
To find an analog of Newton’s method in this setting, we mimic the update
rule (9.1) for the univariate setting, i.e.,
g(x0 )
x1 = x0 − .
g0 (x0 )
Previously, g(x0 ) and g0 (x0 ) were numbers, but if we go from n = 1 to n > 1,
then g(x0 ) becomes a vector and it is not immediately clear what g0 (x0 ) should
160 Newton’s Method

be. It turns out that the right analog of g0 (x0 ) in the multivariate setting is the
Jacobian matrix of g at x0 , i.e., Jg (x0 ) is the matrix of partial derivatives
∂gi
" #
(x0 ) .
∂x j 1≤i, j≤n

Hence, we can now consider the following extension of (9.1) to the multivari-
ate setting:
xt+1 := xt − Jg (xt )−1 g(xt ) for all t ≥ 0. (9.2)

By extending the proof of Theorem 9.1 we can recover an analogous local


quadratic convergence rate of Newton’s method in the multivariate setting. We
do not state a precise theorem for this variant, but as one might expect the
corresponding value M involves the following two quantities: an upper bound
on the “magnitude” of the second derivative of g (or in other words a bound on
the Lipschitz constant of x 7→ Jg (x)) and a lower bound on the “magnitude” of
Jg (x) of the form
1
,
Jg (x)−1 2

where k·k2 denotes the spectral norm of a matrix (see Definition 2.13). For more
details, we refer to Section 9.4 where we present a closely related convergence
result, which can be adapted to this setting.

9.3 Newton’s method for unconstrained optimization


How could Newton’s method be used to solve convex programs? The key lies
in the observation from Chapter 3 that the task of minimizing a differentiable
convex function in the unconstrained setting is equivalent to finding a root of
its derivative. In this section, we abstract out the method from the previous
section and present Newton’s method for unconstrained optimization.

9.3.1 From optimization to root finding


Recall that in the unconstrained case, the problem is to find

x? := argmin f (x)
x∈Rn

where f is a convex function. We assume that f is sufficiently differentiable


in this chapter. The gradient ∇ f of f can be thought as a function from Rn to
9.3 Newton’s method for unconstrained optimization 161

Rn and its Jacobian J∇ f is the Hessian ∇2 f . Hence, the update rule (9.2) for
finding a root of a multivariate function g, adjusted to this setting, is

xt+1 := xt − (∇2 f (xt ))−1 ∇ f (xt ) for all t ≥ 0. (9.3)

For notational convenience we define the Newton step at point x to be

n(x) := −(∇2 f (x))−1 ∇ f (x).

Then (9.3) can be compactly written as

xt+1 := xt + n(xt ).

9.3.2 Newton’s method as a second-order method


We now derive Newton’s method from an optimization perspective. Suppose
we would like to find a global minimum of f and x0 is our current approximate
solution. Let ef (x) denote the quadratic, or second-order approximation of f (x)
around x0 , i.e.,
1
f (x) := f (x0 ) + hx − x0 , ∇ f (x0 )i + (x − x0 )> ∇2 f (x0 )(x − x0 ).
e
2
A natural idea to compute the new approximation x1 (to a minimizer x? of f )
is then to minimize e f (x) over x ∈ Rn . Since we hope that f˜ approximates f , at
least locally, this new point should be an even better approximation to x? .
To find x1 we need to solve

x1 := argmin e
f (x).
x∈Rn

Assuming that f is strictly convex, or rather that the Hessian of f at x0 is PD,


f (x) = 0, i.e.,
such an x1 is equivalent to finding an x such that ∇ e

∇ f (x0 ) + ∇2 f (x0 )(x − x0 ) = 0.

This, assuming ∇2 f (x0 ) is invertible, translates to


 −1
x − x0 = − ∇2 f (x0 ) ∇ f (x0 ).

Hence
 −1
x1 = x0 − ∇2 f (x0 ) ∇ f (x0 ),

and we recover Newton’s method. Thus, at every step, Newton’s method min-
imizes the second-order approximation around the current point and takes the
minimizer as the next point. This has the following consequence: whenever we
apply Newton’s method to a strictly convex quadratic function, i.e., of the form
162 Newton’s Method

h(x) = 12 x> Mx + b> x for M ∈ Rn×n PD and b ∈ Rn . Then, no matter which


point we start at, after one iteration we land in the unique minimizer.
It is instructive to compare Newton’s method to the algorithms we studied
in previous chapters. They were all first-order methods and required multiple
iterations to reach a point close to the optimizer. Does it mean that Newton’s
method is a “better” algorithm? On the one hand, the fact that Newton’s method
also uses the Hessian to perform the iteration makes it more powerful than first-
order methods. On the other hand, this power comes at a cost: computationally
one iteration is now more costly, as we need a second-order oracle to the func-
tion. More precisely, at every step t, to compute xt+1 we need to solve a system
of n linear equations in n variables, of the form
 
∇2 f (xt ) x = ∇ f (xt ).

In the worst case, this takes O(n3 ) time using Gaussian elimination (or O(nω )
using fast matrix multiplication). However, if the Hessian matrix has a special
form, e.g., it is a Laplacian corresponding to some graph, Newton’s method can
lead to fast algorithms due to the availability of nearly-linear time Laplacian
solvers.

9.4 First take on the analysis


In this section, we take a first shot at analyzing Newton’s method for optimiza-
tion. The theorem we state is analogous to Theorem 9.1 (for univariate root
finding) – it says that one step of Newton’s method yields a quadratic improve-
ment of the distance to the optimal solution whenever a condition that we call
NE (for Newton-Euclidean) is satisfied.

Definition 9.3 (NE condition) Let f : Rn → R be a function, x? be one of


its minimizers and x0 be an arbitrary point. Denote by H(x) the Hessian of f at
the point x ∈ Rn . We say that the NE(M) condition is satisfied for some M > 0
if there exists a Euclidean ball B(x? , R) of radius R around x? , containing x0 ,
and two constants h, L > 0 such that M ≥ 2h L
and

• for every x ∈ B(x? , R), H(x)−1 ≤ h1 ,


• for every x, y ∈ B(x? , R), kH(x) − H(y)k ≤ L kx − yk2 .

Here the norm of a matrix is the 2 → 2 or spectral norm.

Theorem 9.4 (Quadratic convergence with respect to the Euclidean norm)


9.4 First take on the analysis 163

Let f : Rn → R and x? be one of its minimizer. Let x0 be an arbitrary starting


point and define
x1 := x0 + n(x0 ).
If the NE(M) condition is satisfied, then
2
x1 − x? 2 ≤ M x0 − x? 2 .

One can observe a rough analogy between Theorem 9.4 and Theorem 9.1. In
the latter, for the method to have quadratic convergence, |g0 (x)| should be large
(relatively to |g00 (x)|). Here, the role of g is played by the gradient ∇ f of f .
The first condition on H(x) in Definition 9.3 basically says that the “magni-
tude” of the second derivative of f is “big”. The second condition may be a
bit more tricky to decipher: it says that ∇2 f (x) is Lipschitz-continuous, and
upper-bounds the Lipschitz constant. Assuming f is thrice continuously dif-
ferentiable, this roughly gives an upper bound on the magnitude of D(3) f . This
intuitive explanation is not quite formal, however, we only wanted to empha-
size that the spirit of Theorem 9.4 is the same as Theorem 9.1.
The proof of Theorem 9.4 is similar to the proof of Theorem 9.1 and, thus,
is moved to the end of this chapter for the interested reader; see Section 9.7.

9.4.1 The problem with the convergence in Euclidean norm


While the statement (and the proof) of Theorem 9.4 might seem to be natural
extensions of Theorem 9.1, the fact that it is stated with respect to quantities
based on the Euclidean norm k·k2 makes it hard to apply in many cases. We
see in a later section that there is a more natural choice of a norm within the
context of Newton’s method – the local norm. However, before we introduce
the local norm, we first show that there is indeed a problem with the use of
the Euclidean norm through a particular example in which Theorem 9.4 fails
to give reasonable bounds.
For K1 , K2 > 0 (to be thought of as large constants), consider the function
! !
1 1
f (x1 , x2 ) := − log (K1 − x1 ) − log (K1 + x1 ) − log − x2 − log + x2 .
K2 K2
 
f is defined whenever (x1 , x2 ) ∈ (−K1 , K1 ) × − K12 , K12 ⊆ R2 , it is convex and
its Hessian is
 (K1 −x1 )2 + (K1 +x
 1 1 
)2 0 
1
 
H(x1 , x2 ) =   .
2 + 
1 1
 0  2 

+x2
1 1

K2 −x2 K2
164 Newton’s Method

We would like to find estimates on the parameters h and L (and, hence, M)


for the NE(M) condition to hold in a close neighborhood of the optimal point
(x1? , x2? ) = (0, 0). As we show, the M parameter is always prohibitively large,
so that Theorem 9.4 cannot be applied to reason about the convergence of
Newton’s method in this case. Nevertheless, as we argue in a later section,
Newton’s method works very well when applied to f , no matter how large K1
and K2 might be.
We start by observing that the first point in the NE(M) condition requires an
h > 0 such that
hI  H(x1 , x2 )

in the neighborhood of x? . Even at x? we already have


2
h≤ ,
K12
i.e., we can make h arbitrarily small by just changing K1 . This is because
2 
?
 2 0 
H(x ) = H((0, 0)) =   K1 .
0 2K22
The second condition requires a bound on the Lipschitzconstant
 of the Hessian
1
of H(x). To see that L is large, consider the point e
x := 0, K22
. Note that
 
0 0  !
?

x) − H(x ) = 0 0 0
1
+ 1
− 2K2  =
2
= Θ(1).

H(e
Θ(1)
!2 !2
 1
− 1 1
− 1  0
K2 K2 K2 K2
2 2

This establishes a lower bound of


x) − H(x? )

H(e
= Θ(K22 )
x − x? k
ke
on the Lipschitz constant L.
L
Thus, M := 2h , which determines the quadratic convergence of Newton’s
method is at least Ω(K12 K22 ) and, in particular, even when we initialize it at e
x,
which is relatively close to the analytic center x? , the guarantee in Theorem 9.4
is too weak to imply that in one step the distance to the analytic center drops.
In fact, we get that if we set x0 = e
x, then the next point x1 satisfies
 2
1  K 
kx1 k ≤ M kx0 k2 ≤ M 4 = Ω  12  .
K2 K2
Thus, whenever K1 is at least a constant, Theorem 9.4 does not imply that the
distance drops. However, Newton’s method does in fact converge rapidly to x?
9.5 Newton’s method as steepest descent 165

when initialized at e
x (this can be checked by hand for this simple 2-dimensional
example).

9.4.2 Affine invariance of Newton’s method


One of the important features of Newton’s method is its affine invariance: if
we consider an affine change of coordinates

y := φ(x) = Ax + b,

where A ∈ Rn×n is an invertible matrix and b ∈ Rn , then Newton’s method will


proceed over the same sequence of points in the x- and y-coordinates.
Formally, consider the function f : Rn → R written in the y-coordinates and
f˜ : Rn → R written in x-coordinates, i.e.,

f˜(x) = f (Ax + b).

Then, if x0 ∈ Rn moves to x1 ∈ Rn by applying one step of Newton’s method


with respect to f˜, then y0 = φ(x0 ) moves to y1 = φ(x1 ) by applying one step of
Newton’s method with respect to f . This property does not hold for gradient
descent or mirror descent, or any of the first-order methods which we have
studied so far. Hence, it is sometimes possible to improve the convergence rate
of gradient descent by preconditioning, i.e., changing coordinates, which is
not the case for Newton’s method.
This gives another reason why the bound obtained in Theorem 9.4 is not sat-
isfactory: it depends on quantities (such as matrix norms) that are not affinely
invariant. Thus, even though Newton’s method still takes the same trajectory
after we change the coordinates in which the function f is expressed, the pa-
rameters L and h in the NE(M) condition change and, thus, the final bound in
Theorem 9.4 also changes.
To overcome these issues, in the next section, we present a different interpre-
tation of Newton’s method – as a discretization of a gradient flow with respect
to a Riemannian metric – which allows us to analyze Newton’s method based
only on affinely invariant quantities in the last section.

9.5 Newton’s method as steepest descent


Suppose we would like to minimize the convex and quadratic function f :
R2 → R given by
f (x1 , x2 ) := x12 + K x22 ,
166 Newton’s Method

where K > 0 is a large constant. One can see by inspection that the minimum
is x? = (0, 0). Nevertheless, we would like to compare the gradient descent
method and Newton’s method for this function.
The gradient descent method takes a starting point, say x̃ := (1, 1) and goes
to
x0 := x̃ − η∇ f ( x̃)

for some step size η > 0. In this case,

∇ f ( x̃) = (2, 2K)> .

Thus, if we take η of constant size, the new point x0 is at distance roughly


Ω(K) from the optimum, hence, instead of getting closer to x? , we move far-
ther away. The reason is that our step size η is too large. Indeed, to make the
distance to x? drop when going from x̃ to x0 we must take a step size of roughly
η ≈ K1 . This makes the convergence slow.
The Newton step at x̃, on the other hand, is

− (∇2 f ( x̃))−1 ∇ f ( x̃) = −(1, 1). (9.4)

This direction points directly towards the minimum x? = (0, 0) of f and, thus,
we are no longer forced to take small steps (see Figure 9.2). How can we rec-
oncile this difference between the predictions of the gradient descent method
and Newton’s method?

x?

Figure 9.2 Illustration of the difference in directions suggested by gradient de-


scent and Newton’s method for the function x12 + 4x22 . x̃ = (1, 1) and x? = (0, 0).
The solid arrow is a step of length 1/2 in the direction (−1, −1) (Newton step) and
the dashed arrow is a step of length 1/2 in the direction (−1, −4) (gradient step).
9.5 Newton’s method as steepest descent 167

Recall that in the gradient descent method we chose to follow the nega-
tive gradient direction because it was the direction of steepest descent with
respect to Euclidean distance (see Section 6.2.1). However, since the role of
coordinates x1 and x2 is not symmetric in f , the Euclidean norm is no longer
appropriate and it makes sense to select a different norm to measure the length
of such vectors. Consider the norm
q
k(u1 , u2 )k◦ := u21 + Ku22 .
(Check that this is indeed a norm when K > 0.) Let us now redo the derivation
of gradient descent as steepest descent from Chapter 6 with respect to this
new norm. This gives rise to the following optimization problem (analogous to
(6.1)):
argmax (−D f (x)[u]) . (9.5)
kuk◦ =1

Since D f (x)[u] = 2(x1 u1 + K x2 u2 ), (9.5) is the same as


max −2(x1 u1 + K x2 u2 ).
u21 +Ku22 =1

By the Cauchy-Schwarz inequality we obtain that


q q
−(x1 u1 + K x2 u2 ) ≤ x12 + K x22 u21 + Ku22 = kxk◦ kuk◦ .
Moreover, the inequality above is tight when u points in the opposite direction
as x. Thus, the optimal solution to (9.5) at x points in the direction −x. For
x̃ = (1, 1), up to scaling, this direction is (−1, −1) – the Newton step (9.5).
This example extends immediately to a convex quadratic function of the
form
h(x) := x> Ax
for PD A. Instead of using the gradient direction ∇h(x) = 2Ax, we determine
a new search direction by solving the following optimization problem with
respect to k · kA :
argmax (−Dh(x)[u]) = argmax (− h∇h(x), ui) .
kukA =1 u> Au=1

Here, the rationale behind using the A-norm is again (as in the 2-dimensional
example) to counter the effect of “stretching” caused by the quadratic term
x> Ax in the objective. This turns out to be norm that gives us the right direction
as the optimal vector u (up to scaling) is equal to −x (Exercise 9.6). Hence,
again, pointing towards the optimum x? = 0. Moreover,
−x = −(2A)−1 (2Ax) = −(∇2 h(x))−1 ∇h(x),
168 Newton’s Method

which is the Newton step for h at x.

9.5.1 Steepest descent with respect to a local norm


In this section we present a new norm for a general convex function f : Rn →
R so that the Newton step coincides with the direction of steepest descent with
respect to this norm. Unlike the quadratic case where the norm was k · kA , this
new norm will vary with x. However, as in the quadratic case, it will related to
the Hessian of f .
Let f : Rn → R be a strictly convex function, i.e., the Hessian ∇2 f (x) is
PD at every point x ∈ Rn . For brevity, we denote the Hessian ∇2 f (x) by H(x).
Such a strictly convex function f induces an inner product on Rn . Indeed, at
every point x ∈ Rn , an inner product h·, ·i x can be defined as

∀u, v ∈ Rn , hu, vi x := u> H(x)v.

The corresponding norm is


p
∀u ∈ Rn , kuk x := u> H(x)u.

The above are sometimes called local inner product and local norm with
respect to the Hessian ∇2 f (·) respectively as they vary with x. Sometimes we
say “local norm” when the underlying function f is clear from the context.
They can be used to measure angles or distances between vectors u, v at each
x and give rise to a new geometry on Rn . We now revisit the derivation of
the gradient descent method as steepest descent, which relied on the use of
Euclidean norm k·k2 , and see what this new geometry yields.
Recall that when deriving the gradient descent algorithm, we decided to pick
the direction of steepest descent which is a solution to the following optimiza-
tion problem:

argmax (−D f (x)[u]) = argmax (− h∇ f (x), ui) . (9.6)


kuk=1 kuk=1

By taking the optimal direction u? with respect to the Euclidean norm, i.e.,
k·k = k·k2 we obtained that u? is in the direction of −∇ f (x) and arrived at the
gradient flow
dx
= −∇ f (x).
dt
What if we instead maximize over all u of local norm 1? Then, Equation (9.6)
becomes
max (− h∇ f (x), ui) = max (− h∇ f (x), ui) . (9.7)
kuk x =1 u> H(x)u=1
9.5 Newton’s method as steepest descent 169

The rationale behind the above is clear given our discussion on the quadratic
case – we would like to capture the “shape” of the function f around a point x
with our choice of the norm, and now, our best guess for that is the quadratic
term of f around x which is given by the Hessian. Again, using the Cauchy-
Schwarz inequality we see that the optimal solution to (9.7) is in the direction

− H(x)−1 ∇ f (x), (9.8)

which is exactly the Newton step. Indeed, let v := H(x)−1 ∇ f (x) and observe
that
D E p p
− h∇ f (x), ui = − H(x)1/2 v, H(x)1/2 u ≤ v> H(x)v u> H(x)u = kvk x kuk x ,
(9.9)
and equality if and only if H(x)1/2 u = −H(x)1/2 v. This is the same as

u = −v = −H(x)−1 ∇ f (x)

as claimed. The associated continuous-time dynamical system is


dx  −1
= −H(x)−1 ∇ f (x) = − ∇2 f (x) ∇ f (x).
dt

9.5.2 Local norms are Riemannian metrics


The local norm H(x) introduced in the previous section is an example of a Rie-
mannian metric.1 Roughly speaking, a Riemannian metric on Rn is a mapping
g from Rn to the space of n × n PD matrices. For an x ∈ Rn , g(x) determines
the local inner product between any u, v ∈ Rn as

hu, vi x := u> g(x)v.

This inner product g(x) should bepa “smooth” function of x. The Riemannian
metric induces the norm kuk x := hu, ui x . The space Rn along with a Rieman-
nian metric g is an example of a Riemannian manifold. It will take us too far
to define a Riemannian manifold formally, but let us just mention that a mani-
fold Ω is a topological space that locally resembles Euclidean space near each
point. More precisely, an n-dimensional manifold, is a topological space with
the property that each point has a neighborhood that topologically resembles
the Euclidean space of dimension n. It is important to note that, in general, at a
point x on a manifold Ω, one may not be able to move in all possible directions
while still being in Ω (see Figure 9.3). The set of all directions that one can
move in at a point x is referred to as the tangent space at x and denoted by
1 Note that the use of the word metric here should not be confused with a distance metric.
170 Newton’s Method

Figure 9.3 The tangent space of a point on the 2-dimensional sphere S 2 .

T x Ω. In fact, it can be shown to be a vector space. The inner product at x ∈ Ω


gives us a way to measure inner products on T x Ω:

g(x) : T x Ω × T x Ω → R.

Thus, we can rephrase our result from the previous section as: Newton’s method
is a gradient descent with respect to a Riemannian metric. In fact, the term
H(x)−1 ∇ f (x) is the Riemannian gradient of f at x with respect to the met-
ric H(·). Note that this type of Riemannian metric is very special: the inner
product at every point is given by a Hessian of a strictly convex function. Such
Riemannian metrics are referred to as Hessian metrics.

9.6 Analysis of Newton’s method based on the local norm


The bound on M provided in Theorem 9.4 was stated with respect to the Eu-
clidean distance from the current iterate xt to the optimal solution x? . In this
section, we use the local norm introduced in the previous sections to present
an affinely invariant analysis of Newton’s method for optimization which gets
around issues with restricting ourselves to the use of the Euclidean norm, as in
Theorem 9.4.

9.6.1 A new potential function


Since we deal with convex functions, a natural quantity which can tell us how
close or how far are we from the optimum is the norm of the gradient k∇ f (x)k.
However, as observed in the previous section, Newton’s method is really a
gradient descent with respect to the Hessian metric H(x). Since the gradient of
f at x with respect to this metric is H(x)−1 ∇ f (x), we should use the local norm
9.6 Analysis of Newton’s method based on the local norm 171

of this gradient. Consequently, we arrive at the following potential function:


q
kn(x)k x := H(x)−1 ∇ f (x) x = (∇ f (x))> H(x)−1 ∇ f (x).
It can be verified that kn(x)k x is indeed affinely invariant (Exercise 9.4(a)).
We note that kn(x)k x also has a different interpretation: 12 kn(x)k2x is the gap
between the current value of f and the minimum value of the second-order
quadratic approximation of f at x. To see this, we take an arbitrary point x0
and set
x1 := x0 + n(x0 ).
We consider the quadratic approximation e
f of f at x0
1
f (x) := f (x0 ) + h∇ f (x0 ), x − x0 i + (x − x0 )> ∇2 f (x)(x − x0 ).
e
2
From our discussion in the previous sections, we know that x1 minimizes e
f
hence, using the fact that n(x0 ) = −(∇ f (x0 )) ∇ f (x0 ), we obtain
2 −1

1
f (x0 ) − ef (x1 ) = − h∇ f (x0 ), n(x0 )i − n(x0 )> ∇2 f (x0 )n(x0 )
2
1
= h∇ f (x0 )n(x0 ), n(x0 )i − n(x0 )> ∇2 f (x0 )n(x0 )
2
2
1
= kn(x0 )k x0 .
2
2
At this point it is instructive to revisit the example studied in Section 9.4.1. 
First observe that the new potential kn(e x)kex is small at the point e x = 0, K12 .
2
Indeed, thinking of K2 as large and suppressing lower order terms, we obtain
! ! !
0 0 0 0
n(e x) = −H(e x)−1 ∇F(ex) ≈ − · = − .
0 K2−2 2 2K2−2
Further,
!
1
x)kex ≈ Θ
kn(e .
K2
More generally, kn(x)k x corresponds to the (Euclidean) distance from x to 0
when the polytope P is scaled (along with point x) to become the square
[−1, 1] × [−1, 1]. As our next theorem says, the error measured as kn(x)k x de-
cays quadratically fast in Newton’s method.

9.6.2 Statement of the bound in the local norm


We first define a more convenient and natural NL condition (for Newton-
Local) instead of the NE condition used in Theorem 9.4.
172 Newton’s Method

Definition 9.5 (NL condition) Let f : Rn → R. We say that f satisfies the


NL condition for δ0 < 1 if for all 0 < δ ≤ δ0 < 1, and for all x, y such that

ky − xk x ≤ δ,

we have
(1 − 3δ)H(x)  H(y)  (1 + 3δ)H(x).

Roughly, a function satisfies this condition if for any two points close enough
in the local norm, their Hessians are close enough. It is worth noting that the
NL condition is indeed affinely invariant (Exercise 9.4(c)). This means that
x 7→ f (x) satisfies this condition if and only if x 7→ f (φ(x)) satisfies it, for any
affine change of variables φ. We now state the main theorem of this chapter.

Theorem 9.6 (Quadratic convergence with respect to the local norm) Let
f : Rn → R be a strictly convex function satisfying the NL condition for
δ0 = 16 , x0 ∈ Rn be any point and

x1 := x0 + n(x0 ).
1
If kn(x0 )k x0 ≤ 6 then
kn(x1 )k x1 ≤ 3 kn(x0 )k2x0 .

We now inspect the NL condition in more detail. To this end, assume for sim-
plicity that f (x) is a univariate function. Then, the NL condition says, roughly
that
|H(x) − H(y)| ≤ 3 kx − yk x |H(x)|

whenever kx − yk x is small enough. In other words

| f 00 (x) − f 00 (y)| 1
· 00 ≤ 3.
kx − yk x | f (x)|

Note that the first term in the left hand side above, roughly, corresponds to
bounding the third derivative of f in the local norm. Thus, the above says
something very similar to “M ≤ 3” where M is the quantity from Theorem 9.1
(recall that g(x) corresponds to f 0 (x) here). The difference is that the quantities
we consider here are computed with respect to the local norm, as opposed to
the Euclidean norm, as in Theorem 9.1 or Theorem 9.4. The constant “3” is by
no means crucial to the definition of the NL condition, it is only chosen for the
convenience of our future calculations.
9.6 Analysis of Newton’s method based on the local norm 173

9.6.3 Proof of convergence in the local norm


Before proving Theorem 9.6 we need to establish one simple yet important
lemma first. It says that if the NL condition is satisfied then the local norms of
nearby points are also close or, more formally, they have low “distortion” with
respect to each other. If we go from x to y and their local distance is a (small)
constant, then the local norms at x and y differ just by a factor of 2.
Lemma 9.7 (Low distortion of close-by norms) Let f : Rn → R be a
strictly convex function that satisfies the NL condition for δ0 = 1/6. Then,
whenever x, y ∈ Rn are such that ky − xk x ≤ 16 , for every u ∈ Rn we have
1
1. 2 kuk x ≤ kuky ≤ 2 kuk x , and
2. 1
2 kukH(x)−1 ≤ kukH(y)−1 ≤ 2 kukH(x)−1 .
We remark that in the above, k·kH(x)−1 is dual to the local norm k·k x = k·kH(x) ,
i.e., k·kH(x)−1 = k·k∗x .
Proof of Lemma 9.7. From the NL condition we have
1
H(y)  H(x)  2H(y),
2
and hence also
1
H(y)−1  H(x)−1  2H(y)−1 .
2
Here we used the fact that for two PD matrices A and B, A  B if and only if
B−1  A−1 . The lemma follows now by the definition of the PSD ordering.
Before we proceed to the proof of Theorem 9.6, we state a simple lemma on
the relationship between the PSD ordering and spectral norms of symmetric
matrices which will be helpful to us in the proof. The proof of this is left as an
exercise (Exercise 9.13).
Lemma 9.8 Suppose A ∈ Rn×n is a symmetric positive definite matrix and
B ∈ Rn×n is symmetric such that
−αA  B  αA
for some α ≥ 0, then
−1/2 −1/2
A BA ≤ α.

We now proceed to prove Theorem 9.6.


Proof of Theorem 9.6. Recall that our goal is to prove that
kn(x1 )k x1 ≤ 3 kn(x0 )k2x0 .
174 Newton’s Method

Note that kn(x1 )k x1 can also be written as k∇ f (x1 )kH(x1 )−1 and similarly, kn(x0 )k x0 =
k∇ f (x0 )kH(x0 )−1 . Therefore, using the fact that the local norms at x1 and x0 are
the same up to a factor of 2 (see Lemma 9.7) it is enough to prove that

3
k∇ f (x1 )kH(x0 )−1 ≤ k∇ f (x0 )k2H(x0 )−1 . (9.10)
2

To see this, we use the second part of Lemma 9.7) with

x := x1 and y := x0 and u := ∇ f (x1 )

to obtain
1
k∇ f (x1 )kH −1 (x1 ) ≤ k∇ f (x1 )kH −1 (x0 ) .
2

Towards proving (9.10), we first write the gradient ∇ f (x1 ) in the form A(x0 )∇ f (x0 )
where A(x0 ) is a certain explicit matrix. Subsequently, we show that the norm
of A(x0 ) is (in a certain sense) small, which in turn allows us to establish Equa-
tion (9.10).

Z 1
∇ f (x1 ) = ∇ f (x0 ) + H(x0 + t(x1 − x0 ))(x1 − x0 )dt
0
(by the fundamental theorem of calculus applied to ∇ f : Rn → Rn )
Z 1
= ∇ f (x0 ) − H(x0 + t(x1 − x0 ))H(x0 )−1 ∇ f (x0 )dt
0
(by rewriting (x1 − x0 ) as −H(x0 )−1 ∇ f (x0 ))
"Z 1 #
= ∇ f (x0 ) − H(x0 + t(x1 − x0 ))dt H(x0 )−1 ∇ f (x0 )
0
(by linearity of integral)
" Z 1 #
= H(x0 ) − H(x0 + t(x1 − x0 ))dt H(x0 )−1 ∇ f (x0 )
0
(by writing ∇ f (x0 ) as H(x0 )H(x0 )−1 ∇ f (x0 ))
= M(x0 )H(x0 )−1 ∇ f (x0 ).

Where in the last equation we used the notation

Z 1
M(x0 ) := H(x0 ) − H(x0 + t(x1 − x0 ))dt.
0
9.7 Analysis based on the Euclidean norm 175

By taking k·kH(x0 )−1 on both sides of the above derived equality we obtain

k∇ f (x1 )kH(x0 )−1 = M(x0 )H(x0 )−1 ∇ f (x0 ) H(x )−1
0

= H(x )−1/2 M(x )H(x )−1 ∇ f (x )
0 0 0 0 2

(by the fact that kukA = A−1/2 u 2 )

≤ H(x0 )−1/2 M(x0 )H(x0 )−1/2 · H(x0 )−1/2 ∇ f (x0 ) 2
(since kAuk2 ≤ kAk kuk2 )

= H(x0 )−1/2 M(x0 )H(x0 )−1/2 · k∇ f (x0 )kH(x0 )−1 .

(by the fact that kukA = A−1/2 u 2 )

Thus, to conclude the proof it remains to show that the matrix M(x0 ) is “small”
in the following sense:
3
H(x0 )−1/2 M(x0 )H(x0 )−1/2 ≤ k∇ f (x0 )kH(x0 )−1 .
2
For this, by Lemma 9.8, it is enough to show that
3 3
− δH(x0 )  M(x0 )  δH(x0 ), (9.11)
2 2
where, for brevity, δ := k∇ f (x0 )kH(x0 )−1 ≤ 61 by the hypothesis of the theorem.
This in turn follows from the NL condition. Indeed, since

δ = k∇ f (x0 )kH(x0 )−1 = kx1 − x0 k x0 ,

if we define
z := x0 + t(x1 − x0 )

for t ∈ [0, 1], we have that

kz − x0 k x0 = tkx1 − x0 k x0 = tδ ≤ δ.

Hence, from the NL condition, for every t ∈ [0, 1], we obtain

−3tδH(x0 )  H(x0 ) − H(z) = H(x0 ) − H(x0 + t(x1 − x0 ))  3tδH(x0 ).

By integrating this inequality with respect to dt from t = 0 to t = 1, Equation


(9.11) follows. This completes the proof of Theorem 9.6.

9.7 Analysis based on the Euclidean norm


Proof of Theorem 9.4. The basic idea of the proof is the same as in the proof
of Theorem 9.1. We consider the function φ : [0, 1] → Rn given by φ(t) :=
176 Newton’s Method

∇ f (x + t(y − x)). Applying the fundamental theorem of calculus to φ (to each


coordinate separately) yields:
Z 1
φ(1) − φ(0) = ∇φ(t)dt
0
Z 1
∇ f (y) − ∇ f (x) = H(x + t(y − x))(y − x)dt. (9.12)
0

We start by writing x1 − x? in a convenient form


x1 − x? = x0 − x? + n(x0 )
= x0 − x? − H(x0 )−1 ∇ f (x0 )
= x0 − x? + H(x0 )−1 (∇ f (x? ) − ∇ f (x0 ))
Z 1
= x0 − x? + H(x0 )−1 H(x0 + t(x? − x0 ))(x? − x0 )dt
0
Z 1
= H(x0 )−1 (H(x0 + t(x? − x0 )) − H(x0 ))(x? − x0 )dt.
0

Now take Euclidean norms on both sides to obtain


Z 1
x1 − x? 2 ≤ H(x0 )−1 (H(x0 + t(x? − x0 )) − H(x0 ))(x? − x0 ) 2 dt

0

?
Z 1
(H(x0 + t(x? − x0 )) − H(x0 )) dt.
−1

≤ H(x0 ) x − x0 2

0
(9.13)
We can then use the Lipschitz condition on H to bound the integral as follows:
Z 1 Z 1
(H(x + t(x? − x )) − H(x )) dt ≤ L t(x? − x0 ) 2 dt

0 0 0
0 0
Z 1
≤ L x? − x0 2

tdt
0
L
= x? − x0 2 .

2
Together with (9.13) this implies

?
L H(x0 )−1 ? 2
x1 − x 2 ≤ x − x0 2 . (9.14)
2
LkH(x0 )−1 k
We can then take M = 2
L
≤ 2h , which completes the proof.
9.8 Exercises 177

9.8 Exercises
9.1 Consider the problem of minimizing the function f (x) = x log x over
x ∈ R≥0 . Perform the full convergence analysis of Newton’s method
applied to minimizing f – consider all starting points x0 ∈ R≥0 and
determine where the method converges for each of them.
9.2 Newton’s method to find roots of polynomials. Consider a real-rooted
polynomial p ∈ R[x]. Prove that if Newton’s method is applied to finding
roots of p and is initialized at a point x0 > λmax (p) (where λmax (p) is the
largest root of p), then it converges to the largest root of p. For a given
ε > 0, derive a bound on the number of iterations required to reach a
point ε additively close to λmax .
9.3 Verify that for a twice-differentiable function f : Rn → R,
J∇ f (x) = ∇2 f (x).
9.4 Let f : Rn → R and let n(x) be the Newton step at x with respect to f .
(a) Prove that Newton’s method is affinely invariant.
(b) Prove that the quantity kn(x)k x is affinely invariant while k∇ f (x)k2
is not.
(c) Prove that the NL condition is affinely invariant.
9.5 Consider the following functions f : K → R, check if they satisfy the
NL condition for some constant 0 < δ0 < 1.
(a) f (x) := − log cos(x) on K = (− π2 , π2 ),
(b) f (x) := x log x + (1 − x) log(1 − x) on K = (0, 1),
(c) f (x) := − ni=1 log xi on K = Rn>0 ,
P

9.6 Consider the function


h(x) := x> Ax − hb, xi
for PD A. Prove that
x
argmax (−Dh(x)[u]) = − .
kukA =1 kxkA
9.7 For an n × m real matrix A and a vector b ∈ Rn , consider the set
Ω := {x : Ax = b}.
Prove that for any x ∈ Ω,
T x Ω = {y : Ay = 0}.
9.8 Prove that the Riemannian metric x 7→ Diag(x)−1 defined for x ∈ Rn>0 is
a Hessian metric. Hint: consider the function f (x) := ni=1 xi log xi .
P
178 Newton’s Method

9.9 Let Ω be a subset of Rn×n consisting of all positive definite matrices.


Prove that the tangent space at any point of Ω is the set of all n × n real
symmetric matrices. Further, prove that for any PD matrix X, the inner
product
hU, ViX := Tr (X −1 UX −1 V)
for n × n symmetric matrices U, V is a Hessian metric. Hint: consider the
function f (X) := − log det X.
9.10 Verify Equation (9.9).
9.11 Physarum dynamics. For an n × m full rank real matrix A and a vector
b ∈ Rn , consider the set
Ω := {x : Ax = b, x > 0}
endowed with the Riemannian metric
hu, vi x := u> X −1 v
at every x ∈ Ω. Here X := Diag(x). Prove that the vector
  −1 
P(x) := X A> AXA> b − 1

is the direction of steepest descent for the function m


P
i=1 xi with respect
to the Riemannian metric.
9.12 Consider a matrix A ∈ Rn×n with strictly positive entries. Prove that the
function f : R2n → R given by
X n
X n
X
f (x, y) = Ai, j e xi −y j − xi + yj
1≤i, j≤n i=1 j=1

satisfies the following condition similar to the NL condition but with the
local norm replaced by `∞ :
1 2
∀w, v ∈ R2n s.t. kw − vk∞ ≤ 1, ∇ f (w)  ∇2 f (v)  10∇2 f (w).
10
9.13 Prove Lemma 9.8.
9.8 Exercises 179

Notes
For a thorough discussion of Newton’s method, we refer to the books by Gal-
ntai (2000) and Renegar (2001). Exercise 9.2 is adapted from the paper by
Louis and Vempala (2016). Exercise 9.11 is from the paper by Straszak and
Vishnoi (2016). Exercise 9.12 is extracted from the papers by Cohen et al.
(2017) and Zhu et al. (2017). For a formal introduction to Riemannian man-
ifolds, including Riemannian gradients and Hessian metrics, the reader is re-
ferred to the survey by Vishnoi (2018).
10
An Interior Point Method for Linear
Programming

We build upon Newton’s method and its convergence to derive a polynomial time algo-
rithm for linear programming. Key to the this algorithm is a reduction from constrained
to unconstrained optimization using the notion of a barrier function and the correspond-
ing central path.

10.1 Linear programming


Linear programming is the problem of finding a point that minimizes a linear
function over a polyhedron. Formally, it is the following optimization problem.

Definition 10.1 (Linear program – canonical form) The input consists of


a matrix A ∈ Qm×n and a vector b ∈ Qm that together specify a polyhedron

P := {x ∈ Rn : Ax ≤ b},

and a cost vector c ∈ Qn . The goal is to find an

x? ∈ argmin hc, xi
x∈P

if P , ∅ or say infeasible if P = ∅. The bit complexity of the input is the total


number of bits required to encode (A, b, c) and is sometimes denoted by L.

Note that, in general, x? may not be unique. Moreover, there may be ways to
specify a polyhedron other than as a collection of linear inequalities. We refer
to this version of linear programming as canonical. We study some other vari-
ants of linear programming in subsequent chapters. A word of caution is that
when we transform one form of linear programming into another, we should
carefully account for the running time, including the cost of transformation.
Linear programming is a central problem in optimization and has made its

180
10.1 Linear programming 181

appearance, implicitly or explicitly, in some of the previous chapters. We also


developed various approximate algorithms for some special cases of linear pro-
gramming using methods such as gradient descent, mirror descent, and MWU.
These algorithms require an additional parameter ε > 0 and guarantee to find
an x̂ such that
hc, x̂i ≤ hc, x? i + ε

in time that depends inverse polynomially on the error parameter ε. For in-
stance, the method based on the MWU scheme that we introduced for the
bipartite matching problem in Chapter 7 can be applied to general linear pro-
grams and yields an algorithm that, given ε > 0, computes a solution x̂ which
has value at most hc, x? i + ε (for an optimal solution x? ) and violates all of
the constraints by at most ε (additively), in time proportional to ε12 . Not only
is the dependency on ε unsatisfactory, but when no additional assumptions on
A, b or c are made, these methods run in time exponential in the bit complexity
of the input. As discussed in Chapter 4, for a method to be regarded as poly-
nomial time, we require the dependency on the error parameter ε > 0 in the
running time to be polynomial in log 1ε , moreover the dependency on the bit
complexity L should be also polynomial. In this chapter, we derive a method
which satisfies both these requirements for linear programming.

Theorem 10.2 (Polynomial time algorithm for solving LPs) There is an


algorithm that, given a description of a linear program (A, b, c) with n vari-
ables, m constraints and bit complexity L as in Definition 10.1, an ε > 0, and
assuming that P is full-dimensional and nonempty, outputs a feasible solution
x̂ with
hc, x̂i ≤ hc, x? i + ε
1
or terminates
 stating that the polyhedron is infeasible. The algorithm runs in
poly L, log 1ε time.

Here, we do not discuss how to get rid of the full-dimensionality and nonempti-
ness assumptions. Given the proof of this theorem, it is not too difficult, but
tedious, and we omit it. Also, note that the above theorem does not solve the
linear programming problem mentioned in Definition 10.1 as it does not out-
put x? , but only an approximation for it. However, with a bit more work one
can transform the above algorithm into one that outputs x? . Roughly, this is
achieved by picking ε > 0 small enough (about 2−poly(L) ) and rounding the out-
1 If the solutions of Ax ≤ b do not lie in a proper affine subspace of Rn , then the polytope is
full-dimensional.
182 An Interior Point Method for Linear Programming

put x̂ to x? (which is guaranteed to be rational and have bit complexity bounded


by O(L)). This is why the poly-logarithmic dependency on 1ε is crucial.

10.2 To unconstrained optimization via barrier functions


What makes linear programming hard is the set of constraints encoded as a
polyhedron P. However, thus far we have largely discussed methods for uncon-
strained optimization problems. Projection based variants of these optimization
methods (such as projected gradient descent presented in Chapter 6) can be
shown to reduce back to linear programming via an equivalence between sep-
aration and optimization discussed in a Chapters 12 and 13. In short, we need
an entirely new method for constrained optimization. The main goal of this
section is to present a different and general methodology to reduce constrained
optimization problems to unconstrained optimization problems – via barrier
functions.
We consider constrained convex optimization problems of the form
min f (x) (10.1)
x∈K

where f is a convex, real-valued function and K ⊆ Rn is a convex set. To


simplify our discussion, we assume that the objective function is linear, i.e.,
f (x) = hc, xi and the convex body K is bounded and full-dimensional.
Suppose we have a point x0 ∈ K and we want to perform an improvement
step with respect to the objective hc, xi while maintaining the condition of
being inside K. Perhaps the simplest idea is to keep moving in the direction
−c to decrease the objective value as much as possible. This will end up on
the boundary of K. The second and subsequent iterates would then lie on to
the boundary, which could force our steps to be short and potentially make
this method inefficient. Indeed, such a method (when applied to polytopes) is
equivalent to a variant of the simplex method that is known to have an expo-
nential worst-case running time, and we do not pursue this any further.
Instead, we study the possibility of moving the constraints defining K to the
objective function and consider the optimization problem
minn hc, xi + F(x), (10.2)
x∈R

where F(x) can be regarded as a penalty’ for violating constraints. Thus, F(x)
should become big when x is close to the boundary (∂K) of the convex set K:
∂K. One seemingly perfect choice of F would be a function which is 0 on an
inside K and +∞ on the complement of K. However, this objective may no
longer be continuous.
10.3 The logarithmic barrier function 183

If one would like the methods for unconstrained optimization that we de-
veloped in the previous chapters to be applicable here, then F should satisfy
certain properties, and at the very least, convexity. Instead of giving a precise
definition of a barrier function F for K, we list some properties that are ex-
pected:

1. F is defined in the interior of K, i.e., the domain of F is int(K),


2. for every point q ∈ ∂K it holds that: lim x∈K→q F(x) = +∞, and
3. F is strictly convex.

Suppose F is such a barrier function. Then, note that solving (10.2) might give
us some idea of what min x∈K hc, xi is, but it does not give us the right answer,
since the addition of F(x) to the objective alters the location of the point we
are looking for. For this reason, one typically considers a family of perturbed
objective functions fη parametrized by η > 0 as follows:

fη (x) := ηhc, xi + F(x). (10.3)

For mathematical convenience, we can imagine that fη is defined on all of Rn


but attains finite values only on int(K). Intuitively, making η bigger and bigger
reduces the influence of F(x) on the optimal value of fη (x). We now focus on
the concrete problem at hand – linear programming.

10.3 The logarithmic barrier function


Recall that the linear programming introduced earlier is the problem of mini-
mizing a linear function hc, xi over the following polyhedron:

P := {x ∈ Rn : hai , xi ≤ bi , for i = 1, 2, . . . , m}, (10.4)

where a1 , a2 , . . . , am are the rows of A (treated as column vectors). To imple-


ment the high-level idea sketched in the previous section we use the following
barrier function:

Definition 10.3 (Logarithmic barrier function) For a matrix A ∈ Rm×n


(with rows a1 , a2 , . . . , am ∈ Rn ) and a vector b ∈ Rm we define the logarithmic
barrier function F : int(P) → R as
m
X
F(x) := − log(bi − hai , xi).
i=1
184 An Interior Point Method for Linear Programming

The domain of this function is the interior of P defined as


int(P) := {x ∈ P : hai , xi < bi , for i = 1, 2, . . . , m}.
For the definition to make sense, we assume that P is bounded (a polytope) and
full-dimensional in Rn .2 Note that F(x) is strictly convex on int(P) and tends
to infinity when approaching the boundary of P (Exercise 10.1(a)). Intuitively,
one can think of each term − log(bi − hai , xi) as exerting a force at the con-
straint hai , xi ≤ bi , which becomes larger the closer the point x comes to the
hyperplane {y : hai , yi = bi }, and is +∞ if the point x lies on or on the wrong
side of this hyperplane.

10.4 The central path


Given the logarithmic barrier function, we go back to the parametrized fam-
ily of perturbed objectives { fη }η≥0 introduced in Equation (10.3). Observe that
since hc, xi is a linear function, the second-order behavior of fη is completely
determined by F as
∇2 fη = ∇2 F.
In particular, fη is also strictly convex and has a unique minimizer. This moti-
vates the following notion:
Definition 10.4 (Central path) For η ≥ 0, denote by xη? the unique mini-
mizer of fη (x) over x ∈ int(P). We call x0? the analytic center of P. The set of
all these minimizers is called the central path (with respect to the cost vector
c) and is denoted by
Γc := {xη? : η ≥ 0}.
While it is not used here, the central path Γc can be shown to be continuous
using the implicit function theorem. It originates at the analytic center x0? of P
and approaches x? , as η → ∞. In other words,
lim xη? = x? .
η→∞

The idea now is to start at some point, say x1? (i.e., for η = 1) on the central
path, and progress along the path by gradually taking η to infinity. A method
2 Of course this is not always the case for general linear programs. However, if the polyhedron
is feasible then we can perturb coefficients of each of the constraints by an exponentially small
value to force the feasible set to be full-dimensional.
10.5 A path following algorithm for linear programming 185

x?

x0?

Figure 10.1 Example of a central path for the cost vector c

that follows this general approach is referred to as a path following interior


point method (IPM). The key questions from the algorithmic point of view
are as follows:

1. Initialization: How to find the analytic center of P?


2. Following the central path: How to follow the central path using discrete
steps?
3. Termination: How to decide when to stop following the central path (our
goal is to reach a close enough neighborhood of x? )?
We make this method precise in the next section and answer all of these ques-
tions in our analysis of the path following IPM in Section 10.6.

10.5 A path following algorithm for linear programming


While the initialization and termination questions stated as 1. and 3. in the
list at the end of the previous section are certainly important, we first focus
on 2. and explain how to progress along the path. This is where the power of
Newton’s method developed in the previous chapter is leveraged.
Suppose for a moment that we are given a point x0 ∈ P that is close to the
central path, i.e., to xη?0 for some η0 > 0. Further assume that the function fη0
186 An Interior Point Method for Linear Programming

satisfies the NL condition from Chapter 9 (we show this later). Then, we know
that by performing a Newton step
x1 := x0 + nη0 (x0 ),
where for any x and η > 0 we define
 −1  −1
nη (x) := − ∇2 fη (x) ∇ fη (x) = − ∇2 F(x) ∇ fη (x),
we make significant progress towards xη?0 . Note that since F is strictly convex,
 −1
∇2 F(x) exists for all x ∈ int(P). The main idea is to use this progress, and
the opportunity it presents, to increase the value of η0 to
η1 := η0 · (1 + γ)
for some γ > 0 so that x1 is close enough to xη?1 and again satisfies the NL
condition; enabling us to repeat this procedure and continue.
The key question becomes: how large a γ can we pick in order to make
this scheme work and produce a sequence of pairs (x0 , η0 ), (x1 , η1 ), . . . such
that ηt increases at a rate (1 + γ) and that xt is in the quadratic convergence
region of fηt . We show in Section 10.6 that the right value for γ is roughly
√1 . A more precise description of this algorithm is presented in Algorithm 7.
m
This algorithm description is not complete and we explain how to set η0 , why
ηT > m/ε suffices, how to compute the Newton step, and how to terminate in
upcoming sections.
The notion of closeness of a point xt to the central path, i.e., to xη?t , that we
employ here is
1
nηt (xt ) x ≤ .
t 6
This is directly motivated by the guarantee obtained in Chapter 9 for New-
ton’s method – this condition means precisely that xt belongs to the quadratic
convergence region for fηt .

10.5.1 Structure of the proof of Theorem 10.2


In this section, we present an outline of the proof of Theorem 10.2. We show
that an appropriate implementation of the path following IPM presented in
Algorithm 7 yields a polynomial time algorithm for linear programming. This
requires making the scheme a little bit more precise (explaining how to perform
step 1). However, the crucial part of the analysis is showing that steps 4 and 5
indeed guarantee to track the central path closely. To this end, we prove that
the following closeness invariant holds.
10.5 A path following algorithm for linear programming 187

Algorithm 7: Path following IPM for linear programming


Input:
• A ∈ Qm×n , b ∈ Qm , c ∈ Q
• An ε > 0
Output: A point x̂ such that A x̂ ≤ b and hc, x̂i − hc, x? i ≤ ε
Algorithm:

1: Initialization: Find an initial η0 > 0 and x0 with nη0 (x0 ) < 1
6
x0
 T
2: Let T be such that ηT := η0 1 + 1√
> εm
20 m
3: for t = 0, 1, . . . , T do
4: Newton step: xt+1 := xt+ nηt (xt ) 
5: Changing η: ηt+1 := ηt 1 + 20 1√m
6: end for
7: Termination: Calculate x̂ by performing two Newton steps with
respect to fηT starting at xT
8: return x̂

Lemma 10.5 (Closeness invariant) For every t = 0, 1, 2, . . . , T it holds that


1
nηt (xt ) x ≤ .
t 6
To prove that this invariant holds, we consider one iteration and show that, if
the invariant holds for t, then it also holds for t + 1. Needless to say, we have
to ensure the the closeness invariant holds for t = 0. Every iteration consists of
two steps: the Newton step with respect to fηt and the increase of ηt to ηt+1 . We
formulate two more lemmas that capture what happens when performing each
these steps. The first lemma uses Theorem 9.6.

Lemma 10.6 (Effect of a Newton step on centrality) The logarithmic bar-


rier function F satisfies the NL condition
δ0 = 1/6. Therefore, for every
for
x ∈ int(P) and every η > 0, such that nη (x) x ≤ 16 , it holds that
2
nη (x0 ) x0 ≤ 3 nη (x) x ,

where x0 := x + nη (x).

This implies that one Newton step brings us closer to the central path. The
lemma below tells us that if we are very close to the central path, then it is
188 An Interior Point Method for Linear Programming
 
safe to increase ηt by a factor of roughly 1 + √1m so that the centrality is still
preserved.
Lemma 10.7 (Effect of re-entering on centrality) For every point x ∈ int(P)
and every two positive η, η0 > 0, we have
η0 √ η0

nη0 (x) x ≤ nη (x) x + m − 1 .
η η
Thus,
 √ if we initialize the algorithm at η0 and terminate it at ηT then, after
O m log ηηT0 iterations, each of which is just solving one m × m linear system


(corresponding to ∇2 F(x)y = z, for two given vectors x ∈ P and v) we recover


an ε-approximate optimal solution. This concludes the part of the analysis con-
cerning the loop in step 3 in the algorithm and the issue of centrality. We now
move to the part concerning initialization.
We would like an algorithm to find a good starting point, i.e., one for which
η0 is not too small and for which the corresponding Newton step is short.
Lemma 10.8 (Efficient initialization) Step 1 of Algorithm 7 can be imple-
mented in polynomial time to yield
η0 = 2−O(nL)
e

and x0 ∈ int(P) such that


1
nη0 (x0 ) x ≤ .
0 6
Here n is the dimension of P and L is the bit complexity of (A, b, c).
Note that it is crucial that the above lemma provides a lower bound on η0 . This
allows us to deduce that the number of iterations T in Algorithm 7 is polyno-
mially bounded and, hence, leading to a polynomial time algorithm for linear
programming. It is also worth noting here that for “well-structured” linear pro-
grams one can sometimes find starting points with a larger value of η0 , such as
η0 = m1 which can have an impact on the number of iterations; see Chapter 11.
Finally, we consider step 7 of the algorithm – termination. This says that by
adjusting the point xT just a little bit we can reach a point x̂ very close to x? in
the objective value. More precisely we show.
Lemma 10.9 (Efficient termination) By performing two Newton steps ini-
tialized at xT , where T is such that ηT ≥ m/ε, we obtain a point x̂ that satisfies
hc, x̂i ≤ hc, x? i + 2ε.
We are now ready to state the main theorem characterizing the performance of
Algorithm 7.
10.6 Analysis of the path following IPM 189

Theorem 10.10
 √ (Convergence
 of the path following IPM) Algorithm 7,
after T = O m log εηm0 iterations, outputs a point x̂ ∈ P that satisfies

hc, x̂i ≤ hc, x? i + 2ε.

Moreover, every iteration (step 4) requires solving one linear system of the
form ∇2 F(x)y = z, where x ∈ int(P), z ∈ Rn is determined by x and we solve
for y. Thus, it can be implemented in polynomial time.

Note now that by combining Theorem 10.10 with Lemma 10.8 the proof of
Theorem 10.2 follows easily.

10.6 Analysis of the path following IPM


The purpose of this section is to prove Lemma 10.5, i.e., that the invariant
1
nηt (xt ) x ≤
t 6
holds for all t = 1, 2, . . . , T . As explained in the previous section, this just
follows from Lemmas 10.6 and 10.7 – we now proceed with their proofs.

Effect of Newton step on centrality. We recall the NL condition introduced


in Chapter 9.

Definition 10.11 (Condition NL) Let f : Rn → R. We say that f satisfies


the NL condition for δ0 < 1 if for all 0 < δ ≤ δ0 < 1, and for all x, y such that

ky − xk x ≤ δ,

we have
(1 − 3δ)H(x)  H(y)  (1 + 3δ)H(x).

Proof of Lemma 10.6. We need to prove that fη (or equivalently F) satisfies


the NL condition for δ = 61 . In fact, we prove that F satisfies the NL condition
for any δ < 1. To this end, check that the Hessian

H(x) = ∇2 fη (x) = ∇2 F(x)

is given by
m
X ai a>i
H(x) = ,
s (x)2
i=1 i
190 An Interior Point Method for Linear Programming

where
si (x) := bi − hai , xi;

see Exercise 10.1(b). Consider any two points x, y with ky − xk x = δ < 1. We


have
m
hai , y − xi .
X 2
δ2 = (y − x)> H(x)(y − x) = si (x)
i=1

Thus, in particular, every term in this sum is upper bounded by δ2 . Hence, for
every i = 1, 2, . . . , m we have

si (x) − si (y) = hai , y − xi ≤ δ.



si (x) si (x)

Consequently, for every i = 1, 2, . . . , m we have

(1 − δ)si (x) ≤ si (y) ≤ (1 + δ)si (x)

and, thus,
(1 + δ)−2 1 (1 − δ)−2
≤ ≤ .
si (x)2 si (y)2 si (x)2
It now follows that
(1 + δ)−2 ai a>i ai a>i (1 − δ)−2 ai a>i
 
si (x)2 si (y)2 si (x)2
and, thus, by summing the above for all i, we obtain

(1 + δ)−2 H(x)  H(y)  (1 − δ)−2 H(x).

To arrive at the NL condition it remains to observe that for every δ ∈ (0, 0.23)
it holds that
1 − 3δ ≤ (1 + δ)−2 and (1 − δ)−2 ≤ 1 + 3δ.

Effect of changing η on centrality. We proceed to the proof


of
Lemma 10.7
which asserts that changing η slightly does not increase nη (x) x by much for
a fixed point x. Let
g(x) := ∇F(x).
10.6 Analysis of the path following IPM 191

Proof of Lemma 10.7. We have


nη0 (x) = H(x)−1 ∇ fη0 (x)
= H(x)−1 (η0 c + g(x))
η0 η0
!
= H(x)−1 (ηc + g(x)) + 1 − H(x)−1 g(x)
η η
η0 η0
!
= H(x)−1 ∇ fη (x) + 1 − H(x)−1 g(x).
η η

After taking norms and applying triangle inequality with respect to k · k x we


obtain
η0 η0

H(x)−1 ∇ fη0 (x) x ≤ H(x)−1 ∇ fη (x) x + 1 − H(x)−1 g(x) x .
η η
We take a pause for a moment and try to understand the significance of the
specific terms in the right hand side of the inequality above. By the closeness
invariant, the term kH(x)−1 ∇ fη (x)k x is a small constant. The goal is to show
that the whole right hand side is also bounded by a small constant. We should
think of η0 as η(1 + γ) for some small γ > 0. Thus, ηη kH(x)−1 (x)∇ fη (x)k x is still
0

a small
constant,
and what prevents us from choosing a large γ is the second
η0
term 1 − η kH(x)−1 g(x)k x . Thus, what remains to do is to derive an upper

bound on kH(x)−1 g(x)k x . To this end, we show something stronger:

sup kH(y)−1 g(y)ky ≤ m.
y∈int(P)

To see this, we pick any y ∈ int(P) and denote z := H −1 (y)g(y). From the
Cauchy-Schwarz inequality we obtain
v
t m
m
X hz, ai i √ X hz, ai i2
kzky = g(y) H(y) g(y) = hz, g(y)i =
2 > −1
≤ m . (10.5)
i=1
si (y) i=1
si (y)2

By inspecting the rightmost expression in the inequality above, we obtain:


m
 m 
X hz, ai i2 X ai a>i 
2
= z 
>
 z = z> H(y)z = kzk2y .
2
(10.6)
i=1
s i (y) i=1
si (y)

Putting (10.5) and (10.6) together we obtain that



kzk2y ≤ mkzky .
Hence,

kzky ≤ m.
192 An Interior Point Method for Linear Programming

Concluding that the invariant holds.

Proof of Lemma 10.5. Suppose that


1
nηt (xt ) x ≤
t 6
for t ≥ 0. Then, by the quadratic convergence of Newton’s method – Lemma 10.6,
we obtain that xt+1 satisfies
2 1
nηt (xt+1 ) x ≤ 3 nηt (xt ) x ≤ .
t+1 t 12
Further, by Lemma 10.7 it follows that
ηt+1 √ η

1 1 1
nηt (xt+1 ) x + m t+1 − 1 < (1+o(1))· +

nηt+1 (xt+1 ) x ≤ ≤ .
t+1 ηt t+1 ηt 12 20 6

10.6.1 Termination condition


First, we give a proof of Lemma 10.9 under the “ideal assumption” that in
the termination step of Algorithm 7 we actually reach xη?T (and not only a
point close to it). Subsequently, we strengthen this result and show that an
approximate minimum of fηT (which we get as a result of our algorithm) also
gives a suitable approximation guarantee.

Termination under the “ideal assumption”. Assume that the point output
by Algorithm 7 is really
x̂ := xη?T .

Then, the lemma below explains our choice of ηT ≈ m


ε.

Lemma 10.12 (Dependence of the approximation on the choice of η) For


every η > 0 we have
m
hc, xη? i − hc, x? i < .
η
Proof Recall that

∇ fη (x) = ∇(ηhc, xi + F(x)) = ηc + ∇F(x) = ηc + g(x).


10.6 Analysis of the path following IPM 193

The point xη? is the minimum of fη , hence, by the first order optimality condi-
tion, we know
∇ fη (xη? ) = 0
and, thus,
g(xη? ) = −ηc. (10.7)
Using this observation we obtain that
E 1D
hc, xη? i − hc, x? i = − c, x? − xη? = g(xη? ), x? − xη? .
D E
η
To complete the proof it remains to argue that g(xη? ), x? − xη? < m. We show
D E

even more: for every two points x, y in the interior of P, we have


hg(x), y − xi < m.
This follows from the following simple calculation

m
X hai , y − xi
hg(x), y − xi =
i=1
si (x)
m
X si (x) − si (y)
=
i=1
si (x)
m
X si (y)
=m−
s
i=1 i
(x)
< m,
where in the last inequality we make use of the fact that our points x, y are
strictly feasible, i.e., si (x), si (y) > 0 for all i.

Dropping the “ideal assumption”. We now show that even after dropping
the ideal assumption we still get an error which is O(ε). For this, we perform a
constant number of Newton iterations starting from xT , so that the local norm
of the output x̂, nηT ( x̂) x̂ , becomes small.


We derive a relation between the length of the Newton step at a point x (with
respect to the function fη ) and the distance to the optimum xη? . We show that
whenever kn(x)k x is sufficiently small, kx − xη? k x is small as well. This fact,
together with a certain strengthening of Lemma 10.12, implies that in the last
step of Algorithm 7, only two additional Newton steps bring us 2ε-close to the
optimum.
194 An Interior Point Method for Linear Programming

We start with an extension of Lemma 10.12 that shows that to get a decent
approximation of the optimum, we do not necessarily need to be on the central
path, but only close enough to it.
Lemma 10.13 (Approximation guarantee for points close to central path)
For every point x ∈ int(P) and every η > 0, if kx − xη? k x < 1 then:
m −1
hc, xi − hc, x? i ≤ 1 − kx − xη? k x .
η
Proof For every y ∈ int(P) we have:
hc, x − yi = hH(x)−1/2 c, H 1/2 (x)(x − y)i

≤ H(x)−1/2 c H 1/2 (x)(x − y)
2 2
= kH(x)−1 ck x kx − yk x
Where inequality follows from Cauchy-Schwarz. Let
c x := H(x)−1 c.
This term is also the Riemannian gradient of the objective function hc, xi at x
with respect to the Hessian metric. Now, we bound kc x k x . Imagine we are at
point x and we move in the direction of −c x until we hit the boundary of the
unit ball (in the local norm) around x. We will land at the point x − kccxxkx , which
is still inside P (as proved in Exercise 10.4). Therefore,
* +
cx
c, x − ≥ hc, x? i.
kc x k x
Since hc, c x i = kc x k2x , by substituting and rearranging in the inequality above,
we obtain
kc x k x ≤ hc, xi − hc, x? i.
Thus,
hc, x − yi ≤ kx − yk x (hc, xi − hc, x? i). (10.8)
Now, we express
hc, xi − hc, x? i = hc, xi − hc, xη? i + hc, xη? i − hc, x? i
and use (10.8) with y = xη? . We obtain
hc, xi − hc, x? i ≤ (hc, xi − hc, x? i)kx − xη? k x + (hc, xη? i − hc, x? i).
Thus,
(hc, xi − hc, x? i)(1 − kx − xη? k x ) ≤ hc, xη? i − hc, x? i.
By applying Lemma 10.12 the result follows.
10.6 Analysis of the path following IPM 195

Note that in Algorithm 7 we never mention the condition that kx− xη? k x is small.
However, we show that it follows from knη (x)k x being small. In fact, we prove
the following more general lemma which holds for any function f satisfying
the NL condition.

Lemma 10.14 (Distance to the optimum and the Newton step) Let f :
Rn → R be any strictly convex function satisfying the NL condition for δ0 = 61 .
Let x be any point in the domain of f . Consider the Newton step n(x) at a point
x. If
1
kn(x)k x < ,
24
then
kx − zk x ≤ 4kn(x)k x ,

where z is the minimizer of f .

Proof Pick any h such that khk x ≤ 16 . Expand f (x + h) into a Taylor series
around x and use the mean value theorem (Theorem 9.2) to write
1
f (x + h) = f (x) + hh, ∇ f (x)i + h> ∇2 f (θ)h (10.9)
2
for some point θ lying in the interval (x, x + h). We proceed by lower bounding
the linear term. Note that

|hh, ∇ f (x)i| = |hh, n(x)i x | ≤ khk x kn(x)k x (10.10)

by the Cauchy-Schwarz inequality. Subsequently, note that


1 > 1
h> ∇2 f (θ)h = h> H(θ)h ≥ h H(x)h = khk2x , (10.11)
2 2
where we used the NL condition with y := θ for δ = 61 for which 1 − 3δ = 12 .
Applying bounds (10.10), (10.11) to the expansion (10.9) results in:
1
f (x + h) ≥ f (x) − khk x kn(x)k x + khk2x . (10.12)
4
Set r := 4kn(x)k x and consider points y satisfying

kyk x = r,

i.e., points on the boundary of the local norm ball of radius r centered at x. For
such a point y,
4 1
kyk x = 4kn(x)k x ≤ =
24 6
196 An Interior Point Method for Linear Programming

and, hence, (10.12) applies. Moreover, (10.12) simplifies to


1
f (x + y) ≥ f (x) − kyk x kn(x)k x + kyk2x = f (x) − 4kn(x)k2x + 4kn(x)k2x ≥ f (x).
4
Since f is strictly convex and its value at the center x of the ball is no more
than its value at the boundary, z, the unique minimizer of f , must belong to the
above mentioned ball, completing the proof of the lemma.

Proof of Lemma 10.9. Let x̂ be the point obtained by doing two additional
Newton steps with a fixed ηT and starting at xT . Then, since
1
knηT (xT )k xT ≤ ,
6
we know that
1
knηT ( x̂)k x̂ ≤ .
48
Applying Lemma 10.13 and Lemma 10.14 for such a x̂ we obtain that if η ≥ m
ε
then
1
hc, x̂i − hc, x? i ≤ ε < 2ε.
1 − 4knηT ( x̂)k x̂

10.6.2 Initialization
In this section we present a method for finding an appropriate starting point.
More precisely, we show how to find efficiently some η0 > 0 and x0 such that
knη0 (x0 )k x0 ≤ 16 . Before we start, we remark that we provide a very small η0 – of
order 2−poly(L) . While this enables us to prove that Algorithm 7 can solve linear
programming in polynomial time, it does not seem promising when trying to
apply IPM to devise fast algorithms for combinatorial problems. Indeed, there
is a factor of log η−1 0 in the bound on the number of iterations, which translates
to L. To make an algorithm fast, we need to have ideally η0 = Ω(1/poly(m)).
It turns out that for specific problems (such as maximum flow) we can devise
some specialized methods to find such an η0 and x0 .

Finding (η0 , x0 ) given a point in the interior of P. First we show that assum-
ing we are given a point x0 ∈ int(P), we can find a starting pair (η0 , x0 ) with the
desired property. In fact, we assume something stronger: that each constraint
is satisfied with slack at least 2−O(nL) at x0 , i.e.,
e

bi − hai , x0 i ≥ 2−O(nL) .
e
10.6 Analysis of the path following IPM 197

𝚪𝒅
𝚪𝒄
𝑓 𝑒
𝒙⋆𝟎
𝚪𝒆
𝚪𝒇
𝑐 𝑑

Figure 10.2 Illustration of central paths Γc , Γd , Γe , Γ f for four different objective


vectors c, d, e and f . All the paths originate at the analytic center x0? and converge
to some boundary point of P. Note that because we minimize the linear objective,
the cost vectors point in the “opposite” direction to the direction traversed by the
corresponding central path.

Our procedure for finding a point in P provides such an x0 , assuming that P is


full-dimensional and nonempty. Here, we do not discuss how to get rid of the
full-dimensionality assumption. It is easy to deal with, but tedious.
Recall that we want to find a point x0 close to the central path

Γc := {xη? : η ≥ 0},

which corresponds to the objective function hc, xi. Note that as η → 0, xη? →
x0? , the analytic center of P. Hence, finding a point x0 close to the analytic
center and choosing η0 to be some small number should be a good strategy.
But how do we find a point close to x0? ?
While it is the path Γc that is of interest, in general, one can define other
central paths. If d ∈ Rn is any vector, define
( )
Γd := argmin (ηhd, xi + F(x)) : η ≥ 0 .
x∈int(P)

As one varies d, what do the paths Γd have in common? The origin! They all
start at the same point: the analytic center of P; see Figure 10.2. The idea now
is to pick one such path on which the initial point x0 lies and traverse it in
reverse to reach a point close to the analytic center of this path, giving us x0 .
Recall that g is used to denote the gradient of the logarithmic barrier F. If
we define
d := −g(x0 ),
198 An Interior Point Method for Linear Programming

then x0 ∈ Γd for η = 1. To see this, denote


fη0 (x) := ηhd, xi + F(x)
and let xη0? be the minimizer of fη0 . Then x10? = x0 since ∇ f10 (x0 ) = 0. If we use
n0η (x) to denote the Newton step at x with respect to fη0 , then note that
n01 (x0 ) = 0.
As mentioned above, our strategy is to move along the Γd path in the direction
of decreasing η. We use an analogous method as in the Algorithm 7. In each
 respect to current η and then, in the
iteration we perform one Newton step with
centering step, decrease η by a factor of 1 − 1√
20 m
. At each step it holds that
1
kn0η (x)k x
≤ by an argument identical to the proof of Lemma 10.5. It remains
6,
to see, how small an η is required (this determines the number of iterations we
need to perform) to allow us to “jump” from Γd to Γc .
For this we first prove that, by taking η small enough, we reach a point x
very close to the analytic center, i.e., for which H(x)−1 g(x) x is small. Next
we show that having such a point x, with an appropriate η0 satisfies
1
nη0 (x) x ≤ .
6
The first part is formalized in the lemma below.
Lemma 10.15 (Following the central path in reverse) Suppose that x0 is a
point such that ∀i = 1, 2, . . . , m,
si (x0 ) ≥ β · max si (x).
x∈P

For any η > 0 denote by n0η (x) the Newton step


n0η (x):= −H(x)−1 (−ηg(x0 ) + g(x)).
β

Then, whenever η ≤ √
24 m
1
and n0η (x) x ≤ 24 , it holds that
1
H(x)−1 g(x) x ≤ .
12
Proof From the triangle inequality of the local norm we have

H(x)−1 g(x) ≤ n0 (x) + η H(x)−1 g(x0 ) .
x η x x

Thus, it remains to show a suitable upper bound on H(x)−1 g(x0 ) x . Denote by
S x and S x0 the diagional matrices with s(x) and s(x0 ) (the slack vectors at x and
x0 respectively) on the diagonals. Then, one can compactly write
H(x) = A> S −2
x A, x 1, and g(x ) = A S x0 1,
g(x) = A> S −1 0 > −1
10.6 Analysis of the path following IPM 199

where 1 ∈ Rm is the all-ones vector. Thus,


2
H(x)−1 g(x0 ) = g(x0 )> H(x)−1 g(x0 ) = 1> S −1 A(A> S −2 A)−1 A> S −1 1.
x x0 x x0

This can be also rewritten as


2
H(x)−1 g(x0 ) x = v> Πv (10.13)
where Π := S −1 > −2 −1 > −1 −1
x A(A S x A) A S x and v := S x S x0 1. One can note now that
Π is an orthogonal projection matrix: indeed Π is symmetric and Π2 = Π.
Therefore,
v> Πv = kΠvk22 ≤ kvk22 . (10.14)
Further
m
X si (x)2 m
kvk22 = ≤ 2. (10.15)
0
s (x )
i=1 i
2 β

Thus, combining (10.13), (10.14) and (10.15), we obtain



m
−1 0
H(x) g(x ) x ≤ ,
β
and the lemma follows.
The next lemma shows that given a point close to the analytic center, we can
use it to initialize our algorithm, assuming that η0 is small enough.
Lemma 10.16 (Switching the central path) Suppose that x ∈ int(P) is a
1
point such that H(x)−1 g(x) x ≤ 12 and η0 is such that
1 1
η0 ≤ · ,
12 hc, x − x? i
where x? is an optimal solution to the linear program. Then nη0 (x) x ≤ 16 .

Proof We have

nη0 (x) x = H(x)−1 (η0 c + g(x)) x ≤ η0 H(x)−1 c x + H(x)−1 g(x) x .

Since H(x)−1 g(x) ≤ 1 , to arrive at the conclusion, it suffices to prove that
x 12

H(x)−1 c x ≤ hc, x − x? i.

To see this, denote


c x := H(x)−1 c.
The following is an exercise (we used this fact before in Lemma 10.13)
E x := {y : (y − x)> H(x)(y − x) ≤ 1} ⊆ P.
200 An Interior Point Method for Linear Programming

Thus,
cx
x− ∈P
kc x k x
because this point belongs to E x . Therefore,
* + D
cx
≥ c, x? ,
E
c, x −
kc x k x
since x? minimizes this linear objective over P. The above, after rewriting
gives
* +
cx
≤ hc, xi − c, x? .
D E
c,
kc x k x
By observing that
* +
cx
c, = H(x)−1 c x ,
kc x k x
the lemma follows.

Given Lemmas 10.15 and 10.16 we are now ready to prove that a good starting
point can be found in polynomial time.

Lemma 10.17 (Finding a starting point given an interior point) There is


an algorithm which given a point x0 ∈ int(P) such that ∀i = 1, 2, . . . , m

si (x0 ) ≥ β · max si (x),


x∈P

outputs a point x0 ∈ int(P) and an η0 > 0 such that


1
nη0 (x0 ) x ≤
0 6
and
1
η0 ≥ ≥ 2−O(nL) .
e
kck2 · diam(P)

The running time of this algorithm is polynomial in n, m, L and log β1 . Here,


diam(P) is the diameter of the smallest ball that contains P.

Proof The lemma follows from a combination of Lemmas 10.15 and 10.16.

1. Start by running the path following IPM in reverse for the cost function
d := −g(x0 ), starting point as x0 , and the starting value of η as 1. Since, by
choice, x0 is optimal for the central path Γd , we know that kn1 (x0 )k x0 = 0.
10.6 Analysis of the path following IPM 201

2. Thus, we keep running it till η becomes less than


β
( )
1
η0 := min √ , .
24 m kck2 · diam(P)
At this point x we know that knη0 (x)k x ≤ 61 .
3. We now do a couple of iterations of the Newton method (with fixed η0 ) to
come to a point y such that
1
knη0 (y)ky ≤ .
24
4. Since all the conditions of Lemma 10.15 are satisfied, we can use it to
ensure that
1
kH(y)−1 g(y)ky ≤ .
12
5. Finally, we use Lemma 10.16 to switch and note that y and η0 satisfy the
initial conditions required to start the path following IPM in the forward
direction for the cost vector c. Here, we use the fact that for x ∈ P, using
the Cauchy-Schwarz inequality, we have
c, x − x? ≤ kck · x − x? ≤ kck · diam(P).
D E
2 2 2

Thus,
1 1
η0 ≤ ≤ .
kck2 · diam(P) hc, y − x? i

To conclude, we use Exercise 10.2 that shows that diam(P) ≤ 2O(nL) .


e

Finding a point in the interior of P. To complete our proof we show how


to find a suitable x0 ∈ int(P). Towards this, we consider an auxiliary linear
program:
min t
(t,x)∈Rn+1

a>i x ≤ bi + t, ∀1 ≤ i ≤ m, (10.16)
− C ≤ t ≤ C.
for some large integer C > 0 that we specify in a moment. Taking x = 0
and t big enough (t := 1 + i |bi |), we obtain a strictly feasible solution with
P
at least O(1) slack at every constraint; see Exercise 10.5. Thus, we can use
the Algorithm 7 in conjunction with Lemma 10.17 to solve it up to precision
2−Ω(nL) in polynomial time. If P is full-dimensional and nonempty, then the
e

optimal solution t? to (10.16) is negative and in fact t? ≤ −2−O(nL) ; see Exercise


e

10.2. Hence, we can set C := 2−O(nL) . Thus, by solving Equation (10.16) up to


e
202 An Interior Point Method for Linear Programming

2−Ω(nL) precision, we obtain a feasible point x0 whose all slacks are at least
e

2−O(nL) , hence, β in Lemma 10.17 is 2−O(nL) .


e e

10.6.3 Proof of Theorem 10.10


Theorem 10.10 can be now concluded from Lemmas 10.12 and 10.5. Indeed,
Lemma 10.12 says that x̂ is an ε-approximate solution whenever η ≥ mε . Since
 T √ 
ηT = η0 1 + 20 1√m , it is enough to take T = Ω m log εηm0 for ηT ≥ mε to
hold.
Since the invariant nηt (xt ) x ≤ 16 is satisfied at every step t (by Lemma 10.5),
t
including for t = T , xT lies in the region of quadratic convergence of Newton’s
method for fηT . Hence, xη?T can be computed from xT . Note that we are not
entirely precise here: we made a simplifying assumption that once we arrive
at the region of quadratic convergence around xη?T then we can actually reach
this point – but we showed how to get around this issue in Section 10.6.1 at the
expense of doubling the error from ε to 2ε.
It remains to observe that in every iteration, the only nontrivial operation
is computing the Newton step nηt (xt ), which in turn boils down to solving the
following linear system
H(xt )y = ∇ fηt (xt ).
Note in particular that computing the inverse of H(xt ) is not necessary. Some-
times, just solving the above linear system might be much faster than inverting
a matrix.
10.7 Exercises 203

10.7 Exercises
10.1 Consider a bounded polyhedron

P := {x ∈ Rn : hai , xi ≤ bi , for i = 1, 2, . . . , m}

and let F(x) be the logarithmic barrier function on the interior of P, i.e.,
m
X
F(x) := − log(bi − hai , xi).
i=1

(a) Prove that F is strictly convex.


(b) Write the gradient and Hessian of F. What is the running time of
evaluating the gradient and multiplying the Hessian with a vector?
Further, define the following function G on the interior of P
 
G(x) := log det ∇2 F(x) .

(a) Prove that G is strictly convex.


(b) Write the gradient and Hessian of G. What is the running time of
evaluating the gradient and multiplying the Hessian with a vector?
10.2 This exercise is a continuation of Exercise 4.9. The goal is to bound bit
complexities of certain additional quantities that show up in the analysis
of the interior point method for linear programming. Let A ∈ Qm×n be a
matrix and b ∈ Qm be a vector and let L be the bit complexity of (A, b).
In particular L ≥ m and L ≥ n. We assume that P = {x ∈ Rn : Ax ≤ b} is
a bounded and full-dimensional polytope in Rn .
(a) Prove that λmin (A> A) ≥ 2−O(nL) and λmax (A> A) ≤ 2O(L) . Here λmin
e e

denotes the smallest eigenvalue and λmax is the largest eigenvalue.


(b) Assuming that P is a polytope, prove that the diameter of P (in
the Euclidean metric) is bounded by 2O(nL) .
e

(c) Prove that there exists a point x0 in the interior of P such that

bi − hai , x0 i ≥ 2−O(nL)
e

for every i = 1, 2, . . . , m.
(d) Prove that if x0? is the analytic center of P then

2−O(mnL) I  H(x0? )  2O(mnL) I,


e e

where H(x0? ) is the Hessian of the logarithmic barrier at x0? .


204 An Interior Point Method for Linear Programming

10.3 In this problem we would like to show the following intuitive fact: if the
gradient of the logarithmic barrier at a given point x is short then the
point x is far away from the boundary of the polytope. In particular, the
analytic center lies well inside the polytope. Let A ∈ Rm×n , b ∈ Rm and
P = {x ∈ Rn : Ax ≤ b} be a polytope. Suppose that x0 ∈ P is a point such
that
bi − hai , x0 i ≥ δ

for every i = 1, 2, . . . , m and some δ > 0. Prove that if D is the diameter


of P (in the Euclidean norm) then for every x ∈ P we have

∀ i = 1, 2, . . . , m, bi − hai , xi ≥ δ · (m + kg(x)k · D)−1 ,

where g(x) is the gradient of the logarithmic barrier function at x.


10.4 Dikin ellipsoid. Let A ∈ Rm×n , b ∈ Rm and P = {x ∈ Rn : Ax ≤ b}
be a full-dimensional polytope. For x ∈ P let H(x) be the Hessian of
the logarithmic barrier function for P. We define the Dikin ellipsoid at a
point x ∈ P as

E x := {y ∈ Rn : (y − x)> H(x)(y − x) ≤ 1}.

Let x0? be the analytic center, i.e., the minimizer of the logarithmic bar-
rier over the polytope.
(a) Prove that for all x ∈ int(P),

E x ⊆ P.

(b) Assume without loss of generality (by shifting the polytope) that
x0? = 0. Prove that

P ⊆ mE x0? .

(c) Prove that if the set of constraints is symmetric, i.e., for every con-
straint of the form ha0 , xi ≤ b0 there is a corresponding constraint
ha0 , xi ≥ −b0 then

x0? = 0 and P ⊆ mE x0? .

10.5 Assume C > 0 is some large constant.


(a) Verify that the constraint region in the auxiliary linear program of
Equation 10.16 is full-dimensional and bounded (assume that P is
full-dimensional and bounded).
10.7 Exercises 205

(b) Verify that the O(1)-slack in the auxiliary linear program of Equa-
tion 10.16 satisfies the condition in Lemma 10.17 with
 
β = min Ω(C −1 ), 2−O(n(L+nLC )) ,
e

where L is the bit complexity of P and LC is the bit complexity of


C.
10.6 Primal-dual path following IPM for linear programming. In this prob-
lem, we derive a different interior point algorithm to solve linear pro-
grams.

The formulation. Consider the following linear program and its dual.

min hc, xi
s.t. Ax = b, (10.17)
x ≥ 0.

max hb, yi
s.t. A> y + s = c, (10.18)
s ≥ 0.

Here, A ∈ Rn×m , b ∈ Rn , and c ∈ Rm . Further, x ∈ Rm , y ∈ Rn and


s ∈ Rm are variables. For simplicity, A is assumed to have rank n (i.e.,
A is full rank). Every triple (x, y, s) such that x satisfies the constraints
of (10.17) and (y, s) satisfies the constraints of (10.18) we call a feasible
primal-dual solution. The set of all such feasible triples we denote by F .
The strictly feasible set, where in addition x > 0 and s > 0 we denote by
F+ .

(a) Prove that the linear programs (10.17) and (10.18) are dual to each
other.
(b) Prove that if (x, y, s) ∈ F is a feasible primal-dual solution then
the duality gap hc, xi − hb, yi = m
P
i=1 xi si .
(c) Prove, using the above, that if xi si = 0 for every i = 1, 2, . . . , m,
then x is optimal for (10.17) and (y, s) is optimal for (10.18).

Thus, finding a solution (x, y, s) ∈ F with xi si = 0 for all i is equivalent


to solving the linear program.
206 An Interior Point Method for Linear Programming

The high-level idea. The main idea of this IPM is to approach this goal
by maintaining a solution to the following equations:
xi si = µ, ∀ i = 1, 2, . . . , m (x, y, s) ∈ F+ , (10.19)
for a positive parameter µ > 0. As it turns out, the above defines a unique
pair (x(µ), s(µ)) and, hence, the set of all solutions (over µ > 0) forms a
continuous path in F . The strategy is to approximately follow this path:
we initialize the algorithm at a (perhaps large) value µ0 and reduce it
multiplicatively. More precisely, we start with a triple (x0 , y0 , s0 ) that ap-
proximately satisfies condition (10.19) (with µ := µ0 ) and, subsequently,
produce a sequence (xt , yt , st , µt ) of solutions, with t = 1, 2, 3, . . . such
that (xt , yt , st ) also approximately satisfies (10.19), with µ := µt . We will
show that the value of µ can be reduced by a factor of (1 − γ) with
γ := Θ(m−1/2 ) in every step.

Description of the update. We describe how to construct the new point


(xt+1 , yt+1 , st+1 ) given (xt , yt , st ). For brevity we drop the superscript t for
now. Given a point (x, y, s) ∈ F and a value µ we use the following
procedure to obtain a new point: we compute vectors ∆x and ∆y and ∆s
such that
= 0,




 A∆x
A ∆y + ∆s = 0,

 >

 (10.20)

 x s + (∆x )s + x (∆s ) = µ, ∀ i = 1, 2, . . . , m,


i i i i i i

and the new point is (x + ∆x, y + ∆y, s + ∆s).


(d) Prove that the above coincides with one step of the Newton method
for root finding applied to the following system of equations. Hint:
Use the fact that (x, y, s) ∈ F .

= b,




 Ax
A y + s = c,

 >



= µ, ∀ i = 1, 2, . . . , m.

x s

i i

Potential function and stopping criteria. We will not be able to find


solutions that satisfy Equation (10.19) exactly, but only approximately.
To measure how accurate our approximation is, we use the following
potential function
v
t m !2
X xi si
v(x, s, µ) := −1 .
i=1
µ
10.7 Exercises 207

Note in particular that v(x, s, µ) = 0 if and only if Equation (10.19) holds.


Moreover, an approximate variant of this observation is also true.
(e) Prove that if v(x, s, µ) ≤ 1
2 then
m
X
xi si ≤ 2µm.
i=1

ε
Thus, it suffices to take µ := 2m in order to bring the duality gap down to
ε. To obtain a complete algorithm it remains to prove that the following
invariant is satisfied in every step t:

(xt , yt , st ) ∈ F+ ,
1 (10.21)
v(xt , st , µt ) ≤ .
2
√ 
Given the above, we will be able to deduce that after t = O m log µε
0

steps, the duality gap drops below ε. We analyze the behavior of the
potential in two steps. We first study what happens when going from
(x, y, s) to (x+∆x, y+∆y, s+∆s) while keeping the value of µ unchanged,
and later we analyze the effect of changing µ to µ(1 − γ).

Analysis of the potential: Newton step. As one can show, the linear
system (10.20) has always a solution. We need to show now that after
one such Newton step, the new point is still strictly feasible and, the
value of the potential is reduced.
(f) Prove that if ∆x and ∆s satisfy Equation (10.20) and v(x + ∆x, s +
∆s, µ) < 1 then x + ∆x > 0 and s + ∆s > 0.
(g) Prove that if ∆x and ∆s satisfy Equation (10.20) then
m
X
(∆xi )(∆si ) = 0. (10.22)
i=1

(h) Prove that if ∆x and ∆s satisfy Equation (10.20) and v(x, s, µ) < 1
then
1 v(x, s, µ)2
v(x + ∆x, s + ∆s, µ) ≤ · .
2 1 − v(x, s, µ)

Analysis of the potential: changing µ. We have shown that v(x, s, µ)


drops when µ is kept intact, it remains to see what happens when we
reduce its value from µ to (1 − γ)µ for some γ > 0.
208 An Interior Point Method for Linear Programming

(i) Prove that if ∆x and ∆s satisfy Equation (10.20) then


1
q
v(x + ∆x, s + ∆s, (1 − γ)µ) ≤ · v(x + ∆x, s + ∆s, µ)2 + γ2 m.
1−γ
The above guarantee allows us to conclude that we can maintain the
invariant (10.21) by taking γ = 3 √1 m . For this, in every step we perform
one step of Newton’s method and then reduce the value of µ by (1 − γ)
and still satisfy v(x, s, µ) ≤ 21 . We obtain the following theorem.
Theorem 10.18 There is an algorithm for linear programming that
given an initial solution (x0 , y0 , s0 , µ0 ) satisfying (10.21) and an ε > 0,

performs O( m log µε ) iterations (in every iteration an m × m linear
0

system is being solved) and outputs an ε-approximate solution x̂ to the


primal problem (10.17) and (ŷ, ŝ) to the dual problem (10.18).
10.7 Exercises 209

Notes
Interior point methods (IPMs) in the form of affine scaling first appeared in the
Ph.D. thesis of I. I. Dikin in the 60s; see Vanderbei (2001). Karmarkar (1984)
gave a polynomial-time algorithm to solve linear programs using an IPM based
on projective scaling. By then, there was a known polynomial time algorithm
for solving LPs, namely the ellipsoid algorithm by Khachiyan (1979, 1980)
(see Chapter 12). However, the method of choice in practice was the simplex
method due to Dantzig (1990), despite it being known to be inefficient in the
worst-case (see Klee and Minty (1972)).3 Karmarkar, in his paper, also pre-
sented empirical evidence demonstrating that his algorithm was consistently
faster than the simplex method. For a comprehensive historical perspective on
IPMs, we refer to the survey by Wright (2005).
Karmarkar’s algorithm needs roughly O(mL) iterations to find a solution;
here L is the bit complexity of the input linear program and m is the number of
constraints. Renegar (1988) combined Karmarkar’s approach with Newton’s

method to design a path following interior point method that took O( mL)
iterations to solve a linear program, and each iteration just had to solve a linear
system of equations of size m × m. A similar result was independently proven
by Gonzaga (1989). This is the result presented in this chapter.
There is also a class of primal-dual interior point methods for solving linear
programs (see Exercise 10.6) and the reader is referred to the book by Wright
(1997) for a discussion. It is possible, though nontrivial, to remove the full-
dimensionality Theorem 10.2; we refer the reader to the book by Grötschel
et al. (1988) for a thorough treatment on this.

3 Spielman and Teng (2009) showed that, under certain assumptions on the data (A, b, c), a
variant of the simplex method is provably efficient.
11
Variants of the Interior Point Method and
Self-Concordance

We present various generalizations and extensions of the path following IPM for the
case of linear programming. As an application, we derive a fast algorithm for the s − t-
minimum cost flow problem. Subsequently, we introduce the notion of self-concordance
and give an overview of barrier functions for polytopes and more general convex sets.

11.1 The minimum cost flow problem


We start by investigating if the interior point method for linear programming
developed in Chapter 10 is applicable for an important generalization of the
s − t-maximum flow in a directed graph called the s − t-minimum cost flow
problem. In this problem, we are given a directed graph G = (V, E) with n :=
|V| and m := |E|, two special vertices s , t ∈ V, where s is the source and t
is the sink, capacities ρ ∈ Qm ≥0 , a target flow value F ∈ Z≥0 , and a cost vector
c ∈ Qm ≥0 . The goal is to find a flow in G that sends F units of flow from s to t that
respects the capacities and is of minimum cost. The cost of a flow is defined to
be the sum of costs over all edges, and the cost of transporting xi units through
edge i ∈ E incurs a cost of ci xi . See Figure 11.1 for an illustration.
We can capture this problem using linear programming as follows:
min hc, xi
s.t. Bx = Fχ st , (11.1)
0 ≤ xi ≤ ρi , ∀ i ∈ E,
where B is the vertex-edge incidence matrix (introduced in Section 2.8.2) of G
and χ st := e s − et ∈ Rn .
One can immediately see that the analogous problem for undirected graphs is
a special case of the one we defined here since one can simply put two directed

210
11.1 The minimum cost flow problem 211

2 11
1 4
6
4
1 2
9
5
4
3

Figure 11.1 An s − t-minimum cost flow problem instance. The vertices 1 and
5 correspond to s and t respectively. We assume that all capacities are 1 and the
costs of edges are as given in the picture. The minimum cost flow of value F = 1
has cost 11, for F = 2 it is 11 + 13 = 24 and there are no feasible flow for F > 2.

edges (w, v) and (v, w) in place of every undirected edge {w, v}. Moreover, this
problem is more general than the s−t-maximum flow problem and the bipartite
matching problem studied introduced in Chapters 6 and 7 respectively.

11.1.1 Linear programming-based fast algorithm?


Most of the (long and rich) research history on the s − t-minimum cost flow
problem revolved around combinatorial algorithms – based on augmenting
paths, push-relabel, and cycle cancelling; see the notes for references. The
best known algorithm obtained using methods from this family has running
time roughly O(mn).
e Since the s − t-minimum cost problem (11.1) can be nat-
urally stated as a linear program one can ask whether it is possible to derive
a competitive, or even faster, algorithm for the minimum cost flow problem
using linear programming methods. The main result from Chapter 10 suggests

that obtaining an algorithm that performs roughly m iterations might be pos-
sible. Further, while the worst case running time of one such iteration could
be as high as O(m3 ), in this case, we can hope to do much better: the matrix
B captures a graph, and hence one might hope that each iteration boils down
to solving a Laplacian system for which we have O(m) e time algorithms. If
the number and complexity of the iterations can be bounded as we hope, this
would yield an algorithm that runs in O(me 3/2 ) time. This would beat the best
known combinatorial algorithm whenever m = o(n2 ).
However, to derive such an algorithm we need to resolve several issues – the
first being that the method we developed in Chapter 10 was for linear program-
ming in the canonical form (Definition 10.1), which does not seem to capture
212 Variants of the Interior Point Method and Self-Concordance

the linear programming formulation the minimum cost flow problem in (11.1).
Moreover, we need to show that one iteration of an appropriate variant of the
path following method can be reduced to solving a Laplacian system. Finally,
the issue of finding a suitable point to initialize the interior point method be-
comes crucial and we develop a fast algorithm for finding such a point.

11.1.2 The issue with the path following IPM


Recall that the primal path following IPM developed in Chapter 10 was geared
towards linear programs of the following form:

min hc, xi
x∈Rm
(11.2)
s.t. Ax ≤ b.
By inspecting the form of the linear program (11.2), it is evident that it has
a different form than that of (11.1). However, this linear programming frame-
work is very expressive and one can easily translate the program (11.1) to the
form (11.2). This follows simply by turning equalities Bx = Fχ st into pairs of
inequalities
Bx ≤ Fχ st and − Bx ≤ −Fχ st ,
and expressing 0 ≤ xi ≤ ρi as
−xi ≤ 0 and xi ≤ ρi .
However, one of the issues is that the resulting polytope
P := {x ∈ Rm : Ax ≤ b}
is not full-dimensional, which was a crucial assumption for the path following
IPM from Chapter 10. Indeed, the running time of this algorithm depends on
β where β is the distance from the boundary of P of an initial point x ∈ P; see
1 0

Lemma 10.17. Since P is not full-dimensional, it is not possible to pick a point


x0 with positive β.
One can try to fix this problem by modifying the feasible set and considering
Pε := {x ∈ Rm : Ax ≤ b + 1ε},
for some tiny ε > 0, where 1 is the all ones vector. Pε is a slightly “inflated”
version of P and, in particular, is full-dimensional. One can then minimize
hc, xi over Pε , hoping that this gives a decent approximation of (11.2). How-
ever, there is at least one issue that arises when doing so. The ε > 0 needs to
be exponentially small – roughly ≈ 2−L (where L is the bit complexity of the
11.1 The minimum cost flow problem 213

linear program) in order to guarantee that we get a decent approximation of


the original program. However, for such a small ε > 0, the polytope Pε is very
“thin” and, thus, we cannot provide a starting point x0 which is far away from
the boundary of Pε . This adds at least an Ω(L) = Ω(m) term in the number of
iterations of the method and also forces us to use extended precision arithmetic
(of roughly L bits) in order to execute the algorithm.

11.1.3 Beyond full-dimensional linear programs


There are several different ways one can try to resolve these issues and obtain
fast linear programming based algorithms for the minimum cost flow prob-
lem. One idea is based on the so-called primal-dual path interior point method,
which is a method to solve the primal and the dual problems simultaneously
using Newton’s method; see Exercise 10.6 in Chapter 10.
Another idea, that we are going to implement in this chapter, is to aban-
don the full-dimensional requirement on the polytope and develop a method
that directly solves linear programs with equality constraints. Towards this, the
first step is to formulate the minimum cost flow problem (11.1) in a form that
involves only equality constraints and nonnegativity of variables. We introduce
a new set of m variables {yi }i∈E – the “edge slacks” and add equality constraints
of the form
∀i ∈ E, yi = ρi − xi .

Thus, by denoting b := Fχ st , our linear program becomes

min hc, xi
x∈Rm
! ! !
B 0 x b
s.t. = , (11.3)
I I y u
x, y ≥ 0.

The number of variables in this new program is 2m and the number of linear
constraints is n + m.
By developing a Newton’s method for equality constrained problems and
then deriving a new, corresponding path following interior point method for
linear problems in the form as (11.3) we prove the following theorem.

Theorem 11.1 (s − t-minimum cost flow using IPM) There is an algorithm


to solve
 √ the minimum
 cost flow problem (11.3) up to an additive error of ε > 0
e m log CU iterations, where U := maxi∈E |ρi | and C := maxi∈E |ci |. Each
in O ε
214 Variants of the Interior Point Method and Self-Concordance

iteration involves solving one Laplacian linear system, i.e., of the form

(BW B> )h = d,

where W is a positive diagonal matrix, d ∈ Rn and h ∈ Rn is the variable


vector.

As solving one such Laplacian system takes time O(m),


e we obtain an algorithm
solving the minimum cost flow problem in time, roughly, O(m e 3/2 ). To be pre-
cise, employing fast Laplacian solvers requires one to show that their approxi-
mate nature does not introduce any problems in the path following scheme.
Our proof of Theorem 11.1 is divided into three parts. First, we derive a new
variant of the path following method for linear programs of the form (11.3) (see

Section 11.2) and show that it takes roughly m iterations to solve it. Subse-
quently, in Section 11.2 we show that each iteration of this new method applied
to (11.3) can be reduced (in O(m) time) to solving one Laplacian system. Fi-
nally, in Section 11.3 we show how one can efficiently find a good starting
point to initialize the interior point algorithm.

11.2 An IPM for linear programming in standard form


The goal of this section is to derive a path following interior point method for
linear programs of the form (11.3) (the reformulation of the s−t-minimum cost
flow problem). More generally, the method derived works for linear programs
that have the following form:

min hc, xi
x∈Rm

s.t. Ax = b, (11.4)
x ≥ 0,

where A ∈ Rn×m , b ∈ Rn and c ∈ Rm . This is often referred to as a linear


program in standard form. It is easy to check that the canonical form studied in
Chapter 10 is the dual of the standard form (up to renaming vectors and switch-
ing min to max). Note that in this section we denote the number of variables
by m and the number of (linear) constraints by n, i.e., the roles of n and m are
swapped when compared to Chapter 10. Also, note that we are overloading m
and n that meant the number of edges and number of vertices in the context
of the s − t-minimum cost problem; the distinction should be clear from the
context. This is done intentionally and is actually consistent with the previous
Chapter 10 since, via duality, the constraints become variables and vice versa.
11.2 An IPM for linear programming in standard form 215

For the sake of brevity, in this section we use the notation

Eb := {x ∈ Rm : Ax = b} and E := {x ∈ Rm : Ax = 0},

where the matrix A and the vector b are fixed throughout.


The idea of the algorithm for solving (11.4) is similar to the path following
IPM introduced in Chapter 10: we use the logarithmic barrier function to define
a central path in the relative interior of the feasible set and then use Newton’s
method to progress along the path. To implement this idea, the first step is
to derive a Newton’s method for minimizing a function restricted to an affine
subspace of Rm .

11.2.1 Equality constrained Newton’s method


Consider a constrained optimization problem of the form

min f (x)
x∈Rm
(11.5)
s.t. Ax = b,

where f : Rm → R is a convex function, A ∈ Rn×m , and b ∈ Rn . We assume


without loss of generality that A has full rank n, i.e., the rows of A are linearly
independent (otherwise there are redundant rows in A which can be removed).
We denote the function f restricted to the domain Eb by e f : Eb → R, i.e.,
f (x) = f (x).
e
In the absence of equality constraints Ax = b we saw that the Newton step
at a point x is defined as

n(x) := −H(x)−1 g(x),

where H(x) ∈ Rm×m is the Hessian of f at x and g(x) is the gradient of f at x,


and the next iterate is computed as

x0 := x + n(x).

When started at x ∈ Eb , the point x + n(x) may no longer belong to Eb and,


hence, we need to adjust the Newton’s method to move only in directions tan-
gent to Eb . To this end, in the next section, we define the notions of a gradient
g(x), and a Hessian H(x)
e e so that the Newton step, defined by the formula

e e −1e
n(x) := H(x) g(x),

gives a well defined method for minimizing e f . Moreover, by defining an ap-


propriate variant of the NL condition for e
f we obtain the following theorem.
216 Variants of the Interior Point Method and Self-Concordance

Theorem 11.2 (Quadratic convergence of equality constrained Newton’s


method) Let e f : Eb → R be a strictly convex function satisfying the NL
condition for δ0 = 61 , x0 ∈ Eb be any point and

x1 := x0 + e
n(x0 ).
1
n(x0 )k x0 ≤
If ke 6 then
n(x0 )k2x0 .
n(x1 )k x1 ≤ 3 ke
ke

We note that the above is completely analogous to Theorem 9.6 in Chapter


9, where we dealt with the unconstrained setting – indeed, the above theorem
ends up being a straightforward extension of this theorem.

11.2.2 Defining the Hessian and gradient on a subspace


The crucial difference between working in Rm and working in Eb is that, given
x ∈ Eb we cannot move in every possible direction from x, but only in tangent
directions, i.e., to x + h for some h ∈ E. In fact, E is the tangent space (as in-
troduced in Chapter 9) corresponding to Eb . Let {v1 , . . . , vk } be an orthonormal
basis for E where vi ∈ Rm . Let ΠE denote the orthogonal projection operator
onto Π. We can see that
Xk
ΠE := vi v>i .
i=1

g(x) should be a vector in E such that for h ∈ E, the linear func-


The gradient e
tion
h 7→ hh,e
g(x)i

provides a good approximation of f (x + h) − f (x) whenever khk is small. Thus,


for x ∈ Eb , we can define
k
X
g(x) :=
e vi D f (x)[vi ]
i=1
k
X
= vi h∇ f (x), vi i
i=1
k
X
= vi v>i ∇ f (x)
i=1
= ΠE ∇ f (x).
11.2 An IPM for linear programming in standard form 217

Similarly, we can obtain a formula for the Hessian


e i j := D2 f (x)[vi , v j ]
H(x)
= v>i H(x)v j
 
= Π>E H(x)ΠE .
ij

These definitions might look abstract, but in fact, the only way they differ from
their unconstrained variants is the fact that we consider E to be the tangent
space instead of Rm . As an example, consider the function
f (x1 , x2 ) := 2x12 + x22 .
When constrained to a “vertical” line
Eb := {(x1 , x2 )> : x1 = b, x2 ∈ R}
we obtain that the gradient is
!
0
g(x1 , x2 ) =
e
2x2
( ! )
e 1 , x2 ) is a linear operator E → E (where E = 0
and the Hessian H(x :h∈R
h
such that
! !
0 0
H(x1 , x2 )
e = .
h 2h
Indeed, this is consistent with the intuition that by restricting the function to a
vertical line, only the variable x2 matters.
To summarize, we have the lemma below.
Lemma 11.3 (Gradient and Hessian in a subspace) Let f : Rm → R be
f : Eb → R be its restriction to Eb . Then we have
strictly convex and let e
g(x) = ΠE ∇ f (x),
e
e = Π>E H(x)ΠE ,
H(x)
where ΠE : Rm → E is the orthogonal projection operator onto E. Note that
ΠE = I − A> (AA> )−1 A.

11.2.3 Defining the Newton step and the NL condition on a


subspace
We would like to define the Newton step to be
e e −1e
n(x) := −H(x) g(x).
218 Variants of the Interior Point Method and Self-Concordance
e is an invertible operator E → E, and hence e
Note that in the above, H(x) n(x)
is well defined. The following lemma states the form of the Newton step re-
stricted to the subspace E in terms of the Hessian and gradient in the ambient
space Rm .

Lemma 11.4 (Newton step in a subspace) Let f : Rm → R be a strictly


f : Eb → R be its restriction to Eb . Then, for every
convex function and let e
x ∈ Eb we have

n(x) = −H(x)−1 g(x) + H(x)−1 A> (AH(x)−1 A> )−1 AH(x)−1 g(x).
e

Proof The Newton step e e −1e


n(x) is defined as −H(x) g(x) and, hence, can be
obtained as the unique solution h to the following linear system

H(x)h

 e = −eg(x),
(11.6)
Ah = 0.

Since ΠE is symmetric and h ∈ E, the first equation gives that

ΠE H(x)h = −ΠE g(x)

and, hence,
ΠE (H(x)h + g(x)) = 0.

In other words, this means that H(x)h + g(x) belongs to the space orthogonal
to E (denoted by E ⊥ ), i.e.,

H(x)h + g(x) ∈ E ⊥ = {A> λ : λ ∈ Rn }.

Consequently, (11.6) can be equivalently rewritten as a linear system of the


form
H(x) A>
! ! !
h −g(x)
· =
A 0 λ 0

with unknowns h ∈ Rm and λ ∈ Rn . From the above λ can be eliminated to


yield
n(x) = H(x)−1 (A> (AH(x)−1 A> )−1 AH(x)−1 − I)g(x),
e

as stated in the lemma.

The Newton’s method for minimizing f over Eb then simply starts at any x0 ∈
Eb and iterates according to

xt+1 := xt + e
n(xt ), for t = 0, 1, 2, . . ., (11.7)
11.2 An IPM for linear programming in standard form 219

As in the unconstrained setting, we consider the local norm k·k x defined on the
tangent space at x by
D E
∀h ∈ E, khk2x := h, H(x)h
e = h> H(x)h,
e

and define the NL condition as follows.

Definition 11.5 (Condition NL on a subspace) Let e f : Eb → R. We say that


f satisfies the NL condition for δ0 < 1 if for all 0 < δ ≤ δ0 < 1, and x, y ∈ Eb
e
such that
ky − xk x ≤ δ,

we have
(1 − 3δ)H(x) e  (1 + 3δ)H(x).
e  H(y) e

In the above, the PSD ordering  is understood in the usual sense, yet restricted
to the space E. To make this precise, we need to define a notion similar to a
symmetric matrix for linear operators. A linear operator H : E → E is said to
be self-adjoint if hHh1 , h2 i = hh1 , Hh2 i for all h1 , h2 ∈ E.

Definition 11.6 For two self-adjoint operators H1 , H2 : E → E we say that


H1  H2 if hh, H1 hi ≤ hh, H2 hi for every h ∈ E.

At this point, all components present in Theorem 11.2 have been formally
defined. The proof of Theorem 11.2 is identical to the proof of Theorem 9.6 in
Chapter 9; in fact, one just has to replace f , g, and H with e
f, e
g and H
e for the
proof to go through.

11.2.4 The IPM for linear programs in standard form


To develop an interior point method for solving linear programs in the stan-
dard form (11.4), we work in the affine subspace Eb and apply a similar path
following scheme as in Chapter 10. More precisely, we define the point xη? as

xη? := argmin (ηhc, xi + F(x)) ,


x∈Eb

where F is the logarithmic barrier function for Rm


>0 , i.e.,
m
X
F(x) := − log xi .
i=1

The idea is to use Newton’s method to progress along the central path, i.e., the
set Γc := {xη? : η ≥ 0}.
220 Variants of the Interior Point Method and Self-Concordance

For completeness, we now adjust the notation to this new setting. We define
the function
fη (x) := ηhc, xi + F(x)

and let efη denote its restriction to Eb . We let e


nη (x) be the Newton step with
respect to fη , i.e., (using Lemma 11.4)
e

nη (x) := H(x)−1 (A> (AH(x)−1 A> )−1 AH(x)−1 − I)(ηc + g(x)),


e

where H(x) and g(x) are the Hessian and gradient of F(x) in Rm . Note that, in
this case, the Hessian and the gradient of F(x) have particularly simple forms:

H(x) = X −2 , g(x) = X −1 1,

where X := Diag(x) is the diagonal matrix with the entries of x on the diagonal
and 1 is the all ones vector. Thus,

nη (x) := X 2 (A> (AX −2 A> )−1 AX 2 − I)(ηc + X −1 1).


e (11.8)

The algorithm for this setting is analogous to Algorithm 7 that was presented in
Chapter 10; the only difference is the use of the equality constrained Newton
step e
nη (x) instead of the unconstrained one nη (x). The theorem that one can
formulate in this setting is as follows.

Theorem 11.7 (Convergence of the equality constrained path following


IPM) The equality constrained version of Algorithm 7, after
!
√ m
T := O m log
εη0
iterations, outputs a point x̂ ∈ Eb ∩ Rm
>0 that satisfies

hc, x̂i ≤ hc, x? i + ε.

Moreover, every Newton step requires O(m) time, plus the time required to
solve one linear system of equations of the form (AWA> )y = d, where W  0
is a diagonal matrix, d ∈ Rn is a given vector and y is the vector of variables.

For the application to the s − t-minimum cost flow problem, it is also important
to show that finding the initial point on the central path is not a bottleneck.
From arguments presented in Section 10.6.2, one can derive the following
lemma that is an analog of Lemma 10.17.

Lemma 11.8 (Starting point) There is an algorithm which given a point


>0 such that xi > β for every i = 1, 2, . . . , m, finds a pair (η0 , x0 )
x 0 ∈ E b ∩ Rm 0

such that

11.3 An O(
e m)-iteration algorithm for minimum cost flow 221

• x0 ∈ Eb ∩ Rm ,
>0
1
• nη0 (x0 ) x ≤ 6 ,
e
 0 
• η0 ≥ Ω D1 · kck1 2 , where D is the (Euclidean) diameter of the feasible set
≥0 .
E b ∩ Rm
√ 
The algorithm performs O
e m log Dβ iterations of the interior point method.

In particular, whenever we can show that the diameter of the feasible set is
bounded by a polynomial in the dimension and we can find a point which
is at least an inverse polynomial distance away from the boundary, then the
total number of iterations of the equality constrained path following IPM (to
find
 √the starting
kck2
 point and then find a point ε-close to optimum) is roughly
O m log ε .
e


e m)-iteration algorithm for minimum cost flow
11.3 An O(
We show how to conclude Theorem 11.1 using the equality constrained path
following method IPM mentioned in the previous section.

Number of iterations. We first note that by Theorem 11.7 the linear pro-

gram (11.3) can be solved in O m log εηm0 iterations of the path following
method. In order to match the number of iterations claimed in Theorem 11.1
we need to show that we can initialize the path following method at η0 such
that η−1
0 = poly(m, C, U). We show this in Section 11.3.1.

Time per iteration.

Lemma 11.9 (Computing a Newton step by solving a Laplacian system)


One iteration of the equality constrained path following IPM applied to the
linear program (11.3) can be implemented in time O(m + T Lap ) where T Lap is
the time required to solve a linear system of the form (BW B> )h = d, where W
is a positive diagonal matrix, d ∈ Rn and h ∈ Rn is an unknown vector.

One can use the formula (11.8) as a starting point for proving the above. How-
ever, one still needs to determine how the task of solving a certain 2m × 2m
linear system reduces to a Laplacian system. We leave this is an exercise (Ex-
ercise 11.3).
222 Variants of the Interior Point Method and Self-Concordance

11.3.1 Finding a starting point efficiently


It remains to show how one can find a point close enough to the central path so
that we can initialize the path following IPM. To this end, we apply Lemma 11.8.
It says that for that we just need to provide a strictly feasible solution (x, y)
to (11.3), i.e., one which satisfies the equality constraints and such that xi , yi >
δ > 0 for some not too small δ (inverse-polynomial in m, C and U suffices).
We start by showing how to find a positive flow of any value and then argue
that by a certain preconditioning of the graph G we can also find a flow of a
prescribed value F using this method.

Lemma 11.10 (Finding a strictly positive flow) There is an algorithm which,


given a directed graph G = (V, E) and two vertices s, t ∈ G, outputs a vector
x ∈ [0, 1]E such that:

1. Bx = 21 χ st ,
2. if an edge i ∈ E does not belong to any directed path from s to t, then
xi = 0,
3. if an edge i ∈ E belongs to some directed path from s to t then
1 1
≤ xi ≤ .
2m 2
The algorithm runs in O(m) time.

Note that the edges that do not belong to any path from s to t can be removed
from the graph, as they cannot be part of any feasible flow. Below we prove
that one can find such a flow in roughly O(nm) time – the question on how to
make this algorithm more efficient is left as an exercise (Exercise 11.4).

Proof sketch of Lemma 11.10. We initialize x ∈ Rm to 0. We fix an edge i =


(v, w) ∈ E. We start by checking whether i belongs to any path connecting s
and t in G. This can be done by checking (using depth first search) whether v
is reachable from s and whether t is reachable from w.
If there is no such path, then we ignore i. Otherwise, let Pi ⊆ E be such a
1
path. We send 2m units of flow through this path, i.e., update
1
x := x + · χ Pi ,
2m
where χPi is the indicator vector of the set Pi .
At the end of this procedure, every edge that lies on a path from s to t has
1 1 1
xi ≥ and xi ≤ m · ≤ .
2m 2m 2

11.3 An O(
e m)-iteration algorithm for minimum cost flow 223

The remaining edges i have xi = 0. In every step we need to update x at O(n)


positions (the length of Pi ), hence the total running time is O(nm).

Flows constructed in the lemma above are strictly positive, yet they fail to
satisfy the constraint Bx = Fχ st , i.e., not enough flow is pushed from s to t. To
get around this problem we need the final idea of preconditioning. We add one
directed edge î := (s, t) to the graph with large enough capacity

uî := 2F

and very large cost


X
cî := 2 |ci |.
i∈E

We denote the new graph (V, E ∪ {î}) by Ĝ and call it the preconditioned
n ograph.
By Lemma 11.10, we can construct a flow of value f = min 21 , F2 in G
which strictly satisfies all capacity constraints since the capacities are integral.
We can then make it have value exactly F by sending the remaining
F
F− f ≥
2
units directly from s to t through î. This allows us to construct a strictly positive
feasible solution on the preconditioned graph.
Note, importantly, that this preconditioning does not change the optimal so-
lution to the original instance of the problem, since sending even one unit of
flow through i incurs a large cost – larger than taking any path from s to t in
the original graph G. Therefore, solving the s − t-minimum cost flow problem
on Ĝ is equivalent to solving it on G. Given this observation, it is enough to
run the path following algorithm on the preconditioned graph. Below we show
how to find a suitable starting point for the path following method in this case.

Lemma 11.11 (Finding a point on the central path) There is an algorithm


that, given a preconditioned instance of s−t-minimum cost flow problem
 (11.3),

outputs a feasible starting point z0 = (x0 , y0 )> ∈ R2m
>0 and η0 := Ω 1
m3 CU
with
U := maxi∈E |ui | and C := maxi∈E |ci | such that
1
enη0 (z0 ) z ≤ .
0 6
√ 
This algorithm performs O e m log U
min{1,F} iterations of the equality con-
strained path following IPM.

Proof First note that we may assume without loss of generality that all edges
224 Variants of the Interior Point Method and Self-Concordance

belong to some directed path from s to t in G. Next, we apply Lemma 11.10 to


find a flow satisfying
( )
1 F
x > 1 · min ,
2m 2m
n o
in G of value f = min 12 , F2 and set

xî = F − f.

This implies that


Bx = Fχ st

and, moreover, for every edge i ∈ E we have


( )
1 F 1
min , ≤ xi ≤ ρi − ,
2m 2m 2
as ρi ∈ Z≥0 for every edge i ∈ E. For xî , we have
F
≤ xî ≤ F = uî − F.
2
Therefore, by setting the slack variables yi to

yi := ρi − xi

we obtain that
( )
1 F
min , ≤ yi .
2 2
We now apply Lemma 11.8 with
( )
1 F 1 F min{1, F}
β := min , , , ,F ≥ .
2m 2m 2 2 2m
It remains to find an appropriate bound on the diameter of the feasible set. Note
that we have, for every i ∈ E ∪ {î},

0 ≤ xi , yi ≤ ρi .

This implies that the Euclidean diameter of the feasible set is bounded by
v
t m
X √
2 ρ2i ≤ 2mU.
i=1

Thus, by plugging these parameters into Lemma 11.8, the bound on η0 and the
bound on the number of iterations follow.
11.4 Self-concordance 225

11.4 Self-concordance
In this section, by abstracting out the key properties of the logarithmic barrier
and the convergence proof of the path following IPM from Chapter 10, we
arrive at the central notion of self-concordance. Not only allows does the notion
of self-concordance help us to develop path following algorithms for more
general convex bodies (beyond polytopes), it also allows us search for better
barrier functions and leads to some of the fastest known algorithm for linear
programming.

11.4.1 Revisiting the properties of the logarithmic barrier


Recall Lemma 10.5 in Chapter 10 which asserts that the invariant
1
nηt (xt ) x ≤
t 6
is preserved at every step of the path following IPM. The proof of this lemma
relies on two important properties of the logarithmic barrier function:

Lemma 11.12 (Properties of the logarithmic barrier function) Let


m
X
F(x) := − log si (x)
i=1

be the logarithmic barrier function for a polytope, let g(x) be its gradient, and
let H(x) be its Hessian. Then, we have

1. for every x ∈ int(P)


2
H(x)−1 g(x) x ≤ m,

and
2. for every x ∈ int(P) and every h ∈ Rn
3  3/2
∇ F(x)[h, h, h] ≤ 2 h> H(x)h .

Recall that the first property was used to show that one can
 progress along the
central path by changing η multiplicatively by 1 + 20 1√m (see Lemma 10.7 in
Chapter 10). The second property was not explicitly mentioned in the proof,
however, it results in the logarithmic barrier F satisfying the NL condition.
Thus, it allows the repeated recentering of the point (bringing it closer to the
central path) using Newton’s method (see Lemma 10.6 in Chapter 10).
226 Variants of the Interior Point Method and Self-Concordance

Proof of Lemma 11.12. Part 1 was proved in Chapter 10. We provide a slightly
different proof here. Let x ∈ int(P) and let S x ∈ Rm×m be the diagonal matrix
with slacks si (x) on the diagonal. Thus, we know that
H(x) = A> S −2
x A and g(x) = A> S −1
x 1,

where 1 ∈ Rm is the all ones vector. We have



H(x)−1 g(x) = g(x)> H(x)−1 g(x)
x
= 1> S −1 > −2 −1 > −1
x A(A S x A) A S x 1
= 1> Π1,
where Π := S −1
x A(A S x A) A S x . Note that Π is an orthogonal projection
> −2 −1 > −1

matrix: indeed Π is symmetric and Π2 = Π (by a direct calculation). Hence,



H(x)−1 g(x) = 1> Π1 = kΠ1k2 ≤ k1k2 = m.
x 2 2

To prove Part 2, first note that


m m
X hai , hi3 X hai , hi2
∇3 F(x)[h, h, h] = 2 and h> H(x)h = .
i=1
si (x)3 i=1
si (x)2
Therefore the claim follows from the inequality
kzk3 ≤ kzk2
hai ,hi
applied to the vector z ∈ Rm defined by zi := si (x) for all i = 1, 2, . . . , m.

11.4.2 Self-concordant barrier functions


The two conditions stated in Lemma 11.12 are useful to understand path fol-
lowing IPMs more generally. Any convex function that satisfies these two
properties and is a barrier for the polytope P (i.e., tends to infinity when ap-
proaching the boundary) is called a self-concordant barrier function.
Definition 11.13 (Self-concordant barrier function) Let K ⊆ Rd be a con-
vex set and let F : int(K) → R be a thrice differentiable function. We say
that F is a self-concordant barrier function with parameter ν if it is satisfies the
following properties:
1. (Barrier) F(x) → ∞ as x → ∂K.
2. (Convexity) F is strictly convex.
3. (Complexity parameter ν) For all x ∈ int(K),
 −1
∇F(x)> ∇2 F(x) ∇F(x) ≤ ν.
11.5 Linear programming using self-concordant barriers 227

4. (Self-concordance) For all x ∈ int(K) and for all allowed vectors h (in
the tangent space of the domain of F)
3  3/2
∇ F(x)[h, h, h] ≤ 2 h> ∇2 F(x)h .

Note that the logarithmic barrier on P satisfies the above conditions with the
complexity parameter equal to ν = m. Indeed, the third condition above coin-
cides with the first property in Lemma 11.12 and fourth condition is the same
as the second property in Lemma 11.12.

11.5 Linear programming using self-concordant barriers


Given any self-concordant barrier function F(x) for a polytope P, one can con-
sider the following objective

fη (x) := ηhc, xi + F(x)

and define the central path to consist of minimizers xη? of the above paramet-
ric family of convex functions. Algorithm 7 presented in Chapter 10 for linear
programming can be adapted straightforwardly for this self-concordant bar-
rier function F. The only difference is that m is replaced by ν, the complexity
parameter of the barrier F. I.e., the value η is updated according to the rule
!
1
ηt+1 := ηt 1 + √ ,
20 ν

and the iteration terminates once ηT > εν .


By following the proof of Theorem 10.10 from Chapter 10 for such an F,
one obtains the following result.

Theorem 11.14 (Convergence of the path following IPM for general bar-
rier functions) Let F(x) be a self-concordant barrier function for P = {x ∈
Rn : Ax ≤ b} with complexity parameter ν. The corresponding path following

IPM, after T := O ν log εην 0 iterations, outputs x̂ ∈ P that satisfies

hc, x̂i ≤ hc, x? i + ε.

Moreover, every iteration of the algorithm requires computing the gradient and
Hessian (H(·)) of F at given point x ∈ int(P) and solving one linear system of
the form H(x)y = z, where z ∈ Rn is determined by x and we solve for y.
228 Variants of the Interior Point Method and Self-Concordance

A lower bound on the complexity parameter of self-concordance. Theo-


rem 11.14 implies that if we could come up with an efficient second-order
oracle to a self-concordant barrier function with parameter ν < m, we could
√ √
obtain an algorithm that performs roughly ν iterations (instead of m as for
the logarithmic barrier). Thus, a natural question arises: can we get arbitrar-
ily good barriers? In particular, can we bring the complexity parameter down
to, say, O(1)? Unfortunately, the self-concordance complexity parameter can-
not go below the dimension of the convex set as established by the following
theorem.
Theorem 11.15 (Lower bound on the complexity parameter for hyper-
cube) Every self-concordant barrier function for the hypercube K := [0, 1]d
has complexity parameter at least d.
We give some intuition for this fact by proving it for d = 1.
Proof of Theorem 11.14 for the special case of d = 1. Let F : [0, 1] → R be
a self-concordant barrier function for K = [0, 1]. Self-concordance in the uni-
variate setting translates to: for every t ∈ (0, 1)
F 0 (t)2 ≤ ν · F 00 (t),
000
F (t) ≤ 2F 00 (t)3/2 .

We prove that if F is a strictly convex barrier then the above imply that ν ≥ 1.
For the sake of contradiction, assume ν < 1. Let
F 0 (t)2
g(t) := ,
F 00 (t)
and consider the derivative of g
!2
F 0 (t) F 0 (t)F 000 (t)
!
g (t) = 2F (t) − 00
0 0
F (t) = F (t) 2 −
000 0
. (11.9)
F (t) F 00 (t)2
Now, using the self-concordance of F, we obtain
F 0 (t)F 000 (t) √
≤ 2 ν. (11.10)
F 00 (t)2
Moreover, since F is a strictly convex function, it holds that F 0 (t) > 0 for t ∈
(α, 1] for some α ∈ (0, 1). Hence, for t > α, by combining (11.9) and (11.10)
we have

g0 (t) ≥ 2F 0 (t)(1 − ν).
Hence,

g(t) ≥ g(α) + 2(1 − ν)(F(t) − F(α)),
11.5 Linear programming using self-concordant barriers 229

which gives us a contradiction as, on the one hand, g(t) ≤ ν (from the self-
concordance of F), and on the other hand, g(t) is unbounded (as limt→1 F(t) =
+∞).
Interestingly, it has also been shown that every full-dimensional convex sub-
set of Rd admits a self-concordant barrier – called the universal barrier –
with parameter O(d). Note however, that just the existence of an O(d)-self-
concordant barrier does not imply
√ that we can construct efficient algorithms
for optimizing over P in O( d) iterations. Indeed, the universal and related
barriers are largely mathematical results that are rather hard to make practical
with from the computational viewpoint – computing gradients and Hessians of
these barriers is nontrivial. We now give an overview of two important exam-
ples of barrier function for linear programming that have good computational
properties.

11.5.1 The volumetric barrier


Definition 11.16 (The volumetric barrier) Let P = {x ∈ Rn : Ax ≤ b},
where A is an m × n matrix, and let F : int(P) → R be the logarithmic barrier
for P. The volumetric barrier function on P is defined as
1
V(x) := log det ∇2 F(x).
2
Moreover, the hybrid barrier on P is defined as
n
G(x) := V(x) + F(x).
m

Discussion. The barrier functions V(x) and G(x) improve upon the m itera-
tions bound obtained for linear programming using the logarithmic barrier. To
understand what the volumetric barrier represents, consider the Dikin ellipsoid
centered at x
E x := {u ∈ Rn : (u − x)> H(x)(u − x) ≤ 1},
where H(x) is the Hessian of F(x). Define a function which at a point x mea-
sures the volume (under the Lebesgue measure) of the Dikin ellipsoid E x (de-
noted by vol(E x )). Then, the volumetric barrier, up to an additive constant, is
the logarithm of this volume:
V(x) = − log vol(E x ) + const.
To see why this is a barrier function note that, since for every x ∈ int(P), the
Dikin ellipsoid E x is contained in P (see Exercise 10.4), whenever x comes
close to the boundary of P, E x becomes very flat and, thus, V(x) tends to +∞.
230 Variants of the Interior Point Method and Self-Concordance

V(x) can also be thought of as a weighted logarithmic barrier function. In-


deed, one can prove that the Hessian of ∇2 V(x) is multiplicatively close1 to the
following matrix Q(x)
m
X ai a>i
Q(x) := σi (x) ,
i=1
si (x)2

where σ(x) ∈ R is the vector of leverage scores at x, i.e., for i = 1, 2, . . . , m


m

a>i H(x)−1 ai
σi (x) := .
si (x)2
The vector σ(x) estimates the relative importance of each of the constraints. In
fact, it holds that
m
X
σi (x) = n and ∀i, 0 ≤ σi (x) ≤ 1.
i=1

Note that, unlike the logarithmic barrier, if some constraint is repeated multiple
times, its importance in the volumetric barrier does not scale with the number
of repetitions.
If for a moment we assumed that the leverage scores σ(x) do not vary with
x, i.e.,
σ(x) ≡ σ

then Q(x) would correspond to the Hessian ∇2 Fσ (x) of the following weighted
logarithmic barrier
m
X
Fσ (x) = σi log si (x).
i=1

Recall that, in the case of the logarithmic barrier, each σi = 1 and, hence, the
total importance of constraints is m, while here m i=1 σi = n. This idea is further
P
generalized in Section 11.5.2.

Complexity parameter. The above interpretation of the volumetric barrier


V(x) as a weighted logarithmic barrier can be then used to prove that its com-
plexity parameter is ν = O(n). However, it does not satisfy the self-concordance
condition. Instead, one can show the following weaker condition:
1 2  
∇ V(x)  ∇2 V(y)  2∇2 V(x) whenever kx − yk x ≤ O m−1/2 .
2
1 By that we mean that there is a constant γ > 0 such that γQ(x)  ∇2 V(x)  γ−1 Q(x).
11.5 Linear programming using self-concordant barriers 231
 
To make the above hold not only for x and y with kx − yk x ≤ O m−1/2 but
also when kx − yk x ≤ O(1), one can consider an adjustment of the volumetric
barrier: the hybrid barrier G(x) := V(x) + mn F(x).
Theorem 11.17 (Self-concordance of the hybrid barrier) The barrier func-

tion G(x) is self-concordant with complexity parameter ν = O( nm).
Note that this gives an improvement over the logarithmic barrier whenever
n = o(m), but does not achieve the optimal bound of O(n).

Computational complexity. While the hybrid barrier is not optimal as far as


the complexity parameter is concerned, the cost of using it to do a Newton step
can be shown to be reasonable.
Theorem 11.18 (Computational complexity of the hybrid barrier) A New-
ton step in the path following IPM with respect to the hybrid barrier G(x) can
be performed in the time it takes to multiply two m × m matrices (O(mω ) where
ω is the matrix multiplication exponent and is about 2.373).
Recall that one iteration using the logarithmic barrier reduces to solving one
linear system. Thus, the above matches the cost per iteration for the logarithmic
barrier in the worst case scenario. However, for certain special cases (as for the
s − t-minimum cost flow problem) the cost per iteration using the logarithmic
barrier might be significantly smaller, even O(m)
e instead of O(mω ).

11.5.2 The Lee-Sidford barrier


While the volumetric barrier considered the log-volume of the Dikin ellipsoid,
roughly, the Lee-Sidford barrier considers the log-volume of an object that is
a “smoothening” of the John ellipsoid.
Definition 11.19 (John ellipsoid) Given a polytope P := {x ∈ Rn : a>i x ≤
bi for i = 1, . . . , m} and a point x ∈ int(P), the John ellipsoid of P at x is
defined to be the ellipsoid of maximum volume that is centered at x and con-
tained in P. For an ellipsoid described by a PD matrix B, its log volume is
− 12 log det B.
It is an exercise (Exercise 11.15) to show that the following convex optimiza-
tion formulation captures the John ellipsoid.
  m 
X ai a>i 
 .

log vol(J(x)) := max log det  wi (11.11)
w≥0, m
P
i=1 wi =n
i=1
si (x)2 
Such a barrier J(x) indeed has complexity parameter n, however, the optimal
232 Variants of the Interior Point Method and Self-Concordance

mE x0?

E x0?
x0?

Figure 11.2 A polytope P and the Dikin ellipsoid center at x0? – the analytic center

of P. While the Dikin ellipsoid is fully contained in P, its m-scaling contains P.

weight vector w(x) ∈ Rm turns out to be nonsmooth, which makes it impossible


to use J(x) as a barrier for optimization.
The problem with defining J(x) as a barrier function is that the optimal
weight vector w(x) ∈ Rm turns out to be nonsmooth. To counter this issue,
one can consider smoothenings of J(x). One possible smoothening is the bar-
rier LS (x) as defined as follows.
Definition 11.20 (The Lee-Sidford barrier) Let P := {x ∈ Rn : Ax ≤
b}, where A is an m × n matrix, be a full-dimensional polytope and let F :
int(P) → R be the logarithmic barrier for P. Consider the following barrier
function defined at a point x ∈ int(P) as a solution to the following optimization
problem:
  m  m

X ai a>i  n X  n
wi log wi  + F(x). (11.12)

LS (x) := max log det  wi
 
2
 −

w≥0
i=1
si (x) m i=1 m
The entropy term in LS (x) makes the dependency of w on x smoother, and
as in the hybrid barrier, this one has also the logarithmic barrier added with
coefficient mn .

Discussion. Remarkably, the Lee-Sidford barrier has led to an IPM algorithm



for linear programming with O( n logO(1) m) iterations, each of which solves
a small number of linear systems. Thus, it not only improves upon the hybrid
barrier in terms of the number of iterations, it also has better computability
properties.

Complexity parameter. A property of the Dikin ellipsoid, which intuitively



is the reason for the m iteration bound, is that whenever P is symmetric (i.e.,
11.5 Linear programming using self-concordant barriers 233

x ∈ P if and only if −x ∈ P),



E0 ⊆ P ⊆ mE0 .

I.e., E0 is a subset of P and by inflating it by a factor of m, it contains the
polytope P. Here, E0 is the Dikin ellipsoid centered at 0. See Figure 11.2 for
an illustration. Thus, if we take one step towards
  negative cost −c to the
the
boundary of E0 , a multiplicative progress of 1 + √1m in the objective is made.
This intuition is not completely accurate as it holds only if we start at the
analytic center of P, however, one may think of intersecting P with a constraint
hc, xi ≤ C for some guess C which shifts the analytic center of P to the new
iterate and proceed in this manner. This provides us with the intuition that the
√ √
Dikin ellipsoid is a m-rounding of the polytope P is responsible for the m
iterations bound. One can prove that in a similar setting as considered above
for the Dikin ellipsoid, it holds that

J0 ⊆ P ⊆ nJ0
where J0 is the John ellipsoid centered at 0. This looks promising as, indeed, m
turned into n when compared to the bound for Dikin ellipsoid and is, roughly,
the reason for the improvement the Lee-Sidford barrier provides. This can be
formalized in the form of the following theorem.
Theorem 11.21 (Self-concordance of the Lee-Sidford barrier) The barrier
function LS (x) is self-concordant with complexity parameter O(n·polylog(m)).
Note in particular that this matches (up to logarithmic factors) the lower bound
of n on the complexity parameter of a barrier of an n-dimensional polytope.

Computational complexity. While the complexity parameter of the barrier


LS (x) is close to optimal, it is far from clear how one can maintain the Hessian
of such a barrier and be able to perform every iteration of Newton’s method in
time comparable to one iteration for the logarithmic barrier (solving one linear
system). The way this algorithm is implemented is by keeping track of the
current iterate xt and the current vector of weights wt . At every iteration, both
the point xt and the weights wt are updated so as to advance on the central path
(determined by the current wt ) towards the optimum solution. For the sake of
computational efficiency, the weights wt never really correspond to optimizers
of (11.12) with respect to xt , but are obtained as another Newton step with
respect to the old weights wt−1 . This finally leads to the following result.
Theorem 11.22 (Informal, Lee-Sidford algorithm for LP) There is an
algorithm for the linear programming problem based on the path following
234 Variants of the Interior Point Method and Self-Concordance
√ 
scheme using the Lee-Sidford barrier LS (x) that performs O e n log 1 itera-
ε
tions to compute an ε-approximate solution. Every iteration of this algorithm
requires solving a polylogarithmic number of m × m linear systems.

11.6 Semidefinite programming using IPM


Recall that in Section 11.1 we used the logarithmic barrier for the positive
orthant to come up with a linear programming algorithm for linear programs
in the standard form
min hc, xi
x∈Rm

s.t. Ax = b,
x ≥ 0.

A generalization of the above to matrices is the following convex optimization


problem over PSD matrices, called semidefinite programming (SDP):

min hC, Xi
X

s.t. hAi , Xi = bi , for i = 1, 2, . . . , m, (11.13)


X  0.

Above, X is a variable that is a symmetric n × n matrix, C and A1 , . . . , Am


are symmetric n × n matrices, and bi ∈ R. The inner product on the space of
symmetric matrices we use above is

hM, Ni := Tr (MN) .

One can extend path following IPMs to the world of SDPs. All one needs is
an efficiently computable self-concordant barrier function for the cone of PD
matrices
Cn := {X ∈ Rn×n : X is PD}.

11.6.1 A barrier for the PD cone


Definition 11.23 (Logarithmic barrier on the PD cone) The logarithmic
barrier function F : Cn → R on Cn is defined to be

F(X) := − log det X.


11.6 Semidefinite programming using IPM 235

Discussion. The logarithmic barrier for Cn is a generalization of the logarith-


mic barrier for the positive orthant. Indeed, if we restrict to Cn to the set of
diagonal, positive definite matrices, we obtain
n
X
F(X) = − log Xi,i .
i=1

More generally it is not hard to see that


n
X
F(x) = − log λi (X),
i=1

where λ1 (X), . . . , λn (X) are the eigenvalues of X.

Complexity parameter. Note that the logarithmic barrier for the positive or-
thant is self-concordant with parameter n. We show that this also holds when
we move to Cn .

Theorem 11.24 (Complexity parameter of the log barrier function for the
PD cone) The logarithmic barrier function F on the cone of positive definite
matrices Cn is self-concordant with parameter n.

Proof To start, note that to verify the conditions in Definition 11.13 we need
to understand first what vectors h to consider. As we are moving in the cone
Cn , we would like to consider any H of the form X1 − X2 for X1 , X2 ∈ Cn , which
coincides with the set of all symmetric matrices. Let X ∈ Cn and H ∈ Rn×n be
a symmetric matrix. We have
 
∇F(X)[H] = − Tr X −1 H ,
 
∇2 F(X)[H, H] = Tr X −1 HX −1 H , (11.14)
 
∇3 F(X)[H, H, H] = −2 Tr X −1 HX −1 HX −1 H) .
We are now ready to verify the self-concordance of F(X). The fact that X is a
barrier follows simply from the fact that if X tends to ∂Cn then det(X) → 0.
The convexity of F also follows as
   
∇2 F(X)[H, H] = Tr X −1 HX −1 H = Tr HX2 ≥ 0,

where HX := X −1/2 HX −1/2 . To prove the last condition, note that


 
∇3 F(X)[H, H, H] = −2 Tr HX3 ,
and
   3/2
2 Tr HX3 ≤ Tr HX2 = ∇2 F(X)[H, H]3/2 .
236 Variants of the Interior Point Method and Self-Concordance

We are left with the task of determining the complexity parameter of F. This
follows simply from the Cauchy-Schwarz inequality, as
√  1/2 √
∇F(X)[H] = − Tr (HX ) ≤ n · Tr HX2 = n∇2 F(X)[H, H]1/2 .

This completes the proof.

Computational complexity. To apply the logarithmic barrier for optimization


over the positive definite cone Cn we also need to show that F(X) is efficient,
as the next theorem (whose proof follows from the calculations above) demon-
strates.

Theorem 11.25 (Gradient and Hessian of logdet) The Hessian and the
gradient of the logarithmic barrier F at a point X ∈ Cn can be computed in
time required to invert the matrix X.

11.7 Optimizing over convex sets via self-concordance


Finally, we note that when defining the notion of self-concordance, we did
not assume that the underlying set K is a polytope. We only required that K
is a full-dimensional convex subset of Rn . Consequently, we can use the path
following IPM to optimize over arbitrary convex sets, provided we have ac-
cess to a good starting point and a computationally efficient self-concordant
barrier function. In fact, this allows us to solve the problem of minimizing a
linear function hc, xi over a convex set K and the corresponding convergence
guarantee is analogous to Theorem 11.14.

Theorem 11.26 (Path following IPM for arbitrary convex sets) Let F(x)
be a self-concordant barrier function on a convex set K ⊆ Rn with complexity

parameter ν. An analog of the Algorithm 7, after T := O ν log εη10 iterations
outputs x̂ ∈ K that satisfies

hc, x̂i ≤ hc, x? i + ε.

Moreover, every iteration requires computing the gradient and Hessian (H(·))
of F at given point x ∈ int(K) and solving one linear system of the form
H(x)y = z, where , z ∈ Rn is a given vector and y is a vector of variables.

We remark that one can also use the above framework to optimize general
convex functions over K. This follows from the fact that every convex program
11.7 Optimizing over convex sets via self-concordance 237

min x∈K f (x) can be transformed into one that minimizes a linear function over
a convex set. Indeed, the above is equivalent to
min t
(x,t)∈K 0

where
K 0 := {(x, t) : x ∈ K, f (x) ≤ t}.
This means that we just need to construct a self-concordant function F on the
set K 0 to solve min x∈K f (x).2

2 Note that this reduction makes the convex set K 0 rather complicated, as the function f is built
in to its definition. Nevertheless, this reduction is still useful in many interesting cases.
238 Variants of the Interior Point Method and Self-Concordance

11.8 Exercises

11.1 Prove Theorem 11.2.


11.2 Prove Lemma 11.3.
11.3 Prove Lemma 11.9.
11.4 Prove that one can indeed find the required starting point promised in
Lemma 11.10 in O(m) time.
11.5 Prove that for every 1 ≤ p1 ≤ p2 ≤ ∞,

kzk p2 ≤ kzk p1 .

11.6 The goal of this problem is to develop a method for solving the s − t-
minimum cost flow problem exactly. We assume here that there exists
an algorithm solving the s − t-minimum
 cost
 flow problem up to addi-
tive error ε > 0 (in value) in O
e m3/2 log CU , where C is the maximum
ε
magnitude of a cost of an edge and U is the maximum capacity of an
edge.

(a) Prove that if an instance of the s − t-minimum cost flow prob-


? ?
lem
 has a unique
 optimal solution x , then x can be found in
3/2
e m log CU time.
O
(b) Isolation Lemma. Prove the following lemma. Let F ⊆ 2[m] be a
family of subsets of [m] := {1, 2, . . . , m} and let w : [m] → [N] be
a weight function chosen at random (i.e., w(i) is a random number
in {1, 2, . . . , N} for every i ∈ [m]). Then the probability that there
is a unique S ? ∈ F maximizing w(S ) := i∈S w(i) over all S ∈ F
P
is at least 1 − mN .
(c) Conclude that there exists a randomized algorithm that with prob-
ability at least 1 − m110 finds an optimal solution to the s − t-
 
minimum cost flow problem in O e m3/2 log CU time. Hint: All
feasible flows for a graph with integral capacities and integral flow
value form a convex polytope with integral vertices.

11.7 Optimal assignment problem. There are n workers and n tasks, the
profit for assigning task j to worker i is wi j ∈ Z. The goal is to assign
exactly one task to each worker so that no task is assigned twice and the
total profit is maximized.

(a) Prove that the optimal assignment problem is captured by the fol-
11.8 Exercises 239

lowing linear program


n X
X n
max
n×n
wi j xi j ,
x∈R
i=1 j=1
n
X
s.t. xi j = 1, ∀ j = 1, 2, . . . , n,
i=1 (11.15)
Xn
xi j = 1, ∀i = 1, 2, . . . , n,
j=1

x ≥ 0.
(b) Develop an algorithm based on the path-following framework for
solving this problem.
(1) The method should perform O(n)e iterations.
(2) Analyze the time per iteration (time to compute the equality
constrained Newton’s step).
(3) Give an efficient algorithm to find a strictly positive initial
solution along with a bound on η0 .
(4) Give a bound on the total running time.
Note: while the optimal assignment problem can be combinatorially re-
duced to the s − t-minimum cost flow problem, this problem asks to
solve it directly using a path following IPM and not via a reduction to
the s − t-minimum cost flow problem.
11.8 Directed Physarum dynamics. In this problem we develop a different
interior point method for solving linear programs in the standard form
min hc, xi
x∈Rm

s.t. Ax = b, (11.16)
x ≥ 0.
We assume here that the program is strictly feasible, i.e., S := {x > 0 :
Ax = b} , ∅.
The idea is to use a barrier function F(x) := m
P
i=1 xi log xi to endow
m
the positive orthant R>0 with the local inner product coming from the
Hessian of F(x), i.e., for x ∈ Rm
>0 we define

∀u, v ∈ Rm , hu, vi x := u> ∇2 F(x)v.

We now specify the update rule of a new algorithm. Let x ∈ S , we define

x0 := x − g(x),
240 Variants of the Interior Point Method and Self-Concordance

where g(x) is defined as the gradient of the linear function x 7→ hc, xi


with respect to the inner product h·, ·i x and restricted to the affine sub-
space {x : Ax = b}. Derive a closed form expression for g(x) in terms of
A, b, c and x.
11.9 Prove that the logarithmic barrier for the positive orthant Rn>0 is self-
concordant with parameter n.
11.10 Prove that for a strictly convex function F defined on a convex set K, the
self-concordance condition
3
∀x ∈ K, ∀h ∈ Rn , ∇ F(x)[h, h, h] ≤ 2 khk3x

implies the NL condition for some δ0 < 1. Hint: it is useful to first show
that self-concordance implies that for all x ∈ K and for all h1 , h2 , h3 ∈ Rn
it holds that
3
∇ F(x)[h1 , h2 , h3 ] ≤ 2 kh1 k x kh2 k x kh3 k x .

11.11 Let K ⊆ Rn be a convex set and let F : int(K) → R. Prove that, for every
ν ≥ 0, the Condition (11.17) holds if and only if Condition (11.18) does.
 −1
For all x ∈ int(K), ∇F(x)> ∇2 F(x) ∇F(x) ≤ ν. (11.17)
For all x ∈ int(K), h ∈ Rn , (DF(x)[h])2 ≤ ν · D2 F(x)[h, h]. (11.18)

11.12 Prove the following properties of self-concordant barrier functions de-


fined over convex sets.
(a) If F1 is a self-concordant barrier for K1 ⊆ Rn and F2 is a self-
concordant barrier for K2 ⊆ Rn with parameters ν1 and ν2 respec-
tively then F1 + F2 is a self-concordant barrier for K1 ∩ K2 with
parameter ν1 + ν2 .
(b) If F is a self-concordant barrier for K ⊆ Rn with parameter ν and
A ∈ Rm×n is a matrix, then x 7→ F(Ax) is a self-concordant barrier
for {y ∈ Rm : A> y ∈ K} with the same parameter.
11.13 Consider an ellipsoid defined by an n × n PD matrix B
E B := {x : x> Bx ≤ 1}.
Define its volume to be
Z
vol(E B ) := dλn (x)
x∈E B

where λn is the Lebesgue measure in Rn . Prove that


1
log vol(E B ) = − log det B + β
2
11.8 Exercises 241

for some constant β.


11.14 Volumetric barrier. Let P = {x ∈ Rn : Ax ≤ b}, where A is an m × n ma-
trix, be a full-dimensional polytope and let V(x) denote the volumetric
barrier.
(a) Prove that V(x) is a barrier function.
(b) Recall the definition of the leverage score vector σ(x) ∈ Rm for
x ∈ P from this chapter:
a>i H(x)−1 ai
σi (x) :=
si (x)2
where H(x) is the Hessian of the logarithmic barrier function on
P.
(1) Prove that m i=1 σi (x) = n for every x ∈ P
P

(2) Prove that 0 ≤ σi (x) ≤ 1 for every x ∈ P and every i =


1, 2, . . . , m.
(c) For x ∈ P, let
A x := S (x)−1 A

where S (x) is the diagonal matrix corresponding to the slacks


si (x). Let
P x := A x (A>x A x )−1 A>x

and let P(2)


x denote the matrix whose each entry is the square of
the corresponding entry of P x . Let Σ x denote the diagonal matrix
corresponding to σ(x). Prove that for all x ∈ P
(1) ∇2 V(x) = A>x (3Σ x − 2P(2)
x )A x .
ai a>
(2) Let Q(x) := m σ
P
i=1 i (x) i
si (x)2
. Prove that for all x ∈ P

Q(x)  ∇2 V(x)  5Q(x).

(d) Prove that for all x ∈ P


1
H(x)  Q(x)  H(x).
4m
(e) Conclude the complexity parameter of V(x) is n.
(f) Prove that for all x, y ∈ int(P) such that kx − yk x ≤ 1
8m1/2
,
1 2
∇ V(x)  ∇2 V(y)  β∇2 V(x)
β
for some β = O(1).
242 Variants of the Interior Point Method and Self-Concordance

11.15 John ellipsoid. Let P = {x ∈ Rn : Ax ≤ b} be a bounded, full-


dimensional polytope and x0 ∈ P be a fixed point in its interior. For
an n × n matrix X  0 define an ellipsoid E X centered at x0 by
E X := {y ∈ Rn : (y − x0 )> X(y − x0 ) ≤ 1}.
Consider the problem of finding the ellipsoid of largest volume centered
at a point x0 in the interior of P that is fully contained in P.
(a) State the problem as convex program over positive definite matri-
ces X.
(b) Prove that the dual of the above convex program is
 m 
X ai a>i 
min − log det  wi  .
w≥0: i wi =n si (x)2 
P
i=1

(c) Assume that P is symmetric around the origin, i.e., P = −P, 0 ∈


int(P), and let E0 be the John ellipsoid of P at 0. Then the ellipsoid

nE0 contains P.
(d) Write the Lagrangian dual of the SDP mentioned in (11.13).
11.8 Exercises 243

Notes
The book by Cormen et al. (2001) contains a comprehensive discussion on the
s − t-minimum cost flow problem. The main theorem presented in this chapter
(Theorem 11.1) was first proved by Daitch and Spielman (2008). The proof
presented here is slightly different from the one by Daitch and Spielman (2008)
(which was based on the primal-dual framework). Their algorithm improved
over the previously fastest algorithm by Goldberg and Tarjan (1987) by a factor

of roughly m. Using ideas mentioned towards the end of this chapter, Lee and

Sidford (2014) improved this bound further to O(m e n logO(1) U).
As discussed in this chapter, the s − t-maximum flow problem is a special
case of the s − t-minimum cost flow problem. Hence, the result of Lee and

Sidford (2014) also implies an O(me n logO(1) U) time algorithm for the s − t-
maximum flow problem; the first improvement since Goldberg and Rao (1998).
Self-concordant functions are presented in the book by Nesterov and Ne-
mirovskii (1994). The lower bound in Theorem 11.14 can also be found in
their book. An alternative proof of the existence of a barrier function with
O(n)-self-concordance can be found in the paper by Bubeck and Eldan (2015).
Coming up with efficient O(n)-self-concordant barriers for various convex sets
K is still an important open problem. The volumetric barrier was introduced
by Vaidya (1987, 1989b,a), and the Lee-Sidford barrier was introduced by Lee
and Sidford (2014).
The Physarum dynamics-based algorithm for linear programming (Exercise
11.8) was analyzed by Straszak and Vishnoi (2016). For more on the isolation
lemma (introduced in Exercise 11.6(b)), see the papers by Mulmuley et al.
(1987) and Gurjar et al. (2018).
12
Ellipsoid Method for Linear Programming

We introduce a class of cutting plane methods for convex optimization and present an
analysis of a special case, namely, the ellipsoid method. We then show how to use this
ellipsoid method to solve linear programs over 0-1-polytopes when we only have access
to a separation oracle for the polytope.

12.1 0-1-polytopes with exponentially many constraints


In combinatorial optimization one often deals with problems of the follow-
ing type: given a set family F ⊆ 2[m] , i.e., a collection of subsets of [m] :=
{1, 2, . . . , m}, and a cost vector c ∈ Zm find a minimum cost set S ∈ F :
X
S ? := argmin ci .
S ∈F i∈S

Many familiar problems on graphs can be formulated as above. For instance,


by taking F ⊆ 2E to be the set of all matchings in a graph G = (V, E) and
letting c = −1 (the negative of the all ones vector of length |E|) we obtain the
maximum cardinality matching problem. Similarly, one can obtain the min-
imum spanning tree problem, or the s − t-maximum flow problem (for unit
capacity graphs), or the s − t-minimum cut problem.
When tackling these problems using continuous methods, such as the ones
introduced in this book, an idea we have used on several occasions is to first
associate a convex set to the domain. In the particular setting of combinatorial
optimization, given a family F ⊆ 2[m] , a natural object is the polytope

PF := conv{1S : S ∈ F },

where 1S ∈ {0, 1}m is the indicator vector of the set S ⊆ [m]. Such polytopes
are called 0-1-polytopes as all their vertices belong to the set {0, 1}m .

244
12.1 0-1-polytopes with exponentially many constraints 245

Below are some important examples of 0-1-polytopes involving graphs; many


of them have already made their appearance in this book.
Definition 12.1 (Examples of combinatorial polytopes associated to graphs)
For an undirected graph G = (V, E) with n vertices and m edges we define:
1. The matching polytope P M (G) ⊆ [0, 1]m to be
P M (G) := conv 1S : S ⊆ E is a matching in G .


2. The perfect matching polytope PPM (G) ⊆ [0, 1]m to be


PPM (G) := conv 1S : S ⊆ E is a perfect matching in G .


3. The spanning tree polytope PPM (G) ⊆ [0, 1]m to be


PS T (G) := conv 1S : S ⊆ E is a spanning tree in G .


Solving the maximum matching problem in G reduces to the problem of mini-


mizing a linear function over P M (G). More generally, one can consider the fol-
lowing linear optimization problem over such polytopes: Given a 0-1-polytope
P ⊆ [0, 1]m , a cost vector c ∈ Zm , find a vector x? ∈ P such that
x? := argmin hc, xi.
x∈P

Note that by solving this linear optimization problem over PF for the cost
vector c, we can solve the corresponding combinatorial optimization problem
over F . Indeed, as the optimal solution is always attained at a vertex, say x? =
1S ? , then S ? is the optimal solution to the discrete problem and vice versa.1 As
this is a type of linear program, a natural question is: can we solve such linear
programs using the interior point method?

12.1.1 The problem with IPMs: The case of matching polytope


The matching polytope of a graph G = (V, E) was defined as a convex hull of its
vertices. While it is a valid description, this is not a particularly useful notion
from a computational point of view; especially when it comes to applying IPMs
that require polyhedral description of the polytope. Remarkably, it turns out
that there is a complete description of all the inequalities defining this polytope.
Theorem 12.2 (Polyhedral description of the matching polytope) Let G =
(V, E) be an undirected graph with n vertices and m edges, then
 
 X |S | − 1 
P M (G) =  m
, S ⊆ [m] and |S | is odd .
 
x ∈ [0, 1] : xi ≤
 


i∈S
2 

1 Here we assume for simplicity that the optimal solution is unique.
246 Ellipsoid Method for Linear Programming

While this does give us all the inequalities describing P M (G), there are 2m−1 of
them, one for each odd subset of [m]. Thus, plugging these inequalities into the
logarithmic barrier would result in 2O(m) terms, exponential in the size of the
input: O(n + m). Similarly, computing either the volumetric barrier or the Lee-
Sidford barrier would require exponential time. In this much time, we might as
well just enumerate all possible subsets of [m] and output the best matching.
One may ask if there are more compact representations of this polytope and
the answer (see notes) turns out to be NO: any representation (even with more
variables) of P M (G) requires exponentially many inequalities. Thus, applying
an algorithm based on the path following method to minimize over P M (G) does
not seem to lead to an efficient algorithm.
In fact, the situation is even worse, this description does not even seem use-
ful in checking if a point is in P M (G). Interestingly, the proof of Theorem
12.2 leads to a polynomial time algorithm to solve the separation problem over
P M (G) exactly. There are also (related) combinatorial algorithms to find the
maximum matching in a graph. In the next chapter, we see that efficient separa-
tion oracles can also be constructed for a large class of combinatorial polytopes
using the machinery of convex optimization (including techniques presented in
this chapter). Meanwhile, the question is: can we do linear optimization over a
polytope if we just have access to a separation oracle for it? The answer to this
leads us to the ellipsoid method that, historically, predates IPMs.
While we have abandoned the goal of developing IPMs for optimizing linear
functions over 0-1-polytopes with exponentially many constraints in this book,
it is an interesting question if self-concordant barrier functions for polytopes
such as P M (G) that are also computationally tractable exist. More specifically,
is there a self-concordant barrier function F : P M (G) → R with complexity pa-
rameter polynomial in m such that the gradient ∇F(x) and the Hessian ∇2 F(x)
are efficiently computable? By efficiently computable we not only mean poly-
nomial time in m, but also that computing the gradient and Hessian should be
simpler than solving the underlying linear program itself.

12.1.2 Efficient separation oracles


The following separation problem was introduced in Chapter 4.
Definition 12.3 (Separation problem for polytopes) Given a polytope P ⊆
Rm and a vector x ∈ Qm , the goal is to output one of the following:
• if x ∈ P output YES, and
• if x < P output NO and provide a vector h ∈ Qm \ {0} such that
∀y ∈ P, hh, yi < hh, xi .
12.2 Cutting plane methods 247

Recall that in the NO case, the vector h defines a hyperplane separating x from
the polytope P. An important class of results in combinatorial optimization is
that for many graph-based combinatorial 0-1-polytopes, such as those intro-
duced in Definition 12.1, the separation problem can be solved in polynomial
time. Here, we measure the running time of a separation oracle as a function
of the size of the graph (O(n + m)) and the bit complexity of the input vector x.

Theorem 12.4 (Separation oracles for combinatorial polytopes) There


are polynomial time separation oracles for the polytopes P M (G), PPM (G) and
PS T (G).

We leave the proof of this theorem as an interesting exercise (Exercise 12.2).


The question that arises is: given a separation oracle for P M (G) does it allow us
to optimize linear functions over this polytope? The main result of this chapter
is the following theorem that says that this is indeed the case, not only for the
matching polytope, but for all 0-1-polytopes.

Theorem 12.5 (Linear optimization over 0-1-polytopes using a separa-


tion oracle) There is an algorithm that given a separation oracle for a 0-
1-polytope P ⊆ [0, 1]m and a vector c ∈ Zm outputs

x? ∈ argmin hc, xi.


x∈P

The algorithm runs in polynomial time in m, L(c) and makes a polynomial


number of queries to the separation oracle of P.

Combining the above with the existence of an efficient separation oracle for
P M (G) (Theorem 12.4), we conclude, in particular, that the problem of com-
puting the maximum cardinality matching in a graph is polynomial time solv-
able. Similarly, using the above, one can compute a matching (or a perfect
matching, or a spanning tree) of maximum or minimum weight in polynomial
time. The remainder of this chapter is devoted to the proof of Theorem 12.5.
We start by sketching a general algorithm scheme in the subsequent section
and recover the ellipsoid method as a special case of it. We then prove that the
ellipsoid method indeed takes just a polynomial number of simple iterations to
converge.

12.2 Cutting plane methods


We develop a generic approach for optimizing linear functions over polytopes,
called cutting plane methods. We are interested in solving the following prob-
248 Ellipsoid Method for Linear Programming

lem:
min hc, xi
(12.1)
s.t. x ∈ P,
where P ⊆ Rm is a polytope and we have access to a separation oracle for
P. Throughout this section we assume that P is full-dimensional. Later we
comment on how to get around this assumption.

12.2.1 Reducing optimization to feasibility using binary search


Before we proceed with the description of the algorithm, we introduce the first
idea that allows us to reduce the optimization problem (12.1) to a series of
simpler problems of the following form.
Definition 12.6 (Feasibility problem for polytopes) Given (in some form)
a polytope P ⊆ Rm and a vector x ∈ Qm , the goal is to output one of the
following:
• if P , ∅ output YES and provide a point x ∈ P, and
• if P = ∅ output NO.
The idea is quite simple: suppose we have a rough estimate that the optimal
value of (12.1) belongs to the interval [l0 , u0 ]. We can then perform binary
search, along with an algorithm for solving the feasibility problem to find an
estimate on the optimal value y? of (12.1) up to an arbitrary precision. Later,
in the proof of Theorem 12.5, we show how one can implement this idea in the
setting of 0-1-polytopes.
To implement this idea, Algorithm 8 maintains an interval [l, u] that contains
the optimal value y? to (12.1) and uses an algorithm to check nonemptiness of
a polytope in step 4 to halve it in every step until its length is at most ε > 0.
The number of iterations of Algorithm 8 is upper bounded by O log u0ε−l0 .
Thus, essentially, given an algorithm to check feasibility we are able to per-
form optimization. Several remarks are in order.
1. Note that, given a separation oracle for P, the separation oracle for P0 is
easy to construct.
2. Assuming that the optimal value y? lies in the initial interval [l0 , u0 ], this
algorithm allows us to learn an additive ε-approximate optimal value in
roughly log u0ε−l0 calls to the feasibility oracle. However, it is not clear
how to recover the exact optimum y? . We show that for certain classes
of polytopes, such as 0-1-polytopes, one can compute y? exactly, given a
good enough estimate of y? .
12.2 Cutting plane methods 249

Algorithm 8: Reducing optimization to feasibility


Input:
• A separation oracle to a polytope P
• A cost function c ∈ Qm
• Estimates l0 ≤ u0 ∈ Q such that l0 ≤ y? ≤ u0
• An ε > 0
Output: A number u such that u ≤ y? + ε
Algorithm:
1: Set l := l0 and u := u0
2: while u − l > ε do
3: Set g := l+u2
4: if P0 := P ∩ {x ∈ Rm : hc, xi ≤ g} , ∅ then
5: Set u := g
6: else
7: Set l := g
8: end if
9: end while
10: return u

3. Algorithm 8 only computes an approximate optimal value of (12.1) and


not the optimal point x? . One way to deal with this issue is to simply use
the feasibility oracle to find a point x̂ in P0 := P ∩ {x ∈ Rm : hc, xi ≤ u}
for the final value of u. The polytope P0 is guaranteed to be nonempty and
the point x̂ has value at most y? + ε. Again, in the proof of Theorem 12.5
we show how to round x̂ to an exact optimal solution.

4. To convert Algorithm 8 into a polynomial time reduction one needs to


make sure that the polytopes P0 constructed in step 4 do not become
difficult for the feasibility oracle. The running time of the algorithm we
present for this task depends on the volume of P0 and it becomes slower as
the volume of P0 gets smaller. Thus, in general, one has to carefully bound
how complex the polytope P0 becomes as we progress with the algorithm.

Thus, up to the details mentioned above, it is enough to construct an algorithm


for checking nonemptiness of polytopes.
250 Ellipsoid Method for Linear Programming

Algorithm 9: Cutting plane method


Input:
• A separation oracle for a polytope P
• A convex set E0 that is promised to contain P
Output: If P , ∅ then return YES and an x̂ ∈ P, else return NO
Algorithm:
1: while vol(Et ) ≥ vol(P) do
2: Select a point xt ∈ Et
3: Query the separation oracle for P on xt
4: if xt ∈ P then
5: return YES and x̂ := xt
6: else
7: Let ht ∈ Qm be the separating hyperplane output by the oracle
8: Construct a new set Et+1 so that
Et ∩ {x : hx, ht i ≤ hxt , ht i} ⊆ Et+1
9: end if
10: end while
11: return NO

12.2.2 Checking feasibility using cutting planes


Suppose we are given a large enough convex set E0 that contains the polytope
P. To check whether P is empty or not, we construct a descending sequence of
convex subsets
E0 ⊇ E1 ⊇ E2 ⊇ · · ·

such that all of them contain P and their volume reduces significantly at every
step. Eventually, the set Et approximates P so well that a point x ∈ Et is likely
to be in P as well. Note that we use the Lebesgue measure in Rm to measure
volumes. We denote by vol(K) the m−dimensional Lebesgue measure of a set
K ⊆ Rm . Formally, we consider Algorithm 9; see figure 12.1 for an illustration.
The following theorem characterizes the performance of the above scheme.

Theorem 12.7 (Number of iterations of the cutting plane method) Sup-


pose that the following conditions on P and the nested sequence of sets E0 , E1 , E2 , . . .
are satisfied:
12.2 Cutting plane methods 251

𝐻
𝐸!

𝑥!
𝑃

𝐸!"#

Figure 12.1 An illustration of one iteration of the cutting plane method. One starts
with a set Et guaranteed to contain the polytope P and queries a point xt ∈ Et . If
xt is not in P, then the separation oracle for P provides a separating hyperplane
H. Hence, one can reduce the region to which P belongs by picking a set Et+1
containing the intersection of Et with shaded halfspace.

1. (Bounding ball) the initial set E0 is contained in an Euclidean ball of


radius R > 0,
2. (Inner ball) the polytope P contains an Euclidean ball of radius r > 0,
and
3. (Volume drop) for every step t = 0, 1, . . . , we have

vol(Et+1 ) ≤ α · vol(Et ),

where 0 < α < 1 is a quantity possibly depending on m.

Then, the
 cutting plane method
 described in Algorithm 9 outputs a point x̂ ∈ P
after O m log α−1 log Rr iterations.

Proof The algorithm maintains the invariant that P ⊆ Et at every step t.


Hence, it never terminates with the verdict NO, as

vol(P) ≤ vol(Et )

for every t. However, since the volume of Et is strictly decreasing at a fixed


rate, for some t, the point xt must be P.
We now estimate the smallest t for which xt ∈ P. Note that, from the volume
drop condition, we have

vol(Et ) ≤ αt · vol(E0 ).
252 Ellipsoid Method for Linear Programming

Moreover, since P contains some ball B(x0 , r) and E0 is contained in some ball
B(x00 , R), we have

vol(B(x0 , r)) ≤ vol(Et ) ≤ αt · vol(E0 ) ≤ αt · vol(B(x00 , R)).

Thus,
vol(B(x0 , r))
≤ αt . (12.2)
vol(B(x00 , R))

However, both B(x0 , r) and B(x00 , R) are scalings (and translations) of the unit
Euclidean ball B(0, 1) and hence

vol(B(x0 , r)) = rm · vol(B(0, 1)) and vol(B(x00 , R)) = Rm · vol(B(0, 1)).

By plugging the above in (12.2) we obtain

R
t ≤ m log α−1 log .
r

Note that if we could find an instantiation of the cutting plane method (a way
of choosing Et and xt at every step t) such that the corresponding parameter
α is a constant less than 1, then it yields an algorithm for linear programming
that performs a polynomial number of iterations – indeed using the results in
one of the exercises, it follows that we can always enclose P in a ball of radius
roughly R = 2O(Ln) and find a ball inside of P of radius r = 2−O(Ln) . Thus,
log Rr = O(Ln).
The method developed we present in this chapter does not quite achieve a
constant drop in volume but rather

1
α≈1− .
m
However, this still implies a method for linear programming that performs a
polynomial number of iterations, and (as we see later) each iteration is just a
simple matrix operation that can be implemented in polynomial time.

12.2.3 Possible choices for Et and xt


Let us discuss two feasible and simple strategies of picking the set Et (and the
point xt ).
12.2 Cutting plane methods 253

𝐸!

Figure 12.2 An illustration of the use of Euclidean balls in the cutting plane
method. Note that the smallest ball containing the intersection of E with the
shaded halfplane is E itself.

Approximation using Euclidean balls. One of the simplest ideas to imple-


ment the cutting plane method is to use Euclidean balls and let Et and xt be
defined as:
Et := B(xt , rt )
for some radius rt .
The problem with this method is that we cannot force the volume drop when
Euclidean balls are used. Indeed, consider a simple an example in R2 . Let E0 :=
B(0, 1) be the starting ball and suppose that the separation oracle provides the
separating hyperplane h = (1, 0)> , meaning that
P ⊆ K := E0 ∩ {x ∈ R2 : x1 ≤ 0}.
What is the smallest ball containing K? (See Figure 12.2 for an illustration.)
It turns out that it is B(0, 1) and we cannot find a ball with smaller radius
containing K. This simply follows from the fact that the diameter of K is 2 (the
distance between (0, −1)> and (0, 1)> is 2). Thus, we need to look for a larger
family of sets that allows us to achieve a suitable volume drop in every step.

Approximation using polytopes. The second idea one can try is to use poly-
topes as Et s. We start with E0 := [−R, R]m – a polytope containing P. When-
ever a new hyperplane is obtained from the separation oracle, we update
Et+1 := Et ∩ {x : hx, ht i ≤ hxt , ht i},
a polytope again. This is certainly the most aggressive strategy one can imag-
ine, as we always cut out as much a possible from Et to obtain Et+1 .
We also need to give a strategy for picking a point xt ∈ Et at every iteration.
For this, it is crucial that the point xt is well in the interior of Et , as otherwise
254 Ellipsoid Method for Linear Programming

we might end up cutting only a small piece out of Et and, thus, not reducing
the volume enough. Therefore, to make it work, at every step we need to find
an approximate center of the polytope Et and use it as the query point xt . As
the polytope gets more and more complex, finding a center also becomes more
and more time-consuming. In fact, we are reducing a problem of finding a
point in P to a problem of finding a point in another polytope, seemingly a
cyclic process.
In one of the exercises we show that this scheme can be made work by main-
taining a suitable center of the polytope and using a recentering step whenever
a new constraint is added. This is a significantly nontrivial approach and based
on the hope that the new center will be close to the old one, when the polytope
does not change too drastically and hence can be found rather efficiently using,
for instance, the Newton’s method.

12.3 Ellipsoid method


The discussion in the previous section indicates that using Euclidean balls is
too restrictive, as we may not be able make any progress in subsequent steps.
On the other hand, maintaining a polytope does not sound any easier that solv-
ing the original problem, as it is not easy to find a suitable point xt ∈ Et to
query, at every step t.
In Chapter 11, we saw another set of interesting convex sets associated to
polytopes: ellipsoids. We saw that one can approximate polytopes using the
so-called Dikin or John ellipsoid. It turns out that choosing Et s ellipsoids is
sufficient: they approximate polytopes well enough and their center is a part
of their description. Since ellipsoids are central to this chapter, we recall their
definition and introduce the notation we use to denote them in this chapter.
Definition 12.8 (Ellipsoid) An ellipsoid in Rm is any set of the form
n o
E(x0 , M) := x ∈ Rm : (x − x0 )> M −1 (x − x0 ) ≤ 1 ,
such that x0 ∈ Rm and M is an m × m, positive definite matrix.
By definition, E(x0 , M) is symmetric around its center x0 . Suppose that Et :=
E(xt , Mt ) and the separation oracle for P provides us with the information that
the polytope P is contained in the set
{x ∈ Et : hx, ht i ≤ hxt , ht i}.
Then, a natural strategy is to take Et+1 so as to contain this set and have the
smallest volume among all such.
12.3 Ellipsoid method 255

Definition 12.9 (Minimum volume enclosing ellipsoid) Given an ellipsoid


E(x0 , M) ⊆ Rm and a halfspace

H := {x ∈ Rm : hx, hi ≤ hx0 , hi}

through its center, for some h ∈ Rm \ {0} we define the minimum enclosing el-
lipsoid of E(x0 , M)∩H to be the ellipsoid E(x? , M ? ) being the optimal solution
to

min vol(E(x, M 0 ))
x∈Rm ,M 0 0

s.t. E(x0 , M) ∩ H ⊆ E(x, M 0 ).

One can show that such a minimum enclosing ellipsoid always exists and is
unique, as the above can be formulated as a convex program. For our applica-
tion, we give an exact formula for the minimum volume ellipsoid containing
the intersection of an ellipsoid with a halfspace.

12.3.1 The algorithm


The ellipsoid algorithm is an instantiation of the cutting plane method (Algo-
rithm 9) where we just use an ellipsoid as the set Et and its center as xt . More-
over, for every step t we take the minimum enclosing ellipsoid as Et+1 . The
ellipsoid algorithm appears in Algorithm 10. Again, we assume that P ⊆ Rm is
a full-dimensional polytope and a separation oracle is available for P.
Note that the description of Algorithm 10 is not yet complete as we have not
provided a way to compute minimum enclosing ellipsoids (required in step 9).
We discuss this question in Section 12.4.

12.3.2 Analysis of the algorithm


We are now ready to state a theorem on the computation efficiency of the el-
lipsoid method. The main part of the proof is provided in Section 12.4.

Theorem 12.10 (Efficiency of the ellipsoid method) Suppose that P ⊆ Rm


is a full-dimensional polytope that is contained in an n-dimensional Euclidean
ball of radius R > 0 and contains an n-dimensional Euclidean ball of radius
r > 0. Then, the ellipsoid method (Algorithm 10) outputs a point x̂ ∈ P af-
ter O m2 log Rr iterations. Moreover, every iteration can be implemented in
O(m2 + T S O ) time, where T S O is the time required to answer one query by the
separation oracle.
256 Ellipsoid Method for Linear Programming

Algorithm 10: Ellipsoid algorithm


Input:
• A separation oracle for a polytope P
• A radius R ∈ Z>0 so that B(0, R) is promised to contain P
Output: If P , ∅ then return YES and an x̂ ∈ P, else return NO
Algorithm:
1: Let E0 := E(0, R · I) where I is the m × m identity matrix
2: while vol(Et ) ≥ vol(P) do
3: Let xt be the center of Et
4: Query the separation oracle for P on xt
5: if xt ∈ P then
6: return YES and x̂ := xt
7: else
8: Let ht ∈ Qm be the separating hyperplane output by the oracle
9: Let Et+1 := E(xt+1 , Mt+1 ) be the minimum volume ellipsoid s.t.
Et ∩ {x : hx, ht i ≤ hxt , ht i} ⊆ Et+1
10: end if
11: end while
12: return NO

Given what has been already proved in Theorem 12.7, we are only missing
the following two components (stated in a form of a lemma) to deduce Theo-
rem 12.10.
Lemma 12.11 (Informal; see Lemma 12.12) Consider the ellipsoid algo-
rithm defined in Algorithm 10. Then,
1. the volume of the ellipsoids
  E0 , E1 , E2 , . . . constructed in the algorithm
drops at a rate α ≈ 1 − 2m 1
, and
2. every iteration of the ellipsoid algorithm can be implemented in polyno-
mial time.
Given the above lemma (stated formally as Lemma 12.12) we can deduce The-
orem 12.10.
Proof of Theorem 12.10. By combining Theorem 12.7 with Lemma 12.11 we
obtain the bound on the number of iterations:
 R  R
O m log α−1 · log = O m2 log .
r r
12.4 Analysis of volume drop and efficiency for ellipsoids 257

Moreover, according to Lemma 12.11, every iteration can be performed effi-


ciently. Hence, the second part of Theorem 12.10 also follows.

12.4 Analysis of volume drop and efficiency for ellipsoids


In this section we show that given an ellipsoid E(x0 , M) and a vector h ∈ Rm we
can efficiently construct a new ellipsoid E(x0 , M 0 ) that has significantly smaller
volume and

E(x0 , M) ∩ {x ∈ Rm : hx, hi ≤ hx0 , hi} ⊆ E(x0 , M 0 ).

While we do not show it here, the construction presented in this section is


the minimum volume ellipsoid containing the intersection of E(x0 , M) and the
halfspace above. Nevertheless the lemma below is enough to complete the de-
scription and analysis of the ellipsoid method.

Lemma 12.12 (Volume drop) Let E(x0 , M) ⊆ Rm be any ellipsoid in Rm


and let h ∈ Rm be a nonzero vector. There exists an ellipsoid E(x0 , M 0 ) ⊆ Rm
such that
E(x0 , M) ∩ {x ∈ Rm : hx, hi ≤ hx0 , hi} ⊆ E(x0 , M 0 )

and
1
vol(E(x0 , M 0 )) ≤ e− 2(m+1) · vol(E(x0 , M)).

Moreover, x0 and M 0 are as follows


1
x0 := x0 − Mg, (12.3)
m+1
m2
!
2
M 0 := M− Mg(Mg)> , (12.4)
m2 − 1 m+1
−1/2
where g := h> Mh h.

The above lemma states that using ellipsoids we can implement the cutting
plane method with volume drop
!
1 1
α = e− 2(m+1) ≈ 1 − < 1.
2(m + 1)
The proof of Lemma 12.12 is has two parts. In the first part, we prove Lemma
12.12 for a very simple and symmetric case. In the second part, we reduce
the general case to the symmetric case studied in the first part using an affine
transformation of Rm .
258 Ellipsoid Method for Linear Programming

12.4.1 Volumes and ellipsoids under affine transformations


In this section, we review basic facts about affine transformations and ellip-
soids. We start by proving how the volume of a set changes when an affine
transformation is applied to it.
Lemma 12.13 (Change of volume under affine map) Consider an affine
map φ(x) := T x + b, where T : Rm → Rm is an invertible linear map and
b ∈ Rm is any vector. Let K ⊆ Rm be a Lebesgue measurable set. Then
vol(φ(K)) = |det(T )| · vol(K).
Proof This is a simple consequence of integrating by substitution. We have
Z
vol(φ(K)) = dλm (x),
φ(K)

where λm is the m-dimensional Lebesgue measure. We apply the change of


variables x := φ(y) and obtain
Z Z
dλm (x) = |det(T )| dλm (y) = |det(T )| · vol(K),
φ(K) K

as T is the Jacobian of the mapping φ.


The next fact analyzes the effect of applying an affine transformation on an
ellipsoid.
Lemma 12.14 (Affine transformations of ellipsoids) Consider an affine
map φ(x) := T x + b, where T : Rm → Rm is an invertible linear map and
b ∈ Rm is any vector. Let E(x0 , M) be any ellipsoid in Rm , then
φ(E(x0 , M)) = E(T x0 + b, T MT > ).
Proof
φ(E(x0 , M)) = {φ(x) : (x0 − x)> M −1 (x0 − x) ≤ 1}
= {y : (x0 − φ−1 (y))> M −1 (x0 − φ−1 (y)) ≤ 1}
= {y : (x0 − T −1 (y − b))> M −1 (x0 − T −1 (y − b)) ≤ 1}
= {y : T −1 (T x0 − (y − b))> M −1 T −1 (T x0 − (y − b)) ≤ 1}
= {y : (T x0 − (y − b))> (T −1 )> M −1 T −1 (T x0 − (y − b)) ≤ 1}
= E(T x0 + b, T MT > ).

From the two previous facts one can easily derive the following corollary re-
garding the volume of an ellipsoid.
12.4 Analysis of volume drop and efficiency for ellipsoids 259

Corollary 12.15 (Volume of an ellipsoid) Let x0 ∈ Rm and M ∈ Rm×m be a


PD matrix. Then
vol(E(x0 , M)) = det(M)1/2 · Vm ,
where Vm denotes the volume of the unit Euclidean ball in Rm .
Proof We first observe (using Lemma 12.14) that
E(x0 , M) = φ(E(0, I)), (12.5)
where φ(x) = M 1/2 x + x0 . Lemma 12.13 implies that
vol(E(x0 , M)) = det(M)1/2 vol(E(0, I)) = det(M)1/2 · Vm .

12.4.2 The symmetric case


In the symmetric case we assume that the ellipsoid is the unit ball E(0, I) and
we intersect it with the halfspace
H = {x ∈ Rm : x1 ≥ 0}.
Then, one can obtain an ellipsoid of a relatively small volume that contains
E(0, I) ∩ H via an explicit formula as in the lemma below.
Lemma 12.16 (Volume drop for the symmetric case) Consider the Eu-
clidean ball E(0, I). Then the ellipsoid E 0 ⊆ Rm given by
 
!2 !2 m
m + 1 1 m 2
− 1

 X 

E 0 :=  m
+ 2
 
x ∈ : x − x ≤ 1
 
R 1
m m+1 m 2 j

 


 j=2 
1
has volume vol(E 0 ) ≤ e− 2(m+1) · vol(E(0, I)) and satisfies
{x ∈ E(0, I) : x1 ≥ 0} ⊆ E 0 .
Proof We start by showing that
{x ∈ E(0, I) : x1 ≥ 0} ⊆ E 0 .
To this end, take any point x ∈ E(0, I) with x1 ≥ 0, the goal is to show that
x ∈ E 0 . We first reduce the problem to a univariate question using the following
observation
Xm
x2j ≤ 1 − x12 ,
j=2
260 Ellipsoid Method for Linear Programming

E1

E0

Figure 12.3 Illustration of the symmetric case of the argument. E0 is the unit
Euclidean ball in R2 and E1 is the ellipse of smallest√
area that contains the right
half (depicted in shaded) of E0 . The area of E1 is 4 9 3 < 1.

which follows from the fact that x belongs to the unit ball. Thus, to conclude
that x ∈ E 0 , we just need to show that
!2 !2
m+1 1 m2 − 1
x1 − + (1 − x12 ) ≤ 1. (12.6)
m m+1 m2
Note that the left hand side of (12.6) is convex (show that the second derivative
is positive) and nonnegative (in [0, 1]) function in one variable. We would like
to show that it is bounded by 1 for every x1 ∈ [0, 1]. To this end, it suffices
to verify this for both end points x1 = 0, 1. By directly plugging in x1 = 0
and x1 = 1 we obtain that the left hand side of (12.6) equals 1 and thus the
inequality is satisfied.
We now proceed to bounding the volume of E 0 . To this end, note that E 0 =
E(z, S ) with
!>
1
z := , 0, 0, . . . , 0 ,
m+1
m2 m2
 m 2 !
S := Diag , 2 ,..., 2 .
m+1 m −1 m −1
Here Diag (x) is a diagonal matrix with diagonal entries as in the vector x. To
compute the volume of E 0 we simply apply Corollary 12.15
vol(E 0 ) = det(S )1/2 · vol(E(0, I)).
12.4 Analysis of volume drop and efficiency for ellipsoids 261

𝐸(𝑧, 𝑆) 𝜙 𝐸(𝑥!′, 𝑀′)

𝐸 0, 𝐼 𝐸 𝑥!, 𝑀

Figure 12.4 The nonsymmetric case follows by using an affine transformation φ


to map the symmetric case to the general one. The unit ball E(0, I) is mapped by
φ to the ellipsoid E(x0 , M) so that the shaded region {x ∈ E(0, I) : x1 ≥ 0} on the
left hand side is also mapped to the shaded region {x ∈ E(x0 , M) : hx, hi ≤ hx0 , hi}
on the right hand side.

We have
!m−1 !2 !m−1
 m 2 m2 1 1
det(S ) = · 2 = 1− 1+ 2 .
m+1 m −1 m+1 m −1
Using the inequality 1 + x ≤ e x , which is valid for every x ∈ R, we arrive at the
upper bound
 1 2  1 m−1 1
det(S ) ≤ e− m+1 · e m2 −1 = e− m+1 .

Finally, we obtain
1
vol(E 0 ) = det(S )1/2 · vol(E(0, I)) ≤ e− 2(m+1) · vol(E(0, I)).

12.4.3 The general case


We now prove Lemma 12.12 in full generality by reducing it to the symmetric
case.
Proof of Lemma 12.12. First note that the ellipsoid E 0 := E(z, S ) defined for
the symmetric case (Lemma 12.16) is given by
1
z= e1 ,
m+1
m2
!
2
S = 2 I− e1 e .
>
m −1 m+1 1
262 Ellipsoid Method for Linear Programming

Thus, indeed, this gives a proof of Lemma 12.12 in the case when E(x0 , M) is
the unit ball and h = −e1 .
The idea now is to find an affine transformation φ so that
1. E(x0 , M) = φ(E(0, I)),
2. {x : hx, hi ≤ hx0 , hi} = φ ({x : x1 ≥ 0}) .
We refer to Figure 12.4 for an illustration of this idea. We claim that when
these conditions hold, the ellipsoid E(x0 , M 0 ) := φ(E 0 ) satisfies the conclusion
of Lemma 12.12. Indeed, using Lemma 12.13 and Lemma 12.16, we have
vol(E(x0 , M 0 )) vol(φ(E(x0 , M 0 ))) vol(E 0 ) 1
= = ≤ e− 2(m+1) .
vol(E(x0 , M)) vol(φ(E(x0 , M))) vol(E(0, I))
Moreover, by applying φ to the inclusion
{x ∈ E : x1 ≥ 0} ⊆ E 0 ,
we obtain
{x ∈ E(x0 , M) : hx, hi ≤ hx0 , hi} ⊆ E(x0 , M 0 ).
It remains now to derive a formula for φ so that we can obtain an explicit
expression for E(x0 , M 0 ). We claim that the following affine transformation φ
satisfies the above stated properties
φ(x) := x0 + M 1/2 U x,
where U is any orthonormal matrix, i.e., U ∈ Rm×m and satisfies UU > =
U > U = I, such that Ue1 = v where
M 1/2 h
v = − .
M 1/2 h 2
We have
φ(E(0, I)) = E(x0 , M 1/2 U > U M 1/2 ) = E(x0 , M).
This proves the first condition. Further,
φ({x : x1 ≥ 0}) = {φ(x) : x1 ≥ 0}
= {φ(x) : h−e1 , xi ≤ 0}
D E
= {y : −e1 , φ−1 (y) ≤ 0}
D E
= {y : −e1 , U > M −1/2 (y − x0 ) ≤ 0}
D E
= {y : −M −1/2 Ue1 , y − x0 ≤ 0}
= {y : hh, y − x0 i ≤ 0}
12.4 Analysis of volume drop and efficiency for ellipsoids 263

and, hence, the second condition follows.


It remains to derive a formula for E(x0 , M 0 ). By Lemma 12.14, we obtain
E(x0 , M 0 ) = φ(E(z, S )) = E(x0 + M 1/2 Uz, M 1/2 US U > M 1/2 ).
Thus,
1 1 Mh
x0 = x0 + M 1/2 Ue1 = x0 −
m+1 m + 1 M 1/2 h

2

and
m2
!
2
M 0 = M 1/2 U > I − e 1 e>
U M 1/2
m2 − 1 m+1 1
m2 2  1/2  > !
= 2 M− M Ue1 M 1/2 Ue1
m −1 m+1
2
2 (Mh) (Mh)>
!
m
= 2 M− .
m −1 m + 1 h> Mh
h
By plugging in g := , the lemma follows.
k M1/2 hk2

12.4.4 Discussion on bit precision issues


In all of the above discussion, we have assumed that the ellipsoids E0 , E1 , . . .
can be computed exactly and no error is introduced at any step. In reality, this
is not the case, as computing Et+1 from Et involves taking a square root, an
operation that cannot be performed exactly. Moreover, one needs to be care-
ful about the bit complexities of the intermediate computations, as there is no
simple reason why would they stay polynomially bounded over the course of
the algorithm.
To address this issue one can use the following idea: pick a polynomially
bounded integer p and perform all calculations in the ellipsoid algorithm using
numbers with at most p bits of precision. In other words, we work with rational
numbers with common denominators 2 p and round to such numbers whenever
intermediate calculations yield more than p (binary) digits after the decimal
point. Such a perturbation might have an undesired effect of slightly shifting
(and reshaping) the ellipsoid Et+1 so that it is not guaranteed to contain P
anymore. One can fix this by increasing it by a small factor such that:
• it is guaranteed to contain Et ∩ {x : hx, ht i ≤ hxt , ht i}, and,
1
• the volume drop is still relatively large ≈ e− 5n .
Further, one can reason that if Et = E(xt , Mt ), then the norm kxt k2 and the
spectral norms kMt k and Mt−1 grow by at most a factor of 2 in every iteration.
264 Ellipsoid Method for Linear Programming

This is enough to guarantee that the algorithm is still correct and converges in
the same number of steps (up to a constant factor) assuming that p is large
enough.

12.5 Linear optimization over 0-1-polytopes


In this section we are set to present a proof of Theorem 12.5 in the case when
the polytope P is full-dimensional (at the end of this section we discuss how
to get rid of this assumption). For this, we use binary search to reduce the
optimization problem to feasibility and then use the ellipsoid algorithm to solve
the feasibility question in every step. A sketch of the algorithm appears in
Algorithm 11.
Note that in contrast to the scheme presented in Section 12.2.1, in Algorithm
11, the binary search runs over integers. We can do that since the cost function
is integral and the optimum is guaranteed to be a vertex and, thus, the optimum
value is also integral. Note that if y? ∈ Z is the optimal value then the polytope
P0 = P ∩ {x : hc, xi ≤ y? } is lower dimensional and, hence, has volume zero.
For this reason, in Algorithm 11, we introduce a small slack and always ask
for objective at most g + 41 instead of at most g, for g ∈ Z. This guarantees that
P0 is always full-dimensional and we can provide a lower bound on its volume
(necessary for the ellipsoid algorithm to run in polynomial time).

12.5.1 Guaranteeing uniqueness of the optimal solution


We now describe the Perturb step of Algorithm 11 that ensures that there is
only one optimal solution. This is done to be able to round a close to opti-
mal solution to a vertex easily. To this end, one can simply define a new cost
function c0 ∈ Zm such that

c0i := 2m ci + 2i−1 .

The total contribution of the perturbation for any vertex solution is strictly less
than 2m and, hence, it does not create any new optimal vertex. Moreover, it is
now easy to see that every vertex has a different cost and, thus, the optimal cost
(and vertex) is unique.
Finally, note that by doing this perturbation, the bit complexity of c only
grows by O(m).
12.5 Linear optimization over 0-1-polytopes 265

Algorithm 11: Linear optimization for 0-1-polytopes


Input:
• A separation oracle to a full-dimensional 0-1-polytope P ⊆ [0, 1]m
• A cost vector c ∈ Zm
Output: An optimal solution to min x∈P hc, xi
Algorithm:
1: Perturb the cost function c to guarantee uniqueness of the optimal
solution
2: Let l := − kck1 and u := kck1 + 1
3: while u − l > 1 do
4: Set g := b l+u
2 c n o
5: Define P0 := P ∩ x ∈ Rm : hc, xi ≤ g + 14
6: if P0 , ∅ (use ellipsoid algorithm on P0 ) then
7: Set u := g
8: else
9: Set l := g
10: end if
11: end while
12: Use the ellipsoid algorithm to find a point
( )
1
x̂ ∈ P ∩ x ∈ R : hc, xi ≤ u +
m
4

13: Round every coordinate of x̂ to the nearest integer


14: return x̂

12.5.2 Lower bounding the volume of polytopes


We now prove formally that the volume of the polytopes P0 queried in the
algorithm is never too small for the ellipsoid method to work in polynomial
time.

Lemma 12.17 (Inner ball) Suppose that P ⊆ [0, 1]m is a full-dimensional


0-1-polytope and c ∈ Zm is any cost vector and C ∈ Z is any integer. Consider
the following polytope

( )
1
P := x ∈ P : hc, xi ≤ C +
0
,
4
266 Ellipsoid Method for Linear Programming

then either P0 is empty or P0 contains a Euclidean ball of radius at least


2−poly(m,L) , where L is the bit complexity of c.

Proof Assume without loss of generality that 0 ∈ P and that 0 is the unique
optimal solution. It is then enough to consider C = 0, as for C < 0, the polytope
P0 is empty and, for C ≥ 1, the optimal cost is larger than that for C = 0.
We start by noting that P contains a ball B of radius 2−poly(m) . To see this, we
first pick m + 1 affinely independent vertices of P and show that the simplex
spanned by them contains a ball of such a radius.
Next, we show that by scaling this ball down, roughly by a factor of 2L with
respect to the origin, it is contained in
( )
1
0
P := x ∈ P : hc, xi ≤ .
4
To see this, note that for every point x ∈ P we have
X
hc, xi ≤ |ci | ≤ 2L .
i

Thus, we have, for every x ∈ P:


 x  1
, c ≤ 2−3 = .
2L+3 8
In particular,
2−L−3 B ⊆ P0 ,

but
 
vol 2−L−3 B = 2−m(L+3) vol(B) = 2−m(L+3)−poly(m) = 2−poly(m,L) .

12.5.3 Rounding the fractional solution


Lemma 12.18 (Rounding the fractional solution) Suppose that P ⊆ [0, 1]m
is a full-dimensional 0-1-polytope and c ∈ Zm is any cost vector such that there
is a unique point x? in P which minimizes x 7→ hc, xi over x ∈ P. Then, if x ∈ P
satisfies
1
hc, xi ≤ hc, x? i + ,
3
then by rounding each coordinate of x to the nearest integer we obtain x? .
12.5 Linear optimization over 0-1-polytopes 267

Proof Let us write x as


x := αx? + (1 − α)z
where α ∈ [0, 1] and z ∈ P is a convex combination of suboptimal vertices (not
including x? ). Then, we have
1
hc, x? i + ≥ hc, xi
3
= hc, αx? + (1 − α)zi
= αhc, x? i + (1 − α)hc, zi
≥ αhc, x? i + (1 − α)(hc, x? i + 1)
≥ hc, x? i + (1 − α).
Here, we have used the fact that z is a suboptimal 0-1-vertex of cost strictly
less than that of x? , hence, because c is integral, we get
hc, zi ≥ hc, x? i + 1.
Thus,
1
1−α≤ .
3
Hence, for every i = 1, 2, . . . , m,
1
|xi − xi? | = (1 − α)|xi? − zi | ≤ 1 − α ≤ .
3
Here, we use the fact that each xi? , zi ∈ {0, 1}, hence, their difference is bounded
in absolute value by 1. It follows that rounding each coordinate of x to the
nearest integer yields x? .

12.5.4 The proof


We are now ready to conclude the proof of Theorem 12.5.
Proof of Theorem 12.5. We show that Algorithm 11 is correct and runs in
polynomial time. The correctness is a straightforward consequence of the def-
inition of the algorithm. The only step which requires justification is why does
the rounding produce the optimal solution x? . This follows from Lemma 12.18.
To bound the running time of the algorithm, observe that it performs
O(log kck1 ) = O(L)
iterations and, in every iteration, it applies the ellipsoid algorithm to check the
(non)-emptiness of P0 . Using Theorem 12.10, and the lower bound on the size
268 Ellipsoid Method for Linear Programming

of a ball that can be fit in P0 provided in Lemma 12.17, we conclude that the
running time of one such application is poly(m, L).

12.5.5 The full-dimensionality assumption


For the proof of Theorem 12.5, we have crucially used the fact that the polytope
P is full-dimensional. Indeed, we used it to give lower bound on the volume of
P and its restrictions. When P is lower dimensional, no such bound holds and
the analysis needs to be adjusted.
In principle, there are two ways to deal with this issue. The first is to assume
that one is given an affine subspace F = {x ∈ Rd : Ax = b} such that the
polytope P is full-dimensional when restricted to F. In such a case, one can
simply apply the ellipsoid method restricted to F and obtain the same running
time bound as in the full-dimensional case; we omit the details.
In the case when the description of a subspace in which P is full-dimensional
is not available, the situation becomes much more complicated, but still man-
ageable The idea is to apply the ellipsoid method to the low dimensional poly-
tope P in order to first find the subspace F. To this end, after enough iterations
of the ellipsoid method, the ellipsoid becomes flat and one can read off the
directions along which this happens (the ones corresponding to tiny eigenval-
ues of the PD matrix generating the ellipsoid). Then, one can use the idea of
simultaneous Diophantine approximations to round these vectors to a low bit
complexity representation of the subspace F.

12.6 Exercises
12.1 Consider the undirected graph on three vertices {1, 2, 3} with edges
12, 23, 13.
Prove that the following polytope does not capture the matching poly-
tope of this graph.
x ≥ 0, x1 + x2 ≤ 1, x2 + x3 ≤ 1, x1 + x3 ≤ 1.
12.2 Construct a polynomial time separation oracle for the perfect matching
polytope using the following steps:
(a) Prove that separating over PPM (G) for a graph G = (V, E) reduces
to the following odd minimum cut problem: given x ∈ QE find:
X
min xi j .
S :|S | odd
i∈S , j∈S̄ ,i j∈E
12.6 Exercises 269

(b) Prove that the odd minimum cut problem is polynomial time solv-
able.
12.3 Recall that, for a polytope P ⊆ Rm the linear optimization problem is:
given a cost vector c ∈ Qn find a vertex x? ∈ P minimizing hc, xi over
x ∈ P.
Let P be a class of full-dimensional 0-1-polytopes for which the linear
optimization problem is polynomial time solvable (i.e., polynomial in m
and the bit complexity of c). Prove that the separation problem for the
class P is also polynomial time solvable using the following steps:

(a) Prove that, if P ∈ P, then the separation problem over P◦ (the


polar of P) is polynomial time solvable. In this problem it may
be convenient to define the polar of P with respect to a point x0
that is deep inside (i.e., its distance to the boundary of P is at least
2
2−O(m ) ) P, i.e.,

P◦ := {y ∈ Rn : ∀x ∈ P hy, x − x0 i ≤ 1}.

(b) Prove that a polynomial time linear optimization oracle for the po-
lar P◦ of P is enough to implement a polynomial time separation
oracle for P.
(c) Prove that every P ∈ P has a polynomial time linear optimiza-
tion oracle using the ellipsoid method. Note: one can assume that
polynomial time rounding to a vertex solution is possible, given a
method to find an ε-optimal solution in poly(n, L(c), log 1ε ) time.
12.4 Prove that there is a polynomial time algorithm (based on the ellipsoid
method) for linear programming in the explicit form, i.e., given a matrix
A ∈ Qm×n , b ∈ Qm and c ∈ Qn find an optimal solution x? ∈ Qn to
min x∈Rn {hc, xi : Ax ≤ b}. The following hints may be helpful:

• Reduce the problem of finding an optimal solution to the problem


of finding a good enough approximation to the optimal value. For
this, think of dropping constraints hai , xi ≤ bi one by one and solving
the program again to see if this changes the optimal value. In the
end, only n constraints remain and allow us to determine the optimal
solution by solving one linear system.
• To deal with the problem of low dimensionality of the feasible re-
gion, perturb the constraints by an exponentially small number while
making sure that this does not alter the optimal value by too much.
270 Ellipsoid Method for Linear Programming

12.5 Prove that the ellipsoid E(x0 , M 0 ) ⊆ Rm defined in Lemma 12.12 coin-
cides with the minimum volume ellipsoid containing

E(x0 , M) ∩ {x ∈ Rm : hx, hi ≤ hx0 , hi}.

12.6 In this problem, we derive a variant of the cutting plane method for solv-
ing the feasibility problem. In the feasibility problem, the goal is to find
a point x̂ in a convex set K. This algorithm (in contrast to the ellipsoid
method) maintains a polytope as an approximation of K, instead of an
ellipsoid.

Description of the problem. The input to the problem is a convex set


K such that
• K ⊆ [0, 1]m (any bounded set K can be scaled down to satisfy this
condition),
• K contains a Euclidean ball of radius r, for some parameter r > 0 (r
is given as input),
• a separation oracle for K is provided.
The goal is to find a point x̂ ∈ K.

Description of the algorithm. The algorithm maintains a sequence of


polytopes P0 , P1 , P2 , . . ., all containing K. The polytope Pt is defined
by 2m + t constraints: 2m box constraints 0 ≤ x j ≤ 1 and t constraints
of the form hai , xi ≤ bi (with kai k2 = 1), added in the course of the
algorithm. At every step t, the algorithm computes the analytic center xt
of the polytope Pt , i.e., the minimum of the logarithmic barrier Ft (x) for
the polytope Pt . More formally:
m
X t
X
Ft (x) := − (log x j + log(1 − x j )) − log(bi − hai , xi),
j=1 i=1

xt := argmin Ft (x).
x∈Pt

The algorithm is as follows.


• Let P0 be [0, 1]m
• For t = 0, 1, 2, . . .
– Find xt , the analytic center of Pt
– If xt ∈ K then return x̂ := xt and terminate
12.6 Exercises 271

– Otherwise, use the separation oracle for K to cut the polytope by


a hyperplane through xt , i.e., add a new constraint of the form
hat+1 , xi ≤ bt+1 := hat+1 , xt i with kat+1 k2 = 1.
The new polytope is Pt+1 := Pt ∩ {x ∈ Rn : hat+1 , xi ≤ bt+1 }.

Upper bound on the potential. We analyze the above scheme assum-


ing that the analytic center xt can be found efficiently. While we do not
discuss the algorithmic task of computing xt , we use its mathematical
properties to give a bound on the number of iterations. We use the mini-
mum value of the logarithmic barrier at iteration t as a potential, i.e.,
φt := min Ft (x) = Ft (xt ).
x∈Pt

We start by establishing an upper bound for this potential.


(a) Prove that at every step t of the algorithm (except possibly the last
step)
1
φt ≤ (2m + t) log .
r

Lower bound on the potential. The next step is to show a lower bound
on φt , intuitively, we would like to show that when t becomes large
enough then (on average) φt+1 − φt > 2 log 1r and hence eventually φt
becomes larger than the lower bound derived above – this gives us a
bound on when does the algorithm terminate.
Let us denote by Ht := ∇2 Ft (xt ) the Hessian of the logarithmic barrier
at the analytic center.
(b) Prove that for every step t of the algorithm
1
φt+1 ≥ φt − log(a>t+1 Ht−1 at+1 ) (12.7)
2
and conclude that
 t 
t  1 X > −1 
φt ≥ φ0 − log  a H ai  .
2 t i=1 i i−1 
Pt
The next step is to analyze (prove an upper bound on) the sum i=1 a>i Hi−1
−1
ai .
Towards this, we establish the following sequence of steps.
(c) Prove that for every step t of the algorithm
t
1X >
Lt := I + a i a i  Ht .
m i=1
272 Ellipsoid Method for Linear Programming

(d) Prove that, if M ∈ Rm×m is an invertible matrix and u, v ∈ Rm are


vectors, then
det(M + uv> ) = det(M)(1 + v> M −1 u).
(e) Prove that for every step t of the algorithm
t t
1 X > −1 1 X > −1
ai Hi−1 ai ≤ a L ai ≤ log det(Lt ).
2m i=1 2m i=1 i i−1
(f) Prove that for every step t of the algorithm
 t 
log det(Lt ) ≤ m log 1 + 2 .
m
From the above it follows that
2m2
!
t  t 
φt ≥ − log log 1 + 2 ,
2 t m

 2 upper bound on φt , the al-


and consequently, by combining it with the
gorithm does not make more than t := O mr4 steps until it finds a point
e
x̂ ∈ K. We can thus deduce the following result.
Theorem 12.19 There is an algorithm that given access to a separation
oracle for a set K ⊆ [0, 1]m containing a ball of radius r, finds a point
e m42 iterations where, in iteration t, the algorithm computes
x̂ ∈ K in O r
an analytic center of a polytope given by O(m + t) constraints.
12.6 Exercises 273

Notes
We refer the reader to the book by Schrijver (2002a) for numerous examples
of 0-1-polytopes related to various combinatorial objects. Theorem 12.2 was
proved in a seminal paper by Edmonds (1965a). We note that a result similar to
Theorem 12.2 also holds for the spanning tree polytope. However, while there
are results showing that there is a way to encode the spanning tree polytope
using polynomially many variables and inequalities, for the matching poly-
tope, Rothvoss (2017) proved that any linear representation of P M (G) requires
exponentially many constraints.
The ellipsoid algorithm was first developed for linear programs by Khachiyan
(1979, 1980) who built upon the ellipsoid method given by Shor (1972) and
Yudin and Nemirovskii (1976). It was further generalized to the case of poly-
topes (when we only have access to a separation oracle for the polytope) by
Grötschel et al. (1981), Padberg and Rao (1981), and Karp and Papadimitriou
(1982). For details on how to deal with the bit precision issues mentioned in
Section 12.4.4, see the book by Grötschel et al. (1988). For details on how to
avoid the full-dimensionality assumption by “jumping” to a lower dimensional
subspace, the reader is also referred to Grötschel et al. (1988). Exercise 12.6 is
based on the paper by Vaidya (1989a).
13
Ellipsoid Method for Convex Optimization

We show how to adapt the ellipsoid method to solve general convex programs. As appli-
cations, we present a polynomial time algorithm for submodular function minimization
and a polynomial time algorithm to compute maximum entropy distributions over com-
binatorial polytopes.

13.1 Convex optimization using the ellipsoid method?


The ellipsoid algorithm for linear programming presented in Chapter 12 has
several desirable features:

1. it only requires a separation oracle to work,


2. its running time depends poly-logarithmically on u0ε−l0 (where ε > 0 is the
error and the optimal value lies between l0 and u0 ), and
3. its running time depends poly-logarithmically on Rr , where R > 0 is the
radius of the outer ball and r is the radius of the inner ball.

Property (1) shows that the ellipsoid method is applicable more generally than
interior point methods, which in contrast require strong access to the convex
set via a self-concordant barrier function. Properties (2) and (3) say that, at
least asymptotically, the ellipsoid algorithm outperforms first-order methods.
However, so far we have only developed the ellipsoid method for linear pro-
gramming, i.e., when K is a polytope and f is a linear function.
In this chapter, we extend the framework of the ellipsoid method to general
convex sets and convex functions assuming an appropriate (oracle) access to f
and a separation oracle for K. We also switch back from m to n to denote the
ambient space of the optimization problem. More precisely, in Section 13.4 we
prove the following theorem.

274
13.1 Convex optimization using the ellipsoid method? 275

Theorem 13.1 (Ellipsoid algorithm for convex optimization) There is an


algorithm that, given

1. a first-order oracle for a convex function f : Rn → R,


2. a separation oracle for a convex set K ⊆ Rn ,
3. numbers r > 0 and R > 0 such that K ⊆ B(0, R) and K contains a
Euclidean ball of radius r,
4. bounds l0 and u0 such that ∀x ∈ K, l0 ≤ f (x) ≤ u0 , and
5. an ε > 0,

outputs a point x̂ ∈ K such that

f ( x̂) ≤ f (x? ) + ε,

where x? is any minimizer of f over K. The running time of the algorithm is


 !2 
 2  R u0 − l 0
O  n + T K + T f · n · log  .
2

·
r ε

Here, T K and T f are the running times for the separation oracle for K and the
first-order oracle for f respectively.

A detailed discussion of the assumptions of Theorem 13.1 is provided in Sec-


tion 13.4. Here, we mention a few things. A first-order oracle for a function
f is understood to be a primitive that, given x, outputs the value f (x) and any
subgradient h(x) ∈ ∂ f (x). Thus, this theorem requires no smoothness or dif-
ferentiability assumptions beyond convexity. In some applications it is useful
to get a multiplicative approximation (instead of an additive approximation):
given a δ > 0, find a point x̂ such that

f ( x̂) ≤ f (x? )(1 + δ).

It can be seen that Theorem 13.1 applies to this setting by letting ε := δl0 . If x̂
is such that

f ( x̂) ≤ f (x? ) + ε,

then

f ( x̂) ≤ f (x? ) + ε ≤ f (x? ) + δl0 ≤ f (x? )(1 + δ).

The running time remains polynomial in all parameters, including log 1δ .


276 Ellipsoid Method for Convex Optimization

13.1.1 Is convex optimization in P?


Given the algorithm in Theorem 13.1, we revisit the question of polynomial
time solvability of general convex programs that have been mentioned (and
answered negatively in Exercise 4.10) Chapter 4. Indeed, even though the re-
sult proved in this chapter seem to imply that convex optimization is in P, it
does not follow, as it relies on a few subtle, yet important assumptions. To
construct a polynomial time algorithm for a specific convex program one has
to first find suitable bounds R and r on the magnitude of the optimal solution
and bounds u0 and l0 on the magnitude of the optimal value. For the ellipsoid
algorithm, such a bound has to be provided as input to even run the algorithm
(see Section 13.4). Moreover, the algorithm requires a separation oracle for K
and an oracle to compute gradients of f – both these computational tasks might
turn out to be provably computationally intractable (NP-hard) in certain cases.
We provide two important and interesting examples of convex optimization:
submodular function minimization and computing maximum entropy dis-
tributions. Both these problems can, in the end, be reduced to optimizing con-
vex functions over convex domains. However, the problem of computing sub-
gradients for the submodular function minimization problem turns out to be
nontrivial. And, for the problem of computing maximum entropy distributions,
both – giving an estimate on the location of the optimizer and the computabil-
ity of the gradient – turn out to be nontrivial. Thus, even if a problem can be
formulated as a convex program, significant additional work may be required
to conclude polynomial time computability; even in the light of Theorem 13.1.

13.2 Submodular function minimization


In some of the earlier chapters, we have already encountered submodular func-
tions and the problem of submodular function minimization, although it has
never been mentioned explicitly. Below we show how it arises rather naturally
when trying to construct separation oracles for combinatorial polytopes.

13.2.1 Separation oracles for 0-1-polytopes


In Chapter 12 we proved that one can efficiently optimize linear functions over
0-1-polytopes

PF := conv{1S : S ∈ F } ⊆ [0, 1]n ,


13.2 Submodular function minimization 277

where F ⊆ 2[n] is a family of subsets of [n], whenever one can provide an


efficient separation oracle for PF . Below we show how to construct separation
oracles for a large family of polytopes called matroid polytopes.

Matroids and rank function. For a family F , we start by defining a rank


function rF : 2[n] → N as

rF (S ) := max{|T | : T ∈ F , T ⊆ S },

i.e., the maximum cardinality of a set in T ∈ F contained in S . Note that rF (S )


may not be well defined in general. However, for downward closed families,
i.e., families F such that for every S ⊆ T ⊆ [n], if T ∈ F then S ∈ F . Of
special interest in this section are set families that are matroids.

Definition 13.2 (Matroid) A nonempty family of subsets F of a ground set


[n] is called a matroid if it satisfies the following conditions:

1. (Downward closure) F is downward closed, and


2. (Exchange property) if A and B are in F and |B| > |A| then there exists
x ∈ B \ A such that A ∪ {x} ∈ F .

The motivation to define matroids comes from an attempt to generalize the


notion of independence from the world of vectors in linear algebra. Recall that
a set of vectors {v1 , . . . , vk } in Rn is said to be linearly independent over R if
no nontrivial linear combination of these vectors with coefficients from R is 0.
Linear independence is a downward closed property as removing a vector from
a set of vectors that are linearly independent maintains the linear independence
property. Moreover, if A and B are two sets of linearly independent vectors
with |A| < |B|, then there is some vector in B that is linearly independent from
all vectors in A and, hence, can be added to A while maintaining the linear
independence property.
Other simple examples of matroids are the uniform matroid, i.e., the set
of all subsets of [n] of cardinality at most k for an integer k, and the graphic
matroid (i.e., for an undirected graph G, the set of all acyclic subsets of edges
of G.
The simplest computational question associated to a matroid is the mem-
bership problem: given S , efficiently decide whether S ∈ F or not. This is
usually a corollary of the description of the matroid at hand and we assume
this problem can be solved efficiently. E.g., checking whether a set of vectors
is linearly independent, or a set is of size at most k, or a subset of edges in a
graph contains a cycle or not are all computationally easy. Using this, and the
278 Ellipsoid Method for Convex Optimization

definition of a matroid, it is not too difficult to find an efficient greedy algo-


rithm for the following problem: given S ⊆ [n], find the maximum cardinality
T ∈ F such that T ⊆ S . As a consequence, one has the following theorem that
states that one can evaluate the rank function of a matroid.
Theorem 13.3 (Evaluating the rank function of a matroid) For a matroid
F on a ground set [n] that has a membership oracle, given an S ⊆ [n], one can
evaluate rF (S ) with a polynomial (in n) number of queries to the membership
oracle.

Polyhedral description of matroid polytopes. The following theorem pro-


vides a convenient description of the polytope PF when F is a matroid.
Theorem 13.4 (Matroid polytopes) Suppose that F ⊆ 2[n] is a matroid, then
 
 X 
PF =  .
 
x ≥ 0 : ∀S ⊆ [n] xi ≤ rF (S )
 

 

i∈S

Note that the inclusion from left to right (⊆) is true for all families F , the
opposite direction relies on the matroid assumption, and is nontrivial.
For the matroid F consisting of all sets of size at most k, that rank function
is particularly simple:
rF (S ) = min{|S |, k}.
Hence, the corresponding matroid polytope is
 
 X 
PF =  .
 
x ≥ 0 : ∀S ⊆ [n] xi ≤ min{|S |, k}
 

 

i∈S

Note that, as the rank of a singleton set can be at most 1, the right hand side has
constraints of the form 0 ≤ xi ≤ 1 for all 1 ≤ i ≤ n. Hence, they trivially imply
the rank constraint for any set S whose cardinality is at most k. Moreover, there
is a constraint
Xn
xi ≤ k.
i=1

This constraint, along with the fact that xi ≥ 0 for all i, implies the constraint
X
xi ≤ k
i∈T

for any T such that |T | ≥ k. Thus, the corresponding matroid polytope is just
 
 X 
PF =  x ∈ [0, 1]n : ,
 
xi ≤ k
 

 

i∈[n]
13.2 Submodular function minimization 279

and the separation problem is trivial.


However, in general, including the case of all acyclic subgraphs of a graph,
the number of constraints defining the polytope can be exponential (in the size
of the graph) and separation is a nontrivial problem.

Separating over matroid polytopes. Theorem 13.4 suggests the following


strategy for separation over matroid polytopes. Given x ∈ [0, 1]n , denote
X
F x (S ) := rF (S ) − xi ,
i∈S

and find
S ? := argmin F x (S ).
S ⊆[n]

Indeed,
x ∈ PF if and only if F x (S ? ) ≥ 0.

Furthermore, if F x (S ? ) < 0 then S ? provides us with a separating hyperplane:

hy, 1S ? i ≤ rF (S ? ).

This is because

F x (S ? ) < 0 implies that hx, 1S ? i > rF (S ? ),

but Theorem 13.4 implies that the inequality is satisfied by all points in PF .
Thus, in order to solve the separation problem, it is sufficient to solve the min-
imization problem for F x . The crucial observation is that F x is not an arbitrary
function – it is submodular, and submodular functions have a lot of combina-
torial structure that can be leveraged.

Definition 13.5 (Submodular function) A function F : 2[n] → R is called


submodular if

∀S , T ⊆ [n], F(S ∩ T ) + F(S ∪ T ) ≤ F(S ) + F(T ).

As alluded to before, the matroid rank function rF is submodular; see Exercise


13.4(a).

Theorem 13.6 (The matroid rank function is submodular) Suppose that


F ⊆ 2[n] is a matroid, then the rank function rF is submodular.

Given the discussion above, we consider the following general problem.


280 Ellipsoid Method for Convex Optimization

Definition 13.7 (Submodular function minimization (SFM)) Given a sub-


modular function F : 2[n] → R, find a set S ? such that
S ? = argmin F(S ).
S ⊆[n]

We have not specified how the function F is given. Depending on the applica-
tion, it can either be succinctly described using a graph (or some other com-
binatorial structure), or given as an oracle that, given S ⊆ [n], outputs F(S ).
Remarkably, this is all we need to develop algorithms for SFM.
Theorem 13.8 (Polynomial time algorithm for SFM) There is an algorithm
that, given an oracle access to a submodular function F : 2[n] → [l0 , u0 ] for
some integers l0 ≤ u0 , and an ε > 0, finds a set S ⊆ [n] such that
F(S ) ≤ F(S ? ) + ε,
where S ? is a minimizer of F. The algorithm makes poly n, log u0ε−l0 queries
 

to the oracle.
Note that, for the application of constructing separation oracles for matroid
polytopes, a polynomial running time with respect to log u0ε−l0 is necessary.
Recall that, in the separation problem, the input also consists of an x ∈ [0, 1]n
and we considered the function
X
F x (S ) := rF (S ) − xi .
i∈S

Given a membership oracle for a matroid, Theorem 13.3 implies an oracle to


its rank function, which is submodular by Theorem 13.6. Note that the function
F x is specified by a point x ∈ [0, 1]n . Even though its range [l0 , u0 ] is small,
ε has to be of the order of the smallest xi to recover an optimum (S ? ) and,
hence, to give us a separating hyperplane when the given point is outside the
polytope. Thus, logarithmic dependence on the error in Theorem 13.8 implies
a polynomial dependence on the bit complexity of x for this separation scheme
to yield a polynomial time algorithm.1
Theorem 13.9 (Efficient separation over matroid polytopes) There is an
algorithm that, given a membership oracle to a matroid F ⊆ 2[n] , solves the
separation problem over PF for a point x ∈ [0, 1]n with polynomially many (in
n and the bit complexity of x) queries to the membership oracle.
A consequence of this theorem and Theorem 12.5 from Chapter 12 is that we
can perform linear optimization over matroid polytopes in polynomial time.
1 We note that there are alternative combinatorial methods to separate over matroid polytopes
using duality and the equivalence between linear optimization and separation; see Exercises
5.15 and 13.6.
13.2 Submodular function minimization 281

Separating over matroid base polytopes. The maximum cardinality elements


of a matroid are called bases and the corresponding polytope (convex hull of
indicator vectors of bases) is called the matroid base polytope. Suppose r is
the size of a base of a matroid (from the exchange property it follows that all
bases have the same cardinality). The polyhedral description of a matroid base
polytope follows from Theorem 13.4 by adding the following equality to PF :
n
X
xi = r. (13.1)
i=1

Thus, if we have a separation oracle for the matroid polytope, it easily extends
to a separation oracle for the matroid base polytope. Note that spanning trees
are maximum cardinality elements in the graphic matroid and, hence, the the-
orem above also gives us a separation oracle for the spanning tree polytope.

13.2.2 Towards an algorithm for SFM – Lovász extension


The first idea in solving the (discrete) SFM problem is to turn it to a convex
optimization problem. Towards this, we define the Lovász extension of F :
2[n] → R.

Definition 13.10 (Lovász extension) Let F : 2[n] → R be a function. The


Lovász extension of F is defined to be the function f : [0, 1]n → R such that

f (x) := E [F({i : xi > λ})] ,

where the expectation is taken over a uniformly random choice of λ ∈ [0, 1].

The Lovász extension has many interesting and important properties. First,
observe that the Lovász extension of F is always a continuous function and
agrees with F on integer vectors. It is, however, not smooth. Moreover, one
can show that
min f (x) = min F(S );
x∈[0,1]n S ⊆[n]

see Exercise 13.8. Further, the following theorem asserts that the Lovász ex-
tension of a submodular function is convex.

Theorem 13.11 (Convexity of the Lovász extension) If a set function F :


2[n] → R is submodular then its Lovász extension f : [0, 1]n → R is convex.

Thus, we have reduced SFM to the following convex optimization problem:

min f (x), (13.2)


x∈[0,1]n
282 Ellipsoid Method for Convex Optimization

where f is the Lovász extension of the submodular function F. Note that the
constraint set is just the hypercube [0, 1]n .

13.2.3 Polynomial time algorithm for minimizing Lovás extension


In the setting of optimizing a convex function f over a hypercube, Theorem
13.1 can be simplified as follows.
Theorem 13.12 (Informal; see Theorem 13.1) Let f : [0, 1]n → R be a
convex function and suppose the following conditions are satisfied:
1. we are given access to a polynomial time oracle computing the value and
the (sub)gradient of f , and
2. the values of f over [0, 1]n lie in the interval [l0 , u0 ] for some l0 ≤ u0 ∈ R.
Then, there is an algorithm that given ε > 0 outputs an ε-approximate solution
to min x∈[0,1]n f (x) in time poly(n, log(u0 − l0 ), log ε−1 ).
Given the above theorem, the only remaining step is to present an efficient ora-
cle for computing the values and subgradients of the Lovász extension (which
is piecewise linear). This is a consequence of the lemma below.
Lemma 13.13 (Efficient computability of the Lovász extension) Let F :
2[n] → R be any function and f : [0, 1]n → R be its Lovász extension. There is
an algorithm that, given x ∈ [0, 1]n , computes f (x) and a subgradient h(x) ∈
∂ f (x) in time
O(nT F + n2 ),
where T F is the running time of the evaluation oracle for F.
Proof Without loss of generality, assume that the point x ∈ [0, 1]n on which
we have to evaluate f satisfies
x1 ≤ x2 ≤ · · · ≤ xn .
Then, the definition of f implies that f (x) equals
x1 F([n]) + (x2 − x1 )F([n] \ {1}) + · · · + (xn − xn−1 )F([n] \ [n − 1]) + (1 − xn )F(∅).
The above shows that f is piecewise linear and also gives a formula to compute
f (x) given x (it requires just n+1 evaluations of F). To compute the subgradient
of f at x, one can assume without loss of generality that x1 < x2 < · · · < xn , as
otherwise we can perform a small perturbation to reduce to this case. Now, on
the set
S := {x ∈ [0, 1]n : x1 < x2 < · · · < xn },
13.3 Maximum entropy distributions over discrete domains 283

the function f (x) (as demonstrated above) is just a linear function, hence, its
gradient can be evaluated efficiently.
We are now ready to establish a proof of Theorem 13.8.
Proof of Theorem 13.8. As seen in the discussion so far, minimizing a sub-
modular function F : 2[n] → R reduces to minimizing its Lovász extension
f : [0, 1]n → R. Moreover, Theorem 13.11 asserts that f is a convex function.
Therefore, one can compute an ε-approximate solution min x∈[0,1]n f (x) given
just an oracle that provides values and subgradients of f (this follows from
Theorem 13.12). If the range of F is contained in [l, u], then so is the range of
f . Hence, the running time bound in Theorem 13.8 follows.
Finally, we show how to round an ε-approximate solution x̂ ∈ [0, 1]n to a set
S ⊆ 2[n] . From the definition of the Lovász extension it follows that
n
X
f ( x̂) = λi F(S i ),
i=0

for some sets


S0 ⊆ S1 ⊆ ··· ⊆ Sn
and λ ∈ ∆n+1 (the probability simplex in dimension n + 1). Thus, for at least
one i it holds that
f (S i ) ≤ f ( x̂),
and we can output this S i , as it satisfies
F(S i ) = f (S i ) ≤ f ( x̂) ≤ f (x? ) + ε = F ? + ε.

13.3 Maximum entropy distributions over discrete domains


In this section we consider the following problem.
Definition 13.14 (Maximum entropy problem) Given a discrete domain Ω,
a collection of vectors {vω }ω∈Ω ⊆ Qn , and θ ∈ Qn , find a distribution p? over Ω
that is a solution to the following optimization problem:
X 1
max pω log
ω∈Ω

X
s.t. pω vω = θ, (13.3)
ω∈Ω

p ∈ ∆Ω .
284 Ellipsoid Method for Convex Optimization

Here, ∆Ω denotes the probability simplex, i.e., the set of all probability distri-
butions over Ω.

Note that the entropy function is concave (nonlinear), and ∆ω is convex, hence,
this is a convex optimization problem. However, solving this problem requires
specifying how the vectors are given as input.
If the domain Ω is of size N and the vectors vω are given to us explicitly, then
one can find an ε-approximate solution to Equation (13.3) in time polynomial
in N, log ε−1 , and the bit complexity of the input vectors and θ. This can be
done for instance using the interior point method2 , or the ellipsoid method
from this chapter.
However, when the domain Ω is large and the vectors are implicitly spec-
ified, this problem becomes computationally nontrivial. To take an instruc-
tive example, let Ω be the set of all spanning trees T of an undirected graph
G = (V, E), and let vT := 1T be the indicator vector of the spanning tree T .
Thus, all vectors vT are implicitly specified by the graph G = (V, E). The input
to the entropy maximization problem in the spanning tree case is, thus, a graph
G = (V, E) and a vector θ ∈ QE . In trying to apply the above approach to the
case of spanning trees in a graph, one immediately realizes that N is exponen-
tial in the size of the graph and, thus, a polynomial in N algorithm is, in fact,
exponential in the number of vertices (or edges) in G.
Further, even though an exponential domain (such as spanning trees in a
graph) can be specified compactly (via a graph), the output of the problem is
still of exponential size – a vector p of dimension N = |Ω|. How can we output
a vector of exponential length in polynomial time? We cannot. However, what
is not ruled out is an algorithm that, given an element of the domain (say a tree
T in G), outputs the probability associated to it in polynomial time (pT ). In the
next sections we show that, surprisingly, this is possible and, using the ellipsoid
method for convex optimization, we can obtain a polynomial time algorithm
for the entropy maximization problem over the spanning tree polytope. What
we present here is a sketch consisting of the key steps in the algorithm and the
proof; several steps are left as exercises.

13.3.1 Dual of the maximum entropy convex program


To counter the problems related to solving the entropy maximization problem
for exponentially large domains Ω we first use duality to transform this prob-
lem into one that has only n variables.
2 For that, one needs to construct a self-concordant barrier function for the sublevel set of the
entropy function.
13.3 Maximum entropy distributions over discrete domains 285

Definition 13.15 (Dual to the maximum entropy problem) Given a domain


Ω, a collection of vectors {vω }ω∈Ω ⊆ Qn and θ ∈ Qn , find a vector y? ∈ Rn , the
optimal solution to the problem:
 
X hy,v −θi 
ω
min log   e 
ω∈Ω (13.4)
n
s.t. y ∈ R

It is an important exercise (Exercise 13.11) to prove that the problems in Equa-


tions (13.3) and (13.4) are dual to each other and that strong duality holds
whenever θ is in the interior of the polytope corresponding to the convex hull
of the vectors {vω }ω∈Ω .
Thus, assuming that strong duality holds, the dual problem (13.4) looks sim-
pler to solve as it is an unconstrained optimization problem in a small number
of variables: Rn . Then, first-order optimality conditions imply that optimal dual
solution y? leads to a compact representation of the optimal distribution p? .

Lemma 13.16 (Succinct representation of maximum entropy distribution)


Suppose that y? ∈ Rn is the optimal solution to the dual to entropy maximiza-
tion problem (13.4). Then, assuming strong duality holds, the optimal solution
p? to the entropy maximization problem (13.3) can be recovered as
?
ehy ,vω i
∀ω ∈ Ω, p?ω = P hy? ,vω0 i
. (13.5)
ω0 ∈Ω e

Thus, in principle, it seems to be enough to compute y? in order to obtain a


succinct representation of the maximum entropy distribution p? .
However, as its objective involves summing up over all the vectors (in cases
such as spanning trees in graphs) we need a way to compute this exponential-
sized sum in polynomial time. Moreover, while the numerator can be easily
evaluated given y? , the denominator in (13.5) also, essentially, requires an or-
acle to evaluate the objective function.
Further, we also need a good bound R on ky? k2 . The ellipsoid method de-
pends poly-logarithmically on R, so an exponential in n bound on R might
seem sufficient, we note that the dual objective function is exponential in y.
Thus, to represent the numbers appearing in it using polynomial in n bits, we
need a polynomial (in n) bound on R.
Both a polynomial time evaluation oracle and a polynomial bound on R are
nontrivial to obtain and may not hold in general settings. We show, however,
that for the case of spanning trees, this is possible.
286 Ellipsoid Method for Convex Optimization

13.3.2 Solving the dual problem using ellipsoid algorithm


Let us denote the objective of the dual program by
 
X hy,v −θi 
f (y) := log  e ω
 .
ω∈Ω

We would like to find the minimum of f (y). For this we apply the general algo-
rithm for convex optimization based on the ellipsoid method that we derive in
Section 13.4. When translated to this setting it has the following consequence.

Theorem 13.17 (Informal; see Theorem 13.1) Suppose the following con-
ditions are satisfied:

1. we are given access to a polynomial time oracle computing the value and
the gradient of f ,
2. y? is guaranteed to lie in a ball B(0, R) for some R > 0, and
3. the values of f over B(0, R) stay in the interval [−M, M] for some M > 0.

Then, there is an algorithm that given ε > 0 outputs ŷ s.t.

f (ŷ) ≤ f (y? ) + ε

in time poly(n, log R, log M, log ε−1 ) (when an oracle query is treated as a unit
operation).

Note that we do not require the inner ball due to the following fact (Exercise
13.13):

∀y, y0 ∈ Rn , f (y) − f (y0 ) ≤ 2 nky − y0 k2 . (13.6)
 
Thus, we can set r := Θ √εn . In general, this does not imply a polynomial
time algorithm for the maximum entropy problem as, for specific instantia-
tions, we have to provide and account for an evaluation oracle, a value of R,
and M. Nevertheless, we show how to use the above to obtain a polynomial
time algorithm for the maximum entropy problem for the case of spanning
trees.

13.3.3 Polynomial time algorithm for the spanning tree case


Theorem 13.18 (Solving the dual to the max entropy problem for spanning
trees) Given a graph G = (V, E), numbers η > 0 and ε > 0, and a vector θ ∈
QE such that B(θ, η) ⊆ PS T (G), there is an algorithm to find an ε-approximate
13.3 Maximum entropy distributions over discrete domains 287

solution to
 
 X 
hy,1 −θi
minE log  e T  (13.7)
y∈R
T ∈TG

in time poly(|V| + |E|, η−1 , log ε−1 ), where TG is the set of all spanning trees in
G.

Note importantly that the above gives a polynomial time algorithm only if θ is
sufficiently inside the polytope PS T (or in other words, when it is far away from
the boundary of PS T ). If the point θ comes close to the boundary (i.e., η ≈ 0)
then the bound deteriorates to infinity. This assumption is not necessary and
we give a suggestion in notes on how to get rid of it. However the η-interiority
assumption makes the proof simpler and easier to understand.

Proving a bound on R. We start by verifying that the second condition in


Theorem 13.17 is satisfied, by proving an appropriate upper bound on the mag-
nitude of the optimal solution.

Lemma 13.19 (Bound on the norm of the optimal solution) Suppose that
G = (V, E) is an undirected graph and θ ∈ RE satisfies B(θ, η) ⊆ PS T (G),
for some η > 0. Then y? , the optimal solution to the dual to the max entropy
problem (13.7), satisfies
? |E|
y 2 ≤ .
η
Note that the bound deteriorates as η → 0 and, thus, is not very useful when θ
is close to the boundary.

Proof of Lemma 13.19. Denote by m := |E| and let θ ∈ Rm be such that


B(θ, η) ⊆ PS T (G). Suppose that y? is the optimal solution to the dual program.
If f denotes the dual objective, we know from strong duality that any upper
bound on the optimal primal solution is an upper bound on f (y? ). Hence, since
the maximum achievable entropy of a distribution over TG is log |TG |, we have

f (y? ) ≤ log |TG | ≤ log 2m = m. (13.8)

Recall now that


 
 X ?
f (y? ) = log  ehy ,1T −θi  .


T ∈TG

Thus, from Equation (13.8) we obtain that for every T ∈ TG ,

y? , 1T − θ ≤ m.
D E
288 Ellipsoid Method for Convex Optimization

By taking an appropriate convex combination of the inequalities above, it fol-


lows that for every point x ∈ PS T (G) we have

y? , x − θ ≤ m.
D E

Hence, from the assumption that B(θ, η) ⊆ PS T (G), we conclude that

y? , v ≤ m.
D E
∀v ∈ B(0, η),
?
In particular, by taking v := η kyy? k we obtain that
2

? m
y 2 ≤ .
η

Efficient evaluation of f . We now verify the first condition in Theorem 13.17.


To this end, we need to show that f can be evaluated efficiently.

Lemma 13.20 (Polynomial time evaluation oracle for f and its gradient)
Suppose that G = (V, E) is an undirected graph and f is defined as
 
 X 
f (y) := log   ehy,1 T −θi  ,

T ∈TG

then there is an algorithm that given y outputs the value f (y) and the gradient
∇ f (y) in time poly(kyk2 , |E|).

The proof is left as Exercise 13.14. Note that the running time of the oracle
is polynomial in kyk and not in log kyk as one would perhaps desire. This is a
consequence of the fact that even a single term in the sum e(y,1T −θ) might be as
large as ekyk and hence it requires up to kyk bits to represent.

Proof of Theorem 13.18. We are now ready to conclude a proof of Theo-


rem 13.18 from Theorem 13.17 using the above lemmas.

Proof of Theorem 13.18. We need to verify the conditions: 1, 2 and 3 in The-


orem 13.17.
Condition 2 is satisfied for R = O( mη ) by Lemma 13.19. Given a bound
kyk2 ≤ R we can now provide a lower and upper bound on f (y) (to verify
13.4 Convex optimization using the ellipsoid method 289

Condition 3). We have


 
 X 
f (y) = log   ehy,1 T −θi 

T ∈TG
 
 X 
kyk ·k1 −θk
≤ log   e 2 T 2 

T ∈TG

≤ log |TG | + kyk2 · O( m)
m3/2
≤m+
η
!
1
= poly m, .
η
Similarly, we can obtain a lower bound and, thus,
  all y such that kyk2 ≤ R,
for
we have that −M ≤ f (y) ≤ M with M = poly m, 1η .
Condition 3 is also satisfied due to Lemma 13.20. However, it adds a poly(R)
factor to the running time (since the evaluation oracle runs in time polynomial
in kyk). Hence, the final bound on the running time (as given by Lemma 13.17)
becomes
poly(|V| + |E|, log R, log M, log ε−1 ) · poly(R) = poly(|V| + |E|, η−1 , log ε−1 ).

13.4 Convex optimization using the ellipsoid method


This section is devoted to deriving the algorithm, promised in Theorem 13.1,
for solving convex programs
min f (x)
x∈K

when we have a separation oracle for K and an oracle access to K and a first-
order oracle for f . To this end, we start by generalizing the ellipsoid algorithm
from Chapter 12 to solve the feasibility problem for any convex set K (not
only for polytopes) and, further, we show that this algorithm, combined with
the standard binary search procedure, yields an efficient algorithm under mild
conditions on f and K.

13.4.1 From polytopes to convex sets


Note that the ellipsoid algorithm for the feasibility problem for polytopes that
we developed in Chapter 12 did not use the fact that P is a polytope, only that P
290 Ellipsoid Method for Convex Optimization

is convex and that a separation oracle for P is available. Indeed Theorem 12.10
in Chapter 12 can be rewritten in this more general form to yield the following.
Theorem 13.21 (Solving the feasibility problem for convex sets) There is
an algorithm that, given a separation oracle for a convex set K ⊆ Rn , a radius
R > 0 such that K ⊆ B(0, R) and a parameter r > 0 gives one of the following
outputs
1. YES, along with a point x̂ ∈ K, proving that K is nonempty, or
2. NO, in which case K is guaranteed to not contain a Euclidean ball of
radius r.
The running time of the algorithm is
 R
O (n2 + T K ) · n2 · log ,
r
where T K is the running time of the separation oracle for K.
Note that the above is written in a slightly different form than Theorem 12.10
in Chapter 12. Indeed, there we assumed that K contains a ball of radius r and
wanted to compute a point inside of K. Here, the algorithm proceeds until the
volume of the current ellipsoid becomes smaller than the volume of a ball of
radius r, and then outputs NO if the center of the current ellipsoid does not lie
in K. This variant can be seen as an approximate nonemptiness check:
1. If K , ∅ and vol(K) is “large enough” (determined by r) then the algo-
rithm outputs YES,
2. If K = ∅ then the algorithm outputs NO,
3. If K , ∅, but vol(K) is small (determined by r), then the algorithm can
answer either YES or NO.
The uncertainty introduced when K has a small volume is typically not a prob-
lem, as we can often mathematically force K to be either be empty or have a
large volume when running the ellipsoid algorithm.

13.4.2 Algorithm for convex optimization


We now build upon the ellipsoid algorithm for checking nonemptiness of a
convex set K to derive a the ellipsoid method for convex optimization. To find
the minimum of a convex function f : Rn → R over K, the idea is to use
the binary search method to find the optimal value. The subproblem solved at
every step is then simply checking nonemptiness of
K g := K ∩ {x : f (x) ≤ g},
13.4 Convex optimization using the ellipsoid method 291

Algorithm 12: Ellipsoid algorithm for convex optimization


Input:
• A separation oracle to a convex set K ⊆ Rn
• A first-order oracle to a function f : Rn → R
• An R > 0 with the promise that K ⊆ B(0, R)
• An r > 0 with the promise that a ball of radius r is in K
• Numbers l0 , u0 with the promise that
l0 ≤ min f (y) ≤ max f (y) ≤ u0
y∈K y∈K

• A parameter ε > 0
Output: An x̂ ∈ K such that f ( x̂) ≤ min x∈K f (x) + ε
Algorithm:
1: Let l := l0 and u := u0
2: Let r0 := 2(ur·ε
0 −l0 )
3: while u − l > 2ε do
4: Set g := b l+u
2 c
5: Define
K g := K ∩ {x : f (x) ≤ g}

6: Run the ellipsoid algorithm from Theorem 13.21 on K g with R, r0


7: if the ellipsoid algorithm outputs YES then
8: Set u := g
9: Let x̂ ∈ K g be the point returned by the ellipsoid algorithm
10: else
11: Set l := g
12: end if
13: end while
14: return x̂

for some value g. This can be done as long as a separation oracle for K g is
available. One also needs to be careful about the uncertainty introduced in the
nonemptiness check, as it might potentially cause errors in the binary search
procedure. For this, it is assumed that K contains a ball of radius r and, at every
step of the binary search, the ellipsoid algorithm is called with an appropriate
292 Ellipsoid Method for Convex Optimization

choice of the inner ball parameter. Details are provided in Algorithm 12, where
it is assumed that all values of f over K lie in the interval [l0 , u0 ].
At this point it is not clear that the algorithm presented in Algorithm 12 gives
a correct algorithm. Indeed, we need to verify that the choice of r0 guarantees
that the binary search algorithm gives correct answers most of the time. Fur-
ther, it is not even clear how one would implement such an algorithm, as so far
we have not yet talked about how f is given. It turns out that access to values
and gradients of f (zeroth- and first-order oracles of f ) suffices in this setting.
We call this first-order access to f as well and recall its definition.

Definition 13.22 (First-order oracle) A first-order oracle for a function f :


Rn → R is a primitive that, given x ∈ Qn , outputs the value f (x) ∈ Q and a
vector h(x) ∈ Qn such that

∀z ∈ Rn , f (z) ≥ f (x) + hz − x, h(x)i .

In particular, if f is differentiable then h(x) = ∇ f (x), but more generally h(x)


can be just any subgradient of f at x. We remark that typically one cannot
hope to get an exact first-order oracle, but rather an approximate one and, for
simplicity, we work with exact oracles. An extension to approximate oracles is
possible but requires introducing the so-called weak separation oracles which,
in turn, creates new technical difficulties.

13.4.3 Proof of Theorem 13.1


The are two main components to it: the first is to show that, given a separation
oracle for K and a first-order oracle for f , we can obtain a separation oracle for

K g := K ∩ {x : f (x) ≤ g},

and the second component is to show that by using the ellipsoid algorithm to
test the nonemptiness of K g with the parameter r0 specified in the algorithm,
we are guaranteed to obtain a correct answer up to the required precision. We
discuss these two components in separate steps and subsequently conclude the
result.

Constructing a separation oracle for K g . In the lemma below, we show that


a separation oracle for the sublevel set

S g := {x ∈ Rn : f (x) ≤ g}

can be constructed using a first-order oracle for f .


13.4 Convex optimization using the ellipsoid method 293

Lemma 13.23 (Separation over sublevel sets using a first-order oracle)


Given a first-order oracle to a convex function f : Rn → R, for any g ∈ Q,
there is a separation oracle for
S g := {x ∈ Rn : f (x) ≤ g}.
The running time of this separation oracle is polynomial in the bit size of the
input and the time taken by the first-order oracle.
Proof The construction of the oracle is rather simple: given x ∈ Rn we first
use the first-order oracle to obtain f (x). If f (x) ≤ g then the separation oracle
outputs YES. Otherwise, let u be the subgradient of f at x (obtained using the
first-order oracle of f ). Then, using the subgradient property, we obtain
∀z ∈ Rn , f (z) ≥ f (x) + hz − x, ui .
As f (x) > g, for every z ∈ S g , we have
g + hz − x, ui < f (x) + hz − x, ui ≤ f (z) ≤ g.
In other words,
hz, ui < hx, ui .
Hence, u provides us with a separating hyperplane. The running time of such
an oracle is clearly polynomial in the bit size of the input and the time taken
by the first-order oracle.
By noting that K g = K ∩ S g we simply obtain that a separation oracle for K g
can be constructed using the respective oracles for K and f .

Bounds on the volume of sublevel sets. In this step, we give a lower bound on
the smallest ball contained in K g depending on how close g is to the minimum
f ? of f on K. This is required to claim that the ellipsoid algorithm correctly
determines whether K g , ∅ in various steps of the binary search.
Lemma 13.24 (Lower bound on volume of sublevel sets) Let K ⊆ Rn be a
convex set containing a Euclidean ball of radius r > 0 and f : K → [ f ? , fmax ]
be a convex function on K with
f ? := min f (x).
x∈K

For any
g := f ? + δ
(for δ > 0), define
K g := {x ∈ K : f (x) ≤ g}.
294 Ellipsoid Method for Convex Optimization

Then, K g contains a Euclidean ball of radius


δ
r· .
fmax − f ?
Proof Let x? ∈ K be any minimizer of f over K. We can assume without loss
of generality that x? = 0, or f (0) = f ? . Define
δ
η := .
f? − fmax
We claim that
ηK ⊆ K g , (13.9)
where ηK := {ηx : x ∈ K}. Since the ball of radius r contained in K becomes
a ball of radius ηr in ηK, the claim implies the lemma. Thus, we focus on
proving Equation (13.9). For x ∈ ηK, we would like to show that f (x) ≤ g. We
have ηx ∈ K and, hence, from convexity of f we have
!
x
f (x) ≤ (1 − η) f (0) + η f
η
?
≤ (1 − η) f + η fmax
= f ? + η( fmax − f ? )
= f? + δ
= g.
Hence, x ∈ K g , and the claim follows.

Proof of Theorem 13.1. Given Lemmas 13.23 and 13.24, we are well equipped
to prove Theorem 13.1.
Proof of Theorem 13.1. Let f ? = f (x? ). First, observe that whenever
ε
g ≥ f? + ,
2
then K g contains a Euclidean ball of radius
r·ε
r0 := .
2(u0 − l0 )
Hence, for such a g, the ellipsoid algorithm terminates with YES and outputs
a point x̂ ∈ K g . This is a direct consequence of Lemma 13.24.
Let l and u be the values of these variables at the moment of termination of
the algorithm. From the above reasoning it follows that
u ≤ f ? + ε.
13.5 Variants of the cutting plane method 295

Indeed,
ε
u≤l+
2
and the ellipsoid algorithm can answer NO only for
ε
g ≤ f? + .
2
Hence,
ε
l ≤ f? + .
2
Therefore, u ≤ f ? + ε and the x̂ output by the algorithm belongs to K u . Hence,

f ( x̂) ≤ f ? + ε.

This proves correctness of the algorithm. It remains to analyze the running


0 −l0
time. The ellipsoid algorithm runs in log uε/2 time and every such execution
takes
!!
 R R u0 − l0
O (n + T K g ) · n · log 0 = O (n + T K + T f ) · n · log
2 2 2 2
·
r r ε
time, where we used Lemma 13.23 to conclude that T K g ≤ T K + T f .

Remark 13.25 (Avoiding binary search) By taking a closer look at the al-
gorithm and the proof of Theorem 13.1, one can observe that one does not need
to restart the ellipsoid algorithm at every iteration of the binary search. Indeed,
one can just reuse the ellipsoid obtained in the previous call. This leads to an
algorithm with a slightly reduced running time
!!
R u0 − l0
O (n + T K + T f ) · n · log
2 2
· .
r ε
Note that there is no square in the logarithmic factor.

13.5 Variants of the cutting plane method


In this section we present the high-level details of some variants of the cutting
plane method that can be used in place of the ellipsoid method to improve
results in this and the previous chapter. Recall that the cutting plane method
is, roughly, used to solve the following problem: given a separation oracle to a
convex set K ⊆ Rn (along with a ball B(0, R) containing it), either find a point
x ∈ K or determine that K does not contain a ball of radius r > 0 where r is
also provided as input.
296 Ellipsoid Method for Convex Optimization

To solve this problem, a cutting plane method maintains a set Et ⊇ K and


shrinks it in every step, so that
E0 ⊇ E1 ⊇ E2 ⊇ · · · ⊇ K.
As a measure of progress, one typically uses the volume of Et and, thus, ideally
one wants to decrease the volume of Et at every step
vol(Et+1 ) < αvol(Et ),
where 0 < α < 1 is the volume drop parameter. The ellipsoid algorithm
achieves α ≈ 1 − 2n 1
and, hence, it requires roughly O(n log α−1 log Rr ) =
2 R
O(n log r ) iterations to terminate. It is known that this volume drop rate is
tight when using ellipsoids, however, one can ask whether this can be improved
to say a constant α < 1 when using different kind of sets Ek to approximate the
convex body K.

13.5.1 Maintaining a polytope instead of an ellipsoid


Recall that in the cutting plane method the set Et+1 is picked so as to satisfy
Et ∩ H ⊆ Et+1 ,
where
H := {x : hx, ht i ≤ hxt , ht i}
is the halfspace passing through the point xt ∈ Et determined by the separat-
ing hyperplane output by the separation oracle for K. In the ellipsoid method
Et+1 is chosen as the minimum volume ellipsoid containing Et ∩ H. One could
perhaps argue that this choice of Et+1 is not efficient, as we already know that
K ⊆ Et ∩ H, hence, Et+1 := K ∩ Et would be much more reasonable. By
following this strategy, starting with E0 lying in a box [−R, R]n we produce a
sequence of polytopes (instead of ellipsoids) containing K. The crucial ques-
tion that arises is how to pick a point xt ∈ Et , so that no matter what halfspace
H through xt is output by the separation oracle, we can still guarantee that
Et+1 := Ek ∩ H has a significantly smaller volume than Et . Note that when
we pick a point xt close to the boundary of Et , H might cut out only a small
piece of Et and, hence, vol(Et ) ≈ vol(Et+1 ), meaning that no progress is made
in such a step. For this reason, it seems reasonable to pick a point xk that is
near the center of the polytope.
It is an exercise to prove that if xt is chosen to be the centroid3 of the poly-
R
3 The centroid c ∈ Rn of a measurable, bounded set K ⊆ Rn is defined to be c := xdλn (x),
K
i.e., the mean of the uniform distribution over K.
13.5 Variants of the cutting plane method 297

tope then no matter what halfspace H through xt is chosen, it holds that


!
1
vol(Et ∩ H) ≤ 1 − vol(Et ).
e
In other words, we obtain a method where α ≈ 0.67 is a constant. While this
sounds promising, this method has a significant drawback: the centroid is not
efficient to compute and, even for polytopes, there are no known fast algorithms
to find the centroid. Given such a difficulty there have been many attempts to
define alternatives notions of a center of the polytope so as both: make it easy
to compute and get a constant α < 1. We now give brief overviews of two such
approaches.

13.5.2 The volumetric center method


The idea is to use the volumetric center of Et as xt . To define it, suppose that
Et is defined by the system of inequalities hai , xi ≤ bi for i = 1, 2, . . . , m. Let
F : Et → R be the logarithmic barrier
m
X  
F(x) = − log bi − a>i x ,
i=1

and similarly let V : Et → R be the volumetric barrier


1
V(x) = log det ∇2 F(x)
2
introduced in Definition 11.16 in Chapter 11. Then, the volumetric center xt of
Et is defined as
xt := argmin V(x).
x∈Et

The intuition for using the volumetric center xt as the queried point at step t is
that xt is the point around which the Dikin ellipsoid has the largest volume. As
the Dikin ellipsoid is a decent approximation of the polytope, then one should
expect that a hyperplane through the center of this ellipsoid should divide the
polytope into two roughly equal pieces and hence we should expect α to be
small. What can be proved is that indeed, on average (over a large number of
iterations)
vol(Et+1 )
≤ 1 − 10−6 ,
vol(Et )
and, hence, α is indeed a constant (slightly) smaller than 1. Further, the volu-
metric center xt of Et does not have to be recomputed from scratch every single
298 Ellipsoid Method for Convex Optimization

time. In fact, since Et+1 has only one constraint more than Et , we can use New-
ton’s method to compute xt+1 with xt as the starting point. This “recentering”
step can be implemented in O(nω ), or matrix multiplication, time. In fact, to
achieve this running time, the number of facets in Et needs to be kept small
throughout the iterations, hence occasionally the algorithm has to drop certain
constraints (we omit the details).
To summarize, this method, based on the volumetric center, attains the op-
timal, constant rate of volume drop, hence, requires roughly O(n log Rr log 1ε )
iterations to terminate. However, the update time per iteration is nω ≈ n2.373
is slower as compared to the update time for the ellipsoid method, which is
O(n2 ) (as a rank one update to an n × n matrix). See also the method based on
“analytic center” in Exercise 12.6.

13.5.3 The hybrid center method


Given the cutting plane method in the previous section, one might wonder
whether it is possible to achieve the optimal, constant volume drop rate and at
the same time keep the running time per iteration as low as for the ellipsoid
method ≈ n2 . Lee, Sidford, and Wong showed that such an improvement is
indeed possible. To obtain this improvement, consider the following barrier
function over the polytope Et :

m
X 1 λ
G(x) := − wi log si (x) + x A + λI) +
log det(A> S −2 kxk22 , (13.10)
i=1
2 2

where w1 , w2 , . . . , wm are positive weights and λ > 0 is a parameter. Similar to


Vaidya’s approach, the query point is determined as the minimum of the barrier
G(x), the hybrid center

xt := argmin G(x).
x∈Et

The use of regularized weighted logarithmic barriers (13.10) is inspired by the


Lee-Sidford barrier function used to obtain an improved for path following
interior point method mentioned in Section 11.5.2. It can be proved that main-
taining xt and the set of weights wt such that xt is the minimum of G(x) over
x ∈ Et is possible in time O(n2 ) per iteration on average. Moreover, such a
choice of xt guarantees a constant volume drop, on average with α ≈ 1 − 10−27 .
13.6 Exercises 299

13.6 Exercises
13.1 Prove that any two bases of a matroid have the same cardinality.
13.2 Given disjoint sets E1 , . . . , E s ⊆ [n] and nonnegative integers k1 , . . . , k s ,
consider the following family of sets

F := {S ⊆ [n] : |S ∩ Ei | ≤ ki for all i = 1, . . . , s}.

Prove that F is a matroid.


13.3 In this exercise we give two alternate characterizations of a submodular
function.
(a) Prove that a function F : 2[n] → R is submodular if and only if for
all T ⊆ [n] and x, y ∈ [n],

F(T ∪ {x, y}) − F(T ∪ {x}) ≤ F(T ∪ {y}) − F(T ).

(b) Prove that a function F : 2[n] → R is submodular if and only if for


all T ⊆ S ⊆ [n] and x ∈ [n],

F(S ∪ {x}) − F(S ) ≤ F(T ∪ {x}) − F(T ).

13.4 Suppose that F ⊆ 2[n] is a matroid.


(a) Prove that its rank function rF is submodular.
(b) Prove that the following greedy algorithm can be used to compute
the rank rF ([n]) :
initialize S := ∅ and iterate over all elements i ∈ [n] adding i to
S whenever S ∪ {i} ∈ F . Output the cardinality of S at the end as
the rank of F .
(c) Extend this greedy algorithm to compute rF (T ) for any set T ⊆
[n].
13.5 Prove that for the graphic matroid F corresponding to a graph G =
(V, E), the rank function is

rF (S ) = n − κ(V, S )

where n := |V| and κ(V, S ) denotes the number of connected components


of the graph with edge set S and vertex set V.
13.6 Given a membership oracle to a matroid F ⊆ 2[n] and a cost vector
c ∈ Zn , give a polynomial time algorithm to find a T ∈ F that maximizes
i∈T ci .
P

13.7 Prove that the following functions are submodular.


300 Ellipsoid Method for Convex Optimization

(a) Given a matrix A ∈ Rm×n , the rank function r : 2[n] → R defined


as
r(S ) := rank(AS ).

Here, for S ⊆ [m] we denote by AS the matrix A restricted to rows


in the set S .
(b) Given a graph G = (V, E), the cut function f : 2V → R defined as

fc (S ) := |{i j ∈ E : i ∈ S , j < S }| .

13.8 Prove that for any function F : 2[n] → R (not necessarily convex) and
for its Lovász extension f the following hold:

min F(S ) = min n f (x) and max F(S ) = maxn f (x).


S ⊆[n] x∈[0,1] S ⊆[n] x∈[0,1]

13.9 Prove Theorem 13.11.


13.10 For F : 2[n] → R, define its convex closure f − : [0, 1]n → R to be
 

X X X 
f (x) = min 

α α = α = α .
 
f (S ) : 1 x, 1, ≥ 0

 S S S S 

 
S ⊆[n] S ⊆[n] S ⊆[n]


(a) Prove that f is a convex function.
(b) Prove that if F is submodular, then f − coincides with the Lovász
extension of F.
(c) Prove that if f − coincides with the Lovász extension of F, then F
is submodular.
13.11 Prove that for any finite set Ω, and any probability distribution p on it,
X 1
pω log ≤ log |Ω|.
ω∈Ω
p ω

13.12 For a finite set Ω, a collection of vectors {vω }ω∈Ω ⊆ Rn , and a point θ ∈
Rn , prove that the convex program considered in (13.3) is the Lagrangian
dual of that considered in (13.4). For the above, you may assume that the
polytope P := conv{vω : ω ∈ Ω} is a full-dimensional subset of Rn and θ
belongs to the interior of P. Prove that, under these assumptions, strong
duality holds.
13.13 Prove the claim in Equation (13.6).
13.14 Let G = (V, E) be an undirected graph and let Ω be a subset of 2E .
Consider the objective of the dual to max entropy program in (13.4):
X
f (y) := log ehy,1S −θi .
S ∈Ω
13.6 Exercises 301

(a) Prove that, if Ω = TG , the set of spanning trees of G, then f (y)


and its gradient ∇ f (y) can be evaluated in time polynomial in |E|
and kyk. For this, first show that if B ∈ RV×E is the vertex-edge
incidence matrix of G and {be }e∈E ∈ RV are its columns, then for
any vector x ∈ RE it holds that
 
 1 > X  XY
det  2 11 +
 xe be be  =
>
xe ,
n e∈E S ∈TG e∈S
V
where 1 ∈ R is the all ones vector.
(b) Prove that, given a polynomial time oracle to evaluate f , one can
compute |Ω| in polynomial time.
13.15 Suppose that p ∈ Z[x1 , . . . , xn ] is a multiaffine polynomial (the degree
of any variable in p is at most 1) with nonnegative coefficients. Suppose
that L is an upper bound on the coefficients of p. Prove that, given an
evaluation oracle for p and a separation oracle for K ⊆ Rn one can
compute a multiplicative (1 + ε) approximation to
p(x)
max min Qn θ
θ∈K x>0 i
i=1 xi

in time poly(log ε−1 , n, log L). You may assume that every point x ∈ K
satisfies B(x, η) ⊆ P where P is the convex hull of the support of p
and your algorithm might depend polynomially on 1η . For a multiaffine
polynomial p(x) = S ⊆[n] cS i∈S xi .4
P Q
13.16 In this problem, we derive an algorithm for the following problem: given
a matrix A ∈ Zn×n n
>0 with positive entries, find positive vectors x, y ∈ R>0
such that
XAY is a doubly stochastic matrix,
where X = Diag (x), Y = Diag (y) and a matrix W is called doubly
stochastic if its entries are nonnegative and all its rows and columns sum
up to 1. We denote the set of all n × n doubly stochastic matrices by Ωn ,
i.e.,
 

 Xn Xn 

Ωn :=  n×n
= = .
 
W ∈ [0, 1] : ∀ i ∈ [n] W 1, ∀ j ∈ [n] W 1
 
 i, j i, j 

 
 j=1 i=1 
Consider the following pair of optimization problems:
X Ai, j
max Wi, j log . (13.11)
W∈Ωn
1≤i, j≤n
Wi, j
4 The support of p is the family of all sets S such that cS , 0.
302 Ellipsoid Method for Convex Optimization
n
 n  n
X X  X
minn log  ez j Ai, j  − z j. (13.12)
z∈R
i=1 j=1 j=1

(a) Prove that both problems in Equations (13.11) and (13.12) are
convex programs.
(b) Prove that the problem in Equation (13.12) is the Lagrangian dual
of that in Equation (13.11) and strong duality holds.
(c) Suppose that z? is an optimal solution to (13.12). Prove that, if
?
y ∈ Rn is such that yi := ezi for all i = 1, 2, . . . , n then there exists
x > 0 such that XAY ∈ Ωn .
As an optimal solution to the convex program (13.12) gives us the re-
quired solution, we apply the ellipsoid algorithm to solve it. We denote
by f (z) the objective of (13.12). For simplicity, assume that the entries
of A are integral.
(d) Design an algorithm that given z ∈ Qn computes an ε−approximation
to the gradient ∇ f (z) in time polynomial in the bit complexity of
A, the bit complexity of z, log ε−1 and the norm kzk∞ .
(e) Prove that f (z) has an optimal solution z? satisfying that
?
z ≤ O(ln(Mn)),

assuming that Ai, j ∈ [1, M] for every 1 ≤ i, j ≤ n using the fol-


lowing steps:
(f) Prove that for any optimal solution z? and any i, j ∈ [n]
1 ? ?
≤ ezi −z j ≤ M.
M
(g) Prove that for any c ∈ R and z ∈ Rn ,
f (z) = f (z + c · 1n ),
where 1n = (1, 1, . . . , 1) ∈ Rn .
(h) Give a lower l and upper bound u on the value of f (z) over z ∈
B(0, R) for R := O(ln(Mn)).
By applying the ellipsoid algorithm we obtain the following theorem
Theorem 13.26 There is an algorithm that given an integer matrix
A ∈ Zn×n with positive entries in the range {1, 2, . . . , M} and an ε > 0
returns a ε−approximate doubly stochastic scaling (x, y) of A in time
polynomial in n, log M and log ε−1 .
13.6 Exercises 303

Notes
For a thorough treatment on matroid theory, including a proof of Theorem
13.4, we refer the reader to the book by Schrijver (2002a). A proof of Theorem
13.4 based on the method of iterative rounding appears in the text by Lau et al.
(2011). Theorems 13.1 and 13.8 were first proved in the paper by Grötschel
et al. (1981). The Lovász extension and its properties (such as convexity) were
established by Lovász (1983).
The maximum entropy principle has its origins in the works of Gibbs (1902)
and Jaynes (1957a,b). It is used to learn probability distributions from data;
see the work by Dudik (2007) and Celis et al. (2020). Theorem 13.18 is a
special case of a theorem proved in Singh and Vishnoi (2014). Using a differ-
ent argument one can obtain a variant of Lemma 13.19 that avoids the inte-
riority assumption; see the paper by Straszak and Vishnoi (2019). Maximum
entropy-based algorithms have also been used to design very general approx-
imate counting algorithms for discrete problems in the papers by Anari and
Oveis Gharan (2017) and Straszak and Vishnoi (2017). Recently, the maxi-
mum entropy framework has been generalized to continuous manifolds; see
the paper by Leake and Vishnoi (2020).
The proof of Theorem 13.1 can be adapted to work with approximate or-
acles; we refer to Grötschel et al. (1988) for a thorough discussion on weak
oracles and optimization with them.
The volumetric center-based method is due to Vaidya (1989b). The hybrid
barrier presented in Section 13.5.3 is from the paper by Lee et al. (2015). Using
their cutting plane method, which is based on this barrier function, Lee et al.
(2015) derive a host of new algorithms for combinatorial problems. In par-
ticular, they give the asymptotically fastest known algorithms for submodular
function minimization and matroid intersection.
Exercise 13.15 is adapted from the paper by Straszak and Vishnoi (2019).
The objective function in Exercise 13.15 can be shown to be geodesically con-
vex; see the survey by Vishnoi (2018) for a discussion on geodesic convexity
and related problems.
Bibliography

Allen Zhu, Zeyuan, and Orecchia, Lorenzo. 2017. Linear Coupling: An Ultimate Uni-
fication of Gradient and Mirror Descent. Pages 3: 1–3: 22 of: 8th Innovations
in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017,
Berkeley, CA, USA. LIPIcs, vol. 67.
Anari, Nima, and Oveis Gharan, Shayan. 2017. A Generalization of Permanent In-
equalities and Applications in Counting and Optimization. Pages 384–396 of:
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Comput-
ing. STOC 2017.
Apostol, Tom M. 1967a. Calculus: One-variable calculus, with an introduction to lin-
ear algebra. Blaisdell book in pure and applied mathematics. Blaisdell Publishing
Company.
Apostol, Tom M. 1967b. Calculus, Vol. 2: Multi-Variable Calculus and Linear Algebra
with Applications to Differential Equations and Probability. New York: J. Wiley.
Arora, Sanjeev, and Barak, Boaz. 2009. Computational complexity: a modern ap-
proach. Cambridge University Press.
Arora, Sanjeev, and Kale, Satyen. 2016. A Combinatorial, Primal-Dual Approach to
Semidefinite Programs. J. ACM, 63(2).
Arora, Sanjeev, Hazan, Elad, and Kale, Satyen. 2005. Fast Algorithms for Approximate
Semidefinite Programming Using the Multiplicative Weights Update Method.
Page 339348 of: Proceedings of the 46th Annual IEEE Symposium on Founda-
tions of Computer Science. FOCS ’05. USA: IEEE Computer Society.
Arora, Sanjeev, Hazan, Elad, and Kale, Satyen. 2012. The Multiplicative Weights Up-
date Method: a Meta-Algorithm and Applications. Theory of Computing, 8(6),
121–164.
Barak, Boaz, Hardt, Moritz, and Kale, Satyen. 2009. The uniform hardcore lemma
via approximate Bregman projections. Pages 1193–1200 of: Proceedings of the
Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009,
New York, NY, USA, January 4-6, 2009. SIAM.
Barvinok, Alexander. 2002. A course in convexity. American Mathematical Society.
Beck, Amir, and Teboulle, Marc. 2003. Mirror descent and nonlinear projected subgra-
dient methods for convex optimization. Oper. Res. Lett., 31(3), 167–175.

305
306 Bibliography

Beck, Amir, and Teboulle, Marc. 2009. A fast iterative shrinkage-thresholding algo-
rithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–
202.
Boyd, Stephen, and Vandenberghe, Lieven. 2004. Convex optimization. Cambridge
University Press.
Bubeck, Sébastien. 2015. Convex Optimization: Algorithms and Complexity. Found.
Trends Mach. Learn., 8(34), 231357.
Bubeck, Sébastien, and Eldan, Ronen. 2015. The entropic barrier: a simple and optimal
universal self-concordant barrier. Page 279 of: Proceedings of The 28th Confer-
ence on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015.
Celis, L. Elisa, Keswani, Vijay, and Vishnoi, Nisheeth K. 2020. Data preprocessing
to mitigate bias: A maximum entropy based approach. In: Proceedings of the
37th International Conference on International Conference on Machine Learning.
ICML’20. JMLR.org.
Christiano, Paul, Kelner, Jonathan A., Madry, Aleksander, Spielman, Daniel A., and
Teng, Shang-Hua. 2011. Electrical flows, Laplacian systems, and faster approx-
imation of maximum flow in undirected graphs. Pages 273–282 of: Proceedings
of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA,
USA, 6-8 June 2011.
Cohen, Michael B., Madry, Aleksander, Tsipras, Dimitris, and Vladu, Adrian. 2017.
Matrix Scaling and Balancing via Box Constrained Newton’s Method and Inte-
rior Point Methods. Pages 902–913 of: 58th IEEE Annual Symposium on Foun-
dations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17,
2017. IEEE Computer Society.
Cormen, Thomas H., Leiserson, Charles E., Rivest, Ronald L., and Stein, Clifford.
2001. Introduction to Algorithms. The MIT Press.
Daitch, Samuel I., and Spielman, Daniel A. 2008. Faster approximate lossy generalized
flow via interior point algorithms. Pages 451–460 of: Proceedings of the 40th
Annual ACM Symposium on Theory of Computing, Victoria, British Columbia,
Canada, May 17-20, 2008.
Dantzig, George B. 1990. Origins of the Simplex Method. New York, NY, USA: Asso-
ciation for Computing Machinery. Page 141151.
Dasgupta, Sanjoy, Papadimitriou, Christos H., and Vazirani, Umesh. 2006. Algorithms.
1 edn. USA: McGraw-Hill, Inc.
Diestel, Reinhard. 2012. Graph Theory, 4th Edition. Graduate texts in mathematics,
vol. 173. Springer.
Dinic, E. A. 1970. Algorithm for solution of a problem of maximal flow in a network
with power estimation. Soviet Math Dokl, 224(11), 12771280.
Dudik, Miroslav. 2007. Maximum entropy density estimation and modeling geographic
distributions of species.
Edmonds, Jack. 1965a. Maximum Matching and a Polyhedron with 0, 1 Vertices. J. of
Res. the Nat. Bureau of Standards, 69, 125–130.
Edmonds, Jack. 1965b. Paths, trees, and flowers. Canadian Journal of mathematics,
17(3), 449–467.
Edmonds, Jack, and Karp, Richard M. 1972. Theoretical Improvements in Algorithmic
Efficiency for Network Flow Problems. J. ACM, 19(2), 248264.
Bibliography 307

Farkas, Julius. 1902. Theorie der einfachen Ungleichungen. Journal fr die reine und
angewandte Mathematik, 124, 1–27.
Ford, L.R., and Fulkerson, D.R. 1956. Maximal flow in a network. Canadian J. Math.,
8, 399–404.
Galntai, A. 2000. The theory of Newton’s method. Journal of Computational and
Applied Mathematics, 124(1), 25 – 44. Numerical Analysis 2000. Vol. IV: Opti-
mization and Nonlinear Equations.
Garg, Naveen, and Könemann, Jochen. 2007. Faster and Simpler Algorithms for Mul-
ticommodity Flow and Other Fractional Packing problems. SIAM J. Comput.,
37(2), 630–652.
Gärtner, Bernd, and Matousek, Jirı́. 2014. Approximation Algorithms and Semidefinite
Programming. Springer Publishing Company, Incorporated.
Gibbs, Josiah Willard. 1902. Elementary principles in statistical mechanics: developed
with especial reference to the rational foundation of thermodynamics. C. Scrib-
ner’s sons.
Goldberg, Andrew V., and Rao, Satish. 1998. Beyond the Flow Decomposition Barrier.
J. ACM, 45(5), 783–797.
Goldberg, Andrew V., and Tarjan, Robert Endre. 1987. Solving Minimum-Cost Flow
Problems by Successive Approximation. Pages 7–18 of: Proceedings of the 19th
Annual ACM Symposium on Theory of Computing, 1987, New York, New York,
USA.
Golub, G.H., and Van Loan, C.F. 1996. Matrix computations. Johns Hopkins Univ
Press.
Gonzaga, Clovis C. 1989. An Algorithm for Solving Linear Programming Problems in
O(n3 L) Operations. New York, NY: Springer New York. Pages 1–28.
Grötschel, M., Lovász, L., and Schrijver, A. 1981. The ellipsoid method and its conse-
quences in combinatorial optimization. Combinatorica, 1(2), 169–197.
Grötschel, Martin, Lovász, Lászlo, and Schrijver, Alexander. 1988. Geometric Algo-
rithms and Combinatorial Optimization. Algorithms and Combinatorics, vol. 2.
Springer.
Gurjar, Rohit, Thierauf, Thomas, and Vishnoi, Nisheeth K. 2018. Isolating a Vertex via
Lattices: Polytopes with Totally Unimodular Faces. Pages 74: 1–74: 14 of: 45th
International Colloquium on Automata, Languages, and Programming, ICALP
2018, July 9-13, 2018, Prague, Czech Republic. LIPIcs, vol. 107.
Hazan, Elad. 2016. Introduction to Online Convex Optimization. Found. Trends Optim.,
2(34), 157325.
Hestenes, Magnus R., and Stiefel, Eduard. 1952. Methods of Conjugate Gradients for
Solving Linear Systems. Journal of Research of the National Bureau of Standards,
49(Dec.), 409–436.
Hopcroft, John E., and Karp, Richard M. 1973. An n5/2 Algorithm for Maximum
Matchings in Bipartite Graphs. SIAM J. Comput., 2(4), 225–231.
Jaggi, Martin. 2013. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimiza-
tion. Page I427I435 of: Proceedings of the 30th International Conference on In-
ternational Conference on Machine Learning - Volume 28. ICML’13. JMLR.org.
Jaynes, Edwin T. 1957a. Information Theory and Statistical Mechanics. Physical Re-
view, 106(May), 620–630.
308 Bibliography

Jaynes, Edwin T. 1957b. Information Theory and Statistical Mechanics. II. Physical
Review, 108(Oct.), 171–190.
Karlin, Anna R., Klein, Nathan, and Oveis Gharan, Shayan. 2020. A (Slightly) Im-
proved Approximation Algorithm for Metric TSP. CoRR, abs/2007.01409.
Karmarkar, Narendra. 1984. A new polynomial-time algorithm for linear programming.
Combinatorica, 4(4), 373–395.
Karp, Richard M., and Papadimitriou, Christos H. 1982. On Linear Characterizations
of Combinatorial Optimization Problems. SIAM J. Comput., 11(4), 620–632.
Kelner, Jonathan A., Lee, Yin Tat, Orecchia, Lorenzo, and Sidford, Aaron. 2014.
An Almost-Linear-Time Algorithm for Approximate Max Flow in Undirected
Graphs, and its Multicommodity Generalizations. Pages 217–226 of: Proceed-
ings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms,
SODA 2014, Portland, Oregon, USA, January 5-7, 2014.
Khachiyan, L. G. 1979. A Polynomial Algorithm for Linear Programming. Doklady
Akademii Nauk SSSR, 224(5), 10931096.
Khachiyan, L. G. 1980. Polynomial algorithms in linear programming. USSR Compu-
tational Mathematics and Mathematical Physics, 20(1), 53 – 72.
Klee, V., and Minty, G. J. 1972. How Good is the Simplex Algorithm? Pages 159–175
of: Inequalities III. Academic Press Inc.
Kleinberg, Jon, and Tardos, Eva. 2005. Algorithm Design. USA: Addison-Wesley
Longman Publishing Co., Inc.
Krantz, S.G. 2014. Convex Analysis. Textbooks in Mathematics. Taylor & Francis.
Lau, Lap-Chi, Ravi, R., and Singh, Mohit. 2011. Iterative Methods in Combinatorial
Optimization. 1st edn. USA: Cambridge University Press.
Leake, Jonathan, and Vishnoi, Nisheeth K. 2020. On the computability of continuous
maximum entropy distributions with applications. Pages 930–943 of: Proccedings
of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC
2020, Chicago, IL, USA, June 22-26, 2020. ACM.
Lee, Yin Tat, and Sidford, Aaron. 2014. Path Finding Methods for Linear Program-
ming: Solving Linear Programs in O(vrank)
e Iterations and Faster Algorithms for
Maximum Flow. Pages 424–433 of: 55th IEEE Annual Symposium on Founda-
tions of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21,
2014. IEEE Computer Society.
Lee, Yin Tat, Rao, Satish, and Srivastava, Nikhil. 2013. A New Approach to Computing
Maximum Flows Using Electrical Flows. Pages 755–764 of: Proceedings of the
Forty-fifth Annual ACM Symposium on Theory of Computing. STOC ’13. New
York, NY, USA: ACM.
Lee, Yin Tat, Sidford, Aaron, and Wong, Sam Chiu-wai. 2015. A Faster Cutting Plane
Method and its Implications for Combinatorial and Convex Optimization. Pages
1049–1065 of: IEEE 56th Annual Symposium on Foundations of Computer Sci-
ence, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015.
Louis, Anand, and Vempala, Santosh S. 2016. Accelerated Newton Iteration for Roots
of Black Box Polynomials. Pages 732–740 of: IEEE 57th Annual Symposium
on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Re-
gency, New Brunswick, New Jersey, UNITED STATES. IEEE Computer Society.
Lovász, László. 1983. Submodular functions and convexity. Pages 235–257 of: Math-
ematical Programming The State of the Art. Springer.
Bibliography 309

Madry, Aleksander. 2013. Navigating Central Path with Electrical Flows: From Flows
to Matchings, and Back. Pages 253–262 of: 54th Annual IEEE Symposium on
Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley,
CA, USA.
Mulmuley, Ketan, Vazirani, Umesh V., and Vazirani, Vijay V. 1987. Matching is as easy
as matrix inversion. Combinatorica, 7, 105–113.
Nemirovski, A., and Yudin, D. 1983. Problem Complexity and Method Efficiency in
Optimization. Wiley Interscience.
Nesterov, Y., and Nemirovskii, A. 1994. Interior-Point Polynomial Algorithms in Con-
vex Programming. Society for Industrial and Applied Mathematics.
Nesterov, Yurii. 1983. A method for unconstrained convex minimization problem with
the rate of convergence O(1/k2 ). Pages 543–547 of: Doklady AN USSR, vol. 269.
Nesterov, Yurii. 2004. Introductory lectures on convex optimization. Vol. 87. Springer
Science & Business Media.
Nesterov, Yurii. 2014. Introductory Lectures on Convex Optimization: A Basic Course.
1 edn. Springer.
Orecchia, Lorenzo, Schulman, Leonard J., Vazirani, Umesh V., and Vishnoi,
Nisheeth K. 2008. On partitioning graphs via single commodity flows. Pages
461–470 of: Proceedings of the 40th Annual ACM Symposium on Theory of Com-
puting, Victoria, British Columbia, Canada, May 17-20, 2008. ACM.
Orecchia, Lorenzo, Sachdeva, Sushant, and Vishnoi, Nisheeth K. 2012. Approximating
the Exponential, the Lanczos Method and an O(m)-Time
e Spectral Algorithm for
Balanced Separator. Page 11411160 of: Proceedings of the Forty-Fourth Annual
ACM Symposium on Theory of Computing. STOC ’12. New York, NY, USA:
Association for Computing Machinery.
Oveis Gharan, Shayan, Saberi, Amin, and Singh, Mohit. 2011. A Randomized Round-
ing Approach to the Traveling Salesman Problem. Pages 267–276 of: FOCS’11:
Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer
Science.
Padberg, Manfred W, and Rao, M Rammohan. 1981. The Russian method for linear
inequalities III: Bounded integer programming. Ph.D. thesis, INRIA.
Pan, Victor Y., and Chen, Zhao Q. 1999. The Complexity of the Matrix Eigenproblem.
Pages 507–516 of: STOC. ACM.
Peng, Richard. 2016. Approximate Undirected Maximum Flows in O(m polylog(n))
Time. Pages 1862–1867 of: Proceedings of the Twenty-Seventh Annual ACM-
SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, Jan-
uary 10-12, 2016.
Perko, Lawrence. 2001. Differential equations and dynamical systems. Vol. 7. Springer
Science & Business Media.
Plotkin, Serge A., Shmoys, David B., and Tardos, Éva. 1995. Fast Approximation
Algorithms for Fractional Packing and Covering Problems. Math. Oper. Res.,
20(2), 257–301.
Polyak, Boris. 1964. Some methods of speeding up the convergence of iteration meth-
ods. Ussr Computational Mathematics and Mathematical Physics, 4(12), 1–17.
Renegar, James. 1988. A polynomial-time algorithm, based on Newton’s method, for
linear programming. Math. Program., 40(1-3), 59–93.
310 Bibliography

Renegar, James. 2001. A Mathematical View of Interior-Point Methods in Convex Op-


timization. USA: Society for Industrial and Applied Mathematics.
Rockafellar, R. Tyrrell. 1970. Convex analysis. Princeton Mathematical Series. Prince-
ton, N. J.: Princeton University Press.
Rothvoss, Thomas. 2017. The Matching Polytope has Exponential Extension Com-
plexity. J. ACM, 64(6), 41:1–41:19.
Rudin, Walter. 1987. Real and Complex Analysis, 3rd Ed. USA: McGraw-Hill, Inc.
Saad, Y. 2003. Iterative Methods for Sparse Linear Systems. 2nd edn. Philadelphia,
PA, USA: Society for Industrial and Applied Mathematics.
Sachdeva, Sushant, and Vishnoi, Nisheeth K. 2014. Faster Algorithms via Approxima-
tion Theory. Found. Trends Theor. Comput. Sci., 9(2), 125–210.
Schrijver, Alexander. 2002a. Combinatorial optimization: polyhedra and efficiency.
Vol. 24. Springer.
Schrijver, Alexander. 2002b. On the history of the transportation and maximum flow
problems. Mathematical Programming, 91(3), 437–445.
Shalev-Shwartz, Shai. 2012. Online Learning and Online Convex Optimization. Found.
Trends Mach. Learn., 4(2), 107194.
Sherman, Jonah. 2013. Nearly Maximum Flows in Nearly Linear Time. Pages 263–269
of: 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS
2013, 26-29 October, 2013, Berkeley, CA, USA.
Shor, N. Z. 1972. Utilization of the operation of space dilatation in the minimization of
convex functions. Cybernetics, 6(1), 7–15.
Singh, Mohit, and Vishnoi, Nisheeth K. 2014. Entropy, optimization and counting.
Pages 50–59 of: Proceedings of the 46th Annual ACM Symposium on Theory of
Computing. ACM.
Spielman, Daniel A. 2012. Algorithms, Graph Theory, and the Solution of Laplacian
Linear Equations. Pages 24–26 of: ICALP (2).
Spielman, Daniel A., and Teng, Shang-Hua. 2004. Nearly-linear time algorithms for
graph partitioning, graph sparsification, and solving linear systems. Pages 81–90
of: STOC’04: Proceedings of the 36th Annual ACM Symposium on the Theory of
Computing.
Spielman, Daniel A., and Teng, Shang-Hua. 2009. Smoothed analysis: an attempt to
explain the behavior of algorithms in practice. Commun. ACM, 52(10), 76–84.
Steele, J. Michael. 2004. The Cauchy-Schwarz Master Class: An Introduction to the
Art of Mathematical Inequalities. USA: Cambridge University Press.
Strang, Gilbert. 1993. The Fundamental Theorem of Linear Algebra. The American
Mathematical Monthly, 100(9), 848–855.
Strang, Gilbert. 2006. Linear algebra and its applications. Belmont, CA: Thomson,
Brooks/Cole.
Straszak, Damian, and Vishnoi, Nisheeth K. 2016. On a Natural Dynamics for Linear
Programming. Page 291 of: Proceedings of the 2016 ACM Conference on Inno-
vations in Theoretical Computer Science, Cambridge, MA, USA, January 14-16,
2016.
Straszak, Damian, and Vishnoi, Nisheeth K. 2017. Real Stable Polynomials and Ma-
troids: Optimization and Counting. Pages 370–383 of: Proceedings of the 49th
Annual ACM SIGACT Symposium on Theory of Computing. STOC 2017.
Bibliography 311

Straszak, Damian, and Vishnoi, Nisheeth K. 2019. Maximum Entropy Distributions:


Bit Complexity and Stability. Pages 2861–2891 of: Conference on Learning The-
ory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA. Proceedings of Machine
Learning Research, vol. 99. PMLR.
Teng, Shang-Hua. 2010. The Laplacian Paradigm: Emerging Algorithms for Massive
Graphs. Pages 2–14 of: TAMC.
Trefethen, Lloyd N., and Bau, David. 1997. Numerical linear algebra. SIAM.
Vaidya, Pravin M. 1987. An Algorithm for Linear Programming which Requires
O(((m + n)n2 + (m + n)1.5 n)L) Arithmetic Operations. Pages 29–38 of: Proceedings
of the 19th Annual ACM Symposium on Theory of Computing, 1987, New York,
New York, USA. ACM.
Vaidya, Pravin M. 1989a. A New Algorithm for Minimizing Convex Functions over
Convex Sets (Extended Abstract). Pages 338–343 of: 30th Annual Symposium on
Foundations of Computer Science, Research Triangle Park, North Carolina, USA,
30 October - 1 November 1989. IEEE Computer Society.
Vaidya, Pravin M. 1989b. Speeding-Up Linear Programming Using Fast Matrix Mul-
tiplication (Extended Abstract). Pages 332–337 of: 30th Annual Symposium on
Foundations of Computer Science, Research Triangle Park, North Carolina, USA,
30 October - 1 November 1989. IEEE Computer Society.
Vaidya, Pravin M. 1990. Solving Linear Equations with Symmetric Diagonally Domi-
nant Matrices by Constructing Good Preconditioners.
Valiant, Leslie G. 1979. The Complexity of Computing the Permanent. Theor. Comput.
Sci., 8, 189–201.
Vanderbei, Robert J. 2001. The Affine-Scaling Method. Boston, MA: Springer US.
Pages 333–347.
Vishnoi, Nisheeth K. 2013. Lx = b. Found. Trends Theor. Comput. Sci., 8(1-2), 1–141.
Vishnoi, Nisheeth K. 2018. Geodesic Convex Optimization: Differentiation on Mani-
folds, Geodesics, and Convexity. CoRR, abs/1806.06373.
Wright, Margaret H. 2005. The interior-point revolution in optimization: history, recent
developments, and lasting consequences. Bull. Amer. Math. Soc. (N.S, 42, 39–56.
Wright, Stephen J. 1997. Primal-Dual Interior-Point Methods. Society for Industrial
and Applied Mathematics.
Yudin, DB, and Nemirovskii, Arkadi S. 1976. Informational complexity and efficient
methods for the solution of convex extremal problems. Ékon Math Metod, 12,
357369.
Zhu, Zeyuan Allen -, Li, Yuanzhi, de Oliveira, Rafael Mendes, and Wigderson, Avi.
2017. Much Faster Algorithms for Matrix Scaling. Pages 890–901 of: 58th IEEE
Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley,
CA, USA, October 15-17, 2017. IEEE Computer Society.
Index

L-Lipschitz continuous gradient, 88, 137 complexity class


L-smooth, 85, 86, 137 #P, 65
` p -ball, 33 BPP, 50, 61
σ-strongly convex, 37 NP-hard, 64
s − t-flow, 2, 92 P, 50, 61
s − t-maximum flow problem, 2, 3, 92, 102, RP, 50, 61
153, 210, 244 concave, 33
s − t-minimum cost flow problem, 210 conjugate function, 72, 106, 119
s − t-minimum cut problem, 3, 102, 131, 153, conjugate gradient, 151
244 convex combination, 32
0-1-polytopes, 244 convex function, 33
NL condition, 172 convex hull, 32, 33
0-1-polytopes, 13, 247, 264 convex program, 47
constrained, 47
accelerated gradient descent, 7, 152
nonsmooth, 47
affine invariance, 165
smooth, 47
affine scaling, 209
unconstrained, 47
analytic center, 184
convex set
arithmetic operations, 51
intersection, 44
barrier function, 11, 183 membership, 51
binary search, 248 union, 44
bit complexity, 52, 180 convexity, 33
black box model, 59, 90, 152 coordinate descent, 100
breadth-first search, 93 cutting plane methods, 247
Bregman divergence, 38, 86, 110, 113, 120,
depth-first search, 136
130, 141
derivative, 15
Cauchy-Schwarz inequality, 20, 83, 88, 121, differentiable, 15
167 continuous, 15, 35
general, 22 twice, 16, 35
central path, 184, 215 differential, 16
centroid, 296 Dikin ellipsoid, 229
classification, 133 directional derivative, 15, 34, 82
closed, 114, 119 downward closed, 277
function, 22, 73 dual feasibility, 74
closed function, 73 duality, 3, 67
complementary slackness, 75 gap, 70, 205

313
314 Index

strong, 70 adjacency matrix, 25


weak, 69 bipartite, 24, 64
dynamical system, 23 complete, 24
continuous-time, 23, 89, 169 perfect matching, 122
continuous-time , 83 complete, 24, 63
discrete-time, 23 connected, 24
Euler discretization, 23, 83 cut, 25
cycle, 24
ellipsoid, 33, 52
degree matrix, 25
ellipsoid method, 10, 244
directed, 24
epigraph, 41, 44, 70
disconnected, 24
estimate sequence, 139
edge, 24
exponential gradient descent, 109, 117
head, 24
first-order, 5, 152 tail, 24
first-order approximation, 18, 35 flow, 25
first-order convexity, 34 independent set, 64
first-order optimality, 157 Laplacian, 25, 96
first-order oracle, 60, 81, 85, 292 loop, 24
follow the regularized leader, 135 matching, 25
Ford-Fulkerson method, 3 perfect, 25
Frank-Wolfe method, 101 perfect matching, 64
full-dimensional, 32 simple, 24
function oracle, 60 spanning tree, 25, 63, 284
fundamental theorem of algebra, 20 subgraph, 24
fundamental theorem of calculus, 17, 87, 105, tree, 25
176 undirected, 24
Gaussian elimination, 6, 48, 162 vertex, 24, 54
generalized negative entropy, 45 degree, 24
generalized Pythagorean theorem, 114, 116, walk, 24, 49
120, 142 weighted, 26
geodesically convex, 303 graphic matroid, 277
global optimum, 42 halfspace, 32, 51
gradient, 15, 16 Hamiltonian dynamics, 78
G-Lipschitz, 105 heavy ball method, 151
bounded, 85, 105 Hessian, 16
Lipschitz, 105 hybrid barrier, 229
gradient descent, 5, 60, 82, 92, 105, 138, 152 hybrid center, 298
accelerated, 91 hypercube, 53
coordinate, 100 hyperplane, 32, 39
initial point, 85 implicit function theorem, 184
learning rate, 84 input length, 52
projected, 92, 96, 105 interior, 184
step length, 83 interior point method, 11
stochastic, 130 iterative rounding, 303
strongly convex, 99
Jacobian, 160
subgradient, 100
John ellipsoid, 231
gradient flow, 83, 165, 168
graph, 24 KKT conditions, 74
s − t-cut, 25 Kullback-Leibler (KL) divergence, 110
s − t-flow, 25 Lagrange
acyclic, 25, 277 multiplier, 68
Index 315

Lagrangian, 68 spectral norm, 20, 162


dual, 67, 69 spectrum, 20
duality, 68 square-root, 19
Lagrangian dynamics, 78 symmetric, 18
Laplacian system, 6, 97 totally unimodular, 30
Law of cosines, 113 matrix multiplication, 162
law of cosines, 116, 120 matroid, 277
Lee-Sidford barrier, 231, 246 base, 281
Legendre-Fenchel dual, 78, 119 rank function, 277
leverage score, 230 matroid base polytope, 281
linear equations, 48 matroid polytopes, 277
linear function, 33 max-flow min-cut theorem, 3
linear hull, 18 maximum cardinality matching problem, 244
linear program, 8, 48, 58, 63, 92, 245 maximum cut, 136
canonical form, 77, 180 maximum entropy distributions, 274, 276
dual, 71 maximum entropy problem, 14
duality, 70 mean value theorem, 158
standard form, 70, 214 membership problem, 277
linear programming, 58, 62, 180 min-max theorem, 132
linear span, 18 minimum spanning tree problem, 244
linear systems, 148 Minkowski sum, 44
Lipschitz mirror descent, 7, 105, 106, 117, 118, 138, 152
constant, 84 mirror map, 118, 119
continuous, 84 multiplicative weights update, 7, 121
gradient, 84
negative entropy, 33
local inner product, 168
Newton step, 161
local norm, 163, 168
Newton’s method, 11, 155, 215
local optimum, 42
Newton-Raphson method, 155
logarithmic barrier, 183, 215
norm, 21
Lovász extension, 281
`2 , 21
lower bound, 152
` p , 21
matching polytope, 26, 245 dual, 22, 121
matrix Euclidean, 21
2 → 2 norm, 20 matrix, 64
column rank, 18
odd minimum cut problem, 268
condition number, 30
online convex optimization, 135
copositive, 64
open ball, 22
eigenvalue, 20, 54
oracle
eigenvector, 20
first-order, 106
full rank, 18
identity, 18 path following interior point method, 185
image, 18 perfect matching polytope, 245
inverse, 18 Pinsker’s inequality, 116, 121
kernel, 18 polyhedron, 33, 48, 52, 180
norm, 20 polytope, 4, 32, 33, 53
nullspace, 18 vertex, 53, 58, 61
positive definite, 19 preconditioning, 165
positive semidefinite, 19, 54, 56 primal, 68
pseudoinverse, 18, 96 primal feasibility, 74
rank, 18 primal-dual interior point method, 209
row rank, 18 probability simplex, 105, 109, 283
316 Index

projected gradient descent, 182 strictly convex, 113, 168


projection, 91 strong convexity, 37
projective scaling, 209 strong duality, 70, 75
proximal method, 106 nonconvex, 71
Pythagoras theorem, 114 strongly convex, 108
subgradient, 35, 100
quadratic convergence, 158
existence, 40
quadratic function, 33
sublevel set, 292
RAM model, 50 submodular function, 13, 279
randomized oracle, 60 submodular function minimization, 13, 274,
regret, 135 276
regularization, 107 supporting hyperplane, 39
regularizer, 108, 148 supporting hyperplane theorem, 39
residual graph, 3
tangent space, 169, 217
Riemannian manifold, 169
Taylor approximation, 17, 110
Riemannian metric, 165, 169
Taylor series, 17
second-order approximation, 18 totally unimodular, 2
second-order convexity, 35 traveling salesman problem, 14
second-order method, 155 Turing machine, 50, 61
second-order oracle, 60, 162 oracle, 50
second-order Taylor approximation, 158 randomized, 50, 61
self-adjoint, 219 unbiased estimator, 60
self-concordance, 12, 210 uniform matroid, 277
self-concordant barrier, 229 universal barrier, 12, 229
semidefinite programming, 234
vertex-edge incidence matrix, 2, 25, 92, 210
separating hyperplane, 39
volumetric barrier, 12, 229, 246
separating hyperplane theorem, 39
volumetric center, 297
separation oracle, 13, 55, 277
set weak duality, 69
boundary, 22, 39, 182 Winnow algorithm, 133
bounded, 22, 48 Young-Fenchel inequality, 73
closed, 22, 41, 48, 73
closure, 22
compact, 22, 48
convex, 32
interior, 22
limit point, 22
open, 22
polar, 79
relative interior, 23, 41
Shannon entropy, 14
shortest path problem, 49
simplex method, 9, 182, 209
Slater’s condition, 69, 70, 75
soft-max function, 129
spanning tree polytope, 14, 26, 53, 56, 245
sparsest cut, 136
standard basis vector, 16
steepest descent, 82
stochastic gradient descent, 60
strict convexity, 37

You might also like